Best practices for building machine learning platforms on the cloud

August 18, 2022August 22, 2022 hitesh nikam AI, AI Applications, Applied AI, Artificial Intelligence, AWS, Cloud, cloud architecture, Cloud hosting, Cloud Security, Cloud storage solutions, cloud-computing, Data, Deep Learning, Google Cloud, Hybrid Cloud, information, information technology, innovation, iOS, IT, Machine learning, ML, ML and Deep Learning, Multi-Cloud, Service cloud, software, technology, theinfotech

Most people are familiar with major technology platforms like iOS, Windows, and AWS. Platforms, in their essence, are a group of technologies that serve as a base from which to build, contribute, experiment and scale other applications. They enable much of today’s most advanced technology capabilities and cutting-edge customer experiences. To keep pace with the scale and complexity of the technology capabilities brought by big data, AI, and machine learning (ML), many companies are developing sophisticated internal platforms of their own. In fact, Gartner predicts that by 2025, cloud-native platforms will serve as the foundation for more than 95% of new digital initiatives up from less than 40% in 2021. In my experience, enterprise technology platforms have been transformational: they enable cross-functional teams to test, launch and learn at a rapid pace; reduce duplication and standardize capabilities; and provide consistent and integrated experiences. In short, they help turn technology into a competitive advantage.

The evolution of enterprise platforms

Increasingly, organizations are becoming more adept at delivering top-notch customer experiences by leveraging cloud-native platforms like Kubernetes that can run large AI and ML workloads. Capital One’s move to become the first U.S. financial institution to go all in on the cloud and our ability to re-architect our data environment has been integral to expanding our cloud-based platform capabilities. With that strong foundation, we’re better able to leverage big data to build new ML capabilities on top of our enterprise platforms to accelerate, enhance and deliver on new, more meaningful customer experiences. Much of our work in this area is already showing impactful results for the business and for our customers. For example, our fraud decisioning platform was built from the ground up to make complex real-time decisions. By leveraging massive amounts of data and enabling model updates in days (as opposed to months), the platform helps protect millions of customers from card fraud and can be used by various stakeholders across the enterprise. Based on my experience leading teams to deliver enterprise technology platforms, there are important lessons and best practices I’ve learned along the way:

It all starts with the team: Build a cross functional team of the best people, even if it slows you down at first. A bigger team is not always better! At minimum, the team should have product managers, engineers, and designers. Staff these functions with people who truly understand the users of the platform. For example, if you’re building a platform that will be used primarily by data scientists, hire a product manager who used to be a data scientist or put a data scientist on your leadership team. If the team is made up of people from several organizations, make sure you have shared goals.
Work backwards from a well-defined end state: Before you start to build, take the time to align on the end state architecture and your plan to iterate your way to that destination. Make sure your architecture is designed for self-service and contribution from the start. Better yet, design the platform assuming that you will expand it to users outside of your immediate organization or line of business. Assume that over time you will want to swap out components as technology changes.
Estimate how long you think it will take, then double it: It is important to take the time to brainstorm all the capabilities that you need to build at the outset and then create a t-shirt sized level of effort for each component. Once your tech teams marry this with velocity to estimate how long it will take to build each feature, add a 50% buffer. In my experience, this estimate ends up being surprisingly accurate.
Focus on business outcomes: Building great platforms can take a long time. It is important to sequence the work so that business value can be achieved along the way. This motivates the team, builds credibility, and creates a virtuous cycle.
Be radically transparent and over communicate: Share decisions, progress and roadmaps with stakeholders liberally. In addition to articulating what you are working on, also articulate what you are not currently prioritizing. Invest in documentation which enables contribution as well as easy onboarding to the platform.
Start small: Even the best testing and QA environment can miss issues which are not found until something is put into production. For big changes that will have meaningful customer impact, always start with a tiny population, and then ramp up once you see things working in production at a small scale. When possible, use associates only for the initial population when a change impact external customer.
Get serious about being well managed: Platform owners should obsess about platform performance. All issues should be self-identified through controls and automated alerts. Exceptions should be addressed quickly. Root cause analysis of issues as well as changes to prevent recurrence should be prioritized. A lack of issues should be properly celebrated so that teams know it is appreciated.
If it seems too good to be true… Exception monitoring is a great way to ensure that your execution matches your intent. Often the goal is to have zero exceptions. For example, latency should never exceed 200 milliseconds. If your exception reporting NEVER shows any exceptions, it’s possible that the monitoring is broken. Always force an exception to make sure that it triggers properly. I’ve learned this one the hard way.
A happy team is a productive team. Celebrate accomplishments, recognize team members when they go above and beyond and create a psychologically safe environment. Measure team happiness (with a quick 1-5 scale) regularly and give teams the space to discuss what would make them happier and the autonomy to try things out to squash dis-satisfiers.

When a team has a strong culture backed by the right platform technology, the possibilities are endless. By combining cloud-native platforms with data at scale, companies can better advance and experiment with newer, more innovative products and experiences. And when those experiences enable end users and customers to achieve exactly what they need, when they need it that’s revolutionary.

See More

Faster Insights with Splunk AI Assistant for SPL: Now More Personal Than Ever

June 11, 2025 Admin

Splunk AI Assistant for SPL has revolutionized how users interact with Splunk’s powerful Search Processing Language (SPL), making data analysis more

From Complexity to Clarity: Leveraging AI to Simplify Fraud, Waste, and Abuse Investigations at Every Level

June 11, 2025 Admin

Fraud, waste, and abuse pose significant challenges across all sectors, from corporate enterprises to federal institutions. With increasing scrutiny from

Beyond the horizon: Navigating the bridge between today’s tech and tomorrow’s AI

June 9, 2025 Admin

On the calendar, a year has 365 days. But in the world of AI, it often feels like every month

Cyber Threats 2025: Dark Web Hacks, AI Malware, and Ransomware Take Center Stage

February 4, 2025 Admin

As cyber threats continue to evolve, 2025 marks a turning point for businesses and individuals facing increasingly sophisticated attacks. Hackers

The evolution of enterprise platforms

See More

Faster Insights with Splunk AI Assistant for SPL: Now More Personal Than Ever

From Complexity to Clarity: Leveraging AI to Simplify Fraud, Waste, and Abuse Investigations at Every Level

Beyond the horizon: Navigating the bridge between today’s tech and tomorrow’s AI

Cyber Threats 2025: Dark Web Hacks, AI Malware, and Ransomware Take Center Stage

See More Blogs

Faster Insights with Splunk AI Assistant for SPL: Now More Personal Than Ever

From Complexity to Clarity: Leveraging AI to Simplify Fraud, Waste, and Abuse Investigations at Every Level

Beyond the horizon: Navigating the bridge between today’s tech and tomorrow’s AI

Subscribe to our newsletter

The evolution of enterprise platforms

See More

See More Blogs

You May Also Like

Subscribe to our newsletter