How to get the most out of your AI/ML investments: Start with your data infrastructure

How to get the most out of your AI/ML investments: Start with your data infrastructure

We’re excited to bring Transform 2022 back in person on July 19 and pretty much July 20-28. Join AI and data leaders for insightful conversations and exciting networking opportunities. Register today!


The era of Big Data has helped democratize information, create a wealth of data, and grow revenues for technology-based businesses. But for all this intelligence, we’re not getting the level of insight into machine learning that you’d expect, as many companies struggle to machine learning (ML) projects feasible and useful. A successful AI/ML program does not start with a large team of data scientists. It starts with a strong data infrastructure. Data must be accessible to all systems and ready for analysis so that data scientists can quickly make comparisons and deliver business results, and the data must be reliable, highlighting the challenge many companies face when starting a data science program.

The problem is, many companies jump into data science first, hire expensive data scientists, and then discover they don’t have the tools or infrastructure data scientists need to succeed. Highly paid researchers end up spending time categorizing, validating, and preparing data — rather than seeking insights. This infrastructure work is important, but it also misses the opportunity for data scientists to use their most useful skills in a way that adds the most value.

Challenges with data management

When leaders evaluate the reasons for success or failure of a data science project (and 87% of the projects never make it to production) they often find that their company was trying to get ahead of the bottom line without building a foundation of reliable data. If they don’t have that solid foundation, data engineers can… 44% of their time maintaining data pipelines with changes to APIs or data structures. Creating an automated data integration process can save engineers time and ensure companies have all the data they need for accurate machine learning. This also helps cut costs and maximize efficiency as companies expand their data science capabilities.

Narrow data yields limited insights

Machine learning is finicky: if there are gaps in the data or if the data isn’t formatted correctly, machine learning won’t work or worse, inaccurate results.

When companies find themselves in a position of uncertainty about their data, most organizations ask the data science team to manually label the data set as part of supervised machine learning, but this is a time-consuming process that adds additional risk to the project. Worse, when the training examples are cropped too far due to data issues, there is a chance that the limited scope means that the ML model can only tell us what we already know.

The solution is to ensure that the team can draw on a comprehensive, centralized data store that spans a wide range of resources and provides a shared understanding of the data. This improves the potential ROI of the ML models by providing more consistent data to work with. A data science program can only evolve if it is based on reliable, consistent data and an understanding of the reliability bar for results.

Big Models vs. Valuable Data

One of the biggest challenges for a successful data science program balances the volume and value of the data when making a prediction. A social media company that analyzes billions of interactions every day can use the large number of relatively low actions (e.g. someone swiping up or sharing an article) to make reliable predictions. If an organization is trying to determine which customers are likely to renew a contract at the end of the year, it is likely working with smaller data sets with big consequences. Since it can take up to a year to find out if the recommended actions have led to success, this creates huge constraints for a data science program.

In these situations, companies must break down internal data silos to combine all the data they have to make the best recommendations. This can include zero-party information captured with gated content, first-party website data, and data from customer interactions with the product, along with successful results, support tickets, customer satisfaction surveys, and even unstructured data such as user feedback. All of these data sources contain clues as to whether a customer is renewing their contract. By combining data silos from different business groups, metrics can be standardized and there is enough depth and breadth to make reliable predictions.

To avoid the pitfall of declining trust and diminishing returns from an ML/AI program, companies can take the following steps.

  1. Recognize where you are — Does your company have a clear view of how ML contributes to the business? Does your company already have the infrastructure ready? Try not to add fancy gilding to vague data – be clear where you start so you don’t jump too far ahead.
  2. Get all your data in one place — Ensure you have identified and integrated a central cloud service or data lake. Once everything is centralized, you can start trading on the data and find any discrepancies in reliability.
  3. crawl-walk-walk — Start with the correct sequence of operations as you build your data science program. Focus on data analytics and Business Intelligence first, then build data engineering and finally a data science team.
  4. Don’t forget the basics — Once you’ve combined, cleaned, and validated all the data, you’re ready to get into data science. But do not forget about the “household” work necessary to maintain a foundation that will bring significant results. These critical tasks include investing in cataloging and data hygiene, ensuring that the right metrics are used to improve the customer experience, and manually maintaining data connections between systems or using an infrastructure service.

By building the right data science infrastructure, companies can see what matters to the business and where the blind spots are. Doing the groundwork first can offer a solution solid ROI, but more importantly, it will set up the data science team for significant impact. Getting a budget for a flashy data science program is relatively easy, but remember that most such projects fail. Getting the budget for the “boring” infrastructure tasks isn’t easy, but data management lays the foundation for data scientists to deliver the most meaningful business impact.

Alexander Lovell is head of product at fivetran.

DataDecision makers

Welcome to the VentureBeat Community!

DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.

If you want to read about the latest ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.

You might even consider contribute an article of your own!

Read more from DataDecisionMakers