Machine learning (ML) projects: 5 reasons they fail

As enterprises do more machine learning (ML) projects, teams will encounter some common pitfalls. Consider this expert advice on how to avoid five frequent causes of failure
79 readers like this.
CIO_automation_AI

You don’t have to look far to see what’s at the root of enterprise IT’s enthusiasm for artificial intelligence (AI) and machine learning (ML) projects – data, and lots of it! Data, indeed, is king across a range of industries, and companies need AI/ML to glean meaningful insights from it.

HCA Healthcare, for example, used machine learning to create a big data analysis platform to speed sepsis detection, while BMW used it to support its automated vehicle initiatives. While AI/ML can bring tremendous value to businesses, your team will first have to navigate around a common set of challenges.

[ Want best practices for AI workloads? Get the eBook: Top considerations for building a production-ready AI/ML environment. ]

5 machine learning project pitfalls to beware

According to Guillaume Moutier, senior principal data engineering architect, Red Hat, a few common reasons why machine learning projects fail have emerged. Spoiler alert: Many of these pitfalls can be avoided.

1. Jumping in without a clearly defined use case

You could also call this one “shiny object syndrome.”

If businesses pursue AI/ML simply because it’s the hottest trend in tech, they can end up wasting tremendous amounts of time and money. Your AI/ML initiative shouldn’t feel like a solution in search of a problem – before anything else, identify a real business problem you’re trying to solve and ask yourself if it would truly benefit from an ML approach.

How to avoid trouble

There are two major questions you should ask yourself before starting a machine learning project. First, what are my organization’s business goals? Second, can this goal meaningfully be phrased as an ML problem?

Let’s say your goal is to increase customer satisfaction. Great! Maybe you can use a machine learning algorithm to create better personalization and sentiment analysis for our customers. From there, you can strategize your way to acquiring the right talent, collecting the right data, coming up with ways to measure success, and so on.

What you want to avoid is a situation where you fall so much in love with the idea of machine learning that less expensive, common sense solutions to your business problems are overlooked. For example, are AI-powered chatbots really the best solution to providing better customer service, or is there a simpler way to improve your businesses customer service skills?

In other words, the potential business value of your ML project should be your first consideration.

2. Projects lack access to relevant data

Data is a key component in all AI/ML initiatives – it is required for training, testing, and operating models. Actually gathering that data, however, is a thorn in the side of many enterprise ML projects. That’s because most businesses generate enormous amounts of data and have no easy way to manage or utilize it. Additionally, most enterprise data is scattered between on-premises and cloud data stores, subject to their own compliance or quality control requirements, which can make it even more difficult to consolidate and analyze the data.

Data silos pose another obstacle. Data silos – collections of data held by one team but not fully accessible to others – can develop when teams use different tools to store and manage data sets. But they may also reflect a siloed organizational structure.

How to avoid trouble

Many organizations benefit from automating data pipelines to connect all the different data sources across the enterprise. Data pipelines can help you collect, prepare, store and access datasets for AI/ML model development, training, and inferencing.

You can think of a data pipeline as a set of processing activities consisting of three key elements: a source, a processing step, and a destination. Standardized application programming interfaces (APIs) and high-bandwidth, low-latency networking will make it easier to access data throughout the AI/ML life cycle as well.

Integration with open source data streaming, manipulation, and analytics tools like Apache Spark, Kafka, and Presto can help you manage your data more efficiently. Data governance capabilities and security features should also be part of your tooling decisions.

3. Data scientists don’t work together with business teams

Let’s say you have the perfect use case for your machine learning project and have hired top-notch data scientists to work on your machine learning project. Your data scientists have started training the model with all the data you gathered, and things are going swimmingly.

Except, according to Moutier, the story often doesn’t end here.

That’s because one of the biggest things he has noticed businesses doing wrong with machine learning projects in the past two years has to do with silos – specifically, they hire a bunch of people with machine learning PhDs only to keep them locked in a room far away from the business people and far from the applications that will deploy the models they develop.

This has profoundly negative implications for a machine learning project, because, as just discussed, a fragmented structure tends to result in data silos. Data scientists can’t lead production operations by themselves, and looking at various data sources to train a model simply won’t generate any meaningful insights.

How to avoid trouble

Consider taking an MLOps approach. Machine Learning Operations (MLOps) is a practice that aims to improve the lifecycle management of a ML application in a robust, collaborative, and scalable way through a combination of processes, people and technology. It shares the same principles of DevOps and GitOps, with some key differences.

A core part of the MLOps process includes building teams with a variety of skill sets – not just data science skills – and empowering them to work as one unit to achieve common goals.

On the technology side of MLOps, teams should consider using CI/CD pipelines to introduce ongoing automation and monitoring throughout the ML lifecycle. Also, using Git as a central source of truth for all code and configuration – and committing often – can give added consistency and reproducibility to teams across the entire organization.

4. Infrastructure lacks flexibility

AI/ML models, software, and applications require infrastructure for development and deployment.

In contrast to academic settings, where machine learning is often research-focused and solves problems under controlled conditions, business settings need to have a more complex infrastructure. The many moving parts involved include data collection, data verification, and model monitoring. Your ML infrastructure not only allows data scientists to develop and test models, but also serves as the way to deploy models into production.

And as you can probably imagine, without flexible infrastructure, your machine learning project can fall flat on its face. That’s because an ML infrastructure supports every stage of the machine learning workflow. It affects how much time data scientists spend on DevOps tasks, communication between tools, and so on.

If you want to develop, test, deploy, and manage AI/ML models and applications in the same manner across all parts of your infrastructure, consider taking a hybrid cloud approach.

How to avoid trouble

If you want to develop, test, deploy, and manage AI/ML models and applications in the same manner across all parts of your infrastructure, consider taking a hybrid cloud approach.

By allowing you to combine on-premises data centers and private clouds with one or more public cloud services, a hybrid cloud model can improve computing performance, increase your agility now and later, and give you the “best of both worlds” of public and private cloud.

But why does hybrid cloud matter for your ML infrastructure exactly? Simply put, a hybrid cloud approach gives you the most flexibility for your machine learning infrastructure. For example, you might store some data sets on-premises (for compliance reasons) but use a public cloud to set up the complex infrastructure for the AI stack. By using a public cloud provider, your data scientists don’t have to spend their time configuring hardware and other tools, letting them focus on doing actual data science.

In another example, you might use on-premises technology for preliminary testing while letting cloud providers handle the heavy lifting for the development of production-ready models. Public clouds can also help data scientists quickly build and deploy AI/ML models by integrating open source applications with commercial partners’ technology.

5. Software stack proves difficult to manage

Machine learning development environments can be … messy. The software stacks used in such environments are complex, sometimes immature, and above all, continually evolving.

For example, you may be using open source tools such as TensorFlow and PyTorch on the ML framework side, Kubeflow or MLflow on the platform side, and Kubernetes for infrastructure purposes. And all of these tools need to be continually maintained and updated.

That can open the door for inconsistency. As an example, if you’re using a dataset in TensorFlow 2.5 to train a model and your colleague is using the exact same dataset in TensorFlow 2.6, you’ll get different results.

If you don’t ensure everyone is using the same tooling and hardware across different training environments, it’s hard to reliably share code and datasets across (and within) teams. Problems with consistency, portability, and dependency management can arise, creating multiple possible points of failure along the way.

[ Want more advice from ML experts and your peers? Read also: AI/ML workloads in containers: 6 things to know. ]

How to avoid trouble

Tools such as Red Hat OpenShift Data Science, a managed cloud service that helps data scientists and developers develop intelligent applications in a sandbox, can add flexibility.

So can containers, as part of a machine learning development environment. Team members can cleanly move a containerized application between environments (dev, test, production) while retaining full application functionality. Containers can also simplify collaboration. Teams can iteratively modify and share container images with versioning capabilities that track changes for transparency.

Once you adopt containers, you need a way to manage and scale them efficiently, which is where a container orchestration tool like Kubernetes and Kubernetes operators come into play. Enterprise Kubernetes platforms such as Red Hat OpenShift also have benefits to offer for AI/ML development tasks.

[ Read also: OpenShift and Kubernetes: What’s the difference? ]

If your organization doesn’t want to manage and maintain Kubernetes itself, you will want to consider cloud services, such as OpenShift DedicatedRed Hat OpenShift on AWS (ROSA) and Microsoft Azure Red Hat OpenShift (ARO).

[ Want to learn more? Get the eBook: Modernize your IT with managed cloud services. ]

Bill Cozens
Bill Cozens is a recent UNC-Chapel Hill grad. He is a writer and editor for the Malwarebytes blog.