Why 90% of machine learning models never hit the market

Corporations are going through rough times. And I’m not talking about the pandemic and the stock market volatility.

The times are uncertain, and having to make customer experiences more and more seamless and immersive isn’t taking off any of the pressure on companies. In that light, it’s understandable that they’re pouring billions of dollars into the development of machine learning models to improve their products.

But there’s a problem. Companies can’t just throw money at data scientists and machine learning engineers, and hope that magic happens.

The data speaks for itself. As VentureBeat reported last year, around 90 percent of machine learning models never make it into production. In other words, only one in ten of a data scientist’s workdays actually end up producing something useful for the company.

Even though 9 out of 10 tech executives believe that AI will be at the center of the next technological revolution, its adoption and deployment leave room for growth. And the data scientists aren’t the ones to blame.

Corporations aren’t set up for machine learning

Leadership support means more than money

The job market for data scientists is pretty great. Companies are hiring, and they’re ready to pay a good salary, too.

Of course, managers and corporate leaders expect from these data scientists that they add a lot of value in return. For the moment, however, they’re not making it easy to do so.

“Sometimes people think, all I need to do is throw money at a problem or put a technology in, and success comes out the other end,” says Chris Chapo, SVP of data and analytics at GAP.

To help data scientists excel in their roles, leaders don’t only need to direct resources in the right direction, but also understand what machine learning models are all about. One possible solution is that leaders get some introductory training to data science themselves, so they can put this knowledge into practice at their companies.

Lacking access to data

Companies aren’t bad at collecting data. However, many companies are highly siloed, which means that each department has its own ways of collecting data, preferred formats, places where they store it, and security and privacy preferences.

Data scientists, on the other hand, often need data from several departments. Siloing makes it harder to clean and process that data. Moreover, many data scientists complain that they can’t even obtain the data they need. But how should you even start training a model if you don’t have the necessary data?

Siloed company structures — and inaccessible data — might have been manageable in the past. But in an era where technological transformation is happening at breakneck speed, companies will need to step up and set up uniform data structures throughout.

Woman sitting in front of computer screen which shows the words “code is beautiful” — For data scientists to do their job, it’s vital that they get access to the data they need. Image by author

The disconnect between IT, data science, and engineering

If companies aim to get less siloed, that also means that departments need to communicate more with one another and align their goals.

In many companies, there’s a fundamental divide between the IT and data science departments. IT tends to prioritize making things work and keeping them stable. Data scientists, on the other hand, like experimenting and breaking things. This doesn’t lead to effective communication.

In addition, engineering isn’t always deemed essential for data scientists. This is a problem because engineers might not always understand all the details of what a data scientist envisions, or might implement things differently due to miscommunication. Therefore, data scientists who can deploy their models have a competitive edge over those who can’t, as StackOverflow points out.

Machine learning models come with their own set of challenges

Scaling up is harder than you think

If a model works great in a small environment, that doesn’t imply that it’ll work everywhere.

For one, the hardware or cloud storage space to handle bigger datasets might not be available. In addition, modularity of machine learning models doesn’t always work the same at large scales as it does on small ones.

Finally, data sourcing may not be easy or even possible. This can be due to silo-structures in companies, as discussed earlier, or due to other challenges in obtaining more data.

This is yet another reason to unify data structures across organizations, and encourage communication between different departments.

Efforts get duplicated

On the long road to deploying machine learning models, more than a quarter of all companies face duplicated efforts.

For example, a software engineer might try to implement what a data scientist told them to. The latter might go ahead and do some of the work themselves, too.

Not only is this a waste of time and resources. It can also lead to additional confusion when stakeholders don’t know which version of the code to use, and who to turn to if they encounter any bugs.

Although data scientists have an advantage if they’re able to implement their models, they should clearly communicate with the engineers about what needs to be done by whom. This way, they’ll save the company’s time and resources.

One man and two women sitting and talking at table with a laptop on it — Effective communication is vital to make machine learning models work. Image by author

Execs don’t always buy in

Tech executives strongly believe in the power of AI as a whole, but that doesn’t mean that they’re convinced by every idea out there. As Algorithmia reports, a third of all business executives blame the poor deployment statistics on a lack of senior buy in.

It seems as if data scientists are still viewed as somewhat nerdy and devoid of business sense. This makes it all the more important that data scientists amp up their business skills and seek the dialog with senior execs whenever possible.

Of course, that doesn’t mean that every data scientist suddenly needs an MBA to excel at their job. However, some key learnings from classes or business experience might serve them a long way.

Lack of cross-language and framework support

Since machine learning models are still in their infancy, there are still considerable gaps when it comes to different languages and frameworks.

Some pipelines start in Python, continue in R, and end in Julia. Others go the other way around, or use other languages entirely. Since each language comes with unique sets of libraries and dependencies, projects quickly get hard to keep track of.

In addition, some pipelines might make use of containerization with Docker and Kubernetes, others might not. Some pipelines will deploy specific APIs, others not. And the list goes on.

Tools like TFX, Mlflow, and Kubeflow are starting to emerge to fill this gap. But these tools are still in their infancy, and expertise in them is rare as of now.

Data scientists know that they need to keep checking out the newest developments in their field. This should apply to model deployment as well.

Versioning and reproducibility remain challenging

Connected with the above issue is that there is, as of now, no go-to way of versioning machine learning models. It’s quite obvious that data scientists need to keep track of any changes they make, but that’s quite cumbersome these days.

In addition, datasets may drift over time. That’s natural as companies and projects evolve, but it makes it harder to reproduce past results.

It’s all the more important that as soon as a project is started, a benchmark is established against which the model runs now and in the future. In combination with diligent version control, data scientists can get their models reproducible.

Doctor holding stethoscope to computer screen depicting lines of code — If a model isn’t reproducible, this could lead to lengthy investigations later on. Image by author

How to stop trying and start deploying

If 90 percent of a data scientist’s efforts lead to nothing, that’s not a good sign. This isn’t the fault of data scientists, as shown above, but rather due to inherent and organizational obstacles.

Change doesn’t come from one day to the next. For companies who are just getting started in machine learning models, it’s therefore advisable to start with a really small and simple project.

Once managers have outlined a clear and simple project, the second step is to choose the right team. It should be cross-functional, and should include data scientists, engineers, DevOps, and any other roles that seem important for its success.

Third, managers should consider leveraging third parties to help them accelerate at the beginning. IBM is among the companies that offer such a service, but there are others on the market, too.

A final caveat is not to strive for sophistication at all costs. If a cheap and simple model fulfills 80 percent of customer needs and could be shipped within a couple of months, that’s already a great feat. Moreover, the learnings of building the simple model will fuel the implementation of a more sophisticated model that, hopefully, makes customers 100 percent satisfied.

The bottom line: revolutions take time

The next decade is bound to be revolutionary — just like the last one was. The widespread adoption of artificial intelligence is only one of many growing trends. The rise of the internet of things, advanced robotics, and blockchain technology count to this list, too.

I’m deliberately speaking of decades and not years, though. For example, consider that 90 percent of companies are in the cloud — so many that it’s hard to even think about how our lives would be without it. On the flip side, clouds took several decades to gain widespread adoption.

There’s no reason to believe that the AI revolution should be any different. It will take a while to implement because the status quo contains a host of obstacles to tackle.

But since machine learning offers so many ways to improve customer experience and corporate efficiency, it’s clear that the winners will be those that deploy models fast and early.

This article was written by Ari Joury and was originally published on Towards Data Science. You can read it here.