Synthetic data is the safe, low-cost alternative to real data that we need

Content provided by IBM and TNW.

Babies learn to talk from hearing other humans — mostly their parents — repeatedly produce sounds. Slowly, through repetition and discovering patterns, infants start connecting those sounds to meaning. Through a lot of practice, they eventually manage to produce similar sounds that humans around them can understand.

Machine learning algorithms work much in the same way, but instead of having a couple of parents to copy from, they use data, painstakingly categorized by thousands of humans who have to manually review the data and tell the machine what it means.

However, this tedious and time-consuming process isn’t the only problem with real world data used to train machine learning algorithms.

Take fraud detection in insurance claims. For an algorithm to accurately be able to tell a case of fraud apart from legit claims, it needs to see both. Thousands upon thousands of both. And because AI systems are often supplied by third parties — so not run by the insurance company itself — those third parties have to be given access to all that sensitive data. You get where this is going, because the same applies to healthcare records and financial data.

TNW City Coworking space - Where your best work happens

A workspace designed for growth, collaboration, and endless networking opportunities in the heart of tech.

Book a tour now

More esoteric but just as worrying are all the algorithms trained on text, pictures, and videos. Aside from questions of copyright, many creators have voiced disagreement with their work being sucked into a data set to train a machine that might eventually take (part of) their job. And that’s assuming their creations aren’t racist or problematic in other ways –– which in turn could lead to problematic outputs.

Also, what if there’s simply not enough data available to train an AI on all eventualities? In a 2016 RAND Corporation report, the authors calculated how many miles, “a fleet of 100 autonomous vehicles driving 24 hours a day, 365 days a year, at an average speed of 25 miles per hour,” would have to drive to show that their failure rate (resulting in fatalities or injuries), was reliably lower than that of humans. Their answer? 500 years and 11 billion miles.

You don’t have to be a super-brained genius to figure out that the current process is not ideal. So what can we do? How can we create enough, privacy-respecting, non-problematic, all-eventuality-covering, accurately-labeled data? You guessed it: more AI.

Fake data can help AIs deal with real data

Even before the RAND report, it was totally clear for companies working on autonomous driving that they were woefully under equipped to gather enough data to reliably train algorithms to drive safely under any condition or circumstance.

Take Waymo, Alphabet’s autonomous driving company. Instead of relying solely on their real world vehicles, they created a totally simulated world, in which simulated cars with simulated sensors could drive around endlessly, collecting real data on their simulated way. According to the company, by 2020 it had collected data on 15 billion miles of simulated driving — compared to a measly 20 million miles of real-world driving.

More methods for producing synthetic data are gaining ground.

In the parlance of AI, this is called synthetic data, or “data applicable to a given situation that is not obtained by direct measurement,” if you want to get technical. Or less technically: AIs are producing fake data so other AIs can learn about the real world at a speedier pace.

One example is Task2Sim, an AI model built by the MIT-IBM Watson AI Lab that creates synthetic data for training classifiers. Rather than teaching the classifier to recognize one object at a time, the model creates images that can be used to teach multiple tasks. The scalability of this type of model makes collecting data less time consuming and less expensive for data hungry businesses.

Adding to this, Rogerio Feris, an IBM researcher who co-authored the paper on Task2Sim said,

The beauty of synthetic images is that you can control their parameters — the background, lighting, and the way objects are posed.

Thanks to all of the concerns listed above, the production of all kinds of synthetic data has ballooned over the past few years, with dozens of startups in the field blooming and picking up hundreds of millions of dollars in investment.

The synthetic data generated ranges from ‘human data’ like health or financial records to synthesized pictures of a diverse range of human faces — to more abstract data sets like genomic data, that mimic the structure of DNA.

How to make really fake data

There are a couple of ways this synthetic data generation happens, the most common and well established of which is called GAN or generative adversarial networks.

In a GAN, two AIs are pitted against each other. One AI produces a synthetic data set, while the other tries to establish if the generated data is genuine. The feedback from the latter loops back into the former ‘training’ it to become more accurate in producing convincing fake data. You’ve probably seen one of the many this-X-does-not-exist websites — ranging from people to cats to buildings — which generate their images based on GANs.

Synthetic data can give smaller players the opportunity to turn the tables.

Lately, more methods for producing synthetic data have been gaining ground. The first are known as diffusion models, in which AIs are trained to reconstruct certain types of data while more and more noise — data that gradually corrupts the training data — is added to the real world data. Eventually, the AI can be fed random data, which it works back into a format that it was originally trained on.

Fake data is like real data without, well, the realness

Synthetic data, however it is produced, offers a number of very concrete advantages over using real world data. First of all, it’s easier to collect way more of it, because you don’t have to rely on humans creating it. Second, the synthetic data comes perfectly labeled, so there’s no need to rely on labor intensive data centers to (sometimes incorrectly) label data. Third, it can protect privacy and copyright, as the data is, well, synthetic. And finally, and perhaps most importantly, it can reduce biased outcomes.

With AI playing an increasingly larger role in technology and society, expectations around synthetic data are pretty optimistic. Gartner has famously estimated that 60% of training data will be synthetic data by 2024. Market analyst Cognilytica valued the market of synthetic data generation at $110 million in 2021, and growing to $1.15 billion by 2027.

Data has been called the most valuable commodity in the digital age. Big tech has sat on mountains of user data that gave it an advantage over smaller contenders in the AI space. Synthetic data can give smaller players the opportunity to turn the tables.

As you might suspect, the big question regarding synthetic data is around the so-called fidelity — or how closely it matches real-world data. The jury is still out on this, but research seems to show that combining synthetic data with real data gives statistically sound results. This year, researchers from MIT and the MIT-IBM AI Watson Lab showed that an image classifier that was pretrained on synthetic data in combination with real data, performed as well as an image classifier trained exclusively on real data.

All in all, synthetic and real world stop lights appear to be green for the near-future dominance of synthetic data in training privacy-friendly and safer AI models, and with that, a possible future of smarter AIs for us is just over the horizon.