Why synthetic data makes real AI better

We’re excited to bring Transform 2022 back in person on July 19 and pretty much July 20-28. Join AI and data leaders for insightful conversations and exciting networking opportunities. Register today!


Data is precious – it is claimed; it has become the world’s most valuable commodity.

And when it comes to training artificial intelligence (AI) and machine learning (ML) models, this is absolutely essential.

However, various factors can make it difficult or even impossible to obtain high-quality, real-world data.

This is where synthetic data becomes so valuable.

Synthetic data reflects data from the real world, both mathematically and statistically, but is generated in the digital world by computer simulations, algorithms, statistical modeling, simple rules, and other techniques. This is in contrast to data that is collected, compiled, annotated and labeled based on real sources, scenarios and experiments.

The concept of synthetic data has been around since the early 1990s, when Donald Rubin, a Harvard statistics professor, generated a series of anonymized US Census responses that matched those of the original dataset (but without identifying respondents by their home address, telephone number or social security number).

Synthetic data became more widely used in the 2000s, especially in the development of autonomous vehicles. Now synthetic data is increasingly being applied to numerous AI and ML use cases.

synthetic data vs. real dating

Real world data is almost always the best source of insights for AI and ML models (because it’s real). That said, it can often be simply unavailable, unusable due to privacy regulations and restrictions, unbalanced or expensive. Errors can also be introduced by bias.

So far, Gartner estimates that by 2022, 85% of AI projects will produce erroneous results.

“Real world data is coincidental and does not include all permutations of conditions or events that are possible in the real world,” said Alexander Linden, VP analyst at Gartner, in a Q&A conducted by the company.

Synthetic data can address many of these challenges. According to experts and practitioners, it is often faster, easier and cheaper to produce and does not require cleaning and maintenance. It removes or reduces restrictions on the use of sensitive and regulated data, can account for edge cases, can adapt to certain circumstances that would otherwise be inaccessible or not yet occurred, and can enable faster insights. Training is also less cumbersome and much more effective, especially when real data cannot be used, shared or moved.

As Linden points out, information injected into AI models can sometimes be more valuable than direct observation. Likewise, some argue that synthetic data is better than the real thing — even revolutionary.

Companies apply synthetic data to various use cases: software testing, marketing, creating digital twins, testing AI systems for bias, or simulating the future, alternate futures, or the metaverse. Banks and financial institutions use synthetic data to investigate market behavior, make better credit decisions or combat financial fraud, Linden explains. Retailers, meanwhile, rely on it for autonomous POS systems, cashless stores, and customer demographics analysis.

“Combined with real data, synthetic data creates an enhanced data set that can often mitigate the weaknesses of the real data,” Linden says.

Still, he cautions that synthetic data has risks and limitations. Its quality depends on the quality of the model that created it, it can be misleading and lead to inferior results, and it may not be “100% fail safe” in terms of privacy.

Then there’s user skepticism — some have called it “fake data” or “inferior data.” As it becomes more widely believed, business leaders may also be asking questions about data generation techniques, transparency, and explainability.

Real-world growth for synthetic data

According to a widely quoted forecast from Gartner, by 2024, 60% of the data used to develop AI and analytics projects will be generated synthetically. In fact, the company said that high-performance, high-performance AI models simply won’t be possible without the use of synthetic data. Gartner further estimates that synthetic data will completely eclipse real data in AI models by 2030.

“Its breadth of applicability makes it a critical accelerator for AI,” Linden says. “Synthetic data enables AI where a lack of data makes AI useless due to bias or inability to recognize rare or unprecedented scenarios.”

According to Cognilytica, the synthetic data generation market was approximately $110 million in 2021. The research firm expects this to reach $1.15 billion by 2027. Grand View Research expects the market for AI training datasets to reach more than $8.6 billion by 2030, representing a compound annual growth rate (CAGR) of just over 22%.

And as the concept grows, so do the contenders.

An increasing number of startups are entering the synthetic data space, receiving significant funding in the process. These include Datagen, which recently closed a $50 million Series B; Gretel.ai, with a $50 million Series B; Mostly AI, with a $25 million Series B; and Synthesis AI, with a $17 million Series A.

Other companies in the space include Sky Engine, OneView, Cvedia and leading data engineering firm Innodata, which recently launched an e-commerce portal where customers can purchase synthetic datasets on-demand and train models instantly. Several open source tools are also available: Synner, Synthea, Synthetig, and The Synthetic Data Vault.

Likewise, Google, Microsoft, Facebook, IBM and Nvidia already use or develop engines and programs for synthetic data.

Amazon, for its part, has relied on synthetic data to generate and fine-tune its Alexa virtual assistant. The company also offers WorldForge, which can generate synthetic scenes, and announced last week at the re:MARS (Machine Learning, Automation, Robotics and Space) conference that its SageMaker Ground Truth tool can now be used to generate labeled synthetics. generate images. data.

“By combining your real-world data with synthetic data, you can create more complete training datasets for training your ML models,” said Antje Barth, lead developer for AI and ML at Amazon Web Services (AWS), in a blog post that has been released. published in collaboration with again: MARS.

How synthetic data improves, improves the real world

Barth described building ML models as an iterative process involving data collection and preparation, model training, and model implementation.

In the beginning, a data scientist could spend months collecting hundreds of thousands of images from production environments. A major hurdle in this regard is displaying all possible scenarios and annotating them correctly. Obtaining variations may be impossible, as in the case of rare product defects. In that case, developers may need to intentionally damage products to simulate different scenarios.

Then comes the time-consuming, error-prone, expensive process of manually labeling images or building labeling tools, Barth points out.

AWS introduced SageMaker Ground Truth, the new capability in Amazon’s data labeling service, to simplify, streamline and improve this process. The new tool creates synthetic, photo-realistic images.

The service allows developers to create an unlimited number of images of a given object in different positions, proportions, lighting conditions and other variations, explains Barth. This is critical, she notes, because models learn best when they have an abundance of sample images and training data, allowing them to calculate countless variations and scenarios.

Synthetic data can be created in huge quantities through the service with “high-precision” labels for annotations across thousands of images. Labeling accuracy can be done in great detail – such as sub-object or pixel level – and across modalities, including bounding boxes, polygons, depth, and segments. Objects and environments can also be customized with variations in elements such as lighting, textures, poses, colors and background.

“In other words, you can ‘order’ exactly the use case for which you are training your ML model,” says Barth.

She adds that “if you combine your real-world data with synthetic data, you can create more complete and balanced datasets, adding data variety that real-world data may not have.”

any scenario

In SageMaker Ground Truth, users can request new projects with synthetic data, track them in progress, and view batches of generated images as they are available for review.

After project requirements are established, an AWS project development team creates small test batches by collecting inputs, including reference photos and 2D and 3D sources, explains Barth. These are then modified to reflect any variation or scenario, such as scratches, dents, and textures. They can also create and add new objects, configure distributions and locations of objects in a scene, and change object size, shape, color, and surface texture.

Once prepared, objects are rendered through a photo-realistic physics engine and labeled automatically. Throughout the process, companies receive a fidelity and diversity report with image- and object-level metrics to “understand” synthetic images and compare them to real images, Barth said.

“With synthetic data,” she said, “you have the freedom to create any image environment.”

The mission of VentureBeat is a digital city square for technical decision-makers to gain knowledge about transformative business technology and transactions. Learn more about membership.