Billion-Dollar Synthetic Data Market Booms: Feeding AI with 'Artificial' Data

baoshi.rao

The AI wave is surging, and data has become a hot business. To feed artificial intelligence with ample 'nourishment,' companies are capitalizing on every stage of data processing, from discovery and collection to annotation. Today, real data can no longer meet the growing 'appetite' of AI, leading businesses to explore 'fake' data self-produced by AI—thus giving birth to the synthetic data industry.

At the end of last month, the domestic synthetic data company 'Lightwheel Intelligence' announced the completion of its angel+ round of financing. A few months earlier, Singapore-based synthetic data startup Betterdata also secured a $1.65 million seed round. Major tech giants are also entering the field. Companies like Microsoft, NVIDIA, Meta, and Amazon have all made moves in synthetic data, including business deployments, investments, or acquisitions.

What exactly is synthetic data? What industrial value and risks does it carry? How will it disrupt the AI industry?

The Rise of 'Artificial' Data

Unlike real data collected or measured from the real world, synthetic data, as the name suggests, is artificially generated 'fake' data. Because it can reflect the properties of the original data, synthetic data can serve as a substitute for real data to train, test, and validate AI models.

However, artificial synthesis does not mean fabrication out of thin air. At this stage, most synthetic data still has its 'roots' in real data.

Qian Wenyi, a senior software engineer at Unity China, explained the general generation process of synthetic data in computer vision-related projects: First, identifiable objects are found in reality and scanned to accurately recreate their models in a 3D scene. Next, labels such as color and size are applied to the model, depending on training needs. Finally, these objects are placed in various designed scenarios and randomly combined to quickly generate multiple images.

Thus, when training the same AI model, using real data might require cameras to continuously capture multiple photos of an object in different scenes and states, whereas synthetic data can produce hundreds or thousands of different images in a minute by adjusting parameters like object position, angle, and background—reducing costs and improving dataset generation efficiency.

In fact, the concept of synthetic data is not new. It is said to have originated as early as 1993 in an article by Donald Rubin. In recent years, as AI technology has achieved breakthrough after breakthrough, the difficulty of collecting and acquiring real data has also increased, making it hard to satisfy the massive 'appetite' of AI training.

Synthetic data often serves as a cheaper alternative to real data. Aiden Gomez, CEO of AI startup Cohere, revealed at the end of last month that due to the high costs of data acquisition from companies like Reddit and Twitter, Microsoft, OpenAI, and Cohere have already used synthetic data to train AI models. Gomez noted that synthetic data can be applied to many training scenarios, though it has not yet been widely adopted.

However, in the view of Professor Wang Yuangen from the School of Computer Science and Network Engineering at Guangzhou University, cost is not the primary consideration for choosing synthetic data.

Real data involves significant personal privacy concerns, and reckless use could lead to serious legal disputes. Moreover, not all real data is usable. The internet is flooded with information that is hard to verify, and extracting usable data from chaotic real-world sources requires extensive manual filtering. Additionally, real data often suffers from imbalance. For example, when training a facial recognition system, scraped facial data from the internet may predominantly feature lighter-skinned faces, leading to biased models. Synthetic data can help mitigate these issues to some extent.

'Some real data is simply unavailable, such as clear underwater images. Synthetic data technology can simulate such data to supplement the completeness of training datasets,' Wang Yuangen added. Although much synthetic data is currently based on real data, technological advancements will gradually reduce this reliance. Some techniques already allow directly synthesized data to be indistinguishable from real data.

However, synthetic data is not flawless. An article published by AI training data provider Appen noted that synthetic data lacks outliers, which naturally occur in real data and are crucial for model accuracy. Additionally, the quality of synthetic data often depends on the input data used for generation, and biases in the input can easily propagate to the synthetic data. Thus, the importance of high-quality input data cannot be overstated. Companies must compare synthetic data with manually annotated real data as an additional control measure.

The More Sensitive, the Sooner the Breakthrough

Currently, in which fields is synthetic data primarily applied?

Compared to natural language and audio, synthetic data has made its earliest and most significant impact in computer vision. Experts attribute this to the relative simplicity of image processing and the primacy of visual interaction in human-environment engagement. In the future, synthetic data is expected to expand into other domains.

Synthetic data holds vast potential in scenarios like autonomous driving, healthcare, and finance. These fields share common traits: real data is highly sensitive, difficult to acquire, and often tied to critical outcomes, including personal safety, demanding exceptionally high data quality. "Areas with the greatest needs will see the earliest development and application. Synthetic data technology is most likely to achieve breakthroughs in these sensitive scenarios," said Wang Yuangen.

Take autonomous driving as an example. During actual driving, vehicles may encounter diverse and unpredictable road conditions, including extreme situations like severe traffic jams, accidents, or harsh weather. Testing with real vehicles in such scenarios is nearly impossible, making real-world data collection extremely challenging.

Synthetic data can simulate these scenarios. Wang Yuangen explained, "For instance, to simulate heavy rain, we use ordinary weather data collected daily to build a physical or network model. By inputting key parameters like 'heavy rain,' we can generate corresponding scenarios. The more accurate the model and parameters, the more realistic the scenario." This approach enhances autonomous driving capabilities while ensuring the safety of personnel and equipment.

Public records show that many autonomous vehicle manufacturers have heavily invested in synthetic data and simulation. For example, Waymo, an autonomous driving subsidiary under Alphabet (Google's parent company), generated 2.5 billion miles of simulated driving data in 2016 to train its autonomous systems (compared to just 3 million miles of real-world driving data). By 2019, this figure had reached 10 billion miles.

In China, Tencent's Autonomous Driving Lab has developed the TADSim simulation system, which can automatically generate unlabeled traffic scenario data. Huawei Cloud has also built a scene reconstruction model based on its Pangu model, capable of reconstructing scenes (synthetic data) from collected road video data. The reconstructed scenes are nearly indistinguishable from real ones to the naked eye.

However, since autonomous driving involves personal safety, synthetic data—being not entirely real—inevitably leads to cautious adoption by companies for training purposes.

Lou Tiancheng, co-founder and CTO of Pony.ai, emphasized to 21st Century Business Herald that synthetic data includes both purely virtual data and modified real data. Currently, Pony.ai does not use purely virtual data for L4 perception modules, primarily because L4 solutions rely on LiDAR. For challenging scenarios like harsh weather or long-tail objects, the distribution gap between virtual LiDAR data and real data is too significant to improve real-world performance.

However, Pony.ai modifies real data to create synthetic data for perception algorithms. For modules not reliant on raw sensor input, such as path planning and scene understanding, synthetic data is used for training and simulation evaluation.

Lou believes that making virtual data realistic enough requires even higher annotation quality. For simpler scenarios, data mining and intelligent annotation loops are more cost-effective than developing highly realistic synthetic data. While academic research explores fully virtual data for autonomous driving training, and many companies are conducting preliminary studies, practical deployment remains limited. Virtual data helps progress from 0 to 80% proficiency but struggles to bridge the gap from 90% to 99%.

"We are closely monitoring advancements in synthetic virtual data and remain open-minded. If the technology matures sufficiently, we will consider its application," Lou added.

Will the Data Annotation Industry Be Reshaped?

Consulting firm Gartner predicts that by 2030, synthetic data will completely replace real data as the primary source for AI models. According to U.S. AI research firm Cognilytica, the synthetic data market was valued at approximately $110 million in 2021 and is projected to reach $1.15 billion by 2027. This represents a lucrative opportunity for tech giants and startups alike.

Multiple tech giants have made business deployments, investments, or acquisitions related to synthetic data. For example, in 2021, NVIDIA launched the Omniverse Replicator synthetic data generation engine for AI training. In July this year, Rendered.ai, a member of the NVIDIA Inception program, integrated Omniverse Replicator into its synthetic data generation platform, making AI training simpler and more accessible. Amazon is also exploring synthetic data applications in various scenarios, such as using synthetic data to train and debug its virtual assistant Alexa to avoid user privacy issues. Meta directly acquired synthetic data startup AI.Reverie to integrate it into its metaverse division, Reality Labs.

On the startup front, investment and acquisitions in the synthetic data field continue to heat up. Computer vision synthetic data provider Datagen announced the completion of a $50 million Series B funding round in early 2022. In April this year, Singapore-based synthetic data startup Betterdata secured $1.65 million in seed funding. At the end of July, domestic synthetic data company "Lightwheel Intelligence" announced the completion of an angel+ funding round. This newly established company has already completed seed, angel, and angel+ rounds, with cumulative funding reaching tens of millions of yuan.

Qian Wenyi observed, "Over the past few years, hundreds or even thousands of new startups have emerged globally each year, providing synthetic data products for algorithm training across various industries."

Amid this industry boom, China has also begun encouraging and guiding the development of the synthetic data industry. In early March this year, Yao Qian, Director of the Technology Supervision Bureau of the China Securities Regulatory Commission, wrote in the "China Finance" magazine, suggesting a focus on developing the synthetic data industry based on AIGC technology. This would "expand incrementally" the data element market with higher efficiency, lower costs, and better quality, helping to build a data advantage for the future development of AI. On May 19, Beijing released the "Beijing General Artificial Intelligence Industry Innovation Partnership Plan," which mentioned plans to establish a national-level data training base and proposed support for developing a new synthetic data industry based on AIGC technology.

For a long time, the massive demand for data in AI has spawned a workforce of data annotators. Now, as synthetic data gains momentum, will the data annotation industry face disruption?

Wang Yuangen believes that disruption is inevitable, but demand will persist. "First, this shift won't happen overnight. Second, annotators will need to adapt. For example, instead of annotating raw data, they will now work with AI-generated data. Additionally, annotators may be required to distinguish between AI-generated data and natural data. Even as synthetic data becomes more prevalent and higher in quality, human guidance and supervision will remain essential to correct potential biases promptly."