LOGO

Synthetic Data: Exploring the Promise and Perils

December 24, 2024
Synthetic Data: Exploring the Promise and Perils

The Viability of AI Training on AI-Generated Data

The question of whether an artificial intelligence can be effectively trained using data exclusively produced by another AI is a topic of growing interest. Initially, this concept may appear counterintuitive. However, it has been a subject of exploration for a considerable period, and its relevance is increasing.

The increasing scarcity of new, authentic data is a primary driver behind this trend. As obtaining real-world datasets becomes more challenging, the possibility of leveraging synthetic data for AI training is gaining momentum.

Examples of AI Trained on Synthetic Data

Several prominent organizations are already employing synthetic data in their AI development processes. Anthropic, for instance, utilized synthetically generated data during the training of its Claude 3.5 Sonnet model.

Furthermore, Meta enhanced its Llama 3.1 models through fine-tuning with data created by artificial intelligence. Reports also suggest that OpenAI is exploring the use of synthetic training data, sourced from its own o1 model – designed for reasoning – for the forthcoming Orion model.

Understanding AI’s Data Requirements

A fundamental question arises: why is data essential for AI, and what specific characteristics should this data possess? Moreover, is it genuinely feasible to substitute authentic data with its synthetic counterpart?

AI algorithms require vast amounts of data to learn patterns, make predictions, and improve their performance. The quality and diversity of this data are crucial factors influencing the AI’s capabilities.

The debate centers around whether AI-generated data can adequately replicate the complexity and nuance found in real-world data, and whether training on such data will lead to comparable results.

The Critical Role of Data Annotations

Artificial intelligence systems function as statistical engines. Through exposure to extensive datasets, they discern underlying patterns and subsequently utilize these patterns to formulate predictions. For instance, they learn that the phrase “to whom” in an email is frequently followed by “it may concern.”

Annotations, typically involving textual labels defining the meaning or components of the data fed into these systems, are fundamental to this process. They act as crucial indicators, effectively instructing a model to differentiate between objects, locations, and concepts.

How Annotations Train AI Models

Imagine a model designed to classify images, presented with numerous pictures of kitchens each labeled as “kitchen.” During its training phase, the model establishes connections between the term “kitchen” and the typical features of kitchens, such as refrigerators and countertops.

Following training, when presented with a previously unseen image of a kitchen, the model should accurately identify it. This highlights the significance of accurate annotation; mislabeling kitchens as “cow” would lead to erroneous identification.

The Expanding Annotation Market

The increasing demand for AI and the consequent need for labeled training data have dramatically expanded the market for annotation services. Dimension Market Research currently values this market at $838.2 million, projecting a substantial growth to $10.34 billion within the next decade.

While the exact number of individuals involved in data labeling remains elusive, a 2022 research paper estimates the workforce to be in the “millions.”

The Human Cost of AI Training

Organizations of all sizes depend on workers employed by data annotation companies to generate labels for AI training datasets. Some of these positions offer competitive compensation, especially when specialized skills are required, such as a background in mathematics.

However, many annotation tasks are physically demanding. Workers in developing nations often earn only a few dollars per hour, lacking benefits and job security.

  • Key takeaway: Accurate annotations are essential for effective AI.
  • The annotation market is experiencing rapid growth.
  • Annotation work presents both opportunities and challenges for workers globally.

The Diminishing Availability of Training Data

There are compelling reasons, both ethical and pragmatic, to explore alternatives to relying solely on human-labeled data. A notable example is Uber, which is actively increasing its workforce of gig economy participants to focus on AI annotation and data labeling tasks.

The speed at which humans can label data is inherently limited. Furthermore, human annotators are susceptible to biases that can be reflected in their work, ultimately impacting the models trained on this data. Errors are also common, and inconsistencies can arise from ambiguous labeling guidelines.

The cost associated with human labor is a significant factor. More broadly, acquiring data itself is an expensive undertaking. Shutterstock, for instance, is levying substantial fees on AI companies for access to its image library.

Reddit has also capitalized on this trend, generating considerable revenue through data licensing agreements with companies like Google and OpenAI. Beyond cost, the accessibility of data is also decreasing.

The majority of AI models are currently trained using vast amounts of publicly available data. However, data owners are increasingly restricting access due to concerns about potential plagiarism and the need for proper attribution.

Growing Restrictions on Data Access

Currently, over 35% of the top 1,000 websites globally are actively blocking OpenAI’s web scraping tools. A recent study indicates that approximately 25% of data originating from sources considered “high-quality” is now inaccessible to the major datasets used for model training.

Epoch AI forecasts that, if these access restrictions persist, developers may exhaust the available data needed to train generative AI models sometime between 2026 and 2032. This scarcity, coupled with the threat of copyright litigation and the potential inclusion of undesirable content in open datasets, is prompting a critical reassessment for AI developers.

The combination of these factors is creating a challenging environment for AI innovation and driving the search for alternative data acquisition and labeling strategies.

  • Human labeling is slow and expensive.
  • Annotator bias can negatively impact model performance.
  • Data access is becoming increasingly restricted.

Data scarcity presents a significant hurdle for the continued advancement of artificial intelligence.

Synthetic Data as an Alternative

Initially, synthetic data appears to offer a comprehensive solution to numerous challenges. The need for annotations can be met through generation. A greater volume of example data is readily achievable. The potential seems limitless.

To a degree, this assessment holds true.

Os Keyes, a PhD candidate at the University of Washington specializing in the ethical implications of emerging technologies, articulated to TechCrunch that, “If ‘data is the new oil,’ synthetic data positions itself as biofuel, capable of being created without the detrimental consequences associated with its natural counterpart.” He further explained that a limited initial dataset can be leveraged to simulate and extrapolate new data points.

The artificial intelligence sector has enthusiastically embraced this concept.

Writer, a company focused on generative AI for enterprise applications, recently introduced Palmyra X 004, a model primarily trained on synthetic data. The development cost, according to Writer, was only $700,000 – a figure significantly lower than the estimated $4.6 million required for a comparable OpenAI model.

Microsoft’s Phi open models benefited from training with synthetic data. Similarly, Google’s Gemma models incorporated it. Nvidia unveiled a family of models this summer specifically designed for generating synthetic training data, and Hugging Face, an AI startup, has released what it claims is the largest synthetic text dataset for AI training.

The creation of synthetic data has evolved into a distinct industry, projected to reach a value of $2.34 billion by 2030. Gartner forecasts that 60% of the data utilized for AI and analytics projects this year will be synthetically generated.

Luca Soldaini, a senior research scientist at the Allen Institute for AI, highlighted that synthetic data techniques facilitate the creation of training data in formats difficult to obtain through conventional methods like web scraping or content licensing. As an example, Meta employed Llama 3 to generate captions for footage used in training its video generator, Movie Gen, with human refinement subsequently added to enhance detail, such as lighting descriptions.

Following a similar approach, OpenAI reports utilizing synthetic data to fine-tune GPT-4o, enabling the development of the sketchpad-like Canvas feature within ChatGPT. Amazon has also stated its use of synthetic data to augment the real-world data employed in training speech recognition models for Alexa.

Soldaini emphasized that “Synthetic data models can efficiently expand upon human understanding of the data required to achieve a specific model behavior.”

Potential Risks Associated with Synthetic Data

While offering numerous benefits, synthetic data isn't without its drawbacks. Like all AI-driven processes, it's susceptible to the “garbage in, garbage out” principle. The quality of synthetic data is directly dependent on the data used to train the generating models.

If the initial training data contains biases or limitations, these will inevitably be reflected in the synthetic outputs. Consequently, underrepresented groups in the original dataset will likely remain so in the artificially created data.

“The extent of improvement is limited,” Keyes explained. “For example, if a dataset includes only 30 individuals identifying as Black, attempting to extrapolate from this limited sample may offer some benefit, but the resulting ‘representative’ data will likely mirror the characteristics of those 30 individuals – perhaps all being middle-class or having lighter skin tones.”

A 2023 study conducted by researchers at Rice University and Stanford University highlighted that excessive reliance on synthetic data during model training can lead to a progressive decline in both quality and diversity.

According to the researchers, sampling bias – a lack of accurate real-world representation – causes a model’s diversity to diminish with each successive generation of training. However, they also noted that incorporating even a small amount of real-world data can help to counteract this effect.

Keyes also identifies risks associated with more complex models, such as OpenAI’s o1, suggesting they may generate synthetic data containing subtle, difficult-to-detect hallucinations. These inaccuracies could then compromise the accuracy of models trained on this data, particularly if the origins of the hallucinations are unclear.

“Complex models are prone to hallucination, and data originating from these models will inevitably contain such inaccuracies,” Keyes stated. “Furthermore, with a model like o1, even the developers may struggle to explain the emergence of certain artefacts.”

The compounding of hallucinations can ultimately result in models that produce nonsensical outputs. A study featured in the journal Nature demonstrates how models, when trained on flawed data, generate increasingly error-ridden data, creating a detrimental feedback loop that degrades subsequent model generations.

The researchers discovered that models gradually lose their understanding of specialized knowledge, becoming more generalized and frequently providing responses that are irrelevant to the questions posed.

the promise and perils of synthetic dataFurther research indicates that other model types, such as image generators, are also vulnerable to this type of degradation:

the promise and perils of synthetic dataSoldaini emphasizes that “unprocessed” synthetic data should be treated with caution, especially if the goal is to prevent the development of chatbots that forget information and image generators that produce homogenous results.

He asserts that utilizing it “safely” necessitates thorough review, curation, and filtering, along with the integration of current, real-world data – mirroring the best practices for handling any dataset.

Neglecting these precautions could potentially lead to model collapse, where a model’s outputs become less innovative – and more biased – ultimately impairing its functionality. While this process can be identified and corrected, it remains a significant risk.

“It is crucial for researchers to examine the generated data, refine the generation process, and implement safeguards to eliminate low-quality data points,” Soldaini said. “Synthetic data pipelines are not inherently self-improving; their output requires careful inspection and enhancement before being used for training.”

Sam Altman, CEO of OpenAI, has proposed that AI will eventually be capable of generating synthetic data of sufficient quality to effectively train itself. However, even if this is achievable, the necessary technology is not currently available.

Currently, no major AI laboratory has released a model that has been trained exclusively on synthetic data. Therefore, human oversight will likely remain essential to ensure the integrity of model training for the foreseeable future.

TechCrunch offers a newsletter dedicated to AI! Subscribe here to receive it in your inbox every Wednesday.

Update: This article was initially published on October 23 and was updated on December 24 with additional information.

#synthetic data#artificial data#data generation#data privacy#machine learning#AI