LOGO

DeepSeek AI Model Training: Did They Use Google Gemini?

June 3, 2025
DeepSeek AI Model Training: Did They Use Google Gemini?

DeepSeek AI Model and Potential Data Sourcing Concerns

Recently, DeepSeek, a Chinese AI laboratory, launched an upgraded iteration of its R1 reasoning AI model. This new version demonstrates strong performance across various mathematical and coding assessments.

However, the precise data sources utilized for the model’s training remain undisclosed. Some AI researchers hypothesize that a portion of the training data may have originated from Google’s Gemini series of AI models.

Evidence Suggesting Gemini Data Usage

Sam Paech, a developer specializing in AI emotional intelligence evaluations based in Melbourne, has presented evidence indicating that DeepSeek’s latest model was potentially trained using outputs from Gemini. Paech asserts that the R1-0528 model exhibits a preference for vocabulary and phrasing consistent with Google’s Gemini 2.5 Pro.

Further analysis by the creator of SpeechMap, a platform evaluating AI free speech, revealed that the DeepSeek model’s internal reasoning processes – its “traces” – closely resemble those generated by Gemini.

Previous Accusations of Data Training Practices

This isn't the first time DeepSeek has faced accusations regarding its data training methods. In December, observations suggested that DeepSeek’s V3 model frequently identified itself as ChatGPT, the AI chatbot developed by OpenAI.

This behavior implied potential training on ChatGPT chat logs.

OpenAI's Findings and Microsoft's Detection

Earlier this year, OpenAI informed the Financial Times that it had discovered evidence linking DeepSeek to the practice of distillation.

Distillation is a technique used to train AI models by leveraging data extracted from larger, more powerful models.

According to reports from Bloomberg, Microsoft, a significant investor and collaborator with OpenAI, detected substantial data exfiltration through OpenAI developer accounts in late 2024. These accounts are believed to be associated with DeepSeek.

The Implications of Distillation

While distillation itself isn't unusual, OpenAI’s terms of service explicitly prohibit customers from utilizing the company’s model outputs to develop competing AI systems.

The Challenge of Data Contamination

It’s important to note that model misidentification and convergence on similar phrasing are common occurrences.

The open web, a primary source of training data for AI companies, is increasingly saturated with AI-generated content, including clickbait and bot-generated posts on platforms like Reddit and X.

This “contamination” significantly complicates the process of effectively filtering AI outputs from training datasets.

Expert Opinions

Despite these challenges, AI experts, such as Nathan Lambert from the AI research institute AI2, believe the possibility of DeepSeek training on Google’s Gemini data remains plausible.

Lambert suggests that leveraging synthetic data generated from leading API models would be a strategically advantageous approach for DeepSeek, given their limited GPU resources and substantial financial backing.

Security Measures and Responses

In response to concerns about distillation, AI companies are enhancing their security protocols.

OpenAI implemented an ID verification process in April for organizations seeking access to advanced models. This process requires government-issued identification from a supported country, excluding China.

Google has begun “summarizing” the traces generated by models available through its AI Studio platform, making it more difficult to train rival models using Gemini traces. Anthropic announced a similar measure in May to safeguard its “competitive advantages.”

A request for comment has been sent to Google, and this article will be updated upon receiving a response.

Key Takeaways

  • DeepSeek’s R1 model shows strong performance but faces scrutiny over its training data.
  • Evidence suggests potential use of Google’s Gemini outputs in the training process.
  • AI companies are increasing security measures to prevent data distillation.
  • Data contamination from AI-generated content poses a significant challenge.
#DeepSeek#Gemini#AI model#training data#Google#artificial intelligence