Here's a question that puzzles newcomers to AI: How do you train a model to do something humans can't evaluate?
If you want AI to write code, humans can check if the code works. If you want AI to summarize documents, humans can judge quality. But what about:
- Math problems too complex for most humans to verify?
- Scientific reasoning requiring domain expertise?
- Scale that exceeds available human evaluators?
The answer is increasingly: synthetic data. AI training AI.
This might be the most important - and least discussed - development in modern AI.
What Is Synthetic Data?
The Traditional Approach
Human-generated data: 1. Collect examples created by humans (text, images, labels) 2. Train model on human examples 3. Model learns to approximate human behavior
The limitation: You can only train on tasks where humans produce examples. Scale is limited by human effort.
The Synthetic Approach
AI-generated data: 1. Use existing AI to generate examples 2. Filter for quality (verified, scored, ranked) 3. Train new AI on synthetic examples 4. New AI exceeds original's capability
The question: How can training on AI output improve beyond that AI's capability?
How Synthetic Data Works
Verification vs. Generation
The key insight: It's often easier to verify a correct answer than to generate it.
Examples: - Code: Generating correct code is hard; running tests is easy - Math: Producing proofs is hard; checking proofs is mechanical - Logic: Reasoning is hard; evaluating validity is structured
The strategy: Generate many candidates, verify automatically, train on verified correct examples.
Constitutional AI and Self-Critique
The approach (pioneered by Anthropic): 1. AI generates responses 2. AI critiques its own responses against principles 3. AI revises based on its own critique 4. Training data includes improved responses
The result: AI that's better at following guidelines than could be achieved through direct human feedback alone.
Reinforcement Learning from AI Feedback (RLAIF)
The traditional RLHF: - Humans rank AI outputs - Model learns to produce preferred outputs - Scale limited by human rating capacity
The RLAIF alternative: - AI evaluator ranks outputs - Model learns from AI preferences - Scale limited only by compute
The concern: Does this just amplify the evaluator's biases?
The evidence: With careful design, RLAIF produces models that humans also prefer.
Where Synthetic Data Shines
Mathematics and Reasoning
The challenge: Human-verified math proofs are scarce. You can't train frontier reasoning on textbook problems.
The solution: 1. Generate candidate solutions to problems 2. Verify solutions automatically (proofs can be checked) 3. Train on verified solutions 4. Model learns reasoning patterns, not just answers
The result: AlphaProof and similar systems that discover novel mathematical approaches.
Code Generation
The challenge: There's lots of code, but limited high-quality annotated examples.
The solution: 1. Generate code for programming tasks 2. Execute tests to verify correctness 3. Train on code that passes tests 4. Include reasoning traces showing problem-solving approach
The result: Models that can solve competitive programming problems better than most human programmers.
Scientific Discovery
The challenge: Scientific knowledge is specialized and sparse.
The solution: 1. Generate hypotheses and experimental designs 2. Validate against known constraints and simulation 3. Train on plausible, consistent proposals 4. Human scientists evaluate the most promising
The result: GNoME discovering millions of stable crystal structures through AI-guided search.
The Model Collapse Concern
The Risk
The fear: If AI trains on AI output, errors accumulate. Each generation is slightly worse. Eventually: garbage.
The term: Model collapse - degradation of model quality when trained on synthetic data.
The Research
Early studies: Yes, naive training on model outputs degrades quality.
The nuance: - Collapse happens when synthetic data replaces human data entirely - Mixing synthetic and human data avoids collapse - Verification/filtering maintains quality - Diverse generation sources prevent mode collapse
The evidence: Frontier models use synthetic data extensively without obvious collapse.
The Best Practices
- Never fully replace human data with synthetic
- Always verify/filter synthetic examples
- Maintain diversity in generation
- Monitor for quality degradation
- Use synthetic data to augment, not replace
Implications for AI Development
The Data Moat Erodes
The old advantage: Access to large, high-quality human datasets.
The new reality: Anyone can generate synthetic data. The differentiator is knowing how to use it effectively.
The shift: From "who has the most data" to "who uses data most intelligently."
Training Becomes More Efficient
The pattern: Each model generation enables more efficient training of the next.
DeepSeek's efficiency: Partially explained by sophisticated synthetic data use.
The implication: The compute cost curve may fall faster than Moore's Law due to data efficiency gains.
Capabilities Accelerate in Specific Domains
Domains with good verifiers see faster progress: - Math (proofs are checkable) - Code (tests are runnable) - Science (simulations can validate)
Domains with weak verifiers progress slower: - Creative writing (quality is subjective) - Ethics (correctness is contested) - General knowledge (verification is expensive)
The Societal Questions
Attribution and Ownership
The question: If synthetic data trains AI, who owns the result?
The complexity: - Synthetic data is generated by models trained on human data - But synthetic examples are novel, not copies - Where does "derived from" become "independent of"?
The legal uncertainty: Courts are grappling with these questions. No clear precedent.
The Self-Improvement Loop
The possibility: AI that improves itself by generating better training data.
The concern: Recursive self-improvement is a long-standing AI safety concern.
The reality check: Current synthetic data pipelines still require significant human oversight. We're not in recursive self-improvement territory yet.
Trust and Verification
The challenge: How do we trust AI trained on AI?
The approach: - Evaluation on human-curated benchmarks - Real-world deployment with monitoring - Diverse evaluation to catch mode collapse - Transparency about training methods
The tension: Competitive pressure discourages full transparency about training techniques.
What This Means for Builders
Using Synthetic Data
For fine-tuning: - Generate task-specific examples - Filter for quality (automatic or hybrid) - Mix with real human examples - Evaluate carefully for domain-specific degradation
For augmentation: - Expand limited datasets - Generate edge cases underrepresented in real data - Create variations for robustness
Watching for Problems
Warning signs: - Decreasing diversity in outputs - Increasing homogeneity in style - Failure on distribution shifts - Degradation on held-out human evaluations
The Competitive Landscape
If you're building AI: - Synthetic data capability is becoming essential - The "data moat" around proprietary human data is narrowing - Execution and technique matter as much as data access
The Bottom Line
Synthetic data is transforming AI development. The idea that AI can train AI - far from being circular - is enabling capabilities that human data alone couldn't support.
But it's not magic. Verification, filtering, diversity, and mixing with human data are essential. Without care, synthetic training leads to model collapse.
Used correctly, synthetic data is the secret sauce behind many recent AI advances. Used carelessly, it's a path to mediocrity.
Understanding the difference is becoming crucial for anyone building or evaluating AI systems.
