Problem Detail

AI Model Collapse from Synthetic Training

Andrej Karpathy DURABLE Documented
Demand: Documented
Speaker explicitly describes people paying or seeking this.
Research Needed
Buildability
Significant new learning required — Fundamental research problem in information theory and optimization. May require new mathematical frameworks for entropy-preserving learning.
Solution: None
Solution Status: None
No existing product addresses this.
Problem Statement
LLMs trained on their own outputs become 'silently collapsed' — they generate content that looks reasonable individually but lacks entropy across samples. This creates a fundamental barrier to synthetic data generation and self-improvement loops.
Job to Be Done
Give me a way to train AI models on synthetic data without losing the diversity and creativity that makes them useful for novel problems.
Assessment
Helmer Power
Proprietary data (whoever solves collapse can generate unlimited training data)
Lenses Triggered
Durable Truths Filter
Constraint Inversion
Human Behavior Constant
Variable Cost
Current solution requires constant human-generated training data. Cost = human time × data volume. Solving collapse would enable unlimited self-improvement cycles at near-zero marginal cost.
Why This Is Durable
Model collapse is a mathematical consequence of training on compressed representations. Any system that learns from its own outputs faces entropy loss unless explicitly designed to maintain diversity.
Solution Gap
No known method to maintain entropy in synthetic training while preserving quality. Current approaches (regularization, diversity rewards) fail at scale.
Demand Evidence
Karpathy explicitly describes this as blocking synthetic data generation at frontier labs, with billions in annotation costs at stake.
Human Behavior Insight
Humans maintain high entropy in outputs through unpredictable life experiences and individual cognitive differences — something deterministic systems may never replicate.
Paradigm Challenge
Synthetic data generation will solve AI training bottlenecks
Source Quote
The LLMs, when they come off, they're what we call 'collapsed.' They have a collapsed data distribution. One easy way to see it is to go to ChatGPT and ask it, 'Tell me a joke.' It only has like three jokes.
Broad Tags
per_unit_cost_collapsible
per_unit_cost_collapsible
Training on human-generated data scales linearly with data needs. Solving collapse would eliminate this cost entirely, enabling unlimited self-improvement.
constraint_accepted_as_fixed
constraint_accepted_as_fixed
Industry treats the need for fresh human training data as permanent. Karpathy suggests this may be an insurmountable constraint, not just an engineering challenge.
capability_doesn_t_exist_yet
capability_doesn_t_exist_yet
No existing approach maintains both quality and diversity in synthetic training loops — a fundamental capability gap blocking AI self-improvement.
Specific Tags (structural patterns for cross-referencing)
synthetic_data_entropy_degradationself_improvement_bottleneck_fundamentaltraining_distribution_collapse_silentdiversity_quality_tradeoff_unsolvedrecursive_learning_failure_modeinformation_theoretic_barrier_traininghuman_data_dependency_scaling_wallmodel_output_statistical_bias_inevitableentropy_preservation_mechanism_missingadversarial_examples_in_judgment_models
Constraints Blocking Progress
PHYSICS information entropy conservation violation
Training on compressed representations necessarily loses information entropy — may be mathematically impossible to avoid without external entropy sources.
TECHNICAL diversity measurement intractable
No efficient method exists to measure and preserve diversity in high-dimensional output spaces during training.
🧠 COGNITIVE human diversity source irreplaceable
Humans generate true entropy through unpredictable creativity — synthetic systems may fundamentally lack this capability.