Unraveling Language Model Collapse

The study demonstrates that compounding data and weight recursion-where each generation of synthetic text refines training from the previous generation’s weights-results in measurable drift, quantified as the change in [latex]\Delta U_{\mathrm{LLN,cov}}(\delta)[/latex] and [latex]\Delta G_{\mathrm{KF}}(\delta)[/latex], relative to a baseline checkpoint.

New research offers a powerful framework for understanding and predicting when large language models begin to lose coherence during prolonged, synthetic data training.