Near-term Improvements and Long-term Convergence

[Submitted on 18 Oct 2025 (v1), last revised 5 Mar 2026 (this version, v2)]

View a PDF of the paper titled Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence, by Bingji Yi and 3 other authors

View PDF
HTML (experimental)

Abstract:Synthetic data has been increasingly used to train frontier generative models. However, recent studies raise key concerns that iteratively retraining a generative model on its self-generated synthetic data may keep deteriorating model performance, a phenomenon often coined model collapse. In this paper, we investigate ways to modify the synthetic retraining process to avoid model collapse, and even possibly help reverse the trend from collapse to improvement. Our key finding is that by injecting information through an external synthetic data verifier, whether a human or a better model, synthetic retraining will not cause model collapse. Specifically, we situate our theoretical analysis in the fundamental linear regression setting, showing that verifier-guided retraining can yield near-term improvements, but ultimately drives the parameter estimate to the verifier’s “knowledge center” in the long run. Our theory further predicts that, unless the verifier is perfectly reliable, these early gains will plateau and may even reverse. Indeed, our experiments across linear regression, Variational Autoencoders (VAEs) trained on MNIST, and fining-tuning SmolLM2-135M on the XSUM task confirm these theoretical insights.

Submission history

From: Qiyuan Liu [view email]
[v1]
Sat, 18 Oct 2025 22:39:39 UTC (1,540 KB)
[v2]
Thu, 5 Mar 2026 22:35:40 UTC (2,227 KB)

Source link

Near-term Improvements and Long-term Convergence

Submission history

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Submission history

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You Might Also Like

A geometric foundation model for enzyme retrieval with evolutionary insights

DECODE: deep learning-based common deconvolution framework for various omics data

Scientists build a “periodic table” for AI

Scientists discover the protein that malaria parasites can’t live without