The Risk of Model Degradation in Self-Training GAI

doi:10.19678/j.issn.1000-3428.0253240

Abstract

Abstract: This study aims to investigate the degradation risks of Generative Artificial Intelligence (GAI) models in self-training loops, with a focus on two core phenomena: content homogenization and the widening divergence between human and machine-generated texts. We select two representative generative models with distinct architectures and build an iterative self-training framework, using the proportion of human data in the training set (α) as the key hyperparameter. Under different initial values of α, we conduct controlled experiments combining two typical dynamic strategies—linear decay and exponential decay—and systematically evaluate the quality, diversity, and human-likeness of generated content using multidimensional performance metrics. The results show that, during self-training, GAI models exhibit a persistent decline in performance, a marked reduction in output diversity, and a gradual increase in the gap between human and machine-generated texts. The linear decay strategy can effectively slow down the decline of information entropy and help maintain content diversity, but it becomes increasingly vulnerable to the cumulative impact of model-generated data pollution in later stages. In contrast, although the exponential decay strategy leads to more pronounced performance fluctuations in the early phase, it achieves superior stability in the long run. Moreover, lightweight unidirectional language models (GPT2) are more prone to falling into a vicious cycle of noise amplification during self-training, whereas bidirectional encoder models (BART), endowed with stronger global modeling capacity, demonstrate greater robustness in the presence of synthetic data contamination. These findings provide important empirical support for optimizing dynamic data-mixing strategies in GAI self-training.

摘要： 本研究旨在探究生成式人工智能（Generative Artificial Intelligence, GAI）在自训练循环中的模型退化风险，重点聚焦内容同质化与人机文本差异两大核心现象。研究选取两种结构具有代表性的生成模型，构建自训练迭代实验框架，以人类数据在训练集中的占比α为核心超参数，在α不同取值下并结合线性递减、指数衰减两类典型动态策略开展对照实验，通过多维度性能指标系统评估生成内容的质量、多样性及与人类文本的差异程度。结果显示，GAI在自训练过程中性能呈持续下降趋势，生成内容多样性显著弱化，人机文本差异逐步扩大；线性递减策略可有效延缓信息熵下降、维持内容多样性，但后期易受模型生成数据污染的累积影响；指数衰减策略虽初期性能波动较明显，但其长期稳定性更优。此外，轻量级单向语言模型（GPT2）在自训练中更易陷入噪声放大的恶性循环，而具备更强全局建模能力的双向编码器模型（BART）在面对生成数据污染时，展现出更优异的鲁棒性。本研究为优化GAI自训练的动态数据配比策略提供了重要实证支撑。

Anran Fang , Lemen Chao. The Risk of Model Degradation in Self-Training GAI[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0253240.

方安然, 朝乐门. 生成式人工智能自训练的模型退化风险[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0253240.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0253240

References

[1] GARNTER. Generative AI, Machine Customers and AR/VR are Expected to Transform Sales in the Next Five Years[EB/OL].（2022-10-10）[2022 11-05]. https://www.gartner.com/en/newsroom/press-releases/2022-10-10-gartner-identifies-seven-technology-disruptions-that-will impact-sales-through-2027.
[2] SHUMAILOV I, SHUMAYLOV Z, ZHAO Y, et al. The curse of recursion: Training on generated data makes models forget[J]. arXiv preprint arXiv:2305.17493, 2023.
[3] 李旭光,胡奕,王曼,等.人工智能生成内容研究综述：应用、风险与治理[J].图书情报工作,2024,68(17):136-149. LI X, HU Y, WANG M, et al. A Review of AI-generated Content Research: Applications, Risks, and Govern-ance[J]. Library and Information Service, 2024, 68(17): 136-149.
[4] DOHMATOB E, FENG Y, YANG P, et al. A tale of tails: Model collapse as a change of scaling laws[J]. arxiv preprint arxiv:2402.07043, 2024.
[5] GERSTGRASSER M, SCHAEFFER R, DEY A, et al. Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data[J]. arxiv preprint arxiv:2404.01413, 2024.
[6] LEE D H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural net-works[C]//Workshop on challenges in representation learning, ICML. 2013, 3(2): 896.
[7] XIE Q, DAI Z, HOVY E, et al. Unsupervised data aug-mentation for consistency training[J]. Advances in Neural Information Processing Systems, 2020, 33: 6256-6268.
[8] ZHAO W X, ZHOU K, LI J, et al. A survey of large lan-guage models[J]. arXiv preprint arXiv:2303.18223, 2023, 1(2).
[9] LIU R, WEI J, LIU F, et al. Best practices and lessons learned on synthetic data[J]. arxiv preprint arxiv:2404.07503, 2024.
[10] 叶英杰,李川.人工智能模型训练中合成数据的应用风险及其治理路径[J/OL].情报理论与实践,1-11[2025-03-24].http://kns.cnki.net/kcms/detail/11.1762.G3.20250208.1355.004.html.
YE Y, Li C. Application Risks of Synthetic Data in AI Model Training and Its Governance Pathways[J/OL]. Information studies: Theory & Application. [2025-03-24]. http://kns.cnki.net/kcms/detail/11.1762.G3.20250208.1355.004.html.
[11] FU S, ZHANG S, WANG Y, et al. Towards theoretical understandings of self-consuming generative models[J]. arXiv preprint arXiv:2402.11778, 2024.
[12] SEDDIK M E A, CHEN S W, HAYOU S, et al. How bad is training on synthetic data? a statistical analysis of language model collapse[J]. arXiv preprint arXiv:2404.05090, 2024.
[13] BERTRAND Q, BOSE A J, DUPLESSIS A, et al. On the stability of iterative retraining of generative models on their own data[J]. arXiv preprint arXiv:2310.00429, 2023. [14] WANG L, SHI X, LI G, et al. Why language models col-lapse when tr
ained on recursively generated text[J]. arXiv preprint arXiv:2412.14872, 2024.
[15] HATAYA R, BAO H, ARAI H. Will large-scale generative models corrupt future datasets?[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 20555-20565.
[16] JAIN A, MONTANARI A, SASOGLU E. Scaling laws for learning with real and surrogate data[J]. arxiv preprint arxiv:2402.04376, 2024.
[17] BENDER E M, KOLLER A. Climbing towards NLU: On meaning, form, and understanding in the age of da-ta[C]//Proceedings of the 58th annual meeting of the association for computational linguistics. 2020: 5185-5198.
[18] GUO Y, SHANG G, VAZIRGIANNIS M, et al. The curi-ous decline of linguistic diversity: Training language models on synthetic text[J]. arxiv preprint arxiv:2311.09807, 2023.
[19] FERBACH D, BERTRAND Q, BOSE A J, et al. Self-consuming generative models with curated data provably optimize human preferences[J]. arXiv preprint arXiv:2407.09499, 2024.
[20] JELINEK F. Statistical methods for speech recognition[M]. Cambridge, MA: MIT Press, 1998.
[21] SHANNON C E. A mathematical theory of communica-tion[J]. The Bell System Technical Journal, 1948, 27(3): 379 423.
[22] SU Z, WU X, ZHOU W, et al. Hc3 plus: A seman-tic-invariant human chatgpt comparison corpus[J]. arXiv preprint arXiv:2309.02731, 2023.
[23] WU J, YANG S, ZHAN R, et al. A survey on LLM-generated text detection: Necessity, methods, and future directions[J]. Computational Linguistics, 2025: 1-66.
[24] LIU Y, ZHOU J, SANG G, et al. The journey of language models in understanding natural language[C]//JIN C, YANG S, SHANG X, WANG H, ZHANG Y. Web information systems and applications (International Conference on Web Information Systems and Applications), Singapore, 2024. Singapore: Springer Nature Singapore, 2024: 331-363.

Please choose a citation manager

Content to export