Model collapse: What happens when AI learns from AI?

The term “model collapse” refers to a degenerative process when the model loses its connection to reality and is tainted by its own output.

We may anticipate that a sizeable amount of online content will eventually be created by AI as more and more content development workflows rely on ChatGPT and other technologies to increase efficiency model collapse. Unfortunately, given that they now rely on data that has been gathered by humans from the internet, massive language models may be in grave danger.

Researchers from Imperial College in London, Cambridge and Oxford universities, the University of Toronto, and other institutions have expressed concern about a potential situation in which LLMs use AI-generated material as part of their training data. The article, titled “The Curse of Recursion: Training on Generated Data Makes Models Forget,” outlines a deteriorating condition known as “model collapse” when the model loses touch with reality and is tainted by its own output.

Model Collapse

The increasing usage of artificial intelligence techniques makes this a possibility. Inaccuracies and distortions in the output of big language models are anticipated to result from the widespread use of AI and the feeding of AI-generated material to these models as training data model collapse.

This issue was seen in large language models, Variational Autoencoders, and Gaussian Mixture Models, which over time start to “forget the true underlying data distribution,” leading to an inaccurate representation of reality because the training data gets so contaminated that it loses any resemblance to real-world data.

The researchers stress the significance of having access to the original distribution data, which is often created by humans, given the substantial danger of model collapse. Since AI language models are intended to communicate with people, they must be in touch with reality in order to accurately represent our environment.

The researchers have suggested numerous more clever methods for training big language models to address this issue. One strategy is the “first-mover advantage,” which emphasizes the importance of maintaining access to the original human-generated data source.

The research paper asserts that “community-wide coordination” is necessary to ensure that the various parties involved in LLM creation and deployment share the information needed to determine the source of data because it is challenging to distinguish between model collapse AI-generated data and human-produced data.

The research continues, “Without access to data that was crawled from the Internet before the technology was widely adopted, or direct access to data generated by humans at scale, it may become increasingly difficult to train newer versions of LLMs.”

There is a bright spot for human creators despite the growing usage of generative AI and worries about it replacing workers. According to the study report, human-created material will become more valuable as AI-generated data fills the internet, even if it’s only as a source of clean data for huge language models.


