AI “Model Collapse” Shows Why Real Human Expression Is So Powerful

AI needs the real words of real people

Clive Thompson
5 min readJun 24


I’m seeing more and more AI-generated text online. Even here on Medium, in fact!

In the last few months, I’ve noticed comments on my posts that seem oddly …. off. They’re grammatically correct, but existentially rather alien. They usually a) blandly praise my post (“this is great”) and then b) blandly repeat the main gist of the post (“this article argues [X, Y, and Z]”). They have none of the specificity and wit you normally see in comments; they’re just dull summaries.

Now, I’m lazy, so I haven’t mounted any serious investigation into these comments (and their commenters). Maybe they’re legit. But they sure look like someone is using a large language model to autogenerate Medium comments.

Why would they be doing it? Eh, who knows. Maybe it’s some coder noodling around. Maybe it’s someone creating an army of bot accounts, and seeding them with long history of normal, mundane-seeming interactions, the better to use them for nefarious purposes later on (a technique botmakers have long employed on Twitter).

The upshot is, it adds to the pile of grey-goo language that’s metastasizing online. Search for the phrases “as an AI language model” or “regenerate response” and you’ll find blog posts, tweets, and reviews on Yelp or Amazon that include these tells. Bloggers candidly admit to autogenerating posts for SEO. Redditors have used it to create comments.

I was thus intrigued to read the academic paper that’s been making the rounds lately — on “model collapse”.

“Model collapse” is the total erosion of AI language-models when they’re trained on the output of other models.

Up until now, of course, companies like OpenAI and Google have used the writings of actual live humans to train their AI. They haven’t fully described their training materials. But it likely focuses heavily on stuff from the Internet. That includes Wikipedia, Reddit, books and manuals; it also, crucially, includes text written by paid teams of human trainers during a crucial reinforcement-learning phase. Basically, up and down the training you need to use tons of documents written by people. That’s the whole…



