We sometimes talk about web scraping projects as "ingesting" or "slurping" text from the web but usually we understand that nothing is being actually consumed, the text is still there where it should be. But now it's different.
Large Language Models (LLMs), of which the current most famous is ChatGPT, are eating the web, actually consuming and destroying it.
-
Google: overrun by fake sites with unhelpful content generated by LLMs. They have, bafflingly, decided to counter with their own, first party unhelpful content generated by LLMs.
-
Reddit: despite its community, it was previously a refuge of meaningful, open, human-written text on topics people care about. Now it's mostly down, because they betrayed their community with drastic changes and the community revolted. One reason given for the drastic changes: their tasty meaningful text was being exploited by Google and OpenAI and their ilk, and they weren't getting any benefit from it. So they decided to charge a gazillion dollars for API access, causing everything else the community built that uses the API, such as accessibility tools and usable interfaces, to go down.
-
Twitter: bought by a complete dipshit who (among his many foolish ideas) thought he could save money by replacing workers and essential site features with AI, so essential site features are being replaced with nothing. Also decided "API access and third-party apps should cost a gazillion dollars so LLMs will pay us" before Reddit did.
-
Stack Overflow: Rampant use of ChatGPT threatens to turn its questions and answers into nonsense, much like the questions and answers you see on Google. Moderators responded by moderating even more harshly than they usually do. Site owners saw that a declining number of people want to even attempt to use Stack Overflow, and panicked, and told moderators to stop banning people for using ChatGPT, instead of addressing any of the other reasons people don't want to use Stack Overflow. Moderators went on strike. The site and its siblings are now mostly unmoderated, and, confusingly, still working for now.
-
Wikipedia: apparently standing strong for now, but their standards of information are threatened as formerly "reliable sources" start generating nonsense with LLMs. Wikipedia is particularly vulnerable, because if LLM output is ever treated as a reliable source, it can create self-reinforcing fake facts that people repeat because they're on Wikipedia.
-
Many independent websites: buried under competing LLM nonsense, or bought out by venture capitalists who fire their staff and replace them with LLM nonsense. (I just saw a hecking GeoGuessr tips page destroyed by someone who copied all the work people put into it, pasted it onto a ChatGPT-generated website, and took credit for it.)
They have found a way to scrape the web so hard that there isn't any web there anymore.
I don't know what we can do about it. I hope that a specifically anti-metrics, anti-capitalist website like Cohost can be a refuge, but there's not enough Cohosts.
