• he/they

27, US expat in Toronto, transmasc, chronically ill/immunocompromized, neurodivergent, arospec, nonmonogamous. i guess i'm a furry now? that's a recent development though. i'm not a programmer but i am a computer nerd and a linux user (apparently that's a thing people like to list here).

.

art page: calico-art


arborelia
@arborelia

We sometimes talk about web scraping projects as "ingesting" or "slurping" text from the web but usually we understand that nothing is being actually consumed, the text is still there where it should be. But now it's different.

Large Language Models (LLMs), of which the current most famous is ChatGPT, are eating the web, actually consuming and destroying it.

  • Google: overrun by fake sites with unhelpful content generated by LLMs. They have, bafflingly, decided to counter with their own, first party unhelpful content generated by LLMs.

  • Reddit: despite its community, it was previously a refuge of meaningful, open, human-written text on topics people care about. Now it's mostly down, because they betrayed their community with drastic changes and the community revolted. One reason given for the drastic changes: their tasty meaningful text was being exploited by Google and OpenAI and their ilk, and they weren't getting any benefit from it. So they decided to charge a gazillion dollars for API access, causing everything else the community built that uses the API, such as accessibility tools and usable interfaces, to go down.

  • Twitter: bought by a complete dipshit who (among his many foolish ideas) thought he could save money by replacing workers and essential site features with AI, so essential site features are being replaced with nothing. Also decided "API access and third-party apps should cost a gazillion dollars so LLMs will pay us" before Reddit did.

  • Stack Overflow: Rampant use of ChatGPT threatens to turn its questions and answers into nonsense, much like the questions and answers you see on Google. Moderators responded by moderating even more harshly than they usually do. Site owners saw that a declining number of people want to even attempt to use Stack Overflow, and panicked, and told moderators to stop banning people for using ChatGPT, instead of addressing any of the other reasons people don't want to use Stack Overflow. Moderators went on strike. The site and its siblings are now mostly unmoderated, and, confusingly, still working for now.

  • Wikipedia: apparently standing strong for now, but their standards of information are threatened as formerly "reliable sources" start generating nonsense with LLMs. Wikipedia is particularly vulnerable, because if LLM output is ever treated as a reliable source, it can create self-reinforcing fake facts that people repeat because they're on Wikipedia.

  • Many independent websites: buried under competing LLM nonsense, or bought out by venture capitalists who fire their staff and replace them with LLM nonsense. (I just saw a hecking GeoGuessr tips page destroyed by someone who copied all the work people put into it, pasted it onto a ChatGPT-generated website, and took credit for it.)

They have found a way to scrape the web so hard that there isn't any web there anymore.
I don't know what we can do about it. I hope that a specifically anti-metrics, anti-capitalist website like Cohost can be a refuge, but there's not enough Cohosts.


calico-catboy
@calico-catboy

I'm at the point of considering tracking down and joining niche forums every time I have an issue. Unfortunately, this would come with the difficulty of finding them when search engines aren't usable. Reddit works for most things for now, but the way things are going I'm getting concerned about how long that'll last.


You must log in to comment.

in reply to @arborelia's post:

a good deal of Wikipedia already feels like the output of a ChatGPT thingummy—articles are effectively cobbled together from bits of public-domain content copied and pasted together without any cohesive effort to create a readable article. (honestly the whole Wikipedia model is flat busted I think; encyclopedias need top-level editors so that articles read like an organic whole, not like a jumble of bits.) so I daresay the ChatGPT-ification of Wikipedia is only a matter of time. AI nerds seem very likely to win Wikipedia editing wars in which victory goes to the most stubborn. ~Chara

"encyclopedias need top-level editors"

which get underpaid, which end up using chatgpt to meet their metrics, etc

it's incredible how many problems are caused just by not paying people enough (and honestly, the system we have currently only works because of this)

honestly the risk is more that humans decide that notable content is not in fact notable and decide to delete most of it, which is the way the site is trending; LLMs, if they hasten it, will do so by enabling those people to discount once-reliable sites as no longer reliable even before they started using LLMs (this is sort of happening with CNET; the reliable source guidelines page draws a distinction between pre- and post-2022 but this is not always read)

I tried to look up a recipe the other day just to remind myself if I was missing anything. At least 4/5ths of the results were LLM generated nonsense that were clearly all repeating information from each other in disconnected ways.

The worst part being that LLM stuff basically looks good enough to get you to open the page only to realize a bit later it's nothing, so it takes like 5x longer to sift through bad info than it did before on top of there just plain being more of it

Archive.org was recently hit with a gigantic volume of scraping for its OCR data that brought it down twice. The worst part of that is that it's OCR data is really isn't high quality, it has a lot of issues with a fair bit of language processing. So any LLM that will have the data ingested will have a lot of garbage too.

I think this is how the internet we have known until now dies. People are smart and we're going to figure out new forms, hopefully ones that are less intensely clinging to venture capital, but it's going to be a challenging few years in this regard. Lucky there's nothing else worrying going on in the world right now, really.