(not mine ofc)
to view the comments of all of my pullquotes from this, please go to the most recent post
The enclosure of information access as a private exchange with a language model creates its own self-perpetuating cyclone whose impacts will be difficult even for the most fastidious tech vegan to avoid. Some proportion of people turning to their LLM assistants rather than public forums or peer production systems like Stack Overflow or Wikipedia means some smaller proportion of questions asked or information shared in public. That decreases the quality of information on those sites, incentivizing more people to turn to LLMs, and so on. Why bother with pesky problems like governance and moderation and other people when you could just ask the godhead of all knowledge itself?
Cultivation of dependence comes wrapped in the language of trust and safety. The internet is full of untrustworthy information, spam, hackers, and only a new generation of algorithmically powered information platforms can rebuild some sense of trust online. It seems awfully convenient that the same companies that are promising to save us are also the ones that create the incentive systems recklessly deploy LLMs to clog the internet with SEO clickbait in the first place. We’re being made an offer we can’t refuse: it’s a shame that you can’t find anything on the internet anymore, but the search companies are here to help. Ever more sophisticated spam creates a strong comparative advantage for those companies that can afford to develop the systems to detect it, and Google and Microsoft are substantially larger than, say, DuckDuckGo.
An endless expanse of data traded out of sight, crudely filtered like coffee through a cloth napkin between layers of algorithmic opacity, rented drop by drop from a customer service prompt that’s a little too intent on being our friend. Information is owned by fewer and larger conglomerates, we are serfs everywhere, data subjects to be herded in gig work, crowdsourcing content for the attention mines to drown ourselves in distraction. It’s all made of us, but we control nothing. Our lives are decided by increasingly opaque flows of power and computation, the Cloud Orthodoxy mutates and merges with some unseemly neighbors, the new normal becomes the old normal. The floor of our future rusts out from beneath our feet while we’re chasing the bouncing ball on the billboard ahead.
KG-LLMs [Knowledge Graph based Large Language Models] augment traditional enterprise platforms with the killer feature of data laundering. The platforms are at once magical universal knowledge systems that can make promises of provenance through their underlying data graphs, but also completely fallible language models that have no reasonable bounds of expectation for their behavior. Because it is unlikely that these models will actually deliver the kind of performance being promised, vendors have every incentive to feed the models whatever they can to edge out an extra 1% over SOTA93 — who’s going to know? The ability for LLMs to lie confidently is again a feature not a bug. Say we were an information conglomerate who didn’t want to acknowledge that we have collected or rented some personal wearable data in our clinical recommendation product94. We could allow our model to be conditioned by that data, but then censor it from any explanation of provenance: the provenance given is in terms of proteins and genes and diseases rather than surveillance data, and that might be all the clinician is looking for. If we want to use another company’s data, we might just use it to train our models rather than gaining direct access to it. That is literally the model of federated learning (eg. [218, 219]), where a data collector can make a promise that the data “never leaves your device” (even if a model trained on it can.) The ability to resolve matching entities across knowledge graphs makes this even easier, as the encoding of the fine tuning data can be made to match that of the original model.
