psilocervine

but wife city is two words

56k warning


cohost (arknights)
cohost.org/arkmints

I keep seeing people, even people in games, go "well it's fine, we'll just train it on our own internally produced art" and I wonder if they've thought about that for even a second. if you're training a dataset, you need some pretty fuckin' excessive amounts of data! if you're not, you're going to be doing something like what Unity did with Muse where they're using Stable Diffusion but "sanitized" so it doesn't contain any copyright material (sort of, more on this later) from major players

but the thing is... that doesn't matter? because that dataset is still filled to the fucking brim with shit that was just scraped from all over the internet without asking anyone involved. the consent of the people who created the art that makes this thing work isn't a factor unless they're big enough, and even then the sanitization process doesn't actually work! because a lot of the shit in there will have slipped through, especially if it had enough of a cultural impact

and so you go "okay but maybe they own a REEEEEAAAALLLLLLLLLLLYYY big library, like Adobe Stock!" and uhhh.... that sucks too! adobe's AI was trained off their stock photo library that a lot of people submitted to before AI was even a thing! once again, the only thing that was consented to was the base EULA and even that isn't the kind of thing where you go "well maybe in the future they'll train a robot on this" because that's a ridiculous thing to have to extrapolate on

the absolute best you can hope for is AI trained on CC0, but uh... again, you start running into problems. how is that data going to be, you know, set up for training? where's that metadata coming from? who's inputting all the relevant data there? because you're going to be dealing with loads of images and they're all going to need a whole shitload of work before they're ready for training. like I get we're getting deep into how capitalism is exploitative at every level but we really can't escape this conversation without mentioning how exploitative this sort of thing is especially

this really isn't something you can fix. generative AI is broken at its very core


You must log in to comment.

in reply to @psilocervine's post:

I was interested, early in the genAI movement, in collecting all the CC0 image sets I could find together into one big IpFS bucket or something and then training a model on it. I wasn't planning on doing any tagging at all, and you're right you couldn't use this for "prompts" without the heavy training OpenAI is paying people in data sweatshops to enter, but I don't think "prompts" are the interesting part. I just wanna like, draw an outline and have the model complete it to a picture, or do other weird things that a large image model can give you by itself.

Even this I don't think is a good idea, which is half of why I'm not doing it*, because in the end not that many people would be using your "clean" public domain model but lots and lots of people would point to your "clean" public domain model to say "see? AI is ethical!" and then go back to using the exploitation-and-expropriation-based OpenAI servers. Why go to all that trouble just to do reputation laundering for the worst people in tech, when the best you're going to get out of it is some wonky images that people will only look at and go "ew, AI"?

* The other half is environmental impact.