I keep seeing people, even people in games, go "well it's fine, we'll just train it on our own internally produced art" and I wonder if they've thought about that for even a second. if you're training a dataset, you need some pretty fuckin' excessive amounts of data! if you're not, you're going to be doing something like what Unity did with Muse where they're using Stable Diffusion but "sanitized" so it doesn't contain any copyright material (sort of, more on this later) from major players
but the thing is... that doesn't matter? because that dataset is still filled to the fucking brim with shit that was just scraped from all over the internet without asking anyone involved. the consent of the people who created the art that makes this thing work isn't a factor unless they're big enough, and even then the sanitization process doesn't actually work! because a lot of the shit in there will have slipped through, especially if it had enough of a cultural impact
and so you go "okay but maybe they own a REEEEEAAAALLLLLLLLLLLYYY big library, like Adobe Stock!" and uhhh.... that sucks too! adobe's AI was trained off their stock photo library that a lot of people submitted to before AI was even a thing! once again, the only thing that was consented to was the base EULA and even that isn't the kind of thing where you go "well maybe in the future they'll train a robot on this" because that's a ridiculous thing to have to extrapolate on
the absolute best you can hope for is AI trained on CC0, but uh... again, you start running into problems. how is that data going to be, you know, set up for training? where's that metadata coming from? who's inputting all the relevant data there? because you're going to be dealing with loads of images and they're all going to need a whole shitload of work before they're ready for training. like I get we're getting deep into how capitalism is exploitative at every level but we really can't escape this conversation without mentioning how exploitative this sort of thing is especially
this really isn't something you can fix. generative AI is broken at its very core
