applecinnabun

fatigue elemental raccalope

  • they/she

game dev/sk8r grrl/guitar/flute/tree liker/indie game obsesser/basic autumn bitch. working on hoptix!

<3 @static-echo <3


play the latest (free!) hoptix demo!
sonicfangameshq.com/forums/showcase/hoptix.1948/
hoptix on twitter (not active, some cool dev posts there though)
twitter.com/hoptixGame
profile pic by @cottontailcat
cohost.org/cottontailcat
BUSINESS
hoptixgame@gmail.com

arborelia
@arborelia

When I was looking for more information about what recourse I might have against the Software Heritage Archive, who has been putting code that deadnames me and other people into their archive, I encountered this announcement of theirs:

They pivoted to AI! They were blockchain bros in 2018 and they pivoted to AI in 2023! Of course they did!

Now you don't have to be a European who changed their name to have an action you can take against them. Everyone who has put code on GitHub has a claim against them.

Their press release says "ethical" several times in hopes that it becomes true, but they have taken all the code they could possibly scrape from GitHub except sometimes leaving out GPL code. Apparently what makes it "ethical" is that you can ask for an opt-out, and they will try to remember to get around to removing the code you specified from later versions. Though of course we've seen how the Archive operates -- they'll most likely say "oh, we removed your code, but not this identical copy of it that we also have", or "we endeavor in the future to be able to remove your code".

Here are some calls to action:

  • If you've ever put code on GitHub, check Am I In The Stack? (edit: more direct link), which will say what they've scraped from your GitHub namespace in particular.
  • If you enjoy futility, send them an opt-out request.
  • If you have a HuggingFace login or you can stomach getting one, log in to HuggingFace and report the dataset for copyright infringement. It's on the three inconspicuous vertical dots on the right sidebar.
  • Send a takedown notice to takedown@softwareheritage.org, demanding that they take your code out of all versions of their AI model dataset, as well as their archive, because they have almost certainly violated your license -- even if it's open source, especially if it's open source -- and therefore they no longer have any right to your code.

No matter what they say, they are not following your license unless your license is equivalent to the public domain.


Every open-source license I know of has an attribution clause. Even for permissively licensed code, they have to credit the author, and usually they have to include the license text that allowed them to use the code.

They claim that users of their model are responsible for following all the attribution clauses, and they claim that they provide the information to let users do so. For one thing, licenses don't work that way! You can't break a license and then say it's someone else's job to comply with it.

If their supposedly reassuring statement about attribution were true, each user of the model would have to write pages of license text next to every line of code they "wrote" (plagiarized) using the BigCode model. I assure you, no user is doing that. But also it's just not true, because there is no language model that is capable of correctly attributing its sources. They are simply lying, as they always do.

Also, if you didn't put a license on your code, the default license on GitHub is just that GitHub can store your code and forks of it. It's "all rights reserved" for everyone else. SWH took repositories like that also.

Remember, "fair use" is not a thing in France. They just stole your code.


You must log in to comment.

in reply to @arborelia's post:

and scraping everyone's code to train up an LLM thingie is somehow relevant to the purported "mission"? how does this help with the ostensible mission of preservation? ugh, I'd think all the corporate blather about "missions" was a complete joke if I didn't seem like executives actually pretend to take their sense of mission seriously, when the mood strikes them anyway ~Chara

This is almost certainly why they based their whole archive on a very stupid immutable data structure and said it's like a blockchain. Because that was the thing you needed to do to get money from rich people in the late 2010s

The only repo of mine that they grabbed contains five text files of probably-out-of-copyright poetry, but I sent in a request for removal anyway because screw ‘em, I didn’t say they could use it.

oh cool the stack manages to include: two open sourced projects that require reproducing the license along with whatever you do with it, and one project with no license, which means by default the license is "all rights reserved"

i'm furious and will be following the outlined steps