When I was looking for more information about what recourse I might have against the Software Heritage Archive, who has been putting code that deadnames me and other people into their archive, I encountered this announcement of theirs:
They pivoted to AI! They were blockchain bros in 2018 and they pivoted to AI in 2023! Of course they did!
Now you don't have to be a European who changed their name to have an action you can take against them. Everyone who has put code on GitHub has a claim against them.
Their press release says "ethical" several times in hopes that it becomes true, but they have taken all the code they could possibly scrape from GitHub except sometimes leaving out GPL code. Apparently what makes it "ethical" is that you can ask for an opt-out, and they will try to remember to get around to removing the code you specified from later versions. Though of course we've seen how the Archive operates -- they'll most likely say "oh, we removed your code, but not this identical copy of it that we also have", or "we endeavor in the future to be able to remove your code".
Here are some calls to action:
- If you've ever put code on GitHub, check Am I In The Stack? (edit: more direct link), which will say what they've scraped from your GitHub namespace in particular.
- If you enjoy futility, send them an opt-out request.
- If you have a HuggingFace login or you can stomach getting one, log in to HuggingFace and report the dataset for copyright infringement. It's on the three inconspicuous vertical dots on the right sidebar.
- Send a takedown notice to takedown@softwareheritage.org, demanding that they take your code out of all versions of their AI
modeldataset, as well as their archive, because they have almost certainly violated your license -- even if it's open source, especially if it's open source -- and therefore they no longer have any right to your code.
No matter what they say, they are not following your license unless your license is equivalent to the public domain.
Every open-source license I know of has an attribution clause. Even for permissively licensed code, they have to credit the author, and usually they have to include the license text that allowed them to use the code.
They claim that users of their model are responsible for following all the attribution clauses, and they claim that they provide the information to let users do so. For one thing, licenses don't work that way! You can't break a license and then say it's someone else's job to comply with it.
If their supposedly reassuring statement about attribution were true, each user of the model would have to write pages of license text next to every line of code they "wrote" (plagiarized) using the BigCode model. I assure you, no user is doing that. But also it's just not true, because there is no language model that is capable of correctly attributing its sources. They are simply lying, as they always do.
Also, if you didn't put a license on your code, the default license on GitHub is just that GitHub can store your code and forks of it. It's "all rights reserved" for everyone else. SWH took repositories like that also.
Remember, "fair use" is not a thing in France. They just stole your code.
