I have had some partial success sending the Software Heritage Archive a takedown notice. They have at least removed one repository of mine from their website. I don't know yet if it's removed from their dataset, or will be.
This repository (which happens to have been my 2020 Advent of Code solutions) was unquestionably mine, not theirs to use in any way, because I never put any kind of license on it.
I actually had no idea that the scope of their archive had grown so large that it included ephemera like this, but it makes sense that now they want as much code as they can possibly hoard so they can feed it to a HuggingFace AI dataset.
I sent this message to dpo@inria.fr and takedown@softwareheritage.org:
The Software Heritage Archive contains an infringing copy of my code:
The copyright on this code belongs strictly to me. GitHub merely has permission to host it under their terms of service. It is not available for distribution under any terms. It is not licensed for use for any purpose.
You must cease and desist using, copying, and distributing this code. You must remove this code and all copies of it from your archive, including where it appears in data exports and derived datasets such as "The Stack", within 30 days.
If you've put a repository on GitHub, if you didn't put a license on it, and if you see that they've made a copy of it on https://archive.softwareheritage.org/, I encourage you to do the same. (You could also check Am I in The Stack?, except it went down.)