• she/her, it/its

gay ass artist and programmer, i guess. 23. works on botania, the minecraft mod. also into weird functional programming stuff. talk to me about monads (or applicatives if you're even cooler)


sapphicfettucine
@sapphicfettucine

how technology & human effort lets us preserve the often hostile, incomprehensibly large web

First Encounter: Javascript

archiving a normal website is easy. you load up the HTML, find everything it links to with normal tags like <a> and <link>, download those, repeat.

picrew.me is not a normal website. it is a website built atop the hallowed bones of Modern Javascript. in this realm we do not write tags by hand; you fool. we ask javascript to write them and then we twist that javascript into an obfuscated, minified and monstrous reflection.

this means that we can't just rely on the raw HTML, because it's incomplete. stylesheets and javascript are loaded by scripts loaded by other scripts; through code that looks like this:

new Promise((function(t,r){for(var n={0:"1277e5ce5342b60da42d",3:"31d6cfe0d16ae931b73c",4:"31d6cfe0d16ae931b73c",5:"31d6cfe0d16ae931b73c",6:"99d1582281960cfd42f9",7:"31d6cfe0d16ae931b73c",8:"46a8df33541082a86879",9:"868c05267c7a48f8659f",10:"31d6cfe0d16ae931b73c",11:"31d6cfe0d16ae931b73c",12:"31d6cfe0d16ae931b73c",13:"31d6cfe0d16ae931b73c",14:"31d6cfe0d16ae931b73c",17:"31d6cfe0d16ae931b73c",18:"31d6cfe0d16ae931b73c"}[e]+".css"

this is why dynamic loading is often the foremost enemy of the archivist: we cannot rely on parsing HTML anymore. we need to either run a full, resource-hungry and slow browser, or learn how to reverse the tricks of every different approach to javascript obfuscation.

the latter is the approach i took when archiving picrews: acting like a browser and manually trawling through every possible combination of images a picrew may contain took hours upon hours. instead i spent hours upon hours to write a regex that looks like this: {([^{]*)}(?=\[e\]\s*?\+\s*?"\.(css|js)")


Second Encounter: This Is Too Many Urls

after parsing cursed javascript, learning picrew's format for storing image combinations in hidden JSON files, and trawling through their search page, i had acquired a cool 5 million, seven-hundred and sixteen thousand urls.

that's too damn many urls.

so i did what i was warned never to do: learn the wonders of HTTP/2. unfortunately, despite existing for 7 years, being supported by over >40% of websites, and being way faster than HTTP/1.1, its not allowed by the main website archival file format, WARC.

but... HTTP/2 often looks a lot like HTTP/1.1. so much so, in fact, that we can often just pass it off as if it was...

this is a breach of protocol - we are altering the data we are archiving! but when you are scraping 5 million links with two computers (thanks to the help of @artemis), you must learn how to live with sin.

so i built a WARC HTTP/2 converter. its off-spec and frowned upon. but it got the job done in days instead of months; and in the realm of picrew.me, a picrew may live only for weeks before being DMCA'ed off the face of the earth.

Third Encounter: uhh conclusion

while a lot of this post was complaining, i want to end on a hopeful note: what it shows is that two people can archive a massive, weirdly designed website on their own. the web is huge, yes, but we have the technology to untangle it and preserve it. we just need to use it.

and once more, thanks to @artemis - who took up downloading half of the URLs for this project! it'd have been a lot harder without her.


You must log in to comment.

in reply to @sapphicfettucine's post: