gretchenleigh

middle-aged multimedia queer

Gretchen
The PlayStation Experiment | Game Mag Print Ads | Rando Chrontendo
software engineer @ Internet Archive
anarcho-left
trans lesbian 🏳️‍🌈🏳️‍⚧️


gosokkyu
@gosokkyu

The long-running Japanese software portal vector.co.jp is ending their hp.vector homepage service and deleting all hosted sites on December 20—this service hosts a ton of turn-of-the-millennium doujin and homebrew pages, including a lot of pasocom-relevant material for PC98, MSX, etc by many authors who never moved and/or went MIA and aren't going to back up their pages of their own accord, so if there's anything you want to save, don't sleep on it. (IIRC they never expanded their 5MB hosting cap, so we're not talking unwieldy levels of data.)


gretchenleigh
@gretchenleigh

With the help of @dog and @asie, who were able to point me to the resources to build a pretty darn close to perfect list of seed URLs, we're expecting to get a full archive. It became a side project that I was able to successfully pitch as part of my work at Archive.

I'll have more updates soon as we make the material available, but the final product will be a collection with full-text search and potentially extended metadata, so more than just a bunch of unwieldy Wayback captures.


gretchenleigh
@gretchenleigh

Just a real quick update to say that the crawling process is going well, but it's taking a bit longer than I originally anticipated. There were a lot more Vector homepages that had automated redirects on them than I was expecting, and our crawler follows those and tries to crawl them. Some of those redirects were to sites that caused the crawler to enter a "trap" where it builds up a big queue of bogus links. I was able to tame these with some help of a teammate, but there's still some more legit URLs to crawl. Hopefully soon!


gretchenleigh
@gretchenleigh

Hey folks, I've found that the crawl has been digging up a ton of great stuff, so I'm continuing to run it. If you find that certain pages are missing (usually deep links within a site), it's likely that they'll eventually get filled in. But what we currently have available is around 40GB of the Vector Homepage Service and a lot of surrounding pages, including a lot of the main Vector software pages so we can capture as much of the actual software as possible.

I am hoping at some point to better organize this. We have the ability to apply metadata fields (most commonly Dublin Core fields for all my librarian friends out there, but custom fields are possible) for faceted filtering and a bunch of other options to make it easier to navigate. There's also full text search, although this probably isn't going to be perfect (at least not yet). But the data is being captured, so anything like that can be fixed/improved later.


You must log in to comment.

in reply to @gosokkyu's post:

Turns out that as of at least 2016, Vector published a list of all homepages that existed. (There were around 5000.) So you can visit the full list of homepages manually, and there are also a couple attempts to archive the full set of pages before they go down. Given the very reasonable size of all of this, I think 100% is going to get backed up.

in reply to @gretchenleigh's post:

OK, good news: the URL format is 100% predictable. Usernames are in the format VAnnnnnn - user IDs are six-digit numbers left padded with zeroes. I don't think they were assigned sequentially because there are a lot of missing numbers, but it should be very easy to just enumerate every possible username in that range to see which ones exist.

in reply to @gretchenleigh's post:

in reply to @gretchenleigh's post:

Curious about full text search: how does this work across encodings? Those pages are primarily shift JIS, but if I enter Japanese in my browser I'll be typing UTF-8.

Ahhh, this is a great question!

First, I almost forgot to mention this: You're likely to hit quite a bit of mojibake looking at the older pages in the collection. The Shift-JIS text will render as UTF-8 on replay, I think probably due to the banner we insert on replayed pages (which is UTF-8, obviously). I need to see if there's a way to hint otherwise directly in Wayback replay, but it is possible to force a page to render with a non-default encoding in most browsers (e.g. this Chrome extension).

As far as FTS goes, it's a very imperfect process in its current form, but we are working on improving it. You bring up a very good point that it probably doesn't handle Shift-JIS content very well even if we can change the default encoding on replay to fix the mojibake. This is a really great edge case to point out, and hopefully we can figure out something, because we have a lot of captures that aren't UTF-8.

Oh and if there's a way this can get fixed in the archive.org web interface for real, this would make my life a lot easier. This has been an ongoing encoding problem for non-English pages for a long time, and it would certainly make archives of historical pages in East Asian languages a lot easier to read!

I absolutely cannot guarantee anything, but lemme see what I can do. It's definitely something I can get on the board for Archive-It at least. I obviously 100% agree on its importance; I'm just not sure how much effort it would be to implement a fix.

I use Safari, which is the only mainstream browser that still lets you manually specify an encoding, so that hasn't been a problem for me, personally - but I'm sure it could be an issue for users of other browsers. I'm very used to the "specify the encoding if it gets confused" dance.