ETPC

video games

  • he/him/they/them

video games | anarcho-communism | depression | blm | acab | trans rights are human rights | he/him/they/them | like 30 or 40 | movies | Senior Social Media Lead/QA for Mighty Foot Productions | runs @dnf2001rp


nago-
@nago-

I guess I'll formally spill the beans on the poorly kept secret that what I've been working on is fully reverse engineering the avatar and background file formats for Microsoft Comic Chat. (Knowing the crowd here on cohost: you probably know about this program through the NSFW webcomic BoneQuest.)

(Look back through my #mscc tag for posts on this subject.)

I went to look up how the files work and was surprised to find that I couldn't find a good resource for it already.

This turned into me writing my own toolset for analyzing, extracting, editing and repairing these comic chat avatars. That project turned into me scraping the web trying to find as many files as I could to use as test data for perfecting my toolset.

For reasons I can't go into just yet, that project turned into me scraping archive.org itself attempting to recover and repair as many avatars as I possibly could that have otherwise been lost forever. This process requires me to at least know the domain to which an avatar may have been posted, and has led me to scraping the entirety of angelfire, xoom, tripod and geocities archives from the wayback machine.

There will be a longer writeup later accompanied by a full FOSS release of the library and toolkit. I was aiming for April Fool's Day originally, but now it's almost July...

...The good news is that I went from 3000 files (sourced mostly from mermeliz) to almost 16,000 avatars and background files. Only a scant few are irreparable, less than 20 at current count and I am still making frequent progress in uncorrupting files and finding replacements. There are unfortunately many more that are currently entirely lost to the sands of time, already 404 by the time archive.org got to them.

Of those 16,000 files, there's about 6,300 absolutely unique avatars. Many of these have very likely never seen the light of day, as the free web hosts they were stored on corrupted the files when they were first served back in the late 90s. I don't think anyone had the expertise or desire to fix them until... now.

That said, I have a request for you:

If any of you have MS Comic Chat avatars, backgrounds, or source files on your hard drives (.avb, .bgb or .avs files) or know of where you can find any, I am officially happy to receive literally any and all files for the sake of preservation of a very interesting piece of history. I will even take corrupt ones from dying hard drives. I have fixed worse.

I am also very happy to know about literally any website that has ever hosted any of these files. I have crawled and scraped a truly impressive number of them, but on the off chance you know of any (especially for non-English speaking audiences) I'd be delighted to hear about it.

Or, if you know anyone who might be interested in this project - please share this chost with them! I'm happy to chat about this at exhausting length with anyone who would listen. I'd be especially happy to connect with anyone at archive.org for more efficient searching of their archives. I know there's more to find there.

(Yes, I am already in contact with mermeliz, you don't need to share this post with her, thanks!)

Sincerely, with love and heartfelt affection;
--nago


seria-mau-genlicher
@seria-mau-genlicher

When I worked on the Skype team I tried SO HARD to bring MS Comic Chat back for MS Teams but “The Man” wasn’t into it


karobit
@karobit

thinking about the parallel timeline where microsoft didnt shutter their twitch competitor mixer unceremoniously after buying & rebranding beam, and instead integrated ms comic chat into a widget streamers could put onto their feed to display the viewer chat


You must log in to comment.

in reply to @nago-'s post:

https://trimex.us/~nago/hugh.7z ... the Microsoft OEM avatar files actually have separate face and body data, which allow them a greater range of emotion with fewer actual bytes. Microsoft planned the ability to create these with CCEdit, but the functionality never saw the light of day.

In this zip, I included separate head/bodies, but I also included every possible combination of head+body -- but some are likely to be duplicates (for reasons) and some are probably impossible to get MSCC to actually display.

But, there's hugh!

[Recommend saving this file, I don't promise it'll stick around after whenever the next time I clean my webfolder is.]

awesome work. I'm guessing by scraping those domains, you went into the archive.org block/report files (I can't recall the exact term) and pulled all the URLs? I was thinking about doing that for another project, but it never got off the ground because it was going to take, like, literal YEARS of processing archive.org data. Did you figure out a better way?

Oh, hi foone!

Literally what I've been doing is using the CDX api and scraping the paginated results with no filtering on a per-domain basis. I found that applying any filters whatsoever tends to apply filtering after pagination, so what I do instead is just query for an entire domain and filter locally instead.

Not terribly efficient, but it does seem to be thorough. I probably owe them a pretty good donation by now. The biggest domains I've been able to crawl have 13,000 pages of results which only take about a day to iterate.

I can't crawl, for instance, ".com", which I calculated would take about 18 years with a single threaded, polite scraper.

I scraped all of the big free hosts I remember (tripod, xoom, angelfire, geocities) and saved a list of candidate URLs. In a second loop, I'd wget everything that wasn't nailed down. Then, I'd parse the metadata of every file I obtained to build a new candidate list of domains to scrape and lather rinse repeat. I'm at a point where I think I'm finally out of new domains to scrape... but I haven't done any deep investigation or parsing of html files for links on websites, which would possibly yield some more - it's just that you have to cut the crawling off "somewhere" and I have "a dayjob", so I haven't gone this far yet.

I was hoping i could somehow bribe someone at wayback with a charitable donation to just run a manual query for me or something and save me weeks or months of effort, but I don't know who or where to ask. It'd be amazing if someone could just look for *.[aA][vV][bB]. Obviously databases don't usually index from RTL so it's a pretty monstrous query...

In another comment on this post somewhere I listed all of the domains I've thought to crawl on wayback - anything wayback has that isn't 4xx or 206 for any capture on those domains is something I've grabbed. A lot of these are "200 ok" that are secretly redirects and 404s. Urgh.

I think I've grabbed all of the really big sites (and multiple mirrors and host moves thereof for each), but even as I get to the bottom of the barrel I'm still finding 800-1000 new files per round of scraping, so I assume there's plenty more to find. I also folded in my own personal collection of avatars from my own windows backups!