fizbin

I'm just this guy, you know?

  • he/him

40-ish white guy in tech


I have discovered a bug in how the wayback machine archives sites, and I have absolutely no idea how I might report this to anyone who would care.

To see this in action, compare https://breakmessage.com/ and https://web.archive.org/web/20240122142937/https://breakmessage.com/


If you care about the technical details, it appears that when an HTML tag has an attribute with &-encoded data in it (e.g. <input type="submit" value="&nbsp;go&nbsp;">), then when the wayback machine stores the value it interprets the & escapes and stores the attribute value as UTF-8-encoded text. So in the example, it would store not the 14 characters &nbsp;go&nbsp;, but rather the 4 characters \u00A0\u0067\u006F\u00A0, and then when serving it back up would serve it back as the six bytes of UTF-8 c2 a0 67 6f c2 a0. (I'm totally guessing about the intermediate storage format, but I know that when it reads &nbsp; it serves that back as the two bytes c2 a0).

This is fine for all web pages that explicitly declare themselves to be UTF-8 in headers or elsewhere, as the majority of pages do. However, if a page declares no character set (say, because it's written to stick to ASCII and using &-escapes for everything else) then what happens is that when web.archive.org serves the page back it also declares no character set. This means most browsers will consider the page to be in ISO-8859-1, but the contents of HTML attributes will be sent in UTF-8, leading to the muck seen there.

I as a web page author can fix this by having my page explicitly declare itself to be UTF-8 despite being pure ASCII and therefore needing no translation, but it really seems like an archiver should go out of its way to not mangle data at ingestion time. (Either that, or it should detect "oh, this page doesn't declare itself as UTF-8. I better render all non-ASCII in HTML attributes as stuff like &#x00A0;")


You must log in to comment.