aurysystem

Eternally tired

  • She/her: +others for headmates

weird grey ace, four partners, a bunch of headmates.
writes cursed code 24


unascribed
@unascribed

If you're anything like me, you've wondered what would happen if you used a compression algorithm for something it was not designed for. A lot of the time, the results are boring — you just get really large output.

However, image and audio codecs are special. Nowadays, psychoacoustic and psychovisual codecs are used — that is, codecs that are tuned specifically for human perception, and throw away information that humans will not notice. A great example of this is that most audio codecs will simply discard all audio frequencies above 20kHz. But the characteristics of the visual system and the auditory system are very different.

To explore this idea, I've created a dedicated page on my site with an interactive demo showing a sample image converted to audio, compressed with a lossy codec, and then back again, with various different codecs and parameters to choose from. That page has an extended introduction, I'll reproduce some of it here:


There are two obvious ideas — what if we use an audio codec to encode imagery, or an image codec to encode audio? Unfortunately, doing anything strange with audio tends to produce horrible screeching, and that generally makes you want to duck and cover or run away screaming. Imagery is a lot different — for most people, a garishly colored or otherwise corrupted image can be viewed safely. So this page provides images that have been compressed with lossy psychoacoustic audio codecs, and then converted back to images.

In fact, these serve as interesting visualizations of the various kinds of changes that audio codecs make to their bitstreams. None of these are useful means of comparison (for that you need an ABX test) but they are at least entertaining.

Methodology

For the uninitiated, the very idea here makes no sense. How do you expect to encode an image as audio? They're completely different formats!

Well, on a computer, everything is at the end of the day a stream of bits, ones and zeroes. Without the proper tagging information, you can "lie" to the computer that some data is in a different format than what it really is. 8 bits make up a byte, which we can represent as a number from 0 to 255, or 00 to FF in hexadecimal. Calling it lying is overselling it a bit. Computers do not truly "know" anything — data is data.

To oversimplify, raw imagery looks something like this (in hex):

Or just FFAA0000AAFF556677 as a continuous stream. These represent a gold color, light-blue color, and blue-gray color respectively . You just repeat that for as many pixels as you have to represent — so a 640x480 image is just 307,200 such triplets in a row, or 921,600 bytes. PNG, JPEG, WebP, JXL, etc are all just different ways to encode that without needing all that raw data.

Meanwhile, audio is encoded as PCM — just a list of amplitudes.

B8DCF4FFFDEBCA944B2008020B254E as a continuous stream.

So, 10 seconds of 8-bit 48kHz audio (fairly standard) is just… 480,000 bytes in a row. 16-bit and 24-bit audio are generally more common these days, but 8-bit is easier to discuss so I'll stop there. Opus, Vorbis, MP3, AAC, etc are all just ways to encode that in less space.

Okay, long winded diversion over. Hopefully you have some idea of what I mean now — it's all just bits, and bits can group up into numbers, and numbers are how audio and images both work. So, what happens if we take a raw image stream, and give it to an audio encoder as if it were audio?

Well, this is what happens. I'm using Bliss, the famous Windows XP wallpaper, as an example. No particular reason, it's just what I was playing with when I first posted about this on Mastodon three years ago. We'll re-interpret the raw image data as raw audio data, with 8-bit samples, in stereo, at 48kHz.

Go to the demo!


You must log in to comment.

in reply to @unascribed's post:

this is super cool. i've been doing audio databending on images for a while now but somehow i've never thought to try using a compressed format as part of the process.

I've been aware of Audacity databending for a while — it's interesting the kinds of things audio processing filters do when applied to general data. As a codec nerd, my immediate thought is to use a lossy codec in addition. :P

I'm working on a revision to the site with more pixel formats and codecs — the results from low-bitrate Speex are actually pretty interesting:

FFmpeg doesn't support codec2, so it's out of scope for now as I can't plug it into my encoding harness.

Someone mentioned this on fedi as well, but I forgot to update the comment here — my specific system I use to get an FFmpeg build doesn't support codec2 unless you turn on an option that builds almost every feature, which I don't want, so, a bit of a rock/hard-place as I also wouldn't want to build FFmpeg manually all the time.

The person that mentioned this on fedi did some test encodes using my script snippet, and found that even at max bitrates codec2 emits completely unrecognizable smears very similar to low-bitrate Speex.

I tried compressing text data as JPEG and MP3 once. IIRC, you could make out some words in the JPEG-compressed text but the MP3-compressed one was totally garbled. Your approach is more interesting.

takes me back to fucking around with images in audacity a whole decade ago, a bunch of people i knew online were doing that and it was a great time, i still have some of those saved as wallpapers. i've wanted to mess around with it again but been short of novel ideas, so this could certainly be something to play with

Maybe I missed it but what order are the pixels in? I was thinking whether something like a Hilbert curve where a following pixel is always a neighbor of the previous one* would change anything. This is probably less straightforward to implement with planar formats and chroma subsampling, though.

*edit: while still true, locality is preserved only in 1D->2D conversion, not the other way, makes sense ig