If you're anything like me, you've wondered what would happen if you used a compression algorithm for something it was not designed for. A lot of the time, the results are boring — you just get really large output.
However, image and audio codecs are special. Nowadays, psychoacoustic and psychovisual codecs are used — that is, codecs that are tuned specifically for human perception, and throw away information that humans will not notice. A great example of this is that most audio codecs will simply discard all audio frequencies above 20kHz. But the characteristics of the visual system and the auditory system are very different.
To explore this idea, I've created a dedicated page on my site with an interactive demo showing a sample image converted to audio, compressed with a lossy codec, and then back again, with various different codecs and parameters to choose from. That page has an extended introduction, I'll reproduce some of it here:
There are two obvious ideas — what if we use an audio codec to encode imagery, or an image codec to encode audio? Unfortunately, doing anything strange with audio tends to produce horrible screeching, and that generally makes you want to duck and cover or run away screaming. Imagery is a lot different — for most people, a garishly colored or otherwise corrupted image can be viewed safely. So this page provides images that have been compressed with lossy psychoacoustic audio codecs, and then converted back to images.
In fact, these serve as interesting visualizations of the various kinds of changes that audio codecs make to their bitstreams. None of these are useful means of comparison (for that you need an ABX test) but they are at least entertaining.
Methodology
For the uninitiated, the very idea here makes no sense. How do you expect to encode an image as audio? They're completely different formats!
Well, on a computer, everything is at the end of the day a stream of bits, ones and zeroes. Without the proper tagging information, you can "lie" to the computer that some data is in a different format than what it really is. 8 bits make up a byte, which we can represent as a number from 0 to 255, or 00 to FF in hexadecimal. Calling it lying is overselling it a bit. Computers do not truly "know" anything — data is data.
To oversimplify, raw imagery looks something like this (in hex):

Or just FFAA0000AAFF556677 as a continuous stream. These represent a gold color, light-blue color, and blue-gray color respectively . You just repeat that for as many pixels as you have to represent — so a 640x480 image is just 307,200 such triplets in a row, or 921,600 bytes. PNG, JPEG, WebP, JXL, etc are all just different ways to encode that without needing all that raw data.
Meanwhile, audio is encoded as PCM — just a list of amplitudes.

B8DCF4FFFDEBCA944B2008020B254E as a continuous stream.
So, 10 seconds of 8-bit 48kHz audio (fairly standard) is just… 480,000 bytes in a row. 16-bit and 24-bit audio are generally more common these days, but 8-bit is easier to discuss so I'll stop there. Opus, Vorbis, MP3, AAC, etc are all just ways to encode that in less space.
Okay, long winded diversion over. Hopefully you have some idea of what I mean now — it's all just bits, and bits can group up into numbers, and numbers are how audio and images both work. So, what happens if we take a raw image stream, and give it to an audio encoder as if it were audio?
Well, this is what happens. I'm using Bliss, the famous Windows XP wallpaper, as an example. No particular reason, it's just what I was playing with when I first posted about this on Mastodon three years ago. We'll re-interpret the raw image data as raw audio data, with 8-bit samples, in stereo, at 48kHz.
.png)
