nickavv

the daikon games guy

I'm an indie game developer, software engineer, and friendly fella. Always trying to learn/grow and be a good person.

--
click this thing for my games acct
Daikon Games


yaodema
@yaodema

and why they might be better off as 6 bits

wait I thought a byte was always 8 bits?

nope! a byte is whatever size a "character" is, at its smallest. on modern systems, this is pretty much always 8 bits, but for a good while, the most common byte size was instead 6 bits. I'll get into why, and why we swapped, as this post goes on.

so we used to use 6 bit bytes?

yes, in the earliest binary computers, and in many later designs until around the mid 1970s, that was the norm. the reason we swapped is partly the fault of the American Standards Association, but also partly the fault of Johannes Gutenberg, and the fault of some monks in the late 700s choosing a new, faster way to write Latin.


let's start with 1950s mechanical computers. basically all of the pre-electronic computers were designed to work with decimal digits. top-of-the-line engineering calculators used ten decimal digits of precision, far more than the typical three significant digits you could get on a slide rule. so, when new, fully electronic computers came out, working in binary, they needed at least enough precision to match these calculators, or engineers would have little interest in them.

it turns out that 34 bits is enough to pull this off (having 17,179,869,184 possible values, more than the 10 billion you need!). adding a sign bit, this gives 35... but both 34 and 35 are rather ugly numbers of bits. the first is 2 x 17, the second is 5 x 7, and neither of these are very nice ways to split up the full word. so, they added one more bit, bumping this up to 36 bits. this allowed them to divide the word into 6 x 6 bit characters, each with enough space to fit all 26 letters and 10 digits used in English, plus punctuation, spaces, and some control characters.

later computers would come out in mostly 18 and 12 bit word sizes, but always with the concept of the 6 bit byte in mind, in these early days. also, sometimes, they cared about binary-coded decimal, which required multiples of 4 to fit decimal digits into binary. this made 12, 24, and 36 bit words preferable.

some early 16 bit computers would come about, such as the MIT Whirlwind in the early 1950s, but these were unconcerned with handling written words, as far as I can tell. they were designed to handle only numbers, so they didn't need a multiple of 6 bits of width.

however, while IBM had its BCD codes for 6 bit standards, even between different IBM machines, these were not standardized. you could never be sure that punch cards or tapes taken from one machine would be readable on another, unless they were the same model of machine. we needed a proper standard, and we'd get one in the 1960s from the American Standards Association, in the form of ASCII.

but first, let's look further back...

something like 770 CE

near as we can tell, the concept of minuscule latin script, as we know it today, came from an abbey somewhere to the north of Paris, sometime before 778 CE. "Carolingian minuscule" was created to make writing faster and more legible between scribes and readers of Latin, within the churches of Europe. the Irish "Insular script" and the widespread "uncial" scripts were beautiful, but not always that legible, in comparison. By using majuscule characters for headers and the starts of sentences, but minuscule characters within sentences, a more familiarly modern style of writing started to emerge.

this style would soon be common in official writing, posted bills, and so on, anywhere the Latin script was used. through cultural exchange, some other languages with similar enough alphabets would do the same, so Greek and Cyrillic eventually got their own minuscule forms.

then forward to the 1400s

as Johannes Gutenberg worked on a movable type printing press, it was plainly obvious to him that he needed both majuscule and minuscule letters. after all, proper writing by hand or by woodcut used both! so he also needed both, and so many other special characters and diacritics and ligatures, and so on. the resulting metal letters and glyphs were arranged into cases, which were arranged according to the needs of the printer.

over time, it became the norm to put majuscule letters in an upper case and minuscule letters in a lower case for the printing press. while Gutenberg himself likely didn't come up with this standard, his decision to ensure all of these letters were present lets us draw a direct line to why we have this concept of "upper and lower case" nowadays.

and back to the 1960s

when the ASA decided what to do for a computer character standard, they were only concerned with English letters, which made the character set they needed small. but, they were also concerned with both upper and lower case letters, due to the centuries-old printing standards! since there isn't room in a mere 64 glyphs for all upper and lower case characters, plus punctuation, spaces, control characters, etc. they couldn't fit this into only 6 bits. (you can fit 26 + 26 + 10 into 62 characters, leaving only 2 for everything that isn't a letter or number!)

they apparently briefly considered making the new standard 6 bits, but using shift codes to swap between code pages. two 6-bit sets would be enough to cover all the characters they needed, and would make the encoding very compact, saving both transfer time and storage space. however, transmitting data at the time had error rates that were too high for this, and were often sent with zero error correction. (for example, the 300 Baud frequency-shift keying modems, which were the norm for a long time!)

without corrections, one error in a shift code could make long sections of the transmission illegible. for this reason, they gave up on the idea of using a 6 bit encoding, and moved to 7 bits instead. the encoding they made in the end is what we now know as ASCII.

while modern computers mostly use Unicode to send text, ASCII is still embedded in the first 128 symbols of unicode, in the exact same positions. so, as long as unicode text uses no other characters, it's identical to ASCII, and can still be read by older systems.

microcomputers

as microchips became more common in the late '70s and throughout the '80s, the computers being made at the time tended to use ASCII for characters on screen, or printed onto paper. since 7 bit words would be awkward, and 8 bits is a multiple of 4 that allows two binary-coded decimal digits, newer machines moved to an 8 bit byte. as a result, the 1970s saw microchip CPUs emerge with 8 and 16 bit word widths.

the 1980s saw the rise of microcomputers, computers small enough to sit on a desk, rather than being the desk or the entire room. these, of course, used microchips, and thus saw the use of CPUs like the Intel 8080 and the Zilog Z80 (8 bit), the MOS Technologies 6502 (8 bit), and the Motorola 68000 (16 bit). as these became widespread, so too did the concept of the 8 bit byte.

and that's why bytes are 8 bits now.

since no standard emerged for a 6 bit byte that could encode the entire ASCII space, and now a modern 6 bit encoding would also need to handle the entire Unicode space, we've pretty much stuck with 8 bit bytes for the past 40 years and change.

this isn't to say we couldn't swap to a 6 bit byte again, if we had the hardware and the encodings to allow for it. maybe at some point in the future, the efficiency that such a tight encoding could offer would drive us in that direction, especially if there are other factors making the hardware attractive.

but, for now, we live in a world of 8 bit bytes, and in great part, it's thanks to English having "upper and lower case" letters. πŸ¦‹


You must log in to comment.

in reply to @yaodema's post:

thank you for a wonderful history! the 6-bit byte is interesting but i think i would find it personally upsetting to use because it's not a power of two and my brain likes numbers that way (and it makes a lot of binary, base-2 calculations very very easy if the space the number fits into is also a power of two)

I get how it could annoy the pattern-seeking brain bits that got used to eight, yeah. but binary computation really doesn't care if you're aligned to six bits or to eight! for most purposes, six bit aligned numbers would be smaller (and thus faster), but serve much the same purpose. computers don't care about the size of their words being powers of two, though thanks to multiplication, they do prefer that the number of bits is even.

oh, six-bit numbers would be faster? that's interesting to me! i was under the impression that the cycle speed of the ALU likely wouldn't change, as long as the physical number of lines on the bus match the size of the word. the flow of electrons wouldn't really care whether the muxer had 6 or 8 or 24 pins, was my understanding - i mean, there's gonna be a handful more nand gates, but those would only really increase the thermal load - but if that's not actually how it works, then that is really interesting, and i wonder why we didn't actually shift down to 4-bits and just use two words per character, the way that a bunch of modern UTF-8 characters use two "characters" to represent a single glyph or codepoint in some languages today.

smaller numbers of bits mean fewer carry operations in additions, shorter Dadda reduction trees for multiplication (generally), and faster convergence on Goldschmidt division. none of these methods require bit widths be powers of 2. modern adders use 4-bit carry-lookahead sections, so carries propagate about four times faster than with single bit adders alone. this is why I think of possible future ISAs being 12 bit aligned (with 24 and 48 bit modes, at least) rather than going all the way down to 6 bits.

as long as multiplication and division aren't involved, the size of the number matters less, but once you need those, bigger numbers are always going to be at least somewhat slower. so it's best to use only the word size you actually need for the largest numbers that code section would be working with, unless the architecture is slower for smaller words (like some x86 processors are when you try to do 8 bit ops with them, weirdly)

i love this history!

as it happens now, so many things are built on the advantages of a byte with a power of two bits that switching back to 6 would break soooo much. defining "efficiency" is important: text is one of the least voluminous data types out there, and optimizing it simply isn't a huge problem. if you want to save data you just compress it somewhere along the way, but it's rarely a huge impact to do so; other content like images require so much more data that simply using better compression there usually saves more.

even within the realm of uncompressed text, a hypothetical utf-6 encoding would also be far less efficient than utf-8. an encoding with similar features would require 7 hexets to replace what utf-8 can encode in 4 octets, and many common characters even in latin writing would necessarily require a second hexet, with 5 out of 12 bits being encoding overhead. we could not even fit both upper and lower case in the single-hexet space!

for all these reasons and more i think it's likely we're going to be on octets basically forever. they're the smallest power of power of two that fits a conveniently sized symbol set; absolutely any other worries of inefficiency can be solved by implementing compressed encodings atop these machine-convenient octets. :eggbug-relieved:

glad you like the history bits! though, I don't agree with the sentiment that you have advantages with "a power of 2 bits" that wouldn't be present with a power of 2 times 3 (12, 24, 48). I have no idea where this concept comes from, but it's just not a thing in the actual hardware that calculates. as long as the number of bits is a multiple of 4, typical adder circuits will work without much effort.

also, you don't have to encode text like unicode does; it'd be very inefficient to do so, for this case. but a text encoding that uses 6 bits for most code pages, states at the start what encoding pages it's using and how to swap between them, and uses the low two or four codes as shift codes, could swap for capitals one at a time or for long strings of them, depending on which shift code is used. for most languages, this would be more than enough, and would consistently use less space than unicode, especially considering that anything that doesn't fit in ASCII uses up a minimum of 16 bits per character in Unicode.

Chinese, Japanese, Korean, and other encodings that use ideogram sets can use a 12 bit encoding, with shift codes to swap between pages. the jouyou ("common use") kanji set consists of only a bit over 2000 characters; 12 bits can encode for over 4000 characters, leaving more than enough space for the Japanese encoding to contain the jouyou kanji, hiragana, katakana, and full-width Latin characters, plus punctuation, all in the initial block. if an uncommon character is needed in isolation, it'd take up a total of 24 bits (single-character shift code and encoding point), which is on par with uncommon CJK characters in Unicode.

I've been designing a 12 bit ISA, so my brain keeps insisting on answering questions like this, at least enough to let me keep poking at the project. another post that I'll be putting together for this is "48 bits is all you need" (most of the time) which covers why 64 bits is too big for most purposes, outside cryptography, where you need more. I want to demonstrate 12 bit aligned floating point numbers for it, so it might be a bit.

one of the things which works less well for grains without a power-of-two bit count is bit addressing, which comes up more often than it seems. you're right the adders and math don't care at all, but addressing is i feel a real consideration.

i would also say that while they're totally valid and work, codepage based encodings are... well, they probably miss some of the key features that make utf-8 in particular work really well (self-synchronization and lexical sorting, features are cool enough that i totally forgive it for being less compact than it could be)

48 bit ints though... that is more compelling generally. i like the idea of it, though in terms of "how much data do we have" that puts a limit at 211 To. in my career i've encountered a relatively small handful of things which desire more:

  • time instants and durations
  • counts of arbitrarily small compute and data units, like bytes or cpu cycles
  • possibly also memory addresses in an ASLR scenario, where having a mostly-unused address space is a feature not a bug

i think that might actually be pretty swag as a default int size, but it doesn't quite have the "one and done, you will never think about this choice again" you get from 64. (i'm a strong advocate for never ever using 32 bit ids in databases: by the time you have to resize you've accrued paltry savings and immense regret)

hmmm. overall...? i think the majority of my leanings towards octets here stem from an underlying philosophy that bytes are for machines, and symbols for humans are a relatively small proportion of all data that can always be encoded into that space as efficiently as desired, often by just using an agnostic abstraction like compression. we westerners were arrogant enough to assume that text is representable in a motley series of increasingly large but still insufficient numbers before we landed on the monster that is unicode today, and i feel that complexity just comes with the territory. let text be text imo

all that aside: the concept of a de novo hexet-based architecture absolutely whips ass and i wanna see it