🔞No minors🔞

Voted "most likely to become a nothlit" by senior year class.

Manufacture date 1991

Nonhuman θΔ

My silly modded-Minecraft account is over at @worse-than-wolves.

❤️ @Ashto ❤️ @Yaodema ❤️ @Yuria ❤️


Fediverse / Mastodon
chitter.xyz/@gyro
Itaku (JUST made zhis one)
itaku.ee/profile/millielet
Pillowfort (Also just made)
www.pillowfort.social/Gyro

yaodema
@yaodema

and why they might be better off as 6 bits

wait I thought a byte was always 8 bits?

nope! a byte is whatever size a "character" is, at its smallest. on modern systems, this is pretty much always 8 bits, but for a good while, the most common byte size was instead 6 bits. I'll get into why, and why we swapped, as this post goes on.

so we used to use 6 bit bytes?

yes, in the earliest binary computers, and in many later designs until around the mid 1970s, that was the norm. the reason we swapped is partly the fault of the American Standards Association, but also partly the fault of Johannes Gutenberg, and the fault of some monks in the late 700s choosing a new, faster way to write Latin.


queerinmech
@queerinmech

another funny thing about this is that many home computers in the 1980s did not actually use proper ASCII despite it having been a standard almost two decades

this is (partially) due to the Signetics 2513 "character generator" chip from the 1970s used for text mode by many home computers which only supported 64 characters - all uppercase!

the 2513 chip was used by early Apple computers

the ZX80 and ZX81 likewise only supported uppercase, but this time due to implementing their own bespoke character set in ROM where every character took space away from the system software!

(also - most hardware used in the 1980s was based on chips from the 1970s which were being misused or cheaply manufactured descendants thereof, so ASCII was still relatively new when those chips were designed)


You must log in to comment.

in reply to @yaodema's post:

thank you for a wonderful history! the 6-bit byte is interesting but i think i would find it personally upsetting to use because it's not a power of two and my brain likes numbers that way (and it makes a lot of binary, base-2 calculations very very easy if the space the number fits into is also a power of two)

I get how it could annoy the pattern-seeking brain bits that got used to eight, yeah. but binary computation really doesn't care if you're aligned to six bits or to eight! for most purposes, six bit aligned numbers would be smaller (and thus faster), but serve much the same purpose. computers don't care about the size of their words being powers of two, though thanks to multiplication, they do prefer that the number of bits is even.

oh, six-bit numbers would be faster? that's interesting to me! i was under the impression that the cycle speed of the ALU likely wouldn't change, as long as the physical number of lines on the bus match the size of the word. the flow of electrons wouldn't really care whether the muxer had 6 or 8 or 24 pins, was my understanding - i mean, there's gonna be a handful more nand gates, but those would only really increase the thermal load - but if that's not actually how it works, then that is really interesting, and i wonder why we didn't actually shift down to 4-bits and just use two words per character, the way that a bunch of modern UTF-8 characters use two "characters" to represent a single glyph or codepoint in some languages today.

smaller numbers of bits mean fewer carry operations in additions, shorter Dadda reduction trees for multiplication (generally), and faster convergence on Goldschmidt division. none of these methods require bit widths be powers of 2. modern adders use 4-bit carry-lookahead sections, so carries propagate about four times faster than with single bit adders alone. this is why I think of possible future ISAs being 12 bit aligned (with 24 and 48 bit modes, at least) rather than going all the way down to 6 bits.

as long as multiplication and division aren't involved, the size of the number matters less, but once you need those, bigger numbers are always going to be at least somewhat slower. so it's best to use only the word size you actually need for the largest numbers that code section would be working with, unless the architecture is slower for smaller words (like some x86 processors are when you try to do 8 bit ops with them, weirdly)

i love this history!

as it happens now, so many things are built on the advantages of a byte with a power of two bits that switching back to 6 would break soooo much. defining "efficiency" is important: text is one of the least voluminous data types out there, and optimizing it simply isn't a huge problem. if you want to save data you just compress it somewhere along the way, but it's rarely a huge impact to do so; other content like images require so much more data that simply using better compression there usually saves more.

even within the realm of uncompressed text, a hypothetical utf-6 encoding would also be far less efficient than utf-8. an encoding with similar features would require 7 hexets to replace what utf-8 can encode in 4 octets, and many common characters even in latin writing would necessarily require a second hexet, with 5 out of 12 bits being encoding overhead. we could not even fit both upper and lower case in the single-hexet space!

for all these reasons and more i think it's likely we're going to be on octets basically forever. they're the smallest power of power of two that fits a conveniently sized symbol set; absolutely any other worries of inefficiency can be solved by implementing compressed encodings atop these machine-convenient octets. :eggbug-relieved:

glad you like the history bits! though, I don't agree with the sentiment that you have advantages with "a power of 2 bits" that wouldn't be present with a power of 2 times 3 (12, 24, 48). I have no idea where this concept comes from, but it's just not a thing in the actual hardware that calculates. as long as the number of bits is a multiple of 4, typical adder circuits will work without much effort.

also, you don't have to encode text like unicode does; it'd be very inefficient to do so, for this case. but a text encoding that uses 6 bits for most code pages, states at the start what encoding pages it's using and how to swap between them, and uses the low two or four codes as shift codes, could swap for capitals one at a time or for long strings of them, depending on which shift code is used. for most languages, this would be more than enough, and would consistently use less space than unicode, especially considering that anything that doesn't fit in ASCII uses up a minimum of 16 bits per character in Unicode.

Chinese, Japanese, Korean, and other encodings that use ideogram sets can use a 12 bit encoding, with shift codes to swap between pages. the jouyou ("common use") kanji set consists of only a bit over 2000 characters; 12 bits can encode for over 4000 characters, leaving more than enough space for the Japanese encoding to contain the jouyou kanji, hiragana, katakana, and full-width Latin characters, plus punctuation, all in the initial block. if an uncommon character is needed in isolation, it'd take up a total of 24 bits (single-character shift code and encoding point), which is on par with uncommon CJK characters in Unicode.

I've been designing a 12 bit ISA, so my brain keeps insisting on answering questions like this, at least enough to let me keep poking at the project. another post that I'll be putting together for this is "48 bits is all you need" (most of the time) which covers why 64 bits is too big for most purposes, outside cryptography, where you need more. I want to demonstrate 12 bit aligned floating point numbers for it, so it might be a bit.

one of the things which works less well for grains without a power-of-two bit count is bit addressing, which comes up more often than it seems. you're right the adders and math don't care at all, but addressing is i feel a real consideration.

i would also say that while they're totally valid and work, codepage based encodings are... well, they probably miss some of the key features that make utf-8 in particular work really well (self-synchronization and lexical sorting, features are cool enough that i totally forgive it for being less compact than it could be)

48 bit ints though... that is more compelling generally. i like the idea of it, though in terms of "how much data do we have" that puts a limit at 211 To. in my career i've encountered a relatively small handful of things which desire more:

  • time instants and durations
  • counts of arbitrarily small compute and data units, like bytes or cpu cycles
  • possibly also memory addresses in an ASLR scenario, where having a mostly-unused address space is a feature not a bug

i think that might actually be pretty swag as a default int size, but it doesn't quite have the "one and done, you will never think about this choice again" you get from 64. (i'm a strong advocate for never ever using 32 bit ids in databases: by the time you have to resize you've accrued paltry savings and immense regret)

hmmm. overall...? i think the majority of my leanings towards octets here stem from an underlying philosophy that bytes are for machines, and symbols for humans are a relatively small proportion of all data that can always be encoded into that space as efficiently as desired, often by just using an agnostic abstraction like compression. we westerners were arrogant enough to assume that text is representable in a motley series of increasingly large but still insufficient numbers before we landed on the monster that is unicode today, and i feel that complexity just comes with the territory. let text be text imo

all that aside: the concept of a de novo hexet-based architecture absolutely whips ass and i wanna see it