Animation Lead on Wanderstop! She/Her & Transgenderrific! Past: Radial Games, Gaslamp Games



dog
@dog

Unicode is pretty cool (waits for anyone less obsessive than me to filter out of the post) unicode is pretty cool, but they've made a bunch of weird decisions over the years, and nothing has ever been weirder than their disastrous choice about encoding CJK characters.

Take a look at these two characters:

Really appreciate the different ways they're written and how seeing one of them implies which language it comes from. Wait a minute, those are the same character aren't they. They sure are! But now look at these next two:

They're sure different, right? But it's the same bytes, all thanks to the curse of han unification.

See, back in the day Unicode was built with the assumption they'd be encoding a maximum of 65,536 characters total, but there's several languages that use Chinese-derived characters and there's a lot of those. More specifically between Chinese, Japanese, Korean and Vietnamese, if you looked at all the different variations of each character, you were looking at over 100,000 characters. Wasn't going to fit, right? You could make Unicode larger, but what if... what if you made those languages smaller?

Enter han unification: the idea that instead of giving each language's version of a character its own codepoint, you'd define one single unified codepoint for each character and then define different ways that character should be rendered depending on the language that's being rendered. But of course that gives you a new problem: if you just have a bunch of raw characters, you don't know how those characters are rendered. If it's traditional Chinese, you render it one way. If it's Japanese, you render it another way. You have to have information from outside the text itself to figure out how it's supposed to look.

For example take those two characters above. When I got them to render differently, what I actually did was this:

<div lang="ja">直</div>
<div lang="zh">直</div>

That is, I used the HTML language tag to hint to the document what the rendering should look like, so it knows whether to use the Japanese or Chinese variant of the character. Without that hinting, the browser just picks one - and it might get it wrong!

If you're guessing this might have pissed off a bunch of people, you're completely right. As far as I know, han unification single-handedly pushed back Unicode adoption in Japan by at least a decade, and it didn't really pick up until the iPhone era and international messaging kind of forced the issue. And, best of all: Unicode didn't even end up sticking with the 65,536 limit. Pretty soon after they ended up swapping to new encoding methods that let them handle more than a million characters, but han unification was already finished and the damage was done and we have to deal with it for the rest of our lives.


You must log in to comment.

in reply to @dog's post:

The only good thing I will say about Han unification is that it allows me, in my ignorance, to input Chinese characters using the Japanese IME I understand instead of the Chinese IME I don’t understand. But even that is fraught with peril and generally a bad idea, and pretty much every time I do it I think about how ridiculous Han unification is.

I agree completely—also just a quick note, the wikipedia article lacks nuance; modern Vietnamese is written using an alphabet, not Han characters, so this is not a huge concern for us (some people do know and use them, the variants do exist, but it's not an everyday usage sort of thing). Mentioning this because I think you included Vietnamese based on the article to be inclusive, which I appreciate and it's not your fault the article is wrong!
(Unicode properly understanding Vietnamese accents and digraphs is a whole different can of worms which also has a ton of problems for similar reason.)

Oh yes, sorry - I didn't get into the situation with Korea/Vietnamese, but my understanding is hanja (Korean) isn't used in day-to-day writing either. So it affects Korean/Vietnamese, but like you say absolutely not to the same degree as Chinese and Japanese.

I sort of wish the TRON character set, as used in the TRON operating system (1, 2) had been taken on, instead. It was quite a bit more complete, designed to be input primary rather than display primary, and iirc had both individual languages' scripts and disambiguations for their directly synonymous/derived characters, for Han siblings as well as e.g., Latin/Greek/Cyrillic - so one input method could be used to enter characters in another language with minimal effort. There was also a TRON keyboard, which facilitated CJK and foreign/Latin script input with a compact keyset.
The TRON keyboard, via Wikimedia Commons
("TRON Keyboard Unit TK1" キー拡大 by Sszzkk on Japanese Wikipedia, via Wikimedia Commons. CC-BY-SA)

Another strange thing in Unicode, is that if I understand correctly, you can compose a character from its radicals, meaning one Kanji character may have multiple byte patterns to mean the same glyph, which is a rather obvious problem for things like search engines; the solution being, of course, to normalize all possible byte patterns to their proper Han glyph. Which also erases information about the original glyph, like which language it was intended to display/be input as.

Oh god, yeah. I didn't even consider the mess you'd get by normalizing it down to one glyph, but yeah, absolutely makes sense.

I first encountered the "composing a character from its radicals" thing when reading about a character that didn't have a Unicode codepoint yet, so that was its closest approximation.

I think you’ve gotten two parts of Unicode crossed: Latin character plus accent mark as two separate characters that are visually combined at runtime by the font renderer is a real thing, which does cause normalization issues, and then there’s “ideographic description characters” which are symbols reserved for describing the layout of extremely rare or hypothetical characters that do not exist in unicode, but no real-world practical typographic renderer attempts to guess what the final result should be and render it as one character

Example using a real character I can show you: ⿰木目 means “ divided into two boxes left to right, first a tree and then an eye” which describes 相

Total tangent, but I'm just getting reminded of what a pain it is to deal with different filesystems that have different ideas about how paths should be normalized. Having the UTF-8 bytes of a string be different depending on the filesystem's preferences, in an app that may not itself have access to unicode normalization functions, is a fun time.

Unfortunately, in this case I'm not personally using Linux (APFS normalizes to decomposed), and the program I'm working on is cross-platform so it needs to be able to handle pretty much whatever a random user's filesystem does.

Huh, interesting. Maybe it comes down to you only having a Chinese font or only having a Japanese font? For me, the second character of the second set is different in Firefox or Safari for Mac.

Interesting! Those are both rendering with the Chinese variant, so I guess you're missing a Japanese font. Here's how they're meant to be rendered (Japanese first, Chinese second)

The character 直, in Japanese and traditional Chinese variants.

(BTW, you can use markdown to put images in comments.)

I seem to have japanese fonts, in that looking at random web pages with japanese text looks correct, with Kanji, Hiragana, Katakana all present.

coming in to say I also am on Linux/Firefox, and I do in fact see the different characters. I have a pretty minimal install too (Debian minimal with manually installed DE). I would be surprised if you're missing the Noto fonts for Japanese and I'm not.

I'd assume you're seeing the only font that reports it can render it.

If you're using Ubuntu, you can add support for languages under "Region & Language" in the system settings (have to log out and back in afterward).

The fonts are another aspect of those whole thing, as apps somehow have to pick which font is "best".

I'll be honest I do not understand the ideaograms used to the degree that they're talking about, but there's gotta be someone working on putting them in, somewhere.

We have ancient Egyptian hieroglyphics and like, the poop emoji after almost 2 decades but not shit people actually use (if I'm reading right anyway)

It's not that we couldn't now, it's just a question of it being a bit too late. We have 30 years of documents being written using the shared codepoints that we have now, and 30 years of operating systems and software which think of those codepoints as existing in the places they do. As long as we have to support the documents and software that we already do, adding the new proper non-unified codepoints would make things more complicated now since we'd have to support them alongside the unified ones.