fi, en, (sv, ja, hu, yi) | avatar by https://twitter.com/udonkimuchikaki


libera.chat, irc.sortix.org
nortti
microblog (that is, a blog with small entries)
microblog.ahti.space/nortti

dog
@dog

Unicode is pretty cool (waits for anyone less obsessive than me to filter out of the post) unicode is pretty cool, but they've made a bunch of weird decisions over the years, and nothing has ever been weirder than their disastrous choice about encoding CJK characters.

Take a look at these two characters:

Really appreciate the different ways they're written and how seeing one of them implies which language it comes from. Wait a minute, those are the same character aren't they. They sure are! But now look at these next two:

They're sure different, right? But it's the same bytes, all thanks to the curse of han unification.

See, back in the day Unicode was built with the assumption they'd be encoding a maximum of 65,536 characters total, but there's several languages that use Chinese-derived characters and there's a lot of those. More specifically between Chinese, Japanese, Korean and Vietnamese, if you looked at all the different variations of each character, you were looking at over 100,000 characters. Wasn't going to fit, right? You could make Unicode larger, but what if... what if you made those languages smaller?

Enter han unification: the idea that instead of giving each language's version of a character its own codepoint, you'd define one single unified codepoint for each character and then define different ways that character should be rendered depending on the language that's being rendered. But of course that gives you a new problem: if you just have a bunch of raw characters, you don't know how those characters are rendered. If it's traditional Chinese, you render it one way. If it's Japanese, you render it another way. You have to have information from outside the text itself to figure out how it's supposed to look.

For example take those two characters above. When I got them to render differently, what I actually did was this:

<div lang="ja">直</div>
<div lang="zh">直</div>

That is, I used the HTML language tag to hint to the document what the rendering should look like, so it knows whether to use the Japanese or Chinese variant of the character. Without that hinting, the browser just picks one - and it might get it wrong!

If you're guessing this might have pissed off a bunch of people, you're completely right. As far as I know, han unification single-handedly pushed back Unicode adoption in Japan by at least a decade, and it didn't really pick up until the iPhone era and international messaging kind of forced the issue. And, best of all: Unicode didn't even end up sticking with the 65,536 limit. Pretty soon after they ended up swapping to new encoding methods that let them handle more than a million characters, but han unification was already finished and the damage was done and we have to deal with it for the rest of our lives.


nortti
@nortti

First thread

@⁠Randy_Au
x_X So I started today THINKING I had a rough understanding of Han Unification in Unicode... and I'm pretty much ending the day being even more confused...

https://en.wikipedia.org/wiki/Han_unification
@⁠henryfhchan
Han Unification as described as a way to reduce number of encoded characters to fit in UCS-2 is not quite correct.
@⁠henryfhchan
Although that was the technical goal, the fact is, Hanzi/Kanji have been written somewhat arbitrarily over the past thousand years without affecting understanding.
@⁠henryfhchan
When the East Asian nations decided to build their country post-WWII, they often settled on different standard forms.
@⁠henryfhchan
But they're still the same character.
So they should be unified just like the fancy two-story a and g and the single story a and g aren't separately coded for the latin script.
@⁠henryfhchan
The problem was, some territories encoded the variations separately in their national encoding standards. Unicode's goal was to replace all these conflicting legacy encodings, 1-1 compatibility was desired.
@⁠henryfhchan
Hence, some variants got encoded at different codepoints, but some didn't. This is called the "Source Separation Rule"
@⁠henryfhchan
Some people are particularly attached to a specific way of writing a character, especially for personal names / place names (family tradition, etc), so the IVD was born, where using a particular variation selector will yield a particular glyph form, provided the fonts exist.
@⁠henryfhchan
Then to complicate things, there are a bunch of unification errors, past glyph errors, mismatch between standard glyphs in national standards vs representative glyph in Unicode, but those are somewhat self-contained and affect display and not general processing.
@⁠henryfhchan
The fact that some orthographic variants are disunified means that when you are writing a search engine, you need to roll your own normalization algorithm (which might not even be uni-directional due to non-1-to-1 traditional-simplified translations!). NFC/NFD aren't applicable.
@⁠henryfhchan
And to further the damage, some of the orthographic variants aren't even properly documented on the Unihan Database (maintained by Unicode Technical Consortium, which relies on volunteer contribution).
@⁠henryfhchan
And there are no specifications for how IMEs should handle simp/trad, variant forms, different regional forms etc. Gboard for Android uses stroke data based on PRC forms, even for the Taiwan/HK region :facepalm:
@⁠henryfhchan
Combined with some variants being unified and some not, you can end up typing one sequence for one character with a component (let it be named "x"), and typing another sequence for another character containing the same component "x" :facepalm: :facepalm:
@⁠henryfhchan
Since some characters are disunified, if u want a more traditional looking orthography, not only do u have to change the font, you need to replace some characters. Published books in the past 20 years often have a mixed orthography, because typesetters have no idea how not to!
@⁠henryfhchan
And then, there's the argument about how far unification model should be extended for historical variants. Let's take a look at the variants of 瞻 submitted for a future CJK Extension H:
Eight glyphs that look like they could be completely separate characters.
@⁠henryfhchan
Now imagine that, but for nearly every character containing 詹 (>50 of them -- of course not all have that many variants). >80,000 Han characters are currently encoded, but if we encode all of these, probably 200,000 isn't enough.
@⁠henryfhchan
We can definitely continue to fill up all 10 remaining planes and fill another 10 if this goes unchecked.
@⁠henryfhchan
It's lucky that each working set is limited to around 5000 characters, and a new set is processed around 2 - 3 years.
@⁠henryfhchan
By that rate, we'll fill a whole plane in 30 years.
@⁠henryfhchan

Second thread

@⁠mjg59
The unicode way is to have combining characters for arbitrary burger topping ordering but to map cheeseburger and veggieburger to the same codepoint and tell people to distinguish by using omnivore or vegetarian fonts
@⁠Kazinsal
Han unification was bizarrely misguided. Also clearly UTF16 was a daft idea, should have been 8 only or maybe 7 and 8
@⁠desmondpalaam
Problem with that: You'd need time travel to have solved it. UTF-16/UCS-2 predates UTF-8 by 3 years in standardization and 8 more in common usage (UCS-2 was the only available transformation format when the NT kernel was being developed circa 1989, notably).
@⁠celestialweasel
Yes indeed, I know it's easily to say with hindsight but it was readily foreseeable. It's not like people didn't criticise Han unification at the time. No HU means UCS2 pointless so no real point having UTF16
@⁠henryfhchan
As a traditional Chinese user, thank God Han Unification happened.
@⁠henryfhchan
Otherwise, the vast majority of traditional Chinese characters would have been dropped had China, Japan and Korea each allocated a single block for their characters.
@⁠henryfhchan
Han Unification was misguided in that it took a simple decompositional approach without properly considering semantics -- more characters were needlessly disunified more than accidentally unified.
@⁠henryfhchan
Unnecessary disunifications such as 説說温溫茲兹為爲青靑 affect users in Hong Kong and Taiwan every day. Characters need to be swapped out to ensure the text is orthographically consistent. Getting the preferred glyph was always a font issue before ISO10646
@⁠henryfhchan
If a web publisher prefers the old orthography they needs to swap out all the characters where the old orthography has been disunified. Meanwhile, there's no guarantee that the viewer has the right font, so the text often ends up in a messy mixture of old and new orthographies.
@⁠henryfhchan
It's often portrayed in the Western media (and some nationalistic posts in China) how the Unicode Consortium trumped the demands of the Chinese, Japanese and Korean standardization bodies. The reality is much more complex than that.
@⁠henryfhchan
The leading Chinese linguistic expert in China's standardization efforts, 王寧, advocates that characters as a result of transliteration differences should be unified -- i.e. 兼 and 𥡝 should have been unified.
@⁠henryfhchan
With a total repertoire of 80,000 CJK characters (and growing), many of which solely exist for the sake of archiving a rare form of a common character, the character set is basically unmanageable and nearly impossible to support well.

The IVD was simply born ten years too late.
@⁠JuEeHa
What would you opinion be on unifying all the variant forms, but allowing Unicode variant selectors to be used to select a specific form if you want to encode that detail?

Of course this won't happen, but hypothetically if Han characters were to be added to unicode now
@⁠henryfhchan
That should have happened, and is supposed to happen for newly submitted characters. Meanwhile, a lot of Adobe-Japan glyphs are encoded using variation selectors.
@⁠henryfhchan
It's up to IRG to decide whether to unify a variant or not. Expert's preferences to unify and encode via IVD are often accepted in one meeting, but rolled back in the next because of other national experts reject the decision between meetings.
@⁠henryfhchan
This leads to inconsistencies within one character set, let alone all extensions combined.

On the other hand, meanings and glyphs often exist in a parallel spectrum, which makes it hard to decide which glyphs on a spectre to encode.
@⁠henryfhchan
I.e. there are thirty glyph variations morphed between Glyph A and B. Glyph A and B are used for exclusive meanings but the morphed forms may have historical uses as either. Which one do you keep and which one do you unify to?
@⁠henryfhchan
Most often, member bodies will argue let's just encode every permutation we've seen. but anyone who's actively involved in the IRG meetings and is familiar with the etymology knows that's not a really sustainable model. But they have to stick to the national position.
@⁠henryfhchan
Personally, I believe national bodies should set a small subset of normalized glyphs, then unify everything to them. But that involves education policy and is out of scope for the IT experts and linguists sent to meetings.
@⁠meisterluk
Wow, one of the most interesting threads on Twitter, I read so far. Thanks.

@henryfhchan Do I understand you correctly, that HU didn't go "far enough" in your opinion? Otherwise, can you explain "took a simple decompositional approach without properly considering semantics"?
@⁠henryfhchan
Yes. They looked at the variations across the standardized forms of commonly-used characters, then drafted unification rules for them, called Unifiable Component Variations (UCV). But those only covered the most common characters, the devil is in the long-tail of archaic forms.
@⁠henryfhchan
When we look at archaic vulgar variants, we see multiple corruptions of both the radical and the phonetic. If we made these corruptions into separate rules, the false positives would be super high. A component decomposition model, as the model described in the Han Unification...
@⁠henryfhchan
... section of The Unicode Standard doesn't work for the long tail of variants -- setting out blanket rules leads to too many characters unified (what people complain about), and a lack of rules leave too many disunified (what actually is the status quo).

You must log in to comment.

in reply to @dog's post:

The only good thing I will say about Han unification is that it allows me, in my ignorance, to input Chinese characters using the Japanese IME I understand instead of the Chinese IME I don’t understand. But even that is fraught with peril and generally a bad idea, and pretty much every time I do it I think about how ridiculous Han unification is.

I agree completely—also just a quick note, the wikipedia article lacks nuance; modern Vietnamese is written using an alphabet, not Han characters, so this is not a huge concern for us (some people do know and use them, the variants do exist, but it's not an everyday usage sort of thing). Mentioning this because I think you included Vietnamese based on the article to be inclusive, which I appreciate and it's not your fault the article is wrong!
(Unicode properly understanding Vietnamese accents and digraphs is a whole different can of worms which also has a ton of problems for similar reason.)

Oh yes, sorry - I didn't get into the situation with Korea/Vietnamese, but my understanding is hanja (Korean) isn't used in day-to-day writing either. So it affects Korean/Vietnamese, but like you say absolutely not to the same degree as Chinese and Japanese.

I sort of wish the TRON character set, as used in the TRON operating system (1, 2) had been taken on, instead. It was quite a bit more complete, designed to be input primary rather than display primary, and iirc had both individual languages' scripts and disambiguations for their directly synonymous/derived characters, for Han siblings as well as e.g., Latin/Greek/Cyrillic - so one input method could be used to enter characters in another language with minimal effort. There was also a TRON keyboard, which facilitated CJK and foreign/Latin script input with a compact keyset.
The TRON keyboard, via Wikimedia Commons
("TRON Keyboard Unit TK1" キー拡大 by Sszzkk on Japanese Wikipedia, via Wikimedia Commons. CC-BY-SA)

Another strange thing in Unicode, is that if I understand correctly, you can compose a character from its radicals, meaning one Kanji character may have multiple byte patterns to mean the same glyph, which is a rather obvious problem for things like search engines; the solution being, of course, to normalize all possible byte patterns to their proper Han glyph. Which also erases information about the original glyph, like which language it was intended to display/be input as.

Oh god, yeah. I didn't even consider the mess you'd get by normalizing it down to one glyph, but yeah, absolutely makes sense.

I first encountered the "composing a character from its radicals" thing when reading about a character that didn't have a Unicode codepoint yet, so that was its closest approximation.

I think you’ve gotten two parts of Unicode crossed: Latin character plus accent mark as two separate characters that are visually combined at runtime by the font renderer is a real thing, which does cause normalization issues, and then there’s “ideographic description characters” which are symbols reserved for describing the layout of extremely rare or hypothetical characters that do not exist in unicode, but no real-world practical typographic renderer attempts to guess what the final result should be and render it as one character

Example using a real character I can show you: ⿰木目 means “ divided into two boxes left to right, first a tree and then an eye” which describes 相

Total tangent, but I'm just getting reminded of what a pain it is to deal with different filesystems that have different ideas about how paths should be normalized. Having the UTF-8 bytes of a string be different depending on the filesystem's preferences, in an app that may not itself have access to unicode normalization functions, is a fun time.

Unfortunately, in this case I'm not personally using Linux (APFS normalizes to decomposed), and the program I'm working on is cross-platform so it needs to be able to handle pretty much whatever a random user's filesystem does.

Huh, interesting. Maybe it comes down to you only having a Chinese font or only having a Japanese font? For me, the second character of the second set is different in Firefox or Safari for Mac.

Interesting! Those are both rendering with the Chinese variant, so I guess you're missing a Japanese font. Here's how they're meant to be rendered (Japanese first, Chinese second)

The character 直, in Japanese and traditional Chinese variants.

(BTW, you can use markdown to put images in comments.)

I seem to have japanese fonts, in that looking at random web pages with japanese text looks correct, with Kanji, Hiragana, Katakana all present.

coming in to say I also am on Linux/Firefox, and I do in fact see the different characters. I have a pretty minimal install too (Debian minimal with manually installed DE). I would be surprised if you're missing the Noto fonts for Japanese and I'm not.

I'd assume you're seeing the only font that reports it can render it.

If you're using Ubuntu, you can add support for languages under "Region & Language" in the system settings (have to log out and back in afterward).

The fonts are another aspect of those whole thing, as apps somehow have to pick which font is "best".

I'll be honest I do not understand the ideaograms used to the degree that they're talking about, but there's gotta be someone working on putting them in, somewhere.

We have ancient Egyptian hieroglyphics and like, the poop emoji after almost 2 decades but not shit people actually use (if I'm reading right anyway)

It's not that we couldn't now, it's just a question of it being a bit too late. We have 30 years of documents being written using the shared codepoints that we have now, and 30 years of operating systems and software which think of those codepoints as existing in the places they do. As long as we have to support the documents and software that we already do, adding the new proper non-unified codepoints would make things more complicated now since we'd have to support them alongside the unified ones.