Unicode is pretty cool (waits for anyone less obsessive than me to filter out of the post) unicode is pretty cool, but they've made a bunch of weird decisions over the years, and nothing has ever been weirder than their disastrous choice about encoding CJK characters.
Take a look at these two characters:
Really appreciate the different ways they're written and how seeing one of them implies which language it comes from. Wait a minute, those are the same character aren't they. They sure are! But now look at these next two:
They're sure different, right? But it's the same bytes, all thanks to the curse of han unification.
See, back in the day Unicode was built with the assumption they'd be encoding a maximum of 65,536 characters total, but there's several languages that use Chinese-derived characters and there's a lot of those. More specifically between Chinese, Japanese, Korean and Vietnamese, if you looked at all the different variations of each character, you were looking at over 100,000 characters. Wasn't going to fit, right? You could make Unicode larger, but what if... what if you made those languages smaller?
Enter han unification: the idea that instead of giving each language's version of a character its own codepoint, you'd define one single unified codepoint for each character and then define different ways that character should be rendered depending on the language that's being rendered. But of course that gives you a new problem: if you just have a bunch of raw characters, you don't know how those characters are rendered. If it's traditional Chinese, you render it one way. If it's Japanese, you render it another way. You have to have information from outside the text itself to figure out how it's supposed to look.
For example take those two characters above. When I got them to render differently, what I actually did was this:
<div lang="ja">直</div>
<div lang="zh">直</div>
That is, I used the HTML language tag to hint to the document what the rendering should look like, so it knows whether to use the Japanese or Chinese variant of the character. Without that hinting, the browser just picks one - and it might get it wrong!
If you're guessing this might have pissed off a bunch of people, you're completely right. As far as I know, han unification single-handedly pushed back Unicode adoption in Japan by at least a decade, and it didn't really pick up until the iPhone era and international messaging kind of forced the issue. And, best of all: Unicode didn't even end up sticking with the 65,536 limit. Pretty soon after they ended up swapping to new encoding methods that let them handle more than a million characters, but han unification was already finished and the damage was done and we have to deal with it for the rest of our lives.
Imagine for a moment if Latin, Greek and Cyrillic didn't have their own code blocks; instead, what we get is a Unified European Characters block where A, А and Α are the same character, sure... but so are D, Д and Δ, or S, С and Σ (the second, Cyrillic one of course not to be confused with the Latin C, which is its own code point.)
If you're wondering if this would cause massive display issues leaving huge blocks of text illegible to their intended audiences without knowledge of both A) said other alphabets and B) how these differences are mapped in Unicode specifically then congrats, you have a glimpse of just how much this sucks.



