Aside from having thousands of overlapping characters, CJK also traditionally does not use whitespace. The result is text that is seemingly very difficult to process algorithmically. A good example of algorithms that don't work well in CJK languages is regular expression word boundaries. This concept of a word boundary is extremely ambiguous in this context.
For starters let's talk about history and how we got to where we are now. Modern usage of CJK languages uses two mixed writing systems, one for phonetic writing, the other for ideographic writing. In Chinese these two writing systems overlap. In Japanese or Korean there are separate character sets for each use case. Phonetic spelling is familiar so I'm not going to talk much about that, just spell a word how it sounds. However ideographic writing may seem foreign coming from a Western language background.
An Ideograph, or Ideogram, is a symbol representing a meaning more so than a pronunciation. A famous example of ideographic writing is Hieroglyphs with there pictographic imagery. These symbols are immediately recognizable as an idea, but no pronunciation comes to mind unless you are studied in the language. CJK ideograms are similar in the sense that they represent an idea more than a pronunciation, however the meaning is not immediately recognizable unless you are also studied in the language. Instead, CJK symbols are built from parts. 214 parts to be specific. These parts are called radicals and they are everywhere. Each symbol may be comprised of more than 10 separate parts.
In 1710 the Kangxi Emperor of the Manchu Qing Dynasty ordered the compilation of a Chinese Dictionary. It was organized by way of precisely 214 so called radicals. Each of these radicals were extremely common components of other letters, and a single component from each letter was taken to form the index of the dictionary. Although these radicals were common components, they were not unambiguous. The same radical, written in different positions in a symbol could be written differently. These similar alternatives to each radical became known as variants of the indexed radical and furthered the idea that they were more or less equivalent in usage. In total, the Kangxi Dictionary contained more than 47,000 entries.
Here are all 214 radicals and their variants:
# | Radical | Variants |
---|---|---|
1 | 一 | 一 |
2 | 丨 | 丨 |
3 | 丶 | 丶 |
4 | 丿 | 丿,乀,乁 |
5 | 乙 | 乙,乚,乛,𠃉,𠃊,𠃋,𠃌,𠃍,𠃎,𠃑 |
6 | 亅 | 亅,𠄌 |
7 | 二 | 二,𠄞,𠄟,𠄠 |
8 | 亠 | 亠 |
9 | 人 | 人,亻,𠆢 |
10 | 儿 | 儿 |
11 | 入 | 入 |
12 | 八 | 八 |
13 | 冂 | 冂 |
14 | 冖 | 冖 |
15 | 冫 | 冫 |
16 | 几 | 几,𠘧,𠘨 |
17 | 凵 | 凵,𠙴 |
18 | 刀 | 刀,刁,刂,𠚣 |
19 | 力 | 力,力 |
20 | 勹 | 勹 |
21 | 匕 | 匕,𠤎 |
22 | 匚 | 匚,𠤬 |
23 | 匸 | 匸 |
24 | 十 | 十 |
25 | 卜 | 卜 |
26 | 卩 | 卩 |
27 | 厂 | 厂 |
28 | 厶 | 厶 |
29 | 又 | 又 |
30 | 口 | 口 |
31 | 囗 | 囗 |
32 | 土 | 土 |
33 | 士 | 士 |
34 | 夂 | 夂,𡕒 |
35 | 夊 | 夊 |
36 | 夕 | 夕 |
37 | 大 | 大,夨 |
38 | 女 | 女,女 |
39 | 子 | 子,孑,孒,孓,𡤼 |
40 | 宀 | 宀 |
41 | 寸 | 寸 |
42 | 小 | 小,𡭔 |
43 | 尢 | 尢,尣,𡯁,𡯂 |
44 | 尸 | 尸,𡰣 |
45 | 屮 | 屮,𡳾 |
46 | 山 | 山 |
47 | 巛 | 巛,川,𡿦 |
48 | 工 | 工 |
49 | 己 | 己,已,巳 |
50 | 巾 | 巾 |
51 | 干 | 干 |
52 | 乡 | 乡,幺 |
53 | 广 | 广 |
54 | 廴 | 廴 |
55 | 廾 | 廾,𢌬 |
56 | 弋 | 弋,𢍺 |
57 | 弓 | 弓,𢎗,𢎘 |
58 | 彐 | 彐,彑 |
59 | 彡 | 彡 |
60 | 彳 | 彳 |
61 | 心 | 心,忄 |
62 | 戈 | 戈 |
63 | 戶 | 戶,户,戸 |
64 | 手 | 手,扌,才 |
65 | 支 | 支 |
66 | 攴 | 攴,攵 |
67 | 文 | 文 |
68 | 斗 | 斗,𣁬 |
69 | 斤 | 斤,𣂑 |
70 | 方 | 方 |
71 | 无 | 无 |
72 | 日 | 日 |
73 | 曰 | 曰 |
74 | 月 | 月 |
75 | 木 | 木,𣎳,𣎴 |
76 | 欠 | 欠 |
77 | 止 | 止,𣥂 |
78 | 歹 | 歹,𣦵,𣦶 |
79 | 殳 | 殳 |
80 | 毋 | 毋,毌,𣫬 |
81 | 比 | 比 |
82 | 毛 | 毛,𣬛 |
83 | 氏 | 氏 |
84 | 气 | 气 |
85 | 水 | 水,氵,𣱱 |
86 | 火 | 火,灬 |
87 | 爪 | 爪,爫,𤓯,𤓰 |
88 | 父 | 父 |
89 | 爻 | 爻 |
90 | 丬 | 丬,爿,𤕪 |
91 | 片 | 片 |
92 | 牙 | 牙 |
93 | 牛 | 牛,牜 |
94 | 犬 | 犬,犭 |
95 | 玄 | 玄 |
96 | 玉 | 玉,玊,𤣩 |
97 | 瓜 | 瓜 |
98 | 瓦 | 瓦 |
99 | 甘 | 甘,𤮺 |
100 | 生 | 生,𤯓 |
101 | 用 | 用,甩 |
102 | 曱 | 曱,田,由,甲,申,甴,𤰒 |
103 | 疋 | 疋,𤴓,𤴔 |
104 | 疒 | 疒 |
105 | 癶 | 癶 |
106 | 白 | 白 |
107 | 皮 | 皮 |
108 | 皿 | 皿 |
109 | 目 | 目 |
110 | 矛 | 矛 |
111 | 矢 | 矢 |
112 | 石 | 石 |
113 | 示 | 示,礻,𥘅 |
114 | 禸 | 禸 |
115 | 禾 | 禾,𥝌 |
116 | 穴 | 穴,𥤢 |
117 | 立 | 立,立 |
118 | 竹 | 竹,𥫗 |
119 | 米 | 米 |
120 | 糸 | 糸,糹,纟 |
121 | 缶 | 缶,𦈢 |
122 | 网 | 网,罒,罓,𦉪,𦉫,𦉭,𦉰 |
123 | 羊 | 羊,𦍋,𦍌 |
124 | 羽 | 羽,羽,𦏲 |
125 | 老 | 老,考,老 |
126 | 而 | 而 |
127 | 耒 | 耒,𦓤 |
128 | 耳 | 耳 |
129 | 聿 | 聿 |
130 | 肉 | 肉 |
131 | 臣 | 臣,𦣝,𦣞 |
132 | 自 | 自,𦣹 |
133 | 至 | 至,𦤳,𦤴 |
134 | 臼 | 臼,𦥑,𦥒,𦥓 |
135 | 舌 | 舌 |
136 | 舛 | 舛 |
137 | 舟 | 舟,𠂨 |
138 | 艮 | 艮 |
139 | 色 | 色 |
140 | 艸 | 艸,艹 |
141 | 虍 | 虍 |
142 | 虫 | 虫 |
143 | 血 | 血 |
144 | 行 | 行,行 |
145 | 衣 | 衣,衤 |
146 | 襾 | 襾,西,覀 |
147 | 見 | 見,见,見 |
148 | 角 | 角,𧢲 |
149 | 言 | 言,訁,讠 |
150 | 谷 | 谷,𧮫 |
151 | 豆 | 豆 |
152 | 豕 | 豕 |
153 | 豸 | 豸 |
154 | 貝 | 貝,贝 |
155 | 赤 | 赤 |
156 | 走 | 走,赱,𧺆 |
157 | 足 | 足,𧾷 |
158 | 身 | 身,𨈏 |
159 | 車 | 車,车,車 |
160 | 辛 | 辛,𨐋 |
161 | 辰 | 辰,辰,𨑃,𨑄 |
162 | 辵 | 辵,辶,𠔇 |
163 | 邑 | 邑,𨙨 |
164 | 酉 | 酉 |
165 | 釆 | 釆 |
166 | 里 | 里,里 |
167 | 金 | 金,釒,钅,金 |
168 | 長 | 長,镸,长,𨱗,𨱘 |
169 | 門 | 門,门,𨳇,𨳈 |
170 | 阜 | 阜,阝,𨸏 |
171 | 隶 | 隶 |
172 | 隹 | 隹 |
173 | 雨 | 雨 |
174 | 靑 | 靑,青 |
175 | 非 | 非 |
176 | 面 | 面,靣,𠚑 |
177 | 革 | 革 |
178 | 韋 | 韋,韦 |
179 | 韭 | 韭 |
180 | 音 | 音 |
181 | 頁 | 頁,页,𩑋 |
182 | 凬 | 凬,風,风 |
183 | 飛 | 飛,𩙱 |
184 | 食 | 食,飠,饣,𠋑,𩙿,𩚀,𩚁,𩚃 |
185 | 首 | 首,𩠐 |
186 | 香 | 香 |
187 | 馬 | 馬,马,𩡧 |
188 | 骨 | 骨 |
189 | 高 | 高,髙 |
190 | 髟 | 髟 |
191 | 鬥 | 鬥 |
192 | 鬯 | 鬯 |
193 | 鬲 | 鬲 |
194 | 鬼 | 鬼 |
195 | 魚 | 魚,鱼,𤋳,𩵋 |
196 | 鳥 | 鳥,鸟 |
197 | 鹵 | 鹵,𠧸 |
198 | 鹿 | 鹿,鹿,𢉖 |
199 | 麥 | 麥,麦 |
200 | 麻 | 麻 |
201 | 黃 | 黃,黄 |
202 | 黍 | 黍 |
203 | 黑 | 黑,黒,𪐗 |
204 | 黹 | 黹 |
205 | 黽 | 黽,黾 |
206 | 鼎 | 鼎,𪔂 |
207 | 鼓 | 鼓,鼔,𡔷 |
208 | 鼠 | 鼠,鼡 |
209 | 鼻 | 鼻 |
210 | 斉 | 斉,齊,齐,𪗄 |
211 | 歯 | 歯,齒,齿,𣦋 |
212 | 龍 | 龍,龙,龍 |
213 | 亀 | 亀,龜,龟,龜,龜,𪚦,𪛉 |
214 | 龠 | 龠 |