Software localization
What Is Unicode?
In multilingual applications, text needs to be stored in a form compatible with all different writing systems. Unicode provides this capability and for most purposes—the web in particular—with UTF-8 as the dominant variant.
Telegraphic beginnings
In the mid-18-hundreds, American Morse code became a standard method for efficiently transmitting messages over the electrical telegraph. This binary code—information “bits” consisted of either long or short signals—laid the foundation for information encoding in computers, where bits are also represented in binary format as 0 or 1.
This early transmission format encodes 128 characters—the basic Latin characters used in English, as well as numbers, and punctuation. Known as ASCII, it requires 7 bits, so it doesn’t fully use the 8 bits of the smallest storage unit (“byte”) on computers.
Babylonic confusion
“Extended ASCII” versions use this available 8th bit and sometimes additional bytes to provide character codings for a range of other languages. To decode such text, people always had to know exactly what code charts were being used or ended up with garbled text.
Universal character encoding scheme
By the late 1980s, it had become clear that constantly dealing with different versions of extended ASCII was impractical, and people proposed “Unicode”—a limitless encoding system that would provide codes for the characters of all the world’s languages without the need for code conversions.
2 organizations established competing standards—the International Standards Organization as well as the US-based Unicode Consortium. Within a few years, however, they agreed to a truce and harmonized their standards, so that ISO IEC 10646 and Unicode match in their core specifications and define identical character sets.
While the ISO standard merely consists of a character map, the Unicode Standard Annexes also offer guidance on a number of linguistic issues pertaining to the internationalization of software, such as line-breaking, text segmentation, and handling of right-to-left scripts like Arabic or Hebrew. Unicode is the standard with which software needs to comply.
Competing Unicode versions
While people agreed that a universal character set like Unicode was clearly needed, different methods of implementation proliferated. Microsoft Windows, the Java programming language, and JavaScript settled on UTF-16, a double-byte encoding system. As one of the 16-bit codes, it covers all 1.1 million code points required by the Unicode standard.
However, Unix-based operating systems (Linux) and the web converged on UTF-8. This is a variable-byte encoding, and the location of characters from different languages reflects the history delineated above. The first byte is used for ASCII (English), the second mainly for other European languages as well as Arabic and Hebrew, and the third covers most Chinese, Japanese and Korean characters.
Storage and transmission efficiency explains the popularity of UTF-8 for the web: Languages with Latin Script still dominate, and HTML codes draw from the same character set—and since UTF-8 is fully compatible with ASCII, most web content requires only single-byte encoding.
Many Indian languages were added later to the Unicode character maps and thus occupy “higher” positions in the character map. Therefore, those languages require more storage space with characters needing 4, 5, or even 6 bytes each. When Unicode code points are added, user machines need updated fonts to display them.
UTF-8 dominance
Today, Unicode is literally everywhere: As of Unicode 13.0, even emoji characters are included as a set of characters. The battle between versions of the Unicode standard seems to be decided: 95% of all web pages (up to 100% for some languages) use UTF-8.
Linux-based operating systems are gaining ground, which default to UTF-8. The majority of servers run directly on Linux. macOS and Android are based on Linux. After decades of resistance, Microsoft is offering Linux on Azure servers and is beginning to integrate Linux into Windows itself. It, therefore, just seems to be a matter of time until UTF-16 disappears.
The last cause for real confusion in practice is the byte-order mark (BOM) that Windows requires to accept a file as UTF-8 encoded. Linux, however, assumes that this invisible character isn’t needed. That’s why Windows will interpret files coming from Linux as plain ASCII until a BOM is added. Linguists and localization specialists just need to be aware of this.
Last updated on September 22, 2022.