In multilingual applications, text needs to be stored in a form compatible with all different writing systems. Unicode provides this capability and for most purposes – the web in particular – UTF-8 has become the dominant variety. To understand what Unicode is and why UTF-8 is the default today, it is useful to learn how text handling developed.
In the mid-18-hundreds, American Morse code became a standard method for efficiently transmitting messages over the electrical telegraph. This binary code – information “bits” consisted of either long or short signals – laid the foundation for information encoding in computers, where bits are also represented in binary format as 0 or 1.
This early transmission format encodes 128 characters – the basic Latin characters used in English as well as numbers and punctuation. Known as ASCII, it requires 7 bits and thus does not fully utilize the 8 bits of the smallest storage unit (“byte”) on computers.
“Extended ASCII” versions use this available 8th bit and sometimes additional bytes to provide character codings for a range of other languages. In order to decode such text, you always had to know exactly what code charts were being used, or you ended up with garbled text.
Universal character encoding scheme
By the late 1980s, it had become clear that constantly dealing with different versions of extended ASCII was impractical, and people proposed “Unicode” – a limitless encoding system that would provide codes for the characters of all the world’s languages without the need for code conversions.
Two organizations established competing standards – the International Standards Organization as well as the US-based Unicode Consortium. Within a few years, however, they agreed to a truce and harmonized their standards, so that ISO IEC 10646 and Unicode match in their core specifications and define identical character sets.
While the ISO standard merely consists of a character map, the Unicode Standard Annexes also offer guidance on a number of linguistic issues pertaining to the internationalization of software, such as line-breaking, text segmentation, and handling of right-to-left scripts like Arabic or Hebrew. Unicode is the standard with which software needs to comply.
Competing Unicode versions
Unix-based operating systems (Linux) and the web converged on UTF-8, however. This is a variable-byte encoding, and the location of characters from different languages reflects the history delineated above. The first byte is used for ASCII (English), the second mainly for other European languages as well as Arabic and Hebrew, and the third covers most Chinese, Japanese and Korean characters.
Storage and transmission efficiency explains the popularity of UTF-8 for the web: languages with Latin Script still dominate and HTML codes draw from the same character set. And since UTF-8 is fully compatible with ASCII, most web content thus requires only single-byte encoding.
Many Indian languages were added later to the Unicode character maps and thus occupy “higher” positions in the character map. Therefore, those languages require more storage space with characters needing 4, 5, or even 6 bytes each. When Unicode code points are added, user machines need updated fonts to display them.
As of 2020, Unicode is everywhere – Unicode 13.0 even includes emoji characters as a new set of characters. The battle between versions of the Unicode standard seems to be decided: 95% of all web pages (up to 100% for some languages) use UTF-8. l
Linux-based operating systems are gaining ground, which default to UTF-8. The majority of servers run directly on Linux. macOS and Android are based on Linux. And after decades of resistance, Microsoft is offering Linux on Azure servers and is beginning to integrate Linux into Windows itself. It therefore just seems to be a matter of time until UTF-16 disappears.
The last cause for real confusion in practice is the byte-order mark (BOM) that Windows requires to accept a file as UTF-8 encoded. Linux, however, assumes that this invisible character is not needed. Thus, Windows will interpret files coming from Linux as plain ASCII until a BOM is added. Localizers just need to be aware of this.