What Is Unicode?

In multilingual apps, text needs to be stored in a form compatible with all different writing systems. Learn in what ways Unicode provides this capability.

In multilingual applications, text needs to be stored in a form compatible with all different writing systems. Unicode provides this capability and for most purposes—the web in particular—with UTF-8 as the dominant variant.

Telegraphic beginnings

In the mid-18-hundreds, American Morse code became a standard method for efficiently transmitting messages over the electrical telegraph. This binary code—information “bits” consisted of either long or short signals—laid the foundation for information encoding in computers, where bits are also represented in binary format as 0 or 1.

This early transmission format encodes 128 characters—the basic Latin characters used in English, as well as numbers, and punctuation. Known as ASCII, it requires 7 bits, so it doesn’t fully use the 8 bits of the smallest storage unit (“byte”) on computers.

Babylonic confusion

“Extended ASCII” versions use this available 8th bit and sometimes additional bytes to provide character codings for a range of other languages. To decode such text, people always had to know exactly what code charts were being used or ended up with garbled text.

Universal character encoding scheme

By the late 1980s, it had become clear that constantly dealing with different versions of extended ASCII was impractical, and people proposed “Unicode”—a limitless encoding system that would provide codes for the characters of all the world’s languages without the need for code conversions.

2 organizations established competing standards—the International Standards Organization as well as the US-based Unicode Consortium. Within a few years, however, they agreed to a truce and harmonized their standards, so that ISO IEC 10646 and Unicode match in their core specifications and define identical character sets.

While the ISO standard merely consists of a character map, the Unicode Standard Annexes also offer guidance on a number of linguistic issues pertaining to the internationalization of software, such as line-breaking, text segmentation, and handling of right-to-left scripts like Arabic or Hebrew. Unicode is the standard with which software needs to comply.

Competing Unicode versions

While people agreed that a universal character set like Unicode was clearly needed, different methods of implementation proliferated. Microsoft Windows, the Java programming language, and JavaScript settled on UTF-16, a double-byte encoding system. As one of the 16-bit codes, it covers all 1.1 million code points required by the Unicode standard.

However, Unix-based operating systems (Linux) and the web converged on UTF-8. This is a variable-byte encoding, and the location of characters from different languages reflects the history delineated above. The first byte is used for ASCII (English), the second mainly for other European languages as well as Arabic and Hebrew, and the third covers most Chinese, Japanese and Korean characters.

Storage and transmission efficiency explains the popularity of UTF-8 for the web: Languages with Latin Script still dominate, and HTML codes draw from the same character set—and since UTF-8 is fully compatible with ASCII, most web content requires only single-byte encoding.

Many Indian languages were added later to the Unicode character maps and thus occupy “higher” positions in the character map. Therefore, those languages require more storage space with characters needing 4, 5, or even 6 bytes each. When Unicode code points are added, user machines need updated fonts to display them.

UTF-8 dominance

Today, Unicode is literally everywhere: As of Unicode 13.0, even emoji characters are included as a set of characters. The battle between versions of the Unicode standard seems to be decided: 95% of all web pages (up to 100% for some languages) use UTF-8.

Linux-based operating systems are gaining ground, which default to UTF-8. The majority of servers run directly on Linux. macOS and Android are based on Linux. After decades of resistance, Microsoft is offering Linux on Azure servers and is beginning to integrate Linux into Windows itself. It, therefore, just seems to be a matter of time until UTF-16 disappears.

The last cause for real confusion in practice is the byte-order mark (BOM) that Windows requires to accept a file as UTF-8 encoded. Linux, however, assumes that this invisible character isn’t needed. That’s why Windows will interpret files coming from Linux as plain ASCII until a BOM is added. Linguists and localization specialists just need to be aware of this.

Keep exploring

Photo-realistic sheet music featuring developer-style translation code in place of musical notes. The staff lines show snippets like t('auth.signin.button') and JSON structures, combining the aesthetics of musical notation with programming syntax to illustrate the idea of “composable localization.”

Blog post

Localization as code: a composable approach to localization

Why is localization still a manual, disconnected process in a world where everything else is already “as code”? Learn how a composable, developer-friendly approach brings localization into your CI/CD pipeline, with automation, observability, and Git-based workflows built in.

A woman in a light sweater sits in a home office, focused on her laptop, representing a developer or content manager working on WordPress localization tasks in a calm, professional environment.

Blog post

How to build a scalable WordPress i18n workflow

WordPress powers the web, but translating it well takes more than plugins. Discover how to build a scalable localization workflow using gettext, best practices, and the Phrase plugin.

Blog post

Localizing Unity games with the official Phrase plugin

Want to localize your Unity game without the CSV chaos? Discover how the official Phrase Strings Unity plugin simplifies your game’s localization workflow—from string table setup to pulling translations directly into your project. Whether you’re building for German, Serbian, or beyond, this guide shows how to get started fast and localize like a pro.

Blog post

Internationalization beyond code: A developer’s guide to real-world language challenges

Discover how language affects your UI. From text expansion to pluralization, this guide explores key i18n pitfalls and best practices for modern web developers.

A digital artwork featuring the Astro.js logo in bold yellow and purple tones, floating above Earth's horizon with a stunning cosmic nebula in the background. The vibrant space setting symbolizes the global and scalable nature of Astro’s localization capabilities, reinforcing the article’s focus on internationalization in web development.

Blog post

Astro.js localization part 2: dynamic content localization

Learn how to localize your Astro.js website with static and dynamic content translation. Explore Astro’s built-in i18n features and Paraglide for handling UI elements, navigation, and dynamic text seamlessly.