The Missing Guide to the ICU Message Format

If you've worked on a project with i18n/l10n, you may have used the ICU message format without knowing it. But what is ICU and how is it related to the ICU message format? We answer these questions here, and lots more.

The ICU message format syntax is used by a significant number of i18n libraries and solutions. You may have used the format yourself.

A basic message and a plural message in ICU message format

The syntax is intuitive. If you have any familiarity with i18n/l10n, you can probably tell what’s going on in the translation file above.

While the ICU message syntax can be intuitive, there are a few things about the message format that can be a bit confusing. Different i18n libraries implement different subsets of ICU. And of course, what ICU actually is can often be a source of confusion itself. We mean to clear some of these mysteries up in this article, as well as cover the practical usage of ICU messages when internationalizing and localizing our apps.

Table of Contents

What is ICU in General?

According to the official documentation, ICU stands for International Components for Unicode: a set of portable libraries that are meant to make working with i18n easier for Java and C/C++ developers.

πŸ—’ Note Β» Since ICU’s inception, the libraries’ implementations have expanded beyond Java and C/C++, and can now be found in other languages. See Which i18n Libraries are Using the ICU Message Format? for more info.

ICU libraries cover a great deal more than translation messages. As you can probably infer from the name, ICU is closely tied to the Unicode international character encoding standard. The ICU library suite provides utilities for working with Unicode in Java and C/C++. It also provides functionality for i18n.

The Official ICU Libraries

The following is a summarized overview of some of the different modules that make up the ICU library suite.

  • Unicode Strings β€”Β Provides macros and utilities for working with Unicode strings.
  • Conversion β€” Handles conversion between unicode and non-unicode character encoding.
  • Locale β€” Deals with the i18n concept of a locale (a language along with optional country and script variant) as well as information relevant to that locale, such as its calendar, currency, etc. Also deals with fallback logic when a locale is not supported.
  • Resources β€” Handles resource bundles, which are effectively translation message filesβ€”e.g. the es_MX or Spanish Mexican resource bundleβ€”and the retrieval of these bundles’ contents.
  • Date/Time Services β€” Handles the representation of time zones and provides logic to work with various kinds of calendars.
  • Formatting β€”Β Deals with the display of text, particularly when internationalizing, focusing on displaying numbers, dates, times, and messages (translated strings). This module, of course, describes the ICU message format, and is of particular interest to us.

πŸ”— Resource Β» We’re just presenting some of the ICU modules here. Check out the ICU User Guide for a comprehensive look at everything ICU has to offer.

The ICU Message Format

ICU itself is a general set of libraries for Unicode and i18n. One of these libraries/modules deals with i18n text formatting, and it provides the ICU message format syntax. The ICU message format is powerful and flexible enough to have grown out of its Java and C/++ origins and has been ported to several other languages and platforms. The next section,Β Which i18n Libraries are Using the ICU Message Format?, lists some of these ported implementations.

We saw the ICU message format syntax earlier. It allows for basic messages, interpolation, and general selection based on value. The format also ties into the given library’s date, time, and number formatting functions.

Here’s the example from the ICU documentation, put in a YAML file for some realistic context:

Let’s say we’re using some JavaScript implementation of ICU message formatting that provides a function called format() . We may be able to display the message above using a call like the following.

The output of the above would be "Maria invites Tamer and one other person to her party."

If we wanted to translate the above message to Spanish, we would provide a parameterized message, perhaps a bit like the following.

The same format() call above, given the Spanish translation message, would output "Maria invita a Tamer y a otra persona a su fiesta."

This allows translators a great deal of flexibility when working with messages, and separates concerns between translators and programmers. A programmer doesn’t need to worry about the nuances of a language to develop her software. She just puts one function call in her code, and depends on the translator to handle the linguistic minutia. The only thing the programmer and the translator need to know is the contract of the message, i.e. its ID and parameters.

In later sections we’ll dive deeper into the different formatting options the syntax provides.

Which i18n Libraries are Using the ICU Message Format?

Across programming languages and platforms, different i18n libraries have implemented ICU message format support. What follows is a list of some of these libs.

βœ‹πŸ½ Heads Up Β» Different ports implement different subsets of ICU message format. In all cases, when adopting an ICU message format port be sure to read the documentation carefully and know what ICU features are supported by the library.

C/C++

  • ICU4C, the papa spec and implementation; this and the Java library are where ICU started. Of course, ICU4C is a complete implementation of ICU.

Dart/Flutter

  • intl β€” the first-party Dart i18n package implements ICU message formatting.
  • Flutter i18n β€” based on Dart’s intl, Flutter’s first-party i18n library also uses ICU message formats.

Java

  • ICU4J β€” the mama spec and implementation; this and the C/C++ libary are the ICU originals. And it goes without saying that ICU4J is a complete implementation of ICU.

JavaScript

JavaScript doesn’t have an official, first-party i18n message format. But this is JavaScript we’re talking about, so of course you have your pick of third-party libraries to use for both browsers and Node.

  • Globalize β€”Β a well-rounded implementation of the ICU message format.
  • messageformat β€” resembling the original ICU implementations, messageformat provides message compilation to JavaScript for better performance.
  • i18next with ICU module β€” an official ICU extension for the robust i18next library.
  • Angular β€” Google’s popular front-end framework uses ICU expressions in its first-party i18n solution.
  • react-intl β€”Β part of the FormatJS family, the i18n library for React uses the ICU message syntax.

PHP

  • Symfony β€” the popular web framework is showing support for ICU messages.

πŸ—’ Note Β» The first-party PHP i18n message solution uses gettext.

Python

  • PyICU β€” Python wrappers for the ICU C++ libraries.

πŸ—’ Note Β» The first-party Python i18n message solution uses gettext.


πŸ—’ Note Β» Have we missed any ICU libraries? Do you use an ICU library that you think is worthy of mention here? Let us know in the comments below.

What is CLDR?

If you’ve perused the documentation of some of the above libraries, or the docs of the original ICU project, you may have seen references to the CLDR. And you may have wondered what that was.

Well, CLDR stands for Common Locale Data Repository, and it’s the official Unicode collection of l10n data. For a given locale, the CLDR can give you the locale’s script, its preferred calendar, number system, date formats, pluralization rules, and more.

The CLDR is used by the main ICU project and other libraries that implement ICU features.

πŸ”— Resource Β» Check out the official CLDR documentation for more info.

βœ‹πŸ½ Heads Up Β» We often use CLDR data without thinking about it since it’s baked into some of the i18n libraries we use. Other times, however, we need to manually pull in CLDR data for the locales our apps support. Be sure to read the documentation of the i18n library you’re using to see if you need to manually fetch CLDR data for your locales.

Working with the ICU Message Format

So what does working with ICU messages actually look like? It’s fairly straightforward, actually. Let’s take a look at the features the syntax gives us.

πŸ—’ Note Β» In the examples below, we’re assuming that our translation messages are stored in YAML or JSON files. We’re also assuming that we’re using a JavaScript i18n library with ICU message format support and a format() function to display our messages. This is just for demonstration’s sake, and you will probably want to take a look at how your i18n library handles message files, and the exact function or method it uses to display messages. The formats presented here should hold no matter which library you use, however.

πŸ”— Resource Β» If you want to play around with the following examples, or your own, we recommend the Online ICU Message Editor by Andy VanWagoner.

Basic Messages

A basic ICU message is just plain text in a given locale.

Interpolation

Dynamic text is denoted by curly braces {}.

Plurals

Pluralized forms can appear anywhere within a message, and have the form {n, plural, ...forms}, where n is the count variable, and forms is one or more plural forms for the phrase.

The special # symbol will display the given count in the active locale’s number system. We’re using n to denote the count here, but that’s just a matter of convention. We can use any name we like as our count variable.

πŸ—’ Note Β» Pluralization generally uses CLDR rules for the given locale. For example, according to the CLDR, English has two plural forms, one an other. Arabic has six forms. Each locale file can each specify its own plural forms for a given message.

πŸ—’ Note Β» The other variant is always required in plural expressions.

Also, we can nest other interpolated values within our plural messages. The following message is perfectly legal.


Overriding CLDR Plural Rules

We can customize our messages using single number specifiers. We use the =42 syntax to provide these custom forms, which override CLDR locale forms like one or other.

Offsets

An optional offset specifier can be added to ICU plural messages. If an offset is provided it will not be used for selecting the plural form. The plain value of n will still be used for the form selection. The offset will, however, be subtracted from the given count, and the difference will be used for the value of #.

βœ‹πŸ½ Heads Up Β» Because offsets can obfuscate the expected display of a message, they may cause confusion when used on a team. Use your own judgement here.

Switching with select

A select expression is a conditional branch, and is a lot like the switch statement found in many programming languages. It can be placed anywhere in a message, and is denoted by {arg, select, ...forms} where arg is the argument variable to switch on, and forms is one or more alternatives.

πŸ—’ Note Β» The other variant is always required in select expressions.

Nesting

Plural forms and select expressions can be nested within themselves or each other. Dynamic strings can be nested in plurals and selects as well. Here’s the example from the ICU docs:


Number Formatting

ICU message format supports predefined number formats: percent and currency. We specify number formats in our messages via the syntax, {n, number, format} where n is the number argument, and format is either percent or currency.

The percent format expects number argument given to the message must be a decimal between 0 and 1.

πŸ—’ Note Β» The format specifier is optional, and a number format will display the given number in the number system of the active locale.

Custom Number Formats

You may find the above options limited for your needs. There are, of course, ways to have more fine-grained control over your number formatting. The official ICU Java libraries, for example, offer a NumberFormatter class, which allows you to do things like the following.

Every ICU library implements its number formatting a bit differently, so check the documentation of the library you’re using to see how you can better control your number formats.

Date/Time Formatting

ICU has four predefined date formats: short, medium, long, and full. Date formats are specified using the syntax, {myDate, date, format}, where myDate is the date value argument, and format is one of the predefined formats.


Custom Date/Time Formatting

The official ICU spec contains support for pattern characters, called Date Field Symbols, that allow for precise control over date formatting. Again, every i18n library does its custom date/time formatting a bit differently, and you may need to supplement your i18n library with another date-specific solution for your date formats. Peruse the docs of your i18n library to see how it achieves or allows for fine-grained control over date formatting.

Even More

We’ve covered what we think are the most commonly used formats here. However, the ICU spec and ICU third-party libraries (in varying degrees), offer much more functionality. Some of these include working with ordinals (e.g. 1st, 2nd ), units (e.g. km, lbs), durations, relative dates (e.g. yesterday), and more. Check out the documentation of your i18n library of choice to see which advanced formats are supported.

Wrapping Up

ICU message format is certainly one of the de-facto standards of translation messages in i18n. We hope that we’ve shed some light on some of your questions around ICU, and that you’ve enjoyed our little guide to ICU and the ICU message format.

πŸ—’ Note Β» Is there anything we’ve missed? Would you like us to cover more ICU topics, or dive deeper into a topic we touched on here? Let us know in the comments below.

And when you’re working with a team on an internationalized project, few things will make you more efficient than a good platform like Phrase. Phrase supports the ICU message format, and gives your translators ICU syntax checking and highlighting.

ICU message format support in Phrase: syntax highlighting and easy plurals

Phrase is also a professional, fully-featured i18n platform that helps product managers track their l10n progress, allows translators to work in an intuitive UI, gives developers that ability to sync translation files through the CLI, and much, much more. Leave the i18n pipeline to Phrase and focus on your product. Check out all of Phrase’s features, and sign up for a free 14-day trial.

The Missing Guide to the ICU Message Format
5 (100%) 10 votes
Author
Mohammad Phrase Content Team
Comments