Localization Source File Formats

In software localization projects, translatable strings are first collected in so-called “localization files” or “resource files.” These files are then handed off to translators who translate strings and thereby create copies of the original files now containing equivalent strings in a different language. They return their translation files to the developers who integrate them into the software. Thus, resource files are a crucial component of the localization process.

What are localization files?

Localization resource files can be super simple or fairly complex. In all cases, though, we are dealing with plain text files. This means they are typically easy to create and can – in principle – be edited in a common text editor.

In the simplest case, a resource file contains text strings, one on each line, each preceded with a unique ID – the “translation key.” More complex files may in addition contain a target language string for each source string, making the file bilingual. For each string, the file may also contain comments, keywords or tags, context indicators, and pluralized variants. And instead of a simple one-dimensional list, the file might contain strings organized in a hierarchical structure.

Once the file format reaches a certain level of complexity, a simple text editor begins to lose its utility. Very complicated files are best edited with applications that understand the structure and can present the information in an understandable manner.

How are localization files created?

The software localization process begins with a step called internationalization” The developer modifies here the source code and extracts keys for localization. This means that all displayable text snippets (“strings”) are removed from the code and replaced with a unique ID (“key”). The keys and their strings are then added to a resource file.

The running software continually queries the resource file and when it needs to display some text, it uses the key provided in the code to pull the appropriate string from the file. When the user selects a different display language, the software then redirects and pulls strings from the resource files for that specific language.

While some programming environments (Visual Studio for example) have tools to automate the string extraction, the developer needs to mark displayable strings by hand in most cases. Tools like gettext can then pull all marked strings into resource files. The exact method and the preferred resource file format depend on the programming language and development environment.

What are common localization source file formats?

Linux developers are likely familiar with Translate Toolkit and the gettext utility. Both mainly work with po files (and their space-saving binary equivalents: pot files). po files are of medium complexity, because “translation units” span several lines and files may contain meta-information about each string as well as pluralized variants. The advantage of working with po is that it is a very common format and there are many tools and utilities to work with them. However, loading a po file into software can be a challenge, especially if developers work with newer programming languages that do not have specific parsers (yet).

For easy and speedy development, programmers often choose to forego the complexities of po files and opt for simpler formats, such as yaml. This is a very straightforward file type, with one key-string pair per line. A format that is very easy to read and write – any programmer can write relevant routines in a few lines of code. But it does not contain a lot of information that could help the translator.

JSON-based formats have the same appeal for programmers – they are easy to read and easy to generate. JSON is an information interchange format that most programming languages support. Thus, it is a useful format for development environments with a mix of programming languages. JSON allows to add as much meta information as desired, but any added complexity reduces the value of JSON as a lowest common denominator format: complex structures require complex code to handle them.

While YAML and JSON are formats that primarily serve for exchanging data between software programs and/or development tools, XLIFF is an exchange format for translation tools. XLIFF is a translation industry standard that most if not all translation and localization programs support. This is an XML-based format, which means it uses <tags> for separating and identifying pieces of content.

Finally, some tools use csv or xlsx as resource file formats. The benefit of these is that they can be opened, inspected, and edited with a familiar spreadsheet program. Thus, these formats are useful for exchanging in information with people who do not have access to either development or translation tools.

Thus, localization file formats differ because they are used with slightly different intents. Typically, those are the intents of the programmers who created a development tool, and developers and localizers may not have much of a choice once they work in a specific development environment.

What are some advanced features in localization files?

If localization file formats allow for complexity, someone is guaranteed to increase the complexity of their data. With most formats, developers have some freedom to pack information into the translation key and use it to tell where exactly a string is used. Most formats also provide ways to add comments or explanations for the translator. In po files, Android XML, and iOS strings files, among others, developers can insert both singular and plural versions of strings and thereby give translators the opportunity to provide different wordings for each. JSON files can encode deeply nested structures and thus reflect the dependency relationships of display screens in a software program.

One caveat here: if a format offers you the option to use complex structures, it is not necessarily a good idea to make use of it. Pluralizations from one language do not always have direct equivalents in other languages, and making appropriate modifications to the localization file is not a trivial task. Comments in localization files are only useful if the translator can see them – and they only see them if their translation tool can show them. Likewise, if keys hold crucial information, it is crucial that translation tools provide ways to use it. Nested structures are most problematic: since these require specific decoders, it is extremely unlikely that a translation tool can display this information in a meaningful way. ‘

Therefore, if you know that your localization files contain information besides just translatable strings, it is worth pointing this out to your translation team.

How do you best manage resource files?

As with localization formats themselves, the way localization files are organized may be dictated by the tools you use for development.

Some tools use monolingual files and maintain translations in files with the same names but added language suffixes. Other tools use bilingual or even multilingual files, which means that all source and target strings that belong together automatically stay together.

Some environments encode dependencies inside their localization files while others create subfolder structures for the same purpose.

Each aspect has its advantages and disadvantages. For example, while multilingual files may be easy to work with for the programmer, they pose a challenge when you want different translators to work on them simultaneously.

If you have control over the choice of localization format, the complexities inside the files and the way resource files are named and organized, the best practice would be to hold off any decisions until you have discussed it with your localization team. Your localization tool may steer you in a specific direction, or it may offer surprising features and conveniences to you.