Software localization

How to Translate Python Applications with the GNU gettext Module

Learn how to use the GNU gettext module, bundled with Python's official standard library, to get your Python i18n and l10n development going.
Software localization blog category featured image | Phrase

Internationalization (i18n) refers to the operation by which a program is made aware of multiple languages. Localization (l10n) refers to the adaptation of your program, once internationalized, to the local language and cultural habits. In theory it looks simple to implement. In practice though, it takes time and effort to provide the best Internationalization and Localization experience for your global audience. In Python, there is a specific bundled module for that and it's called gettext , which consists of a public API and a set of tools that help extract and generate message catalogs from the source code.

Gettext is a mature and battle-tested solution initially released by Sun Microsystems more than 25 years ago. Gettext provides a set of utilities that allow localizing various programs and even operating systems. In this article, we are going to use this module and walk through the process of localizing a small Python app while learning the different rules and options that it provides.

For the purposes of this demo, I will be using Python 3.6 but the gettext module is bundled in the Python 2.7 version as well. The code is hosted on GitHub.

Introduction To GNU gettext module

GNU gettext is the defacto universal solution for localization, offering a set of tools that provides a framework to help other packages produce multi-lingual messages. It gives an opinionated way of how programs should be written to support translated message strings and a directory and file naming organization for the messages that need to be translated.

In regards to directory conventions, we need to have a place to put our localized translations based on the specified locale language. For example, let's say we need to support 2 languages English and Greek. Their language codes are en and el respectively.

We can create a folder named "locales" and inside we need to create folders for each language code and each folder will contain another folder named each LC_MESSAGES  with one or multiple.po files.

So, the file structure should look like this:

locales/

├── el

│   └── LC_MESSAGES

│       └── base.po

└── en

    └── LC_MESSAGES

        └── base.po

Here we can see that the files have a .po  extension. The PO format is a plain text format, written in files with .po  extension. A PO file contains a number of messages, partly independent text segments to be translated, which have been grouped into one file according to some logical division of what is being translated. Those groups are called domains.  In the example above, we have only one domain named as base . The PO files themselves are also called message catalogs.

Apart from PO files, you might sometimes encounter  .mo  files. MO, or Machine Object is a binary data file that contains object data referenced by a program. It is typically used to translate program code, and can be loaded or imported into the GNU gettext program.

In addition, there are also .pot  files. These are the template files for PO files. They will have all the translation strings left empty. A POT file is essentially an empty PO file without the translations, with just the original strings. In practice we have the .pot files be generated from some tools and we should not modify them directly.

Using the Python gettext module

The gettext module comes shipped with Python. It provides internationalization (I18N) and localization (L10N) services for your Python modules and applications. This module exposes two APIs. The first one is the basic API that supports the GNU gettext catalog API. The second one is the higher level one, class-based API that may be more appropriate for Python files. The class bases API offers more flexibility and greater convenience than the GNU gettext API and it is the recommended way of localizing your Python applications and modules. This is also the API that we are going to use in this tutorial.

In order to provide multilingual messages for your Python programs, you need to take the following steps:

  1. Mark all translatable strings in your program with a wrapper function.
  2. Run a suite of tools over your marked files to generate raw messages catalogs or POT files.
  3. Duplicate the POT files into specific locale folders and write the translations.
  4. Import and use the gettext module so that message strings are properly translated.

Let's create a sample application to see how are we going to do that in practice.

Example Application

In order to understand the whole process better, it's important to have an example program that we want to localize. Let's start with a function that prints some strings.

# main.py

def print_some_strings():

    print("Hello world")

    print("This is a translatable string")

if __name__ == '__main__':

    print_some_strings()

Now as it is you cannot provide localization options using gettext.

As we said earlier, the first step is to specially mark all translatable strings in the program. To do that we need to wrap all the translatable strings inside _()

# main.py

import gettext

_ = gettext.gettext

def print_some_strings():

    print(_("Hello world"))

    print(_("This is a translatable string"))

if __name__=='__main__':

    print_some_strings()

Notice that we imported gettext and assigned _  as gettext.gettext. This is to ensure that our program compiles as well.

If you run the program, you will see that nothing has changed:

$ python main.py

Hello world

This is a translatable string

However, now we are able to proceed to the next steps which are extracting the translatable messages in a POT file.

Generate raw translatable messages

For the purpose of automating the process of generating raw translatable messages from wrapped strings throughout the applications, the gettext library authors have provided a set to tools that help to parse the source files and to extract the messages in a general message catalog.

Originally the GNU gettext only supported C or C++ source code but its extended version xgettext scans code written in a number of languages, including Python, to find strings marked as translatable.

The Python distribution includes some specific programs called pygettext.py and msgfmt.py that recognize only python source code and not other languages.

The location of those files depends mainly on the OS default installation of the Python library. In order to find it you can issue the following command:

$ locate pygettext.py

/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/share/doc/python3.6/examples/Tools/i18n/pygettext.py

This was on MacOS. Generally, it is the /Tools/i18n directory . You may need to run updatedb  or /usr/libexec/locate.updatedb command beforehand, to update the search indexes.

Once you found the tool, just call it specifying the file you want to parse the strings for:

$ pygettext.py -d base -o locales/base.pot src/main.py

That will generate a base.pot file in the locales folder taken from our main.py program. Remember that POT files are just templates and we should not touch them. Let us inspect the contents of the base.pot  file:

# SOME DESCRIPTIVE TITLE.

# Copyright (C) YEAR ORGANIZATION

# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.

#

msgid ""

msgstr ""

"Project-Id-Version: PACKAGE VERSION\n"

"POT-Creation-Date: 2018-01-28 16:47+0000\n"

"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"

"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"

"Language-Team: LANGUAGE <LL@li.org>\n"

"MIME-Version: 1.0\n"

"Content-Type: text/plain; charset=UTF-8\n"

"Content-Transfer-Encoding: 8bit\n"

"Generated-By: pygettext.py 1.5\n"

#: src/main.py:5

msgid "Hello world"

msgstr ""

#: src/main.py:6

msgid "This is a translatable string"

msgstr ""

In a bigger program, we would have many translatable strings following.  Here we specified a domain called base because the application is only one file. In bigger ones, I would use multiple domains in order to logically separate the different messages based on the application scope.

Notice that we have a simple convention for our translatable strings. msgid is the original string wrapped in _() . msgstr is the translation we need to provide.

Now we are ready to create our translations. Because we have the template generated for us, the next step is to create the required directory structure and copy the template into the right spot. We've seen the recommended file structure before. We are going to create 2 additional folders inside the locales dir like that:

$localedir/$language/LC_MESSAGES/$domain.po

Where:

  • $localedir is locale
  • $language is en and el
  • $domain is base

The .po  files will contain the translations we need to provide.

Copy and rename the base.pot  into the following folders locale/en/LC_MESSAGES/base.po and locale/el/LC_MESSAGES/base.po. Then modify their headers to include more information about the locale. For example, this is the Greek translation.

# My App.

# Copyright (C) 2018

#

msgid ""

msgstr ""

"Project-Id-Version: 1.0\n"

"POT-Creation-Date: 2018-01-28 16:47+0000\n"

"PO-Revision-Date: 2018-01-28 16:48+0000\n"

"Last-Translator: me <johndoe@example.com>\n"

"Language-Team: Greek <yourteam@example.com>\n"

"MIME-Version: 1.0\n"

"Content-Type: text/plain; charset=UTF-8\n"

"Content-Transfer-Encoding: 8bit\n"

"Generated-By: pygettext.py 1.5\n"

#: main.py:5

msgid "Hello world"

msgstr "Χέρε Κόσμε"

#: main.py:6

msgid "This is a translatable string"

msgstr "Αυτό είναι ένα μεταφραζόμενο κείμενο"

You can find specifications for these files at gnu.org website. Every PO file starts with a header entry that contains information about the file, the author, last revision date and pluralization rules.

Although there are a lot of metadata in the header it's not mandatory to include all of them. Also note that everything in the header is supposed to be in English, to be understandable to users who do not speak that language.

The catalog is built from the .po  file using a tool called msgformat.py. This tool will parse the .po  file and generate an equivalent .mo  file. We mentioned before that the MO files are binary data files that are parsed by the Python gettext module in order to be used in our program. This tool is usually located in the same folder as the pygettext.py

$ cd locales/el/LC_MESSAGES

$ msgfmt.py -o base.mo base

This command will generate a base.mo file in the same folder as the base.po  file.

So, the final file structure should look like this:

locales

├── el

│   └── LC_MESSAGES

│       ├── base.mo

│       └── base.po

├── en

│   └── LC_MESSAGES

│       ├── base.mo

│       └── base.po

└── base.pot

As we have reached this step and we have translated our application lets glue everything together by adding the ability to install and switch the locale languages.

Switching Locale

To have the ability to switch locales in our program we need to actually use the Class based gettext API. In this tutorial, I will explain only one method called gettext.translation. This method accepts some parameters that can be used to load the associated .mo  files of a particular language. If no .mo file is found, it raises an error so we need to be extra careful to provide the right path.

Add the following code to the program:

import gettext

el = gettext.translation('base', localedir='locales', languages=['el'])

el.install()

_ = el.gettext # Greek

The first argument base is the domain and the method will look for a .po  file with the same name in our locale folder. If you don't specify a domain it will fallback to the messages domain. The localedir parameter is the directory location of the locale folder you created. This can be either a relative or absolute path. The languages parameter is a hint for the searching mechanism to load particular language code more resiliently. For example, because we specified el  it will look for .mo  files in the following list of paths:

locales/el_GR.ISO8859-7/LC_MESSAGES/base.mo

locales/el_GR/LC_MESSAGES/base.mo

locales/el/LC_MESSAGES/base.mo

locales/el.ISO8859-7/LC_MESSAGES/base.mo

If you run the program again you will see the translations happening:

$ python main.py

Χαίρε Κόσμε

Αυτό είναι ένα μεταφραζόμενο κείμενο

The install method will cause all the _()  calls to return the Greek translated strings globally into the built-in namespace. This is because we assigned _  to point to the Greek dictionary of translations. To go back to the English just assign _  to be the original gettext object or use a lambda to point to the original string that was wrapped.

Thus either of those commands will work:

_ = lambda s: s

_ = gettext.gettext

Now that we know how to setup basic i18n functionality for our program, let's explore some additional cases that we will encounter while translating our applications.

Finding Message Catalogs

When there are cases where you need to locate all translation files at runtime, you can use the find function as provided by the class-based API. This function takes a few parameters in order to retrieve from the disk a list of .mo  files available.

You can pass a localedir, a domain and a list of languages. If you don't, the library module will use the respective defaults, which is not what you intended to do in most cases. For example, if you don't specify a localdir parameter, it will fallback to sys.prefix + '/share/locale'  which is a global locale dir that can contain a lot of random files.

The language portion of the path is taken from one of several environment variables that can be used to configure localization features (LANGUAGELC_ALLLC_MESSAGES, and LANG). The first variable found to be set is used. Multiple languages can be selected by separating the values with a colon :.

We can see an example of how this works in the program below. Start an interactive Python session inside the project base folder:

$ ipython

Python 3.6.4 (default, Jan 6 2018, 11:51:15)

Type 'copyright', 'credits' or 'license' for more information

IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import gettext

In [2]: gettext.find('base') # We did not specify a localeDir parameter.

In [3]: gettext.find('base', 'locales') # Will search for 'en' as default

Out[3]: 'locales/en/LC_MESSAGES/base.mo'

Now let's see what happens when we set the LANGUAGE environment variable to be  el

In [5] import os

In [6] os.environ['LANGUAGE']='el'

In [7]: gettext.find('base', 'locales')

Out[7]: 'locales/el/LC_MESSAGES/base.mo'

As you can see it will pick up the environment value for language and use that as the languages parameter.

Let's test passing the multiple languages in the environment:

In [8] os.environ['LANGUAGE']='el:en'

In [9] gettext.find('base', 'locales')

Out[9]: 'locales/el/LC_MESSAGES/base.mo'

In [10] gettext.find('base', 'locales', all=True) # Need to pass all=True to get all languages

Out[10]: ['locales/el/LC_MESSAGES/base.mo', 'locales/en/LC_MESSAGES/base.mo']

To get all translations we need to set the  all=True  parameter otherwise the call will return the first one found.

Plural Rules

So far we handled simple cases of translatable strings. There are also some other cases we need to be aware of as gettext treats them as special cases. Pluralization, for example, is dependant on the language. Some languages have different rules for messages referring to one item or many items.

To make managing plurals easier (and possible), there is a separate set of functions for asking for the plural form of a message. One of them is the ngettext function. To understand how it works let's add another function with a few messages containing plurals:

import gettext

el = gettext.translation('base', localedir='locales', languages=['el'])

el.install()

_ = el.gettext # Greek

ngettext = el.ngettext

def print_some_strings():

    print(_("Hello world"))

    print(_("This is a translatable string"))

def print_some_plural_strings(num):

    message1 = ngettext('{0} Human', '{0} Humans', num)

    message2 = ngettext('I possess {0} laptop', 'I possess {0} laptops', num)

    print(message1.format(num))

    print(message2.format(num))

print_some_plural_strings(1)

print_some_plural_strings(5)
We used ngettext function which requires passing 3 parameters. The first is a singular message, the second is a plural denoted message and the third is the amount or quantity that will be interpolated. The returning string is still unformatted so it will print:

%(num)d Human

I possess %(num)d laptop

%(num)d Humans

I possess %(num)d laptops

When we run the program it will format the messages according to the number passed:

$ python main.py

1 Human

I possess 1 laptop

5 Humans

I possess 5 laptops

Run again the pygettext.py  tool to generate the new translatable strings. That will produce the following .pot  file:

# SOME DESCRIPTIVE TITLE.

# Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER

# This file is distributed under the same license as the PACKAGE package.

# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.

#

#, fuzzy

msgid ""

msgstr ""

"Project-Id-Version: PACKAGE VERSION\n"

"Report-Msgid-Bugs-To: \n"

"POT-Creation-Date: 2018-01-30 20:20+0000\n"

"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"

"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"

"Language-Team: LANGUAGE <LL@li.org>\n"

"Language: \n"

"MIME-Version: 1.0\n"

"Content-Type: text/plain; charset=CHARSET\n"

"Content-Transfer-Encoding: 8bit\n"

"Plural-Forms: nplurals=2; plural=n != 1;\n"

#: main.py:10

msgid "Hello world"

msgstr ""

#: main.py:11

msgid "This is a translatable string"

msgstr ""

#: main.py:14

#, python-brace-format

msgid "{0} Human"

msgid_plural "{0} Humans"

msgstr[0] ""

msgstr[1] ""

#: main.py:15

#, python-brace-format

msgid "I possess {0} laptop"

msgid_plural "I possess {0} laptops"

msgstr[0] ""

msgstr[1] ""

Note: If you don't get the plural rules from the pygettext.py  command it's mainly because you don't have the latest version of it.  To overcome this you can use the associated xgettext tool which is bundled with the original gettext library like that:

$ xgettext -L Python -o locales/base.pot main.py

Now, in addition to filling in the translation strings, we will also need to describe the way plurals are formed so the library knows how to index into the array for any given count value.

We need to add the following line in the header section:

Plural-Forms: nplurals=INTEGER; plural=EXPRESSION;\n
  • nplurals is an integer indicating the size of the array (the number of translations used)
  • plural an expression for converting the incoming quantity to an index in the array when looking up the translation. 

For our example, English and Greek include two plural forms:

Plural-Forms: nplurals=2; plural=n != 1;

The singular translation would then go in position 0, and the plural translation in position 1.

Modify the Greek translation to include the plural rules:

# My App.

# Copyright (C) 2018

#

msgid ""

msgstr ""

"Project-Id-Version: 1.0\n"

"POT-Creation-Date: 2018-01-28 16:47+0000\n"

"PO-Revision-Date: 2018-01-28 16:48+0000\n"

"Last-Translator: me <johndoe@example.com>\n"

"Language-Team: Greek <yourteam@example.com>\n"

"MIME-Version: 1.0\n"

"Content-Type: text/plain; charset=UTF-8\n"

"Content-Transfer-Encoding: 8bit\n"

"Generated-By: pygettext.py 1.5\n"

#: src/main.py:7

msgid "Hello world"

msgstr "Χαίρε Κόσμε"

#: src/main.py:8

msgid "This is a translatable string"

msgstr "Αυτό είναι ένα μεταφραζόμενο κείμενο"

#, python-brace-format

msgid "{0} Human"

msgid_plural "{0} Humans"

msgstr[0] "{0} 'Ατομο"

msgstr[1] "{0} 'Ατομα"

#: main.py:15

#, python-brace-format

msgid "I possess {0} laptop"

msgid_plural "I possess {0} laptops"

msgstr[0] "Κατέχω {0} υπολογιστή"

msgstr[1] "Κατέχω {0} υπολογιστές"

Notice the comment starting with #, python-brace-format . This is a way to interpolate the strings. Because we used a Python specific way of doing it, the tool annotated with that info. Another way of interpolating the messages is by using the following format:

'%(num)d'

Then we would have to format the messages in our program like that:

print(message1 % num)

Now generate the .mo  file as before. If you did all the steps correctly and run the program again you will see the translations happening:

$ python main.py

1 'Ατομο

Κατέχω 1 υπολογιστή

5 'Ατομα

Κατέχω 5 υπολογιστές

There are a lot of caveats regarding plurals rules. For more information, I suggest you head on the official docs.

Manipulating PO files

To load the PO files in your application and make some manipulations with them, unfortunately, there is no built-in solution. There is, however, a third party library called polib.

Install it first using pip3:

$ pip3 install polib

Le'ts import it into our app and load the .po file.

po = polib.pofile('./el/LC_MESSAGES/el.po') // specify path to .po file

once loaded you can inspect the .po entries.

for entry in po.translated_entries():

    print(entry.msgid, entry.msgstr)

To see the percentage of translated entries just call the percent_translated method:

po.percent_translated() // prints 100

Of course, if you were missing some translations you would have a lower percentage.

You can also create new .po  file catalog and add entries to it. First, initiate a new .po  file and add the metadata header for it:

po = polib.POFile()

po.metadata = {

 'Project-Id-Version': '1.0',

 'Report-Msgid-Bugs-To': 'johndoe@example.com',

 'POT-Creation-Date': datetime.datetime.now(),

 'PO-Revision-Date': datetime.datetime.now(),

 'Last-Translator': 'me <johndoe@example.com>',

 'Language-Team': 'Greek <yourteam@example.com>',

 'MIME-Version': '1.0',

 'Content-Type': 'text/plain; charset=utf-8',

 'Content-Transfer-Encoding': '8bit'

}

with the file created in memory lets add some entries:

entry = polib.POEntry(

    msgid=u'Welcome to Python',

    msgstr=u'Καλώς ήρθατε στην Python'

)

po.append(entry)

entry = polib.POEntry(msgid=u'Hello world')

po.append(entry)

inspect the percentage again:

po.percent_translated() // prints 50

and save it in a specified path:

po.save('./el/LC_MESSAGES/el.po')

you can also save it as a .mo  extension type:

po.save_as_mofile('./el/LC_MESSAGES/el.mo')

If you see the contents of the files written to disk they correspond to the correct file format. polib supports iterating over all the entries also. Check out their API documentation for more information.

Phrase

Phrase supports many different languages and frameworks, including Python. It allows you to easily import and export translation data and search for any missing translations, which is quite convenient. On top of that, you can collaborate with translators as it is much better to have professionally done localization for your website. If you’d like to learn more about Phrase, check out the Phrase Localization Suite.

Conclusion

In this article, we've seen how to translate Python applications with the GNU gettext module. We learned what gettext and the PO file format is. We saw how to add pluralization and interpolation rules. We also learned how to parse PO files with the polib third-party library.

I hope you enjoyed the article and that it helped you understand how to integrate i18n capabilities into your next Python app. If you want to keep learning about more topics surrounding localization and Python, make sure to check out the following guides: