Internationalization (i18n) is the process of making a program aware of multiple languages. Localization (l10n) refers to the process of adapting your program, once internationalized, to the local language and cultural customs. A locale is a set of internationalization parameters that defines a computer user’s language, region, and any special variant preferences that the user wishes to see displayed in the user interface of programs, scripts, or the desktop. Sometimes the locale consists of a language code, like en
, de
, but more often it is a combination of a language code and a country/region code, like en_EN
or de_DE
. This is because some countries speak more than one language. Often a codeset
information (which charset to use) and a modifier is added. On POSIX
systems this results in a string like language[_territory][.codeset][@modifier]
.
Example:
German language in Germany written in UTF-8: de_DE.UTF-8
German language in Belgium written in ISO-8859: de_BE.ISO-8859-1
German language in Belgium written in ISO-8859 with the EURO symbol: de_BE.ISO-8859-15@euro
often shorted to de_BE5@euro
as the EURO symbol is an option in ISO-8859 but not in UTF-8.
Often GNU gettext
or a similar system is used. The procedure consists of several steps:
A program is written using special tags or functions known to gettext
, such as _
. Each call to print is processed by gettext
. So print works as a wrapper for gettext
. The usual print("text")
becomes print(_("text"))
.
The gettext
framework provides scripts that can extract strings that are parameters to the _
function. So in the above case it would extract “text”. These strings, usually in English but any other language will do, are called msgid
and become the ID of a text. So here we have the message ID: msg_id "text"
. These message IDs are collected in a file with the extension *.pot
(PO template) and other useful information is also collected.
Steps 1 and 2 are basically the internationalization of a program. The next step is the localization. Here gettext
is used to create a local version. This is basically a copy from *.pot
to *.po
. In case of British English it would be en_GB.UTF-8.po
and in case of German spoken in Germany it would be de_DE.UTF-8.po
. Next to the msg_id
entry you will find a msgstr ""
entry. For a localization to be complete, you need to translate the msgid
to the msgstr
. For English this would be msid "Text"
and msgstr "Text"
.
Theoretically, the localization is done. However, some programs would like to have the translated string database in a binary format. There are also tools in the gettext
framework to convert the *.po
file into a binary format. Usually they have the extension *.mo
.
Consider the following Python program main
:
import gettext
_ = gettext.gettext
def print_some_strings():
print(_("Hello world"))
print(_("Internationalisation"))
if __name__=='__main__':
print_some_strings()
The corresponding *.pot
file would look like:
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR ORGANIZATION
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2022-01-28 16:47+0000\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: pygettext.py 1.5\n"
#: src/main.py:5
msgid "Hello world"
msgstr ""
#: src/main.py:6
msgid "Internationalisation"
msgstr ""
This is translated from reference English (whatever the developer wants) to the target language (for example, English as spoken in the United States of America). In this case, it is just a copy of strings.
# English translations for PACKAGE package.
# Copyright (C) 2022 THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# Automatically generated, 2022.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2022-05-18 06:38+0200\n"
"PO-Revision-Date: 2022-05-20 17:16+0200\n"
"Last-Translator: Automatically generated\n"
"Language-Team: none\n"
"Language: en_US\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Plural-Forms: nplurals=2; plural=(n != 1);\n"
#: src/main.py:5
msgid "Hello world"
msgstr "Hello world"
#: src/main.py:6
msgid "Internationalisation"
msgstr "Internationalization"
In short:
*.pot
: xgettext --no-wrap --from-code=UTF-8 --keyword=_ -L Python --copyright-holder='NAME' --package-name='main.py' -package-version='0.1.0' --output=main.pot main.py
*.po
: msginit --no-wrap --no-translator --input=main.pot --locale=en_US -o en_US.po
*.po
from *.pot
: msgmerge --no-wrap --backup=none --update main.po main.pot
*.mo
from *.po
: msgfmt -o en_US.mo main.pot
While the above is possible, it is more common to store the *.mo
files in other ways, such as
locale
├── de_DE
│ └── LC_MESSAGES
│ ├── main.mo
│ └── main.po
├── en_US
│ └── LC_MESSAGES
│ ├── main.mo
│ └── main.po
├── ja_JP
│ └── LC_MESSAGES
│ ├── main.mo
│ └── main.po
└── main.pot
To use *.mo
files in Python with jinja2
, you can use the extension: jinja2.ext.i18n
.
env = Environment(loader=file_loader,extensions=["jinja2.ext.i18n"])
This allows the use of gettext
and ngettext
function calls in the template.
{{ getext('Hello World') }}
It is also possible to configure jinja2
to use a default call or to rename the function call.
This document describes the basics of GNU gettext
. Advanced usage like numbers, singular and plural or ngettext
has not been touched.
Version | Date | Notes |
---|---|---|
0.1.2 | 2023-04-13 | Improve writing |
0.1.1 | 2023-01-18 | Add Jinja2 section |
0.1.0 | 2022-05-30 | Initial release |