FOSS Localization/Annex A: Key Concepts

This annex provides a quick tour of the key concepts of localization, so that those interested in localizing FOSS for their own language, get a broad picture of the kind of knowledge that is needed. The next annex provides the technical details required to get started.

Standardization
When two or more entities interact, common conventions are important. Car drivers must abide by traffic rules to prevent accidents. People need common conventions on languages and gestures to communicate. Likewise, software needs standards and protocols to interoperate seamlessly. In terms of software engineering, contracts between parts of programs need to be established before implementation. The contracts are most important for systems developed by a large group of individual developers from different backgrounds, and are extremely essential for cross-platform interoperability.

Standards provide such contracts for all computing systems in the world. Software developers need to conform to such conventions to prevent miscommunication. Therefore, standardization should be the very first step for any kind of software development, including localization.

To start localization, it is a good idea to study related standards and use them throughout the project. Nowadays, many international standards and specifications have been developed to cover the languages of the world. If these do not fit the project's needs, one may consider participating in standardization activities. Important sources are:


 * ISO/IEC JTC1 (International Organization for Standardization and International Electrotechnical Commission Joint Technical Committee 1)
 * A joint technical committee for international standards for information technology. There are many subcommittees (SC) for different categories, under which working groups (WG) are formed to work on subcategories of standards. For example, ISO/IEC JTC1/SC2/WG2 is the working group for Universal Coded Character Set (UCS). The standardization process, however, proceeds in a closed manner. If the national standard body is an ISO/IEC member, it can propose the requirements for the project. Otherwise, one may need to approach individual committees. They may ask for participation as a specialist. Information for JTC1/SC2 (coded character sets) is published at http://anubis.dkuug.dk/JTC1/SC22 . Information for JTC1/SC22 (programming languages, their environments and system software interfaces) is at http://anubis.dkuug.dk/JTC1/SC22.


 * Unicode Consortium:
 * A non-profit organization working on a universal character set. It is closely related to ISO/IEC JTC1 subcommittees. Its Web site is at http://www.unicode.org, where channels of contribution are provided.


 * Free Standards Group
 * A non-profit organization dedicated to accelerating the use of FOSS by developing and promoting standards. Its Web site is at http://www.freestandards.org . It is open to participation. There are a number of work groups under its umbrella, including OpenI18N for internationalization ( http://www.openi18n.org ).

Note, however, that some issues such as national keyboard maps and input/output methods are not covered by the standards mentioned above. The national standards body should define these standards, or unify existing solutions used by different vendors, so that users can benefit from the consistency.

Unicode
Characters are the most fundamental units for representing text data of any particular language. In mathematical terms, the character set defines the set of all characters used in a language. In ICT terms, the character set must be encoded as bytes in the storage, according to some conventions, called encoding. These conventions must be agreed upon both by the sender and receiver of data for the information to remain intact and exact.

In the 1970s, the character set used by most programs consisted of letters of the English alphabet, decimal digits and some punctuation marks. The most widely used encoding was the 7-bit ASCII (American Standard Code for Information Interchange), in which up to 128 characters can be represented, which is just sufficient for English. However, when the need to use non-English languages in computers arose, other encodings were defined. The concept of codepages was devised as enhancements to ASCII by adding characters as the second 7-bit half, making an 8-bit code table in total. Several codepages were defined by vendors for special characters for decoration purpose and for Latin accents. Some non-European languages were added by this strategy, such as Hebrew and Thai. National standards were defined for character encoding.

The traditional encoding systems were not suitable for Asian languages that have large character sets and particular complexities. For example, the encoding of Han characters used by the Chinese, Japanese and Korean (CJK), the total number of which are still not determined, is much more complicated. A large number of codepages must be defined to cover all of them. Moreover, compatibility with other singlebyte encodings is another significant challenge. This ends up in some multi-byte encodings for CJK.

However, having a lot of encoding standards to support is a problem for software developers. A group of vendors thus agreed to work together to define a single character set that covers the characters of all languages of the world, so that developers have a single point of reference, and users have a single encoding. The Unicode Consortium was thus founded. Major languages in the world were added to the code table. Later on, ISO and IEC formed JTC1/SC2/WG2 to standardize the code table, which is published as ISO/IEC 10646. Unicode is also a member of the working group, along with standard bodies of ISO member countries. Both Unicode and ISO/IEC 10646 are synchronized, so the code tables are the same. But Unicode also provides additional implementation guidelines, such as character properties, rendering, editing, string collation, etc.

Nowadays, many applications have moved to Unicode and have benefited from the clear definitions for supporting new languages. Users of Unicode are able to exchange information in their own languages, especially through the Internet, without compatibility issues.

Fonts
Once the character set and encoding of a script are defined, the first step to enabling it on a system is to display it. Rendering text on the screen requires some resource to describe the shapes of the characters, i.e., the fonts, and some process to render the character images as per script conventions. The process is called the output method. This section will try to cover important aspects of these requirements.

Characters and Glyphs
A font is a set of glyphs for a character set. A glyph is an appearance form of a character or a sequence of characters. It is quite important to distinguish the concepts of characters and glyphs. For some scripts, a character can have more than one variation, depending on the context. In that case, the font may contain more than one glyph for each of those characters, so that the text renderer can dynamically pick the appropriate one. On the other hand, the concept of ligatures, such as "ff" in English text, also allows some sequence of characters to be drawn together. This introduces another kind of mapping of multiple characters to a single glyph.

Bitmap and Vector Fonts
In principle, there are two methods of describing glyphs in fonts: bitmaps and vectors. Bitmap fonts describe glyph shapes by plotting the pixels directly onto a two-dimensional grid of determined size, while vector fonts describe the outlines of the glyphs with line and curve drawing instructions. In other words, bitmap fonts are designed for a particular size, while vector fonts are designed for all sizes. The quality of the glyphs rendered from bit-map fonts always drops when they are scaled up, while that from vector fonts does not. However, vector fonts often render poorly in small sizes in low-resolution devices, such as computer screens, due to the limited pixels available to fit the curves. In this case, bitmap fonts may be more precise.

Nevertheless, the quality problem at low resolution has been addressed by font technology. For example:


 * 1) Hinting, additional guideline information stored in the fonts for rasterizers to fit the curves in a way that preserves the proper glyph shape.
 * 2) Anti-aliasing, capability of the rasterizer to simulate unfitted pixels with some illusion to human perception, such as using grayscales and coloured-subpixels, resulting in the feeling of "smooth curves."

These can improve the quality of vector fonts at small sizes. Moreover, the need for bitmap fonts in modern desktops is gradually diminishing.

Font Formats
Currently, the X Window system for GNU/Linux desktop supports many font formats.

BDF Fonts
BDF (Bit-map Distribution Format) is a bitmap font format of the X Consortium for exchanging fonts in a form that is both human-readable and machine-readable. Its content is actually in plain text.

PCF Fonts
PCF (Portable Compiled Format) is just the compiled form of the BDF format. It is binary and thus, only machine-readable. The utility that compiles BDF into PCF is bdftopcf. Although BDF fonts can be directly installed into the X Window system, they are usually compiled for better performance.

Type 1 Fonts
Type 1 is a vector font standard devised by Adobe and supported by its Postscript standard. So it is well supported under most UNIX and GNU/Linux, through the X Window system and Ghostscript. Therefore, it is the recommended format for traditional UNIX printing.

TrueType Fonts
TrueType is a vector font standard developed by Apple, and is also used in Microsoft Windows. Its popularity has grown along with the growth of Windows. XFree86 also supports TrueType fonts with the help of the FreeType library. Ghostscript has also supported TrueType. Thus, it becomes another potential choice for fonts on GNU/Linux desktops.

OpenType Fonts
Recently, Adobe and Microsoft have agreed to create a new font standard that covers both Type 1 and TrueType technologies with some enhancements to cover the requirements of different scripts in the world. The result is OpenType.

An OpenType font can describe glyph outlines with either Type 1 or TrueType splines. In addition, information for relative glyph positioning (namely, GPOS table) has been added for combining marks to base characters or to other marks, as well as some glyph substitution rules (namely, GSUB table), so that it is flexible enough to draw characters of various languages.

Output Methods
Output method is a procedure for drawing texts on output devices. It converts text strings into sequences of properly positioned glyphs of the given fonts. For the simple cases like English, the character-toglyph mapping may be straightforward. But for other scripts the output methods are more complicated. Some could be with combining marks, some written in directions other than left-to-right, some with glyph variations of a single character, some requiring character reordering, and so on.

With traditional font technologies, the information for handling complex scripts is not stored in the fonts. So the output methods bear the burden. But with OpenType fonts, where all of the rules are stored, the output methods just need the capability to read and apply the rules.

Output methods are defined at different implementations. For X Window, it is called X Output Method (XOM). For GTK+, it uses a separate module called Pango. For Qt, it implements the output method by some classes. Modern rendering engines are now capable of using OpenType fonts. So, there are two ways of drawing texts in output method implementations. If you are using TrueType or Type 1 fonts and your script has some complications over Latin-based languages, you need to provide an output method that knows how to process and typeset characters of your script. Otherwise, you may use OpenType fonts with OpenType tables that describe rules for glyph substitution and positioning.

Input Methods
There are many factors in the design and implementation of input methods. The more different the character set size and the input device capability are, the more complicated the input method becomes. For example, inputting English characters with a 104-key keyboard is straightforward (mostly one-to-one - that is, one key stroke produces one character), while inputting English with mobile phone keypad requires some more steps. For languages with huge character sets, such as CJK, character input is very complicated, even with PC keyboards.

Therefore, analysis and design are important stages of input method creation. The first step is to list all the characters (not glyphs) needed for input, including digits and punctuation marks. The next step is to decide whether it can be matched one-to-one with the available keys, or whether it needs some composing (like European accents) or conversion (like CJK Romanji input) mechanisms in which multiple key strokes are required to input some characters.

When the input scheme is decided for the script, the keyboard layout may be designed. Good keyboard layout should help users by putting most frequently used characters in the home row, and the rest in the upper and lower rows. If the script has no concept of upper/lower cases (which is almost the case for non-Latin scripts), rare characters may be put in the shift positions.

Then, there are two major steps to implement the input method. First, a map of the keyboard layout is created. This is usually an easy step, as there are existing keyboard maps to refer to. Then, if necessary, the second step is to write the input method based on the keyboard map. In general, this means writing an input method module to plug into the system framework.

Locales
Locale is a term introduced by the concept of internationalization (I18N), in which generic frameworks are made so that the software can adjust its behaviour to the requirements of different native languages, cultural conventions and coded character sets, without modification or re-compilation.

Within such frameworks, locales are defined for describing particular cultures. Users can configure their systems to pick up their locales. The programs will load the corresponding predefined locale definition to accomplish internationalized functions. Therefore, to make internationalized software support a new language or culture, one must create a locale definition and fill up the required information, and things will work without having to touch the software code.

According to POSIX, a number of C library functions, such as date and time formats, string collation, numerical and monetary formats, are locale-dependent. ISO/IEC 14652 has added more features to POSIX locale specifications and defined new categories for paper size, measurement unit, address and telephone formats, and personal names. GNU C library has implemented all of these categories. Thus, cultural conventions may be described through it.

Locale definitions are discussed in detail on pages 41–42.

Translation
Translating messages in programs, including menus, dialog boxes, button labels, error messages, and so on, ensures that local users, not familiar with English, can use the software. This task can be accomplished only after the input methods, output methods and fonts are done - or the translated messages will become useless.

There are many message translation frameworks available, but the general concepts are the same. Messages are extracted into a working file to be translated and compiled into a hash table. When the program executes, it loads the appropriate translation data as per locale. Then, messages are quickly looked up for the translation to be used in the user interface.

Translation is a labour-intensive task. It takes time to translate a huge number of messages, which is why it is always done by a group of people. When forming a team, make sure that all members use consistent terminology in all parts of the programs. Therefore it is vital to work together in a forum through close discussion and to build the glossary database from the decisions made collectively. Sometimes the translator needs to run the program to see the context surrounding the message, in order to find a proper translation. At other times the translator needs to investigate the source code to locate conditional messages, such as error messages. Translating each message individually in a literal manner, without running the program, can often result in incomprehensible outputs.

Like other FOSS development activities, translation is a long-term commitment. New messages are usually introduced in every new version. Even though all messages have been completed in the current version, it is necessary to check for new messages before the next release. There is usually a string freeze period before a version is released, when no new strings are allowed in the code base, and an appropriate time period is allocated for the translators. Technical aspects of the message translation process are discussed on page 45.

GNU/Linux Desktop Structure
Before planning to enable a language in GNU/Linux desktop, a clear understanding of the overview of its structure is required. GNU/Linux desktop is composed of layers of subsystems working on top of one another. Every layer has its own locale-dependent operations. Therefore, to enable a language completely, it is necessary to work in all layers. The layers, from the bottom up, are as follow (See Figure 1):




 * The C Library
 * C is the programming language of the lowest level for developing GNU/Linux applications. Other languages rely on the C library to make calls to the operating system kernel.


 * The X Window
 * In most UNIX systems, the graphical environment is provided by the X Window system. It is a client-server system. X servers are the agents that provide service to control hardware devices, such as video cards, monitors, keyboards,  mice or tablets, as well as pass user input events from the devices to the clients. X clients are GUI application programs that request X server to draw graphical objects on the screen, and accept user inputs via the events fed by X server. Note that with this architecture, X client and server can be on different machines in the network. In which case, X server is the machine that the user operates with, while X client can be a process running on the same machine or on a remote machine in the network.


 * Toolkits
 * Writing programs using the low-level Xlib can be tedious as well as a source of inconsistent GUI when all applications draw menus and buttons by their own preferences. Some libraries are developed as a middle layer to help reduce both problems. In X terminology, these libraries are called toolkits. And the GUI components they provide, such as buttons, text entries, etc., are called widgets. Many historic toolkits have been developed in the past, either by the X Consortium itself like the X Toolkit and Athena widget set (Xaw), or by vendors like XView from Sun, Motif from Open Group, etc. In the FOSS realm, the toolkits most widely adopted are GTK+ (The GIMP Toolkit) and Qt.
 * Desktop Environments
 * Toolkits help developers create a consistent look-and-feel among a set of programs. But to make a complete desktop, applications need to interoperate more closely to form a convenient workplace. The concept of desktop environment has been invented to provide common conventions, resource sharing and communication among applications. The first desktop environment ever created on UNIX platforms was CDE (Common Desktop Environment) by Open Group, based on its Motif toolkit. But it is proprietary. The first FOSS desktop environment for GNU/Linux is KDE (K Desktop Environment), based on TrollTech’s Qt toolkit. However, due to some licensing conditions of Qt at that time, some developers didn’t like it. A second one was thus created, called GNOME (GNU Network Object Modelling Environment), based on GTK+. Nowadays, although the licensing issue of Qt has been resolved, GNOME continues to grow and get more support from vendors and the community. KDE and GNOME have thus become the desktops most widely used on GNU/Linux and other FOSS operating systems such as FreeBSD.

Each component is internationalized, allowing local implementation for different locales:
 * GNU C Library
 * Internationalized according to POSIX and ISO/IEC 14652.


 * XFree86 (and X Window in general)
 * Internationalization in this layer includes X locale (XLC) describing font set and character code conversion; X Input Method (XIM) for text input process, in which X Keyboard Extension (XKB) is used in describing keyboard map; and X Output Method (XOM) for text rendering. For XOM, it was implemented too late, when both GTK+ and Qt had already handled the rendering by their own solutions. Therefore, it is questionable whether XOM is still needed.


 * GTK+
 * For GTK+ 2, internationalization frameworks have been defined in a modular way. It has its own input method framework called GTK+ IM, where input method modules can be dynamically plugged in as per user command. Text rendering in GTK+ 2 is handled by a separate general-purpose text layout engine called Pango. Pango can be used for any application that needs to render multilingual texts and not just for GTK.


 * Qt
 * Internationalization in Qt 3 is done in a minimal way. It relies solely on XIM for all text inputs, and handles text rendering with QComplexText C++ class, which relies completely on Unicode data for character properties from Unicode.org. For the desktop environment layer, namely, GNOME and KDE, there is no additional internationalization apart from what is provided by GTK+ and Qt.