Naming Languages

As part of the Novetta Mission Analytics team, I work on a data pipeline that ingests traditional and social media from around the world, enriches it, and makes the enriched data available to customers. Enrichment can involve any number of steps, many of them powered by machine learning, and one of the earliest and most common steps is translation.

When new content arrives, the source language is often unknown and must be detected; if the source language is different from the target language, the content is also translated. In order to translate this volume of content automatically, accurately, and cost effectively, we rely on multiple cloud translation services.

To the surprise of no one, cloud translation services differ not only in pricing but also in the languages they support and in the quality of translation across them. It’s often most cost effective to perform language detection with one service and, depending on the detected language, translation with another. In addition, these services occasionally use different identifiers to refer to the same language, which requires us to do some mapping on our end. For example, Microsoft Azure’s cloud translation service uses zh-Hant to refer to Chinese in traditional script, while Google Cloud’s uses zh-TW.

How can two different identifiers refer to the same language? Well, it turns out that Hant refers to a script—specifically the one that uses traditional Chinese characters—and TW refers to a region—Taiwan. When translating textual content, script seems more relevant, but it turns out that where a language is spoken is a often a pretty good proxy for a number of defining language characteristics, script among them in this case.

It’s clear then that zh must refer to the Chinese language, but upon closer inspection that’s not exactly straightforward either. Chinese, at least in this context, is considered a macrolanguage—a group of individual languages that includes Mandarin, Cantonese, Gan, Min, Wu, and others. When spoken, many of these languages are virtually unintelligible to speakers of the others.

It would seem that referring to a group of languages that are mutually unintelligible would make for a lousy language identifier in a translation service. In a purely spoken context this might be true, but when written, varieties of Chinese generally use one of two scripts: traditional Chinese characters (Hant) or simplified Chinese characters (Hans). Partly for historical reasons—zh was in circulation before there were ever identifiers for individual varieties of Chinese—and partly because script tends to be the key distinguishing piece of information for Chinese text, we’ll often see zh-Hant rather than a more specific identifier, like cmn-Hant (Mandarin in traditional script).

Mandarin is spoken widely in Taiwan, but so too is Hokkien, a variety of Min Nan Chinese (nan), so we can’t really assume a specific variety of Chinese based on TW alone. But we can make an assumption about the implied script because the traditional script is widely used in Taiwan. Therefore we can consider Google’s zh-TW as referring to approximately the same language as Microsoft’s zh-Hant, that is, some unspecified variety of Chinese that uses traditional Chinese characters. (Although, of course, Taiwan is not the only place where the traditional script is used.)

As I hope this example illustrates, there is some nuance to language identification. Already we’ve seen the need for backwards compatibility with older identifiers, the importance of the intended application (spoken? written? signed?), and the need for some understanding of regional variation. What other characteristics are used to identify languages? And where do these identifiers come from?

Wanting answers to these questions and others, I took a deep dive and what follows is the byproduct. The relevant information comes from a variety of sources in a variety of formats, so I built a database to allow me to easily make cross-cutting queries. If you’re interested, you can find langdb on GitHub.


Fortunately humans addressed the challenge of identifying languages decades ago in the same way we have addressed most technical problems faced by a post-Internet world: by creating a bunch of standards. We have standards for the names of languages, groups of languages, writing systems, countries and geographical areas, and the list goes on.

This brings us to BCP 47. BCP stands for Best Current Practice and BCP 47 is the IETF’s authoritative guide on “how to do language identifiers” on the Internet. It draws from many of the existing standards to provide a practical mechanism for identifying languages that can be parsed by machines and humans alike.

The document is really two RFCs concatenated together—RFC 5646, Tags for Identifying Languages, and RFC 4647, Matching of Language Tags. You’ve already been acquainted with language tags because virtually every Internet standard that needs to work with languages or locales uses them.

<html lang="fr-CA">
Example: the opening tag of an HTML document in French as used in Canada.
Accept-Language: es-419, es;q=0.9
Example: an HTTP header requesting content in Spanish as used in Latin America and the Caribbean and accepting, as a fallback, any variety of Spanish.

Language Tags

A subtag is a string of one to eight ASCII letters and numbers. A language tag is a sequence of subtags separated by hyphens. There are different types of subtags and each type must obey a set of rules and appear in a particular order relative to other types of subtags. This is because they were designed in such a way that one’s type can always be determined from its length, contents, and/or position within the enclosing language tag. There are case conventions for most subtags, but letter case is not part of the heuristic for determining one’s type.

Subtag Form Example Standards
Language tag scope
language
[a-z]{2}
zh
ISO 639-1
[a-z]{3}
cmn
ISO 639-2
ISO 639-3
ISO 639-5
extlang
[a-z]{3}
cmn
ISO 639-3
script
[a-z]{4}
Hant
ISO 15924
region
[a-z]{2}
TW
ISO 3166-1
[0-9]{3}
030
UN M49
variant
[a-z][a-z0-9]{4,7}
tongyong
[0-9][a-z0-9]{3,7}
1996
extension
[a-wy-z](-[a-z0-9]{2,8})+
u-co-phonebk
u-ms-metric
RFC 6067
RFC 6497
private use
x-[a-z0-9]{1,8}
x-sarcasm
Figure 1. Language tag structure, in order from top to bottom.

Armed with knowledge of the language tag structure, we can tell, for example, that azb-Arab-IR is made up of language-script-region, even if we’ve never seen this language tag before (South Azerbaijani in Arabic script as used in Iran). Similarly, we can tell that fr-u-fw-mon is made up of language-extension, even if we aren’t familiar with the contents of this particular extension (French in a locale where the week starts on Monday).

After the language subtag, each additional subtag further narrows the scope of the overall language tag. The goal is to use the most concise language tag that appropriately identifies the language for a particular application. That is, subtags should only be used when they add “useful distinguishing information” (Phillips & Davis 2009). For example, en-Latn doesn’t provide any additional value over en since Latin is the assumed script for English; en-Brai (English with Braille script), however, does.

We will use this language tag structure as a path to follow as we look at each type of subtag and the supporting standards.

IANA Language Subtag Registry

BCP 47 establishes a subtag registry that contains an updated list of all valid subtags. “Registry” here is another way of describing a text file in record-jar format that is hosted and maintained by IANA. As part of this arrangement, an official reviewer monitors for any changes to the related standards and determines if and how the changes should be reflected in the registry.

A language tag can be three things: well-formed and valid, well-formed and invalid, or ill-formed. In a well-formed language tag, every subtag appears the correct number of times, in the correct order, and has the correct form for its type (e.g., two letters or three numbers). But in order for a language tag to be valid, each of its subtags must additionally have a corresponding record in the subtag registry (except for private-use subtags, of course).

%%
Type: language
Subtag: ase
Description: American Sign Language
Added: 2009-07-29
%%
Example: the language subtag entry for American Sign Language in the subtag registry.

Language

The first subtag to appear in any language tag is primary language and, with a couple of rare exceptions that I’m going to ignore, it’s always present. We’ve already seen some examples in zh, cmn, azb, and fr. Perhaps you’ve been wondering why some language subtags have two letters and others have three. To answer that we need to dig into ISO 639.

ISO 639 is a set of standards for representing the names of languages and groups of languages. It has four parts: ISO 639-1, ISO 639-2, ISO 639-3, and ISO 639-5. (Well, sort of: ISO 639-4, originally published in 2010, is being revised; ISO 639-6 was withdrawn.)

It’s worth mentioning that the standards are, by design, “open.” Each part has a designated registration authority (RA) in charge of maintaining an updated list of registered codes and reviewing change requests, and minor changes to the lists are in fact made fairly regularly. So naturally any specific numbers mentioned here were accurate at the time of writing but can’t be held to account as things change.

ISO 639-1

Codes for the representation of names of languages — Part 1: Alpha-2 code

ISO 639-1 defines a set of two-letter codes for the names of 184 languages. Examples include de for German, en for English, he for Hebrew, and of course zh for Chinese.

The standard was first introduced as ISO/R 639 in 1967, then replaced by ISO 639 in 1988. Due to their Internet age, these two-letter codes have been embedded into many systems and it’s likely that we’ll be living with them for a very long time.

ISO 639-2

Codes for the representation of names of languages — Part 2: Alpha-3 code

ISO 639-2 is a little different. First, it introduces three-letter codes. It contains codes for all of the languages covered by ISO 639-1 and adds about as many more (although we still don’t see codes for individual varieties of Chinese yet).

Second, it provides two sets of codes: a terminology code set and a bilbliography code set. With twenty exceptions, the code for each entry is the same in both code sets. For example, English is assigned eng in both, and Chinese, an exception, is assigned zho as its terminology code and chi as its bibliography code.

Finally, it includes codes for some groups of languages, such as afa for Afro-Asiatic languages and sla for Slavic languages.

In total, ISO 639-2 assigns 486 codes: 358 individual languages, 65 groups of languages, 58 macrolanguages, and four special cases. (We’ll look at the special cases and macrolanguages shortly.) Notably, it also reserves a range of codes, qaa, qab, ..., qtz, for local use.

While the first two parts of the standard are suitable for many practical applications, they fall short of providing comprehensive coverage. This is where ISO 639-3 and ISO 639-5 come in.

ISO 639-3

Codes for the representation of names of languages — Part 3: Alpha-3 code for comprehensive coverage of languages

ISO 639-3 seeks to provide exhaustive coverage of all known individual languages. Like ISO 639-2, it uses three-letter codes. This is when we get codes for Mandarin, Cantonese, Gan, Min, Wu—hell, we even get tlh for Klingon. All told, it assigns 7,891 codes. In addition to living languages, these include ancient, historic, extinct, and constructed languages. It contains a code for every entry in ISO 639-1, but the same is not quite true for ISO 639-2.

All ISO 639-2 codes are included in ISO 639-3 with two caveats: codes for groups of languages are excluded, and in cases where the terminology code and the bibliography code are different, the terminology code is the one included in ISO 639-3. Referring to our earlier example, the ISO 639-3 code for Chinese is zho, not chi.

ISO 639-5

Codes for the representation of names of languages — Part 5: Alpha-3 code for language families and groups

ISO 639-5 finishes what ISO 639-2 started for language families and groups. Its focus is the classification of languages by geographical, linguistic, or other criteria. It includes all of the codes for groups of languages established in ISO 639-2 and adds some others, for a total of 115 codes.

Examples include sit (Sino-Tibetan languages), cai (Central American Indian languages), sgn (Sign languages), and roa (Romance languages).

ISO 639-1 184 codes ISO 639-2 (T) 486 codes (B) ISO 639-3 7,891 codes ISO 639-5 115 codes language groups in ISO 639-2
Figure 2. Relationship among the ISO 639 code sets.

Special Cases

There are four codes in ISO 639-3 (originally introduced in ISO 639-2) for scenarios in which a valid code is required but no other code is appropriate.

mis
Uncoded languages. A language is used that does not (yet) have a code in the ISO 639 standard.
mul
Multiple languages. More than one language is used and it would be impractical to list them all, or a single language code is required.
und
Undetermined. The language is not (yet) known.
zxx
No linguistic content. For example, audio and images.

These codes have come in handy in our data pipeline. For example, when new content arrives it is tagged with und until its language can be determined. This allows us to store a valid, non-NULL value in the database, and it gives us a known value to check for when performing operations that require the language to be known.

Extended Language

Earlier I mentioned that Chinese was considered a macrolanguage. For our purposes here, a macrolanguage is simply an entry in ISO 639-1 or ISO 639-2 for which mappings exist between itself and two or more individual languages in ISO 639-3. There are 62 such macrolanguages in the subtag registry and their mappings are captured by the Scope and Macrolanguage fields.

%%
Type: language
Subtag: zh
Description: Chinese
Added: 2005-10-16
Scope: macrolanguage
%%
%%
Type: extlang
Subtag: cmn
Description: Mandarin Chinese
Added: 2009-07-29
Preferred-Value: cmn
Prefix: zh
Macrolanguage: zh
%%
Example: the primary language subtag entry for Chinese and the extended language subtag entry for Mandarin Chinese in the subtag registry.

Because macrolanguage subtags were in circulation before there were subtags for the individual languages they map to, they may still be conventionally used as the primary language subtag, depsite the availability of more precise options. The extended language subtag exists to accomodate this reality: the macrolanguage may continue to be used as the primary language subtag, and one of its mapped individual languages may be additionally provided as the extended language. Its values have been drawn exclusively from ISO 639-3, so all extended language subtags are made up of three letters. When used in a language tag, the extended language subtag appears immediately after the primary language subtag.

For every extended language subtag in the subtag registry there are two additional entries: a primary language subtag with the same three-letter code and a redundant subtag for the extended language form. The redundant subtag tells us that, for example, while both uz-uzn and uzn are valid tags for Northern Uzbek, the second form is preferred when possible.

Script

The script subtag follows the primary language and extended language subtags and it is used to specify a particular writing system when multiple possibilities exist or when it is useful to do so.

Serbian is a good example, as it uses Latin and Cyrillic scripts interchangeably. Indeed, Azure Translator lists sr-Latn and sr-Cyrl among its supported languages.

In general, however, the script subtag is usually not necessary. The subtag registry helps us out here by including a Suppress-Script field when the script subtag should definitively not be used in combination with a particular subtag.

%%
Type: language
Subtag: ja
Description: Japanese
Added: 2005-10-16
Suppress-Script: Jpan
%%
Example: the language subtag entry for Japanese indicates that the Jpan script subtag should be omitted.

ISO 15924

Codes for the representation of names of scripts

ISO 15924 assigns four-letter codes for the names of scripts. (It also assigns a three-digit code for each entry, but only the four-letter codes are used in language tags.) In similar fashion to ISO 639-2, it additionally reserves a range of codes for private use, Qaaa, Qaab, ..., Qabx. While not officially listed, the codes True and Root are also reserved, at the request of the Unicode CLDR Technical Committee. ISO 15924 was introduced in 2004 and is maintained by the Unicode Consortium.

Examples beyond the ones we’ve already seen include Hang (Hangul), Hebr (Hebrew), and Zmth (Mathematical notation). Oh, and let’s not forget Piqd for Klingon. It should come as no surprise at this point that Azure Translator also supports tlh-Latn and tlh-Piqd, Klingon in Latin and pIqaD scripts.

Special Cases

There may be times when a valid script code is required but no other code is appropriate. For these ISO 15924 provides the following:

Zxxx
Code for unwritten documents.
Zyyy
Code for undetermined script. The script is not (yet) known.
Zzzz
Code for uncoded script. A script is used that does not (yet) have a code in the ISO 15924 standard.

In cases where only script is relevant, und can be used as the primary language subtag, as in und-Latn.

Region

The region subtag draws its values from two standards: two-letter codes like BR (Brazil) and JP (Japan) come from ISO 3166-1, and three-digit codes like 005 (South America) and 030 (Eastern Asia) come from UN M49.

pt-BR   # Portuguese as used in Brazil
es-419  # Spanish as used in Latin America and the Caribbean
Example: language tags containing a region subtag.

ISO 3166-1

Codes for the Representation of Names of Countries and their Subdivisions

ISO 3166 was first published in 1974; it was divided into three parts in 1997. The first part, ISO 3166-1, contains both two-letter and three-letter code sets for the current names of countries and their administrative areas. (The second part contains a list of codes for the names of principal administrative divisions of the countries listed in the first part, such as Canadian provinces. The third part contains a list of codes for country names that have been removed from the first part.)

UN M49

Standard Country or Area Codes for Statistical Use

The UN M49 standard contains three-digit codes for a variety of geographic regions, including countries, territories, continents—there’s even a code for the world (it’s 001). It also assigns codes for some groupings of countries or areas based on development status.

Notably, the only UN M49 codes that have been added to the subtag registry are for geographic regions, like Asia, Northern Africa, and Western Europe; subtag values for countries and territories come from ISO 3166-1.

ISO 3166 and UN M49 have a somewhat symbiotic relationship: ISO 3166 uses the names of countries as set forth by the United Nations in UN M49 and it lists the three-digit UN M49 code for each country; in turn, UN M49 lists the two- and three-letter ISO 3166 codes for each entry that is a country.

Variant

The variant subtag is the only type of subtag (other than private-use) that doesn’t draw its values from some underlying standard. It exists to enable the specification of language distinctions like orthography, which aren’t captured by other types of subtags. Like all subtags, a variant subtag is only valid if it has an entry in the subtag registry. Here’s an example: en-US-spanglis, Spanglish-leaning-English as spoken in the United States (and no, spanglis is not a typo here—remember that no subtag can have more than eight chracters).

Extension

The extension subtag begins with a single letter, called a singleton, followed by one or more two- to eight-character subtags. Since x is the designated singleton for private-use subtags, extension subtags can’t use it. The singleton identifier for each extension must be registered but the two- to eight-character subtags that follow it are not—their contents and meaning are defined by the particular extension.

Registered extensions don’t live in the same place as registered subtags. Instead, there is another registry just for extensions. Like the subtag registry, it’s a text file in record-jar format that is hosted and maintained by IANA.

Currently there are only two registered extensions: t and u. Both extensions are administered by the Unicode Consortium, have an associated RFC, and draw their values from the Unicode CLDR Project.

For content that has been transformed in some way, such as by transliteration, transcription, or translation, the t extension can be used to specify the source and optionally the mechanism of transformation. For example, en-t-ar could be used for English content that has been translated from Arabic, and und-Latn-t-und-cyrl could be used for content in any language that has been transliterated from Cyrillic to Latin script. While we could make good use of the t extension for transliterated or translated content in our data pipeline, in practice it’s unnecessary since we store the source language elsewhere in structured data associated with the content.

The u extension is for specifying Unicode locale-based variations in language tags. It follows the form u-(?<key>[a-z0-9]{2})-(?<type>[a-z0-9]{3,8}), where key and type are subtags whose values are enumerated in the Unicode CLDR. Examples of keys include co (collation type), ms (measurement system), and nu (numbering system type). Types correspond to a particular key. For example, the available types for the ms key are metric, uksystem, and ussystem.

Private Use

There may be times when it is necessary to specify language distinctions that cannot be captured by the script, region, variant, or extension subtags. For such cases we have the private-use subtag. It consists of the x singleton followed by one or more two- to eight-character subtags. Within a language tag, privage-use subtags must appear last, although a language tag consisting soley of private-use subtags is technically legal. The contents and meaning of private-use subtags are defined only by private agreement, so unless you happen to have a use case for them, private-use subtags should be avoided.

Grandfathered and Redundant Tags

Truth be told, RFC 5646 (the first of the two RFCs in BCP 47) was not the first to take on the challenge of identifying languages. The earliest, RFC 1766, and its successor, RFC 3066, both included provisions for registering whole language tags with IANA. The 93 language tags registered under their direction are considered grandfathered and entries for them still exist in the subtag registry today.

Entries with Type: redundant are considered regular grandfathered tags. An example is the entry with Tag: sr-Cyrl. It’s redundant because it could be constructed from the more recently-added individual subtags sr (Serbian language) and Cyrl (Cyrillic script); it’s “regular” because it at least appears to follow the rules of the language tag structure.

Entries with Type: grandfathered are considered irregular grandfathered tags. Some examples are en-GB-oed, i-default, sgn-BE-NL, and zh-min-nan. There are 26 such tags and they’re easy to spot because they break the rules of the language tag structure we’ve just explored.


We’ve covered a lot of ground. We’ve examined the language tag structure and visited each of the related standards—for languages, scripts, regions, and locale-based variations—along the way. In the event you’re ever asked to explain the difference among zh-Hans, zh-Hant, zh-CN, and zh-TW, I hope you’ll feel well equipped to answer. Perhaps you’ll even feel inspired to explore aspects of localization that you hadn’t considered before.

References