As part of the Novetta Mission Analytics team, I work on a data pipeline that ingests traditional and social media from around the world, enriches it, and makes the enriched data available to customers. Enrichment can involve any number of steps, many of them powered by machine learning, and one of the earliest and most common steps is translation.
When new content arrives, the source language is often unknown and must be detected; if the source language is different from the target language, the content is also translated. In order to translate this volume of content automatically, accurately, and cost effectively, we rely on multiple cloud translation services.
To the surprise of no one, cloud translation services differ not only in pricing but also in the languages they support and in the quality of translation across them. It’s often most cost effective to perform language detection with one service and, depending on the detected language, translation with another. In addition, these services occasionally use different identifiers to refer to the same language, which requires us to do some mapping on our end. For example, Microsoft Azure’s cloud translation service uses zh-Hant
to refer to Chinese in traditional script, while Google Cloud’s uses zh-TW
.
How can two different identifiers refer to the same language? Well, it turns out that Hant
refers to a script—specifically the one that uses traditional Chinese characters—and TW
refers to a region—Taiwan. When translating textual content, script seems more relevant, but it turns out that where a language is spoken is a often a pretty good proxy for a number of defining language characteristics, script among them in this case.
It’s clear then that zh
must refer to the Chinese language, but upon closer inspection that’s not exactly straightforward either. Chinese, at least in this context, is considered a macrolanguage—a group of individual languages that includes Mandarin, Cantonese, Gan, Min, Wu, and others. When spoken, many of these languages are virtually unintelligible to speakers of the others.
It would seem that referring to a group of languages that are mutually unintelligible would make for a lousy language identifier in a translation service. In a purely spoken context this might be true, but when written, varieties of Chinese generally use one of two scripts: traditional Chinese characters (Hant
) or simplified Chinese characters (Hans
). Partly for historical reasons—zh
was in circulation before there were ever identifiers for individual varieties of Chinese—and partly because script tends to be the key distinguishing piece of information for Chinese text, we’ll often see zh-Hant
rather than a more specific identifier, like
cmn-Hant
(Mandarin in traditional script).
Mandarin is spoken widely in Taiwan, but so too is Hokkien, a variety of Min Nan Chinese (nan
), so we can’t really assume a specific variety of Chinese based on TW
alone. But we can make an assumption about the implied script because the traditional script is widely used in Taiwan. Therefore we can consider Google’s zh-TW
as referring to approximately the same language as Microsoft’s zh-Hant
, that is, some unspecified variety of Chinese that uses traditional Chinese characters. (Although, of course, Taiwan is not the only place where the traditional script is used.)
As I hope this example illustrates, there is some nuance to language identification. Already we’ve seen the need for backwards compatibility with older identifiers, the importance of the intended application (spoken? written? signed?), and the need for some understanding of regional variation. What other characteristics are used to identify languages? And where do these identifiers come from?
Wanting answers to these questions and others, I took a deep dive and what follows is the byproduct. The relevant information comes from a variety of sources in a variety of formats, so I built a database to allow me to easily make cross-cutting queries. If you’re interested, you can find langdb on GitHub.
Fortunately humans addressed the challenge of identifying languages decades ago in the same way we have addressed most technical problems faced by a post-Internet world: by creating a bunch of standards. We have standards for the names of languages, groups of languages, writing systems, countries and geographical areas, and the list goes on.
This brings us to BCP 47. BCP stands for Best Current Practice and BCP 47 is the IETF’s authoritative guide on “how to do language identifiers” on the Internet. It draws from many of the existing standards to provide a practical mechanism for identifying languages that can be parsed by machines and humans alike.
The document is really two RFCs concatenated together—RFC 5646, Tags for Identifying Languages, and RFC 4647, Matching of Language Tags. You’ve already been acquainted with language tags because virtually every Internet standard that needs to work with languages or locales uses them.
Language Tags
A subtag is a string of one to eight ASCII letters and numbers. A language tag is a sequence of subtags separated by hyphens. There are different types of subtags and each type must obey a set of rules and appear in a particular order relative to other types of subtags. This is because they were designed in such a way that one’s type can always be determined from its length, contents, and/or position within the enclosing language tag. There are case conventions for most subtags, but letter case is not part of the heuristic for determining one’s type.
Subtag | Form | Example | Standards | |
---|---|---|---|---|
|
language |
[a-z]{2}
|
zh
|
ISO 639-1
|
[a-z]{3}
|
cmn
|
ISO 639-2
ISO 639-3
ISO 639-5
|
||
extlang |
[a-z]{3}
|
cmn
|
ISO 639-3
|
|
script |
[a-z]{4}
|
Hant
|
ISO 15924
|
|
region |
[a-z]{2}
|
TW
|
ISO 3166-1
|
|
[0-9]{3}
|
030
|
UN M49
|
||
variant |
[a-z][a-z0-9]{4,7}
|
tongyong
|
||
[0-9][a-z0-9]{3,7}
|
1996
|
|||
extension |
[a-wy-z](-[a-z0-9]{2,8})+
|
u-co-phonebk
u-ms-metric
|
RFC 6067
RFC 6497
|
|
private use |
x-[a-z0-9]{1,8}
|
x-sarcasm
|
Armed with knowledge of the language tag structure, we can tell, for example, that azb-Arab-IR
is made up of language-script-region, even if we’ve never seen this language tag before (South Azerbaijani in Arabic script as used in Iran). Similarly, we can tell that fr-u-fw-mon
is made up of language-extension, even if we aren’t familiar with the contents of this particular extension (French in a locale where the week starts on Monday).
After the language subtag, each additional subtag further narrows the scope of the overall language tag. The goal is to use the most concise language tag that appropriately identifies the language for a particular application. That is, subtags should only be used when they add “useful distinguishing information” (Phillips & Davis 2009). For example, en-Latn
doesn’t provide any additional value over en
since Latin is the assumed script for English; en-Brai
(English with Braille script), however, does.
We will use this language tag structure as a path to follow as we look at each type of subtag and the supporting standards.
IANA Language Subtag Registry
BCP 47 establishes a subtag registry that contains an updated list of all valid subtags. “Registry” here is another way of describing a text file in record-jar format that is hosted and maintained by IANA. As part of this arrangement, an official reviewer monitors for any changes to the related standards and determines if and how the changes should be reflected in the registry.
A language tag can be three things: well-formed and valid, well-formed and invalid, or ill-formed. In a well-formed language tag, every subtag appears the correct number of times, in the correct order, and has the correct form for its type (e.g., two letters or three numbers). But in order for a language tag to be valid, each of its subtags must additionally have a corresponding record in the subtag registry (except for private-use subtags, of course).
Language
The first subtag to appear in any language tag is primary language and, with a couple of rare exceptions that I’m going to ignore, it’s always present. We’ve already seen some examples in zh
, cmn
, azb
, and fr
. Perhaps you’ve been wondering why some language subtags have two letters and others have three. To answer that we need to dig into ISO 639.
ISO 639 is a set of standards for representing the names of languages and groups of languages. It has four parts: ISO 639-1, ISO 639-2, ISO 639-3, and ISO 639-5. (Well, sort of: ISO 639-4, originally published in 2010, is being revised; ISO 639-6 was withdrawn.)
It’s worth mentioning that the standards are, by design, “open.” Each part has a designated registration authority (RA) in charge of maintaining an updated list of registered codes and reviewing change requests, and minor changes to the lists are in fact made fairly regularly. So naturally any specific numbers mentioned here were accurate at the time of writing but can’t be held to account as things change.
ISO 639-1
Codes for the representation of names of languages — Part 1: Alpha-2 code
ISO 639-1
defines a set of two-letter codes for the names of 184 languages. Examples include de
for German, en
for English, he
for Hebrew, and of course zh
for Chinese.
The standard was first introduced as ISO/R 639 in 1967, then replaced by ISO 639 in 1988. Due to their Internet age, these two-letter codes have been embedded into many systems and it’s likely that we’ll be living with them for a very long time.
ISO 639-2
Codes for the representation of names of languages — Part 2: Alpha-3 code
ISO 639-2 is a little different. First, it introduces three-letter codes. It contains codes for all of the languages covered by ISO 639-1 and adds about as many more (although we still don’t see codes for individual varieties of Chinese yet).
Second, it provides two sets of codes: a terminology code set and a bilbliography code set. With twenty exceptions, the code for each entry is the same in both code sets. For example, English is assigned eng
in both, and Chinese, an exception, is assigned zho
as its terminology code and chi
as its bibliography code.
Finally, it includes codes for some groups of languages, such as afa
for Afro-Asiatic languages and sla
for Slavic languages.
In total, ISO 639-2
assigns 486 codes: 358 individual languages, 65 groups of languages, 58 macrolanguages, and four special cases. (We’ll look at the special cases and macrolanguages shortly.) Notably, it also reserves a range of codes, qaa, qab, ..., qtz
, for local use.
While the first two parts of the standard are suitable for many practical applications, they fall short of providing comprehensive coverage. This is where ISO 639-3 and ISO 639-5 come in.
ISO 639-3
Codes for the representation of names of languages — Part 3: Alpha-3 code for comprehensive coverage of languages
ISO 639-3
seeks to provide exhaustive coverage of all known individual languages. Like ISO 639-2,
it uses three-letter codes. This is when we get codes for Mandarin, Cantonese, Gan, Min, Wu—hell, we even get tlh
for Klingon. All told, it assigns 7,891 codes. In addition to living languages, these include ancient, historic, extinct, and constructed languages. It contains a code for every entry in ISO 639-1,
but the same is not quite true for ISO 639-2.
All ISO 639-2
codes are included in ISO 639-3
with two caveats: codes for groups of languages are excluded, and in cases where the terminology code and the bibliography code are different, the terminology code is the one included in ISO 639-3.
Referring to our earlier example, the ISO 639-3
code for Chinese is zho
, not chi
.
ISO 639-5
Codes for the representation of names of languages — Part 5: Alpha-3 code for language families and groups
ISO 639-5 finishes what ISO 639-2 started for language families and groups. Its focus is the classification of languages by geographical, linguistic, or other criteria. It includes all of the codes for groups of languages established in ISO 639-2 and adds some others, for a total of 115 codes.
Examples include sit
(Sino-Tibetan languages), cai
(Central American Indian languages), sgn
(Sign languages), and roa
(Romance languages).
Special Cases
There are four codes in ISO 639-3 (originally introduced in ISO 639-2) for scenarios in which a valid code is required but no other code is appropriate.
mis
- Uncoded languages. A language is used that does not (yet) have a code in the ISO 639 standard.
mul
- Multiple languages. More than one language is used and it would be impractical to list them all, or a single language code is required.
und
- Undetermined. The language is not (yet) known.
zxx
- No linguistic content. For example, audio and images.
These codes have come in handy in our data pipeline. For example, when new content arrives it is tagged with und
until its language can be determined. This allows us to store a valid, non-NULL value in the database, and it gives us a known value to check for when performing operations that require the language to be known.
Extended Language
Earlier I mentioned that Chinese was considered a macrolanguage. For our purposes here, a macrolanguage is simply an entry in ISO 639-1
or ISO 639-2
for which mappings exist between itself and two or more individual languages in ISO 639-3.
There are 62 such macrolanguages in the subtag registry and their mappings are captured by the Scope
and Macrolanguage
fields.
Because macrolanguage subtags were in circulation before there were subtags for the individual languages they map to, they may still be conventionally used as the primary language subtag, depsite the availability of more precise options. The extended language subtag exists to accomodate this reality: the macrolanguage may continue to be used as the primary language subtag, and one of its mapped individual languages may be additionally provided as the extended language. Its values have been drawn exclusively from ISO 639-3, so all extended language subtags are made up of three letters. When used in a language tag, the extended language subtag appears immediately after the primary language subtag.
For every extended language subtag in the subtag registry there are two additional entries: a primary language subtag with the same three-letter code and a redundant subtag for the extended language form. The redundant subtag tells us that, for example, while both uz-uzn
and uzn
are valid tags for Northern Uzbek, the second form is preferred when possible.
Script
The script subtag follows the primary language and extended language subtags and it is used to specify a particular writing system when multiple possibilities exist or when it is useful to do so.
Serbian is a good example, as it uses Latin and Cyrillic scripts interchangeably. Indeed, Azure Translator lists sr-Latn
and sr-Cyrl
among its supported languages.
In general, however, the script subtag is usually not necessary. The subtag registry helps us out here by including a Suppress-Script
field when the script subtag should definitively not be used in combination with a particular subtag.
ISO 15924
Codes for the representation of names of scripts
ISO 15924 assigns four-letter codes for the names of scripts. (It also assigns a three-digit code for each entry, but only the four-letter codes are used in language tags.) In similar fashion to ISO 639-2,
it additionally reserves a range of codes for private use, Qaaa, Qaab, ..., Qabx
. While not officially listed, the codes True
and Root
are also reserved, at the request of the Unicode CLDR Technical Committee. ISO 15924 was introduced in 2004 and is maintained by the Unicode Consortium.
Examples beyond the ones we’ve already seen include Hang
(Hangul), Hebr
(Hebrew), and Zmth
(Mathematical notation). Oh, and let’s not forget Piqd
for Klingon. It should come as no surprise at this point that Azure Translator also supports tlh-Latn
and tlh-Piqd
, Klingon in Latin and pIqaD scripts.
Special Cases
There may be times when a valid script code is required but no other code is appropriate. For these ISO 15924 provides the following:
Zxxx
- Code for unwritten documents.
Zyyy
- Code for undetermined script. The script is not (yet) known.
Zzzz
- Code for uncoded script. A script is used that does not (yet) have a code in the ISO 15924 standard.
In cases where only script is relevant, und
can be used as the primary language subtag, as in und-Latn
.
Region
The region subtag draws its values from two standards: two-letter codes like BR
(Brazil) and JP
(Japan) come from ISO 3166-1,
and three-digit codes like 005
(South America) and 030
(Eastern Asia) come from UN M49.
ISO 3166-1
Codes for the Representation of Names of Countries and their Subdivisions
ISO 3166 was first published in 1974; it was divided into three parts in 1997. The first part, ISO 3166-1, contains both two-letter and three-letter code sets for the current names of countries and their administrative areas. (The second part contains a list of codes for the names of principal administrative divisions of the countries listed in the first part, such as Canadian provinces. The third part contains a list of codes for country names that have been removed from the first part.)
UN M49
Standard Country or Area Codes for Statistical Use
The UN M49 standard contains three-digit codes for a variety of geographic regions, including countries, territories, continents—there’s even a code for the world (it’s 001
). It also assigns codes for some groupings of countries or areas based on development status.
Notably, the only UN M49 codes that have been added to the subtag registry are for geographic regions, like Asia, Northern Africa, and Western Europe; subtag values for countries and territories come from ISO 3166-1.
ISO 3166 and UN M49 have a somewhat symbiotic relationship: ISO 3166 uses the names of countries as set forth by the United Nations in UN M49 and it lists the three-digit UN M49 code for each country; in turn, UN M49 lists the two- and three-letter ISO 3166 codes for each entry that is a country.
Variant
The variant subtag is the only type of subtag (other than private-use) that doesn’t draw its values from some underlying standard. It exists to enable the specification of language distinctions like orthography, which aren’t captured by other types of subtags. Like all subtags, a variant subtag is only valid if it has an entry in the subtag registry. Here’s an example: en-US-spanglis
, Spanglish-leaning-English as spoken in the United States (and no, spanglis
is not a typo here—remember that no subtag can have more than eight chracters).
Extension
The extension subtag begins with a single letter, called a singleton, followed by one or more two- to eight-character subtags. Since x
is the designated singleton for private-use subtags, extension subtags can’t use it. The singleton identifier for each extension must be registered but the two- to eight-character subtags that follow it are not—their contents and meaning are defined by the particular extension.
Registered extensions don’t live in the same place as registered subtags. Instead, there is another registry just for extensions. Like the subtag registry, it’s a text file in record-jar format that is hosted and maintained by IANA.
Currently there are only two registered extensions: t
and u
. Both extensions are administered by the Unicode Consortium, have an associated RFC, and draw their values from the Unicode CLDR Project.
For content that has been transformed in some way, such as by transliteration, transcription, or translation, the t
extension can be used to specify the source and optionally the mechanism of transformation. For example, en-t-ar
could be used for English content that has been translated from Arabic, and und-Latn-t-und-cyrl
could be used for content in any language that has been transliterated from Cyrillic to Latin script. While we could make good use of the t
extension for transliterated or translated content in our data pipeline, in practice it’s unnecessary since we store the source language elsewhere in structured data associated with the content.
The u
extension is for specifying Unicode locale-based variations in language tags. It follows the form u-(?<key>[a-z0-9]{2})-(?<type>[a-z0-9]{3,8})
, where key and type are subtags whose values are enumerated in the Unicode CLDR. Examples of keys include co
(collation type), ms
(measurement system), and nu
(numbering system type). Types correspond to a particular key. For example, the available types for the ms
key are metric
, uksystem
, and ussystem
.
Private Use
There may be times when it is necessary to specify language distinctions that cannot be captured by the script, region, variant, or extension subtags. For such cases we have the private-use subtag. It consists of the x
singleton followed by one or more two- to eight-character subtags. Within a language tag, privage-use subtags must appear last, although a language tag consisting soley of private-use subtags is technically legal. The contents and meaning of private-use subtags are defined only by private agreement, so unless you happen to have a use case for them, private-use subtags should be avoided.
Grandfathered and Redundant Tags
Truth be told, RFC 5646 (the first of the two RFCs in BCP 47) was not the first to take on the challenge of identifying languages. The earliest, RFC 1766, and its successor, RFC 3066, both included provisions for registering whole language tags with IANA. The 93 language tags registered under their direction are considered grandfathered and entries for them still exist in the subtag registry today.
Entries with Type: redundant
are considered regular grandfathered tags. An example is the entry with Tag: sr-Cyrl
. It’s redundant because it could be constructed from the more recently-added individual subtags sr
(Serbian language) and Cyrl
(Cyrillic script); it’s “regular” because it at least appears to follow the rules of the language tag structure.
Entries with Type: grandfathered
are considered irregular grandfathered tags. Some examples are en-GB-oed
, i-default
, sgn-BE-NL
, and zh-min-nan
. There are 26 such tags and they’re easy to spot because they break the rules of the language tag structure we’ve just explored.
We’ve covered a lot of ground. We’ve examined the language tag structure and visited each of the related standards—for languages, scripts, regions, and locale-based variations—along the way. In the event you’re ever asked to explain the difference among zh-Hans
, zh-Hant
, zh-CN
, and zh-TW
, I hope you’ll feel well equipped to answer. Perhaps you’ll even feel inspired to explore aspects of localization that you hadn’t considered before.
References
- Alvestrand, H. T. (1995, March). Tags for the Identification of Languages. RFC 1766. https://doi.org/10.17487/RFC1766
- Alvestrand, H. T. (2001, January). Tags for the Identification of Languages. RFC 3066. https://doi.org/10.17487/RFC3066
- Azerbaijani Language. (2021, June 17). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Azerbaijani_language&oldid=1029048636
- Davis, M., Falk, C., Phillips, A., & Umaoka, Y. (2012, February). BCP 47 Extension T - Transformed Content. RFC 6497. https://doi.org/10.17487/RFC6497
- Davis, M., Phillips, A., & Umaoka, Y. (2010, December). BCP 47 Extension U. RFC 6067. https://doi.org/10.17487/RFC6067
- Google Cloud. (2021, June 23). Cloud Translation - Language support [API reference]. Retrieved June 19, 2021, from https://cloud.google.com/translate/docs/languages
- Haugen, E. (1966). Dialect, Language, Nation. American Anthropologist, 68(4), new series, 922–935. Retrieved May 7, 2021, from https://www.jstor.org/stable/670407
- Languages of Taiwan. (2021, June 3). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Languages_of_Taiwan&oldid=1026630746
- Library of Congress. (n.d.). ISO 639-2 Registration Authority. https://www.loc.gov/standards/iso639-2/
- Library of Congress. (n.d.). ISO 639-5 Registration Authority. https://www.loc.gov/standards/iso639-5/
- Microsoft Azure. (2020, June 6). Language support for text and speech translation [API reference]. Retrieved June 19, 2021, from https://docs.microsoft.com/en-us/azure/cognitive-services/translator/language-support
- Phillips, A. & Davis, M. (2006, September). Matching of Language Tags. BCP 47, RFC 4647. https://doi.org/10.17487/RFC4647
- Phillips, A. & Davis, M. (Eds.). (2009, December). Tags for Identifying Languages. BCP 47, RFC 5646. https://doi.org/10.17487/RFC5646
- SIL International. (n.d.). ISO 639-3 Registration Authority. https://iso639-3.sil.org/
- The Unicode Consortium. (n.d.). Codes for the representation of names of scripts. ISO 15924 Registration Authority. https://www.unicode.org/iso15924/codelists.html
- The Unicode Consortium. (2021, April 6). Unicode Language and Locale Identifiers. In Unicode Technical Stardard #35: Unicode Locale Data Markup Language (version 39). https://www.unicode.org/reports/tr35/tr35-63/tr35.html#Unicode_Language_and_Locale_Identifiers
- United Nations Statistics Division. (n.d.) Standard country or area codes for statistical use (M49). Department of Economic and Social Affairs, United Nations Secretariat. https://unstats.un.org/unsd/methodology/m49/