UTS #35: Unicode Locale Data Markup Language
Technical Reports
Unicode Technical Standard #35
Unicode Locale Data Markup Language (LDML)
Version
23
Editors
Mark Davis
markdavis@google.com
and other CLDR committee members
Date
2013-03-15
This Version
Previous Version
Latest Version
Corrigenda
Latest Proposed Update
Namespace
DTDs
Revision
31
Summary
This document describes an XML format (
vocabulary
) for the exchange of structured locale data. This format is used in the
Unicode Common Locale Data Repository
Status
This document has been reviewed by Unicode members and other
interested parties, and has been approved for publication by the Unicode
Consortium. This is a stable document and may be used as reference
material or cited as a normative reference by other specifications.
A Unicode Technical Standard (UTS)
is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS.
Please submit corrigenda and other comments with the CLDR bug reporting form [
Bugs
]. Related information that is useful in understanding
this document is found in the
References
. For the latest version of the Unicode Standard see [
Unicode
]. For
a list of current Unicode Technical Reports see [
Reports
]. For more information about versions of the Unicode Standard, see [
Versions
].
Parts
The LDML specification is divided into the following parts:
Part 1: Core specification (languages, locales, basic structure)
Part 2:
General
(display names & transforms, etc.)
Part 3:
Numbers
(number & currency formatting)
Part 4:
Dates
(date, time, time zone formatting)
Part 5:
Collation
(sorting, searching, grouping)
Part 6:
Related Information
(supplemental data)
Part 7:
Keyboards
(keyboard mappings)
Contents of Part 1, Core
Introduction
1.1
Conformance
What is a Locale?
Unicode Language and Locale Identifiers
3.1
Unicode Language Identifier
3.2
Unicode Locale Identifier
3.3
BCP 47 Conformance
3.3.1
BCP 47 Language Tag Conversion
3.4
Field Definitions
3.5
Key And Type Definitions
3.6
Unknown or Invalid Identifiers
3.6.1
Numeric Codes
3.7
Unicode BCP 47 Extension Data
3.7.1
Unicode Locale Extension Data Files
3.7.1.1
Numbering System Data
3.7.1.2
Time Zone Identifiers
3.7.2
Transformed Content Data File
3.8
Compatibility with Older Identifiers
3.8.1
Legacy Variants
3.8.2
Old Locale Extension Syntax
3.8.3
Relation to OpenI18n
3.9
Transmitting Locale Information
3.9.1
Message Formatting and Exceptions
3.10
Unicode Language and Locale IDs
3.10.1
Written Language
Locale Inheritance and Matching
4.1
Multiple Inheritance
4.1.1
Parent Locales
4.2
Inheritance and Validity
4.2.1
Definitions
4.2.2
Resolved Data File
4.2.3
Valid Data
4.2.4
Checking for Draft Status
4.2.5
Keyword and Default Resolution
4.3
Likely Subtags
4.4
Language Matching
XML Format
5.1
Common Elements
5.1.1
Element special
5.1.1.1
Sample Special Elements
5.1.2
Element alias
5.1.3
Element displayName
5.1.4
Element cp
5.2
Common Attributes
5.2.1
Attribute type
5.2.2
Attribute draft
5.2.3
Attribute alt
5.3
Common Structures
5.3.1
Date and Date Ranges
5.3.2
Text Directionality
5.3.3
Unicode Sets
5.3.3.1
Single Quote
5.3.3.2
Backslash Escapes
5.4
Identity Elements
5.5
Valid Attribute Values
5.6
Canonical Form
5.6.1
Content
5.6.2
Ordering
5.6.3
Comments
5.6.4
Canonicalization
5.6.5
Element Order Table
5.6.6
Attribute Order Table
5.6.7
Value Order Table
5.6.8
Defaulted Values Table
Property Data
Lenient Parsing
7.1
Motivation
7.2
Loose Matching
Deprecated Structure
8.1
Element fallback
8.2
BCP 47 Keyword Mapping
8.3
Choice Patterns
8.4
Element default
8.5
Attribute standard
Links to Other Parts
References
Acknowledgments
Modifications
1 Introduction
Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system
can be built of many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging
the gaps; in the internationalization arena, Unicode has provided a lingua franca for communicating textual data.
However, there remain differences in the locale
data used by different systems.
The best practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting
can take place on any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine
does the formatting. The same goes for parsing data, and locale-sensitive analysis of data.
But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many
of those differences are simply gratuitous; all within acceptable limits for human beings, but
yielding different results. In many other cases there are
outright errors. Whatever the cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of
collation (sort-order), where different collation caused not only ordering differences, but also different results of queries! That is, with a query of customers
with names between "Abbot, Cosmo" and "Arnold, James", if different systems have different sort orders, different lists will be returned. (For comparisons across
systems formatted as HTML tables, see [
Comparisons
].)
Note:
There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for
the common locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.
This document specifies an XML format for the communication of locale data: the Unicode Locale Data Markup Language (LDML). This provides a common format for systems
to interchange locale data so that they can get the same results in the services provided by internationalization libraries. It also provides a standard format
that can allow users to customize the behavior of a system. With it, for example, collation (sorting) rules can be exchanged, allowing two implementations to
exchange a specification of tailored collation rules. Using the same specification, the two implementations will achieve the same results in comparing strings. Unicode LDML can also be used to let a user encapsulate specialized sorting behavior for a specific domain,
or create a customized locale
for a minority language. Unicode LDML is also used in the Unicode Common Locale Data Repository (CLDR). CLDR uses an open process for reconciling differences between
the locale data used on different systems and validating the data, to produce with a useful, common, consistent base of locale data.
For more information, see the Common Locale Data Repository project page [
LocaleProject
].
As LDML is an interchange format, it was designed for ease of maintenance and simplicity of transformation into other formats, above efficiency of run-time lookup and use. Implementations should consider converting LDML data into a more compact format prior to use.
1.1 Conformance
There are many ways to use the Unicode LDML format and the data in CLDR,
and the Unicode Consortium does not restrict the ways in which the format or data are used.
However, an implementation may also claim conformance to LDML or to CLDR, as follows:
UAX35-C1.
An implementation that claims conformance
to this specification shall:
Identify the sections of the specification that it conforms
to.
For example, an
implementation might claim conformance to all LDML features except for
transforms
and
segments
Interpret the relevant elements and attributes of LDML
documents in accordance with the descriptions in those sections.
For example, an implementation that claims conformance to
the date format patterns must interpret the characters in such patterns according to
Date Field Symbol Table
Declare which types of CLDR data that it uses.
For example, an
implementation might declare that it only uses language names, and those with a
draft
status of
contributed
or
approved
UAX35-C2.
An implementation that claims conformance
to Unicode locale or language identifiers shall:
Specify whether Unicode locale extensions
are allowed
Specify the canonical form used for
identifiers in terms of casing and field separator characters.
External specifications may also reference
particular components of Unicode locale or language identifiers, such as:
Field X can contain any Unicode region
subtag values as given in Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML),
excluding grouping codes.
2 What is a Locale?
Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use data
in LDML, but they do need to understand it so that the data can be correctly translated into whatever model their implementation uses.
The first issue is basic:
what is a locale?
In this model, a locale is an identifier (id) that refers to a set of user preferences that tend to be
shared across significant swaths of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times,
numbers, and currencies; for measurement units, for sort-order (collation), plus translated names for time
zones, languages, countries, and scripts. The data can
also include support for text boundaries (character, word, line, and sentence), text transformations (including transliterations), and other services.
Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set
to override particular items, such as setting the date format for 2002.03.15, or using metric or Imperial measurement units. In the abstract, locales are simply
one of many sets of preferences that, say, a website may want to remember for a particular user. Depending on the application, it may want to also remember
the user's time zone, preferred currency, preferred character set, smoker/non-smoker preference, meal preference (vegetarian, kosher,
and so on), music preference, religion, party affiliation, favorite charity,
and so on.
Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards;
bugs are found and fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.
In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking,
and so on). The format in this
document does not attempt to represent all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common
use in systems and internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences
according to different countries or regions. However, the line between
locales
and
languages
, as commonly used in the industry, are rather fuzzy.
Note also that the vast majority of the locale data in CLDR is in fact language data; all non-linguistic data is separated out into a separate tree. For more
information, see
Section 3.10 Language and Locale IDs
We will speak of data as being "in locale X". That does not imply that a locale
is
a collection of data; it is simply shorthand for "the set of data
associated with the locale id X". Each individual piece of data is called a
resource
or
field
, and a tag indicating the key of the resource is
called a
resource tag.
3 Unicode Language and Locale Identifiers
Unicode LDML uses stable identifiers based on [
BCP47
] for distinguishing among languages, locales, regions, currencies, time
zones, transforms, and so on. There are many systems for identifiers for these entities. The Unicode LDML identifiers may not match the identifiers used
on a particular target system. If so, some process of identifier translation may be required when using LDML data.
3.1 Unicode Language Identifier
Unicode language identifier
has the
following structure (provided in either EBNF (Perl-based) or ABNF [
RFC5234
]):
EBNF
ABNF
unicode_language_id
="root"
| unicode_language_subtag
(sep unicode_script_subtag)?
(sep unicode_region_subtag)?
(sep unicode_variant_subtag)*
="root"
/ unicode_language_subtag
[sep unicode_script_subtag]
[sep unicode_region_subtag]
*(sep unicode_variant_subtag)
sep
= "-" | "_"
= "-" / "_"
For example, "en-US" (American English), "en_GB" (British English),
"es-419" (Latin American Spanish), and "uz-Cyrl" (Uzbek in Cyrillic) are all
Unicode language identifiers.
3.2 Unicode Locale Identifier
Unicode locale identifier
is composed of a Unicode language identifier plus
(optional) locale extensions. It has the following
EBNF
ABNF
unicode_locale_id
= unicode_language_id
transformed_extensions?
unicode_locale_extensions?
= unicode_language_id
[transformed_extensions]
[unicode_locale_extensions]
unicode_locale_extensions
= sep "u"
((sep keyword)+
|(sep attribute)+ (sep keyword)*)
= sep "u"
(1*(sep keyword)
/ 1*(sep attribute) *(sep keyword))
transformed_extensions
= sep "t"
(("-" tlang ("-" tfield)*)
| ("-" tfield)+)
= sep "t"
(("-" tlang *("-" tfield))
/ 1*("-" tfield))
keyword
= key (sep type)?
= key [sep type]
key
= alphanum{2}
= 2alphanum
type
= alphanum{3,8} (sep alphanum{3,8})*
= 3*8alphanum *(sep 3*8alphanum)
attribute
= alphanum{3,8}
= 3*8alphanum
tlang
= unicode_language_subtag
("-" unicode_script_subtag)?
("-" unicode_region_subtag)?
("-" unicode_variant_subtag)*
= unicode_language_subtag
["-" unicode_script_subtag]
["-" unicode_region_subtag]
*("-"unicode_variant_subtag)
tfield
= fsep ("-" alphanum{3,8})+
= fsep 1*("-" 3*8alphanum)
fsep
= [A-Z a-z] [0-9]
= ALPHA DIGIT
alphanum
= [0-9 A-Z a-z]
= ALPHA / DIGIT
For historical reasons, this is called a
Unicode locale identifier. However, it really functions (with few exceptions) as a
language
identifier, and accesses
language
-based data. Except where it would be unclear, this document uses the term "locale" data loosely to encompass
both types of data: for more information, see
Section 3.10 Language and Locale IDs
Although not shown in the syntax above, Unicode locale identifiers may also have [
BCP47
] extensions (other than "u")
and private use subtags; these are not, however, relevant to their use in Unicode.
As for terminology, the term
code
may also be used instead of "subtag",
and "territory" instead of "region". The primary language subtag is also
called the
base language code
. For example, the base language code
for "en-US" (American English) is "en" (English). The
type
may also
be referred to as a
value
or
key-value
The identifiers can vary in case and in the
separator characters. The "-" and "_" separators are
treated as equivalent. All identifier field values are case-insensitive.
Although case distinctions do not carry any special meaning,
an implementation of LDML should use the casing recommendations in [
BCP47
],
especially when a Unicode locale identifier is used for locale data exchange in software
protocols. The recommendation is that: the region subtag is in uppercase, the script subtag
is in title case, and all other subtags are in lowercase.
Note:
The current version of CLDR uses upper case letters for variant subtags
in its file names for backward compatibility reasons. This might be changed in future
CLDR releases.
3.3 BCP 47 Conformance
Unicode language and locale identifiers inherit the design and the repertoire
of subtags from [
BCP47
] Language Tags. There are some
extensions and restrictions made for the use of the Unicode locale identifier in CLDR:
It does not allow for the full syntax of [
BCP47
]:
No irregular or
BCP47 grandfathered tags are allowed
No extlang subtags are allowed
It allows for certain additions:
For field separator characters, the "_" character can be used as well as the "-" used in [
BCP47
].
"root" to indicate the generic locale used as the parent of all languages in the CLDR data model.
Defined semantics of certain private use codes, and some "macrolanguage" codes.
3.3.1 BCP 47 Language Tag Conversion
A Unicode language/locale identifier can be converted to a valid [
BCP 47
language tag by performing the following transformation.
Replace the "_" separators with "-"
Replace the special language identifier "root" with the BCP 47 primary language tag "und"
For example,
en_US
en-US
de_DE_u_co_phonebk
de-DE-u-co-phonebk
root
und
root_u_cu_usd
und-u-cu-usd
A valid [
BCP 47
language tag can be converted to a valid Unicode language/locale identifier by performing the following transformation.
Canonicalize the language tag (afterwards, there will be no extlang subtag)
Replace the BCP 47 primary language subtag "und" with "root" if no script, region, or variant
subtags are present
If the BCP 47 primary language subtag matches the
type
attribute of a
languageAlias
element in
Supplemental Data
replace the language subtag with the
replacement
value.
If the BCP 47 region subtag matches the
type
attribute of a
territoryAlias
element in
Supplemental Data
replace the language subtag with the
replacement
value. (When multiple
replacement
values are available, use the
first one)
For example,
en-US
en-US
(no changes)
und
root
und-US
und-US
(no changes, because region subtag is present)
und-u-cu-USD
root-u-cu-usd
cmn-TW
zh-TW
(language alias)
sr-CS
sr-RS
(territory alias)
Note:
In some rare cases, BCP 47 language tags cannot be converted to valid Unicode language/locale identifiers, such as certain
BCP 47
] grandfathered tags.
3.4 Field Definitions
Unicode language and locale identifier field values are provided in the
following table. Note that some private-use BCP 47 field values are given
specific meanings in CLDR.
Language/Locale Field Definitions
Field
Allowable Characters
Sample values
unicode_language_subtag
(also known as a
Unicode base
language code)
ASCII letters
BCP47
] subtag values marked as
Type: language
ISO 639-3 introduces the notion of "macrolanguages", where
certain ISO 639-1 or ISO 639-2 codes are given broad semantics, and
additional codes are given for the narrower semantics. For backwards
compatibility, Unicode language identifiers retain use of the
narrower semantics for these codes. For example:
For
Use
Not
Standard Chinese (Mandarin)
zh
cmn
Standard Arabic
ar
arb
Standard Malay
ms
zsm
Standard Swahili
sw
swh
Standard Uzbek
uz
uzn
Standard Konkani
kok
knn
If a language subtag matches the type attribute of a languageAlias element, then the replacement value is used instead. For example, because
"swh" occurs in
, "sw" must be used instead of "swh". Thus Unicode language identifiers
use "ar-EG" for Standard Arabic (Egypt), not "arb-EG"; they use "zh-TW" for Mandarin Chinese
(Taiwan), not "cmn-TW".
The private use codes from
qfz..qtz
will never be given specific semantics in Unicode
identifiers, and are thus safe for use for other purposes by other applications.
The CLDR provides data for normalizing language/locale codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US".
unicode_script_subtag
(also known as a
Unicode
script code)
ASCII letters
BCP47
] subtag values marked as
Type: script
In most cases the script is not necessary, since
the language is only customarily written in a single script. Examples of cases where it is used are:
az_Arab
Azerbaijani in Arabic script
az_Cyrl
Azerbaijani in Cyrillic script
az_Latn
Azerbaijani in Latin script
zh_Hans
Chinese, in simplified script
zh_Hant
Chinese, in traditional script
Unicode identifiers give specific semantics to three Unicode Script values [
UAX24
]:
Zyyy
Common
Qaai
Inherited
the preferred form is now Zinh
Zzzz
Unknown
The private use subtags from Qaaq..Qabx will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by
other applications.
unicode_region_subtag
(also known as a
Unicode region code,
or
a Unicode territory code)
ASCII letters and digits
BCP47
] subtag values marked as
Type: region
Unicode identifiers give specific semantics to the following subtags:
Name
Comment
ISO 3166-1 status
QO
Outlying Oceania
countries in Oceania [009] that do not have a
subcontinent
private use
QU
European Union
the preferred form is now EU
private use
UK
United Kingdom
the correct form is GB
exceptionally reserved
ZZ
Unknown or Invalid Territory
used in APIs or as replacement for invalid code
private use
The private use subtags from XA..XZ will never be given specific semantics in Unicode identifiers, and are thus safe for use for other purposes by
other applications.
The CLDR provides data for normalizing territory/region codes, including mapping overlong codes like "eng-840" or "eng-USA" to the correct code "en-US".
Special Codes:
The territory code 'UK' has a special status in ISO, and is used for the domain name instead of GB. It is thus recognized by CLDR as being an alternate (unnormalized) form of 'GB'.
The territory code '001' (the World) is used to indicate a standardized form, such as "ar-001" for Modern Standard Arabic.
unicode_variant_subtag
(also known as a
Unicode language
variant code)
ASCII letters
BCP47
] subtag values marked as
Type: variant
The CLDR provides data for normalizing variant codes.
attribute
ASCII letters and digits
Currently not used, reserved for future use.
key
ASCII letters and digits
key
type
definitions are discussed below.
For information on the process for adding new
key
type
, see
LocaleProject
].
All type values except ones used for key "ka" (colAlternate) and "vt" (variableTop) are represented by a single
subtag in the current version of CLDR.
If the type is not included, and one of the possible type values is "true",
then that value is assumed. Note that the default for key with a possible "true" value is often "false", but may not always be.
type
ASCII letters and digits
Examples:
en
fr_BE
de_DE_u_co_phonebk_cu_ddm
A locale that only has a language subtag (and optionally a script subtag) is called a
language locale
; one with both language and territory subtag
is called a
territory locale
(or
country locale
).
3.5 Key And Type Definitions
The following chart contains a set of key values that are currently available, with a description or sampling of type values. Each category is associated with an XML file in the bcp47 directory. For the complete
list of valid keys and types defined for Unicode locale extensions, see
Section 3.7 Unicode BCP 47 Extension Data
The BCP47 form is the canonical form, and recommended. Other aliases are included for backwards compatibility.
Key/Type Definitions
category
key
(old key name)
key description
type
(old type name)
type description
Calendar
bcp47/calendar.xml
"ca"
(calendar)
Calendar algorithm
(For information on the calendar algorithms associated with the data
used with these, see [
Calendars
].)
"buddhist"
Thai Buddhist calendar (same as Gregorian except for the year)
"chinese"
Traditional Chinese calendar
"gregory"
(gregorian)
Gregorian calendar
Collation
bcp47/collation.xml
"co"
(collation)
Collation type
"standard"
The default ordering for each language.
For root it is based on the [
DUCET
] (Default Unicode Collation Element Table):
see
Root Collation
Each other locale is based on that, except for appropriate modifications to certain characters for that language.
"search"
A special collation type dedicated for string search—it is not used to determine the relative
order of two strings, but only to determine whether they should be considered equivalent for the specified strength, using
the string search matching rules appropriate for the language. Compared to the normal collator for the language, this may
add or remove primary equivalences, may make additional characters ignorable or change secondary equivalences, and may modify
contractions to allow matching within them, depending on the desired behavior. For example, in Czech, the distinction between
‘a’ and ‘á’ is secondary for normal collation, but primary for search; a search for ‘a’ should never match ‘á’ and vice versa.
A search collator is normally used with strength set to PRIMARY or SECONDARY (should be SECONDARY if using “asymmetric”
search as described in the [
UCA
] section Asymmetric Search). The
search collator in root supplies matching rules that are appropriate for most languages (and which are different than the
root collation behavior); language-specific search collators may be provided to override the matching rules for a given
language as necessary.
Other keywords provide additional choices for certain locales;
they only have effect in certain locales.
"phonetic"
Requests a phonetic variant if available, where text is sorted based on pronunciation. It may interleave different scripts, if multiple scripts are in common use.
"pinyin"
Pinyin ordering for Latin and for CJK characters; that is, an
ordering for CJK characters based on a character-by-character
transliteration into a pinyin. (used in Chinese)
"reformed"
Reformed collation (such as in Swedish)
"searchjl"
Special collation type for a modified string search in which a pattern consisting of a sequence
of Hangul initial consonants (jamo lead consonants) will match a sequence of Hangul syllable characters whose initial
consonants match the pattern. The jamo lead consonants can be represented using conjoining or compatibility jamo. This
search collator is best used at SECONDARY strength with an "asymmetric" search as described in the
UCA
] section Asymmetric Search and obtained, for example, using
ICU4C's usearch facility with attribute USEARCH_ELEMENT_COMPARISON set to value USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD;
this ensures that a full Hangul syllable in the search pattern will only match the same syllable in the searched text
(instead of matching any syllable with the same initial consonant), while a Hangul initial consonant in the search pattern
will match any Hangul syllable in the searched text with the same initial consonant.
For information on each collation setting parameter, from
ka
to
vt
see
Setting Options
Currency
bcp47/currency.xml
"cu"
(currency)
Currency type
ISO 4217 code,
plus others in common use
Codes that are or have been valid in ISO 4217, plus certain additional codes that are or have been in common use.
The full list of codes, with descriptions, is available in the common/main/en.xml file for each release of CLDR.
The list of countries and time periods associated with each currency value is
available.
The XXX code is given a broader interpretation as Unknown or Invalid Currency.
For more information, see
Supplemental Currency Data
Number
bcp47/number.xml
"nu"
(numbers)
Numbering system
Unicode script subtag
Four-letter types indicating the primary numbering system for the corresponding script represented in Unicode. Unless otherwise specified, it is a decimal numbering system using digits [:GeneralCategory=Nd:]. For example, "latn" refers to the ASCII / Western digits 0-9, while "taml" is an algorithmic (non-decimal) numbering system. (The code "tamldec" is indicates the "modern Tamil decimal digits".)
For more information, see
Numbering Systems
"arabext"
Extended Arabic-Indic digits ("arab" means the base Arabic-Indic digits)
"armnlow"
Armenian lowercase numerals
"roman"
Roman numerals
"romanlow"
Roman lowercase numerals
"tamldec"
Modern Tamil decimal digits
Time zone
bcp47/timezone.xml
"tz"
(timezone)
Time zone
Unicode short time zone IDs
Short identifiers defined in terms of a TZ time zone database [
Olson
] identifier in the file common/bcp47/timezone.xml file.
For more information, see
Section 3.7.1.2 Time Zone Identifiers
The CLDR provides data for normalizing timezone codes.
Locale variant
bcp47/variant.xml
"va"
Common variant type
"posix"
POSIX style locale variant
For more information on the allowed keys and types, see the specific elements below, and
Section 3.7 Locale Extension Key and Type Data
Additional keys or types might be added in future versions. Implementations of LDML should be robust
to handle any syntactically valid key or type values.
3.6 Unknown or Invalid Identifiers
The following identifiers are used to indicate an unknown or invalid code in
Unicode language and locale identifiers. For Unicode identifiers, the region
code uses a private use ISO 3166 code, and Time Zone code uses an
additional code; the
others are defined by the relevant standards. When these codes are used in APIs connected with
Unicode identifiers, the meaning is that either there was no identifier available,
or that at some point an input identifier value was determined to be invalid or ill-formed.
Code Type
Value
Description in Referenced Standards
Language
und
Undetermined language
Script
Zzzz
Code for uncoded script, Unknown [
UAX24
Region
ZZ
Unknown or Invalid Territory
Currency
XXX
The codes assigned for transactions where no currency is involved
Time Zone
unk
Unknown or Invalid Time Zone
When only the script or region are known, then a locale ID will use "und" as the language subtag portion. Thus the locale tag "und_Grek" represents the Greek
script; "und_US" represents the US territory.
3.6.1 Numeric Codes
For region codes, ISO and the UN establish a mapping to three-letter codes and numeric codes. However, this does not extend to the private use codes, which
are the codes 900-999 (total: 100), and AAA, QMA-QZZ, XAA-XZZ, and ZZZ (total: 1092).
Unicode identifiers supply a standard mapping to these: for the numeric codes, it
uses the top of the numeric private use range; for the 3-letter codes it doubles the final letter. These are the resulting mappings for all of the private use
region codes:
Region
UN/ISO Numeric
ISO 3-Letter
AA
958
AAA
QM..QZ
959..972
QMM..QZZ
XA..XZ
973..998
XAA..XZZ
ZZ
999
ZZZ
For script codes, ISO 15924 supplies a mapping (however, the numeric codes are not in common use):
Script
Numeric
Qaaa..Qabx
900..949
3.7 Unicode BCP 47 Extension Data
BCP47
] Language Tags provides a mechanism for extending language tags
for use in various applications by extension subtags. Each extension subtag is identified by
a single alphanumeric character subtag assigned by IANA.
The Unicode Consortium has registered and is the maintaining authority for two BCP 47 language tag extensions: the extension 'u' for Unicode locale extension [
RFC6067
and extension 't' for transformed content [
RFC6497
]. The Unicode BCP 47 extension data defines the complete list of
These
subtags are all in lowercase (that is the canonical casing for these subtags), however,
subtags are case-insensitive and casing does not carry any specific meaning. All subtags within the Unicode extensions are alphanumeric characters
in length of two to eight that meet the rule
extension
in the [
BCP47
The -u- Extension.
The syntax of 'u' extension subtags is defined by the rule
unicode_locale_extensions
in
Section 3.2 Unicode locale identifier
, except the separator of
subtags
sep
must be always hyphen '-' when the extension is used as a part of
BCP 47 language tag.
A 'u' extension may contain multiple
attribute
s or
keyword
as defined in
Section 3.2 Unicode locale identifier
Although the order of
attribute
s or
keyword
s does not matter,
this specification defines the canonical form as below:
All attributes are sorted in alphabetical order.
All keywords are sorted by alphabetical order of keys.
All keywords are in lowercase.
All keys and types use the canonical form (from the name attribute; see Appendix Q).
For example, the canonical form of 'u' extension "u-foo-bar-nu-thai-ca-buddhist" is
"u-bar-foo-ca-buddhist-nu-thai". The attributes "foo" and "bar" in this example are provided only for illustration; no attribute subtags are defined
by the current CLDR specification.
The -t- Extension.
The syntax of 't' extension subtags is defined by the rule
unicode_locale_extensions
in
Section 3.7 Unicode locale identifier
, except the separator of
subtags
sep
must be always hyphen '-' when the extension is used as a part of
BCP 47 language tag. For information about the registration process, meaning, and usage of the 't' extension, see [
RFC6497
].
The 'u' extension data is stored in multiple XML files located under
common/bcp47 directory in CLDR. Each file contains the locale extension key/type values
and their backward compatibility mappings appropriate for a particular domain.
For example,
common/bcp47/collation.xml
contains key/type values for collation, including optional collation parameters and valid type values for each key.
The 't' extension data is stored in
common/bcp47/transform.xml
The extension attribute in
the extension attribute is "u" (Unicode locale extension). The
3.7.1 Unicode Locale Extension Data Files
In the Unicode locale extension 'u' data files, the common attributes
for the
Note: There are no values defined for the locale extension attribute in the current CLDR release.
name
The key or type name used by Unicode locale extension with
'u' extension syntax
When
alias
below is absent, this name can be also
used with the old style
"@key=type" syntax
The type name "CODEPOINTS" is reserved for a variable representing
Unicode code point(s). The syntax is:
EBNF
ABNF
codepoints
= codepoint (sep codepoint)?
= codepoint *(sep codepoint)
codepoint
= [0-9 A-F a-f]{4,6}
= 4*6HEXDIG
In addition, no codepoint may exceed 10FFFF. For example, "00A0", "300b", "10D40C" and "00C1-00E1" are
valid, but "A0", "U060C" and "110000" are not.
In the current version of CLDR, the type "CODEPOINTS" is only used for the locale extension
key "vt" (variableTop). The subtags forming the type for "vt" represent an arbitrary string of characters.
There is no formal limit in the number of characters, although practically anything above 1 will be rare,
and anything longer than 4 might be useless. Repetition is allowed, for example, 0061-0061 ("aa") is a Valid
type value for "vt", since the sequence may be a collating element. Order is vital: 0061-0062 ("ab") is
different than 0062-0061 ("ba").
For example,
en-u-vt-0061
: this indicates English, with any characters sorting at or below "a" (at a primary level) considered Variable.
en-u-vt-0061-0065
: this indicates English, with any characters sorting at or below the sequence "ae" (at a primary level) considered Variable.
By default in UCA, variable characters are ignored in sorting at a primary, secondary, and tertiary level. But in CLDR, they are not ignorable by default.
For more information, see
Collation: Section 3.3
Setting Options
The type name "REORDER_CODE" is reserved for reordering block names (e.g. "latn", "digit" and "others")
defined in the
Root Collation
The type "REORDER_CODE" is used for locale extension key "kr" (colReorder). The value of type for "kr" is represented by one or more
reordering block names such as "latn-digit".
For more information, see
Collation: Section 3.12
Collation Reordering
In the current version of CLDR, all type names except "CODEPOINTS" and "REORDER_CODE" are final and used alone.
For example, "gregory" and "japanese" are valid type names for key "ca" (calendar). Both "u-ca-gregory" and "u-ca-japanese" are
valid representations of Unicode locale extension, but "u-ca-gregory-japanese" is not.
alias
(Not applicable to
The BCP47 form is the canonical form, and recommended. Other aliases are included only for backwards compatibility.
Example:
description="Phonebook style ordering (such as in German)"/>
The preferred term, and the only one to be used in BCP47, is the name: in this example, "phonebk".
The alias is a key or type name used by Unicode locale extensions with
the old
"@key=type" syntax
The attribute value for type may contain multiple names delimited by
ASCII space characters. Of those aliases, the first name is the preferred
value.
description
The description of the key, type or attribute element.
since
The version of CLDR in which this key or type was introduced.
Absence of this attribute value implies the key or type was available
in CLDR 1.7.2.
deprecated
The deprecation status of the key, type or attribute element. The value
"true" indicates the element is deprecated and no longer used in the version of CLDR. The
default value is "false".
For example,
...
...
The data above indicates:
type "pinyin" is valid for key "co", thus "u-co-pinyin" is a valid Unicode
locale extension.
type "pinyin" is not valid for key "ka", thus "u-ka-pinyin" is not a valid
Unicode locale extension.
type "pinyin" has no
alias
, so "zh@collation=pinyin" is a
valid Unicode locale identifier according to the old syntax.
type "noignore" has an alias attribute, so "en@colAlternate=noignore" is not
a valid Unicode locale identifier according to the old syntax.
type "aumel" is valid for key "tz", supported by CLDR 1.7.2
(default value) or later versions.
type "aumqi" is valid for key "tz", supported by CLDR 1.8.1
or later versions.
3.7.1.1 Numbering System Data
LDML supports multiple numbering systems. The identifiers for those numbering systems are defined in the file
bcp47/number.xml
. For example, for the 'trunk' version of the data see
bcp47/number.xml
Details about those numbering systems are defined in
supplemental/numberingSystems.xml
. For example, for the 'trunk' version of the data see
supplemental/numberingSystems.xml
LDML makes certain stability guarantees on this data:
Like other BCP47 identifiers, once a numeric identifier is added to
bcp47/number.xml
or
numberingSystems.xml
, it will never be removed from either of those files.
If an identifier has type="numeric" in numberingSystems.xml, then
It is a decimal, positional numbering system with an attribute digits=X, where X is a string with the 10 digits in order used by the numbering system.
The values of the type and digits will never change.
3.7.1.2 Time Zone Identifiers
Short Time Zone Identifiers
LDML inherits time zone IDs from the tz database [
Olson
]. Because these IDs from the tz database do not satisfy the BCP 47
language subtag syntax requirements, CLDR defines short identifiers for the use in the Unicode locale extension. The short identifiers are defined
in the file
common/bcp47/timezone.xml
The short identifiers use UN/LOCODE [
LOCODE
] (excluding a space character) codes where possible. For example,
the short identifier for "America/Los_Angeles" is "uslax" (the LOCODE for Los Angeles, US is "US LAX"). Identifiers of length not equal to 5 are
used where there is no corresponding LOCODE, such as "usnavajo" for "America/Shiprock", or"utcw01" for "Etc/GMT+1".
There is a special code "unk" for an Unknown or Invalid time zone. This can be expressed in the tz database style ID "Etc/Unknown",
although it is not defined in the tz database.
Stability of Time Zone Identifiers
Although the short time zone identifiers are guaranteed to be stable, the preferred IDs in the tz database (as those found in
zone.tab
file)
might be changed time to time. For example, "Asia/Culcutta" was replaced with "Asia/Kolkata" and moved to
backward
file in the tz database.
CLDR contains locale data using a time zone ID from the tz database as the key, stability of the IDs is cirtical.
To maintain the stability of "long" IDs (for those inherited from the tz database), a special rule applied to the
alias
attribute in the
element for "tz" - the first "long" ID is the CLDR canonical "long" time zone ID.
For example:
Above
CLDR canonical
"long" ID
"Asia/Culcutta", and an alias "Asia/Kolkata".
3.7.2 Transformed Content Data File
In the transformed content 't' data file, the name attribute in a
defines a valid field separator subtag. The name attribute in an enclosed
field subtag for the field separator subtag. For example:
since="21"/>
The data above indicates:
"m0" is a valid field separator for the transformed content extension 't'.
field subtag "ungegn" is valid for field separator "m0".
field subtag "ungegn" was introduced in CLDR 21.
The attributes are:
name
The name of the mechanism, limited to 3-8 characters (or sequences of them).
description
A description of the name, with all and only that information necessary to distinguish one name from | American Library others with which it might be confused. Descriptions are not intended to provide general background information.
since
Indicates the first version of CLDR where the name appears. (Required for new items.)
alias
Alternative name, not limited in number of characters. Aliases are intended for compatibility, not to provide all possible alternate names or designations.
(Optional)
For information about the registration process, meaning, and usage of the 't' extension, see [
RFC6497
].
3.8 Compatibility with Older Identifiers
LDML version before 1.7.2 used slightly different syntax for variant subtags and locale
extensions. Implementations of LDML may provide backward compatible identifier support as described in
following sections.
3.8.1 Legacy Variants
Old LDML specification allowed codes other than registered [
BCP47
] variant
subtags used in Unicode language and locale identifiers for representing variations of locale data.
Unicode locale identifiers including such variant codes can be converted to the new [
BCP47
compatible identifiers by following the descriptions below:
Legacy Variant Mappings
Variant Code
Description
AALAND
Åland, variant of "sv" Swedish used in Finland.
Use "sv_AX" to indicate this.
BOKMAL
Bokmål, variant of "no" Norwegian.
Use primary language subtag "nb" to indicate this.
NYNORSK
Nynorsk, variant of "no" Norwegian.
Use primary language subtag "nn" to indicate this.
POSIX
POSIX variation of locale data.
Use Unicode locale extension "-u-va-posix" to indicate this.
POLYTONI
Polytonic, variant of "el" Greek.
Use [
BCP47
] variant subtag "polyton" to indicate this.
SAAHO
The Saaho variant of Afar.
Use primary language subtag "ssy" to indicated this.
3.8.2 Old Locale Extension Syntax
LDML 1.7 or older specification used different syntax for representing unicode
locale extensions. The previous definition of Unicode locale extensions had the following structure:
EBNF
ABNF
old_unicode_locale_extensions
= "@" old_key "=" old_type
(";" old_key "=" old_type)*
= "@" old_key "=" old_type
*(";" old_key "=" old_type)
The new specification mandates keys to be two alphanumeric characters and types to be
three to eight alphanumeric characters. As the result, new codes were assigned to all
existing keys and some types. For example, a new key "co" replaced the previous key
"collation", a new type "phonebk" replaced the previous type "phonebook". However,
the existing collation type "big5han" already satisfied the new requirement, so no new type code
was assigned to the type. The chart below shows some example mappings between the
new syntax and the old syntax.
Locale Extension Mappings
Old (LDML 1.7 or older)
New
de_DE@collation=phonebook
de_DE_u_co_phonebk
zh_Hant_TW@collation=big5han
zh_Hant_TW_u_co_big5han
th_TH@calendar=gregorian;@numbers=thai
th_TH_u_ca_gregory_nu_thai
en_US_POSIX@timezone=America/Los_Angeles
en_US_u_tz_uslax_va_posix
3.8.3 Relation to OpenI18n
The locale id format generally follows the description in the
OpenI18N Locale Naming Guideline
NamingGuideline
], with
some enhancements. The main differences from the those guidelines are that the locale id:
does not include a charset (since the data in LDML format always provides a representation of all Unicode characters. The repository is stored in UTF-8,
although that can be transcoded to other encodings as well.),
adds the ability to have a variant, as in Java
adds the ability to discriminate the written language by script (or script variant).
is a superset of [
BCP47
] codes.
3.9 Transmitting Locale Information
In a world of on-demand software components, with arbitrary connections between those components, it is important to get a sense of where localization should
be done, and how to transmit enough information so that it can be done at that appropriate place. End-users need to get messages localized to their languages,
messages that not only contain a translation of text, but also contain variables such as date, time, number formats, and currencies formatted according to the
users' conventions. The strategy for doing the so-called
JIT localization
is made up of two parts:
Store and transmit
neutral-format
data wherever possible.
Neutral-format data is data that is kept in a standard format, no matter what the local user's environment is. Neutral-format is also (loosely)
called
binary data
, even though it actually could be represented in many different ways, including a textual representation such as in XML.
Such data should use accepted standards where possible, such as for currency codes.
Textual data should also be in a uniform character set (Unicode/10646) to avoid possible data corruption problems when converting between encodings.
Localize that data as "
close
" to the end-user as possible.
There are a number of advantages to this strategy. The longer the data is kept in a neutral format, the more flexible the entire system is. On a practical
level, if transmitted data is neutral-format, then it is much easier to manipulate the data, debug the processing of the data, and maintain the software connections
between components.
Once data has been localized into a given language, it can be quite difficult to programmatically convert that data into another format, if required. This
is especially true if the data contains a mixture of translated text and formatted variables. Once information has been localized into, say, Romanian, it is
much more difficult to localize that data into, say, French. Parsing is more difficult than formatting, and may run up against different ambiguities in interpreting
text that has been localized, even if the original translated message text is available (which it may not be).
Moreover, the closer we are to end-user, the more we know about that user's preferred formats. If we format dates, for example, at the user's machine, then
it can easily take into account any customizations that the user has specified. If the formatting is done elsewhere, either we have to transmit whatever user
customizations are in play, or we only transmit the user's locale code, which may only approximate the desired format. Thus the closer the localization is to
the end user, the less we need to ship all of the user's preferences around to all the places that localization could possibly need to be done.
Even though localization should be done as close to the end-user as possible, there will be cases where different components need to be aware of whatever
settings are appropriate for doing the localization. Thus information such as a locale code or
time zone needs to be communicated between different components.
3.9.1 Message Formatting and Exceptions
Windows (
FormatMessage
String.Format
), Java (
MessageFormat
and ICU (
MessageFormat
umsg
) all provide methods of formatting variables (dates, times, etc) and inserting them
at arbitrary positions in a string. This avoids the manual string concatenation that causes severe problems for localization. The question is, where to do this?
It is especially important since the original code site that originates a particular message may be far down in the bowels of a component, and passed up to
the top of the component with an exception. So we will take that case as representative of this class of issues.
There are circumstances where the message can be communicated with a language-neutral code, such as a numeric error code or mnemonic string key, that is
understood outside of the component. If there are arguments that need to accompany that message, such as a number of files or a datetime, those need to accompany
the numeric code so that when the localization is finally at some point, the full information can be presented to the end-user. This is the best case for localization.
More often, the exact messages that could originate from within the component are not known outside of the component itself; or at least they may not be
known by the component that is finally displaying text to the user. In such a case, the information as to the user's locale needs to be communicated in some
way to the component that is doing the localization. That locale information does not necessarily need to be communicated deep within the component; ideally,
any exceptions should bundle up some language-neutral message ID, plus the arguments needed to format the message (for
example, datetime), but not do the localization
at the throw site. This approach has the advantages noted above for JIT localization.
In addition, exceptions are often caught at a higher level; they do not end up being displayed to any end-user at all. By avoiding the localization at the
throw site, it the cost of doing formatting, when that formatting is not really necessary. In fact, in many running programs most of the exceptions that are
thrown at a low level never end up being presented to an end-user, so this can have considerable performance benefits.
3.10 Unicode Language and Locale IDs
People have very slippery notions of what distinguishes a language code
versus a locale code. The problem is that both are somewhat nebulous concepts.
In practice, many people use [
BCP47
codes to mean locale codes instead of strictly language codes. It is easy to see why this came
about; because [
BCP47
includes an explicit region (territory) code, for most people it was sufficient for use as a
locale code as well. For example, when typical web software receives an [
BCP47
] code, it will use it as a locale code. Other typical software will
do the same: in practice, language codes and locale codes are treated interchangeably. Some people recommend distinguishing on the basis of "-"
versus "_" (for example,
zh-TW
for language code,
zh_TW
for locale code), but in practice that does not work because of the free variation out in the world in the use
of these separators. Notice that Windows, for example, uses "-" as a separator in its locale codes. So pragmatically one is forced to treat "-" and "_" as equivalent
when interpreting either one on input.
Another reason for the conflation of these codes is that
very
little data in most systems is distinguished by region alone; currency codes and measurement
systems being some of the few. Sometimes date or number formats are mentioned as regional, but that really
does not make much sense. If people see the sentence
"You will have to adjust the value to १,२३४.५६७ from ૭૧,૨૩૪.૫૬" (using Indic digits), they would say that sentence is simply not English. Number format is far
more closely associated with language than it is with region. The same is true for date formats: people would never expect to see intermixed a date in the format
"2003年4月1日" (using Kanji) in text purporting to be purely English. There are regional differences in date and number format — differences which can be important
— but those are different in kind than other language differences between regions.
As far as we are concerned —
as a completely practical matter
— two languages are different if they require substantially different localized resources.
Distinctions according to spoken form are important in some contexts, but the written form is by far and away the most important issue for data interchange.
Unfortunately, this is not the principle used in [
ISO639
], which has the fairly unproductive notion (for data interchange) that only spoken
language matters (it is also not completely consistent about this, however).
BCP47
can
express a difference if the use of written languages happens to correspond to region boundaries expressed
as [
ISO3166
] region codes, and has recently added codes that allow it to express some important cases that are not distinguished by [
ISO3166
codes. These written languages include simplified and traditional Chinese (both used in Hong Kong S.A.R.); Serbian in Latin script; Azerbaijani in Arab script,
and so on.
Notice also that
currency codes
are different than
currency localizations
. The currency localizations should largely be in the language-based
resource bundles, not in the territory-based resource bundles. Thus, the resource bundle
en
contains the localized mappings in English for a range of
different currency codes: USD → US$, RUR → Rub, AUD → $A and so on. Of course, some currency symbols are used for more than one currency, and in such cases specializations
appear in the territory-based bundles. Continuing the example,
en_US
would have USD → $, while
en_AU
would have AUD → $. (In protocols, the currency
codes should always accompany any currency amounts; otherwise the data is ambiguous, and software is forced to use the user's territory to guess at the currency.
For some informal discussion of this, see
JIT Localization
.)
3.10.1 Written Language
Criteria for what makes a written language should be purely pragmatic;
what would copy-editors say?
If one gave them text like the following, they
would respond that is far from acceptable English for publication, and ask for it to be redone:
"Theatre Center News: The date of the last version of this document was 2003年3月20日. A copy can be obtained for $50,0 or 1.234,57 грн. We would
like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag,
Avery Bishop, and Doug Felt."
So one would change it to either B or C below, depending on which orthographic variant of English was the target for the publication:
"Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We
would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad
Esfahbod, Doug Felt, Eric Mader."
"Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We
would like to acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad
Esfahbod, Doug Felt, Eric Mader."
Clearly there are many acceptable variations on this text. For example, copy editors might still quibble with the use of first
versus last name sorting in the
list, but clearly the first list was
not
acceptable English alphabetical order. And in quoting a name, like "Theatre Centre News", one may leave it in
the source orthography even if it differs from the publication target orthography. And so on. However, just as clearly, there limits on what is acceptable English,
and "2003年3月20日", for example, is
not
Note that the language of locale data may differ from the language of localized software or web sites, when those latter are not localized into the user's
preferred language. In such cases, the kind of incongruous juxtapositions described above may well appear, but this situation is usually preferable to forcing
unfamiliar date or number formats on the user as well.
4 Locale Inheritance and Matching
The XML format relies on an inheritance model, whereby the resources are collected into
bundles
, and the bundles organized into a tree. Data for the
many Spanish locales does not need to be duplicated across all of the countries having Spanish as a national language. Instead, common data is collected in
the Spanish language locale, and territory locales only need to supply differences. The parent of all of the language locales is a generic locale known as
root
. Wherever possible, the resources in the root are language & territory neutral.
For example, the collation (sorting) order in the root is based on the
DUCET
] (see
Root Collation
).
Since English language collation has the same ordering as the root locale,
the 'en' locale data does not need
to supply any collation data, nor does either the 'en_US' or the 'en_IE' locale data.
Given a particular locale id "en_IE_someVariant", the search chain for a particular resource is the following.
en_IE_someVariant
en_IE
en
root
If a type and key are supplied in the locale id, then logically the chain from that id to the root is searched for a resource tag with a given type, all
the way up to root. If no resource is found with that tag and type, then the chain is searched again without the type.
Thus the data for any given locale will only contain resources that are different from the parent locale. For example, most territory locales will inherit
the bulk of their data from the language locale: "en" will contain the bulk of the data: "en_IE" will only contain a few items like currency. All data that
is inherited from a parent is presumed to be valid, just as valid as if it were physically present in the file. This provides for much smaller resource bundles,
and much simpler (and less error-prone) maintenance. At the
script or region level, the "primary" child locale will be empty, since its
parent will contain all of the appropriate resources for it. For more
information see
CLDR Information : Section 9.3
Default Content
Certain data items depend only on the region specified in a locale id, and are obtained from supplemental
data rather than through locale resources. For example:
The currency for the specified region (see
Supplemental Currency Data
The measurement system for the specified region (see
Measurement System Data
The week conventions for the specified region (see
Week Data
These items will be correct for the specified region regardless of whether a locale bundle actually exists
with the same combination of language and region as in the locale id. For example, suppose data is requested for the locale id
"fr_US" and there is no bundle for that combination. Data obtained via locale inheritance, such as currency patterns and
currency symbols, will be obtained from the parent locale "fr". However, currency amounts would be formatted by default using
US dollars, just displayed in the manner governed by the locale "fr". When a locale id does not specify a region, the
region-specific items such as those above are obtained from the likely region for the locale (obtained via
Likely Subtags
).
If a language has more than one script in
customary modern use, then the CLDR file structure in common/main follows
the following model:
lang
lang_script
lang_script_region
lang_region
(aliases to lang_script_region)
There are actually two different kinds of
fallback: resource bundle lookup and resource item lookup. For the former, a
process is looking to find the first, best resource bundle it can; for the
later, it is fallback within bundles on individual items, like a the
translated name for the region "CN" in Breton. These are closely related,
but distinct, processes. Below "key" stands for zero or more key/type pairs.
Lookup Differences
Lookup Type
Example
Comments
Resource
bundle
lookup
se-FI → se
→ default*
→ root
* default may have its own inheritance change; for example, it may be "en-GB → en" In that case, the chain is expanded
by inserting the chain, resulting in:
se-FI → se
→ fi → en-GB →
en
→ root
Resource
item
lookup
se-FI+key → se+key
→ root_alias*+key → root+key
* if there is a root_alias to another key
or locale, then insert that entire chain. For example, suppose that
months for another calendar system have a root alias to Gregorian
months. In that case, the root alias would change the key, and retry
from se-FI downward.
se-FI+key → se+key
→ root_alias*+key
→ se-FI+key2 → se+key2
→ root_alias*+key2 → root+key2
The fallback is a bit different for these two
cases; internal aliases and keys are are not involved in the bundle lookup, and the
default locale is not involved in the item lookup. Moreover, the resource
item lookup must remain stable, because the resources are built with a
certain fallback in mind; changing the core fallback order can render the
bundle structure incoherent. Resource bundle lookup, on the other hand, is
more flexible; changes in the view of the "best" match between the input
request and the output bundle are more tolerant, when represent overall
improvements for users. For more information, see
Section 8.1 Element fallback
Where the LDML inheritance relationship does not match a target system, such as POSIX, the data logically should be fully resolved in converting to a format
for use by that system, by adding
all
inherited data to each locale data set.
For a more complete description of how inheritance applies to data, and the use of keywords, see
Section 4.2 Inheritance
The locale data does not contain general character properties that are derived from the
Unicode Character Database
UAX44
].
That data being common across locales, it is not duplicated in the bundles. Constructing a POSIX locale from the CLDR data requires use of UCD data. In addition,
POSIX locales may also specify the character encoding, which requires the data to be transformed into that target encoding.
Warning:
If a locale has a different script than its parent (for
example, sr_Latn), then special attention must be paid to make sure that all inheritance is
covered. For example, auxiliary exemplar characters may need to be empty ("[]") to block inheritance.
Empty Override:
There is one special value reserved in LDML to indicate that a child locale is to have no value for a path, even if the parent locale has a value for that path. That value is "∅∅∅". For example, if there is no phrase for "two days ago" in a language, that can be indicated with:
4.1 Multiple Inheritance
In clearly specified instances, resources may inherit from within the same locale. For example, currency format symbols inherit from the number format symbols;
the Buddhist calendar inherits from the Gregorian calendar. This
only
happens where documented in this specification. In these special cases, the inheritance
functions as normal, up to the root. If the data is not found along that path, then a second search is made, logically changing the element/attribute to the
alternate values.
For example, for the locale "en_US" the month data in
in "en", then in "root". If not found there, then it inherits from
in "root".
4.1.1 Parent Locales
In some cases, the normal truncation inheritance does not function well. This happens when:
The child locale is of a different script. In this case, mixing elements from the parent into the child data results in a mishmash.
A large number of child locales behave similarly, and differently from the truncation parent.
The
parentLocale
element is used to override the normal inheritance when accessing CLDR data.
For case 1, the children are script locales, and the parent is "root". For example:
For case 2, the children and parent share the same primary language, but the region is changed. For example:
Collation data, however, is an exception. Since collation rules do not truly inherit
data from the parent, the parentLocale element is not necessary and not used for collation.
Thus, for a locale like zh_Hant in the example above, the parentLocale element would dictate the parent as "root"
when referring to main locale data, but for collation data, the parent locale would still be "zh",
even though the parentLocale element is present for that locale.
4.2 Inheritance and Validity
The following describes in more detail how to determine the exact inheritance of elements, and the validity of a given element in LDML.
4.2.1 Definitions
Blocking
elements are those whose subelements do not inherit from parent locales. For example, a
in a
For more information, see
Section 5.5 Valid Attribute Values
Attributes that serve to distinguish multiple elements at the same level are called
distinguishing
attributes.
For example, the
type
attribute distinguishes different elements in lists of
translations, such as:
Distinguishing attributes affect inheritance; two elements with different distinguishing
attributes are treated as different for purposes of inheritance. For more information, see
Section 5.5 Valid Attribute Values
Other attributes are called nondistinguishing (or informational) attributes. These carry
separate information, and do not affect inheritance.
For any element in an XML file,
an element chain
is a resolved [
XPath
] leading from the root to an element, with attributes on each element in alphabetical
order. So in, say,
we may have:
...
Which gives the following element chains (among others):
//ldml/identity/version[@number="1.1"]
//ldml/localeDisplayNames/languages/language[@type="ar"]
An element chain A is an
extension
of an element chain B if B is equivalent to an initial portion of A. For example, #2 below is an extension of #1.
(Equivalent, depending on the tree, may not be "identical to". See below for an example.)
//ldml/localeDisplayNames
//ldml/localeDisplayNames/languages/language[@type="ar"]
An LDML file can be thought of as an ordered list of
element pairs
:
(This works because of restrictions on the structure of LDML, including that it
does not allow mixed content.) The ordering is the ordering that the element
chains are found in the file, and thus determined by the DTD.
For example, some of those pairs would be the following. Notice that the first has the null string as element contents.
//ldml/identity/version[@number="1.1"]
""
//ldml/localeDisplayNames/languages/language[@type="ar"]
"Αραβικά"
Note:
There are two exceptions to this:
Blocking nodes and their contents are treated as a single end node.
In terms of computing inheritance, the element pair consists of the element chain
plus all distinguishing attributes; the value consists of the value (if any) plus any
nondistinguishing attributes.
Thus instead of the element pair being (a) below, it is (b):
//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart[@day='sun'][@time='00:00']
"">
//ldml/dates/calendars/calendar[@type='gregorian']/week/weekendStart
[@day='sun'][@time='00:00']
Two LDML element chains are
equivalent
when they would be identical if all attributes and their values were removed — except
for distinguishing attributes. Thus the following are equivalent:
//ldml/localeDisplayNames/languages/language[@type="ar"]
//ldml/localeDisplayNames/languages/language[@type="ar"][@draft="unconfirmed"]
For any locale ID, an
locale chain
is an ordered list starting with the root and leading down to the ID. For example:
4.2.2 Resolved Data File
To produce fully resolved locale data file from CLDR for a locale ID L, you start with L, and successively add unique items from the parent locales until
you get up to root. More formally, this can be expressed as the following procedure.
Let Result be initially L.
For each Li in the locale chain for L, starting at L and going up to root:
Let Temp be a copy of the pairs in the LDML file for Li
Replace each alias in Temp by the resolved list of pairs it points to.
The resolved list of pairs is obtained by recursively applying this procedure.
That alias now blocks any inheritance from the parent. (See
Section 5.1 Common Elements
for an example.)
For each element pair P in Temp:
If P does not contain a blocking element, and Result does not have an element pair Q with an equivalent element chain, add P to Result.
Notes:
When adding an element pair to a result, it has to go in the right order for it to be valid according to the DTD.
The identity element and its children are unaffected by resolution.
The LDML data must be constructed so as to avoid circularity in step 2.2.
4.2.3 Valid Data
The attribute
draft="x"
in LDML means that the data has not been approved by the
subcommittee. (For more information, see
Process
). However, some data
that is not explicitly marked as
draft
may be implicitly
draft
, either because it inherits
it from a parent, or from an enclosing element.
Example 2.
Suppose that new locale data is added for af (Afrikaans). To indicate that all of the data is
unconfirmed
, the attribute can be added
to the top level.
Any data can be added to that file, and the status will all be draft=
unconfirmed
Once an item is vetted—
whether it is inherited or explicitly in
the file
—then its status can be changed to
approved
. This can be done either by leaving draft="unconfirmed" on the enclosing element and marking
the child with draft="approved", such as:
However, normally the draft attributes should be canonicalized, which means they are pushed down to leaf nodes as described in
Section 5.6 Canonical
Form
. If an LDML file does has draft attributes that are not on leaf nodes, the file should be interpreted as if it were the canonicalized version of that file.
The attribute
validSubLocales
allows sublocales in a given tree to be treated as though a file for them were present when there
is not one. It only has an effect for locales that inherit from the current file where a file is missing, and the elements would
Example 1.
Suppose that in a particular LDML tree, there are no region locales for German,
for example, there is a de.xml file, but no files for de_AT.xml,
de_CH.xml, or de_DE.xml. Then no elements are valid for any of those region locales. If we want to mark one of those files as having valid elements, then we
introduce an empty file, such as the following.
With the
validSubLocales
attribute, instead of adding the empty files for de_AT.xml, de_CH.xml, and de_DE.xml, in the de file we can add to the parent
locale a list of the child locales that should behave as if files were present.
...
More formally, here is how to determine whether data for an element chain E is implicitly or explicitly draft, given a locale L. Sections 1, 2, and 4 are
simply formalizations of what is in LDML already. Item 3 adds the new element.
4.2.4 Checking for Draft Status
Parent Locale Inheritance
Walk through the locale chain until you find a locale ID L' with a data file D. (L' may equal L).
Produce the fully resolved data file D' for D.
In D', find the first element pair whose element chain E' is either equivalent to or an extension of E.
If there is no such E', return
true
If E' is not equivalent to E, truncate E' to the length of E.
Enclosing Element Inheritance
Walk through the elements in E', from back to front.
If you ever encounter draft=
, return
If L' = L, return
false
Missing File Inheritance
Otherwise, walk again through the elements in E', from back to front.
If you encounter a validSubLocales attribute:
If L is in the attribute value, return
false
Otherwise return
true
Otherwise
Return
true
The validSubLocales in the most specific (farthest from root file) locale file "wins" through the full resolution step (data from more specific files replacing
data from less specific ones).
4.2.5 Keyword and Default Resolution
When accessing data based on keywords, the following process is used. Consider the following example:
The locale 'de' has collation types A, B, C, and no
The locale 'de_CH' has
Here are the searches for various combinations.
User Input
Lookup in Locale
For
Comment
de_CH
no keyword
de_CH
default collation type
finds "B"
de_CH
collation type=B
not found
de
collation type=B
found
de
no keyword
de
default collation type
not found
root
default collation type
finds "standard"
de
collation type=standard
not found
root
collation type=standard
found
de_u_co_A
de
collation type=A
found
de_u_co_standard
de
collation type=standard
not found
root
collation type=standard
found
de_u_co_foobar
de
collation type=foobar
not found
root
collation type=foobar
not found, starts looking for default
de
default collation type
not found
root
default collation type
finds "standard"
de
collation type=standard
not found
root
collation type=standard
found
Examples of "search" collator lookup; 'de' has a
language-specific version, but 'en' does not:
User Input
Lookup in Locale
For
Comment
de_CH_u_co_search
de_CH
collation type=search
not found
de
collation type=search
found
en_US_u_co_search
en_US
collation type=search
not found
en
collation type=search
not found
root
collation type=search
found
Examples of lookup for Chinese collation types. Note:
All of the Chinese-specific collation types are provided in the 'zh' locale
For 'zh' the
element specifies "stroke". However any of the available Chinese collation types can be
explicitly requested for any Chinese locale.
User Input
Lookup in Locale
For
Comment
zh_Hant
no keyword
zh_Hant
default collation type
finds "stroke"
zh_Hant
collation type=stroke
not found
zh
collation type=stroke
found
zh_Hant_HK_u_co_pinyin
zh_Hant_HK
collation type=pinyin
not found
zh_Hant
collation type=pinyin
not found
zh
collation type=pinyin
found
zh
no keyword
zh
default collation type
finds "pinyin"
zh
collation type=pinyin
found
Note:
It is an invariant that the default in root for a given element must
always be a value that exists in root. So you can not have the following in root:
For identifiers, such as language codes, script codes, region codes, variant codes, types, keywords, currency symbols or currency display names, the default
value is the identifier itself whenever if no value is found in the root. Thus if there is no display name for the region code 'QA' in root, then the display
name is simply 'QA'.
4.3 Likely Subtags
There are a number of situations where it is useful to be able to
find the most likely language, script, or region. For example, given the language "zh" and the
region "TW", what is the most likely script? Given the script "Thai" what is the most likely
language or region? Given the region TW, what is the most likely language and script?
Conversely, given a locale, it is useful to find out which fields
(language, script, or region) may be superfluous, in the sense that they contain the likely
tags. For example, "en_Latn" can be simplified down
to "en" since "Latn" is the likely script for "en"; "ja_Jpan_JP"
can be simplified down to "ja".
The
likelySubtag
supplemental data provides default information for
computing these values. This data is based on the default content data, the population data, and
the the suppress-script data in [
BCP47
]. It is heuristically derived, and
may change over time. To look up data in the table, see if a locale matches one of the
from
attribute values. If so, fetch the corresponding
to
attribute value. For example, the
Chinese data looks like the following:
So looking up "zh_TW" returns "zh_Hant_TW", while looking up "zh"
returns "zh_Hans_CN". In the following text, the components of such a result will be
be designated with language², region², and script².
The data is designed to be used in the following operations. It can
also be used with language tags using [
BCP47
] syntax, with a few changes.
Add Likely Subtags:
Given a locale, to fill in the most
likely other fields.
This operation is performed in the following way.
Canonicalize.
Make sure the
input locale is in canonical form: uses the right separator, and has the right casing.
Replace any
deprecated subtags with their canonical values using the
metadata. Use the first value in the replacement list, if it exists.
If the tag is grandfathered (see
Remove the script code 'Zzzz' and the region code 'ZZ' if they occur; change an empty language subtag to
'und'.
Get the components of the cleaned-up tag (language¹,
script¹, and region¹), plus any variants
if they exist
(including keywords).
Try each of
the following in order (where the fields exist). The notation field³
means field¹ if it exists, otherwise field².
Lookup
language¹ _ script¹
_ region¹. If in the table, return the language²
_ script² _ region²
+ variants.
Lookup
language¹ _ script¹.
If in the table, return language² _ script²
_ region³ + variants.
Lookup
language¹ _ region¹. If in the table, return language²
_ script³ _ region² + variants.
Lookup
language¹. If in the table, return language²
_ script³ _ region³ + variants.
If none of these succeed, signal an error.
Example:
Input is ZH-ZZZZ-SG.
Normalize to zh_SG.
Lookup in table. No match.
Remove SG, but remember it. Lookup zh, and get
the match (zh_Hans_CN). Substitute SG, and return zh_Hans_SG.
To find the most likely language for a country, or
language for a script, use "und" as the language subtag. For example, looking up "und_TW" returns zh_Hant_TW.
Remove
Likely Subtags:
Given a locale,
remove any fields that Add Likely Subtags would add.
The reverse operation removes fields that would be added by the
first operation.
First get max
= AddLikelySubtags(inputLocale). If an error is signaled, return it.
Remove the
variants from max.
Then for
trial
in {language, language _ region, language _ script}
If
AddLikelySubtags(
trial
) = max, then return
trial
+ variants.
If you do not
get a match, return max + variants.
Example:
Input is zh_Hant. Maximize to get zh_Hant_TW.
zh => zh_Hans_CN. No match, so continue.
zh_TW => zh_Hant_TW. Matches, so return zh_TW.
A variant of this favors the script over the region, thus using
{language, language_script, language_region} in the above. If that variant is used, then the
result in this example would be zh_Hant instead of zh_TW.
4.4 Language Matching
Implementers are often faced with the issue of how to match the user's requested languages with their product's supported languages. For example, suppose that a product supports {ja-JP, de, zh-TW}. If the user understands written American English, German, French, Swiss German, and Italian, then
de
would be the best match; if s/he understands only Chinese (zh), then zh-TW would be the best match.
The standard truncation-fallback algorithm does not work well when faced with the complexities of natural language. The language matching data is designed to fill that gap. Stated in those terms, language matching can have the effect of a more complex fallback, such as:
sr-Cyrl-RS
sr-Cyrl
sr-Latn-RS
sr-Latn
sr
hr-Latn
hr
Language matching is used to find the best supported locale ID given a requested list of languages. The requested list could come from different sources, such as such as the user's list of preferred languages in the OS Settings, or from a browser Accept-Language list. For example, if my native
tongue is English, I can understand Swiss German and German, my French is
rusty but usable, and Italian basic, ideally an implementation would allow me to select {gsw, de, fr} as my preferred list
of languages, skipping Italian because my comprehension is not good enough
for arbitrary content.
Language Matching can also be used to get fallback data elements. In many cases, there may not be full data for a particular locale. For example,
for a Breton speaker, the best fallback if data is unavailable might be French. That is, suppose we have found a Breton bundle,
but it does not contain translation for the key "CN" (for the country China).
It is best to return "chine", rather than falling back to the value default
language such as Russian and getting "Кітай".
The language matching data can be used to get the closest fallback locales (of those supported) to a given language.
When such fallback is used for resource item
lookup, the normal order of inheritance is used for resource item
lookup, except that before using
any data from
root
, the data for the fallback locales would be used if available. Language matching does not interact with the fallback of resources
within the locale-parent chain
. For example, suppose that we are looking for the value for a particular path
in
nb-NO
. In the absence of aliases, normally the following lookup is used.
nb-NO
nb
root
That is, we first look in
nb-NO
. If there is no value for
there, then we look in
nb
. If there is no value for
there, we return the value for
in root (or a code value, if there is nothing there). Remember that if there is an alias element along this path, then the lookup may restart with a different path in
nb-NO
(or another locale).
However, suppose that
nb-NO
has the fallback values
[nn da sv en]
, derived from language matching. In that case, an implementation
may
progressively lookup each of the listed locales, with the appropriate substitutions, returning the first value that is not found in
root
. This follows roughly the following pseudocode:
value = lookup(P, nb-NO); if (locationFound != root) return value;
value = lookup(P, nn-NO); if (locationFound != root) return value;
value = lookup(P, da-NO); if (locationFound != root) return value;
value = lookup(P, sv-NO); if (locationFound != root) return value;
value = lookup(P, en-NO); return value;
The locales in the fallback list are not used recursively. For example, for the lookup of a path in nb-NO, if
fr
were a fallback value for
da
, it would not matter for the above process. Only the original language matters.
The languageMatching data is interpreted as an ordered list. To find the match between any two languages, use the likely subtags to maximize each language, and perform the following steps.
Remove any trailing fields that are the same.
Traverse the list until a match is found. (If the oneway flag is false, then the match is symmetric.)
Record the match value.
Remove the final field from each, and if any fields are left, repeat these steps.
The end result is the product of the matched values.
There is one special case. Suppose we have the following situation:
desired languages: {und, it}
supported languages: {en, it}
resulting language: en
Part of this is because 'und' has a special function in BCP47; it stands in for 'no supplied base language'. To prevent this from happening, if the desired base language is und, the language matcher should not apply likely subtags to it.
Examples:
For example, suppose that nn-DE and nb-FR are being compared. They are first maximized to nn-Latn-DE and nb-Latn-FR, respectively. The list is searched. The first match is with "*-*-*", for a match of 96%. The languages are truncated to nn-Latn and nb-Latn, then to nn and nb. The first match is also for a value of 96%, so the result is 92%.
Note that language matching is orthogonal to the how closely two languages are related linguistically. For example, Breton is more closely related to Welsh than to French, but French is the better match (because it is more likely that a Breton reader will understand French than Welsh). This also illustrates that the matches are often asymmetric: it is not likely that a French reader will understand Breton.
The "*" acts as a wild card, as shown in the following example:
5 XML Format
There are two kinds of data that can be expressed in LDML: language-dependent data and supplementary data. In either case, data can be split across multiple
files, which can be in multiple directory trees.
For example, the language-dependent data for Japanese in CLDR is present in the following files:
common/collation/ja.xml
common/main/ja.xml
common/rbnf/ja.xml
common/segmentations/ja.xml
The status of the data is the same, whether or not data is split. That is, for the purpose of validation and lookup, all of the data for the above ja.xml
files is treated as if it was in a single file.
Supplemental data relating to Japan or the Japanese writing system can be found in:
Files in common/supplemental/ such as supplementalData.xml
common/transforms/Hiragana-Katakana.xml
common/transforms/Hiragana-Latin.xml
...
The following sections describe the structure of the XML format for language-dependent data. The more precise syntax is in the DTD, listed at the top of
this document
; however, the DTD does not describe all the constraints on the structure.
To start with, the root element is
contextTransforms?,
characters?, delimiters?, measurement?, dates?, numbers?, units?,
listPatterns?, collations?, posix?, segmentations?, rbnf?,
metadata?,
references?, special*))) >
The XML structure is stable over releases. Elements and attributes may be deprecated: they are retained in the DTD but their usage is strongly discouraged.
In most cases, an alternate structure is provided for expressing the information.
In general, all translatable text in this format is in element contents, while attributes are reserved for types and non-translated information (such as
numbers or dates). The reason that attributes are not used for translatable text is that spaces are not preserved, and we cannot predict where spaces may be
significant in translated material.
There are two kinds of elements in LDML:
rule
elements and
structure
elements. For structure elements, there are restrictions to allow for
effective inheritance and processing:
There is no "mixed" content: if an element has textual content, then it cannot contain any elements.
The [
XPath
] leading to the content is unique; no two different pieces of textual content have the same
XPath
].
Rule elements do not have this restriction, but also do not inherit, except as an entire block. The structure elements are listed in serialElements
in the supplemental metadata. See also
Section 4.2 Inheritance and Validity
For more technical details, see
Updating-DTDs
Note that the data in examples given below is purely illustrative, and
does not match any particular language. For a more detailed example of this format,
see [
Example
]. There is also a DTD for this format, but
remember that the DTD alone is not sufficient to understand the semantics, the
constraints, nor the interrelationships between the different elements and attributes
. You may wish to have copies of each of these to hand as you
proceed through the rest of this document.
In particular, all elements allow for draft versions to coexist in the file at the same time. Thus most elements are marked in the DTD as allowing multiple
instances. However, unless an element is listed as a serialElement, or has a distinguishing attribute, it can only occur once as a subelement of a given element.
Thus, for example, the following is illegal even though allowed by the DTD:
There must be only one instance of these per parent, unless there are other distinguishing attributes (such as an alt element).
In general, LDML data should be in NFC format. However, certain elements may need to contain characters that are not in NFC, including exemplars, transforms, segmentations, and p/s/t/i/pc/sc/tc/ic rules in collation. These elements must not be normalized (either to NFC or NFD), or their meaning may be changed. Thus LDML documents
must not be normalized as a whole. To prevent problems with normalization, no element value can start with a combining slash (U+0338 COMBINING LONG SOLIDUS OVERLAY).
Lists, such as
singleCountries
are space-delimited. That means that they are separated by one or more XML whitespace characters,
singleCountries
preferenceOrdering
references
validSubLocales
5.1 Common Elements
At any level in any element, two special elements are allowed.
5.1.1 Element special
This element is designed to allow for arbitrary additional annotation and data that is product-specific. It has one required attribute
xmlns
, which specifies the
XML
namespace
of the special data. For example, the following used the version 1.0 POSIX special element.
">
%posix;
]>
...
Yes
No
^[Yy].*
^[Nn].*
5.1.1.1 Sample Special Elements
The elements in this section are
not
part of the Locale Data Markup Language 1.0 specification. Instead, they are special elements used for
application-specific data to be stored in the Common Locale Repository. They may change or be removed future versions of this document, and are present her
more as examples of how to extend the format. (Some of these items may move into a future version of the Locale Data Markup Language specification.)
The above examples are old versions: consult the documentation for the specific application to see which should be used.
These DTDs use namespaces and the special element. To include one or more, use the following pattern to import the special DTDs that are used in the file:
1.0
" encoding="
UTF-8
" ?>
icu
SYSTEM "
">
openOffice
SYSTEM "
">
%icu;
%openOffice;
]>
Thus to include just the ICU DTD, one uses:
1.0
" encoding="
UTF-8
" ?>
">
%icu;
]>
Note:
A previous version of this document contained a special element for
ISO TR 14652
compatibility data. That element has been withdrawn, pending further
investigation, since
14652 is a Type 1 TR: "when the required support cannot be obtained for the publication of an International Standard,
despite repeated effort". See the ballot comments on
14652 Comments
for
details on the 14652 defects. For example, most of these patterns make little provision for substantial changes in format when elements are empty, so are
not particularly useful in practice. Compare, for example, the mail-merge capabilities of production software such as Microsoft Word or OpenOffice.
Note:
While the CLDR specification guarantees backwards compatibility, the definition of specials is up to other organizations. Any assurance
of backwards compatibility is up to those organizations.
A number of the elements above can have extra information for
openoffice.org
, such as the following example:
IGNORE_CASE
5.1.2 Element alias
The contents of any element in root can be replaced by an alias,
which points to the path where the data can be found.
Aliases will only ever appear in root with the form //ldml/.../alias[@source="locale"][@path="..."].
Consider the following example in root:
If the locale "de_DE" is being accessed for a month name for format/abbreviated, then a resource bundle at "de_DE" will be searched for a resource element at the that path. If not found there, then the
resource bundle at "de" will be searched, and so on. When the alias is found in root, then the search is restarted, but searching for format/
wide
element instead of format/abbreviated.
If the
path
attribute is present, then its value is an [
XPath
] that points to a different node in the tree. For example:
The default value if the path is not present is the same position in the tree. All of the attributes in the
XPath
] must be
distinguishing
elements.
For more details, see
Section 4.2 Inheritance and Validity
There is a special value for the source attribute, the constant
source="locale"
. This special value is equivalent to the
locale being resolved. For example, consider the following example, where locale data for 'de' is being resolved:
Inheritance with source="locale"
Root
de
Resolved
1
2
11
12
11
12
22
11
22
The first row shows the inheritance within the
whereby ,
but by elements in the 'target' locale.
For more details on data resolution, see
Section 4.2 Inheritance and Validity
Aliases must be resolved recursively. An alias may point to another path that results in another alias being found, and so on. For example, looking up Thai buddhist abbreviated months for the locale
xx-YY
may result in the following chain of aliases being followed:
../../calendar[@type="buddhist"]/months/monthContext[@type="format"]/monthWidth[@type="abbreviated"]
xx-YY → xx → root // finds alias that changes path to:
../../calendar[@type="gregorian"]/months/monthContext[@type="format"]/monthWidth[@type="abbreviated"]
xx-YY → xx → root // finds alias that changes path to:
../../calendar[@type="gregorian"]/months/monthContext[@type="format"]/monthWidth[@type="wide"]
xx-YY → xx // finds value here
It is an error to have a circular chain of aliases. That is, a collection of LDML XML documents must not have situations where a sequence of alias lookups
(including inheritance and multiple inheritance) can be followed indefinitely without terminating.
5.1.3 Element displayName
Many elements can have a display name. This is a translated name that can be presented to users when discussing the particular service. For example, a number
format, used to format numbers using the conventions of that locale, can have translated name for presentation in GUIs.
Prozentformat
...
Where present, the display names must be unique; that is, two distinct code would not get the same display name. (There is one exception to this: in
time zones, where parsing results would give the same GMT offset, the standard and daylight display names can be the same across different
time zone IDs.) Any
translations should follow customary practice for the locale in question. For more information, see [
Data Formats
].
5.1.4 Element cp
Unfortunately, XML does not have the capability to contain all Unicode code points. Due to this, in certain instances extra syntax is required to represent
those code points that cannot be otherwise represented in element content. These escapes are only allowed in certain elements, according to the DTD.
Escaping Characters
Code Point
XML Example
U+0000
5.2 Common Attributes
5.2.1 Attribute type
The attribute
type
is also used to indicate an alternate resource that can be selected with a matching type=option in the locale id modifiers, or
be referenced by a default element. For example:
...
...
...
5.2.2 Attribute draft
If this attribute is present, it indicates the status of all the data in this element and any subelements (unless they have a contrary
draft
value),
as per the following:
approved:
fully approved by the technical committee (equals the CLDR 1.3 value
of
false
, or an absent
draft
attribute). This does not mean that the data is guaranteed to be error-free—this is the best judgment of the
committee.
contributed
partially approved by the technical committee.
provisional
: partially confirmed. Implementations may choose
to accept the provisional data, especially if there is no translated alternative.
unconfirmed
: no confirmation available.
For more information on precisely how these values are computed for any
given release, see
Data Submission and
Vetting Process
on the CLDR website.
Normally draft attributes should only occur on "leaf" elements. For a more formal description of how elements are inherited, and what their draft status
is, see
Section 4.2 Inheritance and Validity
5.2.3 Attribute alt
This attribute labels an alternative value for an element. The value is a
descriptor
indicates what kind of alternative it is, and takes one of the following
variantname
meaning that the value is a variant of the normal value, and may be used in its place in certain circumstances. If a variant value
is absent for a particular locale, the normal value is used. The variant mechanism should only be used when such a fallback is acceptable.
proposed
, optionally followed by a number, indicating that the value is a proposed replacement for an existing value.
variantname
-proposed
, optionally followed by a number, indicating that the value is a proposed replacement variant
value.
proposed
" should only be present if the draft status is not "approved". It indicates that the data is proposed replacement
data that has been added provisionally until the differences between it and the other data can be vetted. For example, suppose that the translation for September
for some language is "Settembru", and a bug report is filed that that should be "Settembro". The new data can be entered in, but marked as
alt="proposed"
until it is vetted.
...
Now assume another bug report comes in, saying that the correct form is actually "Settembre". Another alternative can be added:
...
...
The values for
variantname
at this time include "
variant
", "
list
", "
email
",
www
", "
short
", and "
secondary
".
Attribute validSubLocales
The attribute
validSubLocales
allows sublocales in a given tree to be treated as though a file for them were present when there
is not one. It can
be applied to any element. It only has an effect for locales that inherit from the current file where a file is missing, and the elements would
not otherwise
be draft.
For a more complete description of how draft applies to data, see
Section 4.2 Inheritance and Validity
Attribute
references
The value of this attribute is a token representing a reference for the information in the element, including standards
that it may conform to.
Example:
The reference element may be inherited. Thus, for example, R222 may be used in sv_SE.xml even though it is not defined there, if it is defined in sv.xml.
<... allow="verbatim" ...> (deprecated)
This attribute was originally intended for use in marking display names whose capitalization differed from what was indicated by the
now-deprecated
with the new
5.3 Common Structures
5.3.1 Date and Date Ranges
When attribute specify date ranges, it is usually done with
attributes
from
and
to
. The
from
attribute specifies the starting point,
and the
to
attribute specifies the end point. The deprecated
time
attribute was formerly used to specify time with the deprecated weekEndStart and weekEndEnd elements,
which were themselves inherently
from
or
to
The data format is a restricted ISO 8601 format, restricted to the
fields
year, month, day, hour, minute,
and
second
in that order, with "-" used as
a separator between date fields, a space used as the separator between the date and the time
fields, and ":" used as a separator between the time fields. If the minute or minute and second
are absent, they are interpreted as zero. If the hour is also missing, then it is interpreted
based on whether the attribute is
from
or
to
from
defaults to "00:00:00" (midnight at the start of the day).
to
defaults to "24:00:00" (midnight at the
end of the day).
That is, Friday at 24:00:00 is the same time as Saturday at 00:00:00.
Thus when the hour is missing, the
from and to
are interpreted inclusively: the range
includes all of the day mentioned.
For example, the following are equivalent:
If the
from
element is missing, it is assumed to be as far backwards in time as
there is data for; if the
to
element is missing, then it is from this point onwards, with
no known end point.
The dates and times are specified in local time, unless otherwise
noted. (In particular, the metazone values are in UTC (also known as GMT).
5.3.2 Text Directionality
The content of certain elements, such as date or number formats, may consist of several sub-elements with an inherent order (for
example, the year, month, and day for dates). In some cases, the order of these sub-elements may be changed depending on the bidirectional context in which
the element is embedded.
For example, short date formats in languages such as Arabic may contain neutral or weak characters at the beginning or end of the
element content. In such a case, the overall order of the sub-elements may change depending on the surrounding text.
Element content whose display may be affected in this way should include an explicit direction mark, such as U+200E LEFT-TO-RIGHT
MARK or U+200F RIGHT-TO-LEFT MARK, at the beginning or end of the element content, or both.
5.3.3 Unicode Sets
Some attribute values or element contents use UnicodeSet notation. A UnicodeSet represents a set of Unicode characters (and possibly strings) determined by a pattern, following
UTS #18: Unicode Regular Expressions
UTS18
],
Level 1 and RL2.5, including the syntax where given. For an example of a concrete implementation of this, see [
ICUUnicodeSet
].
Patterns are a series of characters bounded by square brackets that contain lists of characters and Unicode property sets. Lists are a sequence of characters
that may have ranges indicated by a '-' between two characters, as in "a-z". The sequence specifies the range of all characters from the left to the right,
in Unicode order. For example,
[a c d-f m]
is equivalent to
[a c d e f m]
. Whitespace can be freely used for clarity, as
[a c d-f m]
means
the same as
[acd-fm]
Unicode property sets are specified by any Unicode property and a value of that property, such as
[:General_Category=Letter:]
. The property names
are defined by the PropertyAliases.txt file and the property values by the PropertyValueAliases.txt file. For more information, see
UAX44
].
The syntax for specifying the property sets is an extension of either POSIX or Perl syntax, by the addition of "=
by using the POSIX-style syntax:
[:General_Category=Letter:]
or by using the Perl-style syntax
\p{General_Category=Letter}
Property names and values are case-insensitive, and whitespace, "-", and "_" are ignored. The property name can be omitted for the Category and Script properties,
but is required for other properties. If the property value is omitted, it is assumed to represent a boolean property with the value "true". Thus
[:Letter:]
is equivalent to
[:General_Category=Letter:]
, and
[:Wh-ite-s pa_ce:]
is equivalent to
[:Whitespace=true:]
The table below shows the two kinds of syntax: POSIX and Perl style. Also, the table shows the "Negative", which is a property that excludes all characters
of a given kind. For example,
[:^Letter:]
matches all characters that are not
[:Letter:]
Positive
Negative
POSIX-style Syntax
[:type=value:]
[:^type=value:]
Perl-style Syntax
\p{type=value}
\P{type=value}
These following low-level lists or properties then can be freely combined with the normal set operations (union, inverse, difference, and intersection):
To union two sets, simply concatenate them. For example,
[[:letter:] [:number:]]
To intersect two sets, use the '&' operator. For example,
[[:letter:] & [a-z]]
To take the set-difference of two sets, use the '-' operator. For example,
[[:letter:] - [a-z]]
To invert a set, place a '^' immediately after the opening '['. For example,
[^a-z]
. In any other location, the '^' does not have a special meaning.
The binary operators '&', '-', and the implicit union have equal precedence and bind left-to-right. Thus
[[:letter:]-[a-z]-[\u0100-\u01FF]]
is equal
to
[[[:letter:]-[a-z]]-[\u0100-\u01FF]]
. Another example is the set
[[ace][bdf] - [abc][def]]
, which is not the empty set, but instead equal to
[[[[ace] [bdf]] - [abc]] [def]]
, which equals
[[[abcdef] - [abc]] [def]]
, which equals
[[def] [def]]
, which equals
[def]
One caution: the '&' and '-' operators operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example, the
pattern
[[:Lu:]-A]
is illegal, since it is interpreted as the set
[:Lu:]
followed by the incomplete range
-A
. To specify the set of
upper case
letters except for 'A', enclose the 'A' in a set:
[[:Lu:]-[A]]
A multi-character string can be in a Unicode set, to represent a tailored grapheme cluster for a particular language. The syntax uses curly braces for that
case.
In Unicode Sets, there are two ways to quote syntax characters and whitespace:
5.3.3.1 Single Quote
Two single quotes represents a single quote, either inside or outside single quotes. Text within single quotes is not interpreted in any way (except for
two adjacent single quotes). It is taken as literal text (special characters become non-special).
5.3.3.2 Backslash Escapes
Outside of single quotes, certain backslashed characters have special meaning:
\uhhhh
Exactly 4 hex digits; h in [0-9A-Fa-f]
\Uhhhhhhhh
Exactly 8 hex digits
\xhh
1-2 hex digits
\ooo
1-3 octal digits; o in [0-7]
\a
U+0007 (BELL)
\b
U+0008 (BACKSPACE)
\t
U+0009 (HORIZONTAL TAB)
\n
U+000A (LINE FEED)
\v
U+000B (VERTICAL TAB)
\f
U+000C (FORM FEED)
\r
U+000D (CARRIAGE RETURN)
\\
U+005C (BACKSLASH)
\N{name}
The Unicode character named "name".
Anything else following a backslash is mapped to itself, except in an environment where it is defined to have some special meaning. For example,
\p{uppercase}
is the set of upper case letters in Unicode.
Any character formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \u and \U escapes
create literal characters. (In contrast, Java treats Unicode escapes as just a way to represent arbitrary characters in an ASCII source file, and any resulting
characters are
not
tagged as literals.)
The following table summarizes the syntax that can be used.
Example
Description
[a]
The set containing 'a' alone
[a-z]
The set containing 'a' through 'z' and all letters in between, in Unicode order.
Thus it is the same as [\u0061-\u007A].
[^a-z]
The set containing all characters but 'a' through 'z'.
Thus it is the same as [\u0000-\u0061 \u007B..\U0010FFFF].
[[pat1][pat2]]
The union of sets specified by pat1 and pat2
[[pat1]&[pat2]]
The intersection of sets specified by pat1 and pat2
[[pat1]-[pat2]]
The asymmetric difference of sets specified by pat1 and pat2
[a {ab} {ac}]
The character 'a' and the multi-character strings "ab" and "ac"
[:Lu:]
The set of characters with a given property value, as defined by PropertyValueAliases.txt. In this case, these are the Unicode
upper case letters.
The long form for this is
[:General_Category=Uppercase_Letter:]
[:L:]
The set of characters belonging to all Unicode categories starting with 'L', that is,
[[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]
. The long form for
this is
[:General_Category=Letter:]
5.4 Identity Elements
The identity element contains information identifying the target locale for this data, and general information about the version of this data.
The version element provides, in an attribute, the version of this file. The contents of the element can contain textual notes about the changes between
this version and the last. For example:
Various notes and changes in version 1.1
This is not to be confused with the version attribute on the ldml element, which tracks the dtd version.
The generation element contains the last modified date for the data. This can be in two formats: ISO 8601 format, or CVS format (illustrated by the example
above).
The language code is the primary part of the specification of the locale id, with values as described above.
The script code
may be used in the identification of written languages, with values described above.
The territory code is a common part of the specification of the locale id, with values as described above.
The variant code is the tertiary part of the specification of the locale id, with values as described above.
When combined according to the rules described in
Section 3, Unicode Language and Locale Identifiers
the language element, along with any of the optional script, territory, and variant elements,
must identify a known, stable locale identifier. Otherwise, it is an error.
5.5 Valid Attribute Values
The valid attribute values, as well as other validity information is contained in the supplementalMetadata.xml file.
(Some, but not all, of this information could have been represented in XML Schema or a DTD.) Most of this is primarily for internal tool use.
The following specify the ordering of elements / attributes in the file:
The suppress elements are those that are suppressed in canonicalization.
The serialElements are those that do not inherit, and may have ordering
first_secondary_ignorable first_tertiary_ignorable first_trailing first_variable i ic languagePopulation
last_non_ignorable last_primary_ignorable last_secondary_ignorable last_tertiary_ignorable last_trailing
last_variable optimize p pc reset rules s sc settings suppress_contractions t tRule tc variable x
The validity elements give the possible attribute values. They are in the format of a series of variables, followed by attributeValues.
buddhist coptic ethiopic ethiopic-amete-alem chinese gregorian hebrew indian islamic islamic-civil
japanese arabic civil-arabic thai-buddhist persian roc
The types indicate the style of match:
choice: for a list of possible values
regex: for a regular expression match
notDoneYet: for items without matching criteria
locale: for locale IDs
list: for a space-delimited list of values
path: for a valid [
XPath
If the attribute order="given" is supplied, it indicates the order of elements when canonicalizing (see below).
The variable values are intended for internal testing, and the definition and usage may change between releases. They do not necessarily include all valid elements. For example, for primary language codes, they include the subset that occur in CLDR locale data. They are intended for a particular version of CLDR, and may omit codes that were present in earlier versions, such as deprecated codes.
The
then only the listed combinations are deprecated. Thus the following means not that the draft attribute is deprecated, but that the true and false values for
that attribute are:
Similarly, the following means that the
type
attribute is deprecated, but only for the listed elements:
The blockingItems indicate which elements (and their child elements) do not inherit. For
example, because supplementalData is a blocking item, all paths containing the element
supplementalData
do not inherit.
The distinguishing items indicate which combinations of elements and attributes (in unblocked
environments) are
distinguishing
in performing inheritance. For example, the attribute
type is distinguishing
except
in combination with certain elements, such as in:
elements="default measurementSystem mapping abbreviationFallback preferenceOrdering"
attributes="type"/>
5.6 Canonical Form
The following are restrictions on the format of LDML files to allow for easier parsing and comparison of files.
Peer elements have consistent order. That is, if the DTD or this specification requires the following order in an element foo:
It can never require the reverse order in a different element bar.
Note that there was one case that had to be corrected in order to make this true. For that reason, pattern occurs twice under currency:
decimal?, group?, special*)) >
XML
files can have a wide variation in textual form, while representing precisely the same data. By putting the
LDML files in the repository into a canonical form, this allows us to use the simple diff tools used widely (and in CVS) to detect differences when vetting
changes, without those tools being confused. This is not a requirement on other uses of LDML; just simply a way to manage repository data more easily.
5.6.1 Content
All start elements are on their own line, indented by
depth
tabs.
All end elements (except for leaf nodes) are on their own line, indented by
depth
tabs.
Any leaf node with empty content is in the form
There are no blank lines except within comments or content.
Spaces are used within a start element. There are no extra spaces within elements.
, not
, not
All attribute values use double quote ("), not single (').
There are no CDATA sections, and no escapes except those absolutely required.
no ' since it is not necessary
no 'a', it would be just 'a'
All attributes with defaulted values are suppressed. See
Section 5.6.8 Defaulted Values Table
The draft and alt="proposed.*" attributes are only on leaf elements.
The tzid are canonicalized in the following way:
All tzids as of as CLDR 1.1 (2004.06.08) in zone.tab are canonical.
After that point, the first time a tzid is introduced, that is the canonical form.
That is, new IDs are added, but existing ones keep the original form. The
TZ
timezone database keeps a set of equivalences in the "backward" file.
These are used to map other tzids to the canonical form. For example, when
America/Argentina/Catamarca
was introduced as the new name for the
previous
America/Catamarca
, a link was added in the backward file.
Link America/Argentina/Catamarca America/Catamarca
Example:
5.6.2 Ordering
Element names are ordered by the
Element Order Table
Attribute names are ordered by the
Attribute Order
Table
Attribute value comparison is a bit more complicated, and may depend on the attribute and type. Compare two values by using the following steps:
If two values are in the
Value Order Table
compare according to the order in the table. Otherwise if just one is, it goes first.
If two values are numeric [0-9], compare numerically (2 < 12). Otherwise if just one is numeric, it goes first.
Otherwise values are ordered alphabetically
An attribute-value pair is ordered first by attribute name, and then if the attribute names are identical, by the value.
An element is ordered first by the element name, and then if the
element names are identical, by the sorted set of attribute-value pairs
(sorted by #4). For the latter, compare the first pair in each (in
sorted order by attribute pair). If not identical, go to the second
pair, and so on.
Any future additions to the DTD must be structured so as to allow compatibility with this ordering.
See also
Section 5.5 Valid Attribute Values
5.6.3 Comments
Comments are of the form .
They are logically attached to a node. There are 4 kinds:
Inline always appear after a leaf node, on the same line at the end. These are a single line.
Preblock comments always precede the attachment node, and are indented on the same level.
Postblock comments always follow the attachment node, and are indented on the same level.
Final comment, after
Multiline comments (except the final comment) have each line after the first indented to one deeper level.
Examples:
...
...
5.6.4 Canonicalization
The process of canonicalization is fairly straightforward, except for comments. Inline comments will have any linebreaks replaced by a space. There may be
cases where the attachment node is not permitted, such as the following.
In those cases, the comment will be made into a block comment on the last previous leaf node, if it is at that level or deeper. (If there is one already,
it will be appended, with a line-break between.) If there is no place to attach the node (for example, as a result of processing that removes the attachment
node), the comment and its node's [
XPath
] will be appended to the final comment in the document.
Multiline comments will have leading tabs stripped, so any indentation should be done with spaces.
5.6.5 Element Order Table
The order of attributes is given by the elementOrder table in the supplemental metadata.
5.6.6 Attribute Order Table
The order of attributes is given by the attributeOrder table in the supplemental metadata.
5.6.7 Value Order Table
The order of attribute values is given by the order of the values in the attributeValues elements that have the attribute order="given". Numeric values are
sorted in numeric order, while tzids are ordered by country, then longitude, then latitude.
5.6.8 Defaulted Values Table
The defaulted attributes are given by the
suppress
table in the supplemental metadata. There is one special value _q; that is used on serial elements
internally to preserve ordering.
6 Property Data
Some data in CLDR does not use an XML format, but rather a semicolon-delimited format derived from that of the Unicode Character Database. That is because the data is more likely to be parsed by implementations that already parse UCD data. Those files are present in the common/properties directory.
Each file has a header that explains the format and usage of the data.
7 Lenient Parsing
7.1 Motivation
User input is frequently messy. Attempting to parse it by matching it exactly against a pattern is likely to be unsuccessful, even when the meaning of the
input is clear to a human being. For example, for a date pattern of "MM/dd/yy", the input "June 1, 2006" will fail.
The goal of lenient parsing is to accept user input whenever it is possible to decipher what the user intended. Doing so requires using patterns as data
to guide the parsing process, rather than an exact template that must be matched. This informative section suggests some heuristics that may be useful for lenient
parsing of dates, times, and numbers.
7.2 Loose Matching
Loose matching ignores attributes of the strings being compared that are not important to matching. It involves the following steps:
Remove "." from currency symbols and other fields used for matching, and also from the input string unless:
"." is in the decimal set, and
its position in the input string is immediately before a decimal digit
Ignore all format characters: in particular, ignore the RLM
and LRM used to control BIDI formatting.
Ignore all characters in [:Zs:] unless they occur between letters. (In the heuristics below, even those between letters are ignored except to delimit
fields)
Map all characters in [:Dash:] to U+002D HYPHEN-MINUS
Use the data in the
(for example, curly to straight apostrophes). Other apostrophe-like characters should also be treated as equivalent,
especially if the character actually used in a format may be unavailable on some keyboards. For example:
U+02BB MODIFIER LETTER TURNED COMMA (ʻ) might be typed instead as U+2018 LEFT SINGLE QUOTATION MARK (‘).
U+02BC MODIFIER LETTER APOSTROPHE (ʼ) might be typed instead as U+2019 RIGHT SINGLE QUOTATION MARK (’), U+0027 APOSTROPHE, etc.
U+05F3 HEBREW PUNCTUATION GERESH (׳) might be typed instead as U+0027 APOSTROPHE.
Apply mappings particular to the domain (i.e., for dates or for numbers, discussed in more detail below)
Apply case folding (possibly including language-specific mappings such as Turkish i)
Normalize to NFKC; thus
no-break space
will map to
space
; half-width
katakana
will map to full-width.
Loose matching involves (logically) applying the above transform to both the input text and to each of the field elements used in matching, before applying
the specific heuristics below. For example, if the input number text is " - NA f. 1,000.00", then it is mapped to "-naf1,000.00" before processing. The currency
signs are also transformed, so "NA f." is converted to "naf" for purposes of matching. As with other Unicode algorithms, this is a logical statement of the
process; actual implementations can optimize, such as by applying the transform incrementally during matching.
8 Deprecated Structure
The following structure was present in previous versions of CLDR. While valid LDML, it is discouraged, and no longer used in CLDR.
8.1 Element fallback
The fallback element is deprecated. Implementations should use instead the information in
Section 4.4 Language Matching
for doing language fallback.
8.2 BCP 47 Keyword Mapping
Note:
This structure is deprecated and replaced with
Section 3.7 Unicode BCP 47 Extension Data
This section defines mappings between old Unicode locale identifier key/type
values and their BCP 47 'u' extension subtag representations. The 'u' extension syntax described in
Section 3.7 Unicode BCP 47 Extension Data
restricts a key to two ASCII alphanumerics and
a type to three to eight ASCII alphanumerics. A key or a type which does not meet that syntax
requirement is converted according to the mapping data defined by the mapKeys or mapTypes elements. For example,
a keyword "collation=phonebook" is converted to BCP 47 'u' extension subtags "co-phonebk" by the mapping data below:
...
...
...
...
8.3 Choice Patterns
Note:
This structure is deprecated and replaced with count attributes.
A choice pattern is a string that chooses among a number of strings, based on numeric value. It has the following form:
'∞' | [0-9]+ ('.' [0-9]+)?)
≤'
The interpretation of a choice pattern is that given a number N, the pattern is scanned from right to left, for each choice evaluating
N. The first choice that matches results in the corresponding string. If no match is found, then the first string is used. For example:
Pattern
Result
0≤Rf|1≤Ru|1
-3, -1, -0.000001
Rf (defaulted to first string)
0, 0.01, 0.9999
Rf
Ru
1.00001, 5, 99,
Re
Quoting is done using ' characters, as in date or number formats.
8.4 Element default
Note:
This structure is deprecated except when used for
In some cases, a number of elements are present. The default element can be used to indicate which of them is the default, in the absence of other information.
The value of the choice attribute is to match the value of the type attribute for the selected item.
h:mm:ss a z
h:mm:ss a z
h:mm:ss a
...
Like all other elements, the
are present in fr, and that in fr_BE we have the following:
In that case, the default time format for fr_BE would be the inherited "long" resource from fr. Now suppose that we had in fr_CA:
...
In this case, the
8.5 Attribute standard
Note:
This attribute is deprecated. Instead, use a reference element with the attribute standard="true".
The value of this attribute is a list of strings representing standards: international, national, organization, or vendor standards. The presence of this
attribute indicates that the data in this element is compliant with the indicated standards. Where possible, for uniqueness, the string should be a URL that
represents that standard. The strings are separated by commas; leading or trailing spaces on each string are not significant. Examples:
...
9 Links to Other Parts
The LDML specification is split into several
parts
by topic,
with one HTML document per part.
The following tables provide redirects for links to specific topics.
Please update your links and bookmarks.
Part 1: Core specification (this document): No redirects needed.
Part 2:
General
(display names & transforms, etc.)
Old section
Section in new part
5.4
Display Name Elements
Display Name Elements
5.5
Layout Elements
Layout Elements
5.6
Character Elements
Character Elements
5.6.1
Exemplar Syntax
3.1
Exemplar Syntax
5.6.2 Restrictions
3.1
Exemplar Syntax
5.6.3 Mapping
3.2
Mapping
5.6.4
Index Labels
3.3
Index Labels
5.6.5 Ellipsis
3.4
Ellipsis
5.6.6 More Information
3.5
More Information
5.7
Delimiter Elements
Delimiter Elements
C.6
Measurement System Data
Measurement System Data
5.8
Measurement Elements (deprecated)
5.1
Measurement Elements (deprecated)
5.11
Unit Elements
Unit Elements
5.12
POSIX Elements
POSIX Elements
5.13
Reference Element
Reference Element
5.15
Segmentations
Segmentations
5.15.1
Segmentation Inheritance
9.1
Segmentation Inheritance
5.16
Transforms
10
Transforms
Transform Rules
10.1
Transform Rules
5.18
List Patterns
11
List Patterns
C.20
Gender of Lists
11.1
Gender of Lists
5.19
ContextTransform Elements
12
ContextTransform Elements
Part 3:
Numbers
(number & currency formatting)
Old section
Section in new part
C.13
Numbering Systems
Numbering Systems
5.10
Number Elements
Number Elements
5.10.1
Number Symbols
2.3
Number Symbols
Number Format Patterns
Number Format Patterns
5.10.2
Currencies
Currencies
C.1
Supplemental Currency Data
4.1
Supplemental Currency Data
C.11
Language Plural Rules
Language Plural Rules
5.17
Rule-Based Number Formatting
Rule-Based Number Formatting
Part 4:
Dates
(date, time, time zone formatting)
Old section
Section in new part
5.9 Date Elements
Overview: Dates Element, Supplemental Date and Calendar Information
5.9.1 Calendar Elements
Calendar Elements
2.1
2.2
2.3
2.4
2.5
2.6
5.9.2 Calendar Fields
Calendar Fields
5.9.3
Time Zone Names
Time Zone Names
C.5 Supplemental Calendar Data
Supplemental Calendar Data
C.7 Supplemental Time Zone Data
Supplemental Time Zone Data
C.15 Calendar Preference Data
4.2
Calendar Preference Data
C.17 DayPeriod Rules
4.5
Day Period Rules
Appendix F: Date Format Patterns
Date Format Patterns
Date Field Symbol Table
Date Field Symbol Table
F.1 Localized Pattern Characters (deprecated)
8.1
Localized Pattern Characters (deprecated)
Appendix J: Time Zone Display Names
Using Time Zone Names
fallbackRegionFormat:
fallbackFormat
fallbackFormat
O.4 Parsing Dates and Times
Parsing Dates and Times
Part 5:
Collation
(sorting, searching, grouping)
Old section
Section in new part
5.14
Collation Elements
Collation Tailorings
5.14.1
Version
3.1
Version
5.14.2
Collation Element
3.2
Collation Element
5.14.3
Setting Options
3.3
Setting Options
Table
Collation Settings
Table
Collation Settings
5.14.4
Collation Rule Syntax
3.4
Collation Rule Syntax
5.14.5
Orderings
3.5
Orderings
5.14.6
Contractions
3.6
Contractions
5.14.7
Expansions
3.7
Expansions
5.14.8
Context Before
3.8
Context Before
5.14.9
Placing Characters Before Others
3.9
Placing Characters Before Others
5.14.10
Logical Reset Positions
3.10
Logical Reset Positions
5.14.11
Special-Purpose Commands
3.11
Special-Purpose Commands
5.14.12
Collation Reordering
3.12
Collation Reordering
5.14.13
Case Parameters
3.13
Case Parameters
Definition:
UncasedExceptions
Definition:
UncasedExceptions
Definition:
LowerExceptions
Definition:
LowerExceptions
Definition:
UpperExceptions
Definition:
UpperExceptions
5.14.14
Visibility
3.14
Visibility
Part 6:
Related Information
(supplemental data)
Old section
Section in new part
Supplemental Data
Introduction
Supplemental Data
C.2
Supplemental Territory Containment
1.1
Supplemental Territory Containment
C.4
Supplemental Territory Information
1.2
Supplemental Territory Information
C.3
Supplemental Language Data
Supplemental Language Data
C.9
Supplemental Code Mapping
Supplemental Code Mapping
C.12
Telephone Code Data
Telephone Code Data
C.14
Postal Code Validation
Postal Code Validation
C.8
Supplemental Character Fallback Data
Supplemental Character Fallback Data
Coverage Levels
Coverage Levels
5.20
Metadata Elements
10
Locale Metadata Element
Supplemental Metadata
P.1
Supplemental Alias Information
P.2
Supplemental Deprecated Information
P.3
Default Content
Supplemental Metadata
9.1
Supplemental Alias Information
9.2
Supplemental Deprecated Information
9.3
Default Content
Part 7:
Keyboards
(keyboard mappings)
Old section
Section in new part
Keyboards
Keyboards
Goals and Nongoals
Goals and Nongoals
Definitions
Definitions
File and Directory Structure
File and Directory Structure
Element Hierarchy - Layout File
Element Hierarchy - Layout File
Element Hierarchy - Platform File
Element Hierarchy - Platform File
Invariants
Invariants
Data Sources
Data Sources
Keyboard IDs
Keyboard IDs
Platform Behaviors in Edge Cases
Platform Behaviors in Edge Cases
Element: keyboard
Element: keyboard
Element: version
Element: version
Element: generation
Element: generation
Element: names
Element: names
Element: name
Element: name
Element: settings
Element: settings
Element: keyMap
Element: keyMap
Element: map
Element: map
Element: transforms
Element: transforms
Element: transform
Element: transform
Element: platform
Element: platform
Element: hardwareMap
Element: hardwareMap
Element: map
Element: map
Principles for Keyboard Ids
Principles for Keyboard Ids
References
Ancillary Information
To properly localize, parse, and format data requires ancillary information, which is not expressed in Locale Data
Markup Language. Some of the formats for values used in Locale Data Markup Language are constructed according to external specifications. The sources
for this data and/or formats include the following:
Bugs
CLDR Bug Reporting form
Charts
The online code charts can be found at
An index
to character names with links to the corresponding chart is found at
DUCET
The Default Unicode Collation Element Table (DUCET)
For the base-level collation, of which all the collation tables in this document are tailorings.
FAQ
Unicode Frequently Asked Questions
For answers to common questions on technical issues.
FCD
As defined in UTN #5 Canonical Equivalences in Applications
Glossary
Unicode Glossary
For explanations of terminology used in this and other documents.
JavaChoice
Java ChoiceFormat
Olson
The
TZ
ID Database (aka Olson timezone database)
Time zone and daylight savings information.
ftp://www.iana.org/time-zones
For archived data, see
ftp://ftp.iana.org/tz/releases/
Reports
Unicode Technical Reports
For information on the status and development process for technical reports, and for a list of technical reports.
Unicode
The Unicode Consortium. The Unicode Standard, Version 6.2.0,
(Mountain View, CA: The Unicode Consortium, 2012. ISBN 978-1-936213-07-8)
Versions
Versions
of the Unicode Standard
For information on version numbering, and citing and referencing
the Unicode Standard, the Unicode Character Database, and Unicode
Technical Reports.
XPath
Other Standards
Various standards define codes that are used as keys or values in Locale Data Markup Language. These include:
BCP47
The Registry
ISO639
ISO Language Codes
Actual List
ISO1000
ISO 1000: SI units and recommendations for the use of their multiples and of certain other units, International Organization for Standardization, 1992.
ISO3166
ISO Region Codes
Actual List
ISO4217
ISO Currency Codes
(Note that as of this point, there are significant problems with this list. The supplemental data file contains the best compendium of currency
information available.)
ISO15924
ISO Script Codes
Actual List
LOCODE
United Nations Code for Trade and Transport Locations, commonly known as "UN/LOCODE"
Download at:
RFC6067
BCP 47 Extension U
RFC6497
BCP 47 Extension T - Transformed Content
UNM49
UN M.49: UN Statistics Division
Country or area & region codes
Composition of macro geographical (continental) regions, geographical sub-regions, and selected economic and other groupings
XML Schema
W3C XML Schema
General
The following are general references from the text:
ByType
CLDR Comparison Charts
Calendars
Calendrical Calculations: The Millennium Edition by Edward M. Reingold, Nachum Dershowitz; Cambridge University Press;
Book and CD-ROM edition (July 1, 2001); ISBN: 0521777526. Note that the algorithms given in this book are copyrighted.
Comparisons
Comparisons between locale data from different sources
CurrencyInfo
UNECE Currency Data
DataFormats
CLDR Data Formats
Example
A sample in Locale Data Markup Language
ICUCollation
ICU rule syntax
ICUTransforms
Transforms
Transforms Demo
ICUUnicodeSet
ICU UnicodeSet
API
ITUE164
International Telecommunication Union: List Of ITU Recommendation E.164 Assigned Country Codes
available at
LocaleExplorer
ICU Locale Explorer
LocaleProject
Common Locale Data Repository Project
NamingGuideline
OpenI18N Locale Naming Guideline
formerly at http://www.openi18n.org/docs/text/LocNameGuide-V10.txt
RBNF
Rule-Based Number Format
RBBI
Rule-Based Break Iterator
RFC5234
RFC5234 Augmented BNF for Syntax Specifications: ABNF
UCAChart
Collation Chart
UTCInfo
NIST Time and Frequency Division Home Page
U.S. Naval Observatory: What is Universal Time?
WindowsCulture
Windows Culture Info (with mappings from [
BCP47
]-style codes to LCIDs)
Acknowledgments
Special thanks to the following people for their continuing overall contributions to the CLDR
project, and for their specific contributions in the following areas.
These descriptions only touch on the many contributions that they have made.
Mark Davis for creating the initial version of LDML, and adding to and maintaining this specification, and for his work on the LDML code and tests, much of the supplemental data and overall structure, and transforms and keyboards.
John Emmons for the POSIX conversion tool and metazones.
Deborah Goldsmith for her contributions to LDML architecture and this specification.
Chris Hansten for coordinating and managing data submissions and vetting.
Erkki Kolehmainen and his team for their work on Finnish.
Steven R. Loomis for development of the survey tool and database management.
Peter Nugent for his contributions to the POSIX tool and from Open Office, and for coordinating and managing data submissions and vetting.
George Rhoten for his work on currencies.
Roozbeh Pournader (روزبه پورنادر) for his work on South Asian countries.
Ram Viswanadha (రఘురామ్ విశ్వనాధ) for all of his work on LDML code and data integration, and for coordinating and managing data submissions and vetting.
Vladimir Weinstein (Владимир Вајнштајн) for his work on collation.
Yoshito Umaoka (馬岡 由人) for his work on the timezone architecture.
Rick McGowan for his work gathering language, script and region data.
Xiaomei Ji (吉晓梅) for her work on time intervals and plural formatting.
David Bertoni for his contributions to the conversion tools.
Mike Tardif for reviewing this specification and for coordinating and vetting data submissions.
Peter Edberg for work on this specification, telephone code data, monthPatterns, cyclicNameSets and contextTransforms.
Raymond Wainman and Cibu Johny
for their work on keyboards.
Jennifer Chye for her contributions to the conversion tools.
Other contributors to CLDR are listed on the
CLDR Project Page
Modifications
The following summarizes modifications from the previous revision of this document. Some of the modification
notes have an associated bug ticket number, which may be used to look up additional information about the modification; for further information,
see
Revision 31
Updated for Version 23.
This document has been split into multiple
parts
UCA CollationAuxiliary.html has been merged into the
Collation
part of this document. The CLDR root collation data files have been moved from the UCA data directory
into the CLDR repository.
The und-u-co-ducet tailoring has been removed.
Index exemplar characters with multiple primary weights are now supported.
The
the
#5512
Updated time zone format patterns and description in
Date Format Patterns
and
Using Time Zone Names
to incorporate some new time
zone format types and related changes.
Documented the new supplemental
Time Data
[tickets
#5488
#5789
Documented the new supplemental
Primary Zones
[tickets
#5439
#5788
Updated
Parsing Dates and Times
to suggest that
matching of abbreviated symbols should be tolerant of the presence or absence of an abbreviation marker
such as period. [ticket
#5631
Updated
Locale Inheritance
to explain that certain data items are
derived directly from the region specified in a locale id (or the likely region if none specified),
using data in supplemental, instead of via inheritance through the locale bundle.
[ticket
#5606
Added new section Appendix
Q.1.2 Time Zone Identifiers
to explain
the short time zone IDs and canonical long time zone IDs. Also updated
Key/Type Definitions
table and
Using Time Zone Names
to refer to the
new section.
[ticket
#5793
Improved the description of collation type "search".
[ticket
#5331
Revision 30 being a proposed update, only changes between revisions 29 and 31 are summarized here.
Revision 29
Updated for Version 22.1.
Clarified the mechanism for producing timezone display names, removing fallbackRegionFormat.
Clarified the use and meaning of the index characters.
Revised the table for collation keyword lookup in Section I.5 Keyword and Default Resolution
In
Section 5.14,
Collation Elements
added information on
zhuyin collation index markers. [ticket
#5320
Clarify that text in a dateTimePattern other than {0}{1} is treated as part of a date pattern.
[ticket
#5398
Revision 28 being a proposed update, only changes between revisions 27 and 29 are summarized here.
Revision 27
Updated for Version 22.
Added new section
Section 5.14.13,
Case Parameters
Modified the table in
Section 5.14.3,
Setting Options
to include information from UCA, and to add clarifications.
Added 6-letter patterns for short weekday names to the
Date Field Symbol Table
[ticket
#4571
Updated deprecated status. [ticket
#5229
In
Section 5.9.1,
Calendar Elements
under
mentioned the new day width
short
and the new
numeric
and width
all.
[ticket
#5268
Fixed last example in table
Specifying Collation Ordering
. [ticket
#3108
Added descriptions of coverageVariable in Appendix M. [ticket
#5269
Documented the attribute value status="grouping" in Appendix C.2. [ticket
#5270
Added Appendix R, for property data.
Added Appendix S, for Keyboards.
Updated reference to [Olson]
Added clarification for removal of deprecated codes to Appendix P:
Supplemental Metadata
Disallowed isolated "n" in plural rules in C.11
Language Plural Rules
. Reworded some of the syntax notes for clarity.
Moved collation Key/Type information to
Section 5.14.3
Setting Options
Clarified that the BCP47 key/value identifiers are the canonical (and preferred) identifiers.
Added
Section Q.1.1
Numbering System Data
Added description of deprecated="true" in Appendix Q.
Revision 26 being a proposed update, only changes between revisions 25 and 27 are summarized here.
Revision 25
Updated for CLDR Version 21.0.1.
Fixed typo in 't' extension.
Added note on the special case of 'und' in language matching:
#3439
Added explanation of Empty Override:
#4012
Fix typo in Annex N:
#4110
Revision 24 being a proposed update, only changes between revisions 23 and 25 are summarized here.
Revision 23
Updated for CLDR Version 21.
Note the change in version numbering scheme beginning with this release.
Added information on distinctness requirements for names of eras and dayPeriods.
[ticket
#3831
Added descriptions of new
character for cyclic year names. Deprecated the 'l' (SMALL LETTER L) pattern character for leap month marker.
[tickets
#4230
#4231
#4232
Added documentation regarding the deprecation of the commonlyUsed element in formatting short time
zone names. [tickets
#4052
#4130
Added documentation for ordinal plural forms. [ticket
#4323
Added documentation for territoryContainment status="deprecated". [ticket
#4326
Added documentation for gender of lists. [tickets
#4125
#4357
Clarified use of hour pattern characters (h, H, K, k) in skeletons and associated patterns, in both
Section 5.9.1
Calendar Elements
and in the
Date Field Symbol Table
[ticket
#4061
Added
Section 5.19
ContextTransform Elements
and
Section 5.20
Metadata Elements
Deprecated the
Added guidance for capitalization of display names, and for consideration of grammar vs. capitalization in format vs. stand-alone calendar names.
[ticket
#4317
Added clarification of YY as fixed width week of year. [ticket
#3862
Clarified that the parentLocale element does not apply to collation. [ticket
#3897
Moved the information about the alias element into
Section 5.21
Alias Elements
and removed comments about whole-locale aliasing.
Updated
Section C.7
Supplemental Time Zone Data
to specify the details of the
Windows TZID mapping data extended by ticket
#4067
in this release.
[ticket
#4296
Added documentation for the new date format pattern "ZZZZZ" (ISO 8601 time zone format) in
Appendix F:
Date Format Patterns
and
Appendix J:
Time Zone Display Names
[ticket
#3995
Added an explanation of the use of the code 'UK', and pointers to the aliases for normalizing codes. [ticket #
4250
Clarified the use of
4013
Updated info on deprecated items. [ticket #
4360
Clarified the status of non-decimal numbering systems. [ticket #
4177
Described collation reordering. [ticket #
4194
Changed the section title of
Appendix Q
from "Locale Extension Key and Type Data" to "BCP 47 Extension Data" and updated the description.
[ticket #
4361
In
Appendix J:
Time Zone Display Names
, fold description of Localized GMT-zero format into that of
Localized GMT format. [ticket #
3695
Describe the 't' extension [ticket #
3976
In
Appendix J:
Time Zone Display Names
, fold description of Localized GMT-zero format into that of
Localized GMT format. [ticket #
3695
In
Section 5.10.2
Currencies
, deprecated the "choice" attribute for currency symbols.
[ticket
#3934
In
Appendix Q:
BCP 47 Extension Data
, clarified valid/invalid use case of
type value with multiple subtags. [ticket
#4212
Misc editorial fixes (spelling etc.).
[ticket
#4378
Revision 22 being a proposed update, only changes between revisions 21 and 23 are summarized here.
Revision 21
Updated for CLDR 2.0.1.
In the Collation section of the
Key/Type Definitions
table,
added an entry for "searchjl". [ticket
#3560
In the Collation parameters section of the
Key/Type Definitions
table,
corrected misspelled "quarternary" to "quaternary". [ticket
#4031
In
Section 5.6
Character Elements
corrected the statements about when punctuation and symbols cannot be included in exemplar sets, and added a note
about use of exemplar sets and number systems to determine character repertoire requirements to support a language.
[ticket
#3498
In
Section 5.9.3
Time Zone Names
added a note recommending use of generic location format in user interfaces for timezone selection,
and referring to the Date Field Symbol Table table and Appendix G (where this is also discussed).
[ticket
#914
Added documentation for territoryContainment grouping="true", for addition deprecatedItems, for coverageLevels, and for parentLocales. [ticket
#3938
Added clarifications of count=0/1 [ticket
#3988
Added descriptions of special index markers added to CJK collations for stroke, pinyin, and unihan, and the alternate alt="short" forms.
Revision 20 being a proposed update, only changes between revisions 19 and 21 are summarized here.
Revision 19
Updated for CLDR 2.0.
In the Collation section of the
Key/Type Definitions
table,
added an entry for "ducet" and corrected the information about which types are available in all locales.
[ticket
#3399
Added fallbackRegionFormat, localeKeyTypePattern, stopwords, count=0/1.
Enhanced plural rules to allow for explicit lists:
n in 1,3,5..14
Clarified normalization of LDML files.
Changed the description of coverage levels; it is now data-based.
Clarified the use of commonlyUsed flag for pattern "v" in the
Date Field Symbol Table
[ticket
#2700
Added calendar type "iso8601" and number type "tamldec" in the
Key/Type Definitions
table.
Restricted the use of the
Revision 18 being a proposed update, only changes between revisions 17 and 19 are summarized here.
Revision 17
Updated for CLDR 1.9.
In the
Date Field Symbol Table
, changed the description of
pattern character 'S' to indicate that the corresponding field truncates, rather than rounds. [ticket
#2845
In the description of
dateFormats
, noted that they are intended
primarily for use by themselves in user interface elements. [ticket
#3048
Added (short) descriptions of transformNames, ellipsis, moreInformation, punctuation exemplars. [ticket #3360]
Updated the discussion of canonical TZ IDs [ticket #2899]
Described the use of numberSystem with symbols, decimalFormats, etc. [ticket #3361]
Documented the recommended fallback for transforms [ticket #2240]
Clarified some issues with plural rules [ticket #3061]
Documented the changes for UCA 6.0, and clarified some examples and the use of "basic syntax" [ticket #3060]
Highlighted where CLDR and LDML have different defaults than UCA/DUCET [ticket #2904]
Described the default subtype 'true' for keywords [ticket
#2958
] Clarified that the defaults are different from the attribute defaults for collation.
Deprecated
Language Matching
[ticket #1988]
Clarified that alt values are not limited to the list in the text. Also added "short". [ticket #1910]
Updated coverage levels [ticket #2591]
Updated the specification to match DTD in various places:
Section 5
XML Format
header;
Section 5.1
Common Elements
for default element use choice instead of deprecated type;
Section 5.4
Display Name Elements
header;
Section 5.6
Character Elements
header;
Section 5.9
Date Elements
header;
Section 5.10
Number Elements
header.
[ticket
#1925
Added a collation type 'search' in the Key/Type Definitions table. Also moved 'standard' at the beginning to match the description. [ticket#3375]
Revision 16
Updated for CLDR 1.8.1. Fix TOC links to sections C.17, C.18. [ticket #2722]
Updated
Appendix Q
Locale Extension
Key and Type Data
to provide more information about valid "vt" (variableTop) value and versioning.
[tickets #2740, #2741]
Clarified the role of aliases and fallback elements [tickets
#2757
#2742
, and
#2762
].
Revision 15
In
Section C.5
Supplemental Calendar Data
explained the different interpretations of
“first day of the week” and their relationship to CLDR data. [ticket #2663]
Updated
Appendix K
Valid Attribute Values
[ticket #1504]
Added descriptions of yeartype and numberingSystem attributes [ticket #2712]
Explained syntax and meaning of vt and variableTop.
Added index exemplar sets and index label elements
Added dayPeriod and dayPeriod Rules
Added list patterns
Added language matching
Updated descriptions of currency and timezone codes in the Key/Type table.
Revised the description in
Section C.7
Supplemental Time Zone Data
to support the new time zone data organization. [ticket #2715]
Updated references to UAX/UTS/UTR documents and to The Unicode Standard. [ticket #2530]
Noted (in Calendar Elements and Lenient Parsing sections) that narrow month and day values need not be distinct. [ticket #1955]
In
Appendix O
Lenient Parsing
expanded the discussion of equivalences among apostrophe-like characters. [ticket #2629]
Revision 14 being a proposed update, only changes between revisions 13 and 15 are summarized here.
Revision 13
Updated 3.
Unicode Language and Locale Identifiers
to make the primary locale identifier syntax more BCP47 compatible. [ticket #2457]
Added Appendix Q.
Locale Extension Key and Type Data
[ticket #2457]
Note that the dateRangePattern element is deprecated, replaced by intervalFormats. Note that the replacement
for the deprecated measurement element is measurementData in supplemental. Be more clear that the hoursFormat, abbreviationFallback,
and preferenceOrdering elements are deprecated. [ticket #2369]
In
Appendix G.8
Number Elements
updated the description
of grouping separators to match Appendix G.2. [ticket #2317]
Fixed some typos. [ticket #2216]
Revision 12
In
Section 1
Introduction
clarified that LDML is an interchange format, not a runtime format. [ticket #1971]
In
Section 3
Unicode Language and Locale Identifiers
Clarified some entries in the Variant Definitions section. [ticket #1878]
Added calendar types ethiopic-amete-alem, indian, roc. [ticket #1960]
In
Section 5.6
Character Elements
Noted that LRM and RLM can be included in auxiliary or currency exemplars. [ticket #2049]
Clarified ordering of encodings in the mapping element for email usage. [ticket #2153]
In
Section 5.9.3
Time Zone Names
Note that "GMT", "UT", "UTC" are not allowed as translations of non-GMT timezones. [ticket #1949]
Describe new
Added 5.17
Rule-Based Number Formatting
Added C.13
Numbering Systems
Added C.14
Postal Code Validation
Added C.15
Calendar Preference Data
Added C.16
BCP 47 Keyword Mapping
Also modified 3.2 BCP 47 Tag Conversion
In
Appendix F
Date Format Patterns
clarified the usage of the 'j'
format character. [ticket #2098]
In
Appendix J
Time Zone Display Names
Added the fallback format used for generic location when
does not have country data for a zone. [ticket #1962]
In the Parsing section, clarified parsing of GMT/UT/UT and localized GMT formats with/without numeric
offset, and inability to parse all RFC 788 date/time formats. [ticket #1949]
Added
numbers
attribute on date patterns, in 5.9
Date Elements
and
C.15
Calendar Preference Data
and in
Key_Type_Definitions
. Also
added bookmarks for
Unicode_language_identifier
Unicode_locale_identifier
Language_Locale_Field_Definitions
, Variant_Definitions,
Key_Type_Definitions
Added
defaultNumberingSystem
in 5.10
Number Elements
Added
tender
attribute in C.1
Supplemental Currency Data
Misc editing.
Revision 11
Made a number of changes as the result of a copy-edit pass by Julie.
Added clarifications in
Unicode Language and Locale Identifiers
Revision 10
In
Section 1.1
Conformance
added UAX35-C2 and information on referencing particular components of Unicode locale or language
identifiers. [ticket #1801]
In
Section 3
Unicode Language and Locale Identifiers
Clarified the syntax and usage of the language and locale identifiers. [ticket #1801]
Replaced the use of the comma (U+0020 SPACE), as an options separator, with the semicolon (U+003B SEMICOLON).
This change was also reflected in the examples given. [ticket #1717]
Re-emphasized that key and type values are limited to ASCII lettters and digits,
and that they have to be unique within the first 8 letters and digits. [ticket #1772]
In
Section 4
Locale Inheritance
explained the different fallback processes for resource bundle lookup and resource item lookup. [ticket #1763]
In
Section 5
XML Format
clarified that no element value can start with a combining slash U+0338 (not a combining backslash). [ticket #1223]
In
Section 5.2
Common Attributes
updated the draft attribute value descriptions; added "contributed".
In
Section 5.3.1
Fallback Elements
clarified examples, and explained
how
implementations can provide a mechanism for overriding the fallbacks. [ticket #1763]
In
Section 5.4
Display Name Elements
re-emphasized the potential uses of these elements [ticket #1665], and added new localeDisplayPattern element [ticket #1448].
In
Section 5.9.1
Calendar Elements
Clarified handling of availableFormat patterns for calendars that require an era field if a year field is present [ticket #1346].
Described the format of a date-time format skeleton used with the availableFormats element [ticket #1611].
Added descriptions of intervalFormats [ticket #1813].
In
Section 5.9.3
Time Zone Names
indicated that timezone IDs are not limited to city names, and corrected/augmented the fallbackFormat examples. [ticket #1604]
In
Section 5.10.1
Number Symbols
updated to match current DTD.
In
5.10.2
Currencies
updated to match current DTD. Added section explaining use of "count" to format currency values for particular numeric values. [ticket #1550]
Added new
Section 5.11
Unit Elements
(renumbered the 5.x sections after it) describing
#1807, #1821]
In
Appendix C.6
Measurement System Data
clarified the meaning of "metric" and its relation to ISO 1000 [ticket #481], and corrected the values for paperSize [ticket #1712].
Added
Appendix C.11
Language Plural Rules
[tickets #1550, #1703]
Added
Appendix C.12
Telephone Code Data
[ticket #1542]
In
Appendix F:
Date Format Patterns
clarified the usage of the YYYY for week of year calendars. [ticket #1605]
In Section G.8 of
Appendix G:
Number Format Patterns
changed currencySeparator, which is not a valid field, to currencyDecimal. [ticket #997]
In
Appendix J:
Time Zone Display Names
amended the fallback example and corrected a typo; corrected root fallbackFormat and all discussions and examples that ensue from that change. [ticket #1604]
In the Parsing section of
Appendix J:
Time Zone Display Names
updated step 3 to allow UTC and UT as synonyms for GMT. [ticket #1582]
Fixed validation errors and broken links [tickets #1606, #1619].
Added contributors [ticket #1835]. In this Modifications section, added bug ticket information [ticket #1630].
Revision 9
Extensive rewrite of
Appendix J:
Time Zone Display Names
primarily due to refinements to the metazone process. This also caused some changes in Appendix F:
Date Format Patterns
. [ticket #1508]
Made the date
range handling uniform, with new
Section 5.2.1
Dates and Date Ranges
and related changes particularly to C.1 and C.5 in Appendix C:
Supplemental Data
Added
Appendix C10.
Likely Subtags
Added missing date pattern symbol "l" for Chinese calendar. [ticket #1557]
Revision 8
Reserved 'j' in date formats for distinguishing 12 and
24 formats.
Added section 5.1.2
Text Directionality
Added new conformance section: 1.1
Conformance
Revised text on loose matching to include BIDI control
characters: Appendix O:
Lenient Parsing
Revised text on distinguishing and blocking elements
Added currency exemplar sets
Added dateRangePatterns
Added language
fallbacks: 5.3.1
Fallback Elements
Clarified use of transliterator names
Added matching options for collation
Added currency change policy
Added description of character fallbacks, changed ordering of
NFC and NFKC.
Added DTD headers for supplemental data
Added supplemental metadata descriptions: Appendix P:
Supplemental Metadata
Added mappings to alternate
language and country codes
Added substantial data on
language and script usage in different countries
Added default content data
Added metazones
Clarified the before
and after elements in currency formatting.
Minor edits
Revision 7
Point at bug database instead of Unicode reporting form.
Add "root" as valid locale identifier, and clarify that "locale" in CLDR is really essentially language.
Added the list of private use language & script subtag codes that
will not be used by CLDR.
Corrected the dateTimeFormat assignments for {0} and {1}.
Revision 6
Incorporated Corrigendum 1 (see
) into Appendix F:
Date Format Patterns
and Section 5.4
Revamped Appendix J:
Time Zone Display Names
. Also changed "Fallback" to "Display Names" in the title of the Appendix,
and "Olson" to "TZ" in other places in the document.
Yesstr/nostr/yesexpr/noexpr changes in Section 5.12
Added Section 5.15
Moved week, measurement data to Appendix C:
Supplemental Data
Added coverage levels in Appendix M:
Coverage Levels
Added rule-based number formats and transforms, in Section 5.16
Transforms
, Appendix N:
Transform
Rules
Added metadata, replacing the contents of Appendix K:
Valid Attribute Values
Added availableFormats, dateFormatItem, and appendItem in
to support more flexible date/time formatting
Added measurementSystemNames and measurementSystemName in
for localized names of measurement
systems
Added quarters, quarterContext, quarterWidth, and quarter to
for names of calendar quarters
Extended possible values for
alt tag
Added ethiopic calendar to
allowed calendar values
Clarified usage of
quotation marks and alternate marks
Corrected example
ISBN
Added eraNarrow to
for one-character version of era names
Added
Appendix O
, on lenient parsing
Added
3.1
Unknown or Invalid Identifiers
Other editing
Updated descriptions to final DTD and metadata.
Revision 5
The canonical form for variants is upper case
Addition of UN M.49 codes
Addition of persian and coptic calendar IDs
Clarification of alias inheritance
New XML references section
Modified revision and generation field format
Use of language display names for whole initial segments of locale IDs names, such as nl-BE
Addition of the inList element
Clarification of 'narrow'
Additional dateTimeFormat description
Names of calendar fields, and relative times.
New element currencySpacing
Descriptions of POSIX yes/now
New supplemental data elements/attributes (end of Appendix C)
currency to/from
languageData
timezoneData
territoryContainment
mapTimezones
alias
deprecated
characters
in dateExtension of era to 1..3
Clarification of year padding
Deprecation of localizedPatternChars
Use of the singleCountries list
Appendix L: canonical form
Misc editing
Revision 1
(2005-06-30): added link to Corrigenda.
Copyright © 2001-2013 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and
assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use
of the information or programs contained or accompanying this technical report. The Unicode
apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.