UAX #44: Unicode Character Database

UAX #44: Unicode Character Database
Technical Reports
Unicode® Standard Annex #44
Unicode Character Database
Version
Unicode 17.0.0
Editors
Ken Whistler
Date
2025-08-27
This Version
Previous Version
Latest Version
Latest Proposed Update
Revision
36
Summary
This annex provides the core documentation for the
Unicode Character Database (UCD). It describes the layout and organization of the Unicode
Character Database and how it specifies the formal definitions of the Unicode Character Properties.
Status
This document has been reviewed by Unicode members and other interested
parties, and has been approved for publication by the Unicode Consortium.
This is a stable document and may be used as reference material or cited as
a normative reference by other specifications.
A Unicode Standard Annex (UAX)
forms an integral part of the
Unicode Standard, but is published online as a separate document. The
Unicode Standard may require conformance to normative content in a Unicode
Standard Annex, if so specified in the Conformance chapter of that version
of the Unicode Standard. The version number of a UAX document corresponds to
the version of the Unicode Standard of which it forms a part.
Please submit corrigenda and other comments with the online reporting
form [
Feedback
].
Related information that is useful in understanding this annex is found in Unicode Standard Annex #41,
Common References for Unicode Standard Annexes
.”
For the latest version of the Unicode Standard, see [
Unicode
].
For a list of current Unicode Technical Reports, see [
Reports
].
For more information about versions of the Unicode Standard, see [
Versions
].
For any errata which may apply to this annex, see [
Errata
].
Contents
Introduction
Conformance
2.1
Simple and Derived Properties
2.2
Use of Default Values
2.3
Stability of Releases
Documentation
3.1
Character Properties in the Standard
3.2
The Character Property Model
3.3
NamesList.html
3.4
StandardizedVariants.html
3.5
Emoji Variation Sequences
3.6
Unihan and UAX #38
3.7
UTC-Source Ideographs and UAX #45
3.8
Data File Comments
3.9
Obsolete Documentation Files
UCD Files
4.1
Directory Structure
4.2
File Format Conventions
4.3
File List
4.4
Zipped Files
4.5
UCD in XML
Properties
5.1
Property Index
5.2
About the Property Table
5.3
Property Definitions
5.4
Derived Extracted Properties
5.5
Contributory Properties
5.6
Case and Case Mapping
5.7
Property Value Lists
5.8
Property and Property Value Aliases
5.9
Matching Rules
5.10
Invariants
5.11
Validation
5.12
Deprecation
5.13
Property APIs
5.14
Character Age
Test Files
6.1
NormalizationTest.txt
6.2
Segmentation Test Files and Documentation
6.3
Bidirectional Test Files
UCD Change History
Acknowledgments
References
Modifications
Note:
the information in
this annex is not intended as an exhaustive description of the use and
interpretation of Unicode character properties and behavior. It must be used in conjunction with
the data in the other files in the Unicode Character Database, and relies on the notation and
definitions supplied in
The Unicode
Standard
. All chapter references are to Version
17.0.0 of the standard unless otherwise indicated.
Introduction
The Unicode Standard is far more than a simple encoding of characters.
The standard also associates a rich set of semantics with each encoded
character—properties that
are required for interoperability and correct behavior in
implementations, as well as for Unicode conformance.
These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files
which contain the Unicode character code points and character names.
The data files define the Unicode character properties and mappings between
Unicode characters (such as case mappings).
This annex describes the UCD and provides a guide to the various
documentation files associated with it. Additional information
about character properties and their use is contained in the
Unicode Standard and its annexes. In particular, implementers should familiarize themselves
with the formal definitions and conformance requirements for properties detailed
in
Section 3.5, Properties
in [
Unicode
and with the material in
Chapter 4, Character Properties
in
Unicode
].
Additional discussion about the Unicode
character property model can be found in [
UTR23
].
The latest version of the UCD is always located on the Unicode
website at:
The specific files for the UCD associated with this version of
the Unicode Standard (17.0.0) are located at:
Stable, archived versions of the UCD associated with all earlier
versions of the Unicode Standard can be accessed from:
For a description of the changes in the UCD for
this version and earlier versions, see the
UCD Change History
Conformance
The Unicode Character Database is an integral part of the Unicode Standard.
The UCD contains normative property and mapping information required for
implementation of various Unicode algorithms such as the Unicode Bidirectional
Algorithm, Unicode Normalization, and Unicode Casefolding. The data files also
contain additional informative and provisional character property information.
Each specification of a Unicode algorithm, whether specified in the text of
Unicode
] or in one of the Unicode
Standard Annexes, designates which data file(s) in the UCD are needed to
provide normative property information required by that algorithm.
For information on the meaning and application of the terms,
normative
informative
contributory
, and
provisional
, see
Section 3.5,
Properties
in [
Unicode
].
For information about the applicable terms of use for the
UCD, see the Unicode
2.1
Simple and Derived Properties
2.1.1
Simple Properties
Some character properties in the UCD are simple properties.
This status has no bearing on whether or not the properties are
normative, but merely indicates that their values
are not derived from some combination of other properties.
2.1.2
Derived Properties
Other character properties are derived. This means that
their values are derived by rule from some other
combination of properties. Generally such rules are
stated as set operations, and may or may not include
explicit exception lists for individual characters.
Certain simple properties are defined merely
to make the statement of the rule defining a derived
property more compact or general. Such properties are
known as
contributory properties
Sometimes these contributory properties are defined to
encapsulate the messiness inherent in exception
lists. At other times, a contributory property may
be defined to help stabilize the definition of
an important derived property which is subject to stability
guarantees.
Derived character properties are not considered
second-class citizens among Unicode character properties.
They are defined to make implementation of important
algorithms easier to state. Included among the
first-class derived properties important for such
implementations are: Uppercase, Lowercase, XID_Start,
XID_Continue, Math, and Default_Ignorable_Code_Point, all
defined in DerivedCoreProperties.txt, as well as derived
properties for the optimization of normalization, defined
in DerivedNormalizationProps.txt.
Implementations should simply use the derived properties,
and should not try to rederive them from lists of simple
properties and collections of rules, because of the
chances for error and divergence when doing so.
Definitions of property derivations are provided
for information only, typically in comment fields
in the data files. Such definitions may be refactored,
refined, or corrected over time. These
definitions are presented in a modified set notation, expressed
as set additions and/or subtractions of various other property
values. For example:
# Derived Property: ID_Start
# Characters that can start an identifier.
# Generated from:
# Lu + Ll + Lt + Lm + Lo + Nl
# + Other_ID_Start
# - Pattern_Syntax
# - Pattern_White_Space
When interpreting definitions of derived properties
of this sort, keep in mind that set subtraction is not a commutative
operation. Thus "Lo + Lm - Pattern_Syntax" defines a different set
than "Lo - Pattern_Syntax + Lm". The order of property set operations
stated in the definitions affects the composition of
the derived set.
If there are any cases of mismatches
between the definition of a derived property as
listed in DerivedCoreProperties.txt or similar data
files in the UCD, and the definition of a derived
property as a set definition rule, the explicit
listing in the data file should
always
be taken
as the normative definition of the property. As described
in
Stability of Releases
the property
listing in the data files for any given version
of the standard will never change for that version.
2.1.3
Properties Dependent on External Specifications
In limited cases, a Unicode character property defined in the Unicode Character Database
may have an external dependency on another specification which is not a part of the Unicode Standard,
and whose data is not formally part of the UCD. In such cases, version stability for the UCD is attained by
requiring that dependency to be based on a known, published version of the external specification.
Starting with Version 10.0 of the UCD and continuing through Version 12.1,
the clear example of such an external dependency was the
derivation of some segmentation-related character properties, in part based on emoji properties associated with
UTS #51, "Unicode Emoji" [
UTS51
]. The details of the
derivation were described in the respective annexes, [
UAX14
and [
UAX29
], as well as in the documentation portions of
the associated UCD property files. See [
Data14
and [
Props
].
The version of UTS #51 used for those segmentation properties
in each of the relevant versions of the UCD was clearly
identified in those annexes and data files. Starting with
Version 13.0 of the UCD, however, the emoji properties which the UCD previously
depended on have been formally incorporated
into the UCD, so that they no longer constitute an external dependency.
An external dependency may impact either a simple or a derived property.
2.2
Use of Default Values
Unicode character properties have default values. Default
values are the value or values that a character property takes
for an unassigned code point, or in some instances, for
designated subranges of code points, whether assigned or
unassigned. For example, the default value of a binary
Unicode character property is always "N".
For the formal discussion of default values, see D26 in
Section 3.5, Properties
in [
Unicode
].
For conventions related to default values in various data files
of the UCD and for documentation regarding the particular default values of
individual Unicode character properties, see
Default Values
2.3
Stability of Releases
Just as for the Unicode Standard as a whole, each version of the
UCD, once published, is absolutely stable and will
never
change. Each released version is archived in a directory on
the Unicode website, with a directory number associated with
that version. URLs pointing to that version's directory are also
stable and will be maintained in perpetuity.
Any errors discovered for a released version of the UCD
are noted in [
Errata
],
and if appropriate will be corrected in a
subsequent
version of the UCD.
Stability guarantees constraining how Unicode character
properties can (or cannot) change between releases of the UCD
are documented in the Unicode Consortium Stability
Policies [
Stability
].
2.3.1
Changes to Properties Between Releases
Updates to character properties in the Unicode Character Database may be required
for any of three reasons:
To cover new characters added to the standard
To add new character properties to the standard
To change the assigned values for a property for some characters already in the standard
While the Unicode Consortium endeavors to keep the values of all
character properties as stable as possible between versions, occasionally circumstances
may arise which require changing them. In particular, as less well-documented scripts, such
as those for minority languages, or historic scripts are added to the standard, the exact
character properties and behavior may not fully be known when the script is first encoded.
The properties for some of these characters may change as further information becomes
available or as implementations turn up problems in the initial property assignments.
As far as possible, any readjustment of property values based
on growing implementation experience is made to be compatible with established practice.
All changes to normative or informative property values, to the status
or type of a property, or to property or property value aliases, must be approved by
an explicit decision taken by the Unicode Technical Committee. Changes to provisional
property values are subject to less stringent oversight.
Occasionally, a character property value is changed to prevent incorrect generalizations
about a character's use based on its nominal property values. For example, U+200B ZERO
WIDTH SPACE was originally classified as a space character (General_Category=Zs), but
it was reclassified as a Format character (General_Category=Cf) to clearly distinguish it from space characters
in its function as a format control for line breaking.
There is no guarantee that a particular value for an enumerated
property will actually have characters associated with it. Also, because of
changes in property value assignments between versions of the standard, a
property value that once had characters associated with it may later have none.
Such conditions and changes are rare, but implementations must not
assume that all property values are associated with non-null
sets of characters. For example, currently the special Script property
value Katakana_Or_Hiragana has no characters associated with it.
2.3.2
Obsolete Properties
An
obsolete
property is one whose original use
case no longer exists. The original use case may have been overtaken by other
developments, or the property may have been supplanted by a different property,
and so forth.
For example, the
ISO_Comment
property was once used to keep
track of annotations for characters used in the production of name lists for
ISO/IEC 10646 code charts. As of Unicode 5.2.0 that
functionality was dropped, and so the property became obsolete,
and its value is now defaulted to the null string for all Unicode code points.
An obsolete property is never removed from the UCD.
Obsolete properties are not recommended for use in APIs.
2.3.3
Deprecated Properties
Formally declaring
a property to be
deprecated
is an indication that the property is no longer recommended for
use, perhaps because its original intent has been replaced by another property
or because its specification was somehow defective. The general
practice of the UTC is to deprecate properties that have become obsolete, although
there may be exceptions.
See also the discussion of
Deprecation
A deprecated property is never removed from the UCD.
Deprecated properties are not recommended for use in APIs.
Table 1
lists the properties that are formally deprecated as of
this version of the Unicode Standard.
Table 1.
Deprecated Properties
Property Name
Deprecation Version
Reason
Grapheme_Link
5.0.0
Duplication of ccc=9
Hyphen
6.0.0
Supplanted by Line_Break property values
ISO_Comment
6.0.0
No longer needed for chart generation; otherwise not useful
Expands_On_NFC
6.0.0
Less useful than UTF-specific calculations
Expands_On_NFD
6.0.0
Less useful than UTF-specific calculations
Expands_On_NFKC
6.0.0
Less useful than UTF-specific calculations
Expands_On_NFKD
6.0.0
Less useful than UTF-specific calculations
FC_NFKC_Closure
6.0.0
Supplanted in usage by
NFKC_Casefold
; otherwise not useful
2.3.4
Stabilized Properties
stabilized
property is one for which the Unicode Technical Committee has declared that it will no longer actively maintain the property or extend it for newly
encoded characters. The property values of a
stabilized property are frozen as of a particular release of the standard.
The stabilization of a property does not indicate that the property
should or should not be used. For example, if the property references a subset of
characters that is unaffected by future additions to the repertoire, it may be
stabilized without becoming useless. An example of a property which
could
be stabilized without becoming useless is ASCII_Hex_Digit, as no more such
digits would ever be added to the standard.
A stabilized property is never removed from the UCD.
Table 2
lists the properties that are formally stabilized as of
this version of the Unicode Standard.
Table 2.
Stabilized Properties
Property Name
Stabilization Version
Hyphen
4.0.0
ISO_Comment
6.0.0
2.3.5
Provisional Properties
provisional
property has no stability guarantees. It may
be changed arbitrarily or may be removed altogether.
Table 9,
Property Table
does not list any provisional properties;
however, [
UAX38
] documents a large number of provisional properties
specified in the Unihan Database. Provisional properties are used to collect
various information about Han characters, for review and testing. On occasion, a
provisional property's status may change to informational or normative, in which
case it then becomes subject to the same stability guarantees as other properties.
A provisional property
may
be removed in any subsequent
version of the UCD.
Provisional properties are not recommended for use in APIs.
Documentation
This annex provides the core documentation for the UCD, but
additional information about character properties is available in
other parts of the standard and in additional documentation files
contained within the UCD.
3.1
Character Properties in the Standard
The formal definitions related to character properties used
by the Unicode Standard are documented in
Section 3.5, Properties
in [
Unicode
].
Understanding those definitions and related terminology is
essential to the appropriate use of Unicode character properties.
See
Section 4.1, Unicode Character Database
, in
Unicode
] for a general
discussion of the UCD and its use in defining properties. The
rest of Chapter 4 provides important explanations regarding
the meaning and use of various normative character properties.
3.2
The Character Property Model
For a general discussion of the property model which underlies
the definitions associated with the UCD, see
Unicode Technical Report #23, "The Unicode Character Property Model" [
UTR23
].
That technical report is informative, but over the years various
content from it has been incorporated into normative portions
of the Unicode Standard, particularly for the definitions in
Chapter 3.
UTR #23 presents the important distinction
between properties defined for strings (in contrast to properties defined for
characters or code points) and character properties that have values that are strings.
The latter are referred to as
string-valued properties
in UTR #23
and in this annex. UTR #23 also discusses string functions and their relation to
character properties.
3.3
NamesList.html
NamesList.html formally describes the format of the NamesList.txt data file in BNF.
That data file is used to drive the PDF formatting
of the Unicode code charts and names list. See also
Section 24.1,
Character Names List
, in [
Unicode
for a detailed discussion of the conventions used in the Unicode names list as
formatted for the online code charts.
3.4
StandardizedVariants.html
StandardizedVariants.html has been obsoleted
as of Version 9.0 of the UCD. This file formerly
documented standardized variants, showing a
representative glyph for each. It was closely tied to the data file,
StandardizedVariants.txt, which defines those sequences normatively.
The function of StandardizedVariants.html to show representative
glyphs for standardized variants has been superseded. There are now better means
of illustrating the glyphs. Many standardized variation sequences are shown
in the Unicode code charts directly, in summary sections at the ends of the
names list for any block which contains them. Glyphs for standardized variants
of CJK compatibility ideographs are also shown directly in the Unicode
code charts.
3.5
Emoji Variation Sequences
Emoji variation sequences are a special class of variation
sequences involving emoji characters. They are divided into two subtypes:
an
emoji presentation sequence
, consisting of an emoji character base followed
by the variation selector U+FE0F, and a
text presentation sequence
consisting of an emoji character base followed by the variation selector U+FE0E.
Such sequences come in pairs: the text presentation sequence shown
with a black and white presentation, as seen in the Unicode code charts,
and the emoji presentation sequence shown with a colorful icon, as
usually seen in implementations on mobile devices and elsewhere.
Starting with Version 9.0.0, the following page in the Unicode emoji
subsite area shows appropriate representative glyphs for all emoji variation
sequences, with separate columns for text
presentation sequences and for emoji presentation sequences:
The data file which defines the exact list of emoji variation
sequences is emoji-variation-sequences.txt. That file is maintained in the
UCD, but emoji variation sequences are documented in
Unicode Technical Standard #51,
Unicode Emoji
UTS51
].
3.6
Unihan and UAX #38
Unicode Standard Annex #38, "Unicode Han Database (Unihan)"
UAX38
] describes
the format and content of the Unihan Database [
Unihan
],
which collects together all property information
for CJK unified ideographs. That annex also specifies in detail
which of the Unihan character properties are normative,
informative, or provisional.
The Unihan Database contains extensive and detailed mapping
information for CJK unified ideographs encoded in the Unicode Standard,
but it is aimed
only
at those ideographs, not at other characters used in the East
Asian context in general.
In contrast, East Asian legacy character sets, including important
commercial and national character set standards, contain many non-CJK
characters. As a result, the Unihan Database must be supplemented from
other sources to establish mapping tables for those character sets.
The majority of the content of the Unihan Database is
released for each version of the Unicode Standard as a collection of Unihan data
files in the UCD. Because of their large size, these data files are released only as
a zipped file, Unihan.zip. The details of the particular data files in Unihan.zip
and the CJK properties each one contains are provided in [
UAX38
].
For versions of the UCD prior to Version 5.2.0, all of the CJK properties were
listed together in a very large, single file, Unihan.txt.
3.7
UTC-Source Ideographs and UAX #45
Unicode Standard Annex #45, "U-Source Ideographs"
UAX45
] describes the format of USourceData.txt,
which lists all of the information for UTC-Source ideographs.
3.8
Data File Comments
In addition to the specific documentation files for the UCD, individual data
files often contain extensive header comments describing their content and any
special conventions used in the data.
In some instances, individual property
definition sections also contain comments with information about how the property
may be derived. Such comments are informative; while they are intended
to convey the intent of the derivation, in case of any mismatch between
a statement of a derivation in a comment field and the actual
listing of the derived property, the list is considered to be definitive.
See
Simple and Derived Properties
3.9
Obsolete Documentation Files
UCD.html was formerly the primary documentation file for the UCD. As of Version 5.2.0, its
content has been wholly incorporated into this document.
Unihan.html was formerly the primary documentation file for
the Unihan Database. As of Version 5.1.0, its
content has been wholly incorporated into [
UAX38
].
Versions of the Unicode Standard
prior to Version 4.0.0 contained small, focused
documentation files, UnicodeCharacterDatabase.html, PropList.html, and
DerivedProperties.html, which were later consolidated into UCD.html.
StandardizedVariants.html has been obsoleted as of Version 9.0.0.
See
Section 3.4,
StandardizedVariants.html
UCD Files
The heart of the UCD consists of the data files themselves. This section
describes the directory structure for the UCD, the format conventions
for the data files, and provides documentation for data files not documented
elsewhere in this annex.
4.1
Directory Structure
Each version of the UCD is released in a separate, numbered directory
under the
Public
directory on the Unicode website. The content of that
directory is complete for that release. It is also stable—once released,
it will be archived permanently in that directory, unchanged, at a stable URL.
The specific files for the UCD associated with this version of
the Unicode Standard (17.0.0) are located at:
The UCD data files proper are located under the ucd/ subdirectory.
Other data files and charts associated with a release of the Unicode Standard are
located in other subdirectories. For details regarding the data files for other
UTSes synchronized with each release of the Unicode Standard, see
UTS10
],
UTS39
],
UTS46
], and
UTS51
].
The latest released version of the UCD is always accessible via the
following stable URL:
A draft version of the UCD under development for a subsequent release is always accessible via the
following stable URL:
Prior to Version 6.3.0, access to the latest released version
of the UCD was via the following stable URL:
That "UNIDATA" URL will be maintained, but is no longer recommended, because
it points to the
ucd
subdirectory of the latest release, rather than to the parent
directory for the release. The "UNIDATA" naming convention is also very old, and does not follow
the directory naming conventions currently used for other data releases in the
Public
directory on the Unicode website.
4.1.1
UCD Files Proper
The UCD proper is located in the
ucd
subdirectory of the numbered version
directory. That directory contains all of the documentation files and most
of the data files for the UCD, including some data files for derived properties.
Although all UCD data files are version-specific for a release and most contain
internal date and version stamps, the file names of the released data files do not
differ from version to version. When linking to a version-specific data file, the
version will be indicated by the version number of the directory for the release.
All files for derived extracted properties are in the
extracted
subdirectory of the
ucd
subdirectory.
See
Derived Extracted Properties
for
documentation regarding those data files and their content.
A number of auxiliary properties are specified in files in the
auxiliary
subdirectory of the
ucd
subdirectory. It contains
data files specifying properties associated with
Unicode Standard Annex #29, "Unicode Text Segmentation" [
UAX29
and with
Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [
UAX14
],
as well as test data for those algorithms.
See
Segmentation Test Files and Documentation
for more information about the test data.
Certain data files associated with emoji properties are maintained
in the
emoji
subdirectory of the
ucd
subdirectory. Those data
files define the simple character properties associated with emoji characters,
as well as the emoji variation sequences. Other data files associated with
emoji, including those which define
the RGI ("recommended for general interchange") sets of various
types of emoji sequences, as well as emoji test data, are maintained elsewhere,
and are not considered formally a part of the UCD.
See [
UTS51
] for documentation regarding those data files and their content.
4.1.2
UCD XML Files
The XML version of the UCD is located in the
ucdxml
subdirectory of the
numbered version directory. See the
UCD in XML
for
more details.
4.1.3
Charts
The code charts specific to a version of Unicode are archived
as a single large PDF file in the
charts
subdirectory of the
numbered version directory. See the readme.txt in that subdirectory
and the general web page explaining the
Unicode Code Charts
for
more details.
4.1.4
Beta Review Considerations
Prior to the formal release of a version of the UCD, draft files
are made available for review in a subdirectory named
draft
, under the
/Public
directory on the Unicode server. The files in this
directory may include temporary files, including documentation of differences between
draft versions. The number of reviews is not fixed—a beta review will
always take place, but an alpha review is optional.
Notices contained in a ReadMe.txt file in the
draft
directory during the
beta review period also make it clear that that directory contains
preliminary material under review, rather than a final, stable release.
4.1.5
File Directory Differences for Early Releases
The
UCD in XML
was introduced in Version 5.1.0,
so UCD directories prior to that do not contain the
ucdxml
subdirectory.
UCD directories prior to Version 13.0.0 do not contain the
emoji
subdirectory.
UCD directories prior to Version 4.1.0 do not contain the
auxiliary
subdirectory.
UCD directories prior to Version 3.2.0 do not contain the
extracted
subdirectory.
The general structure of the file directory for a released version of the UCD
described above applies to Versions 4.1.0 and later. Prior to Version 4.1.0,
versions of the UCD were not self-contained, complete sets of data files
for that version, but instead only contained any new data files or any data files
which had
changed
since the prior release.
Because of this, the property files for a given version
prior to Version 4.1.0 can be spread over several directories. Consult the
component listings at
Enumerated Versions
to find out which files in which directories comprise a complete set of data
files for that version.
The directory naming conventions and the file naming conventions also
differed prior to Version 4.1.0. So, for example, Version 4.0.0 of the UCD
is contained in a directory named
4.0-Update
, and Version 4.0.1 of
the UCD in a directory named
4.0-Update1
. Furthermore, for these
earlier versions, the data file names
do
contain explicit version
numbers.
4.2
File Format Conventions
Files in the UCD use the format conventions described in
this section, unless otherwise specified.
4.2.1
Data Fields
Each line of data consists of fields separated by semicolons. The fields are numbered
starting with zero.
The first field (0) of each line in the Unicode Character Database files represents a code
point or range. The remaining fields (1..n) are properties associated with that code point.
Leading and trailing spaces within a field are not significant.
However, no leading or trailing spaces
are allowed in any field of UnicodeData.txt.
The Unihan data files [
Unihan
] in the UCD have a separate format, using tab characters
instead of semicolons to separate fields. See [
UAX38
for the detailed specification of the format of the Unihan data files. The
data files TangutSources.txt and NushuSources.txt also use this format.
4.2.2
Code Points and Sequences
Code points are expressed as hexadecimal numbers with four to six digits.
(See
Appendix A, Notational Conventions
in
Unicode
for a full, formal definition of this convention.)
They are written without the "U+" prefix in
all data files except the Unihan data files. The Unihan data files use the "U+" prefix for
all Unicode code points, to distinguish them from other decimal and hexadecimal
numerical references occurring in their data fields.
When a data field contains a sequence of code points, spaces separate
the code points.
4.2.3
Code Point Ranges
A range of code points is specified by the form "X..Y".
Each code point in a range has the
associated property value specified on a data file. For example (from Blocks.txt):
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
For backward compatibility, ranges in the file UnicodeData.txt
are specified by entries for the
start and end characters of the range, rather than by the form "X..Y".
The start character is indicated by a range identifier, followed by a comma
and the string "First", in angle brackets. This entry takes the
place of a regular character name in field 1 for that line.
The end character is indicated on the next line with the same range identifier,
followed by a comma and the string "Last", in angle brackets:
4E00;;Lo;0;L;;;;;N;;;;;
9FEF;;Lo;0;L;;;;;N;;;;;
For character ranges using this convention, the names of all characters in the range
are algorithmically derivable.
See
Section 4.8, Name
in [
Unicode
] for more information on
derivation of character names for such ranges.
4.2.4
Comments
U+0023 NUMBER SIGN ("#") is used to indicate comments: all
characters from the number sign to the end
of the line are considered part of the comment, and are disregarded when parsing data.
In many files, the comments on data
lines use a common format, as illustrated here (from Scripts.txt):
09B2 ; Bengali # Lo BENGALI LETTER LA
The first part of a comment using this common format is the General_Category value,
provided for information. This is followed by the character name for
the code point in the first field (0).
The printing of the General_Category value is suppressed in instances where
it would be redundant, as for DerivedGeneralCategory.txt, in which the value
of the property value in the data field is already the General_Category value.
The symbol "L&"
indicates characters of General_Category Lu, Ll, or Lt (uppercase, lowercase,
or titlecase letter). For example:
0386 ; Greek # L& GREEK CAPITAL LETTER ALPHA WITH TONOS
L& as used in these comments is an alias for
the derived LC value (cased letter) for the General_Category property, as documented in
PropertyValueAliases.txt.
When the data line contains a range of code points, this common format
for a comment also indicates a range of character names, separated by "..", as
illustrated here (from DerivedNumericType.txt):
00BC..00BE ; Numeric # No [3] VULGAR FRACTION ONE QUARTER..VULGAR FRACTION THREE QUARTERS
Normally, consecutive characters with the same property value would be
represented by a single code point range. In data files using this
comment convention, such ranges are subdivided so that all
characters in a range also
have the same General_Category value (or LC).
While this convention results in more ranges than are strictly necessary, it
makes the contents of the ranges clearer.
When a code point range occurs, the number of items in the range is
included in the comment (in square brackets), immediately following the General_Category value.
The comments are purely informational, and may change format or be omitted in the
future. They should not be parsed for content. However, see Section 4.2.10
@missing Conventions
4.2.5
Code Point Labels
Surrogate code points, private-use characters, control codes, noncharacters,
and unassigned code points have no names. When such code points are
listed in the data files, for example to list their General_Category
values, the comments use code point labels instead of character
names. For example (from DerivedCoreProperties.txt):
2065 ; Default_Ignorable_Code_Point # Cn
Although code point labels are not formally character names
and are not considered values of the Name property for characters, they are
designed to be maintained as unique values within the namespace for Unicode
character names. Hence, implementations can safely use them as identifiers
for code points without overlap with actual character names.
Code point labels use one of the tags as documented in
Section 4.8, Name
in [
Unicode
] and as shown in
Table 3
followed by "-" and the code point expressed in hexadecimal. The
entire label is then enclosed in angle brackets when
listed in data files of the UCD.
Table 3.
Code Point Label Tags
Tag
General_Category
Note
reserved
Cn
Noncharacter_Code_Point=F
noncharacter
Cn
Noncharacter_Code_Point=T
control
Cc
private-use
Co
surrogate
Cs
4.2.6
Multiple Properties in One Data File
When a file contains the specification for multiple properties, the second field specifies the name
of the property and the third field specifies the property value. For example (from
DerivedNormalizationProps.txt):
03D2 ; FC_NFKC; 03C5 # L& GREEK UPSILON WITH HOOK SYMBOL
03D3 ; FC_NFKC; 03CD # L& GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
4.2.7
Binary Property Values
For binary properties, the second field specifies the name of the applicable property, with
the implied value of the property being "True". Only the ranges of characters with the binary
property value of "Y" (= True) are listed. For example (from PropList.txt):
1680 ; White_Space # Zs OGHAM SPACE MARK
2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
4.2.8
Multiple Values for Properties
When a data file defines a property which may take multiple values for a single code
point, the multiple values are expressed in a space-delimited list. For example (from ScriptExtensions.txt):
0640 ; Adlm Arab Mand Mani Phlp Rohg Sogd Syrc # Lm ARABIC TATWEEL
In some cases—but not all—the order of multiple elements in a space-delimited
list may be significant. When the order of multiple elements is significant, it is documented
along with the property itself. For example (from Unihan_Readings.txt), for the tag kMandarin,
when there are two values for a code point, the first value is used to
indicate a preferred pronunciation for zh-Hans (CN) and the second a
preferred pronunciation for zh-Hant (TW).
For further discussion, see Section 5.7.6
Properties Whose Values Are Sets of Values
4.2.9
Default Values
Entries for a code point may be omitted in a data file if the
code point has a default value for the property in question.
For most string-valued properties,
including the definition of foldings and mappings, the
default value is the code point of the character itself.
For some string-valued properties which define a property that
applies primarily to a small, defined set of code points, the default
value is , which is interpreted as no value is defined. (This
contrasts with specification of an actual value consisting of an
empty string. See
Section 4.2.11
Empty Fields
.) Current examples include
Bidi_Paired_Bracket
, as well as some Unihan-related properties.
For miscellaneous properties which take strings as values,
such as the Unicode Name property, the default value is an empty
string.
For binary properties except for
Extended_Pictographic
the default value is always "N" (= False)
and is always omitted.
For enumerated and catalog properties, the default value is listed in a comment. For
example (from Scripts.txt):
# All code points not explicitly listed for Script
# have the value Unknown (Zzzz).
A few properties of the enumerated type have multiple default values. In
those cases, comments in the file explain the code point ranges for applicable values.
See also
Table 4
Default values are also listed in specially-formatted comment lines,
using the keyword "@missing". Parsers which extract and process
these lines can algorithmically determine the default values for all code points.
See
@missing Conventions
for details about the syntax and use of these lines.
Because of the legacy format constraints for UnicodeData.txt, that
file contains no specific information about default values for properties.
The default values for fields in UnicodeData.txt are documented
in
Table 4
below
if they cannot be derived from the general rules about default values
for properties.
The file ArabicShaping.txt is also exceptional, because it omits the listing
of many characters whose property value (jt=T) can be derived by rule. Adding an "@missing" line
to that file would result in the wrong interpretation of Joining_Type values for omitted characters.
The full explicit listing of Joining_Type values and the correct "@missing" line for
the default Joining_Type value (jt=U) can be found in the file DerivedJoiningType.txt instead.
The values of Joining_Type listed in DerivedJoiningType.txt should
be taken as definitive, because of the difficulty of deriving the correct values for all
characters based only on the entries in ArabicShaping.txt.
Default values for common catalog, enumeration, and
numeric properties are listed in
Table 4
, along
with the exceptional binary property, Extended_Pictographic.
Further explanation is provided below the table, in
those cases where the default values
are complex, as indicated in the third column.
Table 4.
Default Values for Properties
Property Name
Default Value(s)
Complex?
Age
Unassigned (= NA)
No
Bidi_Class
L, AL, R, BN, ET
Yes
Block
No_Block
No
Canonical_Combining_Class
Not_Reordered (= 0)
No
Decomposition_Type
None
No
East_Asian_Width
Neutral (= N), Wide (= W)
Yes
Extended_Pictographic
N (= False), Y (= True)
Yes
General_Category
Cn
No
Line_Break
Unknown (= XX), ID, PR
Yes
Numeric_Type
None
No
Numeric_Value
NaN
No
Script
Unknown (= Zzzz)
No
Vertical_Orientation
Rotated (= R), Upright (= U)
Yes
4.2.9.1
Complex Default Values
Complex default values
are those which take multiple values, contingent on
code point ranges or other conditions. Complex default values other than those specified in the
"@missing" line are explicitly listed in the relevant property file, except for instances
noted in this section. This means that a parser extracting property values from
the UCD should never encounter an ambiguous condition for which the default value of a property
for a particular code point is unclear.
Bidi_Class
See
Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [
UAX9
and DerivedBidiClass.txt for full details.
East_Asian_Width
This property defaults to Neutral for most code points, but defaults to Wide
for unassigned code points in blocks associated with CJK ideographs.
See Unicode Standard Annex #11, "East Asian Width"
UAX11
] and
EastAsianWidth.txt for documentation of the default values
and DerivedEastAsianWidth.txt for the full listing of values.
Line_Break
This property defaults to Unknown for most code points, but defaults to ID
for unassigned code points in blocks associated with CJK ideographs, and
in blocks in the ranges U+1F000..U+1FAFF
and U+1FC00..U+1FFFD.
The property defaults to PR for unassigned code
points in the Currency Symbols block. See Unicode Standard Annex #14, "Unicode Line Breaking Algorithm"
UAX14
and LineBreak.txt for documentation of the default values
and DerivedLineBreak.txt for the full listing of values.
Extended_Pictographic
This property defaults to N (= False) for most code points, but defaults to
Y (= True) for unassigned code points in blocks in the ranges U+1F000..U+1FAFF and U+1FC00..U+1FFFD.
Those ranges are correlated with the ranges associated with default values for the Line_Break
property, and have the same rationale. They help future-proof the behavior of Unicode segmentation
algorithms for code point ranges most likely to be used for future assignment of new emoji characters.
Vertical_Orientation
This property defaults to Rotated (R) for most code points,
but defaults to Upright (U)
for unassigned code points in blocks associated with scripts that are themselves predominantly
Upright, in blocks for
some notational systems, and in blocks predominantly associated with pictographic
symbols and emoji.
See Unicode Standard Annex #50, "Unicode Vertical Text Layout"
UAX50
] and VerticalOrientation.txt for full details.
4.2.10
@missing Conventions
Specially-formatted comment lines with the keyword "@missing" are
used to define default property values for ranges of code points not explicitly listed
in a data file. These lines follow regular conventions that make them
machine-readable.
An @missing line starts with the comment character "#", followed by
a space, then the "@missing" keyword, followed by a colon, another space, a code
point range, and a semicolon. Then the
line typically continues with a semicolon-delimited list of one or more
default property values. For example:
# @missing: 0000..10FFFF; Unknown
In general, the code point range and semicolon-delimited list follow
the same syntactic conventions as the data file in which the @missing line occurs, so
that any parser which interprets that data file can easily be adapted to also
parse and interpret an @missing line to pick up default property values for code points.
@missing lines are also supplied for many properties in the file
PropertyValueAliases.txt. In this case, because there are many @missing lines in that
single data file, each @missing line in that file
uses the syntactic pattern code_point_range; property_name; default_prop_val.
An @missing line is never provided for a binary property, because the
default value for binary properties is always "N" and need not be defined redundantly
for each binary property.
Because of the
addition of property names when @missing lines are included in PropertyValueAliases.txt,
there are currently two syntactic patterns used for @missing lines, as
summarized schematically below:
code_point_range; default_prop_val
code_point_range; property_name; default_prop_val
In this schematic representation, "default_prop_val" stands in for
either an explicit property value or for a special tag such as or