Chapter 9 – Unicode 16.0.0

Chapter 9 – Unicode 16.0.0
Chapter 9
Middle East-I
Modern and Liturgical Scripts
Most scripts in this chapter have a common origin in the ancient Phoenician alphabet.
The Hebrew script is used in Israel and for languages of the Diaspora. The Arabic script is used to write many languages throughout the Middle East, North Africa, and certain parts of Asia. The Syriac script is used to write a number of Middle Eastern languages. These three also function as major liturgical scripts, used worldwide by various religious groups.
The Samaritan script is used in small communities in Israel and the Palestinian Territories to write the Samaritan Hebrew and Samaritan Aramaic languages. The Mandaic script was used in southern Mesopotamia in classical times for liturgical texts by adherents of the Mandaean gnostic religion. The Classical Mandaic and Neo-Mandaic languages are still in limited current use in modern Iran and Iraq and in the Mandaean diaspora.
Unlike most of the other scripts discussed in this chapter, the Yezidi script is an alphabet. The script was used to write two religious texts which may date to the 12th or 13th centuries. The script was recently revived and is used by clergy in the Yezidi temple in Tbilisi.
The Middle Eastern scripts are mostly abjads, with small character sets. Words are demarcated by spaces. These scripts include a number of distinctive punctuation marks. In addition, the Arabic script includes traditional forms for digits, called “Arabic-Indic digits” in the Unicode Standard.
Text in these scripts is written from right to left. Implementations of these scripts must conform to the Unicode Bidirectional Algorithm (see Unicode Standard Annex #9, “Unicode Bidirectional Algorithm”). For more information about writing direction, see
Section 2.10, Writing Direction
. There are also special security considerations that apply to bidirectional scripts, especially with regard to their use in identifiers. For more information about these issues, see Unicode Technical Report #36, “Unicode Security Considerations.”
Arabic, Syriac and Mandaic are cursive scripts even when typeset, unlike Hebrew and Samaritan, where letters are unconnected. Most letters in Arabic, Syriac and Mandaic assume different forms depending on their position in a word. Shaping rules for the rendering of text are specified in
Section 9.2, Arabic
Section 9.3, Syriac
and
Section 9.5, Mandaic
. Shaping rules are not required for Hebrew because only five letters have position-dependent final forms, and these forms are separately encoded.
Historically, Middle Eastern scripts did not write short vowels. Nowadays, short vowels are represented by marks positioned above or below a consonantal letter. Vowels and other pronunciation (“vocalization”) marks are encoded as combining characters, so support for vocalized text necessitates use of composed character sequences. Yiddish and Syriac are normally written with vocalization; Hebrew, Samaritan, and Arabic are usually written unvocalized.
9.1 Hebrew
9.1.1 Hebrew: U+0590–U+05FF
The Hebrew script is used for writing the Hebrew language as well as Yiddish, Judezmo (Ladino), and a number of other languages. Vowels and various other marks are written as
points,
which are applied to consonantal base letters; these marks are usually omitted in Hebrew, except for liturgical texts and other special applications. Five Hebrew letters assume a different graphic form when they occur last in a word.
Directionality.
The Hebrew script is written from right to left. Conformant implementations of Hebrew script must use the Unicode Bidirectional Algorithm (see Unicode Standard Annex #9, “Unicode Bidirectional Algorithm”).
Cursive.
The Unicode Standard uses the term
cursive
to refer to writing where the letters of a word are connected. A handwritten form of Hebrew is known as cursive, but its rounded letters are generally unconnected, so the Unicode definition does not apply. Fonts based on cursive Hebrew exist. They are used not only to show examples of Hebrew handwriting, but also for display purposes.
Standards.
ISO/IEC 8859-8—Part 8.
Latin/Hebrew Alphabet
. The Unicode Standard encodes the Hebrew alphabetic characters in the same relative positions as in ISO/IEC 8859-8; however, there are no points or Hebrew punctuation characters in that ISO standard.
Vowels and Other Pronunciation Marks.
These combining marks, generically called
points
in the context of Hebrew, indicate vowels or other modifications of consonantal letters. General rules for applying combining marks are given in
Section 2.11, Combining Characters
, and
Section 3.6, Combination
. Additional Hebrew-specific behavior is described below.
Hebrew points can be separated into four classes:
dagesh
shin dot
and
sin dot
, vowels, and other pronunciation marks.
Dagesh
U+05BC
HEBREW POINT DAGESH OR MAPIQ
, has the form of a dot that appears inside the letter that it affects. It is not a vowel but rather a diacritic that affects the pronunciation of a consonant. The same base consonant can also have a vowel and/or other diacritics.
Dagesh
is the only element that goes inside a letter.
The dotted Hebrew consonant
shin
is explicitly encoded as the sequence
U+05E9
HEBREW LETTER SHIN
followed by
U+05C1
HEBREW POINT SHIN DOT
. The
shin dot
is positioned on the upper-right side of the undotted base letter. Similarly, the dotted consonant
sin
is explicitly encoded as the sequence
U+05E9
HEBREW LETTER SHIN
followed by
U+05C2
HEBREW POINT SIN DOT
. The
sin dot
is positioned on the upper-left side of the base letter. The two dots are mutually exclusive. The base letter
shin
can also have a
dagesh
, a vowel, and other diacritics. The two dots are not used with any other base character.
Vowels
all appear below the base character that they affect, except for
holam
U+05B9
HEBREW POINT HOLAM
, which appears above left. The following points represent vowels: U+05B0..U+05BB, and U+05C7.
The remaining three points are
pronunciation marks
U+05BD
HEBREW POINT METEG
U+05BF
HEBREW POINT RAFE
, and
U+FB1E
HEBREW POINT JUDEO-SPANISH VARIKA
Meteg
, also known as
siluq
, goes below the base character;
rafe
and
varika
go above it. The varika, used in Judezmo, is a glyphic variant of
rafe
Shin and Sin.
Separate characters for the dotted letters
shin
and
sin
are not included in this block. When it is necessary to distinguish between the two forms, they should be encoded as
U+05E9
HEBREW LETTER SHIN
followed by the appropriate dot, either
U+05C1
HEBREW POINT SHIN DOT
or
U+05C2
HEBREW POINT SIN DOT
. (See preceding discussion.) This practice is consistent with Israeli standard encoding.
Final (Contextual Variant) Letterforms.
Variant forms of five Hebrew letters are encoded as separate characters in this block, as in Hebrew standards including ISO/IEC 8859-8. These variant forms are generally used in place of the nominal letterforms at the end of words. Certain words, however, are spelled with nominal rather than final forms, particularly names and foreign borrowings in Hebrew and some words in Yiddish. Because final form usage is a matter of spelling convention, software should not automatically substitute final forms for nominal forms at the end of words. The positional variants should be coded directly and rendered one-to-one via their own glyphs—that is, without contextual analysis.
Yiddish Digraphs.
The digraphs are considered to be independent characters in Yiddish. The Unicode Standard has included them as separate characters so as to distinguish certain letter combinations in Yiddish text—for example, to distinguish the digraph
double vav
from an occurrence of a consonantal
vav
followed by a vocalic
vav
. The use of digraphs is consistent with standard Yiddish orthography. Other letters of the Yiddish alphabet, such as
pasekh alef,
can be composed from other characters, although alphabetic presentation forms are also encoded.
Punctuation.
Most punctuation marks used with the Hebrew script are not given independent codes (that is, they are unified with Latin punctuation) except for the few cases where the mark has a unique form in Hebrew—namely,
U+05BE
HEBREW PUNCTUATION MAQAF
U+05C0
HEBREW PUNCTUATION PASEQ
(also known as
legarmeh
),
U+05C3
HEBREW PUNCTUATION SOF PASUQ
U+05F3
HEBREW PUNCTUATION GERESH
, and
U+05F4
HEBREW PUNCTUATION GERSHAYIM
. For paired punctuation such as parentheses, the glyphs chosen to represent
U+0028
LEFT PARENTHESIS
and
U+0029
RIGHT PARENTHESIS
will depend on the direction of the rendered text. See
Section 4.7, Bidi Mirrored
, for more information. For additional punctuation to be used with the Hebrew script, see
Section 6.2, General Punctuation
Cantillation Marks.
Cantillation marks are used in publishing liturgical texts, including the Bible. There are various historical schools of cantillation marking; the set of marks included in the Unicode Standard follows the Israeli standard SI 1311.2.
Positioning.
Marks may combine with vowels and other points, and complex typographic rules dictate how to position these combinations.
The vertical placement (meaning above, below, or inside) of points and marks is very well defined. The horizontal placement (meaning left, right, or center) of points is also very well defined. The horizontal placement of marks, by contrast, is not well defined, and convention allows for the different placement of marks relative to their base character.
When points and marks are located below the same base letter, the point always comes first (on the right) and the mark after it (on the left), except for the marks
yetiv
U+059A
HEBREW ACCENT YETIV
, and
dehi
U+05AD
HEBREW ACCENT DEHI
. These two marks come first (on the right) and are followed (on the left) by the point.
These rules are followed when points and marks are located above the same base letter:
If the point is
holam
, all cantillation marks precede it (on the right) except
pashta
U+0599
HEBREW ACCENT PASHTA
Pashta
always follows (goes to the left of) points.
Holam
on a
sin
consonant (
shin
base +
sin dot
) follows (goes to the left of) the
sin dot
. However, the two combining marks are sometimes rendered as a single assimilated dot.
Shin dot
and
sin dot
are generally represented closer vertically to the base letter than other points and marks that go above it.
Meteg.
Meteg
U+05BD
HEBREW POINT METEG
, frequently co-occurs with vowel points below the consonant. Typically,
meteg
is placed to the left of the vowel, although in some manuscripts and printed texts it is positioned to the right of the vowel. The difference in positioning is not known to have any semantic significance; nevertheless, some authors wish to retain the positioning found in source documents.
The alternate
vowel-meteg
ordering can be represented in terms of alternate ordering of characters in encoded representation. However, because of the fixed-position canonical combining classes to which
meteg
and vowel points are assigned, differences in ordering of such characters are not preserved under normalization. The
combining grapheme joiner
can be used within a
vowel-meteg
sequence to preserve an ordering distinction under normalization. For more information, see the description of
U+034F
COMBINING GRAPHEME JOINER
in
Section 23.2, Layout Controls
For example, to display
meteg
to the left of (after, for a right-to-left script) the vowel point
sheva
U+05B0
HEBREW POINT SHEVA
, the sequence of
meteg
following
sheva
can be used:

Because these marks are canonically ordered, this sequence is preserved under normalization. Then, to display
meteg
to the right of the
sheva
, the sequence with
meteg
preceding
sheva
with an intervening CGJ can be used:

A further complication arises for combinations of
meteg
with
hataf
vowels:
U+05B1
HEBREW POINT HATAF SEGOL
U+05B2
HEBREW POINT HATAF PATAH
, and
U+05B3
HEBREW POINT HATAF QAMATS
. These vowel points have two side-by-side components.
Meteg
can be placed to the left or the right of a
hataf
vowel, but it also is often placed between the two components of the
hataf
vowel. A three-way positioning distinction is needed for such cases.
The
combining grapheme joiner
can be used to preserve an ordering that places
meteg
to the right of a
hataf
vowel, as described for combinations of
meteg
with non-
hataf
vowels, such as
sheva
Placement of
meteg
between the components of a
hataf
vowel can be conceptualized as a ligature of the
hataf
vowel and a nominally positioned
meteg
. With this in mind, the ligation-control functionality of
U+200D
ZERO WIDTH JOINER
and
U+200C
ZERO WIDTH NON-JOINER
can be used as a mechanism to control the visual distinction between a nominally positioned
meteg
to the left of a
hataf
vowel versus the medially positioned
meteg
within the
hataf
vowel. That is,
zero width joiner
can be used to request explicitly a medially positioned
meteg
, and
zero width non-joiner
can be used to request explicitly a left-positioned
meteg
. Just as different font implementations may or may not display an “fi” ligature by default, different font implementations may or may not display
meteg
in a medial position when combined with
hataf
vowels by default. As a result, authors who want to ensure left-position versus medial-position display of
meteg
with
hataf
vowels across all font implementations may use joiner characters to distinguish these cases.
Thus the following encoded representations can be used for different positioning of
meteg
with a
hataf
vowel, such as
hataf patah
left-positioned
meteg
:
medially positioned
meteg
:
right-positioned
meteg
:
In no case is use of ZWNJ, ZWJ, or CGJ
required
for representation of
meteg
. These recommendations are simply provided for interoperability in those instances where authors wish to preserve specific positional information regarding the layout of a
meteg
in text.
Atnah Hafukh and Qamats Qatan.
In some older versions of Biblical text, a distinction is made between the accents
U+05A2
HEBREW ACCENT ATNAH HAFUKH
and
U+05AA
HEBREW ACCENT YERAH BEN YOMO
. Many editions from the last few centuries do not retain this distinction, using only
yerah ben yomo
, but some users in recent decades have begun to reintroduce this distinction. Similarly, a number of publishers of Biblical or other religious texts have introduced a typographic distinction for the vowel point
qamats
corresponding to two different readings. The original letterform used for one reading is referred to as
qamats
or
qamats gadol
; the new letterform for the other reading is
qamats qatan
. Not all users of Biblical Hebrew use
atnah hafukh
and
qamats qatan
. If the distinction between accents
atnah hafukh
and
yerah ben yomo
is not made, then only
U+05AA
HEBREW ACCENT YERAH BEN YOMO
is used. If the distinction between vowels
qamats gadol
and
qamats qatan
is not made, then only
U+05B8
HEBREW POINT QAMATS
is used. Implementations that support Hebrew accents and vowel points may not necessarily support the special-usage characters
U+05A2
HEBREW ACCENT ATNAH HAFUKH
and
U+05C7
HEBREW POINT QAMATS QATAN
Holam Male and Holam Haser.
The vowel point
holam
represents the vowel phoneme /o/. The consonant letter
vav
represents the consonant phoneme /w/, but in some words is used to represent a vowel, /o/. When the point
holam
is used on
vav
, the combination usually represents the vowel /o/, but in a very small number of cases represents the consonant-vowel combination /wo/. A typographic distinction is made between these two in many versions of Biblical text. In most cases, in which
vav + holam
together represents the vowel /o/, the point
holam
is centered above the
vav
and referred to as
holam male
. In the less frequent cases, in which the
vav
represents the consonant /w/, some versions show the point
holam
positioned above left. This is referred to as
holam haser
. The character
U+05BA
HEBREW POINT HOLAM HASER FOR VAV
is intended for use as
holam haser
only in those cases where a distinction is needed. When the distinction is made, the character
U+05B9
HEBREW POINT HOLAM
is used to represent the point
holam male on vav
U+05BA
HEBREW POINT HOLAM HASER FOR VAV
is intended for use only on
vav
; results of combining this character with other base characters are not defined. Not all users distinguish between the two forms of
holam
, and not all implementations can be assumed to support
U+05BA
HEBREW POINT HOLAM HASER FOR VAV
Puncta Extraordinaria.
In the Hebrew Bible, dots are written in various places above or below the base letters that are distinct from the vowel points and accents. These dots are referred to by scholars as
puncta extraordinaria
, and there are two kinds. The
upper punctum
, the more common of the two, has been encoded since Unicode 2.0 as
U+05C4
HEBREW MARK UPPER DOT
. The
lower punctum
is used in only one verse of the Bible, Psalm 27:13, and is encoded as
U+05C5
HEBREW MARK LOWER DOT
. The
puncta
generally differ in appearance from dots that occur above letters used to represent numbers; the number dots should be represented using
U+0307
COMBINING DOT ABOVE
and
U+0308
COMBINING DIAERESIS
Nun Hafukha.
The
nun hafukha
is a special symbol that appears to have been used for scribal annotations, although its exact functions are uncertain. It is used a total of nine times in the Hebrew Bible, although not all versions include it, and there are variations in the exact locations in which it is used. There is also variation in the glyph used: it often has the appearance of a rotated or reversed
nun
and is very often called
inverted nun
; it may also appear similar to a
half tet
or have some other form. In pointed texts, the
nun hafukha
carries a dot above it. This dot should be represented using
U+0307
COMBINING DOT ABOVE
Currency Symbol.
The
NEW SHEQEL SIGN
(U+20AA) is encoded in the currency block.
9.1.2 Alphabetic Presentation Forms: U+FB00–U+FB4F
The Hebrew characters in this block are chiefly of two types: variants of letters and marks encoded in the main Hebrew block, and precomposed combinations of a Hebrew letter or digraph with one or more vowels or pronunciation marks. This block contains all of the vocalized letters of the Yiddish alphabet. The
alef lamed
ligature and a Hebrew variant of the plus sign are included as well. The Hebrew plus sign variant,
U+FB29
HEBREW LETTER ALTERNATIVE PLUS SIGN
, is used more often in handwriting than in print, but it does occur in school textbooks. It is used by those who wish to avoid cross symbols, which can have religious and historical connotations.
U+FB20
HEBREW LETTER ALTERNATIVE AYIN
is an alternative form of
ayin
that may replace the basic form
U+05E2
HEBREW LETTER AYIN
when there is a diacritical mark below it. The basic form of
ayin
is often designed with a descender, which can interfere with a mark below the letter. U+FB20 is encoded for compatibility with implementations that substitute the alternative form in the character data, as opposed to using a substitute glyph at rendering time.
Use of Wide Letters.
Wide letterforms are used in handwriting and in print to achieve even margins. The wide-form letters in the Unicode Standard are those that are most commonly “stretched” in justification. If Hebrew text is to be rendered with even margins, justification should be left to the text-formatting software.
These alphabetic presentation forms are included for compatibility purposes. For the preferred encoding, see the Hebrew presentation forms, U+FB1D..U+FB4F.
For letterlike symbols, see U+2135..U+2138.
9.2 Arabic
9.2.1 Arabic: U+0600–U+06FF
The Arabic script is used for writing the Arabic language and has been extended to represent a number of other languages, such as Persian, Urdu, Pashto, Sindhi, and Uyghur, as well as many African languages. Urdu is often written with the ornate Nastaliq script variety. Some languages, such as Indonesian/Malay, Turkish, and Ingush, formerly used the Arabic script but now employ the Latin or Cyrillic scripts. Other languages, such as Kurdish, Azerbaijani, Kazakh, and Uzbek have competing Arabic and Latin or Cyrillic orthographies in different countries.
The Arabic script is cursive, even in its printed form (see
Figure 9-1
). As a result, the same letter may be written in different forms depending on how it joins with its neighbors. Vowels and various other marks may be written as combining marks called
tashkil
, which are applied to consonantal base letters. In normal writing, however, these marks are omitted.
Figure 9-1.
Directionality and Cursive Connection
Directionality.
The Arabic script is written from right to left. Conformant implementations of Arabic script must use the Unicode Bidirectional Algorithm to reorder the memory representation for display (see Unicode Standard Annex #9, “Unicode Bidirectional Algorithm”).
Standards.
ISO/IEC 8859-6—Part 6.
Latin/Arabic Alphabet
. The Unicode Standard encodes the basic Arabic characters in the same relative positions as in ISO/IEC 8859-6. ISO/IEC 8859-6, in turn, is based on ECMA-114, which was based on ASMO 449.
Encoding Principles.
The basic set of Arabic letters is well defined. Each letter receives only one Unicode character value in the basic Arabic block, no matter how many different contextual appearances it may exhibit in text. Each Arabic letter in the Unicode Standard may be said to represent the inherent semantic identity of the letter. A word is spelled as a sequence of these letters. The representative glyph shown in the Unicode character chart for an Arabic letter is usually the form of the letter when standing by itself. It is simply used to distinguish and identify the character in the code charts and does not restrict the glyphs used to represent it. See “Arabic Cursive Joining,” “Arabic Ligatures,” and “Arabic Joining Groups” in the following text for an extensive discussion of how cursive joining and positional variants of Arabic letters are handled by the Unicode Standard.
The following principles guide the encoding of the various types of marks which are applied to the basic Arabic letter skeletons:
Ijam
: Diacritical marks applied to basic letter forms to derive new (usually consonant) letters for extended Arabic alphabets are not separately encoded as combining marks. Instead, each letter plus
ijam
combination is encoded as a separate, atomic character. These letter plus
ijam
characters are never given decompositions in the standard.
Ijam
generally take the form of one-, two-, three- or four-dot markings above or below the basic letter skeleton, although other diacritic forms occur in extensions of the Arabic script in Central and South Asia and in Africa. In discussions of Arabic in Unicode,
ijam
are often also referred to as
nukta
, because of their functional similarity to the nukta diacritical marks which occur in many Indic scripts.
Tashkil
: Marks functioning to indicate vocalization of text, as well as other types of phonetic guides to correct pronunciation, are separately encoded as combining marks. These include several subtypes:
harakat
(short vowel marks),
tanwin
(postnasalized or long vowel marks),
shaddah
(consonant gemination mark), and
sukun
(to mark lack of a following vowel). A basic Arabic letter plus any of these types of marks is never encoded as a separate, precomposed character, but must always be represented as a sequence of letter plus combining mark. Additional marks invented to indicate non-Arabic vowels, used in extensions of the Arabic script, are also encoded as separate combining marks.
Maddah
: The
maddah
is a particular case of a
harakat
mark which has exceptional treatment in the standard. In most modern languages using the Arabic script, it occurs only above
alef
, and in that combination represents the sound /ʔaa/. In Quranic Arabic,
maddah
occurs above
waw
or
yeh
to note vowel elongation. For this reason, and the shared use of
maddah
between Arabic and Syriac scripts, the precomposed combination
U+0622
ARABIC LETTER ALEF WITH MADDA ABOVE
is encoded, however the combining mark
U+0653
ARABIC MADDAH ABOVE
is also encoded. U+0622 is given a canonical decomposition to the sequence of
alef
followed by the
combining maddah
. Some historical non-Arabic orthographies have also used
maddah
as an
ijam
. U+0653 should be used to represent those texts.
Hamza
: The
hamza
may occur above or below other letters. Its treatment in the Unicode Standard is also exceptional and rather complex. The general principle is that when such a
hamza
is used to indicate an actual glottal stop, the /je/ sound used in Persian and Urdu for
ezafe
, or the short vowels /ə/ and /ɨ/ in Kashmiri, it should be represented with a separate combining mark, either
U+0654
ARABIC HAMZA ABOVE
or
U+0655
ARABIC HAMZA BELOW
. However, when the
hamza
mark is used as a diacritic to derive a separate letter as an extension of the Arabic script, then the basic letter skeleton plus the
hamza
mark is represented by a single, precomposed character. See “Combining Hamza Above” later in this section for discussion of the complications for particular characters.
Annotation Marks
: Quranic annotation marks are always encoded as separate combining marks.
Punctuation.
Most punctuation marks used with the Arabic script are not given independent codes (that is, they are unified with Latin punctuation), except for the few cases where the mark has a significantly different appearance in Arabic—namely,
U+060C
ARABIC COMMA
U+061B
ARABIC SEMICOLON
U+061E
ARABIC TRIPLE DOT PUNCTUATION MARK
U+061F
ARABIC QUESTION MARK
, and
U+066A
ARABIC PERCENT SIGN
Persian and some other languages use rounded forms of
U+00AB
LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
and
U+00BB
RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
Sindhi uses
U+2E41
REVERSED COMMA
and
U+204F
REVERSED SEMICOLON
. Some fonts have used glyph variants of
U+060C
ARABIC COMMA
and
U+061B
ARABIC SEMICOLON
, although this is not recommended.
For paired punctuation such as parentheses, the glyphs chosen to display for example,
U+0028
LEFT PARENTHESIS
and
U+0029
RIGHT PARENTHESIS
, will depend on the direction of the rendered text. See “Paired Punctuation” in
Section 6.2, General Punctuation
, for more discussion.
The Non-joiner and the Joiner.
The Unicode Standard provides two user-selectable formatting codes:
U+200C
ZERO WIDTH NON-JOINER
and
U+200D
ZERO WIDTH JOINER
. The use of a joiner adjacent to a suitable letter permits that letter to form a cursive connection without a visible neighbor. This provides a simple way to encode some special cases, such as exhibiting a connecting form in isolation, as shown in
Figure 9-2
Figure 9-2.
Using a Joiner
These connecting forms commonly occur in some abbreviations such as the marker for
hijri
dates, which consists of an initial form of
heh
The use of a non-joiner between two letters prevents those letters from forming a cursive connection with each other when rendered, as shown in
Figure 9-3
Figure 9-3.
Using a Non-joiner
Examples requiring the use of a non-joiner include the Persian plural suffix, some Persian proper names, and Ottoman Turkish vowels. This use of non-joiners is important for representation of text in such languages, and ignoring or removing them will result in text with a different meaning, or in meaningless text.
Joiners and non-joiners may also occur in combinations. The effects of such combinations are shown in
Figure 9-4
. For further discussion of joiners and non-joiners, see
Section 23.2, Layout Controls
Figure 9-4.
Combinations of Joiners and Non-joiners
Tashkil Nonspacing Marks.
Tashkil
are marks that indicate vowels or other modifications of consonant letters. In English, these marks are often referred to as “points.” They may also be called
harakat
, although technically,
harakat
refers to the subset of
tashkil
which denote short vowels. The code charts depict these
tashkil
in relation to a dotted circle, indicating that this character is intended to be applied via some process
to the character that precedes it
in the text stream (that is, the base character). General rules for applying nonspacing marks are given in
Section 7.9, Combining Marks
. The few marks that are placed after (to the left of) the base character are treated as ordinary spacing characters in the Unicode Standard. For more information about the canonical ordering of nonspacing marks, see
Section 2.11, Combining Characters
and
Section 3.11, Normalization Forms
Use of the Arabic Mark Transient Reordering Algorithm (AMTRA) during text display is recommended to correctly and consistently render Arabic combining mark sequences. This algorithm provides results that match user expectations and assures that canonically equivalent sequences are rendered identically, independent of the order of the combining marks in the text stream. For more information, see Unicode Technical Report #53, “Unicode Arabic Mark Rendering.”
The placement and rendering of vowel and other marks in Arabic strongly depends on the typographical environment or even the typographical style. For example, in the Unicode code charts, the default position of
U+0651
◌ّ
ARABIC SHADDA
is with the glyph placed above the base character, whereas for
U+0650
◌ِ
ARABIC KASRA
the glyph is placed below the base character, as shown in the first example in
Figure 9-5
. However, computer fonts often follow an approach that originated in metal typesetting and combine the
kasra
with
shadda
above the text, as shown in the second example in
Figure 9-5
U+064D
◌ٍ
ARABIC KASRATAN
also follows this behavior.
Figure 9-5.
Placement of Harakat
The shapes of the various
tashkil
marks may also depend on the style of writing. For example,
dammatan
can be written in at least four different styles, as shown in
Figure 9-6
Figure 9-6.
Dammatan Styles
U+064C
ARABIC DAMMATAN
can be rendered in any of those four shapes.
U+08F1
ARABIC OPEN DAMMATAN
is an alternative
dammatan
character for use in Quran orthographies which have two distinct forms of
dammatan
that convey a semantic difference.
Arabic-Indic Digits.
The names for the forms of decimal digits vary widely across different languages. The decimal numbering system originated in India (Devanagari
९०१२
…) and was subsequently adopted in the Arabic world with a different appearance (Arabic
٠١٢٣
…). The Europeans adopted decimal numbers from the Arabic world, although once again the forms of the digits changed greatly (European 0123…). The European forms were later adopted widely around the world and are used even in many Arabic-speaking countries in North Africa. In each case, the interpretation of decimal numbers remained the same. However, the forms of the digits changed to such a degree that they are no longer recognizably the same characters. Because of the origin of these characters, the European decimal numbers are widely known as “Arabic numerals” or “Hindi-Arabic numerals,” whereas the decimal numbers in use in the Arabic world are widely known there as “Hindi numbers.”
The Unicode Standard includes
Indic
digits (including forms used with different Indic scripts),
Arabic
digits (with forms used in most of the Arabic world), and
European
digits (now used internationally). Because of this decision, the traditional names could not be retained without confusion. In addition, there are two main variants of the Arabic digits: those used in Afghanistan, India, Iran, and Pakistan (here called
Eastern Arabic-Indic
) and those used in other parts of the Arabic world. In summary, the Unicode Standard uses the names shown in
Table 9-1
. A different set of number forms, called Rumi, was used in historical materials from Egypt to Spain, and is discussed in the subsection on “Rumi Numeral Symbols” in
Section 22.3, Numerals
Table 9-1.
Arabic Digit Names
Name
Code Points
Forms
European
U+0030..U+0039
0123456789
Arabic-Indic
U+0660..U+0669
٠١٢٣٤٥٦٧٨٩
Eastern Arabic-Indic
U+06F0..U+06F9
۰۱۲۳۴۵۶۷۸۹
Indic (Devanagari)
U+0966..U+096F
९०१२३४५६७८
There are distinct glyph forms for Eastern Arabic-Indic digits for the digits four, five, six, and seven. Furthermore, for four, six, and seven, there is substantial variation between locales using the Eastern Arabic-Indic digits.
Table 9-2
illustrates this variation with some example glyphs for digits in languages of Afghanistan, India, Iran, and Pakistan. While some usage of the Persian glyph for
U+06F7
EXTENDED ARABIC-INDIC DIGIT SEVEN
can be documented for Sindhi, the form shown in
Table 9-2
is predominant.
Table 9-2.
Glyph Variation in Eastern Arabic-Indic Digits
Code Point
Digit
Persian
Sindhi
Urdu and Kashmiri
U+06F4
U+06F5
U+06F6
U+06F7
The Unicode Standard provides a single, complete sequence of digits for Persian, Sindhi, and Urdu to account for the differences in appearance and directional treatment when rendering them. The Kashmiri digits have the same appearance as those for Urdu. (For a complete discussion of directional formatting of numbers in the Unicode Standard, see Unicode Standard Annex #9, “Unicode Bidirectional Algorithm.”)
Extended Arabic Letters.
Arabic script is used to write major languages, such as Persian and Urdu, but it has also been used to transcribe some lesser-used languages, such as Baluchi and Lahnda, which have little tradition of printed typography. As a result, the Unicode Standard encodes multiple forms of some Extended Arabic letters because the character forms and usages are not well documented for a number of languages. For additional extended Arabic letters, see the Arabic Supplement block, U+0750..U+077F, the Arabic Extended-A block, U+08A0..U+08FF, and the Arabic Extended-B block, U+0870..U+089F.
Quranic Annotation Signs.
These characters are used in the Quran to mark pronunciation and other annotation of the text. Several additional Quranic annotation signs are encoded in the Arabic Extended-A block, U+08A0..U+08FF, and the Arabic Extended-B block, U+0870..U+089F. The alternate forms of some of the marks are not merely decorative; they are used to show variations in pronunciation (
U+08F0
ARABIC OPEN FATHATAN
), or can indicate various pause points (
U+0615
ARABIC SMALL HIGH TAH
), extended pauses, mandatory pauses, or places where a breath should not occur. They are required to guide the reader in reciting the text correctly. Some marks, such as
U+08D4
ARABIC SMALL HIGH WORD AR-RUB
, may be positioned above
end of ayah
Additional Vowel Marks.
When the Arabic script is adopted as the writing system for a language other than Arabic, it is often necessary to represent vowel sounds or distinctions not made in Arabic. In some cases, conventions such as the addition of small dots above and/or below the standard Arabic
fatha
damma
, and
kasra
signs have been used.
Classical Arabic has only three canonical vowels (/a/, /i/, /u/), whereas languages such as Urdu and Persian include other contrasting vowels such as /o/ and /e/. For this reason, it is imperative that speakers of these languages be able to show the difference between /e/ and /i/ (
U+0656
ARABIC SUBSCRIPT ALEF
), and between /o/ and /u/ (
U+0657
ARABIC INVERTED DAMMA
). At the same time, the use of these two diacritics in Arabic is redundant, merely emphasizing that the underlying vowel is long.
U+065F
ARABIC WAVY HAMZA BELOW
is an additional vowel mark used in Kashmiri. It can appear in combination with many characters. The particular combination of an
alef
with this vowel mark should be written with the sequence <
U+0627
ARABIC LETTER ALEF
U+065F
ARABIC WAVY HAMZA BELOW
>, rather than with the character
U+0673
ARABIC LETTER ALEF WITH WAVY HAMZA BELOW
, which has been deprecated and which is not canonically equivalent. However, implementations should be aware that there may be existing legacy Kashmiri data in which U+0673 occurs.
Honorifics.
Marks known as honorifics represent phrases expressing the status of a person and are in widespread use in the Arabic-script world. Most have a specifically religious meaning. In effect, these marks are combining characters at the word level, rather than being associated with a single base character. The normal practice is that such marks be used at the end of words. In manuscripts, depending on the letter shapes present in the name and the calligraphic style in use, the honorific mark may appear over a letter in the middle of the word. If an exact representation of a manuscript is desired, the honorific mark may be represented as following that letter. The normalization algorithm does not move such word-level combining characters to the end of the word.
Spacing honorifics are also in wide use both in the Arabic script and among Muslim communities writing in other scripts. See “Word Ligatures” under Arabic Presentation Forms-A later in this section for more information.
Arabic Mathematical Symbols.
A few Arabic mathematical symbols are encoded in this block. The Arabic mathematical radix signs,
U+0606
ARABIC-INDIC CUBE ROOT
and
U+0607
ARABIC-INDIC FOURTH ROOT
, differ from simple mirrored versions of
U+221B
CUBE ROOT
and
U+221C
FOURTH ROOT
, in that the digit portions of the symbols are written with Arabic-Indic digits and are not mirrored.
U+0608
ARABIC RAY
is a letterlike symbol used in Arabic mathematics.
Date Separator.
U+060D
ARABIC DATE SEPARATOR
is used in Pakistan and India between the numeric date and the month name when writing out a date. This sign is distinct from
U+002F
SOLIDUS
, which is used, for example, as a separator in currency amounts.
Full Stop.
U+061E
ARABIC TRIPLE DOT PUNCTUATION MARK
is encoded for traditional orthographic practice using the Arabic script to write African languages such as Hausa, Wolof, Fulani, and Mandinka. These languages use
ARABIC TRIPLE DOT PUNCTUATION MARK
as a full stop.
Currency Symbols.
U+060B
AFGHANI SIGN
is a currency symbol used in Afghanistan. The symbol is derived from an abbreviation of the name of the currency, which has become a symbol in its own right.
U+FDFC
RIAL SIGN
is a currency symbol used in Iran. Unlike the
AFGHANI SIGN
U+FDFC
RIAL SIGN
is considered a compatibility character, encoded for compatibility with Iranian standards. Ordinarily in Persian “rial” is simply spelled out as the sequence of letters, <0631, 06CC, 0627, 0644>.
Signs Spanning Numbers.
Several other special signs are written in association with numbers in the Arabic script. All of these signs can span multiple-digit numbers, rather than just a single digit. They are not formally considered
combining marks
in the sense used by the Unicode Standard, although they clearly interact graphically with their associated sequence of digits. In the text representation they
precede
the sequence of digits that they span, rather than follow a base character, as would be the case for a combining mark. Their General_Category value is Cf (format character). Unlike most other format characters, however, they should be rendered with a visible glyph, even in circumstances where no suitable digit or sequence of digits follows them in logical order. The characters have the Bidi_Class value of Arabic_Number to make them appear in the same run as the numbers following them.
A few similar signs spanning numbers or letters are associated with scripts other than Arabic. See the discussion of
U+070F
SYRIAC ABBREVIATION MARK
in
Section 9.3, Syriac
, and the discussion of
U+110BD
KAITHI NUMBER SIGN
and
U+110CD
KAITHI NUMBER SIGN ABOVE
in
Section 15.2, Kaithi
. All of these prefixed format controls, including the non-Arabic ones, are given the property value Prepended_Concatenation_Mark = True, to identify them as a class. They also have special behavior in text segmentation. (See Unicode Standard Annex #29, “Unicode Text Segmentation.”)
U+0600
ARABIC NUMBER SIGN
signals the beginning of a number. It is followed by a sequence of one or more Arabic digits and is rendered below the digits of the number. The length of its rendered display may vary with the number of digits. The sequence terminates with the occurrence of any non-digit character.
U+0601
ARABIC SIGN SANAH
indicates a year (that is, as part of a date). This sign is also rendered below the digits of the number it precedes. Its appearance is a vestigial form of the Arabic word for
year
, /sanatu/ (
seen noon teh-marbuta
), but it is now a sign in its own right and is widely used to mark a numeric year even in non-Arabic languages where the Arabic word would not be known.
U+0602
ARABIC FOOTNOTE MARKER
is a specialized variant of number sign. Its use indicates that the number so marked represents a footnote number in the text.
U+0603
ARABIC SIGN SAFHA
is another specialized variant of number sign. It marks a page number.
U+0604
ARABIC SIGN SAMVAT
is a specialized variant of date sign used specifically to write dates of the Śaka era. The shape of the glyph is a stylized abbreviation of the word
samvat
, the name of this calendar. It is seen in the Urdu orthography, where it contrasts with conventions used to display dates from the Gregorian or Islamic calendars.
U+0605
ARABIC NUMBER MARK ABOVE
is a specialized variant of number sign. It is used in Arabic text with Coptic numbers, such as in early astronomical tables. Unlike the other Arabic number signs, it extends across the top of the sequence of digits, and is used with Coptic digits, rather than with Arabic digits. (See also the discussion of supralineation and the numerical use of letters in
Section 7.3, Coptic
.)
U+0890
ARABIC POUND MARK ABOVE
and
U+0891
ARABIC PIASTRE MARK ABOVE
are Egyptian currency signs which extend across the top of a sequence of digits. The shape of the
pound mark
is usually based on a dotless head of
jeem
above the amount. It is occasionally based on a dotted
jeem
instead. The shape of the
piastre mark
is written using a mirrored version of the pound mark. They are used in advertising and price tags, as well as in handwritten texts.
U+06DD
ARABIC END OF AYAH
is another sign used to span numbers, but its rendering is somewhat different. Rather than extending below the following digits, this sign
encloses
the digit sequence. This sign is used conventionally to indicate numbered verses in the Quran.
U+06DE
ARABIC START OF RUB EL HIZB
is an in-text marker. In printed Qurans, it appears in running text by itself, usually adjacent to an
end of ayah
marker. Although the original symbol for
rub el hizb
consists of octagonal overlayed squares, the actual glyph seen in various editions can be more ornate, as shown in the Unicode code charts. The
rub el hizb
indicates the boundaries of the parts of sections of the
hizb
. It can appear at the start or end of a section and is displayed without interaction with adjacent text.
U+08E2
ARABIC DISPUTED END OF AYAH
is a specialized variant of the
end of ayah
. It is seen occasionally in Quranic text to mark a verse for which there is scholarly disagreement about the location of the end of the verse.
Because digit choice is dependent on regional use, these marks may be used with European digits (U+0030..U+0039), Arabic-Indic digits (U+0660..U+0669) or with extended Arabic-Indic digits (U+06F0..U+06F9). Implementations should support up to three or four digits.
Figure 9-7
shows examples of how these are formatted with varying numbers of digits. In these examples, each instance is separated by an arbitrary letter
hamza
, to help visualize how the formatted sequences interact with the Arabic baseline.
Figure 9-7.
Arabic Signs Spanning Numbers
Poetic Verse Sign.
U+060E
ARABIC POETIC VERSE SIGN
is a special symbol often used to mark the beginning of a poetic verse. Although it is similar to
U+0602
ARABIC FOOTNOTE MARKER
in appearance, the poetic sign is simply a symbol. In contrast, the footnote marker is a format control character that has complex rendering in conjunction with following digits.
U+060F
ARABIC SIGN MISRA
is another symbol used in poetry.
9.2.2 Arabic Cursive Joining
Minimum Rendering Requirements.
A rendering or display process must convert between the logical order in which characters are placed in the backing store and the visual (or physical) order required by the display device. See Unicode Standard Annex #9, “Unicode Bidirectional Algorithm,” for a description of the conversion between logical and visual orders.
The cursive nature of the Arabic script imposes special requirements on display or rendering processes that are not typically found in Latin script-based systems. At a minimum, a display process must select an appropriate glyph to depict each Arabic letter according to its immediate
joining
context; furthermore, in almost every font style, it must substitute certain ligature glyphs for sequences of Arabic characters. The remainder of this section specifies a minimum set of rules that provide legible Arabic joining and ligature substitution behavior.
Joining Types.
Each Arabic letter must be depicted by one of a number of possible contextual glyph forms. The appropriate form is determined on the basis of the cursive joining behavior of that character as it interacts with the cursive joining behavior of adjacent characters. In the Unicode Standard, such cursive joining behavior is formally described in terms of values of a character property called Joining_Type. Each Arabic character falls into one of the types shown in
Table 9-3
. (See ArabicShaping.txt in the Unicode Character Database for a complete list.)
Table 9-3.
Primary Arabic Joining Types
Joining_Type
Examples and Comments
Right_Joining (R)
ALEF, DAL, THAL, REH, ZAIN
...
Left_Joining (L)
None (in Arabic)
Dual_Joining (D)
BEH, TEH, THEH, JEEM
...
Join_Causing (C)
U+200D
ZERO WIDTH JOINER
and
U+0640
ARABIC TATWEEL
. These characters are distinguished from the dual-joining characters in that they do not change shape themselves.
Non_Joining (U)
U+200C
ZERO WIDTH NON-JOINER
and all spacing characters, except those explicitly mentioned as being one of the other joining types, are non-joining. These include
U+0621
ARABIC LETTER HAMZA
U+0674
ARABIC LETTER HIGH HAMZA
, spaces, digits, punctuation, non-Arabic letters, and so on. Also,
U+0600
ARABIC NUMBER SIGN
..
U+0605
ARABIC NUMBER MARK ABOVE
and
U+06DD
ARABIC END OF AYAH
Transparent (T)
All nonspacing marks (General Category Mn or Me) and most format control characters (General Category Cf) are transparent to cursive joining. These include
U+064B
ARABIC FATHATAN
and other Arabic
tashkil
U+0655
ARABIC HAMZA BELOW
U+0670
ARABIC LETTER SUPERSCRIPT ALEF
, combining Quranic annotation signs, and nonspacing marks from other scripts. Also
U+070F
SYRIAC ABBREVIATION MARK
In
Table 9-3
right
and
left
refer to visual order, so a Joining_Type value of Right_Joining indicates that a character cursively joins to a character displayed to its right in visual order. (For a discussion of the meaning of Joining_Type values in the context of a vertically rendered script, see “Cursive Joining” in
Section 14.4, Phags-pa
.) The Arabic characters with Joining_Type = Right_Joining are exemplified in more detail in
Table 9-8
, and those with Joining_Type = Dual_Joining are shown in
Table 9-7
. When characters do not join or cause joining (such as
DAMMA
), they are classified as transparent.
The Phags-pa and Manichaean scripts have a few Left_Joining characters, which are otherwise unattested in the Arabic and Syriac scripts. See
Section 10.5, Manichaean
. For a discussion of the meaning of Joining_Type values in the context of a vertically rendered script, see “Cursive Joining” in
Section 14.4, Phags-pa
Table 9-4
defines derived superclasses of the primary Arabic joining types; those derived types are used in the cursive joining rules. In this table,
right
and
left
refer to visual order.
Table 9-4.
Derived Arabic Joining Types
Description
Derivation
Right join-causing
Superset of dual-joining, left-joining, and join-causing
Left join-causing
Superset of dual-joining, right-joining, and join-causing
Joining Rules.
The following rules describe the joining behavior of Arabic letters in terms of their display (visual) order. In other words, the positions of letterforms in the included examples are presented as they would appear on the screen
after
the Bidirectional Algorithm has reordered the characters of a line of text.
An implementation may choose to restate the following rules according to logical order so as to apply them
before
the Bidirectional Algorithm’s reordering phase. In this case, the words
right
and
left
as used in this section would become
preceding
and
following
In the following rules, if X refers to a character, then various glyph types representing that character are referred to as shown in
Table 9-5
Table 9-5.
Arabic Glyph Types
Glyph Type
Description
Non-joining glyph form that does not join on either side.
Right-joining glyph form (both right-joining and dual-joining characters may employ this form)
Left-joining glyph form (both left-joining and dual-joining characters may employ this form)
Dual-joining (medial) glyph form that joins on both left and right (only dual-joining characters employ this form)
R1
Transparent characters do not affect the joining behavior of base (spacing) characters. For example:
MEEM
+ SHADDA
+ LAM
→ MEEM
+ SHADDA
+ LAM
R2
A right-joining character X that has a right join-causing character on the right will adopt the form X
. For example:
ALEF
+ TATWEEL
→ ALEF
+ TATWEEL
R3
A left-joining character X that has a left join-causing character on the left will adopt the form X
R4
A dual-joining character X that has a right join-causing character on the right and a left join-causing character on the left will adopt the form X
. For example:
TATWEEL
+ MEEM
+ TATWEEL
→ TATWEEL
+ MEEM
+ TATWEEL
R5
A dual-joining character X that has a right join-causing character on the right and no left join-causing character on the left will adopt the form X
For example:
MEEM
+ TATWEEL
→ MEEM
+ TATWEEL
R6
A dual-joining character X that has a left join-causing character on the left and no right join-causing character on the right will adopt the form X
For example:
TATWEEL
+ MEEM
→ TATWEEL
+ MEEM
R7
If none of the preceding rules applies to a character X, then it will adopt the non-joining form X
The cursive joining behavior described here for the Arabic script is also generally applicable to other cursive scripts such as Syriac. Specific circumstances may modify the application of the rules just described.
As noted earlier in this section, the
ZERO WIDTH NON-JOINER
may be used to prevent joining, as in the Persian plural suffix or Ottoman Turkish vowels.
9.2.3 Arabic Ligatures
Ligature Classes.
The
lam-alef
type of ligatures are extremely common in the Arabic script. These ligatures occur in almost all font designs, except for a few modern styles. When supported by the style of the font,
lam-alef
ligatures are considered obligatory. This means that all character sequences rendered in that font, which match the rules specified in the following discussion, must form these ligatures.
In the majority of styles used for writing the Arabic script, including the predominant
Hafs
style, the
lam
is the right part of the ligature, and the
alef
is the left part of the ligature. However, in the
al-Dani
style of writing Arabic script, common in northern Africa, the practice is reversed: the
alef
is the right part and
lam
is the left part. This difference in the styles of writing Arabic is important for font developers to understand. Logical order would still be used in both styles: this means that in the
al-Dani
style of
lam-alef,
marks are positioned differently on the
lam-alef
ligature. See
Figure 9-8
for a comparison of rendering the sequence <
lam
sukun
alef-with-hamza-above
damma
> in the two styles mentioned.
Figure 9-8.
Lam-alef
with Marks
The important thing to note in this figure is the placement of the marks over the parts of the ligature. The exact shapes of the ligature and the marks depend on the style in use.
In general, the
lam-alef
ligature will be formed by any character in the LAM joining group followed by any character from the ALEF joining group. Many
lam-alef
combinations with the specialized
alef
additions in the range U+0870..U+0882 are not attested in actual practice. In such cases, the
lam-alef
ligature should not be considered obligatory.
Many other Arabic ligatures are discretionary. Their use depends on the font design.
For the purpose of describing the obligatory Arabic ligatures, certain characters are subject to the same requirements as
lam
and
alef
. As described in the text that follows, these fall into the joining groups
LAM
and
ALEF
, respectively. Examples of these include
U+0644
ARABIC LETTER LAM
U+06B5
ARABIC LETTER LAM WITH SMALL V
U+0623
ARABIC LETTER ALEF WITH HAMZA ABOVE
, and
U+0622
ARABIC LETTER ALEF WITH MADDA ABOVE
. The complete list is available in ArabicShaping.txt in the Unicode Character Database.
Ligature Rules.
The following rules describe the formation of obligatory ligatures. They are applied after the preceding joining rules. As for the joining rules just discussed, the following rules describe ligature behavior of Arabic letters in terms of their display (visual) order.
In the ligature rules, if X and Y refer to characters, then various glyph types representing combinations of these characters are referred to as shown in
Table 9-6
Table 9-6.
Arabic Ligature Notation
Symbol
Description
(X-Y)
Nominal ligature glyph form representing a combination of an X
form and a Y
form
(X-Y)
Right-joining ligature glyph form representing a combination of an X
form and a Y
form
L1
Transparent characters do not affect the ligating behavior of base (nontransparent) characters. For example:
ALEF
+ FATHA
+ LAM
→ (LAM-ALEF)
+ FATHA
L2
Any sequence with ALEF
on the left and LAM
on the right will form the ligature (LAM-ALEF)
. For example:
L3
Any sequence with ALEF
on the left and LAM
on the right will form the ligature (LAM-ALEF)
. For example:
Optional Features.
Many other ligatures and contextual forms are optional, depending on the font and application. Some of these presentation forms are encoded in the ranges U+FB50..U+FDFF and U+FE70..U+FEFE. However, these forms should
not
be used in general interchange. Moreover, it is not expected that every Arabic font will contain all of these forms, nor that these forms will include all presentation forms used by every font.
More sophisticated rendering systems will use additional shaping and placement. For example, contextual placement of the nonspacing vowels such as
fatha
will provide better appearance. The justification of Arabic tends to stretch words instead of adding width to spaces. Basic stretching can be done by inserting
tatweel
between characters shaped by rules R2, R4, R5, R6, L2, and L3; the best places for inserting tatweel will depend on the font and rendering software. More powerful systems will choose different shapes for characters such as
kaf
to fill the space in justification.
9.2.4 Arabic Joining Groups
The Arabic characters with the property values Joining_Type = Dual_Joining and Joining_Type = Right_Joining can each be subdivided into shaping groups, based on the behavior of their letter skeletons when shaped in context. The Unicode character property that specifies these groups is called Joining_Group.
The Joining_Type and Joining_Group values for all Arabic characters are explicitly specified in ArabicShaping.txt in the Unicode Character Database. For convenience in reference, the Joining_Type values are extracted and listed in DerivedJoiningType.txt and the Joining_Group values are extracted and listed in DerivedJoiningGroup.txt.
Dual-Joining.
Table 9-7
exemplifies dual-joining Arabic characters and illustrates the forms taken by the letter skeletons and their
ijam
marks in context. Dual-joining characters have four distinct forms, for isolated, final, medial, and initial contexts, respectively. The name for each joining group is based on the name of a representative letter that is used to illustrate the shaping behavior. All other Arabic characters are merely variations on these basic shapes, with diacritics added, removed, moved, or replaced. For instance, the
BEH
joining group applies not only to
U+0628
ARABIC LETTER BEH
, which has a single dot below the skeleton, but also to
U+062A
ARABIC LETTER TEH
, which has two dots above the skeleton, and to
U+062B
ARABIC LETTER THEH
, which has three dots above the skeleton, as well as to the Persian and Urdu letter
U+067E
ARABIC LETTER PEH
, which has three dots below the skeleton. The joining groups in the table are organized by shape and not by standard Arabic alphabetical order. Note that characters in some joining groups have dots in some contextual forms, but not others, or their dots may move to a different position. These joining groups include
NYA
FARSI YEH
, and
BURUSHASKI YEH BARREE
Table 9-7.
Dual-Joining Arabic Characters
Joining Group
Notes
BEH
Includes
TEH
and
THEH
NOON
Includes
NOON GHUNNA
AFRICAN NOON
NYA
Jawi
NYA
YEH
Includes
ALEF MAKSURA
FARSI YEH
KASHMIRI YEH
THIN YEH
Final and isolated forms are not attested.
BURUSHASKI YEH BARREE
Dual joining, as opposed to
YEH BARREE
HAH
Includes
KHAH
and
JEEM
SEEN
Includes
SHEEN
SAD
Includes
DAD
TAH
Includes
ZAH
AIN
Includes
GHAIN
FEH
AFRICAN FEH
QAF
AFRICAN QAF
MEEM
HEH
KNOTTED HEH
See
Table 9-9
for more information on regional variation.
HEH GOAL
Includes
HAMZA ON HEH GOAL
KAF
SWASH KAF
GAF
Includes
KEHEH
LAM
Right-Joining.
Table 9-8
exemplifies right-joining Arabic characters, illustrating the forms they take in context. Right-joining characters have only two distinct forms, for isolated and final contexts, respectively.
Table 9-8.
Right-Joining Arabic Characters
Joining Group
Notes
ALEF
WAW
STRAIGHT WAW
Tatar
STRAIGHT WAW
DAL
Includes
THAL
REH
Includes
ZAIN
TEH MARBUTA
Includes
HAMZA ON HEH
TEH MARBUTA GOAL
YEH WITH TAIL
YEH BARREE
ROHINGYA YEH
Some characters occur only at the end of words or morphemes when correctly spelled; these are called
trailing characters
. Examples include
TEH MARBUTA
and
DAMMATAN
. When trailing characters are joining (such as
TEH MARBUTA
), they are classified as right-joining, even when similarly shaped characters are dual-joining. Other characters, such as
ALEF MAKSURA
, are considered trailing in modern Arabic, but are dual-joining in Quranic Arabic and languages like Uyghur. These are classified as dual-joining.
Letter heh.
In the case of
U+0647
ARABIC LETTER HEH
, the glyph
is shown in the code charts. This form is often used to reduce the chance of misidentifying
heh
as
U+0665
ARABIC-INDIC DIGIT FIVE
, which has a very similar shape. The isolated forms of
U+0647
ARABIC LETTER HEH
and
U+06C1
ARABIC LETTER HEH GOAL
both look like
U+06D5
ARABIC LETTER AE
U+06BE
ARABIC LETTER HEH DOACHASHMEE
is used to represent any
heh
-like letter that appears with left stems in all contextual forms. All four forms should have two horizontal or vertical “eyes.” The exact contextual shapes of the letter depend on the language and the style of writing. Four variations for
KNOTTED HEH
are shown in
Table 9-9
Table 9-9.
Letter
heh
Shapes
Code Points
Name
Joining Group
Notes
0647
FEE9..FEEC
HEH
HEH
Standard forms
06BE
FBAA..FBAD
HEH DOACHASHMEE
KNOTTED HEH
Standard forms, Uighur, Kazakh
Behdini Kurdish
Possibly used in Sindhi
Nastaliq
Letter yeh.
There are many complications in the shaping of the Arabic letter
yeh
. These complications have led to the encoding of several different characters for
yeh
in the Unicode Standard, as well as the definition of several different joining groups involving
yeh
. The relationships between those characters and joining groups for
yeh
are explained here.
U+06CC
ARABIC LETTER FARSI YEH
is used in Persian, Urdu, Pashto, Azerbaijani, Kurdish, and various minority languages written in the Arabic script, and also Quranic Arabic. It behaves differently from most Arabic letters, in a way surprising to some native Arabic language speakers. The letter has two horizontal dots below the skeleton in initial and medial forms, but no dots in final and isolated forms. Compared to the two Arabic language
yeh
forms,
FARSI YEH
is exactly like
U+0649
ARABIC LETTER ALEF MAKSURA
in final and isolated forms, but exactly like
U+064A
ARABIC LETTER YEH
in initial and medial forms, as shown in
Table 9-10
. However,
U+06CC
ARABIC LETTER FARSI YEH
followed by
U+0654
ARABIC HAMZA ABOVE
should retain its dots in initial and medial forms.
Table 9-10.
Forms of the Arabic Letter
yeh
Character
Joining Group
U+0649
ALEF MAKSURA
YEH
U+064A
YEH
YEH
U+06CC
FARSI YEH
FARSI YEH
U+0886
THIN YEH
THIN YEH
U+0777
YEH WITH DIGIT FOUR BELOW
YEH
U+0620
KASHMIRI YEH
KASHMIRI YEH
U+06D2
YEH BARREE
YEH BARREE
U+077A
YEH BARREE WITH DIGIT TWO ABOVE
BURUSHASKI YEH BARREE
U+08AC
ROHINGYA YEH
ROHINGYA YEH
Other characters of the joining group
FARSI YEH
follow the same pattern. These
YEH
forms appear with two dots aligned horizontally below them in initial and medial forms, but with no dots below them in final and isolated forms. Characters with the joining group
YEH
behave in a different manner. Just as
U+064A
ARABIC LETTER YEH
retains two dots below in all contextual forms, other characters in the joining group
YEH
retain whatever mark appears below their isolated form in all other contexts. For example,
U+0777
ARABIC LETTER FARSI YEH WITH EXTENDED ARABIC-INDIC DIGIT FOUR BELOW
carries an Urdu-style digit four as a diacritic below the
yeh
skeleton, and retains that diacritic in all positions, as shown in the fourth row of
Table 9-10
. Note that the joining group cannot always be derived from the character name alone. The complete list of characters with the joining group
YEH
or
FARSI YEH
is available in ArabicShaping.txt in the Unicode Character Database.
In the orthographies of Arabic and Persian, the
yeh barree
has always been treated as a stylistic variant of
yeh
in final and isolated positions. When the Perso-Arabic writing system was adapted and extended for use with the Urdu language,
yeh barree
was adopted as a distinct letter to accommodate the richer vowel repertoire of Urdu. South Asian languages such as Urdu and Kashmiri use
yeh barree
to represent the /e/ vowel. This contrasts with the /i/ vowel, which is usually represented in those languages by
U+06CC
ARABIC LETTER FARSI YEH
. The encoded character
U+06D2
ARABIC LETTER YEH BARREE
is classified as a right-joining character, as shown in
Table 9-10
. On that basis, when the /e/ vowel needs to be represented in initial or medial positions with a
yeh
shape in such languages, one should use
U+06CC
ARABIC LETTER FARSI YEH
. In the unusual circumstances where one wishes to distinctly represent the /e/ vowel in word-initial or word-medial positions, a higher level protocol should be used.
For the Burushaski language, two characters that take the form of
yeh barree
with a diacritic,
U+077A
ARABIC LETTER YEH BARREE WITH EXTENDED ARABIC-INDIC DIGIT TWO ABOVE
and
U+077B
ARABIC LETTER YEH BARREE WITH EXTENDED ARABIC-INDIC DIGIT THREE ABOVE
, are classified as dual-joining. These characters have a separate joining group called
BURUSHASKI YEH BARREE
, as shown for U+077A in the last row of
Table 9-10
U+0620
ARABIC LETTER KASHMIRI YEH
is used in Kashmiri text to indicate that the preceding consonantal sound is palatalized. The letter has the form of a
yeh
with a diacritic small circle below in initial and medial forms, but its final and isolated forms appear as truncated
yeh
shapes (
) without the diacritic ring. It has a joining group of its own,
KASHMIRI YEH
, with the shapes as shown in
Table 9-10
, as well as
Table 9-7
. (Prior to Version 16.0, the Unicode Standard had specified that when written in the Naskh style, the letter had different shapes than when written in Nastaliq style; that specification was erroneous.)
U+08AC
ARABIC LETTER ROHINGYA YEH
is used in the Arabic orthography for the Rohingya language of Myanmar. It represents a
medial ya
, corresponding to the use of
U+103B
MYANMAR CONSONANT SIGN MEDIAL YA
in the Myanmar script. It is a right-joining letter. It only occurs after certain consonants, forming a conjunct letter with those consonants.
U+0626
ARABIC LETTER YEH WITH HAMZA ABOVE
normally has the
hamza
positioned over the bowl of the glyph in isolate and final forms. For the Kyrgyz language the
hamza
is positioned at the top right of the glyph in isolate and final forms, as shown in
Table 9-11
Table 9-11.
Glyph Variation for U+0626
Yeh with Hamza Above
Standard
Kyrgyz-style
Noon Ghunna.
The letter
noon ghunna
is used to mark nasalized vowels at the ends of words and some morphemes in Urdu, Balochi, and other languages of Southern Asia. It is represented by
U+06BA
ARABIC LETTER NOON GHUNNA
. The
noon ghunna
has the shape of a dotless
noon
and typically appears only in final and isolated contexts in these languages. In the middle of words and morphemes, the normal
noon,
U+0646
ARABIC LETTER NOON
, is used instead. To avoid ambiguity, sometimes a special mark,
U+0658
ARABIC MARK NOON GHUNNA
, is added to the dotted
noon
to indicate nasalization.
U+06BA
ARABIC LETTER NOON GHUNNA
is also used as a dotless
noon
for the
noon
skeleton in all four of its contextual forms. As such, it is used in representation of early Arabic and Quranic Arabic texts. Rendering systems should display U+06BA as a dual-joining letter, with all four contextual forms shown dotless, regardless of the language of the text.
Advanced text entry applications for Urdu, Balochi, and other languages using
noon ghunna
may include specialized logic for its handling. For example, they might detect mid-word usage of the
noon ghunna
key and emit the regular dotted
noon
character (U+0646) instead, as appropriate for spelling in that context.
Letters for Warsh.
There is a set of widespread orthographic conventions for Arabic writing in West and Northwest Africa known as
Warsh
. Among other differences from the better-known
Hafs
orthography of the Middle East, there are significant differences in Warsh regarding the placement of
ijam
dots on several important Arabic letters. Several “African” letters are encoded in the Arabic Extended-A block specifically to account for these differences in dot placement.
The specialized letters for Warsh share the skeleton with the corresponding, regular Arabic letters. However, they differ in the placement of dots. The Warsh letters have no dots in final or isolated positional contexts. This is illustrated by
U+08BD
ARABIC LETTER AFRICAN NOON
. Unlike
U+0646
ARABIC LETTER NOON
, which displays a dot above in
all
positional contexts,
african noon
displays a dot above in initial and medial position, and no dot in final or isolated position. This contrast can be clearly seen in
Table 9-7
U+08BB
ARABIC LETTER AFRICAN FEH
and
U+08BC
ARABIC LETTER AFRICAN QAF
also lose all dots in final or isolated position, but exhibit a somewhat different pattern for initial and medial position. The basic skeletons for
feh
and for
qaf
are identical for those letters in initial and medial position. In the Hafs orthography, the
feh
takes a single dot above in all positions, while the
qaf
takes
two
dots above in all positions. The Warsh orthography distinguishes the two letters differently: the
feh
takes a single dot
below
in initial or medial position, while the
qaf
takes a single dot
above
in initial or medial position. These contextual differences in the placement of the dots for these letters can also be seen in
Table 9-7
Letters for Ajami.
The Hausa and Wolof languages of West Africa use an Arabic-based orthography known as
ajami
. The
ajami
orthography contains additional, specialized letters with three dots above or below. This three dot
ijam
is known as a
wagaf
in Hausa. In the Kano/Maghribi Arabic style used in this region, the
wagaf
is noticeably smaller than any other
ijam
that may also occur on these specialized characters or other Arabic letters. For example, when rendered in the Kano/Maghribi style in a Hausa font,
U+0751
ARABIC LETTER BEH WITH DOT BELOW AND THREE DOTS ABOVE
will show the dot below in a dark, large size, while the three dots above for the
wagaf
are distinctly smaller. This distinction in size tends to be much less noticeable when the same letters are rendered using a standard,
naskh
style Arabic font, as shown in the code charts.
Joining Groups in Other Scripts.
Other scripts besides Arabic also have cursive joining behavior and associated per-character values for Joining_Type and Joining_Group. Those values are also listed in ArabicShaping.txt in the Unicode Character Database, in sections devoted to each particular script. See the script descriptions for such scripts in the core specification—for example, Syriac and Manichaean—for detailed discussions of cursive joining behavior and tables of joining groups for those scripts.
For the Arabic script, Joining_Group values are assigned for each distinct letter skeleton in all instances—even for the small number of cases, such as
heh goal
, where only a single character is associated with that Joining_Group. This is appropriate for Arabic, because the script has cosmopolitan use, and many letters have been modified with various
nukta
diacritics to form new letters for non-Arabic languages using the script. This pattern of comprehensive assignment of Joining_Group values to all letter skeletons also applies for the Syriac and Manichaean scripts.
For other cursive joining scripts with less well-defined joining groups, all letters are simply assigned the value No_Joining_Group. This does not necessarily mean that no identifiable letter skeletons occur, but rather that no complete analysis has been done that would indicate more than one letterform uses a shared skeleton for cursive joining. Examples include: Mongolian, Phags-Pa, Psalter Pahlavi, Sogdian, and Adlam.
Starting with Unicode 11.0, even in cases where a newly encoded script with cursive joining behavior includes
some
characters which share letter skeletons, most characters are given the No_Joining_Group value. This applies, for example, to the Hanifi Rohingya script, which has a few explicit Joining_Group values, but for which all other characters have the No_Joining_Group value.
In the future, characters with the No_Joining_Group value in scripts with cursive joining behavior may end up being given explicit new Joining_Group values, where further analysis clearly demonstrates use of shared skeletons in cursive joining, or where new, diacritically modified letters are added to the encoding for that script.
9.2.5 Combining Hamza
Combining Hamza Above.
U+0654
ARABIC HAMZA ABOVE
is intended both for the representation of
hamza
semantics in combination with certain Arabic letters, and as a diacritical mark occasionally used in combinations to derive extended Arabic letters. There are a number of complications regarding its use, which interact with the rules for the rendering of Arabic letter
yeh
and which result from the need to keep Unicode normalization stable.
U+0654
ARABIC HAMZA ABOVE
should not be used with
U+0649
ARABIC LETTER ALEF MAKSURA
. Instead, the precomposed
U+0626
ARABIC LETTER YEH WITH HAMZA ABOVE
should be used to represent a
yeh
-shaped base with no dots in any positional form, and with a
hamza
above. Because U+0626 is canonically equivalent to the sequence <
U+064A
ARABIC LETTER YEH
U+0654
ARABIC HAMZA ABOVE
>, when U+0654 is applied to
U+064A
ARABIC LETTER YEH
, the
yeh
should lose its dots in all positional forms, even though
yeh
retains its dots when combined with other marks.
A separate, non-decomposable character,
U+08A8
ARABIC LETTER YEH WITH TWO DOTS BELOW AND HAMZA ABOVE
, is used to represent a
yeh
-shaped base with a
hamza
above, but with retention of dots in all positions. This letter is used in the Fulfulde language in Cameroon, to represent a palatal implosive.
In most other cases when a
hamza
is needed as a mark above for an Arabic letter,
U+0654
ARABIC HAMZA ABOVE
can be freely used in combination with basic Arabic letters. Three exceptions are the extended Arabic letters
U+0681
ARABIC LETTER HAH WITH HAMZA ABOVE
U+076C
ARABIC LETTER REH WITH HAMZA ABOVE
, and
U+08A1
ARABIC LETTER BEH WITH HAMZA ABOVE
, where the
hamza
mark is functioning as an
ijam
(diacritic), rather than as a normal
hamza
. In those three cases, the extended Arabic letters have no canonical decompositions; consequently, the preference is to use those precomposed forms, rather than applying
U+0654
ARABIC HAMZA ABOVE
to
hah
reh
, or
beh
respectively.
In Persian and Urdu, a
hamza above
is frequently used for the
ezafe
sound /je/. This should be represented using
U+0654
ARABIC HAMZA ABOVE
after the
heh
letter appropriate for the orthography, as opposed to the precomposed U+06C0 which decomposes to a
heh
form not used in Persian and Urdu.
In Kashmiri, a
hamza above
is used as a vowel to represent the sound /ə/ over various different letters. In cases where it appears over a
beh
hah
, or
reh
, the precomposed letters U+0681, U+076C, and U+08A1 mentioned above should
not
be used. Instead, such Kashmiri text must be represented using
beh
hah
, or
reh
followed by
U+0654
ARABIC HAMZA ABOVE
These interactions between various letters and the
hamza
are summarized in
Table 9-12
Table 9-12.
Arabic Letters With Hamza Above
Code Point
Name
Decomposition
0623
alef with hamza above
0627 0654
0624
waw with hamza above
0648 0654
0626
yeh with hamza above
064A 0654
06C2
heh goal with hamza above
06C1 0654
06D3
yeh barree with hamza above
06D2 0654
0681
hah with hamza above
None
076C
reh with hamza above
None
08A1
beh with hamza above
None
08A8
yeh with 2 dots below and hamza above
None
The first five entries in
Table 9-12
show the cases where the
hamza above
can be freely used, and where there is a canonical equivalence to the precomposed characters. The last four entries show the exceptions, where use of the
hamza above
is inappropriate, and where only the precomposed characters should be used.
High Hamza.
The characters
U+0675
ARABIC LETTER HIGH HAMZA ALEF
U+0676
ARABIC LETTER HIGH HAMZA WAW
U+0677
ARABIC LETTER U WITH HAMZA ABOVE
, and
U+0678
ARABIC LETTER HIGH HAMZA YEH
are not recommended for use. Their compatibility decompositions are anomalous: the decomposed sequences are pairs of letters with right-to-left bidirectional character properties, with
U+0674
ARABIC LETTER HIGH HAMZA
as the second letter. When the decomposed sequences are processed using the Unicode Bidirectional Algorithm, the
hamza
will appear to the left of the other letter, whereas in the composite characters the
hamza
appears on the right. Thus, the ordering of characters in the decomposition sequences are the reverse of what is expected. Accordingly, appropriately-ordered pairs of letters beginning with
U+0674
ARABIC LETTER HIGH HAMZA
should be used instead. For example, the sequence should be used rather than
U+0675
ARABIC LETTER HIGH HAMZA ALEF
. To facilitate correct text entry, input methods should be configured to generate the corresponding pairs of letters beginning with
U+0674
ARABIC LETTER HIGH HAMZA
Use of these characters in identifier systems can be problematic and can present a potential security risk. For example, IDNA 2003 permits
U+0675
ARABIC LETTER HIGH HAMZA ALEF
to be used in a domain name, but requires that to be mapped to the compatibility decomposed sequence before conversion to punycode. However, the sequence could also be used in a domain name. Two domain names that differ only in using U+0675 versus would map to distinct punycode sequences but would be visually identical. Under IDNA 2008, the four composed characters (U+0675..U+0678) would no longer be permitted in a registered domain name, but applications can still accept them and map them into punycode, so risks from ambiguity still exist.
Malay Jawi uses
U+0674
ARABIC LETTER HIGH HAMZA
. In Jawi, the letter is the same size as
U+0621
ARABIC LETTER HAMZA
; however, unlike U+0621, it is positioned above the baseline at three-quarters height of the
U+0627
ARABIC LETTER ALEF
. Font designers can use language tagging in order to support the preferred shapes for both Kazakh and Jawi in multilingual fonts.
Quranic Texts.
Most traditions of writing the Quran keep the skeleton of words intact from earlier Quranic manuscripts, but add dots and diacritics, including
hamzas
. Thus, words spelled with the medial form of
U+0626
ARABIC LETTER YEH WITH HAMZA ABOVE
in modern Arabic orthographies may appear in Quranic texts without the tooth typical of the letter. There is usually an elongation under the
hamza
, and the
hamza
may carry other diacritical marks, such as a
fatha
. This convention can be thought of as a modified version of
yeh-hamza
, and is represented with the sequence <
U+0640
ARABIC TATWEEL
U+0654
ARABIC HAMZA ABOVE
>. For example, in some Quranic traditions the word
yasʾaluka
is represented by the sequence <
yeh
fatha
seen
sukun
tatweel
hamza above
fatha
lam
damma
kaf
fatha
>.
9.2.6 Other Letters for Extended Arabic
Jawi.
U+06BD
ARABIC LETTER NOON WITH THREE DOTS ABOVE
is used for Jawi, which is Malay written using the Arabic script. Malay users know the character as
Jawi Nya
. Contrary to what is suggested by its Unicode character name, U+06BD displays with the three dots
below
the letter pointing downward when it is in the initial or medial position, making it look exactly like the initial and medial forms of
U+067E
ARABIC LETTER PEH
. This is done to avoid confusion with
U+062B
ARABIC LETTER THEH
, which appears in words of Arabic origin, and which has the same base letter shapes in initial or medial position, but with three dots above in all positions.
Kurdish.
The Kurdish language is written in several different orthographies, which use either the Latin, Cyrillic, or Arabic scripts. When written using the Arabic script, Kurdish uses a number of extended Arabic letters, for an alphabet known as Soraní. Some of those extensions are shared with Persian, Urdu, or other languages: for example,
U+06C6
ARABIC LETTER OE
, which represents the Kurdish vowel [o]. Soraní also makes other unusual adaptations of the Arabic script, including the use of a digraph
waw+waw
to represent the long Kurdish vowel [uː]. That digraph is represented by a sequence of two characters, <
U+0648
ARABIC LETTER WAW
U+0648
ARABIC LETTER WAW
>.
Among the extended Arabic characters used exclusively for Soraní are
U+0695
ARABIC LETTER REH WITH SMALL V BELOW
(for the Kurdish
trilled r
) and
U+06B5
ARABIC LETTER LAM WITH SMALL V
(for the Kurdish
velarized l
).
The Arabic block also includes several extended Arabic characters whose origin was to represent dialectal or other poorly attested alternative forms of the Soraní alphabet extensions.
U+0692
ARABIC LETTER REH WITH SMALL V
is a dialectal variant of U+0695 which places the
small v
diacritic above the letter rather than below it. U+0694 is another variant of U+0695. U+06B6 and U+06B7 are poorly attested variants of U+06B5, and U+06CA is a poorly attested variant of U+06C6. None of these alternative forms is required (or desired) for a regular implementation of the Kurdish Soraní orthography.
Sindhi Meem.
In general, the distinction between a long tail and a short tail is stylistic. However, Sindhi specifically prefers the
meem
to have a short tail in isolate and final positions, as shown in
Table 9-13
Table 9-13.
Glyph Variation for U+0645
Meem
Standard
Sindhi-style
9.2.7 Arabic Supplement: U+0750–U+077F
The Arabic Supplement block contains additional extended Arabic letters for the languages used in Northern and Western Africa, such as Fulfulde, Hausa, Songhoy, and Wolof. In the second half of the 20th century, the use of the Arabic script was actively promoted for these languages. This block also contains a number of letters used for the Khowar, Torwali, and Burushaski languages, spoken primarily in Pakistan. Characters used for other languages are annotated in the character names list. Additional vowel marks used with these languages are found in the main Arabic block.
Marwari.
U+076A
ARABIC LETTER LAM WITH BAR
is used to represent a flapped retroflexed lateral in the Marwari language in southern Pakistan. It has also been suggested for use in the Gawri language of northern Pakistan but it is unclear how widely it has been adopted there. Contextual shaping for this character is similar to that of
U+0644
ARABIC LETTER LAM
, including the requirement to form ligatures with characters of Joining_Group =
ALEF
9.2.8 Arabic Extended-A: U+08A0–U+08FF
The Arabic Extended-A block contains additional Arabic letters and vowel signs for use by a number of African languages from Chad, Senegal, Guinea, and Cameroon, and for languages of the Philippines. It also contains extended letters, vowel signs, and tone marks used by the Rohingya Fonna writing system for the Rohingya language in Myanmar, as well as several additional Quranic annotation signs. Characters used for other languages are annotated in the character names list.
One Quranic annotation sign,
U+08D9
◌ࣙ
ARABIC SMALL LOW NOON WITH KASRA
was given a mistaken Canonical_Combining_Class value when it was encoded in this block, and that value cannot be changed, due to normalization stability policies. Section 5.8, “Workaround for Mistaken Canonical_Combining_Class Assignment” in Unicode Standard Annex #53, “Unicode Arabic Mark Rendering,” provides more details about this character and explains how the Arabic Mark Transient Reordering Algorithm can be applied to get correct rendering behavior.
9.2.9 Arabic Extended-B: U+0870–U+089F
The Arabic Extended-B block comprises Quranic characters, especially those used in Northwest Africa, and characters from other orthographies, such as Bosnian and Pegon in Indonesia. The block also includes currency symbols and an abbreviation mark.
9.2.10 Arabic Extended-C: U+10EC0–U+10EFF
The Arabic Extended-C block comprises Quranic characters and characters from other orthographies, such as Pegon in Indonesia.
9.2.11 Arabic Presentation Forms-A: U+FB50–U+FDFF
This block contains a list of Arabic presentation forms encoded as characters primarily for compatibility reasons. These characters have a preferred representation that makes use of a normal (noncompatibility) Arabic character, or in many cases a sequence of Arabic characters.
Presentation form
is a mostly obsolete term for a contextually shaped glyph (for a single character) or for a ligature glyph (for a sequence of characters).
The presentation forms in this block consist of contextual (positional) variants of Extended Arabic letters, contextual variants of Arabic letter ligatures, spacing forms of Arabic diacritic combinations, contextual variants of certain Arabic letter/diacritic combinations, and Arabic phrase ligatures, including honorific word ligatures. The ligatures include a large set of presentation forms. However, the set of ligatures appropriate for any given Arabic font will generally not match this set precisely. Fonts will often include only a subset of these glyphs, and they may also include glyphs outside of this set. The included glyphs are generally not accessible as characters and are used only by rendering engines.
Ornate Parentheses.
The alternative, ornate forms of parentheses (
U+FD3E
ORNATE LEFT PARENTHESIS
and
U+FD3F
ORNATE RIGHT PARENTHESIS
) for use with the Arabic script are considered traditional Arabic punctuation, rather than compatibility characters. These ornate parentheses are exceptional in rendering in bidirectional text; for legacy reasons, they do not have the Bidi_Mirrored property. Thus, unlike other parentheses, they do not automatically mirror when rendered in a bidirectional context.
Nuktas.
Various patterns of single or multiple dots or other small marks are used diacritically to extend the core Arabic set of letters to represent additional sounds in other languages written with the Arabic script. Such dot patterns are known as
ijam
or
nuktas
. In the Unicode Standard, extended Arabic characters with nuktas are simply encoded as fully-formed base characters. However, there is an occasional need in pedagogical materials about the Arabic script to exhibit the various nuktas in isolation. The range of characters U+FBB2..U+FBC1 provides a set of symbols for this purpose. These are ordinary, spacing symbols with right-to-left directionality. They are
not
combining marks, and are not intended for the construction of new Arabic letters by use in combining character sequences. The Arabic pedagogical symbols do not partake of any Arabic shaping behavior. Their Joining_Type is Non_Joining, so if used in juxtaposition with an Arabic letter skeleton, they will break the cursive connection and render after the letter, instead of above or below it.
For clarity in display, those with the names including the word “above” should have glyphs that render high above the baseline, and those with names including “below” should be at or below the baseline.
Word Ligatures.
The signs and symbols encoded at U+FD40..U+FD4F, U+FDCF, and U+FDF0..U+FDFF are word ligatures sometimes treated as a unit. Most of them are encoded for compatibility with older character sets and are rarely used, except the following:
U+FDF2
ARABIC LIGATURE ALLAH ISOLATED FORM
is a very common ligature, used to display the name of God. When the formation of the
allah
ligature is desired, the recommended way to represent the word would be <
alef
lam
lam
shadda
superscript alef
heh
> <0627, 0644, 0644, 0651, 0670, 0647>. In non-Arabic languages, other forms of
heh
, such as
heh goal
(U+06C1), may also form the ligature. Extra care should be taken not to form the ligature in the absence of the
shadda
and the
superscript alef
, as the sequences <
alef
lam
lam
heh
> and <
alef
lam
lam
shadda
heh
> exist in Persian and other languages with different meanings or pronunciations, where the formation of the ligature would be incorrect and inappropriate.
U+FDFA
ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
and
U+FDFB
ARABIC LIGATURE JALLAJALALOUHOU
are honorifics, commonly used after the name of the prophet Muhammad or God. Other honorific ligatures include U+FD40..U+FD4F, U+FDCF, and U+FDFD..U+FDFF. Their usage is comparable to the honorifics found at U+0610..U+0613, except that these are spacing characters. The same characters are sometimes used by Muslims writing in other scripts such as Latin and Cyrillic.
U+FDFD
ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM
is a special ligated form of the
basmala
, a common opening phrase used by Muslims. The ligature is written in a multitude of ways. Its usage is common in writings by Muslims in non-Arabic scripts, even more than the honorifics mentioned above. It can be displayed as a unit above text in several different scripts, such as Bengali and Thaana. Unlike the other Arabic word ligatures, this character does not have a compatibility decomposition.
U+FDFC
RIAL SIGN
is a condensed version of the word
rial
, the Iranian currency. The character was invented by a typewriter standardization committee in 1973 and is encoded in the Unicode Standard as a compatibility character, as it continues to be specified in Iranian national standards for character sets and keyboard layouts, including ISIRI 9147:2007. Except for a short life during the typewriter era, it has not received widespread usage outside standards, as Iranians prefer to spell out the word as <
reh
farsi yeh
alef
lam
>.
9.2.12 Arabic Presentation Forms-B: U+FE70–U+FEFF
This block contains additional Arabic presentation forms consisting of spacing or
tatweel
forms of Arabic diacritics, contextual variants of primary Arabic letters, and some of the obligatory
LAM-ALEF
ligatures. They are included here for compatibility with preexisting standards and legacy implementations that use these forms as characters. Instead of these, letters from the Arabic block (U+0600..U+06FF) should be used for interchange. Implementations should handle contextual glyph shaping by rendering rules when accessing glyphs from fonts, rather than by encoding contextual shapes as characters.
Spacing and Tatweel Forms of Arabic Diacritics.
For compatibility with certain implementations, a set of spacing forms of the Arabic diacritics is provided here. The tatweel forms are combinations of the joining connector tatweel and a diacritic.
Zero Width No-Break Space.
This character (U+FEFF), which is not an Arabic presentation form, is described in
Section 23.8, Specials
9.3 Syriac
9.3.1 Syriac: U+0700–U+074F
Syriac Language.
The Syriac language belongs to the Aramaic branch of the Semitic family of languages. The earliest datable Syriac writing dates from the year 6
CE
. Syriac is the active liturgical language of many communities in the Middle East (Syrian Orthodox, Assyrian, Maronite, Syrian Catholic, and Chaldaean) and Southeast India (Syro-Malabar and Syro-Malankara). It is also the native language of a considerable population in these communities.
Syriac is divided into two dialects. West Syriac is used by the Syrian Orthodox, Maronites, and Syrian Catholics. East Syriac is used by the Assyrians (that is, Ancient Church of the East) and Chaldaeans. The two dialects are very similar and have almost no differences in grammar and vocabulary. They differ in pronunciation and use different dialectal forms of the Syriac script.
Languages Using the Syriac Script.
A number of modern languages and dialects employ the Syriac script in one form or another. They include the following:
Literary Syriac
. The primary usage of Syriac script.
Neo-Aramaic dialects
. The Syriac script is widely used for modern Aramaic languages, next to Hebrew, Cyrillic, and Latin. A number of Eastern Modern Aramaic dialects known as
Swadaya
(also called vernacular Syriac, modern Syriac, modern Assyrian, and so on, and spoken mostly by the Assyrians and Chaldaeans of Iraq, Turkey, and Iran) and the Central Aramaic dialect,
Turoyo
(spoken mostly by the Syrian Orthodox of the Tur Abdin region in southeast Turkey), belong to this category of languages.
Garshuni
(Arabic written in the Syriac script). It is currently used for writing Arabic liturgical texts by Syriac-speaking Christians. Garshuni employs the Arabic set of vowels and overstrike marks.
Christian Palestinian Aramaic
(also known as Palestinian Syriac). This dialect is no longer spoken.
Other languages
. The Syriac script was used in various historical periods for writing Armenian and some Persian dialects. Syriac speakers employed it for writing Arabic, Ottoman Turkish, and Malayalam. Six special characters used for Persian and Sogdian were added in Version 4.0 of the Unicode Standard.
Shaping.
The Syriac script is cursive and has shaping rules that are similar to those for Arabic. The Unicode Standard does not include any presentation form characters for Syriac.
Directionality.
The Syriac script is written from right to left. Conformant implementations of Syriac script must use the Unicode Bidirectional Algorithm (see Unicode Standard Annex #9, “Unicode Bidirectional Algorithm”).
Syriac Type Styles.
Syriac texts employ several type styles. Because all type styles use the same Syriac characters, even though their shapes vary to some extent, the Unicode Standard encodes only a single Syriac script.
Estrangela type style
. Estrangela (a word derived from Greek
strongulos,
meaning “rounded”) is the oldest type style. Ancient manuscripts use this writing style exclusively. Estrangela is used today in West and East Syriac texts for writing headers, titles, and subtitles. It is the current standard in writing Syriac texts in Western scholarship.
Serto or West Syriac type style
. This type style is the most cursive of all Syriac type styles. It emerged around the eighth century and is used today in West Syriac texts, Turoyo (Central Neo-Aramaic), and Garshuni.
East Syriac type style
. Its early features appear as early as the sixth century; it developed into its own type style by the twelfth or thirteenth century. This type style is used today for writing East Syriac texts as well as Swadaya (Eastern Neo-Aramaic). It is also used today in West Syriac texts for headers, titles, and subtitles alongside the Estrangela type style.
Christian Palestinian Aramaic
. Manuscripts of this dialect employ a script that is akin to Estrangela. It can be considered a subcategory of Estrangela.
The Unicode Standard provides for usage of the type styles mentioned above. It also accommodates letters and diacritics used in Neo-Aramaic, Christian Palestinian Aramaic, Garshuni, Persian, and Sogdian languages.
Examples are supplied in the Serto type style, except where otherwise noted.
Character Names.
Character names follow the East Syriac convention for naming the letters of the alphabet. Diacritical points use a descriptive naming—for example,
U+0743
SYRIAC TWO VERTICAL DOTS ABOVE
Syriac Abbreviation Mark.
U+070F
SYRIAC ABBREVIATION MARK
(SAM) is a zero-width formatting code that has no effect on the shaping process of Syriac characters. The SAM specifies the beginning point of a
Syriac abbreviation
, which is a line drawn horizontally above one or more characters, at the end of a word or of a group of characters followed by a character other than a Syriac letter or diacritical mark. A Syriac abbreviation may contain Syriac diacritics.
Ideally, the Syriac abbreviation is rendered by a line that has a dot at each end and the center, as shown in the examples. While not preferable, it has become acceptable for computers to render the Syriac abbreviation as a line without the dots. The line is acceptable for the presentation of Syriac in plain text, but the presence of dots is recommended in liturgical texts.
The Syriac abbreviation is used for letter numbers and contractions. A Syriac abbreviation generally extends from the last tall character in the word until the end of the word. A common exception to this rule is found with letter numbers that are preceded by a preposition character, as seen in
Figure 9-9
Figure 9-9.
Syriac Abbreviation
A SAM is placed before the character where the abbreviation begins. The Syriac abbreviation begins over the character following the SAM and continues until the end of the word. Use of the SAM is demonstrated in
Figure 9-10
Figure 9-10.
Use of SAM
Note:
Modern East Syriac texts employ a punctuation mark for contractions of this sort.
Ligatures and Combining Characters.
Only one ligature is included in the Syriac block:
U+071E
SYRIAC LETTER YUDH HE
. This combination is used as a unique character in the same manner as an “æ” ligature. A number of combining diacritics unique to Syriac are encoded, but combining characters from other blocks are also used, especially from the Arabic block.
Diacritical Marks and Vowels.
The function of the diacritical marks varies. They indicate vowels (as in Arabic and Hebrew), mark grammatical attributes (for example, verb versus noun, interjection), or guide the reader in the pronunciation and/or reading of the given text.
“The reader of the average Syriac manuscript or book is confronted with a bewildering profusion of points. They are large, of medium size and small, arranged singly or in twos and threes, placed above the word, below it, or upon the line.”
There are two vocalization systems. The first, attributed to Jacob of Edessa (633–708
CE
), utilizes letters derived from Greek that are placed above (or below) the characters they modify. The second is the more ancient dotted system, which employs dots in various shapes and locations to indicate vowels. East Syriac texts exclusively employ the dotted system, whereas West Syriac texts (especially later ones and in modern times) employ a mixture of the two systems.
Diacritical marks are nonspacing and are normally centered above or below the character. Exceptions to this rule follow:
U+0741
SYRIAC QUSHSHAYA
and
U+0742
SYRIAC RUKKAKHA
are used only with the letters
beth
gamal
(in its Syriac and Garshuni forms),
dalath
kaph
pe,
and
taw
The
qushshaya
indicates that the letter is pronounced hard and unaspirated.
The
rukkakha
indicates that the letter is pronounced soft and aspirated. When the
rukkakha
is used in conjunction with the
dalath
, it is printed slightly to the right of the
dalath
’s dot below.
In Modern Syriac usage, when a word contains a
rish
and a
seyame
, the dot of the
rish
and the
seyame
are replaced by a
rish
with two dots above it.
The
feminine dot
is usually placed to the left of a final
taw
Punctuation.
Most punctuation marks used with Syriac are found in the Latin-1 and Arabic blocks. The other marks are encoded in this block.
Digits.
Modern Syriac employs European numerals, as does Hebrew. The ordering of digits follows the same scheme as in Hebrew.
Harklean Marks.
The Harklean marks are used in the Harklean translation of the New Testament.
U+070B
SYRIAC HARKLEAN OBELUS
and
U+070D
SYRIAC HARKLEAN ASTERISCUS
mark the beginning of a phrase, word, or morpheme that has a marginal note.
U+070C
SYRIAC HARKLEAN METOBELUS
marks the end of such sections.
Dalath and Rish.
Prior to the development of pointing, early Syriac texts did not distinguish between a
dalath
and a
rish
, which are distinguished in later periods with a dot below the former and a dot above the latter. Unicode provides
U+0716
SYRIAC LETTER DOTLESS DALATH RISH
as an ambiguous character.
Semkath.
Unlike other letters, the joining mechanism of
semkath
varies through the course of history from right-joining to dual-joining. It is necessary to enter a
U+200C
ZERO WIDTH NON-JOINER
character after the
semkath
to obtain the right-joining form where required. Two common variants of this character exist:
U+0723
SYRIAC LETTER SEMKATH
and
U+0724
SYRIAC LETTER FINAL SEMKATH
. They occur interchangeably in the same document, similar to the case of Greek sigma.
Vowel Marks.
The so-called Greek vowels may be used above or below letters. As West Syriac texts employ a mixture of the Greek and dotted systems, both versions are accounted for here.
Miscellaneous Diacritics.
Miscellaneous general diacritics are used in Syriac text. Their usage is explained in
Table 9-14
Table 9-14.
Miscellaneous Syriac Diacritic Use
Code Points
Use
U+0303, U+0330
These are used in Swadaya to indicate letters not found in Syriac.
U+0304, U+0320
These are used for various purposes ranging from phonological to grammatical to orthographic markers.
U+0307, U+0323, U+1DF8, U+1DFA
These points are used for various purposes—grammatical, phonological, and otherwise. They differ typographically and semantically from the
qushshaya
rukkakha
points, and the dotted vowel points. If the point appears above or below a single letter, U+0307 or U+0323 should be used. In contrast, if the point appears between two letters (above or below), U+1DF8 or U+1DFA should be used following the first letter in the encoded character sequence.
U+0308
This is the plural marker. It is also used in Garshuni for the Arabic
teh marbuta
U+030A, U+0325
These are two other forms for the indication of
qushshaya
and
rukkakha
. They are used interchangeably with
U+0741
SYRIAC QUSHSHAYA
and
U+0742
SYRIAC RUKKAKHA
, especially in West Syriac grammar books.
U+0324
This diacritical mark is found in ancient manuscripts. It has a grammatical and phonological function.
U+032D
This is one of the
digit markers
U+032E
This is a mark used in late and modern East Syriac texts as well as in Swadaya to indicate a fricative
pe
Use of Characters of the Arabic Block.
Syriac makes use of several characters from the Arabic block, including
U+0640
ARABIC TATWEEL
. Modern texts use
U+060C
ARABIC COMMA
U+061B
ARABIC SEMICOLON
, and
U+061F
ARABIC QUESTION MARK
. The
shadda
(U+0651) is also used in the core part of literary Syriac on top of a
waw
in the word “O”. Arabic
harakat
are used in Garshuni to indicate the corresponding Arabic vowels and diacritics.
9.3.2 Syriac Shaping
Minimum Rendering Requirements.
Rendering requirements for Syriac are similar to those for Arabic. The remainder of this section specifies a minimum set of rules that provides legible Syriac joining and ligature substitution behavior.
Joining Types.
Each Syriac letter must be depicted by one of a number of possible contextual glyph forms. The appropriate form is determined on the basis of the cursive joining behavior of that character as it interacts with the cursive joining behavior of adjacent characters. The basic joining types are identical to those specified for the Arabic script, and are specified in the file ArabicShaping.txt in the Unicode Character Database. However, there are additional contextual rules which govern the shaping of
U+0710
SYRIAC LETTER ALAPH
in final position. The additional glyph types associated with final
alaph
are listed in
Table 9-15
Table 9-15.
Syriac Final Alaph Glyph Types
Glyph Type
Description
fj
Final joining (alaph only)
fn
Final non-joining
except
following dalath and rish (alaph only)
fx
Final non-joining following dalath and rish (alaph only)
In the following rules,
alaph
refers to
U+0710
SYRIAC LETTER ALAPH
, which has Joining_Group = Alaph.
These rules are intended to augment joining rules for Syriac which would otherwise parallel the joining rules specified for Arabic in
Section 9.2, Arabic
. Characters with Joining_Type = Transparent are skipped over when applying the Syriac rules for shaping of
alaph
. In other words, the Syriac parallel for Arabic joining rule R1 would take precedence over the
alaph
joining rules.
S1
An alaph that has a left-joining character to its right and a non-joining character (or end of text) to its left will take the form of A
fj
S2
An alaph that has a non-left-joining character to its right, except for a character with Joining_Group = Dalath_Rish, and a non-joining character (or end of text) to its left will take the form of A
fn
S3
An alaph that has a character with Joining_Group = Dalath_Rish to its right and a non-joining character (or end of text) to its left will take the form of A
fx
The example in rule S3 is shown in the East Syriac font style.
Malayalam LLA.
U+0868
SYRIAC LETTER MALAYALAM LLA
normally connects to the right, but because it joins on both sides in some manuscripts, it is designated dual-joining. To represent right-joining
lla
, the ZWNJ should be employed to make sure it does not connect to the left-side letter.
Syriac Character Joining Groups.
Syriac characters can be subdivided into shaping groups, based on the behavior of their letter skeletons when shaped in context. The Unicode character property that specifies these groups is called Joining_Group, and is specified in ArabicShaping.txt in the Unicode Character Database. It is described in the subsection on character joining groups in
Section 9.2, Arabic
Table 9-16
exemplifies dual-joining Syriac characters and illustrates the forms taken by the letter skeletons in context. This table and the subsequent table use the Serto (West Syriac) font style, whereas the Unicode code charts are in the Estrangela font style.
Table 9-16.
Dual-Joining Syriac Characters
Joining Group
Notes
BETH
Includes
PERSIAN BHETH
GAMAL
Includes
GAMAL GARSHUNI
and
PERSIAN GHAMAL
HETH
TETH
Includes
TETH GARSHUNI
YUDH
KAPH
KHAPH
Sogdian
LAMADH
MIM
NUN
SEMKATH
FINAL_SEMKATH
PE
REVERSED_PE
FE
Sogdian
QAPH
SHIN
MALAYALAM_NGA
Suriyani Malayalam
MALAYALAM_NYA
MALAYALAM_TTA
MALAYALAM_NNA
MALAYALAM_NNNA
MALAYALAM_LLA
The skeleton patterns shown in
Table 9-16
include six of the Garshuni characters encoded in the Syriac Supplement block (U+0860, U+0862..U+0865, U+0868) that are also dual-joining, and have their own joining group values.
U+0868
SYRIAC LETTER MALAYALAM LLA
, in particular, normally connects only to the right, but occasionally occurs connected on both sides. That letter is given the dual-joining property value. For instances when a right-joining
lla
occurs in a manuscript, it may be represented with the sequence <0868, ZWNJ>.
Table 9-17
exemplifies right-joining Syriac characters, illustrating the forms they take in context. Right-joining characters have only two distinct forms, for isolated and final contexts, respectively.
Table 9-17.
Right-Joining Syriac Characters
Joining Group
Notes
DALATH_RISH
Includes
RISH
DOTLESS DALATH RISH
, and
PERSIAN DHALATH
HE
SYRIAC_WAW
ZAIN
ZHAIN
Sogdian
YUDH_HE
SADHE
TAW
MALAYALAM_RA
Suriyani Malayalam
MALAYALAM_LLLA
MALAYALAM_SSA
Table 9-17
includes three of the Garshuni characters encoded in the Syriac Supplement block (U+0867, U+0869, U+086A) that are also right-joining, and have their own joining group values. The two other characters encoded in that block,
U+0861
SYRIAC LETTER MALAYALAM JA
and
U+0866
SYRIAC LETTER MALAYALAM BHA
, not shown in the tables above, do not connect either to the right or the left.
U+0710
SYRIAC LETTER ALAPH
has the Joining_Group = Alaph and is a right-joining character. However, as specified above in rules S1, S2, and S3, its glyph is subject to additional contextual shaping.
Table 9-18
illustrates all of the glyph forms for
alaph
in each of the three major Syriac type styles.
Table 9-18.
Syriac Alaph Glyph Forms
Type Style
fj
fn
fx
Estrangela
Serto (West Syriac)
East Syriac
Ligature Classes.
As in other scripts, ligatures in Syriac vary depending on the font style.
Table 9-19
identifies the principal valid ligatures for each font style. In some cases, the ligatures are obligatory; those cases are highlighted in bold italic in the table.
Table 9-19.
Syriac Ligatures
Characters
Estrangela
Serto (West Syriac)
East Syriac
Sources
ALAPH LAMADH
N/A
Dual-joining
N/A
Beth Gazo
GAMAL LAMADH
N/A
Dual-joining
N/A
Armalah
GAMAL E
N/A
Dual-joining
N/A
Armalah
HE YUDH
N/A
N/A
Right-joining
Qdom
YUDH TAW
N/A
Right-joining
N/A
Armalah
KAPH LAMADH
N/A
Dual-joining
N/A
Sh
imo
KAPH TAW
N/A
Right-joining
N/A
Armalah
LAMADH SPACE ALAPH
N/A
Right-joining
N/A
Nomocanon
LAMADH ALAPH
Right-joining
Right-joining
Right-joining
BFBS
LAMADH LAMADH
N/A
Dual-joining
N/A
Sh
imo
NUN ALAPH
N/A
Right-joining
N/A
Sh
imo
SEMAKATH TETH
N/A
Dual-joining
N/A
Qurobo
SADHE NUN
Right-joining
Right-joining
Right-joining
Mush
otho
RISH SEYAME
Right-joining
Right-joining
Right-joining
BFBS
TAW ALAPH
Right-joining
N/A
Right-joining
Qdom
TAW YUDH
N/A
N/A
Right-joining
9.3.3 Syriac Supplement: U+0860–U+086F
The Syriac Supplement block contains characters used to write a dialect of Malayalam called Suriyani Malayalam, which is also known as Garshuni (Karshoni) or Syriac Malayalam.
9.4 Samaritan
9.4.1 Samaritan: U+0800–U+083F
The Samaritan script is used today by small Samaritan communities in Israel and the Palestinian Territories to write the Samaritan Hebrew and Samaritan Aramaic languages, primarily for religious purposes. The Samaritan religion is related to an early form of Judaism, but the Samaritans did not leave Palestine during the Babylonian exile, so the script evolved from the linear Old Hebrew script, most likely directly descended from Phoenician (see
Section 10.3, Phoenician
). In contrast, the more recent square Hebrew script associated with Judaism derives from the Imperial Aramaic script (see
Section 10.4, Imperial Aramaic
) used widely in the region during and after the Babylonian exile, and thus well-known to educated Hebrew speakers of that time.
Like the Phoenician and Hebrew scripts, Samaritan has 22 consonant letters. The consonant letters do not form ligatures, nor do they have explicit final forms as some Hebrew consonants do.
Directionality.
The Samaritan script is written from right to left. Conformant implementations of Samaritan script must use the Unicode Bidirectional Algorithm. For more information, see Unicode Standard Annex #9, “Unicode Bidirectional Algorithm.”
Vowel Signs.
Vowel signs are optional in Samaritan, just as points are optional in Hebrew. Combining marks are used for vowels that follow a consonant, and are rendered above and to the left of the base consonant. With the exception of
and
short a
, vowels may have up to three lengths (normal, long, and overlong), which are distinguished by the size of the corresponding vowel sign.
Sukun
is centered above the corresponding base consonant and indicates that no vowel follows the consonant.
Two vowels,
and
short a
, may occur in a word-initial position preceding any consonant. In this case, the separate spacing versions
U+0828
SAMARITAN MODIFIER LETTER I
and
U+0824
SAMARITAN MODIFIER LETTER SHORT A
should be used instead of the normal combining marks.
When
U+0824
SAMARITAN MODIFIER LETTER SHORT A
follows a letter used numerically, it indicates thousands, similar to the use of
U+05F3
HEBREW PUNCTUATION GERESH
for the same purpose in Hebrew.
Consonant Modifiers.
The two marks,
U+0816
SAMARITAN MARK IN
and
U+0817
SAMARITAN MARK IN-ALAF
, are used to indicate a pharyngeal voiced fricative /ʕ/. These occur immediately following their base consonant and preceding any vowel signs, and are rendered above and to the right of the base consonant.
U+0818
SAMARITAN MARK OCCLUSION
“strengthens” the consonant, for example changing /w/ to /b/.
U+0819
SAMARITAN MARK DAGESH
indicates consonant gemination. The
occlusion
and
dagesh
marks may both be applied to the same consonant, in which case the
occlusion
mark should precede the
dagesh
in logical order, and the
dagesh
is rendered above the
occlusion
mark. The
occlusion
mark is also used to designate personal names to distinguish them from homographs.
Epenthetic yut
represents a kind of glide-vowel which interacts with another vowel. It was originally used only with the consonants
alaf
iy
it
, and
in
, in combination with a vowel sign. The combining
U+081B
SAMARITAN MARK EPENTHETIC YUT
should be used for this purpose. When
epenthetic yut
is not fixed to one of the four consonants listed above, a new behavior evolved in which the mark for the
epenthetic yut
behaves as a spacing character, capable of bearing its own diacritical mark.
U+081A
SAMARITAN MODIFIER LETTER EPENTHETIC YUT
should be used instead to represent the
epenthetic yut
in this context.
Punctuation.
Samaritan uses a large number of punctuation characters.
U+0830
SAMARITAN PUNCTUATION NEQUDAA
and
U+0831
SAMARITAN PUNCTUATION AFSAAQ
(“interruption”) are similar to the Hebrew
sof pasuq
and were originally used to separate sentences, and later to mark lesser breaks within a sentence. They have also been described respectively as “semicolon” and “pause.” Samaritan also uses a smaller dot as a word separator, which can be represented by
U+2E31
WORD SEPARATOR MIDDLE DOT
U+083D
SAMARITAN PUNCTUATION SOF MASHFAAT
is equivalent to the full stop.
U+0832
SAMARITAN PUNCTUATION ANGED
(“restraint”) indicates a break somewhat less strong than an
afsaaq
U+083E
SAMARITAN PUNCTUATION ANNAAU
(“rest”) is stronger than the
afsaaq
and indicates that a longer time has passed between actions narrated in the sentences it separates.
U+0839
SAMARITAN PUNCTUATION QITSA
is similar to the
annaau
but is used more frequently. The
qitsa
marks the end of a section, and may be followed by a blank line to further make the point. It has many glyph variants. One important variant,
U+0837
SAMARITAN PUNCTUATION MELODIC QITSA
, differs significantly from any of the others, and indicates the end of a sentence “which one should read melodically.”
Many of the punctuation characters are used in combination with each other, for example:
afsaaq
nequdaa
or
nequdaa
afsaaq
qitsa
nequdaa
, and so on.
U+0836
SAMARITAN ABBREVIATION MARK
follows an abbreviation.
U+082D
SAMARITAN MARK NEQUDAA
is an editorial mark which indicates that there is a variant reading of the word.
Other Samaritan punctuation characters mark some prosodic or performative attributes of the text preceding them, as summarized in
Table 9-20
Table 9-20.
Samaritan Performative Punctuation Marks
Code Point
Name
Description
0833
bau
request, prayer, humble petition
0834
atmaau
expression of surprise
0835
shiyyaalaa
question
0838
ziqaa
shout, cry
083A
zaef
outburst indicating vehemence or anger
083B
turu
didactic expression, a “teaching”
083C
arkaanu
expression of submissiveness
9.5 Mandaic
9.5.1 Mandaic: U+0840–U+085F
The origins of the Mandaic script are unclear, but it is thought to have evolved between the second and seventh century
CE
from a cursivized form of the Aramaic script (as did the Syriac script) or from the Parthian chancery script. It was developed by adherents of the Mandaean gnostic religion of southern Mesopotamia to write the dialect of Eastern Aramaic they used for liturgical purposes, which is referred to as Classical Mandaic.
The religion has survived into modern times, with more than 50,000 Mandaeans in several communities worldwide (most having left what is now Iraq). In addition to the Classical Mandaic still used within some of these communities, a variety known as Neo-Mandaic or Modern Mandaic has developed and is spoken by a small number of people. Mandaeans consider their script sacred, with each letter having specific mystic properties, and the script has changed very little over time.
Letter It.
The character U+0847
MANDAIC LETTER IT
is a pharyngeal, pronounced [hu]. It can appear at the end of personal names or at the end of words to indicate the third person singular suffix.
Structure.
Mandaic is unusual among Semitic scripts in being a true alphabet; the letters
halqa
ushenna
aksa
, and
in
are used to write both long and short forms of vowels, instead of functioning as consonants also used to write long vowels (
matres lectionis
), in the manner characteristic of other Semitic scripts. This is possible because some consonant sounds represented by the corresponding letters in other Semitic scripts are not used in the Mandaic language.
The character
U+0856
MANDAIC LETTER DUSHENNA
, also called
adu
, has a morphemic function. It is used to write the relative pronoun and the genitive exponent
di. Dushenna
is a digraph derived from an old ligature for
ad
aksa
. It is thus an addition to the usual Semitic set of 22 characters. The Mandaic alphabet is traditionally represented as the 23 letters
halqa
through
dushenna
, with
halqa
appended again at the end to form a symbolically-important cycle of 24 letters.
Two additional Mandaic characters are encoded in the Unicode Standard:
U+0858
MANDAIC LETTER AIN
is a borrowing from
U+0639
ARABIC LETTER AIN
. The second additional character,
U+0857
MANDAIC LETTER KAD
, is a digraph used to write the word
kd
, which means “when, as, like”. There are two ways to represent
kad
in Mandaic:
U+0857
MANDAIC LETTER KAD
or the sequence <
U+084A
MANDAIC LETTER AK
U+0856
MANDAIC LETTER DUSHENNA
>.
The Joining_Type values for
U+0856
MANDAIC LETTER DUSHENNA
U+0857
MANDAIC LETTER KAD
, and
U+0858
MANDAIC LETTER AIN
were changed in Unicode Version 13.0 from Non_Joining to Right_Joining. See
Table 9-22
. In cases where the isolated form of
dushenna
ain
, or
kad
following a right join-causing character is desired, a
U+200C
ZERO WIDTH NON-JOINER
should be employed to prevent joining with the previous character. (See
Table 9-4
for the definition of a right join-causing character.)
Three diacritical marks are used in teaching materials to differentiate vowel quality; they may be omitted from ordinary text.
U+0859
MANDAIC AFFRICATION MARK
is used to extend the character set for foreign sounds (whether affrication, lenition, or another sound).
U+085A
MANDAIC VOCALIZATION MARK
is used to distinguish vowel quality of
halqa
ushenna
, and
aksa
U+085B
MANDAIC GEMINATION MARK
is used to indicate what native writers call a “hard” pronunciation.
Punctuation.
Sentence punctuation is used sparsely. A single script-specific punctuation mark is encoded:
U+085E
MANDAIC PUNCTUATION
. It is used to start and end text sections, and is also used in colophons—the historical lay text added to the religious text—where it is typically displayed in a smaller size.
Directionality.
The Mandaic script is written from right to left. Conformant implementations of Mandaic script must use the Unicode Bidirectional Algorithm (see Unicode Standard Annex #9, “Unicode Bidirectional Algorithm”).
Shaping and Layout Behavior.
Mandaic has fully-developed joining behavior, with forms as shown in
Table 9-21
and
Table 9-22
. In these tables, X
, X
, X
, and X
designate the nominal, right-joining, dual-joining (medial), and left-joining forms respectively, just as in
Table 9-6
Table 9-7
, and
Table 9-8
Table 9-21.
Dual-Joining Mandaic Characters
Character
AB
AG
AD
AH
USHENNA
ATT
AK
AL
AM
AN
AS
IN
AP
ASZ
AQ
AR
AT
Table 9-22.
Right-Joining Mandaic Characters
Character
HALQA
AZ
IT
AKSA
ASH
DUSHENNA
KAD
AIN
Line Breaking.
Spaces provide the primary line break opportunity. When text is fully justified, words may be stretched as in Arabic.
U+0640
ARABIC TATWEEL
may be inserted for this purpose.
9.6 Yezidi
9.6.1 Yezidi: U+10E80–U+10EBF
The Yezidi script was used to write two religious texts,
Masḥafā Reš
and
Ketēbā Jelwa
, which may date to the 12th or 13th centuries. The history of the script between the creation of these texts and the current period is unclear; however, the Spiritual Council of Yezidis in Georgia decided to revive the script in 2013. As part of the revitalization, two specialists modified the script to represent the Yezidi language (called
Êzdîkî
in the vernacular), which is also referred to as the Kurmanji language. This language can also be written in the Latin, Cyrillic, and Arabic scripts. Today, clergy in the Yezidi temple in Tbilisi use the Yezidi script to write prayers, sacred books, and in other contexts.
Structure.
Yezidi is an alphabet, written right to left. Ligatures occur in the historical texts, but not in the modern version of the script.
Letters.
A set of ten letters have been added to the repertoire to represent the modern Kurmanji language. Two historic letters with diacritics are separately encoded as atomic characters:
U+10EB0
YEZIDI LETTER LAM WITH DOT ABOVE
and
U+10EB1
YEZIDI LETTER YOT WITH CIRCUMFLEX ABOVE
. The letters with diacritic marks have distinct pronunciation:
YEZIDI LETTER LAM WITH DOT ABOVE
is pronounced [ɫ], instead of [l], and
YEZIDI LETTER YOT WITH CIRCUMFLEX ABOVE
is pronounced [e], instead of [j].
Long
is indicated by a ligature of <
U+10EA3
YEZIDI LETTER UM
U+10EA3
YEZIDI LETTER UM
>. This sequence of two
um
characters may appear kerned or unkerned, without difference in meaning.
Diacritics.
Two combining diacritics,
U+10EAB
YEZIDI COMBINING HAMZA MARK
and
U+10EAC
YEZIDI COMBINING MADDA MARK
, appear in words of Arabic origin. Additional diacritics appear in the
Masḥafā Re
š, but the meaning of the marks is unclear, so they are not currently encoded.
Punctuation.
U+10EAD
YEZIDI HYPHENATION MARK
may appear above the last letter in a line to indicate a word break. In historic texts, the hyphenation mark may appear at the beginning of a line or above the last letter in a line. Occasionally, the mark can be used to denote long phonemes within a word, but this usage does not apply to modern texts.
Yezidi also uses
U+060C
ARABIC COMMA
U+061B
ARABIC SEMICOLON
, and
U+061F
ARABIC QUESTION MARK
, in addition to
U+002E
FULL STOP
and
U+003A
COLON
Numbers.
Older texts employ Arabic-Indic numbers (U+0660..U+0669), but Western digits are preferred in modern usage.