Internationalization Tag Set (ITS) Version 2.0
Internationalization Tag Set (ITS) Version 2.0
W3C Recommendation 29 October 2013
This version:
Latest version:
Previous version:
Editors:
David Filip, University of Limerick
Shaun McCance, Invited Expert
Dave Lewis, TCD
Christian Lieske, SAP AG
Arle Lommel, DFKI
Jirka Kosek, UEP
Felix Sasaki, DFKI / W3C Fellow
Yves Savourel, ENLASO
Please refer to the
errata
for this document, which may include some normative corrections.
See also
translations
This document is also available in these non-normative formats:
ODD/XML document
self-contained zipped archive
, and
XHTML Diff markup to previous publication
2013-09-24
W3C
MIT
ERCIM
Keio
Beihang
), All Rights Reserved. W3C
liability
trademark
and
document use
rules apply.
Abstract
The technology described in this document “
Internationalization Tag Set (ITS)
2.0
“ enhances the foundation to integrate automated processing of human language
into core Web technologies. ITS 2.0 bears many commonalities with its predecessor,
ITS 1.0
but provides additional
concepts that are designed to foster the automated creation and processing of multilingual
Web content. ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage
processing based on the XML Localization Interchange File Format (XLIFF), as well as the
Natural Language Processing Interchange Format (NIF).
Status of this Document
This section describes the status of this document at the time of its publication.
Other documents may supersede this document. A list of current W3C publications and the
latest revision of this technical report can be found in the
W3C technical reports index
at
The technology described in this document “
Internationalization Tag Set (ITS)
2.0
“ enhances the foundation to integrate automated processing of human language
into core Web technologies. ITS 2.0 bears many commonalities with is predecessor,
ITS 1.0
but provides additional
concepts that are designed to foster the automated creation and processing of multilingual
Web content. ITS 2.0 focuses on HTML, XML-based formats in general, and can leverage
processing based on the XML Localization Interchange File Format (XLIFF), as well as the
Natural Language Processing Interchange Format (NIF).
This document was published by the
MultilingualWeb-LT Working
Group
as a W3C Recommendation (see
W3C document
maturity levels
). The Working Group has completed and approved this specification's
Test Suite
and created an
Implementation Report
that shows that two or more independent implementations pass each test.
This document has been reviewed by W3C Members, by software developers, and by other W3C groups and interested parties, and is endorsed by the Director as a W3C Recommendation. It is a stable document and may be used as reference material or cited from another document. W3C's role in making the Recommendation is to draw attention to the specification and to promote its widespread deployment. This enhances the functionality and interoperability of the Web.
The ITS 2.0 specification has a normative dependency on the HTML5 specification: it relies on the
HTML5 Translate attribute
. By publishing this Recommendation, W3C expects that the functionality specified in this ITS 2.0 Recommendation will not be affected by changes to HTML5 as that specification proceeds to Recommendation.
If you wish to make comments, please send them to
public-i18n-its-ig@w3.org
. The
archives for this list
are publicly available. See also issues discussed within the
MultilingualWeb-LT Working Group
and the
list of changes
since the previous publication.
This document was produced by a group operating under the
5 February 2004 W3C Patent Policy
. W3C maintains a
public list of any patent disclosures
made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains
Essential Claim(s)
must disclose the information in accordance with
section 6 of the W3C Patent Policy
Table of Contents
Introduction
1.1
Overview
1.2
General motivation for going beyond ITS 1.0
1.3
Usage Scenarios
1.4
High-level differences between ITS 1.0 and ITS 2.0
1.5
Extended implementation hints
Basic Concepts
2.1
Data Categories
2.2
Selection
2.2.1
Local Approach
2.2.2
Global Approach
2.3
Overriding, Inheritance and Defaults
2.4
Adding Information or Pointing to Existing Information
2.5
Specific HTML support
2.5.1
Global approach in HTML5
2.5.2
Local approach
2.5.3
HTML markup with ITS 2.0 counterparts
2.5.4
Standoff markup in HTML5
2.5.5
Version of HTML
2.6
Traceability
2.7
Mapping and conversion
2.7.1
ITS and RDF/NIF
2.7.2
ITS and XLIFF
2.8
ITS 2.0 Implementations and Conformance
Notation and Terminology
3.1
Notation
3.2
Data category
3.3
Selection
3.4
ITS Local Attributes
3.5
Rule Elements
3.6
Usage of Internationalized Resource Identifiers in ITS
3.7
The Term HTML
3.8
The Term CSS Selectors
Conformance
4.1
Conformance Type 1: ITS Markup Declarations
4.2
Conformance Type 2: The Processing Expectations for ITS Markup
4.3
Conformance Type 3: Processing Expectations for ITS Markup in HTML
4.4
Conformance Type 4: Markup conformance for HTML5+ITS documents
Processing of ITS information
5.1
Indicating the Version of ITS
5.2
Locations of Data Categories
5.2.1
Global, Rule-based Selection
5.2.2
Local Selection in an XML Document
5.3
Query Language of Selectors
5.3.1
Choosing Query Language
5.3.2
XPath 1.0
5.3.3
CSS Selectors
5.3.4
Additional query languages
5.3.5
Variables in selectors
5.4
Link to External Rules
5.5
Precedence between Selections
5.6
Associating ITS Data Categories with Existing Markup
5.7
ITS Tools Annotation
Using ITS Markup in HTML
6.1
Mapping of Local Data Categories to HTML
6.2
Global rules
6.3
Standoff Markup in HTML
6.4
Precedence between Selections
Using ITS Markup in XHTML
Description of Data Categories
8.1
Position, Defaults, Inheritance, and Overriding of Data Categories
8.2
Translate
8.2.1
Definition
8.2.2
Implementation
8.3
Localization Note
8.3.1
Definition
8.3.2
Implementation
8.4
Terminology
8.4.1
Definition
8.4.2
Implementation
8.5
Directionality
8.5.1
Definition
8.5.2
Implementation
8.6
Language Information
8.6.1
Definition
8.6.2
Implementation
8.7
Elements Within Text
8.7.1
Definition
8.7.2
Implementation
8.8
Domain
8.8.1
Definition
8.8.2
Implementation
8.9
Text Analysis
8.9.1
Definition
8.9.2
Implementation
8.10
Locale Filter
8.10.1
Definition
8.10.2
Implementation
8.11
Provenance
8.11.1
Definition
8.11.2
Implementation
8.12
External Resource
8.12.1
Definition
8.12.2
Implementation
8.13
Target Pointer
8.13.1
Definition
8.13.2
Implementation
8.14
ID Value
8.14.1
Definition
8.14.2
Implementation
8.15
Preserve Space
8.15.1
Definition
8.15.2
Implementation
8.16
Localization Quality Issue
8.16.1
Definition
8.16.2
Implementation
8.17
Localization Quality Rating
8.17.1
Definition
8.17.2
Implementation
8.18
MT Confidence
8.18.1
Definition
8.18.2
Implementation
8.19
Allowed Characters
8.19.1
Definition
8.19.2
Implementation
8.20
Storage Size
8.20.1
Definition
8.20.2
Implementation
Appendices
References
Internationalization Tag Set (ITS) MIME Type
Values for the Localization Quality Issue Type
Schemas for ITS
Informative References
Conversion to NIF
Conversion NIF2ITS
Localization Quality Guidance
List of ITS 2.0 Global Elements and Local Attributes
Revision Log
Acknowledgements
1 Introduction
This section is informative.
1.1 Overview
Content or software that is authored in one language (so-called

source

language) for one locale (e.g. the French-speaking part of

Canada) is often made available in additional languages or adapted

with regard to other cultural aspects. A prevailing paradigm for

multilingual production in many cases encompasses

three phases: internationalization, translation, and localization (see the
W3C's Internationalization Q&A
for more information related to these concepts).
From the viewpoints of feasibility, cost, and efficiency, it is

important

that the original material is suitable for

downstream

phases such as translation. This

is

achieved by

appropriate design and

development.

The corresponding

phase is

referred to as

internationalization.

A proprietary XML vocabulary may be internationalized by defining special markup to specify directionality in mixed direction text.
During the translation phase, the meaning of a source language text is analyzed,
and a target language text that is equivalent in meaning is determined. For example
national or international laws may regulate linguistic dimensions like mandatory
terminology or standard phrases in order to promote or ensure a translation's
fidelity.
Although an agreed-upon definition of the localization phase is missing, this
phase is usually seen as encompassing activities such as creating locale-specific
content (e.g. adding a link for a country-specific reseller), or modifying functionality
(e.g. to establish a fit with country-specific regulations for financial reporting).
Sometimes, the insertion of special markup to support a local language or script is also
subsumed under the localization phase. For example, people authoring in languages such
as Arabic, Hebrew, Persian or Urdu need special markup to specify directionality in
mixed direction text.
The technology described in this document – the
Internationalization Tag
Set (ITS) 2.0
addresses some of the challenges and opportunities related to
internationalization, translation, and localization. ITS 2.0 in particular contributes
to concepts in the realm of metadata for internationalization, translation, and
localization related to core Web technologies such as XML. ITS does for example assist
in production scenarios, in which parts of an XML-based document are to be excluded
from translation. ITS 2.0 bears many commonalities with its predecessor,
ITS 1.0
but provides
additional concepts that are designed to foster enhanced automated processing – e.g.
based on language technology such as entity recognition – related to multilingual Web
content.
Like ITS 1.0, ITS 2.0 both identifies concepts (such as “Translate” ),
and defines implementations of these concepts (termed “ITS data categories”) as a set of
elements and attributes called the
Internationalization Tag Set (ITS)
. The
definitions of ITS elements and attributes are provided in the form of RELAX NG
[RELAX NG]
(normative). Since one major step from ITS 1.0 to
ITS 2.0 relates to coverage for HTML, ITS 2.0 also establishes a relationship between
ITS markup and the various HTML flavors. Furthermore, ITS 2.0 suggests when and how to
leverage processing based on the XML Localization Interchange File Format (
[XLIFF 1.2]
and
[XLIFF 2.0]
), as
well as the Natural Language Processing Interchange Format
[NIF]
For the purpose of an introductory illustration, here is a series of examples related to the question, how ITS can indicate that certain parts of a document are not intended for translation.
Example 1: Document in which some content has to be left untranslated
In this document it is difficult to distinguish between those
string
elements that are intended for translation and those that are not to be translated. Explicit metadata is needed to resolve the issue.

id
"Homepage"


page


childlist




POLICY


Corporate Policy




Page


ABC Corporation - Policy Repository


Footer_Last


Pages


bgColor


NavajoWhite


title


List of Available Policies




[Source file:
examples/xml/EX-motivation-its-1.xml
ITS proposes several mechanisms, which differ among others in terms of the usage scenario/user types for which the mechanism is most suitable.
Example 2: Document that uses two different ITS mechanisms to indicate that some parts have to be left untranslated.
ITS provides two mechanisms to explicitly associate metadata with one
or more pieces of content (e.g. XML nodes): a
global
, rule-based
approach as well as a
local
, attribute-based approached. Here, for
instance, a
translateRule
first specifies that only every second element inside
keyvalue_pairs
is intended for translation; later, an ITS
translate
attribute specifies that
one of these elements is not to be translated.
xmlns:its
"http://www.w3.org/2005/11/its"
its:version
"2.0"
version
"2.0"
selector
"//arguments"
translate
"no"
/>
selector
"//keyvalue_pairs/string[(position() mod 2)=1]"
translate
"no"
/>

id
"Homepage"


page


childlist



its:translate
"no"
POLICY


Corporate Policy




Page


ABC Corporation - Policy Repository


Footer_Last


Pages


bgColor

its:translate
'no'
NavajoWhite


title


List of Available Policies




[Source file:
examples/xml/EX-motivation-its-2.xml
1.2 General motivation for going beyond ITS 1.0
The basics of ITS 1.0 are simple:
Provide metadata (e.g. “Do not translate”) to assist internationalization-related processes
Use XPath (so-called
global approach
) to associate metadata with specific XML nodes (e.g. all elements named
uitext
) or put the metadata straight onto the XML nodes themselves (so-called
local approach
Work with a well-defined set of metadata categories or values (e.g. only the values "yes" and "no" for certain data categories)
Take advantage of existing metadata (e.g. terms already marked up with HTML markup such as
dt
This conciseness made real-world deployment of ITS 1.0 easy. The deployments helped to
identify additional metadata categories for internationalization-related processes. The
ITS Interest Group
for
example compiled a list of additional data categories (see this
related summary
). Some of these were then defined in ITS 2.0:
ID Value
, local
Elements
Within Text
Preserve Space
, and
Locale Filter
. Others are still discussed as requirements
for possible future versions of ITS:
“Context” = What specific related information might be helpful?
“Automated Language” = Does this content lend itself to automatic processing?
The real-world deployments also helped to understand that for the
Open Web Platform
– the ITS 1.0 restriction
to XML was an obstacle for quite a number of environments. What was missing was, for
example, the following:
Applicability of ITS to formats such as HTML in general, and HTML5 in particular
Easy use of ITS in various Web-exposed (multilingual) Natural Language Processing contexts
Computer-supported linguistic quality assurance
Content Management and translation platforms
Cross-language scenarios
Content enrichment
Support for W3C provenance
[PROV-DM]
, “information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness”
Provisions for extended deployment in Semantic Web/Linked Open Data
scenarios
ITS 2.0 was created by an alliance of stakeholders who are involved in content for global use. Thus, ITS 2.0 was developed with input from/with a view towards the following:
Providers of content management and machine translation solutions who want to easily integrate for efficient content updates in multilingual production chains
Language technology providers who want to automatically enrich content (e.g. via term candidate generation, entity recognition or disambiguation) in order to facilitate human translation
Open standards endeavours (e.g. related to
[XLIFF 1.2]
[XLIFF 2.0]
and
[NIF]
that are interested for example in information sharing, and lossless roundtrip of
metadata in localization workflows
One example outcome of the resulting synergies is the
ITS Tool Annotation
mechanism. It addresses the
provenance-related requirement by allowing ITS processors to leave a trace: ITS
processors can basically say “It is me that generated this bit of
information”. Another example are the
[NIF]
related details of ITS 2.0, which provide a non-normative approach to couple Natural Language
Processing with concepts of the Semantic Web.
1.3 Usage Scenarios
The
[ITS 1.0]
introduction
states: “ITS is a technology to easily create XML, which is internationalized and can be localized effectively”. In order to make this tangible, ITS 1.0 provided examples for
users and usages
. Implicitly, these examples carried the information that ITS covers two areas: one that is related to the static dimension of mono-lingual content, and one that is related to the dynamic dimension of multilingual production.
Static mono-lingual (for example, the area of content authors): This part of the
content has the directionality “right-to-left”.
Dynamic multilingual: (for example, the area of machine translation systems): This
part of the content has to be left untranslated.
Although ITS 1.0 made no assumptions about possible phases in a multilingual production
process chain, it was slanted towards a simple three phase
“write→internationalize→translate” model. Even a birds-eye-view at ITS 2.0 shows
that ITS 2.0 explicitly targets a much more comprehensive model for multilingual
content production. The model comprises support for multilingual content production
phases such as:
Internationalization
Pre-production (e.g. related to marking terminology)
Automated content enrichment (e.g. automatic hyperlinking for entities)
Extraction/filtering of translation-relevant content
Segmentation
Leveraging (e.g. of existing translation-related assets such as translation memories)
Machine Translation (e.g. geared towards a specific domain)
Quality assessment or control of source language or target language content
Generation of translation kits (e.g. packages based on XLIFF)
Post-production
Publishing
The document
[MLW US IMPL]
lists a large variety
of usage scenarios for ITS 2.0. Most of them are composed from the aforementioned
phases.
In a similar vein, ITS 2.0 takes a much more comprehensive view on the actors that may
participate in a multilingual content production process. ITS 1.0 annotations (e.g.
local markup for the
Terminology
data category) most of
the time were conceived as being closely tied to human actors such as content authors or
information architects. ITS 2.0 raises non-human actors such as word processors/editors,
content management systems, machine translation systems, term candidate generators,
entity identifiers/disambiguators to the same level. This change among others is
reflected by the ITS 2.0
Tool Annotation
, which
allows systems to record that they have processed a certain part of content.
1.4 High-level differences between ITS 1.0 and ITS 2.0
The differences between ITS 1.0 and ITS 2.0 can be summarized as follows.
Coverage of
[HTML5]
ITS 1.0 can be applied to XML content. ITS 2.0 extends the coverage to
[HTML5]
. Explanatory details about ITS 2.0 and
[HTML5]
are given in
Section 2.5: Specific HTML support
Addition of data categories
: ITS 2.0 provides additional data categories
and modifies existing ones. A summary of all ITS 2.0 data categories is given in
Section 2.1: Data Categories
Modification of data categories
ITS 1.0 provided the
Ruby data
category
. ITS 2.0 does not provide ruby because at the time of writing the
ruby model in HTML5
was still under development. Once these discussions are
settled, the Ruby data category possibly will be reintroduced, in a subsequent
version of ITS.
The
Directionality
data category reflects directionality markup in
[HTML 4.01]
. The reason is that enhancements are being discussed in the context of HTML5 that are expected to change the approach to marking up directionality, in particular to support content whose directionality needs to be isolated from that of surrounding content. However, these enhancements are not finalized yet. They will be reflected in a future revision of ITS.
Additional or modified mechanisms:
The following mechanisms from ITS 1.0 have been modified or added to ITS 2.0:
ITS 1.0 used only XPath as the mechanism for selecting nodes in
global rules
. ITS 2.0 allows for choosing the
query language of selectors
. The default is XPath 1.0. An ITS 2.0 processor is free to support other selection mechanisms, like CSS selectors or other versions of XPath.
In global rules it is now possible to set
variables for the selectors
(XPath expression). The
param
element serves this purpose.
ITS 2.0 has an
ITS Tools Annotation
mechanism to associate processor information with the use of individual data categories. See
Section 2.6: Traceability
for details.
Mappings:
ITS 2.0 provides a non-normative algorithm to convert ITS 2.0 information into
[NIF]
and links to guidance about how to relate ITS 2.0 to XLIFF. See
Section 2.7: Mapping and conversion
for details.
Changes to the conformance section
: The
Section 4: Conformance
tells implementers how to implement ITS. For ITS 2.0, the conformance statements related to Ruby have been removed. For
[HTML5]
, a dedicated conformance section has been created. Finally, a conformance clause related to Non-ITS elements and attributes has been added.
1.5 Extended implementation hints
As a general guidance, implementations of ITS 2.0 are encouraged to use a
normalizing transcoder
. It converts from a legacy encoding to a Unicode encoding form and ensures that the result is in Unicode Normalization Form C. Further information on the topic of Unicode normalization is provided in
[Charmod Norm]
2 Basic Concepts
This section is informative.
The purpose of this section is to provide basic knowledge about how ITS 2.0 works. Detailed knowledge (including formal definitions) is given in the subsequent sections.
2.1 Data Categories
A key concept of ITS is the abstract notion of
data categories
. Data categories define the information that can be conveyed via ITS. An example is the
Translate
data category. It conveys information about translatability of content.
Section 8: Description of Data Categories
defines data categories. It
also describes their implementation, i.e. ways to use them for example in an XML
context. The motivation for separating data category definitions from their
implementation is to enable different implementations with the following
characteristics:
For various types of content (XML in general or
HTML
).
For a single piece of content, e.g. a
element. This is the so-called
local approach
For several pieces of content in one document or even a set of documents. This is the
so-called
global approach
For a complete markup vocabulary. This is done by adding
ITS markup declarations
to the schema for the vocabulary.
ITS 2.0 provides the following data categories:
Translate
: expresses information about whether
a selected piece of content is intended for translation or not.
Localization Note
: communicates notes to
localizers about a particular item of content.
Terminology
: marks terms and optionally
associates them with information, such as definitions or references to a term data
base.
Directionality
: specifies the base writing
direction of blocks, embeddings and overrides for the Unicode bidirectional
algorithm.
Language Information
: expresses the
language of a given piece of content.
Elements Within Text:
expresses how
content of an element is related to the text flow (constitutes its own segment like
paragraphs, is part of a segment like emphasis marker etc.).
Domain
: identifies the topic or subject of the
annotated content for translation-related applications.
Text Analysis
: annotates content with lexical or
conceptual information (e.g. for the purpose of contextual disambiguation).
Locale Filter
: specifies that a piece of content
is only applicable to certain locales.
Provenance
: communicates the identity of agents
that have been involved processing content.
External Resource
: indicates reference
points in a resource outside the document that need to be considered during
localization or translation. Examples of such resources are external images and audio
or video files.
Target Pointer
: associates the markup node of
a given source content (i.e. the content to be translated) and the markup node of its
corresponding target content (i.e. the source content translated into a given target
language). This is relevant for formats that hold the same content in different
languages inside a single document.
Id Value
: identifies a value that can be used as
unique identifier for a given part of the content.
Preserve Space
: indicates how whitespace is to
be handled in content.
Localization Quality Issue
: describes the nature and
severity of an error detected during a language-oriented quality assurance (QA)
process.
Localization Quality Rating
: expresses an overall
measurement of the localization quality of a document or an item in a document.
MT Confidence
: indicates the confidence that MT
systems provide about their translation.
Allowed Characters
: specifies the characters that
are permitted in a given piece of content.
Storage Size
: specifies the maximum storage size
of a given piece of content.
Most of the existing ITS 1.0 data categories are included and new ones have been added. Modifications of existing ITS 1.0 data categories are summarized in
Section 1.4: High-level differences between ITS 1.0 and ITS 2.0
2.2 Selection
Information (e.g. “translate this”) captured by an ITS data category always
pertains to one or more XML or HTML nodes, primarily element and attribute nodes. In a
sense, the relevant node(s) get “selected”. Selection may be explicit or implicit.
ITS distinguishes two mechanisms for explicit selection: (1) local and (2) global (via
rules
). Both local and global approaches can interact with each other, and
with additional ITS dimensions such as inheritance and defaults.
The mechanisms defined for ITS selection resemble those defined in
[CSS 2.1]
. The local approach can be compared to the
style
attribute in HTML/XHTML, and the global approach is similar to the
style
element in HTML/XHTML:
The local approach puts ITS markup in the relevant element of the host vocabulary
(e.g. the
author
element in DocBook)
The global
rule-based approach
puts the ITS
markup in elements defined by ITS itself (namely the
rules
element)
ITS usually uses XPath in rules for identifying nodes although CSS Selectors and other query languages can in addition be implemented by applications.
ITS 2.0 can be used with XML documents (e.g. a DocBook article), HTML documents,
document schemas (e.g. an XML Schema document for a proprietary document format), or
data models in RDF.
The following two examples provide more details about the distinction between the local
and global approach, using the
Translate
data
category as an example.
2.2.1 Local Approach
The document in
Example 3
shows how a content author can use the ITS
translate
attribute to indicate that all content inside the
author
element is not intended for translation (i.e. has to be left untranslated). Translation tools that are aware of the meaning of the attribute can protect the relevant content from being translated (possibly still allowing translators to see the protected content as context information).
Example 3: ITS markup on elements in an XML document (local approach)
xmlns
"http://docbook.org/ns/docbook"
xmlns:its
"http://www.w3.org/2005/11/its"
its:version
"2.0"
version
"5.0"
xml:lang
"en"

<br>An example article<br>
its:translate
"no"


John


Doe





foo@example.com






This is a short article.