Requirements for String Identity Matching and String Indexing
WD-charreq-19980710
Requirements for String Identity Matching and String Indexing
World Wide Web Consortium Working Draft 10-July-1998
This version:
Latest version:
Public:
WG-internal: See overview at
Previous public version:
None
Previous WG-internal version:
Editor:
Martin J. Dürst (W3C)
Status of this document
This is a W3C Working Draft for use by W3C members and other parties. This
document has been subject to extensive review by the Internationalization
Working Group. This document may be updated, replaced, or obsoleted by other
documents at any time.
This document is being written as the first step towards a character model
for W3C specifications, to make sure that the requirements of other W3C Working
Groups (and of other interested parties) are understood and can be addressed.
This document itself is not intended to proceed to Proposed Recommendation
and Recommendation, but will serve as the base for the document that will
specify the character model. Comments are very welcome and should be sent
to
the editor of this Working Draft
as soon as possible.
For the current status of the Internationalization Activity, see
Abstract
This document describes the requirements for some important aspects of the
character model for W3C specifications. The two aspects discussed are
string identity matching
and
string indexing
. Both aspects
are considered to be vital for the seamless interaction of many components
of the current and future web architecture.
Table of Contents
Introduction
Background
Potential users of the resulting specification
Structure of this Document
Scope
String identity matching
Problem
The string identity matching specification shall be defined
exactly
The string identity matching specification shall not expose
invisible encoding differences to the user
The string identity matching specification shall not treat
as equivalent characters that can usually be distinguished by the user
The string identity matching specification shall be
forward-compatible
The string identity matching specification shall be broadly
applicable
The string identity matching specification shall be workable
with opaque identifiers and data
The string identity matching specification shall allow to
be conservative in what you send
The string identify specification shall be prepared quickly
Solutions for string identity matching
Early uniform normalization
Problem
The location of early uniform normalization shall be
specified
Early uniform normalization shall be based on widespread practice
Early uniform normalization shall be specified in collaboration
with the expert communities on character encoding
Early uniform normalization shall be feasible to implement
Reference software for early uniform normalization shall be
provided
Test cases for early uniform normalization shall be provided
String indexing
Problem Description
String indexing shall behave consistently across
implementations
String indexing shall take into account user expectations
String indexing shall be able to address "characters" at various
levels
String indexing shall be forward-compatible
String indexing shall be feasible to implement
The String indexing specification shall be prepared quickly
Appendix: Details about users of the resulting
specification
Glossary
References
1. Introduction
1.1 Background
Since [
RFC 2070
], [
ISO
10646
]/[
Unicode
] (hereafter denoted as UCS, Universal
Character Set) has served as a common reference for character encoding in
W3C specifications (see [
HTML 4.0
], [
XML
1.0
], and [
CSS2
]). This choice was motivated by the
fact that the UCS:
is the only universal character repertoire available
covers the widest possible repertoire
provides a way of referencing characters independent of the encoding of a
resource
is being updated/completed carefully
is widely accepted and implemented by industry.
As long as data transfer on the WWW was primarily unidirectional (from server
to browser), and the main purpose was rendering, the direct use of the UCS
as a common reference posed no problems.
However, from early on, the WWW included bidirectional data transfer (forms,...).
Recently, purposes other than rendering are becoming more and more important.
The WWW has traditionally been seen as a collection of applications exchanging
data based on protocols. It can however also be seen as a single, very large
application [
Nicol
]. The second view is becoming more
and more important due to the following developments:
The increase in data transfers among servers, proxies, and clients
The increase in places where non-ASCII characters are allowed
The increase in data transfers between different protocol/format elements
(such as element/attribute names, URI components, and textual content)
Definition of specifications for APIs (as opposed to protocol specifications
only)
In this context, some properties of the UCS become relevant and have to be
addressed. It should be noted that such properties also exist in legacy
encodings, and in many cases have been inherited by the UCS in one
way or another from such legacy encodings. In particular, these properties
are:
Choice of binary encoding forms (UTF-8, UTF-16, UCS-4)
Variable length encodings (e.g. due to the use of combining characters,
surrogates,...)
Duplicate encodings (e.g. precomposed vs. decomposed)
Control codes for various purposes (e.g. bidirectionality control, symmetric
swapping,...)
This means that in order to insure consistent behavior on the WWW, some
additional specifications, based on the UCS, are necessary.
This document is written as part of the work of the I18N WG to provide
internationalization guidelines for the authors of W3C specifications. Because
of the importance of consistent behavior for the WWW, it should be expected
that the resulting guideline components will become mandatory for W3C
specifications.
1.2 Potential users of the resulting specification
The specification that will be developed based on this document have a very
wide range of potential users, which are listed below in three categories.
For some of the users listed here, a short description of what they do and
how the requirements described in this document are thought to apply to them
is given in the
Appendix
. A need for specifications
in the areas addressed by this document has directly been expressed by (in
particular at the
Query
Language Meeting
" in April 1998 in Brisbane) the following W3C Working
Groups or specifications:
DOM
(Document Object Model)
The
XML
activity, for
XPointer
XSL
(eXtensible Style Language)
RDF
(Resource Description Framework)
Model and Syntax
Within the W3C, it may in addition be useful for:
XML element/attribute
names
Work on
digital signatures
Internationalization of URIs
Outside of the W3C, it may in addition be useful for things such as:
Identifiers in Java
String handling in ECMAScript
Filenames in FTP
Folder names in IMAP
Usenet newsgroup names
Identifiers in ACAP
1.3 Structure of this Document
The following sections 2-4 each discuss the requirements for a particular
aspect of the WWW character model. Each section in its first subsection briefly
describes the problem addressed. The following subsections then discuss the
various requirements.
Section 2
is devoted to the requirements
for string identity matching.
Section 3
expands on string
identity matching and discusses subrequirements for early uniform normalization,
one way to address string identity matching.
Section 4
discusses
the requirements for string indexing. An
appendix
gives additional information about some of the users of the specification
resulting from this document. A
glossary
gives additional
explanations for some of the terms used in this document.
1.4 Scope
This document addresses only those parts of the character model that need
exact specification and are extremely time-critical. To see exactly which
parts are addressed, please see the first subsection of each of the following
sections. A more general model, e.g. in the sense of the reference processing
model in [
RFC 2070
], and general guidelines, e.g.
similar to those in [
RFC 2130
] and
RFC 2277
] for the work of the IETF, are not discussed
here. Nevertheless, something like the reference processing model in
RFC 2070
], which requires applications to behave
as if they used the UCS, is assumed as a base.
For each problem, this document lists various requirements. Ideally, all
requirements would be met equally well, and the degree to which they are
being met could be measured equally well. However, some of the requirements
take the form of more general design objectives, for which it is difficult
to measure the degree to which they have been met. Also, some requirements
conflict with each other. Where such conflicts are known, the conflict and
a preference (i.e. which requirement has greater weight) is indicated.
2. String Identity Matching
2.1 Problem
String
identity
matching is a subset of the more general problem
of string matching. String matching in general can be done with various degrees
of specificity, from very approximate matching such as e.g. regular expressions
or phonetic matching for English, to more specific matches such as
case-insensitive or accent-insensitive matching. This document deals only
with string
identity
matching. Two strings match as identical if
they contain no user-identifiable distinctions. For more details on the meaning
of user-identifiable distinctions, see the following explanations as well
as
subsection 2.3
and
subsection 2.4
Any kind of less specific matching is not discussed in this document.
At various places in the WWW infrastructure, strings, and in particular
identifiers, are compared for identity. If different places use different
definitions of string identity matching, this results in undesired
unpredictability. Such comparisons are unproblematic if the expectations
of the users and the results of a simple binary comparison coincide, or can
be made to coincide. For ASCII, such a coincidence is established and assumed,
including some degree of user education, e.g. about the differences between
the digit 0 and the uppercase letter O. For the full repertoire of the UCS,
however, the aforesection coincidence between user expectations and binary
comparisons is not a priori guaranteed.
In order to insure consistent behavior on the WWW, a character model for
W3C specifications must make sure that the gap between user expectations
and internal operation is bridged. A character model for W3C specifications
must therefore specify how the problem of
string identity matching
is handled. The requirements for such a specification are listed in the following
subsections. Please note that with the exception of
subsection
2.7
and
subsection 2.8
, the following subsections
assume the character processing model of [
RFC 2070
],
i.e. they assume that applications behave as if they used the UCS internally.
The section ends with
subsection 2.10
, which lays out
some alternatives and motivates
section 3
2.2 The string identity matching specification shall be defined
exactly
In order to fulfill its purpose, a specification of string identity matching
must not contain any ambiguities.
While in some cases, the addition of version numbers might help to make the
specification unambiguous, carrying version numbers as parameters is in many
cases highly undesirable and should therefore be avoided.
2.3 The string identity matching specification shall not expose
invisible encoding differences to the user
Typical examples where a gap between user expectations and internal operation
can occur in the UCS are the duplicate encodings defined as
canonical
equivalences
in [
Unicode
]. As an example, the
UCS allows us to encode "ü" both as a single codepoint (U+00FC,
LATIN SMALL LETTER U WITH DIAERESIS), or as the codepoint for "u" (U+0075,
LATIN SMALL LETTER U) followed by the codepoint U+0308 (COMBINING DIAERESIS).
Such equivalences are artifacts of the encoding method(s) chosen for the
UCS.
It is expected that the canonical equivalences specified in the Unicode standard
will be an excellent starting point for defining the range of things to be
identified as duplicate encodings. This will make sure that the experience
of the Unicode Technical Committee with respect to character equivalences
is fully leveraged. Whether any changes are necessary will have to be examined
more closely. If such changes consist only of additions of equivalences,
implementations of W3C specifications would collectively conform to conformance
clause C9 given in [
Unicode
, p. 3-2]:
A process
shall not assume that the interpretations of two canonical-equivalent character
sequences are distinct.
Additions may include some presentation forms.
Another category where encoding differences are invisible to the user are
the various control codes. W3C standards mostly deal with structured text
(as opposed to plain text). It should therefore in most cases be possible
to rely on explicit markup rather than on in-stream control codes.
2.4 The string identity matching specification shall not treat
as equivalent characters that can usually be distinguished by the user
String identity matching shall not treat as equivalent cases that can clearly
be distinguished by a user because the difference may be significant in many
cases. Examples are:
Lower-case letters and upper-case letters (e.g. "ü" and "Ü")
Characters with and without diacritics such as accents or vowel marks (e.g.
"ü" and "u")
Half-width and full-width presentation variants (Even though one of the variants
is clearly only encoded for compatibility, users can distinguish them if
necessary. Depending on the individual specification and the protocol/format
element concerned, the use of such variants may be discouraged or forbidden.)
These differences can be
handled
by the (mainly native) users of
the characters in question, and can at least be
identified
by users
not familiar with the characters in question. Such similarities are explicitly
not considered for string
identity
matching, because they do not
need a coordinated solution for the entirety of the WWW.
Various forms of equivalence testing are needed for operations such as searching
and sorting. But such operations will not be based on string
identity
matching. Also, it is felt that such operations do not
need to behave uniformly across the web; that on the contrary, it is beneficial
to have competition (e.g. for search engines and their user interfaces),
that this has already been taken care of elsewhere (e.g. the work of ISO
and Unicode on default and tailorable sorting), and that the requirements
of language-dependence and user-configurability are stronger than the needs
for consistent behavior.
2.5 The string identity matching specification shall be
forward-compatible
It is impossible to predict what characters might be added to the UCS in
the future. String identity matching should be specified so as to try to
minimize the impact of future additions to the UCS on the specification and
its implementations.
One category of additions that warrants particular attention, both because
it has occurred relatively frequently in the past and because it affects
string identity matching directly, is the addition of new precomposed forms
for which decomposed equivalents are already available.
2.6 The string identity matching specification shall be broadly
applicable
Because of the increased integration of the WWW, selecting different ways
to solve the string identity matching problem for different components of
the WWW would produce a fragmentation of users' and implementors' expectations,
and the need for constant attention to minute differences that are rarely
visible. Applicability to a broad range of W3C specifications and the widest
number of components of the WWW means that a solution has to be feasible
for all kinds of different systems, and different subsystems of larger
applications, with different resources available. This in particular includes
very small systems, and systems that do not have continuous network access.
2.7 The string identity matching specification shall be workable
with opaque identifiers and data
Many components of the WWW have to work with data without access to the actual
characters. This includes all kinds of schemes that make use of encryption
techniques as well as schemes where the character encoding is in general
left undefined, such as URIs [
URI
]. For things such as
URIs, it should be possible to test two strings for identity even if their
character encoding is unknown, given of course that in both cases the same
character encoding has been chosen. Also, it should be possible to test two
strings for identity if the actual data cannot be accessed directly because
it is encrypted. Even in cases where the character encoding is known, and
the data is accessible, treating data as opaque is often desirable, because
an identity check might occur in an architectural component that has (or
the implementors of which have) completely different concerns than
internationalization. Examples of such components are firewalls and passwords.
2.8 The string identity matching specification shall allow
you to
be conservative in what you send
An often cited maxim of Internet engineering is
be liberal in what you
accept; be conservative in what you send
. The use of the appropriate
kind of equivalence at the receiving end easily allows you to
be liberal
in what you accept
. However, without any kind of indication of the
preferred
way of encoding or the preferred character variant, there
is no way to
be conservative in what you send
. This means that potential
benefits cannot be realized.
2.9 The string identify specification shall be prepared
quickly
Several upcoming W3C specifications depend on a clear and uniform specification
for string identity matching. Therefore, no time should be lost in preparing
the string identity matching specification.
2.10 Solutions for string identity matching
For a specification for string identity matching, the following issues have
to be addressed:
Which representations to treat as equivalent (and which not)
Which components in the WWW architecture to make responsible for equivalences:
Each individual component that performs a string identity check has to take
equivalences into account (late normalization)
Duplicates and ambiguities are removed as close to their source as possible
(early normalization)
Which way to normalize (in the case that early normalization (2.2) is needed,
even if only in some cases)
The arguments for why early normalization may be needed, even if only in
some cases, can be listed as follows:
It is a prerequisite for
be conservative in what you send
It is the only solution to deal with opaque data (see
subsection
2.7
Not all parts of the WWW may reasonably be expected to do normalization
There is less need for software updates to address forward-compatibility
issues
It may lead to more efficient implementations for string indexing (see
subsection 4.6
With increased component integration, it becomes more and more difficult
to hide certain kinds of implementation details
It therefore seems appropriate to address the requirements of early normalization
in particular. This is done in the next section.
3. Early uniform normalization
3.1 Problem
As discussed in
subsection 2.10
, there is a high probability
that early normalization may become necessary, even if only for some selected
cases. Early normalization means that data is normalized as close to its
origin, or as close to its conversion to the UCS, as possible. This eliminates
duplicate representations and other ambiguities. The actual string identity
check can therefore be done without taking such ambiguities into account.
In order for this to work, however, early normalization has to be uniform,
i.e. all components of the WWW that normalize have to do so in one specific
way.
3.2 The location of early uniform normalization shall be
specified
In order for W3C specifications to attribute the responsibility for early
uniform normalization to specific components, guidelines on where early uniform
normalization should occur must be provided. Ideally, uniform normalization
would occur at the time of data creation, e.g. by a keyboard driver. However,
W3C specifications do not deal directly with things such as keyboard drivers.
This means that more appropriate locations for requiring early uniform
normalization have to be defined. As an example, it could be required that
text transmitted via certain protocols, or text exposed in certain APIs,
is normalized.
It should be noted that text is transmitted on the WWW in many encodings
not based on the UCS. In these cases, uniform normalization ideally occurs
when data is transcoded (or assumed to be transcoded according to the reference
processing model of [
RFC 2070
]) from legacy encodings
(such as [
ISO 8859
] or [
ISO
6937
]) to the UCS.
Ideally, early uniform normalization will spread out from the WWW to other
parts of the information infrastructure. For example, early uniform normalization
may only be specified for text actually sent out by a server, but the task
of normalization may be transferred from the server to the document provider,
and from there further to the editor tool and even to the keyboard driver.
Such a transfer is indeed highly desirable in many cases, because to avoid
generating unnormalized data is in many cases easier than to normalize such
data later.
3.3 Early uniform normalization shall be based on widespread
practice
A wide range of text on the WWW will have to be normalized. This is easier
to do if uniform normalization occurs towards the more popular representation
than if a not so widely used representation is used as the normal form. It
may also provide a bit more time, in that we are just defining what might
happen naturally anyway instead of having to fight uphill from day one. Existing
standards (such as the canonical ordering behavior for combining characters
Unicode
, page 3-9]) should also be considered.
3.4 Early uniform normalization shall be specified in collaboration
with the expert communities on character encoding
The views of experts on character coding, especially of members of the Unicode
Technical Committee and of ISO/IEC JTC1/SC2/WG2 should be sought, with the
goal of achieving a broad consensus. This requirement cannot, however, take
precedence over all other requirements, especially
Requirement
2.9
, "The string identity matching specification shall be prepared quickly".
3.5 Early uniform normalization shall be feasible to
implement
Where choices are available, early uniform normalization should be specified
in a way which permits easy and compact implementations. It should however
be remembered that the main benefit in terms of implementation simplification
is achieved due to the concept of early uniform normalization itself, by
relieving a large part of the WWW infrastructure of the need to consider
equivalences when making comparisons, and by locating normalization at those
places in the WWW architecture where most information on actually occurring
codepoint combinations and most internationalization implementation expertise
and concern are available.
3.6 Reference software for early uniform normalization shall
be provided
To help in developing, understanding, implementing, and testing early uniform
normalization, reference software shall be developed and provided to the
public under
W3C
. This software will cover all cases, whereas at a given point
in the infrastructure (e.g. a transcoder or a keyboard driver), only some
cases may have to be taken into account.
3.7 Test cases for early uniform normalization shall be
provided
To help in developing, understanding, implementing, and testing early uniform
normalization, test cases shall be developed and provided to the public under
W3C
4. String indexing
4.1 Problem Description
On many occasions, in order to access a substring or a character, it is necessary
to index characters in a string/sequence/array of characters. Where character
indices are exchanged between components of the WWW, there is a need for
a uniform definition of string indexing in order to insure consistent behavior.
In the simplest cases, this boils down to questions such as
At which position
in a given string is a given character?
Which character is at a given
position in a given string?
, and even simpler,
What's the length of
a given string?
Note: In many cases, it is highly preferable to use non-numeric ways of
identifying substrings. The specification of string indexing for the WWW
should not be seen as a general recommendation for the use of string indexing
for substring identification. As an example, in the case of translation of
a document from one language to another, identification of substrings based
on document structure can be expected to be much more stable than identification
based on string indexing.
Note: Because of the wide variability of scripts and characters, different
operations may be required to work at different levels of aggregation or
subdivision. String indexing as discussed in this section is only intended
to provide a base for such operations; it cannot address all levels concurrently.
The issue of indexing origin, i.e. whether the first character in a string
is indexed as character number 0 or as character number 1, will not be addressed
here.
4.2 String indexing shall behave consistently across
implementations
This is the basic functional requirement for indexing. It means that the
specification has to be without options.
The basic consistency test is the following:
On system A, take any string of characters.
In that string, identify a substring by using appropriate indices.
Transmit the string (potentially undergoing transformations such as transcoding
and normalization) to system B.
Use the same indices as in step 2 to identify a substring in the received
string.
If the substring identified is the same as that identified in step 2, then
the test is successful.
The requirement is fulfilled if the test is successful for all strings of
characters and all combinations of systems.
4.3 String indexing shall take into account user expectations
Tools and programs are supposed to hide most of the indexing values from
the end users. However, the fact that direct editing/manipulation was possible
was one of the (unexpected) reasons for the success of the WWW. Also,
in the complex infrastructure of the WWW, it is impossible to define a clear
and strict boundary between what is manipulated by programs and what is seen
and manipulated by the users. Therefore, it is highly desirable that something
seen as one single character by the user is indeed counted as one character.
However, there may be cases where for the same characters, there are differences
in the perceptions of users using various languages, or even of users using
one and the same language. In this case, an ideal solution is not possible.
Preference should be given to a solution which, although not corresponding
to user expectations, can be understood by as many users as possible (e.g.
treat each character in the Klingon alphabet as occupying two index
positions
).
This requirement may be in conflict with
requirement 4.6
(because user expectations and actual encoding might be different). Because
neither requirement is absolute, no indication of relative priorities has
been given here.
4.4 String indexing shall be able to address "characters" at
various levels
Because of the variability of what a "character" can mean in different scripts
and to different people (for the same script), string indexing should permit
the designation of characters at various levels of resolution appropriate
for the task at hand. This can in principle be achieved by indexing on the
finest granularity possible, or by indexing of subelements. Although subelement
indexing might not be defined in the first version of the character model,
and might not be implemented everywhere, the necessary precautions for syntax
extensibility and fallbacks should be taken care of and defined up-front
wherever applicable.
4.5 String indexing shall be forward-compatible
It is impossible to predict what characters might be added to the UCS in
the future. String indexing should be specified so as to try to minimize
the impact of future additions to the UCS on the specification and its
implementations.
One category of additions that warrants particular attention, both because
it has occurred relatively frequently in the past and because it may affect
string indexing directly, is the addition of new precomposed forms for which
decomposed equivalents are already available.
4.6 String indexing shall be feasible to implement
Indexing into a string of characters is a very frequent operation. Ease of
implementation is therefore crucial. If string indexing is based on early
uniform normalization, then this may help to make implementation easier.
4.7 The String indexing specification shall be prepared
quickly
Several upcoming W3C specifications depend on a clear character model and
in particular on clear definitions for string indexing. It is therefore crucial
that no time is lost.
Appendix: Details about users of the resulting
specification
This appendix gives some additional details about users of the specification
that will result from the requirements in this document. This is intended
to give some very short background to readers not familiar with some of the
work of the W3C, as well as to make sure that the requirements of these groups
are well understood.
Note:
The specifications discussed below are still in progress. The
summaries are based on the current state, as publicly known. Changes may
occur at any time.
DOM (Document Object Model, see
A series of API definitions to access and manipulate documents, both document
structure and textual content. Currently, APIs for basic functionality for
HTML and XML, with bindings to programming languages such as Java, ECMAScript,
and C. All string parameters in the APIs are defined as Unicode strings.
To assure consistent behavior of programs written in different languages
and running on different implementations, uniform normalization and string
indexing specifications are necessary.
XLL (eXtensible Linking Language)
Linking support for XML. XLL defines the #anchor syntax component of
URIs for XML. A syntax for identifying elements in a document tree (e.g.
based on element names that can contain arbitrary characters in XML), as
well as for identifying portions of text, is defined. For consistent
identification of portions of text, either or both of string identity matching
and string indexing are necessary.
RDF (Resource Description Framework)
A data model and streaming format for metadata, with search engines and inference
engines as potential users. Much metadata is textual, and a basic operation
is to decide whether two elements of metadata are the same or not. For consistent
behavior, string identity matching is necessary.
URIs
Web addresses, with various components; pivot point for much of the WWW.
How to encode arbitrary bytes into a restricted set of characters (using
%HH escapes) is well defined, but which character encoding to use to encode
arbitrary characters into bytes is not defined. In most cases, e.g. in proxies,
comparisons are strictly binary. Without some specification for uniform
normalization, some characters cannot reliably be used.
Glossary
This glossary does not provide exact definitions of terms but gives some
background on how certain words are used in this document.
Character
Used in a loose sense to denote small units of text, where the exact definition
of these units is still open.
Early Normalization
Duplicates and ambiguities are removed as close to their source as possible.
This is done by normalizing them to a single representation. Because the
normalization is not done by the component that carries out the identity
check, normalization has to be done uniformly for all the components of the
WWW.
Late Normalization
Each individual component that performs a string identity check has to take
equivalences into account. This is usually done by normalizing each string
to a preferred representation that eliminates duplicates and ambiguities.
Because, with late normalization, normalization is done locally and on the
fly, there is no need to specify a web-wide uniform normalization.
String Identity Matching
Exact matching of strings, except for encoding duplicates indistinguishable
to the user. See
section 2
String Indexing
Indexing into a string to address a character or a sequence of characters.
See
section 4
UCS
Universal Character Set, the character repertoire defined in parallel by
ISO 10646
] and [
Unicode
].
WWW
World-wide Web, the collection of technologies built up starting with HTML,
HTTP, and URIs, the corresponding software (servers, browsers,...), and/or
the corresponding content.
References
[CSS2]
Bert Bos, Håkon Wium Lie, Chris Lilley, Ian Jacobs, Eds.,
Cascading Style Sheets, level
(CSS2 Specification), W3C Recommendation 12-May-1998,
[ISO 6937]
ISO/IEC 6937:1994
Information technology -- Coded graphic character set for text
communication -- Latin alphabet
[ISO 8859]
ISO/IEC 8859,
Information technology -- 8-bit single-byte coded graphic
character sets
various
parts
and publication dates).
[ISO 10646]
ISO/IEC 10646-1:1993
Information technology -- Universal Multiple-Octet Coded Character
Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane
, and
its amendments.
[HTML 4.0]
Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds.,
HTML 4.0
Specification
, W3C Recommendation 18-Dec-1997 (revised on
24-Apr-1998),
[Nicol]
Gavin Nicol,
The Multilingual World Wide Web
Chapter 2: The WWW As A Multilingual Application
[RFC 2070]
F. Yergeau, G. Nicol, G. Adams, M. Dürst,
Internationalization
of the Hypertext Markup Language
, RFC 2070, January 1997,
ftp://ftp.isi.edu/in-notes/rfc2070.txt
[RFC 2130]
C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. Crispin,
P. Svanberg,
The Report
of the IAB Character Set Workshop
held 29 February - 1 March,
1996, RFC 2130, April 1997,
ftp://ftp.isi.edu/in-notes/rfc2130.txt
[RFC 2277]
H. Alvestrand,
IETF
Policy on Character Sets and Languages
, RFC 2277 / BCP 18, January
1998,
ftp://ftp.isi.edu/in-notes/rfc2277.txt
[Unicode]
The Unicode Consortium,
The
Unicode Standard
Version
2.0
, Addison-Wesley, Reading, MA, 1996.
[URI]
T. Berners-Lee, R. Fielding, L. Masinter,
Uniform
Resource Identifiers (URI): Generic Syntax
, work in progress,
ftp://ftp.ietf.org/internet-drafts/draft-fielding-uri-syntax-03.txt
June 1998.
[XML 1.0]
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eds.,
Extensible Markup Language (XML)
1.0
, W3C Recommendation 10-February-1998,
W3C
MIT
INRIA
Keio
), All Rights Reserved. W3C
liability,
trademark
document
use
and
software
licensing
rules apply.
US