Web addresses in HTML 5

Web addresses in HTML 5
Web addresses in HTML 5
Editors:
Dan Connolly
Midwest Web Sense LLC and W3C
connolly@w3.org
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
cmsmcq@blackmesatech.com
Editor's Draft
21 May 2009
W3C
MIT
ERCIM
Keio
), All Rights Reserved.
W3C
liability
trademark
and
document use
rules apply.
Abstract
This specification defines the handling of Web addresses
for Hypertext Markup Language (HTML) 5, the fifth major revision
of the core language of the World Wide Web.
In this version, special attention has been
given to defining clear conformance criteria for user agents in an
effort to improve interoperability.
Status of this Document
This is a start at factoring out
the URL
material in the HTML 5 draft
as a separate draft for
consideration by the W3C
HTML Working Group
(See
ACTION-68
.)
See also a
URI
desk calculator
Pending feedback:
Subject:
Web Addresses feedback
Date: Thu, 30 Apr 2009 23:50:25 +0000
Subject: Definition of absolute URL
Date: Fri, 15 May 2009 21:22:02 +0200
Subject:
Re: a few URI/href issues captured with test cases
Date: Thu, 21 May 2009 19:05:52 +0200
Subject:
Updating the IRI spec to include "web addresses"
Date: Sun, 31 May 2009 10:12:14 -0700
Subject: FYI: IRI issues wiki
Date: Wed, 3 Jun 2009 11:34:45 +1000
Subject:
Web addresses feedback
Date: Sun, 14 Jun 2009 00:21:12 +0000 (UTC)
Table of Contents
Terminology
Parsing Web addresses
Resolving Web addresses
References
Old algorithm for parsing Web addreses
Introduction
This specification defines the term
Web address
, and defines
various algorithms for dealing with Web addresses, because for historical
reasons the rules defined by the URI and IRI specifications are not
a complete description of what HTML user agents need to implement to
be compatible with Web content.
Terminology
Web address
is a string used to identify a resource.
The term "Web address" in this specification is
used to include not only Uniform Resource Identifiers (URIs) as
they are defined by
RFC 3986
and
Internationalized Resource Identifiers (IRIs) as they are defined
by
RFC 3987
, but also other strings of
characters which can be used to identify Web resources when
processed appropriately.
Web address
is a
valid
Web address
if at least one of the following conditions
holds:
The
Web address
is a valid URI
reference (i.e. it matches the grammar for given
in
RFC 3986
).
The
Web address
is a valid IRI reference
(i.e. it matches the grammar for given
in
RFC 3987
), and it has no
query component.
The
Web address
is a valid IRI
reference and its query component contains no unescaped non-ASCII
characters
[RFC3987]
The
Web address
is a valid IRI
reference and the
character encoding
of the
Web address's
Document
is UTF-8 or UTF-16
[RFC3987]
Web address
has an associated
URL character encoding
, determined
as follows:
If the Web address came from a script (e.g. as an argument to a
method)
The Web address character encoding is the script's character
encoding.
If the Web address came from a DOM node (e.g. from an element)
The node has a
Document
, and the URL character
encoding is the document's character encoding.
If the Web address had a character encoding defined when the Web address was
created or defined
The Web address character encoding is as defined.
Parsing Web addresses
To
parse a Web address
into its
component parts, the user agent must use the following steps:
Strip leading and trailing space characters from
Percent-encode all non-URI characters
in
This 2nd step probably needs to be laid out in more
detail.
Note: the 2nd step will replace all of the following
characters with a percent-encoded equivalent:
all characters with codepoints less than or equal to U+0020
(i.e. the C0 control characters)
all characters with codepoints greater than or equal to U+007%
(i.e. U+007?F and all non-ASCII characters in the
U+0022 double quotation mark
U+0025 percent sign
U+003C less-than sign
U+003E greater-than sign mark
U+005C reverse solidus (backslash)
U+005E circumflex accent
U+0060 grave accent
U+007B left curly bracket
U+007C vertical line
U+007D right curly bracket
If
begins with either of:
a string matching the production,
followed by "
://
the string "//"
then percent-encode any left or right square brackets
(U+005B, U+005D, "
" and "
")
following the first occurrence of "
",
", or
" which
follows
the
first occurrence of "
//
".
Otherwise, percent-encode all left and right square brackets.
Percent-encode all occurrences of U+0023 (Number sign, "
")
after the first.
Parse
using the grammar in
RFC 3986
If
doesn't match the
production, even after the above changes are
made to it, then parsing the Web address fails with an
error.
[RFC3986]
Otherwise, parsing
was successful; the
components of the Web address are substrings of
defined as follows. First, the substring of the modified
which matched a particular production in
RFC 3986
is identified; then any
percent-encoded characters in that substring are decoded.
The resulting string (called here the "decoded substring)
is one of the named components of

As a result of percent-encoding the percent sign, any
occurrences of percent-encoding in the Web address will be
double-encoded at this step.
As in the algorithm previously given, Web addresses
containing percent-encoded characters here have components which similarly
contain percent-encoded characters.

The decoded substring matched by the production, if any.

The decoded substring matched by the production, if any.

The decoded substring matched by the production, if any.

If there is a component and a
component and the port given by the component is
different than the default port defined for the protocol given by
the component, then is the
decoded substring that starts with the decoded substring matched by the
production and ends with the decoded substring matched by the
production, and includes the colon in between the
two. Otherwise, it is the same as the component.

The decoded substring matched by one of the following productions, if
one of them was matched:

The decoded substring matched by the production, if any.

The decoded substring matched by the production, if any.

The decoded substring that
follows
the decoded substring matched
by the production, or the whole string if the
production wasn't matched.
N.B. the rules given above will parse not only valid Web addresses but
a variety of invalid ones as well. The point of making the algorithm
have a scope different from that of the definition of valid Web
address is not clear and needs to be discussed in the WG.
The parsing process described here should be more closely aligned with
the rules given in RFC 3987.
How does this compare to just
parsing using the IRI grammar of RFC 3987?
Resolving Web addresses
To
resolve a Web address
to an
absolute Web adddress
relative to either another absolute Web address
or an element,
the user agent must use the following steps. Resolving a Web address can
result in an error, in which case the Web address is not resolvable.
Let
be the Web address being
resolved.
Let
encoding
be the character
encoding of the Web address.
If
encoding
is UTF-16, then change it to
UTF-8.
If the algorithm was invoked with an
absolute Web address
to use as the base Web address, let
base
be that
absolute Web address
Otherwise, let
base
be the
base URI of
the element
, as defined by the XML Base specification, with
the base URI of the document entity
being defined as the
document base Web address
of the
Document
that
owns the element.
[XMLBASE]
For the purposes of the XML Base specification, user agents
must act as if all
Document
objects represented XML
documents.
It is possible for
xml:base
attributes to be present
even in HTML fragments, as such attributes can be added
dynamically using script. (Such scripts would not be conforming,
however, as
xml:base
attributes
are not allowed in HTML documents.)
The
document base Web address
of a
Document
is
the
absolute Web address
obtained by running these
substeps:
Let
fallback base url
be the
document's address.
If
fallback base url
is
about:blank
, and the
Document
's
browsing context has a creator browsing
context, then let
fallback base url
be the
document base Web address
of the creator
Document
instead.
If there is no
base
element that is both a
child of the
head
element and has an
href
attribute, then the
document base Web address
is
fallback base
url
Otherwise, let
be the value of the
href
attribute of the first
such element.
Resolve
relative to
fallback base
url
(thus, the
base
href
attribute isn't affected by
xml:base
attributes).
The
document base Web address
is the result of the
previous step if it was successful; otherwise it is
fallback base url
Parse
into its component parts.
If parsing
resulted in a

component, then replace the
matching subtring of
with the string that
results from expanding any sequences of percent-encoded octets in
that component that are valid UTF-8 sequences into Unicode
characters as defined by UTF-8.
If any percent-encoded octets in that component are not valid
UTF-8 sequences, then return an error and abort these steps.
Apply the IDNA ToASCII algorithm to the matching substring,
with both the AllowUnassigned and UseSTD3ASCIIRules flags
set. Replace the matching substring with the result of the ToASCII
algorithm.
If ToASCII fails to convert one of the components of the
string, e.g. because it is too long or because it contains invalid
characters, then return an error and abort these steps
[RFC3490]
If parsing
resulted in a

component, then replace the
matching substring of
with the string that
results from applying the following steps to each character other
than U+0025 PERCENT SIGN (%) that doesn't match the original
production defined in RFC 3986:
Encode the character into a sequence of octets as defined by
UTF-8.
Replace the character with the percent-encoded form of those
octets.
[RFC3986]
For instance if
was "
//example.com/a^b☺c%FFd%z/?e
", then the

component's substring
would be "
/a^b☺c%FFd%z/
" and the two
characters that would have to be escaped would be "
" and "
". The
result after this step was applied would therefore be that
now had the value "
//example.com/a%5Eb%E2%98%BAc%FFd%z/?e
".
If parsing
resulted in a

component, then replace the
matching substring of
with the string that
results from applying the following steps to each character other
than U+0025 PERCENT SIGN (%) that doesn't match the original
production defined in RFC 3986:
If the character in question cannot be expressed in the
encoding
encoding
, then replace it with a
single 0x3F octet (an ASCII question mark) and skip the remaining
substeps for this character.
Encode the character into a sequence of octets as defined by
the encoding
encoding
Replace the character with the percent-encoded form of those
octets.
[RFC3986]
Apply the algorithm described in RFC 3986 section 5.2
Relative Resolution, using
as the
potentially relative URI reference (
), and
base
as the base URI (
Base
).
[RFC3986]
Apply any relevant conformance criteria of RFC 3986 and RFC
3987, returning an error and aborting these steps if
appropriate.
[RFC3986]
[RFC3987]
For instance, if an absolute URI that would be
returned by the above algorithm violates the restrictions specific
to its scheme, e.g. a
data:
URI using the
//
" server-based naming authority syntax,
then user agents are to treat this as an error instead.
Let
result
be the target URI (
) returned by the Relative Resolution
algorithm.
If
result
uses a scheme with a
server-based naming authority, replace all U+005C REVERSE SOLIDUS
(\) characters in
result
with U+002F SOLIDUS
(/) characters.
Return
result
Web address
is an
absolute Web address
if
resolving
it results in the same
Web address without an error.
A References
RFC 3490
P. Faltstrom, P. Hoffman, and A. Costello,
"Internationalizing Domain Names in Applications (IDNA)",
RFC 3490, March 2003.

RFC 3986
T. Berners-Lee, R. Fielding, and L. Masinter,
"Uniform Resource Identifier (URI): Generic Syntax",
RFC 3986, January 2005.

RFC 3987
M. Duerst and M. Suignard,
"Internationalized Resource Identifiers (IRIs)",
RFC 3987, January 2005.

XML Base
Jonathan Marsh and Richard Tobin, ed.,
"XML Base (Second Edition)",
W3C Recommendation 28 January 2009.