W3C International Internationalization

This page is no longer maintained and may be inaccurate. For more up-to-date information, see the Internationalization Activity home page.

For worldwide interoperability, URIs have to be encoded uniformly. To map the wide range of characters used worldwide into the 60 or so allowed characters in a URI, a two-step process is used:

  • Convert the character string into a sequence of bytes using the UTF-8 encoding
  • Convert each byte that is not an ASCII letter or digit to %HH, where HH is the hexadecimal value of the byte

For example, the string

Franois

would be encoded as

Fran%c3%a7ois

(The "" is encoded in UTF-8 as two bytes C3 (hex) and A7 (hex), which are then written as the three characters "%c3" and "%a7" respectively.)

This can make a URI rather long (up to 9 ASCII characters for a single Unicode character), but the intention is that browsers only need to display the decoded form, and many protocols can send UTF-8 without the %HH escaping.

Program code

Here are some examples of program code for encoding and decoding:

  • Java class for encoding Unicode strings
  • Java function for decoding UTF8/URL encoded strings
  • Graphical interface in Java (from an idea by Gary Adams):
    1. Download URLUTF8Encoder.java and UTF8URL.java .
    2. Compile (on Unix: javac UTF8URL.java )
    3. Run (on Unix: java UTF8URL )

    As you type in the upper box, the second box shows the encoded version, and the bottom box shows the decoded version of the second box (which, of course, should be exactly the same as what you typed).

  • The above as an applet

Jigsaw

Jigsaw , the W3C demonstration server, is written in Java and could in principle serve resources with non-ASCII names. However, the current version (1.x) doesn't do so. By replacing the unescape routine in file LookupState.java with the version above that omission is fixed. (However, it is currently difficult to create non-ASCII resources interactively; that won't be fixed until Jigsaw 2.0.)


W3C Bert Bos , i18n coordinator
Webmaster
Last updated $Date: 2008/05/07 17:58:25 $