A schema for serialized infosets

Richard Tobin and Henry Thompson, LTG, University of Edinburgh

This is a schema that describes an XML serialization of XML infosets. There are two main versions: one for the basic infoset, and one for the post-schema-validation (PSV) infoset.

Our main goal in defining this serialization was to allow comparison of the infosets generated by different processors (including parsers and schema validators). It has also proved useful for finding flaws in the infoset and schema specifications themselves, and the serializations can also be converted to HTML (by stylesheets) for display.

The top-level schemas are

XMLInfoset.xsd: The basic infoset
XMLInfoset-strict.xsd: "Strict" version of the basic infoset (see below)
PSVInfoset.xsd: PSV infoset (uses strict version of basic infoset)

Notes

All properties and infoitems are represented as elements. There are two reasons for this:

It avoids the need to decide each case individually.
It allows properties to be nulled with xsi:nil, to represent no value (and absent as it is called in the PSV infoset).

A type is declared for each info item and property. Type names are camel-case with an initial capital. Element names are camel-case with an initial lower-case letter.

All properties are represented as elements whose name is the property name. These elements are globally declared (except where there are infoitems with the same name, in which case "Property" is appended to the name). A consequence is that properties with the same name must have the same type; this is true for both the basic and PSV infosets.) Their types fall into several categories:

Atomic (strings, enumerations and booleans):
The property is an element with a simple type.
Lists of atoms
If the atom cannot include spaces, the property is an element with a simple type which is a list of the appropriate simple type. If it can include spaces (eg xpaths) we create a dummy infoitem for it (XXX we haven't done this for URIs which can theoretically include spaces).
Lists of info items:
The property is an element containing a sequence of elements which represent the info items.
(Unordered) sets
As for lists. The values are sorted into a canonical order. For attributes and namespaces this is the same as the order in Canonical XML. For other cases we will specify an order.
Single info items:
Surprisingly, there are none of these in the basic infoset, except in cases where a pointer is used (see below). They do occur in the PSV infoset. The property is an element containing an element which represents the info items. In several cases the property has the same name as the infoitem that is its value, resulting in a strange-looking repetition of the element name.
References to info items:
Where the very same info item appears in two or more places, we specify that one contains the real value and the others contain pointers. All info items that are pointed to have attributes named id. A property pointing to an info item contains an element named pointer which has an attribute named ref corresponding to the pointed-to item. Identity constraints in the schema enforce the correspondence (just in ID/IDREF style at present - this could be tightened up to ensure that pointers point to the right type).
When there is a natural home for the real definitions it is used. In particular, unparsed entities and notations reside in the [unparsed entities] and [notations] properties of the document info item. Global schema components reside in the [schema components] property of the schema information info item, others reside in the component in which they are defined (for example a local element declaration will reside in a particle).
Odd cases:
The PSV infoset has some odd cases. Where the property is either an atom or a structure (eg [scope], which is either global or a complex type definition), we just use mixed content. Where the property has substructure (eg [value constraint] which is a pair of a string and default or fixed), we create a dummy infoitem.

Since there is no requirement for a processor to produce all infoitems or properties, in the basic infoset schema all properties are optional. In addition, to allow extensions of the infoset to be validated against the basic schema, all infoitems end with

<s:any namespace="##other" processContents="lax" minOccurs="0" maxOccurs="unbounded"/>

There is a "strict" version of the schema which requires all the properties (but still allows extra properties from other namespaces).

The serialization of the basic infoset uses the namespace http://www.w3.org/2001/05/XMLInfoset and corresponds to what is expected to be the CR draft of the Infoset spec.

The serialization of the PSV infoset uses the namespace http://www.w3.org/2001/05/PSVInfosetExtension for added properties and infoitems, and corresponds to the XML Schema Recommendation.

Future work

There are some incompletenesses that will be rectified. In particular, no serialization has yet been defined for ID/IDREF or identity constraint tables. The schema could be tightened up in several places (facets, for example).

We intend to make the schemas compatible with the RDF schema for the basic infoset, so that a serialization can be valid according to both.

There are no doubt many bugs in these schemas, which we will attempt to correct. Please mail Richard Tobin (richard@cogsci.ed.ac.uk) with corrections.