Normal Form Conventions for XML Representations of Structured Data
Normal Form Conventions for XML Representations of Structured Data
Henry S. Thompson
October 1 2001
As submitted to and accepted by XML 2001: For unclear
reasons not published in their proceedings.
1.
Introduction
Starting with the work of Andrew Layman
[Lay98]
the issue of establishing conventions for the XML representation of structured
data has generated considerable interest. In this paper we present an
explicit definition not only of what might be called
Layman normal
form
, but also of three other normal forms.
2.
Motivation
The initial motivation for the work presented here is to serve as the
starting point for a declarative approach to XML data binding, i.e. to the use
of XML as a transfer medium for structured data (see
[ThomSwick99]
for an introduction to this viewpoint). Such an approach must accommodate both
marshalling application data into XML document, and unmarshalling XML documents
into application data.
However with hindsight it has become apparent that it is also a useful
point of departure for consideration of the general question of the semantics
of markup vocabularies in the wild, so to speak. That is, when we look at the
DTDs and other schemas which have been written for markup applications, what
generalisations can we make about how markup is used to convey or record domain
properties? It is my contention that in practice existing DTDs often can be
understood as employing one or another of the normal forms set out below as
their encoding strategies. Understanding these normal forms can thus
contribute to the analysis of DTD patterns of markup use.
3.
Terminology
The kind of data we will explore encodings for will not be specified in
detail. For the time being, we'll assume a very simple ontology of
individuals, atoms, unary types and binary relations. Individuals have
identifiers and atoms have notations. Both types and relations have names. Identity
is an issue for individuals, but not for atoms. That is, it may make sense to
assert that two distinct identifiers identify the same individual, but atoms
are assumed to be uniquely determined by their type and notation. A type is a named set of
individuals or atoms. Relations are named pairs of
individuals and (individuals or atoms). Relations are
not
restricted to being functional, that is, they may be one-to-many.
This ontology is intended to be a reasonable intermediary between XML and
any number of different actual modelling languages or actual application data
models: Entity-Relation, UML, RDF, C or Pascal structures,
OO class instances and variables, first-order logic.
4.
Example dataset
All the following sections will use the following dataset to illustrate
their approach.
Types
PurchaseOrder
P1
Address
A1, A2
Item
I1, I2
string, integer, decimal, date
The types of most atoms are
shown explicitly
inline below. The one exception is
string
, where we use italics
instead of the explicit typename. For example, we use
"Alice
Smith"
instead of
string("Alice Smith")
Relations
orderDate
[P1,
date("1999-10-20")
shipTo
[P1,A1]
billTo
[P1,A2]
comment
[P1,
"Hurry, my lawn is going wild!"
],
[I1,
"Confirm this is electric"
item
[P1,I1], [P1,I2]
country
[A1,
"US"
], [A2,
"US"
name
[A1,
"Alice Smith"
], [A2,
"Robert Smith"
street
[A1,
"123 Maple Street"
], [A2,
"8 Oak Avenue"
city
[A1,
"Mill Valley"
], [A2,
"Old Town"
state
[A1,
"CA"
], [A2,
"PA"
zip
[A1,
integer("90952")
], [A2,
integer("19144")
partNum
[I1,
"872-AA"
], [I2,
"926-AA"
productName
[I1,
"Lawnmower"
], [I2,
"Baby monitor"
quantity
[I1,
integer("1")
], [I2,
integer("1")
USPrice
[I1,
decimal("148.95")
], [I2,
decimal("39.98")
shipDate
[I2,
date("1999-05-21")
5.
What is a 'Normal Form'?
I use the phrase 'normal form' in this paper in two related ways, one
concrete and one abstract. By the concrete use, when talking about a
normal form for a particular dataset, that is a particular collection of individuals, atoms, types and relations, I mean a representation or encoding in
XML of that dataset. By the
general use, e.g. when giving the definition below of Alternating Normal Form,
I mean a set of principles for constructing and/or interpreting concrete normal
forms. The first use will be notated in lower case (normal form), the second capitalised
(Normal Form)
In other words, once you know the principles for a Normal
Form, you know how, given a particular dataset, to construct a normal form for it,
that is an XML document
which represents it in that Normal Form, and given an XML document known to be
a normal form, that is constructed according to the principles of a Normal Form, you know how to (re)construct the dataset it represents or encodes.
It follows that in defining a Normal Form, both the mapping from datasets
to XML and the mapping from XML to datasets must be specified. In practice
specifying this from the perspective of the XML, in a way which is manifestly
invertable, seems to work well, and it is this approach which is followed below.
The following four sections set out the principles of four distinct
Normal Forms, together with the corresponding normal form of the example dataset.
6.
Alternating Normal Form
Attributes are not used.
The document element encodes a typed individual.
The children of the document element encode relations.
Their
children encode atoms or typed individuals.
And so on -- that is, the children of any element encoding an individual
encode relations, the children of any element encoding a relation encode either
atoms or typed individuals.
The types of atoms are not encoded in the document -- they must come from
elsewhere, e.g. from an XML Schema.
The alternating normal form for our example is as follows:
This Normal Form is the simplest to specify, but leads to very verbose
documents. It is the basis for the XML reflection of Infosets and PSVIs
proposed by
[ThomTobin01]
. Of the four Normal
Forms presented here it requires
the least additional information to complete the XML-to-dataset mapping.
One could also have a fully explicit Alternating Normal Form, in which
there was one more level, so that the types of atoms were encoded by elements
as well, and then
no
out-of-band information would be required, e.g.
. . .
Note that here as elsewhere the approach to atomic types is short on
detail. Namespace-qualified type atomic type names would almost certainly
be needed in practice.
7.
Relation Normal Form
The basic idea of this Normal Form was first proposed
to me by Istvan Cseri
[Cseri98]
Elements encode relations (except for the document element, which is just
a placeholder).
Attributes (optionally) encode relations to atoms.
(Schema) types encode typed individuals or the types of atoms.
A Relation normal form for our example is as follows:
It is noticeable that this form is
very
close to the example
as originally encoded, in the XML Schema Primer
[Fallside01]
. The only substantive difference is the lack above of an
wrapper for the separate
elements.
This Normal Form requires out-of-band information, e.g. from an XML
Schema, to supply the types of individuals and atoms. When unmarshalling from
XML to data, schema-validation with an appropriate schema will supply the
necessary type information in the post schema-validation Infoset (PSVI).
8.
Layman Normal Form
Elements encode typed individuals.
Attributes encode relations.
Attributes of type IDREF or IDREFS encode relations to individuals,
namely those whose ID(s) is/are their value.
Attributes of any other type encode relations to typed atoms, encoded
by their value.
(Schema) types encode the types of atoms.
Out-of-band information is required to identify the attributes of type
ID, IDREF
and IDREFS. This could be provided by a minimal set of XML 1.0 attribute
declarations in the internal subset, or by reference to an external DTD, or by
an XML Schema.
A Layman normal form for our example is as follows:
city="Mill Valley" state="CA" zip="90952"/>
city="Old Town" state="PA" zip="19144"/>
This is the most removed from any kind of `traditional' approach of the
Normal Forms presented here. It does not require the use of nesting to convey
any information at all, although it does not preclude it: Although in this
example the
Address
and
Item
elements are only nested
within the
PurchaseOrder
to ensure the whole is well-formed, in
examples with richer structure nesting indicative of dependency could perfectly
well be used.
9.
Individual Normal Form
The basic idea of this Normal Form was first proposed
to me by Adam Bosworth
[Bosworth98]
Attributes are not used.
Elements encode typed individuals or typed atoms.
Relations are encoded by position or by schema
annotation.
This Normal Form is quite counter-intuitive from the perspective of the
document markup tradition, but is quite similar to some approaches to archive
or transfer syntaxes for relational databases. It requires out-of-band
information, e.g. in
appInfo
elements in an XML Schema, to
determine relations. Alternatively a further convention for ordering relations
is required.
The Bosworth normal form for our example is as follows:
10.
Open issues
Re-entrancy
Only one of the four Normal Forms above addresses re-entrancy (shared structure) and/or
circularity directly, namely Layman Normal Form. As presented here, the others
are constrained to encode only proper trees, and cannot handle directed graphs.
It is obviously possible to add the use of IDREF/ID connections to at least
Alternating and Property Normal Form to handle these cases. An alternative
which works in those cases and also for Individual normal form is to introduce
a distinguished (i.e. in a separated namespace)
pointer
element
with an IDREF attribute, along with a distinguished
normalID
attribute. None of these approaches are quite as clean as Layman Normal Form
in handling re-entrancy however, in that given any item which is the value of
multiple relations, nesting will be used to encode one of the relations, and
some form of indirection for the others, where in principle there is not
distinction between them.
Collections
The absence of collections from the simple
ontology used here is obviously a weakness. There is after all a difference
between a relation between, say, an individual and a set, on the one hand, and
multiple instances of a relation between an individual and various others.
This difference cannot be encoded as the proposals above stand. The best way
forward to fill this gap is not yet clear to me.
11.
Conclusions
Our understanding of how we have used markup in the past, and how we may
use it in the future, is surprisingly limited -- mostly people have just
used
markup, without thinking too hard about what they were doing
in the abstract (although see
[HuitfeldMcQueen00]
for a
notable exception). I hope the enumeration of Normal Forms presented here may
help others in redressing this deficit: we already have some evidence
[KrupnikovThompson01]
that it yields practical benefits in supporting an XML-Schema-based approach to the declarative specification of the relation between XML document types and the application data they encode.
12.
References
Bosworth98
Bosworth, Adam
, personal
communication. Microsoft Corporation, Redmond, WA.
Cseri98
Cseri, Istvan
, personal
communication. Microsoft Corporation, Redmond, WA.
HuitfeldMcQueen00
Huitfeld, Klaus and C. Michael
Sperberg-McQueen
, [to be filled in].
KrupnikovThompson01
Krupnikov, K. Ari and
Henry S. Thompson
Data Binding Using W3C XML Schema Annotations
, in this volume.
Lay98
Layman, Andrew
, "XML Syntax Recommendation for
Serializing Graphs of Data", 1998. In
QL'98 - The Query Languages
Workshop Position Papers
, World Wide Web Consortium, Cambridge, MA.
Available online at
ThomSwick99
Thompson, Henry S.
and Ralph Swick, eds.,
The Cambridge Communiqué
, 1999. World Wide Web Consortium, Cambridge, MA.
Available online at
ThomTobin01
Thompson, Henry S.
and Richard Tobin,
A Schema for Serialized Infosets
, 2001. World Wide Web Consortium, Cambridge, MA.
Available online at