An ascetic view of XML best practices

MonasticXML.org is a look at XML from a different angle, focusing on what markup is best at rather than what markup can do to solve a particular problem or set of problems. While XML is powerful, developers seem insistent on using XML in ways which seem convenient for a moment but which cause much greater trouble down the line to both their projects and to markup itself.

Looking at XML from the perspective of "what markup is good at" may seem dull, at least to those who thrive on extending technologies in creative ways. MonasticXML.org presents a deeply conservative view of markup in general and XML in particular, though it often turns out that discipline brings its own rewards. Paying attention to the details of how a technology works may not be as interesting as building large projects, but it may save time and energy over the long term.

The pieces here are short, clearly my own perspective, and not necessarily perfectly coherent (or finished yet), but hopefully they will illuminate aspects of solid XML practice.

Marking-up at the foundation

XML is a markup technology. Like other markup technologies, notably SGML and HTML, it starts with textual documents. In XML these documents are considered to be made up of Unicode characters, though other character encodings may be used. XML documents encode information as a combination of ordinary text and markup, using a small set of characters to separate the markup and the content.

In the document practice from which XML developed, there is a strong notion of a document as something separate from and possibly prior to the markup. It's sort of an "in the beginning was the word, not the markup" perspective. In some ways it makes the most sense with documents, but it's also pretty easy to do with data, especially if that data is already in some textual form (CSV, tab-delimited, etc.).

"Marking up" information means starting with information and adding markup to that information to add metadata - structures and annotation - to the information. This could mean starting from a plain text document and adding markup information, or it could be a more mixed process, with information creation and markup structuring performed simultaneously.

Markup doesn't have to use angle brackets and XML rules. There are lot of other forms of markup, from wikis to lexical date notations. Mixing markup with content is controversial itself from certain viewpoints, but remains a useful approach for many different projects.

A common view emerges from this process. The view holds that the "content" of a document is the textual content of its elements, with the markup - element structures and the attributes in those structures - merely providing additional information. Stripping out the markup will certainly remove structure and annotation, but in some sense the basic information is still there.

As it turns out, there are a huge number of advantages to using element content to represent the information in the document, and reserve attributes for annotation. The most important reason is extensbility. It's far easier to create child elements to reflect complex content than to break an attribute into pieces. At the same time, you can use attributes to refine your understanding of that element, to annotate it with extra information. Creating attributes which talk about attributes is very difficult - there are no conventions.

A lot of developers use attribute structures to hold the content of the document because it seems to make their life easier or corresponds more explicitly with the approaches they are used to seeing in their code. An attribute feels much like a local variable, making it an easy match for (simple) object properties. Attributes are of course less verbose, and they also turn out to be easier to collect as a group when doing SAX processing. There are real costs to this however, much like the costs of creating "final" and unextensible code. In markup, it often making these decisions difficult because it's largely impossible to predict where documents will end up. Optimizing document size by using attributes rather than child elements for content is a short-term strategy with long-term consequences

Naming things and reading names

Markup is fundamentally the identification of information. Marking up a document is effectively a naming process, identifying labels that are appropriate to particular pieces of information. SGML and XML 1.0 recognized that elements form the foundation of the markup type system, using element type declarations to describe element types, and then using the names inside the tags to identify which type was appropriate to the content contained by the tags.

For various reasons, some developers see the use of names as types to be inadequate. In some cases, attributes can provide additional useful information about how to interpret element content in a particular instance context, identifying, for instance, a date as written using a particular lexical format. In other cases, names act as hooks for processing, and applications use their own mappings from names to internal structures. A formal description of a model may be helpful in this mapping process.

Labeling textual content does not change the textual nature of that content, however. While some developers see that textual nature as a curse, bringing verbosity, redundancy, and sometimes ambiguity to the information, the textual nature of the names, the annotation, and the content keeps information accessible to a wide variety of systems and leaves open the possibility that these systems may interpret the information very differently.

Applying a structured naming system to information enriches the information. It does not necessarily lock that information into a particular set of rules. Processors (including human readers) can use the names and apply their own sets of understandings about whether a given named structure is sensible.

Namespaces as opportunity

Namespaces in XML is one of the greatest disasters that XML's creators have inflicted upon XML. Namespaces are complicating for two reasons. First, the declarations for namespaces are made using scoped attribute values, making it much more difficult to move content from one part of a document to another while remaining certain that the labels remain the same. Second, namespaces use URIs as identifiers, a much more complicated set of identifiers than the simple names used on XML 1.0 elements and attributes, laden with the baggage of years of expectations for URLs that have never been standardized for their usage in XML.

Although the specifications have made little effort to contain the damage, these complexity issues can be controlled and namespaces can be a useful part of XML practice. Namespaces offer the chance to identify vocabularies with a label that offers additional information or tools for processing. Rather than treating the URI as an abstraction, you can embrace its concrete value as a locator. The Resource Directory Description Language (RDDL) provides an XML-based standard for such descriptions.

There are also several approaches to handling the scoping issue. Perhaps the most common approach is to put all namespace declarations that will be used in a document in the root element of the document, providing a single point of reference. An alternative approach puts namespace declarations as close to the elements and attributes that use them as possible. This creates a messy-looking document, but may be appropriate for occasional inclusions of elements from foreign namespaces.

Another nit in the specification leads to ambiguity about how to handle unprefixed attributes. The specification states "Note that default namespaces do not apply directly to attributes," but fails to provide further guidance, leaving unqualified attributes in no namespace at all. While there are a number of approaches to dealing with this situation, the safest approach is most likely for applications and vocabularies to treat unqualified attributes as being in the namespace of the element that contains them, prefixed or not. As attributes are metadata for their elements, this approach seems consistent. In general, mixing qualified and unqualified content (something the W3C XML Schema specification defines and SOAP uses) is a poor idea, substantially complicating the disambiguation that Namespaces in XML was supposed to accomplish.

Namespaces have led to the creation of another whole set of problems: namespace-qualified names (QNames) used as attribute or element content. The W3C XML Schema specification blesses this unfortunate practice, which drives the ambiguity and scoping problems of namespaces into content as well as into labels. XML 1.0 parsers have no mechanisms for understanding QNames on their own, and only higher levels of applications will have a chance of figuring out what QNames mean.

Senders and receivers

Conversations always have two sides - a sender and a receiver. They may be the same person or the same computer, and they may change roles on a regular basis, but sending and receiving information is at the heart of communication.

The roles of sender and receiver are asymmetrical. While senders (speakers) are often considered more important than receivers (listeners), in large part because speakers are often considered to be "commanders", the relationships are not that simple. Anyone who has shouted at people who can't hear or understand a message - or simply don't want to hear it - has experienced the power of the receiver.

Computing has long combined strong expectations that receivers will perfectly understand a given set of messages with a parallel understanding that receivers will fail ("Abort, retry, fail?" or some other behavior) when they receive messages beyond their ability to interpret. Programmers are very used to working with near-perfect communications inside a tightly constrained set of understandings. Changing those understandings requires explicit work - a type cast or inheritance structures, for instance.

Markup is certainly computerized communications, but it offers a very different set of possibilities than the models typically encountered in computing. While XML, for example, has severely constrained syntax, it has endless possibilities for describing labeled structures and contents.

Developers who are used to writing programs which deal with infinite possibilities stored in a limited number of containers may retreat rapidly from the infinite number of containers XML offers. In trying to return to the communications models with which they were previously acquainted, they discard most of the potential XML offered in the first place.

Markup offers a communications model that is subtly different from most programming. Messages are still crucial, and messages are still tightly structured. Responsibility for the structure of those messages is now distributed, however, and responsibility for the interpretation of those messages is now more explicitly local. Tight binding makes little sense in an environment designed for openness and extensibility.

In contrast to many other computing systems, this shift in the nature of the message opens up new possibilities in which the recipient of a message is the arbiter of its content, rather than the sender. There may or may not be agreements between senders and receivers about the structure and content of the messages; meaning is, in the end up to the receiver and whatever obligations it chooses to uphold.

This feels much more like human conversation than method-calling, and suggests that developers may want to reconsider their style of communications to embrace a wider set of possibilities rather than rushing to restrain them.

How long will documents be used?

One of the most interesting and simultaneously most frustrating aspects of XML is that developers can rarely count on their documents having one and only one use. Because XML is such an open format, it's always possible to apply a different set of processing tools to the same information. While this may seem unappealing or unlikely to some developers, it's actually one of XML's greatest strengths. XML's harsh insistence on a common syntax also means that programmers can develop tools which use that syntax in a wide variety of ways.

Messages intended for use only on the wire may seem like transient bits with no chance of later (and different) processing, but there are all kinds of scenarios where documents with only one acceptable kind of processing and a brief lifespan turn into important and minable information source which can be used by a wide variety of applications.

RPC calls and similar short messages may be processed and discarded, but the sender, recipient, or an intermediary may store copies of those messages for regulatory, administrative, or other reasons. Because the messages are XML, they are easily stored, reprocessed, and reused by anyone who takes the time to investigate them.

Developers may not want to consider the unknown future of their information, which may prove to be more permanent than their software. The risk of information permanence is real, however, and although XML wasn't designed explicitly to encourage it, it certainly makes it more likely. Thinking ahead, considering the prospect that other people will want or need to use your information and the document structures you in circumstances you can't predict, should be yet another reason to be very careful in your markup design.

Approaching information

While it is very tempting to approach information from the perspective of a pre-defined structure, this seemingly easy approach creates its own difficulties. Markup offers the opportunity to take a much broader view of information structures, allowing the natural structures and contexts of the information to take precedence over a more limited view of what the information should be. Rather than cramming information into a pre-fab box, markup provides a more generous set of containers.

The long history of strictly constrained information in computing has roots in several different sets of problems. Computers lacked power, networks lacked bandwidth, storage was precious, and programmers want tools to catch their mistakes. While loosely-typed information is common in scripting languages, it is much less popular in environments which are closer to the hardware (notably C and C++) or derived from that family of languages.

For years, writing programs has meant identifying a set of structures and writing code around those structures. Information comes later, and is made to conform to those structures. (Non-conformant data is often simply rejected.) Working with markup offers developers a different option, though one that takes some getting used to: applying labels and structure to data, perhaps 'painting' on the structure, rather than breaking the data into strictly pre-defined pieces.

This is a large leap to make, and a path rather different from many of the data-binding and data-modeling tools offered for developers wanting to use XML. This approach sees the information and the markup as primary, and not as a mere serialization of an object, table, or other predefined container. This approach is more open to non-deterministic styles of markup, and requires a different style of working with information. Mapping information between one set of fixed structures and another is a start, but only a small step.

Perhaps most important, this perspective removes the programmer as controller of the information. The information arrives first, then it is marked up with labels and structures, and then it is a program's task to interpret that information.

Explicit, not implicit

While DTDs, schemas, and namespaces all rely on mechanisms which define content which is implicit rather than contained directly in the document, their sleight-of-hand is a key factor in making XML processing difficult and opaque. Some key labels - element structures - are always present directly in the document, but the namespace declaration, schema type, and attribute default information may be elsewhere. Namespace declarations may be hidden inside of DTDs as defaulted attributes, requiring multiple steps to figure out where namespace declarations originated, and making it easy to lose them in processors (XML 1.0 non-validating parsers) which don't necessarily read the external subset of the DTD. These declarations are lost to simple text-based processors which might well be useful in many markup-processing situations, and so is any information declared in a W3C XML Schema.

Notions of infosets may excite developers who prefer abstract representations to concrete syntax, and implicit information may well be available in strictly controlled circumstances, but explicit data is available to all. Every type of XML processor, including humans, has a decent chance of processing more explicitly presented labels and structures.

Meeting local expectations

XML 1.0's use of Unicode may have eased many internationalization problems, but localization - across countries, organizations, or people - is a separate issue. XML's foundations are flexible enough to support localization, but many of the layers built on top of XML are designed to remove or reduce that flexibility.

W3C XML Schema offers some obvious examples of this reduction, particularly in its numeric and date/time types. All content which uses these types must be in formats which are inconvenient for large parts of the world, and W3C XML Schema offers no option whatsoever for issues as simple as changing the decimal separator from a period to a comma, not to mention the complete lack of support for calendar systems other than the Gregorian.

Localization issues go beyond the classic problems of working in different countries and cultures, however. A particularly local view of localization might look at the different understandings held by different organizations, parts of an organization, or even individuals. The classic answer to information interchange with XML is to stamp out variation to the greatest extent possible, demanding conformity to particular vocabularies and a type system blessed by the W3C. While this may force everyone into a style which makes life easier for programmers, it also substantially reduces the flexibility of markup approaches and makes the stakes in standardization processes much much higher.

Is there a way out? There may be. Relying exclusively on the markup contained in a document to interpret its structures (rather than calling out to schemas for additional information) makes documents more portable, reducing the processing mismatches that are easily created with the many layers of XML specifications. Using markup to identify rather than constrain makes it simpler for recipients to interpret the information in a document as they need, in a context which reflects their needs and the relationship between the recipient and the document (and the document's sender) rather than simple obeisance to a fixed and often brittle set of rules. Formal descriptions of those structures can then be applied as local sanity-checking, rather than as global straitjackets.

Accepting the discipline of trees

Many of the problems that people have in using XML derive from XML's most basic structures. XML documents are series of bytes which conform to particular rules, rules which provide far more support for encapsulation than for cross-reference. This generally makes XML most amenable to representing tree structures, and often requires an in-memory representation if developers need to wander around those structures.

While tree structures have received enormous good press since XML's appearance, and apply very naturally to a wide variety of problems, many developers don't actually work in terms of tree structures. Object definition and typing structures may seem hierarchical, especially where single inheritance is required, but in practice objects rarely are stored or used in neat hierarchical structures. Relational databases deliberately break information into small pieces and discard the order, and pointers among pieces of relational information can take interpreters along a variety of paths. Other approaches, like RDF and Topic Maps, explicitly define their content as a graph of nodes, traversable and processable at will.

In many ways, XML's tree approach seems retrograde relative to the freedom and flexibility these other approaches offer. The fact that every component of an XML document comes in a particular sequence whether or not that sequence is valued seems constraining to developers who just want to get to the data model. That sequence issue also raises performance issues for developers who want to reach a particular piece of information quickly, but find it's at the very end of a document. Tools for navigating tree structures can fall into endless loops when confronted with data where nodes can have more than one parent. Even the relatively simple task of cross-references within a document receives only limited support in XML, through the somewhat broken ID, IDREF, and IDREFs.

Trees definitely have their limitations, and many of the most difficult problems in XML work emerge from trying to cram graphs into trees. At the same time, however, tree structures fit a certain class of problems - a class of problems on which markup has always focused - extremely well. Writing, even hypertext writing, has always been concerned with order, sequence, and structure. Children learn letters, words, paragraphs, then document structures. Structures of various kinds provide guideposts in nearly every form of writing, even when those guideposts are deliberately abused. These structures fit beautifully into markup environments.

If these tree structures fit your work naturally, XML is probably a good choice. XSLT and XML APIs offer powerful tools for working with information represented as trees, especially for converting from one tree to another. Basic cross-reference, both inside a document and beyond, isn't that difficult. Even if you need to represent overlapping trees, tools like LMNL and Just-In-Time-Trees offer mechanisms for working with such structures in an XML-like context.

On the other hand, if you need to represent information which demands the flexibility of graphs, it may be a good idea to look for alternative approaches which don't impose the discipline (and processing expectations) of trees on your data. At the moment, a number of graph-based approaches are using XML for their serializations because of the usefulness of the syntactic agreement XML provides and its surrounding toolkit, while struggling with the actual representation. The more twisted the graph, the more likely that it's time to break free of the trees and look for alternative approaches to representing the information.

Optimizing markup for processing is always premature

Every now and then, an article will appear suggesting the best ways to reduce XML's verbosity or the processing that parsing XML documents requires. Common features include things like abbreviated names, reduced whitespace, and the ever-frequent use of attributes for content to avoid the overhead of an end tag. Using default values for attributes or avoiding namespaces declarations can save some space. Some applications go so far as to combine multiple values in a single attribute or element space, to reduce the overhead of multiple labels.

All of these options have genuine costs and should be avoided. Abbreviated names and reduced whitespace make it much harder for humans to inspect a document should a system break down unexpectedly. The use of attributes for content makes it extremely and unnecessarily complicated to supply metadata for that content (or fragment that content into smaller pieces) if further information about the content is needed. Defaulting attributes hides them and generally makes them rely on unreliable processing (schemas and DTDs are not always processed). Skipping out on namespace declarations (by defaulting those attributes!) or avoiding the use of namespaces altogether removes a key part of what were supposed to be unambiguous labels. Choosing how best to normalize information into markup is always a difficult question, but deliberately combining pieces of information which clearly have a separate existence is can be problematic.

While verbose markup may seem too large to some people, and perversely verbose markup (extra layers of elements which serve no clear purpose, for instance) should also be avoided, planning markup structures around the needs of a particular application's low-level code is a disservice to the information contained in those structures. Creating markup structures which are written to support one application as concisely as possible substantially limits the capabilites of those documents for creative, possibly constructive, and likely unpredicted reuse.

Developers who need to create optimized formats would be well-advised to look either to tools like gzip (simple file compression, well-suited to compressing typically redundant markup) or to alternatives like binary formats and ASN.1. Markup can represent information stored in most other formats, but that doesn't mean that markup should be used in every case or stretched to cover use cases where it's not well-suited to the performance demands.

Uniform Resource Identifiers are not

One of the technologies that markup has had to integrate for its work on the Web is Uniform Resource Identifiers (URIs). Unfortunately, while simple use of URIs is quite simple, more sophisticated use can create new and difficult problems.

The only thing genuinely uniform about URIs is their syntax, described in RFC 2396 and its likely successors. (Internationalized Resource Identifiers, or IRIs, are in development.) Within the combination of scheme and scheme-specific content, as well as the relative URIs used for convenience and the additional possibility of fragment identifiers, there are an enormous number of possibilities that make uniform processing of URIs remarkably difficult. Even simple equality tests are difficult if not impossible to define, as comparison rules vary among schemes and contexts.

Beyond syntax, the relationship of identifiers to resource is defined as a circle: resources are things which can be identified, while identifiers are things which identify resources. That simple circle conceals a wide range of ambiguities, as the interpretation of which resource an identifier identifies regularly demands more context. The naive understandings of 'resolving' a URI to retrieve content representing that URI work poorly when the URI is opaque, when multiple representations are available, or when multiple resources may be identified by the same identifier, as is common practice in XML namespaces.

Ensuring that URIs and URI references work consistently requires more work than simply specifying that a particular value must be a URI. Consistent use of URIs requires the definition of far more context than URIs provide by themselves and an explicit understanding of whether the URI processing applies to resources in the abstract or to representations in the concrete. URNs, for instance, fare poorly as hyperlink targets on the public Web. URI references with fragment identifiers may fail unexpectedly because fragment identifiers are tied to particular representation formats. In that case, the identifiers only have meaning in a given representation context. Efforts to retrieve information about namespace URIs (or other kinds of URIs used as identifiers, like SAX2 features and properties) are likely to fail, though these URIs frequently take forms that look retrievable on the surface, because the processing context is not typical.

Varying contexts make it difficult to make statements about URIs out of context. A statement about http://monasticxml.org/uris.html, for instance, would seem likely to describe this page, but it could also describe a namespace using that identifier or other resource behavior that might exist beyond the classic HTTP GET. The coherence of the resource is up to behavior of the URI, for better or worse.

Markup can be useful for providing the extra context that bare URIs are lacking. The HTML form element has long had a method attribute, specifying whether information should be sent to the URI using GET or POST. Similarly, markup which needs to identify content in a particular representation of a resource might provide additional information, perhaps a MIME content-type identifier. These issues may force changes in your markup structures if they are relevant to the information with which you are working,

The Web has demonstrated that tolerating failures (the classic 404 not found, but also more generally) in URI processing can be a workable solution that simplifies the overall architecture. If you plan to use URIs, learn from this. Set your expectations for use as explicitly as possible, and expect less than your expectations would suggest. URIs are a marvelous abstraction but often need to be constrained and supplemented to function concretely.

Beyond the XML mirage

For various reasons, developers seem to expect that they can solve a wide variety of problems by simply using XML formats and using XML tools to manipulate that information. They cast their expectations on using XML itself to solve the problem, when in fact their problems need much more attention than a common syntax.

Developers who focus directly on creating and manipulating XML structures rather than using XML to represent the information they need to create and manipulate are often disappointed by the amount of work they create for themselves. XML can be an elegant syntax for representing information, but labeled structures are not themselves data models.

Making XML useful requires an understanding of the information, and a separation between the understanding of the information and the expectations for processing it. Treating the XML as the information and processing it directly makes it much harder for different people and organizations to share and process the same documents.

Unfortunately, an enormous amount of effort has gone into blurring that separation. The Document Object Model (DOM) and similar APIs are widely used by developers who just want to manipulate the XML without importing it into their own structures. Tools which treat XML purely as a serialization of objects or database tables often focus on the information as they appear in the objects or tables, but pay little attention to how that information might best be represented in XML. In both cases, the separation between XML and the information is barely respected, and it's not surprising that these approaches are both limiting and frustrating.

Making XML work - using XML syntax to share information - requires more than generic tools and frameworks. Building ever more abstract models which represent XML contents is useless without direct connections to ever more specific applications. Generic tools are powerful and useful, but only to the extent that they can solve specific problems.

Developers need to take responsibility for building conduits between their information and XML representations of it which take into account how XML works. This project requires a combination of generic tools and specific expertise, combining tools like XML parsers and XSLT processors with an understanding of how the information in the XML document connects to the information expectations of the application. Simple pathways connecting the two are possible but probably unusual, at least if documents are widely shared between applications with even slightly different information expectations.

All this mapping may seem like extra work to developers who just want to ship information from point A to point B, but it's at the heart of making markup useful. XML itself doesn't solve any of these problems - developers do.

The author

Simon St.Laurent is the author of too many computer books (notably XML:A Primer (three editions) and XML Elements of Style. He is currently an editor at O'Reilly & Associates, focused mostly on XML-related topics.

As multiple documents