Accepting the discipline of trees

Many of the problems that people have in using XML derive from XML's most basic structures. XML documents are series of bytes which conform to particular rules, rules which provide far more support for encapsulation than for cross-reference. This generally makes XML most amenable to representing tree structures, and often requires an in-memory representation if developers need to wander around those structures.

While tree structures have received enormous good press since XML's appearance, and apply very naturally to a wide variety of problems, many developers don't actually work in terms of tree structures. Object definition and typing structures may seem hierarchical, especially where single inheritance is required, but in practice objects rarely are stored or used in neat hierarchical structures. Relational databases deliberately break information into small pieces and discard the order, and pointers among pieces of relational information can take interpreters along a variety of paths. Other approaches, like RDF and Topic Maps, explicitly define their content as a graph of nodes, traversable and processable at will.

In many ways, XML's tree approach seems retrograde relative to the freedom and flexibility these other approaches offer. The fact that every component of an XML document comes in a particular sequence whether or not that sequence is valued seems constraining to developers who just want to get to the data model. That sequence issue also raises performance issues for developers who want to reach a particular piece of information quickly, but find it's at the very end of a document. Tools for navigating tree structures can fall into endless loops when confronted with data where nodes can have more than one parent. Even the relatively simple task of cross-references within a document receives only limited support in XML, through the somewhat broken ID, IDREF, and IDREFs.

Trees definitely have their limitations, and many of the most difficult problems in XML work emerge from trying to cram graphs into trees. At the same time, however, tree structures fit a certain class of problems - a class of problems on which markup has always focused - extremely well. Writing, even hypertext writing, has always been concerned with order, sequence, and structure. Children learn letters, words, paragraphs, then document structures. Structures of various kinds provide guideposts in nearly every form of writing, even when those guideposts are deliberately abused. These structures fit beautifully into markup environments.

If these tree structures fit your work naturally, XML is probably a good choice. XSLT and XML APIs offer powerful tools for working with information represented as trees, especially for converting from one tree to another. Basic cross-reference, both inside a document and beyond, isn't that difficult. Even if you need to represent overlapping trees, tools like LMNL and Just-In-Time-Trees offer mechanisms for working with such structures in an XML-like context.

On the other hand, if you need to represent information which demands the flexibility of graphs, it may be a good idea to look for alternative approaches which don't impose the discipline (and processing expectations) of trees on your data. At the moment, a number of graph-based approaches are using XML for their serializations because of the usefulness of the syntactic agreement XML provides and its surrounding toolkit, while struggling with the actual representation. The more twisted the graph, the more likely that it's time to break free of the trees and look for alternative approaches to representing the information.