Optimizing markup for processing is always premature

Every now and then, an article will appear suggesting the best ways to reduce XML's verbosity or the processing that parsing XML documents requires. Common features include things like abbreviated names, reduced whitespace, and the ever-frequent use of attributes for content to avoid the overhead of an end tag. Using default values for attributes or avoiding namespaces declarations can save some space. Some applications go so far as to combine multiple values in a single attribute or element space, to reduce the overhead of multiple labels.

All of these options have genuine costs and should be avoided. Abbreviated names and reduced whitespace make it much harder for humans to inspect a document should a system break down unexpectedly. The use of attributes for content makes it extremely and unnecessarily complicated to supply metadata for that content (or fragment that content into smaller pieces) if further information about the content is needed. Defaulting attributes hides them and generally makes them rely on unreliable processing (schemas and DTDs are not always processed). Skipping out on namespace declarations (by defaulting those attributes!) or avoiding the use of namespaces altogether removes a key part of what were supposed to be unambiguous labels. Choosing how best to normalize information into markup is always a difficult question, but deliberately combining pieces of information which clearly have a separate existence is can be problematic.

While verbose markup may seem too large to some people, and perversely verbose markup (extra layers of elements which serve no clear purpose, for instance) should also be avoided, planning markup structures around the needs of a particular application's low-level code is a disservice to the information contained in those structures. Creating markup structures which are written to support one application as concisely as possible substantially limits the capabilites of those documents for creative, possibly constructive, and likely unpredicted reuse.

Developers who need to create optimized formats would be well-advised to look either to tools like gzip (simple file compression, well-suited to compressing typically redundant markup) or to alternatives like binary formats and ASN.1. Markup can represent information stored in most other formats, but that doesn't mean that markup should be used in every case or stretched to cover use cases where it's not well-suited to the performance demands.

Monastic XML Copyright 2002 Simon St.Laurent.