Marking-up at the foundation

XML is a markup technology. Like other markup technologies, notably SGML and HTML, it starts with textual documents. In XML these documents are considered to be made up of Unicode characters, though other character encodings may be used. XML documents encode information as a combination of ordinary text and markup, using a small set of characters to separate the markup and the content.

In the document practice from which XML developed, there is a strong notion of a document as something separate from and possibly prior to the markup. It's sort of an "in the beginning was the word, not the markup" perspective. In some ways it makes the most sense with documents, but it's also pretty easy to do with data, especially if that data is already in some textual form (CSV, tab-delimited, etc.).

"Marking up" information means starting with information and adding markup to that information to add metadata - structures and annotation - to the information. This could mean starting from a plain text document and adding markup information, or it could be a more mixed process, with information creation and markup structuring performed simultaneously.

Markup doesn't have to use angle brackets and XML rules. There are lot of other forms of markup, from wikis to lexical date notations. Mixing markup with content is controversial itself from certain viewpoints, but remains a useful approach for many different projects.

A common view emerges from this process. The view holds that the "content" of a document is the textual content of its elements, with the markup - element structures and the attributes in those structures - merely providing additional information. Stripping out the markup will certainly remove structure and annotation, but in some sense the basic information is still there.

As it turns out, there are a huge number of advantages to using element content to represent the information in the document, and reserve attributes for annotation. The most important reason is extensbility. It's far easier to create child elements to reflect complex content than to break an attribute into pieces. At the same time, you can use attributes to refine your understanding of that element, to annotate it with extra information. Creating attributes which talk about attributes is very difficult - there are no conventions.

A lot of developers use attribute structures to hold the content of the document because it seems to make their life easier or corresponds more explicitly with the approaches they are used to seeing in their code. An attribute feels much like a local variable, making it an easy match for (simple) object properties. Attributes are of course less verbose, and they also turn out to be easier to collect as a group when doing SAX processing. There are real costs to this however, much like the costs of creating "final" and unextensible code. In markup, it often making these decisions difficult because it's largely impossible to predict where documents will end up. Optimizing document size by using attributes rather than child elements for content is a short-term strategy with long-term consequences