Uniform Resource Identifiers are not

One of the technologies that markup has had to integrate for its work on the Web is Uniform Resource Identifiers (URIs). Unfortunately, while simple use of URIs is quite simple, more sophisticated use can create new and difficult problems.

The only thing genuinely uniform about URIs is their syntax, described in RFC 2396 and its likely successors. (Internationalized Resource Identifiers, or IRIs, are in development.) Within the combination of scheme and scheme-specific content, as well as the relative URIs used for convenience and the additional possibility of fragment identifiers, there are an enormous number of possibilities that make uniform processing of URIs remarkably difficult. Even simple equality tests are difficult if not impossible to define, as comparison rules vary among schemes and contexts.

Beyond syntax, the relationship of identifiers to resource is defined as a circle: resources are things which can be identified, while identifiers are things which identify resources. That simple circle conceals a wide range of ambiguities, as the interpretation of which resource an identifier identifies regularly demands more context. The naive understandings of 'resolving' a URI to retrieve content representing that URI work poorly when the URI is opaque, when multiple representations are available, or when multiple resources may be identified by the same identifier, as is common practice in XML namespaces.

Ensuring that URIs and URI references work consistently requires more work than simply specifying that a particular value must be a URI. Consistent use of URIs requires the definition of far more context than URIs provide by themselves and an explicit understanding of whether the URI processing applies to resources in the abstract or to representations in the concrete. URNs, for instance, fare poorly as hyperlink targets on the public Web. URI references with fragment identifiers may fail unexpectedly because fragment identifiers are tied to particular representation formats. In that case, the identifiers only have meaning in a given representation context. Efforts to retrieve information about namespace URIs (or other kinds of URIs used as identifiers, like SAX2 features and properties) are likely to fail, though these URIs frequently take forms that look retrievable on the surface, because the processing context is not typical.

Varying contexts make it difficult to make statements about URIs out of context. A statement about http://monasticxml.org/uris.html, for instance, would seem likely to describe this page, but it could also describe a namespace using that identifier or other resource behavior that might exist beyond the classic HTTP GET. The coherence of the resource is up to behavior of the URI, for better or worse.

Markup can be useful for providing the extra context that bare URIs are lacking. The HTML form element has long had a method attribute, specifying whether information should be sent to the URI using GET or POST. Similarly, markup which needs to identify content in a particular representation of a resource might provide additional information, perhaps a MIME content-type identifier. These issues may force changes in your markup structures if they are relevant to the information with which you are working,

The Web has demonstrated that tolerating failures (the classic 404 not found, but also more generally) in URI processing can be a workable solution that simplifies the overall architecture. If you plan to use URIs, learn from this. Set your expectations for use as explicitly as possible, and expect less than your expectations would suggest. URIs are a marvelous abstraction but often need to be constrained and supplemented to function concretely.

Monastic XML Copyright 2002 Simon St.Laurent.