XML schema

This article is about XML schemas in general. For information on the W3C-recommendation, see XML Schema (W3C).

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.

There are languages developed specifically to express XML schemas. The Document Type Definition (DTD) language, which is native to the XML specification, is a schema language that is of relatively limited capability, but that also has other uses in XML aside from the expression of schemas. Two more expressive XML schema languages in widespread use are XML Schema (with a capital S) and RELAX NG.

The mechanism for associating an XML document with a schema varies according to the schema language. The association may be achieved via markup within the XML document itself, or via some external means.

Validation

The process of checking to see if an XML document conforms to a schema is called validation, which is separate from XML's core concept of syntactic well-formedness. All XML documents must be well-formed, but it is not required that a document be valid unless the XML parser is "validating", in which case the document is also checked for conformance with its associated schema. DTD-validating parsers are most common, but some support W3C XML Schema or RELAX NG as well.

Documents are only considered valid if they satisfy the requirements of the schema with which they have been associated. These requirements typically include such constraints as:

Validation of an instance document against a schema can be regarded as a conceptually separate operation from XML parsing. In practice, however, many schema validators are integrated with an XML parser.

Languages

There are several different languages available for specifying an XML schema. Each language has its strengths and weaknesses.

The primary purpose of a schema language is to specify what the structure of an XML document can be. This means which elements can reside in which other elements, which attributes are and are not legal to have on a particular element, and so forth. A schema is somewhat equivalent to a grammar for a language; a schema defines what the vocabulary for the language may be and what a valid "sentence" is.

There are historic and current XML schema languages:

Language Abbrev. Versions Authority
Constraint Language in XML CLiX 2004 Independent[1]
Document Content Description facility for XML, an RDF framework[2] DCD v1.0 (1998) W3C (Note)
Document Definition Markup Language DDML v0 (1999) W3C (Note)
Document Structure Description DSD 2002, 2005 BRICS (defunct)
Document Type Definition DTD 1986 (SGML) ISO[3]
2008 (XML) ISO/IEC[3]
Namespace Routing Language NRL 2003 Independent[4]
Namespace-based Validation Dispatching Language NVDL 2006 ISO/IEC[5]
Content Assembly Mechanism CAM 2007 OASIS
REgular LAnguage for XML Next Generation RELAX NG, RelaxNG 2001,[6] Compact Syntax (2002)[7] OASIS
v1 (2003), v1 Compact Syntax (2006), v2 (2008) ISO/IEC[5]
Schema for Object-Oriented XML SOX ? ?
Schematron N/A 2006 ISO/IEC[5]
XML-Data Reduced XDR ? ?
ASN.1 XML Encoding Rules XER ? ?
XML Schema WXS, XSD ? W3C

The main ones (see also the ISO 19757's endorsed languages) are described below.

Though there are a number of schema languages available, the primary three languages are Document Type Definitions, W3C XML Schema, and RELAX NG. Each language has its own advantages and disadvantages.

Document Type Definitions

Tool Support

DTDs are perhaps the most widely supported schema language for XML. Because DTDs are one of the earliest schema languages for XML, defined before XML even had namespace support, they are widely supported. Internal DTDs are often supported in XML processors; external DTDs are less often supported, but only slightly. Most large XML parsers, ones that support multiple XML technologies, will provide support for DTDs as well.

W3C XML Schema

Main article: XML Schema (W3C)

Advantages over DTDs

Compared to DTDs, W3C XML Schemas are exceptionally powerful. They provide much greater specificity than DTDs could. They are namespace aware, and provide support for types.

W3C XML Schema is written in XML itself, and therefore has a schema of its own (appropriately, written in W3C XML Schema).

W3C XML Schema has a large number of built-in and derived data types. These are specified by the W3C XML Schema specification, so all W3C XML Schema validators and processors must support them.

Due to the nature of the schema language, after an XML document is validated, the entire XML document, both content and structure, can be expressed in terms of the schema itself. This functionality, known as Post-Schema-Validation Infoset (PSVI), can be used to transform the document into a hierarchy of typed objects that can be accessed in a programming language through a neutral interface.

Commonality with RELAX NG

RELAX NG and W3C XML Schema allow for similar mechanisms of specificity. Both allow for a degree of modularity in their languages, going so far as to being able to split the schema into multiple files. And both of them are, or can be, defined in an XML language.

Advantages over RELAX NG

RELAX NG does not have any analog to PSVI. Unlike W3C XML Schema, RELAX NG was designed so that validation and augmentation (adding type information and default values) are separate.

W3C XML Schema has a formal mechanism for attaching a schema to an XML document, while RELAX NG intentionally avoids such mechanisms for security and interoperability reasons.

RELAX NG has no ability to apply default attribute data to an element's list of attributes (i.e., changing the XML info set), while W3C XML Schema does. Again, this design is intentional and is to separate validation and augmentation.[8]

W3C XML Schema has a rich "simple type" system built in (xs:number, xs:date, etc., plus derivation of custom types), while RELAX NG has an extremely simplistic one because it's meant to use type libraries developed independently of RELAX NG, rather than grow its own. This is seen by some as a disadvantage. In practice it's common for a RELAX NG schema to use the predefined "simple types" and "restrictions" (pattern, maxLength, etc.) of W3C XML Schema.

In W3C XML Schema a specific number or range of repetitions of patterns can be expressed more elegantly than under RELAX NG. For large numbers it's practically not possible to specify at all in RELAX NG.

Disadvantages

W3C XML Schema is complex and hard to learn, although that's partially because it tries to do more than mere validation (see PSVI).

Although being written in XML is an advantage, it is also a disadvantage in some ways. The W3C XML Schema language in particular can be quite verbose, while a DTD can be terse and relatively easily editable.

Likewise, WXS's formal mechanism for associating a document with a schema can pose a potential security problem. For WXS validators that will follow a URI to an arbitrary online location, there is the potential for reading something malicious from the other side of the stream.[9]

W3C XML Schema does not implement most of the DTD ability to provide data elements to a document.

Although W3C XML Schema's ability to add default attributes to elements is an advantage, it is a disadvantage in some ways as well. It means that an XML file may not be usable in the absence of its schema, even if the document would validate against that schema. In effect, all users of such an XML document must also implement the W3C XML Schema specification, thus ruling out minimalist or older XML parsers. It can also dramatically slow down processing of the document, as the processor must potentially download and process a second XML file (the schema).

Tool Support

WXS support exists in a number of large XML parsing packages. Xerces and the .NET Framework's Base Class Library both provide support for WXS validation.

RELAX NG

Main article: RELAX NG

RELAX NG provides for most of the advantages that W3C XML Schema does over DTDs.

Advantages over W3C XML Schema

While the language of RELAX NG can be written in XML, it also has an equivalent form that is much more like a DTD, but with greater specifying power. This form is known as the compact syntax. Tools can easily convert between these forms with no loss of features or even commenting. Even arbitrary elements specified between RELAX NG XML elements can be converted into the compact form.

RELAX NG provides very strong support for unordered content. That is, it allows the schema to state that a sequence of patterns may appear in any order.

RELAX NG also allows for non-deterministic content models. What this means is that RELAX NG allows the specification of a sequence like the following:

<zeroOrMore>
  <ref name="odd" />
  <ref name="even" />
</zeroOrMore>
<optional>
  <ref name="odd" />
</optional>

When the validator encounters something that matches the "odd" pattern, it is unknown whether this is the optional last "odd" reference or simply one in the zeroOrMore sequence without looking ahead at the data. RELAX NG allows this kind of specification. W3C XML Schema requires all of its sequences to be fully deterministic, so mechanisms like the above must be either specified in a different way or omitted altogether.

RELAX NG allows attributes to be treated as elements in content models. In particular, this means that one can provide the following:

<element name="some_element">
  <choice>
    <attribute name="has_name">
      <value>false</value>
    </attribute>
    <group>
      <attribute name="has_name">
        <value>true</value>
      </attribute>
      <element name="name"><text /></element>
    </group>
  </choice>
</element>

This block states that the element "some_element" must have an attribute named "has_name". This attribute can only take true or false as values, and if it is true, the first child element of the element must be "name", which stores text. If "name" did not need to be the first element, then the choice could be wrapped in an "interleave" element along with other elements. The order of the specification of attributes in RELAX NG has no meaning, so this block need not be the first block in the element definition.

W3C XML Schema cannot specify such a dependency between the content of an attribute and child elements.

RELAX NG's specification only lists two built-in types (string and token), but it allows for the definition of many more. In theory, the lack of a specific list allows a processor to support data types that are very problem-domain specific.

Most RELAX NG schemas can be algorithmically converted into W3C XML Schemas and even DTDs (except when using RELAX NG features not supported by those languages, as above). The reverse is not true. As such, RELAX NG can be used as a normative version of the schema, and the user can convert it to other forms for tools that do not support RELAX NG.

Disadvantages

Most of RELAX NG's disadvantages are covered under the section on W3C XML Schema's advantages over RELAX NG.

Though RELAX NG's ability to support user-defined data types is useful, it comes at the disadvantage of only having two data types that the user can rely upon. Which, in theory, means that using a RELAX NG schema across multiple validators requires either providing those user-defined data types to that validator or using only the two basic types. In practice however, most RELAX NG processors support the W3C XML Schema set of data types.

Tool Support

RELAX NG's tool support is significant, but it is less widespread than W3C XML Schema. The Mono Project's implementation of the .NET Framework includes a RELAX NG validator. The C library libxml2 provides RELAX NG support as well. Sun Microsystems's Multiple Schema Validator for Java also provides RELAX NG support.

Schematron

Main article: Schematron

Schematron is a fairly unique schema language. Unlike the main three, it defines an XML file's syntax as a list of XPath-based rules. If the document passes these rules, then it is valid.

Advantages

Because of its rule-based nature, Schematron's specificity is very strong. It can require that the content of an element be controlled by one of its siblings. It can also request or require that the root element, regardless of what element that happens to be, have specific attributes. It can even specify required relationships between multiple XML files.

Disadvantages

While Schematron is good at relational constructs, its ability to specify the basic structure of a document, that is, which elements can go where, results in a very verbose schema.

The typical way to solve this is to combine Schematron with RELAX NG or W3C XML Schema. There are several schema processors available for both languages that support this combined form. This allows Schematron rules to specify additional constraints to the structure defined by W3C XML Schema or RELAX NG.

Tool Support

Schematron's reference implementation is actually an XSLT transformation that transforms the Schematron document into an XSLT that validates the XML file. As such, Schematron's potential toolset is any XSLT processor, though libxml2 provides an implementation that does not require XSLT. Sun Microsystems's Multiple Schema Validator for Java has an add-on that allows it to validate RELAX NG schemas that have embedded Schematron rules.

Namespace Routing Language (NRL)

This is not technically a schema language. Its sole purpose is to direct parts of documents to individual schemas based on the namespace of the encountered elements. An NRL is merely a list of XML namespaces and a path to a schema that each corresponds to. This allows each schema to be concerned with only its own language definition, and the NRL file routes the schema validator to the correct schema file based on the namespace of that element.

This XML format is schema-language agnostic and works for just about any schema language.

Terminology

Capitalization in the schema word: there is some confusion as to when to use the capitalized spelling "Schema" and when to use the lowercase spelling. The lowercase form is a generic term and may refer to any type of schema, including DTD, XML Schema (aka XSD), RELAX NG, or others, and should always be written using lowercase except when appearing at the start of a sentence. The form "Schema" (capitalized) in common use in the XML community always refers to W3C XML Schema.

Schema authoring choices

The focus of the schema definition is structure and semantics of contents, but can also vary greatly in choosing tag and attribute names, in the syntax of these names, in choosing tag or attribute, and others.

Tag vs attribute choice
There are some constraints to choose if a information need to be defined by a tag or by an attribute a content with tags, for instance, can't be represented in an attribute. Attributes in general denote information that not need to be repeated (ex. attribute class in HTML4 can be used to define font-face once in the tag <p> but the tag <font> can be used as a child for all content or as child in some peaces of text). Even with constraints and rules of good practices, there are some freedom of choice (and ambiguity when asking why to decide this choice) in informations that can be represented by attributes, because they can always be also represented as tags.
Hyphenation and Capitalization
Tags in general are related to terms or words that represent something of the "tag's concept". So, that terms and words can be lowercase, CamelCase, UPPERCASE or hyphen-separated. There are no preference, except to lower-case over upper-case. (see also Canonical XML).
Semantic vs syntatic
Except for a RDF-related one, no schema language express formally semantic, only structure and data-types. Despite being the ideal, the inclusion of RDF assumptions is very poor and is not a recommendation in the schema development frameworks.
Schema reuse
A new XML schema can be developed from scratch, or can reuse some fragments of other XML schemas. All schema languages offer some tools (for example, include and modularization control over namespaces) and recommend reuse in the schema development.

See also

Languages:

References

  1. Marconi, Michael; Nentwich, Christian, eds. (31 January 2004). "CLiX Language Specification Version 1.0".
  2. Bray, Tim; Frankston, Charles; Malhotra, Ashok, eds. (31 July 1998). "Document Content Description for XML: Submission to the World Wide Web Consortium". World Wide Web Consortium.
  3. 3.0 3.1 "Standards and projects under the direct responsibility of ISO/IEC JTC 1/SC 34 Secretariat". ISO Standards catalogue.
  4. Clark, James (13 June 2003). "Namespace Routing Language (NRL)". Thai Open Source Software Center, Ltd.
  5. 5.0 5.1 5.2 "Freely Available Standards". ISO.
  6. Clark, James; Makoto, MURATA, eds. (3 December 2001). "RELAX NG Specification". OASIS.
  7. Clark, James, ed. (21 November 2002). "RELAX NG Compact Syntax". OASIS.
  8. While annotations in RELAX NG can support default attribute values, the RELAX NG specification does not mandate that a validator provide this ability to modify an XML infoset as part of validation. The WXS specification does mandate this behavior. An additional specification associated with RELAX NG does provide this ability. See Relax NG DTD Compatibility (default value).
  9. James Clark (co-creator of RELAX NG). RELAX NG and W3C XML Schema

External links