Overlapping markup

In markup languages, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner. A document with overlapping markup cannot be represented as a tree. This is also known as concurrent markup. Overlap happens, for instance, in poetry, where there may be a metrical structure of feet and lines; a linguistic structure of sentences and quotations; and a physical structure of volumes and pages and editorial annotations.[1][2]

Properties and types

A distinction should be drawn between schemes that allow non-contiguous overlap, and those which allow only contiguous overlap; often, what is meant by 'markup overlap' is strictly the latter. Contiguous overlap can always be represented as a linear document with milestones, without the need for fragmentation and pointers to fragments, but non-contiguous overlap may require document fragmentation. Another distinction in overlapping markup schemes is whether elements can overlap with other elements of the same kind (self-overlap).[2]

A scheme may have a privileged hierarchy. Some XML-based schemes, for example, represent one hierarchy directly in the XML document tree, and represent other, overlapping, structures by another means; these are said to be non-privileged.

Approaches and implementations

DeRose (2004, Evaluation criteria) identifies several criteria for judging solutions to the overlap problem: readability and maintainability, tool support and compatibility with XML, possible validation schemes, and ease of processing.

Tag soup is, strictly speaking, not overlapping markup—it is malformed HTML, which is a non-overlapping language, and may be ill-defined. HTML5 defines how processors should deal with such mis-nested markup in the HTML syntax and turn it into a single hierarchy.[3] With XHTML and SGML-based HTML, however, mis-nested markup is a strict error and makes processing by standards-compliant systems impossible.[4]

SGML, which early versions of HTML were based on, has a feature called CONCUR that allows multiple independent hierarchies to co-exist without privileging any. DTD validation is a challenge when using CONCUR, validation across hierarchies is hard if not impossible, it could not support self-overlap, and it interacted poorly with commonly used SGML features. This feature was poorly supported by tools and saw very little actual use; using CONCUR to represent document overlap was not a recommended use case, according to a commentary by the standard's editor.[5][6]

Within hierarchical languages

There are several approaches to representing overlap in a non-overlapping language:[7]

The Text Encoding Initiative, as an XML-based markup scheme, cannot directly represent overlapping markup. All four of the above approaches are suggested.[1] The Open Scripture Information Standard is another XML-based scheme, designed to mark up the Bible. It uses empty milestone elements to encode non-privileged components.[17]

New languages

Another approach is to design an entirely new markup language. These forego the tool support in existing languages for a less complicated semantic model and more convenient syntax.

Conversion and equivalence

Marinelli, Vitali and Zacchiroli provide algorithms to convert between restricted GODDAGs, ECLIX, LMNL, parallel documents in XML, contiguous stand-off markup and TexMECS.[26]

References

  1. 1.0 1.1 Text Encoding Initiative.
  2. 2.0 2.1 DeRose 2004, The problem types.
  3. W3 Consortium (16 September 2014). "HTML5 (Proposed Reccomendation)". Retrieved 2014-10-14. |chapter= ignored (help)
  4. Sperberg-McQueen & Huitfeldt 2000, 2.1. Non-SGML Notations.
  5. Sperberg-McQueen & Huitfeldt 2000, 2.2. CONCUR.
  6. 6.0 6.1 DeRose 2004.
  7. 7.0 7.1 Di Iorio, Peroni & Vitali 2009.
  8. Text Encoding Initiative, 20.1 Multiple Encodings of the Same Information.
  9. Text Encoding Initiative, 20.2 Boundary Marking with Empty Elements.
  10. Sperberg-McQueen & Huitfeldt 2000, 2.4. Milestones.
  11. DeRose 2004, TEI-style milestones.
  12. Text Encoding Initiative, 20.3 Fragmentation and Reconstitution of Virtual Elements.
  13. DeRose 2004, Segmentation.
  14. Sperberg-McQueen & Huitfeldt 2000, 2.5. Fragmentation.
  15. DeRose 2004, Joins.
  16. Sperberg-McQueen & Huitfeldt 2000, 2.6. Standoff Markup.
  17. Durusau, Patrick (2006). OSIS Users Manual (OSIS Schema 2.1.1) (PDF). Retrieved 2014-10-14.
  18. DeRose 2004, CLIX and LMNL.
  19. Piez, Wendell (August 2012). Luminescent: parsing LMNL by XSLT upconversion. Balisage: The Markup Conference 2012. Montréal. doi:10.4242/BalisageVol8.Piez01. Retrieved 2014-10-14.
  20. Sperberg-McQueen & Huitfeldt 2000, 2.7. MECS.
  21. Sperberg-McQueen & Huitfeldt 2000.
  22. Huitfeldt, Claus; Sperberg-McQuen, C M (2003). "TexMECS: An experimental markup meta-language for complex documents". Retrieved 2014-10-14.
  23. Hilbert, Schonefeld & Witt 2005.
  24. Witt et al. 2007.
  25. Schonefeld 2008.
  26. Marinelli, Vitali & Zacchiroli 2008.