Overlapping markup

In markup languages and the digital humanities, overlap occurs when a document has two or more structures that interact in a non-hierarchical manner. A document with overlapping markup cannot be represented as a tree. This is also known as concurrent markup. Overlap happens, for instance, in poetry, where there may be a metrical structure of feet and lines; a linguistic structure of sentences and quotations; and a physical structure of volumes and pages and editorial annotations.[1][2]

History

The problem of non-hierarchical structures in documents has been recognised since 1988; resolving this problem against the dominant paradigm of text as a single hierarchy (an ordered hierarchy of content objects or OHCO) was initially thought to be merely a technical issue, but has, in fact, proven much more difficult.[3] In 2008, Jeni Tennison identified markup overlap as "the main remaining problem area for markup technologists".[4]

Properties and types

A distinction exists between schemes that allow non-contiguous overlap, and those that allow only contiguous overlap. Often, 'markup overlap' strictly means the latter. Contiguous overlap can always be represented as a linear document with milestones, without the need for fragmentation and pointers to fragments, but non-contiguous overlap may require document fragmentation. Another distinction in overlapping markup schemes is whether elements can overlap with other elements of the same kind (self-overlap).[2]

A scheme may have a privileged hierarchy. Some XML-based schemes, for example, represent one hierarchy directly in the XML document tree, and represent other, overlapping, structures by another means; these are said to be non-privileged.

Approaches and implementations

DeRose (2004, Evaluation criteria) identifies several criteria for judging solutions to the overlap problem: readability and maintainability, tool support and compatibility with XML, possible validation schemes, and ease of processing.

Tag soup is, strictly speaking, not overlapping markup—it is malformed HTML, which is a non-overlapping language, and may be ill-defined. HTML5 defines how processors should deal with such mis-nested markup in the HTML syntax and turn it into a single hierarchy.[5] With XHTML and SGML-based HTML, however, mis-nested markup is a strict error and makes processing by standards-compliant systems impossible.[6]

SGML, which early versions of HTML were based on, has a feature called CONCUR that allows multiple independent hierarchies to co-exist without privileging any. DTD validation is a challenge when using CONCUR, validation across hierarchies is hard if not impossible, it could not support self-overlap, and it interacted poorly with commonly used SGML features. This feature was poorly supported by tools and saw very little actual use; using CONCUR to represent document overlap was not a recommended use case, according to a commentary by the standard's editor.[7][8]

Within hierarchical languages

There are several approaches to representing overlap in a non-overlapping language:[9]

The Text Encoding Initiative, as an XML-based markup scheme, cannot directly represent overlapping markup. All four of the above approaches are suggested.[1] The Open Scripture Information Standard is another XML-based scheme, designed to mark up the Bible. It uses empty milestone elements to encode non-privileged components.[19]

New languages

Another approach is to design an entirely new markup language. These forego the tool support in existing languages for a less complicated semantic model and more convenient syntax.

Graph-based formalisms

Rather than grounding markup information in a tree, standoff XML employs a data model based on directed graphs.[29] As an alternative to traditional markup, such graph-based data models can be represented with formalisms originally developed for generalized directed multigraphs, most notably the Resource Description Framework (RDF).[30] [31] EARMARK is an early RDF/OWL representation that encompasses GODDAGs.[9]

RDF provides different linearizations, including an XML format that can be modeled to mirror conventional standoff XML, and a linearization that lets RDF be expressed in XML attributes (RDFa). But while it is semantically equivalent to standoff XML, it does not require special-purpose technology for storing, parsing and querying. Multiple interlinked RDF files representing a document or a corpus may constitute an example of Linguistic Linked Open Data.

References

  1. 1 2 Text Encoding Initiative.
  2. 1 2 DeRose 2004, The problem types.
  3. Renear, Mylonas & Durand 1993.
  4. Tennison 2008.
  5. W3 Consortium (16 September 2014). "HTML5 (Proposed Reccomendation)". Retrieved 2014-10-14. |chapter= ignored (help)
  6. Sperberg-McQueen & Huitfeldt 2000, 2.1. Non-SGML Notations.
  7. Sperberg-McQueen & Huitfeldt 2000, 2.2. CONCUR.
  8. 1 2 DeRose 2004.
  9. 1 2 Di Iorio, Peroni & Vitali 2009.
  10. Text Encoding Initiative, 20.1 Multiple Encodings of the Same Information.
  11. Text Encoding Initiative, 20.2 Boundary Marking with Empty Elements.
  12. Sperberg-McQueen & Huitfeldt 2000, 2.4. Milestones.
  13. DeRose 2004, TEI-style milestones.
  14. Text Encoding Initiative, 20.3 Fragmentation and Reconstitution of Virtual Elements.
  15. DeRose 2004, Segmentation.
  16. Sperberg-McQueen & Huitfeldt 2000, 2.5. Fragmentation.
  17. DeRose 2004, Joins.
  18. Sperberg-McQueen & Huitfeldt 2000, 2.6. Standoff Markup.
  19. Durusau, Patrick (2006). OSIS Users Manual (OSIS Schema 2.1.1) (PDF). Retrieved 2014-10-14.
  20. DeRose 2004, CLIX and LMNL.
  21. Piez, Wendell (August 2012). Luminescent: parsing LMNL by XSLT upconversion. Balisage: The Markup Conference 2012. Montréal. doi:10.4242/BalisageVol8.Piez01. Retrieved 2014-10-14.
  22. Sperberg-McQueen & Huitfeldt 2000, 2.7. MECS.
  23. Sperberg-McQueen & Huitfeldt 2000.
  24. Huitfeldt, Claus; Sperberg-McQuen, C M (2003). "TexMECS: An experimental markup meta-language for complex documents". Retrieved 2014-10-14.
  25. Hilbert, Schonefeld & Witt 2005.
  26. Witt et al. 2007.
  27. Schonefeld 2008.
  28. Marinelli, Vitali & Zacchiroli 2008.
  29. Ide & Suderman 2007.
  30. & Cassidy 2010.
  31. Chiarcos 2012.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.