Delimiter

From Wikipedia, the free encyclopedia

A stylistic depiction of a fragment from a CSV-formatted text file. The commas (shown in red) are used as field delimiters.
A stylistic depiction of a fragment from a CSV-formatted text file. The commas (shown in red) are used as field delimiters.

A delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text. An example of a delimiter is the comma character in a sequence of comma-separated values.

An alternative to the use of field delimiters is declarative notation, which uses a length field at the start of a region to specify the boundary. For word delimiters used in the written form of human languages, see interword separation.

Contents

[edit] Overview

Delimiters can be broken down into:

  • Field and Record delimiters
  • Bracket delimiters

[edit] Field and record delimiters

Field delimiters are used to separate data fields. Record delimiters separate groups of fields.[1]

For example, the (CSV) file format uses a comma as the delimiter between fields, and a end-of-line indicator as the delimiter between records. For instance:

   fname,lname,age,salary
   nancy,davolio,33,$30000
   erin,borakova,28,$25250
   tony,raphael,35,$28700

specifies a simple flat file database table using the (CSV) file format.

[edit] Bracket delimiters

Bracket delimiters (also block delimiters, region delimiters, balanced delimiters) are used to mark both the start and end of a region of text.[2] They are used in almost all programming languages, including Wikicode.

Some of the more common bracket delimiters include the following:

  • ( and )
  • { and }
  • [ and ]
  • < and >
  • " and " commonly used to denote string literals.
  • ' and ' commonly used to denote string literals.
  • <? and ?> used to indicate XML processing instructions.
  • /* and */ used to denote comments in some programming languages.
  • {| and |} used to indicate a table in Wikicode.
  • <% and %> used in some web templates to specify language boundaries.

[edit] Conventions

Computing platforms historically use certain delimiters by convention.

Field and Record delimiters

End of Field End of Record End of File
Unix (and Mac OS X) Tab LF none
Windows Tab CRLF none
Classic Mac OS Tab CR none
Unicode UNIT SEPARATOR
Position 31 (U+001F)
RECORD SEPARATOR
Position 30 (U+001E)
FILE SEPARATOR
Position 28 (U+001C)


Programming languages

String Literal End of Statement
Pascal doublequote semicolon
C doublequote, singlequote semicolon

[edit] Delimiter collision

Delimiter collision is a problem that occurs when an author or programmer introduces delimiters into text without actually intending them to be interpreted as boundaries between separate regions.[1] In the case of (CSV) files, for example, this can occur whenever an author attempts to include a comma as part of a field value (e.g., salary = "$30,000").

Because this is a very common problem, a number of methods for avoiding delimiter collision have been invented.

[edit] Escape character

One method for avoiding delimiter collision is to use escape characters. From a language design standpoint, these are adequate, but they have some drawbacks:

  • text can be rendered unreadable when littered with numerous escape characters
  • they require a mechanism to 'escape the escapes' when not intended as escape characters
  • although easy to type, they can be cryptic to someone unfamiliar with the language

[edit] Escape sequence

Escape sequences are similar to escape characters, except they usually consist of some kind of mnemonic instead of just a single character. One use is in string literals that include a doublequote (") character. For example in Perl, the code:

   print "Nancy said \x34Hello World!\x34 to the crowd.";  ### use \x34

produces the same output as:

   print "Nancy said \"Hello World!\" to the crowd.";      ### use escape char

One drawback of escape sequences, when used by people, is the need to memorize the various codes that represent individual characters (see also: character entity reference, numeric character reference).

[edit] Dual quoting delimiters

In contrast to escape sequences and escape characters, dual delimiters provide yet another way to avoid delimiter collision. Some languages, for example, allow the use of either a singlequote (') or a doublequote (") to specify a string literal. For example in Perl:

   print 'Nancy said "Hello World!" to the crowd.';
   

produces the desired output without requiring escapes. This approach, however, only works when the string does not contain both types of quotation marks.

[edit] Multiple quoting delimiters

In contrast to dual delimiters, multiple delimiters are even more flexible for avoiding delimiter collision. For example in Perl:

   print qq^Nancy doesn't want to say "Hello World!" anymore.^;
   print qq@Nancy doesn't want to say "Hello World!" anymore.@;
   print qq§Nancy doesn't want to say "Hello World!" anymore.§;

all produce the desired output through use of the quotelike operator, which allows numerous different characters to act as delimiters. Although this method is more flexible, few languages support it. Perl and Ruby are two that do.

[edit] Content boundary

A content boundary is a special type of delimiter that is specifically designed to resist delimiter collision. It works by allowing the author to specify a long sequence of characters that is guaranteed to always indicate a boundary between regions in text, with no other possible interpretation.

This is usually done by specifying a random sequence of characters followed by an identifying mark such as a UUID, a timestamp, or some other distinguishing mark. (see e.g., MIME, Here documents).

[edit] ASCII armor

Although principally used as a mechanism for text encoding of binary data, ASCII armoring is also a way to avoid delimiter collision in some circumstances. It is contrasted from all other methods because it is more complicated. It employs a special encoding scheme, such as base64.

[edit] References

  1. ^ a b de Moor, Georges J. (1993). Progress in Standardization in Health Care Informatics. IOS Press. ISBN 9051991142. p. 141
  2. ^ Friedl, Jeffrey E. F. (2002). Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools. O'Reilly. ISBN 0596002890. p. 319

[edit] See also

In other languages