Delimiter
From Wikipedia, the free encyclopedia
A delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data stream.[1] An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values.
An alternative to the use of field delimiters is declarative notation, which uses a length field at the start of a region to specify the boundary. For word delimiters used in the written form of human languages, see interword separation.
Contents |
[edit] Overview
Delimiters can be broken down into:
- Field and Record delimiters; and
- Bracket delimiters.
[edit] Field and record delimiters
Field delimiters separate data fields. Record delimiters separate groups of fields.[2]
For example, the (CSV) file format uses a comma as the delimiter between fields, and a end-of-line indicator as the delimiter between records. For instance:
fname,lname,age,salary nancy,davolio,33,$30000 erin,borakova,28,$25250 tony,raphael,35,$28700
specifies a simple flat file database table using the (CSV) file format.
[edit] Conventions
Computing platforms historically use certain delimiters by convention.[3] The following tables depict just a few examples for comparison.
Programming languages (See also, Comparison of programming languages (syntax)).
String Literal | End of Statement | |
---|---|---|
Pascal | singlequote | semicolon |
C | doublequote, singlequote | semicolon |
Field and Record delimiters (See also, ASCII, Control character).
End of Field | End of Record | End of File | |
---|---|---|---|
Unix (and Mac OS X) | Tab | LF | none |
Windows | Tab | CRLF | none |
Classic Mac OS | Tab | CR | none |
Unicode | UNIT SEPARATOR Position 31 (U+001F) |
RECORD SEPARATOR Position 30 (U+001E) |
FILE SEPARATOR Position 28 (U+001C) |
[edit] Bracket delimiters
Bracket delimiters (also block delimiters, region delimiters, balanced delimiters) mark both the start and end of a region of text.[4] They are used in almost all programming languages, including Wikicode.[5]
Common examples of bracket delimiters include:[6]
Delimiters | Description |
( and ) |
Parenthesis. The Lisp programming language syntax is cited as recognizable primarily from its use of parenthesis.[7] |
{ and } |
Curly brackets.[8] |
[ and ] |
Square brackets.[9] |
< and > |
Angle brackets.[10] |
" and " |
commonly used to denote string literals. |
' and ' |
commonly used to denote string literals. |
<? and ?> |
used to indicate XML processing instructions. |
/* and */ |
used to denote comments in some programming languages. |
{| and |} |
used to indicate a table in Wikicode. |
<% and %> |
used in some web templates to specify language boundaries. These are also called template delimiters.[11] |
[edit] Delimiter collision
Delimiter collision is a problem that occurs when an author or programmer introduces delimiters into text without actually intending them to be interpreted as boundaries between separate regions.[2] In the case of Comma-separated values files, for example, this can occur whenever an author attempts to include a comma as part of a field value (e.g., salary = "$30,000"). In the case of XML, for example, this can occur whenever an author attempts to specify an angle bracket character.
[edit] Solutions
Because delimiter collision is a very common problem, various methods for avoiding it have been invented.
[edit] Escape character
One method for avoiding delimiter collision is to use escape characters. From a language design standpoint, these are adequate, but they have drawbacks:
- text can be rendered unreadable when littered with numerous escape characters;
- they require a mechanism to 'escape the escapes' when not intended as escape characters; and
- although easy to type, they can be cryptic to someone unfamiliar with the language.[12]
[edit] Escape sequence
Escape sequences are similar to escape characters, except they usually consist of some kind of mnemonic instead of just a single character. One use is in string literals that include a doublequote (") character. For example in Perl, the code:
print "Nancy said \x34Hello World!\x34 to the crowd."; ### use \x34
produces the same output as:
print "Nancy said \"Hello World!\" to the crowd."; ### use escape char
One drawback of escape sequences, when used by people, is the need to memorize the codes that represent individual characters (see also: character entity reference, numeric character reference).
[edit] Dual quoting delimiters
In contrast to escape sequences and escape characters, dual delimiters provide yet another way to avoid delimiter collision. Some languages, for example, allow the use of either a singlequote (') or a doublequote (") to specify a string literal. For example in Perl:
print 'Nancy said "Hello World!" to the crowd.';
produces the desired output without requiring escapes. This approach, however, only works when the string does not contain both types of quotation marks.
[edit] Multiple quoting delimiters
In contrast to dual delimiters, multiple delimiters are even more flexible for avoiding delimiter collision.[13]
For example in Perl:
print qq^Nancy doesn't want to say "Hello World!" anymore.^; print qq@Nancy doesn't want to say "Hello World!" anymore.@; print qq§Nancy doesn't want to say "Hello World!" anymore.§;
all produce the desired output through use of the quotelike operator, which allows characters to act as delimiters. Although this method is more flexible, few languages support it. Perl and Ruby are two that do.[14][15]
[edit] Content boundary
A content boundary is a special type of delimiter that is specifically designed to resist delimiter collision. It works by allowing the author to specify a long sequence of characters that is guaranteed to always indicate a boundary between parts in a multi-part message, with no other possible interpretation.[16]
This is usually done by specifying a random sequence of characters followed by an identifying mark such as a UUID, a timestamp, or some other distinguishing mark. (See e.g., MIME, Here documents).
[edit] Whitespace or indentation
Some programming and computer languages allow the use of whitespace delimiters or indentation[17] as a means of specifying boundaries between independent regions in text. Python and YAML are prominent examples.
[edit] Regular expression syntax
- See also: Regular expression examples
In specifying a regular expression, alternate delimiters may also be used to simplify the syntax for match and substitution operations in Perl.[18]
For example, a simple match operation may be specified in perl with the following syntax:
$string1 = 'Nancy said "Hello World!" to the crowd.'; # specify a target string print $string1 =~ m/[aeiou]+/; # match one or more vowels
The syntax is flexible enough to specify match operations with alternate delimiters, making it easy to avoid delimiter collision:
$string1 = 'Nancy said "httq://Hello/World.htm" is not a valid address.'; # target string print $string1 =~ m@httq://@; # match using alternate regular expression delimiter print $string1 =~ m{httq://}; # same as previous, but different delimiter print $string1 =~ m!httq://!; # same as previous, but different delimiter
[edit] ASCII armor
Although principally used as a mechanism for text encoding of binary data, "ASCII armoring" is a programming and systems administration technique that also helps to avoid delimiter collision in some circumstances.[19][20] This technique is contrasted from the other approaches described above because it is more complicated, and therefore not suitable for small applications and simple data storage formats. The technique employs a special encoding scheme or hash function, such as base64, to ensure that delimiter characters do not appear in transmitted data.
This technique is used, for example, in Microsoft's ASP.NET web development technology, and is closely associated with the "VIEWSTATE" component of that system.[21]
[edit] Example
The following is a simplified example demonstrating how this technique works in practice.
The first code fragment shows a simple HTML tag in which the VIEWSTATE value contains characters that are incompatible with the delimiters of the HTML tag itself:
<input type="hidden" __VIEWSTATE="BookTitle:Nancy doesn't say "Hello World!" anymore." />
This first code fragment is not well-formed, and would therefore not work properly in a "real world" deployed system.
In contrast, the second code fragment shows the same HTML tag, except this time incompatible characters in the VIEWSTATE value are removed through the application of base64 encoding:
<input type="hidden" __VIEWSTATE="b29rVGl0bGU6TmFuY3kgZG9lc24ndCBzYXkgIkhlbGxvIFdvcmxkISIgYW55bW9yZS4=" />
This prevents delimiter collision and ensures that incompatible characters will not appear inside the HTML code, regardless of what characters appear in the original (decoded) text.[21]
[edit] Notes and references
- ^ Federal Standard 1037C delimiter
- ^ a b de Moor, Georges J. (1993). Progress in Standardization in Health Care Informatics. IOS Press. ISBN 9051991142. p. 141
- ^ International Organization for Standardization (December 1, 1975). "The set of control characters for ISO 646". Internet Assigned Numbers Authority Registry. Alternate U.S. version: [1]. Accessed August 7, 2005.
- ^ Friedl, Jeffrey E. F. (2002). Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools. O'Reilly. ISBN 0596002890. p. 319
- ^ Scott, Michael Lee (1999). Programming Language Pragmatics. Morgan Kaufmann. ISBN 1558604421.
- ^ Wall, Larry, Tom Christiansen and Jon Orwant (July 2000). Programming Perl, Third Edition. O'Reilly.
- ^ Kaufmann, Matt (2000). Computer-Aided Reasoning: An Approach. Springer. ISBN 0792377443.p. 3
- ^ See e.g., curly bracket programming language.
- ^ See e.g., Wikicode.
- ^ See e.g., HTML.
- ^ See e.g., Smarty template system documentation, Smarty template system documentation. Retrieved on 2007-07-19., Text-ScriptTemplate documentation on CPAN. Retrieved on 2007-07-19.
- ^ Kahrel, Peter (2006). Automating InDesign with Regular Expressions. O'Reilly. ISBN 0596529376.p. 11
- ^ Wall, infra p. 63
- ^ Yukihiro, Matsumoto (2001). Ruby in a Nutshell. O'Reilly. 0596002149. In Ruby, these are indicated as general delimited strings. p. 11
- ^ Wall, infra p. 63
- ^ Javvin Technologies, Incorporated (2005). Network Protocols Handbook. Javvin Technologies Inc.. 0974094528. p. 26
- ^ (See also, Indent style, Programming_style#Indenting).
- ^ Friedl infra.
- ^ Rhee, Man (2003). Internet Security: Cryptographic Principles, Algorithms and Protocols. John Wiley and Sons. ISBN 0470852852.(an example usage of ASCII armoring in encryption applications)
- ^ Gross, Christian (2005). Open Source for Windows Administrators. Charles River Media. 1584503475.(an example usage of ASCII armoring in encryption applications)
- ^ a b Kalani, Amit (2004). Developing and Implementing Web Applications with Visual C# . NET and Visual Studio . NET. Que. ISBN 0789729016.(describes the use of Base64 encoding and VIEWSTATE inside HTML source code)
[edit] See also
- Delimiter-separated values
- String literal
- CamelCase (used in WikiWikiWeb as an alternate method of link creation that does not require delimiters to indicate links)
- Federal Standard 1037C (contains a simple definition for "delimiter")
- Naming collision
- Sigil