S-expression

S-expressions or sexps (for "symbolic expression") are list-based data structures that represent semi-structured data. An S-expression may be a nested list of smaller S-expressions. S-expressions are probably best known for their use in the Lisp family of programming languages. They are typically represented in text by parenthesized, whitespace-separated sequences of character strings, as in (= 4 (+ 2 2)), which represents the Boolean expression commonly written as 4 == (2 + 2) in C and related programming languages.

In Lisp, the first element of every S-expression is an operator and any remaining elements are treated as data. This is called "prefix notation" or "Cambridge Polish notation". Due to the close association between S-expressions and written representations of data, the term "S-expression" in casual conversation often refers to the written representation of an S-expression.

Other uses of S-expressions are in Lisp-derived languages such as DSSSL, and as mark-up in communications protocols like IMAP and John McCarthy's CBCL. The details of the syntax and supported data types vary in the different languages, but the most common feature among these languages is the use of S-expressions and prefix notation.

In Lisp, S-expressions are used to store both source code and data (see McCarthy Recursive Functions of Symbolic Expressions). S-expressions were originally intended only for data to be manipulated by M-expressions, but the first implementation of Lisp was an interpreter of S-expression encodings of M-expressions, and Lisp programmers soon became accustomed to using S-expressions for both code and data. This means that Lisp is homoiconic, that is, the primary representation of programs is also a data structure in a primitive type of the language itself.

Contents

Definition

S-expressions are defined recursively as either single data objects called "atoms" or lists of S-expressions. Numbers, arrays, character strings, and symbols are all atoms.

An S-expression may be a list of other S-expressions, which may also be lists, so S-expressions may be arbitrarily nested to any depth. In Lisp, S-expression lists are constructed from small data structures called cons pairs, written as (x . y). The first element of the cons (x) is the first element of the list, and the second element (y) is the remainder of the list. Longer lists are made up of nested cons pairs, for example (1 . (2 . (3 . nil))). The special symbol nil marks the end of the list. Cons-based lists are usually written more compactly as (1 2 3). Nested lists can also be written as S-expressions: ((milk juice) (honey marmalade)) is a two-element S-expression whose elements are also two-element S-expressions. The whitespace-separated notation used in Lisp (and this article) is typical. Line breaks (newline characters) usually qualify as separators.

A non-atomic S-expression in which the second element is an atom other than nil is a "dotted pair" (e.g., (a . b)); a "dotted list" (e.g., (a b c . d)) is defined recursively as an s-expression in which the second element is either a dotted pair or another dotted list, while a "list" (e.g., (a b c), which is equivalent to (a b c . nil)) is defined as either the reserved atom nil, or a non-atomic s-expression in which the second element is a list.

Example

This is a simple context-free grammar written as an s-expression (Gazdar/Melish, Natural Language Processing in Lisp):

(((S) (NP) (VP))
 ((VP) (V))
 ((VP) (V) (NP))
 ((V) died)
 ((V) employed)
 ((NP) nurses)
 ((NP) patients)
 ((NP) Medicenter)
 ((NP) Dr Chan))

S-expressions in Lisp

Program code can be written in S-expressions, usually using prefix notation.

Example in Common Lisp:

(defun factorial (x)
   (if (zerop x)
       1
       (* x (factorial (- x 1)))))

S-expressions can be read in Lisp using the function READ. READ reads the textual representation of an s-expression and returns Lisp data. The function PRINT can be used to output an s-expression. The output then can be read with the function READ, when all printed data objects have a readable representation. Lisp has readable representations for numbers, strings, symbols, lists and many other data types. Program code can be formatted as pretty printed S-expressions using the function PPRINT.

Lisp programs are valid s-expressions, but not all s-expressions are valid Lisp programs. (1.0 + 3.1) is a valid s-expression, but not a valid Lisp program, since Lisp uses prefix notation and a floating point number (here 1.0) is not valid as an operation (the first element of the expression).

An S-expression preceded by a single quotation mark, as in 'x, is syntactic sugar for a quoted S-expression, in this case (quote x).

Standardization

Standards for some Lisp-derived programming languages include a specification for their S-expression syntax. These include Common Lisp (ANSI standard document ANSI INCITS 226-1994 (R2004)), Scheme (R5RS and R6RS[1]), and ISLISP.

In May 1997, Ron Rivest submitted an Internet-Draft[2] to be considered for publication as an RFC. The draft defined a syntax based on Lisp S-expressions but intended for general-purpose data storage and exchange (similar to XML) rather than specifically for programming. It was never approved as an RFC, but it has since been cited and used by other RFCs (e.g. RFC 2693) and several other publications.[3] It was originally intended for use in SPKI.

Rivest's format defines an S-expression as being either an octet-string (a series of bytes) or a finite list of other S-expressions. It describes three interchange formats for expressing this structure. One is the "advanced transport", which is very flexible in terms of formatting, and is syntactically similar to Lisp-style expressions, but they are not identical. The advanced transport, for example, allows octet-strings to be represented verbatim (the string's length followed by a colon and the entire raw string), a quoted form allowing escape characters, hexadecimal, Base64, or placed directly as a "token" if it meets certain conditions. (Rivest's tokens differ from Lisp tokens in that the former are just for convenience and aesthetics, and treated exactly like other strings, while the latter have specific syntactical meaning.) Another interchange format, intended to be more compact, easier to parse, and unique for any abstract S-expression, is the "canonical representation" which only allows verbatim strings, and prohibits whitespace as formatting outside strings. Finally there is the "basic transport representation", which is either the canonical form or the same encoded as Base64 and surrounded by braces, the latter intended to safely transport a canonically-encoded S-expression in a system which might change spacing (e.g. an email system which has 80-character-wide lines and wraps anything longer than that).

This format has not been widely adapted for use outside of SPKI. Rivest's S-expressions web page provides C source code for a parser and generator, which could theoretically be adapted and embedded into other programs, though licensing on these programs is unclear. However, there are no restrictions on independently implementing the format.

See also

References

  1. ^ [1] Sperber, Dybvig, Flatt, Van Straaten, Findler, Matthews, "Revised6 Report on the Algorithmic Language Scheme
  2. ^ S-Expressions, Network Working Group, Internet Draft, Expires November 4, 1997 - R. Rivest, May 4, 1997 draft-rivest-sexp-00.txt, Ronald L. Rivest, CSAIL MIT website
  3. ^ rivest sexp, Google Scholar (search)

External links

Free software implementations are available: