Perl Compatible Regular Expression

From Wikipedia, the free encyclopedia

Perl Compatible Regular Expressions

Latest release:	7.0 / 2006-12-19
OS:	cross platform
Use:	pattern matching library
License:	BSD
Website:	http://www.pcre.org

Perl Compatible Regular Expressions (PCRE) is a regular expression C library inspired by Perl's external interface, written by Philip Hazel. PCRE is much richer than classic regular expression libraries which is why they have been adopted by many modern programming languages. Their syntax is much more powerful and flexible than POSIX regular expressions. The name is therefore a misnomer, because PCRE is "Perl Compatible" only if you consider a subset of PCRE's settings and a subset of Perl's regular expression facilities.

PCRE settings also permit PCRE to emulate regular expression libraries other than Perl's, such as the selection of backslash to either enable (Emacs-like) or disable (Perl-like) special characters like vertical bar. C and C++ interfaces are provided by the library itself.

[edit] Features

PCRE has developed a unique feature set, and development of features such as recursive sub-patterns have continued in both Perl and PCRE in parallel, but coordinated ways.^[1]

Currently, the following features are available:

\ is always an escape: Like Perl, PCRE always uses \ as an escape when preceding non-alpha-numeric characters. In POSIX regular expressions, sometimes it escaped non-alpha-numerics (e.g. \.) and sometimes it introduced a special feature (e.g. ).
Extended character classes: Single-letter character classes are supported in addition to the longer POSIX names. For example \d matches any digit exactly as [[:digit:]] would in POSIX regular expressions.
Minimal matching: A ? may be placed after any repeat count to indicate that the shortest match should be used. The default is to attempt the longest match first, and backtrack through shorter matches. e.g. "a.*?b" would match "ab" in "ababab", where "a.*b" would match the entire string.
Unicode character properties: Unicode defines several properties for each character. Patterns in PCRE can match these properties. e.g. \p{Ps}.*?\p{Pe}, would match a string that was delimited by any "opening punctuation" and any "close punctuation" such as "[abc]".
Multiline matching: ^ and $ can match at the beginning and end of a string only, or at the start and end of each "line" within the string depending on what options are set.
Named subpatterns: A sub-pattern (surrounded by (...)) may be named by including a leading "?P<name>" after the open-paren. Named subpatterns are a feature that PCRE adopted from Python regular expressions.
Backreferences: A pattern may refer back to the results of a previous match. For example, (a|b)c\1 would match "a" or "b" followed by a "c". Then it would look for the same character (an "a" or a "b") that matched in the first subpattern.
Atomic grouping: Atomic grouping is a way of preventing backtracking in a pattern. For example, a++bc will match as many "a"s as possible, and never back up to try one less.
Lookahead and lookbehind assertions: Patterns may assert that previous text or subsequent text contains a pattern without consuming matched text (zero-width assertion). For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab.
Escape sequences for zero-width assertions: E.g. \b for matching zero-width "word boundaries", similar to (?![a-zA-Z_]+).
Comments: A comment begins with (?# and ends and the next close-paren.
Recursive patterns: A pattern can refer back to itself recursively or to any subpattern. For example, the pattern "$(a*|(?R))*$" will match any combination of balanced parentheses and "a"s.
Generic callouts: PCRE expressions can embed "(?Cn)" where n is some number. This will call out to an external, user-defined function through the PCRE API, and can be used to embed arbitrary code in a pattern.