Original author(s) | Philip Hazel |
---|---|
Stable release | 8.21 / 2011-12-12 |
Written in | C |
Operating system | Cross-platform |
Type | Pattern matching library |
License | BSD |
Website | http://www.pcre.org |
Perl Compatible Regular Expressions (PCRE) is a regular expression C library inspired by Perl's external interface, written by Philip Hazel. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and many classic regular expression libraries. The name is misleading, because PCRE is Perl-compatible only if you consider a subset of PCRE's settings and a subset of Perl's regular expression facilities.
The PCRE library is incorporated into a number of prominent open-source programs, such as the Apache HTTP Server and the PHP and R scripting languages; and can be incorporated in proprietary software too (BSD license). As of Perl 5.10, PCRE is also available as a replacement for Perl's default regular expression engine through the re::engine::PCRE module.
The library can be built using configure and make (typical of Unix-like environments), as well as in Unix, Windows and other environments using CMake. Numerous default settings are elected at build time. In addition to the PCRE library, a POSIX C wrapper, a Google-contributed native C++ wrapper, several test programs, and the utility program pcregrep are also included in the distribution and are built in tandem with the library. The PCRE library provides matching only; the C++ wrapper, if used, adds multiple match and replacement functionality.
Unless the "NoRecurse" PCRE build option (aka "--disable-stack-for-recursion") is chosen, adequate stack space must be allocated to PCRE by the calling application or operating system. The amount of stack needed varies for each pattern. For example, to complete the tests provided with pcretest, 8 mb of stack space would be needed. While PCRE's documentation cautions that the "NoRecurse" build option makes PCRE slower than the alternative, using it avoids entirely the issue of stack overflows.
Contents |
PCRE has developed an extensive and in some ways unique feature set. While it originally aimed at feature-equivalence with Perl, over time a number of features have been first implemented in PCRE and only much later added to Perl. During the PCRE 7.x and Perl 5.9.x (development track) phase the two projects have coordinated development and are to the extent possible feature equivalent. In some cases PCRE has included in mainline releases features that originated with Perl 5.9.x and in some cases Perl 5.9.x has included features that were previously only available in PCRE.[1]
PCRE includes the following features:
\
(backslash) before the character, and vice versa, any alpha-numeric character preceded by a backslash typically gives it a special meaning. In the case where the sequence has not been defined to be special it will also be treated as a literal, however this usage is not forward compatible as new versions of PCRE may give such patterns a special meaning. A good example of this is \R
which has no special meaning prior to PCRE 7. In POSIX regular expressions, sometimes backslashes escaped non-alpha-numerics (e.g. \.
) and sometimes it introduced a special feature (e.g. \(\)
).\d
matches any digit exactly as [[:digit:]]
would in POSIX regular expressions.?
may be placed after any repeat count to indicate that the shortest match should be used. The default is to attempt the longest match first, and backtrack through shorter matches. e.g. "a.*?b
" would match "ab" in "ababab", where "a.*b
" would match the entire string.\p{Ps}.*?\p{Pe}
would match a string beginning with any "opening punctuation" and ending with any "close punctuation" such as "[abc]"
. Since verion 8.10, matching of certain "normal" metacharacters can be driven by Unicode properties when the compile option PCRE_UCP is set. The option can be set for a pattern by including (*UCP) at the start of pattern. The option alters behavior of the following metacharacters: \B, \b, \D, \d, \S, \s, \W, \w
, and some of the POSIX character classes. For example, the characters matched by \w
(word characters) is expanded to include letters and accented letters as defined by Unicode properties. Such matching is slower than the normal (ASCII-only) non-UCP alternative. Note that the UCP option requires the PCRE library to have been built to include UTF-8 and Unicode property support. Support for UTF-16 is under development and is expected to be available in early 2012.^
and $
can match at the beginning and end of a string only, or at the start and end of each "line" within the string depending on what options are set.^
-line beginnings and $
-ends (in multiline mode) as well as what matches dot (regardless of multiline mode unless the dotall (?s) option is set). It also affects PCRE's matching procedure (since version 7.0): when an unanchored pattern fails to match at the start of a newline sequence, PCRE advances past the entire newline sequence before retrying the match. If the newline option alternative in effect includes CRLF as one of the valid linebreaks, it does not skip the \n in a CRLF if the pattern contains specific \r or \n references (since version 7.3). Since version 8.10, the metacharacter \N
always matches any character other than linebreak characters. It has the same behavior as "." when the dotall option aka "(?s)" is not in effect.(*LF)
Newline is a linefeed character. Corresponding linebreaks can be matched with \n
.(*CR)
Newline is a carriage return. Corresponding linebreaks can be matched with \r
.(*CRLF)
Newline/linebreak is a carriage return followed by a linefeed. Corresponding linebreaks can be matched with \r\n
.(*ANYCRLF)
Any of the above encountered in the data will trigger newline processing. Corresponding linebreaks can be matched with (?>\r\n|[\r\n])
or with \R
. See below for configuration and options concerning what matches Backslash-R.(*ANY)
Any of the above plus special Unicode linebreaks. When not in UTF-8 mode, corresponding linebreaks can be matched with (?>\r\n|\n|\x0b|\f|\r|\x85)
or \R
. In UTF-8 mode, two additional characters are recognized as line breaks with (*ANY)
: LS (line separator, U+2028), and PS (paragraph separator, U+2029). On Windows, in non-Unicode data, some of the ANY
linebreak characters have other meanings. For example, \x85
can match a horizontal ellipsis, and if encountered while the ANY
newline is in effect, it would trigger newline processing. See below for configuration and options concerning what matches Backslash-R.\R
. The default can be either to match the linebreaks associated ANYCRLF or those corresponding to ANY. The default can be overridden when necessary by including (*BSR_UNICODE)
or (*BSR_ANYCRLF)
at the start of the pattern. When providing a (*BSR..) option, you can also provide a (*newline) option, e.g., (*BSR_UNICODE)(*ANY)rest-of-pattern
. The Backslash-R options also can be changed with external options by the application calling PCRE, when a pattern is compiled as well as when it is run.(*LF)
documented above; Backslash-R options such as (*BSR_ANYCRLF)
documented above; Unicode Character Properties option (*UCP)
documented above; and, (*UTF8)
option documented as follows: Since version 7.9, if your PCRE library has been compiled with UTF-8 support, you can specify the (*UTF8)
option at the beginning of a pattern instead of setting an external option to invoke UTF-8 mode.(...)
) may be named by including a leading "?P<name>
" after the open-paren. Named subpatterns are a feature that PCRE adopted from Python regular expressions. Since PCRE 7.0, named groups can be defined using (?<name>...)
or (?'name'...)
as well as (?P<name>...)
. Named groups can then be invoked with, for example, (?<P=name>...)
.(a|b)c\1
would match "a" or "b" followed by a "c". Then it would look for the same character (an "a" or a "b") that matched in the first subpattern.a++bc
will match as many "a"s as possible, and never back up to try one less./\w+(?=\t)/
matches a word followed by a tab, without including the tab.\b
for matching zero-width "word boundaries", similar to (?<=\W)(?=\w)|(?<=\w)(?=\W)
.(?#
and ends at the next close-paren.\((a*|(?R))*\)
" will match any combination of balanced parentheses and "a"s.(?Cn)
" where n is some number. This will call out to an external, user-defined function through the PCRE API, and can be used to embed arbitrary code in a pattern.PCRE has the following differences in external behaviour when compared to Perl's regular expression (as of Perl 5.9.4):
"<<!>!>!>><>>!>!>!>" =~ /^(<(?:[^<>]+|(?3)|(?1))*>)()(!>!>!>)$/
will match in Perl but not in PCRE."aba" =~ /^(a(b)?)+$/;
will result in $1
containing 'a' and $2
containing undef
in Perl, but in PCRE will result in $2
containing 'b'.(??{...})
(a callback whose return is evaluated as being part of the pattern) nor the (?{})
construct, although the latter can be emulated using (?Cn)
. Recursion control verbs added in the Perl 5.9.x series are also not supported. Support for experimential backtracking control verbs (added in Perl 5.10) is available in PCRE since version 7.3. They are (*FAIL)
, (*F)
, (*PRUNE)
, (*SKIP)
, (*THEN)
, (*COMMIT)
, and (*ACCEPT)
. Perl's corresponding use of arguments with backtracking control verbs is not generally supported. Note however that since version 8.10, PCRE supports the following verbs with a specified argument: (*MARK:markName), (*SKIP:markName), (*PRUNE:markName), and (*THEN:markName)."bbbbXcXaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa" =~ /.X(.+)+X/
will fail to match due to stack overflow, but Perl will match this correctly. Perl uses the heap for recursion and has no hard limit for recursion depth, whereas PCRE has a compile time hard limit.With the exception of the above points PCRE is capable of passing the tests in the Perl 't/op/re_tests' file, one of the main syntax level regression tests for Perl's regular expression engine.