User talk:Thore/regex
From Wikipedia, the free encyclopedia
Contents |
[edit] Additional quantifiers
- x+
- The plus sign indicates that the preceding character must be present at least once. For example, "goo+gle" matches the infinte set of strings google, gooogle, goooogle, etc. "x+" is equivalent to "(x|ε)".
- x?
- The question mark indicates that the preceding character may be present at most once. For example, "colou?r" matches color and colour. "x+" is equivalent to "xx*".
- x{2,5}
- Numbers in curly brackets specify a range that bounds the number of occurences of the preceding character. "x{2,5}" is equivalent to "(xx|xxx|xxxx|xxxxx)".
[edit] Character classes
Matching large numbers of possible character by means of the alternation operator quickly becomes cumbersome. As a shorthand, most regex flavours support a concept of "character classes" or "character properties" to match whole sets of characters with short and mnemonic notation. For example, the decimal digits can be described by the regular expression "(0|1|2|3|4|5|6|7|8|9)", but various flavours support shorthands like "[0-9]", "\d", "[:digit:]" or even "\p{Decimal-Digit-Number}".
- .
- The dot matches any single character. Equivalent to "(a1|a2|...|ak)" where A={a1, a2,..., ak}.
- [aeiou]
- Matches the five lowercase vowels. Equivalent to "(a|e|i|o|u)". More generally, a list of characters in square brackets matches any character on the list. Entire ranges of character to be specified with a hyphen: "[a-e]" is the same as "[abcde]". This is well-defined only for some ranges like "[a-f]" or "[0-5]" but not for ranges like "[a-5]" or "[æ-ø]". Lists and ranges can be mixed like "[26a-zA-Z]", which matches any upper- or lower-case English letter, a 2, or a 6.
- [^aeiou]
- Matches anything but the five lowercase vowels. In general, if the first character after the opening bracked is a caret, the expression matches any other letter than those in the brackets. Ranges can be used as above, so "[^0-9]" matches anything but a digit. The expression "[^a1a2...ak]" is equivalent to "(b1|b2|...|bk)" for all bi in A-{a1, a2, ..., ak}.
- \d
- Matches any digit, the same as [0-9]. Similarly, "\w" matches any "word"-character (letters).
- \D
- Matches anything but a digit, the same as [^0-9]. Similary, "\W" matches anything not matched by "\w".
Many of these classes are sensitive to the underlying character set or the locale of the environment.
[edit] Anchors
^ | Matches only at the beginning of a string (or line, depending on setting) |
---|---|
$ | Matches only at the end of a string (or line, depending on setting) |