Talk:Regular expression

From Wikipedia, the free encyclopedia

This article is within the scope of WikiProject Computer science.
??? This article has not yet received a rating on the assessment scale.
??? This article has not yet received an importance rating on the assessment scale.

Contents

[edit] Missing The Fundamental Algorithm

The regular expression article needs to have a link to or at least include the simple algorithm of converting a regular expression to a Deterministic Finite Automaton because regular expressions are inherently used for input string matching by a computer program.

Very few commonly used regex engines actually use either simulated DFA or full DFA engines. As long as this page is a merge of both the mathematical and jargon uses of the term "regular expression" i dont think the DFA construction algorithm should be included. If/when it is split into two then i think the strict term should include such discussion. Demerphq 12:45, 23 March 2007 (UTC)

[edit] First paragraph

I have added an introductory paragraph before the history section, explaining by means of example the basic regex operations. This used to be buried in the formal language theory section, which makes it opaque to a lay audience. This led to some minor edits in the formal language section.


[edit] Notation and style

The Manual of Style suggests that strings as strings should appear in italics. This convention doesn't make the current page harder to read, so I have edited some passages with that convention in mind. Some questions remain: how should regular expressions be formatted in this article? They currently appear mostly in double quotes (inch-marks), like this:

  1. the regular expression "gr(e|a)y" matches grey and gray.

I think that works well, provided we do not add many expressions that include quotation marks. Alternatives would be to use the same convention as for strings:

  1. the regular expression gr(e|a)y matches grey and gray

Alternatives include bold face, not least for readability

  1. the regular expression gr(e|a)y matches grey and gray

or a fixed width typeface, also for readability and proximity to program code

  1. the regular expression gr(a|e)y matches grey and gray

A alternative solution to the double quotes would be to always present them in slashes, to make it look like perl

  1. /gr(a|e)y/ matches grey and gray

Opinions are welcome.

On a related matter, I have replaced the union operator ∪ with the vertical bar even in the formal language theory section. I can count at least four different symbols for this in the literature (including the vertical bar), and suggest we use the vertical bar to keep the connection with programming tools and reduce the number of symbols in this article. Moreover, the choice of default typeface for the Wikipedia leads to a problem for this page, in that there is no distinction between the lowercase letter “l” and the vertical bar “|”. That makes some of the regexes really hard to read, and could be taken as a point in favour of displaying all regexes in the monospaced typeface. Here is an example of two lower-case Ls separated by a vertical bar: “l|l” in the default typeface, and “l|l” in the default monospace.

The article is quite hard to read because of the problem of “|”, “l” and “I” all looking somewhat similar. I think the change to a monospaced font for regexps would be a very good idea. I know when I read the first example, it took me a few seconds to realise that it wasn’t a lower case “L” that I was looking at… for a moment I thought it was using some new notation I hadn’t seen before. I suppose my only other comment is that the default choice of monospaced font isn’t itself that great, because of the similarity between “1” and “l”, but for an article on regular expressions, the “|” character is a much more relevant issue. — Ajhoughton 10:50, 2 March 2007 (UTC)

[edit] Complexity

Under Perl Regexp it states: ****

"This extra power comes at a cost. The worst-case complexity of matching a string against a Perl regular expression is exponential in the size of the input. That means the regular expression could take an extremely long time to process under just the right conditions of expression and input string, although it rarely happens in practice."

Isn't this also true for POSIX and all flavours using an NFA? --Weinzier 18:12, 12 Jul 2004 (UTC) Ludwig Weinzierl

NFAs can recognize strings in linear time (in the size of the input). I don't know POSIX well enough to comment on that. --Jan Hidders 20:00, 12 Jul 2004 (UTC)
LW is right. The exponential running time is not caused by nonregularity (instead it has to do with greediness and/or the ability to make backreferences). Indeed, the nonregularity does not "come at a cost", but is included pretty much because the engine might as well execute some Perl code while it's there (so you can implement a stack, or save substrings). So it would be more correct (but still misleading) to say it "comes for free". I have removed the whole paragraph. --User:Thore
I don't follow. The greediness and backreferences are what makes it non-regular, (by having them things become non-polynomial, and if we would not have them then it would stay polynomial) so if I understand you correctly you are in fact above confirming the very thing that you claim is not true. What am I missing? --Jan Hidders 20:41, 8 Dec 2004 (UTC)
Jan, let me try to explain this better. The thing that causes nonregularity are the backreferences. They are not specific to Perl. The thing that causes the nonlinear running time is the implementation often referred to as NFA, which is just a "backtracking" algorithm. (Rather than simulating a determinised automaton, which is sometimes called DFA, and always runs in linear time.) That's an inaccuracy in your first paragraph on this page, by the way. "NFA implementations" like Perl's are decidedly non-linear time. Friedl's book tries to explain this in Chapter 6. Let me see if I can write a few paragraphs about this on the main page, and then you can all shoot it down. --User:Thore
Its a mistake to equate NFA and backtracking. Most backtracking engines are not strictly speaking NFA since they implement leftmost-longest behaviour. This is an area where the commonly used terms, and the strict mathematical definitions contradict. To be properly NFA the engine must simulate the behaviour of a DFA matching the same pattern, so for instance 'aaaa'=~/a|aa|aaa|aaaa/ should match 'aaaa' and not 'a'. Of course this is extremely inefficient in a backtracking engine (as it means you must try every alternation), so most backtracing-nfa's dont do this, opting for the more efficient behaviour of matching 'a' instead. Demerphq 12:52, 23 March 2007 (UTC)

[edit] Error in example?

  • character classes, with negation: [a-z0-9^4-6] ('all lowercase letters plus the digits 0,1,2,3,7,8, and 9')

request verification: "[^" requests negation; a "^" not following the first "[" is taken literally; so this is all lowercase letters, all digits, and "^"

Correct. My 'vi' and 'grep' agree with you. I will remove it. --JanHidders

In the examples section of the Regular expressions in Formal Language Theory it states that

(a U b)* is equivalent to (ab*)*

However, the first set contains "b", the second does not.

Indeed, all non-empty strings in the second set start with "a" which is of course not the case in the first set. Fixed now. --Jan Hidders 12:55, 19 Sep 2003 (UTC)

Description and examples of alternation for regexes were uniformly incorrect. Alternation operates on single sub-expressions before and after the bar. If not a class, parenthesis group, or replication expression, only a single character is matched. The main page has been updated to conform. Turmoil 16:29, 14 March 2006 (UTC)

Not sure what planet you're from, but on Earth, /abc|def/ matches abc or def, not abcef and abdef. I reverted your changes --Randal L. Schwartz 20:57, 14 March 2006 (UTC)
This is a case in which mathematical usage and practical usage differ. Turmoil accurately describes mathematical usage. Alternation applies to atoms, so in the absence of a grouping construct the alternation operator applies only to single symbols. However, in practice, Randal L. Schwartz is right. Every matching engine that I can think of behaves as if a string of atoms were implicitly grouped. That is why you need parentheses around the entire alternation construct even if it isn't in the scope of a quantifier. /(abc|def)/ matches "abc" and "def", but /abc|def/ matches not only "abc" and "def" but also "abcdef", "abcef", and "abcdf" due to the ambiguity of the scope of the alternation operator. (If you don't believe me, try it. I just tested this in Perl, egrep, and Tcl to be sure my memory was correct.) Bill 21:24, 14 March 2006 (UTC)
Try that again with anchors. /abc|def/ matches "abcef" because it matches the abc part. Print out the value of $& to see the part that is matching. To get a good test, you have to say /^(abc|def)$/ which will demonstrate that the vertical bar is choosing either abc or def, and not just c or d. It's all about relative precedence... if you read the regex description, they describe it very precisely and predictably... the "implicit" sequence operator (two atoms next to each other) binds tighter than the vertical bar. --Randal L. Schwartz 14:36, 15 March 2006 (UTC)
I did check it with anchors but misled myself. /^abc|def$/ still matches "abcef" etc., which suggested that my explanation above was correct. The substitution test shows you are right. The reason that /^abc|def$/ matches "abcef" etc. is because it is equivalent to (/^abc)|(def$)/. (Personally, I've always thought that relying on detailed operator precedence was nuts, pace the Numerical Recipes folks, and prefer explicit grouping.) My point about mathematical usage versus application stands, though. The usual assumption in mathematical discussion of regular expressions is that you use explicit grouping and don't assume tight binding in sequences. (Incidentally, in a smallness of the world moment, I realize that I once knew your buddy Rich Cower.) Bill 20:50, 15 March 2006 (UTC)

[edit] Programmers' T-shirt pic

hmm... good one.. but Diet Coke is getting some free publicity! --Jay 13:26, 17 Jun 2004 (UTC)

I prefer the tee-shirt imaged at http://www.thinkgeek.com/tshirts/coder/57f0/ . I wonder if there is any way that we can use that image instead? Probably not, unless one of us orders one and snaps a picture while wearing it. --Bevo 14:46, 17 Jun 2004 (UTC)
Must admit, I hate Diet Coke, and this picture is not going to convert me — simply thought it looked very typical; and mimicking the FedEx logo is so obvious. (The Shakespeare quote is also cute.) --Palapala 16:05, 2004 Jun 17 (UTC)
I have no objection to the picturing of diet coke here, however it's hard to find an adequate picture for this article. --Ævar Arnfjörð Bjarmason 22:35, 2004 Jul 12 (UTC)
I think the pic should go altogether, or at least it shouldn't be at the very start. It doesn't add anything useful or essential to the article, but rather distracts attention. This is supposed to be about regular expressions, not geek culture or t-shirts or whatever. --Jonik 13:11, 22 Jan 2005 (UTC)
I agreee. --Edward 19:33, 2005 Jan 22 (UTC)


[edit] T shirt text

I'm sorry to say it so but that "skript kiddie" t-shirt regexp states for "Hack the planet" with some haxx0r variations: h@<</h4c</hax/hak, da/tha/de/ttha/tth@, planet/p1anet/pl4n3t/planett/pl4nettt. also any number of spaces between the words I do not understand, though, the /i part. --Tordek 19:34, 3 Sep 2004 (UTC)


[edit] Be more generic

Don't want to start any arguments, but where the page says "Ever since that time, regular expressions have been widely used in Unix and Unix-like utilities such as: expr, awk, Emacs, vim, lex, and Perl.", shouldn't it really mention vi, not vim. Agree? Celada 00:28, 2004 Nov 19 (UTC)

Agree! --Jan Hidders 16:14, 19 Nov 2004 (UTC)
OK, done! --Celada 03:31, 2004 Nov 20 (UTC)
I agree with this too. (81.11.186.58 23:30, 1 March 2007 (UTC))

[edit] Regex versus regular expression

Porge replaced a number of occurences of the term regex with regular expression. However, this may not be as straightforward as expanding an abbreviation with a longer form.

One of the big problems with this pages is that regular expression means different things to different people. Especially, not all regexes are formal language theory regular expressions. The previous version of this page tried to use the term regular expression only for those constructs that are regular in both senses of the word, using the term regex for irregular patterns. The article also contains a paragraph about this policy, and tried to follow that policy itself.

However, from Porge's edits that may not be a universally appreciated policy. Maybe we can discuss it here. There are certainly many, many ways to resolve the issue. (For example, by separating the pages.) However, I think this article has a lot of potential and actually tries to explain something, so I would be in favour of keeping both meanings under one hood, and retaining the subtle nomenclature. Other opinions are welcome.

(Larry Wall in one of his Perl 6 apocalypses suggests to just use pattern, and get rid of the connection to (Formal Language) regular expressions altogether.) --Arbor 07:25, 20 May 2005 (UTC)

I don't think this is a widely accepted distinction - I don't remember ever seeing it before. --Khendon 19:00, 20 May 2005 (UTC)
Sorry if I've removed a distinction, I was just trying to standardize the word across the article (in various places it had regexp, regex or regular expression). I'm not experienced enough in this area to comment on the formal versus implemented side of things, although I've never heard of the nomenclature being used to protray the difference. --porges 19:52, May 20, 2005 (UTC)
I never claimed the distinction was widely accepted. I introduced it when I tried to clean up the article; the idea is certainly up for debate. The solution had the advantage of not calling overmuch attention to itself (compared to, say, using phrases like non-regular regular expressions) but at the same time surviving a close examination (because nowhere was a regex called regular when it wasn't). No matter how we solve this, I would maintain the position that the present article should aim to minimise confusion, which includes being very precise about not using the term regular expression for patterns that don't conform to the term in all its meanings. (Alternatively, the language-theoretical use needs to be split off, which would make the article less interesting in my opinion.) The nomenclature is bad enough as it is (pointed out and commented on in both Schwartz's book and Wall's writings), and Wikipedia shouldn't add to the confusion. However, other suggestions are more than welcome. (I would prefer to just use pattern for all those regexes that aren't regular expressions.) --Arbor 20:09, 20 May 2005 (UTC)
I'd prefer "pattern" over "regex" for irregular regular expressions. The latter is too ambiguous. :) --porges 09:13, May 21, 2005 (UTC)


[edit] Javascript and escaping minus

Javascripts regex.test () do not match '-' (minus sign) for [-a-z] like, for example, Perl does. Javascript requires that '-' is escaped with [\-a-z]. (Tested in Firefox.)

  • The article says: "The '-' character should be literal only if it is the last or the first character within the brackets: [abc-] or [-abc].".

Suggestion: Note that this does not work in some implementations like Javascript and that '\-' is better (this works in all implementations?) and clearer.

  • The article also says: "Since the characters '(', ')', '[', ']', '.', '*', '?', '+', '^' and '$' are used as special symbols they have to be "escaped" somehow if they are meant literally. This is done by preceding them with '\'".

Suggestion: Mention that '-' should (not must) be escaped as well. It's a small detail, but it would have helped me a few moments ago. :) --Kricke 14:38, 17 june 2005 (CET).

I've just tested this in Firefox and IE, and escaping minus is not necessary. Shinobu 03:46, 9 October 2006 (UTC)

[edit] Alternation character and text appearance

On my computer (Mac running Safari) the alternation character as seen here is nearly indistinguishable from a lower-case L. (The font is Arial, I believe - if your theme matches mine, see for yourself: | l ) In some examples this makes comprehension difficult - for example, I read the first expression in the line

For example, "gr(a|e)y" is the same as "gray|grey"

as "gr(ALE)y", and since there's no direct reference to alternation in the explanation of the example, I had to rely on my knowledge of regexps to figure out that the character was an alternation operator.

One solution to this problem would be to use the TeX equivalent of the character. Another would be to put all regular expression examples in code form, as suggested above:

For example, gr(a|e)y is the same as gray|grey

I think the latter (code-style) is preferable, but there may be larger Wikipedia style issues at play. Does anyone have any comments or suggestions? Either way the problem should be resolved. --Adam Conover 17:23, Jun 20, 2005 (UTC)

I tried to discuss the issue further up. You correctly point out a big problem with this page. --Arbor 18:31, 20 Jun 2005 (UTC)


[edit] Splitting This Topic

Currently this topic is confusing at best. The problem is with the material relating to various programming tools that make use of regular expressions. Regular expressions are primarily a mathematical concept. As such, the most pertinent sections are those labeled "In formal language theory" and "Implementations and running times."

Other sections contain data that is inappropriate for the topic regular expressions. For instance, the "Basic concepts" section discusses the operation "?" which does not exist in regular expressions. Similarly "Traditional Unix regular expressions" mentions operations dealing with placement in a line of text such as "^" and "$" which also have little to do with regular expressions.

I understand that this is good and useful data in the context of regular expressions in programming and in Unix. I do not mean to say "This is wrong, we must delete it!" What I suggest is that there be a topic "Regular Expressions in Programming" and that information be moved there, leaving this topic to focus on the mathematical concept of regular expressions.

Agreed. Personally, I see the current article as three different things, mixed together.
  1. the theoretical CS perspective... eg. DFA's from the perspective of regular expressions.
  2. the practical use as general text-matching tools in compilers, editors, grep, and programming languages.
  3. an enumeration of all the possible libraries, blah blah, that implement regexps. In my mind, the third should be removed, according to m:When should I link externally.
In terms of enumerating some of the syntax features and differences of specific practical implementations... I don't think that should be done on this page. That's been done elsewhere [1], but if needed inside the wiki, it should be on a separate article (though it's terribly complicated IMHO, especially with things like less which can link with a variety of libraries, and with many libraries being slight derivatives of others). --Interiot 22:09, 21 September 2005 (UTC)
Like that edit for example. PCRE links with many applications (the PCRE homepage lists 13 notable examples). We don't need to list them all, especially not on this page. --Interiot 22:35, 21 September 2005 (UTC)
Object. The topic of this article good enough to make this a featured article at some time. It's a wonderful subject, and it's useful. What I don't see this article becoming is a look-up reference for Perl 5 compatible regular expressions or any other variant. That will never work—look at the pages upon pages in Mastering regular expressions needed to cover these many variants. Wikipedia is not a how-to, and it's certainly not a list of API's. But the central issues of Regular Expressions—how they are defined, how they came about, how they are used and in what contexts, how the algorithms work—that's a great topic for an encyclopedia, not really covered anywhere else in a coherent, readable fashion. We need to de-mystify the TCS section and remove some of the enumeration of syntax differences so that this article can have a cleaner focus on the central concepts. --Arbor 08:24, 22 September 2005 (UTC)
Later: Actually, we all seem to more or less agree, don't we? I also want to remove the proliferation of syntaxes, I just don't want to split the article—the practicalities are very important, as long as we don't get bogged down providing detailed syntax comparisons. (The Perl page has its own description of Perl's syntax, and that's exactly how it should be done.) But this page needs to explain ^ and $, and relate them to the formal definition. We need to explain that matching is defined differently in tools and CS (namely, that uvw tool-matches p iff v CS-matches p, or, in other words, w CS-matches p iff ^w$ tool-matches p). I am pretty fond of the explanation of ? and + in the TCS-section, explaining that the formal definition is purposefully parsimonious, but has the same expressive power. Such details are exactly missing from TCS text books and programmers' manuals. --Arbor 08:34, 22 September 2005 (UTC)
I am also in favour of splitting this article. This title should cover only true regular expressions as defined by computer science and mathematics, and contain a link to a seperate article on the software development use of the term, which does not strictly conform to the mathematical notion. By splitting the page we could show the true primitives of regular expressions without being distracted by non-primitives or non-regular constructs. And i disagree with the Arbor regarding Perl Compatible Regular Expressions, which have become more orless the standard syntax for most modern backtracking-nfa regular expression implementations. Indeed the PCRE standard of leftmost-longest matching sematics are more common than the strictly regular longest-token matching semantics, and afaik the former is not emulatable by the latter. Full disclosure: I am and have been working on the Perl regular expression engine for some time. Demerphq 14:37, 19 February 2007 (UTC)

[edit] Removed Regular expressions used when editing Wikipedia

Removed Regular expressions used when editing Wikipedia to Wikipedia:Avoid self-references. --Anon.

[edit] Color-coded ASCII chart

I've added a pointer to an ASCII chart color-coded for the POSIX classes that I made some time ago as part of the documentation for a regular expression tool I wrote as I think that it makes things a lot clearer and also makes precise some of the not so clear terminology (e.g. the difference between "graph" and "print"). Three questions: (a) is this useful? (b) if so, would it be better to incorporate it directly into the article? (c) if so, is there any utility for converting HTML to wiki markup? And is the color-coding even possible? Bill 19:44, 2 January 2006 (UTC)

It's somewhat useful. Yes, it's possible, some amount of HTML and all of CSS are supported, see Image:Mississippi Delta Lobes.jpg for one example, or as a reference m:Help:Table#Color; scope of parameters or m:Help:HTML in wikitext.
But I'm actually starting to think that a comparison table like this would be more helpful, as ugly as tables are, since people keep changing the \( on traditional regular expressions to (, perhaps having more experience with modern regexp patterns and not realizing how much various regexp syntaxes differ. Because I think tables are generally uncyclopedic, I'd prefer to add as few as necessary, and would prefer to add the syntax one instead, but they're probably both useful. --Interiot 20:17, 2 January 2006 (UTC)
Maybe a way of avoiding cluttering up the regexp article would be to create a separate article on character classes (which would require disambiguation as I see that there is already an article with this title that has to do with role-playing games) or POSIX character classes. That would make some sense since character classes have some uses outside of regular expressions.
As for the syntax comparison, the chart you point at is pretty good, but the variation in regexp syntax is so great and there are so many programs that differ in subtle ways that I'm skeptical that a table can really do a good job of it.Bill 06:19, 3 January 2006 (UTC)

[edit] Complement operators and Kleene algebras

In the section titled "In formal language theory", it says the following:

"Sometimes the complement operator ~ is added; ~R denotes the set of all strings over Σ that are not in R. In that case the resulting operators form a Kleene algebra. "

A complement operator is not required in Kleene algebra. Concatenation, choice, and Kleene star operators suffice. I think the above sentence is misleading.

I also would like to question the comment:

"The complement operator is redundant: it can always be expressed by only using the other operators."

Is this so? I have never seen a definition like this. It is generally defined as something like ~A = Σ - A , where '-' represent set minus. However, set minus is not part of the regular expression syntax, so it means the complement operator is "primitive". Or is there something I am missing?

tim 13:54, 25 January 2006 (UTC)

Your first comment is correct. Fix it. As to complementation, the easiest way to see that goes via DFA: switch the accepting and rejecting states: hey presto! Instant DFA for the complement. Arbor 14:02, 25 January 2006 (UTC)
Thanks for you (very fast!) response. The change has been made. I see what you mean about the complement operator. I had misunderstood and thought that it was saying that the complement operator could be defined in terms of other operators, rather than expressions containing complement operators could be rewritten to not use them. tim 14:31, 25 January 2006 (UTC)

[edit] How to negate a whole line?

Hello, I want to negate a whole line starting from a specific caracter, eg.:

"The text to be negated starts here: negated text"

I want to negate all the line starting from the two dots ":".

Is there a way to do this?

Thanks, Ian Liu.

You cannot complement anything other than character sets, because this would make the efficient generation of NFAs from regexps infeasible. It's conceivably possible to complement the entire regexp by complementing the final DFA generated from the NFA (for systems that generate such a DFA), but this isn't usually very useful. Deco 00:29, 27 January 2006 (UTC)

[edit] "In formal language theory" or "Informal language theory"?

Greetings,

What a difference a space character can make...

The section title "In formal language theory" puzzles me. Does it mean:

 In that discipline we call formal language theory

or does it mean:

 Here's some informal language theory

If the former, please lose the initial "In" and just call it "Formal Language Theory". If the latter, please lose the space between "In" and "formal" to make the title "Informal language theory".

In the case of the former, I don't mind using a preposition as the first word, but using "in" in this case is a little too close to the word "informal".

Thanks, David

Hi David. Why not fix it yourself? WP:BOLD Welcome to Wikipedia! Arbor 08:15, 4 March 2006 (UTC)
Hmm, I don't think being bold is the problem here. Anyway, the edit history [2] shows it comes from a heading "Regular Expressions in Formal Language Theory", so I'll remove the "In". Thanks for picking that out, it's good to know where there's confusion. --H2g2bob 03:07, 24 June 2006 (UTC)

[edit] Even/Odd Regular Expression replacement.

The original RE text listed was :
"((a|ba(aa)*b)(b(aa)*b)*(a|ba(aa)*b)|b(aa)*b)*(a|ba(aa)*b)(b(aa)*b)* denotes the set of all strings which contain an even number of bs and an odd number of as. Note that this regular expression is of the form (X Y*X U Y)*X Y* with X = a|ba(aa)*b and Y = b(aa)*b. "
If you trace through this, you'll see that the first part of the RE matches aa(an even number of a's). This is therefore incorrect. This is my replacement:
"(aa|ab(bb)*ba)*(b|ab(bb)*a)[a(bb)*a|(b|a(bb)*ba)(aa|ab(bb)*ba)*(b|ab(bb)*a)]* denotes the set of all strings which contain an even number of as and an odd number of bs."
I had to removed text stating what form it was is, since that form was wrong and does not apply to the correct solution. Also, due to my own original solving of this problem, I have even a's and odd b's(instead of the original odd a's, even b's) but they are interchangable. --MacGyver07 03:15, 9 March 2006 (UTC)

The right square bracket on the end is clearly wrong. I had presumed that a version without that was correct. But as it stands, it's not even a legal regular expression. I've pulled the entire paragraph out until someone can figure out what the real expression is. --Randal L. Schwartz 14:39, 9 March 2006 (UTC)

Ahh, there's a left bracket earlier. Well, that's clearly wrong then, because it's a legal regular expression, but matching any of the characters between that left bracket and right bracket. So it's gonna match a's, b's, left parens, right parens, vertical bars, and stars. I really don't think that's what you want. Again, it should stay removed until it gets fixed. --Randal L. Schwartz 14:46, 9 March 2006 (UTC)
Look, you seriously need to stop reverting edits on things you clearly can't understand.
[a(bb)*a|(b|a(bb)*ba)(aa|ab(bb)*ba)*(b|ab(bb)*a)]*
The asterisk clearly shows that it can match nothing, and if you trace through you'll see it won't match just a, but only even a's(which is the whole point of the RE).
|'s are " or's " and *'s mean zero, one, or more... this is unaffected by the bracket as it does not change meaning, but put there for clarity.... I am really at the end of my patience for this, I fixed a CLEARLY wrong RE with the right one and you have reverted the correct changes twice now with completely incorrect reasons.... please learn how Regular Expressions work before saying they are wrong.--MacGyver07 23:31, 9 March 2006 (UTC)
Ok... upon relooking at the article, I understand where you got off base... in UNIX regular expressions, you are correct in that brackets denote character classes.... however the ones listed were in a section talking about FA's and RE's in a general sense, and in that case the RE is seen as a mathematical expression rather than a functional language.... in other words the brackets are the same as a parentheseis(this is why my original algorithm listed '|'s as +'s(whereas +'s are one or many in functional RE's), In any case, I changed the brackets to parenthesis to better fit the other RE, but I still say that brackets denoting character classes is specific to unix RE's(an implementation) and doesnt affect them in the "examples" section...AFAIK the expressions given in the example area were meant to be general RE's and not language specific which is why I used brackets... however I as I said they've been removed and no further editing or reverting of the RE should be needed... please trace through and post here before reverting if you cant understand this.--MacGyver07 23:42, 9 March 2006 (UTC)

[edit] This doesn't seem right

"Sometimes the complement operator ~ is added; ~R denotes the set of all strings over Σ that are not in R. The complement operator is redundant: it can always be expressed by only using the other operators."

How? --Ihope127 02:31, 24 June 2006 (UTC)
It's true, but completely non-obvious. First, you translate the regexp to an NFA. Then, translate the NFA to a DFA using the powerset construction. Then, complement the set of final states. Then, translate back to a regexp. The resulting regexp will recognize the complement language, but may be an exponentially larger string. Deco 17:38, 26 June 2006 (UTC)

[edit] Missing information about locale

The section about Unicode should be preceded with something about locale and collating. Those issues rise before introducing unicode charachters.


[edit] search and replace

Many text editors have a "search and replace" that allows regexes in the "search" part.

This article emphasizes "searching", but only hints at the possibility of using regexes in the "replace" part.

For example, "\0" represents the entire thing matched, "\1", "\2", "\3", etc. represent various parts of the thing matched.

If the string I type in the "replace" part -- for example,

 <em>\0</em>

(intended to add literal "<em>" and "</em>" around what the regex found, whatever it may be) is not technically a regex, then what is it called? If this is not the appropriate encyclopedia article to discuss the "replace" part, which encyclopedia article is? --70.189.73.224 15:35, 17 September 2006 (UTC)

[edit] Google Analytics and RegEx

Google Analytis uses RegEx now and so the need to understand them is greater than before. But, all the examples here are tech-heavy Perl (or as one guy said on the Web Analytics forum, "To me, Perl is something that comes in an oyster." And that was someone who was writing his own RSS application...) Anyway, we need examples that are more relevant to that market. Rsteif 22:09, 23 September 2006 (UTC)

[edit] Mailing Lists

I had added 3 links to RegEx mailing lists and they were removed. Is there a reason for this?

Maybe because Wikipedia is not a list of links? There could be lots of other reasons. Arbor 17:35, 28 September 2006 (UTC)
While I was here, I removed more than half of the links in that section, arguably somewhat randomly. Editors should put back whatever they need is necessary for this article, but let's remember that this is not a repository of external references. Arbor 17:40, 28 September 2006 (UTC)
But looking at the links left shows that they are geared towards resources. Cheat sheets, learning sites, etc. A mailing list definitely falls under this definition and as there are almost none, giving people the ability to know where to go to ask questions is of great importance, at least in my thinking. I can understand not having a link to regexbuddy, but a learning resource is a learning resource.

If I'm wrong in this, please tell me the difference between a mailing list and a site with a tutorial.

A mailing list has no encyclopedic content, a site with a tutorial (hopefully) does. Martin 18:14, 28 September 2006 (UTC)
What do you mean by 'encyclopedic content'? The mailing lists have archives of the questions and answers if that's an issue.
If there is a well-organised mailing list archive somewhere that presents such answers in a coherent, comprehensive, and accessible manner, we could use it just as well as any of the other tutorials. But that's no longer a mailing list then. In any case, one such link, to one external site is quite enough. (Also, sign your contributions using four tildes—that will produce a signature like mine or Martin's. And why not get an account?) Arbor 19:00, 28 September 2006 (UTC)
So if the list owners did a 'best of' or compilation of the questions and answers on the list, it would be something that could be linked in without issue? Also, how do you determine which resource to link to? Jakarta has 2 different lists. House of Fusion has 1. There's even a yahoo groups list at http://tech.groups.yahoo.com/group/regex/ and a google group at http://groups.google.com/group/regex. Which 'one' gets listed? Mysticrpg 19:26, 28 September 2006 (UTC)
I would say none. That's the easiest way to solve this problem. As we said, this is an encyclopaedia, the idea is that important information is here. If a hypothetical book/site/list/whatever contained useful information, the idea is to incorporate it into this very article, not link to it. If the book/site/list/whatever does not contain useful information, it needn't be linked anyway. Arbor 20:12, 28 September 2006 (UTC)
Your argument means that there should be no external links at all as the data in the external link should be in the entry anyway. There's a problem with that logic. And where should people ask RegEx questions? Should there be a Q&A box at the bottom of the entry? Pointing people to where they can get more information is exactly what an encyclopaedia is for. Give the info and if the info is not present, point to where it can be found. Mysticrpg 20:28, 28 September 2006 (UTC)
DMOZ is designed to point people to the most useful websites for a specific topic. Wikipedia is not. We link to DMOZ, that should be good enough. --Interiot 20:52, 28 September 2006 (UTC)

OK, I added a single mailing list to the external links section so people know where to go to ask RegEx questions. Good enough? Mysticrpg 20:44, 28 September 2006 (UTC) OK, so the single link I added was taken down again. I quit. This is exactly why people hate wikipedia. Overediting and control with valid resources being rejected with a blanket statement of "not a link repository". So take down all the links then. Over controled BS. Enjoy your control. Mysticrpg 03:15, 29 September 2006 (UTC)

I think that the above overseers should've pointed to the rules instead of their private opinions: Wikipedia:External_links#Links_normally_to_be_avoided. It clearly states:
10. Links to social networking sites (such as MySpace), discussion forums or USENET.
Such a reference is much more descriptive (since rules are written much clearer than their interpretations) and a user (including me) feels far fairer after realizing he's violated a rule, not fallen a victim to arbitrarity. — Vano 01:08, 1 December 2006 (UTC)

[edit] External links that were removed en-masse

Someone in the past removed a bunch of link - here is the whole list from that awful day, so people can re-add the ones they find useful. -- Wriggleybum 08:27, 30 September 2006 (UTC)

That was me; I announced it here as well, see above. By the way, most WP editors should be able to use the "history" tab to find and bring back whatever they find necessary, so in general there is no need to "resurrect" material by copying it to the talk page. Arbor 09:15, 30 September 2006 (UTC)
A good way to point to an edit that you find problematic is to include a link to the diff, like this: [3]. That shows what I removed and when, including my edit summary. Having done that, I will remove the section of links you added to this talk page. Arbor 09:18, 30 September 2006 (UTC)
Much appreciated. I've been trying to keep it sparse. For anyone who thinks this is a bad idea, please review WP:NOT. -Harmil 19:06, 2 October 2006 (UTC)
Very Bad Idea. Don't have to think it. A link to site that was up for barely 24 hours generated a very large volume of traffic from this article. Strongly suggests this page is failing to provide answers for people coming here. WP NOT, is fine and good, but I believe you'll also find that WP is not about useless information for no purpose. Earnestly suggest you reconsider your pruning efforts or do a better job in the article. dmoz is also fine and good, but why are you making it hard for people? --Gavferr 04:14, 14 October 2006 (UTC)
More likely it was the multitude of editors who watchlist this article following the link for validation. — EncMstr 04:24, 14 October 2006 (UTC)
Gavferr, based on your comments, I take it that you are the maintainer of one of the sites that had been linked to from this article? If so, please do review WP:NOT, and if you decide that you have a justification for a particular external link, let us know. "a link ... generated a very large volume of traffic," doesn't actually address the concern. -Harmil 04:26, 14 October 2006 (UTC)
Personally I could care less if you link or don't. I derive zero benefit either way. Do keep in mind that those links keep growing. Ask yourselves why? People aren't finding what they want here, they find it somewhere else, and try and help by putting a link up. Then think of what fraction of people who would even *bother* doing the editing. QED. You can either learn from this or not. It's your call as the keepers of the page, not mine. The goal of WP is not to freeze articles in stasis but improve them. This one is frozen and has plenty of proof that it is failing. I care enough about WP succeeding to try and help by alerting you to the issue, but not enough to waste time arguing with pedants. Apologies if the truth offends. --Gavferr 09:04, 15 October 2006 (UTC)

I noticed someone added the Regex Coach to the external links; although I agree with 99% of the removed links, this one should stay; I've been tempted to add it myself. It is a stand-alone (not web-based) program, and it has literally saved me hours and hours in debugging complex REs. It has some nice "stepping" tools to let you watch what the parsing is doing. MeekMark 14:24, 5 October 2006 (UTC)

The problem with that sort of logic is that it's an excellent reason to keep a link on a Web directory service, which Wikipedia is not. Links in an article should further inform the reader, and allow them to perform further research on their own. If there's a link that you think regular expression users should know about, I suggest strongly that you submit it at DMOZ. -Harmil 06:11, 6 October 2006 (UTC)

[edit] Javascript?

Why is Javascript not mentioned in the article? I think to most users it is a much more familiar and accessible example of a language supporting regexen than the relatively speaking more obscure tcl. Shinobu 03:49, 9 October 2006 (UTC)

Oh, I agree. And I think that lots of people are trying to learn RegExen now that Google Analytics uses them. This is the same reason I think that the the RegEx Coach should be an external link (and let's face it, nobody goes to DMOZ except to get a directory listing to help their SEO.) We really need a whole section on RegEx for non-geeks. Rsteif 20:32, 12 October 2006 (UTC)

[edit] How to describe alternation?

Sorry I didn't ask before changing this sentence:

For example, "gray|grey" matches gray or grey, which can commonly be shortened to "gr(a|e)y".

to this sentence:

For example, "gray|grey", which is commonly shortened to the equivalent "gr(a|e)y", can match gray or grey.

I changed it because the original doesn't seem to me to say what it's supposed to say. That is, the original seems to say that it is "gray or grey" which can be shortened, not "gray|grey". While maybe my rewrite isn't best possible, surely some rewording is necessary to make it apparent what's being shortened? JadeNB 21:25, 10 October 2006 (UTC)

It looks good to me, although I prefer the more explicit:

For example, the regular expression "gray|grey", which is commonly shortened to the equivalent "gr(a|e)y", can match either "gray" or "grey".

Deco 23:26, 10 October 2006 (UTC)
Oops, of course you're right that I'm missing some quotation marks there. I'm not clear on the etiquette here -- if another user changed my changes, is it now reasonable for me to change them back after asking here, or am I supposed to address him or her specifically?
JadeNB 00:01, 11 October 2006 (UTC)
That person should be watching this page, and evidently prefers the original wording. You could give them a few hours to read this so that you don't end up in a revert war. However, I much prefer your wording. Deco 00:15, 11 October 2006 (UTC)
I too, like the new one better (with the quotes). Shinobu 09:37, 11 October 2006 (UTC)
This particular example could also (and quite possibly more often?) be expressed by the character class as "gr[ae]y". Perhaps anothe canonical example of the altnernation might be in order? Something that can "only" be expressed as an alternation? (Such as "jail|gaol" which would also be expressable as "(gao|jai)l".) QTJ 19:57, 11 October 2006 (UTC)


[edit] McCullough and Pitts

I have no idea what McCullough and Pitts have to do with regular expressions. I am removing the statement making that reference. If someone can dig up a reference that McCullough and Pitts had anything to do with the early history of regular expressions or finite-state languages, feel free to revert. Pexatus 01:06, 21 October 2006 (UTC)

[edit] "Regular Expressions Example (Visual Basic .NET)"

A new section entitled "Regular Expressions Example (Visual Basic .NET)" does not seem at all appropriate to me, as the example does not seem at all useful, and is more about visual basic than regular expressions. It also contains misleading statements, such as "Regular Expressions have been fully integrated into Visual Studio", I think it means integrated into the .NET framework, and "In particular, Visual Basic .NET", what bout c# etc. where regex implementation is pretty much identical. Martin 09:05, 23 October 2006 (UTC)

I agree with all that. Removed the section, and scattered the text into #Regular expression and the .NET framework and #Uses of regular expressions. Both could do with more work though. --h2g2bob 11:48, 23 October 2006 (UTC)
I've added an {{expert}} tag to the .net section. I have no knowledge of "lookaround", which looks to be a key concept in .net regexps. Plus, I have no particular desire to find out. --h2g2bob 12:15, 23 October 2006 (UTC)
"lookaround" is jargon for "positive/negative lookahead/behind assertion" --Demerphq 15:59, 23 December 2006 (UTC)

[edit] searchmonkey

Someone added searchmonkey to the "See also" section. This makes about as much sense as adding Ford F-100 to Fuel injection "See also" section. It certainly makes sense to disucss (in prose, with as much context as possible) any product which has made notable contributions to the real-world use of regular expressions, but the "See also" section isn't the place to do that. That section should be reserved for directly related or very similar topics that are not otherwise covered in the body of the article. -Harmil 21:25, 9 February 2007 (UTC)

[edit] remove regexlib.com

regexlib.com is totally wrong here, they have a crappy design and no good content, why is it still listed?

[edit] Perhaps a section on clarifying the asterisk is in order?

Speaking as a reader of this article I was quickly baffled in the very first section by the following:

"The asterisk indicates there are 0, 1 or any number of the previous expression. For example, "go*gle" matches ggle, gogle, google, gooogle, etc."

I had to do a double take because I was under the impression that the asterisk was used as a wildcard in regular expressions. Some quick googling led me to discover that I was mistaken. Through my googling though I quickly came to the conclusion that this mistaken impression is probably very common. Perhaps a section talking about common misconceptions like this might be in order? —The preceding unsigned comment was added by 64.126.82.205 (talk) 08:11, 14 February 2007 (UTC).

The * as a wildcard is non-Unix, it was DOS (maybe CP/M that was before my time). — RevRagnarok Talk Contrib 11:48, 14 February 2007 (UTC)
I don't know about CP/M, but the DOS implementation was flawed. The asterisk could only appear at the end of a field. That is, FILE*.* worked as expected, however *FILE.C would be interpreted as *.C. It seems the asterisk was expanded into the single-character wild card ? to fill out the field. As these were 8.3 filenames, FILE*.* was transformed into FILE????.??? while *FILE.C became ????????.C for matching purposes. —EncMstr 16:20, 14 February 2007 (UTC)
* wildcarding is very old, but Unix, DOS, VMS and various other systems supported this kind of wildcarding. Wildcarding is a subset of the capability of regular expressions, but is not directly related. The clarification should probably be made if it's not already between the two. Modern Unix wildcarding is VERY sophisticated, but the fundamental difference is that wildcarding expresses missing segments, where regular expressions express sequences. Thus .* in a regular expression is functionally equivalent to * in a wildcard, but you don't have to describe what it is that comes between in a wildcard (and cannot). Alternations are sometimes possible with {a,b,c} in wildcards, but not everything supports this, and it's limited in its capabilities (e.g. you cannot specifiy the repetition of an alternation, as you can with (a|b|c)*). -Harmil 17:49, 14 February 2007 (UTC)

[edit] Tone of the Formal Language Theory section

This section reads like a textbook. Further, it seems like it could be condensed quite a bit to make it more relevant to regular expressions (formal language theory does have its own article, after all). -- Super Aardvark 01:09, 31 March 2007 (UTC)

[edit] Suggest removing section "Uses of regular expressions"

The section "Uses of regular expressions" is not very useful and feel out of place. Regexes has many uses other than syntax highlighting. Let's just link to Regular expression examples instead. --Apoc2400 06:10, 2 April 2007 (UTC)