Wikipedia talk:AutoWikiBrowser/Typos

From Wikipedia, the free encyclopedia

Contents

[edit] Reliable sources

Is dictionary.com a reliable source?--Andeh 06:04, 11 August 2006 (UTC)

Nope. See here. alphaChimp laudare 06:19, 11 August 2006 (UTC)
OK, what about Microsoft Word 2000's or higher dictionary?--Andeh 06:25, 11 August 2006 (UTC)

This looks like a good source for misspellings: http://www.misspelled.com/common/a.htm --BillFlis 10:45, 27 August 2006 (UTC)

[edit] Poss reconsider

Bizarre as in Some Bizarre Records. Rich Farmbrough 22:53 11 August 2006 (GMT).

[edit] Attempt

If a fix for attemp is desired, "\b(A|a)tt?em(p|t)(|ed|ing|s)\b" --> "$1ttempt$3" seems to work for all cases. I don't think it matches any real words.—Mrkwcz 17:23, 12 August 2006 (UTC)

[edit] Opposites

What about alternative beginnings to words, as in opposites like accessible and inaccessible? Instead of having two separate entries to check and maintain, we could easily just have one:

<Typo word="Accessible/Inaccessible" find="\b(A|a|Ina|ina)ccessab(le|ility)\b" replace="$1ccessib$2" />

This would simply require a rule that opposites (starting with in-, un-, etc) should not be placed alphabetically, but placed with their root word, and in many cases in the same regex.

An other strategy would be a rule that any word covered like this outside its normal alphabetical order should have a comment line placed in the alphabetical list where it would have gone.

Euchiasmus 12:55, 20 August 2006 (UTC)

Sounds like a great idea, reducing duplication is always good. thanks Martin 18:58, 20 August 2006 (UTC)

[edit] Victuals and eke

I removed these new additions:

Typo word="Victuals" find="\b(V|v)ittles\b" replace="$1ictuals"

Typo word="Eke" find="\b(E|e)e(ke|ked|kes|king)\b" replace="$1$2"

Typo word="Eke" find="\b(E|e)e(k)\b" replace="$1ke"

Typo word="Ekes" find="\b(E|e)eks\b" replace="$1kes"

"Vittles" is so old a misspelling that it's kind of its own word now, not to mention the cat food Tender Vittles, etc. (see the Google search).

"Eek" is a really common onomatopoeia for screaming, among other things. There are a lot of false positives on this Google search, and the words on the list should have 0 false positives. "Eeks" seems the same (lots of legit uses), there seem to be two legit uses of "eeke", and there are only 4 mainspace results for "eeked", 2 for "eeking", and none for "eekes". --Galaxiaad 18:34, 25 August 2006 (UTC)

Ah, sorry. I even happen to know that Eek is a town in Alaska (you can't get there from here, or even from there—Google Maps fails!). But has "eek"-the-onomatopoeia been verbed? "The scream queen eeked out a living"? BTW, I've just made probably a few hundred changes to the list—I gather that you're genuinely interested, so you might want to take a gander.--BillFlis 23:36, 25 August 2006 (UTC)
Yeah, all the changes are impressive and a bit overwhelming. I definitely want to look though. I didn't mean to sound harsh in my previous comment; sometimes it's hard for me to sound human instead of just stating facts, heh. Hm, doesn't look like it's been verbed, but there is the plural in "Eeks and Squeaks". (The instances of "eeked" actually were typos for "eked" but there were only 4, which isn't enough to merit inclusion.) Hey, I'm just wondering and you'd probably know: what does the word="whatever" bit actually do? --Galaxiaad 13:58, 26 August 2006 (UTC)
Actually, I thought your points were well-taken. I figure the word="whatever is just informational. My understanding of the AWB is that it's not a bot, it just helps someone make the same kind of edit over and over very quickly. I have a question about how it uses this typo list: I've noticed that some of the rules here have sort of the opposite of a false positive; that is, the correct spelling will trigger a change, back to the correct spelling. There's no harm done, but isn't this inefficient? Should I be stamping these cases out?--BillFlis 14:27, 26 August 2006 (UTC)
The word property means they can be sorted in proper alphabetic order (sorting by order of the typo was very difficult to deal with, as duplicates were not adjacent to eachother), also it allows easy location of a specific word, which will hopefully avoid future duplication, and probably explains the enourmous amount of duplication that previously existed. Matching the correct spelling is much more efficient than having 2 separate regexes (which is how is used to be) but not as efficent as having a single regex that manages to avoid the correct spelling, so yes, avoid them when possible, but if it is becoming complicated then it doesnt really matter. And thanks for all the work you have done on this! Martin 14:48, 26 August 2006 (UTC)

[edit] Airbourne

I got a false positive running AWB when Airbourne was changed to Airborne. Should this be removed? — Loudsox 16:46, 27 August 2006 (UTC)

I think it should be removed, or maybe changed. I think a more likely misspelling is airborn. What would be really nice is some way to tag a word within the encyclopedia as a deliberate misspelling, like adding "[sic]".--BillFlis 17:34, 27 August 2006 (UTC)

[edit] Regexes that match the correct spelling

Sometimes a regex, in providing matches for a variety of possible misspellings, matches the correct spelling. As best I can tell, AWB stops on an article when the regex matches the correct spelling and therefore makes no change.

Example: for "Apparel", the regex

(A|a)pp?arr?e(l|ls|ling|lling|led|lled) 

corrects "Aparrel", "Aparel", and "Apparrel". Unfortunately, those alternatives allow "Apparel" to match, so AWB stops on "Apparel" but shows no diff. Example article: Jones Apparel Group.

So, 1) is what I'm saying true; 2) is there a preference against such regexes; 3) is there a way to fix the regex (while keeping only one regex) to avoid this? (And/or, can AWB be programmed to realize that a null edit has occurred?) Thanks, –Outriggr § 01:23, 16 September 2006 (UTC)

Well, I just played with the "Skip article when no change made" setting (which I could swear was on by default, or that I have always had it on), and I see that AWB no longer stops in the above case. Not such an issue then? –Outriggr § 01:31, 16 September 2006 (UTC)
I've been told that the regex does make the "change" (to the same correct spelling, thus useless work) and is thus wasteful of resources. Have I been given some bad info? I've been trying to stamp out such cases, but maybe the program is smart enough to recognize (i.e., it checks) whether any real change is made, and I'm the one doing the useless work!--BillFlis 20:15, 16 September 2006 (UTC)
The program is smart enough to know if a change was actually made, but it is slightly preferable not to match the correct spelling, though not critical. I suppose it might be more critical in the future if some other software wanted to make use of this list though. Martin 09:31, 17 September 2006 (UTC)

[edit] Suggestion of a change

How about "alot" to "a lot". But I am not sure how to program it.--Esprit15d 17:50, 27 September 2006 (UTC)

But it might be "allot".--BillFlis 19:46, 27 September 2006 (UTC)

I suppose:

<Typo word="Alot" find="\b(A|a)lot\b" replace="$1 lot" />

<Typo word="Allot" find="\b(A|a)llot\b" replace="$1 lot" />

Reedy Boy 17:10, 16 October 2006 (UTC)

Upon doing it manually with AWB find and replace the words allotment and ballots came up causing a problem with the search on Allot.

Would running those like that, ensure that only that word is used? Or would it include words that include alot/allot?

Reedy Boy 17:11, 16 October 2006 (UTC)


Seems some people use allot instead of allocate...?

Reedy Boy 17:14, 16 October 2006 (UTC)

reject Allot comes from the sense of "assigning by lot" and therefore implies random allocation. Allotment has a specific political meaning of "to select by random selection" - aka "jury" selection and "sortition". Allocation does not have any sense of chance and e.g. to allocate a person to a jury rather than allot them would imply they were chosen rather than selected at random (which would dramatically change their nature) The two words are very different and in my view to replace "allot" with "a lot" was just vandalism. --Mike 16:10, 18 October 2006 (UTC)

I think what you intended was:

<Typo word="Alot" find="\b(A|a)lot\b" replace="$1 lot" />
<Typo word="Allot" find="\b(A|a)lot\b" replace="$1llot" />

Then you'd have to run AWB manually (isn't this always how it's run?), and decide which rule to accept: alot --> a lot or alot --> allot. Yes, allot means allocate, as "within the allotted time". This would be safe to add, I think:

<Typo word="Allot_" find="\b(A|a)lot(ted|ting|ments?|tees?)\b" replace="$1llot$2" />

where we add the low-line character (_) to signal that only certain endings are being treated.--BillFlis 17:22, 16 October 2006 (UTC)

[edit] Reconsider

Rich Farmbrough, 19:33 3 October 2006 (GMT).

    • I'm a bit concerned that people—both those who use AWB, and those who see bad edits—forget that this system is semi-automated. In conjunction with the fact that the AWB user is reviewing his edits, I don't see why it is necessary to get rid of a spelling correction rule even if there are very rare exceptions to that rule. I managed not to "correct" Garry Tallent (in another article) once. I'm not pressing for the removal of the spelling error "tallent". –Outriggr § 00:27, 4 October 2006 (UTC)
      • Simply because the stated aim is to have no false positives. "The lofty goal of RETF is to be completely automatic." It is a courtesy to the creator report problems here. Rich Farmbrough, 21:58 7 October 2006 (GMT).

[edit] Two questions

  1. Is "first-hand" really bad? dictionary.com
  2. Comunal->Communal breaks Estadio Comunal de Aixovall, do we care?

Rich Farmbrough, 21:58 7 October 2006 (GMT).

Also, "first hand" can occur together. "I won the first hand."--BillFlis 12:01, 8 October 2006 (UTC)
Actually, "first hand" occurs in Canasta.
Each player is dealt a hand of 11 and a second hand of 13, sometimes referred to as the "hand" and the "foot", respectively. The hand with the lowest bottom card is played first. Once a player plays all cards from his first hand he picks up the second and continues normal play.
It has caused a false positive.Punainen Nörtti 18:15, 25 October 2006 (UTC)

[edit] Countries

I've added entries to convert names of countries to Title Case. My process was:

  • copy list of countries from List of countries
  • process to remove text in () or []
  • process "See * for *" lines
  • change lines with "1, 2" into "2 1" (eg "Congo, Republic of")
  • manually inspect and make special changes (eg Taiwan)
  • add to AutoWikiBrowser/Typos and test
  • remove duplicates that had already been put onto the list
  • remove country names that are also words that can be in lowercase (chad, guinea, jersey)

I guess that many of the lines could be manually tweaked to give greater coverage of variants - but this is a start, anyway...

Hope this doesn't generate too many erroneous matches that I haven't thought of...

Euchiasmus 07:40, 8 October 2006 (UTC)

"wale(s)" and "coco(s)" have uncapitalized meanings in http://www.m-w.com. "chile" is a valid spelling of "chili" (capsicum). "india" (occasionally before "ink" and "rubber") isn't always capitalized.--BillFlis 11:54, 8 October 2006 (UTC)

Thanks, Bill - I've removed those. I also realised about turkey and took that out too. Euchiasmus 19:51, 9 October 2006 (UTC)

Because this is an issue of capitalisation rather than spelling, I suggest that these entries are placed in a separate section rather than being distributed into the A, B, C, sections. Gaius Cornelius 13:21, 6 November 2006 (UTC)

[edit] Full stops, commas, colons, brackets and double spaces

I have felt that following mistakes are too comon (specially in stubs) to ignore:

  • c denotes any alphanumeric character
  • s denotes a space character
Mistake Correction Suggested code
c.c c.sc
<Typofind="\b(a-zA-Z).(a-zA-Z)\b" replace="$1. $2" />
cs.c c.sc
<Typofind="\b(a-zA-Z) .(a-zA-Z)\b" replace="$1. $2" />
cs.sc c.sc
<Typofind="\b(a-zA-Z) . (a-zA-Z)\b" replace="$1. $2" />
c,c c,sc
<Typofind="\b(a-zA-Z),(a-zA-Z)\b" replace="$1, $2" />
cs,c c,sc
<Typofind="\b(a-zA-Z) ,(a-zA-Z)\b" replace="$1, $2" />
cs,sc c,sc
<Typofind="\b(a-zA-Z) , (a-zA-Z)\b" replace="$1, $2" />
c;c c;sc
<Typofind="\b(a-zA-Z);(a-zA-Z)\b" replace="$1; $2" />
cs;c c;sc
<Typofind="\b(a-zA-Z) ;(a-zA-Z)\b" replace="$1; $2" />
cs;sc c;sc
<Typofind="\b(a-zA-Z) ; (a-zA-Z)\b" replace="$1; $2" />
c(c cs(c And so forth
c(sc cs(c And so forth
cs(sc cs(c And so forth
c)c c)sc And so forth
cs)c c)sc And so forth
cs)sc c)sc And so forth
ss s And so forth

Note: Suggested code is based on my preliminary understanding of the pattern of the working code at Wikipedia:AutoWikiBrowser/Typos, and I am very sure it is wrong and needs to be corrected.

Szhaider 15:39, 9 October 2006 (UTC)

These are indeed common mistakes, but unfortunately, in my experience there are too many legitimate exceptions, such as ".NET", the other mistakes may not have so many exceptions though. Martin 16:16, 9 October 2006 (UTC)
Yeah, and what about U.S.A.? Or T.S. Eliot? Also, semi-colon is part of many HTML entities, like "—" etc., which will butt right up against letters.--BillFlis 02:11, 10 October 2006 (UTC)

[edit] Predominately?

Suggested addition - replacing "predominately" (not a word) with "predominantly." | Mr. Darcy talk 20:22, 6 November 2006 (UTC)

Sorry, but "predominately" is indeed a word, meaning--guess what?--"predominantly". See here.--BillFlis 19:58, 10 November 2006 (UTC)

[edit] 'Logical' punctuation in quotations

I'm changing punctuation at the end of quotations to 'logical' style, per Wikipedia:Manual of Style#Quotations by replacing <," > (comma-quote-space) with <", > (quote-comma-space) throughout (e.g. <"Yes," he said.> to <"Yes", he said.>. I haven't come across any false positives yet. A similar replacement might be possible for embedded full stops at the end of quotations, but that's more controversial and would produce too many false positives, I think, unless someone could suggest a clever method to exclude the case where an entire sentence, including its final punctuation, is being quoted. Colonies Chris 22:59, 6 November 2006 (UTC)

[edit] Orignal --> Original

There is a town in Ontario called L'Orignal, mentioned in a few articles, so the regex should exclude this if possible. Colonies Chris 08:23, 9 November 2006 (UTC)

[edit] Problem with "definitions"

When presented with the misspelling "defintions" it tries to replace it with "definitons" which is still not the correct spelling. I took a look at the RegEx and I am not quite sure how to fix this problem, so if somebody with more experience can fix it, that would be great. --Maelnuneb (Talk) 19:49, 10 November 2006 (UTC)

OK, fixed, thanks.--BillFlis 19:58, 10 November 2006 (UTC)

[edit] Firsthand

I am getting a ton of false-positives with this one. Card game pages are a real big source of false-positives. I am going to remove it from the list due to this. Code for the RegEx was: <Typo word="Firsthand" find="\b(F|f)irst[ -]hand\b" replace="$1irsthand" /> Possible fix: only match first-hand, but I'm not positive that version isn't an acceptable spelling. Any comment on that would be great. --Maelnuneb (Talk) 20:59, 13 November 2006 (UTC)

After looking up first-hand on [1], it suggested firsthand, so I will add checking for "first-hand" back into the system, but not "first hand" as the possibility of a false positive for "first-hand" is non-existent. If people believe that "first hand" should be included still, please debate here. --Maelnuneb (Talk) 21:05, 13 November 2006 (UTC)
And the OED and Webster Unabridged, both more reliable dictionaries, have "first hand" and "first-hand". This is certainly not a typo, and at the very least is an acceptable alternative spelling, if not the better spelling. —Centrxtalk • 21:29, 14 November 2006 (UTC)
Given that, I would agree to not have firsthand in the list of typos. I personally didn't write the rule in the first place, just tweaked it to get rid of false positives and then did a quick search to see if "first-hand" was a correct spelling, running on the assumption that the original contributor that added the rule for firsthand was in fact correct. Centrx, thank you very much for finding evidence of the other spellings and bringing them here. --Maelnuneb (Talk) 17:46, 15 November 2006 (UTC)

Also, this list really does need to be restricted to typos, not bad usage, because quotations and normal sentences will be filled with cases that should not be "corrected". Also, with compound words there are common sentences (such as actually referring to the first hand of something, as in a game of cards or something about physiology) that would never warrant changing. —Centrxtalk • 06:34, 16 November 2006 (UTC)

Typos would still show up in those cases unfortunately. That is the entire reason that the process of fixing typos is not automated. Your point about "first hand" was exactly why I changed the rule to match only "first-hand" actually. I was getting tired of fixing false positives, so I changed the rule to prevent it. --Maelnuneb (Talk) 18:00, 17 November 2006 (UTC)

[edit] referrences -> referencces

<Typo word="Reference" find="\b(R|r)efe(?:rr?a|rre)n(ce[ds]?|cing|ts?)\b" replace="$1eferenc$2" />
should likely be
<Typo word="Reference" find="\b(R|r)efe(?:rr?a|rre)n(ce[ds]?|cing|ts?)\b" replace="$1eferen$2" />
~ BigrTex 20:19, 15 November 2006 (UTC)

Thank you for your suggestion! When you feel an article needs improvement, please feel free to make those changes. Wikipedia is a wiki, so anyone can edit almost any article by simply following the Edit this page link at the top. You don't even need to log in (although there are many reasons why you might want to). The Wikipedia community encourages you to be bold in updating pages. Don't worry too much about making honest mistakes — they're likely to be found and corrected quickly. If you're not sure how editing works, check out how to edit a page, or use the sandbox to try out your editing skills. New contributors are always welcome. ~ BigrTex 20:00, 16 November 2006 (UTC)

[edit] Society, abundant

  • Societ -> Society
  • abundandt - >abundant
  • abundandtly -> abundantly

I stumbled across "Societ" today, and I have a tendency to add an an unnecessary d to abundant as well, but I don't know how to add these to the filters myself. --Lethargy 00:14, 16 November 2006 (UTC)

I have just added <Typo word="Abundant" find="\b(A|a)bundand(t|tly)\b" replace="$1bundan$2" /> Tankred 00:38, 16 November 2006 (UTC)

[edit] <Typo word="Oft(en)times" find="\b(O|o)ft(|en)[- ]times\b" replace="$1ft$2times" /

Often Times to Oftentimes ???

It might be me, but that seems like a use that would be sparsely used?

Or is it just me?

Reedy Boy 15:32, 19 November 2006 (UTC)

[edit] New additions section

Can we be more explicit in whether the new additions should be put at the beginning or at the end of the "New additions" section? People put them to both places, which makes the chronology of the section a bit problematic to follow. The section is fairly large now and it would be perhaps a good idea to check the oldest additions again and then to put them to the main body. Tankred 16:55, 19 November 2006 (UTC)

[edit] Increase

Suggested addition: While fixing other typos I stumbled upon 'increse' (missing a).

<Typo word="Increase" find="\b(I|i)ncres(e|ed|ing|ingly)\b" replace="$1ncreas$2" />

Thanks. ChrisCork 06:51, 28 November 2006 (UTC)

Added, with the handling of "Decrease" as well.--BillFlis 12:52, 28 November 2006 (UTC)

[edit] Super Bowl

Superbowl -> Super Bowl. I see that one a lot, not just on the Wiki. I'm not sure how to add listings that split into two words, so I'm adding it here. --cholmes75 (chit chat) 20:56, 28 November 2006 (UTC)

Done!--BillFlis 21:02, 28 November 2006 (UTC)

[edit] Guerilla

<Typo word="Guerilla" find="\b(G|g)uer(?:r?i|ril?)l(as?)\b" replace="$1uerill$2" />

We are replacing Guerrilla with Guerilla, even though the article spells it the 'wrong' way. I have removed the line. ~ BigrTex 00:12, 1 December 2006 (UTC)

[edit] Problem with kW, kJ, Hz

I'm getting problems with kW, kJ, Hz because AWB now changes (eg on the Bible page)

[[kw:Bibel]] to [[kW:Bibel]]
[[kj:Ombibeli]] to [[kJ:Ombibeli]]
[[hz:Ombeibela]] to [[Hz:Ombeibela]]

They then get moved out of sequence. I suggest the regex be amended to exclude situations where the word is preceded by square brackets and followed by a colon.

Sorry haven't got time to do it at present - I'm rushing off to work!

Cheers - Euchiasmus 07:08, 1 December 2006 (UTC)

[edit] Rule Problems

The rule as written changes governement to governmen. -- Saaber 04:07, 4 December 2006 (UTC) The rule as written changes quanity to quantituanit. -- Saaber 11:02, 4 December 2006 (UTC)

[edit] Miniscule

... is cool, listed as a variant of "minuscule" here and here.--BillFlis 12:50, 9 December 2006 (UTC)

The misspelling has become so widespread that some authorities are listing it as an alternative. However, there is still a clear majority in favour of the correct spelling. I vote we go with the majority and stick to minuscule. Euchiasmus 16:07, 9 December 2006 (UTC)
Dictionary.com shows "miniscule" in three different sources here, which makes a total of at least four, since M-W isn't one of them. Given the policy against changing from one spelling of the same word to another, I don't think we should be automatically changing this. —Krellis 17:31, 11 December 2006 (UTC)
Whatever you do, don't change the occurrences of "miniscule" in the minuscule article. This article does indeed say that "miniscule" has been "traditionally regarded as a spelling mistake," although no reference is offered for this contention. Some discussion with references may be found here.--BillFlis 19:03, 11 December 2006 (UTC)

[edit] Changing ordinals to cardinals in dates

Please can we remove the ordinal to cardinal conversion in dates? Maybe the Americans don't habitually use dates like "1st May", but we British do use them and I can't see anything wrong with them. When I read "1 May" it looks very strange, especially in narrative prose.

Here in UK the use of st|nd|rd|th is very common in dates. For example, glancing through filed correspondance I find that the majority of my documents (insurance policies, bank statements, nominet registration, etc) use ordinal numbers in dates. With other regional variations WP allows alternative forms - why not in dates?

Euchiasmus 14:18, 10 December 2006 (UTC)

I personally have mixed feelings about adding things to the typo list that aren't typos or misspellings, but the intention here was clearly to go with the Manual of Style guideline on ordinal suffixes in dates (relevant section here). So you'd really probably be better off bringing it up there. Hope this helps. --Galaxiaad 19:02, 10 December 2006 (UTC)
  • Here are a couple of points:
    • because WP:DATE is a guideline, consensus was reached about the date format to be used. While a guideline is not a rule, we should be striving towards the suggestions given unless there is a strong push for a change, which would mean that there is no longer consensus. Therefore, while consensus still exists, there is no reason to remove the rules removing ordinals from dates.
    • A note to users of WP:AWB/T: be careful not to remove ordinals in direct quotes. --Maelnuneb (Talk) 17:44, 12 December 2006 (UTC)