User talk:Lupin/badwords
From Wikipedia, the free encyclopedia
[edit] A curious question
I may, perhaps, be harder to offend than the average american, but how is "all the pies" considered a "bad word"? :) - JustinWick 08:34, 31 January 2006 (UTC)
- See Who Ate All the Pies? - it's a "classic"
playgroundfootball insult. Lupin|talk|popups 16:19, 31 January 2006 (UTC)
-
- Wow, an informative response! Thanks, I learned something! - JustinWick 05:41, 12 February 2006 (UTC)
[edit] Suggestion
You should add more variations of the bad words. I can think of some you may have missed. Evan Robidoux 09:11, 24 February 2006 (UTC)
-
-
- Image:Human_feces.jpg
-
- suks
- kill, especially with exclamation points.
- Variations of the word "die," especially with exclamation marks (e.g.: "Die!")
That's all I can think of right now. Evan Robidoux 09:42, 24 February 2006 (UTC)
[edit] Another suggestion
terms youve missed are permutations of a,s,d, and f. on a qwerty keyboard if you mash the keys most people end up writing "asdasdasdf" or similar. vandal edits usually give an edit summary of mashed keys.-- Alfakim -- talk 18:02, 13 April 2006 (UTC)
- Actually, vandals usually give no edit summary, or only a section edit summary. This is probably because most of them are new users who haven't noticed the summary box.--Reverting 02:49, 6 June 2006 (UTC)
[edit] Regexp
Is it possible for this to support Regxps? It seems to me that a good number of these edits and such could be used for good (see this dif, where the word vegan was added...)? -Mysekurity [m!] 21:17, 27 April 2006 (UTC)
- Yes, I've had a go at this. Note that ( is replaced with (?: - the idea is that all paren groups are transformed into non-capturing parens so that it doesn't mess up script internals. This means that backreferences aren't possible and also that you should avoid opening parens apart from using them for grouping at the moment. Also, each regexp is treated as if it's surrounded by word boundary markers, it is made case-insensitive, and flags aren't supported. To add a regexp, surround it by forward slashes and add it to badwords. I haven't tested this much, so let me know how you fare... Lupin|talk|popups 02:13, 28 April 2006 (UTC)
[edit] More Bad Words
This is just a suggestion, I didn't add any of these.
REDIRECT--Maybe this will work against WoW, or redirect vandals.
chicks--as in "I like hot chicks.
stupid--"article is stupid--I'm surprised you don't already have this.
Also, many vandals like to type in ALL CAPS, so maybe you can do something about this.
- I disagree with REDIRECT, as it will give a huge number of false-positives for every time someone moves a page, or creates a redirect. It's broad words like this that make the tool much less useful. I'm going to remove it. -Mysekurity [m!] 01:20, 10 May 2006 (UTC)
[edit] Wang
How is Wang a bad word? It is a common Chinese family name. Andrew_pmk | Talk 02:37, 2 May 2006 (UTC)
- It is also slang for penis, along with about a million other words to refer to genitalia (there's a certain stigma attached to private parts, as I understand). This is the type of situation where I think REGXPs (see above) would work well. Unfortunately, I'm not too good with them, so if you have any suggestions based on Lupin's post above, feel free to tell me and I'll change the page. -Mysekurity [m!] 02:45, 8 May 2006 (UTC)
[edit] I couldn't think of a title for this...
What about ____ on Wheels? And they aren't all bad words. Just words vandals like to use.-Gangsta-Easter-Bunny 20:09, 5 May 2006 (UTC)
- It's already there (see "On Wheels"). -Mysekurity [m!] 01:19, 10 May 2006 (UTC)
[edit] Case sensitive?
Are the "badwords" listed here case sensitive? By that I mean will a word, say "bitch" still be detected if it is written "BITCH", for example, without a seperate entry for an all-caps version of the word having to exist?--Conrad Devonshire Talk 01:39, 9 May 2006 (UTC)
- They're all case-insensitive, so the answer to your second question is "yes". Lupin|talk|popups 02:32, 17 May 2006 (UTC)
- Here's the thing though... I've seen more than a few vandalous edits where the entire edit was done all in caps. is there any way that we can filter for an "all Caps" edit? Fbarton 00:19, 8 December 2006 (UTC)
[edit] Removed "fist"
I decided to remove "fist" from the list, but if anyone disagrees with this decision, feel free to undo it.--Conrad Devonshire Talk 21:37, 9 May 2006 (UTC)
[edit] Removed "woody"
I have decided to remove "woody" from the list of vandal terms.--Conrad Devonshire Talk 01:37, 17 May 2006 (UTC)
[edit] Moravia?
Why is "Moravia" on the list of vandal terms?--Conrad Devonshire Talk 21:54, 28 May 2006 (UTC)
- No idea :) Here's the diff. Lupin|talk|popups 01:39, 30 May 2006 (UTC)
[edit] Linkspam
I've added three links to the list. I don't think they should be banned from Wikipedia outright, but they have been added a lot recently and I'd like to keep an eye on them. If this is not the kind of thing we want on this list, feel free to remove them. Tom Harrison Talk 14:50, 3 June 2006 (UTC)
[edit] Badwords fork
Rather than ask for consensus every time I wanted to remove a false positive, I've split off my own badwords list which is slightly more optimized. Anyone who is interested is welcome to use it: http://en.wikipedia.org/wiki/User:Can%27t_sleep%2C_clown_will_eat_me/badwords -- Can't sleep, clown will eat me 02:32, 5 June 2006 (UTC)
- Forking is fine of course, but I'd rather people were bold and changed the page as they saw false positives or missing bad words crop up instead of trying to come to some sort of consensus in advance. If there's controversy there can be discussion, but I don't want anyone to think that there's a requirement to discuss before making changes. Lupin|talk|popups 06:51, 6 June 2006 (UTC)
[edit] gabenwell.com and churnedfortaste.com
I have added these two sites to the list. If you see a link to either one of them posted, DO NOT CLICK IT. It will cause a window with an offensive image to appear and will attempt to open tons of Outlook Express and and Instant Messenger windows and try to send e-mail to the GNAA. They were posted by now-blocked user Churnedfortaste. Another mirror of this site, hentai.net has also been spammed according to the Spam Blacklist but has since been blacklisted.--Conrad Devonshire Talk 03:06, 11 July 2006 (UTC)
- These to have now been added to the spam blacklist.--Conrad Devonshire Talk 21:30, 11 July 2006 (UTC)
[edit] Ho
Could someone please remove "ho" from the list? I looked for it myself, but couldn't find it.--The Count of Monte Cristo Parley 10:13, 1 August 2006 (UTC)
- Done. I couldn't find it either, so I wrote a script which I've included below for reference. Lupin|talk|popups 01:17, 2 August 2006 (UTC)
#!/usr/bin/env perl # usage: findbad.pl testword < badwords my $test=@ARGV[0]; while (<STDIN>) { next unless m!^/(.*)/$!; my $re=$1; if ($test =~ /$re/i) { print "$.: $_"; } }
[edit] Triple
I have removed "triple", as it was giving lots of false-positives, and I can't imagine any bad use of it. -Goldom ‽‽‽ ⁂ 11:50, 5 August 2006 (UTC)
- Apparently, I haven't, cause it's still showing up. Not sure what I actually did there, in that case, so I reverted in case it was something bad. If someone else could remove it properly, unless there's a reason to keep it, that'd be great. -Goldom ‽‽‽ ⁂ 11:54, 5 August 2006 (UTC)
[edit] TTT
Why is TTT flagged as a bad word? -- Selmo 04:33, 18 August 2006 (UTC)
[edit] nigger
What do you think of the idea of adding nigger(s) to the black list? I saw it twice tonight Lucasbfr 02:18, 20 August 2006 (UTC)
- I'm sorry, racial slurs are terrible things, etc, but that's a fairly amusing (hopefully unintentional) pun. Yes, I am that insensitive.- JustinWick 09:32, 7 December 2006 (UTC)
[edit] queer
I've been using your tool (which I LOVE) and a few times "queer" came up because the TV show "Queer eye for the straight guy" was mentioned. Is it possible to make that an exception to the scan for that word? Lauren 18:56, 20 August 2006 (UTC)
[edit] Regular expression idioms
Wherever a space appears in a regular expression, it could be replaced with \s* to allow one or more spaces to match. Also useful: (e?s|[e']?d|in[g']?|ers?)? to catch verb paradigms such as pick, picks, picked, picker, pick'd, picking, pickin', and so on. Peter O. (Talk) 02:53, 23 August 2006 (UTC)
[edit] Noxious SPAMmer
Since "datasheet4u.com" has done NOTHING but SPAM datasheet, could someone add this to the list to prevent sneaky insertions (It's already on the SPAM blacklist, but they just don't link it instead)? Thanx. 68.39.174.238 23:26, 5 September 2006 (UTC)
[edit] Regex
How come these two rules I made to match vandalism which often involve the use of more than 2 ?'s and !'s don't seem to work? What is wrong with them and what's athe right way of matching multiple question marks and multiple exclamation mark?
/!{2,}/
/\?{2,}/
Sir Vicious 01:34, 1 November 2006 (UTC)
- Regular expressions are awful. They never do what you expect them to do (or what documentation says they should do); they work differently on each system, and what's more, the huge amount of the afore mentioned documentation never seems to solve the problem. -Patstuart(talk)(contribs) 03:04, 1 November 2006 (UTC)
- Thanks for the comment. So, are there better ways of matching them? I've tried /!!+/ too but it did not seem to have the desired effect, it matched a single "!" too, weird. Sir Vicious 03:50, 1 November 2006 (UTC)
-
- Come to think of it, maybe I don't need to use regexp at all, I can just match ?? and !!, any case where more than 2 marks is used will also automatically be matched. Sir Vicious 03:59, 1 November 2006 (UTC)
- I've tried some stuff in the sandbox; it's picking up Niger (I added that as a reg ex actually to pick up nigar), but it's not picking up n00b, which is on the list either, and I could have sworn it would pick up. *Sigh*. Patstuart(talk)(contribs) 04:08, 1 November 2006 (UTC)
- Ha! As I typed this, look at this edit: [1]. and I thought picking up niger was bad! Patstuart(talk)(contribs) 04:09, 1 November 2006 (UTC)
- Hehe, yes, there is always an idiot out there who can't even vandalize right =) Sir Vicious 04:13, 1 November 2006 (UTC)
[edit] Possible or impossible
I don't know if this would be possible, but I've seen a lot of vandalism today where the user put their own username into an article. I found them through the badwords filter, but I wonder how much "Graffiti" we're missing because of this. Is there a way to check if the added text is equivalent to the editor's username? Fbarton 19:01, 8 December 2006 (UTC)
[edit] Innovative vandalism
Just came across this. Not sure how to add <nowiki> and </nowiki> to this list. —Dylan Lake 02:00, 13 December 2006 (UTC)
[edit] "Chicken" and "Cum laude"
- Why is "Chicken" a bad word? The vandal tool has been flagging a lot of harmless edits about KFC recently.
- I think that "Cum laude" should not be considered a bad word, even though "cum" is obviously one.
[edit] repetitions of hi
I've had several vandals recently doing repetitions of hi, e.g. hihihihihi. Can this be added? BlankVerse 00:33, 11 January 2007 (UTC)
[edit] Roland?
Why is "Roland" on the list... --Catz [T • C] 14:25, 13 January 2007 (UTC)
[edit] Another bad word?
MMM Commentaries - I've seen it inserted onto several pages (think petitiononline): 1 2 3 4 5 6 --science4sail talkcon 01:25, 23 January 2007 (UTC)
[edit] Sorted list
Folks, I am trying to use this list to scan for entries in the WP CD release - see Wikipedia talk:Version 0.5. To try to optimise this list, I sorted it, by the longest embedded string, and put the results at User_talk:Lupin/sorted_badwords. Could this please replace the parent page ? Can people optimise the list ? Wizzy…☎ 10:17, 7 February 2007 (UTC)
- Out of a list of 1991 articles, the following regular expressions were the most common to hit (and thus could use the most tailoring ..)
102 /(fried)?chicken/ 94 /rap(e[sd]?|ers?|ing)/ 53 /monkeys?/ 53 /dumb?(ass|arse|o|m?y)?/ 51 /fat(ty|ass)/ 49 /lesbian(s|ism)?/ 48 /sex(e[dr]?)s?/ 44 /chi(ck|x)s? ?(with ?di(ck|x)s?)?/ 40 /ma(de|ke[ds]?|king) out/ 37 /s?su(c?k|x)(a|ing|e[rd]|y)?s?/ 32 /stupid(ity|ness|er|head|ly)?s?/ 32 /loo?sers?/ 30 /s?su(c?k|x)(a|ing|e[rd]|y)?s? (my|your|his|her|its|their|our|each other|peter)?s?/ 29 /[a@]([s$][s$]+|rse?|zz)(ban(ned)?|s?e|fuc?k|h[0o][l1][e3]|head|hat|juice|lick(e[rd])?|ram(mer|ma)?|raper?|rapper|wiper?)?[sz]?/ 29 /cum(bucket|dumpster|felch(er|ing|ed)?)?s?/ 26 /rect(al|ums?)/ 26 /retard(s|ed(ly)?)?/ 24 /sodom(i[zst](e[rd]|ing)|y)s?/ 23 /butt-?(|breath|crack|fuck(e[dr]|ing)?|head|hole|lick(er|ing)|pirate|rape|sex|secks|wiper?)s?/ 22 /vagina(l|s)?/ 21 /an(us|al)(hole|tova|es)?/ 20 /r[ai]m(job|me[dr]|ming)s?/ 20 /c[o0]ck-?(|ass|bag|biter?|goggle|fucker|smok(a|e[dr]|ing|in|in')|head|face|nose|hole|suck(|a|e[dr]|ing|in|in')|thirsty?)?s?/ 19 /fetish(es|ism)?/ 18 /junk(ies?)?/ 18 /jerk(ing|ed|y|wad)?([- ]?off)?s?/ 17 /n[i1]gg?([e3]r|ar?|uh)(lover|ass)?[sz]?( stole)?/ 17 /w[au][sz] here/ 17 /d[a4]m[nm](it)?/ 15 /beaver(juice|lick|suck|fuck)?(er|ing|ed|a)?s?/ 15 /lam[eo](brain|er)?s?/ 14 /testicles?/ 14 /crackers?/ 13 /p[3ei]n[1!iu]s(bit|lick|suck|head|fuck|face|hole(e|er|ing)?)?s?/ 13 /Amerik+an?'?s?/ 12 /sex(y|ier|iest) ?(babe|cunt|beast|bitche?|whore)?s?/ 12 /(yo)+/ 12 /nuk(e([dr])?s?|ing)/ 11 /nipples?/ 10 /bu(m|ng)(hole|lick(e[rd])?|wipe[rd]?|ming|chum)?s?/ 10 /Japs?/ 10 /((is a|are|is) )?homo(phobe)?s?/ 10 /(f|ph)u(kc|c+k*|c*k+|x)(a|ass|e[rd]|ie|y|bitch|erino|head|hole|arse|arsed|face|queer|wit|in[g']?|inghell|[o0]r?|o|off|tard|wad)?s?/ 10 /finger(ing|ed|pull(a|er)s?)/
[edit] Going to remove 'the'
I don't understand why 'the' is a 'bad word'.. it just floods the tool. SgeoTC 05:18, 11 February 2007 (UTC)
[edit] Major overhaul
Spent some time working on the list (as you can tell from the edit summary). Basically, instead of a straight alphabetical list, I made an attempt to categorize and prioritize it by level of offensiveness so that the most egregious vandals are more apparent when using the 'recent changes' tool. Also added quite a few phrases and sentence fragments based on the vandal patterns that I've been seeing. Hope it works out for everyone, and please let me know if I've either helped out or jacked something up. RJASE1 Talk 20:16, 18 February 2007 (UTC)
[edit] Punk
The punk string appears to me to be generating huge numbers of false positives, and I have yet to see it generate a true positive. IMHO the expression should be modified to only match punk with "asse" and perhaps "buc" (I'm not sure what the buc bit is for), so that fewer articles that are genuinely about punk rock are picked up. I don't know how the regular expressions work so I'm not sure what would be best. --Jon186 13:23, 4 March 2007 (UTC)
- Fixed. RJASE1 Talk 17:16, 4 March 2007 (UTC)
- Thanks for that :o) --Jon186 20:19, 11 March 2007 (UTC)
[edit] What regex does this use?
The syntax for regular expressions varies depending on the implementation used. Which regex is used here? Is there any documentation? -- kenb215 talk 21:35, 13 March 2007 (UTC)
- It is the syntax used by your browser's javascript engine, which is generally something like PCRE (see the ECMAScript spec for details). There are further restrictions, however, as parens (...) are replaced internally by (?:...) which means you can't use literal parens, \1, \2 etc. Lupin|talk|popups 22:35, 13 March 2007 (UTC)
What's the best way to test a regular expression that I wish to add to the list. Is there a way to test a portion of text against the existing list to see if the vandalism is already being caught. -- callred
- In theory: Make a user subpage, uncheck "Ignore my edits," open the "filter recent changes" page, add your test to the subpage, and see if it shows up (make sure you do everything in that order... except maybe the first one) In practice: There's probably a much better way to do this... maybe with the javascript: URI or something... --Thinboy00 @145, i.e. 02:29, 14 February 2008 (UTC)
[edit] April Fool?
Should "April Fool" be added to this list? A lot of users have allready started making April Fools day edits and a lot of them contain the text "April Fool". -Mschel 21:25, 31 March 2007 (UTC)
[edit] Bot
I'm curious, would it be allowable for a bot to use this as a secondary source for badwords when the bot is doing a different job? (e.g. newpage monitoring) Thanks! TheFearow 05:37, 15 May 2007 (UTC)
[edit] "Learn english"?
To counteract any Stephen Colbert-related vandalism, does it make sense to add "Learn English" (just like "librarians are hiding something" was added to the list) -- Amazins490 (talk) 20:37, 25 May 2007 (UTC)
- I agree, you should add that to the list. Make sure learn and english are capitalized though, there are probably a lot of instances in Wikipedia where it says "learn english".--eskimospy (talk • contribs) 03:05, 26 May 2007 (UTC)
[edit] Signatures
Hi! I don't know much about scripting, but would it be possible to stop filtering ~~~~ and ~~~ from the list of repeated characters? It's showing up a lot in my filter. Thanks. Smaug123 06:25, 25 June 2007 (UTC)
[edit] Jimmy Wales
What is Jimmy Wales doing on the list? I mean, just because he is the founder of wikipedia, doesn't mean that any vandal would type it in.... Coastergeekperson04 06:56, 5 July 2007 (UTC)
- It may be that this is part of the MO of one or more vandals. I just added an e-mail address for the same reason - this specific e-mail addrss seems to have been used twice by the same vandal. The address I'm talking about is ignoreallrules@walla.com. Od Mishehu 09:27, 2 August 2007 (UTC)
[edit] Cum laude
I don't really know how this list works, but is there a way to make "exceptions", or a "whitelist"? The filter just showed a page with the words "cum laude" because it matched the word "cum". Melsaran 11:30, 17 August 2007 (UTC)
[edit] !!
!! is wikisyntax for tables, if you want to list headers one after another. I'm not sure how to edit this list, but it would kill a lot of false positives. :-) Stwalkerster talk 14:40, 17 August 2007 (UTC)
[edit] hi
Another point: it is picking up his, white, history etc. because they contain 'hi'. :-) Stwalkerster talk 15:39, 17 August 2007 (UTC)
- I'm not sure that's quite right. The diff still has to contain the identifiable word 'hi' for this to happen. If it does, then all occurrences of the string 'hi' are highlighted. The false positives come from things like '.hi.' (which occurs inside some URLs) and 'hi:', the Hindi language tag. Philip Trueman 14:27, 24 August 2007 (UTC)
[edit] New Word
I have seen "FOOKIN" used once or twice now that hasn't been picked up. DoyleyTalk 19:18, 9 October 2007 (UTC)
[edit] Recent alterations
Some recent alterations made to the word list broke AVT's filter recent changes page. I'm not sure which specific change broke it (though I suspect it was the fairly major changes by Rocket000 (t c)); but reverting to the Sept 29th version fixed the tool, and that's the important part. If you make changes to the word list, please double check that your changes didn't break the script—there are instructions at the top of the word list for forcing your browser to use the changes immediately. I'd suggest taking the time to make sure the script still works normally if you make a change, especially if you change a large number of entries all at once. --Darkwind (talk) 01:16, 21 October 2007 (UTC)
[edit] Jews did WTC
I think this should be rather anti-semitic, how is "Jews did WTC" considered a "vandal term"? --Blake3522 03:35, 3 November 2007 (UTC)
- An entry on this list is things we usually DON'T want in Wikipedia articles. It means vandals were writing "Jews did WTC" on Wikipedia pages, and because it's now on this list the Anti-Vandal Tool will catch that and help us remove it. --Darkwind (talk) 17:42, 3 November 2007 (UTC)
-
- And this makes reference to the redirect: 9/11 conspiracy claims regarding Jews or Israel. --Blake3522 04:00, 10 November 2007 (UTC)
[edit] "Ethiopian" string
The "ethiopian" string seems to be having big numbers of false positives. Even when the article matches this string, so it is not right. --Blake3522 (talk) 07:09, 24 November 2007 (UTC)
[edit] False positive: Rotten - Rotten Tomatoes
I don't know how to change the code, but could someone remove the false positive Rotten Tomatoes hits from the word "rotten"? Thanks! :) ~Eliz81(C)
- Rotten Tomatoes is a website, and we don't want spam, do we? —Coastergeekperson04's talk@12/09/2007 01:50
- I've seen it used as a source in movie articles. --Thinboy00 @759, i.e. 17:12, 21 January 2008 (UTC)
- It's a really popular website for reference, and I've gotten this false positive too. I would remove it, if I could find where it was. The Evil Spartan (talk) 07:25, 22 January 2008 (UTC)
- The line is
/rotten[- ]?(ass|crotch)?e?s?/
, which would evaluate as true for "rotten". It probably could be modified to evaluate as false for "rotten", but I would have to ask someone more informed about regex than me - i.e. User:Gracenotes >_> --Iamunknown 07:28, 22 January 2008 (UTC)- Thanks. I've changed it so the second phrase must be part of the filter. No reason to go chiming off every time we get the word rotten. The Evil Spartan (talk) 08:37, 22 January 2008 (UTC)
- The line is
- It's a really popular website for reference, and I've gotten this false positive too. I would remove it, if I could find where it was. The Evil Spartan (talk) 07:25, 22 January 2008 (UTC)
- I've seen it used as a source in movie articles. --Thinboy00 @759, i.e. 17:12, 21 January 2008 (UTC)
[edit] Cummings
The last name Cummings seems to be coming up a lot as a false positive. If anyone with knowledge would be able to fix this. Thanks. The Evil Spartan (talk) 02:05, 20 January 2008 (UTC)
- Is that as in e e cummings? --Thinboy00 @639, i.e. 14:19, 24 January 2008 (UTC)
[edit] repeated dashes
I see a lot of <!-- this -------------> (with trailing dashes) in front of or above infoboxes. Since the repeated dash filter kept finding them, I removed it. --Thinboy00 @757, i.e. 17:10, 21 January 2008 (UTC)
[edit] Jig
I do not know why Jig is flagged as a bad word, since it can mean a lively traditional Celtic dance commonly used in Baroque music called Gigue. Johnny Au (talk) 21:28, 27 January 2008 (UTC)
[edit] no spaces
any better way to catch Last revision of 193554783? Right now the only thing that catches that is the !!! filter. We need something to catch bad words without spaces. --Thinboy00 @914, i.e. 20:55, 23 February 2008 (UTC)
[edit] In the house
This is always something line 'in the house of commons' or similar, I've never seen it be used for vandalism. Keep 'in da house' though. Thought I'd better bring it up here first. George D. Watson (Dendodge).TalkHelp 18:37, 20 March 2008 (UTC)
[edit] repeated braces: }}}}}}}}}}}}}}}}}
Repeated curly braces are often used in templates, is there any way to remove them from the list without removing all repeated characters? George D. Watson (Dendodge).TalkHelp 13:54, 21 March 2008 (UTC)
[edit] One more suggestion
I suggest to add "Sieg Heil" on the list of bad words; I fear that some might use it on Israel-related or Nazism-related vandalism. Alexius08 is welcome to talk about his contributions. 01:16, 21 April 2008 (UTC)
[edit] Filter for "ard"
This filter is matching parts of ordinary words. Is this in fact a real mark of a vandal? In the meantime, I'm enclosing it in /s so it only matches at word boundaries. --Thinboy00's sockpuppet alternate account 23:33, 9 June 2008 (UTC)
[edit] swallow filter
/swal+ow(a|e[rd]|in[g']?)?[sz]?/