Wikipedia:Bots/Requests for approval/SpellCheckerBot
From Wikipedia, the free encyclopedia
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.
[edit] SpellCheckerBot
tasks • contribs • count • sul • logs • page moves • block user • block log • flag log • flag bot
Operator: —— Eagle101Need help?
Ok, before you folks freak out at the bot name, let me state first that it will not edit articles. Let me start of with the basic stuff. Its programmed in perl with the assistance of perlwikipedia, and my own framework that builds ontop of it. In addition the bot is making use of the aspell library. The bot is operated by the programmer, that is me. SpellCheckerBot will run daily, probably a category or two a day. It should not need to exceed more then about 2-3 edits a minute. As this proposal has come up before let me explain the function details.
The bot will load pages from a category, and do some preprocessing before applying the spellcheck. First it will find all words found inside of blue wikilinks (those that have articles) and add those to a temporary spelling database, to avoid flagging things like names, places ect. The bot will also avoid anything in ALL CAPS, or inside of <code> or other related tags. It will also ignore anything that is bolded or in italics. (also added to the temporary spelling database). All output will go to a main list of entries, that looks something like: article || word -- suggestions. I have done some trials on articles on my computer, and it looks to be fairly decent, getting about 3 wrong for every 100 reports, though I'd like to improve this of course, this is only after about 4 hours experimentation. :)
Depending on how successful it is, I might make up a signup page where users can sign up for X spelling errors to be delivered to them on a particular day of the week. There will also likely be a page where users who have signed up can submit common errors or words that the bot needs to have added to its database. —— Eagle101Need help? 08:22, 4 June 2007 (UTC)
- Also a request, any trial approval be for about 250 detected spelling errors placed to its own userpage. —— Eagle101Need help? 08:29, 4 June 2007 (UTC)
[edit] Discussion
So it edits in a userspace? E talk 08:24, 4 June 2007 (UTC)
- Yes its own, and possibly in the future the talk pages of those that sign up for the bot's reports. —— Eagle101Need help? 08:25, 4 June 2007 (UTC)
- Not keen on the idea, you'd still get people following through on the different international spellings like -ise and -ize, though not directly the bots fault I'm sure some would see it as legitamising such "corrections" --pgk 18:32, 4 June 2007 (UTC)
- You'd have to be careful regarding international spellings - add common words like color/colour etc to an exempt list, it's not possible to do every international spelling but as pgk said above, you could exclude words ending in things like ise and ize. Also, if this does go through, you should add a feature to check for misspellings in RC and alert the editor - I know that I would immediately sign up for that, and from my understanding of bots it shouldn't be too hard to implement (I know RC are easy in the framework i'm using, but should be standard in any). Anyway, looks like a great bot, and I support it. Matt - TheFearow 21:09, 4 June 2007 (UTC)
- The bot will be using multiple libraries, that includes US, Canadian, and British, plus variants including the ise, and ize thing. This is not an attempt to convert spelling from one format to another. —— Eagle101Need help? 06:03, 5 June 2007 (UTC)
- Let me clarify, the bot is using some 10-12 libraries, including local variants of US, British, and Canadian spelling. I'm sure there will be some problems if some obscure spelling is reported as wrong, but this is not a British to US spelling bot. Any and all wrongly reported words and patterns will be added to a custom database, and treated as correct. —— Eagle101Need help? 06:06, 5 June 2007 (UTC)
- Don't forget Australian English on that list :P E talk 11:31, 5 June 2007 (UTC)
- Yes that is one of the alternate libraries that I can and will use. I think I have all the major basis covered with internationalization. If you want to look up the software I'm using behind this its Aspell. Again I really doubt the bot is going to have any major internationalization issues. This is not intended to convert spelling from any one form to another. —— Eagle101Need help? 12:01, 5 June 2007 (UTC)
- OK, the only other thing I'd suggest is that you mark the pages where the results are delivered to point out that anyone following them shouldn't do so blindly and is responsible for their own edits, if they get in a dispute about it, they can't blame the bot. --pgk 18:11, 5 June 2007 (UTC)
- Don't forget Australian English on that list :P E talk 11:31, 5 June 2007 (UTC)
- Let me clarify, the bot is using some 10-12 libraries, including local variants of US, British, and Canadian spelling. I'm sure there will be some problems if some obscure spelling is reported as wrong, but this is not a British to US spelling bot. Any and all wrongly reported words and patterns will be added to a custom database, and treated as correct. —— Eagle101Need help? 06:06, 5 June 2007 (UTC)
- The bot will be using multiple libraries, that includes US, Canadian, and British, plus variants including the ise, and ize thing. This is not an attempt to convert spelling from one format to another. —— Eagle101Need help? 06:03, 5 June 2007 (UTC)
Approved for trial. - 250 spelling errors to be reported to a subpage under the bot's user space, with a big flashy warning at the top (keep edit rate < 2 per min). Martinp23 15:44, 5 June 2007 (UTC)
- Trial is done, several errors have come up that I did not see when running the bot first time through during my own testing. As I took this from the clean up category there seems to be alot of foreign words, and names. Before running again I'm going to have to find a list of names and have the bot load that. In addition I will be having the bot use foreign language libraries, in particular greek and latin, but I will add others to be on the safe side. Ideas on where to find a list of names are welcome. Output is at User:SpellCheckerBot/spellcheck. —— Eagle101Need help? 10:17, 6 June 2007 (UTC)
- Once that's done, do another 250 errors, but I better not see any flashing warnings on the userpage. --ST47Talk 01:00, 8 June 2007 (UTC)
- Interesting. A couple of suggestions: how about completely ignoring two-letter words? Those seem to generate a lot of false positives (particularly foreign language stuff, like the French 'de', 'du', 'et', 'tu', 'te', etc.) and are likely to spelled correctly (I mean, 'the' and 'teh' is one thing...). Another thing to look into would be three-letter words that start with X or Y. It looks like those are more likely than not Chinese names or places. -- Seed 2.0 09:54, 9 June 2007 (UTC)
- The two letter word false positives were due to some iffy html formatting on one of the pages. The bot now knows how to parse that. I will bring this bot back online shortly. for a second trial to its own userpage. This time it will have libraries for about 10 foriegn languages as well as english words. —— Eagle101Need help? 00:25, 16 June 2007 (UTC)
- Interesting. A couple of suggestions: how about completely ignoring two-letter words? Those seem to generate a lot of false positives (particularly foreign language stuff, like the French 'de', 'du', 'et', 'tu', 'te', etc.) and are likely to spelled correctly (I mean, 'the' and 'teh' is one thing...). Another thing to look into would be three-letter words that start with X or Y. It looks like those are more likely than not Chinese names or places. -- Seed 2.0 09:54, 9 June 2007 (UTC)
- Once that's done, do another 250 errors, but I better not see any flashing warnings on the userpage. --ST47Talk 01:00, 8 June 2007 (UTC)
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.