User:Triddle/stubsensor
From Wikipedia, the free encyclopedia
Stubsensor is a part of wpfsck that analyzes the english Wikipedia and tries to identify out-of-the-ordinary Stub articles; the software never edits an article, only humans do. Currently statistical analysis and bayesian filtering are being used for improper stub detection.
Contents |
[edit] Reports
- Jun 07, 2006 - 2,000 more articles
- Jun 23, 2005 2,000 more articles are up for human review.
- May 16, 2005 Another 2000 articles generated by the first version of stubsensor to feature a cruft detector to eliminate some false positives.
- April 21, 2005 The top 2000 articles that have a really good probability of being mislabeled as a stub from the SQL dump of the same date. Completed in 7 days.
[edit] Stubsensor award
The coveted stubsensor award is given for bravery, commitment to service, and fearlessness in the face of even the longest stub. For a list of current award holders see the image page. To award the Stubsensor award to another Wikipedian place a message on their talk page and including the following wikicode:
[[Image:Stubsensor award.jpg|200px|frame|right|For your help with the Stubsensor cleanup project you are given the Stubsensor award.]]
[edit] Suggestions
Feel free to leave ideas here.
- Improper stub detection may work better if lists (defined as lines that start with # and *) and tables (defined with some other logic) are ignored in the byte count. That seems like it would help prevent false positives like Zdenek Moravec, Michel Serrault, Paul Wild (Swiss astronomer), Kirov Oblast, Fritillary. Triddle 19:08, 3 May 2005 (UTC)
- Perhaps there should be a place to list articles which are still stubs, even though stubsensor says that they may not be.Nereocystis 21:59, 3 May 2005 (UTC)
- Stubsensor incorrectly removed an info box from the end of Jeep Cherokee. Good job so far, but needs work. --SFoskett 13:43, May 4, 2005 (UTC)
- Stubsensor does not edit articles, only people do. Unfortunately that was a mistake by another contributor (see http://en.wikipedia.org/w/index.php?title=Jeep_Cherokee&diff=13219345&oldid=13217745 - an honest mistake I'm sure and I'm also sure I accidently removed a couple valid tags too. Triddle 22:24, 4 May 2005 (UTC)
- The sensor treats every article with a table or a big list as eligible for stub removal; this is not the case if they are poorly orgnized, contain misleading/unverified information or are not given a meaningful explanation in the text of the article. So far, I feel the need to reinsert some stub messages that were removed by project contributors. IMHO, every article marked as stub should be left as such unless it's above the average length of a quality article, just for the case the topic should be described in a way greater detail than your average two-paragraph non-stub article does. DmitryKo 21:12, 4 May 2005 (UTC)
- I believe you are correct in trying to address that problem. Articles that don't appear to be stubs but are missing information should be labeled as such however I don't think the stub tag is the right one. In this case it makes more sense to use the {{expand}} template. Instead of leaving a stub tag which conveys little information the expand tag specifically asks for information regarding expansion to be listed on the talk page. This way the shortcoming is known and a possible fix is present. Now anyone who wants to research the information can help. When the article is left marked only as a stub what needs to be done to correct it is not readily apparent. Triddle 18:20, 21 May 2005 (UTC)
- This comment was moved from User Talk:Triddle/stubsensor. The stub was removed from free nerve ending, apparently because of stubsensor, but it is in fact a stub. Apart from a general definition, and an extremely short description of different kinds of nerve endings (which isn't even complete), this article contains virtually no substance. I'd be very cautious about listing articles for de-stubbing solely based on their lengths. People who participate in the project should be suggested not to de-stub unless they are sure (ie: they actually know something about the topic) the article doesn't deserve a stub template. At the very least, you shouldn't count text in the references (or see also or external links) sections. --jag123 10:58, 5 May 2005 (UTC)
- The article may be incomplete yet not be a stub - I'd suggest that free nerve ending is the former but not the latter. Unsigned.
- "Stub" is not intended to be a "list of articles in this topic that need work" template. —Morven 03:21, May 7, 2005 (UTC)
- This is good advice and I have tried to incorporate your suggestions into the next project, see User:Triddle/stubsensor/working and feel free to leave comments about that page too.
- Personally, I'm highly skeptical that an automatic stub-sensor would work at all. Something could not be described by anything less than a couple of paragraphs; others could make it with one sentence. Unless there is a rigid standard length for all articles, this thing does not make any sense at all. At least, the suspected articles should be rearranged by topic and not by plain, mechanical chunks of 20. --XF95.邪 06:58, 6 May 2005 (UTC)
- You are correct in many ways and certainly automatic stub-detection is not 100% accurate. However the good news is that it does not have to be. That is why this is a human powered cleanup project and no automated program ever edits or changes an article. Stubsensor merely identifies articles that are likely to be stubs and makes a list available to people so they can make decisions. This version of stubsensor had a problem because it went off the total size of the article. This wound up putting articles with only a couple paragraphs and very long lists right at the top. Stubsensor has been adjusted so now it counts the size of the article minus any tables, or lists that start with # and *. This has made a dramatic difference in the number of false positives but unfortunately creates a few false negatives. Stubsensor is still imperfect and will certainly not identify all articles correctly but again it does not have to. Even if stubsensor is only 60% accurate (I hope it passes that) that still means 1,200 false stub tags removed from a pool of 2,000. It may not be perfect but that is significant. The next project is also going to try to give better advice for making decisions on the article being a stub or non-stub. Triddle 19:28, 7 May 2005 (UTC)
- It shouldn't be too difficult to add a Bayesian categorizer to stubsensor: there's plenty of ready-to-use code out there, eg. the AI:Categorizer module. I should think the results would be pretty good. --- User:Chalst/128.36.233.100 29 June 2005 04:14 (UTC)
- Is it feasible to generate a list of pages that are not tagged as stubs but should be? This could also identify any false negatives in the stub detection, because they would become false positives. Susvolans (pigs can fly) 12:29, 1 Jun 2005
- I agree that this would be useful if it is feasible. DES 19:12, 20 July 2005 (UTC)
- I'm not sure if there's a more "live" discussion page someplace, so pardon me while I reanimate this year-old thread. I believe that User:Bluemoose used to construct lists of such candidates. I could probably do something similar, based on something very simple like length of articles, which is probably more reliable as a lower limit, than it is as an upper one: how many non-stubs are shorter than 250 characters, say?. That's probably not very useful for Bayesian learning, though. In order to get meaningful results for that, it'd be necessarily to see how stub tags are added "in the wild". In theory this could be extracted from page histories, looking at whenever a stub tag is added to an article previously not in a stub category. That'd require a full duplicate database, and isn't something I'd be in a position to do anytime soon. A more feasible alternative is to compare stub category membership from one db dump to the next, which I should be able to do, in theory... Alai 18:15, 26 August 2006 (UTC)
- I agree that this would be useful if it is feasible. DES 19:12, 20 July 2005 (UTC)
[edit] Questions and observations
What is the current status of the stubsensor project? Is it dead, or just on hiatus?
Here are a couple of stub observations that I've made as I've tried to organize things for the Southern California WikiProject: 1) In looking at the local radio stations, all of them had {{US-bcast-stub}}, but at least a quarter of them were large enough that they didn't need it. 2) In looking at the User:Rambot-created articles on cities and communities, almost all of them did NOT have a stub template, although somewhere between 1/2 and 2/3rds of them had less than 10 sentences that were not created by Rambot and therefore probably should have {{california-geo-stub}}. BlankVerse ∅ 15:55, 2 August 2005 (UTC)
- Stubsensor is going strong; its been integrated into wpfsck and the next cleanup project will be organized shortly after the coresponding database dump. Can you create a list of the articles you are referring to? I'll go through the list manually and tidy things up; the new stubsensor runs off statistical analysis of the entire Wikipedia. If I can get those articles cleaned up before the next database the stubsensor should be even more accurate. Triddle 01:00, August 3, 2005 (UTC)
-
- For all the LA-area radio stations, I've already gone and removed {{US-bcast-stub}} from the largest articles, but left it on the borderline articles that I know can be greatly expanded. I haven't yet looked at the San Diego-area radio stations, but I assume that I will probably find similar statistics—the broadcast stub on almost all of them, with a fair number of them already grown beyond stub size. The radio stubs are probably a big problem for the stub sensor since most (~80%) have infoboxes and other miscellaneam which will inflate their size. You can find the rest of the articles with {{US-bcast-stub}} at Category:United States broadcasting stubs.
-
- As for the Rambot-generated articles, there are roughly 140 cities in the Los Angeles county, roughly 60 unincorporated communities, and at least 250 districts and neighborhoods within the City of Los Angeles (although not all of them have articles yet). You can see my worksheet at [[1]]. BlankVerse ∅ 06:27, 3 August 2005 (UTC)