Wikipedia talk:WikiProject Vandalism studies/Study2
From Wikipedia, the free encyclopedia
Contents |
[edit] some suggestions for this study, new perspectives
[edit] User:DavidCBryant's comments from the last study
- To do simple statistical analyses like the one you're after, spreadsheet software is appropriate. Has anyone entered the data into a spreadsheet program, like Excel, or Open Office? If not, I'll try to get it done in the next couple of days.
-
- Minitab would be even better. If someone has access to that it's possible to get some great regression analysis. --Spangineerws (háblame) 02:25, 28 March 2007 (UTC)
- Using the "random article" button to select articles for analysis makes a lot of sense. Selecting only the month of November for analysis makes less sense. Ideally, you might want to generate a random integer 1 through 12 (using a pseudorandom number generator) for each article selected for analysis, then analyze the edits for that month for that particular article. The problem with the procedure you used is that it may have introduced an unintentional systematic bias. Human behavioral patterns vary with the seasons, so it may be that you got an exceptionally high reading (or an unusually low reading) because people are grumpy (or benevolent?) in November, on average. Not that it's a big deal. Call it an opportunity for improvement.
- There are quite a few 'bot authors on Wikipedia. The process of extracting the raw data, at least, might be automated somehow. For example, a 'bot might select the random articles, select a month at random, then extract only the edit history records you're interested in and dump the whole thing into one page somewhere, where you guys could study the data without doing so much data collection. Just a thought.
- When you write your report, you might want to present the numbers two ways – both with and without the randomly selected articles that you discarded because no edits occurred in November. I'm only suggesting this because if you include that number of articles you can compute the likelihood that a randomly selected article is going to be edited in November, at least once in three years. OK, you'd probably want to dress it up a little and present it as the probability that a randomly selected article gets edited within one month. Anyway, it would just be good statistical practice to report how many articles you bypassed because there were no edits in November. It's part of full disclosure.
- You've divided the edits into two classes, "vandalism" and "not vandalism". I think three classes might be more appropriate: "vandalism", "revert vandalism", and "not related to vandalism". I think the distinction is meaningful, and probably not too hard to make. Anyway, I'm not sure how you counted reverts in your raw data, but maybe I didn't read the report closely enough.
[edit] User:CMummert's comments from the last study
- This (last) study was very interesting and informative, but the small sample size (100) makes the final numbers subject to a large margin of error. With the sample of 100, the estimate of 5% of edits being vandalism has a margin of error of about 4% at 95% confidence; so I conclude from your numbers that there is a very high chance the real vandalism rate is less than 9%. In order to have a 2% margin of error with 95% confidence, if the real percentage of vandalism edits is 5%, you need to sample about 475 articles. Fortunately, the total number of WP articles doesn't matter, only the number that you sample.
- A second, more interesting, problem is that you are measuring the average percentage of edits per article that are vandalism. But there is another statistic that is equally valid: the average percentage of total WP edits that are vandalism. To see the difference, think about the extreme case where only one article on WP is ever vandalized, but it received 100% vandalism edits. Then your survey would show 0% average vandalism unless you were lucky enough to find that one article with the random article button. To measure overall averages, you would need to take a random sample of 1000 edits (I have no good idea how to do that without using a database dump) and determine how many of them are vandalism.
[edit] User:Reywas92's comments from the last study
- In my opinion, the first vandalism study group of articles was too small. Much more than 100 should be used for this one. multiple percentages were created for what type of vandalism, reversion speed, and who made the edits, but only 31 acts of vandalism is much too small to make such statements.
- For this, perhaps the random article selection should be less random, purposely choosing some more popular articles to find more vandalism. Some random articles have very few links to them so vandals can hardly even find them: Only 31 acts of vandalism in 100 articles. I know a lot of articles have ever been vandalized.
- Also, the time frame could be bettered. Looking only at November edits is not comparative to now. We should look at current vandalism to understand the future, right? A historical look is good, but I think the present is more important. This would also make looking though histories simpler by not having to look so far back and skipping the others.
[edit] User:Xiner's comments from the last study
I second DavidCBryant's concern about focusing on any particular month. I just read the other day that the summer is a period of doldrums on Wikipedia, for example.
I second Reywas92's concern about the small sample size of acts of vandalism. While the number of 100 articles is itself small, the number of 32 vandalism edits definitely means that the conclusions regarding the authors of those edits (IP vs. registered) were invalid.
Perhaps Wikipedians knowledgeable about statistics can help ensure that your next study will be devoid of these deficiencies, because your efforts are definitely needed, and I don't want to see anyone's time wasted, especially when the topic is of such high importance. Thanks a lot, guys. Xiner (talk, email) 03:03, 29 March 2007 (UTC)
[edit] structure
i was thinking that the Scientific method is something we should start following for study 2, seeing as how the structure has been tried and true. it also makes for a better understanding in where all the data is going - study 1 was kind of a data-before-the-study kind of deal, and i think this one should be more formal and easier to read in a intro/hypothesis/data/results/conclusion kind of way. how do others feel? JoeSmack Talk 04:05, 2 March 2007 (UTC)
[edit] Non-vandalous Edit %s of registered and unregistered users
Background: The recent Wikipedia talk:WikiProject Vandalism studies/Study1#Draft conclusion for a sample of 100 articles states that 25% of vandalism reverting is done by unregistered editors and 75% is done by registered Wikipedians. It also states that "in a given month approximately 5% of edits are vandalism and 97% of that vandalism is done by anonymous editors."
For articles with a relatively high percentage of vandalism to total Edits or high volume of Edits, that's a good argument for semiprotect (administratively requested & possibly granted) status: It's discouraging for serious continuous editing to be distracted or deflected by a high volume of vandalism ("Why bother with this mess?"). The downside of semiprotection is no Edits (or reverts of vandalism) by unregistered editors.
In trying to decide whether semiprotect is worth the costs, here's a reasonable question to ask in construction of Study 2. Is the following ratio high or low for only nonminor Edits relative to total Edits?
(percent of nonvandalous Edits by unregistered users)/(percent of Edits by registered users)
If it is well below 1 (say, .25) and if the non-minor Edits improve the article more than minor Edits (big ifs), the ratio can be interpreted as indicating the article is not getting proportionate help from unregistered users, imposing higher Edit costs on registered editors. For articles with a low volume of vandalous Edits of course, semiprotect is not going to be so important (and conversely).
(The same question can be posed for minor Edits.)
If you have ever been interested in an article with a high volume of vandalous Edits and in need of big improvements, the above suggested stats and conjectures are likely to be more prominent as to the desirability of semiprotect. I hope Study 2 could collect the above stats to assist in trying to decide how articles might be improved fastest. Vandalous edits are a kind of tax of nonvandalous Edits and article improvement. Study 2 might be helpful in suggesting the benefits relative to costs of semiprotect. Comments welcome. --Thomasmeeks 11:29, 25 March 2007 (UTC)
- How exactly would you set up the procedure for this? I'm curious to how this could be run. JoeSmack Talk 16:08, 25 March 2007 (UTC)
-
- Well, the above refers only to data gathering, which would provide background, much as Study1 does. The above template by itself does nothing (not even deter, if my experience is representative). Clicking to "request unprotection" in the above template takes one to a section of Wikipedia:Requests for page protection. That project page is where one can request semiprotection. If the request is granted at that end, someone at that puts up the template and activates it, so that only registered users can do Edits.
-
- As for a procedure for seeking semiprotect help, one look at the last 50 Edits and count the number of reverts (by inspecting that they were for vandalism). The data could help in a semiprotect request. If I have misinterpreted your question, I'll try again. Thx. --Thomasmeeks 18:20, 25 March 2007 (UTC)
-
-
- So you're saying you want to count how many vandalisms there are before semi-protection is activated and a boilerplate is put up and how many vandalisms occur after? Basically you want to know how well semi-protection works? Interesting, I'd love to know how effective it is... JoeSmack Talk 21:46, 25 March 2007 (UTC)
-
-
-
-
- Well, I wasn't trying to be that specific. I witnessed something very dramatic for one article (Economics}. Something close to 30 of Edits were reverted of the 100 Edits previous to semiprotect status on Feb. 17. All the reverted Edits were by unregistered users. After semiprotect became effective (through an administrative request granted as per above), quick automatic reverts virtually disappeared. What Study1 found applied very well to that article: Vandalism is overwhelmingly a problem from unregistered users. Only 3 Edits were reverted that I could detect in the 100 Edits that followed (rather than say 30). Those were not from vandalism and were accompanied by Edit summaries that well explained why there was a difference of editing opinion. So, semiprotect seems to work very well. --Thomasmeeks 23:51, 25 March 2007 (UTC)
-
-
-
-
-
-
- Interesting indeed. Well, how would you like to turn this into Study 2? I'd love to explore this area of vandalism, and I think it'd be valuable! JoeSmack Talk 00:34, 26 March 2007 (UTC)
- Oh and something like this proposal is currently running over at Obama article study. Check it out. Should we still make this study 2? JoeSmack Talk 00:38, 26 March 2007 (UTC)
- I'm sure this has relevance as well: Don't protect Main Page featured articles/December Main Page FA analysis.JoeSmack Talk 00:42, 26 March 2007 (UTC)
-
-
-
-
-
-
-
-
- I think that a big conclusion is highly likely to hold also for a sample larger than 1 (namely Economics. And data would be available for articles before and after semiprotect. Scanning the last article, what's missing is a comparison with registered users, which is what the above was getting at. --Thomasmeeks 02:05, 26 March 2007 (UTC)
-
-
-
-
-
-
-
-
-
-
-
-
- That might be one for statistics folks on their Talk page or that of Talk:Regression analysis. As small a sample of articles as would do the job (say 50 or less) would make it easier. High volume articles subject to frequent reverts would add relevance. If there was a way of determining which general areas had the highest traffic volume (say by Wiki searches), that might be used. Clustering by different general topics social sciences, philosophy, , or controversial subjects (Death and resurrection of Jesus etc.), are possibilities. I think the object should be not randomization but clustering to address practical problems. Assignment to persons on those Talk pp. would spread the work load. --Thomasmeeks 12:26, 26 March 2007 (UTC)
-
-
-
-
-
-
-
(unindent) well we can't leave msgs on Talk:Statistics and stuff like that, talk pages for articles are about the articles and not experts in the field. what you can do is ask a statistician from Category:Wikipedian_statisticians to give some thoughts, but dont ask like 10 (no spamming 'round here). as for work load nothing much needs to get spread round, we did the last study with three people and that was fine. so, what is sort of initial mini-list of articles you were thinking of making? i'm curious as to how you would cluster articles, and i'm not sure how your highest traffic volume approach would work; do searches get a traffic score that i don't know about? JoeSmack Talk 12:44, 26 March 2007 (UTC)
- Points well taken. I' assuming that Wiki searches via Google do give wt. to traffic volume. I gave examples of clustering. I can't do better than that. If anyone think of a cluster that looks interesting, so be it. It was only a suggestion. --Thomasmeeks 13:23, 26 March 2007 (UTC)
-
- I guess wikipedia has 100 most popular wikipedia articles, but I don't know of any google traffic rating system. The examples you gave above petter out after about 4, I was wondering if you had a category list in mind, like the one WP:1.0 uses or something? I guess I'm wondering a) how generalizable we want to make this and b) how to accomplish that without getting sticky with bias. JoeSmack Talk 13:34, 26 March 2007 (UTC)
-
-
- Well, here's a way to address that. Consider a broad category (what I called "general topics" above) within which there are data points (called "observations" below) such as top 100, entertainment, social sciences, religion, etc. I'm assuming a search of Wiki within a category of Wiki using Google reflects traffic to some degree. The clustering (classifying) could be according to those categories, such as 10 from the top 100, 10 from entertainment, 10 from social sciences 10 from religion, and the rest. The advantage is that for each category, there would be some measure of vandalism frequency as to each observation in the sample. That would be the dependent variable. Independent variables would the 0-1 dummy variables for each category data point ("no" = 0, "yes" = 1 for a given category). Then a linear regression could be run for vandalism frequency as a function of each of the categories to determine statistical significance of any differences in categories as to vandalism rates. --Thomasmeeks 18:44, 27 March 2007 (UTC)
- I'm still not clear on exactly how to intend to get the top 10 social sciences articles etc. via a google traffic measure. that top 100 link is the only article popularity resource i know of...
- Also, what happens when one item falls clearly into two categories? JoeSmack Talk 04:59, 28 March 2007 (UTC)
- Well, here's a way to address that. Consider a broad category (what I called "general topics" above) within which there are data points (called "observations" below) such as top 100, entertainment, social sciences, religion, etc. I'm assuming a search of Wiki within a category of Wiki using Google reflects traffic to some degree. The clustering (classifying) could be according to those categories, such as 10 from the top 100, 10 from entertainment, 10 from social sciences 10 from religion, and the rest. The advantage is that for each category, there would be some measure of vandalism frequency as to each observation in the sample. That would be the dependent variable. Independent variables would the 0-1 dummy variables for each category data point ("no" = 0, "yes" = 1 for a given category). Then a linear regression could be run for vandalism frequency as a function of each of the categories to determine statistical significance of any differences in categories as to vandalism rates. --Thomasmeeks 18:44, 27 March 2007 (UTC)
-
-
-
-
-
- It could be that X different social sciences would more or less coincide with the top 10. A Wiki search even without Google but with added terms to narrow the search in successive searches might work as might an advanced Google search of Wiki with more and more NOT terms in successive searches. Stating method used would give transparency, even if subsequent bias was found. More than one category is fine, in fact good if interaction effects are suspected. Then the combined effect might be significantly more (or less) than the sum of their parts, picked up by a multiplied-category variable. --Thomasmeeks 11:54, 28 March 2007 (UTC)
-
-
-
[edit] Study3?
With Study2 wrapped up, let's look ahead. Actually the following could be part of Study2:
- What's the effect of semiprotect on the rate of non-vandal-or-revert Edits?
In favor of an increase in that rate, registered users might be encouraged to do less reverting (since say 95 percent of vandalous Edits are by unregistered usere) leaving more time for non-revert Edits. In favor of a decrease in that rate is that non-registered users would be blocked from any Edits with semi-protect. I'd guess that if the percentage of Edits by registered users is high, you'd get one result and vice versa, so that percentage would be a good control variable (an additional independent variable in the regression analysis). --Thomasmeeks 00:58, 29 March 2007 (UTC)
- Do you mean you'd like to make this Study 3 leaving study 2 here to recent-changes vandalism approach? Either way I think it sounds like a terrific subject for a study - count me in. JoeSmack Talk 02:25, 29 March 2007 (UTC)
[edit] Correlate page popularity with amount of vandalism
I'd be interested to know what sort of relationship there is between how many edits a page gets in a certain amount of time and what percentage of those edits are vandalism. For this study, using random article to find lesser-viewed articles might be appropriate, but more popular pages would also have to be selected, perhaps an article from the list of most-frequently vandalized pages. If I had to guess, I'd say that the graph of percentage of edits that were vandalism, as compared to useful edits, vs. page popularity would be exponential, and the graph of percentage of edits that were vandalism, as compared to total edits, vs. page popularity would be logarithmic, but that's just a hypothesis. Feel free to drop me a line on my talk page if you'd like my help with anything, I love numbers. shoy 16:00, 27 March 2007 (UTC)
- This sounds something like the Obama article study. Popular page, freq. vandalized, comparing useful edits to vandalism ones. You might use that one as an extreme and a lessor know one as another and use those two as a pilot and see how you'd like to approach this more specifically. Interested? JoeSmack Talk 05:02, 28 March 2007 (UTC)
[edit] Recent Changes?
Forgive me for asking a question that may have been asked before, but couldn't you also just go through the list of the Recent Changes that have been made? Yes, it is biased to the immediate, but if you analysed the last 50 edits at (say) three times during the day on three days of the week, wouldn't that work just as well? Iorek85 23:02, 27 March 2007 (UTC)
Yes, I came here to suggest something along the same line. I don't think that analyzing "random articles" for vandalism offers a good account. 80% of random articles are crud -- the long tail of the encyclopedia that have little attention paid to them by anyone, including vandals. I suggest a vandalism survey by "diff". Vandalism is per edit, not per article, so I think it makes sense to survey the edit pool, not the article pool; it will be naturally weighted toward what people are actually doing with (to!) Wikipedia. If you want to constrain the pool, you can survey some diff numbers to find the date range you want. If you randomly generate numbers within this range, and throw out the non-article-space diffs, you end up with a better sample of what's going on. –Outriggr § 03:42, 28 March 2007 (UTC)
- Wow, an approach that was right in front of my eyes! It's only a few items down in the sidebar, recent changes! ;) A much bigger picture approach too, very different than what I was conceptualizing of vandalism.
- When we were piecing together Study 1, i think the idea was to show how much vandalism you might bump into in the wild. No one likes reading through information about their favorite scientist only to find the end of the paragraph says "johhny is a total alcoholic, lLOL!!!". Vandalism from an 'integrity' standpoint: how much of 'good content' is bad content waiting to be discovered. Also it gave an opportunity to show how vandalism has increased or decreased over the years (although it appears not very much in either direction).
- I think thanks to recent changes patrol and a lot of hard work on the community's part a shit-ton of vandalism and cruft articles never see more than 5 minutes of live time before being removed or deleted. The first 5 minutes of an article and edit's life would be a completely different picture and indeed perspective on vandalism, and I think it'd be really cool to pursue this idea. Anyone up for it? JoeSmack Talk 05:08, 28 March 2007 (UTC)
-
- I can see where you are coming from with looking at with Study 1; article integrity is important, as is the time to revert, and both of those are missed with recent changes. But for simple % vandalism edits, RC would be the place to look, I think. As you say, with RC patrollers, the vast majority (especially obvious vandalism reverted by bots) of vandalism doesn't last long. You could work out a rough estimate of how much vandalism escapes by subtracting the 'revert' edits from the 'vandalism' ones, but one problem is that to look back five minutes would require manually checking about 500 edits, by the look of it.
- Another idea that just came to me would be to take the list of say 100 recent changes. Take down the names of the articles that are vandalised, and the time of vandalism. Then check back (at any time afterwards, which is the good part) and check to see how long it took for that vandalism to be reverted. It'd just another way of gaining 'time to revert', but biased to the more heavily edited articles.
- And if you're less masochistic, on the article integrity front, you could give a 'chance the page you are looking at is vandalised' statistic, which would be cool. At least it would be simple; just use the random button and note if the page you are viewing is vandalised or not vandalised. Iorek85 11:57, 28 March 2007 (UTC)
-
-
- Recent changes studies on a very small scale HAVE been done. For examples, see one I did: [1] and one done by Opabinia regalis (talk • contribs): [2]. These may serve as models for how to do study 2. There are LOTS of good possible studies that could be done by tracking recent changes:
- Correlation between anon/logged status and type of edit (Good/Test/Vandalism)
- Correlation between time of day and type of edit (Good/Test/Vandalism)
- Correlation between level of activity and type of edit
- Correlation between length of article and type of edit
- Correlation between type of article and type of edit
- Correlation between user experience and type of edit (do users with more edits make better edits, regardless of anon/logged status?)
- Just some ideas to mill over. There are LOTS of studies to be done, and we may not get to all of them, but they could ALL be useful in driving Wikipedia policy decisions, and helping to improve the encyclopedia. Again, check out the earlier RC studies. They could shed light on how this and future studies can be done. --Jayron32|talk|contribs 18:10, 28 March 2007 (UTC)
- Recent changes studies on a very small scale HAVE been done. For examples, see one I did: [1] and one done by Opabinia regalis (talk • contribs): [2]. These may serve as models for how to do study 2. There are LOTS of good possible studies that could be done by tracking recent changes:
-
(unindent) i think that this recent changes approach would be a great method both in its simplicity and its efficacy. I do think it is most susceptable, as already suggested by Jayron32 (thanks for those other studies, awesome) to time-of-day vandals. For instance, check out this graph made by User:Nick showing that at 10pm there is about twice as many external links being added than at 10am all days of the week. A similar trend might be present in recent-changes vandalism. I think this kind of study would reveal that too, which would be interesting as hell; a strength and a weakness! JoeSmack Talk 02:37, 29 March 2007 (UTC)
[edit] Volunteering
You guys did a terrific job on Study 1. I'm not sure how you divide up the duties, but when you have settled on a method for Study 2, please feel free to drop me a message assigning me a task (e.g., Data Points 20 - 30). Would love to be some small help, as time permits. Jonathan Stokes 04:30, 28 March 2007 (UTC)
- Absolutely! We'll rouse you for tasks, and of course value any input before and after tasks too. :) JoeSmack Talk 05:10, 28 March 2007 (UTC)
-
- I don't have much input on your methodology...I'm happy to be a workerbee. FYI, I just blogged your first study, hopefully with all appropriate credits and disclaimers. Tomorrow, I expect this should get picked up in this Wikipedia blog aggregator and this one, too. My hope is to help you draw attention and volunteers to the project. You're doing impressive work here, and it deserves recognition. Jonathan Stokes 05:56, 28 March 2007 (UTC)
- Thanks for the praise. Your volunteering made me realize that we should probably set up a volunteer section so I have done that below. Remember 12:40, 28 March 2007 (UTC)
- I don't have much input on your methodology...I'm happy to be a workerbee. FYI, I just blogged your first study, hopefully with all appropriate credits and disclaimers. Tomorrow, I expect this should get picked up in this Wikipedia blog aggregator and this one, too. My hope is to help you draw attention and volunteers to the project. You're doing impressive work here, and it deserves recognition. Jonathan Stokes 05:56, 28 March 2007 (UTC)
-
[edit] Volunteers section
Please place you name here if you are willing to help us gather or analyze data for our next study:
- Remember 12:38, 28 March 2007 (UTC)
- Jonathan Stokes 16:56, 28 March 2007 (UTC)
- JoeSmack Talk 02:26, 29 March 2007 (UTC)
- Xiner Xiner (talk, email) 02:56, 29 March 2007 (UTC)
[edit] Random edit study
I also had an interesting discussion on my talk page that I thought I would move here. In order to get a random sampling of edits we could randomly choose numbers and then go to that specific edit number in the wikipedia edit index (which I had no idea existed). What do others think? Remember 12:37, 28 March 2007 (UTC)
- The only suggestion I would have for your vandalism study would be to take a random sample of edits instead of a sample of articles. The reason being, as it stands, heavily edited pages will have (what would seem) a disproportionate weight in the resulting statistics. If you are trying to estimate the true rate of vandalism in terms of vandalized edits per total edits, then the effect of this will be increased variance of your statistics in the best case (that vandalism rates are uniform across articles), and increased variance PLUS BIAS in the worst case (that some articles are systematically vandalized at a higher rate than others). So if the goal is to estimate the encyclopedia-wise rate of vandalizing edits, my recommendation would be to randomly sample edits, not articles. Btyner 22:35, 26 March 2007 (UTC)
- Yes but how can we do that when there is no random edit button. Any ideas? Remember 22:43, 26 March 2007 (UTC)
- Do the randomization with your own software; randomly draw from integers {a,...,b} where a corresponds to (say) the first edit of the year and b corresponds to a fairly recent edit. Say your pseudorandom number generator tells me to check edit number 87310788. Then go to the corresponding edit and compare to previous version to see if it was vandalism. There may be a more automatic solution but you'd have to ask someone more advanced in such things. Btyner 22:53, 26 March 2007 (UTC)
- I had no idea that you could check edit number 87310788. Where can I find out about this edit index? Remember 22:57, 26 March 2007 (UTC)
- Try Wikipedia:Odometer which has information in this vein. Note that for some reason the link I gave above originally had /w/ instead of /wiki/ in the title but now I've fixed it. Btyner 23:02, 26 March 2007 (UTC)
- A random edit thing would require a datadump, I think. I'd like to point out, too, that I've tried RCP a few times, but have been discouraged because there seem to be so few vandals there. I've heard that you can only effectively find vandals on RCP with software tools. Xiner (talk, email) 03:05, 29 March 2007 (UTC)