Wikipedia talk:WikiProject Vandalism studies

From Wikipedia, the free encyclopedia

WP:WPVS (talk)
Study 1 (talk) (finished)
Study 2 (talk)
Obama Study (talk)
Related Projects/Pages
Wikipedia:Researching Wikipedia
Wikipedia:WikiProject Wikidemia
Wikipedia:Statistics
Wikipedia:The Motivation of a Vandal
edit

Contents

[edit] To do list

Anyone feel free to edit the below to do list to come up with more ideas. Remember 14:17, 4 January 2007 (UTC)

  • Gather other editors of interest into the project
  • Come up with proposed studies to conduct and implement them
  • Revise project page to make it more accessible
  • Figure out what Study 2 will be
  • Finish Obama study

[edit] Typology

Note: I'm not putting my name on the items below; feel free to edit directly. If there are disagreements, let's note that in a discussion section. Or split this off into a subpage? John Broughton | Talk 00:22, 4 January 2007 (UTC)

[edit] Targets of vandalism

There are a variety of targets that vandalism can hit:

  • Articles:
    • Main page article
    • Other featured articles (FA)
    • Good articles (formally so categorized - GA)
    • Other articles (not GA, not FA)
  • Templates
  • Wikipedia namespace (including Wikipedia talk)
  • User namespace (including User talk)
  • Categories
  • Redirects
  • External links
  • References
  • Media (pictures, music files, etc.)

[edit] Sources of vandalism

Vandalism comes from:

  • Anonymous IP addresses
  • Newly registered users (typically vandal-only accounts)
  • Disruptive editors (limited but some constructive work)
  • Trolls, sock puppets, etc. - disgruntled "power users"

[edit] Types of vandalism

The IBM study identified five major types:

  • Articles:
    • Mass deletion: deletion of all contents on a page
    • Offensive copy: insertion of vulgarities or slurs
    • Phony copy: insertion of text unrelated to the page topic
    • Idiosyncratic copy: adding text that is related to the topic of the page but which is clearly one-sided, not of general interest, or inflammatory
  • Redirects:
    • Phony redirection (a redirect with a misleading pipe)

[edit] Methods of vandalism

How do we feel about a Methods of vandalism section? Will it do more harm than good to be specific? JoeSmack Talk 05:43, 4 January 2007 (UTC)
I think it would be helpful to create categories, but we don't necessarily have to state how one can do the more reckless types of vandalism. Remember 14:02, 4 January 2007 (UTC)
Alright, identifiable but not a diagram on how to do them, I think that is sound. JoeSmack Talk 14:17, 4 January 2007 (UTC)

[edit] Impact of Vandalism

[edit] Proposed studies

In order to study vandalism and the response to vandalism on wikipedia I had a couple of ideas. First, we can try to somehow gather the data of a random group of vandalism that occurred during a particular time period or on particular pages. The second idea, which would be much more controversial, would be to engage in some small systematic vandalism on random pages and see what gets reverted first and what manages to stay on and why. I believe that this idea would be face lots of controversy, but I think it may yield interesting data and it seems like a controlled way in which to study certain aspects of vandalism. Remember 14:07, 4 January 2007 (UTC)

I think the more random we can make the sample pool the better. The first idea I think should be tried first before the second; many have done 'experiments' like the second and I don't think we need to go that route until others avenues don't prove to be just as useful. If we make the first study empirical, focused, vigorous and detailed I think it could provide a lot of useful information. JoeSmack Talk 14:20, 4 January 2007 (UTC)
Here's a link to a tool that shows the most popular articles. Another way of selecting a sample would be random pages, of course.
As for the second study, that's absolutely a violation of WP:POINT as proposed. I for one am not particularly interested in being blocked for a week (or whatever) for screwing around with Wikipedia. An acceptable alternative would be to IDENTIFY a sample of vandalism from a stream of edits (rather than, say, from an article's history of reverts), but not revert the vandalism . We'd need to keep the data offline in order to avoid anyone tampering with it, but its doable. John Broughton | Talk 22:43, 4 January 2007 (UTC)

[edit] Proposed Study 1

Each member randomly choose a certain amount of web pages (as chosen by the random article link). Then go through the whole history of the article and categorize what vandalism occured, who was responsible for the vandalism and how long each vandalism remained. Remember 21:03, 4 January 2007 (UTC)

Any study should start small, and be iterative - that is, avoid doing a lot of work and then realize that it was done wrong. It's better to spend more time planning than a lot of time regretting.
So, first, we should restrict this to (say) the last three months of edits. Second, we should do a test run with a very small number of pages. Third, we should put together the table(s)/lists/whatever where we'll record the data before we actually start grabbing data. (For example, do we want to count non-vandalizing edits; and how do we want to categorize the type of vandalism?) Which means there probably should be a WikiProject Vandalism studies/Study1 subpage set up before anyone starts any type of counting. Assuming, of course, we have consensus on moving forward. John Broughton | Talk 22:43, 4 January 2007 (UTC)
I think this is a great idea. Move forward, by all means; you have my help. JoeSmack Talk 05:56, 6 January 2007 (UTC)

[edit] Discussion

First, we probably need to think about how to set up the data gathering. Here is a proposed table that others should feel free to edit. Remember 16:18, 6 January 2007 (UTC)

Page Date of edits examined Total Number of vandalisms Each vandalism with reference Comments
Wiki page Oct 2006-Dec 2006 15  ??  ??
...Lets choose what we are going to hit before we design the hammer. ;) Which articles are we going to monitor, how many, and how long? And what is our workable definition of vandalism here? Anything reverted? Anyone with a history that doesn't look like a good contrib? We need a solid, solid definition first and foremost. JoeSmack Talk 17:36, 6 January 2007 (UTC)

[edit] Proposal 2

Alright how about this. 1. Which articles are we going to monitor, how many, and how long?

We take a random sampling of articles as chosen by the random article link on wikipedia. We look at all the edits for one month of the year for 2004, 2005, 2006. That will help us see how vandalism has changed each year. Remember 17:48, 6 January 2007 (UTC)
I like this idea. For some reason I thought we were all going to put 100 articles on our talk pages and watch them for a month. Lets pick something like November of 2004, 2005, 2006 and look through the history for vandalism, and record it. How big should our sample size be, should we go until we reach 1,000? More? Less?

2. And what is our workable definition of vandalism here? Anything reverted? Anyone with a history that doesn't look like a good contrib?

I vote that we split up vandalism into several different categories similar to the IBM study. Obvious added vandalism (adding various curse words and other nonsense), Deleted information, and subtle vandalism (other additions that are intended to push a POV or otherwise harm the article).Remember 17:48, 6 January 2007 (UTC)

[edit] Further discussion

I'm down with the IBM study's analysis; the only questionable aspect I'd say approach with caution is "Idiosyncratic copy: adding text that is related to the topic of the page but which is clearly one-sided, not of general interest, or inflammatory". We want to have clear boundaries between Offensive (Max tucker is a dickbox!), phony (Max tucker is actually a computer programmer with too much time on his hands) and blatantly POV (Max Tucker has been demonstrated to be an inflexible human being throughout all walks of life). I think we should separate all these classes too, and have the blatantly POV (idiosyncratic); i believe this one will be the most, uh, subjectively defined class of vandalism. How does this sound?
We also should set up an examples page so we can get some of our definitions straight on some instances. This is where we can decide if we want to classify vandalism by one or multiple classes. For instance, example.com vandalism should be defined as Phony vandalism, but should it also be Deletion vandalism when it replaces a section/page? Similar for vandals who replace the 'criticisms' section of their favorite film director with the word 'shitcock'; is this Offensive vandalism or deletion? Or both? We need to decide if we're going to run a multiple class system on single instances or not, and then follow by having a subpage with examples of vandalisms that are classed for our sake and everyone else's who are looking in on the work. JoeSmack Talk 18:19, 6 January 2007 (UTC)
On another note, we should have other users go through and do a second classification for reliability. We don't want the results from the previous classification to be apparent because it'll taint the second assessment, but we can either work a second table out or maybe put data in black text over black background or something clever like that to help. If we can show our reliability is high, it'll really help the credibility of our study. JoeSmack Talk 18:27, 6 January 2007 (UTC)
And again, on another note: this won't include article creation vandalism. They usually get reverted right off the back. If the random article button clicks to an obvious db-, it should be db-ed and moved on, right? Also, in terms of linkspam: should this be counted as vandalism? Lots of links to places to get viagra etc are added and removed all the time, should that be counted? And if so, what about less encyclopedic links that are added like to youtube copyvios and inappropriate user blogs? JoeSmack Talk 18:56, 6 January 2007 (UTC)
I vote to not count linkspam. I have set up the first study page at Wikipedia:WikiProject Vandalism studies/Study1 I think we can work through these issues as we go through the study there. I am going to try to start setting up a table there. Remember 18:22, 11 January 2007 (UTC)

[edit] Proposal 3: The viability of allowing anyone to edit

Cost-benefit analysis: Try to weigh the benefits of allowing anyone to edit (readers fixing typos, new registered users that wouldn't have gotten interested if it weren't for anon editing etc) versus the costs (vandalism, and good, particularly expert contributors leaving due to said vandalism (remember two contributors aren't necessarily equal - you must evaluate the cost of the loss of an expert versus the gain of another contributor to TV show articles)).

[edit] Further discussion

Although this proposal probably seems quite daunting, if the study was properly performed, it could make a signficant difference to the outcome of the continuing viability of the anon' editing policy. --Seans Potato Business 02:24, 19 February 2007 (UTC)


[edit] Proposal for Study 2

It is common not to semi-protect Featured Articles when they're on the frontpage, with the idea that people new to Wikipedia, who visit a FA, will get a good idea of how Wikipedia is open for anyone to edit.

I propose we conduct a study to look at a number of factors with FA's at the day they've been on the frontpage:

  • how many edits, vandalist edits, sorts of vandalist edits, reverts and time to revert
  • how many edits have been made by new users (use "User contributions")
  • Compare FA's by day of the week.

JackSparrow Ninja 11:43, 23 February 2007 (UTC)

This is what I started to do last December although I had not seen this edit at the time, but it became too much of a chore, each article required about 8 hours of work, plus I became very disenchanged with the reactions to vandalism. I captured edits to a FA during its time as FA. I captured the Date and Time of the edit; the editor; the nature of the edit -- beneficial, neutral (such as copyedits) or harmful; and, if the edit was harmful, the consequences of the edit -- date and time of reversion, reverting editor and personal consequence to the "vandal".

The disenchantment mainly came from the last, the personal consequence to the "vandal". On almost every article article that I studied there were multiple Final Warnings (This is your final warning. If you continue to make destructive edits . . . you will be blocked from editing.) issued to the same editor for successive edits but the editor was never blocked.

Another source of disenchantment were the edits that were constructive in that they provided additional information, sometimes with references, that, while pertinent, were contrary to some editor's POV and were reverted by an editor as being NPOV. Somewhere around the third or fourth time I encountered this I realized that assumed ownership of articles wa possibly a larger problem that vandalism. By that time I knw that this page was here and had read part of the entries so I knew that vandalism was recognized as a problem. I have not been able to find any group that is concerned about editors who adopt articles or groups of articles and act as gatekeepers to edit on the grounds of misapplication of WP Guidelines.

So I will continue to use WP as a resource since there are a lot of very good articles on WP on subjects of serious academic interest, well sourced and either unbiased or with a very obvious bias, and do a little copy editing here and there I find need for it. I will simply ignore that articles where I find an editor who reverts edits that are not personally acceptable. I read the Talk page of any article that seems interesting. An editor who has assumed ownership of an article is easily discovered on the Talk page.

Good luck with your efforts.

JimCubb (talk) 22:59, 20 April 2008 (UTC)

[edit] Categorizing vandalism

I've set up a page at Wikipedia:WikiProject Vandalism studies/Types of vandalism. If you have proposed changes, I suggest you just edit over what is there, rather than doing a threaded discussion, and post your comments here. John Broughton | Talk 19:33, 7 January 2007 (UTC)

[edit] Notifying the community

Any ideas how to go about letting people know that this project is going on? I have a feeling that there are other interested people out there, but that our project may not be the easiest to find. Remember 18:36, 7 January 2007 (UTC)

I've got some links, but not time to follow-up on them: Wikipedia:WikiProject (for general info, I think); Wikipedia:WikiProject Council/Guide (best practices), and Wikipedia:WikiProject Council/Directory. If someone else would take a look, that would be great. John Broughton | Talk 18:50, 7 January 2007 (UTC)
Any talk pages where you find an active discussion of the problem of vandalism and the counter vandalism unit. --Seans Potato Business 02:25, 19 February 2007 (UTC)

[edit] Bayesian spam filterting

Hey folks, it just crossed my mind that we already have a solution to vandalism. Take how spam was dealt with in the world of email: bayesian filtering. Theoretically speaking, we should be able to apply the same thing to vandalism. Even if we don't have access to the text people are putting into articles, we have enough input to make reasonable assertions (anonymous IP, number of lines changed, edit summary, article edited, etc). Over the next month I'm going to try to implement this into WikiGuard. I'll let you know how it goes.  :) --Brad Beattie (talk) 18:48, 7 January 2007 (UTC)

Bayssain and other methods for determining what is and isn't vandalism could be very useful as front-end tools for editors who are fighting spam, since they presumably reduce scanning work by human beings. I don't, however, think that they are particularly good at coming up with robust, defensible numbers on who is doing vandalism and what type of vandalism they're doing. So while I encourage your efforts, and don't think this WikiProject should wait to see what happens with them. John Broughton | Talk 18:56, 7 January 2007 (UTC)

[edit] How to conduct a study - an example

Here's a modest but useful example of what I think a userful study looks like. The following comes from here.

  • Study goal: To evaluate if it is true that positive contributions from anonymous users far outweigh vandalism.
  • Data sampling approach: an informal tally of anonymous contributions i come across in my daily vandal-cleaning efforts from june 12 to august 15 2006
  • Results:
    • vandalism - 116
    • non-vandalism - 505

One could critize the study for its non-random approach, for not clearly defining how vandalism was determined, for not evalating whether non-vandal edits were significantly useful or not, but still, the evaluation did provide useful information - it appears that anonymous IP edits are constructive far more often than not. And that in turn is actionable - that since the benefits to allowing anonymous IP addresses to edit Wikipedia articles seem to outweigh the costs, the current approach should continue. John Broughton | Talk 20:00, 7 January 2007 (UTC)

I think this is the general case, but the study should be more nuanced. Some articles are very much vandalized. So much that the edits of the vandal and the reverts takes up more than half of the history. I've been working on The Simpsons for a long time and it's a daily strugle to keep it vandal free. --Maitch 14:45, 11 January 2007 (UTC)

[edit] Started first study

Go to WikiProject Vandalism studies/Study1 to check it out and help make it better. Remember 22:40, 11 January 2007 (UTC)

I moved it to Wikipedia:WikiProject Vandalism studies/Study1; it was previously in article space. Trebor 22:54, 11 January 2007 (UTC)

[edit] Need help

Anybody that wants to help get 100 data points for our first study Wikipedia:WikiProject Vandalism studies/Study1 please let me know because we need all the help we can get. Below is a copy of the current results we have so far, which I think are interesting. Remember 17:30, 18 January 2007 (UTC)

Current cumulative tally

Total edits 2004, 2005, 2006 = 100
Total vandalism edits 2004, 2005, 2006 = 5
Percentage of vandalism to total edits = (5/100)= 5%

November 2004

Total edits in November 2004 = 15
Total vandalism edits in 2004 = 2
Percentage of vandalism to total edits = (2/15) = 13.33%

November 2005

Total edits in November 2005 = 45
Total vandalism edits in 2005 = 2
Percentage of vandalism to total edits = (2/45) = 4.444%

November 2006

Total edits in November 2006 = 40
Total vandalism edits in 2006 = 1
Percentage of vandalism to total edits = (1/40) = 2.5%

Percentage of overall vandalism that was

Obvious vandalism = (4/5) = 80%
Inaccurate vandalism = (0/5) = 0%
POV vandalism = (0/5) = 0%
Deletion vandalism = (0/5) = 0%
Linkspam = (1/5) = 20%

Percentage of overall vandalism that was done by

Anonymous editors = (4/5) = 80%
Editors with accounts = (1/5) = 20%
Bots = (0/5) = 0%

Reverting

Average time before reverting = (7991+14+6816+18+2561)/5= 3480 minutes
Percentage of reverting done by
Anonymous editors = (0/5) = 0 %
Editors with accounts = (5/5) = 100%
Bots = (0/2) = 0 %

[edit] Still need help

I still need people to help with the first study. We are now up to 40 points, but I would like to get northwards of 100. Here are the current results based on the first 40 points. Remember 17:12, 28 January 2007 (UTC)

Current cumulative tally

Total edits 2004, 2005, 2006 = 150
Total vandalism edits 2004, 2005, 2006 = 8
Percentage of vandalism to total edits = (8/150)= 5.3%

November 2004

Total edits in November 2004 = 22
Total vandalism edits in 2004 = 2
Percentage of vandalism to total edits = (2/22) = 9.09%

November 2005

Total edits in November 2005 = 59
Total vandalism edits in 2005 = 5
Percentage of vandalism to total edits = (5/59) = 8.47%

November 2006

Total edits in November 2006 = 69
Total vandalism edits in 2006 = 1
Percentage of vandalism to total edits = (1/69) = 1.44%

Percentage of overall vandalism that was

Obvious vandalism = (7/8) = 87.5%
Inaccurate vandalism = (0/8) = 0%
POV vandalism = (0/8) = 0%
Deletion vandalism = (0/8) = 0%
Linkspam = (1/8) = 12.5%

Percentage of overall vandalism that was done by

Anonymous editors = (7/8) = 87.5%
Editors with accounts = (1/5) = 12.5%
Bots = (0/5) = 0%

Reverting

Average time before reverting = (7991+14+6816+18+2561+4+11+11)/8 = 2,178.25 minutes
Percentage of reverting done by
Anonymous editors = (0/8) = 0 %
Editors with accounts = (8/8) = 100%
Bots = (0/8) = 0 %

[edit] Barack Obama

This widely watchlisted article has been running unprotected since 31 January. move=sysop protection was added on February 5. See the article's talk page for discussion about allowing IP edits. Would some volunteers here be willing to turn their analytical skills to this article? It is likely to be linked from the main page "In the news" box during the weekend of 10 February when Obama is expected to announce his plans for the 2008 presidential election. --HailFire 18:48, 8 February 2007 (UTC)

What sort of study were you thinking about? Remember 20:02, 8 February 2007 (UTC)
Something a lot like this. Such a study could inform the continuing talk page discussion mentioned here and help the article's editors to build greater consensus on IP edits. More broadly, the analysis may provide useful guidance on the relative merits of protection/unprotection for other high visibility political articles visited with frequent vandalism. --HailFire 21:08, 8 February 2007 (UTC)

Still thinking such a study might provide useful insights for managing vandalism on this and other closely watched articles with broad readership. The article is currently in unprotected status. --HailFire 11:02, 6 March 2007 (UTC)

I ran across this project, and this looks like a neat mini-project that I can take up. I'll create it at User:BuddingJournalist/ObamaAnalysis. BuddingJournalist 08:20, 9 March 2007 (UTC)
Feel free to set it up as a study under the wikiproject vandalism study section (e.g. Wikipedia:WikiProject Vandalism studies/Obama article study) so the whole group can help out. Remember 14:57, 9 March 2007 (UTC)
Indeed, this looks really interesting. JoeSmack Talk 15:11, 9 March 2007 (UTC)
Move completed! Feel free to contribute! BuddingJournalist 01:35, 10 March 2007 (UTC)

Lots of new data for the unprotected period between 12 and 17 March, for anyone who wants to give it another look. Would also be interesting to track time of day and geolocation data, perhaps putting this in a graphic. Some examples here. --HailFire 15:08, 19 March 2007 (UTC)

[edit] A study of my user page

I carried out a vandalism study on my own user page and found that 47% of the vandalism was made by registered users. Angela. 22:13, 29 March 2007 (UTC)

[edit] user box

I created a user box for this project. Let me know what you think. Remember 20:34, 30 March 2007 (UTC)

WVS This user is interested in studying vandalism.






[edit] WikiProject Wikidemia

I found this project myself through a recommendation from the FA protection discussion, I hadn't realized it existed before then, probably due to lack of searching on my own part. I'd like to know how this project should related to the more general Wikipedia:WikiProject Wikidemia, which I've followed for a few months, but haven't had much involvement in. This project is a more general research effort (though fairly inactive at present) and seems a logical parent project for this one. I think all efforts related to vandalism research should be redirected here in the case of any repetition and to keep all of those involved together. It should also make this project easier to find.

My other question is whether there are any other related research pages or projects like this I may have missed. Does anyone know of any? Richard001 03:10, 11 April 2007 (UTC)

Please see WP:RW. And by all means add this project to it.-- Piotr Konieczny aka Prokonsul Piotrus | talk  07:25, 24 April 2007 (UTC)

[edit] Wikiversity

Would this project perhaps fall more under the scope of Wikiversity than Wikipedia? --Remi0o 06:37, 16 April 2007 (UTC)

Not really, it's more an internal thing, and not particularly relevant to Wikiversity. Richard001 22:57, 16 April 2007 (UTC)

[edit] recent vdl study done by User:Colonel_Chaos

i just saw User:Colonel_Chaos/study over at the village pump. haven't had the time to read it, but i thought i'd let the project here know about it. it looks like he vandalized things himself, very WP:POINT, but here it is. JoeSmack Talk 23:50, 1 May 2007 (UTC)

Interesting. That raises a semi-ethical question about our research here - are we allowed to partake in vandalism to further our understanding of it? If we cannot do so, it does place some restrictions on our research. For example we have to rely on observation rather than true experimentation. If we were to revert all of the vandalism done would that be acceptable? It's problematic to have a group of Wikipedians going around vandalizing things, but it also helps with research, so it's an open question of whether or not it's acceptable. Richard001 00:05, 2 May 2007 (UTC)
I don't think that going against WP:POINT is a good idea; at least he knows he did. I don't think that this project wants to walk down that mine field to be honest. His sample size was pretty small just like ours, so his average of 10 hours to revert probably isn't the strongest result. It does raise the open question: How long does vandalism typically remain visible? Right now the project here has tended to lean more towards a per-edit-incidence rate, but this question is still important none the less and we should keep it in the back of our heads. JoeSmack Talk 18:33, 2 May 2007 (UTC)
I agree with Joe at this point. I think we would just garner animosity towards this project if we condoned this behavior. But I do think we should add his study to the list of studies that have been conducted on wikipedia. Remember 18:44, 2 May 2007 (UTC)
P.S. He used a registered user to vandalize, which means these results are the reversion time of vandalism by a registered user and not users in general. JoeSmack Talk 18:36, 2 May 2007 (UTC)

Some general comments on my study. First of all, I admit that the sample size was rather small, but would you really want me to expand it? As for the strength of my result, I'm not sure that 10 hours is really here or there, but I think that my study clearly demonstrates that it takes a very long time to revert vandalism. I was dealing with Featured Articles here for crying out loud, not stubs. I'd wager that a similar study with stubs would generate an average revert time of never. You may not like my methods, but my conclusions warrant consideration. Colonel Chaos 21:49, 2 May 2007 (UTC)

Keep in mind that measuring is easy, but knowing what it is you are measuring is the hard part. Maybe we're not measuring vandal reverting in FAs but vandalism in a small group of FAs that don't get reverted quickly. That'd change the whole set of results on their head. Thats why we want big sample sizes to help reduce that possibility. The limit of vandalizing yourself is that you run into ethical issues like 'uh, should i really be expanding the sample size'?
Anyways, we're not here to pick a fight but to study vandalism, and while we might mince methods your study is interesting and it does raise some interesting questions to consider for upcoming studies and their aims. How to do you feel about where we're going with study 2 or the Obama study? JoeSmack Talk 22:26, 2 May 2007 (UTC)
Did you keep track of how many vandalism warnings, if any, were left on talk pages when your vandalism was reverted? Perhaps you can add that to the study. Also your picking the FA article to vandalize, and how, introduced bias into what was otherwise a very good concept for a study.--Chrisbak 04:16, 3 May 2007 (UTC)
I noticed that in each of the usernames you created, you add some text to your user and user talk page so that it would not be red and thus eliminate any quick suspicion that you're a "newbie". Just thought I'd point that out to those who didn't catch that. I'd imagine if you didn't do that, those times would be a lot less. Edit: Also, I think all of your edits were marked minor. Just trying to find all the variable here. :) Pizzachicken 05:40, 4 May 2007 (UTC)

[edit] Study of IP Vandalism

Hey guys, just found this page recently. A while back I did a survey at User:Cool3/Analysis, on the percentage of vandalism done by anonymous editors. My methodology may not have been perfect, but it does have the advantage of a very large sample size. Hope you don't mind that I listed it under previous studies. Cool3 15:01, 5 May 2007 (UTC)

Thanks for the link. I will add it to our list of individually done studies. Remember 21:00, 6 May 2007 (UTC)
whoops, you've already added it. Remember 21:01, 6 May 2007 (UTC)

[edit] Study of schools and universities

I'm interesting in getting some statistics on the contributions of shared IP addresses, especially schools and universities. From my experience the contribution of these addresses are almost universally puerile vandalism and nonsense. I'm interested in seeing what the breakdown of the contributions actually are, as well as comparing the levels of vandalism coming from the two institutions (one would hope there would be slightly less nonsense on average from universities, but I wouldn't be that surprized if it was the other way around either...)

This should help provide some emperical basis for discussion of shared IP addresses, and catalyze discussion of policies for interacting with these institutions. Frankly, I believe a shared IP address should be banned if it does not, on average, improve the quality of Wikipedia articles, regardless of whether there are good faith edits there as well. Students who vandalize behind the shield of a shared address cannot be held responsible for their actions as they are totally anonymous. This situation just encourages them to show off and damage articles. I believe a policy forcing users to create their own account would allow students to be held responsible for their edits and drastically reduce vandalism. Just today I have seen frequent vandalism from schools on several of the articles I watch and I feel powerless to stop them - I'm not even going to do anything about it, because frankly I can't do anything about it. The constant but slow flow of vandalism is not enough to warrant a block most of the time, which leaves me little option but to revert the edit and feel a little embarrassed for Wikipedia and a little sorry for those 10% or so of readers who come away from predation knowing only that 'james is gay'. I'm not sure I will have time to conduct a study myself for a while at least, but I'd at least like to propose it so we can discuss the subject and if any editor or editors would like to undertake it themselves or work out the details we can make a start.

For selection we could use Template:SharedIPEDU. Looking through a few entries I can't find any order to the list at all. It may be 'random' enough as it is to select from. Richard001 00:31, 25 May 2007 (UTC)

[edit] Vandalisms per article

Will the WVS be doing any vandalisms per article studies that track that number over time in the near future? Specifically, to see whether major events affects this number (e.g. natural disasters, terrorist attacks, movie releases, etc). --The Dark Side 02:38, 27 May 2007 (UTC)

Not that I know of. The all active projects are shown at the top of the page. I take it you mean vandalism per unit time in relation to an external event relevant to the article? Richard001 03:01, 27 May 2007 (UTC)

[edit] I recommend proofreading the write-up of the study

I had to make some changes, because there were some glaring errors. Without the actual raw data I can't check the rest of it. I don't know where the raw data is. I do find it amusing that the modal time taken for vandalism to be reverted is instantly. 217.43.138.193 22:58, 12 June 2007 (UTC)

What exactly do you mean by "Because articles were randomly sampled and not edits, a ratio estimate must be used to calculate the percentage of edits that are vandalism."? I don't see any ratios in the write-up, nor evidence of their use. 217.43.138.193 23:07, 12 June 2007 (UTC)

[edit] Answers

  • How effective are bots in curtailing vandalism?
    Much less effective than real users, but they can operate for longer periods of time.
  • Are editors any more likely to continue or desist vandalizing if warned by a bot instead of a person?
    Slightly less likely, but the difference is not significant; for the most part, warnings achieve absolutely nothing.
  • How long does vandalism typically remain visible?
    The typical time is, of course, inversely related to the obviousness of the vandalism and the visibility of the article. In most cases, no more than a few minutes.
  • Who is responsible for vandalism? What are the demographics of the vandal population?
    Mostly anonymous users, mostly in the United States, probably mostly children. More cannot really be determined with any degree of accuracy.
  • What proportion of vandals are on dynamic IP addresses, and hence very hard to block?
    The number is small, though not completely insignificant, and is certainly a lot less than it used to be. (Some dynamic ISPs now send X-Forwarded-For headers, allowing MediaWiki to record these users' real IP address rather than a dynamic one).
  • Who is responsible for reverting vandalism?
    A reasonably-sized group of regular contributors (and a bot) are responsible for dealing with most vandalism. Other regular contributors who do not devote their time to fixing vandalism but deal with it as they come across it, and the occasional new or anonymous user make up the rest.
  • How much time do editors waste cleaning up vandalism?
    Approximately 5% of all edits, though each of these edits takes a few seconds at most.
  • What effects does semi-protection have on the level of vandalism of protected articles?
    It dramatically reduces, but does not completely eliminate, vandalism to the semi-protected article.
    • Do vandals just choose another article to edit instead? How can we test this?
      Possibly. There is no way to reliably test this short of standing behind vandals and watching what they do.
  • What level of vandalism is considered acceptable before semi-protection or some other measure is needed? How should the 'level of vandalism' be measured? (See Wikipedia talk:Protection policy#A more explicit semi-protection policy for articles subject to vandalism)
    It will vary greatly depending on which administrator you ask. Generally, persistent vandalism to a page by multiple, unrelated anonymous and/or registered users at the rate of several incidents every few hours will merit temporary semi-protection of that page.
  • Are IP edits ever responsible to improving a featured article while on the Main Page?
    Very rarely. It has happened, though.
  • What motivates people to vandalize articles (See Wikipedia:The_Motivation_of_a_Vandal)? How can we minimize the satisfaction they get from doing it?
    Boredom. Integrate an anti-vandalism bot into MediaWiki itself so that vandal edits do not even save.
  • Why do certain articles attract more vandalism than others?
    Because the subject is more popular with Wikipedia's typical demographic. The level of vandalism can be roughly correlated with the popularity of the article, once the effect of semi-protection is ignored.
  • What types of vandalism are there? What message are they trying to get across? Why do vandals not fully realise that their actions are futile?
    Unless a specific message is inserted into the edit, they are not trying to get a message across. They are simply bored. Most of the time, they do realize their actions are futile, but they are bored and can't think of anything better to do.
  • What strategies can we employ to catch vandalism quickly?
    Connect to the recent changes IRC feed, apply an algorithm to pick out likely vandalism, monitor the output and revert where necessary. I'm not going to go into details on the algorithm because this method is far more effective if everyone uses their own. (Otherwise, everyone is going after the same edits).
    • How can we catch most of it at recent changes?
      The page itself isn't very useful, but can probably be made to work in an approximation of the above way using JavaScript, for those who do not have access to IRC.
    • How can we establish a situation where almost every article has someone responsible for maintaining it? Is this even a good idea? (See WP:OWN)
      It would be better to assign people to time periods, rather than articles, and have them watch recent changes in that period. However, this being a voluntary project, any attempt to assign people to things will always be less effective than simply allowing them to choose to do what they want to do. Being less harsh on people who spend most of their time dealing with vandalism would be a better strategy (right now, their contributions are frequently dismissed as "worthless", they are hassled over every mistake they make, and they are barred from seeking adminship).
  • What impact does vandalism have on the reputation of Wikipedia?
    A moderate negative impact, but one not so strong as the percieved general unreliability and inaccuracy of the project.
  • What sort of financial gains can be made from using Wikipedia to advertise - are spammers just wasting their time, or can it actually be profitable? Are our anti-spam measures adequate?
    Addition of external links to Wikipedia articles used to greatly increase search engine rankings, increasing site traffic. While it no longer does, it does lead to a smaller, more direct increase in site traffic; from there, financial gains arise, most obviously through advertising.
  • How good are editors at reverting vandalism? That is, is it reverted properly, or is it often dealt with poorly, e.g. removing a whole paragraph that the vandal has simply altered in meaning. Also, how often are vandals properly warned on their talk page after committing an offense?
    The word "revert", when used on Wikipedia, actually means 'to restore a page to a previous version'. Provided the correct previous version is chosen, all reverts remove the vandalism correctly. Removing a whole paragraph that a vandal has altered in meaning would have to be done manually and not by reverting, unless a recent version existed without that paragraph. The most common error is to revert vandalism by one user, but neglect to revert further vandalism by another user immediately preceding it. This error is most often made by the anti-vandalism bot, reducing its usefulness as it is necessary to follow after it checking its edits. (Occasionally reverting a revert back to the vandalised version is another curious and very annoying 'feature' of the bot). "Properly warned" is rather an odd concept, especially since warnings are virtually useless. Contributors who do not warn vandals are being no less useful than those who are (indeed, you could argue they are saving server load and avoiding provokation) and do not deserve to be hassled for it.
  • What is the overall contribution from schools and universities like?
    Anonymous contributions from high schools (or equivalent) are usually mostly vandalism. Long-term anonymous-only blocking of the relevant IP addresses effectively deals with this while allowing established users at those institutions who wish to contirbute to continue doing so. While vandalism can also originate from universities, it is rare for a long-term block to be necessary.
  • What happens to vandalism levels when edits won't show up in the current version of the article - a trial of something like stable versions, where the vandal cannot vandalize the actual article people see, or something functionally similar, is needed. Perhaps a small section (e.g. all articles in a certain category) could be tested out.
    They would be reduced, though only slightly. A more effective solution would be to not save the vandalism at all.
  • How does the rate of vandalism vary throughout the day?
    It correlates almost precisely with the overall level of site traffic. The 'vandalism information' template is imprecise, irregularly updated and on the whole useless, and I'm not entirely sure why people even use it, when they would get a far better indication of the level of vandalism by looking at the traffic graphs.
  • Angela suggests there would still be problems with vandalism if anonymous editing was blocked. How can we test this hypothesis? Certain categories could be experimentally altered to block anonymous editors, but then vandals could just choose an article that wasn't protected. We would have to block all IP editing, which would certainly be controversial, even just to gather a small sample of data. The blocks would also have to allow newly registered users to edit, otherwise there wouldn't be time to create an account and then wait 4 days. Perhaps we could use a comparative method by doing the experiments on another wiki instead?
    Of course there would still be problems; we get vandalism from registered users now, and the level could only increase if anonymous editing was prevented. Semi-protection is essentially a small-scale test of this, so it has already been tested. Why certain categories? Why not just continue what we are currently doing, and selectively disable anonymous editing to problematic articles by semi-protecting them?
  • Quantitatively, how are levels of vandalism affected (both in terms of percentage of edits and number of edits) when there is external attention draw to an article (e.g. Slashdot or The Colbert Report). Do levels of vandalism return to normal (e.g. in elephant) in all cases? How quickly?
    The level of vandalism to elephant is still far higher than it was before attention was drawn to it. However, such incidents affect only individual articles, or a small handful of them, and can usually be dealt with simply by semi-protecting for a few days; occasionally more long-term protection is needed.
  • How well does Wikipedia:Flagged revisions work in practice?
    Reasonably well on very small wikis, but would be absolutely and totally useless here. You asked above "how much time is wasted fixing vandalism"; multiply that by a factor of about 15 to get the amount of time that would be wasted on the even more useless task of flagging revisions.

All your questions have been answered. You may now go and do something useful. —Preceding unsigned comment added by Gurch (talkcontribs)

[edit] Vandalism content list

For automated vandalism detection similar to Lupin's Anti-Vandal Tool it would be very important to have a list of the content added by bored vandals. To collect this information in an unbiased way (i.e. without the existing word lists) I thought that the people who conduct the ongoing vandalism studies might also copy the detected changes, either on a page here or into a file for download somewhere else. Cacycle 02:06, 20 June 2007 (UTC)

[edit] suggestions

I would like to suggest some other Research questions:

What is the amount of people using IP that makes constructive edits versus vandalism edits?

What is the amount of people having username that makes constructive edits versus vandalism edits?

In the case IPs users are responsible for the majority of vandalism edits , is it best to have only people registered with usernames to be able to edit wikipedia?

Z E U S 04:29, 6 July 2007 (UTC)

Just place your questions on the list itself, there's no formal proceedure for doing so or anything. Just make sure they aren't already on there, of course. Richard001 05:00, 6 July 2007 (UTC)

[edit] Help needed?

Hi, I'd like to offer my services and experience to this study, if required. I've spent almost a year creating and working on the uw- series of warnings, but have begun to see these as 'first aid' so to speak, instead of looking for a cure to the problem. I, and others, created the uw- system with an idea of how these warning were to be implemented, assuming good faith for new talkpages, minimum of two warnings etc, but more and more often I see at WP:AIV editors looking for blocks having jumped straight to a 4th level warning. I've already offered at WT:AIV to write an essay on how warnings should be issued and including some case study style examples, which was met with l;uke warm response. So if I can help out or sign up please just let me know. Regards Khukri 09:14, 19 August 2007 (UTC)

Right now it seems like the whole project is in the doldrums. No one has time to be director of the second study so if you are interested please take over. Remember 11:52, 19 August 2007 (UTC)
Any information on the effectiveness of warnings is welcome. Studying the response of vandals who are warned/not warned would be interesting. Recently a user suggested that warnings were a complete waste of time, and it would be more productive just to keep patrolling and reverting. It would also be interesting to see how warnings are used - how many instances of vandalism result in a warning, and which templates are people using? Which are the best to use in what circumstances?
I use uw-bv a lot myself, since it gives the flexibility to block quickly. I find going through the 1-2-3-4-block cycle a waste of time myself, and only use uw-1 when it could be a mistake or good faith edit, and only use 2 for intermediate cases. I only jump to uw-only if it's particularly bad, but I also get annoyed when people use uw-1 when someone plasters obscenities all over the page, and often adjust the warning to something a lot more stern. I warn most of the time.
The project has become fairly inactive, and most of the people who signed up haven't really done anything, but it can easily be kicked into action by anyone that wants to start something. Richard001 06:34, 20 August 2007 (UTC)

[edit] New essay section

I've added a new section to the essay Wikipedia:The motivation of a vandal. -- The Anome 11:35, 21 September 2007 (UTC)

[edit] Study Idea (School IPs)

I think a study should be done to determine if anonymous edits from School IPs have a net benefit to Wikipedia or not. It seems like alot of vandalism comes from school IPs and that we might be able to stop it by requiring all school IPs to sign up an account to edit. However, data would be needed before such a policy could be proposed. Life, Liberty, Property 12:17, 16 October 2007 (UTC)

See above. I've had a brief look at the edits of my own university, and it's quite difficult to classify them into good, bad and evil, but it would definitely be worth looking into. Richard001 05:10, 17 October 2007 (UTC)

[edit] A thought

I know the proper procedure is to send a warning to vandals after they vandalize a page to thier user page for any vandalism event but has anyone done a study on vandals if they do not get warnings sent to their talk pages? What i mean is; is sending countless warnings just feeding them attention? causing them to repeat the vandilism anyway?? just a thought.... Ottawa4ever (talk) 03:56, 11 December 2007 (UTC)

Just on a whim, because my interest was piqued by the discussion on Protecting the Feature Article, I have been collecting data for the FAs in this month. I have only collected the time that an edit was made, if the edit was constructive or destructive (Most destructive edits are obviously so. Where it is not obvious, I lean towards AGF even when the edit is incorrect.), the user who made a destructive edit and the time that a destructive edit was corrected. I am less than 20% through the month (2:16 a.m. on 7 December) and have no expectation of collecting any day's article in real time.
What I think I will do is to start over at the middle of the month and try to collect as many of the types of information that I can. Besides capturing the users who made any changes, including bots, I will capture some of the editing history of those who make destructive edits and the consequences to the user for making such edits.
I have thought about this change in approach for a couple of days. It made much more sense to me after a bot reverted a slightly vandalized version to a very seriously vandalized version and, on a different article, two attempts to make an improvement that were both reverted.
I will let you know if I come up with anything interesting, assuming that at the end I am still functioning at a normal level. Of course one could ask legitimately into the normal level of one who intends to examine twenty-four hours of edits on sixteen articles.
JimCubb (talk) 19:39, 25 December 2007 (UTC)
What quantities are you studying. Would this be of help? Voice-of-All 21:16, 25 December 2007 (UTC)
I WILL get back to you on that. I need some time to evaluate it.
JimCubb (talk) 06:32, 28 December 2007 (UTC)

=="Compare and contrast"--

Have other wikis been contacted about vandalism (and other similar issues - including "confused newbies" and "fingers in a twist") and how such matters are handled: though there will be different profiles of activity for each. Might as well get some consistency/avoid reinventing the wheel where possible. Jackiespeel (talk) 19:23, 10 January 2008 (UTC)

[edit] Impact Statistics

I think more analysis should be done on the impact of anon vandalism. Are statistics readily available per article for "the number of editors who have a particular popular article on their watch list" and the "rate at which people are reading this article"?

Since all editors who have an article on their watch list will be reading the vandalism, a measure of the impact on editors can be quantified per vandalism instance as:

(# users with this article on their watch list) * (time to read an instance of vandalism) +
(time to revert the change + time to post a notice on vandal's discussion page).

The impact on Wikipedia can be further quantified per vandalism instance as the number of lost quality edits as:

(time lost among all editors) / (time quality editors require to make a single edit).

Since vandals chose popular articles to vandalise the impact on editors should be very large since a large number of editors will have a popular page on their watch list.

The impact on readers can be quantified by:

(average read rate on article) * (average length of time an instance of vandalism goes uncorrected).
(average time and not median time should be used in this calculation)

BradMajors (talk) 15:22, 30 January 2008 (UTC)

I think you are on to something here. I think the random sampling studies done every November grossly underestimate the vandalism that the average visiter to wikipedia is likely to see. Why? Because the average user (and average vandal) are both likely to gravitate towards certain articles - namely, the top 100 most viewed article. I propose that it would be more useful to gather a months worth of statistics from a random sampling of the top 100 (or even top 500) articles, not a random sampling across the entire universe of articles, which includes topics so obscure that vandals themselves don't even know they exist.—Mrand T-C 21:05, 31 January 2008 (UTC)
Yes, vandals choose articles where their work will be read by the most number of people. I would first propose that data is gathered and analysed for one article (although not scientific) until the methodology is worked out. But, is the data for the read rate of an particular article even available? If it is available I don't know where to find it.BradMajors (talk) 22:35, 31 January 2008 (UTC)
More less a rant here... I see your point to say that vandals target popular pages which is true but I feel it necessary to remind people reading this that the quality of an encyclopedia is judged based on the accuracy of the articles. Wikipedias credibility is subject to its accuracy in any general article. And it is important to know that an scientific article one that isnt viewed often can often be targeted for vandalism and not easily corrected as the knowledge base is limited to noticing the true vandalism. There are people out there that attack pages such as politicians, sports teams, cities etc.... And it is naive to think the most vandalism occurs on the most viewed pages, perhaps only easily correctable ones do...the serious damage is done where people arent looking everyday, and thats where wikipedias credibility gets hurt the most. Just a thought to keep in mind when discussing where vandals choose to strike and the size of a sample study, I think a universal sampling technique is still perferred over a 'top 100'. Ottawa4ever (talk) 22:44, 13 February 2008 (UTC)
Encyclopaedias are NOT judged based upon the accuracy of the articles. They are judged based upon how often users see accurate versus inaccurate articles.(which is a big difference). If vandalism occurs and is left uncorrected in a little read article for say 12 hours, but only one reader sees the error that is not as bad as if vandalism is left uncorrected for one minute in a popular article which 100 readers read. Until we can measure how frequently users see vandalism, I don't think it is possible to come up with meaningful statistics. BradMajors (talk) 07:39, 14 February 2008 (UTC)

Im still compelled to re emphasize my point, In academia wikipedia is not a credible source, in fact it is common practice to fail a paper which cites wikiepdia. The accuracy of the articles is what wikipedia is judged upon and will be judged. I agree that its next to impossible to get an accurate information on how often vandalism is seen, but its important to understand that vandalism is often I should say 'un noticed' to those who are unfamilliar with the concepts of the topic, and this creates a more serious branch of vandalism that wikipedia is vulnerable too. And often this is left alone in articles that dont recieve enough 'hits'. To do any proper study on how many times an avergae user sees vandalism you still need to consider articles not in the top 100 or so(which will still be viewed, and in my opinion are laced with mis information designed to mislead a reader) just as much as frequently visted ones. Maybe this is a bit out of place in this discussion, but I think people should be aware of this when talking about just targeting frequently viewed pages. But still your point is valid that more poeple will see the vandilism in a larger article, and will likley reconginize it, but we need to be aware that some vandlisms go unnoticed for some time. If you want to see an example of this i recently fixed the jacques plante article (hockey) a month ago to include his career statistics which had been deleted 2 years before the fix and few even noticed that yet in two years that page would have recieved a numerous amount of hits, We need to be aware of this issue too, are people aware that they are even reading vandalism and being mislead and take it as a fact which is serious too. I just think its important to keep this on the back of the mind when deciding how to build up statistics, but not entirely directed at this idea as this is to see how frequently the user sees vandalism. Ottawa4ever (talk) 15:27, 14 February 2008 (UTC)

If a tree falls in a forest and there is no one to hear it does it make a noise? BradMajors (talk) 18:36, 14 February 2008 (UTC)
Yet its absence can be seen...... All articles in wikipedia are just as important as one which registers in a 'top 100' list. otherwise why have an encyclopedia in the first place, why not a top 100 list? Ottawa4ever (talk) 19:07, 14 February 2008 (UTC)
They are not as important. We have them for comprehensiveness, but there is a scale of importance from those that are vital topics and often viewed to those that are less important and seldom viewed. As Brad says, it is how often vandalism is seen, not how long it remains that is important. Studying the most viewed articles would be a good start, though we also need to include less viewed articles as well. It is just that some articles are more important than others. Studying vandalism in less viewed articles would also be a good research topic, and I think having at least one person watching every article is a goal we should strive for (though I've made absolutely no progress in convincing people of this).
Imagine we based our judgment of Wikipedia as a whole on the average article rating. Most articles would be start or stub class, but what if the majority of the important and most viewed articles were FA class? We would be wrong to conclude that Wikipedia was no good just because most of its articles we undeveloped and unreferenced. Richard001 (talk) 21:16, 14 February 2008 (UTC)
It may be useful to split the topics of statistics gathering and statistics analysis. We can discuss all the different useful statistics which can be gathered and then with these various statistics we can come to conclusions. Both of the above types of statistics should be gathered the difference is what conclusion we would draw from them. We currently don't know if a particular instance of vandalism is seen on average by one or one thousand people. BradMajors (talk) 23:30, 14 February 2008 (UTC)

[edit] Classification

I have made an attempt at classifying the items on the article page. BradMajors (talk) 21:24, 5 February 2008 (UTC)

[edit] Article Read Rate

There does not currently seem to be any way to obtain an article's read rate. There is an easy way to get the data by adding already existing third party links and services. Would there be support for trying to get permission from Wikipedia to temporarily use a third party service to get some statistics? Or is there any other way? BradMajors (talk) 23:03, 18 February 2008 (UTC)

It's definitely highly relevant in any case. Without this variable we would have to use some other estimate, like editing frequency, but some articles will draw more edits than others. Richard001 (talk) 04:22, 19 February 2008 (UTC)
These statistics can be obtained from a third party tool here: read rates BradMajors (talk) 11:16, 29 February 2008 (UTC)
A ballpark estimate for the Obama article is that each instance of IP vandalism was seen by 70 readers. The raw data this application is using is available here BradMajors (talk) 11:30, 29 February 2008 (UTC)

[edit] Vandalism count

I once saw a webpage that kept count of the number of vandalism acts on Wikipedia. Does anyone know of this webpage? ~QuasiAbstract (talk/contrib) 12:18, 2 April 2008 (UTC)