Wikipedia talk:WikiProject Vandalism studies

From Wikipedia, the free encyclopedia

Contents

[edit] To do list

Anyone feel free to edit the below to do list to come up with more ideas. Remember 14:17, 4 January 2007 (UTC)

  • Gather other editors of interest into the project
  • Come up with proposed studies to conduct and implement them
  • Revise project page to make it more accessible
  • Figure out what Study 2 will be
  • Finish Obama study

[edit] Typology

Note: I'm not putting my name on the items below; feel free to edit directly. If there are disagreements, let's note that in a discussion section. Or split this off into a subpage? John Broughton | Talk 00:22, 4 January 2007 (UTC)

[edit] Targets of vandalism

There are a variety of targets that vandalism can hit:

  • Articles:
    • Main page article
    • Other featured articles (FA)
    • Good articles (formally so categorized - GA)
    • Other articles (not GA, not FA)
  • Templates
  • Wikipedia namespace (including Wikipedia talk)
  • User namespace (including User talk)
  • Categories
  • Redirects
  • External links
  • References
  • Media (pictures, music files, etc.)

[edit] Sources of vandalism

Vandalism comes from:

  • Anonymous IP addresses
  • Newly registered users (typically vandal-only accounts)
  • Disruptive editors (limited but some constructive work)
  • Trolls, sock puppets, etc. - disgruntled "power users"

[edit] Types of vandalism

The IBM study identified five major types:

  • Articles:
    • Mass deletion: deletion of all contents on a page
    • Offensive copy: insertion of vulgarities or slurs
    • Phony copy: insertion of text unrelated to the page topic
    • Idiosyncratic copy: adding text that is related to the topic of the page but which is clearly one-sided, not of general interest, or inflammatory
  • Redirects:
    • Phony redirection (a redirect with a misleading pipe)

[edit] Methods of vandalism

How do we feel about a Methods of vandalism section? Will it do more harm than good to be specific? JoeSmack Talk 05:43, 4 January 2007 (UTC)
I think it would be helpful to create categories, but we don't necessarily have to state how one can do the more reckless types of vandalism. Remember 14:02, 4 January 2007 (UTC)
Alright, identifiable but not a diagram on how to do them, I think that is sound. JoeSmack Talk 14:17, 4 January 2007 (UTC)

[edit] Proposed studies

In order to study vandalism and the response to vandalism on wikipedia I had a couple of ideas. First, we can try to somehow gather the data of a random group of vandalism that occurred during a particular time period or on particular pages. The second idea, which would be much more controversial, would be to engage in some small systematic vandalism on random pages and see what gets reverted first and what manages to stay on and why. I believe that this idea would be face lots of controversy, but I think it may yield interesting data and it seems like a controlled way in which to study certain aspects of vandalism. Remember 14:07, 4 January 2007 (UTC)

I think the more random we can make the sample pool the better. The first idea I think should be tried first before the second; many have done 'experiments' like the second and I don't think we need to go that route until others avenues don't prove to be just as useful. If we make the first study empirical, focused, vigorous and detailed I think it could provide a lot of useful information. JoeSmack Talk 14:20, 4 January 2007 (UTC)
Here's a link to a tool that shows the most popular articles. Another way of selecting a sample would be random pages, of course.
As for the second study, that's absolutely a violation of WP:POINT as proposed. I for one am not particularly interested in being blocked for a week (or whatever) for screwing around with Wikipedia. An acceptable alternative would be to IDENTIFY a sample of vandalism from a stream of edits (rather than, say, from an article's history of reverts), but not revert the vandalism . We'd need to keep the data offline in order to avoid anyone tampering with it, but its doable. John Broughton | Talk 22:43, 4 January 2007 (UTC)

[edit] Proposed Study 1

Each member randomly choose a certain amount of web pages (as chosen by the random article link). Then go through the whole history of the article and categorize what vandalism occured, who was responsible for the vandalism and how long each vandalism remained. Remember 21:03, 4 January 2007 (UTC)

Any study should start small, and be iterative - that is, avoid doing a lot of work and then realize that it was done wrong. It's better to spend more time planning than a lot of time regretting.
So, first, we should restrict this to (say) the last three months of edits. Second, we should do a test run with a very small number of pages. Third, we should put together the table(s)/lists/whatever where we'll record the data before we actually start grabbing data. (For example, do we want to count non-vandalizing edits; and how do we want to categorize the type of vandalism?) Which means there probably should be a WikiProject Vandalism studies/Study1 subpage set up before anyone starts any type of counting. Assuming, of course, we have consensus on moving forward. John Broughton | Talk 22:43, 4 January 2007 (UTC)
I think this is a great idea. Move forward, by all means; you have my help. JoeSmack Talk 05:56, 6 January 2007 (UTC)

[edit] Discussion

First, we probably need to think about how to set up the data gathering. Here is a proposed table that others should feel free to edit. Remember 16:18, 6 January 2007 (UTC)

Page Date of edits examined Total Number of vandalisms Each vandalism with reference Comments
Wiki page Oct 2006-Dec 2006 15  ??  ??
...Lets choose what we are going to hit before we design the hammer. ;) Which articles are we going to monitor, how many, and how long? And what is our workable definition of vandalism here? Anything reverted? Anyone with a history that doesn't look like a good contrib? We need a solid, solid definition first and foremost. JoeSmack Talk 17:36, 6 January 2007 (UTC)

[edit] Proposal 2

Alright how about this. 1. Which articles are we going to monitor, how many, and how long?

We take a random sampling of articles as chosen by the random article link on wikipedia. We look at all the edits for one month of the year for 2004, 2005, 2006. That will help us see how vandalism has changed each year. Remember 17:48, 6 January 2007 (UTC)
I like this idea. For some reason I thought we were all going to put 100 articles on our talk pages and watch them for a month. Lets pick something like November of 2004, 2005, 2006 and look through the history for vandalism, and record it. How big should our sample size be, should we go until we reach 1,000? More? Less?

2. And what is our workable definition of vandalism here? Anything reverted? Anyone with a history that doesn't look like a good contrib?

I vote that we split up vandalism into several different categories similar to the IBM study. Obvious added vandalism (adding various curse words and other nonsense), Deleted information, and subtle vandalism (other additions that are intended to push a POV or otherwise harm the article).Remember 17:48, 6 January 2007 (UTC)

[edit] Further discussion

I'm down with the IBM study's analysis; the only questionable aspect I'd say approach with caution is "Idiosyncratic copy: adding text that is related to the topic of the page but which is clearly one-sided, not of general interest, or inflammatory". We want to have clear boundaries between Offensive (Max tucker is a dickbox!), phony (Max tucker is actually a computer programmer with too much time on his hands) and blatantly POV (Max Tucker has been demonstrated to be an inflexible human being throughout all walks of life). I think we should separate all these classes too, and have the blatantly POV (idiosyncratic); i believe this one will be the most, uh, subjectively defined class of vandalism. How does this sound?
We also should set up an examples page so we can get some of our definitions straight on some instances. This is where we can decide if we want to classify vandalism by one or multiple classes. For instance, example.com vandalism should be defined as Phony vandalism, but should it also be Deletion vandalism when it replaces a section/page? Similar for vandals who replace the 'criticisms' section of their favorite film director with the word 'shitcock'; is this Offensive vandalism or deletion? Or both? We need to decide if we're going to run a multiple class system on single instances or not, and then follow by having a subpage with examples of vandalisms that are classed for our sake and everyone else's who are looking in on the work. JoeSmack Talk 18:19, 6 January 2007 (UTC)
On another note, we should have other users go through and do a second classification for reliability. We don't want the results from the previous classification to be apparent because it'll taint the second assessment, but we can either work a second table out or maybe put data in black text over black background or something clever like that to help. If we can show our reliability is high, it'll really help the credibility of our study. JoeSmack Talk 18:27, 6 January 2007 (UTC)
And again, on another note: this won't include article creation vandalism. They usually get reverted right off the back. If the random article button clicks to an obvious db-, it should be db-ed and moved on, right? Also, in terms of linkspam: should this be counted as vandalism? Lots of links to places to get viagra etc are added and removed all the time, should that be counted? And if so, what about less encyclopedic links that are added like to youtube copyvios and inappropriate user blogs? JoeSmack Talk 18:56, 6 January 2007 (UTC)
I vote to not count linkspam. I have set up the first study page at Wikipedia:WikiProject Vandalism studies/Study1 I think we can work through these issues as we go through the study there. I am going to try to start setting up a table there. Remember 18:22, 11 January 2007 (UTC)

[edit] Proposal 3: The viability of allowing anyone to edit

Cost-benefit analysis: Try to weigh the benefits of allowing anyone to edit (readers fixing typos, new registered users that wouldn't have gotten interested if it weren't for anon editing etc) versus the costs (vandalism, and good, particularly expert contributors leaving due to said vandalism (remember two contributors aren't necessarily equal - you must evaluate the cost of the loss of an expert versus the gain of another contributor to TV show articles)).

[edit] Further discussion

Although this proposal probably seems quite daunting, if the study was properly performed, it could make a signficant difference to the outcome of the continuing viability of the anon' editing policy. --Seans Potato Business 02:24, 19 February 2007 (UTC)


[edit] Proposal for Study 2

It is common not to semi-protect Featured Articles when they're on the frontpage, with the idea that people new to Wikipedia, who visit a FA, will get a good idea of how Wikipedia is open for anyone to edit.

I propose we conduct a study to look at a number of factors with FA's at the day they've been on the frontpage:

  • how many edits, vandalist edits, sorts of vandalist edits, reverts and time to revert
  • how many edits have been made by new users (use "User contributions")
  • Compare FA's by day of the week.

JackSparrow Ninja 11:43, 23 February 2007 (UTC)

[edit] Categorizing vandalism

I've set up a page at Wikipedia:WikiProject Vandalism studies/Types of vandalism. If you have proposed changes, I suggest you just edit over what is there, rather than doing a threaded discussion, and post your comments here. John Broughton | Talk 19:33, 7 January 2007 (UTC)

[edit] Notifying the community

Any ideas how to go about letting people know that this project is going on? I have a feeling that there are other interested people out there, but that our project may not be the easiest to find. Remember 18:36, 7 January 2007 (UTC)

I've got some links, but not time to follow-up on them: Wikipedia:WikiProject (for general info, I think); Wikipedia:WikiProject Council/Guide (best practices), and Wikipedia:WikiProject Council/Directory. If someone else would take a look, that would be great. John Broughton | Talk 18:50, 7 January 2007 (UTC)
Any talk pages where you find an active discussion of the problem of vandalism and the counter vandalism unit. --Seans Potato Business 02:25, 19 February 2007 (UTC)

[edit] Bayesian spam filterting

Hey folks, it just crossed my mind that we already have a solution to vandalism. Take how spam was dealt with in the world of email: bayesian filtering. Theoretically speaking, we should be able to apply the same thing to vandalism. Even if we don't have access to the text people are putting into articles, we have enough input to make reasonable assertions (anonymous IP, number of lines changed, edit summary, article edited, etc). Over the next month I'm going to try to implement this into WikiGuard. I'll let you know how it goes.  :) --Brad Beattie (talk) 18:48, 7 January 2007 (UTC)

Bayssain and other methods for determining what is and isn't vandalism could be very useful as front-end tools for editors who are fighting spam, since they presumably reduce scanning work by human beings. I don't, however, think that they are particularly good at coming up with robust, defensible numbers on who is doing vandalism and what type of vandalism they're doing. So while I encourage your efforts, and don't think this WikiProject should wait to see what happens with them. John Broughton | Talk 18:56, 7 January 2007 (UTC)

[edit] How to conduct a study - an example

Here's a modest but useful example of what I think a userful study looks like. The following comes from here.

  • Study goal: To evaluate if it is true that positive contributions from anonymous users far outweigh vandalism.
  • Data sampling approach: an informal tally of anonymous contributions i come across in my daily vandal-cleaning efforts from june 12 to august 15 2006
  • Results:
    • vandalism - 116
    • non-vandalism - 505

One could critize the study for its non-random approach, for not clearly defining how vandalism was determined, for not evalating whether non-vandal edits were significantly useful or not, but still, the evaluation did provide useful information - it appears that anonymous IP edits are constructive far more often than not. And that in turn is actionable - that since the benefits to allowing anonymous IP addresses to edit Wikipedia articles seem to outweigh the costs, the current approach should continue. John Broughton | Talk 20:00, 7 January 2007 (UTC)

I think this is the general case, but the study should be more nuanced. Some articles are very much vandalized. So much that the edits of the vandal and the reverts takes up more than half of the history. I've been working on The Simpsons for a long time and it's a daily strugle to keep it vandal free. --Maitch 14:45, 11 January 2007 (UTC)

[edit] Started first study

Go to WikiProject Vandalism studies/Study1 to check it out and help make it better. Remember 22:40, 11 January 2007 (UTC)

I moved it to Wikipedia:WikiProject Vandalism studies/Study1; it was previously in article space. Trebor 22:54, 11 January 2007 (UTC)

[edit] Need help

Anybody that wants to help get 100 data points for our first study Wikipedia:WikiProject Vandalism studies/Study1 please let me know because we need all the help we can get. Below is a copy of the current results we have so far, which I think are interesting. Remember 17:30, 18 January 2007 (UTC)

Current cumulative tally

Total edits 2004, 2005, 2006 = 100
Total vandalism edits 2004, 2005, 2006 = 5
Percentage of vandalism to total edits = (5/100)= 5%

November 2004

Total edits in November 2004 = 15
Total vandalism edits in 2004 = 2
Percentage of vandalism to total edits = (2/15) = 13.33%

November 2005

Total edits in November 2005 = 45
Total vandalism edits in 2005 = 2
Percentage of vandalism to total edits = (2/45) = 4.444%

November 2006

Total edits in November 2006 = 40
Total vandalism edits in 2006 = 1
Percentage of vandalism to total edits = (1/40) = 2.5%

Percentage of overall vandalism that was

Obvious vandalism = (4/5) = 80%
Inaccurate vandalism = (0/5) = 0%
POV vandalism = (0/5) = 0%
Deletion vandalism = (0/5) = 0%
Linkspam = (1/5) = 20%

Percentage of overall vandalism that was done by

Anonymous editors = (4/5) = 80%
Editors with accounts = (1/5) = 20%
Bots = (0/5) = 0%

Reverting

Average time before reverting = (7991+14+6816+18+2561)/5= 3480 minutes
Percentage of reverting done by
Anonymous editors = (0/5) = 0 %
Editors with accounts = (5/5) = 100%
Bots = (0/2) = 0 %

[edit] Still need help

I still need people to help with the first study. We are now up to 40 points, but I would like to get northwards of 100. Here are the current results based on the first 40 points. Remember 17:12, 28 January 2007 (UTC)

Current cumulative tally

Total edits 2004, 2005, 2006 = 150
Total vandalism edits 2004, 2005, 2006 = 8
Percentage of vandalism to total edits = (8/150)= 5.3%

November 2004

Total edits in November 2004 = 22
Total vandalism edits in 2004 = 2
Percentage of vandalism to total edits = (2/22) = 9.09%

November 2005

Total edits in November 2005 = 59
Total vandalism edits in 2005 = 5
Percentage of vandalism to total edits = (5/59) = 8.47%

November 2006

Total edits in November 2006 = 69
Total vandalism edits in 2006 = 1
Percentage of vandalism to total edits = (1/69) = 1.44%

Percentage of overall vandalism that was

Obvious vandalism = (7/8) = 87.5%
Inaccurate vandalism = (0/8) = 0%
POV vandalism = (0/8) = 0%
Deletion vandalism = (0/8) = 0%
Linkspam = (1/8) = 12.5%

Percentage of overall vandalism that was done by

Anonymous editors = (7/8) = 87.5%
Editors with accounts = (1/5) = 12.5%
Bots = (0/5) = 0%

Reverting

Average time before reverting = (7991+14+6816+18+2561+4+11+11)/8 = 2,178.25 minutes
Percentage of reverting done by
Anonymous editors = (0/8) = 0 %
Editors with accounts = (8/8) = 100%
Bots = (0/8) = 0 %

[edit] Barack Obama

This widely watchlisted article has been running unprotected since 31 January. move=sysop protection was added on February 5. See the article's talk page for discussion about allowing IP edits. Would some volunteers here be willing to turn their analytical skills to this article? It is likely to be linked from the main page "In the news" box during the weekend of 10 February when Obama is expected to announce his plans for the 2008 presidential election. --HailFire 18:48, 8 February 2007 (UTC)

What sort of study were you thinking about? Remember 20:02, 8 February 2007 (UTC)
Something a lot like this. Such a study could inform the continuing talk page discussion mentioned here and help the article's editors to build greater consensus on IP edits. More broadly, the analysis may provide useful guidance on the relative merits of protection/unprotection for other high visibility political articles visited with frequent vandalism. --HailFire 21:08, 8 February 2007 (UTC)

Still thinking such a study might provide useful insights for managing vandalism on this and other closely watched articles with broad readership. The article is currently in unprotected status. --HailFire 11:02, 6 March 2007 (UTC)

I ran across this project, and this looks like a neat mini-project that I can take up. I'll create it at User:BuddingJournalist/ObamaAnalysis. BuddingJournalist 08:20, 9 March 2007 (UTC)
Feel free to set it up as a study under the wikiproject vandalism study section (e.g. Wikipedia:WikiProject Vandalism studies/Obama article study) so the whole group can help out. Remember 14:57, 9 March 2007 (UTC)
Indeed, this looks really interesting. JoeSmack Talk 15:11, 9 March 2007 (UTC)
Move completed! Feel free to contribute! BuddingJournalist 01:35, 10 March 2007 (UTC)

Lots of new data for the unprotected period between 12 and 17 March, for anyone who wants to give it another look. Would also be interesting to track time of day and geolocation data, perhaps putting this in a graphic. Some examples here. --HailFire 15:08, 19 March 2007 (UTC)

[edit] A study of my user page

I carried out a vandalism study on my own user page and found that 47% of the vandalism was made by registered users. Angela. 22:13, 29 March 2007 (UTC)