Wikipedia talk:Wikipedia Signpost/2007-10-08/Vandalism study

From Wikipedia, the free encyclopedia

Michael, great job.

I think the study does effectively provide an answer to Aaron Swartz's question of who writes Wikipedia vis-a-vis anons: "The total number of persistent word views was 34 trillion; or, excluding anonymous editors, 25 trillion."(p. 5). So immediately we know that registered users contribute about 74% of PWVs.

It's not clear (at least to me) whether the further analysis of editors by decile and PWV contributions is based on 34 or 25 trillion, but it looks the the deciles themselves are calculated by excluding anonymous editors. So it's either that the top 10% of registered editors (by edit count) contributed 86% of PWVs, or (if anon PWVs are excluded) 63% (i.e., 86% of 74%).

In either case, it strongly suggests that, at least when weighted by how popular content is, Swartz was largely wrong. (It might be the case, however, that registered editors are more likely to edit popular topics, while anons contribute large word counts to obscure topics.) Note further that the exclusion of anons from the editor decile rankings means that the 10% decile is actually a smaller proportion of total editors, when anons as well as registered users are considered as editors.--ragesoss 23:02, 8 October 2007 (UTC)

After a closer look at the graphs, particularly Figure 3, it looks like the percentiles are based on 25 trillion PWVs (i.e., excluding the anon PWVs), since I count 9 lines. This implies that the final, 100% decile line is coincident with the 100% PWV line. So the top 10% of registered editors (ca. 420,000 people) account for ca. 63% of all PWVs, the top 1% (ca. 42,000) account for around 51%, and the top .1% (ca. 4,200) account for 32%.
It's interesting that the distribution if flattening for all segments except the very top (the top 4,200, and probably somewhat above that), which is on the same order of magnitude as the persistent core community: according to Erik Zachte's stats (which use the same main dataset, the October 2006 dump), there were 4330 editors with >100 edits in October 2006, and about 10 times that many with >5 edits. Contra Swartz, the visible community is becoming more, not less significant (as measured by PWVs). However, anons could also be gaining in PWV share; the study doesn't give any indication there. It seems like the only segment that is losing PWV share is the 10%-1% segment, many of whom are no doubt formerly active Wikipedians who have left the project.--ragesoss 23:32, 8 October 2007 (UTC)
Thanks for the feedback and additional analysis. You may be right about how this stands up against Swartz's analysis, but in typical "we report, you decide" fashion I didn't want advocate a conclusion about it, and without having examined their data closely I wasn't sure what considerations might remain unaccounted for. One issue that comes to mind is that in the past two years, with the restriction of article creation to registered users along with the increased use of semi-protection, over time the balances weigh increasingly in favor of editing with an account. The world this study looks at has changed, and while the changes were already underway, Swartz may not have recognized their full impact. His argument is more anecdotal than systematic anyway. With those trends in mind, it stands to reason that known personalities would be pulling more of the weight now, as you say. --Michael Snow 05:03, 9 October 2007 (UTC)
I would also add that the designations of "top x%" are sort of meaningless, since there are literally millions of accounts that have never made an edit and anons aren't counted in those calculations. If you look at it by numbers rather than percentile, the distribution does seem modestly level, at least compared to the idea of a small core contributing the significant majority of content, as Jimbo used to argue. 50% of the PWVs come from the top 42,000 editors. According to the abstract, "we show that an overwhelming majority of the viewed words were written by frequent editors and that this majority is increasing." I think this is incorrect, based on the rest of the paper. The PWV share of the 10% and 1% groups is decreasing, and the the .1% share is increasing but does not account for a majority of PWV (only about 32%). And the 10% group, the top 420,000 editors responsible for 63% of total PWVs, can hardly be considered "frequent contributors", since only a tenth of that number have more than 5 edits per month.--ragesoss 05:53, 9 October 2007 (UTC)
After looking into it further, I think that I was mistaken in assuming that the graphs are based on only registered accounts (and only the 25 trillion PWVs associated with them). They say they analyze 4.2 million accounts, which is too high for just registered accounts ca. October 2006. This makes interpretation of the graphs and assessment of the "who writes Wikipedia" question much more complicated. 27% of PWVs come from anons, yet the top 10% of all editors is responsible for 86% of PWVs. So anons must be well-represented in the upper deciles. I've emailed the lead author and hopefully I can get some more clarification.--ragesoss 07:15, 9 October 2007 (UTC)
Hi folks,
I'm the lead author of the work, and I'll reply to ragesoss's email here. It's rare indeed that research papers generate such immediate interest among practitioners, so it's very exiting to receive his email and see this talk page. The five questions that seem to have arisen, and our thoughts, are:
1. The abstract makes a claim unsupported by data. Specifically, we claim that "we show that an overwhelming majority of the the viewed words were written by frequent editors and that this majority is increasing". Indeed, this sentence is wrong, for the reasons you've noted. While the 10% and 1% cohorts contributed about 85% or 75% of the value, respectively, these cohorts' shares are not increasing. Only the 0.1% cohort's share is increasing. You could also reasonably argue that overwhelming majority was inappropriate, and perhaps strong majority would be better.
2. The term "frequent" is not well defined. We've used the term frequent informally to refer to editors with the higher edit counts, but we never defined the term carefully, which we should have. So under our terminology, someone who edits only a handful of times per month but still edits more than e.g. 90% of other editors would be labeled frequent.
3. Do figures 3 and 4 include anonymous editors? The short answer is yes, they do. They include all editors appearing at least once in the history dump that we analyzed. There was definitely an opportunity to be more clear in the paper about this. :) We chose to do this because we didn't see fundamental differences between the graphs as presented in the paper and considering only the non-anonymous universe (25T PWVs). We've posted the figures with anonymous editors (and their PWVs) excluded: [Figure 3] [Figure 4].
4. Will you publish a list of the top editors by PWV? We're happy to, provided there aren't any privacy issues. What are the privacy issues from the Wikpedia community perspective? Do you publish lists of editors ordered by different metrics?
5. It would be great to see these analyses run on more current dumps. We agree. :) We worked with what we had at the time: when this paper was being written in April and May, the Nov. 4, 2006 dump was the most current available.
It's too late to make changes to the paper, but this feedback will inform our presentation and discussions at GROUP. It's appreciated. We'll continue to watch this talk page, so feel free to direct additional questions to us here. --R27182818 20:03, 12 October 2007 (UTC)
In response to Question 4., if I understand correctly the PWV metric only uses publicly available data, in the sense that I could work out my own PWV value by trawling the (public) history of each article I've edited (also public). So there are no genuine privacy issues here; there are only issues of courtesy. We have had Wikipedia:List of Wikipedians by number of edits since June 2004, and there have never been any issues with that until a few months ago, when a handful of editors decided they would prefer not to appear on it, and removed their names. We also previously had Wikipedia:List of Wikipedians by number of most recent edits, not maintained since 2004; and Wikipedia:List of Wikipedians by number of recent edits, not maintained since 2005. Both lists were dropped only because no-one could be bothered maintaining them, not because of privacy issues - indeed, the data is still up, albeit grossly out of date. So I think there is sufficient precedent for these lists to be published, without any need to fret about privacy. Hesperian 05:38, 13 October 2007 (UTC)
Hi folks, sorry to keep you waiting so long. Please find a list of the top editors by PWV at http://www.cs.umn.edu/~reid/pwv-list-4200.txt. PWV scores are percentages. Enjoy! --R27182818 18:55, 14 November 2007 (UTC)

Why, oh why, don't we have a more current dump? Something about the technical problems with complete database dumps of en-wiki might be useful in the article.--ragesoss 00:20, 9 October 2007 (UTC)

Or we could just wait for more people to apply for funding to study Wikipedia, or for a large amount of funding for studies of Wikipedia. How much research is being done on Wikipedia, out on interest? Is there a way to um, study the study of Wikipdia? Carcharoth 16:22, 9 October 2007 (UTC)
There's a WikiProject Wikidemia that would essentially aim to be the forum for your last question, though its activity level is not that high. There's also the Wikimedia Research Network. To the extent that the study of Wikipedia is being done through data extraction and analysis, it's often going to be conducted outside of normal wiki activity, so it's not always easy to know what's currently happening. Greg Maxwell is the Wikimedia Foundation's Chief Research Officer, and he along with some of the people who work on the toolserver are probably who you'd want to talk to for more of a sense of this activity. --Michael Snow 16:37, 9 October 2007 (UTC)
Thanks! Carcharoth 17:35, 9 October 2007 (UTC)

[edit] Further questions

It might be worthwhile to submit some further questions to the authors, since there is probably a lot of interesting information that they could easily provide that isn't in the paper.--ragesoss 00:20, 9 October 2007 (UTC)

  • Who are the top 4,200 editors by PWV?
  • What would Figure 4 look like if IP edits were included? Have the relative PWV contributions of anonymous editors been increasing or decreasing over the period analyzed?
  • [add your questions here]

[edit] What is a false positive?

This story is excellent except for the last sentence. What is a false positive in the context of persistent vandalism? 1of3 21:41, 10 October 2007 (UTC)

It's not talking about "persistent" vandalism, it's talking about any vandalism. With this much data, the study has to apply mechanical tests to identify what it thinks is vandalism, which in this case is largely based on edits that got reverted. Since the test has no human input, it can pull in cases that upon further review do not actually involve vandalism, which are the false positives. --Michael Snow 21:58, 10 October 2007 (UTC)
But it is talking about persistent vandalism, saying that most of what the program flagged as vandalism which persisted for more than 100,000 page views was actually not vandalism.--ragesoss 22:01, 10 October 2007 (UTC)
Right, of course, sorry I was thinking the question involved "persistent" vandalism in the sense of repeated action (even though the story itself talks about persisting cases). Anyway, the explanation of what is a false positive is still applicable. --Michael Snow 22:13, 10 October 2007 (UTC)
Thanks kindly, that makes sense now. 1of3 23:15, 10 October 2007 (UTC)