Wikipedia talk:Version 1.0 Editorial Team/Assessment rewrite

From Wikipedia, the free encyclopedia

Material was moved here from Wikipedia_talk:Version_1.0_Editorial_Team/Assessment#Overhaul_and_rewrite_of_the_assessment_scheme and related discussions on Wikipedia talk:Version 1.0 Editorial Team/Assessment. Walkerma (talk) 17:35, 12 May 2008 (UTC)

Contents

[edit] Fewer words?

For the assessment table, I'm wondering if we couldn't say more or less the same thing using fewer words, here's a suggestion. I've included a description of a "list", which was commented out of the table; feel free to delete it if it's not relevant.

Article progress grading scheme [  v  d  e  ]
Label Criteria Reader's experience Editor's experience Example
B
{{B-Class}}
Anything that's definitely better than the "Start" category, but doesn't meet higher standards. It gives the impression that a typical reader would learn something. Improve the article by trying to meet higher standards. Jammu_and_Kashmir (as of October 2007) has a lot of helpful material but needs more.
Start
{{Start-Class}}
A good article that is still weak in many areas. Has at least a particularly useful picture or graphic, or multiple links that help explain or give examples of the topic, or a subheading that covers one topic more deeply, or multiple subheadings that suggest material that could be added to complete the article. Useful to some, provides more than a little information, but many readers will need more. Major editing is needed, not a complete article. Real analysis (as of November 2006)
Stub
{{Stub-Class}}
Either a very short article or a rough collection of information that needs a lot of work. Possibly useful. It might be just a dictionary definition. Any editing or additional material can be helpful. Coffee table book (as of July 2005)
List
{{List-Class}}
An article that meets the definition of a Stand-alone List. It should contain many wikilinks, with descriptions. There is no one way to make a list, but it should be logical and useful to the reader. Lists can be anything from a stub to a Featured List. List of aikidoka (as of June 2007)

- Dan Dank55 (talk) 19:45, 4 April 2008 (UTC)

Hm, no response. Let me add that one of my favorite essays, WP:KISS (linked from K.I.S.S., linked from WP:Instruction creep), says, in its entirety: "Keep policy, guideline and procedure pages short, or else people won't read them, more people will leave the project, and less people will join the project." That's my experience too ... the shorter the instructions, the more likely they are to get read. - Dan Dank55 (talk) 19:57, 5 April 2008 (UTC)
Sorry I didn't see this, I've been very busy offline; this proposal is well worth looking at. I'm a great supporter of KISS, but the wiki approach tends to mean that often people simply add rather than rewriting. There has also been a proposal for tightening up the assessment scheme, and it would make sense to do everything at once. If you're up for working on this, I'll try and recruit a few others for their input. Bearing in mind that this scheme is used by over 1000 projects, we need to make sure that any rewrite represents a consensus of several interested people. Thanks! Walkerma (talk) 07:15, 6 April 2008 (UTC)
Sure, I'll follow your lead, Martin. Note that there's a new way to notify wikiprojects if you like, here. I can't speak for WP:GAN or WP:FAC ... perhaps a link to those pages would be better than trying to sum up what's needed in a table ... and I didn't know what to say about A-class. I should add that I and others are putting major energy into the GA process. I know how A-class is defined, but I'm kind of wondering what the function of A-class is ... there might have been a feeling that the GA process should be avoided for a variety of reasons. This would be a good time for people to post a message at WT:WGA if they have been dissatisfied with the goals or output of the GA review, we're very much on it. - Dan Dank55 (talk) 11:49, 6 April 2008 (UTC)
P.S. I'd be happy to drop "Lists" if it's not needed, and it seems to me the "Editor's experience" column could be dropped, since it logically follows from everything else. - Dan Dank55 (talk) 12:04, 6 April 2008 (UTC)
Good! A-Class predates GA, and GA was added into the scheme later. It still fulfils a useful role for some projects, and there are some projects (such as WP:MILHIST) that have had a wariness concerning the GA process - I don't think that would change at this point, since they don't relate any more to specific points of procedure. I personally think that GA-Class is the one that doesn't belong in the scheme, but not because I don't like GA (I'm a big fan!). It's because it's not a project-based assessment. However, we need to work with the system that the consensus likes to use, and that will (for the foreseeable future) include both GA and A. I'll try to get things started this week from my side. Thanks again, Walkerma (talk) 05:41, 7 April 2008 (UTC)
Well, one great thing about A-class is that it's not work for me :) To the extent that it's useful, absolutely, keep it. I look forward to learning more about the individual review processes of the wikiprojects. - Dan Dank55 (talk) 12:42, 7 April 2008 (UTC)
The person interested in tightening up the assessment scheme looks to be quite busy elsewhere (he had warned me), so maybe we should make a start here anyway. However, I have an idea - I wonder if we should consider having a simple form, but if people want to know more detail they can click for more information? Bearing in mind the fact that thousands of Wikipedians need to consult this page (see these stats for proof), it would be perfectly reasonable to set up subpages as needed for this - especially if we add a range of examples, as planned. That's one thing I like about the wiki framework - you can keep it simple on the main page, but have more detail for those who want to go deeper. I think I'd like to write up an FAQ, because we do have some standard questions that keep getting asked. Does this sounds like a reasonable idea? Walkerma (talk) 14:49, 10 April 2008 (UTC)
Sounds great. Or, if you'd like to put more information in the chart, based on the questions people ask, that's fine too. What didn't look right to me about it was the low meaning-to-words ratio, or maybe I just didn't get the meaning. - Dan Dank55 (talk) 20:22, 10 April 2008 (UTC)

[edit] Overhaul and rewrite of the assessment scheme

There have been two proposals recently relating to assessment, and both seem to be reasonable (IMHO). They would both involve some rewriting and recalibrating, and therefore I think we should consider both proposals at the same time. I'm adding a third proposal, which is in effect how I think the first two would best be implemented together. There's also a fourth, which came up in discussions, and which I'll throw in for good measure. Walkerma (talk) 18:01, 12 April 2008 (UTC)

[edit] Simplifying the descriptions

(Described right above here) We should simplify the basic definitions of each class. The descriptions are quite detailed, but that may mean simply that people don't bother to read them properly. We could simply do a copyedit and chop out a lot of wording; that would make them easier to follow, but we may lose some of the rigour if actual examples or nuances of meaning are lost. Hence my proposal for a "summary style" approach; this will allow us to have very clear, simple definitions for routine use. Walkerma (talk) 18:01, 12 April 2008 (UTC)

Comments
I want to help with the rewrite

Be happy to help - Dan Dank55 (talk) 02:00, 13 April 2008 (UTC)

I can do what I can, anyway. John Carter (talk) 15:36, 20 April 2008 (UTC)
I do. Arman (Talk) 03:56, 8 May 2008 (UTC)

[edit] Refining the assessment scheme

See Wikipedia_talk:Version_1.0_Editorial_Team/Work_via_Wikiprojects#Assessment for the original proposal. We have a good scheme that works well, but there are variations in standards. It should be possible to sharpen the boundaries of the scheme by including additional examples to indicate specific detail about the levels (the lowest standard for Start-Class, vs. the highest standard for Start-Class). We may also be able to consider how we handle the different aspects of assessment (article length, quality, technical aspects, aesthetics, etc). We have one very knowledgable contributor offering to help, and I think we should use this opportunity to make the scheme more rigorous. Any thoughts? Walkerma (talk) 18:01, 12 April 2008 (UTC)

Comments
I want to help with the refining process

Happy to help with style and language issues - Dan Dank55 (talk) 02:09, 13 April 2008 (UTC)

Good idea. A few ideas that come to mind:
  • Clearly defining the standards for Start and B, specifically regarding need for referencing, if any, for B
  • Deciding what if anything to do with the GA/A conundrum
  • And something that has arisen with a few Biography articles where even everything known about a notable, obscure subject still isn't much: maybe some sort of "bastard" A/Start grade for articles that are as complete as reasonably possible, but still so very short that many stubs are longer. John Carter (talk) 15:36, 20 April 2008 (UTC)

[edit] Converting the scheme to summary style

This was my suggestion for dealing with the first two proposals, which at first glance would appear to be irreconcilable. How can we make the scheme even more nuanced and rigorous, yet make it simpler to understand? I think we can accomplish this through use of the summary style approach: Have one short, succinct description of the scheme, but then have a sub-page (or sub-pages) to give more detail. That way, someone who just wants to "get the general idea" can do so, but the reviewer who is agonising over where something is B or Start can look for some more detailed guidance. Is this a good approach to the problem?

Comments

[edit] Add an FAQ

The scheme is now well into its third year, and some of the standard questions and proposals keep coming up over and over again:

  • Why is A-Class above GA-Class?
  • Why is A-Class (or GA-Class) even needed?
  • Are citations required for B-Class?
  • How are articles promoted to A-Class?
  • How do I request use of the bot for our WikiProject?
  • I think we should have one more/fewer level in the assessment scheme!
  • Can our project use an extra level or categories in its assessment scheme?
  • Our project uses its own descriptor, "Foo-Class": Can this be added into the statistics table?
  • etc.

I think it's about time we wrote a simple FAQ to deal with these questions; for every one person who posts on one of these, there are probably ten who are simply baffled and leave.

Comments and suggestions for FAQs

We have Wikipedia:WikiProject Council/Assessment FAQ, which we can always expand for this purpose. Titoxd(?!? - cool stuff) 20:42, 11 May 2008 (UTC)

I'm willing to help write the FAQ page

Happy to help with style and language issues - Dan Dank55 (talk) 02:14, 13 April 2008 (UTC)

Happy to help address "I think we should have one more/fewer levels in the scheme", and there may be other things I can help with also. Holon (talk) 10:10, 15 April 2008 (UTC)

[edit] General comments

Our scheme has grown from around 2000 articles when the scheme was automated two years ago, to around 1.1 million today - that's more than the growth of Boston in 1776 to the Boston of today. The scheme is holding up remarkably well, IMHO, but I think we need to revamp the "architecture" a bit. Walkerma (talk) 18:01, 12 April 2008 (UTC)

[edit] On Simplifying

Hi all. On Dank55's suggested simplication. I would certainly keep the current version with a more detailed description. However, for those who have become accustomed to it, I doubt they will refer to it in detail often, and an abbreviated version could be used to complement (not replace) the more detailed version -- i.e. a kind of quick reference version that people can go to if they prefer. An option to consider anyhow.

From experience, the examples tend to be the most powerful part of the process, and as I've said elsewhere I think it is excellent you have examples. The description of an article (like the description of most complex things) can be interpreted in different ways, and most importantly here, more or less strictly/harshly or leniently, and with different assumed interpretation of the various elements. Don't get me wrong, I think it's very important to describe, to orient to what features people need to look at, but then at the end of the day someone can always ask: so what does that actually look like? Just as a picture tells a thousand words, so does an example of an article!

So a quick reference with the same examples could be useful for those assessing a lot of stuff, or even those who assess just a few things after becoming familiar with scheme.

The more detailed version is also likely to be important in cases where there is some dispute. Holon (talk) 09:26, 15 April 2008 (UTC)

I've just noticed my comments are similar to the summary style idea above. If many are familiar with the existing scheme though, I'd still argue that keeping it and adding a short version would be easiest, but either way the principle is the same Holon (talk) 11:14, 15 April 2008 (UTC)

[edit] On examples

I'd strongly advise against using different examples/exemplars in different versions of the generic scheme (not that anyone has suggested it) because I have seen empirical cases in which exemplars are changed in a scheme that otherwise remains the same, and there is a severe impact on ratings (e.g. becomes much harder to be deemed in a category). The relevant research was thorough and very controlled. I can't say exactly to what extent it applies to this scheme, but in general it's better to keep things consistent as much as possible (provided of course they're sound and working!).

On that note, I'd also be careful changing the examples over time if you want consistent grading over time. Having said that, there are ways to link new to old if this is a must and I can advise and help. It takes some time and effort to make sure changing examples doesn't change the relative difficulties of the 'grades' though.

However, it may be very useful to have specific examples for more unusual kinds of article. In these cases, for the sake of comparability, I would advise people in the relevant projects to very carefully select examples that are as close to the same quality as those in the generic scheme as possible (for the same reason, they're very powerful). The basic principle is this. In cases where there are unique considerations and/or some of the generic considerations are not applicable (or less so), people have to take into account the considerations when deciding the grade of a given article. Now, assuming some effort goes into this, you may as well do it once then save having to do it every time thereafter. However, becuase the exemplars may have a strong impact on the assessments, ideally the first time a decision is made, an exemplar should be selected that is considered as close to the generic one as possible. Another thing worth considering.

Cheers Holon (talk) 09:26, 15 April 2008 (UTC)

[edit] On borderline examples

In keeping with the general principle of having a simple scheme with flexibility for cases that require or warrant special attention, I want to add that additional borderline examples could also be listed on a separate page, only to be used when necessary (e.g. if start vs B is a difficult call).

The scheme gives broad classifications, which is fine for many purposes. If people are interested in adding, I would recommend selecting candidate articles and experienced assessors quickly doing pairwise comparisons between the candidates and existing exemplars. Given relatively little data, I can analyse and report back scaled locations in order so a decision can be made about borderline below/above articles. If there is enough data, I can also advise which were most consistently judged, which are better to use.

Just as extra increments on a tape measure (or any instrument) provide additional precision in a region of a continuum, so additional exemplars provide additional precision in the region (border between adjacent classifications). So additional exemplars in selected regions allow greater precision when desired, provided they've been carefully selected and calibrated. Incidentally, this also answers the question in FAQ about more or less grades. The thing people are generally thinking when they ask this is: I think there should be more precision (or less, though I wouldn't recommend less here). So this is one way to get the best of both worlds -- simplicity plus precision when needed. Anyhow, hope this background and explanation is useful but fire away with questions if not. Cheers Holon (talk) 11:07, 15 April 2008 (UTC)

[edit] What would it take to establish a first-class foundation for Wikipedia standards?

After looking through the discussion here, I think it might be instructive to describe an 'ideal' process, and to work back from there to what's doable. Pretty much every issue that has come up here is fairly common in assessment. I hope it will be easier to see why from the ideal. Please keep in mind that the work put into what I outline to follow overlaps with normal work on articles anyway, and in the long run would likely make that work far easier by helping to identify what needs to be done and when.

It may also turn out something closer to the ideal is achievable with available skills than I realize. With some ingenuity, Wikipedia could be a first for online ratings en mass by developing a top-class process based on solid foundations! OK, not likely, but possible. It's already considerably better than the crude methods normally used, such as ratings of 1-10 plucked out of the air.

[edit] Ideal

Given the nature of articles, the following process in the ideal is what I would (and do) recommend.

  1. Compare (pairwise) a set or sets of exemplars and scale them.
  1. List in order from worst to best (by links). This provides an ordered set analogous to a ruler with many points of possible distinction.

This achieves two things:

  1. When the scheme is refined, it can be refined not only based on what 'should' distinguish better from worse, but what is seen to in a carefully calibrated and ordered set of examples.
  1. The set of examples can sit in the background and be used whenever anyone wishes to, for greater precision (you can always go from cm to meters).

The last point is key when articles are near a threshold for going from one "grade" to the next.

Probably more important than all else an ordered set of examples provides a clear picture for editors of what it takes for an article to progress toward the highest standard.

[edit] Common reaction

A common reaction to this is that it's too time consuming because most are used to easy, but poor, rating processes (e.g. pluck a number from 1 to 10 out of the air or a grade based on best guess).

I understand but my standard response is that the payoffs outway the up-front time, often by a large factor, and of course anything worth doing takes some effort and coordination. The only reason most of us can buy a thermometer and easily, yet precisely, measure temperature at will is that a lot of work lies behind its development and construction. Like anything else, including articles on Wikipedia themselves, quality products require some work.

Good measurement instruments and procedures are a cornerstone of industry and technology -- without common standards, many things are impossible in industry. The same idea applies to Wikipedia as a whole. If editors can quickly, yet precisely, measure against calibrated standards as they work and assess articles, there are similar payoffs. There is a lot more clarity on standards and how to know where you are and what it takes to progress.

I believe around a million have been assessed, is that right?

However, it's like everything, it does take time and coordination. Hopefully though, this helps in explaining various issues and how they all fit together in the bigger picture even if nobody actually ends up participating.

[edit] Small-scale test

I can offer to anyone who wishes to do a small scale test in their own project. I don't think I have yet encountered a case in assessment where people have not found the process informative and useful.

Send me a set of article labels, preferably 15 or more, and I'll send back a spreadsheet with a set of pairwise comparisons to be done: each to be compared with each other and a judgment made about which is better. Do these and send me back the results. I will scale it, put them in order, and tell you how consistent you were overall and tell you which articles were anomalous, if any. Include at least two or three of the articles in the scheme so you will be able to see how the rest scale in between. If you can organize more than one judge to make comparisons, even better, and I can give you feedback on each judge's consistency and the agreement between them.

This should be quite quick for someone who is reasonably familiar with the set of articles, if the assessor only needs to refer to them when it's hard to say which is better. Most judgments should be quick and only a portion take more time. The payoff -- for your project you get a much clearer picture of the way articles progress from worse to better quality, and you have a far more precise basis for judging when an article should move up a grade.

This can be extended across projects. This would simply require choosing a number of articles in your project as well as some from another project also doing a calibration exercise. All articles can be scaled jointly and tests conducted to see how successful the exercise was. It's preferable that the assessors have some knowledge of the other articles, but I doubt it would be necessary for them to be experts on the content to get worthwhile results.

Obviously, this requires coordination if it crosses editors and particularly projects. However, the result could be a nice list across projects of articles from the worst to higher quality that everyone can refer to pluse the benefits to the project mentioned.

So to reiterate, this process is beneficial for

  • refining the scheme by seeing what actual progression looks like, according to consistent judgments by a methodical process.
  • provideing a set of examples (behind the scenes) that includes examples in the scheme, and can be used when the call between one grade and the next is getting difficult, avoiding debate the number of classifications (there is more precision if you want it and editors would know more clearly when an article is getting close to progressing to the next grade).
  • giving a clear summary picture of what it takes to progress articles for editors, which would probably also reveal things not anticipated up front.
  • founding refinements on the information to make the criteria more accurate, so more efficient to use and more credible.

I know there's a lot, but I hope it gives a clear picture of the ideal, and it might spark ideas even if nobody elects to do a trial.

Don't hesitate to criticize -- believe me it's unlikely you'll raise anything I haven't heard many times, and if you do, I'll be grateful for the challenge.

Cheeers all. Holon (talk) 10:45, 11 May 2008 (UTC)

I'm not entirely sure what you mean by all of the above. Having said that, the new Wikipedia:WikiProject Christianity/Christianity in China work group has about 400 pages total tagged to date in Category:Christianity in China work group articles. It might work for the purposes you're suggesting. I expect there to be a lot of deviation there, though, because many of the assessments seem to have been copies of preexisting assessments. John Carter (talk) 20:17, 12 May 2008 (UTC)