Talk:Elo rating system

From Wikipedia, the free encyclopedia

This is the talk page for discussing improvements to the Elo rating system article.
Please sign and date your posts by typing four tildes (`~~~~`). Put new text under old text. Click here to start a new topic. New to Wikipedia? Welcome! Ask questions, get answers. This is not a forum for general discussion about the article's subject.	Be polite Assume good faith No personal attacks Be welcoming	Article policies No original research Neutral point of view Verifiability

This article is within the scope of WikiProject Chess, a collaborative effort to improve Wikipedia's coverage of chess. For more information, visit the project page, where you can join the project and/or contribute to the discussion.

Chess Portal

This article has been rated as B-Class on the quality scale.

High

This article has been rated as High-Importance on the importance scale.

[edit] Precise statistical model

For the ELO system the precise statistical model and the estimation of parameters is difficult to be retrieved on the internet. therefore I would much appreciated seeing it on this page, esp. since it should be a couple of lines only.

Done, roughly speaking. It's not clear what the precise model is, since Elo himself waffled between the normal and logistic curves. Moreover, the implementation of the model varies significantly from one organization to the next. Finally, it should be noted that it is a stretch to label this adjustments of ratings up and down as statistical estimation. Yes, there is a model, but adding and subtracting points on a game-by-game basis is a klutzy way to estimate anything, and highly unlikely to be used in any real statistical application.

The rating systems in place today are a political compromise between mathematicians who would like to estimate hypothetical parameters accurately and players who want each game to be a fight over the rating points they win and lose. Players seem to prefer being able to say, "I beat that guy four games straight and took 45 points from him," as opposed to being able to say, "My rating is accurate to the third digit." They don't want accuracy, they want to win and lose points. That way they have something to fight for every single game, even if they are not in contention to win a given match or tournament. --Fritzlein 20:19 28 Jun 2003 (UTC)

Can't they fight over fractions, or floating points (:)), instead? lysdexia 17:12, 12 Nov 2004 (UTC)

[edit] Benefits

On a hypothetical basis, if someone was forced to play against somone who they knew they were going to lose against no matter what, would it be more beneficial for that person to win as many games as possible and raising their ELO as high as possible before playing the game they would inevitably lose - OR would it be better to not play - keep their rating lower so they would lose less points and then recover afterwoods? —Preceding unsigned comment added by 220.239.20.242 (talk) 06:04, August 25, 2007 (UTC)

[edit] His name

Tidbit: "élő" means "living" in Hungarian language. --grin 19:45, 2004 Apr 6 (UTC)

Please, would it be possible to mention his name correctly spelled at least once, maybe in parantheses? His name is Élő Árpád (beware the accents) in Hungarian, or Árpád Élő in the English order of names. ("Élő" is his family name and "Árpád" is the given name.)

And please, kill those acronym-like, all-caps references to "Elo".

91.120.127.14 22:05, 29 September 2007 (UTC)

His name, correctly spelled, is Arpad Elo. That's the way he chose to spell it himself. No evidence whatsoever has been provided that that Elo ever spelled his name in any other way. Your complaint about the use of ELO would be a good point, except it's already explained in the second paragraph of the article and the capitalized usage you object to appears nowhere in the article except in a single external link of dubious value. Quale 00:01, 30 September 2007 (UTC)

There's no need for evidence, it's plain common sense for every Hungarian :D He was born in Hungary, and Árpád is an ancient and common Hungarian name. Élő is a common word in Hungarian as well, while Elo means nothing at all. There's no way someone could spell his name in Hungary as Elo Arpad, and he must have spelled it as Élő Árpád until his family moved to the US, where of course the accents wouldn't be understood. The French president is called Sarkozy, and not Sárközy, but his Hungarian ancients were called as such before immigrating to France. Please don't remove information out of ignorance. Thank you. http://hu.wikipedia.org/wiki/%C3%89l%C5%91_%C3%81rp%C3%A1d http://www.google.hu/search?hl=hu&q=%22%C3%A9l%C5%91+%C3%A1rp%C3%A1d%22&btnG=Keres%C3%A9s&meta=cr%3DcountryHU —Preceding unsigned comment added by Fblodilovics (talk • contribs) 17:17, 30 January 2008 (UTC)

Please read and understand WP:V and WP:RS. There's no such thing as "plain common sense for every Hungarian" as a justification for an English wikipedia edit. "He must have" doesn't meet wikipedia policy. Also, you should take up your complaint at Arpad Elo where it belongs. Quale (talk) 17:28, 30 January 2008 (UTC)

Well, you claim authority on a Hungarian name while you don't speak the language and you don't believe Hungarians on this matter, but I understand you'd like to make sure the information added comes from a reliable source. So I've read WP:V and WP:RS. All right. Please check that the English Arpad Elo and a lot of Elo rating system Wikipedia articles in other languages and article itself are clearly stating his native name as Élő Árpád. If you make a Google search for Arpad Elo on the Hungarian Google site, you'll find nothing but Élő Árpád references. This reflects the fact the general consensus of the people of the country he was born in is that he was born as Élő Árpád. Actually there is no known source which claims otherwise. In fact it is a common fact who he is and what is his name among Hungarian people who has a little interest in chess. For another reliable source as an example, here is a biographical encyclopaedia of famous people published by the Hungarian state's local government of the county Arpad Elo was born in. ÉLŐ Árpád Imre in Veszmprém county's biographical encyclopaedia. Of course the name doesn't require a translation. Can National Geographic be called a 'third-party published source with a reputation for fact-checking and accuracy'? I guess it can. Here is an NG article about him. It's in Hungarian but the name doesn't require a translation. —Preceding unsigned comment added by Fblodilovics (talk • contribs) 21:08, 30 January 2008 (UTC)

I've seen those references, including the National Geographic reference. Over a year ago I asked it be cited in Arpad Elo as a reference if someone who reads Hungarian could check to see if it was higher quality than the references the article had. See the end of Talk:Arpad Elo#Reality is whatever some Wikipedia editor says it is. The main concern I have with those references is that it is not clear if they explicitly claim a spelling for Elo's birth name, or if they are instead back transliterations of his name from English to Hungarian. It is not obvious that a Hungarian name transliterated to English and then transliterated from English back to Hungarian will give the original name. (Not all transliterations map one-to-one this way.) If you insist that his birth name be given in this article on a rating system he developed over 40 years after he had immigrated to the U.S. as a child, you would have been really happy with the 30 May 2007 version. Quale (talk) 05:57, 31 January 2008 (UTC)

WP:MOS "For terms in common usage, use anglicized spellings; native spellings are an optional alternative if they use the Latin alphabet. Diacritics are optional, except where they are required for disambiguation (résumé). Where native spellings in non-Latin scripts (such as Greek and Cyrillic) are given, they appear in parentheses (except where the sense requires otherwise), and are not italicized, even where this is technically feasible. The choice between anglicized and native spellings should follow English usage "

WP:NC "Convention: Name your pages in English and place the native transliteration on the first line of the article unless the native form is more commonly recognized by readers than the English form. The choice between anglicized and native spellings should follow English usage " Bubba73 (talk), 02:13, 31 January 2008 (UTC)

I agree with WP:MOS and WP:NC that for the name of the page and the common usage of his name throughout the article should be spelled anglicized. However, I don't see any problem with mentioning the 'born as' tidbit beside the date of his birth and death. It's additional information and yes, I insist :) The pro/back transliterations might not be obvious sometimes, but in this case it is only not obvious to you. So I've read the discussion of the mentioned article and saw there are some confusions about the reliability of Hungarian sources. No wonder if one doesn't speak the language. The first link mentioned is an online port of an originally print version biographical encyclopaedia of famous people published by the Hungarian state's local government of the county Arpad Elo (Élő Árpád) was born in. It got nothing to do with Wikipedia, in fact the collection of data for the book ended in 1997 and some of its sources reach back to the 1920s. The website itself is a Cultural Ministry sponsored project for porting the more important books of Hungary's national library (with 8 million items in its catalogue) to the online world. It is an as reliable source as it can get. The other source is a joint project of KFKI (Central Research Institute for Physics) and the Hungarian Academy of Sciences. Also a very reliable source. Fblodilovics (talk) 13:45, 31 January 2008 (UTC)

No, I don't think you understand. Firstly, "Elo" isn't just his name Anglicized, it was Elo's name. He chose to spell it that way, and he published in English spelling his own name that way. I haven't seen any evidence that Elo ever used diacritics in his own name at all, and certainly not after age 13. The point is, it's entirely irrelevant how Hungarians choose to spell Elo's name today, unless we know that's how his family spelled it when they were in Hungary. The Hungarian references don't seem to ever give Elo's name spelled the way he chose to spell it as an adult, and I'm not sure they explicitly state that his birth name was spelled that way. For a comparison, an article on Elo in Russian might give his name in Cyrillic, but we don't provide the Cyrillic transliteration here. The Hungarian transliteration is only of interest if we are sure that that was his birth name. Your "insistence" seems wildly out of place in this article, since the Hungarian is given at Arpad Elo where it belongs. "Additional information" in this article should be about the rating system; we already have an entire article on Arpad Elo himself. I frankly don't see what point it has in this article. Elo was fully Americanized, living and working in the U.S. for over 40 years under the name "Arpad Elo" when he developed his rating system. We've had tedious tug-of-wars with Hungarian POV pushers in this and many other articles; it would be a shame if you were to do this too. Compare this with John von Neumann, a much more famous Hungarian-American. He immigrated to the U.S. at a much later age, but the absolute insistence of giving the Hungarian spelling of his name in every article in which he is mentioned and at every single opportunity doesn't seem to be there. It's given at John von Neumann where it belongs. Quale (talk) 15:15, 31 January 2008 (UTC)

I'm not a Hungarian POV pusher, in fact in my opinion the Ernő_Rubik article should be called Erno Rubik as this is the English Wikipedia. The article is indeed about the rating system, but there is a paragraph specifically about his name, for clarifying that it is not an acronym. I think his preimmigration name perfectly fits there. His family never could give Arpad as his name, as there is no such name in Hungarian at all, while Árpád is known to be a common name from as early as the 9th century A.D. I also cited a governmental published biographical encyclopaedia compiled by known academic researchers who cite their sources. You might argue about why his birth name doesn't belong there, but doubting Hungarians about Hungarian names is just plain silly. Fblodilovics (talk) 16:02, 31 January 2008 (UTC)

[edit] Depth of something ranked with ELO?

I removed the section below from the article, as I can't find any information about this concept elsewhere... can anyone provide a cite? -- The Anome 14:16, 12 Sep 2004 (UTC)

The ELO rating depth also states something over the "depth" of the game. The total depth of a game is defined by two end points of the possible range of skills, from the total beginner to the theoretical best play by an infallible, omniscient player.

Both are not easy to establish: Is someone already a beginner who just heard the rules, thereby setting the lowest standard or does it need several games until one has immersed the rules of a game and is able to play on its own? On the other end of the range one simply has to take the best player at a given time. The total beginner, yet playing on its own according to the simple rules can in Go safely be set at 30 kyu. Theoretical best play could result in the strength of an imaginable 13 dan according to measurements of standard deviations among professional games.

Only taking 20 kyu and 9 dan as endpoints makes Go a very deep game. A rating difference of 2900 ELO points from (Gu Li) to a 20 kyu with 100 ELO points is a difference in insight into the game by 29 times the standard deviation (100 ELO points).

Chess in comparison has a similar endpoint (Gari Kasparow with once 2851 points, s.a.), yet the standard deviation is set at 200 ELO points. More difficult to compare due to the draws, however it results in a depth of chess of (only) 14 layers of standard deviation if the total beginner in chess had a rating of zero ELO points (which s/he has not AFAIK).

I remember reading something similar to this in Chess magazine (London) probably about eight or nine years ago, but I don't have a cite (I've a feeling it was in one of Fox and James' columns, but can't be sure). If I remember correctly, it reported a study which had counted the number of steps one needed to take in a number of games to get from the weakest player in the world to the strongest, where each intermediate player could score 75% against the one below. Go had the most steps by far (and so was considered the most "deep" or "difficult" game); chess was second; various other things were also considered (checkers I remember was in there, backgammon too, I think). But in any case, I'm not sure something like the above really belongs in this article: it's not about the Elo system per se; the Elo system is just being used as a tool to measure the "depth" of chess. Perhaps a mention could be made in the chess or Go articles or in some new comparison of chess and go article. --Camembert

Sorry I didn't chip in on this topic before. Yes, the ELO system has certainly been used to measure the depth of games in the manner described by the paragraphs which were removed from the article. By this measure go is a deeper game than chess, after which checkers, bridge, and poker follow in close succession. However, there is a serious problem in comparing chess to games like bridge and poker: how many hands of the latter are equal to one game of chess? The luck involved in cards means that it may take a whole evening for the superior skill of one player to manifest itself. Also there is a question of the margin of victory, as one big pot in poker can cover lots of small losses.

I think the appropriateness of this section for the article is marginal, because the fundamental concept is not really that of statistical estimation, but that of a "class interval" being a difference in skill such that the stronger player can win 75% of the time. For different games the statistical model may be different. I believe that for go tests have shown that the normal curve approximates performance better than the logistic curve. When two games use a different model it is a stretch to say that you are comparing the range of ELO ratings in each case. On the other hand, the notion of measuring the depth of a game by the number of class intervals is an interesting topic in its own right, and deserves to be covered somewhere in Wikipedia. Maybe it makes more sense for it to be attached to this article than to be put anywhere else?

Oh, and the explosion of scholastic chess in the U.S. has indeed given rise to ratings of zero. It shouldn't be too surprising that a random 6-year-old with no special gift for that game can play that badly. But if you include a zero rating in chess, you have to go down to something like 35 kyu or lower in go. Furthermore the tradition that 9-dan is the highest rank doesn't allow ratings on the upper end to expand as much as they should. Therefore, if we measure chess in a way that shows 15 class intervals, then a comparable measurement in go may show 45 or more class intervals. No matter how you slice it, the class interval measurement asserts that go is vastly deeper than chess. --Fritzlein 16:18, 14 Nov 2004 (UTC)

[edit] Glicko system?

Do we have an article about the Glicko rating system, which is gaining popularity? Apparently Glicko-2 could replace Elo one day.--Sonjaaa 02:26, Jan 31, 2005 (UTC)

Glickman's system has real advantages over the current clunky implementations of Elo's model, but that's not enough to make it a likely replacement. Are you suggesting that the USCF might adopt it any time soon? If so, you know more about USCF politics than I do. I was under the impression that the USCF ratings committee was a fairly conservative body. Or is ICC making the switch? Last I knew (and I confess to being out of date) only FICS was using Glicko ratings. Who else is jumping on the bandwagon? --Fritzlein

While the idea that some players have a better determined rating than others is appealing, and may be useful in other sports, actual sports organizations penalize inactivity by taking away points over time, rather than increasing the rating "uncertainty". Elo system has theoretical underpinnings that make it a true statistical estimator, at least when K is set sufficiently low. But so far there has not been any indication that Glicko is actually an improvement in terms of its predictive ability. Glicko-2 is even less well motivated than Glicko: it has both a rating deviation, RD, and a rating volatility

σ

. I believe that both systems can probably be manipulated by a group of conspirators fixing games against each other in such way as to drive the ratings up for one of the participants.--Kotika

Glickman is a statistician, so it isn't surprising that he thinks improvements in the rating system will come from doing better statistics on the same data. Unfortunately for his project, the underlying model IS NOT QUITE TRUE. Adding layers of refinement to the estimation technique is akin to finding the radius of the earth to the tenth digit: eventually you must face the fact that the earth is not truly spherical (It is wider at the equator than at the poles.), so extra digits of accuracy in the radius have no meaning.

The most compelling evidence that the Elo model doesn't hold true comes from the on-line chess servers. The blatant counter-example to the truth of the model is computer players, but subtler proof comes from the distortions of ratings that arise from players being able to select their opponents, favoring some and avoiding others. It is no coincidence that many ICC members consider the only accurate ratings on the server to be those from which computer players are barred and the games are paired randomly by the server rather than by choice of the participants themselves.

My opinion is that, since the underlying model is false, it is misguided to focus on more accurate estimation. Rather one should focus on the concern Kotika raises, namely rating manipulation. One's primary focus should be to minimize the opportunities for participants, either singly or in collusion, to distort their ratings, particularly opportunities to inflate their ratings. I suspect that Kotika's imputation is not quite right, i.e. I suspect the Glicko system is if anything slightly less vulnerable to manipulation than plain vanilla Elo ratings. But I do think Glicko's energy is somewhat misdirected. In practice, the biggest accuracy problems with the Elo system don't come from the klunky estimation technique, they come from the model being wrong, and from clever people exploiting the wrong model to cheat the system. --Fritzlein 16:35, 27 Mar 2005 (UTC)

The exploits you refer to would not be possible in OTB tournaments. --Malathion 07:36, 24 Jun 2005 (UTC)

Very true. It was self-selection of opponents on-line that first showed us the inadequacies of the USCF model. When you don't get to choose your opponents, it covers up 95% of the deficiencies of the model. If you are in an environment where players can't select their opponents, I guess it makes sense to focus on the 5% of the problem that remains, rather than focusing on the huge problem of rating manipulation that opponent-selection creates. --Fritzlein 20:26, 25 October 2005 (UTC)

Hello. Do you have some sources for that figure (95%)? Or is it just an estimation from your experience? If there is somewhere a study on this topic, I'll be really interested in reading it. BTW, Glicko model is different (but similar, I agree) from ELO one. Moreover, could you explain briefly the strategy used by cheaters to increase their rating by selecting opponents? Finally, when you say that "the underlying model" is false, i understand that you think that some aspects of real games are not well modeled. What are these aspects (at least the most important ones) to your mind? Thanks a lot.Dangauthier 16:47, 4 June 2007 (UTC)

The 95% figure was pulled from thin air based on my experience. I basically meant that, even if you don't allow self-selection of opponents, you can probably statistically prove that Elo's model is wrong, but if you do allow self-selection of opponents, it is glaringly obvious that the model is wrong.

As for aspects of reality that are not well-modeled, the simplest argument is a circle of dominance. Suppose I can find three players A, B, and C, any three in the world, such that A beats B more than 50% of the time, and likewise B beats C, and C beats A. If I find even one such triplet, I have proven the model false. Unfortunately, while it is intuitively obvious (at least to me) that such circles exist, it is very hard to accumulate enough evidence to prove it statistically, because during the number of games it takes to measure, the skill level of the participants will have changed!

A better hope statistically is to prove non-transitivity in some more general sense. If A beats B 75% of the time, and B beats C 75% of the time, the system demands that A beat C exactly 90% of the time. These percentages correspond to rating gaps of 191, 191, and 382 points respectively. If you can show that for rating gaps of 382 points the favorite only wins 88% of the time, plus or minus 1%, then the model has been busted. Actually, Mark Glickman has already proven something very like this to be true, but he chose to interpret it as a evidence of poor estimation of ratings, rather than taking it as evidence the model is wrong. He has a case.

The best way to inflate your rating (without cheating!) is to pick a computer you know how to beat, and beat it over and over in essentially the same way. It will never catch on to your methods and stop you. Meanwhile a weaker player who can't beat that computer yet will lose to it over and over, because the computer will never blunder. The fish donates tons of point to the bot, which it then transfers to you. In the end you might end up rated 400 points above the computer, while the poor schmoe ends up 400 points below the computer, but in reality you are nowhere near 800 points better than the fish is. This is the historically first obvious violation of transitivity of rating differences.

A secondary way to inflate your ratings is to play exclusively against humans who have inflated their ratings by the first method, and diligently avoid playing the underrated schmoes who gave all their points to computers. --Fritzlein 03:38, 5 June 2007 (UTC)

Thanks for these explanations. I'll think to your arguments and comment on latter . E.g. are some computers officially rated? However, this page is probably not the best place for that.Dangauthier 00:22, 9 June 2007 (UTC)

[edit] Elo for Multiplayer games??

Is there a version of Elo, or a different rating system that's ideal for rating multiplayer games like Scrabble or what not?--Sonjaaa 13:01, Feb 26, 2005 (UTC)

Scrabble is considered a two-player game by serious Scrabble players, because the multiplayer version is hugely influenced by the order of play, so much so that it seems impossible to make multiplayer Scrabble fair enough for tournament play. Nevertheless your question is valid for true multiplayer games like Diplomacy. There is a natural extension of Elo's basic formula for expected number of wins, which can be expressed on the same logarithmic scale Elo chose, i.e. 200 points for a class interval. If there are N players with ratings R1, R2, ... RN, then the expected wins for player I would be 10^(RI/400)/[10^(R1/400) + 10^(R2/400) + ... + 10^(RN/400)]. Based on this model, one can produce ratings estimates from game results in a variety of ways, including simple linear adjustments parallel to Elo's suggestion for chess.

The validity of this method for any given multiplayer game is very much open to question, but I have never heard of anything better. At least this extension of Elo is plausibly fair to all players. --Fritzlein 04:03, 27 Feb 2005 (UTC)

I missed something in there. In the Main article it state that expected wins can be calculated as 1 / 1 + 10^(R[a]-R[b]/400). Where does the series you note above fit into that?--Nolesce

I apologize for not noticing your question when it was written, but I'll answer it now. Before generalizing the two-player formula to a multiplayer formula it pays to notice that 1/(1+10^((R_a - R_b)/400)) is equivalent to 10^(R_b/400)/(10^(R_a/400)+10^(R_b/400)). If you take chess ratings, divide by 400, and take the inverse logs, the expectancy formula is a simple proportion. For example, let R_a = 1102 and R_b = 1295. We calculate 10^(1102/400) = 569 and 10^(1295/400) = 1728. The odds of winning are therefore 569:1728. Player A's probability of winning is 569/(569+1728), while Player B's probability of winning is 1728/(569+1728).

Now we can easily generalize. If Player C has rating R_c = 1427, we calculate 10^(1427/400) = 3694. When the three players contest a multi-player game, the odds will be 569:1728:3694. Player A's probability of winning is 569/(569+1728+3694), while Player B's probability of winning is 1728/(569+1728+3694), and Player C's probability of winning is 3694/(569+1728+3694). Does this make more sense now? --Fritzlein 20:06, 25 October 2005 (UTC)

First of all, I LOVE DIP TOO! I was actually thinking of using it for games like Setters or Carcassonne or Ticket to Ride in our group of friends. But anyway, what about this idea suggested by a friend: If player A wins against B and C, then the Elo is calculated as if it were 2 games: A beats B, A beats C. Is that any mathematically better or worse than the one you mention?--Sonjaaa 08:18, Feb 27, 2005 (UTC)

Ah, your idea is also superficially reasonable, and in fact it is what Yahoo Games uses for hearts. The winner is assumed to have beaten all three opponents at individual games. However, it is not at all mathematically equivalent to what I propose, and I don't like it one bit, because your rating adjustment depends on who you lose to. This unbalances the incentives and places the players on an uneven footing in the meta-game of ratings.

Let's say we are playing Settlers. I am rated 1200, you are rated 1600, and Jughead is rated 2000. Now it turns out that late in the game I am about to win (lucky dice), Jughead is close behind, but you have slim chances yourself. You do a quick mental calculation and see that if I win you will lose 29 rating points to me, but if Jughead wins you will lose only 3 rating points to him. Therefore you abandon your own slim chances and give all of your resource cards to Jughead for free, and otherwise try in every way to help him win instead of me.

That shouldn't happen. When you sit down to play you should know that you win X points for winning and lose Y points for losing no matter how the other players fare, so you no incentive to favor anyone. Buz Eddy realized this when he made his Maelstrom ratings for Diplomacy using the extension of Elo ratings I first mentioned, and I haven't seen it improved upon. --Fritzlein 17:02, 27 Feb 2005 (UTC)

The above seems reasonable for multiplayer games with one winner. What about multi-player games with multiple winners, such as Mafia?

For Dipolmacy, which may end in a draw including some of the players and excluding others, the ratings give the losers a score of zero each and split one point between the winners. For example, suppose the seven players in Diplomacy are rated 1200, 1300, 1400, 1500, 1600, 1700, 1800. Their expected scores would be 0.014, 0.025, 0.045, 0.079, 0.141, 0.251, 0.446 respectively. If the latter three share in a three-way draw, the actual scores would be 0, 0, 0, 0, 0.333, 0.333, 0.333. With a K factor of 100, the ratings adjustments would be -1, -3, -4, -8, +19, +8, -11 respectively. Note that expectations on the top-rated player are so high that a three-way draw is actually a sub-par performance that costs points. --Fritzlein 19:27, 10 March 2006 (UTC)

In my opinion, the discussion about Elo for multiplayer (3+ players) games should be added to the main article. Or at least the generalized formula for multiplayer matches, what do you tnink? --Joaotorres (talk) 07:21, 2 April 2008 (UTC)

The ranked multiplayer part of Company of Heroes is using a rating system based upon the Elo system. --Fblodilovics (talk) 16:14, 21 April 2008 (UTC)

[edit] Confusion About The Confusion

Is there really any likelihood that the "ELO rating system" will be confused with the acronym for the 70's band "Electric Light Orchestra"? --BadSanta

It seems to me that the disambiguation at ELO should be enough, since to even get to this page you have to say something about rating systems. Does the Electric Light Orchestra page need a link to this one for people who want to know about chess ratings?

[edit] Formula for Ea, Eb

Is there a way to make the formula for calculating Ea and Eb more clear? When I read it the denominator looks like 1+10*(Ra-Rb)/400, which didn't work mathmatically. I had to research some other sites before I found that it was actually 1+10^((Ra-Rb)/400). Did anyone else have this problem? PK9 03:54, 24 October 2005 (UTC)

Would parentheses around the exponent help? I think the formula is clear now, but of course I'm expecting the right answer, which makes it easier to see. I believe that for most readers the current layout is easier to comprehend than it was when it was in plain text, even though the plain text is unambiguous, as your paragraph above demonstrates. Please experiment with the math markup if you have any ideas. --Fritzlein 20:16, 25 October 2005 (UTC)

I also had the same problem. The ideal thing would be to superscript the exponent more. The parentheses around the exponent didn't help me but thanks for trying. Other text formulas use a caret for the exponent -- while it looks amateurish, it's actually clearer. erixoltan 11/9/2006.

[edit] Jeff Sonas' site

chessmetrics.com for more info on his rating system, since it has changed a bit since 2002. 128.6.175.26 17:53, 2 February 2006 (UTC)

[edit] Elo or ELO?

I think all the instances of this word should be spelled in lower-case: "Elo".Chvsanchez 04:03, 4 April 2006 (UTC)

I also would prefer to always spell it "Elo". Given that it is not an acronym, I don't understand the capitalization. Unfortunately, for whatever reason, "ELO" seems to be standard. --Fritzlein 17:42, 4 April 2006 (UTC)

Fixed. Search 4 Lancer 01:49, 27 April 2007 (UTC)

[edit] The Hydra handle Zor_Champ

The Hydra team has always used the handle Zor_Champ in the Playchess server, this has been known for years. When you say "team," it makes it appear as if they use a commercial program or grandmaster advice along with their Hydra engine to decide on what moves to play, which is untrue, all moves are decided purely by Hydra. You can log into Playchess and ask Zor_Champ yourself. Dionyseus 21:35, 27 April 2006 (UTC)

I didn't say team; their website says team.WolfKeeper 21:51, 27 April 2006 (UTC)

But what they mean by "team" is that they as a team created Hydra, in other words they want some credit too. Log into Playchess and ask them yourself, they regularly test their engine modifications in the Engine room. Their entire goal is to prove to the world that Hydra is the strongest chess entity, it would make no sense for them to use the aid of other engines, or human aid during games. Dionyseus 22:03, 27 April 2006 (UTC)

And even if what you say is true (and I've seen contrary claims elsewhere); that doesn't prove that Hydra has the highest Elo; or establish what it is, they haven't played enough games yet; it takes more than a couple of matches.WolfKeeper 21:54, 27 April 2006 (UTC)

I'd also like to know why you insist on putting in the article that centaurs regularly outperform Hydra. Where is your proof of this? The recent 2006 PAL/CSS Freestyle Tournament clearly shows otherwise. Dionyseus

It lost in previous years. If you can find evidence that Hydra actually was playing alone in this 2006 competition (when the team was under no obligation to do that); add it or refer to it. Otherwise stop reverting; you're violating NPOV every single time.WolfKeeper 22:09, 27 April 2006 (UTC)

The main reason it was unable to qualify into the finals in the 2005 PAL/CSS Freestyle tournament was because of outright and obvious human errors. The fact that it was only using 32 nodes as opposed to the 64 nodes it uses now doesn't help either. I can provide you with a link where you can download the games from that tournament if you'd like. Dionyseus 22:18, 27 April 2006 (UTC)

Irrelevant as to your deletion. The fact that some people think centaurs or cyborgs play better than humans does not seem to be controversial; and probably should go in the article. The trick is not putting undue weight on it, or putting undue weight on the different idea that Hydra is inevitably stronger either (because zor_team won one match???). NPOV is about capturing the points of view, not trying to impose any supposedly correct view on the wikipedia.WolfKeeper 22:36, 27 April 2006 (UTC)

I can't off-hand remember how many ELO points twice as much speed gives you. Maybe 50 points; not necessarily decisive.WolfKeeper 22:36, 27 April 2006 (UTC)

It is obvious that centaurs perform better than humans, no one disputes that. However, there is no evidence that centaurs have outperformed Hydra, in fact the data available thus far indicates otherwise. By the way, where did you get the idea that doubling of speed equals 50 elo points? Do not dismiss the 2004 match between Hydra and Shredder 8, Hydra with just 16 nodes dominated the former computer world champion [1], made the former computer world champion look like an amateur program, sort of how it made Michael Adams, who at the time of the match in 2005 was ranked 7th in the world, appear as an amateur even though it only used 32 nodes. Now Hydra is using 64 nodes, this is 4 times the speed of the Hydra that dominated Shredder 8 in 2004, this is twice as fast as the Hydra that dominated Michael Adams. Dionyseus 23:25, 27 April 2006 (UTC)

Arno Nickel has beaten Hydra 2 games with computer assistance. In addition, humans do better at longer time schedules. Other engines are weaker than Hydra, but whether they are weaker with Human assistance is very much less clear. There's also the point that in Freestyle play in principle anyone can network enough iron together to outprocess Hydra. Hydra is inflexible, the owners have to buy nodes, rather than rent or borrow.WolfKeeper 17:08, 18 May 2006 (UTC)

[edit] I have requested mediation

I have requested mediation about the Hydra matter. I would appreciate it if you would stop reverting my edits and cooperate so that we can resolve this matter. Here's the page, http://en.wikipedia.org/wiki/Wikipedia:Mediation_Cabal/Cases/2006-04-27_Elo_rating_system Dionyseus 00:21, 28 April 2006 (UTC)

[edit] Other Gaming Mediums

It might be worth mentioning that the Elo ratings have also been applied to videogames, specifically the game Age of Empires III with the cuetech ratings based on the Elo system. These ratings are often taken in the same seriousness as the chess ratings among players.

They've also been used in Unreal Tournament's online play rating system.WolfKeeper 17:02, 18 May 2006 (UTC)

[edit] Elo rating and Computer Programme

Many computer chess programmes are available which give rating. FIDE or http://www.fide.com should develop a computer programme easily available to world for rating. I reqest the reader of this discussion to forward a email to fide.com vkvora 18:47, 23 May 2006 (UTC)

[edit] Ratings Inflation

The article needs a section on ratings inflation. Rocksong 02:54, 7 August 2006 (UTC)

I agree. When I first wrote the article it seemed like too much detail to talk about rating inflation/deflation, but some of the sections that have been added since are arguably even less relevant, so the time is ripe to address the issue.

Unfortunately, all the different implementations of Elo's ideas mean that each implementation suffers from different problems. For example, the USCF implemented "rating floors" to combat sandbagging and deflation (both real problems), and as a result got ridiculous inflation of ratings within the chess-playing prison population, which is both more active and more insular than the general USCF population. How much space does USCF's failed experiment deserve?

Moreover, even if we restrict ourselves to talking about inflation of FIDE ratings, people mean two very different things by "rating inflation". Some people mean that the top ratings and average ratings are higher than they used to be. A 2600 FIDE rating used to make you a World Championship contender, and now it doesn't get you into the world top 100.

On the other hand, an equally powerful definition of inflation is that playing at the same absolute skill level now earns a higher FIDE rating than it used to. The intuition is that a rating of, say, 2400, should not necessarily place you at the same ranking in the world list as it used to, but instead it should mean a 50% chance of winning a game if you could go back in time to play someone rated 2400 decades ago.

By this second definition, FIDE ratings are probably not suffering inflation. Indeed, they are actually suffering deflation, in that you have to play much better chess now to be rated 2400 than they had to in the old days. You have to know more about openings, and be more accurate tactically, for example.

Given that FIDE ratings are gradually inflating according to one definition, and gradually deflating according to an equally valid definition, extending this article to cover rating inflation is a rather tricky project. ;-) --Fritzlein 18:08, 9 August 2006 (UTC)

Nevermind, I did it. Edit away! --Fritzlein 19:54, 9 August 2006 (UTC)

[edit] Deliberately Misleading Information

Deep Junior did not win a match or even a game against Hydra. The article claims that as of 2006, Junior is the Computer Chess Champion, proving that Hydra's 32 processors are not superior to Junior on a dual AMD processor. This is misleading. Junior won a tournament that crowned it computer champion, but Hydra was not in that tournament. This piece of misleading information was inserted by Chessbase. Chessbase is the author of Junior and did so to advertise it's product. They have a history of lying and being deceitful to promote their software. For example they refuse to acknowledge Rybka which is a commercial engine vastly superior in playing strength to anything Chessbase has produced. It is well known among any computer chess enthusiast that Hydra would destroy Junior handily. This is not something that could be printed in the article becuse they have not had such a direct match. But what is currently in the article needs to be removed ASAP. It is misleading... and damnit I'm sick of Chessbase's lies.

More to the point, (a) arguing over what is the best chess programs does not belong in Wikipedia, and (b) any comparisons belong in Computer chess, not here. I say delete the entire 2 paragraphs which discuss computer chess. p.s. Remember to sign your comments. Rocksong 12:29, 21 August 2006 (UTC)

The point is, the article is about ratings, so to the extent that we know the ratings, it is reasonable to discuss players (including computer players) ratings a little here.WolfKeeper 17:05, 21 August 2006 (UTC)

Fair enough. But how about this: we should explain the often-used term "performance rating" (which, surprisingly, the article doesn't do yet). Then we could list the best performance ratings of computers (and people). Also - I wanted to say this but I wasn't certain - computers don't have official ratings, probably because they don't play people often enough under tournament conditions, right? Rocksong 23:47, 21 August 2006 (UTC)

Mainly because it's just not allowed. I don't necessarily agree that we should remove the computer chess discussion as it ties in with ratings (once, as suggested by rockson, performance ratings are explained). Explaining why Hydra's domination of Adams only "proved" it had a rating of 2850 or higher is a very important concept.

I'm not happy about the paragraph about Rybka either. At least two of the 4 sources are rapid chess, and the results are all against other computers. Better, I think, to note that computers don't have official ratings, and link to some of these comparison sites; rather than single out Rybka (or any other program). Rocksong 01:56, 23 August 2006 (UTC)

There's more than one rating list for humans though as well.WolfKeeper 02:21, 23 August 2006 (UTC)

So? That doesn't affect my point: that a score of 2900 on these rating lists, generated solely from computer-versus-computer play, often in conditions completely different from tournament play, means (almost) nothing when compared to a FIDE rating. Don't some people have ratings over 3000 on ICC? Again, so what? Rocksong 06:13, 23 August 2006 (UTC)

Do you have a cite for the claim that it means almost nothing?WolfKeeper 07:48, 23 August 2006 (UTC)

I don't think RockSong needs a cite for his point. The point is that the article compares computer ratings to human ratings as though they are equivelant. Clearly they aren't. He doesn't need to cite that.

Do you have a cite that they correlate to FIDE ratings? Rocksong 08:06, 23 August 2006 (UTC)

I'm not making a positive claim, you are. The idea that they have '(almost) nothing' connecting them to the FIDE ratings seems to be highly unlikely, given that there *are* games played between humans and computers and they help keep the two rating scales in step, but I'll accept a good cite. So- cite please?WolfKeeper 08:18, 23 August 2006 (UTC)

See my comment below (dated 06:34, 23 August 2006 (UTC)). So long as there's a reasonable qualifier in the article, I don't care. The debate bores me. Rocksong 08:42, 23 August 2006 (UTC)

I've put "Ratings of Computers" in a separate section, and added a qualifying paragraph at the front. I think the qualifier is important. Beyond that, I've no interest in debates on the relative merits of different computers. Rocksong 06:34, 23 August 2006 (UTC)

[edit] Provisional period crude averaging

This section sounds extremely biased including these quotes. "for some reason a crude averaging system" "Apart from the obvious flawed logic" although I see the point, and agree with it, it sounds extremely insulting to the sites that use this method.24.237.198.91 05:58, 24 August 2006 (UTC)

That section is so poorly written, it doesn't even make clear what it is objecting to. I think I can guess what the author is upset about, but I don't know how anyone unfamiliar with the ratings ecosystem would be able to figure it out.

All rating systems have difficulty giving a roughly accurate rating to a previously unrated player. Many systems have a method of calculating "provisional" ratings for new players by some means radically different from Elo's standard formula of upward/downward adjustment. One such system, which I agree is literally "crude", is to calculate the "performance" of a player as equal to the rating of the opponent in case of a draw, 400 points higher than the opponent for a victory, or 400 points lower than the opponent for a loss. So if I beat someone rated 1400, draw someone rated 1500, and lose to someone rated 1750, that gives me "performances" of 1800, 1500, and 1350. My average performance would be 1550, which can serve as a provisional rating.

What makes this system objectionable is that a win against a low-rated player can lower my provisional rating, while a loss to high-rated player can raise my provisional rating. In the above example, suppose I lost my fourth game to a player rated 2150. That would give me a "performance" of 1750, and raise my provisional rating from 1550 to 1600. It is intuitively obviously unfair to be rewarded for any loss or punished for any victory. This provisional system effectively rewards selecting opponents who are rated as high as possible.

If the system simply adds an exception that "a win can't hurt you and a loss can't help you", it can actually make the problem worse. As an unrated player in that "fixed" system, I need only make sure to play my first game against someone rated way above my skill level, and the rest of my provisional games against players so weak I can easily beat them. Say I play a 2350-rated player first, and get a provisional rating of 1950 for the loss. Then I win ninteen games in a row against players rated 1000 or less, and since a win can't hurt me, I get to keep my provisional rating of 1950 all the way until it becomes a regular rating.

Based on my cursory reading of what the BCF does for provisional ratings, it goes even further than the "fixed" system. The BCF will not only insure that you can't lose points for a win, it will actually insure that you gain points for a win, no matter if you are already overrated in the provisional period. This addresses the intuitive issue of fairness in gaining/losing points on a per-game basis, but may actually result in less-accurate provisional ratings. The ECF system effectively rewards selecting opponents who are rated as low as possible. I would therefore add my voice to those questioning the neutrality of the section in question. However, I think a larger issue than NPOV is that the section needs to be re-written so that people can tell what the heck it is talking about. --Fritzlein 17:19, 24 August 2006 (UTC)

I agree it's hard to work out it's point. I say delete that whole subsection. Rocksong 11:59, 25 August 2006 (UTC)

[edit] Tone

This is an informative and detailed article, so congratulations to those who have worked on it, but it's tone is distinctly unencyclopaedic. In many places it has the hallmarks of text that has been reworked many times in different directions by different parties, and reviewing this talk page suggests that this is so. I have slapped the 'tone' tag on it for now, but please don't consider this an aggressive gesture. I would like to see that aspect of the article improved and would do it myself but for time constraints. Soo 23:00, 26 September 2006 (UTC)

I agree, the tone has a lot of problems that are obvious in several sections. Night Gyr (talk/Oy) 21:51, 4 November 2006 (UTC)

[edit] Geocities?

Geocities fails WP:V and WP:RS as a self-published source. I've removed the reference. If the information is present in a reliable source it can be referenced there, if it isn't, it can't be referenced.--Crossmr 07:00, 4 January 2007 (UTC)

And how are the other 3 refs any different? All of them appear to be self-published and unverifiable. Rocksong 22:46, 4 January 2007 (UTC)

If they are, feel free to remove the information or put a cite tag on it. I only had time to look at the geocities citation.--Crossmr 22:41, 15 January 2007 (UTC)

It isn't be used as the primary source, thus it should be ok as a secondary source. Mathmo ^Talk 10:01, 19 January 2007 (UTC)

[edit] Questions from an Uninformed Reader

For somebody who has no existing information about Elo, this page seems vague in some areas, especially regarding provisionally rated players. Can established players gain or lose rating points as the result of a match with a provisionally rated player? If so, does the increased K factor apply to the established player as well, or does she use her normal K factor?

I notice some discussion about provisional ratings on the talk page, but the information there hasn't been carried over into the article. I also agree that the formulas are confusing as formatted on the article. I was able to figure them out after seeing the ASCII versions on this discussion page. —The preceding unsigned comment was added by 70.184.146.67 (talk) 20:01, 9 February 2007 (UTC).

The problem with discussing provisional ratings is that every institution that implements Elo ratings does something different. It isn't even clear what type of provisional ratings count as "Elo" provisional ratings. Provisional rating changes often aren't linear adjustments, so the concept of K factor may not even apply to provisional players, although typically provisional ratings change more from game to game than established ratings do.

In general, an established player can gain or lose points from playing a provisionally rated player, although some implementations make that gain or loss less than it would be from playing an established player, in which case the established player effectively uses a lower-than-normal K factor.

How to properly rate newcomers is a very thorny issue. Folks are usually glad if provisional ratings are even approximately correct, and then hope that lots of games between established players will even everything out eventually. --Fritzlein 04:30, 10 February 2007 (UTC)

[edit] Rating and probability of a win (Player A vs. Player B)

Another question from an uninformed/unknowning reader... Is it possible to calculate the probability of win/loss with the Elo system? That is, if I am rated at 2000, and People A and B are 1800 and 2500, what probability do I have of winning/losing against either of them? This would seem non-trivial based on the inclusion of draws and the way points are allocated (maybe %age of win/draw vs. loss?). But, I would imagine this is tremendously useful bit of information. As a beginner, if I'm rated at 1200, should I even waste my time against a 1400 opponent? Or should I expect to win enough of the time that the games will be both instructive and have a chance of reward greater than a slot machine? ;-) Thanks! 71.60.83.239 13:10, 22 October 2007 (UTC)

According to the current USCF formulas, your probabilities as a 2000 of winning against an 1800 and a 2500 are 75 percent (0.5 + (2000 − 1800) / 800 = 0.75) and zero, respectively. Your chance as a 1200 of beating a 1400 is 25 percent, so you shouldn't waste your time playing him once -- you should play him four times. --Mr. A. (talk) 01:08, 17 January 2008 (UTC)

The old "classical" formula for the win expectancy is $W=\frac{1}{10^{\Delta R/400} + 1}$ , where $Δ R$ is the difference in rating. (See formula in the December 1999 issue of Chess Life.) If you set $Δ R$ to 200, you get $W = 0.7597$ . However, the USCF have revamped their rating formulas according to a October 2006 interview with Glickman in Chess Life. One problem with this formula is that it does not take into account that players do not play at the same strength in each and every game, that the rating is merely an estimate. That inserts a random element to the win expectancy. Such random variations favor the underdog, and hence the "real" win expectancy for the lower rated player is higher than what the formula suggests. Sjakkalle (Check!) 07:39, 1 February 2008 (UTC)

Also, a rating difference of 500 is by no means a "certain win", the formula gives about a 5% win expectancy for the lower rated player. To illustrate the fact that 500 points is not certain at all, in 2004 I faced a 13-year old girl who was rated 578 points below me, and I thought that would be a fairly easy game. It wasn't, and I proceeded to lose that game (and the next game as well, this one to an 11-year old girl). Shows you what good ratings are... Sjakkalle (Check!) 07:50, 1 February 2008 (UTC)

[edit] recently-added paragraph

This paragraph

In addition, one major problem is the starting rating of players; the current average "provisional" player's rating is significantly lower then the "provisional" player of yesteryear. While several decades ago a beginning rating of 1200 was not uncommon, now young players tend to start with 400 ratings.

was recently added. I have some problems with it:

1. I don't see that it is a "problem". A few decades ago there were very few scholastic players in the USCF, especially below the high school level. Now there are tens of thousands of very young players in the USCF.

2. About starting with 400 ratings (a) that may be an accurate reflection of the playing strength of young players, (b) I think I remember that the starting rating is 100 points x the grade level, but I couldn't find that at the USCF website. Bubba73 (talk), 01:55, 31 January 2008 (UTC)

The entire "Ratings inflation and deflation" section, including its subsections, looks like WP:Original Research and is almost entirely unreferenced. It also appears to be an argument in one direction (against the general consensus that there is ratings inflation). Also note that the text from "A common misconception" was added in a single hit.[2] Whether or not those arguments are true, they are WP:Original Research. I believe the entire section should be deleted, or at least savagely reduced. Peter Ballard (talk) 02:12, 31 January 2008 (UTC)

The "common misconception" sentence makes little or no sense to me. FIDE ratings have inflated since 1985 - I saw data about that yesterday. But there are clearly problems with the rest of the section. Bubba73 (talk), 02:17, 31 January 2008 (UTC)

Perhaps I'm being a little harsh - we should say something on ratings inflation, but it should be referenced. BTW the sections "7.1 Game activity versus protecting one's rating", "7.2 Chess engines" and "7.3 Selective pairing" are also WP:OR and should be deleted. Peter Ballard (talk) 02:19, 31 January 2008 (UTC)

This is a personal website (so it may not be a WP:RS), but it does seem to have good research and data. Bubba73 (talk), 02:23, 31 January 2008 (UTC)

I'm a little suspicious because it is advocating an alternative ratings system, and there is nothing about his qualifications. It's better than nothing, but only just. Peter Ballard (talk) 02:51, 31 January 2008 (UTC)

Good point. In Elo's book, page 18, he considers players over 2600 to be "world champion contenders". I'm not knocking a 2600 player, but that probably wouldn't be a contender today. Perhaps that shows inflation. Bubba73 (talk), 03:11, 31 January 2008 (UTC)

I'm taking out that paragraph. But the article needs a lot more work too. Bubba73 (talk), 21:45, 12 February 2008 (UTC)

[edit] WikiProject Chess Importance

Upgraded to Top from High due to high linkage to article. ChessCreator (talk) 16:35, 17 February 2008 (UTC)

It is linked a lot, but on the other hand, you can play plenty of chess without ever having to know about the rating system. But I'm not going to change the rating. Bubba73 (talk), 02:56, 19 February 2008 (UTC)

Reduced to high, although not because you can NOT play chess without it(you can play chess without almost every top rated chess article), but because most times the link to ELO is not important to the linking article. ChessCreator (talk) 01:34, 7 March 2008 (UTC)

[edit] Practical issues section

This section has a lot of tags in it. On the Chess wikiproject page, I mentioned problems with this article. I don't think this section can be fixed - I think all or almost all of it should be removed. At best it doesn't really relate to the ELO system, but to rating systems in general. At worse it is unsubstantiated POV or O.R. Bubba73 (talk), 05:09, 13 March 2008 (UTC)

Someone else has tagged it too. It needs a lot of work or needs to be deleted. Bubba73 (talk), 14:40, 17 March 2008 (UTC)

[edit] Splitting the article in two

In my opinion this article talks a lot about the use of Elo ratings in Chess and not so much about the rating system itself. In my opinion, the article should be split in two, so that the "Elo rating system" article should focus on the workings of the rating system and another article would focus on its application to Chess. There's more than 120 links to this page and a lot of them are not related to Chess at all. Even though the main use of Elo ratings is related to Chess, in my opinion the article is pretty confusing in some areas and vague in others, splitting the article could help fix this. --Joaotorres (talk) 07:47, 2 April 2008 (UTC)

You have a point there. I'm not sure what to do. Bubba73 (talk), 17:36, 2 April 2008 (UTC)

I guess splitting would be a good solution. I'm just not sure how to do it. Guess we could outline which topics should go to each article and then create the new article, move the topics there and make the proper adjustments. What do you think? -- Joaotorres (talk) 19:36, 9 April 2008 (UTC)

[edit] I've lost the faith

I have been busy defending Elo to players on Scrabulous on Facebook. This has caused me to go back the maths and think again - and I've lost the faith

The probability stage of the calculation, where ratings are used to stand for actual strength, has no correction for the K factor with the result the "probabilities" vary wildly depending on the K factor chosen.

The probability estimate has a term 10^((A-B)/400). I wonder where the two constants 10 and 400 come from and if there’s a value of K that makes them work properly, or from which they were derived.

Is there anyone out there who can explain?

I play chess on http://64squar.es and Scrabulous (the Scrabble knock off) on http://Facebook.com The chess site uses Elo with a K factor of 32; Scrabulous uses a K factor of 120. Both sites attract masses of complaints. The chess players complain that it takes too long to reach the "right" rating and the Scrabulous players complain their ratings fluctuate wildly from day to day.

Tesspub (talk) 01:38, 10 April 2008 (UTC)

Edit to add: I think I'm saying much the same thing as the "Rating and Probability of a Win" para a few paras up. Tesspub (talk) 08:55, 10 April 2008 (UTC)