Talk:Item response theory
From Wikipedia, the free encyclopedia
Contents |
[edit] Negative language
Angela feels that the second to last paragraph, where tests are described as imprecise and containing error, is too negative. I feel it is a statement of fact that is often misunderstood by non-psychometricians. I think it follows directly from the psychometric material here on Wikipedia, particularly classical test theory.
It's worth noting that I, the author, am a psychometrician (i.e., not likely to have a negative view of testing). Maybe someone can suggest a alternative wording that appears balanced?
Amead 19:09, 5 Jan 2004 (UTC)
- Actually, I'm happy with the way it is now as it makes it clear you are talking about it in terms of standard error etc rather than sounding like someone's opinion. Angela.
Amead, I reverted the article, at least for now, becuase I don't believe that on balance, the changes enhanced it. Revert again if you wish -- prefer we work toward a better article in a considered way if possible. The issues were:
- The definition introduced did not properly define IRT, rather it described (i) what it is used for and (ii) what it is not. While I agree a definition should be as non-technical as possible, this can only be so with reason (see for e.g. probability theory) and efforts to make it so should not detract from the essential elements of a definition. Having said this, some of the changes were great and I agree the definition needs to be improved.
- IRT has been referred to as a body of related psychometric theory from early on, and I don't see good reason to say "it is not a theory per se". Doesn't this suggest the label is self-contradictory and so confused? I would also note that saying it is a body of theory does not suggest it is a theory (i.e. particular theory).
- Extra spaces between 1st and 2nd para of overview look sloppy (minor point obviously)
- Reliability vs information being introduced as a topic separately to information is to me inefficient (e.g. detail such as info fn being bell-shaped was repeated). Further, if we want to make this connection, it should be done properly. See below (**)
- The Rasch had already been covered, and there was no connection to newly introduced material. Also, the One Parameter Logistic Model (OPLM) is also a model referred to by Verhelst & Glas (1995) which potentially makes the statement quite confusing. The comments on Rasch were (as you stated) from an American perspective -- European, Asian, etc. perspectives also need to be considered.
BTW, you're right about link to discrimination, removed it.
(**)On reliability, let:
Then SE(θ) is an estimate of the standard deviation of ε for person with a given wighted score and
is analogous to chronbach's alpha (indeed it is typically very close in value) and so analogous to the traditional concept of reliability. The mean squared standard error can be used as the estimate of the variance of the error across persons. Take care ... Stephenhumphry 03:22, 30 July 2005 (UTC)
I removed two external links to sites related to the Rasch model. Both links were "Objective Measurement" links, not general IRT links, and thus more appropriately belong on the Rasch model wiki (which I added to the "See also" section). Bhabing 20:27, 15 April 2006 (UTC)
[edit] Amount of distinguishing from Rasch Measurement
Do we really need two paragraphs distinguishing IRT from Rasch? I also prefer to have a more balanced set of references on the issue of distinguishing the two, and that the definition of measurement not be from a strictly Rasch perspective.
--129.252.16.200 21:00, 25 September 2006 (UTC)
I think a couple of paragraphs about IRT and Rasch is in order. The definition of measurement is not from a Rasch perspective. The definition of measurement throughout the natural sciences is quite clear. See psychometrics for a brief account of the history of this definition. If you would like to propose a definition you think is widely accepted in IRT with a citation, be my guest. Please do not attempt to 'balance' by omitting a perspective. Balance on Wikipedia should be achieved by considered presentation of alternative perspectives. There were some quite fundamental problems with previous edits. For example, the reference to "easily computed sufficient statistics" seemed to imply other models have sufficient statistics but they're just not easily computed. This was misleading to say the least. Holon 00:58, 26 September 2006 (UTC)
In that case I would suggest starting a separate, later section of the IRT wiki dealing with the relationship between "Model building" based IRT and the philosophical underpinnings of Rasch measurement, instead of putting it in what is ostensibly the "Overview" section for IRT. It would make the references to discrimination and 2PL/3PL model make more sense. I think it would also be a better place for the "frame of reference" discussion.
- I agree it is better placed in another section. Let's do that. I'm pretty flat out -- if you want to have a go, great, and I'll look at it when I can. Holon 03:31, 26 September 2006 (UTC)
-
- I should have time to do a little mucking around second week of October. ::crosses-fingers::: --Bhabing 23:37, 26 September 2006 (UTC)
-
-
- Great, well I'll have a go if I get time also. Together, I'm sure we can improve. I think quite a few parts of the article could be improved personally. Holon 01:20, 27 September 2006 (UTC)
-
-
-
-
- Cool. I'm trying to encourage a swath of IRT people I've worked with into contributing on their areas of specialization (MIRT, DIF, equating, unfolding models, etc...). --Bhabing 03:39, 27 September 2006 (UTC)
-
-
As far as the definition of "measurement", it strikes me as patently untrue that it has a single agreed upon definition in psychometrics (regardless of what other wiki’s might say). The Thissen reference that you have removed twice deals with this from the IRT model building perspective. In addition to stating his (a past president of the psychometric society) own opinion, he also provides several references. That it is not decisively accepted by IRT practitioners at large is also attested by many of its staunchest proponents use of "objective", "Rasch" and "fundamental" as modifiers, and by the plethora of articles defending it (why defend what no one attacks?). If the giants in the field don't agree then it seems odd for the wiki to choose one side (as it does by use of the Andrich quote, as opposed to the Wright quote which has a modifier).
- There is a miscommunication here. Let me be as clear as possible. My whole point in inviting you to give an agreed upon definition of measurement in IRT is that it doesn't exist. Rasch explicitly showed the congruence of his models with measurement in physics in his 1960 book. What I actually said is that the definition of measurement in the natural sciences - physics, chemistry, etc. - is widely agreed. Indeed, it is implied by the definition of all SI units and the standard means of expressing magnitudes in physics (a number of units where the numbeer is a real). See Reese's quote in the psychometric article. Your edits seemed to me to suggest that Rasch and proponents have 'created' some mysterious definition of measurement, which is patently untrue. What has actually occurred is that various people have created definitions of measurement that are incongruent with the definition throughout the most established sciences (physics, etc.). I have no problem with you presenting alternative definitions, but be clear about them so the article can be written from that basis. There is no need to labour the definition of measurement implied by Rasch models, because of the congruence with the rest of science. Holon 03:31, 26 September 2006 (UTC)
- To add to the above, I'm perplexed by your comments about articles "defending it". Defending what, exactly? By whom? Holon 05:38, 26 September 2006 (UTC)
As far as sufficient statistics, isn't the entire data set definitionally a sufficient statistic (albeit a not-very-useful one) for the model parameters in general, making the statement “has sufficient statistics” vacuous? (I would be interested in any references to mathematical statistics texts that restrict sufficient statistics from being the entire data set.)
- Person and item parameters have sufficient statistics (DATA only) in the Rasch model. There is no data reduction when the entire data set is called a statistic, and I would suggest you'd need to define the term statistic. So the answer is no, it is not at all vacuous to state that person and item parameters have sufficient statistics (total scores). Holon 03:31, 26 September 2006 (UTC)
-
- That the statistic (X1, X2, ... Xn) is sufficient, irregardless of whether it allows a reduction of the data or is a scalar, is in the mathematical statistics texts by Rohatgi (1976, pg. 339), Bickel and Doksum (1977, pg. 83) and Lehmann (1983, pg. 41), among others. Fischer and Molenaar (1995) manage by saying what the particular sufficient statistic is (number correct score or sum score -- pages 10 and 16) or what other property is required for the given result (minimality and independence of some other statistic or from an exponential family -- pg. 25 and 222 respectively). --Bhabing 23:37, 26 September 2006 (UTC)
-
-
- Let's define a statistic as being sufficient for a parameter θ iff the probability distribtion of the data conditional on the relevant statistic is not dependent on θ. Now let's define βni = xni for n = 1,..,N and i = 1,..,I. If we were to condition on the entire data matrix, there is no probability distribution -- the data are fully determined. Therefore, the entire matrix cannot be a sufficient statistic according to that definition of sufficiency if parameters are supposed to enter into a stochastic model (and any other model is clearly inferior as far as recovering the data is concerned, if that is the criterion). I'm not sure what you think Fischer and Molenaar "manage". It seems to me you think there is some problem with Rasch models and the epistemological case put forward for the models. As far as these models are concenred, the point is that the person and item parameters are separable, which leads to sufficiency of total scores (or sometimes vectors as in Andersen, 1977). I can refer to various articles to make these points (most importantly Rasch, 1961), but I fear we'll just go around in circles all day.
-
-
-
-
- I'll do some library diving when I get time to see what the authoritative sources in sufficiency's home field (mathematical statistics) say on the matter beyond the references I gave above, and get back to you. As far as Rasch measurement, I am hoping my feelings about Rasch measurement (pro and con) don't harm my attempts to add to this wiki any more than yours stop you. In my experience, most IRT researchers appreciate both the philosophical and statistical properties of the Rasch models as well as the need to deal with a wide variety of actual data sets. --Bhabing 03:39, 27 September 2006 (UTC)
-
-
-
-
-
-
- Fair enough. Keep in mind though that the concept of sufficiency is due to Sir Ronald Fisher and Rasch studied and worked with Fisher directly. Keep in mind also it ceases to be a purely mathematical matter where it comes to models used for empirical data. There is a quote from Rasch about this. Would you mind e-mailing me using the wiki function? Couple of things I want to mention but don't want to congest the board. Thanks for the cooperative spirit. Holon 05:17, 27 September 2006 (UTC)
-
-
-
What works best for you in editing this part of the wiki? Should I post some proposed changes here in the discussion first for your modification, or would it be easier for if I scan-mailed you the two pages of Thissen and Wainer (if you don’t have a copy available) and let you take the first go? --Bhabing 02:30, 26 September 2006 (UTC)
- The problem with your citation was that it was entirely unclear what point was being made. Could you please just clarify the point in light of this discussion? Be bold in editing -- let's just have a discussion if you want to actually remove points that are being made, rather than add counterpoints. I'm open to any alternatives for clarification. Holon 03:31, 26 September 2006 (UTC)
-
- Thanks! --Bhabing 23:37, 26 September 2006 (UTC)
[edit] Clarity
This article is not appropriate for an encyclopedia entry. I have a degree in psych and it is incomprehensible. It is jargon from beginning to end. I looked up the entry to find out what IRT meant. I haven't a clue. The people on this talk page are happy with it. They are evidently members of an esoteric circle. Talk plain English or give an example - or something.
- Pepper 150.203.227.130 06:30, 12 January 2007 (UTC)
- Did you use any quantitative methods in your degree? The reason I ask is that IRT is quite different from traditional quantitative methods taught in psych, and sometimes this makes it harder rather than easier to have some background. Whatever the case, though, I value your feedback. Some of the article needs to be technical -- it is by definition a body of theory. However, the basic purpose and concepts can be made clearer, and I for one am open to suggestion and input. In order to begin somewhere, does the first sentence not make some sense for you?
- Incidentally, if you respond, I'll move the discussion to the bottom to keep things in chronological order, so please look for any responses at the bottom. If you don't respond, I'll move it after a few days. I'd also ask you to keep in mind constructive criticism and input is productive, whereas emotive language tends to obstruct productive communication. Cheers. Holon 10:57, 12 January 2007 (UTC)
-
- I agree. The language is quite simple for someone with a quantitative background, which is necessary to understand IRT. Having a degree in psych won't help much. That's like saying a BA in Biology will help you understand meta-analysis of public health studies.Iulus Ascanius (talk) 16:12, 17 April 2008 (UTC)
[edit] Technical language
- Kania 72.139.47.78 22:04, 24 February 2007 (UTC)
- I happened to be investigating computer image processing and the links brought me to this page. The first paragraph is really incomprehensible to someone outside of the field. I initially thought that it was related to determination of the size of objects in an image. While it doesn't apply to me, I just thought that I would provide some comment to highlight the confusion that a layman might encounter in trying to understand the content.
- In the following copy of the first paragraph from the article, the last most technical paragraph is actually the most easy to understand. I would explain what "item" is because the term is too generic. I would replace scaling with rating if that is what is meant. In the first sentence alone, the use of "items" twice with potentially different meanings is particularly confusing and scaling "items" based on their responses is a just nonsensical.
-
- "Item response theory (IRT) is a body of related psychometric theory that provides a foundation for scaling persons and items based on responses to assessment items. The central feature of IRT models is that they relate item responses to characteristics of individual persons and assessment items. Expressed in somewhat more technical terms, IRT models are functions relating person and item parameters to the probability of a discrete outcome, such as a correct response to an item."
-
-
- Thanks for the comments, much appreciated. We'd better work on tightening up. It should be at least obvious what it is and what it is used for in the simplest possible terms. Holon 12:55, 27 February 2007 (UTC)
- BTW, just to clarify a couple of things -- scaling items is not nonsensical if you understand the process of scaling (estimating scale locations from responses to items). Rating is most certainly not the same as scaling. Holon 07:01, 28 February 2007 (UTC)
-
[edit] Equation
Is the equation for the three parameter logistic equation correct? It contains four parameters if you include the D parameter. This parameter is not discussed in the article nor does it appear in Baker's online book discussing the three-parameter model. Andrés (talk) 14:35, 17 April 2008 (UTC) (talk) 14:32, 17 April 2008 (UTC)
- D is not an estimated parameter. It is a constant fixed to 1.0 or 1.702 to determine the scale.Iulus Ascanius (talk) 16:12, 17 April 2008 (UTC)
This previous comment makes no mathematical sense. Whether one uses ai or D * ai makes no difference. All one is doing is rescaling the value of the ai parameter. In other words, it cannot change the class of the function. The functions sin(ai * x) and sin(Dai * x) both describe a sine function. For every solution to the characteristic curve that uses the fixed value of 1.0 for D, there exists an equivalent representation that uses the value 1.702. There is no mathematical way to way to distinguish the two. Can someone provide a reference to the significance of the D parameter/constant? It still seems incorrect to me. A random selection of an article from the web, such as Comparison of Item-Fit Statistics for the Three-Parameter Logistic Model, makes no mention of this scaling constant. Andrés (talk) 05:52, 21 April 2008 (UTC)
- That's the point, that is nothing more than a small rescaling to help the logistic function more closely approximate a cumulative normal function. Here's a good reference, though it's not available through ERIC: Camilli, G. (1994). Origin of the scaling constant "D" = 1.7 in item response theory. Journal of Educational and Behavioral Statistics, 19(3), 293-295. If you need papers immediately available on the internet that mention but do'nt really explain, see www.fcsm.gov/05papers/Cyr_Davies_IIIC.pdf, http://harvey.psyc.vt.edu/Documents/WagnerHarveySIOP2003.pdf, http://www2.hawaii.edu/~daniel/irtctt.pdf. Any IRT book will explain it.
Thank you for the references to the use of the D constant. I now understand the issues better. The purpose of the constant is to make the item's charecteristic curve look like that of the CDF of the normal distribution by rescaling the ability scale. The entry in the article is still wrong, however. This rescaling is appropriate for the 2PL model but not for the 3PL model whenever ci > 0. As stated latter in the article, the use of D = 1.7 makes the characteristic curve of the 2PL differ by less than 0.1 from the that of the normal CDF. This bound is broken by any fitted value of ci > 0. For example, in a multiple choice exam question even random guessing will pick the correct answer. As I understand IRT, that is why the 3PL is used. In a typical exam with five choices, one would expect the fitted value of ci to be 0.2 or higher. For negative values of the responder's ability, any 3PL with ci > 0 will not be close to the normal CDF no matter what value of D is picked. As support for my argument that the D constant does not belong in the 3PL equation, I point out that the Camilli reference given in the previous paragraph only talks about the 2PL model. Andrés (talk) 14:32, 22 April 2008 (UTC)