Test (student assessment)

From Wikipedia, the free encyclopedia

A test or an examination (or "exam") is an assessment, often administered on paper or on the computer, intended to measure the test-takers' or respondents' (often a student) knowledge, skills, aptitudes, or classification in many other topics (e.g., beliefs). Tests are often used in education, professional certification, counseling, psychology (e.g., MMPI), the military, and many other fields. The measurement that is the goal of testing is called a test score, and is "a summary of the evidence contained in an examinee's responses to the items of a test that are related to the construct or constructs being measured."[1] Test scores are interpreted with regards to a norm or criterion, or occasionally both. The norm may be established independently, or by statistical analysis of a large number of subjects.

A standardized test is one that is administered and scored in a consistent matter to ensure legal defensibility.[2] A large proportion of formal testing is standardized. A standardized test with important consequences for the individual examinee is referred to as a high stakes test.

The basic component of a test is an item. These are often colloquially referred to as "questions," but not every item is phrased as a question; it may be such things as a true/false statement or a task that must be performed (if a performance test).

Contents

[edit] History

The earliest known standardized tests (which included both practical and written components) are the Chinese Imperial Examinations which began in 587. [3]

In Europe, traditionally school examinations were conducted orally. Students would have to answer questions posed by teachers in Latin, and teachers would grade them on their answers. The first written exams in Europe were held at Cambridge University, England in 1792 by professors who were paid a piece rate and realized that written exams would earn them more money. contact for more details ::-www.indiaexamresult.com

[edit] Types of items

Many possible item formats are available for test construction. These include: multiple-choice, free response, performance or simulation, true/false, and Likert-type. There is no "best" format to use; the applicability depends on the purpose and content of the test. For example, a test on a complex psychomotor task would be better served by a performance or simulation item than a true/false item.

[edit] Multiple-choice items

Main Article: Multiple choice items

A common type of test item is a multiple-choice question, the author of the test provides several possible answers (usually four or five) from which the test subjects must choose.[4] There is one right answer, usually represented by only one answer option, though sometimes divided into two or more, all of which subjects must identify correctly. Such a question may look like this:

The number of right angles in a square is: a) 2 b) 3 c) 4 d) 5

Test authors generally create incorrect response options, often referred to as distracters, which correspond with likely errors.[5] For example, distracters may represent common misconceptions that occur during the developmental process. The construction of effective distracters is a key challenge that must be faced in order to construct multiple-choice items that possess strong psychometric properties. Well-designed distracters, considered in combination, can attract considerably more than 25% of the weakest students, so reducing the effects of guessing on total scores. The construction of such items may in some cases require some skill and experience on the part of the item developer.

Multiple choice distracter analysis with Item Characteristic Curve
Multiple choice distracter analysis with Item Characteristic Curve

A graph depicting the functioning of a multiple-choice question is shown in Figure 1. The x-axis represents an ability continuum and the y-axis the probability of any given choice being selected by an examinee with a given level of ability. The y-axis is obviously on a scale of 0 to 1, while the x-axis represents standardized scores with a mean of 0 and standard deviation of 1, which can be based on either the items or the examinees.

The grey line maps ability to the probability of a correct response according to the Rasch model, which is a psychometric model used to analyse test data. The correct response in the example shown in Figure 1 is E. The proportion of students along the ability continuum who chose the correct response is highlighted in pink. The graph shows the proportion of students opting for other choices along the range of the ability continuum, as shown in the legend. The proportion of students at about −1.5 on the scale (i.e., of very low ability) who responded correctly to this item is approximately 0.1, which is below the proportion expected if students were purely guessing.

An attractive feature of multiple-choice questions is that they are particularly easy to score.[6] Machines such as the Scantron and software grading of computer-based tests can be performed automatically and instantly, which is particularly valuable for situations where there are not enough graders available to grade a large class or large-scale standardized test. Multiple-choice tests are also valuable when the test sponsor desires to have immediate score reporting available to the examinee; it is impossible to provide a score at the end of the test if the items are not actually scored until several weeks later.

This format is not, however, appropriate for assessing all types of skills and abilities. Poorly written multiple-choice questions often create an overemphasis on simple memorization and deemphasize processes and comprehension. They also leave no room for disagreement or alternate interpretation, making them particularly unsuitable for humanities such as literature and philosophy.

[edit] Free response items

Students taking a test at the University of Vienna, June 2005
Students taking a test at the University of Vienna, June 2005

Free response questions do not pose as much of a challenge to the test author, but evaluating the responses is a different matter. Effective scoring involves reading the answer carefully and looking for specific features, such as clarity and logic, which the item is designed to assess. Often, the best results are achieved by awarding scores according to explicit ordered categories which reflect an increasing quality of response. Doing so may involve the construction of marking criteria and support materials, such as training materials for markers and samples of work which exemplify categories of responses. Typically, these questions are scored according to a uniform grading rubric for greater consistency and reliability.

At the other end of the spectrum, scores may be awarded according to superficial qualities of the response, such as the presence of certain important terms. In this case, it is easy for test subjects to fool scorers by writing a stream of generalizations or non sequiturs that incorporate the terms that the scorers are looking for. This, along with other factors that limit their reliability and cost/measurement ratio, has caused the usefulness of this item type to be questioned.[7]

While free-response items have disadvantages, they are able to offer more differentiating power between examinees.[8] However, this might be offset by the length of the item; if a free-response item provides twice as much measurement information as a multiple-choice item, but takes as long to complete as three multiple-choice items, is it worth it?

[edit] Performance test or practical examination

Knowledge of how to do something does not lend itself well to either free-response or multiple-choice questions. It may be demonstrated only outright by a performance test.[9] Art, music, and language fall into this category, as do non-academic disciplines such as sports and driving. Students of engineering are often required to present an original design or computer program developed over the course of days or even months.

A practical examination may be administered by an examiner in person (in which case it may be called an audition or a tryout) or by means of an audio or video recording. It may be administered on its own or in combination with other types of questions; for instance, many driving tests in the United States include a practical examination as well as a multiple-choice section regarding traffic laws.

Tests of the sciences may include laboratory experiments (practicals/laboratory sessions) to make sure that the student has learned not only the body of knowledge comprising the science but also the experimental methods through which it has been developed. Again, the use of explicit criteria is generally beneficial in the marking of practical examinations or performances.

[edit] Criticism

General aptitude tests, such as the SAT in the United States, are used in certain countries as a basis for entrance into colleges and universities. A criticism associated with this use of these tests is that they are known to be subject to practice effects, and do not necessarily assess the accumulated learning of students during their schooling years. However, the goal of these tests is not to assess accumulated learning; they are designed to measure aptitude, not achievement.

Similarly, college entrance exams are criticized for not accurately predicting first-year university grade point average(GPA) as well as high school GPA.[10] However, the intent is for test scores to be used along with other measures in university selection; large-scale test scores are only one aspect of the university selection process. Universities are free to place more emphasis on high school GPA or extracurricular activities. Any criticism might be better directed to a university than the test itself, which most people consider fair.[11]

The content of the exam might not correspond with its intended use or representation. An example of this would be for an exam to have the ratio of questions in geometry, calculus, and number theory dissimilar to the ratio of these questions present in the environment for which the exam is intended to serve as a predictor of future performance. As an extreme and unrealistic example, a mathematics exam may ask solely about the names, birthdates, and country of origin of various mathematicians when such knowledge is of little importance in a mathematics curriculum. For this reason, if a test is to be legally defensible, it must be demonstrated as valid for its use, which is so important that it is Standard 1.1 for educational and psychological testing.[12] If it is used for other than its intended purpose, the burden of proof of validity rests upon its user.[13]

People are variously susceptible to stress. Some are virtually unaffected, and excel on tests, while in extreme cases, individuals can become very nervous and forget large components of exam material. To counterbalance this, often teachers and professors don't grade their students on tests alone, placing considerable weight on homework, attendance, in-class discussion activity, and laboratory investigations (where applicable). Conversely, in some high-stakes testing cases, the pressure induces examinees to rise to meet the exam's high expectations.

Through specialized training on material and techniques specifically created to suit the test, students can be "coached" on the test to increase their scores without actually significantly increasing knowledge of the subject matter. However, research on the effects of coaching remains inconclusive, and the increase might be simply due to practice effects.[14]

Although test organizers attempt to prevent it and impose strict penalties for it, academic dishonesty (cheating) can be used to obtain an advantage over other test-takers. On a multiple-choice test, lists of answers may be obtained beforehand. On a free-response test, the questions may be obtained beforehand, or the subject may write an answer that creates the illusion of knowledge. If students sit in proximity to one another, it is also possible to copy answers off other students, especially if a test-taker knows that particular person knows the material better than they do. Despite such issues, tests are less susceptible to cheating than other tools of learning evaluation. Laboratory results can be fabricated, and homework can be done by one student and copied by rote by others. The presence of a responsible test administrator, in a controlled environment, helps to guard against cheating.

[edit] References

  1. ^ Thissen, D., & Wainer, H. (2001). Test Scoring. Mahwah, NJ: Erlbaum. Page 1, sentence 1.
  2. ^ North Central Regional Educational Laboratory[1]
  3. ^ Feng, Y. (1994). From the Imperial Examination to the National College Entrance Examination: the Dynamics of Political Centralism in China's Educational Enterprise. ASHE Annual Meeting Paper. [2]
  4. ^ Haladyna, T. (2004). Developing and Validating Multiple-Choice Test Items. Erlbaum.
  5. ^ Kehoe, Jerard (1995). Writing multiple-choice test items. Practical Assessment, Research & Evaluation, 4(9). Retrieved February 26, 2008 from http://PAREonline.net/getvn.asp?v=4&n=9
  6. ^ Test Item Writing - From the University of Alabama at Birmingham [3]
  7. ^ Hollingworth, L., Beard, J.J., & Proctor, T.P. (2005). An Investigation of Item Type in a Standards-Based Assessment. Practical Assessment, Research, and Evaluation, 12(18). [[4]]
  8. ^ Vale, C.D., & Weiss, D.J. (1977). A Comparison of Information Functions of Multiple-Choice and Free-Response Vocabulary Items. Technical Report, University of Minnesota Psychometric Methods Laboratory.[5]
  9. ^ Performance Testing Council - Why Performance Testing? [6]
  10. ^ FairTest criticism of the SAT [[7]]
  11. ^ Domino, G., & Domino, M.L. (2006). Psychological Testing: An Introduction. Cambridge University Press. page 342 Preview available at [8]
  12. ^ Standard 1.1 - American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.
  13. ^ Standard 1.4 - American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.
  14. ^ Domino, G., & Domino, M.L. (2006). Psychological Testing: An Introduction. Cambridge University Press. page 340 Preview available at [9]

[edit] See also

[edit] External links

[edit] International exams

[edit] Further reading

  • Airasian, P. (1994) "Classroom Assessment," Second Edition, NY" McGraw-Hill.
  • Cangelosi, J. (1990) "Designing Tests for Evaluating Student Achievement." NY: Addison-Wesley.
  • Gronlund, N. (1993) "How to make achievement tests and assessments," 5th edition, NY: Allyn and Bacon.
  • Haladyna, T.M. & Downing, S.M. (1989) Validity of a Taxonomy of Multiple-Choice Item-Writing Rules. "Applied Measurement in Education," 2(1), 51-78.
  • Monahan, T. (1998) The Rise of Standardized Educational Testing in the U.S. – A Bibliographic Overview.
  • Wilson, N. (1997) Educational standards and the problem of error. http://olam.ed.asu.edu. Tap into archives, vol 6. No 10