PRINCIPLES OF LANGUAGE ASSESSMENT
PRACTICALITY
An effective test is practical. This means
A test that is prohibitively expensive is impractical. A test of language proficiency that takes a students five hours to complete is impractical-it consumes more time (and money) than necessary to accomplish its objective. A test that requires individual one-on-one proctocoring is impractical for a group of several hudred tet-takers and only a handful of examiners. A test that takwes a few minutes for student to take and several hours for an examiner to evaluate is impractical for most classroom situations. A test that can be scored only by computer is impractical if the test takes place a thousand miles away from the nearest computers.
RELIABILITY
A reliable test is consistent and dependable. If you give the same test to the same student or matched students on two different occasions, the test should yield similar results. The issue of reliability of a test may best be addressed by considering a number of factors that may contribute to the unreliability of a test. Consider the following possibilities (adapted from Mousavi, 2002,p.804:) fluctuations in the student, in scoring, in test administration, and in test itself.
a. RATER RELIABILITY
Inter-rater-reliability occurs when two or more scores yield inconsistent scores of the same test, possibly for lack of attention to scoring criteria, inexperience, inattention, or even preconceived biascs, in the story above about the placement test, the initial scoring plan for the dictations was found to be unreliable-that is, the two scores were not applying the same standards.
b. TEST ADMINISTRATION RELIABILITY
Unreliability may also result from the conditions in which the test is administered. I once witnessed the administration of test of aural comprehension in which a tape recorder played items for comprehension, but because of street noise outside the building, students sitting next to windows could not hear the tape accurately.
c. TEST RELIABILITY
Sometimes the nature of the test itself can cause measurement errors. If a test is too long, test-takes may become fatigued by the time they reach the later items and hastily respond incorrectly.
VALIDITY
By far the most criterion of an effective test-and arguably the most important principle-is validity, “the extent to which inferences made from assessment results are appropriate, meaningful, and used in terms of the purpose of the assessment” (Gronlund, 1998, p. 226).
a) CONTENT-RELATED EVIDENCE
If a test actually samples the subject matter about which conclussions are to be drawn, and if requires the test-taker to perform the behavior that is being measured, it can claim content-related evidence of validity, often popularly referred to as content validity (e.g.,Mousavi, 2002;Hughes, 2003).
b) CRITERION-RELATED EVIDENCE
The second form of evidence of the validity of a test may be found in what is called criterion-related evidence, also refered to as criterion-related validity, or the extent to which the “criterion” of the test has actually been reached.
A test concurrent validity if its resuts are supported by other concurrent performance beyond the assessment itself. The predictive validity of an assessment becomes important in the case of placement tests, admissions assessment batteries, language aptitude tests, and the like.
c) CONSTRUCT-RELATED EVIDENCE
A third kind of evidence that can support validity, but one that does not play as large a role for classroom teacher, is construct-related validity commonly referred to as construct validity. A construct is any theory, hypothesis, or model that attempts to explain observed phenomena in our universe of perceptions.
d) CONSEQUENTIAL VALIDITY
Messick (1989), Grolund(1998), McNamara(2000), and Brindley (2001), among others, underscore the potential validity encompasses all the consequences of a test, including such considerations as accuracy in measuring intended criteria, its impact on the preparation of test-takers, its effect on the learner, and the (intended and unintended) social consequences of a test’s interpretation and use.
e) FACE VALIDITY
An important facet of consequential validity is the extent to which “students view the assessment as fair, relevant, and useful for improving learning” (Gronlund, 199, p. 210), or what is popularly known as face validity. “face validity refers to the degree to which a test looks right, and appears to measure the knowledge or abilities it claims to measure, based on the subjective judgment of the examines who take it, the administrative personel who decide on its use, and other psychometrically unsophisticated observers” (Mousavi, 2002,p. 244).
AUTHENTICITY
A fourth major principle of language testing is authenticity, a concept that is a little slippery to define, especially tithin the art and science of evaluating and designing tests. Bachman and Palmer (1996, p. 23) define authenticity as “the degree of correspondence of the characteristics of a given language test task to the features of a target language task,” and then suggest an agenda for identifying those target language tasks and for transforming them into valid test items.
In a test, authenticity may be present in the following ways:
• The language in the test is as natural as possible.
• Otems are contextualized rather than isolated.
• Topics are meaningful (relevant, interesting) for the learner.
• Some thematic organization to items is provided, such as through a story line or episode.
• Tasks represent, or closely approximate, real-world tasks.
WASHBACK
A facet of consequential validity, discussed above, is “the effect of testing on teaching and learning” (Hughes, 2003, p.1), otherwise known among language-testing specialists as washback. In large-scale assessment, washback refers to the effect the tests have on instruction in terms of how students prepare for the test.
PRACTICALITY
An effective test is practical. This means
A test that is prohibitively expensive is impractical. A test of language proficiency that takes a students five hours to complete is impractical-it consumes more time (and money) than necessary to accomplish its objective. A test that requires individual one-on-one proctocoring is impractical for a group of several hudred tet-takers and only a handful of examiners. A test that takwes a few minutes for student to take and several hours for an examiner to evaluate is impractical for most classroom situations. A test that can be scored only by computer is impractical if the test takes place a thousand miles away from the nearest computers.
RELIABILITY
A reliable test is consistent and dependable. If you give the same test to the same student or matched students on two different occasions, the test should yield similar results. The issue of reliability of a test may best be addressed by considering a number of factors that may contribute to the unreliability of a test. Consider the following possibilities (adapted from Mousavi, 2002,p.804:) fluctuations in the student, in scoring, in test administration, and in test itself.
a. RATER RELIABILITY
Inter-rater-reliability occurs when two or more scores yield inconsistent scores of the same test, possibly for lack of attention to scoring criteria, inexperience, inattention, or even preconceived biascs, in the story above about the placement test, the initial scoring plan for the dictations was found to be unreliable-that is, the two scores were not applying the same standards.
b. TEST ADMINISTRATION RELIABILITY
Unreliability may also result from the conditions in which the test is administered. I once witnessed the administration of test of aural comprehension in which a tape recorder played items for comprehension, but because of street noise outside the building, students sitting next to windows could not hear the tape accurately.
c. TEST RELIABILITY
Sometimes the nature of the test itself can cause measurement errors. If a test is too long, test-takes may become fatigued by the time they reach the later items and hastily respond incorrectly.
VALIDITY
By far the most criterion of an effective test-and arguably the most important principle-is validity, “the extent to which inferences made from assessment results are appropriate, meaningful, and used in terms of the purpose of the assessment” (Gronlund, 1998, p. 226).
a) CONTENT-RELATED EVIDENCE
If a test actually samples the subject matter about which conclussions are to be drawn, and if requires the test-taker to perform the behavior that is being measured, it can claim content-related evidence of validity, often popularly referred to as content validity (e.g.,Mousavi, 2002;Hughes, 2003).
b) CRITERION-RELATED EVIDENCE
The second form of evidence of the validity of a test may be found in what is called criterion-related evidence, also refered to as criterion-related validity, or the extent to which the “criterion” of the test has actually been reached.
A test concurrent validity if its resuts are supported by other concurrent performance beyond the assessment itself. The predictive validity of an assessment becomes important in the case of placement tests, admissions assessment batteries, language aptitude tests, and the like.
c) CONSTRUCT-RELATED EVIDENCE
A third kind of evidence that can support validity, but one that does not play as large a role for classroom teacher, is construct-related validity commonly referred to as construct validity. A construct is any theory, hypothesis, or model that attempts to explain observed phenomena in our universe of perceptions.
d) CONSEQUENTIAL VALIDITY
Messick (1989), Grolund(1998), McNamara(2000), and Brindley (2001), among others, underscore the potential validity encompasses all the consequences of a test, including such considerations as accuracy in measuring intended criteria, its impact on the preparation of test-takers, its effect on the learner, and the (intended and unintended) social consequences of a test’s interpretation and use.
e) FACE VALIDITY
An important facet of consequential validity is the extent to which “students view the assessment as fair, relevant, and useful for improving learning” (Gronlund, 199, p. 210), or what is popularly known as face validity. “face validity refers to the degree to which a test looks right, and appears to measure the knowledge or abilities it claims to measure, based on the subjective judgment of the examines who take it, the administrative personel who decide on its use, and other psychometrically unsophisticated observers” (Mousavi, 2002,p. 244).
AUTHENTICITY
A fourth major principle of language testing is authenticity, a concept that is a little slippery to define, especially tithin the art and science of evaluating and designing tests. Bachman and Palmer (1996, p. 23) define authenticity as “the degree of correspondence of the characteristics of a given language test task to the features of a target language task,” and then suggest an agenda for identifying those target language tasks and for transforming them into valid test items.
In a test, authenticity may be present in the following ways:
• The language in the test is as natural as possible.
• Otems are contextualized rather than isolated.
• Topics are meaningful (relevant, interesting) for the learner.
• Some thematic organization to items is provided, such as through a story line or episode.
• Tasks represent, or closely approximate, real-world tasks.
WASHBACK
A facet of consequential validity, discussed above, is “the effect of testing on teaching and learning” (Hughes, 2003, p.1), otherwise known among language-testing specialists as washback. In large-scale assessment, washback refers to the effect the tests have on instruction in terms of how students prepare for the test.
Comments
Post a Comment