Test Development Process Overview
Test Development Process Overview
This is a
deceptively simple question. Its answer is closely linked
test development is an umbrella term for all that goes
to how the test developer defines the construct being
into the process of creating a test. It is the product of
measured and how that definition is the same as or
the thoughtful and sound application of established
different from other tests purporting to measure the
principles of test development.
same construct.
The process of developing a test occurs in five stages:
■ What is the objective of the test? In the service of
1. test conceptualization; what goal will the test be employed? In what way or
ways is the objective of this test the same as or different
2. test construction; from other tests with similar goals? What real-world
3. test tryout; behaviors would be anticipated to correlate with
testtaker responses?
4. item analysis;
■ Is there a need for this test? Are there any other tests
5. test revision purporting to measure the same thing? In what ways
test conceptualization ,the idea for a test is conceived will the new test be better than or different from
existing ones? Will there be more compelling evidence
test construction is a stage in the process of test for its reliability or validity? Will it be more
development that entails writing test items (or re- comprehensive? Will it take less time to administer? In
writing or revising existing items), as well as formatting what ways would this test not be better than existing
items, setting scoring rules, and otherwise designing tests?
and building a test. Once a preliminary form of the test
has been developed, ■ Who will use this test? Clinicians? Educators? Others?
For what purpose or purposes would this test be used?
it is administered to a representative sample of
testtakers under conditions that simulate the conditions ■ Who will take this test? Who is this test for? Who
that the final version of the test will be administered needs to take it? Who would find it desirable to take it?
under (test tryout). The data from the tryout will be For what age range of testtakers is the test designed?
collected and testtakers’ performance on the test as a What reading level is required of a testtaker? What
whole and on each item will be analyzed. Statistical cultural factors might affect testtaker response?
procedures, referred to as item analysis, are employed ■ What content will the test cover? Why should it cover
to assist in making judgments about which items are this content? Is this coverage different from the content
good as they are, which items need to be revised, and coverage of existing tests with the same or similar
which items should be discarded. objectives? How and why is the content area different?
test revision refers to action taken to modify a test’s To what extent is this content culture-specific?
content or format for the purpose of improving the ■ How will the test be administered? Individually or in
test’s effectiveness as a tool of measurement. This groups? Is it amenable to both group and individual
action is usually based on item analyses, as well as administration? What differences will exist between
related information derived from the test tryout. The individual and group administrations of this test? Will
revised version of the test will then be tried out on a the test be designed for or amenable to computer
new sample of testtakers. administration? How might differences between
Test Conceptualization versions of the test be reflected in test scores?
Asexuality may be defined as a sexual orientation ■ What is the ideal format of the test? Should it be
characterized by a long-term lack of interest in a sexual true–false, essay, multiple-choice, or in some other
relationship with anyone or anything. format? Why is the format selected for this test the best
format?
Some Preliminary Questions
■ Should more than one form of the test be
developed? On the basis of a cost–benefit analysis,
should alternate or parallel forms of this test be Pilot Work
created? the context of test development, terms such as pilot
work, pilot study, and pilot research refer, in general, to
■ What special training will be required of test users
the preliminary research surrounding the creation of a
for administering or interpreting the test? What
prototype of the test. Test items may be pilot studied (or
background and qualifications will a prospective user of
piloted) to evaluate whether they should be included in
data derived from an administration of this test need to
the final form of the instrument.
have? What restrictions, if any, should be placed on
distributors of the test and on the test’s usage? pilot research may involve open-ended interviews with
research subjects believed for some reason (perhaps on
■ What types of responses will be required of
the basis of an existing test) to be introverted or
testtakers? What kind of disability might preclude
extraverted.
someone from being able to take this test? What
adaptations or accommodations are recommended for pilot work, the test developer typically attempts to
persons with disabilities? determine how best to measure a targeted construct.
■ Who benefits from an administration of this test? Pilot work is a necessity when constructing tests or
What would the testtaker learn, or how might the other measuring instruments for publication and wide
testtaker benefit, from an administration of this test? distribution.
What would the test user learn, or how might the test
Test Construction
user benefit? What social benefit, if any, derives from an
administration of this test? Scaling
■ Is there any potential for harm as the result of an measurement as the assignment of numbers according
administration of this test? What safeguards are built to rules.
into the recommended testing procedure to prevent any
sort of harm to any of the parties involved in the use of Scaling may be defined as the process of setting rules
this test? for assigning numbers in measurement. Stated another
way, scaling is the process by which a measuring device
■ How will meaning be attributed to scores on this is designed and calibrated and by which numbers (or
test? Will a testtaker’s score be compared to those of other indices)—scale values—are assigned to different
others taking the test at the same time? To those of amounts of the trait, attribute, or characteristic being
others in a criterion group? Will the test evaluate measured.
mastery of a particular content area?
L. L. Thurstone is credited for being at the forefront of
Norm-referenced versus criterion-referenced tests: efforts to develop methodologically sound scaling
Item development issues methods.
when it comes to criterion-oriented assessment, being
“first in the class” does not count and is often irrelevant. - adapted psychophysical scaling methods to the study
Although we can envision exceptions to this general of psychological variables such as attitudes and values
rule, norm-referenced comparisons typically are (Thurstone, 1959; Thurstone & Chave, 1929).
insufficient and inappropriate when knowledge of Thurstone’s (1925) article entitled “A Method of Scaling
mastery is what the test user requires. Psychological and Educational Tests”
Criterion-referenced testing and assessment are the notion of absolute scaling—a procedure for
commonly employed in licensing contexts, be it a license obtaining a measure of item difficulty across samples of
to practice medicine or to drive a car. Criterion- testtakers who vary in ability.
referenced approaches are also employed in Types of scales
educational contexts in which mastery of particular scales may also be conceived of as instruments used to
material must be demonstrated before the student measure.
moves on to advanced material that conceptually builds
on the existing base of knowledge, skills, or both. something being measured is likely to be a trait, a state,
or an ability.
scales can be meaningfully categorized along a “Smiley” faces, used in social-psychological research
continuum of level of measurement and be referred to with young children and adults with limited language
as nominal, ordinal, interval, or ratio. skills. The faces are used in lieu of words such as
positive, neutral, and negative.
age-based scale - testtaker’s test performance as a
function of age is of critical interest, Likert scale - One type of summative rating scale, used
extensively in psychology, usually to scale attitudes.
grade-based scale - testtaker’s test performance as a
relatively easy to construct. usually on an agree–
function of grade is of critical interest,
disagree or approve–disapprove continuum. usually
stanine scale - If all raw scores on the test are to be reliable,
transformed into scores that can range from 1 to 9
Cheating on taxes if you have a chance. This is (check
it may be categorized as unidimensional as opposed to one): never justified - rarely justified - sometimes
multidimensional. It may be categorized as comparative justified - usually justified - always justified
as opposed to categorical. This is just a sampling of the
The use of rating scales of any type results in ordinal-
various ways in which scales can be categorized.
level data.
Scaling methods
unidimensional, meaning that only one dimension is
a testtaker is presumed to have more or less of the presumed to underlie the ratings.
characteristic measured by a (valid) test as a function of
multidimensional, meaning that more than one
the test score. The higher or lower the score, the more
dimension is thought to guide the testtaker’s responses.
or less of the characteristic the testtaker presumably
possesses. method of paired comparisons. Testtakers are
presented with pairs of stimuli (two photographs, two
Ex. Morally Debatable Behaviors Scale–Revised (MDBS-
objects, two statements), which they are asked to
R; Katz et al., 1994). Developed to be “a practical means
compare.
of assessing what people believe, the strength of their
convictions, as well as individual differences in moral Select the behavior that you think would be more
tolerance” (p. 15), the MDBS-R contains 30 items. Each justified:
item contains a brief description of a moral issue or
a. cheating on taxes if one has a chance
behavior on which testtakers express their opinion by
means of a 10-point scale that ranges from “never b. accepting a bribe in the course of one’s duties
justified” to “always justified.”
One method of sorting, comparative scaling, entails
Cheating on taxes if you have a chance is: judgments of a stimulus in comparison with every other
stimulus on the scale.
1 23456789 10
Another scaling system that relies on sorting is
never justified always justified
categorical scaling. Stimuli are placed into one of two or
MDBS-R is an example of a rating scale, which can be more alternative categories that differ quantitatively
defined as a grouping of words, statements, or symbols with respect to some continuum.
on which judgments of the strength of a particular trait,
Guttman scale (Guttman, 1944a,b, 1947) is yet another
attitude, or emotion are indicated by the testtaker.
scaling method that yields ordinal-level measures. Items
Rating scales can be used to record judgments of on it range sequentially from weaker to stronger
oneself, others, experiences, or objects, and they can expressions of the attitude, belief, or feeling being
take several forms measured. all respondents who agree with the stronger
statements of the attitude will also agree with milder
summative scale - the final test score is obtained by
statements.
summing the ratings across all the items,
Do you agree or disagree with each of the following:
The Many Faces of Rating Scales
a. All people should have the right to decide collectively referred to as item format.
whether they wish to end their lives. a selected-response format require testtakers to
b. People who are terminally ill and in pain should select a response from a set of alternative
have the option to have a doctor assist them in responses.
ending their lives.
Three types of selected-response item formats:
c. People should have the option to sign away the
use of artificial life-support equipment before multiple-choice, matching, and true false.
they become seriously ill.
d. People have the right to a comfortable life. a multiple-choice format has three elements: (1) a
stem, (2) a correct alternative or option, and (3)
The resulting data are then analyzed by means of several incorrect alternatives or options variously
scalogram analysis, an item-analysis procedure and referred to as distractors or foils.
approach to test development that involves a matching item, the testtaker is presented with two
graphic mapping of a testtaker’s responses. columns: premises on the left and responses on the
right. The testtaker’s task is to determine which
The method of equal-appearing intervals is an
response is best associated with which premise.
example of a scaling method of the direct
estimation variety. A multiple-choice item that contains only two
possible responses is called a binary-choice item.
other methods that involve indirect estimation,
Perhaps the most familiar binary-choice item is the
there is no need to transform the testtaker’s
true–false item. such as agree or disagree, yes or
responses into some other scale.
no, right or wrong, or fact or opinion.
Writing Items
a constructed-response format require testtakers to
the grand scheme of test construction,
supply or to create the correct answer, not merely
considerations related to the actual writing of the
to select it.
test’s items go hand in hand with scaling
considerations. The prospective test developer or types of constructed-response items are the
item writer immediately faces three questions completion item, the short answer, and the essay.
related to the test blueprint:
A completion item requires the examinee to
■ What range of content should the items cover? provide a word or phrase that completes a
sentence, as in the following example:
■ Which of the many different types of item formats
should be employed? The standard deviation is generally considered the
most useful measure of __________.
■ How many items should be written in total and for
each content area covered? also be referred to as a short-answer item. It is
desirable for completion or short-answer items to
An item pool is the reservoir or well from which
be written clearly enough that the testtaker can
items will or will not be drawn for the final version
respond succinctly—that is, with a short answer.
of the test.
essay item as a test item that requires the testtaker
A comprehensive sampling provides a basis for
to respond to a question by writing a composition,
content validity of the final version of the test.
typically one that demonstrates recall of facts,
Because approximately half of these items will be
understanding, analysis, and/or interpretation.
eliminated from the test’s final version, the test
developer needs to ensure that the final version Writing items for computer administration
also contains items that adequately sample the
domain. These programs typically make use of two
advantages of digital media: the ability to store
Item format items in an item bank and the ability to individualize
Variables such as the form, plan, structure, testing through a technique called item branching.
arrangement, and layout of individual test items are
item bank is a relatively large and easily accessible Scoring Items
collection of test questions. Ex. Instructor Resources
cumulative model - the model used most commonly—
within Connect, in OOBAL-8-B1, “How to ‘Fund’ an
owing, in part, to its simplicity and logic. the rule in a
Item Bank.”
cumulatively scored test is that the higher the score on
computerized adaptive testing (CAT) refers to an the test, the higher the testtaker is on the ability, trait,
interactive, computer administered test-taking or other characteristic that the test purports to measure
process wherein items presented to the testtaker
class scoring or (also referred to as category scoring),
are based in part on the testtaker’s performance on
testtaker responses earn credit toward placement in a
previous items.
particular class or category with other testtakers whose
pattern of responses is presumably similar in some way.
This approach is used by some diagnostic systems
wherein individuals must exhibit a certain number of
symptoms to qualify for a specific diagnosis.
Test Tryout
■ an index of the item’s difficulty item discrimination indicate how adequately an item
separates or discriminates between high scorers and
■ an index of the item’s reliability
low scorers on an entire test.
■ an index of the item’s validity
a multiple-choice item on an achievement test is a good
■ an index of item discrimination item if most of the high scorers answer correctly and
most of the low scorers answer incorrectly. If most of
The Item-Difficulty Index the high scorers fail a particular item, these testtakers
An index of an item’s difficulty is obtained by may be making an alternative interpretation of a
calculating the proportion of the total number of response intended to serve as a distractor.
testtakers who answered the item correctly.
item-discrimination index is a measure of item
Note that the larger the item-difficulty index, the easier discrimination, symbolized by a lowercase italic “d” (d).
the item. Because p refers to the percent of people
passing an item, the higher the p for an item, the easier Analysis of item alternatives
the item. The statistic referred to as an item-difficulty
the quality of each alternative within a multiple-choice
index in the context of achievement testing may be an
item can be readily assessed with reference to the
item-endorsement index in other contexts, such as
comparative performance of upper and lower scorers.
personality testing.
No formulas or statistics are necessary here. By charting
The Item-Reliability Index the number of testtakers in the U and L groups who
chose each alternative, the test developer can get an
The item-reliability index provides an indication of the idea of the effectiveness of a distractor by means of a
internal consistency of a test, the higher this index, the simple eyeball test.
greater the test’s internal consistency
Item-Characteristic Curves
Factor analysis and inter-item consistency
item response theory IRT can be a powerful tool not
factor analysis - A statistical tool useful in determining only for understanding how test items perform but also
whether items on a test appear to be measuring the for creating or modifying individual test items, building
same thing(s). useful in the test interpretation process, new tests, and revising existing tests.
especially when comparing the constellation of
responses to the items from two or more groups item-characteristic curves (ICCs) can play a role in
decisions about which items are working well and which
The Item-Validity Index items are not. Recall that an item-characteristic curve is
The item-validity index is a statistic designed to provide a graphic representation of item difficulty and
an indication of the degree to which a test is measuring discrimination.
what it purports to measure. The higher the item- Other Considerations in Item Analysis
validity index, the greater the test’s criterion-related
validity. The item-validity index can be calculated once Guessing
the following two statistics are known:
In achievement testing, the problem of how to handle another when differences in group ability are
testtaker guessing is one that has eluded any universally controlled
acceptable solution.
Item-characteristic curves can be used to
following three criteria that any correction for guessing identify biased items.
must meet as well as the other interacting issues that
Speed tests - Item analyses of tests taken under
must be addressed:
speed conditions yield misleading or
1. A correction for guessing must recognize that, uninterpretable results. The closer an item is to
when a respondent guesses at an answer on an the end of the test, the more difficult it may
achievement test, the guess is not typically appear to be. This is because testtakers simply
made on a totally random basis. It is more may not get to items near the end of the test
reasonable to assume that the testtaker’s guess before time runs out.
is based on some knowledge of the subject
how can items on a speed test be analyzed? to
matter and the ability to rule out one or more of
restrict the item analysis of items on a speed
the distractor alternatives. However, the
test only to the items completed by the
individual testtaker’s amount of knowledge of
testtaker.
the subject matter will vary from one item to
the next. However, this solution is not recommended, for
2. A correction for guessing must also deal with at least three reasons: (1) Item analyses of the
the problem of omitted items. Sometimes, later items would be based on a progressively
instead of guessing, the testtaker will simply smaller number of testtakers, yielding
omit a response to an item. Should the omitted progressively less reliable results;
item be scored “wrong”? Should the omitted
item be excluded from the item analysis? (2) if the more knowledgeable examinees reach
Should the omitted item be scored as if the the later items, then part of the analysis is
testtaker had made a random guess? Exactly based on all testtakers and part is based on a
how should the omitted item be handled? selected sample; and
3. Just as some people may be luckier than others (3) because the more knowledgeable testtakers
in front of a Las Vegas slot machine, so some are more likely to score correctly, their
testtakers may be luckier than others in performance will make items occurring toward
guessing the choices that are keyed correct. Any the end of the test appear to be easier than
correction for guessing may seriously they are.
underestimate or overestimate the effects of
guessing for lucky and unlucky testtakers. Qualitative Item Analysis
biased test item is an item that favors one Qualitative item analysis is a general term for
particular group of examinees in relation to various nonstatistical procedures designed to
explore how individual test items work. The used where a neutral term could be
analysis compares individual test items to each substituted?
other and to the test as a whole.
Other: Panel members were asked to be specific
qualitative methods involve exploration of the regarding any other indication of bias they
issues through verbal means such as interviews detected.
and group discussions conducted with testtakers
test [Link] development process that the
and other relevant parties.
test undergoes as it is modified and revised
“Think aloud” test administration - as a
Test Revision - a stage in the development of a
qualitative research tool designed to shed light
new test
on the testtaker’s thought processes during the
administration of a test. On a one-to-one basis Test Revision as a Stage in New Test
with an examiner, examinees are asked to take a Development
test, thinking aloud as they respond to each
item. Test Revision in the Life Cycle of an Existing
Test
If the test is designed to measure personality or
some aspect of it, the “think aloud” technique Cross-validation – revalidation of a test on a
may also yield valuable insights regarding the sample of testtakers other than those on whom
way individuals perceive, interpret, and respond test performance was originally found to be a
to the items. valid predictor of some criterion.
Expert panels - provide qualitative analyses of The decrease in item validities that inevitably
test items. occurs after cross-validation of findings is
referred to as validity shrinkage.
sensitivity review is a study of test items,
typically conducted during the test co-validation - a test validation process
development process, in which items are conducted on two or more tests using the same
examined for fairness to all prospective sample of testtakers. When used in conjunction
testtakers and for the presence of offensive with the creation of norms or the revision of
language, stereotypes, or situations. existing norms, this process may also be
referred to as co-norming.
possible forms of content bias that may find
their way into any achievement test were Quality assurance during test revision
identified as follows (Stanford Special Report, If there were discrepancies in scoring, the
1992, pp. 3–4). discrepancies were resolved by yet another
Status: Are the members of a particular group scorer, referred to as a resolver. According to
shown in situations that do not involve the manual, “The resolvers were selected based
authority or leadership? on their demonstration of exceptional scoring
accuracy and previous scoring experience”
Stereotype: Are the members of a particular
group portrayed as uniformly having certain (1) Another mechanism for ensuring consistency in
aptitudes, (2) interests, (3) occupations, or (4) scoring is the anchor protocol. An anchor
personality characteristics? protocol is a test protocol scored by a highly
authoritative scorer that is designed as a model
Familiarity: Is there greater opportunity on the for scoring and a mechanism for resolving
part of one group to (1) be acquainted with the scoring discrepancies.
vocabulary or (2) experience the situation
presented by an item? A discrepancy between scoring in an anchor
protocol and the scoring of another protocol is
Offensive Choice of Words: (1) Has a demeaning referred to as scoring drift.
label been applied, or (2) has a male term been
The Use of IRT in Building and Revising Tests have different probabilities of endorsing as a function of
their group membership.
item response theory (IRT) could be applied in
the evaluation of the utility of tests and testing Developing item banks - each of the items
programs. assembled as part of an item bank, whether
taken from an existing test (with appropriate
permissions, if necessary) or written especially
for the item bank, have undergone rigorous
qualitative and quantitative evaluation