Reliability and Validity
These two terms, reliability and validity, are often usedinterchangeably when they are not related to
statistics. When critical readersof statistics use these terms, however, they refer to different properties
ofthe statistical or experimental method.
Reliability is another term for consistency. If one person takes the samepersonality test several times and
always receives the same results, the test isreliable.
A test is valid if it measures what it is supposed to measure. If theresults of the personality test claimed
that a very shy person was in factoutgoing, the test would be invalid.
Reliability and validity are independent of each other. A measurement maybe valid but not reliable, or
reliable but not valid. Suppose your bathroomscale was reset to read 10 pound lighter. The weight it
reads will be reliable(the same every time you step on it) but will not be valid, since it is notreading your
actual weight.
Types of Validity
Martyn Shuttleworth 350.4K reads Printer-friendly versionSend by emailPDF version
Here is an overview on the main types of validity used for the scientific method.
This article is a part of the guide:
"Any research can be affected by different kinds of factors which, while extraneous to the concerns of
the research, can invalidate the findings" (Seliger & Shohamy 1989, 95).
Let's take a look on the the most frequent uses of validity in the scientific method:
External Validity
External validity is about generalization: To what extent can an effect in research, be generalized to
populations, settings, treatment variables, and measurement variables?
External validity is usually split into two distinct types, population validity and ecological validity and they
are both essential elements in judging the strength of an experimental design.
Internal Validity
Internal validity is a measure which ensures that a researcher's experiment design closely follows the
principle of cause and effect.
“Could there be an alternative cause, or causes, that explain my observations and results?”
Test Validity
Test validity is an indicator of how much meaning can be placed upon a set of test results.
Criterion Validity
Criterion Validity assesses whether a test reflects a certain set of abilities.
Concurrent validity measures the test against a benchmark test and high correlation indicates that the
test has strong criterion validity.
Predictive validity is a measure of how well a test predicts abilities. It involves testing a group of subjects
for a certain construct and then comparing them with results obtained at some point in the future.
Content Validity
Content validity is the estimate of how much a measure represents every single element of a construct.
Construct Validity
Construct validity defines how well a test or experiment measures up to its claims. A test designed to
measure depression must only measure that particular construct, not closely related ideals such as
anxiety or stress.
Convergent validity tests that constructs that are expected to be related are, in fact, related.
Discriminant validity tests that constructs that should have no relationship do, in fact, not have any
relationship. (also referred to as divergent validity)
Face Validity
Face validity is a measure of how representative a research project is ‘at face value,' and whether it
appears to be a good project.
Online Threats to Quality Reliability is a measure of the consistency of a metric or a method.
Every metric or method we use, including things like methods for uncovering usability problems in an
interface and expert judgment, must be assessed for reliability.
In fact, before you can establish validity, you need to establish reliability.
Here are the four most common ways of measuring reliability for any empirical method or metric:
inter-rater reliability
test-retest reliability
parallel forms reliability
internal consistency reliability
Because reliability comes from a history in educational measurement (think standardized tests), many of
the terms we use to assess reliability come from the testing lexicon. But don’t let bad memories of
testing allow you to dismiss their relevance to measuring the customer experience. These four methods
are the most common ways of measuring reliability for any empirical method or metric.
Inter-Rater Reliability
The extent to which raters or observers respond the same way to a given phenomenon is one measure
of reliability. Where there’s judgment there’s disagreement.
Even highly trained experts disagree among themselves when observing the same phenomenon. Kappa
and the correlation coefficient are two common measures of inter-rater reliability. Some examples
include:
Evaluators identifying interface problems
Experts rating the severity of a problem
For example, we found that the average inter-rater reliability[pdf] of usability experts rating the severity
of usability problems was r = .52. You can also measure intra-rater reliability, whereby you correlate
multiple scores from one observer. In that same study, we found that the average intra-rater reliability
when judging problem severity was r = .58 (which is generally low reliability).
Test-Retest Reliability
Do customers provide the same set of responses when nothing about their experience or their attitudes
has changed? You don’t want your measurement system to fluctuate when all other things are static.
Have a set of participants answer a set of questions (or perform a set of tasks). Later (by at least a few
days, typically), have them answer the same questions again. When you correlate the two sets of
measures, look for very high correlations (r > 0.7) to establish retest reliability.
As you can see, there’s some effort and planning involved: you need for participants to agree to answer
the same questions twice. Few questionnaires measure test-retest reliability (mostly because of the
logistics), but with the proliferation of online research, we should encourage more of this type of
measure.
Parallel Forms Reliability
Getting the same or very similar results from slight variations on the question or evaluation method also
establishes reliability. One way to achieve this is to have, say, 20 items that measure one construct
(satisfaction, loyalty, usability) and to administer 10 of the items to one group and the other 10 to
another group, and then correlate the results. You’re looking for high correlations and no systematic
difference in scores between the groups.
Internal Consistency Reliability
This is by far the most commonly used measure of reliability in applied settings. It’s popular because it’s
the easiest to compute using software—it requires only one sample of data to estimate the internal
consistency reliability. This measure of reliability is described most often using Cronbach’s alpha
(sometimes called coefficient alpha).
It measures how consistently participants respond to one set of items. You can think of it as a sort of
average of the correlations between items. Cronbach’s alpha ranges from 0.0 to 1.0 (a negative alpha
means you probably need to reverse some items). Since the late 1960s, the minimally acceptable
measure of reliability has been 0.70; in practice, though, for high-stakes questionnaires, aim for greater
than 0.90. For example, the SUS has a Cronbach’s alpha of 0.92.
The more items you have, the more internally reliable the instrument, so to increase internal consistency
reliability, you would add items to your questionnaire. Since there’s often a strong need to have few
items, however, internal reliability usually suffers. When you have only a few items, and therefore usually
lower internal reliability, having a larger sample size helps offset the loss in reliability.
In Summary
Here are a few things to keep in mind about measuring reliability:
Reliability is the consistency of a measure or method over time.
Reliability is necessary but not sufficient for establishing a method or metric as valid.
There isn’t a single measure of reliability, instead there are four common measures of consistent
responses.
You’ll want to use as many measures of reliability as you can (although in most cases one is sufficient to
understand the reliability of your measurement system).
Even if you can’t collect reliability data, be aware of the ways in which low reliability may affect the
validity of your measures, and ultimately the veracity of your decisions