Course: Educational Assessment and Evaluation
Code: 8602
Semester: Spring, 2021
Assignment # 2
Level: B. Ed
Q.1 What is the relationship between validity and reliability of test?
ANS:
Reliability and validity are two different standards used to gauge the usefulness of a test.
Though different, they work together. It would not be beneficial to design a test with good
reliability that did not measure what it was intended to measure. The inverse, accurately
measuring what we desire to measure with a test that is so flawed that results are not
reproducible, is impossible. Reliability is a necessary requirement for validity. This means
that you have to have good reliability in order to have validity. Reliability actually puts a cap
or limit on validity, and if a test is not reliable, it cannot be valid. Establishing good reliability
is only the first part of establishing validity. Validity has to be established separately. Having
good reliability does not mean we have good validity, it just means we are measuring
something consistently. Now we must establish, what it is that we are measuring consistently.
The main point here is reliability is necessary but not sufficient for validity. In short we can
say that reliability means noting when the problem is validity.
Validity
Educational assessment should always have a clear purpose. Nothing will be gained from
assessment unless the assessment has some validity for the purpose. For that
reason, validity is the most important single attribute of a good test.
The validity of an assessment tool is the extent to which it measures what it was designed to
measure, without contamination from other characteristics. For example, a test of reading
comprehension should not require mathematical ability.
There are several different types of validity:
• Face validity: do the assessment items appear to be appropriate?
• Content validity: does the assessment content cover what you want to assess?
• Criterion-related validity: how well does the test measure what you want it to?
• Construct validity: are you measuring what you think you're measuring?
It is fairly obvious that a valid assessment should have a good coverage of the criteria
(concepts, skills and knowledge) relevant to the purpose of the examination. The important
notion here is the purpose. For example:
• The PROBE test is a form of reading running record which measures reading behaviours
and includes some comprehension questions. It allows teachers to see the reading strategies
that students are using, and potential problems with decoding. The test would not,
however, provide in-depth information about a student’s comprehension strategies across a
range of texts.
• STAR (Supplementary Test of Achievement in Reading) is not designed as a
comprehensive test of reading ability. It focuses on assessing students’ vocabulary
understanding, basic sentence comprehension and paragraph comprehension. It is most
appropriately used for students who don’t score well on more general testing (such as PAT
or e-asTTle) as it provides a more fine grained analysis of basic comprehension strategies.
There is an important relationship between reliability and validity. An assessment that has
very low reliability will also have low validity; clearly a measurement with very poor
accuracy or consistency is unlikely to be fit for its purpose. But, by the same token, the things
required to achieve a very high degree of reliability can impact negatively on validity. For
example, consistency in assessment conditions leads to greater reliability because it reduces
'noise' (variability) in the results. On the other hand, one of the things that can improve
validity is flexibility in assessment tasks and conditions. Such flexibility allows assessment to
be set appropriate to the learning context and to be made relevant to particular groups of
students. Insisting on highly consistent assessment conditions to attain high reliability will
result in little flexibility, and might therefore limit validity.
The Overall Teacher Judgment balances these notions with a balance between the reliability
of a formal assessment tool, and the flexibility to use other evidence to make a judgment.
Reliability
The reliability of an assessment tool is the extent to which it consistently and accurately
measures learning.
When the results of an assessment are reliable, we can be confident that repeated or
equivalent assessments will provide consistent results. This puts us in a better position to
make generalized statements about a student’s level of achievement, which is especially
important when we are using the results of an assessment to make decisions about teaching
and learning, or when we are reporting back to students and their parents or caregivers. No
results, however, can be completely reliable. There is always some random variation that may
affect the assessment, so educators should always be prepared to question results.
Factors which can affect reliability:
• The length of the assessment – a longer assessment generally produces more reliable
results.
• The suitability of the questions or tasks for the students being assessed.
• The phrasing and terminology of the questions.
• The consistency in test administration – for example, the length of time given for the
assessment, instructions given to students before the test.
• The design of the marking schedule and moderation of marking procedures.
• The readiness of students for the assessment – for example, a hot afternoon or straight after
physical activity might not be the best time for students to be assessed.
Q.2 Define a scoring criteria for essay type test items for 8th grade?
ANS:
A rubric or scoring criteria is developed to evaluate/score an essay type item. A rubric is a
scoring guide for subjective assessments. It is a set of criteria and standards linked to learning
objectives that are used to assess a student's performance on papers, projects, essays, and
other assignments. Rubrics allow for standardized evaluation according to specified criteria,
making grading simpler and more transparent. A rubric may vary from simple checklists to
elaborate combinations of checklist and rating scales. How elaborate your rubric is depends
on what you are trying to measure. If your essay item is A restricted-response item simply
assessing mastery of factual content, a fairly simple listing of essential points would be
sufficient. An example of the rubric of restricted response item is given below.
Test Item: Name and describe five of the most important factors of unemployment in
Pakistan.
Rubric/Scoring Criteria:
(i) 1 point for each of the factors named, to a maximum of 5 points
(ii) One point for each appropriate description of the factors named, to a maximum of 5
points
(iii) No penalty for spelling, punctuation, or grammatical error
(iv) No extra credit for more than five factors named or described.
(v) Extraneous information will be ignored.
However, when essay items are measuring higher order thinking skills of cognitive domain,
more complex rubrics are mandatory. An example of Rubric for writing test in language is
given below.
Scoring Criteria (Rubrics) for Essay Type Item for 8th grade.
Advantages of Essay Type Items
The main advantages of essay type tests are as follows:
(i) They can measures complex learning outcomes which cannot be measured by other
means.
(ii) They emphasize integration and application of thinking and problem solving skills.
(iii) They can be easily constructed.
(iv) They give examines freedom to respond within broad limits.
(v) The students cannot guess the answer because they have to supply it rather than select it.
(vi) Practically it is more economical to use essay type tests if number of students is small.
(vii) They required less time for typing, duplicating or printing. They can be written on the
blackboard also if number of students is not large.
(viii) They can measure divergent thinking.
(ix) They can be used as a device for measuring and improving language and expression skill
of examinees.
(x) They are more helpful in evaluating the quality of the teaching process.
(xi) Studies has supported that when students know that the essay type questions will be
asked, they focus on learning broad concepts and articulating relationships, contrasting and
comparing.
(xii) They set better standards of professional ethics to the teachers because they expect more
time in assessing and scoring from the teachers.
Limitations of Essay Type Items
The essay type tests have the following serious limitations as a measuring instrument:
(i) A major problem is the lack of consistency in judgments even among competent
examiners.
(ii) They have halo effects. If the examiner is measuring one characteristic, he can be
influenced in scoring by another characteristic. For example, a well behaved student may
score more marks on account of his good behaviour also.
(iii) They have question to question carry effect. If the examinee has answered satisfactorily
in the beginning of the question or questions he is likely to score more than the one who did
not do well in the beginning but did well later on.
(iv) They have examinee to examine carry effect. A particular examinee gets marks not only
on the basis of what he has written but also on the basis that whether the previous examinee
whose answered book was examined by the examiner was good or bad.
(v) They have limited content validity because of sample of questions can only be asked in
essay type test.
(vi) They are difficult to score objectively because the examinee has wide freedom of
expression and he writes long answers.
(vii) They are time consuming both for the examiner and the examinee. (viii) They generally
emphasize the lengthy enumeration of memorized facts.
Suggestions for Writing Essay Type Items
I. Ask questions or establish tasks that will require the student to demonstrate command of
essential knowledge. This means that students should not be asked merely to reproduce
material heard in a lecture or read in a textbook. To "demonstrate command" requires that the
question be somewhat novel or new. The substance of the question should be essential
knowledge rather than trivia that might be a good board game question.
II. Ask questions that are determinate, in the sense that experts (colleagues in the field) could
agree that one answer is better than another. Questions that contain phrases such as "What do
you think..." or "What is your opinion about..." are indeterminate. They can be used as a
medium for assessing skill in written expression, but because they have no clearly right or
wrong answer, they are useless for measuring other aspects of achievement.
III. Define the examinee's task as completely and specifically as possible without interfering
with the measurement process itself. It is possible to word an essay item so precisely that
there is one and only one very brief answer to it. The imposition of such rigid bounds on the
response is more limiting than it is helpful. Examinees do need guidance, however, to judge
how extensive their response must to be considered complete and accurate.
IV. Generally give preference to specific questions that can be answered briefly. The more
questions used, the better the test constructor can sample the domain of knowledge covered
by the test. And the more responses available for scoring, the more accurate the total test
scores are likely to be. In addition, brief responses can be scored more quickly and more
accurately than long, extended responses, even when there are fewer of the latter type.
V. Use enough items to sample the relevant content domain adequately, but not so many that
students do not have sufficient time to plan, develop, and review their responses. Some
instructors use essay tests rather than one of the objective types because they want to
encourage and provide practice in written expression. However, when time pressures become
great, the essay test is one of the most unrealistic and negative writing experiences to which
students can be exposed. Often there is no time for editing, for rereading, or for checking
spelling. Planning time is short changed so that writing time will not be. There are few, if
any, real writing tasks that require such conditions. And there are few writing experiences
that discourage the use of good writing habits as much as essay testing does.
VI. Avoid giving examinees a choice among optional questions unless special circumstances
make such options necessary. The use of optional items destroys the strict comparability
between student scores because not all students actually take the same test. Student A may
have answered items 1-3 and Student B may have answered 3-5. In these circumstances the
variability of scores is likely to be quite small because students were able to respond to items
they knew more about and ignore items with which they were unfamiliar. This reduced
variability contributes to reduced test score reliability. That is, we are less able to identify
individual differences in achievement when the test scores form a very homogeneous
distribution. In sum, optional items restrict score comparability between students and
contribute to low score reliability due to reduced test score variability.
VII. Test the question by writing an ideal answer to it. An ideal response is needed eventually
to score the responses. It if is prepared early, it permits a check on the wording of the item,
the level of completeness required for an ideal response, and the amount of time required to
furnish a suitable response. It even allows the item writer to determine if there is any
"correct" response to the question.
VIII. Specify the time allotment for each item and/or specify the maximum number of points
to be awarded for the "best" answer to the question. Both pieces of information provide
guidance to the examinee about the depth of response expected by the item writer. They also
represent legitimate pieces of information a student can use to decide which of several items
should be omitted when time begins to run out. Often the number of points attached to the
item reflects the number of essential parts to the ideal response. Of course if a definite
number of essential parts can be determined, that number should be indicated as part of the
question.
IX. Divide a question into separate components when there are obvious multiple questions or
pieces to the intended responses. The use of parts helps examinees organizationally and,
hence, makes the process more efficient. It also makes the grading process easier because it
encourages organization in the responses. Finally, if multiple questions are not identified,
some examinees may inadvertently omit some parts, especially when time constraints are
great.
Q.3 Write a note on Mean, Median and Mode. Also dicsuss their
importance in interpreting test scores?
ANS:
The terms mean, median, mode, and range describe properties of statistical distributions. In
statistics, a distribution is the set of all possible values for terms that represent defined events.
The value of a term, when expressed as a variable, is called a random variable.
There are two major types of statistical distributions. The first type contains discrete random
variables. This means that every term has a precise, isolated numerical value. The second
major type of distribution contains a continuous random variable. A continuous random
variable is a random variable where the data can take infinitely many values. When a term
can acquire any value within an unbroken interval or span, it is called a probability density
function.
IT professionals need to understand the definition of mean, median, mode and range to plan
capacity and balance load, manage systems, perform maintenance and troubleshoot
issues. Furthermore, understanding of statistical terms is important in the growing
field of data science.
Understanding the definition of mean, median, mode and range is important for IT
professionals in data center management. Many relevant tasks require the administrator to
calculate mean, median, mode or range, or often some combination, to show a statistically
significant quantity, trend or deviation from the norm. Finding the mean, median, mode and
range is only the start. The administrator then needs to apply this information to investigate
root causes of a problem, accurately forecast future needs or set acceptable working
parameters for IT systems.
When working with a large data set, it can be useful to represent the entire data set with a
single value that describes the "middle" or "average" value of the entire set. In statistics, that
single value is called the central tendency and mean, median and mode are all ways to
describe it. To find the mean, add up the values in the data set and then divide by the number
of values that you added. To find the median, list the values of the data set in numerical order
and identify which value appears in the middle of the list. To find the mode, identify which
value in the data set occurs most often. Range, which is the difference between the largest
and smallest value in the data set, describes how well the central tendency represents the data.
If the range is large, the central tendency is not as representative of the data as it would be if
the range was small.
Mean
The most common expression for the mean of a statistical distribution with a discrete random
variable is the mathematical average of all the terms. To calculate it, add up the values of all
the terms and then divide by the number of terms. The mean of a statistical distribution with a
continuous random variable, also called the expected value, is obtained by integrating the
product of the variable with its probability as defined by the distribution. The expected value
is denoted by the lowercase Greek letter mu (µ).
Median
The median of a distribution with a discrete random variable depends on whether the number
of terms in the distribution is even or odd. If the number of terms is odd, then the median is
the value of the term in the middle. This is the value such that the number of terms having
values greater than or equal to it is the same as the number of terms having values less than or
equal to it. If the number of terms is even, then the median is the average of the two terms in
the middle, such that the number of terms having values greater than or equal to it is the same
as the number of terms having values less than or equal to it.
The median of a distribution with a continuous random variable is the value m such that the
probability is at least 1/2 (50%) that a randomly chosen point on the function will be less than
or equal to m, and the probability is at least 1/2 that a randomly chosen point on the function
will be greater than or equal to m.
Mode
The mode of a distribution with a discrete random variable is the value of the term that occurs
the most often. It is not uncommon for a distribution with a discrete random variable to have
more than one mode, especially if there are not many terms. This happens when two or more
terms occur with equal frequency, and more often than any of the others.
A distribution with two modes is called bimodal. A distribution with three modes is called
trimodal. The mode of a distribution with a continuous random variable is the maximum
value of the function. As with discrete distributions, there may be more than one mode.
Importance of Interpreting test scores
To calculate mean, add together all of the numbers in a set and then divide the sum by the
total count of numbers. For example, in a data center rack, five servers consume 100 watts,
98 watts, 105 watts, 90 watts and 102 watts of power, respectively. The mean power use of
that rack is calculated as (100 + 98 + 105 + 90 + 102 W)/5 servers = a calculated mean of 99
W per server. Intelligent power distribution units report the mean power utilization of the
rack to systems management software.
In the data center, means and medians are often tracked over time to spot trends, which
inform capacity planning or power cost predictions. The statistical median is the middle
number in a sequence of numbers. To find the median, organize each number in order by
size; the number in the middle is the median. For the five servers in the rack, arrange the
power consumption figures from lowest to highest: 90 W, 98 W, 100 W, 102 W and 105 W.
The median power consumption of the rack is 100 W. If there is an even set of numbers,
average the two middle numbers. For example, if the rack had a sixth server that used 110 W,
the new number set would be 90 W, 98 W, 100 W, 102 W, 105 W and 110 W. Find the
median by averaging the two middle numbers: (100 + 102)/2 = 101 W.
The mode is the number that occurs most often within a set of numbers. For the server power
consumption examples above, there is no mode because each element is different. But
suppose the administrator measured the power consumption of an entire network operations
center (NOC) and the set of numbers is 90 W, 104 W, 98 W, 98 W, 105 W, 92 W, 102 W,
100 W, 110 W, 98 W, 210 W and 115 W. The mode is 98 W since that power consumption
measurement occurs most often amongst the 12 servers. Mode helps identify the most
common or frequent occurrence of a characteristic. It is possible to have two modes
(bimodal), three modes (trimodal) or more modes within larger sets of numbers.
Q.4 Write the procedure of assigning letter grades to test score?
ANS:
Assigning letter grades Letter grade system is most popular in the world including Pakistan.
Most teachers face problems while assigning grades. There are four core problems or issues
in this regard; 1) what should be included in a letter grade, 2) how should achievement data
be combined in assigning letter grades?, 3) what frame of reference should be used in
grading, and 4) how should the distribution of letter grades be determined?
1. Determining what to include in a grade
Letter grades are likely to be most meaningful and useful when they represent achievement
only. If they are communicated with other factors or aspects such as effort of work
completed, personal conduct, and so on, their interpretation will become hopelessly confused.
For example, a letter grade C may represent average achievement with extraordinary effort
and excellent conduct and behavior or vice versa. If letter grades are to be valid indicators of
achievement, they must be based on valid measures of achievement. This involves defining
objectives as intended learning outcomes and developing or selecting tests and assessments
which can measure these learning outcomes.
2. Combining data in assigning grades
One of the key concerns while assigning grades is to be clear what aspects of a student are to
be assessed or what will be the tentative weightage to each learning outcome. For example, if
we decide that 35 percent weightage is to be given to mid-term assessment, 40 percent final
term test or assessment, and 25% to assignments, presentations, classroom participation and
conduct and behavior; we have to combine all elements by assigning appropriate weights to
each element, and then use these composite scores as a basis for grading.
3. Selecting the proper frame of reference for grading
Letter grades are typically assigned on the basis of one of the following frames of reference.
a) Performance in relation to other group members (relative grading)
b) Performance in relation to specified standards (absolute grading)
c) Performance in relation to learning ability (amount of improvement)
Assigning grades on relative basis involves comparing a student’s performance with that of a
reference group, mostly class fellows. In this system, the grade is determined by the student’s
relative position or ranking in the total group. Although relative grading has a disadvantage
of a shifting frame of reference (i.e. grades depend upon the group’s ability), it is still widely
used in schools, as most of the time our system of testing is ‘norm-referenced’.
Assigning grades on an absolute basis involves comparing a student’s performance to
specified standards set by the teacher. This is what we call as ‘criterion-referenced’ testing. If
all students show a low level of mastery consistent with the established performance
standard, all will receive low grades. The student performance in relation to the learning
ability is inconsistent with a standard based system of evaluating and reporting student
performance. The improvement over the short time span is difficult. Thus lack of reliability in
judging achievement in relation to ability and in judging degree of improvement will result in
grades of low dependability. Therefore such grades are used as supplementary to other
grading systems.
4. Determining the distribution of grades
The assigning of relative grades is essentially a matter of ranking the student in order of
overall achievement and assigning letter grades on the basis of each student’s rank in the
group. This ranking might be limited to a single classroom group or might be based on the
combined distribution of several classroom groups taking the same course. If grading on the
curve is to be done, the most sensible approach in determining the distribution of letter grades
in a school is to have the school staff set general guidelines for introductory and advanced
courses. All staff members must understand the basis for assigning grades, and this basis
must be clearly communicated to users of the grades. If the objectives of a course are clearly
mentioned and the standards for mastery appropriately set, the letter grades in an absolute
system may be defined as the degree to which the objectives have been attained, as followed.
A = Outstanding (90 to 100%)
B = very Good (80-89%)
C = Satisfactory (70-79%)
D = Very Weak (60-69%)
F = Unsatisfactory (Less than 60%)
A standard score is also derived from the raw scores using the normal information gathered
when the test was developed. Instead of indicating a student’s rank compared to others,
standard scores indicate how far above or below the average (Mean) an individual score falls,
using a common scale, such as one with an average of 100. Basically standard scores express
test performance in terms of standard deviation (SD) from the Mean. Standard scores can be
used to compare individuals of different grades or age groups because all are converted into
the same numerical scale. There are various forms of standard scores such as z-score, T-
score, and stanines. Z-score expresses test performance simply and directly as the number of
SD units a raw score is above or below the Mean. A z-score is always negative when the raw
score is smaller than Mean. Symbolic representation can be shown as: z-score = X-M/SD. T-
score refers to any set of normally distributed standard cores that has a Mean of 50 and SD of
10. Symbolically it can be represented as: T-score = 50+10(z). Stanines are the simplest form
of normalized standard scores that illustrate the process of normalization. Stanines are single
digit scores ranging from 1 to 9. These are groups of percentile ranks with the entire group of
scores divided into nine parts, with the largest number of individuals falling in the middle
stanines, and fewer students falling at the extremes (Linn & Gronlund, 2000).
It is the most easiest and popular way of grading and reporting system. The traditional system
is generally based on grades A to F. This rating is generally reflected as: Grade A (Excellent),
B (Very Good), C (Good), D (Satisfactory/Average), E (Unsatisfactory/ Below Average), and
F (Fail). This system does truly assess a student’s progress in different learning domains.
First shortcoming is that using this system it is difficult to interpret the results. Second, a
student’s performance is linked with achievement, effort, work habits, and good behaviour;
traditional letter-grade system is unable to assess all these domains of a student. Third, the
proportion of students assigned each letter grade generally varies from teacher to teacher.
Fourth, it does not indicate patterns of strengths and weaknesses in the students (Linn &
Gronlund, 2000). Inspite of these shortcomings, this system is popular in schools, colleges
and universities.
It is a popular way of reporting students’ progress, particularly at elementary level. In the
context of Pakistan, as majority of the parents are illiterate or hardly literate, therefore they
have concern with ‘pass or fail’ about their children’s performance in schools. This system is
mostly used for courses taught under a pure mastery learning approach i.e. criterion-
referenced testing. This system has also many shortcomings. First, as students are declared
just pass or fail (successful or unsuccessful) so many students do not work hard and hence
their actual learning remains unsatisfactory or below desired level. Second, this two-category
system provides less information to the teacher, student and parents than the traditional letter
grade (A, B, C, D) system. Third, it provides no indication of the level of learning.
Q.5 Discuss the difference between measures of central tendency and
measures of reliability?
ANS:
Measures of central tendency
A measure of central tendency (also referred to as measures of center or central location) is a
summary measure that attempts to describe a whole set of data with a single value that
represents the middle or Centre of its distribution. There are three main measures of central
tendency: the mode, the median and the mean. Each of these measures describes a different
indication of the typical or central value in the distribution.
What is the mode?
The mode is the most commonly occurring value in a distribution. Consider this dataset
showing the retirement age of 11 people, in frequency whole years. 54, 54, 54, 55, 56, 57, 57,
58, 58, 60, 60. This table shows a simple frequency distribution of the retirement age data.
Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2
The most commonly occurring value is 54; therefore the mode of this distribution is 54 years.
Advantage of the mode:
The mode has an advantage over the median and the mean as it can be found for both
numerical and categorical (non-numerical) data.
Limitations of the mode:
There are some limitations to using the mode. In some distributions, the mode may not reflect
the canter of the distribution very well. When the distribution of retirement age is ordered
from lowest to highest value, it is easy to see that the centre of the distribution is 57 years, but
the mode is lower, at 54 years. 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
It is also possible for there to be more than one mode for the same distribution of data, (bi-
modal, or multi-modal). The presence of more than one mode can limit the ability of the
mode in describing the centre or typical value of the distribution because a single value to
describe the centre cannot be identified. In some cases, particularly where the data are
continuous, the distribution may have no mode at all (i.e. if all values are different).
In cases such as these, it may be better to consider using the median or mean, or group the
data in to appropriate intervals, and find the modal class.
What is the median?
The median is the middle value in distribution when the values are arranged in ascending or
descending order. The median divides the distribution in half (there are 50% of observations
on either side of the median value). In a distribution with an odd number of observations, the
median value is the middle value. Looking at the retirement age distribution (which has 11
observations), the median is the middle value, which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the mean of
the two middle values. In the following distribution, the two middle values are 56 and 57,
therefore the median equals 56.5 years:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Advantage of the median:
The median is less affected by outliers and skewed data than the mean, and is usually the
preferred measure of central tendency when the distribution is not symmetrical.
Limitation of the median:
The median cannot be identified for categorical nominal data, as it cannot be logically
ordered.
What is the mean?
The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average. Looking at the retirement age
distribution again: 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean is calculated by adding together all the values
(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the number of observations
(11) which equals 56.6 years.
Advantage of the mean:
The mean can be used for both continuous and discrete numeric data.
Limitations of the mean: The mean cannot be calculated for categorical data, as the values
cannot be summed. As the mean includes every value in the distribution the mean is
influenced by outliers and skewed distributions. What else do I need to know about the
mean?
The population mean is indicated by the Greek symbol µ (pronounced ‘mu’). When the mean
is calculated on a distribution from a sample it is indicated by the symbol x̅ (pronounced X-
bar).
Measure of Reliability
Reliability is one of the most important elements of test quality. It has to do with the
consistency, or reproducibility, of an examinee's performance in the test. It's not possible to
calculate reliability exactly. Instead, we have to estimate reliability, and this is always an
imperfect attempt. Here, we introduce the major reliability estimators and talk about their
strengths and weaknesses.
There are six general classes of reliability estimates, each of which estimates reliability in a
different way. They are:
i) Inter-Rater or Inter-Observer Reliability
To assess the degree to which different raters/observers give consistent estimates of the same
phenomenon. That is if two teachers mark same test and the results are similar, so it indicates
the inter-rater or inter-observer reliability.
ii) Test-Retest Reliability:
To assess the consistency of a measure from one time to another, when a same test is
administered twice and the results of both administrations are similar, this constitutes the test-
retest reliability. Students may remember and may be mature after the first administration
creates a problem for test-retest reliability.
iii) Parallel-Form Reliability:
To assess the consistency of the results of two tests constructed in the same way from the
same content domain. Here the test designer tries to develop two tests of the similar kinds and
after administration the results are similar then it will indicate the parallel form reliability.
iv) Internal Consistency Reliability:
To assess the consistency of results across items within a test, it is correlation of the
individual items score with the entire test.
v) Split half Reliability:
To assess the consistency of results comparing two halves of single test, these halves may be
even odd items on the single test.
vi) Kuder-Richardson Reliability:
To assess the consistency of the results using all the possible split halves of a test. Let's
discuss each of these in turn.