Course: Educational Assessment and Evaluation Code: 8602 Semester: Spring, 2021 Assignment # 2 Level: B. Ed

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Course: Educational Assessment and Evaluation

Code: 8602

Semester: Spring, 2021

Assignment # 2

Level: B. Ed

Q.1 What is the relationship between validity and reliability of test?

ANS:

Reliability and validity are two different standards used to gauge the usefulness of a test.

Though different, they work together. It would not be beneficial to design a test with good

reliability that did not measure what it was intended to measure. The inverse, accurately

measuring what we desire to measure with a test that is so flawed that results are not

reproducible, is impossible. Reliability is a necessary requirement for validity. This means

that you have to have good reliability in order to have validity. Reliability actually puts a cap

or limit on validity, and if a test is not reliable, it cannot be valid. Establishing good reliability

is only the first part of establishing validity. Validity has to be established separately. Having

good reliability does not mean we have good validity, it just means we are measuring

something consistently. Now we must establish, what it is that we are measuring consistently.

The main point here is reliability is necessary but not sufficient for validity. In short we can

say that reliability means noting when the problem is validity.


Validity

Educational assessment should always have a clear purpose. Nothing will be gained from

assessment unless the assessment has some validity for the purpose. For that

reason, validity is the most important single attribute of a good test.

The validity of an assessment tool is the extent to which it measures what it was designed to

measure, without contamination from other characteristics. For example, a test of reading

comprehension should not require mathematical ability.

There are several different types of validity:

• Face validity: do the assessment items appear to be appropriate?

• Content validity: does the assessment content cover what you want to assess?

• Criterion-related validity: how well does the test measure what you want it to?

• Construct validity: are you measuring what you think you're measuring?

It is fairly obvious that a valid assessment should have a good coverage of the criteria

(concepts, skills and knowledge) relevant to the purpose of the examination. The important

notion here is the purpose. For example:

• The PROBE test is a form of reading running record which measures reading behaviours

and includes some comprehension questions. It allows teachers to see the reading strategies

that students are using, and potential problems with decoding. The test would not,

however, provide in-depth information about a student’s comprehension strategies across a

range of texts.

• STAR (Supplementary Test of Achievement in Reading) is not designed as a

comprehensive test of reading ability. It focuses on assessing students’ vocabulary

understanding, basic sentence comprehension and paragraph comprehension. It is most

appropriately used for students who don’t score well on more general testing (such as PAT

or e-asTTle) as it provides a more fine grained analysis of basic comprehension strategies.


There is an important relationship between reliability and validity. An assessment that has

very low reliability will also have low validity; clearly a measurement with very poor

accuracy or consistency is unlikely to be fit for its purpose. But, by the same token, the things

required to achieve a very high degree of reliability can impact negatively on validity. For

example, consistency in assessment conditions leads to greater reliability because it reduces

'noise' (variability) in the results. On the other hand, one of the things that can improve

validity is flexibility in assessment tasks and conditions. Such flexibility allows assessment to

be set appropriate to the learning context and to be made relevant to particular groups of

students. Insisting on highly consistent assessment conditions to attain high reliability will

result in little flexibility, and might therefore limit validity.

The Overall Teacher Judgment balances these notions with a balance between the reliability

of a formal assessment tool, and the flexibility to use other evidence to make a judgment.

Reliability

The reliability of an assessment tool is the extent to which it consistently and accurately

measures learning.

When the results of an assessment are reliable, we can be confident that repeated or

equivalent assessments will provide consistent results. This puts us in a better position to

make generalized statements about a student’s level of achievement, which is especially

important when we are using the results of an assessment to make decisions about teaching

and learning, or when we are reporting back to students and their parents or caregivers. No

results, however, can be completely reliable. There is always some random variation that may

affect the assessment, so educators should always be prepared to question results.

Factors which can affect reliability:

• The length of the assessment – a longer assessment generally produces more reliable

results.
• The suitability of the questions or tasks for the students being assessed.

• The phrasing and terminology of the questions.

• The consistency in test administration – for example, the length of time given for the

assessment, instructions given to students before the test.

• The design of the marking schedule and moderation of marking procedures.

• The readiness of students for the assessment – for example, a hot afternoon or straight after

physical activity might not be the best time for students to be assessed.

Q.2 Define a scoring criteria for essay type test items for 8th grade?

ANS:

A rubric or scoring criteria is developed to evaluate/score an essay type item. A rubric is a

scoring guide for subjective assessments. It is a set of criteria and standards linked to learning

objectives that are used to assess a student's performance on papers, projects, essays, and

other assignments. Rubrics allow for standardized evaluation according to specified criteria,

making grading simpler and more transparent. A rubric may vary from simple checklists to

elaborate combinations of checklist and rating scales. How elaborate your rubric is depends

on what you are trying to measure. If your essay item is A restricted-response item simply

assessing mastery of factual content, a fairly simple listing of essential points would be

sufficient. An example of the rubric of restricted response item is given below.

Test Item: Name and describe five of the most important factors of unemployment in

Pakistan.

Rubric/Scoring Criteria:

(i) 1 point for each of the factors named, to a maximum of 5 points


(ii) One point for each appropriate description of the factors named, to a maximum of 5

points

(iii) No penalty for spelling, punctuation, or grammatical error

(iv) No extra credit for more than five factors named or described.

(v) Extraneous information will be ignored.

However, when essay items are measuring higher order thinking skills of cognitive domain,

more complex rubrics are mandatory. An example of Rubric for writing test in language is

given below.

Scoring Criteria (Rubrics) for Essay Type Item for 8th grade.
Advantages of Essay Type Items

The main advantages of essay type tests are as follows:

(i) They can measures complex learning outcomes which cannot be measured by other

means.

(ii) They emphasize integration and application of thinking and problem solving skills.

(iii) They can be easily constructed.

(iv) They give examines freedom to respond within broad limits.

(v) The students cannot guess the answer because they have to supply it rather than select it.

(vi) Practically it is more economical to use essay type tests if number of students is small.

(vii) They required less time for typing, duplicating or printing. They can be written on the

blackboard also if number of students is not large.

(viii) They can measure divergent thinking.

(ix) They can be used as a device for measuring and improving language and expression skill

of examinees.

(x) They are more helpful in evaluating the quality of the teaching process.

(xi) Studies has supported that when students know that the essay type questions will be

asked, they focus on learning broad concepts and articulating relationships, contrasting and

comparing.

(xii) They set better standards of professional ethics to the teachers because they expect more

time in assessing and scoring from the teachers.

Limitations of Essay Type Items

The essay type tests have the following serious limitations as a measuring instrument:

(i) A major problem is the lack of consistency in judgments even among competent

examiners.
(ii) They have halo effects. If the examiner is measuring one characteristic, he can be

influenced in scoring by another characteristic. For example, a well behaved student may

score more marks on account of his good behaviour also.

(iii) They have question to question carry effect. If the examinee has answered satisfactorily

in the beginning of the question or questions he is likely to score more than the one who did

not do well in the beginning but did well later on.

(iv) They have examinee to examine carry effect. A particular examinee gets marks not only

on the basis of what he has written but also on the basis that whether the previous examinee

whose answered book was examined by the examiner was good or bad.

(v) They have limited content validity because of sample of questions can only be asked in

essay type test.

(vi) They are difficult to score objectively because the examinee has wide freedom of

expression and he writes long answers.

(vii) They are time consuming both for the examiner and the examinee. (viii) They generally

emphasize the lengthy enumeration of memorized facts.

Suggestions for Writing Essay Type Items

I. Ask questions or establish tasks that will require the student to demonstrate command of

essential knowledge. This means that students should not be asked merely to reproduce

material heard in a lecture or read in a textbook. To "demonstrate command" requires that the

question be somewhat novel or new. The substance of the question should be essential

knowledge rather than trivia that might be a good board game question.

II. Ask questions that are determinate, in the sense that experts (colleagues in the field) could

agree that one answer is better than another. Questions that contain phrases such as "What do

you think..." or "What is your opinion about..." are indeterminate. They can be used as a
medium for assessing skill in written expression, but because they have no clearly right or

wrong answer, they are useless for measuring other aspects of achievement.

III. Define the examinee's task as completely and specifically as possible without interfering

with the measurement process itself. It is possible to word an essay item so precisely that

there is one and only one very brief answer to it. The imposition of such rigid bounds on the

response is more limiting than it is helpful. Examinees do need guidance, however, to judge

how extensive their response must to be considered complete and accurate.

IV. Generally give preference to specific questions that can be answered briefly. The more

questions used, the better the test constructor can sample the domain of knowledge covered

by the test. And the more responses available for scoring, the more accurate the total test

scores are likely to be. In addition, brief responses can be scored more quickly and more

accurately than long, extended responses, even when there are fewer of the latter type.

V. Use enough items to sample the relevant content domain adequately, but not so many that

students do not have sufficient time to plan, develop, and review their responses. Some

instructors use essay tests rather than one of the objective types because they want to

encourage and provide practice in written expression. However, when time pressures become

great, the essay test is one of the most unrealistic and negative writing experiences to which

students can be exposed. Often there is no time for editing, for rereading, or for checking

spelling. Planning time is short changed so that writing time will not be. There are few, if

any, real writing tasks that require such conditions. And there are few writing experiences

that discourage the use of good writing habits as much as essay testing does.

VI. Avoid giving examinees a choice among optional questions unless special circumstances

make such options necessary. The use of optional items destroys the strict comparability

between student scores because not all students actually take the same test. Student A may

have answered items 1-3 and Student B may have answered 3-5. In these circumstances the
variability of scores is likely to be quite small because students were able to respond to items

they knew more about and ignore items with which they were unfamiliar. This reduced

variability contributes to reduced test score reliability. That is, we are less able to identify

individual differences in achievement when the test scores form a very homogeneous

distribution. In sum, optional items restrict score comparability between students and

contribute to low score reliability due to reduced test score variability.

VII. Test the question by writing an ideal answer to it. An ideal response is needed eventually

to score the responses. It if is prepared early, it permits a check on the wording of the item,

the level of completeness required for an ideal response, and the amount of time required to

furnish a suitable response. It even allows the item writer to determine if there is any

"correct" response to the question.

VIII. Specify the time allotment for each item and/or specify the maximum number of points

to be awarded for the "best" answer to the question. Both pieces of information provide

guidance to the examinee about the depth of response expected by the item writer. They also

represent legitimate pieces of information a student can use to decide which of several items

should be omitted when time begins to run out. Often the number of points attached to the

item reflects the number of essential parts to the ideal response. Of course if a definite

number of essential parts can be determined, that number should be indicated as part of the

question.

IX. Divide a question into separate components when there are obvious multiple questions or

pieces to the intended responses. The use of parts helps examinees organizationally and,

hence, makes the process more efficient. It also makes the grading process easier because it

encourages organization in the responses. Finally, if multiple questions are not identified,

some examinees may inadvertently omit some parts, especially when time constraints are

great.
Q.3 Write a note on Mean, Median and Mode. Also dicsuss their

importance in interpreting test scores?

ANS:

The terms mean, median, mode, and range describe properties of statistical distributions. In

statistics, a distribution is the set of all possible values for terms that represent defined events.

The value of a term, when expressed as a variable, is called a random variable.

There are two major types of statistical distributions. The first type contains discrete random

variables. This means that every term has a precise, isolated numerical value. The second

major type of distribution contains a continuous random variable. A continuous random

variable is a random variable where the data can take infinitely many values. When a term

can acquire any value within an unbroken interval or span, it is called a probability density

function.

IT professionals need to understand the definition of mean, median, mode and range to plan

capacity and balance load, manage systems, perform maintenance and troubleshoot

issues. Furthermore, understanding of statistical terms is important in the growing

field of data science.

Understanding the definition of mean, median, mode and range is important for IT

professionals in data center management. Many relevant tasks require the administrator to

calculate mean, median, mode or range, or often some combination, to show a statistically

significant quantity, trend or deviation from the norm. Finding the mean, median, mode and

range is only the start. The administrator then needs to apply this information to investigate

root causes of a problem, accurately forecast future needs or set acceptable working

parameters for IT systems.

When working with a large data set, it can be useful to represent the entire data set with a

single value that describes the "middle" or "average" value of the entire set. In statistics, that
single value is called the central tendency and mean, median and mode are all ways to

describe it. To find the mean, add up the values in the data set and then divide by the number

of values that you added. To find the median, list the values of the data set in numerical order

and identify which value appears in the middle of the list. To find the mode, identify which

value in the data set occurs most often. Range, which is the difference between the largest

and smallest value in the data set, describes how well the central tendency represents the data.

If the range is large, the central tendency is not as representative of the data as it would be if

the range was small.

Mean

The most common expression for the mean of a statistical distribution with a discrete random

variable is the mathematical average of all the terms. To calculate it, add up the values of all

the terms and then divide by the number of terms. The mean of a statistical distribution with a

continuous random variable, also called the expected value, is obtained by integrating the

product of the variable with its probability as defined by the distribution. The expected value

is denoted by the lowercase Greek letter mu (µ).

Median

The median of a distribution with a discrete random variable depends on whether the number

of terms in the distribution is even or odd. If the number of terms is odd, then the median is

the value of the term in the middle. This is the value such that the number of terms having

values greater than or equal to it is the same as the number of terms having values less than or

equal to it. If the number of terms is even, then the median is the average of the two terms in

the middle, such that the number of terms having values greater than or equal to it is the same

as the number of terms having values less than or equal to it.

The median of a distribution with a continuous random variable is the value m such that the

probability is at least 1/2 (50%) that a randomly chosen point on the function will be less than
or equal to m, and the probability is at least 1/2 that a randomly chosen point on the function

will be greater than or equal to m.

Mode

The mode of a distribution with a discrete random variable is the value of the term that occurs

the most often. It is not uncommon for a distribution with a discrete random variable to have

more than one mode, especially if there are not many terms. This happens when two or more

terms occur with equal frequency, and more often than any of the others.

A distribution with two modes is called bimodal. A distribution with three modes is called

trimodal. The mode of a distribution with a continuous random variable is the maximum

value of the function. As with discrete distributions, there may be more than one mode.

Importance of Interpreting test scores

To calculate mean, add together all of the numbers in a set and then divide the sum by the

total count of numbers. For example, in a data center rack, five servers consume 100 watts,

98 watts, 105 watts, 90 watts and 102 watts of power, respectively. The mean power use of

that rack is calculated as (100 + 98 + 105 + 90 + 102 W)/5 servers = a calculated mean of 99

W per server. Intelligent power distribution units report the mean power utilization of the

rack to systems management software.

In the data center, means and medians are often tracked over time to spot trends, which

inform capacity planning or power cost predictions. The statistical median is the middle

number in a sequence of numbers. To find the median, organize each number in order by

size; the number in the middle is the median. For the five servers in the rack, arrange the

power consumption figures from lowest to highest: 90 W, 98 W, 100 W, 102 W and 105 W.

The median power consumption of the rack is 100 W. If there is an even set of numbers,

average the two middle numbers. For example, if the rack had a sixth server that used 110 W,
the new number set would be 90 W, 98 W, 100 W, 102 W, 105 W and 110 W. Find the

median by averaging the two middle numbers: (100 + 102)/2 = 101 W.

The mode is the number that occurs most often within a set of numbers. For the server power

consumption examples above, there is no mode because each element is different. But

suppose the administrator measured the power consumption of an entire network operations

center (NOC) and the set of numbers is 90 W, 104 W, 98 W, 98 W, 105 W, 92 W, 102 W,

100 W, 110 W, 98 W, 210 W and 115 W. The mode is 98 W since that power consumption

measurement occurs most often amongst the 12 servers. Mode helps identify the most

common or frequent occurrence of a characteristic. It is possible to have two modes

(bimodal), three modes (trimodal) or more modes within larger sets of numbers.

Q.4 Write the procedure of assigning letter grades to test score?

ANS:

Assigning letter grades Letter grade system is most popular in the world including Pakistan.

Most teachers face problems while assigning grades. There are four core problems or issues

in this regard; 1) what should be included in a letter grade, 2) how should achievement data

be combined in assigning letter grades?, 3) what frame of reference should be used in

grading, and 4) how should the distribution of letter grades be determined?

1. Determining what to include in a grade

Letter grades are likely to be most meaningful and useful when they represent achievement

only. If they are communicated with other factors or aspects such as effort of work

completed, personal conduct, and so on, their interpretation will become hopelessly confused.

For example, a letter grade C may represent average achievement with extraordinary effort
and excellent conduct and behavior or vice versa. If letter grades are to be valid indicators of

achievement, they must be based on valid measures of achievement. This involves defining

objectives as intended learning outcomes and developing or selecting tests and assessments

which can measure these learning outcomes.

2. Combining data in assigning grades

One of the key concerns while assigning grades is to be clear what aspects of a student are to

be assessed or what will be the tentative weightage to each learning outcome. For example, if

we decide that 35 percent weightage is to be given to mid-term assessment, 40 percent final

term test or assessment, and 25% to assignments, presentations, classroom participation and

conduct and behavior; we have to combine all elements by assigning appropriate weights to

each element, and then use these composite scores as a basis for grading.

3. Selecting the proper frame of reference for grading

Letter grades are typically assigned on the basis of one of the following frames of reference.

a) Performance in relation to other group members (relative grading)

b) Performance in relation to specified standards (absolute grading)

c) Performance in relation to learning ability (amount of improvement)

Assigning grades on relative basis involves comparing a student’s performance with that of a

reference group, mostly class fellows. In this system, the grade is determined by the student’s

relative position or ranking in the total group. Although relative grading has a disadvantage

of a shifting frame of reference (i.e. grades depend upon the group’s ability), it is still widely

used in schools, as most of the time our system of testing is ‘norm-referenced’.

Assigning grades on an absolute basis involves comparing a student’s performance to

specified standards set by the teacher. This is what we call as ‘criterion-referenced’ testing. If

all students show a low level of mastery consistent with the established performance

standard, all will receive low grades. The student performance in relation to the learning
ability is inconsistent with a standard based system of evaluating and reporting student

performance. The improvement over the short time span is difficult. Thus lack of reliability in

judging achievement in relation to ability and in judging degree of improvement will result in

grades of low dependability. Therefore such grades are used as supplementary to other

grading systems.

4. Determining the distribution of grades

The assigning of relative grades is essentially a matter of ranking the student in order of

overall achievement and assigning letter grades on the basis of each student’s rank in the

group. This ranking might be limited to a single classroom group or might be based on the

combined distribution of several classroom groups taking the same course. If grading on the

curve is to be done, the most sensible approach in determining the distribution of letter grades

in a school is to have the school staff set general guidelines for introductory and advanced

courses. All staff members must understand the basis for assigning grades, and this basis

must be clearly communicated to users of the grades. If the objectives of a course are clearly

mentioned and the standards for mastery appropriately set, the letter grades in an absolute

system may be defined as the degree to which the objectives have been attained, as followed.

A = Outstanding (90 to 100%)

B = very Good (80-89%)

C = Satisfactory (70-79%)

D = Very Weak (60-69%)

F = Unsatisfactory (Less than 60%)

A standard score is also derived from the raw scores using the normal information gathered

when the test was developed. Instead of indicating a student’s rank compared to others,

standard scores indicate how far above or below the average (Mean) an individual score falls,

using a common scale, such as one with an average of 100. Basically standard scores express
test performance in terms of standard deviation (SD) from the Mean. Standard scores can be

used to compare individuals of different grades or age groups because all are converted into

the same numerical scale. There are various forms of standard scores such as z-score, T-

score, and stanines. Z-score expresses test performance simply and directly as the number of

SD units a raw score is above or below the Mean. A z-score is always negative when the raw

score is smaller than Mean. Symbolic representation can be shown as: z-score = X-M/SD. T-

score refers to any set of normally distributed standard cores that has a Mean of 50 and SD of

10. Symbolically it can be represented as: T-score = 50+10(z). Stanines are the simplest form

of normalized standard scores that illustrate the process of normalization. Stanines are single

digit scores ranging from 1 to 9. These are groups of percentile ranks with the entire group of

scores divided into nine parts, with the largest number of individuals falling in the middle

stanines, and fewer students falling at the extremes (Linn & Gronlund, 2000).

It is the most easiest and popular way of grading and reporting system. The traditional system

is generally based on grades A to F. This rating is generally reflected as: Grade A (Excellent),

B (Very Good), C (Good), D (Satisfactory/Average), E (Unsatisfactory/ Below Average), and

F (Fail). This system does truly assess a student’s progress in different learning domains.

First shortcoming is that using this system it is difficult to interpret the results. Second, a

student’s performance is linked with achievement, effort, work habits, and good behaviour;

traditional letter-grade system is unable to assess all these domains of a student. Third, the

proportion of students assigned each letter grade generally varies from teacher to teacher.

Fourth, it does not indicate patterns of strengths and weaknesses in the students (Linn &

Gronlund, 2000). Inspite of these shortcomings, this system is popular in schools, colleges

and universities.

It is a popular way of reporting students’ progress, particularly at elementary level. In the

context of Pakistan, as majority of the parents are illiterate or hardly literate, therefore they
have concern with ‘pass or fail’ about their children’s performance in schools. This system is

mostly used for courses taught under a pure mastery learning approach i.e. criterion-

referenced testing. This system has also many shortcomings. First, as students are declared

just pass or fail (successful or unsuccessful) so many students do not work hard and hence

their actual learning remains unsatisfactory or below desired level. Second, this two-category

system provides less information to the teacher, student and parents than the traditional letter

grade (A, B, C, D) system. Third, it provides no indication of the level of learning.

Q.5 Discuss the difference between measures of central tendency and

measures of reliability?

ANS:

Measures of central tendency

A measure of central tendency (also referred to as measures of center or central location) is a

summary measure that attempts to describe a whole set of data with a single value that

represents the middle or Centre of its distribution. There are three main measures of central

tendency: the mode, the median and the mean. Each of these measures describes a different

indication of the typical or central value in the distribution.

What is the mode?

The mode is the most commonly occurring value in a distribution. Consider this dataset

showing the retirement age of 11 people, in frequency whole years. 54, 54, 54, 55, 56, 57, 57,

58, 58, 60, 60. This table shows a simple frequency distribution of the retirement age data.

Age Frequency
54 3

55 1

56 1

57 2

58 2

60 2

The most commonly occurring value is 54; therefore the mode of this distribution is 54 years.

Advantage of the mode:

The mode has an advantage over the median and the mean as it can be found for both

numerical and categorical (non-numerical) data.

Limitations of the mode:

There are some limitations to using the mode. In some distributions, the mode may not reflect

the canter of the distribution very well. When the distribution of retirement age is ordered

from lowest to highest value, it is easy to see that the centre of the distribution is 57 years, but

the mode is lower, at 54 years. 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

It is also possible for there to be more than one mode for the same distribution of data, (bi-

modal, or multi-modal). The presence of more than one mode can limit the ability of the

mode in describing the centre or typical value of the distribution because a single value to

describe the centre cannot be identified. In some cases, particularly where the data are

continuous, the distribution may have no mode at all (i.e. if all values are different).

In cases such as these, it may be better to consider using the median or mean, or group the

data in to appropriate intervals, and find the modal class.

What is the median?


The median is the middle value in distribution when the values are arranged in ascending or

descending order. The median divides the distribution in half (there are 50% of observations

on either side of the median value). In a distribution with an odd number of observations, the

median value is the middle value. Looking at the retirement age distribution (which has 11

observations), the median is the middle value, which is 57 years:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

When the distribution has an even number of observations, the median value is the mean of

the two middle values. In the following distribution, the two middle values are 56 and 57,

therefore the median equals 56.5 years:

52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Advantage of the median:

The median is less affected by outliers and skewed data than the mean, and is usually the

preferred measure of central tendency when the distribution is not symmetrical.

Limitation of the median:

The median cannot be identified for categorical nominal data, as it cannot be logically

ordered.

What is the mean?

The mean is the sum of the value of each observation in a dataset divided by the number of

observations. This is also known as the arithmetic average. Looking at the retirement age

distribution again: 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The mean is calculated by adding together all the values

(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the number of observations

(11) which equals 56.6 years.

Advantage of the mean:


The mean can be used for both continuous and discrete numeric data.

Limitations of the mean: The mean cannot be calculated for categorical data, as the values

cannot be summed. As the mean includes every value in the distribution the mean is

influenced by outliers and skewed distributions. What else do I need to know about the

mean?

The population mean is indicated by the Greek symbol µ (pronounced ‘mu’). When the mean

is calculated on a distribution from a sample it is indicated by the symbol x̅ (pronounced X-

bar).

Measure of Reliability

Reliability is one of the most important elements of test quality. It has to do with the

consistency, or reproducibility, of an examinee's performance in the test. It's not possible to

calculate reliability exactly. Instead, we have to estimate reliability, and this is always an

imperfect attempt. Here, we introduce the major reliability estimators and talk about their

strengths and weaknesses.

There are six general classes of reliability estimates, each of which estimates reliability in a

different way. They are:

i) Inter-Rater or Inter-Observer Reliability

To assess the degree to which different raters/observers give consistent estimates of the same

phenomenon. That is if two teachers mark same test and the results are similar, so it indicates

the inter-rater or inter-observer reliability.

ii) Test-Retest Reliability:

To assess the consistency of a measure from one time to another, when a same test is

administered twice and the results of both administrations are similar, this constitutes the test-

retest reliability. Students may remember and may be mature after the first administration

creates a problem for test-retest reliability.


iii) Parallel-Form Reliability:

To assess the consistency of the results of two tests constructed in the same way from the

same content domain. Here the test designer tries to develop two tests of the similar kinds and

after administration the results are similar then it will indicate the parallel form reliability.

iv) Internal Consistency Reliability:

To assess the consistency of results across items within a test, it is correlation of the

individual items score with the entire test.

v) Split half Reliability:

To assess the consistency of results comparing two halves of single test, these halves may be

even odd items on the single test.

vi) Kuder-Richardson Reliability:

To assess the consistency of the results using all the possible split halves of a test. Let's

discuss each of these in turn.

You might also like