Course: Educational Assessment and Evaluation Code: 8602 Semester: Spring, 2021 Assignment # 2 Level: B. Ed
Course: Educational Assessment and Evaluation Code: 8602 Semester: Spring, 2021 Assignment # 2 Level: B. Ed
Course: Educational Assessment and Evaluation Code: 8602 Semester: Spring, 2021 Assignment # 2 Level: B. Ed
Code: 8602
Assignment # 2
Level: B. Ed
ANS:
Reliability and validity are two different standards used to gauge the usefulness of a test.
Though different, they work together. It would not be beneficial to design a test with good
reliability that did not measure what it was intended to measure. The inverse, accurately
measuring what we desire to measure with a test that is so flawed that results are not
that you have to have good reliability in order to have validity. Reliability actually puts a cap
or limit on validity, and if a test is not reliable, it cannot be valid. Establishing good reliability
is only the first part of establishing validity. Validity has to be established separately. Having
good reliability does not mean we have good validity, it just means we are measuring
something consistently. Now we must establish, what it is that we are measuring consistently.
The main point here is reliability is necessary but not sufficient for validity. In short we can
Educational assessment should always have a clear purpose. Nothing will be gained from
assessment unless the assessment has some validity for the purpose. For that
The validity of an assessment tool is the extent to which it measures what it was designed to
measure, without contamination from other characteristics. For example, a test of reading
• Content validity: does the assessment content cover what you want to assess?
• Criterion-related validity: how well does the test measure what you want it to?
• Construct validity: are you measuring what you think you're measuring?
It is fairly obvious that a valid assessment should have a good coverage of the criteria
(concepts, skills and knowledge) relevant to the purpose of the examination. The important
• The PROBE test is a form of reading running record which measures reading behaviours
and includes some comprehension questions. It allows teachers to see the reading strategies
that students are using, and potential problems with decoding. The test would not,
range of texts.
appropriately used for students who don’t score well on more general testing (such as PAT
very low reliability will also have low validity; clearly a measurement with very poor
accuracy or consistency is unlikely to be fit for its purpose. But, by the same token, the things
required to achieve a very high degree of reliability can impact negatively on validity. For
'noise' (variability) in the results. On the other hand, one of the things that can improve
validity is flexibility in assessment tasks and conditions. Such flexibility allows assessment to
be set appropriate to the learning context and to be made relevant to particular groups of
students. Insisting on highly consistent assessment conditions to attain high reliability will
The Overall Teacher Judgment balances these notions with a balance between the reliability
of a formal assessment tool, and the flexibility to use other evidence to make a judgment.
Reliability
The reliability of an assessment tool is the extent to which it consistently and accurately
measures learning.
When the results of an assessment are reliable, we can be confident that repeated or
equivalent assessments will provide consistent results. This puts us in a better position to
important when we are using the results of an assessment to make decisions about teaching
and learning, or when we are reporting back to students and their parents or caregivers. No
results, however, can be completely reliable. There is always some random variation that may
• The length of the assessment – a longer assessment generally produces more reliable
results.
• The suitability of the questions or tasks for the students being assessed.
• The consistency in test administration – for example, the length of time given for the
• The readiness of students for the assessment – for example, a hot afternoon or straight after
physical activity might not be the best time for students to be assessed.
Q.2 Define a scoring criteria for essay type test items for 8th grade?
ANS:
scoring guide for subjective assessments. It is a set of criteria and standards linked to learning
objectives that are used to assess a student's performance on papers, projects, essays, and
other assignments. Rubrics allow for standardized evaluation according to specified criteria,
making grading simpler and more transparent. A rubric may vary from simple checklists to
elaborate combinations of checklist and rating scales. How elaborate your rubric is depends
on what you are trying to measure. If your essay item is A restricted-response item simply
assessing mastery of factual content, a fairly simple listing of essential points would be
Test Item: Name and describe five of the most important factors of unemployment in
Pakistan.
Rubric/Scoring Criteria:
points
(iv) No extra credit for more than five factors named or described.
However, when essay items are measuring higher order thinking skills of cognitive domain,
more complex rubrics are mandatory. An example of Rubric for writing test in language is
given below.
Scoring Criteria (Rubrics) for Essay Type Item for 8th grade.
Advantages of Essay Type Items
(i) They can measures complex learning outcomes which cannot be measured by other
means.
(ii) They emphasize integration and application of thinking and problem solving skills.
(v) The students cannot guess the answer because they have to supply it rather than select it.
(vi) Practically it is more economical to use essay type tests if number of students is small.
(vii) They required less time for typing, duplicating or printing. They can be written on the
(ix) They can be used as a device for measuring and improving language and expression skill
of examinees.
(x) They are more helpful in evaluating the quality of the teaching process.
(xi) Studies has supported that when students know that the essay type questions will be
asked, they focus on learning broad concepts and articulating relationships, contrasting and
comparing.
(xii) They set better standards of professional ethics to the teachers because they expect more
The essay type tests have the following serious limitations as a measuring instrument:
(i) A major problem is the lack of consistency in judgments even among competent
examiners.
(ii) They have halo effects. If the examiner is measuring one characteristic, he can be
influenced in scoring by another characteristic. For example, a well behaved student may
(iii) They have question to question carry effect. If the examinee has answered satisfactorily
in the beginning of the question or questions he is likely to score more than the one who did
(iv) They have examinee to examine carry effect. A particular examinee gets marks not only
on the basis of what he has written but also on the basis that whether the previous examinee
whose answered book was examined by the examiner was good or bad.
(v) They have limited content validity because of sample of questions can only be asked in
(vi) They are difficult to score objectively because the examinee has wide freedom of
(vii) They are time consuming both for the examiner and the examinee. (viii) They generally
I. Ask questions or establish tasks that will require the student to demonstrate command of
essential knowledge. This means that students should not be asked merely to reproduce
material heard in a lecture or read in a textbook. To "demonstrate command" requires that the
question be somewhat novel or new. The substance of the question should be essential
knowledge rather than trivia that might be a good board game question.
II. Ask questions that are determinate, in the sense that experts (colleagues in the field) could
agree that one answer is better than another. Questions that contain phrases such as "What do
you think..." or "What is your opinion about..." are indeterminate. They can be used as a
medium for assessing skill in written expression, but because they have no clearly right or
wrong answer, they are useless for measuring other aspects of achievement.
III. Define the examinee's task as completely and specifically as possible without interfering
with the measurement process itself. It is possible to word an essay item so precisely that
there is one and only one very brief answer to it. The imposition of such rigid bounds on the
response is more limiting than it is helpful. Examinees do need guidance, however, to judge
IV. Generally give preference to specific questions that can be answered briefly. The more
questions used, the better the test constructor can sample the domain of knowledge covered
by the test. And the more responses available for scoring, the more accurate the total test
scores are likely to be. In addition, brief responses can be scored more quickly and more
accurately than long, extended responses, even when there are fewer of the latter type.
V. Use enough items to sample the relevant content domain adequately, but not so many that
students do not have sufficient time to plan, develop, and review their responses. Some
instructors use essay tests rather than one of the objective types because they want to
encourage and provide practice in written expression. However, when time pressures become
great, the essay test is one of the most unrealistic and negative writing experiences to which
students can be exposed. Often there is no time for editing, for rereading, or for checking
spelling. Planning time is short changed so that writing time will not be. There are few, if
any, real writing tasks that require such conditions. And there are few writing experiences
that discourage the use of good writing habits as much as essay testing does.
VI. Avoid giving examinees a choice among optional questions unless special circumstances
make such options necessary. The use of optional items destroys the strict comparability
between student scores because not all students actually take the same test. Student A may
have answered items 1-3 and Student B may have answered 3-5. In these circumstances the
variability of scores is likely to be quite small because students were able to respond to items
they knew more about and ignore items with which they were unfamiliar. This reduced
variability contributes to reduced test score reliability. That is, we are less able to identify
individual differences in achievement when the test scores form a very homogeneous
distribution. In sum, optional items restrict score comparability between students and
VII. Test the question by writing an ideal answer to it. An ideal response is needed eventually
to score the responses. It if is prepared early, it permits a check on the wording of the item,
the level of completeness required for an ideal response, and the amount of time required to
furnish a suitable response. It even allows the item writer to determine if there is any
VIII. Specify the time allotment for each item and/or specify the maximum number of points
to be awarded for the "best" answer to the question. Both pieces of information provide
guidance to the examinee about the depth of response expected by the item writer. They also
represent legitimate pieces of information a student can use to decide which of several items
should be omitted when time begins to run out. Often the number of points attached to the
item reflects the number of essential parts to the ideal response. Of course if a definite
number of essential parts can be determined, that number should be indicated as part of the
question.
IX. Divide a question into separate components when there are obvious multiple questions or
pieces to the intended responses. The use of parts helps examinees organizationally and,
hence, makes the process more efficient. It also makes the grading process easier because it
encourages organization in the responses. Finally, if multiple questions are not identified,
some examinees may inadvertently omit some parts, especially when time constraints are
great.
Q.3 Write a note on Mean, Median and Mode. Also dicsuss their
ANS:
The terms mean, median, mode, and range describe properties of statistical distributions. In
statistics, a distribution is the set of all possible values for terms that represent defined events.
There are two major types of statistical distributions. The first type contains discrete random
variables. This means that every term has a precise, isolated numerical value. The second
variable is a random variable where the data can take infinitely many values. When a term
can acquire any value within an unbroken interval or span, it is called a probability density
function.
IT professionals need to understand the definition of mean, median, mode and range to plan
capacity and balance load, manage systems, perform maintenance and troubleshoot
Understanding the definition of mean, median, mode and range is important for IT
professionals in data center management. Many relevant tasks require the administrator to
calculate mean, median, mode or range, or often some combination, to show a statistically
significant quantity, trend or deviation from the norm. Finding the mean, median, mode and
range is only the start. The administrator then needs to apply this information to investigate
root causes of a problem, accurately forecast future needs or set acceptable working
When working with a large data set, it can be useful to represent the entire data set with a
single value that describes the "middle" or "average" value of the entire set. In statistics, that
single value is called the central tendency and mean, median and mode are all ways to
describe it. To find the mean, add up the values in the data set and then divide by the number
of values that you added. To find the median, list the values of the data set in numerical order
and identify which value appears in the middle of the list. To find the mode, identify which
value in the data set occurs most often. Range, which is the difference between the largest
and smallest value in the data set, describes how well the central tendency represents the data.
If the range is large, the central tendency is not as representative of the data as it would be if
Mean
The most common expression for the mean of a statistical distribution with a discrete random
variable is the mathematical average of all the terms. To calculate it, add up the values of all
the terms and then divide by the number of terms. The mean of a statistical distribution with a
continuous random variable, also called the expected value, is obtained by integrating the
product of the variable with its probability as defined by the distribution. The expected value
Median
The median of a distribution with a discrete random variable depends on whether the number
of terms in the distribution is even or odd. If the number of terms is odd, then the median is
the value of the term in the middle. This is the value such that the number of terms having
values greater than or equal to it is the same as the number of terms having values less than or
equal to it. If the number of terms is even, then the median is the average of the two terms in
the middle, such that the number of terms having values greater than or equal to it is the same
The median of a distribution with a continuous random variable is the value m such that the
probability is at least 1/2 (50%) that a randomly chosen point on the function will be less than
or equal to m, and the probability is at least 1/2 that a randomly chosen point on the function
Mode
The mode of a distribution with a discrete random variable is the value of the term that occurs
the most often. It is not uncommon for a distribution with a discrete random variable to have
more than one mode, especially if there are not many terms. This happens when two or more
terms occur with equal frequency, and more often than any of the others.
A distribution with two modes is called bimodal. A distribution with three modes is called
trimodal. The mode of a distribution with a continuous random variable is the maximum
value of the function. As with discrete distributions, there may be more than one mode.
To calculate mean, add together all of the numbers in a set and then divide the sum by the
total count of numbers. For example, in a data center rack, five servers consume 100 watts,
98 watts, 105 watts, 90 watts and 102 watts of power, respectively. The mean power use of
that rack is calculated as (100 + 98 + 105 + 90 + 102 W)/5 servers = a calculated mean of 99
W per server. Intelligent power distribution units report the mean power utilization of the
In the data center, means and medians are often tracked over time to spot trends, which
inform capacity planning or power cost predictions. The statistical median is the middle
number in a sequence of numbers. To find the median, organize each number in order by
size; the number in the middle is the median. For the five servers in the rack, arrange the
power consumption figures from lowest to highest: 90 W, 98 W, 100 W, 102 W and 105 W.
The median power consumption of the rack is 100 W. If there is an even set of numbers,
average the two middle numbers. For example, if the rack had a sixth server that used 110 W,
the new number set would be 90 W, 98 W, 100 W, 102 W, 105 W and 110 W. Find the
The mode is the number that occurs most often within a set of numbers. For the server power
consumption examples above, there is no mode because each element is different. But
suppose the administrator measured the power consumption of an entire network operations
100 W, 110 W, 98 W, 210 W and 115 W. The mode is 98 W since that power consumption
measurement occurs most often amongst the 12 servers. Mode helps identify the most
(bimodal), three modes (trimodal) or more modes within larger sets of numbers.
ANS:
Assigning letter grades Letter grade system is most popular in the world including Pakistan.
Most teachers face problems while assigning grades. There are four core problems or issues
in this regard; 1) what should be included in a letter grade, 2) how should achievement data
Letter grades are likely to be most meaningful and useful when they represent achievement
only. If they are communicated with other factors or aspects such as effort of work
completed, personal conduct, and so on, their interpretation will become hopelessly confused.
For example, a letter grade C may represent average achievement with extraordinary effort
and excellent conduct and behavior or vice versa. If letter grades are to be valid indicators of
achievement, they must be based on valid measures of achievement. This involves defining
objectives as intended learning outcomes and developing or selecting tests and assessments
One of the key concerns while assigning grades is to be clear what aspects of a student are to
be assessed or what will be the tentative weightage to each learning outcome. For example, if
term test or assessment, and 25% to assignments, presentations, classroom participation and
conduct and behavior; we have to combine all elements by assigning appropriate weights to
each element, and then use these composite scores as a basis for grading.
Letter grades are typically assigned on the basis of one of the following frames of reference.
Assigning grades on relative basis involves comparing a student’s performance with that of a
reference group, mostly class fellows. In this system, the grade is determined by the student’s
relative position or ranking in the total group. Although relative grading has a disadvantage
of a shifting frame of reference (i.e. grades depend upon the group’s ability), it is still widely
specified standards set by the teacher. This is what we call as ‘criterion-referenced’ testing. If
all students show a low level of mastery consistent with the established performance
standard, all will receive low grades. The student performance in relation to the learning
ability is inconsistent with a standard based system of evaluating and reporting student
performance. The improvement over the short time span is difficult. Thus lack of reliability in
judging achievement in relation to ability and in judging degree of improvement will result in
grades of low dependability. Therefore such grades are used as supplementary to other
grading systems.
The assigning of relative grades is essentially a matter of ranking the student in order of
overall achievement and assigning letter grades on the basis of each student’s rank in the
group. This ranking might be limited to a single classroom group or might be based on the
combined distribution of several classroom groups taking the same course. If grading on the
curve is to be done, the most sensible approach in determining the distribution of letter grades
in a school is to have the school staff set general guidelines for introductory and advanced
courses. All staff members must understand the basis for assigning grades, and this basis
must be clearly communicated to users of the grades. If the objectives of a course are clearly
mentioned and the standards for mastery appropriately set, the letter grades in an absolute
system may be defined as the degree to which the objectives have been attained, as followed.
C = Satisfactory (70-79%)
A standard score is also derived from the raw scores using the normal information gathered
when the test was developed. Instead of indicating a student’s rank compared to others,
standard scores indicate how far above or below the average (Mean) an individual score falls,
using a common scale, such as one with an average of 100. Basically standard scores express
test performance in terms of standard deviation (SD) from the Mean. Standard scores can be
used to compare individuals of different grades or age groups because all are converted into
the same numerical scale. There are various forms of standard scores such as z-score, T-
score, and stanines. Z-score expresses test performance simply and directly as the number of
SD units a raw score is above or below the Mean. A z-score is always negative when the raw
score is smaller than Mean. Symbolic representation can be shown as: z-score = X-M/SD. T-
score refers to any set of normally distributed standard cores that has a Mean of 50 and SD of
10. Symbolically it can be represented as: T-score = 50+10(z). Stanines are the simplest form
of normalized standard scores that illustrate the process of normalization. Stanines are single
digit scores ranging from 1 to 9. These are groups of percentile ranks with the entire group of
scores divided into nine parts, with the largest number of individuals falling in the middle
stanines, and fewer students falling at the extremes (Linn & Gronlund, 2000).
It is the most easiest and popular way of grading and reporting system. The traditional system
is generally based on grades A to F. This rating is generally reflected as: Grade A (Excellent),
F (Fail). This system does truly assess a student’s progress in different learning domains.
First shortcoming is that using this system it is difficult to interpret the results. Second, a
student’s performance is linked with achievement, effort, work habits, and good behaviour;
traditional letter-grade system is unable to assess all these domains of a student. Third, the
proportion of students assigned each letter grade generally varies from teacher to teacher.
Fourth, it does not indicate patterns of strengths and weaknesses in the students (Linn &
Gronlund, 2000). Inspite of these shortcomings, this system is popular in schools, colleges
and universities.
context of Pakistan, as majority of the parents are illiterate or hardly literate, therefore they
have concern with ‘pass or fail’ about their children’s performance in schools. This system is
mostly used for courses taught under a pure mastery learning approach i.e. criterion-
referenced testing. This system has also many shortcomings. First, as students are declared
just pass or fail (successful or unsuccessful) so many students do not work hard and hence
their actual learning remains unsatisfactory or below desired level. Second, this two-category
system provides less information to the teacher, student and parents than the traditional letter
measures of reliability?
ANS:
summary measure that attempts to describe a whole set of data with a single value that
represents the middle or Centre of its distribution. There are three main measures of central
tendency: the mode, the median and the mean. Each of these measures describes a different
The mode is the most commonly occurring value in a distribution. Consider this dataset
showing the retirement age of 11 people, in frequency whole years. 54, 54, 54, 55, 56, 57, 57,
58, 58, 60, 60. This table shows a simple frequency distribution of the retirement age data.
Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2
The most commonly occurring value is 54; therefore the mode of this distribution is 54 years.
The mode has an advantage over the median and the mean as it can be found for both
There are some limitations to using the mode. In some distributions, the mode may not reflect
the canter of the distribution very well. When the distribution of retirement age is ordered
from lowest to highest value, it is easy to see that the centre of the distribution is 57 years, but
the mode is lower, at 54 years. 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
It is also possible for there to be more than one mode for the same distribution of data, (bi-
modal, or multi-modal). The presence of more than one mode can limit the ability of the
mode in describing the centre or typical value of the distribution because a single value to
describe the centre cannot be identified. In some cases, particularly where the data are
continuous, the distribution may have no mode at all (i.e. if all values are different).
In cases such as these, it may be better to consider using the median or mean, or group the
descending order. The median divides the distribution in half (there are 50% of observations
on either side of the median value). In a distribution with an odd number of observations, the
median value is the middle value. Looking at the retirement age distribution (which has 11
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the mean of
the two middle values. In the following distribution, the two middle values are 56 and 57,
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The median is less affected by outliers and skewed data than the mean, and is usually the
The median cannot be identified for categorical nominal data, as it cannot be logically
ordered.
The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average. Looking at the retirement age
distribution again: 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Limitations of the mean: The mean cannot be calculated for categorical data, as the values
cannot be summed. As the mean includes every value in the distribution the mean is
influenced by outliers and skewed distributions. What else do I need to know about the
mean?
The population mean is indicated by the Greek symbol µ (pronounced ‘mu’). When the mean
bar).
Measure of Reliability
Reliability is one of the most important elements of test quality. It has to do with the
calculate reliability exactly. Instead, we have to estimate reliability, and this is always an
imperfect attempt. Here, we introduce the major reliability estimators and talk about their
There are six general classes of reliability estimates, each of which estimates reliability in a
To assess the degree to which different raters/observers give consistent estimates of the same
phenomenon. That is if two teachers mark same test and the results are similar, so it indicates
To assess the consistency of a measure from one time to another, when a same test is
administered twice and the results of both administrations are similar, this constitutes the test-
retest reliability. Students may remember and may be mature after the first administration
To assess the consistency of the results of two tests constructed in the same way from the
same content domain. Here the test designer tries to develop two tests of the similar kinds and
after administration the results are similar then it will indicate the parallel form reliability.
To assess the consistency of results across items within a test, it is correlation of the
To assess the consistency of results comparing two halves of single test, these halves may be
To assess the consistency of the results using all the possible split halves of a test. Let's