0% found this document useful (0 votes)
2K views22 pages

Teaching Effectiveness Questionnaire

validated questionnaire for research
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views22 pages

Teaching Effectiveness Questionnaire

validated questionnaire for research
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

International Journal of Teaching and Learning in Higher Education 2022, Volume 33, Number 3, 407-428

[Link] ISSN 1812-9129

Development and Validation of a Formative Evaluation Instrument


for College Teaching
Ren Liu, Xiufeng Liu, and Lara Hutson
University of Buffalo

This study aims to provide preliminary validation for a newly designed instrument to evaluate teaching
effectiveness through student classroom engagement and learning gains. The instrument is titled the
Middle Semester Classroom Survey of Student Engagement and Learning (MS-CSSEL); it consists
of 31items to measure student classroom engagement in three dimensions and 19 items to measure
student learning gains in three dimensions. To validate the instrument, 634 undergraduate students in
a four-year research university participated in this study. The multidimensional Rasch model was used
to conduct the analysis. The findings indicated that (a) items displayed a good fit to the Rasch model;
(b) dimensions were distinct from each other; and (c) the items displayed high reliability. This
instrument measures teaching effectiveness in a new perspective and provides college teachers with a
new tool to gauge student engagement and learning gains for conducting evidence-based teaching
improvement.

Student ratings of instruction (SRI) is among the What is Teaching Effectiveness?


most predominant approaches adopted by higher
education institutions to measure overall teaching Teaching effectiveness is not a new concept in
performance or effectiveness (Berk, 2005; Chen & higher education. However, there is no commonly
Hoshower, 2003; Yao & Grady, 2005; Zabaleta, agreed-upon definition or universal criteria to answer
2007). Typically, SRI are administrated near the end what effective teaching is (Benton & Cashin, 2012;
of each semester, and students are asked to rate the Lehrer-Knafo, 2019; Shevlin, Banyard et al., 2000).
characteristics of teachers and courses, such as course When thinking about teaching effectiveness, it is
organization, teachers' enthusiasm, and clarity of necessary to consider the desired goals of teaching and
explanation (Uttl et al., 2017). learning in different contexts (Atkins & Brown, 2002).
Although initially SRI was intended to provide In other words, the meaning of effective in one context
formative feedback to improve teaching quality, since may not be the same in another (Atkins & Brown, 2002).
the 1970s, it has been commonly used for making Nevertheless, the concept of teaching effectiveness
high-stakes decisions for faculty members is stakeholder relative. Students, teachers, and evaluator
(Berk,2005; Clayson, 2013; Clayson & Haley, 2011; agencies may have different understandings of the
Galbraith et al., 2012; Nasser & Fresko, 2002; Stowell meaning effectiveness (Fauth et al., 2020; Henard &
et al., 2012). However, the original formative purpose Leprince-Ringuet, 2008). For example, the exemplary
of SRI has not been fully achieved. There is little teachers may be concerned about effective teaching
evidence to support the usefulness of SRI to improve through lesson organization, lesson clarity, interests of
and shape the quality of teaching. Faculty members the lesson, and positive classroom environment (Hativa
rarely make changes in their teaching styles or course et al., 2001). Delaney and colleagues (2010) conducted a
content based on students’ ratings, especially not study with 17,000 graduate and undergraduate students
senior faculty (Yao & Grady, 2005). Many instructors to explore the key factors that students perceived as
are not well trained formally in pedagogy and do not essential for effective teaching. Students identified nine
have the necessary skills to interpret students’ ratings behavioral characteristics, including respectful,
(Gravestock & Gregor-Greenleaf, 2008; Yao & knowledgeable, approachable, engaging,
Grady, 2005). To further undermine the use of communicative, organized, responsive, professional, and
summative evaluation for teaching improvement, humorous (Delaney et al., 2010).
faculty have doubts about the validity and reliability Recently, the focus of teaching effectiveness has
of the measurement instruments used for SRI because been changed from observable teaching behaviors to
of diverse interpretations of effective teaching students’ learning. In theory, teaching quality should be
(Spooren et al., 2013). While instructors may be conceptualized as a complex social process relying on
willing to accept students’ feedback, at the same time the interactions between students and instructors (Fauth
they hold negative attitudes towards the summative et al., 2020). Handelsman and colleagues (2007)
use of SRI (Spooren et al., 2013). Therefore, it is demonstrated that “The instructor needs to consider what
necessary to develop a separate teaching evaluation they want their students to know, understand, and be able
for formative purposes, which is the purpose of this to do and work back from there.” Similarly, Hativa and
study. colleagues (2001) defined effective teaching as the
Liu et al. Development and Validation of a Formative Evaluation Instrument 408

“teaching that brings about effective and successful the classroom was problematic since the questions on the
student learning that is deep and meaningful.” In SRI failed to capture what happened in the classroom
addition, Darling-Hammond and colleagues (2012) settings (Pounder, 2007). Worthington (2002) conducted
considered effective teaching as the instruction that a case study in a finance major course to investigate the
enabled students to learn. Effective teaching should meet effects of students’ characteristics and their perceptions
the demands of discipline, instructional goals, and of the usage of SRI. The results suggest that the expected
students’ needs in the teaching and learning environment grade in the subject, student age, race, gender, and their
(Darling-Hammond et al., 2012). “Effective teaching is perceptions of the evaluation process all have significant
about reaching achievement goals; it is about students impacts on the ratings given to the instructors.
learning what they are supposed to in a particular Third, SRI may discriminate instructors based on
context, grade, or subject” (Berliner, 2005, p, 207). their background characteristics, especially for female
Carnell (2007) conducted a qualitative study with eight instructors. Centra and Gaubatz (2000) conducted a
instructors teaching in higher education to examine study to investigate the relationship between students’
university teachers’ conceptions of effective teaching. gender and instructors’ gender across classes regarding
Although the instructors have different teaching instruction ratings. The results indicate that both in the
experiences, they all consider effective teaching to same class and across all classes, there is a significant
enable students’ learning (Carnell, 2007). difference between male and female students when
rating female instructors, but no significant difference is
Student Ratings of Instruction (SRI) detected for male instructions (Centra & Gaubatz, 2000).
Female instructors are more likely to receive lower
The use of SRI to measure and interpret teaching ratings from male students, even controlling for the
effectiveness has increased in higher education effects of class size (Centra & Gaubatz, 2000).
institutions since the 1900s. However, SRI has Similarly, Young and colleagues (2009) conducted
limitations on measuring teaching quality (Clayson & a study to explore the gender bias when rating the
Haley, 2011; Pounder, 2007; Shevlin et al., 2000; instructor and instructions, and the interaction between
Spooren et al., 2013; Uttl et al., 2017). First, both the students’ and instructors’ characteristics, especially for
content validity and construct validity of the commonly the effects of gender. The results show a potential
used teaching evaluation measurement instruments have gender-preference while rating instructors on
been questioned (Spooren et al., 2013). Due to the lack pedagogical characteristics and course content
of consensus of effective teachers' characteristics, there characteristics (Young et al., 2009). A more recent
is a large variation in scope for the instruments used to qualitative study conducted by Mitchell and Martin
measure teaching effectiveness, especially in the defined (2018) aimed to investigate the underlying relationship
dimensions of teaching effectiveness (Spooren et al., between gender and student evaluation of teaching.
2013). In addition, the majority of measurement Based on analysis of the student comments, the authors
instruments used currently are designed by found that students did evaluate their professors with
administrators without consideration of other essential gender bias. Students tend to comment on female
stakeholders’ views of effective teaching, which raises instructors’ appearance and personality more than male
the question of content validity for the design of the instructors and show less professional respect to woman
instruments (Spooren et al., 2013). instructors (Mitchell & Martin, 2018).
Second, students’ ratings can be affected by a Considering the limitation of traditional SRI, the
variety of factors other than teaching practices, which purpose of this study is to develop and validate a
raises the issues of accuracy of using SRI’s results for formative evaluation measurement instrument titled
high-stake decisions (Pounder, 2007; Shevlin et al., Middle Semester Classroom Survey of Student
2000; Worthington, 2002). Shevlin et al. (2000) Engagement and Learning (MS-CSSEL) that can be
conducted a study with 213 undergraduate students used in college teaching for instructional improvement
within a social science department at a UK University through the lens of students’ engagement and learning
exploring the potential relationship between students’ gains. The measures of engagement and learning gains
perception of the lecturer and their ratings for intend to provide comprehensive information for
instruction. The results indicate that charisma factors instructors to adjust teaching strategies during the course
account for 69% and 37% of the variation in lecturer of instruction.
ability and module attributes respectively (Shevlin et al.,
2000). In addition, in a comprehensive review of the Theoretical Framework
literature, Pounder (2007) synthesized a variety of
potential student-level, course-level, and teacher-level Reliability and validity are two essential
factors that affected student ratings, concluding that psychometric properties for any measurement
relying only on SRI to measure teaching and learning in instrument to be used. According to the Standards for
Liu et al. Development and Validation of a Formative Evaluation Instrument 409

Educational and Psychological Testing (Joint Student Engagement


Committee, 2014), validity is defined as “the degree to
which evidence and theory support the interpretations of While the conceptualizations of student engagement
test scores for proposed uses of tests. Validity is, are diverse among researchers, there is an agreement that
therefore, the most fundamental consideration in student engagement is a multidimensional concept with
developing tests and evaluating tests” (p. 11). The three coherent dimensions: behavioral, cognitive, and
concept of validity has evolved during the past century, affective (Appleton et al., 2006; Fredricks et al., 2004;
from considering criterion validity to content validity Kahu, 2013). The definition of behavioral engagement
(Strauss & Smith, 2009). Around 1980, construct primarily relies on the idea of active involvement in
validity was developed and accepted by scholars (Strauss academic, social, and extracurricular activities and the
& Smith, 2009). Today, it is commonly accepted that absence of negative behaviors for accomplishing
validity claims can be established based on content- positive learning outcomes (Fredricks et al., 2004;
related evidence, alignment of items with the defined Fredricks & McColskey, 2012; Trowler, 2010).
theory, internal structure of items, response process According to Kahu’s (2013) framework of engagement,
evidence, criterion-related evidence, and consequence- there were three specific subscales attached to behavioral
related evidence. engagement, time and effort allocated to educational
Reliability refers to the “consistency of scores activities, interactions with peers and instructors for
across replications of a testing procedure, regardless of educational purposes, and the extent of participation in
how this consistency is estimated or reported” (Joint learning activities (see Figure 1).
Committee, 2014). Reliability has been evaluated by a Cognitive engagement incorporates the idea of
variety of coefficients depending on the measurement willingness to invest effort beyond the requirements for
model being used, such as the Cronbach’s Alpha the course to understand complicated concepts and
reliability coefficient, generalizability coefficient, and master skills (Fredricks et al., 2004; Fredricks &
Item Response Theory (IRT) information functions McColskey, 2012; Trowler, 2010). Cognitive
(Joint Committee, 2014). engagement includes many latent factors that cannot be
In this study, we employed Rasch modeling to observed directly in the classroom such as self-
validate the psychometric properties of the newly regulation, values and beliefs about learning, cognitive
developed measurement instrument. The item and metacognitive strategies for learning (e.g.,
dimensionality estimates, the correlation between memorizing, synthesizing, understanding, evaluating,
dimensions, fit statistics, item-person maps, and etc.), and personal goals and autonomy (Appleton et al.,
threshold estimates were generated to make claims about 2006; Fredricks et al., 2004; see Figure 1).
the instrument's construct validity. The Expected A Emotional engagement focuses on the affective
Posteriori (EAP) reliability can be calculated to represent reactions to teachers, classmates, academics, and the
the degree of consistency for the instrument. institution, which impact the willingness to participate in
The research questions of this study are: school work (Fredricks et al., 2004; Fredricks &
1) What evidence supports the multidimensional McColskey, 2012; Trowler, 2010). Kahu (2013)
construct assumption of student engagement and considered emotional engagement as a kind of
learning gains? attachment to schools or classes, while others considered
2) Does the MS-CSSEL survey produce valid and it as enjoyment or interest in the learning activities. This
reliable measures to assess student classroom study accepted the conceptual framework of Kahu
engagement and learning gains? (2013), which decomposed emotional engagement as
enthusiasm for the courses, interest in the courses, and
Method the sense of belonging to the classes (see Figure 1).

Purpose and Population of the Measurement Learning Gains


Instrument
After conducting a systematic literature review of
The purpose of the MS-CSSEL survey is to measure learning gains and how they were measured, Rogaten et
college student engagement and learning gains in the al. (2019) categorized learning gains into three different
middle of the semester. The measures of engagement and types following the commonly used ABC (i.e.,
learning gains will be used to infer whether the current Affective, Behavioral, and Cognitive) classification.
instructions are effective, and for the instructors to adjust Behavioral learning gains refer to skills, including study
teaching for the rest of the course. This survey's target skills, leadership skills, team-work skills, and critical
population is college students, and the setting for this thinking skills (Rogaten et al., 2019). Cognitive learning
survey is face-to-face college classrooms. gains are defined as understanding and cognitive
Liu et al. Development and Validation of a Formative Evaluation Instrument 410

Figure 1
Conceptual Framework

abilities, such as analyzing, memorizing, synthesizing, content knowledge, critical thinking, problem-solving,
evaluating, and applying abilities (Rogaten et al., 2019). communication, and cooperation. Cognitive learning
Affective learning gains are mainly measured as the gains were assessed using six questions related to
change of attitude during a course, such as confidence, understanding, analyzing and synthesizing, evaluating,
motivation, and interests (Rogaten et al., 2019). In and applying the course contents. Learning gains in
addition, how much students have learned about content attitude were assessed using six questions relating to
knowledge is also an important indicator to measure enthusiasm, interests, and confidence of the content and
students’ learning gains. For this study, questions about the course.
content knowledge are incorporated into the behavioral In addition to the 50 items above, the MS-CSSEL
dimension. Figure 1 presents the conceptual framework survey includes two set of questions to ask to what
of the constructs measured in this study. extent a student’s skills and knowledge gains, cognitive
Instrument Construction gains, and affective gains have been affected by some
Based on Kahu’s (2013) framework, the MS- instructional teaching practices, such as the lecture,
CSSEL survey utilized five-point Likert-scale questions assigned class activities, graded assignments, etc.
to assess student engagement. Behavioral engagement At the end of the engagement section, two open-
was assessed using 11 questions relating to how much ended questions asked students how they had
time and effort students allocated to educational tasks, participated and engaged in this course and how the
interactions with peers and instructors both inside and instructor could maintain and improve their
outside the classrooms, and attendance and participation. engagement for the rest of the semester. Another two
Cognitive engagement was assessed using 14 questions open-ended questions were attached to the learning
relating to how much extra work students did for the gains section, which let students to describe what they
course, beliefs and values related to the course, cognitive had learned so far and provide suggestions to the
strategies used to study the course content, and self- instructor for the rest of the semester (see Appendix for
regulation strategies. Affective engagement was the entire survey).
assessed using six items relating to sense of belonging in
the course, enthusiasm, and interests in the subject. Data Collection
Regarding the items measuring student learning
gains, learning gains in skills and knowledge were The MS-CSSEL survey was conducted to a target
assessed using seven questions that covered gains in audience at one research University. In total, 634
Liu et al. Development and Validation of a Formative Evaluation Instrument 411

undergraduate students in two introductory biology empirically tested through Rasch models in order to
courses participated in this study (see Table 1). The MS- produce a set of items that define a linear measurement
scale” (Liu, 2020, p. 34). In order to address the
Table 1 limitation of CTT, this study applied the Rasch modeling
Descriptive Statistics of the Sample (N = 634) approach to provide more accurate statistical evidence
for the reliability and validity claim of the MS-CSSEL
Category N survey.
Gender Male 211 According to the literature, student engagement and
Female 391 learning gains are two multidimensional constructs.
Missing 32 Thus, this study applied the multidimensional rating
scale Rasch model for data analysis. All the items that
Classification Freshmen 424 measure classroom engagement adopted the same
Sophomore 104 Likert-scaled categories from strongly agree to strongly
Junior 39 disagree. Because all statements of items are in positive
Senior 36 tones, strongly agree was coded as 5 and strongly
Missing 31 disagree was coded as 1. The items that measured
Race White 275 learning gains adopted a 5-point rating scale from 5 to 1.
Hispanic/Latino 43 Items measuring students’ classroom engagement and
African American 54 learning gains were analyzed separately following the
Native American 0 same modeling procedures.
Asian 189 We used the “TAM” (Robitzsch et al., 2019) and
Others 40 “WrightMap” packages (Torres et al., 2014) in RStudio
Missing 33 (RStudio Team, 2018) to do the Rasch analysis. Item fit
statistics, EAP reliability coefficients, and item-person
CSSEL survey was administrated in the middle of maps were generated. Additionally, we used Winstep
the Spring 2019 semester via a commercial online survey 4.5.4 (Linacre, 2020) to test the appropriateness of the
platform, SelectSurvey, which is similar to common category structure. The item probability curves for
platforms such as SurveyMonkey and Qualtrics. The student engagement and learning gains were drawn
online survey platform collected students’ responses and separately.
generated descriptive report automatically. Students
were invited by email with a survey link and they were Results
given two weeks to complete the survey. The survey was
conducted anonymously. The instructor provided 5 extra Dimension Structure
credit points (0.75% of the final grade) to the students
for encouraging their participation in the survey. In total, Although student classroom engagement and
713 students were invited by email and 634 of them learning gains are widely believed to be
completed the survey with a response rate of 88.9%. As multidimensional constructs, there is little statistical
recommended by Wright and Douglas (1975), in order to evidence to prove the dimensional structure of these
obtain stable item calibration within ± ½ logit with 95% concepts. For testing the necessity of using
confidence interval for the errors, the minimum sample multidimensional models, we ran both unidimensional
size is between 64 and 144. Thus, it is reasonable to and 3-dimensional Rating-scale Rasch models, and then
assume that this study has an appropriate sample size to compared the results. Because the multidimensional
create a stable item calibration. model hierarchically subsumes to the unidimensional
Traditionally, many Likert-scale evaluation models, the two models can be compared by testing the
instruments used in higher education have been significant change in their deviance that describes the
developed according to the Classical Test Theory (CTT), difference between the estimated model and the true
which assumes that all items on the survey have the same model of the concept (Baghaei, 2012). Briggs and
standard error, and threshold estimates between Wilson (2003) indicated that the difference of deviance
categories for all items are equal (Van Zile-Tamsen, between two estimated models was nearly a chi-square
2017). Researchers have demonstrated limitations when distribution with a degree of freedom of the difference
using CTT to analyze rating-scale or Likert-scale data between the number of parameters estimated in the two
(Bode & Wright, 1999). models. Janssen and De Boeck (1999) recommended
Based on the validity and reliability theory, “using selecting the model with significantly smaller deviance
Rasch models to develop measurement instruments or compared to estimated models. As shown in Table 2, the
Rasch modeling, is a systematic process in which items 3-dimensional Rasch model had significantly smaller
are purposefully constructed according to theory and deviance than the unidimensional model regarding
Liu et al. Development and Validation of a Formative Evaluation Instrument 412

Table 2
Model Comparison

Deviance Npars Chi-square df p


Engagement
1-dimension 47697.30 35 570.76 5 <.05
3-dimension 47126.54 40
Learning gains
1-dimension 25527.94 23 339.20 5 <.05
3-dimension 25188.75 28

Table 3
Correlation Matrix for the Dimensions of Classroom Engagement

Behavioral Cognitive Affective


Behavioral 1
Cognitive 0.78 1
Affective 0.71 0.94 1

Table 4
Correlation Matrix for the Dimensions of Learning Gains

Skills and Knowledge Cognitive Attitude


Skills and Knowledge 1
Cognitive 0.95 1
Attitude 0.84 0.88 1

student engagement, χ2 (5) = 570.76, p < .05, suggesting and affective engagement (r = 0.71). In addition,
that the 3-dimensional Rasch model was a better solution students’ cognitive engagement had a high-level
to model student engagement than the unidimensional correlation to students’ affective engagement (r = 0.94).
model. Although the smaller correlation estimates always help
The same modeling process was applied to test the to differentiate dimensions, a higher correlation estimate
difference of deviance between the unidimensional does not necessarily imply an identical dimension
model and 3-dimensional model used to analyze learning (Baghaei, 2012). Table 6 presents the correlation matrix
gains. The results indicated that the 3-dimensional Rasch for learning gains. Overall, the dimensions of learning
modeling approach had a better fit to the true model of gains show a high degree of correlation (i.e., ranging
learning gains than a unidimensional model with a from 0.84 to 0.95), which suggests that the measurement
significant change of deviance, χ2 (5) = 339.20, p < .05. instrument has a high degree of precision.
Thus, we selected 3-dimensional Rasch model for testing
the quality of the measures of engagement and learning Item Fit Statistics
gains.
In addition to the comparison of deviance, the Rasch modeling approach provides four indices for
correlations between dimensions of student engagement determining how the data fits the expected Rasch model.
and learning gains were used to provide additional The mean square fit statistics (MNSQs) indicates how
information for the preciseness of the measurement much the misfit observed between the Rasch model’s
instrument (see Table 3 and Table 4)Table 3 and Table expected item performance and the actual performance
4). When the multidimensional approach is used, “the according to the data matrix (Bond & Fox, 2015). For the
higher the correlations, the greater the number of latent mean squared statistics, the closer to 1, the better the
traits, and the shorter the subtests” will improves model-data-fit performed. For Likert-scale and rating
measurement precision significantly (Wang et al., 2004, scale questions, the commonly accepted range of the
p, 125). As shown in Table 5, the results indicated that mean-square statistics is from 0.6 to 1.4 logits (Bond &
students’ behavioral engagement had a significantly high Fox, 2015; Linacre, 2019). The standardized fit statistics
correlation to students’ cognitive engagement (r = 0.78) (ZSTDs) indicate how likely the degree of misfit
Liu et al. Development and Validation of a Formative Evaluation Instrument 413

Table 5
Fit Statistics for Classroom Engagement

Outfit Statistics Infit Statistics


MNSQ ZSTD p MNSQ ZSTD p
Behavioral
Item 1 1.25 4.55 0.00 1.23 4.35 0.00
Item 2 1.32 5.56 0.00 1.31 5.49 0.00
Item 3* 1.42 6.04 0.00 1.48 6.78 0.00
Item 4 0.96 -0.76 0.45 0.95 -0.89 0.37
Item 5* 1.54 9.03 0.00 1.53 8.91 0.00
Item 6 1.03 0.68 0.50 1.04 0.82 0.41
Item 7 0.97 -0.41 0.68 1.00 0.00 1.00
Item 8 0.82 -3.43 0.00 0.84 -3.18 0.00
Item 9 0.85 -3.21 0.00 0.84 -3.39 0.00
Item 10 1.37 6.09 0.00 1.37 6.22 0.00
Item 11* 2.93 23.58 0.00 2.92 23.46 0.00
Cognitive
Item 12 1.23 3.81 0.00 1.19 3.21 0.00
Item 13 1.26 4.42 0.00 1.21 3.65 0.00
Item 14 1.08 1.32 0.19 1.13 2.07 0.04
Item 15 0.69 -5.70 0.00 0.72 -5.17 0.00
Item 16 0.80 -4.10 0.00 0.78 -4.52 0.00
Item 17 0.74 -5.43 0.00 0.69 -6.41 0.00
Item 18 0.78 -4.34 0.00 0.77 -4.54 0.00
Item 19 0.66 -6.72 0.00 0.67 -6.67 0.00
Item 20 0.63 -7.49 0.00 0.63 -7.61 0.00
Item 21 0.76 -4.61 0.00 0.74 -5.15 0.00
Item 22 0.74 -5.06 0.00 0.72 -5.29 0.00
Item 23 0.73 -5.17 0.00 0.73 -5.22 0.00
Item 24 0.70 -5.81 0.00 0.68 -6.19 0.00
Item 25 0.91 -1.56 0.12 0.90 -1.72 0.09
Affective
Item 26 0.78 -4.22 0.00 0.78 -4.17 0.00
Item 27 0.75 -5.06 0.00 0.73 -5.45 0.00
Item 28 0.79 -3.93 0.00 0.78 -4.20 0.00
Item 29 0.65 -6.70 0.00 0.63 -7.22 0.00
Item 30 0.65 -6.50 0.00 0.69 -5.57 0.00
Item 31 0.83 -2.92 0.00 0.86 -2.40 0.02

expressed by mean square statistics will be observed are more likely to be too sensitive (i.e., with many items
(Bond & Fox, 2015). When the sample size is between failing to fit the model; Linacre, 2019). The standardized
30 to 300, the acceptable range for the standardized fit fit statistics highly depend on the sample size, which
statistic is from -.20 to 2.0 (Bond & Fox, 2015). inflates putative Type I error rates; however, the mean
Typically, the decision-making of model-data-fit square statistics are comparatively insensitive to sample
depends on those four indices equally, but it is size (Smith et al., 2008).
reasonable to make decisions according to some indices Over 600 students participated in this study;
for a particular purpose (Bond & Fox, 2015). For therefore, the decision-making of model-data-fit was
example, if the sample size is larger than 300, the ZSTDs made primarily based on the MNSQs. Overall, all 31
Liu et al. Development and Validation of a Formative Evaluation Instrument 414

Table 6
Fit Statistics of Learning Gains

Outfit Statistics Infit Statistics


MNSQ ZSTD p MNSQ ZSTD p
Skills and knowledge
1 Item 32 0.87 -2.18 0.03 0.90 -1.66 0.10
2 Item 33 1.00 0.01 1.00 0.92 -1.28 0.20
3 Item 34 0.91 -1.62 0.11 0.95 -0.76 0.45
4 Item 35 0.78 -4.09 0.00 0.82 -3.29 0.00
5 Item 36 1.28 4.57 0.00 1.30 4.85 0.00
6 Item 37 1.06 0.99 0.32 1.06 1.05 0.29
7 Item 38 1.54* 7.83 0.00 1.60* 8.54 0.00
Cognitive
8 Item 39 1.37 5.94 0.00 1.37 5.90 0.00
9 Item 40 0.70 -5.52 0.00 0.73 -4.79 0.00
10 Item 41 0.65 -6.60 0.00 0.68 -5.94 0.00
11 Item 42 0.72 -5.06 0.00 0.74 -4.67 0.00
12 Item 43 0.69 -5.81 0.00 0.72 -5.11 0.00
13 Item 44 0.87 -2.14 0.03 0.90 -1.74 0.08
Attitude
14 Item 45 0.95 -0.76 0.44 1.02 0.39 0.70
15 Item 46 1.12 1.99 0.05 1.15 2.33 0.02
16 Item 47 0.84 -2.77 0.01 0.93 -1.24 0.21
17 Item 48 0.88 -2.13 0.03 0.88 -2.08 0.04
18 Item 49 0.71 -5.48 0.00 0.73 -5.02 0.00
19 Item 50 1.70* 9.47 0.00 1.70* 9.48 0.00

items measuring student classroom engagement fit the whether students were willing to seek help from others
expected Rasch model well (Table 2). For classroom when necessary, failed to fit the model, based both on
behavioral engagement, 8 out of 11 items demonstrated Infit and Outfit MNSQ.
a good model-data-fit with a range of the MNSQs from
0.82 to 1.37. The MNSQs of items 3, 5, and 11 were Internal Structure of Items
outside the acceptable range (see Table 2). For all 14
items measuring cognitive engagement, MNSQs ranged The item-person map, also called “Wright map,”
from 0.63 to 1.23, which indicated an acceptable level of puts the person and item estimates in a same logit scale
model-data-fit. All six items measuring student affective with the person ability estimates distributing on the left
engagement also fit the expected Rasch model, with and the item difficulty estimates on the right. Generally,
MNSQs ranging from 0.63 to 0.86. a good measurement instrument should be able to match
As presented in Table 3, 6 out of 7 items measuring sample’s ability distribution with items’ difficulty
learning gains in skills and knowledge (i.e., items 32 to distribution (Liu, 2020, p. 40). In this study, person
38) had good model-data-fit, with MNSQs ranging from ability estimates represent levels of student engagement
0.78 to 1.30. Item seven, which asked how much and extent of learning gains. On the Wright map, items
students learned to communicate and work with peers to were arranged by difficulty estimates from easier to
improve their learning, had more misinformation (Outfit agree with at the bottom and the harder to agree with on
MNSQ = 1.54; Infit MNSQ = 1.60). All items that the top of the map. Individuals were arranged based on
measured learning gains in cognition (items 39-44) had levels of engagement and learning gains from higher at
acceptable MNSQs, ranging from 0.65 to 1.37. Five out the top to lower at the bottom.
of the six items that measured learning gains in attitude Evidence from the combined Wright Map for
(items 45-50) fit the expected model well, with the student classroom engagement (see Figure 2) indicated
MNSQs ranging from 0.84 to 1.15. Item 50, which asked that levels of student classroom engagement in
Liu et al. Development and Validation of a Formative Evaluation Instrument 415

behavioral, cognitive, and affective dimensions were engagement was determined from the time and efforts
normally distributed and the spread of student put into the course, interactions with peers and
engagement estimates was sufficient with the logits instructors, and participation in the course. The
ranging from -2.5 logits to 3.5. The mean logit for threshold Wright map (Figure 3) indicated that there
student engagement was slightly higher than the were sufficient items to measure student classroom
average of the item estimates suggesting that this set behavioral engagement. The results also indicated that
of questions was relatively easier for most students to the higher end of student engagement continued higher
agree with. than the highest “difficult” item thresholds, which
Separated Wright maps were produced for all suggests that more items should be developed to assess
dimensions of student engagement. Student behavioral the top end of student engagement precisely.

Figure 2
Combined Wright Map for Students' Classroom Engagement

As shown in Figure 3, overall, the distribution of difficulty for this population. Thus, for further
cognitive engagement estimates was acceptable with a development, some of these questions should be
range of -2.0 logits to 4.5 logits. Overall, the threshold combined or deleted in order to improve the efficiency
map indicated that the spread of items was sufficient to of the survey.
measure most levels of cognitive engagement. However, As shown in Figure 3, the spread of students’
the higher end of cognitive engagement continued higher affective engagement was acceptable with the range of
than the most “difficult” item, which suggests that more affective engagement estimates ranging from -2.5 logits
“difficult” questions were needed to capture higher to 3.5 logits. The mean logit of the six items was slightly
cognitive engagement. Additionally, the efficiency of lower than the mean logit of students’ affective
this set of questions was also not ideal, with 5 items (e.g., engagement, which suggested that the items were
items 1,8,11,12,13) found to be at the same level of relatively easier for students to agree with. The
Liu et al. Development and Validation of a Formative Evaluation Instrument 416

Figure 3
Separated Wright Map for Behavioral, Cognitive, and affective Engagement

distribution of the threshold estimates had a sufficient For the items measuring student learning gains, it
spread to measure most levels of affective engagement, suggested that the spread of person estimates was
with the exception of the higher end. Thus, for further acceptable ranging from -2.5 logits to 4 logits, and the
improvement, some more difficult questions should be shape of the distribution was nearly normal (see Figure
considered for making this dimension more 4). The mean logit of the items was slightly smaller than
comprehensive. the average logit of students’ learning gains, which

Figure 4
Combined Wright Map for Learning Goals
Liu et al. Development and Validation of a Formative Evaluation Instrument 417

Figure 5
Separated wright Map for Learning Gains in Skills and Knowledge, Cognition, and Attitude

indicates that students gained more through classroom knowledge, 0.92 for learning gains in cognition, and
learning than what was measured by the 19 items 0.90 for learning gains in attitude.
together, in terms of the learning gains in skills and
knowledge, cognition, and attitude. Table 7
The threshold map (Figure 5) of learning gains in Reliability of Engagement and Learning Gains
skills and knowledge indicates that the separation of
item difficulty is acceptable but not sufficient to EAP Reliability
measure all levels of student learning gains in skills and Classroom Engagement
knowledge. Two gaps - between items 6 and 3, and Behavioral engagement 0.83
between items 7 and 1 - indicate that more questions Cognitive engagement 0.87
should be added. Affective engagement 0.83
The threshold map of learning gains in cognition Learning gains
(Figure 5) indicate that the six items measured some Skills and knowledge 0.91
levels of cognitive learning gains, but the items were Cognitive 0.92
not sufficient to capture all levels of affective learning Attitude 0.90
gains. The gap between items 1 and 3 suggests that
more items should be added in this area. Category Threshold Estimates
Results from the threshold map for learning gains
in attitude (Figure 5) indicate that the spread of the For the rating scale Rasch model, it is required that
difficulty of threshold is acceptable, though there was a the average person estimates should advance
gap between items 4 and 6. Additionally, none of the monotonically from lower-level categories to higher,
questions successfully measured the highest and lowest and the difference of neared threshold estimates should
levels of learning gains in attitude. This should also be Advance by at least 1.4 logits for a 5-point scale (Bond
addressed in future improvement of the survey. & Fox, 2015).

Reliability Category Threshold Estimates

The Expected A Posteriori (EAP) Measures were For the rating scale Rasch model, it is required that
calculated to test the reliability of the MS-CSSEL the average person estimates should advance
survey. As presented in Table 7, the results indicated a monotonically from lower-level categories to higher, and
great extent of consistency regarding every sub- the difference of neared threshold estimates should
dimensions of classroom engagement and learning advance by at least 1.4 logits for a 5-point scale (Bond &
gains with the reliability coefficient ranging from 0.83 Fox, 2015). The acceptable range of MNSQs for each
to 0.92. In particular, the reliability was 0.83 for category is from 0.6 to 1.4 (Bond & Fox, 2015). In this
behavioral engagement, 0.87 for cognitive engagement, study, none of the MNSQs of categories were outside the
and 0.83 for affective engagement. Regarding the acceptable range. The category Andrich threshold
section measuring students’ learning gains, the estimates suggested that the observed average for each
reliability was 0.91 for learning gains in skills and category in both student engagement and learning gains
Liu et al. Development and Validation of a Formative Evaluation Instrument 418

were increased monotonically. However, the difference Discussion


of the Andrich threshold between "disagree" and
"neutral" was 0.88, and the difference of the threshold Validity and Reliability of the MS-CSSEL Survey
between "neutral" and "agree" was 0.74, which
suggested that students had difficulty distinguishing Overall, the Rasch analysis results suggest that the MS-
neutral with disagree and agree. As shown in Figure 6, CSSEL survey is a valid and reliable tool that can
the highest probability peak of “neutral” was less than .5, provide useful information to the instructor about student
which indicated that this category was not functioning classroom engagement and learning gains in the middle
well. The results suggest that the response option neutral of the semester. For the reliability of the MS-CSSEL
should be investigated further or deleted. survey, the EAP reliability estimates for each sub-
In terms of the category probability statistics for dimension suggests that the survey has a large degree of
learning gains, as presented in Table 8, the majority of consistency. In addition, the validity of the MS-CSSEL
threshold estimates were acceptable. However, the survey is supported by the following aspects: (a) this
difference between the second and third threshold study supports the need to treat student engagement and
estimates was 0.75. Because this is less than the learning gains as two multidimensional constructs by
minimum threshold of 1.0, it suggests that students might comparing the 3-dimensional model with the
not be able to differentiate categories 2 and 3. As shown unidimensional model, (b) the moderately high
in Figure 7, the highest probability peak for the second correlation coefficient between sub-dimensions suggests
category was less than 0.5, which suggested that this a good internal relationship between dimensions, (c) the
category was not functioning well. For further item fit statistics indicate that 47 out of 50 items
improvement, it is reasonable to consider using a 4-point contribute to the defined constructs, (d) the item-person
rating-scale category structure for the items measuring maps show a good match between “person ability” and
learning gains. “item difficulty” for both the measures of classroom

Figure 6
Item Probability Curve for Student Engagement
Liu et al. Development and Validation of a Formative Evaluation Instrument 419

Table 8
Category Threshold Estimates

Category Label MNSQs Observed Average Andrich Threshold


Infit Outfit
Student engagement
Strongly Disagree 1.17 1.30 -.55 None
Disagree 1.00 1.03 -.10 -1.41
Neutral 0.93 0.94 .36 -.53
Agree 0.89 0.86 .90 .21
Strongly Agree 1.03 1.02 1.52 1.73
Learning Gains
1 1.22 1.36 -.89 None
2 1.00 1.06 -.35 -1.80
3 0.90 0.96 .34 -1.05
4 0.89 0.86 1.30 .43
5 1.06 1.04 2.60 2.42

Figure 7
Item Probability Curve for Learning Gains
Liu et al. Development and Validation of a Formative Evaluation Instrument 420

engagement and learning gains, which suggests a good pilot study showed that students did not always ask
internal structure of items, and (e) the category threshold questions in class (M=2.71) and had limited interaction
estimates for the measures of student engagement and with the instructor outside the classroom (M=2.25).
learning gains indicate that each category adequately Thus, for teaching improvement, the instructor may
measured the constructs. consider providing more opportunities for students to ask
Although most items in the MS-CSSEL survey questions while teaching in the classroom and encourage
functioned as expected, a few items have been identified them to interact with the instructor after class.
for further improvement. First, the question regarding Additionally, the information on how the designed
student attendance showed poor model-data-fit. instructional components affect students’ learning gains
According to the definition of student behavioral can provide additional evidence for the instructor to
engagement, attendance is an essential aspect that consider further teaching improvement. For example, in
reflects how students engage in classroom learning. this study, the average scores of “interactions with the
However, although the survey was conducted instructor about learning” are lower than other
anonymously, students might not report their attendance instructional components in terms of the contributions to
accurately. Thus, this question should remain but needs students’ learning gains. Aligning with the findings
revision. Second, the questions, “During class time, I identified through the Likert-type questions, the
regularly work with other students on work assigned by interaction between faculty and students is an essential
the instructor” and “Learning to communicate and work aspect for further teaching improvement.
with peers to improve my learning” also showed misfit For researchers who will use the instrument, they
to the Rasch model. Those two questions are important should conduct Rasch analysis to obtain interval
aspects of engagement and learning gains. However, measures of student engagement and learning gains.
because most current large-enrollment university classes First, the individual Rasch scores of engagement and
are lecture-based, collaborative learning between and learning gains of students will help understand the
among students occurs rarely. Thus, the instructor should relationship between students’ academic performance
consider how to incorporate collaborative learning in and their engagement and learning gains, which suggests
large classes or whether the items are meaningful to the better instructional strategies to teach students,
course when using this survey. Third, the item-person especially for the low-performance students. For
map suggests that the items effectively measure example, the individual “ability” scores could help
students’ engagement and learning gains, but some researchers to reach the low-performing students to find
apparent gaps between items need to be taken into out their strength and weakness in engagement and
consideration for further improvement. Fourth, the learning gains, and then to prepare more targeted
results of category probability estimates implied that the mentoring and suggestions to the students for improving
response option “Neutral” could be deleted for the items their academic performance in the rest of the course.
measuring students’ classroom engagement, and the Second, by looking at the item-person map and the
response option "2" and "3" for the rating scale used for average scores of each item, researchers can identify in
measuring learning gains could be collapsed. These what aspects the majority of students’ struggle. For
items may be improved in continuous development and example, the separated item-person map of behavioral
validation of the instrument. engagement suggests that taking notes was the question
that students most often agreed with, which meant that
Use of Results for Teaching Improvement most students took notes during classroom learning.
However, the item that asked whether students always
Instructors can use data produced by the MS-CSSEL discussed their learning process with the instructor
survey for teaching improvement. The instructors do not outside the classroom was less commonly agreed with.
need to conduct the statistical analysis reported in this Thus, to maximize students’ engagement, the instructor
paper, instead they can rely on each item's descriptive should think about how to provide more opportunities for
statistics to plan for teaching improvement. Any online student-faculty interactions.
survey platform, such as Survey Monkey, can generate Finally, this survey can benefit students in terms of
descriptive statistics for each item automatically. The self-regulation and train them to be self-directed
mean scores of students' responses to each question learners. By taking this survey, students will have an
provide detailed information on students’ average opportunity to monitor their engagement and reflect on
performance regarding each aspect of engagement and what they have learned. Instructors can provide students
learning gains. Since the MS-CSSEL survey adopts a 5- the information on the average engagement and learning
point category structure, we recommend instructors pay gains as well as how high-performance students engage
attention to the items that have mean scores below 3. For in classroom learning as examples for other students to
example, the results of the behavioral engagement in this adjust their learning strategies. In this way, students will
Liu et al. Development and Validation of a Formative Evaluation Instrument 421

have an opportunity to learn their strengths and Atkins, M., & Brown, G. (2002). Effective teaching in
weaknesses in the course. higher education. Routledge.
Baghaei, P. (2012). The application of multidimensional
Administration of the MS-CSSEL Survey Rasch models in large scale assessment and
validation: an empirical example. Electronic
The MS-CSSEL survey is intended to measure Journal of Research in Educational Psychology,
student engagement and learning gains in the middle of 10(1), 233–252.
the semester. Thus, it is appropriate to administer the Benton, S. L., & Cashin, W. E. (2012). Idea paper# 50
survey around the middle of the semester in order to student ratings of teaching: A summary of research
collect data for instructors to consider potential and literature.
improvement while teaching the course. Berk, R. A. (2005). Survey of 12 strategies to measure
The purpose of the MS-CSSEL survey is to help teaching effectiveness. International Journal of
instructors identify the strengths and weaknesses of Teaching and Learning in Higher Education, 17(1),
students’ engagement and learning gains for adjusting 48-62.
teaching strategies. Thus, although the survey performed Berliner, D. (2005). The near impossibility of testing for
a high internal consistency and excellent construct teacher quality. Journal of Teacher Education,
validity, it should be used for formative evaluation of 56(3), 205 – 214.
teaching instead of high-stakes faculty evaluation, Bode, R. K., & Wright, B. D. (1999). In Higher
according to the results. To protect students’ privacy and education: Handbook of theory and research.
encourage them to take the survey without concerns, we Springer.
recommend administering the survey anonymously. Bond, T., & Fox, C. M. (2015). Applying the Rasch
Finally, because the survey is quite long, having students model: Fundamental measurement in the human
complete it on a purely voluntary basis will likely sciences. Routledge.
decrease the response rate. Thus, we recommend the Briggs, D. C., & Wilson, M. (2003). An introduction to
instructor provide incentives to encourage student multidimensional measurement using Rasch
participation. For example, in this study, the instructor models. Journal of Applied Measurement, 4(1), 87-
provided five extra points (~0.75% of the final grade) to 100.
the students who completed the survey on time. In this Carnell, E. (2007). Conceptions of effective teaching in
way, more than 88.9% of the students responded to the higher education: extending the boundaries.
survey. Teaching in Higher Education, 12(1), 25-40.
In conclusion, assessing teaching effectiveness Centra, J. A., & Gaubatz, N. B. (2000). Is there gender
through measuring student classroom engagement and bias in student evaluations of teaching? The Journal
learning gains is a viable way to address the lack of a of Higher Education, 71(1), 17-33.
unified definition of teaching effectiveness. Chen, Y., & Hoshower, L. B. (2003). Student evaluation
Incorporating the MS-CSSEL survey in the middle of the of teaching effectiveness: An assessment of student
semester will help instructors: (a) monitor and perception and motivation. Assessment &
understand students’ engagement and learning process, Evaluation in Higher Education, 28(1), 71-88.
(b) gauge whether students’ learning gains match with Clayson, D. E. (2013). Initial impressions and the student
instructor’s expectations, and (c) adjust instruction for evaluation of teaching. Journal of Education for
the remainder of the course. Furthermore, it will prepare Business, 88(1), 26-35.
students to be self-directed learners. Clayson, D. E., & Haley, D. A. (2011). Are students
telling us the truth? A critical look at the student
References evaluation of teaching. Marketing Education
Review, 21(2), 101-112.
American Educational Research Association, American Darling-Hammond, L., Jaquith, A., & Hamilton, M.
Psychological Association, Joint Committee on (2012). Creating a comprehensive system for
Standards for Educational, Psychological Testing evaluating and supporting effective teaching.
(US), & National Council on Measurement in Stanford Center for Opportunity Policy in
Education. (2014). Standards for educational and Education.
psychological testing. American Educational Delaney, J. G., Johnson, A., Johnson, T. D., & Treslan,
Research Association. D. (2010). Students’ perceptions of effective
Appleton, J. J., Christenson, S. L., Kim, D., & Reschly, teaching in higher education.
A. L. (2006). Measuring cognitive and [Link]
psychological engagement: Validation of the slan_SPETHE_Paper.pdf
Student Engagement Instrument. Journal of School Fauth, B. Göllner, R. Lenske, G. Praetorius, A. &
Psychology, 44(5), 427-445. Wagner, W. (2020). Who sees what? Conceptual
Liu et al. Development and Validation of a Formative Evaluation Instrument 422

considerations on the measurement of teaching Rogaten, J., Rienties, B., Sharpe, R., Cross, S.,
quality from different perspectives. Zeitschrift Für Whitelock, D., Lygo-Baker, S., & Littlejohn, A.
Pädagogik, 66, 138–155. (2019). Reviewing affective, behavioural and
Fredricks, J. A., & McColskey, W. (2012). Handbook cognitive learning gains in higher education.
of research on student engagement. Springer. Assessment & Evaluation in Higher Education,
Galbraith, C. S., Merrill, G. B., & Kline, D. M. (2012). 44(3), 321-337.
Are student evaluations of teaching effectiveness RStudio Team (2018). RStudio: Integrated
valid for measuring student learning outcomes in development for R. RStudio, Inc.
business related classes? A neural network and [Link]
bayesian analyses. Research in Higher Education, Shevlin, M., Banyard, P., Davies, M., & Griffiths, M.
53(3), 353-374. (2000). The validity of student evaluation of
Gravestock, P., & Gregor-Greenleaf, E. (2008). Student teaching in higher education: love me, love my
course evaluations: Research, models and trends. lectures? Assessment & Evaluation in Higher
Higher Education Quality Council of Ontario. Education, 25(4), 397-405.
Handelsman, J., Miller, S., & Pfund, C. (2007). Smith, A. B., Rush, R., Fallowfield, L. J., Velikova,
Scientific teaching. Macmillan. G., & Sharpe, M. (2008). Rasch fit statistics and
Hativa, N., Barak, R., & Simhi, E. (2001). Exemplary sample size considerations for polytomous data.
university teachers: Knowledge and beliefs BMC Medical Research Methodology, 8(1), 33.
regarding effective teaching dimensions and Spooren, P., Brockx, B., & Mortelmans, D. (2013). On
strategies. The Journal of Higher Education, 72(6), the validity of student evaluation of teaching: The
699-729. state of the art. Review of Educational Research,
Henard, F., & Leprince-Ringuet, S. (2008). The path to 83(4), 598-642.
quality teaching in higher education. Stowell, J. R., Addison, W. E., & Smith, J. L. (2012).
[Link] Comparison of online and classroom-based
df student evaluations of instruction. Assessment &
Janssen, R., & De Boeck, P. (1999). Confirmatory Evaluation in Higher Education, 37(4), 465-473.
analyses of componential test structure using Strauss, M. E., & Smith, G. T. (2009). Construct
multidimensional item response theory. validity: Advances in theory and methodology.
Multivariate Behavioral Research, 34(2), 245-268. Annual Review of Clinical Psychology, 5, 1-25.
Kahu, E. R. (2013). Framing student engagement in Torres Irribarra, D. & Freund, R. (2014). Wright Map:
higher education. Studies in Higher Education, IRT item-person map with ConQuest integration.
38(5), 758-773. [Link]
Lehrer-Knafo, O. (2019). How to improve the quality Trowler, V. (2010). Student engagement literature
of teaching in higher education? The application of review. The Higher Education Academy,11(1), 1-
the feedback conversation for the effectiveness of 15.
interpersonal communication. EDUKACJA Uttl, B., White, C. A., & Gonzalez, D. W. (2017).
Quarterly, 149(2). Meta-analysis of faculty's teaching effectiveness:
Linacre, J. M. (2020). Winsteps® Rasch measurement Student evaluation of teaching ratings and student
computer program. [Link] learning are not related. Studies in Educational
Liu, X. (2020). Using and developing measurement Evaluation, 54, 22-42.
instruments in science education: A Rasch Van Zile-Tamsen, C. (2017). Using Rasch analysis to
Modeling Approach (2nd ed.). IAP. inform rating scale development. Research in
Mitchell, K. M., & Martin, J. (2018). Gender bias in Higher Education, 58(8), 922-933.
student evaluations. PS: Political Science & Wang, W. C., Chen, P. H., & Cheng, Y. Y. (2004).
Politics, 51(3), 648-652. Improving measurement precision of test
Nasser, F., & Fresko, B. (2002). Faculty views of batteries using multidimensional item response
student evaluation of college teaching. Assessment models. Psychological methods, 9(1), 116-136.
& Evaluation in Higher Education, 27(2), 187- Worthington, A. C. (2002). The impact of student
198. perceptions and characteristics on teaching
Pounder, J. (2007). Is student evaluation of teaching evaluations: a case study in finance education.
worthwhile? An analytical framework for Assessment & Evaluation in Higher Education,
answering the question. Quality Assurance in 27(1), 49-64.
Education, 15(2), 178–191. Yao, Y., & Grady, M. L. (2005). How do faculty make
Robitzsch, A., Kiefer, T., & Wu, M. (2019). TAM: Test formative use of student evaluation feedback? A
analysis modules. R package version 3.3-10. multiple case study. Journal of Personnel
[Link] Evaluation in Education, 18(2), 107.
Liu et al. Development and Validation of a Formative Evaluation Instrument 423

Young, S., Rush, L., & Shaw, D. (2009). Evaluating and learning. He was the inaugural director of the Center
gender bias in ratings of university instructors' for Educational Innovation with a mission for improving
teaching effectiveness. International Journal for the university teaching and student learning. Among the
Scholarship of Teaching and Learning, 3(2), n2. books he has published is Using and Developing
Zabaleta, F. (2007). The use and misuse of student Measurement Instruments in Science Education: A
evaluations of teaching. Teaching in Higher Rasch Modeling Approach (Second Edition, 2020,
Education, 12(1), 55-76. Information Age Publishing).
____________________________
LARA HUTSON is a Clinical Associate Professor and
REN LIU is a Ph.D. candidate in the Department of Director of Undergraduate Studies in the Department of
Learning and Instruction, Graduate School of Education, Biological Sciences at the University at Buffalo
University at Buffalo, State University of New York. He (SUNY). She is also coordinator and instructor of the
conducts research in the areas of college teaching second-semester introductory majors’ biology course
improvement, measurement development using Rasch (Cell Biology) and teaches biochemistry. Dr. Hutson’s
model, program evaluation, and application of Rasch most recent projects aim to reduce D/R/F/W rates in
measurement in higher education. introductory STEM courses, including Attendance
Tracking and Intervention (funded by the UB President’s
XIUFENG LIU is a Professor in the Department of Circle); SUNY Excels, a collaboration between STEM
Learning and Instruction, Graduate School of Education, instructors at the SUNY Centers (funded by the State of
University at Buffalo, State University of New York. He New York); and UB Excite, a course redesign program
conducts research in closely related areas of (also funded by the State of New York).
measurement and evaluation of STEM education,
applications of Rasch measurement, and STEM teaching
Liu et al. Development and Validation of a Formative Evaluation Instrument 424

Appendix

Part I: Classroom engagement

Behavioral Engagement Strongly Agree Neutral Disagree Strongly


agree disagree
▼ ▼ ▼ ▼ ▼
1. I always ask content-related questions in class. □ □ □ □ □
2. I always complete assigned readings before coming □ □ □ □ □
to class.
3. I always take notes during class. □ □ □ □ □
4. I always review my notes of previous classes before □ □ □ □ □
coming to the next class.
5. During class time, I regularly work with other □ □ □ □ □
students on work assigned by the instructor.
6. I always discuss my learning progress (e.g., grades, □ □ □ □ □
assignments, learning difficulties) with the
instructor out of classroom.
7. I always pay attention to the instruction during class □ □ □ □ □
time.
8. If I have a difficulty in understanding something, I □ □ □ □ □
always seek additional help.
9. I always contribute to class discussion in class. □ □ □ □ □
10. I always discuss assignments with other students. □ □ □ □ □
11. I have been absent in class fewer than 2 times so □ □ □ □ □
far.

Cognitive Engagement

12. I spend more time and effort on this course (e.g., □ □ □ □ □


assignments, studying, reviewing notes) than on
other courses.
13. I spend much time and effort in finding additional □ □ □ □ □
resources to help me complete the course work.
14. I clearly understand the value and the importance □ □ □ □ □
of this course for my future learning and career.
15. I work hard in order to meet the instructor’s □ □ □ □ □
expectation.
16. I fully understand the course content. □ □ □ □ □
17. I memorize the course content (e.g., definitions, □ □ □ □ □
facts, ideas, or methods) and can recall them well.
18. I always try to decompose an idea or theory to □ □ □ □ □
identify its components or elements.
Liu et al. Development and Validation of a Formative Evaluation Instrument 425

19. I synthesize knowledge (e.g., ideas, information, □ □ □ □ □


experiences) into more comprehensive
interpretations and relationships.
20. I examine critically in order to make judgment □ □ □ □ □
about the value of information, arguments, or
methods learned from this course.
21. I apply new knowledge and skills learned in this □ □ □ □ □
course to solve practical problems.
22. Before beginning a task, I plan for appropriate □ □ □ □ □
strategies and allocate sufficient time.
23. I monitor my learning progress during the course. □ □ □ □ □
24. I consider the instructor’s feedback of my learning □ □ □ □ □
performance carefully and adjust my learning
accordingly.
25. I fully understand my strength and weakness of □ □ □ □ □
learning in this course.

Affective Engagement

26. I feel I belong to this course as a learning □ □ □ □ □


community.
27. I feel I have a voice in this course. □ □ □ □ □
28. I feel comfortable to talk to the instructor. □ □ □ □ □
29. I feel supported by the instructor. □ □ □ □ □
30. I am enthusiastic about learning new things. □ □ □ □ □
31. I am interested in the course content. □ □ □ □ □

1. Please describe how you have participated and engaged in this course so far.
2. Please suggest how the instructor may maintain and further improve your participation and engagement
during the rest of the course.
Liu et al. Development and Validation of a Formative Evaluation Instrument 426

Part II: Learning gains: How much have you learned so far in this course

Please rate your learning gains in the following aspects from


1(lowest) to 5(highest)
Skills and Knowledge 1 2 3 4 5
▼ ▼ ▼ ▼ ▼
32. Gaining factual knowledge (terminology, classifications, □ □ □ □ □
methods, trends).
33. Learning fundamental principles, generalizations or theories. □ □ □ □ □
34. Learning how to find and use resources for answering □ □ □ □ □
questions or solving problems.
35. Developing specific skills, competencies and points of view □ □ □ □ □
needed by professionals in the field most closely related to
this course.
36. Developing skill in expressing yourself orally or in writing. □ □ □ □ □
37. Learning to communicate with the instructor to improve my □ □ □ □ □
learning.
38. Learning to communicate and work with peers to improve my □ □ □ □ □
learning.

Cognitive

39. Developing creative capacities (writing, inventing, designing). □ □ □ □ □


40. Gaining a broader understanding and appreciation of key □ □ □ □ □
concepts of this course.
41. Learning to analyze and critically evaluate ideas, arguments □ □ □ □ □
and points of view related to the key topics in this course.
42. Learning to synthesize and organize new knowledge into a □ □ □ □ □
more complex and comprehensive way.
43. Learning to apply course material (to improve thinking, □ □ □ □ □
problem solving and decision making)
44. Learning to apply ideas from this class to ideas encountered in □ □ □ □ □
other classes within this subject area.
In general, how much has each of the following aspects of the
course helped your cognitive learning gains (Q32-Q45)?
• Lecturing presentation □ □ □ □ □
• Assigned class activities (e.g., discussions, problem solving, □ □ □ □ □
case studies)
• Graded assignments □ □ □ □ □
• Feedback on my work of class assignments and exams from □ □ □ □ □
instructors
• Course materials (e.g., textbooks and supplementary □ □ □ □ □
readings)
• Examinations and quizzes □ □ □ □ □
Liu et al. Development and Validation of a Formative Evaluation Instrument 427

• Online notes or resources posted by instructor □ □ □ □ □


• Interaction with the instructor about your learning □ □ □ □ □

Attitude

45. Acquiring an interest in learning more knowledge and skills □ □ □ □ □


from this course, and having interests to take additional class
in this field.
46. Developing a clearer understanding of, and commitment to, □ □ □ □ □
personal values.
47. Enthusiasm for the subject. □ □ □ □ □
48. Confidence that you understand the course materials. □ □ □ □ □
49. Feeling comfortable in working with complex ideas or tasks □ □ □ □ □
in this field.
50. Willing to seek help from others (e.g., teacher, peers, TA) □ □ □ □ □
when necessary.
In general, how much has each of the following aspects of the
course helped your affective learning gains (Q45-Q50)?
• Lecturing presentation □ □ □ □ □
• Assigned class activities (e.g., discussions, problem solving, □ □ □ □ □
case studies)
• Graded assignments □ □ □ □ □
• Feedback on my work of class assignments and exams from □ □ □ □ □
instructors
• Course materials (e.g., textbooks and supplementary □ □ □ □ □
readings)
• Examinations and quizzes □ □ □ □ □
• Online notes or resources posted by instructor □ □ □ □ □
• Interactions with the instructor about your learning □ □ □ □ □

1. Please describe what you have learned in this course so far?


2. Please suggest how the instructor may maintain and further improve instruction in order to maximize your
learning gain during the rest of the course?
Liu et al. Development and Validation of a Formative Evaluation Instrument 428

Part III: Demographic

If you do not want to respond to any of the


questions below, you can choose to skip it
What is your Gender Male □ Female □
What is your Race? White □ Hispanic/ □
(please choose all that apply) Latino
African □ Native □
American American
Asian □ Others,
please
specify
Are you a first-generation college student? Yes □ No □
What is your classification? Freshmen □ Sophomore □
Junior □ Senior □
What are the reasons to register in this Required □ Selective □
course?
Others --------

You might also like