Teaching Effectiveness Questionnaire
Teaching Effectiveness Questionnaire
This study aims to provide preliminary validation for a newly designed instrument to evaluate teaching
effectiveness through student classroom engagement and learning gains. The instrument is titled the
Middle Semester Classroom Survey of Student Engagement and Learning (MS-CSSEL); it consists
of 31items to measure student classroom engagement in three dimensions and 19 items to measure
student learning gains in three dimensions. To validate the instrument, 634 undergraduate students in
a four-year research university participated in this study. The multidimensional Rasch model was used
to conduct the analysis. The findings indicated that (a) items displayed a good fit to the Rasch model;
(b) dimensions were distinct from each other; and (c) the items displayed high reliability. This
instrument measures teaching effectiveness in a new perspective and provides college teachers with a
new tool to gauge student engagement and learning gains for conducting evidence-based teaching
improvement.
“teaching that brings about effective and successful the classroom was problematic since the questions on the
student learning that is deep and meaningful.” In SRI failed to capture what happened in the classroom
addition, Darling-Hammond and colleagues (2012) settings (Pounder, 2007). Worthington (2002) conducted
considered effective teaching as the instruction that a case study in a finance major course to investigate the
enabled students to learn. Effective teaching should meet effects of students’ characteristics and their perceptions
the demands of discipline, instructional goals, and of the usage of SRI. The results suggest that the expected
students’ needs in the teaching and learning environment grade in the subject, student age, race, gender, and their
(Darling-Hammond et al., 2012). “Effective teaching is perceptions of the evaluation process all have significant
about reaching achievement goals; it is about students impacts on the ratings given to the instructors.
learning what they are supposed to in a particular Third, SRI may discriminate instructors based on
context, grade, or subject” (Berliner, 2005, p, 207). their background characteristics, especially for female
Carnell (2007) conducted a qualitative study with eight instructors. Centra and Gaubatz (2000) conducted a
instructors teaching in higher education to examine study to investigate the relationship between students’
university teachers’ conceptions of effective teaching. gender and instructors’ gender across classes regarding
Although the instructors have different teaching instruction ratings. The results indicate that both in the
experiences, they all consider effective teaching to same class and across all classes, there is a significant
enable students’ learning (Carnell, 2007). difference between male and female students when
rating female instructors, but no significant difference is
Student Ratings of Instruction (SRI) detected for male instructions (Centra & Gaubatz, 2000).
Female instructors are more likely to receive lower
The use of SRI to measure and interpret teaching ratings from male students, even controlling for the
effectiveness has increased in higher education effects of class size (Centra & Gaubatz, 2000).
institutions since the 1900s. However, SRI has Similarly, Young and colleagues (2009) conducted
limitations on measuring teaching quality (Clayson & a study to explore the gender bias when rating the
Haley, 2011; Pounder, 2007; Shevlin et al., 2000; instructor and instructions, and the interaction between
Spooren et al., 2013; Uttl et al., 2017). First, both the students’ and instructors’ characteristics, especially for
content validity and construct validity of the commonly the effects of gender. The results show a potential
used teaching evaluation measurement instruments have gender-preference while rating instructors on
been questioned (Spooren et al., 2013). Due to the lack pedagogical characteristics and course content
of consensus of effective teachers' characteristics, there characteristics (Young et al., 2009). A more recent
is a large variation in scope for the instruments used to qualitative study conducted by Mitchell and Martin
measure teaching effectiveness, especially in the defined (2018) aimed to investigate the underlying relationship
dimensions of teaching effectiveness (Spooren et al., between gender and student evaluation of teaching.
2013). In addition, the majority of measurement Based on analysis of the student comments, the authors
instruments used currently are designed by found that students did evaluate their professors with
administrators without consideration of other essential gender bias. Students tend to comment on female
stakeholders’ views of effective teaching, which raises instructors’ appearance and personality more than male
the question of content validity for the design of the instructors and show less professional respect to woman
instruments (Spooren et al., 2013). instructors (Mitchell & Martin, 2018).
Second, students’ ratings can be affected by a Considering the limitation of traditional SRI, the
variety of factors other than teaching practices, which purpose of this study is to develop and validate a
raises the issues of accuracy of using SRI’s results for formative evaluation measurement instrument titled
high-stake decisions (Pounder, 2007; Shevlin et al., Middle Semester Classroom Survey of Student
2000; Worthington, 2002). Shevlin et al. (2000) Engagement and Learning (MS-CSSEL) that can be
conducted a study with 213 undergraduate students used in college teaching for instructional improvement
within a social science department at a UK University through the lens of students’ engagement and learning
exploring the potential relationship between students’ gains. The measures of engagement and learning gains
perception of the lecturer and their ratings for intend to provide comprehensive information for
instruction. The results indicate that charisma factors instructors to adjust teaching strategies during the course
account for 69% and 37% of the variation in lecturer of instruction.
ability and module attributes respectively (Shevlin et al.,
2000). In addition, in a comprehensive review of the Theoretical Framework
literature, Pounder (2007) synthesized a variety of
potential student-level, course-level, and teacher-level Reliability and validity are two essential
factors that affected student ratings, concluding that psychometric properties for any measurement
relying only on SRI to measure teaching and learning in instrument to be used. According to the Standards for
Liu et al. Development and Validation of a Formative Evaluation Instrument 409
Figure 1
Conceptual Framework
abilities, such as analyzing, memorizing, synthesizing, content knowledge, critical thinking, problem-solving,
evaluating, and applying abilities (Rogaten et al., 2019). communication, and cooperation. Cognitive learning
Affective learning gains are mainly measured as the gains were assessed using six questions related to
change of attitude during a course, such as confidence, understanding, analyzing and synthesizing, evaluating,
motivation, and interests (Rogaten et al., 2019). In and applying the course contents. Learning gains in
addition, how much students have learned about content attitude were assessed using six questions relating to
knowledge is also an important indicator to measure enthusiasm, interests, and confidence of the content and
students’ learning gains. For this study, questions about the course.
content knowledge are incorporated into the behavioral In addition to the 50 items above, the MS-CSSEL
dimension. Figure 1 presents the conceptual framework survey includes two set of questions to ask to what
of the constructs measured in this study. extent a student’s skills and knowledge gains, cognitive
Instrument Construction gains, and affective gains have been affected by some
Based on Kahu’s (2013) framework, the MS- instructional teaching practices, such as the lecture,
CSSEL survey utilized five-point Likert-scale questions assigned class activities, graded assignments, etc.
to assess student engagement. Behavioral engagement At the end of the engagement section, two open-
was assessed using 11 questions relating to how much ended questions asked students how they had
time and effort students allocated to educational tasks, participated and engaged in this course and how the
interactions with peers and instructors both inside and instructor could maintain and improve their
outside the classrooms, and attendance and participation. engagement for the rest of the semester. Another two
Cognitive engagement was assessed using 14 questions open-ended questions were attached to the learning
relating to how much extra work students did for the gains section, which let students to describe what they
course, beliefs and values related to the course, cognitive had learned so far and provide suggestions to the
strategies used to study the course content, and self- instructor for the rest of the semester (see Appendix for
regulation strategies. Affective engagement was the entire survey).
assessed using six items relating to sense of belonging in
the course, enthusiasm, and interests in the subject. Data Collection
Regarding the items measuring student learning
gains, learning gains in skills and knowledge were The MS-CSSEL survey was conducted to a target
assessed using seven questions that covered gains in audience at one research University. In total, 634
Liu et al. Development and Validation of a Formative Evaluation Instrument 411
undergraduate students in two introductory biology empirically tested through Rasch models in order to
courses participated in this study (see Table 1). The MS- produce a set of items that define a linear measurement
scale” (Liu, 2020, p. 34). In order to address the
Table 1 limitation of CTT, this study applied the Rasch modeling
Descriptive Statistics of the Sample (N = 634) approach to provide more accurate statistical evidence
for the reliability and validity claim of the MS-CSSEL
Category N survey.
Gender Male 211 According to the literature, student engagement and
Female 391 learning gains are two multidimensional constructs.
Missing 32 Thus, this study applied the multidimensional rating
scale Rasch model for data analysis. All the items that
Classification Freshmen 424 measure classroom engagement adopted the same
Sophomore 104 Likert-scaled categories from strongly agree to strongly
Junior 39 disagree. Because all statements of items are in positive
Senior 36 tones, strongly agree was coded as 5 and strongly
Missing 31 disagree was coded as 1. The items that measured
Race White 275 learning gains adopted a 5-point rating scale from 5 to 1.
Hispanic/Latino 43 Items measuring students’ classroom engagement and
African American 54 learning gains were analyzed separately following the
Native American 0 same modeling procedures.
Asian 189 We used the “TAM” (Robitzsch et al., 2019) and
Others 40 “WrightMap” packages (Torres et al., 2014) in RStudio
Missing 33 (RStudio Team, 2018) to do the Rasch analysis. Item fit
statistics, EAP reliability coefficients, and item-person
CSSEL survey was administrated in the middle of maps were generated. Additionally, we used Winstep
the Spring 2019 semester via a commercial online survey 4.5.4 (Linacre, 2020) to test the appropriateness of the
platform, SelectSurvey, which is similar to common category structure. The item probability curves for
platforms such as SurveyMonkey and Qualtrics. The student engagement and learning gains were drawn
online survey platform collected students’ responses and separately.
generated descriptive report automatically. Students
were invited by email with a survey link and they were Results
given two weeks to complete the survey. The survey was
conducted anonymously. The instructor provided 5 extra Dimension Structure
credit points (0.75% of the final grade) to the students
for encouraging their participation in the survey. In total, Although student classroom engagement and
713 students were invited by email and 634 of them learning gains are widely believed to be
completed the survey with a response rate of 88.9%. As multidimensional constructs, there is little statistical
recommended by Wright and Douglas (1975), in order to evidence to prove the dimensional structure of these
obtain stable item calibration within ± ½ logit with 95% concepts. For testing the necessity of using
confidence interval for the errors, the minimum sample multidimensional models, we ran both unidimensional
size is between 64 and 144. Thus, it is reasonable to and 3-dimensional Rating-scale Rasch models, and then
assume that this study has an appropriate sample size to compared the results. Because the multidimensional
create a stable item calibration. model hierarchically subsumes to the unidimensional
Traditionally, many Likert-scale evaluation models, the two models can be compared by testing the
instruments used in higher education have been significant change in their deviance that describes the
developed according to the Classical Test Theory (CTT), difference between the estimated model and the true
which assumes that all items on the survey have the same model of the concept (Baghaei, 2012). Briggs and
standard error, and threshold estimates between Wilson (2003) indicated that the difference of deviance
categories for all items are equal (Van Zile-Tamsen, between two estimated models was nearly a chi-square
2017). Researchers have demonstrated limitations when distribution with a degree of freedom of the difference
using CTT to analyze rating-scale or Likert-scale data between the number of parameters estimated in the two
(Bode & Wright, 1999). models. Janssen and De Boeck (1999) recommended
Based on the validity and reliability theory, “using selecting the model with significantly smaller deviance
Rasch models to develop measurement instruments or compared to estimated models. As shown in Table 2, the
Rasch modeling, is a systematic process in which items 3-dimensional Rasch model had significantly smaller
are purposefully constructed according to theory and deviance than the unidimensional model regarding
Liu et al. Development and Validation of a Formative Evaluation Instrument 412
Table 2
Model Comparison
Table 3
Correlation Matrix for the Dimensions of Classroom Engagement
Table 4
Correlation Matrix for the Dimensions of Learning Gains
student engagement, χ2 (5) = 570.76, p < .05, suggesting and affective engagement (r = 0.71). In addition,
that the 3-dimensional Rasch model was a better solution students’ cognitive engagement had a high-level
to model student engagement than the unidimensional correlation to students’ affective engagement (r = 0.94).
model. Although the smaller correlation estimates always help
The same modeling process was applied to test the to differentiate dimensions, a higher correlation estimate
difference of deviance between the unidimensional does not necessarily imply an identical dimension
model and 3-dimensional model used to analyze learning (Baghaei, 2012). Table 6 presents the correlation matrix
gains. The results indicated that the 3-dimensional Rasch for learning gains. Overall, the dimensions of learning
modeling approach had a better fit to the true model of gains show a high degree of correlation (i.e., ranging
learning gains than a unidimensional model with a from 0.84 to 0.95), which suggests that the measurement
significant change of deviance, χ2 (5) = 339.20, p < .05. instrument has a high degree of precision.
Thus, we selected 3-dimensional Rasch model for testing
the quality of the measures of engagement and learning Item Fit Statistics
gains.
In addition to the comparison of deviance, the Rasch modeling approach provides four indices for
correlations between dimensions of student engagement determining how the data fits the expected Rasch model.
and learning gains were used to provide additional The mean square fit statistics (MNSQs) indicates how
information for the preciseness of the measurement much the misfit observed between the Rasch model’s
instrument (see Table 3 and Table 4)Table 3 and Table expected item performance and the actual performance
4). When the multidimensional approach is used, “the according to the data matrix (Bond & Fox, 2015). For the
higher the correlations, the greater the number of latent mean squared statistics, the closer to 1, the better the
traits, and the shorter the subtests” will improves model-data-fit performed. For Likert-scale and rating
measurement precision significantly (Wang et al., 2004, scale questions, the commonly accepted range of the
p, 125). As shown in Table 5, the results indicated that mean-square statistics is from 0.6 to 1.4 logits (Bond &
students’ behavioral engagement had a significantly high Fox, 2015; Linacre, 2019). The standardized fit statistics
correlation to students’ cognitive engagement (r = 0.78) (ZSTDs) indicate how likely the degree of misfit
Liu et al. Development and Validation of a Formative Evaluation Instrument 413
Table 5
Fit Statistics for Classroom Engagement
expressed by mean square statistics will be observed are more likely to be too sensitive (i.e., with many items
(Bond & Fox, 2015). When the sample size is between failing to fit the model; Linacre, 2019). The standardized
30 to 300, the acceptable range for the standardized fit fit statistics highly depend on the sample size, which
statistic is from -.20 to 2.0 (Bond & Fox, 2015). inflates putative Type I error rates; however, the mean
Typically, the decision-making of model-data-fit square statistics are comparatively insensitive to sample
depends on those four indices equally, but it is size (Smith et al., 2008).
reasonable to make decisions according to some indices Over 600 students participated in this study;
for a particular purpose (Bond & Fox, 2015). For therefore, the decision-making of model-data-fit was
example, if the sample size is larger than 300, the ZSTDs made primarily based on the MNSQs. Overall, all 31
Liu et al. Development and Validation of a Formative Evaluation Instrument 414
Table 6
Fit Statistics of Learning Gains
items measuring student classroom engagement fit the whether students were willing to seek help from others
expected Rasch model well (Table 2). For classroom when necessary, failed to fit the model, based both on
behavioral engagement, 8 out of 11 items demonstrated Infit and Outfit MNSQ.
a good model-data-fit with a range of the MNSQs from
0.82 to 1.37. The MNSQs of items 3, 5, and 11 were Internal Structure of Items
outside the acceptable range (see Table 2). For all 14
items measuring cognitive engagement, MNSQs ranged The item-person map, also called “Wright map,”
from 0.63 to 1.23, which indicated an acceptable level of puts the person and item estimates in a same logit scale
model-data-fit. All six items measuring student affective with the person ability estimates distributing on the left
engagement also fit the expected Rasch model, with and the item difficulty estimates on the right. Generally,
MNSQs ranging from 0.63 to 0.86. a good measurement instrument should be able to match
As presented in Table 3, 6 out of 7 items measuring sample’s ability distribution with items’ difficulty
learning gains in skills and knowledge (i.e., items 32 to distribution (Liu, 2020, p. 40). In this study, person
38) had good model-data-fit, with MNSQs ranging from ability estimates represent levels of student engagement
0.78 to 1.30. Item seven, which asked how much and extent of learning gains. On the Wright map, items
students learned to communicate and work with peers to were arranged by difficulty estimates from easier to
improve their learning, had more misinformation (Outfit agree with at the bottom and the harder to agree with on
MNSQ = 1.54; Infit MNSQ = 1.60). All items that the top of the map. Individuals were arranged based on
measured learning gains in cognition (items 39-44) had levels of engagement and learning gains from higher at
acceptable MNSQs, ranging from 0.65 to 1.37. Five out the top to lower at the bottom.
of the six items that measured learning gains in attitude Evidence from the combined Wright Map for
(items 45-50) fit the expected model well, with the student classroom engagement (see Figure 2) indicated
MNSQs ranging from 0.84 to 1.15. Item 50, which asked that levels of student classroom engagement in
Liu et al. Development and Validation of a Formative Evaluation Instrument 415
behavioral, cognitive, and affective dimensions were engagement was determined from the time and efforts
normally distributed and the spread of student put into the course, interactions with peers and
engagement estimates was sufficient with the logits instructors, and participation in the course. The
ranging from -2.5 logits to 3.5. The mean logit for threshold Wright map (Figure 3) indicated that there
student engagement was slightly higher than the were sufficient items to measure student classroom
average of the item estimates suggesting that this set behavioral engagement. The results also indicated that
of questions was relatively easier for most students to the higher end of student engagement continued higher
agree with. than the highest “difficult” item thresholds, which
Separated Wright maps were produced for all suggests that more items should be developed to assess
dimensions of student engagement. Student behavioral the top end of student engagement precisely.
Figure 2
Combined Wright Map for Students' Classroom Engagement
As shown in Figure 3, overall, the distribution of difficulty for this population. Thus, for further
cognitive engagement estimates was acceptable with a development, some of these questions should be
range of -2.0 logits to 4.5 logits. Overall, the threshold combined or deleted in order to improve the efficiency
map indicated that the spread of items was sufficient to of the survey.
measure most levels of cognitive engagement. However, As shown in Figure 3, the spread of students’
the higher end of cognitive engagement continued higher affective engagement was acceptable with the range of
than the most “difficult” item, which suggests that more affective engagement estimates ranging from -2.5 logits
“difficult” questions were needed to capture higher to 3.5 logits. The mean logit of the six items was slightly
cognitive engagement. Additionally, the efficiency of lower than the mean logit of students’ affective
this set of questions was also not ideal, with 5 items (e.g., engagement, which suggested that the items were
items 1,8,11,12,13) found to be at the same level of relatively easier for students to agree with. The
Liu et al. Development and Validation of a Formative Evaluation Instrument 416
Figure 3
Separated Wright Map for Behavioral, Cognitive, and affective Engagement
distribution of the threshold estimates had a sufficient For the items measuring student learning gains, it
spread to measure most levels of affective engagement, suggested that the spread of person estimates was
with the exception of the higher end. Thus, for further acceptable ranging from -2.5 logits to 4 logits, and the
improvement, some more difficult questions should be shape of the distribution was nearly normal (see Figure
considered for making this dimension more 4). The mean logit of the items was slightly smaller than
comprehensive. the average logit of students’ learning gains, which
Figure 4
Combined Wright Map for Learning Goals
Liu et al. Development and Validation of a Formative Evaluation Instrument 417
Figure 5
Separated wright Map for Learning Gains in Skills and Knowledge, Cognition, and Attitude
indicates that students gained more through classroom knowledge, 0.92 for learning gains in cognition, and
learning than what was measured by the 19 items 0.90 for learning gains in attitude.
together, in terms of the learning gains in skills and
knowledge, cognition, and attitude. Table 7
The threshold map (Figure 5) of learning gains in Reliability of Engagement and Learning Gains
skills and knowledge indicates that the separation of
item difficulty is acceptable but not sufficient to EAP Reliability
measure all levels of student learning gains in skills and Classroom Engagement
knowledge. Two gaps - between items 6 and 3, and Behavioral engagement 0.83
between items 7 and 1 - indicate that more questions Cognitive engagement 0.87
should be added. Affective engagement 0.83
The threshold map of learning gains in cognition Learning gains
(Figure 5) indicate that the six items measured some Skills and knowledge 0.91
levels of cognitive learning gains, but the items were Cognitive 0.92
not sufficient to capture all levels of affective learning Attitude 0.90
gains. The gap between items 1 and 3 suggests that
more items should be added in this area. Category Threshold Estimates
Results from the threshold map for learning gains
in attitude (Figure 5) indicate that the spread of the For the rating scale Rasch model, it is required that
difficulty of threshold is acceptable, though there was a the average person estimates should advance
gap between items 4 and 6. Additionally, none of the monotonically from lower-level categories to higher,
questions successfully measured the highest and lowest and the difference of neared threshold estimates should
levels of learning gains in attitude. This should also be Advance by at least 1.4 logits for a 5-point scale (Bond
addressed in future improvement of the survey. & Fox, 2015).
The Expected A Posteriori (EAP) Measures were For the rating scale Rasch model, it is required that
calculated to test the reliability of the MS-CSSEL the average person estimates should advance
survey. As presented in Table 7, the results indicated a monotonically from lower-level categories to higher, and
great extent of consistency regarding every sub- the difference of neared threshold estimates should
dimensions of classroom engagement and learning advance by at least 1.4 logits for a 5-point scale (Bond &
gains with the reliability coefficient ranging from 0.83 Fox, 2015). The acceptable range of MNSQs for each
to 0.92. In particular, the reliability was 0.83 for category is from 0.6 to 1.4 (Bond & Fox, 2015). In this
behavioral engagement, 0.87 for cognitive engagement, study, none of the MNSQs of categories were outside the
and 0.83 for affective engagement. Regarding the acceptable range. The category Andrich threshold
section measuring students’ learning gains, the estimates suggested that the observed average for each
reliability was 0.91 for learning gains in skills and category in both student engagement and learning gains
Liu et al. Development and Validation of a Formative Evaluation Instrument 418
Figure 6
Item Probability Curve for Student Engagement
Liu et al. Development and Validation of a Formative Evaluation Instrument 419
Table 8
Category Threshold Estimates
Figure 7
Item Probability Curve for Learning Gains
Liu et al. Development and Validation of a Formative Evaluation Instrument 420
engagement and learning gains, which suggests a good pilot study showed that students did not always ask
internal structure of items, and (e) the category threshold questions in class (M=2.71) and had limited interaction
estimates for the measures of student engagement and with the instructor outside the classroom (M=2.25).
learning gains indicate that each category adequately Thus, for teaching improvement, the instructor may
measured the constructs. consider providing more opportunities for students to ask
Although most items in the MS-CSSEL survey questions while teaching in the classroom and encourage
functioned as expected, a few items have been identified them to interact with the instructor after class.
for further improvement. First, the question regarding Additionally, the information on how the designed
student attendance showed poor model-data-fit. instructional components affect students’ learning gains
According to the definition of student behavioral can provide additional evidence for the instructor to
engagement, attendance is an essential aspect that consider further teaching improvement. For example, in
reflects how students engage in classroom learning. this study, the average scores of “interactions with the
However, although the survey was conducted instructor about learning” are lower than other
anonymously, students might not report their attendance instructional components in terms of the contributions to
accurately. Thus, this question should remain but needs students’ learning gains. Aligning with the findings
revision. Second, the questions, “During class time, I identified through the Likert-type questions, the
regularly work with other students on work assigned by interaction between faculty and students is an essential
the instructor” and “Learning to communicate and work aspect for further teaching improvement.
with peers to improve my learning” also showed misfit For researchers who will use the instrument, they
to the Rasch model. Those two questions are important should conduct Rasch analysis to obtain interval
aspects of engagement and learning gains. However, measures of student engagement and learning gains.
because most current large-enrollment university classes First, the individual Rasch scores of engagement and
are lecture-based, collaborative learning between and learning gains of students will help understand the
among students occurs rarely. Thus, the instructor should relationship between students’ academic performance
consider how to incorporate collaborative learning in and their engagement and learning gains, which suggests
large classes or whether the items are meaningful to the better instructional strategies to teach students,
course when using this survey. Third, the item-person especially for the low-performance students. For
map suggests that the items effectively measure example, the individual “ability” scores could help
students’ engagement and learning gains, but some researchers to reach the low-performing students to find
apparent gaps between items need to be taken into out their strength and weakness in engagement and
consideration for further improvement. Fourth, the learning gains, and then to prepare more targeted
results of category probability estimates implied that the mentoring and suggestions to the students for improving
response option “Neutral” could be deleted for the items their academic performance in the rest of the course.
measuring students’ classroom engagement, and the Second, by looking at the item-person map and the
response option "2" and "3" for the rating scale used for average scores of each item, researchers can identify in
measuring learning gains could be collapsed. These what aspects the majority of students’ struggle. For
items may be improved in continuous development and example, the separated item-person map of behavioral
validation of the instrument. engagement suggests that taking notes was the question
that students most often agreed with, which meant that
Use of Results for Teaching Improvement most students took notes during classroom learning.
However, the item that asked whether students always
Instructors can use data produced by the MS-CSSEL discussed their learning process with the instructor
survey for teaching improvement. The instructors do not outside the classroom was less commonly agreed with.
need to conduct the statistical analysis reported in this Thus, to maximize students’ engagement, the instructor
paper, instead they can rely on each item's descriptive should think about how to provide more opportunities for
statistics to plan for teaching improvement. Any online student-faculty interactions.
survey platform, such as Survey Monkey, can generate Finally, this survey can benefit students in terms of
descriptive statistics for each item automatically. The self-regulation and train them to be self-directed
mean scores of students' responses to each question learners. By taking this survey, students will have an
provide detailed information on students’ average opportunity to monitor their engagement and reflect on
performance regarding each aspect of engagement and what they have learned. Instructors can provide students
learning gains. Since the MS-CSSEL survey adopts a 5- the information on the average engagement and learning
point category structure, we recommend instructors pay gains as well as how high-performance students engage
attention to the items that have mean scores below 3. For in classroom learning as examples for other students to
example, the results of the behavioral engagement in this adjust their learning strategies. In this way, students will
Liu et al. Development and Validation of a Formative Evaluation Instrument 421
have an opportunity to learn their strengths and Atkins, M., & Brown, G. (2002). Effective teaching in
weaknesses in the course. higher education. Routledge.
Baghaei, P. (2012). The application of multidimensional
Administration of the MS-CSSEL Survey Rasch models in large scale assessment and
validation: an empirical example. Electronic
The MS-CSSEL survey is intended to measure Journal of Research in Educational Psychology,
student engagement and learning gains in the middle of 10(1), 233–252.
the semester. Thus, it is appropriate to administer the Benton, S. L., & Cashin, W. E. (2012). Idea paper# 50
survey around the middle of the semester in order to student ratings of teaching: A summary of research
collect data for instructors to consider potential and literature.
improvement while teaching the course. Berk, R. A. (2005). Survey of 12 strategies to measure
The purpose of the MS-CSSEL survey is to help teaching effectiveness. International Journal of
instructors identify the strengths and weaknesses of Teaching and Learning in Higher Education, 17(1),
students’ engagement and learning gains for adjusting 48-62.
teaching strategies. Thus, although the survey performed Berliner, D. (2005). The near impossibility of testing for
a high internal consistency and excellent construct teacher quality. Journal of Teacher Education,
validity, it should be used for formative evaluation of 56(3), 205 – 214.
teaching instead of high-stakes faculty evaluation, Bode, R. K., & Wright, B. D. (1999). In Higher
according to the results. To protect students’ privacy and education: Handbook of theory and research.
encourage them to take the survey without concerns, we Springer.
recommend administering the survey anonymously. Bond, T., & Fox, C. M. (2015). Applying the Rasch
Finally, because the survey is quite long, having students model: Fundamental measurement in the human
complete it on a purely voluntary basis will likely sciences. Routledge.
decrease the response rate. Thus, we recommend the Briggs, D. C., & Wilson, M. (2003). An introduction to
instructor provide incentives to encourage student multidimensional measurement using Rasch
participation. For example, in this study, the instructor models. Journal of Applied Measurement, 4(1), 87-
provided five extra points (~0.75% of the final grade) to 100.
the students who completed the survey on time. In this Carnell, E. (2007). Conceptions of effective teaching in
way, more than 88.9% of the students responded to the higher education: extending the boundaries.
survey. Teaching in Higher Education, 12(1), 25-40.
In conclusion, assessing teaching effectiveness Centra, J. A., & Gaubatz, N. B. (2000). Is there gender
through measuring student classroom engagement and bias in student evaluations of teaching? The Journal
learning gains is a viable way to address the lack of a of Higher Education, 71(1), 17-33.
unified definition of teaching effectiveness. Chen, Y., & Hoshower, L. B. (2003). Student evaluation
Incorporating the MS-CSSEL survey in the middle of the of teaching effectiveness: An assessment of student
semester will help instructors: (a) monitor and perception and motivation. Assessment &
understand students’ engagement and learning process, Evaluation in Higher Education, 28(1), 71-88.
(b) gauge whether students’ learning gains match with Clayson, D. E. (2013). Initial impressions and the student
instructor’s expectations, and (c) adjust instruction for evaluation of teaching. Journal of Education for
the remainder of the course. Furthermore, it will prepare Business, 88(1), 26-35.
students to be self-directed learners. Clayson, D. E., & Haley, D. A. (2011). Are students
telling us the truth? A critical look at the student
References evaluation of teaching. Marketing Education
Review, 21(2), 101-112.
American Educational Research Association, American Darling-Hammond, L., Jaquith, A., & Hamilton, M.
Psychological Association, Joint Committee on (2012). Creating a comprehensive system for
Standards for Educational, Psychological Testing evaluating and supporting effective teaching.
(US), & National Council on Measurement in Stanford Center for Opportunity Policy in
Education. (2014). Standards for educational and Education.
psychological testing. American Educational Delaney, J. G., Johnson, A., Johnson, T. D., & Treslan,
Research Association. D. (2010). Students’ perceptions of effective
Appleton, J. J., Christenson, S. L., Kim, D., & Reschly, teaching in higher education.
A. L. (2006). Measuring cognitive and [Link]
psychological engagement: Validation of the slan_SPETHE_Paper.pdf
Student Engagement Instrument. Journal of School Fauth, B. Göllner, R. Lenske, G. Praetorius, A. &
Psychology, 44(5), 427-445. Wagner, W. (2020). Who sees what? Conceptual
Liu et al. Development and Validation of a Formative Evaluation Instrument 422
considerations on the measurement of teaching Rogaten, J., Rienties, B., Sharpe, R., Cross, S.,
quality from different perspectives. Zeitschrift Für Whitelock, D., Lygo-Baker, S., & Littlejohn, A.
Pädagogik, 66, 138–155. (2019). Reviewing affective, behavioural and
Fredricks, J. A., & McColskey, W. (2012). Handbook cognitive learning gains in higher education.
of research on student engagement. Springer. Assessment & Evaluation in Higher Education,
Galbraith, C. S., Merrill, G. B., & Kline, D. M. (2012). 44(3), 321-337.
Are student evaluations of teaching effectiveness RStudio Team (2018). RStudio: Integrated
valid for measuring student learning outcomes in development for R. RStudio, Inc.
business related classes? A neural network and [Link]
bayesian analyses. Research in Higher Education, Shevlin, M., Banyard, P., Davies, M., & Griffiths, M.
53(3), 353-374. (2000). The validity of student evaluation of
Gravestock, P., & Gregor-Greenleaf, E. (2008). Student teaching in higher education: love me, love my
course evaluations: Research, models and trends. lectures? Assessment & Evaluation in Higher
Higher Education Quality Council of Ontario. Education, 25(4), 397-405.
Handelsman, J., Miller, S., & Pfund, C. (2007). Smith, A. B., Rush, R., Fallowfield, L. J., Velikova,
Scientific teaching. Macmillan. G., & Sharpe, M. (2008). Rasch fit statistics and
Hativa, N., Barak, R., & Simhi, E. (2001). Exemplary sample size considerations for polytomous data.
university teachers: Knowledge and beliefs BMC Medical Research Methodology, 8(1), 33.
regarding effective teaching dimensions and Spooren, P., Brockx, B., & Mortelmans, D. (2013). On
strategies. The Journal of Higher Education, 72(6), the validity of student evaluation of teaching: The
699-729. state of the art. Review of Educational Research,
Henard, F., & Leprince-Ringuet, S. (2008). The path to 83(4), 598-642.
quality teaching in higher education. Stowell, J. R., Addison, W. E., & Smith, J. L. (2012).
[Link] Comparison of online and classroom-based
df student evaluations of instruction. Assessment &
Janssen, R., & De Boeck, P. (1999). Confirmatory Evaluation in Higher Education, 37(4), 465-473.
analyses of componential test structure using Strauss, M. E., & Smith, G. T. (2009). Construct
multidimensional item response theory. validity: Advances in theory and methodology.
Multivariate Behavioral Research, 34(2), 245-268. Annual Review of Clinical Psychology, 5, 1-25.
Kahu, E. R. (2013). Framing student engagement in Torres Irribarra, D. & Freund, R. (2014). Wright Map:
higher education. Studies in Higher Education, IRT item-person map with ConQuest integration.
38(5), 758-773. [Link]
Lehrer-Knafo, O. (2019). How to improve the quality Trowler, V. (2010). Student engagement literature
of teaching in higher education? The application of review. The Higher Education Academy,11(1), 1-
the feedback conversation for the effectiveness of 15.
interpersonal communication. EDUKACJA Uttl, B., White, C. A., & Gonzalez, D. W. (2017).
Quarterly, 149(2). Meta-analysis of faculty's teaching effectiveness:
Linacre, J. M. (2020). Winsteps® Rasch measurement Student evaluation of teaching ratings and student
computer program. [Link] learning are not related. Studies in Educational
Liu, X. (2020). Using and developing measurement Evaluation, 54, 22-42.
instruments in science education: A Rasch Van Zile-Tamsen, C. (2017). Using Rasch analysis to
Modeling Approach (2nd ed.). IAP. inform rating scale development. Research in
Mitchell, K. M., & Martin, J. (2018). Gender bias in Higher Education, 58(8), 922-933.
student evaluations. PS: Political Science & Wang, W. C., Chen, P. H., & Cheng, Y. Y. (2004).
Politics, 51(3), 648-652. Improving measurement precision of test
Nasser, F., & Fresko, B. (2002). Faculty views of batteries using multidimensional item response
student evaluation of college teaching. Assessment models. Psychological methods, 9(1), 116-136.
& Evaluation in Higher Education, 27(2), 187- Worthington, A. C. (2002). The impact of student
198. perceptions and characteristics on teaching
Pounder, J. (2007). Is student evaluation of teaching evaluations: a case study in finance education.
worthwhile? An analytical framework for Assessment & Evaluation in Higher Education,
answering the question. Quality Assurance in 27(1), 49-64.
Education, 15(2), 178–191. Yao, Y., & Grady, M. L. (2005). How do faculty make
Robitzsch, A., Kiefer, T., & Wu, M. (2019). TAM: Test formative use of student evaluation feedback? A
analysis modules. R package version 3.3-10. multiple case study. Journal of Personnel
[Link] Evaluation in Education, 18(2), 107.
Liu et al. Development and Validation of a Formative Evaluation Instrument 423
Young, S., Rush, L., & Shaw, D. (2009). Evaluating and learning. He was the inaugural director of the Center
gender bias in ratings of university instructors' for Educational Innovation with a mission for improving
teaching effectiveness. International Journal for the university teaching and student learning. Among the
Scholarship of Teaching and Learning, 3(2), n2. books he has published is Using and Developing
Zabaleta, F. (2007). The use and misuse of student Measurement Instruments in Science Education: A
evaluations of teaching. Teaching in Higher Rasch Modeling Approach (Second Edition, 2020,
Education, 12(1), 55-76. Information Age Publishing).
____________________________
LARA HUTSON is a Clinical Associate Professor and
REN LIU is a Ph.D. candidate in the Department of Director of Undergraduate Studies in the Department of
Learning and Instruction, Graduate School of Education, Biological Sciences at the University at Buffalo
University at Buffalo, State University of New York. He (SUNY). She is also coordinator and instructor of the
conducts research in the areas of college teaching second-semester introductory majors’ biology course
improvement, measurement development using Rasch (Cell Biology) and teaches biochemistry. Dr. Hutson’s
model, program evaluation, and application of Rasch most recent projects aim to reduce D/R/F/W rates in
measurement in higher education. introductory STEM courses, including Attendance
Tracking and Intervention (funded by the UB President’s
XIUFENG LIU is a Professor in the Department of Circle); SUNY Excels, a collaboration between STEM
Learning and Instruction, Graduate School of Education, instructors at the SUNY Centers (funded by the State of
University at Buffalo, State University of New York. He New York); and UB Excite, a course redesign program
conducts research in closely related areas of (also funded by the State of New York).
measurement and evaluation of STEM education,
applications of Rasch measurement, and STEM teaching
Liu et al. Development and Validation of a Formative Evaluation Instrument 424
Appendix
Cognitive Engagement
Affective Engagement
1. Please describe how you have participated and engaged in this course so far.
2. Please suggest how the instructor may maintain and further improve your participation and engagement
during the rest of the course.
Liu et al. Development and Validation of a Formative Evaluation Instrument 426
Part II: Learning gains: How much have you learned so far in this course
Cognitive
Attitude