Issues in The Design of Randomized Field Trials To Examine The Impact of Teacher Professional Development Interventions
Issues in The Design of Randomized Field Trials To Examine The Impact of Teacher Professional Development Interventions
Issues in The Design of Randomized Field Trials To Examine The Impact of Teacher Professional Development Interventions
Andrew Wayne
Kwang Suk Yoon
Stephanie Cronen
Michael S. Garet
Pei Zhu
MDRC
New York, York
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
1
INTRODUCTION
One strategy for improving student achievement is to provide teachers with specialized
training. In addition to training provided prior to the start of the teaching career – i.e.,
preservice training – teachers also receive training during their careers. This strategy is
called in-service training or professional development. (PD) Today, virtually all public
school teachers participate in professional development programs each year (Choy, Chen,
and Bugarin, 2006, p. 47). Professional development programs are also a focus in
NCLB, through which states and districts spent approximately $1.5 billion on teacher
professional development in 2004-05 (Birman et al., 2007).
Although NCLB provides resources for teacher professional development, the broader
effect of NCLB has been to promote accountability for student achievement. NCLB also
encourages school districts to adopt programs and practices that have been proven by
scientifically based research. There is thus a significant need for rigorous studies of the
effect of teacher professional development on student achievement. Sponsors of
professional development initiatives, such as the National Science Foundation, are
particularly eager to find ways to evaluate the effects that their professional development
programs have on student achievement.
This paper provides a discussion of issues that must be confronted in designing studies of
the impact of teacher professional development interventions using randomized
controlled trials. Although not the only method for studying impact, randomized
controlled trials address the problem of selection bias – the problem that those “selected”
to receive an intervention are often different from those not receiving it. Selection bias is
particularly problematic for those studying professional development. Many districts
have professional development offerings that they expect only a small cadre of ambitious
teachers will join. To complicate matters further, professional development is often
mandated for all teachers in school that have been identified under accountability
systems. Clearly those participating in the mandated intervention are likely to be
different from those not participating.
1
paper is on the design issues that arise when applying experimental methods to the study
of teacher professional development in particular.
Kennedy’s literature review (1998) was perhaps the first widely circulated review of
empirical evidence on the effectiveness of professional development on student
achievement. She analyzed a number of studies of math and science PD programs and
their effects on student outcomes. In order to identify features of effective professional
development programs, she categorized the programs in several ways. She found that the
relevance of the content of the PD was particularly important. She classified in-service
programs into four groups according to the levels of prescriptiveness and specificity of
the content they provide to teachers.1 Based on her analysis of effect sizes, Kennedy
(1998) concluded:
Kennedy’s literature review shed light on the crucial role of content emphasis in high-
quality, and effective PD. Her seminal work prompted others to test such a research
hypothesis in their subsequent studies (cf. Desimone, Porter, Garet, Yoon, & Birman,
2002; Garet et al, 2001; Yoon, Garet, & Birman, 2007).
1
In Kennedy’s review, Group 1 professional development programs were focused on the activities that
prescribe a set of teaching behaviors that are expected to apply generically to all school subjects (e.g.,
Stevens & Slavin, 1995). Group 2 PD activities also prescribed a set of generic teaching behaviors, but
were proffered as applying to one particular school subject (e.g., Good, Grouws, & Ebmeier, 1983). Group
3 PD activities provided a general guidance on both curriculum and pedagogy for teaching a particular
subject and that justify their recommended practices with references to knowledge about how students learn
this subject (e.g., Wood & Sellers, 1996). Lastly, Group 4 PD programs provided knowledge about how
students learn particular subject matter but did not provide specific guidance on the practices that should be
used to teach that subject (e.g., Carpenter et al., 1989).
2
Building on the literature reviews by Kennedy and by Clewell, Campbell & Perlman
(2004), Yoon et al. (2007) recently conducted a more systematic and comprehensive
review. Yoon et al examined studies of impacts in three core academic subjects (reading,
mathematics, and science). They focused the review on types of studies presumed to
provide strong and valid causal inferences: that is, randomized controlled trials (RCT)
and quasi-experimental design (QED) studies. Nine studies emerged as meeting the
What Works Clearinghouse evidence standards, from 132 manuscripts identified as
relevant.2 All nine studies were focused on elementary school teachers and their students.
Four studies were RCTs that met evidence standards without reservations. The remaining
four studies met evidence standards with reservations (one RCT with a group equivalence
problem and three QEDs).
Pooling across the three content areas included in the review, the average overall effect
size was 0.54. Notably, the effect size was reasonably consistent across studies in which
the PD took different forms and had different content. However, the small number of
qualifying studies limited the variability in the PD that was represented. For example, all
PD interventions in the nine studies were provided directly to teachers (rather than using
a train-the-trainer approach) and were delivered in the form of workshop or summer
institute by the author(s) or their affiliated researchers. In addition, the studies involved
a small number of teachers, ranging from 5 to 44, often clustered in a few schools. In
general, these might be viewed as efficacy trials, testing the impact of PD in small,
controlled settings, in contrast to effectiveness trials, which test interventions on a larger
scale, in more varied settings.
Given the modest variation in the features of the studies that qualified to be included in
their review, Yoon et al. were unable to draw any strong conclusions about the
distinguishing features of effective professional development – and especially about the
role of content focus. The studies they reviewed did suggest that the duration or
“dosage” of PD may be related to impact. The average number of contact hours was
about 48 hours across the nine studies, and the three studies that provided the least
intensive PD (ranging from 5 to 14 hours over the span of two to three and a half months)
produced no statistically significant effect.3
Several larger scale studies of the impact of PD on student achievement are currently
underway, and, when their results become available, they should add appreciably to the
available research base. The ongoing studies include a study of the impact of PD in
2
At the initial stage, over 1,300 manuscripts were screened for their relevance including topic, study
design, sample, and recency. After the prescreening process, only 132 studies were determined to be
relevant for a subsequent thorough coding process. In this coding stage, each study under the review was
given one of three possible ratings in accordance with the WWC’s technical guidelines: “Meets Evidence
Standards” (e.g., randomized controlled trials that provided the strongest evidence of causal validity),
“Meets Evidence Standards with Reservations“ (e.g., quasi-experimental studies or randomized controlled
trials that had problems with randomization, attrition, teacher-intervention confound, or disruption), and
“Does Not Meet Evidence Screens” (e.g., studies that did not provide strong evidence of causal validity).
The WWC’s technical working papers are found at: http://ies.ed.gov/ncee/wwc/twp.asp.
3
The remaining six studies provided a more intensive PD with contact hours ranging from 30 to 100+. With
an exception of one study with 83 contact hours, all of them resulted in significant effects on student
achievement.
3
second grade reading involving about 270 teachers and a study of the impact of PD in
seventh grade mathematics involving about 200 teachers, both conducted by the authors
of the current paper and their colleagues. Other ongoing studies include a study of the
impact of science PD in Los Angeles, conducted by Gamoran, Borman, and their
colleagues at the University of Wisconsin, and a study of science PD conducted in Detroit
by Fishman and his colleagues at the University of Michigan.
In sum, while there has been some progress since Borko’s 2004 address, much remains to
be done. While evidence is accumulating that PD can be effective in different content
areas, more work is needed, especially on some thorny design issues. We summarize
some of these issues broadly in the next section, and then turn to a more detailed
discussion of four issues that we believe are particularly important..
Although good curriculum materials can provide rich tasks and activities
that support students’ mathematical investigations, such materials may not
be sufficient to enable deep changes in instructional practice…
professional development strategies are designed to support teachers’
efforts to transform their practices” (p. 56).
A second key design decision involves the context in which the study will be carried out,
and this appears in the “Study Context” box at the right of the diagram. Among other
things, the setting determines the level and type of PD that occurs “naturally” among
teachers in the sample. The setting also determines the curricula that will be in use in the
sample classrooms, which is important since PD interventions may interact with
4
curricula. (Note that when curricula are part of the treatment itself, they would be
represented in the Intervention box rather than the Study Context box.)
STUDY CONTEXT
STUDY SAMPLE
Treatment group Control group
Time 1 Time 2
INTERVENTION
Teacher Teacher
PD features Knowledge Knowledge
•Focus Teacher
•Structure retention
•Other Teacher Teacher
Practice Practice
Other features
•Unit of
intervention
Student Student Student
•Non PD
Achievement retention Achievement
components
The determination of the study sample is the third issue we consider in some detail.
Exhibit 1 depicts the Study Sample within the Study Context as two boxes – one for the
treatment group, in the foreground, and one for the control group, in the background.
Retention of teachers and students in the sample, as shown in Exhibit 1, is a significant
consideration.
The selection of measures is the final design issue we consider. The selection of
measures of course depends on the assumed causal chain leading from participation in
PD to improved student achievement. Exhibit 1 shows one potential causal chain,
leading from teacher knowledge through teacher to student achievement. Presumably,
participating in PD will influence teachers’ knowledge and instructional practice, which
in turn will improve student achievement.4 We argue that it is critical to measure
4
As Kennedy (1998) argues, PD may operate primarily by improving teacher content knowledge, which
will improve teachers’ selection and use of practices, but the practices are not a direct focus of he PD. Or,
5
variables at each stage in the causal chain. In addition, the effects of the PD on teacher
and student outcomes will unfold over time, and thus a key measurement decision
concerns the timing and frequency of measurement. For simplicity, Exhibit 1 shows two
time points. It is quite likely to that the number of waves of measurement might be
greater than two.
DESIGN ISSUES
Researchers designing randomized controlled trials focused on teacher professional
development interventions face a common set of methodological issues. In this section,
we discuss the tradeoffs inherent in each issue. The resolution of these issues depends in
part on the particular intervention selected and the resources available, so the discussion
is meant to raise points to consider rather than provide definitive answers. We organize
the discussion under four broad subheadings.
One challenge in studying the impact of PD is that an intervention rests on at least two
theories, a theory about features of PD that will promote change in teacher knowledge
and/or teacher practice -- a “theory of teacher change;” and a theory about how particular
classroom instructional practices will lead to improvements in student achievement -- a
“theory of instruction.” (For a related argument, See Supovitz, 2001.)
With respect to a theory of the features of PD that promote teacher change, there is a
relatively large literature (Garet, Porter, Desimone, Birman, & Yoon, 2001; Guskey,
2003; Hawley & Valli, 1998; Kennedy, 1998; Little, 1993; Loucks-Horsley, Hewson,
Love, & Stiles, 1998; National Commission on Teaching and America’s Future, 1996;
Showers, Joyce, & Bennett, 1987; Wilson & Berne, 1999), but little evidence on the
impact of specific features on teacher knowledge, teacher practice, or student
achievement. Nevertheless, over the years, a consensus has been built on promising “best
practices.” For example, it is generally accepted that intensive, sustained, job-embedded
professional development focused on the content of the subject that teachers teach is
more likely to improve teacher knowledge, classroom instruction, and student
achievement. Further, active learning, coherence, and collective participation have also
been suggested to be promising “best practices” in professional development (Garet et al.,
2001). While these features have only limited evidence to support them, they provide a
starting point for the design of the form and delivery of PD, and many recent studies of
PD interventions appear to draw on features of this sort.
In contrast to the emerging consensus on the features of PD worth testing, there is much
less consensus on the theories of instruction. The PD interventions tested in recent
studies differ widely in the theory of instruction being tested. Some interventions focus
on a specific set of instructional practices. For example, Sloan (1993) randomly assigned
the PD may operate primarily by improving teachers skills in implementing specific instructional practices.
6
a treatment that sought to elicit teaching behaviors associated with the direct instruction
model—specifically, Madeline Hunter’s “seven steps of the teaching act.” Participating
elementary teachers were expected to use these practices in teaching all subjects.
All of this is to say that studies of PD interventions are tests of a package—a package that
inevitably draws on both a theory of teacher change and a theory of instruction. The fact
that PD packages draw on two theories can make negative results difficult to interpret.. If
Sloan (1993) had found no effect, we would not know whether the flaw was the direct
instruction model or the teacher change model, which was a 30-hour treatment that
included summer sessions and seven follow-up meetings, spread out over a period of 6
months.5
Such “three-arm” studies are a promising way to add to the knowledge base, since one
can test the importance of specific PD components. Alternatively, researchers can define
programs that share the same underlying theory of instruction but use very different PD
delivery mechanisms. The most serious constraint in selecting the treatments for such
“three-arm” studies is that the PD received by the two treatment groups needs to differ
enough to result in measurable differences in student achievement. (See issue #3,
below.).
Apart from specifying the features of the PD intervention, a related issue concerns the
available evidence in support of the intervention. If no prior research has been done on
the proposed intervention, it seems most useful to engage in a small-scale trial, testing
whether the intervention can be implemented as anticipated and can achieve results in
controlled settings.
7
et al., 1989; Good, Grouws, & Ebmeier, 1983; Saxe, Gearhart, & Nasir, 2001). Such
“replication trials” are likely to be especially valuable when previous trials were
implemented by a single trainer, with small numbers of volunteer teachers, or in a single
district context.
Or, if there is reasonable evidence on efficacy, one may move to an effectiveness trial,
testing the impact of the intervention across a wider set of contexts. Here, the main
design challenges are likely to involve ways of scaling-up the PD so that it can be
delivered with fidelity in multiple settings with different facilitators. In many of the
available rigorous efficacy trials, the PD was delivered by a single individual, often the
primary researcher.
First, it is necessary to consider the curricular context and find locations with curricula
for which the PD is suited.6 Curricula, like PD programs, embed specific theories of
instruction. Ideally, it is sensible to seek locations that use curricula that align with the
underlying theory of instruction embedded in the PD. At the very least, it is necessary to
choose locations where the curricula would not discourage the practices promoted by the
PD.
Second, it is important to examine the PD that already exists in districts being considered.
It is inevitable that during the study, the teachers in study will receive other PD. As noted
earlier, teachers participate in a variety of professional development activities each year,
either due to mandates, incentives, or personal initiative. Teachers may be part of
informal groups at their schools that serve professional development needs. Teachers will
presumably continue to participate in all of these PD experiences regardless of their
status in the study, except to the extent researchers are able to negotiate special
arrangements.
8
control group. If the ambient PD contains elements in common with the study PD, the
impact will appear to be smaller. For instance, suppose some researchers want to study a
10-day program of content-focused professional development for geometry teachers. If
the district in which the study took place provided all geometry teachers with two days of
workshops on geometry, the teachers in the treatment group would probably not learn as
much during their 10-day PD program as they would have otherwise. Assuming that it is
impossible to dissuade the district from providing their 2-day PD, the only way to avoid
such a problem is to conduct one’s study in districts where the ambient PD has no
elements in common with the PD to be studied.
Another potential concern with ambient PD is whether teachers can attend both the
ambient PD and the study PD. Ideally, the teachers in the treatment group will receive all
of the study PD and will continue to receive the ambient PD, such that the study will
show the value added of the PD treatment that is being randomly assigned. But if the
treatment is time-intensive or if a district has a lot of ambient PD, scheduling conflicts
could occur, or teachers could begin to feel “overloaded” and selectively not attend some
PD events. Thus, treatment group teachers might not get some PD that they were
supposed to get. The treatment-control difference would be distorted, since the treatment
group would be receiving less ambient PD than the control group. Alternatively,
treatment group teachers might decide to attend the ambient PD in lieu of attending the
study PD.
Thus, it is important to select locations where the chance of interference with the PD they
wish to study is low; alternatively, it may be wise to construct treatments that can be
delivered without interfering significantly with other PD. It is also necessary to find
locations where the ambient PD does not share elements in common with the study PD.
Regardless, the extent and variability of ambient PD clearly necessitates the measurement
of teacher participation in the ambient PD – to verify that treatment and control teachers
participated equally and that the measured impact is roughly the impact of the treatment
PD. These measures can characterize the service contrast between treatment and control
teachers very clearly (e.g., dosage in terms of contact hours or duration).
Another design issue concerns the determination of the sample that will be needed for the
intended study. Randomized controlled trials of professional development interventions
differ from many other educational experiments in the sense that the professional
development interventions target teachers while the eventual outcome of interest—
student achievement—are measured and collected at student level. Trials like these pose
some new challenges and difficulties to sample design (Donner, 1998). This section
discusses some of these challenges that are specific to the evaluation of professional
development interventions. In particular, it discusses issues related to unit of random
assignment, unit of analysis, precision of the estimators, and teacher/student mobility
during the program.
9
Unit of Random Assignment
The choice of unit of assignment has important implications for the set of teachers and
their students for whom impact is examined. If the teacher is selected as the unit, then
each teacher included in the study would be assigned to treatment or control at the start
of the study, and the teacher (and the teacher’s students) would be followed up during
each wave of data collection. If, on the other hand, the school is selected as the unit, then
each school included in the study would be assigned to treatment or control at the start of
the study, and all relevant teachers currently teaching in each school (and their students)
would be the target of data collection at eave wave.
In the absence of teacher turnover, both the teacher and school-level designs focus on
teachers identified for inclusion in the study at the time the study begins. But, in the
presence of teacher turnover, the two choices for the unit of assignment lead to different
teacher samples over time. The teacher-level design involves following teachers as long
as they remain in teaching, and collecting data on their knowledge and their student’s
achievement at the time points specified in the design. Teachers who exit teaching or
who change grade levels or subjects must be dropped from the study, because data on
classroom instruction and student achievement could not be collected from such teachers,
even if the teachers could be located. Thus, in the teacher-level design, if the PD
treatment affects teacher turnover rates, mobility could lead to selection bias, and this
bias would need to be taken into account in the analysis.
In the school-level design, if a teacher leaves a school in the sample over the course of
the study and another teacher enters, then the new replacement teacher would be included
in the study sample from that point forward. Thus, in a school-level design, mobility can
7
The use of group randomization to study the impacts of social policies has a long history. Harold F.
Gosnell (1927) studied ways to increase voter turnout by randomly assigning one part of each of twelve
local districts in Chicago as targets for a series of hortatory mailings and the other part as control group. In
recent years, the use of group randomization has spread to many fields (for a review and discussion of the
key issues, see Boruch and Foley 2000). Over the past decades, it has been used to evaluate “whole-school”
reforms (Cook, Hunt, and Murphy 2000), school-based teacher training programs (Blank et al. 2002),
community health promotion campaigns (Murray, Hannan, et al 1994), community employment initiatives
(Bloom and Riccio 2002), and group medical practice interventions (Leviton et al. 1999). The two text
books on cluster randomization published to date, both focus on evaluating health programs, are by Allan
Donner and Neil Klar (2000) and David M. Murray (1998).
10
cause the impact of the PD to be diluted. For example, if a treatment teacher left in the
middle of a program year, after the PD treatment was complete, and his/her class was
handed over to a new teacher who had not been exposed to the professional development
intervention, then the amount of exposure to treatment that the students in this class had
would be cut in half. The treatment impact estimated based on the achievement data
from this class of students would reflect the program impact from a half of the intended
treatment, and is likely to be different from what the students would have experienced
had the treatment teacher stayed for the whole program year. This suggests that in a
school-level design, it may be desirable to incorporate components in the intervention to
provide support for new teachers who enter treatment schools over the course of the study
(i.e., supplemental PD).8
Apart from the implications for the analysis of teacher mobility, there are other pros and
cons associated with choosing the school or the teacher as the unit of assignment. Using
school as random assignment unit helps to reduce control-group contamination. One of
the greatest threats to the methodological integrity of a random-assignment research
design is the possibility that some control-group members will be exposed to the
program, thus reducing the service contrast – the difference between the group receiving
the intervention (or the treatment group) and the “business as usual” group (or the control
group). Such contamination of the control group is especially likely in the case of
professional development interventions. Since many of these interventions incorporate
collaboration among teachers at a given grade level, if some teachers in a school are
randomly assigned to an intervention, they are likely to share some of the information
provided through the intervention with peers who have been randomly assigned to the
control group. This second-hand exposure will attenuate the service contrast and make it
difficult to interpret impact estimates. By randomly assigning schools to treatment
condition, one separates the treatment group from the control group spatially, hence
blocks some potential paths of information flow and reduces control-group
contamination.
Using the school as the unit of assignment may also help to deliver the services more
effectively by capitalizing on economies of spatial concentration. Spatial concentration of
the target-group members may reduce the time and money costs of transportation to and
from the program, and may enable staff to operate the program more effectively by
exposing them directly to problems. For example, when a professional development
intervention involves intensive coaching, it certainly reduces costs if a coach needs to
travel to one school to work with 4 teachers in that school rather than travel to two
different schools to work with 2 teachers in each of them. School level randomization
enables the coach to spend more time in the school and get exposure to the common
problems in that school and thus adjust the delivery of service to fit the school needs
more effectively.
8
In a school level design, turnover would not ordinarily lead to selection bias even if there is differential
turnover between treatment and control schools, because all teachers teaching the schools would be
included in the analysis, but the estimated impact would be due to a combination of the impact of the
treatment on achievement for teachers who stay and the impact on turnover.
11
Another very different reason for randomly assigning schools is to facilitate political
acceptance of randomization by off-setting ethical concerns about “equal treatment”.
Even though random assignment of teachers treats individual teachers equally in the
sense that each one of them has an equal chance of being offered the program, this fact is
often overlooked because after randomization treatment group teachers have access to the
intervention while control group teachers do not. This perception is especially acute if
within a school, some teachers receive intervention and some do not. Therefore, school-
level randomization is generally easier to “sell” than teacher-level randomization.
On the other hand, using teachers as the unit of assignment requires fewer participating
schools and teachers, for a given level of precision. It is well documented that estimates
based on cluster randomization have less statistical precision than those based on
individual randomization because possible correlations of impacts across individuals
within a cluster have to be accounted for in cluster randomization (Howard 2005, Murray
1994). By analogy, for a given set of data structure, estimates based on school-level
(cluster) randomization have less statistical precision than those based on teacher-level
(individual) randomization because possible correlations of impacts across teachers
within a school have to be accounted for if schools are being randomized.9 In other
words, it requires less total number of teachers to detect an impact of a given size with a
given level of precision if the randomization is done at teacher level instead of school
level. Note, however, that the difference in precision between these two options depends
on specific outcomes and analytical methods used in the evaluation and can vary from
program to program. Nonetheless, a reduction in required sample size saves money and
efforts in the evaluation process.
Overall, whether to use school or teacher as randomization unit depends on the specific
features of the intervention being tested. If control-group contamination is not a major
concern due to the nature of the intervention or there are other effective ways to prevent
contamination from happening, at the same time monetary constraint is important, then
using teacher as randomization unit might serve the purpose of the program better. On the
other hand if the nature of the intervention is prone to contamination and the budget for
the program is enough to support an evaluation with adequate power, then using school as
randomization unit would be preferable.
9
The standard error of the impact estimator for cluster-randomization program (when no covariates other
than the treatment status is included) is
1 τ2 σ2
SECL = + , while the standard error of the impact estimator for individual-
P (1 − P ) J nJ
1 τ2 σ2
randomization program is SE IN = + . Here P is the percent of clusters (or
P (1 − P) nJ nJ
individuals) that are randomly assigned to the treatment group; J is the number of clusters; n is the number
of individuals within a cluster; τ 2 is the cross-cluster variance, and σ 2 is the within-cluster variance
(Bloom 2005). The proportion of the total population variance across clusters as opposed to within clusters
τ2
( ) is usually called an intra-class correlation (Fisher 1925).
τ 2 +σ 2
12
Unit of Analysis
Another important design element is the precision of the estimated impacts of a study.
The precision of impact is often expressed in terms of the smallest program effect that
could be detected with confidence, or minimum detectable effect (MDE). A minimum
detectable effect is the smallest true program effect that has a certain probability of
producing an impact estimate that is statistically significant at a given level (Bloom
1995). This parameter, which is a multiple of the impact estimator’s standard error (see
the first formula in footnote 1), depends on the following factors:
• The type of test to be performed: a one-tailed t-test is used for program impact in
the predicted direction; a two-tailed t-test can be used for any program effects;
• The level of statistical significance to which the result of this test will be
compared (α);
• The desired statistical power (1- β)—the probability of detecting a true effect of a
given size or larger;
• The degree of freedom of the test, which depends on the number of clusters (J),
and the size of cluster;
• The intra-class correlation—the proportion of the total population variance across
clusters as opposed to within clusters;
13
• The explanatory power of potential covariates, such as student’s prior
achievement, teacher characteristics, etc (Bloom 2005).
The first three factors have their conventional values and are relatively easy to pin
down.10 The minimum detectable effect size declines in roughly inverse proportion to the
square root of the number of clusters randomized, while the size of the clusters
randomized often makes far less difference to the precision of program impact estimators
than does the number of clusters (for more details, see Schochet 2005).
The last two factors, on the other hand, are hard to determine. These factors vary from
outcome to outcome and from sample to sample. The conventional practice is to use
similar measures from similar past studies as proxies for these factors in the precision
calculation at the design phase of a study. However, it is often difficult to find good
proxies and judgment must be made when deciding values for these factors. This is
especially true for evaluations of professional development interventions that use teacher-
level randomization because measures of teacher-level intra-class correlation and the
explanatory power of teacher-level covariates are not common in past studies.
To assess the minimum detectable effect size for a research design, one needs a basis for
deciding how much precision is needed. From a programmatic perspective, it might be
whether the study can detect an effect that, judging from the performance of similar
programs, is likely to be attainable. This “attainable effect” is especially difficult to
decide for the professional development interventions. Unlike other types of educational
interventions, the professional development programs target teachers directly and only
affect student outcomes through teachers indirectly. The existing literature on
professional development shows a wide range of effect sizes depending on the nature of
the interventions and the study designs, thus gives very little guidance in the literature
about how much precision one should expect from a professional development
intervention.
14
• Determining the timing of measurement of outcomes..
An important design decision concerns how much of the study’s resources to devote to
the measurement of anticipated mediating factors, such as implementation levels
achieved and proximal outcomes (e.g., teacher knowledge and practice), in addition to
distal outcomes (student achievement). While it is tempting in randomized field trials to
focus only on the ultimate outcome of student achievement, excluding measures of
proximal outcomes and other potential moderators and mediators can ultimately be costly
in conceptual terms. Measurement of mediating variables is especially critical in making
use of study results to draw conclusions about the “theory of teacher change” and “theory
of instruction” on which the PD intervention is based.
For example, if a study of professional development does not find an overall impact on
student achievement, and no teacher outcomes are measured, it is impossible to know
where and to what extent the causal model broke down (see Exhibit 1, above). It may be
that the professional development was effective in increasing teachers’ knowledge or
practices (so the theory of teacher change receives support)), but these teacher changes
did not result in higher student achievement (so the theory of instruction is not
supported). Without a measure of the proximal outcomes of the professional
development, the model cannot be fully explored or understood.
PD interventions will vary in the specificity of the intended outcomes for teacher
knowledge, teacher practice, and student achievement. In particular, as discussed earlier,
some approaches to PD may be designed to improve teachers’ skills in implementing a
highly specified set of instructional practices; other PD may be designed to strengthen
teachers’ content knowledge and pedagogical content knowledge, with desired changes in
practice less clearly articulated. Similarly, some PD interventions may be designed to
produce changes in relatively narrowly defined aspects of student achievement (e.g.,
student understanding of a particular concept in mathematics), while other PD
interventions may be designed to produce broad-gauge changes in achievement.
15
Clearly, the outcome measures must be chosen to reflect the intended outcomes of the
PD; but the desired degree of alignment can be difficult to establish, and it may be wise
to choose multiple instruments that are more or less closely aligned with the specific
focus of the PD, to provide some information on the generalizabilty and robustness of the
findings.
Two extremes can exemplify the need for balance between alignment and generalizability
in outcome measure selection. Assume two groups of researchers have conducted studies
to determine if a professional development on early reading content and instruction leads
to higher achievement. In both scenarios, the intervention was focused primarily on
providing teachers with an understanding of phonemic awareness, phonics, and fluency
instruction in early reading, and the researchers chose a teacher outcome measure that
was created by the professional development team to assess teachers’ understanding of
the content just learned. The first group of researchers chose the language and examples
in the assessment directly from the PD training materials, similar to an “end-of-chapter”
test found in many textbooks. In addition, the researchers created a similar student
achievement measure that exclusively contained the types of decoding items that the
teachers had learned explicitly to address in their reading instruction. Both measures
included a range of easy to difficult items, to ensure that ability at all levels was captured.
Upon conducting the impact analyses, the researchers obtained an effect size of 1.20 for
student achievement, and concluded that professional development in early reading
affects reading achievement.
The second group of researchers studying the same intervention decided to rely on
existing standardized assessments of teacher and student ability. These measures had
been widely tested and validated, and were commonly used in the education domain as
accountability tests so the results were policy relevant. In addition, the data collection
was low-burden for study participants because the tests were district-administered and
data were available from secondary sources. The tests measured a variety of early
reading and language arts outcomes, but did not provide subscores on decoding, which
was a focus of the PD. Upon conducting an impact analysis, this group obtained null
results. .
The differences in results across the two groups of researchers may have little to do with
the PD being studied and more to do with the alignment of measures with the
intervention. The first group of researchers chose a strongly aligned test; the second
groups of researchers chose a weakly aligned test. If the intended outcomes of the PD are
very narrowly and specifically designed, it may be sufficient to focus on a highly aligned
set of measures; in most cases, though, it appears that a mix of measures varying in
specificity will be required, to permit an examination of the generalizability of the results
across
,
Timing of Outcome Measurement
Issues of timing are central to the study of the impact of professional development. As
shown in Exhibit 1, moving from the provision of PD to obtaining an impact on
16
achievement involves traversing a number of causal links, and each of these may take
time to unfold. How long do teachers need to think about what they have learned in order
to put it effectively into practice? Once practices are put into place, are they sustained
over time? And, how long does their improved instruction take to affect detectable
increases in their students’ learning?
The answer to these questions will likely vary by the intensity, specificity and form of the
professional development received, and by the alignment of the outcome measure to the
professional development. A focused week-long institute on phonics with a lot of
modeling and training on specific instructional practices may begin to affect teachers’
practices as soon as the teachers get back in the classroom. If the student test used to
measure achievement is very sensitive to these practices, achievement gains might occur
relatively quickly.
In most studies, however, these conditions are not likely to be met, suggesting that a
realistic study will require multiple waves of data collection, with the timing determined
by features of both the PD and the measures. .
CONCLUSION
For researchers interested in understanding the impact of professional development,
randomized controlled trails overcome the perennial problem of selection bias, caused by
the fact that districts typically make PD available to some teachers on a voluntary basis,
at the same time they mandate participation in other PD.
Given the current public investment in professional development and the demonstrated
potential for professional development to improve student achievement, it is not
surprising that increasing attention is being given to the design of rigorous studies of the
impact of PD. To maximize what is learned from these studies, it is critical to give more
attention to the vexing methodological issues involved..
17
References
Ball, D. L. (1996). Teacher learning and the mathematics reforms: What we think we
know and what we need to learn. Phi Delta Kappan, 77(7), 500-08.
Birman, B., Le Floch, K.C., Klekotka, A., Ludwig, M., Taylor, J., Walters, K., Wayne,A.,
Yoon, K. (2007). State and Local Implementation of the No Child Left Behind Act,
Volume II—Teacher Quality Under NCLB: Interim Report. Washington, D.C.: U.S.
Department of Education, Office of Planning, Evaluation and Policy Development,
Policy and Program Studies Service.
Blank, Rolf K., Diana Nunnaley, Andrew Porter, John Smithson, and Eric Osthoff. (2002)
Experimental Design to Measure Effects of Assisting Teachers in Using Data on Enacted
Curriculum to Improve Effectiveness in Instruction in Mathematics and Science
Education. Washington, DC.: National Science Foundation.
Bloom, Howard S. (1995) “Minimum Detectable Effects: A simple way to Report the
Statistical Power of Experimental Designs.” Evaluation review 19(5): 547-56.
Bloom, Howard S., and James A. Riccio. (2002) Using Place-Based Random Assignment
and Comparative Interrupted Time-Series Analysis to Evaluate the Jobs-Plus
Employment Program for Public Housing Residents. New York: MDRC.
Borko, H. (2004). Professional development and teacher learning: Mapping the terrain.
Educational Researcher, 33(8), 3–15.
Boruch, Robert F., and Ellen Foley (2000) “The Honestly Experimental Society: Sites
and Other Entities as the Units of Allocation and Analysis in Randomized Trials.” In
Validity and Social Experimentation: Donald Campbell’s Legacy, Edited by Leonard
Bickman. Volume 1. Thousand Oaks, Calif.: Sage Publications.
18
Campbell, D. T. & Stanley, J. C. (1963). Experimental and quasi-experimental designs for
research. Dallas, TX: Houghton Mifflin.
Carpenter, T. P., Fennema, E., Peterson, P.L., Chiang, C. P., & Loef, M. (1989). Using
knowledge of children’s mathematics thinking in classroom teaching: An experimental
study. American Educational Research Journal, 26(4), 499–531.
Choy S.P., Chen X., Bugarin R. (2006). Teacher Professional Development in 1999–
2000: What Teachers, Principals, and District Staff Report. Washington, DC: U.S.
Department of Education, National Center for Education Statistics.
Clewell, B. C., Campbell, P. B., & Perlman, L. (2004). Review of evaluation studies of
mathematics and science curricula and professional development models. Submitted to
the GE Foundation. Washington, DC: Urban Institute.
Cohen, D. K., & Hill, H. C. (1998). Instructional policy and classroom performance:
The mathematics reform in California (RR-39). Philadelphia: Consortium for Policy
Research in Education.
Cohen, D. K., & Hill, H. C. (2000). Instructional policy and classroom performance: The
mathematics reform in California. Teachers College Record, 102(2), 294–343.
Cohen, D. K., & Hill, H. C. (2001). Learning policy: When state education reform
works.
New Haven, CT: Yale University Press.
Confrey, J. (2006). Comparing and contrasting the National Research Council report on
Evaluating Curricular Effectiveness with the What Works Clearing approach. Educational
Evaluation and Policy Analysis, 28(3), 195-213.
Cook, Thomas H., David Hunt, and Robert F. Murphy. (2000) “Comer’s School
Development Program in Chicago: A Theory-Based Evaluation.” American Educational
Research Journal (Summer).
19
Desimone, L., Porter, A. C., Garet, M., Yoon, K. S., & Birman, B. (2002). Does
professional development change teachers’ instruction? Results from a three-year study.
Educational Evaluation and Policy Analysis, 24(2), 81–112.
Donner, Allan (1998) “Some Aspects of the Design and Analysis of Cluster
Randomization Trials.” Applied Statistics 47(1): 95-113.
Donner, Allan, and Neil Klar (2000) Design and Analysis of Group Randomization Trials
in Health Research. London: Arnold.
Elmore, R. (2002). Bridging the gap between standards and achievement: The
imperative for professional development in education. [Online.] Available:
http://www.ashankerinst.org/Downloads/Bridging_Gap.pdf.
Garet, M., Porter, A., Desimone, L., Birman, B., & Yoon, K. S. (2001). What makes
professional development effective? Results from a national sample of teachers.
American Education Research Journal, 38(4), 915-945.
Good, T. L., Grouws, D. A., & Ebmeier, H. (1983). Active mathematics teaching. New
York: Longman, Inc.
Gosnell, Harold F. (1927) Getting Out the Vote: An Experiment in the Stimulation of
Voting. Chicago: University of Chicago Press.
Fisher, Ronald A. (1925) Statistical Methods for Research Workers. Edinburgh: Oliver &
Boyd.
Hawley, W.D. & Valli, Linda (1998) The essentials of effective professional
development: A new consensus. In Darling Hammond, L .S. & Sykes, G., (Eds.) The
Heart of the Matter: Teaching as a Learning Profession. 86-124. San Francisco: Jossey
Bass.
Hiebert, J. & Grouws, D. (2007). The effects of classroom mathematics teaching on
students’ learning. In F. K. Lester (Ed.), Second handbook of research on mathematics
teaching and learning (pp. 371-404). Charlotte, NC: Information Age Publishing.
20
Kennedy, M. (1998). Form and substance of inservice teacher education (Research
Monograph No. 13). Madison, WI: National Institute for Science Education, University
of Wisconsin–Madison.
Knapp, M. S. (1997). Between systemic reforms and the mathematics and science
classroom: The dynamics of innovation, implementation, and professional learning.
Review of Educational Research, 67(2), 227-266.
Leviton, Laura C., Robert L. Goldenberger, C. Suzanne Baker, and M.C. Freda. (1999)
“Randomized Controlled Trial of Methods to Encourage the Use of Antenatal
Corticosteroid Therapy for Fetal Maturation.” Journal of the American Medical
Association 281(1): 46-52.
Loucks-Horsley, S., Hewson, P. W., Love, N., & Stiles, K. E. (1998). Designing
professional development for teachers of science and mathematics. Thousand Oaks, CA:
Corwin Press, Inc.
Loucks-Horsley, S., Stiles, K., & Hewson, P. (1996). Principles of effective professional
development for mathematics and science education: A synthesis of standards (NISE
Brief, Vol. 1). Madison, WI: National Institutes for Science Education.
McCutchen, D., Abbott, R. D., Green, L. B., Beretvas, S. N., Cox, S., Potter, N. S.,
Quiroga, T., & Gray, A. L. (2002). Beginning literacy: Links among teacher knowledge,
teacher practice, and student learning. Journal of Learning Disabilities, 35(1), 69–86.
McGill-Franzen, A., Allington, R. L., Yokoi, L., & Brooks, G. (1999). Putting books in
the classroom seems necessary but not sufficient. Journal of Educational Research,
93(2), 67–74.
McKinlay S. M., E. J. Stone, and D. M. Zucker (1989) Research Design and Analysis
Issues. Health Education Quarterly, 16(2), 307-313.
Murray, David M. (1998) Design and Analysis of Group-Randomized Trials. New York:
Oxford University Press.
21
Murray, David M., Peter J. Hannan, David R. Jacobs, Paul J. McGovern, Linda Schmid,
William L. Baker, and Clifton Gray, (1994) ”Assessing Intervention Effects in the
Minnesota heart Health Program.” American Journal of Epidemiology 139(1): 91-103.
National Commission on Teaching and America’s Future (1996). What matters most:
Teaching for America’s future. New York, NY: Author.
Richardson, V., & Placier, P. (2001). Teacher change. In V. Richardson (Ed.), Handbook
of research on teaching (4th ed., pp. 905-947). New York: Macmillan.
Saxe, G. B., Gearhart, M., & Nasir, N. S. (2001). Enhancing students’ understanding of
mathematics: A study of three contrasting approaches to professional support. Journal of
Mathematics Teacher Education, 4, 55–79.
Showers, B., Joyce, B., & Bennett, B. (1987). Synthesis of research on staff development:
A framework for future study and a state-of the-art analysis. Educational Leadership,
45(3), 77–87.
Sloan, H. A. (1993). Direct instruction in fourth and fifth grade classrooms. Unpublished
doctoral dissertation. Dissertation Abstracts International, 54(08), 2837A. (UMI No.
9334424)
Loucks-Horsley, S., Stiles, K., & Hewson, P. (1996). Principles of effective professional
development for mathematics and science education: A synthesis of standards (NISE
Brief, Vol. 1). Madison, WI: National Institutes for Science Education.
Stullich, S., L. Eisner, J. McCrary, & C. Roney. National Assessment of Title I Interim
Report: Volume I: Implementation of Title I. U.S. Department of Education, Institute of
Education Sciences, National Center for Education Evaluation and Regional Assistance.
Washington, DC, 2006.
22
Supovitz, J. A. (2001). Translating teaching practice into improved student achievement.
In S. Fuhrman (Ed.), National Society for the Study of Education Yearbook. Chicago, IL:
University of Chicago Press.
Yoon, K. S., Garet, M., Birman, B., & Jacobson, R. (2006). Examining the effects of
mathematics and science professional development on teachers’ instructional practice: Using
professional development activity log. Washington, DC: Council of Chief State School
Officers.
Yoon, K. S., Duncan, T., Lee, S. W.-Y., Scarloss, B., & Shapley, K. (2007). Reviewing the
evidence on how teacher professional development affects student achievement (Issues &
Answers Report, REL 2007–No. 033). Washington, DC: U.S. Department of Education,
Institute of Education Sciences, National Center for Education Evaluation and Regional
Assistance, Regional Educational Laboratory Southwest.
23
Figure 1:
STUDY CONTEXT
STUDY SAMPLE
Treatment group Control group
Time 1 Time 2
INTERVENTION
Teacher Teacher
PD features Knowledge Knowledge
•Focus Teacher
•Structure retention
•Other Teacher Teacher
Practice Practice
Other features
•Unit of
intervention
Student Student Student
•Non PD
Achievement retention Achievement
components