Issues in The Design of Randomized Field Trials To Examine The Impact of Teacher Professional Development Interventions

Issues in the Design of Randomized Field Trials

to Examine the Impact of

Teacher Professional Development Interventions
Andrew Wayne
Kwang Suk Yoon
Stephanie Cronen
Michael S. Garet

American Institutes for Research

Washington, DC
Pei Zhu
MDRC
New York, York
Prepared for presentation at the annual meeting of the

Association for Public Policy and Management (APPAM)
November 10, 2007
For comment only.

Do not quote without the express written permission of the authors.
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
1
INTRODUCTION
One strategy for improving student achievement is to provide teachers with specialized
training. In addition to training provided prior to the start of the teaching career – i.e.,
preservice training – teachers also receive training during their careers. This strategy is
called in-service training or professional development. (PD) Today, virtually all public
school teachers participate in professional development programs each year (Choy, Chen,
and Bugarin, 2006, p. 47). Professional development programs are also a focus in
NCLB, through which states and districts spent approximately $1.5 billion on teacher
professional development in 2004-05 (Birman et al., 2007).
Although NCLB provides resources for teacher professional development, the broader
effect of NCLB has been to promote accountability for student achievement. NCLB also
encourages school districts to adopt programs and practices that have been proven by
scientifically based research. There is thus a significant need for rigorous studies of the
effect of teacher professional development on student achievement. Sponsors of
professional development initiatives, such as the National Science Foundation, are
particularly eager to find ways to evaluate the effects that their professional development
programs have on student achievement.
As we note later, the existing literature demonstrates that carefully constructed

professional development, delivered on a small scale, can have an effect on student
achievement. However, many question the effectiveness of typical professional
development and are skeptical about whether PD as currently practiced can improve
achievement in challenging school contexts.
This paper provides a discussion of issues that must be confronted in designing studies of
the impact of teacher professional development interventions using randomized
controlled trials. Although not the only method for studying impact, randomized
controlled trials address the problem of selection bias – the problem that those “selected”
to receive an intervention are often different from those not receiving it. Selection bias is
particularly problematic for those studying professional development. Many districts
have professional development offerings that they expect only a small cadre of ambitious
teachers will join. To complicate matters further, professional development is often
mandated for all teachers in school that have been identified under accountability
systems. Clearly those participating in the mandated intervention are likely to be
different from those not participating.
Researchers designing new randomized controlled trials focused on teacher professional

development interventions face a common set of methodological challenges. This paper
will begin with a review of recently completed studies of teacher professional
development. The paper will then discuss a series of design issues associated with using
randomized controlled trials to study teacher professional development. Because there is
already a significant methodological literature on experimental design, the focus of this
1
paper is on the design issues that arise when applying experimental methods to the study
of teacher professional development in particular.
This paper is of course not meant to be a comprehensive guide to designing rigorous

studies. Instead, it focuses on the issues that are especially challenging in designing and
implementing randomized controlled trials of professional development interventions.
EXISTING STUDIES OF THE IMPACT OF TEACHER

PROFESSIONAL DEVELOPMENT
In her presidential address to the American Educational Research Association, Borko
(2004) described the extensive literature on professional development and teacher
learning, and she noted the progress that had been made in rigorous impact studies of the
impact of professional development. She stated, “We have evidence that professional
development can lead to improvements in instructional practices and student learning.”
As we discuss in this section, the field still has progressed only somewhat since her
address.
Kennedy’s literature review (1998) was perhaps the first widely circulated review of
empirical evidence on the effectiveness of professional development on student
achievement. She analyzed a number of studies of math and science PD programs and
their effects on student outcomes. In order to identify features of effective professional
development programs, she categorized the programs in several ways. She found that the
relevance of the content of the PD was particularly important. She classified in-service
programs into four groups according to the levels of prescriptiveness and specificity of
the content they provide to teachers.1 Based on her analysis of effect sizes, Kennedy
(1998) concluded:
“Programs whose content focused mainly on teachers’ behaviors

demonstrated smaller influences on student learning than did programs
whose content focused on teachers’ knowledge of the subject, on the
curriculum, or on how students learn the subject” (p. 18).
Kennedy’s literature review shed light on the crucial role of content emphasis in high-
quality, and effective PD. Her seminal work prompted others to test such a research
hypothesis in their subsequent studies (cf. Desimone, Porter, Garet, Yoon, & Birman,
2002; Garet et al, 2001; Yoon, Garet, & Birman, 2007).
1
In Kennedy’s review, Group 1 professional development programs were focused on the activities that
prescribe a set of teaching behaviors that are expected to apply generically to all school subjects (e.g.,
Stevens & Slavin, 1995). Group 2 PD activities also prescribed a set of generic teaching behaviors, but
were proffered as applying to one particular school subject (e.g., Good, Grouws, & Ebmeier, 1983). Group
3 PD activities provided a general guidance on both curriculum and pedagogy for teaching a particular
subject and that justify their recommended practices with references to knowledge about how students learn
this subject (e.g., Wood & Sellers, 1996). Lastly, Group 4 PD programs provided knowledge about how
students learn particular subject matter but did not provide specific guidance on the practices that should be
used to teach that subject (e.g., Carpenter et al., 1989).
2
Building on the literature reviews by Kennedy and by Clewell, Campbell & Perlman
(2004), Yoon et al. (2007) recently conducted a more systematic and comprehensive
review. Yoon et al examined studies of impacts in three core academic subjects (reading,
mathematics, and science). They focused the review on types of studies presumed to
provide strong and valid causal inferences: that is, randomized controlled trials (RCT)
and quasi-experimental design (QED) studies. Nine studies emerged as meeting the
What Works Clearinghouse evidence standards, from 132 manuscripts identified as
relevant.2 All nine studies were focused on elementary school teachers and their students.
Four studies were RCTs that met evidence standards without reservations. The remaining
four studies met evidence standards with reservations (one RCT with a group equivalence
problem and three QEDs).
Pooling across the three content areas included in the review, the average overall effect
size was 0.54. Notably, the effect size was reasonably consistent across studies in which
the PD took different forms and had different content. However, the small number of
qualifying studies limited the variability in the PD that was represented. For example, all
PD interventions in the nine studies were provided directly to teachers (rather than using
a train-the-trainer approach) and were delivered in the form of workshop or summer
institute by the author(s) or their affiliated researchers. In addition, the studies involved
a small number of teachers, ranging from 5 to 44, often clustered in a few schools. In
general, these might be viewed as efficacy trials, testing the impact of PD in small,
controlled settings, in contrast to effectiveness trials, which test interventions on a larger
scale, in more varied settings.
Given the modest variation in the features of the studies that qualified to be included in
their review, Yoon et al. were unable to draw any strong conclusions about the
distinguishing features of effective professional development – and especially about the
role of content focus. The studies they reviewed did suggest that the duration or
“dosage” of PD may be related to impact. The average number of contact hours was
about 48 hours across the nine studies, and the three studies that provided the least
intensive PD (ranging from 5 to 14 hours over the span of two to three and a half months)
produced no statistically significant effect.3
Several larger scale studies of the impact of PD on student achievement are currently
underway, and, when their results become available, they should add appreciably to the
available research base. The ongoing studies include a study of the impact of PD in
2
At the initial stage, over 1,300 manuscripts were screened for their relevance including topic, study
design, sample, and recency. After the prescreening process, only 132 studies were determined to be
relevant for a subsequent thorough coding process. In this coding stage, each study under the review was
given one of three possible ratings in accordance with the WWC’s technical guidelines: “Meets Evidence
Standards” (e.g., randomized controlled trials that provided the strongest evidence of causal validity),
“Meets Evidence Standards with Reservations“ (e.g., quasi-experimental studies or randomized controlled
trials that had problems with randomization, attrition, teacher-intervention confound, or disruption), and
“Does Not Meet Evidence Screens” (e.g., studies that did not provide strong evidence of causal validity).
The WWC’s technical working papers are found at: http://ies.ed.gov/ncee/wwc/twp.asp.
3
The remaining six studies provided a more intensive PD with contact hours ranging from 30 to 100+. With
an exception of one study with 83 contact hours, all of them resulted in significant effects on student
achievement.
3
second grade reading involving about 270 teachers and a study of the impact of PD in
seventh grade mathematics involving about 200 teachers, both conducted by the authors
of the current paper and their colleagues. Other ongoing studies include a study of the
impact of science PD in Los Angeles, conducted by Gamoran, Borman, and their
colleagues at the University of Wisconsin, and a study of science PD conducted in Detroit
by Fishman and his colleagues at the University of Michigan.
In sum, while there has been some progress since Borko’s 2004 address, much remains to
be done. While evidence is accumulating that PD can be effective in different content
areas, more work is needed, especially on some thorny design issues. We summarize
some of these issues broadly in the next section, and then turn to a more detailed
discussion of four issues that we believe are particularly important..
ISSUES IN STUDYING THE IMPACT OF PROFESSIONAL

DEVELOPMENT
In this section, as a prelude to a discussion of specific design issues, we lay out an overall
model that calls attention to some of the main design decisions that must be made in
planning a study of the impact of PD. (See Exhibit 1.) .
The selection or creation of the professional development intervention to be tested is of

course a key design issue and is the first issue we discuss. The key attributes of the
intervention appear in the Intervention box at the far left of the model shown in Exhibit 1.
Generally, the PD intervention is a multidimensional package of intertwined elements,
including the content focus and structure, as well as the delivery model. The intervention
being tested may also include other, non PD components – for example, a new textbook,
curriculum modules, or other materials. Saxe, et al (2001), for example, tested the impact
of PD in conjunction with the adoption of a new textbook, comparing three groups:
teachers using a traditional text, teachers using a reform text without PD, and teachers
adopting a reform text with PD. In their study, PD is viewed as a support for teachers’
mastery of new materials:
Although good curriculum materials can provide rich tasks and activities
that support students’ mathematical investigations, such materials may not
be sufficient to enable deep changes in instructional practice…
professional development strategies are designed to support teachers’
efforts to transform their practices” (p. 56).
These additional components appear as part of the Intervention box in Exhibit 1,
A second key design decision involves the context in which the study will be carried out,
and this appears in the “Study Context” box at the right of the diagram. Among other
things, the setting determines the level and type of PD that occurs “naturally” among
teachers in the sample. The setting also determines the curricula that will be in use in the
sample classrooms, which is important since PD interventions may interact with
4
curricula. (Note that when curricula are part of the treatment itself, they would be
represented in the Intervention box rather than the Study Context box.)
Exhibit 1: A model of design of PD intervention study
STUDY CONTEXT
STUDY SAMPLE
Treatment group Control group
Time 1 Time 2
INTERVENTION
Teacher Teacher
PD features Knowledge Knowledge
•Focus Teacher
•Structure retention
•Other Teacher Teacher
Practice Practice
Other features
•Unit of
intervention
Student Student Student
•Non PD
Achievement retention Achievement
components
Teacher and student characteristics
The determination of the study sample is the third issue we consider in some detail.
Exhibit 1 depicts the Study Sample within the Study Context as two boxes – one for the
treatment group, in the foreground, and one for the control group, in the background.
Retention of teachers and students in the sample, as shown in Exhibit 1, is a significant
consideration.
The selection of measures is the final design issue we consider. The selection of
measures of course depends on the assumed causal chain leading from participation in
PD to improved student achievement. Exhibit 1 shows one potential causal chain,
leading from teacher knowledge through teacher to student achievement. Presumably,
participating in PD will influence teachers’ knowledge and instructional practice, which
in turn will improve student achievement.4 We argue that it is critical to measure
4
As Kennedy (1998) argues, PD may operate primarily by improving teacher content knowledge, which
will improve teachers’ selection and use of practices, but the practices are not a direct focus of he PD. Or,
5
variables at each stage in the causal chain. In addition, the effects of the PD on teacher
and student outcomes will unfold over time, and thus a key measurement decision
concerns the timing and frequency of measurement. For simplicity, Exhibit 1 shows two
time points. It is quite likely to that the number of waves of measurement might be
greater than two.
DESIGN ISSUES
Researchers designing randomized controlled trials focused on teacher professional
development interventions face a common set of methodological issues. In this section,
we discuss the tradeoffs inherent in each issue. The resolution of these issues depends in
part on the particular intervention selected and the resources available, so the discussion
is meant to raise points to consider rather than provide definitive answers. We organize
the discussion under four broad subheadings.
Design issue #1: What treatment will be studied?
One challenge in studying the impact of PD is that an intervention rests on at least two
theories, a theory about features of PD that will promote change in teacher knowledge
and/or teacher practice -- a “theory of teacher change;” and a theory about how particular
classroom instructional practices will lead to improvements in student achievement -- a
“theory of instruction.” (For a related argument, See Supovitz, 2001.)
With respect to a theory of the features of PD that promote teacher change, there is a
relatively large literature (Garet, Porter, Desimone, Birman, & Yoon, 2001; Guskey,
2003; Hawley & Valli, 1998; Kennedy, 1998; Little, 1993; Loucks-Horsley, Hewson,
Love, & Stiles, 1998; National Commission on Teaching and America’s Future, 1996;
Showers, Joyce, & Bennett, 1987; Wilson & Berne, 1999), but little evidence on the
impact of specific features on teacher knowledge, teacher practice, or student
achievement. Nevertheless, over the years, a consensus has been built on promising “best
practices.” For example, it is generally accepted that intensive, sustained, job-embedded
professional development focused on the content of the subject that teachers teach is
more likely to improve teacher knowledge, classroom instruction, and student
achievement. Further, active learning, coherence, and collective participation have also
been suggested to be promising “best practices” in professional development (Garet et al.,
2001). While these features have only limited evidence to support them, they provide a
starting point for the design of the form and delivery of PD, and many recent studies of
PD interventions appear to draw on features of this sort.
In contrast to the emerging consensus on the features of PD worth testing, there is much
less consensus on the theories of instruction. The PD interventions tested in recent
studies differ widely in the theory of instruction being tested. Some interventions focus
on a specific set of instructional practices. For example, Sloan (1993) randomly assigned
the PD may operate primarily by improving teachers skills in implementing specific instructional practices.
6
a treatment that sought to elicit teaching behaviors associated with the direct instruction
model—specifically, Madeline Hunter’s “seven steps of the teaching act.” Participating
elementary teachers were expected to use these practices in teaching all subjects.
Other PD interventions focus on building a teacher’s knowledge of a content area or of

student thinking (see e.g., Carpenter et al, 1989), with the expectation that this increase in
knowledge will lead to improvements in the quality of teaching more generally. These
PD interventions are less prescriptive with respect to instruction but still embed theories
of instructional improvement.
All of this is to say that studies of PD interventions are tests of a package—a package that
inevitably draws on both a theory of teacher change and a theory of instruction. The fact
that PD packages draw on two theories can make negative results difficult to interpret.. If
Sloan (1993) had found no effect, we would not know whether the flaw was the direct
instruction model or the teacher change model, which was a 30-hour treatment that
included summer sessions and seven follow-up meetings, spread out over a period of 6
months.5
Researchers interested in “unpacking” the packages of PD or understanding what makes

PD effective may want to go beyond studying a single PD intervention. Researcher can
instead specify two versions of their PD. For example, the Study of Professional
Development in Reading randomly assigns schools to a control group and one of two
treatment groups. Both treatments include the same core PD program of summer training
and school-year seminars, but the second treatment adds a coaching component, provided
by part-time coaches based at each school. The study will show whether or not the
coaching component makes a difference.
Such “three-arm” studies are a promising way to add to the knowledge base, since one
can test the importance of specific PD components. Alternatively, researchers can define
programs that share the same underlying theory of instruction but use very different PD
delivery mechanisms. The most serious constraint in selecting the treatments for such
“three-arm” studies is that the PD received by the two treatment groups needs to differ
enough to result in measurable differences in student achievement. (See issue #3,
below.).
Apart from specifying the features of the PD intervention, a related issue concerns the
available evidence in support of the intervention. If no prior research has been done on
the proposed intervention, it seems most useful to engage in a small-scale trial, testing
whether the intervention can be implemented as anticipated and can achieve results in
controlled settings.
If at least some prior evidence of efficacy is available for a particular PD intervention,

testing the intervention in different contexts may be especially valuable (e.g., Carpenter
5
Sloan (1993) reports statistically significant effects on student achievement outcomes measures. In a
subsequent analysis of Sloan’s findings, Yoon et al (forthcoming) adjusted for clustering and for tests on
multiple outcomes and found that impacts on student achievement were large enough to be substantive
important but were not statistically significant.
7
et al., 1989; Good, Grouws, & Ebmeier, 1983; Saxe, Gearhart, & Nasir, 2001). Such
“replication trials” are likely to be especially valuable when previous trials were
implemented by a single trainer, with small numbers of volunteer teachers, or in a single
district context.
Or, if there is reasonable evidence on efficacy, one may move to an effectiveness trial,
testing the impact of the intervention across a wider set of contexts. Here, the main
design challenges are likely to involve ways of scaling-up the PD so that it can be
delivered with fidelity in multiple settings with different facilitators. In many of the
available rigorous efficacy trials, the PD was delivered by a single individual, often the
primary researcher.
Design issue #2: Where will the professional development be

studied?
Depending on sample size requirements—which are discussed below under issue #3 —

one may conduct a study in several schools and several school districts. Selecting the
specific schools and districts requires some review of the fit between the professional
development and the features of those locations. In this section, we consider some of the
issues involved in selecting the context in which the professional development will be
studied. Obviously, the locations have to have suitable administrative conditions, where
the professional development can be carried out as specified, but there are two other
features to which researchers must pay close attention.
First, it is necessary to consider the curricular context and find locations with curricula
for which the PD is suited.6 Curricula, like PD programs, embed specific theories of
instruction. Ideally, it is sensible to seek locations that use curricula that align with the
underlying theory of instruction embedded in the PD. At the very least, it is necessary to
choose locations where the curricula would not discourage the practices promoted by the
PD.
Second, it is important to examine the PD that already exists in districts being considered.
It is inevitable that during the study, the teachers in study will receive other PD. As noted
earlier, teachers participate in a variety of professional development activities each year,
either due to mandates, incentives, or personal initiative. Teachers may be part of
informal groups at their schools that serve professional development needs. Teachers will
presumably continue to participate in all of these PD experiences regardless of their
status in the study, except to the extent researchers are able to negotiate special
arrangements.
The existence of such PD—what we call ambient PD—concerns researchers because an

RFT measures impact as the difference in outcomes between the treatment group and the
6
The requirements placed on the context will differ if new materials are to be introduced along with the
PD; in that case, one may need to consider the contrast between the curriculum in use and the new
materials to be adopted.
8
control group. If the ambient PD contains elements in common with the study PD, the
impact will appear to be smaller. For instance, suppose some researchers want to study a
10-day program of content-focused professional development for geometry teachers. If
the district in which the study took place provided all geometry teachers with two days of
workshops on geometry, the teachers in the treatment group would probably not learn as
much during their 10-day PD program as they would have otherwise. Assuming that it is
impossible to dissuade the district from providing their 2-day PD, the only way to avoid
such a problem is to conduct one’s study in districts where the ambient PD has no
elements in common with the PD to be studied.
Another potential concern with ambient PD is whether teachers can attend both the
ambient PD and the study PD. Ideally, the teachers in the treatment group will receive all
of the study PD and will continue to receive the ambient PD, such that the study will
show the value added of the PD treatment that is being randomly assigned. But if the
treatment is time-intensive or if a district has a lot of ambient PD, scheduling conflicts
could occur, or teachers could begin to feel “overloaded” and selectively not attend some
PD events. Thus, treatment group teachers might not get some PD that they were
supposed to get. The treatment-control difference would be distorted, since the treatment
group would be receiving less ambient PD than the control group. Alternatively,
treatment group teachers might decide to attend the ambient PD in lieu of attending the
study PD.
Thus, it is important to select locations where the chance of interference with the PD they
wish to study is low; alternatively, it may be wise to construct treatments that can be
delivered without interfering significantly with other PD. It is also necessary to find
locations where the ambient PD does not share elements in common with the study PD.
Regardless, the extent and variability of ambient PD clearly necessitates the measurement
of teacher participation in the ambient PD – to verify that treatment and control teachers
participated equally and that the measured impact is roughly the impact of the treatment
PD. These measures can characterize the service contrast between treatment and control
teachers very clearly (e.g., dosage in terms of contact hours or duration).
Design issue # 3: What sample will be needed?
Another design issue concerns the determination of the sample that will be needed for the
intended study. Randomized controlled trials of professional development interventions
differ from many other educational experiments in the sense that the professional
development interventions target teachers while the eventual outcome of interest—
student achievement—are measured and collected at student level. Trials like these pose
some new challenges and difficulties to sample design (Donner, 1998). This section
discusses some of these challenges that are specific to the evaluation of professional
development interventions. In particular, it discusses issues related to unit of random
assignment, unit of analysis, precision of the estimators, and teacher/student mobility
during the program.
9
Unit of Random Assignment
As mentioned above, professional development interventions are designed to directly

affect the behavior of groups of interrelated people (teachers) and indirectly affect
another set of groups of interrelated people (classes of students) rather than individuals.
Therefore it is generally not feasible to measure the effectiveness of PD in an experiment
that randomly assigns individual students to the program or to a control group. However,
one can reap most of the methodological benefits of random assignment by randomizing
at the level of teachers or schools.7 Under teacher-level assignment, teachers within each
school are randomly assigned to treatment condition; under school-level random
assignment, whole schools (and the relevant teachers within them) are randomly
assigned.
The choice of unit of assignment has important implications for the set of teachers and
their students for whom impact is examined. If the teacher is selected as the unit, then
each teacher included in the study would be assigned to treatment or control at the start
of the study, and the teacher (and the teacher’s students) would be followed up during
each wave of data collection. If, on the other hand, the school is selected as the unit, then
each school included in the study would be assigned to treatment or control at the start of
the study, and all relevant teachers currently teaching in each school (and their students)
would be the target of data collection at eave wave.
In the absence of teacher turnover, both the teacher and school-level designs focus on
teachers identified for inclusion in the study at the time the study begins. But, in the
presence of teacher turnover, the two choices for the unit of assignment lead to different
teacher samples over time. The teacher-level design involves following teachers as long
as they remain in teaching, and collecting data on their knowledge and their student’s
achievement at the time points specified in the design. Teachers who exit teaching or
who change grade levels or subjects must be dropped from the study, because data on
classroom instruction and student achievement could not be collected from such teachers,
even if the teachers could be located. Thus, in the teacher-level design, if the PD
treatment affects teacher turnover rates, mobility could lead to selection bias, and this
bias would need to be taken into account in the analysis.
In the school-level design, if a teacher leaves a school in the sample over the course of
the study and another teacher enters, then the new replacement teacher would be included
in the study sample from that point forward. Thus, in a school-level design, mobility can
7
The use of group randomization to study the impacts of social policies has a long history. Harold F.
Gosnell (1927) studied ways to increase voter turnout by randomly assigning one part of each of twelve
local districts in Chicago as targets for a series of hortatory mailings and the other part as control group. In
recent years, the use of group randomization has spread to many fields (for a review and discussion of the
key issues, see Boruch and Foley 2000). Over the past decades, it has been used to evaluate “whole-school”
reforms (Cook, Hunt, and Murphy 2000), school-based teacher training programs (Blank et al. 2002),
community health promotion campaigns (Murray, Hannan, et al 1994), community employment initiatives
(Bloom and Riccio 2002), and group medical practice interventions (Leviton et al. 1999). The two text
books on cluster randomization published to date, both focus on evaluating health programs, are by Allan
Donner and Neil Klar (2000) and David M. Murray (1998).
10
cause the impact of the PD to be diluted. For example, if a treatment teacher left in the
middle of a program year, after the PD treatment was complete, and his/her class was
handed over to a new teacher who had not been exposed to the professional development
intervention, then the amount of exposure to treatment that the students in this class had
would be cut in half. The treatment impact estimated based on the achievement data
from this class of students would reflect the program impact from a half of the intended
treatment, and is likely to be different from what the students would have experienced
had the treatment teacher stayed for the whole program year. This suggests that in a
school-level design, it may be desirable to incorporate components in the intervention to
provide support for new teachers who enter treatment schools over the course of the study
(i.e., supplemental PD).8
Other issues concerning teacher and school as unit
Apart from the implications for the analysis of teacher mobility, there are other pros and
cons associated with choosing the school or the teacher as the unit of assignment. Using
school as random assignment unit helps to reduce control-group contamination. One of
the greatest threats to the methodological integrity of a random-assignment research
design is the possibility that some control-group members will be exposed to the
program, thus reducing the service contrast – the difference between the group receiving
the intervention (or the treatment group) and the “business as usual” group (or the control
group). Such contamination of the control group is especially likely in the case of
professional development interventions. Since many of these interventions incorporate
collaboration among teachers at a given grade level, if some teachers in a school are
randomly assigned to an intervention, they are likely to share some of the information
provided through the intervention with peers who have been randomly assigned to the
control group. This second-hand exposure will attenuate the service contrast and make it
difficult to interpret impact estimates. By randomly assigning schools to treatment
condition, one separates the treatment group from the control group spatially, hence
blocks some potential paths of information flow and reduces control-group
contamination.
Using the school as the unit of assignment may also help to deliver the services more
effectively by capitalizing on economies of spatial concentration. Spatial concentration of
the target-group members may reduce the time and money costs of transportation to and
from the program, and may enable staff to operate the program more effectively by
exposing them directly to problems. For example, when a professional development
intervention involves intensive coaching, it certainly reduces costs if a coach needs to
travel to one school to work with 4 teachers in that school rather than travel to two
different schools to work with 2 teachers in each of them. School level randomization
enables the coach to spend more time in the school and get exposure to the common
problems in that school and thus adjust the delivery of service to fit the school needs
more effectively.
8
In a school level design, turnover would not ordinarily lead to selection bias even if there is differential
turnover between treatment and control schools, because all teachers teaching the schools would be
included in the analysis, but the estimated impact would be due to a combination of the impact of the
treatment on achievement for teachers who stay and the impact on turnover.
11
Another very different reason for randomly assigning schools is to facilitate political
acceptance of randomization by off-setting ethical concerns about “equal treatment”.
Even though random assignment of teachers treats individual teachers equally in the
sense that each one of them has an equal chance of being offered the program, this fact is
often overlooked because after randomization treatment group teachers have access to the
intervention while control group teachers do not. This perception is especially acute if
within a school, some teachers receive intervention and some do not. Therefore, school-
level randomization is generally easier to “sell” than teacher-level randomization.
On the other hand, using teachers as the unit of assignment requires fewer participating
schools and teachers, for a given level of precision. It is well documented that estimates
based on cluster randomization have less statistical precision than those based on
individual randomization because possible correlations of impacts across individuals
within a cluster have to be accounted for in cluster randomization (Howard 2005, Murray
1994). By analogy, for a given set of data structure, estimates based on school-level
(cluster) randomization have less statistical precision than those based on teacher-level
(individual) randomization because possible correlations of impacts across teachers
within a school have to be accounted for if schools are being randomized.9 In other
words, it requires less total number of teachers to detect an impact of a given size with a
given level of precision if the randomization is done at teacher level instead of school
level. Note, however, that the difference in precision between these two options depends
on specific outcomes and analytical methods used in the evaluation and can vary from
program to program. Nonetheless, a reduction in required sample size saves money and
efforts in the evaluation process.
Overall, whether to use school or teacher as randomization unit depends on the specific
features of the intervention being tested. If control-group contamination is not a major
concern due to the nature of the intervention or there are other effective ways to prevent
contamination from happening, at the same time monetary constraint is important, then
using teacher as randomization unit might serve the purpose of the program better. On the
other hand if the nature of the intervention is prone to contamination and the budget for
the program is enough to support an evaluation with adequate power, then using school as
randomization unit would be preferable.
9
The standard error of the impact estimator for cluster-randomization program (when no covariates other
than the treatment status is included) is
1 τ2 σ2
SECL = + , while the standard error of the impact estimator for individual-
P (1 − P ) J nJ
1 τ2 σ2
randomization program is SE IN = + . Here P is the percent of clusters (or
P (1 − P) nJ nJ
individuals) that are randomly assigned to the treatment group; J is the number of clusters; n is the number
of individuals within a cluster; τ 2 is the cross-cluster variance, and σ 2 is the within-cluster variance
(Bloom 2005). The proportion of the total population variance across clusters as opposed to within clusters
τ2
( ) is usually called an intra-class correlation (Fisher 1925).
τ 2 +σ 2
12
Unit of Analysis
As noted by McKinlay et al. (1989), Fisher’s classical theory of experimental design,

without exception, assumes that the randomization unit of an experiment is the unit of
analysis. However, this is not true for cluster randomization trials because, as discussed
before, the inferences of these trials are often intended to apply at the level of the
individual subject. Most professional development intervention programs which use
school or teacher as unit of randomization and intend to estimate the program impacts on
student outcomes fall into this category. For these studies, the unit of analysis is often
different than the unit of randomization. For example, a school-level randomized
professional development trial can use individual teacher as unit of analysis if it is
interested in estimating the program impacts on teacher behavior, or it can use individual
student as unit of analysis if the eventual objective of the program is to improve student
achievement. This discrepancy between the unit of randomization and the analytic unit
implies that standard statistical methods for sample size calculation and for impact
analysis are not applicable. It is well known, for example, that the application of standard
sample size approaches to cluster randomization designs may lead to seriously
underpowered studies because those approaches ignore the correlation , and that the
application of standard statistical methods could lead to spurious statistical significance
for the same reason (Donner 1998). There now exist methods and tools that deals with
these problems (for example, the MIXED procedure in SAS can correct for the standard
error according to the nested structure of data). It is therefore very important for the
design and the analysis of a program to clearly identify the unit of analysis early on.
Precision of Impact Estimators
Another important design element is the precision of the estimated impacts of a study.
The precision of impact is often expressed in terms of the smallest program effect that
could be detected with confidence, or minimum detectable effect (MDE). A minimum
detectable effect is the smallest true program effect that has a certain probability of
producing an impact estimate that is statistically significant at a given level (Bloom
1995). This parameter, which is a multiple of the impact estimator’s standard error (see
the first formula in footnote 1), depends on the following factors:
• The type of test to be performed: a one-tailed t-test is used for program impact in
the predicted direction; a two-tailed t-test can be used for any program effects;
• The level of statistical significance to which the result of this test will be
compared (α);
• The desired statistical power (1- β)—the probability of detecting a true effect of a
given size or larger;
• The degree of freedom of the test, which depends on the number of clusters (J),
and the size of cluster;
• The intra-class correlation—the proportion of the total population variance across
clusters as opposed to within clusters;
13
• The explanatory power of potential covariates, such as student’s prior
achievement, teacher characteristics, etc (Bloom 2005).
The first three factors have their conventional values and are relatively easy to pin
down.10 The minimum detectable effect size declines in roughly inverse proportion to the
square root of the number of clusters randomized, while the size of the clusters
randomized often makes far less difference to the precision of program impact estimators
than does the number of clusters (for more details, see Schochet 2005).
The last two factors, on the other hand, are hard to determine. These factors vary from
outcome to outcome and from sample to sample. The conventional practice is to use
similar measures from similar past studies as proxies for these factors in the precision
calculation at the design phase of a study. However, it is often difficult to find good
proxies and judgment must be made when deciding values for these factors. This is
especially true for evaluations of professional development interventions that use teacher-
level randomization because measures of teacher-level intra-class correlation and the
explanatory power of teacher-level covariates are not common in past studies.
To assess the minimum detectable effect size for a research design, one needs a basis for
deciding how much precision is needed. From a programmatic perspective, it might be
whether the study can detect an effect that, judging from the performance of similar
programs, is likely to be attainable. This “attainable effect” is especially difficult to
decide for the professional development interventions. Unlike other types of educational
interventions, the professional development programs target teachers directly and only
affect student outcomes through teachers indirectly. The existing literature on
professional development shows a wide range of effect sizes depending on the nature of
the interventions and the study designs, thus gives very little guidance in the literature
about how much precision one should expect from a professional development
intervention.
Design Issue #4: What should be measured?
A fourth methodological issue to be considered in conducting research on the impact of

professional development is deciding what to measure and how and when to measure it.
According to Supovitz (2001), three common weaknesses of PD effectiveness studies are
poor alignment between what is taught and the form by which it is tested, poor alignment
between the content of what is taught and what is tested, and lack of sufficient time lag
between PD invention and the measurement of PD impact. There are several related
measurement issues to be considered during the design stage that, if dealt with, may help
overcome some of the weaknesses identified. These issues include:
• Measuring key mediators;

• Determining the alignment of the outcome measures with the intervention; and
10
Other than efficacy studies, researchers usually employ a two-tailed test with a statistical power of 80%
and significance level of 0.05.
14
• Determining the timing of measurement of outcomes..
Measurement of mediating variables
An important design decision concerns how much of the study’s resources to devote to
the measurement of anticipated mediating factors, such as implementation levels
achieved and proximal outcomes (e.g., teacher knowledge and practice), in addition to
distal outcomes (student achievement). While it is tempting in randomized field trials to
focus only on the ultimate outcome of student achievement, excluding measures of
proximal outcomes and other potential moderators and mediators can ultimately be costly
in conceptual terms. Measurement of mediating variables is especially critical in making
use of study results to draw conclusions about the “theory of teacher change” and “theory
of instruction” on which the PD intervention is based.
For example, if a study of professional development does not find an overall impact on
student achievement, and no teacher outcomes are measured, it is impossible to know
where and to what extent the causal model broke down (see Exhibit 1, above). It may be
that the professional development was effective in increasing teachers’ knowledge or
practices (so the theory of teacher change receives support)), but these teacher changes
did not result in higher student achievement (so the theory of instruction is not
supported). Without a measure of the proximal outcomes of the professional
development, the model cannot be fully explored or understood.
Similarly, if dosage, or exposure to the professional development is not measured, it will

be unclear whether the PD was successfully delivered as intended, but it did not obtain
the desired results, or whether it was not successfully implemented, and thus the theory of
teacher change broke down at the first link in the chain. For this reason, it is important to
anticipate in advance what the most important features of the PD model are and design
measures to quantify them. The most critical factor will probably be dosage, but other
aspects of the PD may important to measure as well – in particular, the time allocated to
the specific topics covered. The difficulty of measuring dosage is likely to vary with the
form of professional development—hours of attendance at a training is very simple to
measure, whereas participation in coaching activities in the school is more challenging.
Alignment of Outcome Measures
PD interventions will vary in the specificity of the intended outcomes for teacher
knowledge, teacher practice, and student achievement. In particular, as discussed earlier,
some approaches to PD may be designed to improve teachers’ skills in implementing a
highly specified set of instructional practices; other PD may be designed to strengthen
teachers’ content knowledge and pedagogical content knowledge, with desired changes in
practice less clearly articulated. Similarly, some PD interventions may be designed to
produce changes in relatively narrowly defined aspects of student achievement (e.g.,
student understanding of a particular concept in mathematics), while other PD
interventions may be designed to produce broad-gauge changes in achievement.
15
Clearly, the outcome measures must be chosen to reflect the intended outcomes of the
PD; but the desired degree of alignment can be difficult to establish, and it may be wise
to choose multiple instruments that are more or less closely aligned with the specific
focus of the PD, to provide some information on the generalizabilty and robustness of the
findings.
Two extremes can exemplify the need for balance between alignment and generalizability
in outcome measure selection. Assume two groups of researchers have conducted studies
to determine if a professional development on early reading content and instruction leads
to higher achievement. In both scenarios, the intervention was focused primarily on
providing teachers with an understanding of phonemic awareness, phonics, and fluency
instruction in early reading, and the researchers chose a teacher outcome measure that
was created by the professional development team to assess teachers’ understanding of
the content just learned. The first group of researchers chose the language and examples
in the assessment directly from the PD training materials, similar to an “end-of-chapter”
test found in many textbooks. In addition, the researchers created a similar student
achievement measure that exclusively contained the types of decoding items that the
teachers had learned explicitly to address in their reading instruction. Both measures
included a range of easy to difficult items, to ensure that ability at all levels was captured.
Upon conducting the impact analyses, the researchers obtained an effect size of 1.20 for
student achievement, and concluded that professional development in early reading
affects reading achievement.
The second group of researchers studying the same intervention decided to rely on
existing standardized assessments of teacher and student ability. These measures had
been widely tested and validated, and were commonly used in the education domain as
accountability tests so the results were policy relevant. In addition, the data collection
was low-burden for study participants because the tests were district-administered and
data were available from secondary sources. The tests measured a variety of early
reading and language arts outcomes, but did not provide subscores on decoding, which
was a focus of the PD. Upon conducting an impact analysis, this group obtained null
results. .
The differences in results across the two groups of researchers may have little to do with
the PD being studied and more to do with the alignment of measures with the
intervention. The first group of researchers chose a strongly aligned test; the second
groups of researchers chose a weakly aligned test. If the intended outcomes of the PD are
very narrowly and specifically designed, it may be sufficient to focus on a highly aligned
set of measures; in most cases, though, it appears that a mix of measures varying in
specificity will be required, to permit an examination of the generalizability of the results
across
,
Timing of Outcome Measurement
Issues of timing are central to the study of the impact of professional development. As
shown in Exhibit 1, moving from the provision of PD to obtaining an impact on
16
achievement involves traversing a number of causal links, and each of these may take
time to unfold. How long do teachers need to think about what they have learned in order
to put it effectively into practice? Once practices are put into place, are they sustained
over time? And, how long does their improved instruction take to affect detectable
increases in their students’ learning?
The answer to these questions will likely vary by the intensity, specificity and form of the
professional development received, and by the alignment of the outcome measure to the
professional development. A focused week-long institute on phonics with a lot of
modeling and training on specific instructional practices may begin to affect teachers’
practices as soon as the teachers get back in the classroom. If the student test used to
measure achievement is very sensitive to these practices, achievement gains might occur
relatively quickly.
In most studies, however, these conditions are not likely to be met, suggesting that a
realistic study will require multiple waves of data collection, with the timing determined
by features of both the PD and the measures. .
CONCLUSION
For researchers interested in understanding the impact of professional development,
randomized controlled trails overcome the perennial problem of selection bias, caused by
the fact that districts typically make PD available to some teachers on a voluntary basis,
at the same time they mandate participation in other PD.
Despite the promise of randomized controlled trials of professional development

interventions, the task of designing such studies poses several significant design
challenges. In this paper, we have discussed dilemmas that must be faced in determining
the professional development treatment to be studied, the context in which to conduct the
study, the study sample to be used, and the measurement of implementation and
outcomes.
Given the current public investment in professional development and the demonstrated
potential for professional development to improve student achievement, it is not
surprising that increasing attention is being given to the design of rigorous studies of the
impact of PD. To maximize what is learned from these studies, it is critical to give more
attention to the vexing methodological issues involved..
17
References
Ball, D. L. (1996). Teacher learning and the mathematics reforms: What we think we
know and what we need to learn. Phi Delta Kappan, 77(7), 500-08.
Ball, D. L., & Cohen, D. K. (1999). Developing practices, developing practitioners:

Toward a practice-based theory of professional development. In G. Sykes & L. Darling-
Hammonds (Eds.), Teaching as the learning profession: Handbook of policy and practice
(pp. 30–32). San Francisco, CA: Jossey-Bass.
Birman, B., Le Floch, K.C., Klekotka, A., Ludwig, M., Taylor, J., Walters, K., Wayne,A.,
Yoon, K. (2007). State and Local Implementation of the No Child Left Behind Act,
Volume II—Teacher Quality Under NCLB: Interim Report. Washington, D.C.: U.S.
Department of Education, Office of Planning, Evaluation and Policy Development,
Policy and Program Studies Service.
Blank, Rolf K., Diana Nunnaley, Andrew Porter, John Smithson, and Eric Osthoff. (2002)
Experimental Design to Measure Effects of Assisting Teachers in Using Data on Enacted
Curriculum to Improve Effectiveness in Instruction in Mathematics and Science
Education. Washington, DC.: National Science Foundation.
Bloom, Howard S. (1995) “Minimum Detectable Effects: A simple way to Report the
Statistical Power of Experimental Designs.” Evaluation review 19(5): 547-56.
Bloom, Howard S. (2005) “Randomizing Groups to Evaluate Place-Based Programs” in

Learning More from Social Experiments: Evolving Analytic Approaches, Edited by
Howard S. Bloom. New York: Russell Sage Foundation.
Bloom, Howard S., and James A. Riccio. (2002) Using Place-Based Random Assignment
and Comparative Interrupted Time-Series Analysis to Evaluate the Jobs-Plus
Employment Program for Public Housing Residents. New York: MDRC.
Borko, H. (2004). Professional development and teacher learning: Mapping the terrain.
Educational Researcher, 33(8), 3–15.
Boruch, R. F. (1997). Randomized Experiments for Planning and Evaluation: A practical

guide. Applied Social Research Methods Series, 44. Thousand Oaks, CA: Sage
Publications.
Boruch, Robert F., and Ellen Foley (2000) “The Honestly Experimental Society: Sites
and Other Entities as the Units of Allocation and Analysis in Randomized Trials.” In
Validity and Social Experimentation: Donald Campbell’s Legacy, Edited by Leonard
Bickman. Volume 1. Thousand Oaks, Calif.: Sage Publications.
18
Campbell, D. T. & Stanley, J. C. (1963). Experimental and quasi-experimental designs for
research. Dallas, TX: Houghton Mifflin.
Carpenter, T. P., Fennema, E., Peterson, P.L., Chiang, C. P., & Loef, M. (1989). Using
knowledge of children’s mathematics thinking in classroom teaching: An experimental
study. American Educational Research Journal, 26(4), 499–531.
Choy S.P., Chen X., Bugarin R. (2006). Teacher Professional Development in 1999–
2000: What Teachers, Principals, and District Staff Report. Washington, DC: U.S.
Department of Education, National Center for Education Statistics.
Clewell, B. C., Campbell, P. B., & Perlman, L. (2004). Review of evaluation studies of
mathematics and science curricula and professional development models. Submitted to
the GE Foundation. Washington, DC: Urban Institute.
Coalition for Evidence-Based Policy. (2007). Developing an effective evaluation

strategy: Suggestions for Federal and state education officials. Washington, DC: Authors.
Cohen, D. K., & Hill, H. C. (1998). Instructional policy and classroom performance:
The mathematics reform in California (RR-39). Philadelphia: Consortium for Policy
Research in Education.
Cohen, D. K., & Hill, H. C. (2000). Instructional policy and classroom performance: The
mathematics reform in California. Teachers College Record, 102(2), 294–343.
Cohen, D. K., & Hill, H. C. (2001). Learning policy: When state education reform
works.
New Haven, CT: Yale University Press.
Confrey, J. (2006). Comparing and contrasting the National Research Council report on
Evaluating Curricular Effectiveness with the What Works Clearing approach. Educational
Evaluation and Policy Analysis, 28(3), 195-213.
Confrey, J. and Stohl, V. (Eds). (2004). On Evaluating Curricular Effectiveness: Judging

the Quality of K-12 Mathematics Evaluations. Committee for a Review of the Evaluation
Data on the Effectiveness of NSF-Supported and Commercially Generated Mathematics
Curriculum Materials, National Research Council. Washington, DC: National Academies
Press.
Cook, Thomas H., David Hunt, and Robert F. Murphy. (2000) “Comer’s School
Development Program in Chicago: A Theory-Based Evaluation.” American Educational
Research Journal (Summer).
Cook, T. D. & Payne, M. R. (2001). Objecting to objections to using random assignment

in educational research. In R. F. Boruch & F. Mosteller (Eds.), Evidence Matters:
Randomized Trials in Education Research. Washington, DC: Brookings Institution Press.
19
Desimone, L., Porter, A. C., Garet, M., Yoon, K. S., & Birman, B. (2002). Does
professional development change teachers’ instruction? Results from a three-year study.
Educational Evaluation and Policy Analysis, 24(2), 81–112.
Donner, Allan (1998) “Some Aspects of the Design and Analysis of Cluster
Randomization Trials.” Applied Statistics 47(1): 95-113.
Donner, Allan, and Neil Klar (2000) Design and Analysis of Group Randomization Trials
in Health Research. London: Arnold.
Elmore, R. (2002). Bridging the gap between standards and achievement: The
imperative for professional development in education. [Online.] Available:
http://www.ashankerinst.org/Downloads/Bridging_Gap.pdf.
Garet, M., Porter, A., Desimone, L., Birman, B., & Yoon, K. S. (2001). What makes
professional development effective? Results from a national sample of teachers.
American Education Research Journal, 38(4), 915-945.
Good, T. L., Grouws, D. A., & Ebmeier, H. (1983). Active mathematics teaching. New
York: Longman, Inc.
Gosnell, Harold F. (1927) Getting Out the Vote: An Experiment in the Stimulation of
Voting. Chicago: University of Chicago Press.
Grant, S. G., Peterson, P. L., & Shojgreen-Downer, A. (1996). Learning to teach

mathematics in the context of systemic reform. American Educational Research Journal,
33(2), 502-541.
Guskey, T. R. (2003). What makes professional development effective? Phi Delta

Kappan, 84(10), 748-750.
Fisher, Ronald A. (1925) Statistical Methods for Research Workers. Edinburgh: Oliver &
Boyd.
Hargreaves, A. and Fullan, M. G. (1992). Understanding teacher development. London:

Cassell.
Hawley, W.D. & Valli, Linda (1998) The essentials of effective professional
development: A new consensus. In Darling Hammond, L .S. & Sykes, G., (Eds.) The
Heart of the Matter: Teaching as a Learning Profession. 86-124. San Francisco: Jossey
Bass.
Hiebert, J. & Grouws, D. (2007). The effects of classroom mathematics teaching on
students’ learning. In F. K. Lester (Ed.), Second handbook of research on mathematics
teaching and learning (pp. 371-404). Charlotte, NC: Information Age Publishing.
20
Kennedy, M. (1998). Form and substance of inservice teacher education (Research
Monograph No. 13). Madison, WI: National Institute for Science Education, University
of Wisconsin–Madison.
Knapp, M. S. (1997). Between systemic reforms and the mathematics and science
classroom: The dynamics of innovation, implementation, and professional learning.
Review of Educational Research, 67(2), 227-266.
Leviton, Laura C., Robert L. Goldenberger, C. Suzanne Baker, and M.C. Freda. (1999)
“Randomized Controlled Trial of Methods to Encourage the Use of Antenatal
Corticosteroid Therapy for Fetal Maturation.” Journal of the American Medical
Association 281(1): 46-52.
Lieberman, A. (Ed.). (1996). Practices that support teacher development: Transforming

conceptions of professional learning. In M. W. McLaughlin & I. Oberman (Eds.),
Teacher learning: New policies, new practices (pp. 185-201). New York: Teachers
College Press.
Lieberman, A., & McLaughlin, M. W. (1992). Networks for educational change:

Powerful and problematic. Phi Delta Kappan, 73, 673-677.
Little, J. W. Teachers’ professional development in a climate of educational reform.

(1993). Educational Evaluation & Policy Analysis, 15(2), 129–151
Loucks-Horsley, S., Hewson, P. W., Love, N., & Stiles, K. E. (1998). Designing
professional development for teachers of science and mathematics. Thousand Oaks, CA:
Corwin Press, Inc.
Loucks-Horsley, S., Stiles, K., & Hewson, P. (1996). Principles of effective professional
development for mathematics and science education: A synthesis of standards (NISE
Brief, Vol. 1). Madison, WI: National Institutes for Science Education.
McCutchen, D., Abbott, R. D., Green, L. B., Beretvas, S. N., Cox, S., Potter, N. S.,
Quiroga, T., & Gray, A. L. (2002). Beginning literacy: Links among teacher knowledge,
teacher practice, and student learning. Journal of Learning Disabilities, 35(1), 69–86.
McGill-Franzen, A., Allington, R. L., Yokoi, L., & Brooks, G. (1999). Putting books in
the classroom seems necessary but not sufficient. Journal of Educational Research,
93(2), 67–74.
McKinlay S. M., E. J. Stone, and D. M. Zucker (1989) Research Design and Analysis
Issues. Health Education Quarterly, 16(2), 307-313.
Murray, David M. (1998) Design and Analysis of Group-Randomized Trials. New York:
Oxford University Press.
21
Murray, David M., Peter J. Hannan, David R. Jacobs, Paul J. McGovern, Linda Schmid,
William L. Baker, and Clifton Gray, (1994) ”Assessing Intervention Effects in the
Minnesota heart Health Program.” American Journal of Epidemiology 139(1): 91-103.
National Commission on Teaching and America’s Future (1996). What matters most:
Teaching for America’s future. New York, NY: Author.
National Research Council. (2004). On evaluating curricular effectiveness: Judging the

quality of K-12 mathematics evaluations. Washington, DC: National Academies Press.
O’Connor, R. E. (1999). Teachers learning Ladders to Literacy. Learning Disabilities

Research and Practice, 14, 203-214.
Richardson, V., & Placier, P. (2001). Teacher change. In V. Richardson (Ed.), Handbook
of research on teaching (4th ed., pp. 905-947). New York: Macmillan.
Saxe, G. B., Gearhart, M., & Nasir, N. S. (2001). Enhancing students’ understanding of
mathematics: A study of three contrasting approaches to professional support. Journal of
Mathematics Teacher Education, 4, 55–79.
Schochet, Peter Z. (2005) Statistical Power for Random Assignment Evaluations of

Education Programs. Mimeo. Mathematica Policy Research, Inc. Princeton, NJ.
Showers, B., Joyce, B., & Bennett, B. (1987). Synthesis of research on staff development:
A framework for future study and a state-of the-art analysis. Educational Leadership,
45(3), 77–87.
Sloan, H. A. (1993). Direct instruction in fourth and fifth grade classrooms. Unpublished
doctoral dissertation. Dissertation Abstracts International, 54(08), 2837A. (UMI No.
9334424)
Sprinthall, N. A., Reiman, A. J., & Thies-Sprinthall, L. (1996). Teacher professional

development. In J. Sikula (Ed.), Handbook of research on teacher education (2nd Ed., pp.
666–703). New York, NY: Macmillan.
Stevens, R. J. & Slavin, R. E. (1995). The cooperative elementary school: Effects on

student achievement, attitudes, and social relations. American Educational Research
Journal, 32, 321-351.
Loucks-Horsley, S., Stiles, K., & Hewson, P. (1996). Principles of effective professional
development for mathematics and science education: A synthesis of standards (NISE
Brief, Vol. 1). Madison, WI: National Institutes for Science Education.
Stullich, S., L. Eisner, J. McCrary, & C. Roney. National Assessment of Title I Interim
Report: Volume I: Implementation of Title I. U.S. Department of Education, Institute of
Education Sciences, National Center for Education Evaluation and Regional Assistance.
Washington, DC, 2006.
22
Supovitz, J. A. (2001). Translating teaching practice into improved student achievement.
In S. Fuhrman (Ed.), National Society for the Study of Education Yearbook. Chicago, IL:
University of Chicago Press.
Talbert, J. E., & McLaughlin, M. W. (1993). Understanding teaching in context. In D. K.

Cohen, M. W. McLaughlin, & J. E. Talbert (Eds.), Teaching for understanding:
Challenges for policy and practice (pp. 167-206). San Francisco: Jossey-Bass, Inc.
Wood, T. & Sellers, P. (1996). Assessment of a problem-centered mathematics program:

Third grade. Journal for Research in Mathematics Education, 27, 337-353.
Yoon, K. S., Garet, M., Birman, B., & Jacobson, R. (2006). Examining the effects of
mathematics and science professional development on teachers’ instructional practice: Using
professional development activity log. Washington, DC: Council of Chief State School
Officers.
Yoon, K. S., Duncan, T., Lee, S. W.-Y., Scarloss, B., & Shapley, K. (2007). Reviewing the
evidence on how teacher professional development affects student achievement (Issues &
Answers Report, REL 2007–No. 033). Washington, DC: U.S. Department of Education,
Institute of Education Sciences, National Center for Education Evaluation and Regional
Assistance, Regional Educational Laboratory Southwest.
23
Figure 1:
STUDY CONTEXT
STUDY SAMPLE
Treatment group Control group
Time 1 Time 2
INTERVENTION
Teacher Teacher
PD features Knowledge Knowledge
•Focus Teacher
•Structure retention
•Other Teacher Teacher
Practice Practice
Other features
•Unit of
intervention
Student Student Student
•Non PD
Achievement retention Achievement
components
Teacher and student characteristics

Issues in The Design of Randomized Field Trials To Examine The Impact of Teacher Professional Development Interventions

Uploaded by

Copyright:

Available Formats

Issues in The Design of Randomized Field Trials To Examine The Impact of Teacher Professional Development Interventions

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Issues in The Design of Randomized Field Trials To Examine The Impact of Teacher Professional Development Interventions

Uploaded by

Copyright:

Available Formats

Issues in the Design of Randomized Field Trials

to Examine the Impact of

American Institutes for Research

Prepared for presentation at the annual meeting of the

For comment only.

As we note later, the existing literature demonstrates that carefully constructed

Researchers designing new randomized controlled trials focused on teacher professional

This paper is of course not meant to be a comprehensive guide to designing rigorous

EXISTING STUDIES OF THE IMPACT OF TEACHER

“Programs whose content focused mainly on teachers’ behaviors

ISSUES IN STUDYING THE IMPACT OF PROFESSIONAL

The selection or creation of the professional development intervention to be tested is of

These additional components appear as part of the Intervention box in Exhibit 1,

Exhibit 1: A model of design of PD intervention study

Teacher and student characteristics

Design issue #1: What treatment will be studied?

Other PD interventions focus on building a teacher’s knowledge of a content area or of

Researchers interested in “unpacking” the packages of PD or understanding what makes

If at least some prior evidence of efficacy is available for a particular PD intervention,

Design issue #2: Where will the professional development be

Depending on sample size requirements—which are discussed below under issue #3 —

The existence of such PD—what we call ambient PD—concerns researchers because an

Design issue # 3: What sample will be needed?

As mentioned above, professional development interventions are designed to directly

Other issues concerning teacher and school as unit

As noted by McKinlay et al. (1989), Fisher’s classical theory of experimental design,

Precision of Impact Estimators

Design Issue #4: What should be measured?

A fourth methodological issue to be considered in conducting research on the impact of

• Measuring key mediators;

Measurement of mediating variables

Similarly, if dosage, or exposure to the professional development is not measured, it will

Alignment of Outcome Measures

Despite the promise of randomized controlled trials of professional development

Ball, D. L., & Cohen, D. K. (1999). Developing practices, developing practitioners:

Bloom, Howard S. (2005) “Randomizing Groups to Evaluate Place-Based Programs” in

Boruch, R. F. (1997). Randomized Experiments for Planning and Evaluation: A practical

Coalition for Evidence-Based Policy. (2007). Developing an effective evaluation

Confrey, J. and Stohl, V. (Eds). (2004). On Evaluating Curricular Effectiveness: Judging

Cook, T. D. & Payne, M. R. (2001). Objecting to objections to using random assignment

Grant, S. G., Peterson, P. L., & Shojgreen-Downer, A. (1996). Learning to teach

Guskey, T. R. (2003). What makes professional development effective? Phi Delta

Hargreaves, A. and Fullan, M. G. (1992). Understanding teacher development. London:

Lieberman, A. (Ed.). (1996). Practices that support teacher development: Transforming

Lieberman, A., & McLaughlin, M. W. (1992). Networks for educational change:

Little, J. W. Teachers’ professional development in a climate of educational reform.

National Research Council. (2004). On evaluating curricular effectiveness: Judging the

O’Connor, R. E. (1999). Teachers learning Ladders to Literacy. Learning Disabilities

Schochet, Peter Z. (2005) Statistical Power for Random Assignment Evaluations of

Sprinthall, N. A., Reiman, A. J., & Thies-Sprinthall, L. (1996). Teacher professional

Stevens, R. J. & Slavin, R. E. (1995). The cooperative elementary school: Effects on

Talbert, J. E., & McLaughlin, M. W. (1993). Understanding teaching in context. In D. K.

Wood, T. & Sellers, P. (1996). Assessment of a problem-centered mathematics program:

Teacher and student characteristics

You might also like