134Xo3
USING RANDOMIZED CONTROL
DESIGNS IN EVALUATING
SOCIAL SECTOR PROGRAMS
IN DEVELOPING COUNTRIES
Jobn Newman
Laura Rawlings                                           2         a
Paul Gertler
Seven case studies-from Bolivia, Colombia, Indonesia, Mexico, Nicaragua,
Taiwan (China), and Turkey-demonstrate the feasibility of conducting rigor-
ous impact evaluations in developing countries using randomized control de-
signs. This experience, covering a wide variety of settings and social programs,
offers lessons for task managers and policymakers interested in evaluating social
sector investments.
The main conclusions are: first, policymakers interested in assessing the effec-
tiveness of a project ought to consider a randomized control design because such
evaluations not only are feasible but also yield the most robust results. Second,
the acute resource constraints common in developing countries that often make
program rationing unavoidable also present opportunities for adopting random-
ized control designs. Policymakers and program managers need to be alert to the
opportunities for building randomized control designs into development pro-
grams right from the start of the project cycle because they, more than academic
researchers or evaluation experts, are in the best position to ensure that oppor-
tunities for rigorous evaluations are exploited.
D       espite the importance of knowing whether social programs work as
intended, evaluations of social sector investments are still uncommon
in developing countries. This neglect of evaluation handicaps the de-
velopment community's ability to demonstrate what has been achieved and so
to win political support, design more effective projects, and set priorities for
resource allocation. Today, as more money than ever is flowing to the social
The World Bank Research Observer, vol. 9, no. 2 (July 1994), pp. 181-201
© 1994 The International Bank for Reconstruction and Development/THE WORLD BANK  181



sectors governments and lending institutions are demanding value from that
money. Evaluations can help make that happen by answering the critical ques-
tion of how effective a particular social sector intervention is relative to other
possible interventions.
This article and the companion article by Grossman lay out the issues of
which policymakers and task managers need to be aware to build successful
evaluation designs into their projects. Grossman's article describes the advan-
tages, disadvantages, and limitations of the three main types of evaluation
strategy-two quasi-experimental (reflexive and matched comparison) and one
experimental (randomized control) (see table 1)-and reviews their use in
social sector programs in the United States. Examples can be found in devel-
oping countries for each type of evaluation strategy discussed in Grossman's
article.1 Grossman expresses the view, generally shared by evaluation experts,
that randomized control designs are the best evaluation strategy in technical
terms but that in many situations it is not possible or appropriate to apply
them.
This article examines the use of randomized control designs in developing
countries and reaches two main conclusions. First, whenever a project is of suf-
ficient interest to policymakers to warrant an impact evaluation, program
designers ought to consider a randomized control design because this method-
ology yields the most robust results. Second, rigorous randomized control
designs can often be built into a social sector program when acute resource
constraints make rationing of services unavoidable. The second point is not
new (Blum and Feachem 1983), but it may be salutary to remind policymakers
and program managers that randomized control designs can often be built into
Table 1. Evaluation Strategies
Control group
Type           selection criteria    Pros            Cons        Frequency of use
None           None            Very cheap       Nothing is       Very common
learned
Reflexive      Program         Cheap             Change in       Occasional
participants'                    outcome may
behavior before                  be due to other
the intervention                 factors
Matched        Judgmental      Better than      Results may not   Occasional
comparison   pairing         random when       be generalizable
target population
is small
Random         Random          Statistical;      Can be          Rare
inferences can be   expensive
drawn from
result
182                              The World Bank Research Observer, vol. 9, no. 2 (July 1994)



a social sector program at relatively low cost. Program managers, rather than
academic researchers or evaluation experts, are in the best position to ensure
that the opportunities for rigorous evaluation are exploited.
These opportunities present themselves whenever, for administrative or bud-
getary reasons, the number of eligible candidates exceeds the number of par-
ticipants that the program is capable of serving. In developing countries there
may not be enough resources to provide the program to all potential beneficia-
ries at once or even to all members of a high-priority group. Program managers
frequently allocate scarce services by spreading resources evenly but thinly
among eligible participants or by tightening the eligibility criteria until the
number of people eligible matches the resources available. A common proce-
dure is to rank each individual, community, or geographical area according to
priorities set by the program, on the basis of such criteria as per capita income
or the percentage of households with substandard housing. The cutoff point is
then determined according to available funds. Tests are rarely done for the sta-
tistical significance of the differences in the indicators used in the ranking.
Thus, it is entirely possible that individuals or communities that are observa-
tionally equivalent and equally eligible would be assigned different probabili-
ties for receiving the program.
If all potential beneficiaries are equally eligible, a random draw can be used
to select among them, and those who are not selected can serve as controls for
those who are. This procedure need not be incompatible with targeting, since
eligibility can be restricted to members of a high-priority group. The element
of randomization ensures both equity in the allocation process and equivalence
in the treatment and control groups.
Often, policymakers and program managers believe that conducting an
impact evaluation of any type, especially one using a rigorous experimental
design, would be too difficult or too costly in a developing country. In this arti-
cle, we present a series of case studies that demonstrate that randomized con-
trol designs have been used successfully in developing countries and that no
insurmountable barriers of knowledge, experience, or cost stand in the way of
conducting such evaluations. We also point out some of the design and imple-
mentation issues that task managers may face when they try to implement rig-
orous evaluations in developing countries and note that such evaluations are
not always warranted. In some cases, after weighing what could be learned
from an evaluation against the costs of carrying it out, it may make sense to
decide not to conduct an evaluation. Most published impact evaluations pay
little attention to costs-both the costs of carrying out the intervention and
those of conducting the evaluation. Whether the evaluations themselves share
this shortcoming or whether the published reports merely fail to provide the
information, the outcome is a dearth of published data on the cost of evalua-
tion. In the conclusion, we discuss some issues related to costs and provide
some practical suggestions on setting up randomized control evaluations in
developing countries.
John Newman, Laura Rawlings, and Paul Gertler                       183



Randomized Control Designs Work in Developing Countries
This article presents seven success stories. The seven cases used randomized
control design to evaluate the impact of social sector projects ranging from
family planning to radio education and mass communication. Randomized
control designs have been applied successfully in many diverse settings and
programs in developing countries, although they have been used much less
often than have other evaluation methodologies and much less often than they
have been in industrial countries. Boruch, McSweeny, and Sonderstrom (1978)
found that of 400 documented cases of randomized control designs in settings
outside of laboratories, less than 5 percent were conducted in developing coun-
tries. A review by Cuca and Pierce (1977) found that only twelve of ninety-six
family planning program evaluations used randomized control design.
Few impact evaluation studies of any type, but particularly those using ran-
domized control designs, have been carried out in developing countries in
recent years. This scarcity is reflected in the fact that few of our examples are
drawn from the 1980s. This seeming reluctance to conduct evaluations some-
times appears to stem from a sense that such studies are too expensive and too
complicated to justify their use. The real problem, however, may be that eval-
uations have been inappropriately applied. Policymakers and program manag-
ers may have been discouraged by efforts to evaluate program impacts when
the programs themselves were suffering from severe implementation prob-
lems.2 An impact evaluation is not the appropriate tool for monitoring whether
a program is functioning as it was designed to function. That is the purpose
of a monitoring system, which provides inexpensive and timely information on
the program and beneficiaries and on whether the program is being imple-
mented as intended. To determine whether a program, properly implemented,
has the desired effect requires an evaluation strategy that, in addition, collects
data from an appropriate comparison group. Monitoring programs can be sim-
ple and cheap-indeed, multilateral lending institutions are recommending
that monitoring information be produced routinely in all projects that they
finance. Evaluation is harder.
The advantage of a technically sound impact evaluation is that it can pro-
vide convincing evidence of program effectiveness for policymakers. That
involves collecting information on a comparison group as well as the treatment
group and applying a rigorous design to ensure that differences in outcomes
result from the impact of the program rather than from measured or unmea-
sured differences between the treatment and control groups. The technical
soundness of the design can be instrumental in convincing policymakers of the
reliability of a study's findings. The first two case studies, from Nicaragua and
Turkey, illustrate how the use of a randomized control design convinced pol-
icymakers of the effectiveness of new approaches to learning. The right design
can also help policymakers choose among alternative program options, as illus-
trated by the Colombia and Taiwan (China) case studies.
184                           The World Bank Research Observer, vol. 9, no. 2 (July 1994)



The implementation of an evaluation in a developing country can be as
important as its design. A program manager setting out to conduct an impact
evaluation in a developing country is also something of a pioneer. Typically,
there are no consulting firms to call on to carry out the evaluation, as there
are in the United States. Political support for the evaluation may be weak or
absent. Further, many of the same factors that can make implementing a
project difficult-the rapid turnover of staff, political change, sporadic inter-
ruptions in cash flow-can make conducting an impact evaluation difficult.
At the same time, the budgetary and administrative constraints in develop-
ing countries that often make it impossible to reach all potential beneficiaries
at once create opportunities for using randomization that are less often
encountered in established market economies. The need to ration services and
benefits means that a randomized control design can be built into a program's
first implementation phases, as happened in the case of the education upgrad-
ing program in Bolivia. Evaluating the first part of a phased-in program pre-
sents an alternative to a pilot program, which may not accurately predict the
effect of the full-fledged program because of differences in the way pilot and
full programs are implemented, as illustrated by the experience with the "Ses-
ame Street" program in Mexico. In addition, using a randomized control
design in the first part of a program can build up valuable experience in con-
ducting evaluations in developing countries, making it in many cases a more
useful exercise than promoting expensive pilot programs.
It is noteworthy that in six of the seven case studies, the programs delivered
services to a community rather than directly to individuals, a common practice
in developing countries. The experimental conditions required for a random-
ized control group design are less likely to be contaminated in a society in
which communities are relatively self-contained, as they tend to be in develop-
ing countries. (See Grossman in this volume for a discussion of contamination
of the control group.)
Even when a program is delivered to communities, indicators at both the
individual and community levels may be used to measure its impact. The indi-
vidual comparisons provide more accurate measurements of the program's
impact, but they are statistically more demanding. When programs directed at
communities are evaluated using community-level variables, unbiased esti-
mates of the impact of the availability of the program on measured community
outcomes can be obtained without controlling statistically for the correlation
between an individual's decision to participate in the program and the out-
come. (See Grossman in this volume for a discussion of the problem of disen-
tangling participation and treatment effects.) The use of communitywide
averages combines the outcomes for individuals in the treatment community
who choose not to participate in the program with those for individuals who
do participate. Provided that a sufficiently large number of communities are
included in the program and control groups, the measured differences in com-
munity-level indicators between the program and control areas would yield
John Newman, Laura Rawlings, and Paul Gertler                    185



estimates of the expected effect of extending the program to similar, unserved
communities. The community-level differences would not, however, yield esti-
mates of the potential impact of extending the program benefits to all individ-
uals or to a target group of individuals.
A related problem is that most programs that require rationing are not
assigned randomly to eligible communities, as they were in the Bolivia educa-
tion upgrading project. Thus, differences in outcomes across communities may
reflect a combination of the program's impact and an explicit or implicit allo-
cation rule that may incorporate measured or unmeasured differences across
communities. Failure to account for unmeasured differences that are related
both to program allocation and to outcomes can yield biased estimates of a
program's impact. In projects that require communities or individuals to apply
for services, it is especially important that the evaluation be designed to analyze
both the decision to apply for services and the impact of the project.3 For
example, the Indonesia National Family Planning Coordination Board allo-
cates more family planning resources to communities in which contraceptive
prevalence is low. One study (Lerman and others 1989) reported a negative
correlation between family planning program inputs and contraceptive preva-
lence using least squares cross-section multivariate regressions. However, this
result says more about the effect of past contraceptive choices on the way the
government allocates program inputs than it does about the effect of those
inputs on couples' contraceptive choices.
Rosenzweig and Wolpin (1986) have pointed out that most of the economic
studies that have attempted to evaluate social sector interventions have ignored
this problem and have implicitly assumed that program managers randomly
allocate programs across communities. They demonstrate that information
over time on the spatial distribution of programs and program characteristics
can be used to yield unbiased estimates of the effects of changes in local pro-
grams on changes in local population characteristics. Working with changes
eliminates the influence that unmeasured, fixed characteristics of the commu-
nity could have on the outcome.4 Using repeated observations of program
interventions and household outcomes in ex post matched comparisons is a
promising approach that is worth pursuing. However, substantial improve-
ments would have to be made in national information systems to generate and
then link adequate information on program interventions (typically collected
from community surveys, provider surveys, and administrative records) with
household outcomes (obtained from household surveys) before any useful
results could be realized. Even good national information systems are not yet
designed so that these links can be made easily. The World Bank's Living Stan-
dards Measurement Study is encouraging further efforts along these lines.
Not all evaluations in developing countries will focus on the impact of
expanding services to other groups or individuals. Evaluations have also been
used to test the feasibility of introducing changes in the price of services deliv-
ered, as is illustrated by the case from Indonesia. The Indonesian case also
186                           The World Bank Research Observer, vol. 9, no. 2 (July 1994)



underscores some of the political constraints that can be encountered when
applying randomized control designs and the tradeoffs that must often
be made between these political constraints and the reliability of the eval-
uation design.
Showing That Radio Education Works:
The Radio Mathematics Project in Nicaragua
This project used a randomized control evaluation design to assess and dem-
onstrate the effectiveness of a new approach to learning-radio education. The
positive findings of the evaluation led to the expansion of the radio education
program to classrooms throughout Nicaragua and to further use of randomized
control in evaluating the effectiveness of radio education compared with that
of new textbooks.
The Radio Mathematics Project was launched in 1974 by Stanford Univer-
sity through the Ministry of Public Education, with the support of the
U.S. Agency for International Development (USAID). The aim was to develop
and implement a prototype system of radio-delivered mathematics instruc-
tion for elementary school students. The project was implemented in four
phases-research, pilot-level field tests, standardized tests, and the main
field test.
The first two years, 1974 and 1975, were dedicated to establishing the
project, developing lessons, and conducting pilot tests of the program in first
grade classrooms in California and in Masaya, Nicaragua. In 1976 and 1977
schools in the provinces of Masaya, Carazo, and Granada were randomly
selected to receive the revised mathematics program. In 1978 the Province of
Rio San Juan was also included in the project.
School populations were categorized by grade and by rural and urban areas
in each province so that the effect of the program on different groups within
the population could be assessed. Within each province each qualifying school
(any school with at least fifteen first graders) had an equal chance of being in
the treatment group or in the control group. Each year, depending on the grade
being evaluated, schools were chosen from the list of randomly assigned treat-
ment and control groups using a three-step process. First, the number of classes
to be chosen from each group was determined. Next, a list of eligible classes
was drawn up for each cell in each category (for example, rural control schools
in Masaya). Finally, the appropriate number of classes was selected from each
list. From 1975 to 1978, this process generated a total of 145 control classes
and 257 treatment classes for the evaluation.
The radio education program for the first through fourth grades consisted
of an hour of mathematics instruction daily throughout the school year,
divided into radio instruction and teacher-assisted exercises. During the period
when the program was being fully implemented (1976-78), project personnel
John Newman, Laura Rawlings, and Paul Gertler                      187



administered tests to students in the control and treatment groups both before
and after the program aired.
Quantitative evaluations showed statistically significant improvements on
mathematics tests for students in the first through third grades who received
the radio education program (Friend, Searle, and Suppes 1980). For the first
grade the mean correct score on the tests was 65.5 percent for the treatment
classes, but only 38.8 percent for the control classes. In the second grade the
scores were 66.1 and 58.4 percent, and in the third grade, 51.7 and 43.2 percent.
For all three grades these differences in scores were statistically significant at
the 99 percent level of confidence, that is, it is 99 percent likely that the
differences between the control and treatment groups could be attributed to
the treatment rather than to chance. Scores for fourth graders were not statis-
tically different for the treatment and control classes, but this grade was tested
during a period of revolutionary turmoil, when many schools dismissed
children before the daily broadcast of the fourth grade lesson-an extreme
example of how failure to implement a project as planned precludes meaning-
ful evaluation.
Qualitative evaluations based on classroom observation and weekly tests
also constituted an important part of the overall evaluation. These activities
allowed teaching methods to be assessed and refined rapidly and provided
valuable feedback to teachers. The qualitative evaluations found students to be
attentive and able to keep pace with the worksheets and to learn new skills.
Teachers reported satisfaction with the program, which they said reduced their
workload and introduced students to new concepts.
Explicit efforts were made to build political support for the evaluation. Two
advisory committees, with representatives from the Ministry of Public Educa-
tion and participating schools, were established to explain the objectives of the
program and the evaluation. Briefing sessions were conducted to explain the
use of randomized control design and to reassure teachers that the program,
not their teaching, was being evaluated. Each eligible school had the same
probability of being selected to receive the program, lessening the chance of
any school or individual developing feelings of animosity toward a program
that had "rejected" them, which could have influenced the results.
This success led to a second evaluation using a randomized control design,
which confirmed the greater effectiveness of the radio education program in
increasing children's learning ability compared with a program that provided
additional textbooks (Jamison and others 1981). Since this trial run in
Nicaragua, the number of interactive radio-based education programs in devel-
oping countries has grown steadily. During the 1980s radio mathematics pro-
grams were introduced in Bolivia, Costa Rica, the Dominican Republic,
Ecuador, Guatemala, Honduras, Lesotho, Nepal, and Thailand. Interactive
radio instruction programs in science, health, Spanish, English as a Second
Language (ESL), and teacher training have also spread across the developing
world (USAID 1990).
188                           The World Bank Researcb Observer, vol. 9, no. 2 (July 1994)



Testing for Lasting Effects: Early Childhood Education in Turkey
In 1982 a pilot program headed by the Psychology Department at Bogazici
University in Istanbul, Turkey, was initiated to test whether educating lower-
income mothers of three- and five-year-olds improves the children's learning
abilities. Because the beneficial effects of early-childhood interventions pro-
vided directly to children had often been found to dissipate with time, the pro-
gram managers hoped that, by educating mothers instead of children, the
program would have a lasting effect on children's cognitive abilities. The
hypothesis was that the mothers' training would constitute a permanent change
in the children's environment. This program was evaluated twice: once at the
time of the project, to assess immediate effects; and again nine years later, to
find out whether the effects were lasting-an ambitious follow-up program.
A series of assessments, tests, and interviews were used to establish a base-
line for the project. Three categories of mothers were then selected to receive
training: those whose children were attending an educational preschool, those
whose children were attending a custodial daycare center, and those who were
caring for their children at home. Treatment and control groups were estab-
lished through random selection. The treatment group began a two-year, two-
part training program that consisted of a cognitive development program for
children, implemented through a series of exercises completed by mother and
child working together, and an enrichment program that educated mothers
about their children's health and education needs.
An initial impact assessment was conducted at the end of the two-year train-
ing program. The children of mothers who had gone through the program
scored significantly higher in measures of IQ, analytical training, and classifi-
cation tasks than children in the control group. They also had higher grades,
most notably in Turkish and mathematics.
Because the initial evaluation showed such positive results, a revised version
of the program was extended to other areas of the country, with the support
of nongovernment organizations and private industry. A television version of
the enrichment training for mothers was also developed and broadcast in a
series of eleven short programs.
A long-term impact evaluation was recently completed for 217 of the origi-
nal 255 participants in the training program-a follow-up rate of 85 percent
(Kagitcibasi, Sunar, and Bekman 1993). The evaluation included interviews
with the children, now twelve to fifteen years old, and their parents. The
results of this study confirmed the hypothesis that changing the environment
in which children learn can lead to sustainable improvements in education.
One of the most striking long-term impacts of the training is the much higher
school retention rates for the children whose mothers participated in the pro-
gram: 86 percent, compared with 67 percent for the children of mothers in the
control group. Throughout the first five years of primary school, the academic
performance and vocabulary test scores of children whose mothers had
John Newman, Laura Rawlings, and Paul Gertler                      189



received the training were consistently superior to those of children whose
mothers had not. In addition, both the children and the mothers who had ben-
efited from the training program had significantly different scores for answers
on questions that demonstrated self-confidence, attitudes toward academics,
and expectations about educational achievement.
Testing Alternative Service Delivery Modes:
The Taichung Family Planning Program in Taiwan, China
In 1962 the Taiwan Provincial Health Department began what was at the
time the largest intensive family planning program ever carried out in a city
the size of Taichung, which had a population of 325,000. The decision to
extend the program to the entire city was prompted by the results of a series
of surveys in 1961-62 that revealed a strong demand for family planning ser-
vices and a readiness to use a new form of birth control, the intrauterine device
(IUD). Information services and supplies were offered for a wide variety of con-
traceptive methods.
Program officials chose to test the effectiveness of different combinations of
services and information by randomly assigning treatments by lin, a neighbor-
hood unit averaging twenty households. In all, some 36,000 married couples of
childbearing age (couples in which the wife was between the ages of twenty
and thirty-nine) were included in control and treatment groups. Four types of
treatment were designed, ranging from more intensive and more costly to less
intensive and less costly:
e Treatment 1: Everything, husband and wife. Personal visits to husbands
and wives by trained health workers providing family planning informa-
tion and services; mailings to newlyweds and couples with at least two
children detailing family planning methods and benefits and identifying the
location of clinics; and neighborhood family planning meetings offering in-
formation about family planning.
e Treatment 2: Everything, wife only. Same as treatment 1, but without the
visits to the husband by the health workers.
* Treatment 3: Mailings. Only informational mailings, as detailed in treat-
ment 1.
* No treatment.
In addition, the city was divided into three "density" sectors, which differed,
insofar as possible, only in the proportion of lins receiving more intensive or
less intensive treatments. The density variation was introduced to determine to
what extent the beneficiary population could be depended on to spread the
desired innovation and to establish how many households within a given area
needed to be contacted to stimulate diffusion of the innovation. Differences
among the three density sectors were minimized by constructing sectors that
were as similar as possible on the basis of measurable characteristics such as
190                           The World Bank Research Observer, vol. 9, no. 2 (July 1994)



Table 2. Cumulative Acceptance Rates per 100 Married Women Aged 20-39 for All
Methods of Birth Control in Taichung
(percentage)
Density sector
Treatment          Heavy         Medium          Light        All sectors
Treatment 1         20             12             14            17
Treatment 2         18             14             14            17
Treatment 3          8              7              8             8
Nothing              9              7             7              8
Total              14             9              8             11
Source: Authors' calculations, from Freedman and Takeshita (1969).
fertility, occupational composition, and education. In the sector designated to
receive high-density treatment (928 lins), half the couples were randomly cho-
sen to receive an "everything" treatment (treatment 1 or 2). In the sector des-
ignated to receive low-density treatment (730 lins), only 20 percent of the
couples received an "everything" treatment. In the medium-density sector,
34 percent of couples received an "everything" treatment. Each lin within each
sector had the same probability of receiving a treatment because the treatments
were allocated randomly by lin. However, the probability of being selected into
each treatment category varied according to the treatment density to which the
sector was assigned (Freedman and Takeshita 1969).
During the experimental period of the program from February 1963 to
March 1964, the contraceptive acceptance rate was significantly higher in the
high-density sector than in the medium- or low-density sectors (table 2). The
variation between medium- and low-density sectors was slight. The experiment
suggested that the marginal effect of approaching the husbands (treatment 1)
in addition to the wives (treatment 2) was negligible and that the mail cam-
paign (treatment 3) was largely ineffective. In 1964, elements of the Taichung
program that were considered the most promising-notably house visits by
fieldworkers-were replicated throughout Taiwan, and greater emphasis was
placed on the availability of IUDs as a method of family planning.
Targeting and Random Assignment:
The Cognitive Abilities of Malnourished Children in Colombia
A pilot program in Cali, Colombia, in 1971-75 was designed to determine
what levels of education, nutrition, and health services for preschool children and
parents from low-income families would reduce malnutrition and whether these
actions could produce improvements in children's intellectual functioning
(McKay and others 1978). Medical practitioners had long asserted that inade-
quate nutrition impairs a child's cognitive development, perhaps permanently,
John Newman, Laura Rawlings, and Paul Gertler                        191



but these claims had never been systematically investigated. This case shows that,
when program services are to be phased in, a randomized control design can be
used even in a program that aims eventually to cover all eligible participants.
Random assignment is used simply to determine which groups or individuals
receive the program first. This case also shows that achieving an efficient ran-
domized control design may require that the target group be identified first.
The program was run by the staff of the Human Ecology Research Station,
with the support of the Colombian Ministry of Education, the Ford Founda-
tion, the National Institute for Child Health and Human Development, and a
number of private industries in Colombia. The first step was a multiphase
screening survey to identify a target group of malnourished children from
among households with four-year-old children. The survey identified general
nutritional levels, gathered demographic data, and screened for malnutrition.
The 333 malnourished children identified through this process were classified
into twenty sectors by neighborhood. Each sector of thirteen to nineteen chil-
dren was randomly assigned to one of four treatment groups that differed only
in the duration of the treatments, which were staggered over time. Two other
groups of children of the same age were formed to allow for qualitative com-
parisons with the treatment groups. One group consisted of children from
high-income families living in Cali, and the other of children from low-income
families who exhibited no signs of malnutrition but who lived in the same
neighborhoods and participated in the screening process that had identified the
children who qualified for the program.
The children in the treatment groups participated in six hours of health- and
nutrition-related and educational activities a day, five days a week. The nutri-
tional component provided 75 percent of recommended daily protein and cal-
orie intake, along with mineral and vitamin supplements. Health care services
included daily observations of all children and immediate pediatric attention
as warranted. The educational component focused on developing cognitive
processes and language, social, and psychomotor skills.
Because one of the objectives of the study was to assess how long such a
program should last, time-sequencing of treatments formed a crucial part of
the pilot program. A randomly selected subgroup from the larger pool of mal-
nourished children was assigned to treatment 4, the longest treatment period
of 4,170 hours. Over staggered eight-month periods, other randomly selected
subgroups received treatments 3, 2, and 1; the last was the shortest, lasting only
990 hours. The children's development was traced over the forty-four months
of the program by measuring each child's cognitive ability at equally spaced
intervals five times during the study period. The tests measured such indicators
of cognitive ability as use of language, spatial relations, quantitative concepts,
logical thinking, and manual dexterity and motor control. One problem, how-
ever, is that different tests were administered at each measurement point, mak-
ing it difficult to compare the test results.
192                           The World Bank Research Observer, vol. 9, no. 2 (July 1994)



Because children were assigned randomly to the four treatment groups, dif-
ferences among the groups could be attributed to differences in the duration of
the program. Children who received the longest treatment showed the greatest
gains. For children eight years old, results on the Stanford-Binet intelligence
tests-reported as mental age minus chronological age-were as follows for
the different groups: treatment 1, -15 months; treatment 2, -11 months; treat-
ment 3, -9 months; and treatment 4, -5 months. The treatment groups differed
from one another in the expected direction-the longer the treatment, the
greater the gain-but the differences between adjacent treatment groups were
not statistically significant. (It should be noted, however, that the sample sizes
were small.) Even with the maximum treatment, none of the groups ever
reached the average level of ability shown by children from the nonrandomly
selected high socioeconomic group, who had a mental age minus chronological
age of +10 months as measured by the Stanford-Binet test.
No member of the target group was denied treatment, a factor that facili-
tated acceptance of the randomized control design, particularly in the sensitive
case of a study of the effects of malnutrition on intellectual development.
Testing the Whole Program:
The Impact of "Sesame Street" in Mexico
A new version of the children's television program "Sesame Street," in Span-
ish and adapted to Latin American culture, was introduced in Mexico in 1971.
Policymakers in the communications and education fields were interested in
exploring the effect of the program on children's cognitive skills. The evalua-
tion was designed to assess the effectiveness of the entire program, rather than
the relative effectiveness of different strategies, as in some of the other case
studies. This case illustrates some of the problems that can occur in moving
from a pilot program to broader implementation of the project.
A randomized control design was applied to a pilot test carried out in day-
care centers serving low-income families in Mexico City in 1971. Two hundred
and twenty-one children three to five years old from three daycare centers were
divided by age and gender and then randomly assigned to treatment or control
groups. Children in the treatment group watched "Plaza Sesamo" for fifty min-
utes a day five days a week for six months. Children in the control group
watched cartoons. To make sure that children in the control group did not
watch "Plaza Sesamo" at home in the evening (it was shown again at 6:00 P.M.),
children in that group were kept at the daycare centers until 7:00 P.M.; children
in the treatment group left earlier (Hoole 1978).
Nine cognitive development tests were administered to the randomly
selected control and experimental groups before and after the pilot program
began. Statistically significant differences were found for four of the nine
cognitive tests administered after the program. The greatest differences were
John Newman, Laura Rawlings, and Paul Gertler                      193



in the tests of letters and words, general knowledge, and numbers-topics most
closely related to the objectives of "Plaza Sesamo." (Differences after the pro-
gram in adjusted test mean scores for four- and five-year-olds in the experi-
mental group and those in the control group were 7.3 and 4.8 in general
knowledge; 4.5 and 5.1 for letters and words; and 7.8 and 6.2 in numbers, all
significant at the 99 percent confidence level.)
The encouraging results of this pilot test prompted a larger field test. Con-
trol and treatment groups were randomly selected from lower- and middle-
class preschool children in daycare centers in urban and rural areas. The
impact of "Plaza Sesamo" was not as clear in the broad field test, which used
a slightly different methodology (the tests were revised, and a rural component
was added). The field test was also ultimately less rigorous because of the
larger number of dropouts and contamination that occurred because some chil-
dren had watched an earlier version of "Plaza Sesamo" at home. However, the
evaluators suggested that the difference between the pilot test and the field
experiment resulted less from the difference in methodology than from impor-
tant differences in the social environments in which the children watched the
program (Diaz-Guerrero and others 1976). Essentially, they hypothesized that
the presence of a greater number of adults in the laboratory-type setting of the
pilot project created a slightly different environment that was more conducive
to learning. Because the laboratory-type setting was not replicated when the
program was expanded, the nature of the intervention changed.
Although the results of the field test were less conclusive, the results of the
pilot test helped to generate broad interest in "Sesame Street," not only in
Mexico but throughout Latin America.
Assigning Services by Lottery:
Educational Investments in the El Chaco Region of Bolivia
This case study and the malnourished children project in Colombia both
illustrate that the targeting of project interventions does not have to rule out
the use of randomized control designs for evaluation. When resources are
limited, it may be preferable to group individuals or communities on the basis
of some rough classification criteria, treat them as observationally equivalent,
and conduct a lottery to distribute limited resources, rather than spend funds
on more costly information collection activities to target services more
narrowly. This approach was followed in a pilot program recently introduced
in the El Chaco region of Bolivia to upgrade physical facilities and teacher
training in rural public schools (Coa 1992). The program is one of several
activities financed by the Social Investment Fund (SIF)-an institution set up by
the Government of Bolivia to finance education and health projects in low-
income areas-and is also supported by the World Bank and Kreditanstalt fur
Wiederaufbau.
194                            The World Bank Research Observer, vol. 9, no. 2 (July 1994)



To direct interventions to the neediest cases, project managers assigned
schools in the region to one of three priority groups on the basis of community
characteristics and assessments of the current state of their infrastructure. Rec-
ognizing that funds spent making finer distinctions among schools could be
better spent on program activities, project managers made no attempt to mea-
sure subtle differences among the schools or to rank them in order of priority.
All eight schools in the highest priority group were upgraded under the
project. The next highest priority group contained 120 schools, but funds were
available to upgrade only 54 of them. These schools were selected randomly.
This group is of particular interest to policymakers because schools in this cat-
egory are the hardest hit by current budget stringencies.
Because the allocation rule assigned services to all the schools in the top pri-
ority group, the effect of the intervention on that group will be measured using
a reflexive comparison design (see the Grossman article for a discussion of this
type of design). For the medium-priority group, conditions are right for using
a randomized control group design. Baseline information for the evaluation
was collected between May and June 1993 using household, community, and
school facility questionnaires. A follow-up survey will be conducted one year
later, after the project interventions have been completed.5
Combining Randomized Control and Matched Comparisons:
The Indonesia Resource Mobilization Study
Sometimes evaluations combine randomized control group and matched
comparison designs, as in the Indonesia Resource Mobilization Study. This
study was designed in 1991 to ascertain the potential impact of user fees on
health system revenues, health care utilization, patient's choice of medical care,
provider services, and health outcomes and to assess the willingness of patients
to pay for improvements in the health care system.
The Resource Mobilization Study is one component of the Third Health
Project, a set of health care initiatives implemented by the government in the
provinces of Kalimantan Timur and Nusa Tenggara Barat to increase the
availability and improve the quality of medical services primarily through
resource investments (such as new facilities, additional personnel, and more
drugs and other medical supplies; see Indonesia 1992). The project has so far
been funded by a World Bank loan, but unless the government finds other
sources of financing once the loan is expended, the improvements in health care
services will not be sustained. For that reason, the government wanted to take
advantage of the opportunity presented by the health project to experiment
with increases in user fees in two provinces before extending the scheme
nationwide. The increases were likely to be less unpopular if they came at the
same time as an overall improvement in the quality of services under the health
project.
John Newman, Laura Rawlings, and Paul Gertler                      195



The interaction between the evaluation team and government policymakers
led to several important and practical compromises in the design of the exper-
iment. The government initially planned to increase fees uniformly across the
two provinces, while expanding mechanisms to exempt the poor from having
to pay the new fees. The evaluation team argued for delaying some of the fee
increases so that experimental control and treatment groups could be studied.
Random assignment of the fee increases at the individual level was clearly not
practical because health care services are priced at the provider level. Applying
the fee increase at the facility level would be difficult as well, because health
care prices are set at the district level. Although local officials were eager to
increase fees to generate additional revenue, they were reluctant to set different
prices at different facilities within the same district for fear of political back-
lash. In the end, differences in fees were applied only at the district level, in
six districts randomly selected from among the twelve in the two provinces.
Fees were one and a half times higher than prevailing rates. Price variations
were also introduced among levels of care (such as hospital and health center
or inpatient and outpatient care).
The small size of the sample of districts subjected to fee increases created sta-
tistical problems, so a matched comparison was introduced to strengthen the
evaluation design. Treatment and comparison villages were matched not directly
on a village-by-village basis, but by comparing the distribution of socioeconomic
characteristics of treatment and control villages as groups. First, 110 treatment
villages were selected randomly from among the six randomly selected districts.
Next, the same number of control villages was selected randomly from among
the randomly selected control districts, and the distribution of their socioeco-
nomic characteristics (income level, family size, access to medical care, and other
data from national household surveys) was compared with that of the treatment
villages. The control village that was the least similar to the treatment villages
was dropped in favor of another randomly selected replacement village drawn
from the control district, and the process was repeated until the comparability
of the two groups could no longer be improved. This iterative process-made
possible by the availability of national survey and census data on household-
and village-level socioeconomic characteristics-substantially improved the fit of
the match in one of the two provinces.6
Baseline information was collected in 1991 on the matched treatment and
control villages using household, community, and health provider question-
naires. Follow-up surveys of the same households and providers were con-
ducted in 1993, some eighteen to twenty months after the fee increases. Results
of the analysis are expected in the summer of 1994. The collection of data both
before and after the fee increases is intended to isolate the effect of the policy
reforms from other factors that may have influenced people's use of medical
services over time. The control-and-treatment-group design controls for other
influences, such as changes in weather, morbidity pattern, and income, that
196                             The World Bank Research Observer, vol. 9, no. 2 (July 1994)



cannot be controlled for in a reflexive design, which tests the same group
before and after the intervention.
Conclusion
These cases demonstrate the feasibility of conducting impact evaluations
using randomized control designs in developing countries. They also demon-
strate that there is no single blueprint for conducting evaluations. The evalua-
tion designs explored in this article were tailored to the question of interest in
the social sector project or treatment being evaluated. Such evaluations are
most effective when they seek to answer a clear question of interest to policy-
makers and when the intervention itself can be precisely defined and measured:
Can radio education improve learning? What is the most effective level of
intensity in the provision of family planning services? How will people react
to an increase in prices for medical services?
More effort needs to be devoted to collecting and reporting information on
the costs of carrying out specific interventions. Having that information would
allow the outcomes of different kinds of interventions to be expressed in terms
of how much they cost to implement rather than in terms of outcome indica-
tors that are not directly comparable. Because initial conditions and service
delivery levels are often very poor in developing countries, an impact evalua-
tion might easily find a sizable absolute improvement in the outcome indicators
for given inputs. But it is important to remember that the relevant factor in
deciding resource allocation is the opportunity cost of investing in one project
rather than another; that is, the expected gains from investing in one project
compared with the expected gains from investing in another.
More effort also needs to be devoted to collecting information on the costs
of conducting evaluation studies. The critical question in deciding whether to
conduct an evaluation is whether the expected value of the information
obtained is greater than the cost of collecting it. Again, the relevant cost is the
opportunity cost of using the funds. If the project to be evaluated is only one
of a group of projects expected to have high returns with low risk, the oppor-
tunity cost of financing an impact evaluation instead of investing in another
project might be high. If the level of uncertainty about what can be gained
from the project is appreciable, however, spending the money on an evaluation
study probably makes sense. The opportunity cost of investing in a project
with low returns can be considerable when other investments could yield
higher returns.
In addition to concern about the costs of conducting impact evaluations,
policymakers and program managers need to be aware of some of the issues
involved in setting up an impact evaluation study within a project. Some of the
decisions made early on in the design of a project can make an impact evalu-
ation easier or harder to conduct later: Who is eligible to participate in the
John Newman, Laura Rawlings, and Paul Gertler                         197



project? How are the project activities rationed among eligible beneficiaries if
resources do not permit delivery of services to all who are eligible? How is the
project being phased in? Policymakers and program managers should be alert
to opportunities for introducing randomization into program implementation,
thus building in possibilities for generating randomized control designs. Ran-
domization can be used to allocate a limited number of spaces among equally
eligible potential participants, as in the radio education project in Nicaragua.
The education upgrading project in Bolivia shows that such opportunistic ran-
domization need not be incompatible with targeting interventions to high-
priority groups. Randomization may also be built into the plans for expanding
a program: the last groups of participants to receive the program's benefits can
serve as controls for the first groups. This approach is particularly appropriate
in situations where it is ethically untenable to generate a control group that
will be denied access to the program altogether. The Colombia case study of
the malnourished children project is a good example of the use of this type of
randomized control design.
In developing countries, the task of organizing an impact evaluation usually
falls on program managers. It is rare to find either government agencies that
have the capacity to conduct evaluations or local consulting firms that can be
contracted to do the work. One way around some of these problems is for pro-
gram managers to establish a small evaluation unit, preferably within the
project unit. Household data collection can usually be subcontracted from a
national statistical institute or a private company. Data on the internal opera-
tion of the project, including cost data and monitoring indicators, should be
collected as part of the project's management information system. The evalu-
ation unit should ensure that data on households, which will provide informa-
tion on the outcomes, can be easily linked with the data on project inputs.
Freeing personnel in the evaluation unit from direct data collection tasks
allows them to concentrate on analyzing the data and bringing the results to
the attention of program managers.
For some tasks, such as designing the evaluation and analyzing the data, the
evaluation unit may need to call on consultants or technical assistance from
lending institutions.7 As the Indonesia case illustrates, there are often tradeoffs
in the evaluation design that need to be analyzed by experts. The evaluation
unit will also require support in addressing some of the conceptual issues
involved in analyzing the data, particularly if the evaluation design relies on
statistically controlling for differences between participants and nonpartici-
pants in measuring impacts. The wide availability of powerful and cheap
microcomputers and of user-friendly statistical software makes the task of pro-
cessing the data much easier and cheaper than in the past.
By demonstrating a project's benefits, impact evaluations can also help to
build political support for a project. Impact evaluations can also identify the
best ways to carry out particular kinds of interventions and provide convincing
evidence for changing or eliminating unsuccessful programs or components,
198                           The World Bank Research Observer, vol. 9, no. 2 (July 1994)



thereby improving the cost-effectiveness of project interventions. As the devel-
opment community embarks on a major increase in social sector spending, it
should reconsider the role that impact evaluations can play in ensuring the con-
tinual improvement of the quality of social sector investments. Only policy-
makers have the power to draw together all the parties involved in a planned
intervention, allowing them to debate the merits of conducting an evaluation
and of how best to proceed should they decide that evaluation is warranted.
Policymakers and program managers need to be aware of the tradeoffs and fea-
sibility of the various evaluation options before they can make an informed
judgment.
Notes
John Newman is senior economist in the World Bank's Human Resources Division for Mex-
ico and Latin America. Laura Rawlings is a consultant to the Poverty and Human Resources
Division of the World Bank's Policy Research Department. Paul Gertler is senior economist at
the RAND Corporation.
1. The most common form of evaluation in developing countries, as in industrial countries,
is the matched comparison study. Examples of influential matched comparisons conducted in
developing countries include those of television-based educational reform in El Salvador (Mayo,
Hornick, and McAnany 1976), the Dacca family planning project in Pakistan, the Rajastan ap-
plied nutrition program in India (UNESCO 1984), and the Matlab family planning project in Bang-
ladesh (Nag 1992; Balk and others 1988). A recent matched comparison is Revenga, Riboud, and
Tan (1994) on employment programs in Mexico.
2. Berg (1987) and Binnendjik (1989) discuss some common concerns voiced about impact
evaluation studies.
3. For further discussion of the problems involved in disentangling participation and impact,
see the Grossman paper in this volume, Heckman (1992), and Manski and Garfinkel (1992).
4. Programs in Indonesia have been the subject of several evaluations that statistically control
for the nonrandom placement of programs. Pitt, Rosenzweig, and Gibbons (1993) evaluated the
impact of health and education programs on illness rates and school enrollment; Frankenberg
(1993) evaluated the impact of health infrastructure on infant mortality; and Gertler and
Molyneaux (1994) evaluated the impact of family planning programs on contraceptive preva-
lence and fertility.
5. The cost of collecting the data for the baseline and follow-up surveys in the El Chaco area
is roughly US$300,000, about 0.4 percent of the total SIF budget of $74.5 million as of May 1993.
6. Both the sample size and the size of the fee increases were selected to obtain a statistical
power of more than 80 percent. Power calculations used the national household survey data on
health care utilization.
7. For practical information on conducting evaluations, see the "Program Evaluation Kit" put
out by Sage Publications, Newbury Park, California, in 1987, which includes books on designing
and implementing evaluations. Hoole (1978); Dennis and Boruch (1989); North (1988); and Free-
man, Rossi, and Wright (1980) provide useful sources for exploring the developing-country con-
text. For general information on evaluations, Evaluation Review may be consulted. For
information on evaluation designs, the classic work by Campbell and Stanley (1963) is recom-
mended. Fitz-Gibbon and Lyons Morris (1987) also provide practical information on designing
evaluations. Rieken and Boruch (1984) provide further discussion of experimental designs in
evaluating social programs.
John Newman, Laura Rawlings, and Paul Gertler                                  199



References
The word "processed" describes informally reproduced works that may not be commonly
available through library systems.
Balk, Deborah, Khodezatul Faiz, Ubaidur Rob, J. Chakraborty, and George Simmons. 1988. "An
Analysis of Costs and Cost-Effectiveness of the Family Planning-Health Services Project in
Matlab, Bangladesh." International Center for Diarrheal Research, Bangladesh. Processed.
Berg, Alan. 1987. Malnutrition: What Can be Done? Lessons from World Bank Experience. Bal-
timore, Md.: Johns Hopkins University Press.
Binnendiik, Annette L. 1989. "Donor Agency Experience with the Monitoring and Evaluation of
Development Projects." Evaluation Review 13(3):206-22.
Blum, Deborah, and Richard Feachem. 1983. "Measuring the Impact of Water Supply and San-
itation Investments on Diarrhoeal Diseases: Problems of Methodology." International Journal
of Epidemiology 12(3):357-65.
Boruch, Robert, John McSweeny, and John Soderstrom. 1978. "Randomized Field Experiments
for Program Planning, Development, and Evaluation: An Illustrative Bibliography." Evalua-
tion Quarterly 2(4):655-95.
Campbell, Donald, and Julian Stanley. 1963. Experimental and Quasi-Experimental Designs for
Research. Chicago, Ill.: Rand McNally.
Coa, Ramiro. 1992. "Disefio para la Evaluaci6n de Impacto de las Intervenciones FIS." Fondo de
Inversi6n Social, La Paz, Bolivia. Processed.
Cuca, Roberto, and Catherine Pierce. 1977. Experiments in Family Planning: Lessons from the
Developing World. Baltimore, Md.: Johns Hopkins University Press.
Dennis, Michael, and Robert Boruch. 1989. "Randomized Experiments for Planning and Testing
Projects in Developing Countries: Threshold Conditions." Evaluation Review 13(3):292-309.
Diaz-Guerrero, R., Isabel Reyes-Lagunes, Donald Witzke, and Wayne Holtzman, 1976. "Plaza
Sesamo in Mexico: An Evaluation." Journal of Communication Spring:145-54.
Fitz-Gibbon, Carol Taylor, and Lynn Lyons Morris. 1987. How to Design a Program Evaluation.
Newbury Park, Calif.: Sage Publications.
Frankenberg, Elizabeth. 1993. "The Effect of Access to Health Care on Infant Mortality in
Indonesia: 1980-87." Dorothy S. Thomas Award Paper presented at 1993 Population Associ-
ation of America (PAA) meetings, Cincinnati, Ohio. Processed.
Freedman, Ronald, and John Y. Takeshita. 1969. Family Planning in Taiwan: An Experiment in
Social Change. Princeton, N.J.: Princeton University Press.
Freeman, Howard, Peter Rossi, and Sonia Wright, 1980. Evaluating Social Projects in Developing
Countries. Paris: Organization for Economic Cooperation and Development (OECD).
Friend, Jamesine, Barbara Searle, and Patrick Suppes, eds. 1980. Radio Mathematics in Nicara-
gua. Stanford, Calif.: Stanford University Press.
Gertler, Paul, and John Molyneaux. 1994. "How Economic Development and Family Planning
Programs Combined to Reduce Indonesian Fertility." Demography 21(1):33-64.
Heckman, James J. 1992. "Randomization and Social Policy Evaluation." In Charles Manski and
Irwin Garfinkel, eds., Evaluating Welfare and Training Programs. Cambridge, Mass.: Harvard
University Press.
Hoole, Francis W. 1978. Evaluation Research and Development Activities. Beverly Hills, Calif.:
Sage Publications.
Indonesia, Ministry of Health. 1992. "Health Care Resource Needs and Mobilization in KalTim
and NTB: Interim Results, Health Project III." WD-6281-1-MoH/RI. Jakarta.
Jamison, Dean, Barbara Searle, Klaus Galda, and Stephen P. Heyneman. 1981. "Improving Ele-
mentary Mathematics Education in Nicaragua: An Experimental Study of the Impact of Text-
books and Radio on Achievement." Journal of Educational Psychology 73(4):556-67.
200                                The World Bank Research Observer, vol. 9, no. 2 (July 1994)



Kagitcibasi, Cigdem, Diane Sunar, and Sevda Bekman. 1993. "Long-Term Effects of Early Inter-
vention." Department of Education, Bogadzdi University, Istanbul, Turkey. Processed.
Lerman, Charles, John Molyneaux, Soetedjo Moeljodihardjo, and Sahala Pandjaitan. 1989. "The
Correlation between Family Planning Program Inputs and Contraceptive Use in Indonesia."
Studies in Family Planning 20(1):26-37.
Manski, Charles, and Irwin Garfinkel, eds. 1992. Evaluating Welfare and Training Programs.
Cambridge, Mass.: Harvard University Press.
Mayo, J. K., R. C. Hornick, and E. G. McAnany. 1976. Educational Reform with Television:
The El Salvador Experience. Palo Alto, Calif.: Stanford University Press.
McKay, H., A. McKay, L. Siniestra, H. Gomez, and P. Lloreda. 1978. "Improving Cognitive
Ability in Chronically Deprived Children." Science 200(21):270-78
Nag, Moni. 1992. "Family Planning Success Stories in Bangladesh and India." Policy Research
Working Paper 1041. World Bank, Population and Human Resources Department, Washing-
ton, D.C. Processed.
North, W. Haven. 1988. Evaluation in Developing Countries: A Step in Dialogue. Paris: OECD.
Pitt, Mark M., Mark R. Rozenzweig, and Donna M. Gibbons. 1993. "The Determinants and
Consequences of the Placement of Government Programs in Indonesia: 1980-86." The World
Bank Economic Review 7(3):319-48.
Revenga, Ana, Michelle Riboud, and Hong Tan. 1994. "The Impact of Mexico's Retraining Pro-
gram on Employment and Wages." The World Bank Economic Review 8(2):247-77.
Rieken, Henry, and Robert Boruch. 1984. Social Experimentation: A Method for Planning and
Evaluating Social Intervention. New York: Academy Press.
Rosenzweig, M., and K. Wolpin. 1986. "Evaluating the Effects of Optimally Distributed Public
Programs: Child Health and Family Planning Interventions." American Economic Review
76(3):470-82.
Rossi, Peter, Howard Freeman, and Sonia Wright. 1979. Evaluation: A Systematic Approach.
Newbury Park, Calif.: Sage Publications.
UNESCO (United Nations Educational, Scientific and Cultural Organization. 1984. Project Evalu-
ation: Problems of Methodology. Paris.
USAID (U.S. Agency for International Development). 1990. Interactive Radio Instruction: Con-
fronting Crisis in Basic Education. AID Science and Technology in Development Series.
Washington, D.C.
John Newman, Laura Rawlings, and Paul Gertler                                 201