89278 v2 APPENDIX SET OF TECHNICAL PAPERS Economics of Education Review 30 (2011) 394–403 Contents lists available at ScienceDirect Economics of Education Review journal homepage: www.elsevier.com/locate/econedurev Teacher opinions on performance pay: Evidence from Indiaଝ,ଝଝ, Karthik Muralidharan a,∗ , Venkatesh Sundararaman b a UC San Diego, NBER, and J-PAL, La Jolla, CA 92093-0508, United States b South Asia Human Development Unit, World Bank, United States a r t i c l e i n f o a b s t r a c t Article history: The practical viability of performance-based pay programs for teachers depends critically Received 24 December 2010 on the extent of support the idea will receive from teachers. We present evidence on teacher Accepted 1 February 2011 opinions with regard to performance-based pay from teacher interviews conducted in the context of an experimental evaluation of a program that provided performance-based JEL classification: bonuses to teachers in the Indian state of Andhra Pradesh. We report four main findings I21 in this paper: (1) over 80% of teachers had a favorable opinion about the idea of link- J45 ing a component of pay to measures of performance, (2) exposure to an actual incentive O15 program increased teacher support for the idea, (3) teacher support declines with age, Keywords: experience, training, and base pay, and (4) the extent of teachers’ stated ex ante support for Teacher performance pay performance-linked pay (over a series of mean-preserving spreads of pay) is positively cor- Teacher incentives related with their ex post performance as measured by estimates of teacher value addition. Merit pay This suggests that teachers are aware of their own effectiveness and that implementing a Teacher opinions performance-linked pay program could not only have broad-based support among teachers Education Education policy but also attract more effective teachers into the teaching profession. India © 2011 Elsevier Ltd. All rights reserved. 1. Introduction ଝ We thank Julian Betts, Julie Cullen, Eric Hanushek, and Richard Mur- Education policy makers around the world have been nane for useful comments and discussions. showing growing interest in directly measuring and ଝଝ This is a revised version of a paper originally prepared for the Con- rewarding schools and teachers based on student learn- ference on “Merit Pay: Will it Work? Is it Politically Viable?” sponsored by Harvard’s Program on Education Policy and Governance, Taubman Center ing outcomes.1 The idea of paying teachers based on direct on State and Local Government, Harvard’s Kennedy School, 2010. measures of performance has attracted particular attention This paper is based on a project known as the Andhra Pradesh Ran- since teacher salaries are the largest component of edu- domized Evaluation Study (AP RESt), which is a partnership between the cation budgets and an increasing body of research shows Government of Andhra Pradesh, the Azim Premji Foundation, and the that teacher characteristics rewarded under the status quo World Bank. Financial assistance for the project has been provided by in most school systems (such as experience and mas- the Government of Andhra Pradesh, the UK Department for International Development (DFID), the Azim Premji Foundation, and the World Bank. ter’s degrees in education) are poor predictors of better We are especially grateful to D.D. Karopady, M. Srinivasa Rao, and staff student outcomes (Gordon, Kane, & Staiger 2006; Rivkin, of the Azim Premji Foundation for their leadership and meticulous work Hanushek, & Kain 2005; Rockoff, 2004). International evi- in implementing this project. Vinayak Alladi provided excellent research dence suggests that introducing performance-linked pay assistance. The findings, interpretations, and conclusions expressed in this paper are those of the authors and do not necessarily represent the views of the Government of Andhra Pradesh, the Azim Premji Foundation, or the World Bank. 1 Prominent policy initiatives in this regard include the “Race ∗ Corresponding author. Tel.: +1 858 534 2425. to the Top” initiative in the US, as well as similar initiatives E-mail addresses: kamurali@ucsd.edu (K. Muralidharan), in Australia (http://alp.org.au/agenda/school-reform/performance-pay/), vsundararaman@worldbank.org (V. Sundararaman). the UK (Atkinson et al., 2009), Chile (Contreras and Rau, 2009). 0272-7757/$ – see front matter © 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.econedurev.2011.02.001 K. Muralidharan, V. Sundararaman / Economics of Education Review 30 (2011) 394–403 395 (PLP) programs for teachers leads to significant improve- and base pay. Fourth and finally, the extent of teachers’ ments in student learning outcomes (Lavy, 2002, 2009 in stated ex ante support for PLP (over a series of mean- Israel; Muralidharan & Sundararaman, 2011 in India) and preserving spreads of pay) is positively correlated with cross-sectional evidence in the US suggests that schools their ex post performance as measured by estimates of with individual teacher compensation systems that reward teacher value addition. This correlation continues to be performance are more likely to have high performing stu- positive and significant even after controlling for several dents (Figlio & Kenny, 2007). observable teacher characteristics, suggesting that teach- While empirical evidence suggests that PLP for teachers ers are aware of their own effectiveness (based on traits may improve student learning2 and education adminis- unobservable to the econometrician or policy maker) and trators are increasingly interested in implementing such that their support for PLP is positively correlated with these programs, a critical factor in the success of scaled up unobservables. PLP programs is the extent of support they receive from The last finding is important because Lazear (2000) teachers. Several studies in the US have examined teacher shows that around half the gains from performance-pay attitudes towards PLP and have reported mixed findings in the company he studied were due to more produc- depending on the specific questions asked and the type of tive workers being attracted to join the company under differential pay considered (see Ballou & Podgursky, 1993; a performance-pay system. Similarly, Hoxby and Leigh Jacob & Springer, 2007; Goldhaber et al., 2010 for some (2005) argue that compression of teacher wages in the US illustrative studies in the US). In general, teachers tend to be is an important reason for the decline in teacher quality, most supportive of higher pay for teachers accepting addi- with higher-ability teachers exiting the teacher labor mar- tional responsibilities or accepting positions in schools that ket. Our results suggest that teachers are aware of their own are difficult to staff, and least supportive of proposals to link effectiveness and that implementing a performance-linked pay to student test scores (Farkas, Johnson, Duffet, Moye, & pay program could not only have broad-based support Vine, 2003). among teachers, but also attract more effective teachers This paper adds to the literature on teacher opinions into the teaching profession over time. with regard to performance-linked pay by looking at evi- Qualitative evidence based on detailed interviews of dence from the Indian state of Andhra Pradesh. In addition teachers conducted by field enumerators suggests a few to being the first paper to study the levels and correlates of possible reasons for the popularity of the program. First, teacher support for performance pay in the Indian context several teachers reported being de-motivated by the sta- (and to our knowledge, the first in any developing country), tus quo where there is no differentiation of teacher career we also make two original contributions to the global lit- prospects on the basis of performance, which in turn leads erature on teacher opinions on performance pay. First, our to an erosion of motivation over time (over 75% of teach- evidence is based on teacher interviews conducted in both ers in incentive schools report that their motivation levels treatment and control schools in the context of an experi- increased as a result of the performance-based incen- mental implementation of a performance pay program.3 We tive program). Second, teachers also report trusting the therefore present evidence not only on the levels of teacher integrity of the program as run by the Azim Premji Founda- support in general, but can also provide an experimental tion and over 90% of teachers reported a favorable opinion answer to the question of how exposure to an actual perfor- of the Foundation’s program. Finally, teachers report being mance pay program influences teacher opinions. Second, satisfied that the content of the assessment tools provided we can also link teachers’ ex ante opinions on performance an appropriate measure of student learning, with over 85% pay to their actual ex post performance (as measured by saying that the tests used were either good or very good. estimates of teacher value addition), which has not been In parallel work (Muralidharan & Sundararaman, 2011), possible in the literature to date. we find that both group-level and individual-level perfor- We report four main findings in this paper. First, over mance pay programs led to significant improvement in 80% of teachers had a favorable opinion about the idea of student test scores. The results presented in this paper sug- linking a component of pay to measures of performance gest that scaling up such a program may not only improve with over 45% of teachers having a very favorable opinion. learning outcomes in Andhra Pradesh (and India), but also Second, exposure to an actual incentive program increased be popular among teachers. Section 2 of the paper dis- teacher support for PLP with teachers in schools that were cusses some theoretical considerations that may affect how randomly assigned to the incentive program reporting sig- incentive pay schemes may be perceived by teachers. Sec- nificantly higher levels of support. Third, teacher support tion 3 describes the context, the experiment, and the data. for PLP declines significantly with age, experience, training, Section 4 presents the main results, and Section 5 con- cludes. 2 2. Theoretical considerations However, a recent experimental study in Tennessee found no impact of a performance-linked teacher bonus program on student test scores (Springer et al., 2010), suggesting that context and program details may There are several reasons for why teachers may not be produce different outcomes in various programs and locations. in favor of a system that paid bonuses to teachers on the 3 The program was implemented by the Azim Premji Foundation (a basis of gains in student test scores. First, evidence from leading non-profit organization working to improve primary education in India) in partnership with the Government of Andhra Pradesh, with tech- psychological studies suggests that monetary incentives nical support from the World Bank. See Muralidharan and Sundararaman can sometimes crowd out intrinsic motivation and lead to (2011) for details. inferior outcomes on the task that is being monitored and 396 K. Muralidharan, V. Sundararaman / Economics of Education Review 30 (2011) 394–403 rewarded (Deci & Ryan, 1985; Fehr & Falk, 2002). Teaching rewarding excellence in teaching and had no negative may be especially susceptible to this concern since many implications for poor performance beyond the non-receipt teachers are thought to enter the profession due to strong of a bonus. Thus, all communications to teachers described intrinsic motivation. Second, teachers may feel that test the program as one that aimed to provide recognition to scores are only one component of a good education and that outstanding teachers as opposed to framing the program being evaluated solely on test scores would limit their func- in terms of “school and teacher accountability”. The Azim tioning as teachers and induce activities such as “teaching Premji Foundation is also a well-regarded entity in India to the test” that may be detrimental to longer-term learning with a reputation among teachers for aiming to improve outcomes. Third, even if test scores represented learning the quality of education in India. It is therefore likely that accurately, the teacher is only one input into the determi- the teachers trusted the integrity of the program. Finally, nation of learning, with crucial inputs being required from testing and coaching for high stakes tests is such an integral the household and from the student as well. Thus, being component of the Indian education system,5 that assess- held accountable for an outcome that is not fully within a ment and evaluation on the basis of improvements in teacher’s “locus of control” may also be de-motivating to student test scores probably seemed like a fair and trans- teachers (a related concern may be measurement error in parent way to assess teacher impact. Thus, a combination of test scores, which may lead to bonuses being determined contextual and program design factors probably helped to mostly by luck). Fourth, depending on the specific structure mitigate the concerns that teachers might have otherwise of the bonus program, the incentive for teachers to cooper- had about such a program. ate among themselves may be affected, which in turn may reduce collegiality in the workplace. Finally, teachers may 3. Context, experimental design and data not trust administrators and head teachers to implement the program fairly and may resist changes to the status 3.1. Context quo.4 On the other hand, a system that does not differen- While India has made substantial progress in improving tiate among high and low-performing teachers may also access to primary schooling and primary school enrolment be de-motivating to teachers and reduce effort if higher rates, the average levels of learning remain very low. The effort and effectiveness is not rewarded in any way. The most recent Annual Status of Education Report found that context in India and Andhra Pradesh suggested that this nearly 60% of children aged 6–14 in an all-India sample of may have been a valid concern. Kremer, Muralidharan, over 300,000 rural households could not read at the sec- Chaudhury, Hammer, and Rogers (2005) show that in ond grade level, though over 95% of them were enrolled Indian government schools, teachers reporting high levels in school (Pratham, 2010). Public spending on education of job satisfaction are more likely to be absent. In subsequent has been rising as part of the “Education for All” campaign, focus group discussions with teachers, it was suggested but there are substantial inefficiencies in public delivery of that this was because teachers who were able to get by education services. A study using a nationally representa- with low effort were quite satisfied, while hard-working tive dataset of primary schools in India found that 25% of teachers were dissatisfied because there was no difference teachers were absent on any given day, and that less than in professional outcomes between them and those who half of them were engaged in any teaching activity (Kremer shirked. Thus, it is also possible that the lack of external et al., 2005). reinforcement for performance can erode intrinsic motiva- Andhra Pradesh (AP) is the 5th most populous state in tion and teacher satisfaction (Mullainathan, 2006). In such India, with a population of over 80 million, 73% of whom a context, the provision of external incentives based on live in rural areas. AP is close to the all-India average on objective measures of performance that are transparently measures of human development such as gross enrollment and fairly applied could increase intrinsic motivation, and in primary school, literacy, and infant mortality, as well as teacher satisfaction, which may lead to teachers favoring on measures of service delivery such as teacher absence. such a system. There are a total of over 60,000 such schools in AP and over In summary, the psychological literature on incentives 70% of children in rural AP attend government-run schools suggests that extrinsic incentives that are perceived by (Pratham, 2010). All regular government-school teachers workers as a means of exercising control over them and are employed by the state, and their salary is mostly deter- interfering with norms of professional behavior are more mined by experience and rank, with minor adjustments likely to crowd out intrinsic motivation, while those that based on assignment location, but no component based on are seen as reinforcing norms of professional behavior can any measure of performance. The average salary of regular enhance intrinsic motivation (Fehr & Falk, 2002). Thus, the teachers is over Rs. 10,000/month and total compensation way an incentive program is designed and framed can influ- including benefits is even higher (per capita income in AP ence its effectiveness as well as teacher opinions. is around Rs. 2500/month; 1 US Dollar ≈ 45 Indian Rupees). The teacher incentive program implemented in Andhra Pradesh was designed with a view to recognizing and 5 The centrality of testing to the Indian education experience is attested to by the proliferation of coaching classes for high-stakes entrance tests to selective colleges and universities. The best known coaching classes are 4 These points are made in various forms in the several papers that study in turn so selective that there is a large industry of coaching classes for the teacher attitudes towards performance pay in the US including Goldhaber entrance exam for the coaching classes for the entrance exam for highly et al. (2010) and Jacob and Springer (2007). selective institutes such as the Indian Institute of Technology (IIT). K. Muralidharan, V. Sundararaman / Economics of Education Review 30 (2011) 394–403 397 Teacher unions are strong and disciplinary action for non- 4. Results performance is rare.6 4.1. Teacher opinions on performance pay 3.2. Experimental design and data We focus on two main variables of interest. The first is teacher response to the question: “What is your over- The data used in this paper come from an experimental all opinion about the idea of providing high-performing evaluation of the impact of providing performance-linked teachers with bonus payments on the basis of objective bonuses to teachers in Andhra Pradesh (AP). We stud- measures of student performance improvement?” We find ied two types of teacher performance pay (group bonuses that over 80% of the teachers in control schools who were based on school performance, and individual bonuses interviewed report having a somewhat or very favorable based on teacher performance), with the average bonus opinion about such an idea (Table 1, Panel A). Teachers who calibrated to be around 3% of a typical teacher’s annual were exposed to the incentive program report even higher salary (or 35% of a month’s pay). The incentive program support for the idea of performance-linked pay (PLP), with was designed to minimize the likelihood of undesired con- teachers in the individual incentive program showing the sequences (see Muralidharan & Sundararaman, 2011 for highest extent of support (over 88%). Averaged across all details on the incentive design) and the study was con- teachers, over 85% of teachers were supportive of PLP. ducted by randomly allocating the incentive programs Since teacher bonuses were paid over and above the across a representative sample of 300 government-run base salary, the high level of support indicated might partly schools in rural AP with 100 schools each in the group reflect the fact that there was nothing to lose for any and individual incentive treatment groups and 100 schools teacher from the program. Our second variable of interest serving as the control group. is therefore the extent of self-reported teacher preference The school year in AP runs from mid-June to mid- over a schedule of mean-preserving spreads of pay. Since April, and the experiment was carried out in the school pay revisions for public employees in India typically fol- years 2005–2006, and 2006–2007. Baseline tests were low the recommendations of decennial Pay Commissions conducted in June–July 2005 and end of year tests were appointed by the Government of India, the specific ques- conducted in March–April 2006 and 2007. Measures of tion asked was: “The 6th Pay Commission has just been teacher value addition are constructed using this panel set up and is going to consider the amount and structure data on test scores using a standard teacher fixed effect of pay increases in the next 2 years. Suppose that the total specification. The data on teacher opinions used in this budget for increases in teacher salaries is 15%. How would paper comes from interviews conducted with teachers you want this money to be allocated?” The choices ranged in July–August 2006 and 2007 respectively. These inter- from an across the board increase of 15% for all teachers views were conducted after the teachers had exposure to allocating all of the extra money for performance-linked to the program, but before they knew their own results bonuses.8 (and bonus amounts to be received) because the bonuses We find that over 70% of teachers in control schools based on performance in each school year were paid out a report a preference for at least some component of total few months into the next school year (usually in Septem- pay being linked to performance (Table 1, Panel B). Again, ber). Teachers in all three treatment groups (control, group the level of support is higher in the schools that were part incentive, and individual incentive schools) were inter- of the incentive experiment, with over 78% of teachers viewed and the interviews included questions on teaching in individual incentive schools expressing such a prefer- practice, activities during the school year, and opinions ence. Across all teachers, over 75% expressed a preference on teacher performance pay. The control schools were for some PLP. If teachers are risk-averse and have ratio- not exposed to the details of the performance pay treat- nal expectations about the distribution of their abilities, ments, but were probably aware that the Foundation was we would expect less than 50% to support expected-wage- conducting pilot programs involving performance-linked neutral performance pay since there was no risk premium bonuses in other schools.7 We report the main results offered in the set of options. The 75% positive response regarding teacher opinions on performance-linked pay could reflect several factors including over-optimism about both separately by treatment as well as pooled across their own abilities, a belief that it will be politically more treatments. feasible to secure funds for salary increases if these are linked to performance, or a sense that such a system could bring more professional respect to teachers, would be fair 6 See Kingdon and Muzammil (2001) for an illustrative case study of to high-performing teachers, and could enhance teacher the power of teacher unions in India. Kremer et al. (2005) find that 25% of motivation across the board. teachers are absent across India, but only 1 head teacher in their sample of The rest of the analyses in this paper use the answers 3000 government schools had ever fired a teacher for repeated absence. 7 There was no formal communication to any school about programs to these two questions as the main dependent variable of being offered to other schools, but field reports suggest that teachers gen- erally knew about the programs offered in other schools through informal channels. We cannot rule out the possibility that there may have been 8 some spillovers from the incentive program in other schools to teachers’ The questions were asked of teachers in 2006 and 2007, while the 6th opinions in control schools, but we think that this is unlikely since direct Pay Commission was set up in 2006 and made its recommendations in interaction between teachers in control and incentive schools was very 2008. This timing made the phrasing of the question salient in a way that limited (due to the geographical dispersion of the schools). could be understood by all teachers. 398 K. Muralidharan, V. Sundararaman / Economics of Education Review 30 (2011) 394–403 Table 1 Teacher opinions on performance pay (summary statistics). (Panel A): Summary of teacher favorability towards performance pay Distribution of answers to the question: “What is your overall opinion about the idea of providing high-performing teachers with bonus payments on the basis of objective measures of student performance improvement?” Very unfa- Somewhat Neutral Somewhat Very Total of somewhat vorable unfavor- favorable favorable and very favorable able Control (n = 458) 3.1% 4.4% 11.6% 36.0% 45.0% 81.0% Group incentive schools (n = 508) 2.2% 4.1% 7.7% 36.8% 49.2% 86.0% Individual incentive schools (n = 540) 1.5% 5.0% 4.8% 32.4% 56.3% 88.7% All teachers (n = 1506) 2.2% 4.5% 7.8% 35.0% 50.5% 85.5% (Panel B): Summary of teacher opinions on mean preserving spreads of pay Distribution of answers to the question: “The 6th Pay Commission has just been set up and is going to consider the amount and structure of pay increases in the next 2 years. Suppose that the total budget for increases in teacher salaries is 15%. How would you want this money to be allocated?” a. Flat increase of b. Flat increase of c. Flat increase of d. Flat increase of Fraction of teachers 15% for all teachers, 10% for all 5% for all teachers, 0% for all teachers, who would like no performance teachers, rest based rest based on rest based on some component based component on performance performance performance of salary increase (range of salary (range of salary (range of salary to be based on increase from 10% increase from 5% to increase from 0% to performance to 20% based on 25% based on 30% based on performance) performance) performance) Control (n = 465) 29.7% 47.7% 10.3% 12.3% 70.3% Group Incentive Schools 24.1% 47.5% 15.5% 12.9% 75.9% (n = 503) Individual Incentive 21.6% 49.7% 13.6% 15.1% 78.4% Schools (n = 537) All teachers (n = 1505) 24.9% 48.4% 13.2% 13.5% 75.1% interest (the exact questions and distributions of answers out the controls (column 2). Columns 3 and 4 break down are shown in Table 1). Table 2 presents ordered probit and the results by treatment as well as by year and we see that OLS estimates of teacher responses to these questions as in general there was no significant difference between the a function of the treatment status of their school and the two years for any of the treatment groups. Columns 5–8 project year (Panels A and B show the two different ques- present the results from the OLS specification and we see tions, which correspond exactly to those in Panels A and B the same pattern of results, with teachers in both types in Table 1). The ordered probit specifications make the best of incentive schools showing significantly higher levels of use of the information contained in the teacher surveys support for PLP, and the highest support being in individual because the answers to the questions in Table 1 are only incentive schools (though not significantly different rela- ordinal and not cardinal. However, since these coefficients tive to group incentive schools in the OLS specification). are difficult to interpret, we also present OLS specifications Panel B (opinions on mean-preserving spreads of pay) that use a binary dependent variable indicating favorability show the same patterns as Panel A, but the difference towards PLP (using the same classification tabulated in the between group and individual incentives is not significant last column of Table 1). The OLS coefficients can be directly in any of the specifications though the point estimates con- interpreted as the change in probability of being favorable tinue to indicate greater support in individual incentive towards PLP. schools. Breaking down the results by year and treat- We see that teachers in both the group and the indi- ment suggests that support in individual incentive schools vidual incentive treatment groups are more likely to be in increased at the end of the second year (in the ordered favor of performance pay than those in the control schools probit specification with demographic controls), while it (the omitted category) with teachers in individual incen- decreased in group incentive schools (OLS specifications tive schools significantly more in favor than those in group based on binary indicator of support). Since we have only incentive schools (Table 2, Panel A, column 1). Since these two years worth of data, it is difficult to generalize from treatments are randomly assigned, the results suggest that these results as to how the long-run attitude towards PLP exposure to the programs increased teacher preference programs may evolve among teachers. However, it is worth for performance-linked bonuses. We control for several noting that the overall level of support does not seem to teacher demographic characteristics (the ones shown in change much over the two years of the program. Table 3) and find that this result is unchanged, but the number of observations falls by a quarter due to lack of 4.2. Demographic correlates of teacher opinions demographic data on all teachers and as a result the coeffi- cient on the group incentive schools is no longer significant, While the overall level of teacher support for though the point estimate is essentially the same as with- performance-linked bonuses is high, there may be variation K. Muralidharan, V. Sundararaman / Economics of Education Review 30 (2011) 394–403 399 Table 2 Teacher opinions by treatment and year. Panel A: Favorability towards performance pay (PP) Ordered probit: favorability towards PP OLS: favorable or very favorable towards PP [1] [2] [3] [4] [5] [6] [7] [8] Control schools 0.810*** 0.813*** (0.019) (0.026) Group incentive (GI): A 0.126* 0.128 0.095 0.136 0.050** 0.050* 0.057* 0.062* (0.073) (0.084) (0.104) (0.108) (0.025) (0.027) (0.034) (0.034) Individual incentive (II): B 0.273*** 0.323*** 0.202* 0.225** 0.077*** 0.094*** 0.068** 0.076** (0.074) (0.088) (0.106) (0.110) (0.024) (0.026) (0.032) (0.032) Control*Year 2 −0.170* −0.129 −0.006 −0.013 (0.103) (0.123) (0.035) (0.042) GI*Year 2 −0.107 −0.15 −0.019 −0.043 (0.094) (0.120) (0.031) (0.039) II*Year 2 −0.031 0.121 0.012 0.033 (0.099) (0.118) (0.027) (0.029) Teacher demographic controls No Yes No Yes No Yes No Yes Test: A = B 0.047 0.019 0.197 0.051 Observations 1506 1137 1506 1137 1506 1137 1506 1137 R2 0.008 0.029 0.008 0.032 Panel B: Extent of mean-preserving spread of pay desired Ordered Probit: extent of OLS: Would like to have some component of total pay mean-preserving spread of pay desired be based on performance [1] [2] [3] [4] [5] [6] [7] [8] Control schools 0.703*** 0.717*** (0.022) (0.030) Group incentive (GI): A 0.151** 0.134 0.231** 0.256*** 0.056* 0.071** 0.092** 0.107*** (0.076) (0.086) (0.094) (0.098) (0.030) (0.033) (0.039) (0.040) Individual incentive (II): B 0.209*** 0.261*** 0.195** 0.217** 0.081*** 0.103*** 0.074* 0.081** (0.075) (0.087) (0.093) (0.099) (0.029) (0.032) (0.039) (0.040) Control*Year 2 0.085 0.16 −0.028 −0.027 (0.097) (0.125) (0.041) (0.052) GI*Year 2 −0.074 −0.143 −0.097*** −0.123*** (0.088) (0.108) (0.035) (0.044) II*Year 2 0.114 0.283*** −0.016 0.025 (0.084) (0.104) (0.033) (0.038) Teacher demographic controls No Yes No Yes No Yes No Yes Test: A = B 0.419 0.122 0.379 0.312 Observations 1505 1138 1505 1138 1505 1138 1505 1138 R2 0.006 0.023 0.011 0.03 Notes: 1. All regressions are clustered at the teacher level. Significance levels are as follows: *10%, **5%, and ***1%. 2. Columns [1]–[4] report ordered probits based on the full range of responses shown in Table 1. 3. For columns [5]–[8] of panels A and B, the dependent variable is a binary indicator of teacher favorability towards performance pay and willingness to accept a performance-based component in total compensation (these are based on the last column in Table 1). 4. Teacher demographic controls included in the even columns are the same set reported in Table 3. in support by teacher demographic characteristics. Table 3 teachers are usually locally-hired on fixed-term renew- presents bivariate correlations between several teacher able contracts, are typically not professionally trained, characteristics and their attitudes towards performance- and are paid much lower salaries than those of regular linked pay in general (columns 1 and 2) and the extent of teachers—often less than one-fifth as much. Since these mean preserving spreads of PLP that they would prefer to teachers do not obtain any of the benefits of civil-service see (columns 3 and 4). As in Table 2, we report both ordered job security or pay, and appear to be as effective as regular probit (columns 1 and 3) and OLS specifications (columns teachers (see Muralidharan & Sundararaman, 2010 for fur- 2 and 4) with incrementally coded and binary responses ther details) it is not surprising that they were supportive respectively (the results from these 2 specifications almost of the idea of performance-linked pay. never differ in terms of which covariates are significant). Finally, we also run these regressions with a linear The main results here are teachers with higher levels of interaction between each characteristic and a dummy for training, greater experience, higher base pay, and teachers ‘incentive school’ status to test for differential response who are older are significantly less likely to support the by covariates across treatment and control schools, and idea of PLP. find that in most cases, we cannot reject the null of sim- One group of teachers who strongly support PLP are ilar response by covariates across treatments. Notable contract teachers (also known as para-teachers). Contract exceptions include teachers who have completed college 400 Table 3 Correlates of teacher opinions on performance pay. K. Muralidharan, V. Sundararaman / Economics of Education Review 30 (2011) 394–403 Panel A: Bivariate regressions of teacher opinion on each covariate Panel B: Multiple regression of teacher opinion on all covariates Teacher characteristic Favorability towards PP Extent of mean-preserving Favorability towards PP Extent of mean-preserving spread of pay desired spread of pay desired Ordered Probit OLS Ordered Probit OLS Ordered Probit OLS Ordered Probit OLS [1] [2] [3] [4] [5] [6] [7] [8] 0.013 −0.017 0.003 −0.011 0.094 −0.008 0.085 0.015 Male (0.069) (0.020) (0.069) (0.026) (0.073) (0.022) (0.073) (0.028) −0.051 −0.011 −0.022 −0.025 0.2 0.047 −0.022 −0.04 College degree (0.098) (0.029) (0.096) (0.036) (0.146) (0.038) (0.129) (0.047) Bachelor’s in education or higher −0.241*** −0.051** −0.045 −0.011 −0.352*** −0.065*** −0.023 0.018 level teacher training (0.079) (0.021) (0.074) (0.028) (0.109) (0.024) (0.098) (0.034) 0.135 0.009 0.099 0.046 −0.006 −0.017 −0.039 0.008 From same village (0.115) (0.032) (0.109) (0.041) (0.123) (0.039) (0.119) (0.048) −0.016*** −0.002* −0.015*** −0.004*** −0.004 0 −0.003 0.002 Teacher experience (0.004) (0.001) (0.005) (0.002) (0.009) (0.003) (0.011) (0.004) −0.265*** −0.032* −0.223*** −0.069*** 0.1 0.055 0.034 −0.003 Log salary (0.075) (0.018) (0.067) (0.023) (0.152) (0.063) (0.179) (0.072) −0.822*** −0.142*** −0.596*** −0.212*** −0.867*** −0.196** −0.512* −0.253** Log age (0.156) (0.042) (0.153) (0.055) (0.288) (0.079) (0.303) (0.104) Somewhat or very active in −0.056 −0.022 0.116 0.026 −0.1 −0.032 0.066 0.003 teacher unions (0.077) (0.022) (0.075) (0.030) (0.081) (0.023) (0.078) (0.031) 0.807*** 0.115*** 0.548*** 0.131*** 0.401 0.146 0.355 0.049 Contract teachers (0.194) (0.026) (0.174) (0.046) (0.391) (0.118) (0.372) (0.142) Observations 1137 1137 1138 1138 Notes: 1. Columns 1–4 present results from individual bivariate regressions of teacher opinion/preference for performance pay on several teacher characteristics, while columns 5–8 present results from a multiple regression with each of the covariates included. Significance levels are as follows: *10%, **5%, and ***1%. 2. All ordered probit specifications use the full range of responses recorded in Table 1, while all OLS specifications use binary dependent variables coded as in the last column of Table 1. 3. The number of observations for each bivariate regression (in columns 1–4) is not too different from the number of observations reported in columns 5–8 since all questions come from the same set of teacher interviews and item-level non-response is very low. K. Muralidharan, V. Sundararaman / Economics of Education Review 30 (2011) 394–403 401 Table 4 Correlations of teacher preferences with measures of value addition. Panel A: Favorability towards performance pay (Year 1 only) All schools Control All incentive schools All schools Control All incentive schools [1] [2] [3] [4] [5] [6] Teacher value added (averaged 0.349*** 0.191 0.386*** 0.284** 0.048 0.341** across 2 years) (0.104) (0.178) (0.127) (0.112) (0.207) (0.143) Teacher demographic No No No Yes Yes Yes controls Observations 730 224 506 681 208 473 Panel B: Extent of mean-preserving spread of pay desired (Year 1 only) All schools Control All incentive schools All schools Control All incentive schools [1] [2] [3] [4] [5] [6] Teacher value added (averaged 0.390*** 0.422** 0.345*** 0.372*** 0.319 0.338*** across 2 years) (0.099) (0.198) (0.116) (0.104) (0.204) (0.126) Teacher demographic No No No Yes Yes Yes controls Observations 730 224 506 681 208 473 Notes: 1. All regressions are ordered probits. The dependent variable is the teacher opinion from year 1 and the main right-hand side variable is the teacher value added averaged across both years of the project. Significance levels are as follows: *10%, **5%, and ***1%. 2. The dependent variable in panel A is the one tabulated in panel A of Table 1, while the dependent variable in panel B is the one tabulated in panel B of Table 1. 3. Teacher demographic controls used in columns 4, 5 and 6 are the full set shown in Table 3. education or formal teacher training, who appear to be change from the status quo desired by teachers (the results even less likely to support PLP when they were in incentive are not very different though). schools. We see that there is a significant positive correla- Since several teacher characteristics (such as age, expe- tion between the extent of performance-linked mean- rience, and base pay) are correlated with each other, we preserving spread of pay that teachers would support and also run a multiple regression on the correlates of teacher a measure of their own effectiveness. The result holds in opinions (using ordered probits and OLS) and present these both the pooled sample across all schools (column 1) as results in columns 5–8 of Table 3. The two main predic- well as in the samples that are disaggregated by treatment tors of teacher preferences are teacher training and age, status (columns 2 and 3). We also test for whether this cor- both of which are significantly negatively correlated with relation can be explained by teacher demographics that teacher preferences for PLP. We also see that the coefficient are correlated with both their opinions on performance on teacher salary is no longer significant in the multiple pay and their actual performance by including as controls regression suggesting that opposition to PLP may be more all the demographic variables shown in Table 3. We find a function of age than that of high base pay under the status that the results are robust to the inclusion of all these quo (though the two are highly correlated). controls and that the magnitudes of the effects are only slightly lower (columns 4–6). If we assume that teacher 4.3. Performance-related correlates of teacher opinions responses would be consistent with their self-interest, then this result suggests that there are aspects of teacher A unique feature of this paper is that our data allows effectiveness that are unobservable to the econometrician us to match teacher opinions on PLP with not only (and to policy makers), but which teachers themselves are demographic characteristics, but with actual measures of aware of. performance since we can calculate the “value added” to Since unobservable quality traits of teachers are not test scores by each teacher in our sample. We also conduct compensated under the status quo, our finding a positive the interviews on teacher opinions after the school year, correlation between teachers’ ex ante preference for PLP but before teachers know their own performance and bonus and their actual performance (which would be a measure figures. Table 4 shows the correlation (based on ordered of quality) suggests that a system of teacher compensa- probits) between teachers ex ante stated opinions regard- tion that included a performance-linked component may ing PLP after the first year of the program and their actual be able to attract higher-quality teachers into the teaching ex post performance in improving student test scores (aver- profession. Of course, if teacher learning about their own age of each teacher’s ‘value added’ estimate across the two aptitude for teaching mostly takes place only after enter- years). We present results on both dependent variables ing teaching, then the impact of PLP is more likely to be on (Panels A and B), but focus our discussion on the extent of the retention margin, and this may have less of an overall mean-preserving spreads of pay desired by teachers (Panel effect in India given that the rents accruing to government B), since this is the more direct measure of the extent of school teachers are quite large and that few government 402 K. Muralidharan, V. Sundararaman / Economics of Education Review 30 (2011) 394–403 school teachers ever leave their jobs (see Muralidharan & schools about the performance based bonus program was Sundararaman, 2010). careful to frame the bonus program as designed to recog- Finally, we also measure the correlation between nize and reward excellent teaching as opposed to holding changes in teacher opinion between year 1 and year 2 teachers accountable for student performance. The ques- and measures of the actual bonus received and find that tions that were asked to the teachers (including those in this correlation is positive and significant. This suggests the control schools) also used a similar framing. that while teachers are aware of some of their unobserved Finally, testing (and dedicated coaching of students for quality, the scores and bonuses also provide them with high-stakes testing) is such an integral component of the additional information about their effectiveness, which Indian education system, that being evaluated on the basis probably affects their level of support for PLP programs. of improving student performance on tests perhaps does Thus, it is likely that teacher opinions over time will be not seem unfair to teachers. This is in contrast with edu- influenced by their actual performance and that support cation systems like those in the US with a limited history may fall among teachers who receive no or low bonuses. of high-stakes testing. As discussed earlier, evaluation sys- tems that conform to teachers’ own sense of professional 5. Conclusion standards are more likely to be acceptable to teachers and the relative centrality of testing in different education sys- This paper presents the first evidence on teacher opin- tems may be an important factor in explaining differential ions regarding performance-linked pay from a developing teacher attitudes towards performance pay systems that country and is also unique in being able to study these are linked to student test scores. opinions in the context of a multi-year experimental eval- The results in this paper suggest some straightforward uation of the impact of PLP on student learning outcomes. policy implications. Linking a component of teacher pay We report four main results in this paper: (1) over 80% of to objective measures of performance is not only likely teachers had a favorable opinion about the idea of link- to improve student learning outcomes (Muralidharan & ing a component of pay to measures of performance, (2) Sundararaman, 2011), but is also likely to be popular exposure to an actual incentive program increased teacher among teachers. The results and discussion presented support for the idea, (3) teacher support declines with age, in this paper suggest that some of the key design fea- experience, training, and base pay, and (4) the extent of tures of a performance-pay system that may be broadly teachers’ stated ex-ante support for performance-linked accepted by teachers include: framing the program less pay (over a series of mean-preserving spreads of pay) is in terms of “school accountability” and more in terms of positively correlated with their ex post performance as “teacher recognition”, fair and transparent administration, measured by estimates of teacher value addition. and being seen as rewarding aspects of teacher behavior It is worth reflecting on why our findings (especially that are consistent with teachers’ own notions of good pro- the high levels of teacher support for performance bonuses fessional conduct. linked to improvements in student test scores) may be dif- ferent from the typically low levels of support for PLP based on student test scores found in other studies—especially References in the US. We can think of three possible reasons for the divergence. Atkinson, A., Burgess, S., Croxson, B., Gregg, P., Propper, C., Slater, H., & Wilson, D. (2009). Evaluating the impact of performance-related pay First, most papers that study teacher opinions on for teachers in England. Labour Economics, 16, 251–261. performance-linked pay are based on responses to ques- Ballou, D., & Podgursky, M. (1993). Teachers’ attitudes toward merit tions about the general concept as opposed to specific pay: Examining conventional wisdom. Industrial and Labor Relations Review, 47, 50–61. well-defined schemes. It is possible that the absence of Contreras, D. G., & Rau, T. B. (2009). Tournaments, gift exchanges, and the specifics may lead a risk-averse teacher to be wary of effect of monetary incentives for teachers: The case of Chile. University changes and oppose the suggestion. Our data is based on of Chile, Department of Economics., p. 42. teacher responses in the context of an actual program, Deci, E. L., & Ryan, R. M. (1985). Intrinsic Motivation and Self-Determination in Human Behavior. New York: Plenum. which they could see was transparently designed and fairly Farkas, S., Johnson, J., Duffet, A., Moye, L., & Vine, J. (2003). Stand by me: implemented by an independent NGO. Thus, though the What teachers really think about unions, merit pay, and other professional questions asked were about teachers’ opinion on PLP in matters. Washington, DC: Public Agenda. Fehr, E., & Falk, A. (2002). Psychological Foundations of Incentives. Euro- general, the answers probably considered the specific pro- pean Economic Review, 46, 687–724. gram as a prototype.9 Figlio, D. N., & Kenny, L. (2007). Individual teacher incentives and student Second, since PFP programs in other parts of the world performance. Journal Public Economics, 91, 901–914. Goldhaber, D., DeArmond, M., & DeBurgomaster, S. (2010). Teacher (and especially the US) are often associated with school attitudes about compensation reform: Implications for reform imple- accountability measures, the framing of these programs mentation. (Urban Institute Working Paper 50). can often connote an adversarial relationship between Gordon, R., Kane, T., & Staiger, D. (2006). Identifying effective teachers using performance on the job. Washington, DC: The Brookings Institution. teachers and administrators, which may also explain lower Hoxby, C. M., & Leigh, A. (2005). Pulled away or pushed out? Explaining levels of support in such contexts. The communication to the decline of teacher aptitude in the United States. American Economic Review, 94, 236–240. Jacob, B., & Springer, M. (2007). Teacher attitudes towards pay for perfor- mance: Evidence from Hillsborough County, Florida. National Center for 9 Of course, this would not explain the high levels of support in the Performance Incentives. control schools as well, but may explain why exposure to the program Kingdon, G. G., & Muzammil, M. (2001). A political economy of education increased the support for the ideal of PLP. in India: The case of U.P. Economic and Political Weekly, 36. K. Muralidharan, V. Sundararaman / Economics of Education Review 30 (2011) 394–403 403 Kremer, M., Muralidharan, K., Chaudhury, N., Hammer, J., & Rogers, F. H. Muralidharan, K., & Sundararaman, V. (2011). Teacher performance pay: (2005). Teacher absence in India: A snapshot. Journal of the European Experimental evidence from India. Journal of Political Economy, forth- Economic Association, 3, 658–667. coming. Lavy, V. (2002). Evaluating the effect of teachers’ group performance Pratham. (2010). Annual status of education report. incentives on pupil achievement. Journal of Political Economy, 110, Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools, and 1286–1317. academic achievement. Econometrica, 73, 417–458. Lavy, V. (2009). Performance pay and teachers’ effort, productivity, and Rockoff, J. E. (2004). The impact of individual teachers on student achieve- grading ethics. American Economic Review, 99, 1979–2011. ment: Evidence from panel data. American Economic Review, 94, Lazear, E. (2000). Performance pay and productivity. American Economic 247–252. Review, 90, 1346–1361. Springer, M. G., Ballou, D., Hamilton, L., Le, V.-N., Lockwood, J. R., McCaffrey, Mullainathan, S. (2006). Development economics through the lens of psy- D., et al. (2010). Teacher pay for performance: Experimental evidence chology. Harvard University. from the project on incentives in teaching. Nashville, TN: National Center Muralidharan, K., & Sundararaman, V. (2010). Contract teachers: Experi- for Performance Incentives at Vanderbilt University. mental evidence from India. San Diego: University of California. The Economic Journal, 120 (August), F187–F203. doi: 10.1111/j.1468-0297.2010.02373.x. Ó The Author(s). Journal compilation Ó Royal Economic Society 2010. Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA. THE IMPACT OF DIAGNOSTIC FEEDBACK TO TEACHERS ON STUDENT LEARNING: EXPERIMENTAL EVIDENCE FROM INDIA* Karthik Muralidharan and Venkatesh Sundararaman We present experimental evidence on the impact of a programme that provided low-stakes diagnostic tests and feedback to teachers, and low-stakes monitoring of classroom processes across a repres- entative set of schools in the Indian state of Andhra Pradesh. We find teachers in treatment schools exerting more effort when observed in the classroom but students in these schools do no better on independently-administered tests than students in schools that did not receive the programme. This suggests that though teachers in the programme schools worked harder while being observed, there was no impact of the feedback and monitoring on student learning outcomes. Policy initiatives to improve the quality of education increasingly involve the use of high-stakes tests to measure progress in student learning.1 While proponents of high- stakes testing claim that they are a necessary (if imperfect) tool for measuring school and teacher effectiveness, opponents argue that high-stakes tests induce distortions of teacher activity such as teaching to the test that not only reduce the validity of the test scores (and any inferences made on their basis), but also lead to negative outcomes.2 An alternative use that is suggested for tests that would preserve their usefulness, while being less susceptible to distortion is to use tests in a low-stakes environment to * We are grateful to Caroline Hoxby, Michael Kremer and Michelle Riboud for their support, advice and encouragement at all stages of this project. We thank Julian Betts, Julie Cullen, Gordon Dahl, Dan Goldhaber, Nora Gordon, Richard Murnane and various seminar participants for useful comments and discussions. The project that this article is based on was conducted by the Azim Premji Foundation on behalf of the Government of Andhra Pradesh with technical support from the World Bank and financial support from the UK Department for International Development (DFID) and the Government of Andhra Pradesh. We thank officials of the Department of School Education in Andhra Pradesh for their continuous support and long- term vision for this research. We are especially grateful to DD Karopady, M Srinivasa Rao and staff of the Azim Premji Foundation for their meticulous work in implementing this project. Sridhar Rajagopalan and Vyjayanthi Sankar of Education Initiatives led the test design and preparation of diagnostic reports on learning. Vinayak Alladi provided outstanding research assistance. The findings, interpretations and con- clusions expressed in this article are those of the authors and do not necessarily represent the views of the World Bank, its Executive Directors, or the governments they represent. 1 The high-stakes for teachers and schools associated with student testing range from public provision of information on school performance to rewards and sanctions for school management and teachers on the basis of these tests. The best known example of high-stakes tests are those associated with school account- ability laws such as No Child Left Behind. 2 See Koretz (2008) for a discussion of the complexities of testing and the difficulty in interpreting test score gains. Holmstrom and Milgrom (1991) and Baker (1992) discuss the problem of multi-task moral hazard, with test-based incentives for teachers being a well-known example of this problem. Examples of counter-productive teacher behaviour in response to high-powered incentives include rote Ôteaching to the testÕ and neglecting higher-order skills (Glewwe et al., 2003), manipulating performance by short-term strategies like boosting the caloric content of meals on the day of the test (Figlio and Winicki, 2005), excluding weak students from testing ( Jacob, 2005), focusing only on some students in response to Ôthreshold effectsÕ embodied in the structure of the incentives (Neal and Schanzenbach, 2007) or even outright cheating ( Jacob and Levitt, 2003). [ F187 ] F188 THE ECONOMIC JOURNAL [AUGUST provide teachers and school administrators with detailed data on student performance as a diagnostic tool to understand areas of student weakness and to focus their teaching efforts better. The channels posited for the possible effectiveness of low-stakes tests include the benefits of better information in improving teaching practice and increases in teacher intrinsic motivation by focusing attention on student learning levels and improving their ability to set and work towards goals.3 A useful way to distinguish these two approaches is to think of high-stakes tests as Ôassessments of learningÕ and low-stakes tests as Ôassessments for learningÕ. While the idea of such low-stakes testing is promising, there is very little rigorous evidence on its effectiveness.4 Also, in practice, systems that provide feedback on stu- dent performance to teachers are accompanied by varying degrees of training and coaching of teachers on the implications of the feedback for modifying teaching practices. Thus, it is difficult to distinguish the impact of diagnostic testing from the varying levels of training and follow up action that typically accompany such diagnostic feedback. Finally, data that are generated to provide feedback to teachers can also be used for external systems of accountability and it is often difficult to distinguish the channels of impact.5 Visscher and Coe (2003) provide a good review of the literature6 on school performance feedback systems (SPFS) and conclude that: ÔGiven the com- plexity of the kinds of feedback that can be given to schools about their performance, the varying contexts of school performance, and the range of ways feedback can be provided, it is extremely difficult to make any kind of generalised predictions about its likely effectsÕ. In this article, we present experimental evidence on the impact of a programme that provided teachers with written diagnostic feedback on their studentsÕ performance (both absolute and relative) at the beginning of the school year, along with suggestions on ways to improve learning levels of students in low achievement areas. Focusing on written feedback reports that are provided directly to teachers allows us to estimate the impact of diagnostic feedback without the confounding effects of different types of training or structured teacher group work that typically accompany such feedback. Instead, our estimates are most relevant for thinking about the impact of programmes that aim to improve teacher performance by making student learning outcomes salient and by providing information that can be used to teach more effectively and to set goals and targets.7 3 See Boudet et al. (2005) for a summary of this approach of using assessment data to improve teaching practices and learning outcomes. 4 Coe (1998) reviews the evidence on the effectiveness of feedback on performance in general and highlights the lack of evidence on the effectiveness of feedback systems in improving studentsÕ academic performance. 5 Tymms and Wylde (2003) discuss the difference between school performance data systems focused on accountability and those focused on professional development, while recognising that data that is generated for one purpose is quite likely to be used for the other as well. 6 They overview several theories of why feedback may improve performance including and also review the empirical evidence on SPFS (School Performance Feedback Systems) and conclude that there is very little rigorous causal evidence on the impact of SPFS on student performance, though prima facie there appears to be reason for believing that these could be effective. 7 Goal setting and performance feedback are believed to be important components of improving intrinsic motivation of employees. See Section 1.2 for further details. Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 2010 ] DIAGNOSTIC FEEDBACK TO TEACHERS F189 8 The programme we study was implemented by the Azim Premji Foundation during the school year 2005–6, on behalf of the Government of the Indian state of Andhra Pradesh across 100 randomly selected rural primary schools from a representative sample of such schools in the state.9 The programme received by the ÔfeedbackÕ schools consisted of an independently administered baseline test at the start of the school year, a detailed written diagnostic feedback report on the performance of students on the baseline test, a note on how to read and use the performance reports and benchmarks, an announcement that students would be tested again at the end of the year to monitor progress in student performance, and low-stakes monitoring of classrooms during the school year to observe teaching processes and activity. It was made clear to schools and teachers that no individually attributable information would be made public, and that there were no negative consequences whatsoever of poor performance on either the baseline or the end-of-year tests. Thus, the programme was designed to focus on the intrinsic motivation of teachers to be better teachers, as opposed to any extrinsic incentives or pressure (monetary or non-monetary). We find at the end of one year of the programme that teachers in the feedback schools appear to perform better on measures of teaching activity when measured by classroom observations compared to teachers in the control schools. However, there was no difference in test scores between students in the feedback schools and the comparison schools at the end of the year. This suggests that though teachers in the treatment schools worked harder while being observed, there was no impact of the diagnostic feedback and low-stakes monitoring on student learning outcomes. In a parallel initiative, the Azim Premji Foundation provided teachers in another randomly selected set of schools with the opportunity to obtain performance-linked bonuses in addition to the same diagnostic feedback described above. We find that though the diagnostic feedback on its own had no significant impact on student test scores, the combination of feedback and teacher performance pay had a significant positive effect on student test scores.10 Teachers in both types of schools report similar levels of usefulness of the reports. However, we find that teachersÕ self-reported use- fulness of the feedback reports does not predict student test scores in the feedback only schools but does do so in the incentive schools. This suggests that the diagnostic feedback did contain useful information but that teachers were less likely to make effective use of it in the absence of external incentives to do so. While our results do not speak to the potential effectiveness of such feedback when combined with teacher training and targeted follow up, it does suggest that diagnostic feedback to teachers by itself may not be enough to improve student learning outcomes, especially in the absence of improved incentives to make effective use of the additional inputs. This article presents the first experimental evaluation of a low-stakes diagnostic testing and feedback intervention and contributes to a small but emerging literature on 8 The Azim Premji Foundation is a leading non-profit organisation in India that works with several state governments to improve the quality of primary education. 9 This study was conducted as part of a larger project known as the Andhra Pradesh Randomised Evalu- ation Study (AP RESt). The AP RESt studies several interventions to improve education outcomes that provided diagnostic feedback in addition to other programmes such as performance-linked pay for teachers, an extra contract teacher, and cash block grants to schools. 10 The details of the performance pay programme are provided in a companion paper. See Muralidharan and Sundararaman (2009). Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 F190 THE ECONOMIC JOURNAL [AUGUST measuring the impact of low-stakes feedback on student learning. The closest related study is Betts et al. (2010) who use panel data to study the impact of CaliforniaÕs Mathematics Diagnostic Testing Project (MDTP) and find positive effects of mandated use of MDTP but no effects of voluntary use by teachers. In a complementary paper, Tyler (2010) studies the extent to which teachers in Cincinnati use data on student-level performance and finds Ôrelatively low levels of teacher interaction with pages on the web tool that contain student test information that could potentially inform practiceÕ. The rest of this article is organised as follows: Section 1 describes the experimental intervention and data collection, Section 2 presents the main results of the article and Section 3 discusses policy implications and concludes. 1. Experimental Design 1.1. Context Andhra Pradesh (AP) is the 5th largest state in India, with a population of over 80 million, of whom around 70% live in rural areas. AP is close to the all-India average on various measures of human development such as gross enrolment in primary school, literacy and infant mortality, as well as on measures of service delivery such as teacher absence (Figure 1a). The state consists of three historically distinct socio-cultural regions (Figure 1b) and a total of 23 districts. Each district is divided into three to five divisions and each division is composed of ten to fifteen mandals, which are the lowest administrative tier of the government of AP. A typical mandal has around 25 villages and 40 to 60 government primary schools. There are a total of over 60,000 such schools in AP and around 80% of children in rural AP attend government-run schools (Pratham, 2008). The average rural primary school is quite small, with total enrolment of around 80 to 100 students and an average of 3 teachers across grades one to five.11 One teacher typically teaches all subjects for a given grade (and often teaches more than one grade simultaneously). All regular teachers12 are employed by the state, are well qualified, and are paid well (the average salary of regular teachers is over four times per capita income in AP). However, incentives for teacher attendance and performance are weak, with teacher absence rates of over 25% (Kremer et al., 2005). Teacher unions are strong and disciplinary action for non-performance is rare.13 1.2. The Diagnostic Feedback Intervention Regular government teachers are quite well qualified with around 85% of teachers in our (representative) sample of teachers having a college degree and 98% having a 11 This is a consequence of the priority placed on providing all children with access to a primary school within a distance of 1 kilometre from their homes. 12 Regular civil-service teachers who are employed by the state government comprise the majority of teachers (around 90%) in government rural schools, with the rest comprising of contract teachers who are hired locally at the school level on annually renewable contracts. 13 Kremer et al. (2005) find that on any given working day, 25% of teachers are absent from schools across India, but only 1 head teacher in their sample of 3,000 government schools had ever fired a teacher for repeated absence. The teacher absence rate in AP is almost exactly equal to the all-India average. Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 2010 ] DIAGNOSTIC FEEDBACK TO TEACHERS F191 (a) India AP Gross Enrolment 95.9 95.3 (Ages 6–11) (%) Literacy (%) 64.8 60.5 Teacher Absence (%) 25.2 25.3 INDIA Infant Mortality 63 62 (per 1000) Andhra Pradesh (b) Telangana Nizamabad Medak Vishakapatnam Coastal Andhra East Godavari Kadapa Rayalseema Fig. 1. (a) Andhra Pradesh (AP ) (b) District Sampling (Stratified by Socio-cultural Region of AP ) formal teacher training certificate or degree. However, student learning levels continue to be very low with a recent all-India survey finding that over 58% of children aged 6 to 14 in an all-India sample of over 300,000 rural households could not read at the second grade level, though over 95% of them were enrolled in school (Pratham, 2008). Education planners and policy makers often posit that an important reason for this is that teachers (though qualified on paper) are not equipped to deal effectively with the classroom situations that they face.14 One area of teacher preparedness that is believed to be lacking is detailed knowledge of the learning levels of their students. For instance, teachers are believed to simply teach from the textbooks without any mapping from the content in the textbook to conceptual learning objectives. This in turn would mean that the teacher is not able to measure or judge the progress made by students against learning objectives. Another 14 Almost every strategy paper for education issued by the Ministry of Human Resource Development (MHRD) emphasises the need for better teacher training. Examples of both review papers and strategy papers are available at the MHRD website at: http://www.education.nic.in Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 F192 THE ECONOMIC JOURNAL [AUGUST limitation is that many of the children are first-generation learners with illiterate par- ents and teachers have very low expectations of what such children can be expected to learn.15 Finally, there is no standardised testing across Indian schools till the comple- tion of 10th grade, which means that teachers have very limited information on the performance of their students against either absolute measures of learning targets or against benchmarks of relative performance across comparable schools.16 In response to this lack of information on student learning levels (which is believed to be a problem in both public and private schools), private-sector providers of education services have created products that provide detailed information on student learning levels and customised feedback to teachers. The intervention studied in this article was developed by Educational Initiatives (one of IndiaÕs leading private sector providers of assessment tools to schools) and consisted of low-stakes tests followed by reports to teachers on the levels of learning of their students and suggestions on how to use these reports to improve their teaching. These reports provide information about student learning by grade-appropriate competence and include sub-district, district and state averages against which performance can be benchmarked. Based on their prior experience with private schools that had sought out and paid for this Ôdiagnostic assessmentÕ product, Education Initiatives had a strong prior belief that the programme would be able to improve student learning outcomes. The provision of detailed diagnostic feedback is posited to improve teacher effect- iveness through two channels. The first channel is the provision of new information and knowledge, which allows teachers to understand the relative strengths and weak- nesses in learning of their students and to realign their efforts to bridge the gaps in student learning. This information can also be used to target their efforts more effectively (for instance, by grouping together students with similar areas of strengths and weaknesses). The second channel posited is that provision of feedback on student performance can increase the intrinsic motivation (defined as an individualÕs desire to do a task for its own sake (Benabou and Tirole, 2003)) of teachers. Malone and Lepper (1987) integrate several aspects of motivational theory to identify characteristics of tasks that make them more desirable in and of themselves. Some of the factors they highlight include: setting challenges that are neither too difficult nor too easy, being able to set meaningful goals, receiving of performance feedback and relating goals to self-esteem. Deci and Ryan (1985) provide another overview of theories about intrinsic motivation and reach very similar conclusions. They suggest that intrinsic motivation of employees is positively linked to the extent of Ôgoal orientationÕ in the task and the extent to which completion of the task enhances professional Ôself perceptionÕ. Seen in this theoretical light, the components of the treatment studied in this article can be thought of as an attempt to increase the intrinsic motivation of teachers. Thus, if the provision of performance reports to teachers can increase their Ôself perceptionÕ as 15 It is well established in the education literature that the level of teacherÕs expectation for their students is positively correlated with actual learning levels of students; see for example, Good (1987) and Ferguson (2003) 16 The lack of credible testing till the 10th grade is partly attributable to the Ôno detentionÕ policy in place in Indian government schools. Thus, while schools do conduct internal annual exams for students, promo- tion to higher grades is automatic and there is no external record or benchmarking of the internal tests. Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 2010 ] DIAGNOSTIC FEEDBACK TO TEACHERS F193 professionals who ought to be able to help their students achieve adequate learning standards, then this would be a possible channel of positive impact. Similarly, if the reports help teachers to set goals and direct their efforts towards achieving these goals, the provision of feedback reports could again increase intrinsic motivation through improving Ôgoal orientationÕ. Coe (1998) summarises the literature on the effectiveness of feedback on performance in general (not just in education) and concludes that Ôfeedback is found to enhance performance when it focuses attention on, or increases the saliency of, desired outcomes, or when the information it conveys helps to diagnose shortcomings in performanceÕ. Both of these features are found in the intervention studied in this article. The contents of the intervention comprised a baseline test given at the start of the school year, followed by detailed diagnostic feedback to schools and teachers that provided each studentÕs test score by individual question and aggregated by skill ⁄ competence, as well as performance benchmarks for the school, district and state. The communication to the schools emphasised that the first step to improving learning outcomes was to have a good understanding of current levels of learning and that the aim of these feedback reports was to help teachers improve student learning out- comes.17 The treatment schools were also told that there would be another external assessment of learning conducted at the end of the school year to monitor the progress of students in the school. Finally, enumerators from the Azim Premji Foundation also made six rounds of unannounced tracking surveys to each of the programme schools during September 2005 to February 2006 (averaging one visit ⁄ month) to collect data on process variables including student attendance, teacher attendance and activity, and classroom observation of teaching processes. Thus, the components of the ÔfeedbackÕ treatment (that were not provided to com- parison schools) included a baseline test written feedback on performance, the announcement of an end-of-year test and regular low-stakes classroom observations of teaching processes. Since the treatment and control schools differed not only in the receipt of feedback, but also in the extent of ongoing visits to collect process data, the treatment effects described in this article are the effects of Ôlow-stakes feedback and monitoringÕ as opposed to ÔfeedbackÕ alone, though we continue to refer to the treat- ment schools as ÔfeedbackÕ schools for expositional ease. However, schools and teachers were also told by the project coordinators from the Foundation that no individually- identifiable information would be made public. Thus, the focus of the intervention was on targeting the intrinsic motivation of teachers to be better teachers as opposed to external incentives (monetary or non-monetary). 1.3. Sampling, Randomisation and Data Collection The school sample was drawn as follows: 5 districts were sampled across each of the 3 socio-cultural regions of AP in proportion to population (Figure 1b). One division was sampled in each of the 5 districts, following which 10 mandals were randomly sampled in the selected division. In each of the 50 mandals, 2 randomly selected schools were 17 Samples of communication letters to schools are provided in Appendix A and samples of the class reports and the feedback reports are provided in Appendix B. Both are available as supporting information online. Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 F194 THE ECONOMIC JOURNAL [AUGUST provided with the feedback intervention, making for a total of 100 treatment schools that were a representative sample of rural primary schools in Andhra Pradesh.18 The school year in AP starts in the middle of June and baseline tests were conducted in these schools during late June and early July, 2005.19 After the tests were scored and school and class reports generated (in July 2005), field coordinators from the Azim Premji Foundation (APF) personally went to each of the 100 schools selected for the feedback intervention in the first week of August 2005 to provide them with student, class and school performance reports, and with oral and written communication that the Foundation was providing the schools with feedback and reports to help them improve learning outcomes. The Foundation also informed them that it would be conducting another assessment at the end of the year to track the progress of students. In each of the 50 mandals above, an additional six schools were randomly sampled and these 300 schools served as the comparison schools for evaluation of the feedback intervention. Since conducting independent external assessments was a part of the treatment, these 300 schools did not receive a baseline test and had no contact with project staff during the school year except for a single unannounced visit to these 300 schools during the school year, during which enumerators collected similar data on teacher attendance and classroom behaviour as were collected in the 100 feedback schools. At the end of the school year 2005–6, 100 out of these 300 schools (2 in each mandal) were randomly selected to be given the same end-of-year learning assessments that were given to the 100 feedback schools. These 100 schools were given only a weekÕs notice before being tested (whereas the 100 feedback schools knew about the tests from the beginning of the year and were reminded of it by the repeated tracking surveys). The tests were conducted in mathematics and language and consisted of two rounds of tests conducted around two weeks apart.20 Thus, the measures of teacher classroom behaviour in the treatment schools are constructed from six observations over 100 schools over the course of the school year, while the same measures for the control schools are constructed from one observation over 300 schools during the school year. While each individual visit is unannounced, schools in the feedback treatment knew that they were in a study (having received the communications in Appendix A), while the single such visit among the 300 control schools was likely to have been a surprise. Measures of student learning outcomes are obtained from the end-of-year assessments conducted in the 100 feedback schools and 18 As mentioned earlier, this study was conducted in the context of a larger study that evaluated several policy options to improve the quality of primary education in Andhra Pradesh including group and individual teacher performance pay, the use of contract teachers and the provision of cash block grants to school in addition to the provision of diagnostic feedback to schools. The total study was conducted across 500 schools which were made up of 10 randomly sampled schools in each of 50 randomly sampled mandals. In each mandal, 2 schools were randomly allocated to each of five treatments – one of which was the diagnostic feedback intervention. 19 The selected schools were informed by the government that an external assessment of learning would take place in this period but there was no communication to any school about any potential intervention at this stage. 20 The first test covered competencies up to that of the previous school year, while the second test (conducted two weeks later) tested skills from the current school yearÕs syllabus. Doing two rounds of testing at the end of each year allowed the testing of more materials, improved power by allowing the smoothing of measurement errors specific to the day of testing and helped to reduce the extent of sample attrition due to student absence on the day of the test. Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 2010 ] DIAGNOSTIC FEEDBACK TO TEACHERS F195 in 100 of the 300 control schools. So the comparison schools are as close to Ôbusiness as usualÕ schools as possible, since they comprise a representative set of schools that were not formally aware of being part of the study during the course of the school year. 2. Results 2.1. Impact of Feedback and Monitoring on Observed Teacher Behaviour The data on teacher behaviour are collected from classroom observations conducted by enumerators, where they sat in classrooms for 20–30 minutes and coded if various indicators of effective and engaged teaching took place during the time they observed the classroom. Table 1 (a) compares the feedback schools with the comparison schools on these measures of teacher behaviour and we find that the feedback schools (that were also subject to repeated observation) show significantly higher levels of effort on several measures of effective and engaged teaching and do not do significantly worse on any of these measures. Teachers in the feedback schools were found to be sig- nificantly more likely to be actively teaching, to be reading from a textbook, to be making students read from their textbook, to address questions to students and to be actively using the blackboard. They were also more likely to assign homework and to provide guidance on homework to students in the classroom. Since the treatment schools were observed six times during the school year and the control schools were observed only once, the differences in observed teacher behaviour could partly be due to being in the treatment group and partly due to the repeated nature of the observations, which might have led teachers to improve their observed performance over time. We distinguish between these possibilities by running a regression of an index of teacher activity21 on treatment status and the survey round.22 We find that teachers in treatment schools show a 0.11 standard deviation higher level of activity, and that the impact of the survey round is not significant (Table 2 – Column 1). Since the first (and only) round of visits in the 300 control schools took place around the same time as the last three visits in the treatment schools (December 2005 to February 2006), we also restrict the analysis to only the last three survey rounds to ensure comparability of the time of the year. The results do not change much but now the survey round is significant at the 10% level suggesting that teacher behaviour was affected both by the treatment and by the repeated observation (Table 2 – Column 2). These superior measures of observed teaching activity in treatment schools could be reflecting either a genuine increase in teaching activity throughout the school year in response to the treatment, or a temporary increase in teaching activity when under observation by enumerators due to teachersÕ knowledge that they were in a study 21 The index is an average of the 15 measures of teacher activity coded from the classroom observation conducted by enumerators (these are all the measures in Table 1 except teacher absence and activity, which were measured by scanning the teachers and were not based on the classroom observation instrument). Each individual activity is normalised to have a mean of zero and a standard deviation of one in the control schools, and the index is the mean of the 15 normalised individual activities. 22 Coded from 1 to 6 for the treatment schools and coded 1 for the control schools (since each school was only visited once). Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 F196 THE ECONOMIC JOURNAL [AUGUST Table 1 Process Variables (Based on Classroom Observation) (a) (b) Feedback Schools Versus Feedback Schools Versus ÔFeedback and IncentivesÕ Comparison Schools Schools (All figures in %) (All figures in %) Process Variable ÔFeedback (Activities performed by p-value and p-value Teachers unless recorded Feedback Comparison (H0: IncentiveÕ Feedback (Ho: otherwise) Schools Schools Diff = 0) Schools Schools Diff = 0) Teacher Absence 22.5 20.6 0.342 24.9 22.5 0.21 Actively Teaching 49.9 40.9 0.012** 47.5 49.9 0.46 Clean & Orderly Classroom 59.5 53.5 0.124 60.5 59.5 0.772 Giving a Test 26.6 27.6 0.790 26.6 26.6 0.993 Calls Students by Name 78.1 78.6 0.865 78.5 78.1 0.878 Addresses Questions to 63.2 58.1 0.087* 62.8 63.2 0.871 Students Provides Individual ⁄ 35.7 31.9 0.263 37.1 35.7 0.625 Group Help Encourages Participation 37.0 37.0 0.996 37.6 37.0 0.835 Reads from Textbook 56.1 41.9 0.000*** 52.8 56.1 0.299 Makes Children Read From 60.0 45.6 0.000*** 57.8 60.0 0.43 Textbook Active Blackboard Usage 49.1 40.9 0.014** 50.0 49.1 0.764 Assigned Homework 37.2 29.2 0.034** 39.5 37.2 0.518 Provided Homework Guidance 32.9 18.0 0.000*** 33.6 32.9 0.849 Provided Feedback on Homework 27.0 13.1 0.000*** 24.7 27.0 0.478 Children were Using a Textbook 67.4 60.8 0.026** 66.0 67.4 0.559 Children Asked Questions 37.0 42.6 0.069* 37.1 37.0 0.958 in Class Teacher Was in Control of 52.4 51.2 0.706 51.2 52.4 0.694 the Class Notes. 1. The feedback and Ôfeedback plus incentiveÕ schools were each visited by a project coordinator around once a month for a total of 6 visits between September 2005 and March 2006, and the measures of teacher behaviour reported here were recorded during classroom observations conducted during these visits. To construct the teacher behaviour variables for Ôbusiness as usualÕ comparison schools, 300 extra schools were randomly sampled (6 in each mandal) and the same surveys were conducted to measure processes in a ÔtypicalÕ school. Each of these schools was visited only once (at an unannounced date) during the entire year. 2. Each round of classroom observation is treated as one observation and the standard errors for the t-tests are clustered at the school level (i.e. correlations across visits and classrooms are accounted for in the standard errors) * significant at 10%; ** significant at 5%; *** significant at 1%. (Hawthorne effects). One way of distinguishing between the two possibilities is to study the impact of the programme on student learning outcomes. 2.2. Impact of Feedback and Monitoring on Student Test Scores To study the impact of the low-stakes diagnostic feedback and monitoring on student learning outcomes, we estimate the equation: Tijkm ¼ a þ dFeedback þ bZm þ ek À ejk þ eijk : The main dependent variable of interest is Tijkm which is the normalised student test score on mathematics and language tests (at the end of the school year 2005–06), Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 2010 ] DIAGNOSTIC FEEDBACK TO TEACHERS F197 Table 2 Differences in Class Room Observation Process Variables Between Feedback and Control Schools Dependent Variable = normalised Index of Class Room Activities All Rounds Last 3 Rounds only [1] [2] Feedback Schools 0.107 (0.053)** 0.104 (0.044)** Rounds 0.013 (0.010) 0.04 (0.022)* Observations 4,132 2,758 R-squared 0.02 0.02 Notes. The dependent variable is the normalised index of classroom process variables. The index is the mean of fifteen normalised process variables from class room observation in Table 1 (all except the first two, which are measured differently). The normalisation of the index is with respect to the distribution in the control schools during the first visit. The reason for the distinction between ÔAll RoundsÕ and the ÔLast 3 Rounds OnlyÕ is that the timing of data collection in the control schools corresponded to the last 3 rounds of data collection in the treatment schools. Thus column 2 represents data collected in a comparable time of the year in both treatment and control schools. All regressions include standard errors clustered at the school level. * significant at 10%; ** significant at 5%; *** significant at 1%. where i, j, k, m denote the student, grade, school and mandal respectively. All regres- sions include a set of mandal-level dummies (Zm) and the standard errors are clustered at the school level. Since the randomisation is stratified and balanced by mandal, including mandal fixed effects increases the efficiency of the estimate. The ÔFeedbackÕ variable is a dummy at the school level indicating if it was in the incentive treatment, and the parameter of interest is d, which is the effect on the normalised test scores of being in an incentive school. The random assignment of treatment ensures that the ÔFeedbackÕ variable in the equation above is not correlated with the error term, and the estimate is therefore unbiased.23 The main result we find is that there is no significant effect of the diagnostic feed- back and monitoring on student test scores (Table 3). Not only is the effect insignifi- cant, but the magnitude of the effect is very close to zero in both mathematics and language tests. The large sample size and multiple rounds of tests meant that the experiment had adequate power to detect an effect as low as 0.075 standard deviations at the 10% level and 0.09 standard deviations at the 5% level.24 Thus, the non-effect is quite precisely estimated. It is possible that there were heterogeneous treatment effects among students even though there was no mean programme effect (for instance, teachers may have used the feedback reports to focus on lower performing students). Figure 2 plots the quantile 23 Since the conducting of external tests and the salience of the test score was a part of the treatment, it was important that the control schools did not get a baseline test. However, the random assignment also means that a baseline test is not needed for this analysis. 24 Experiments in education typically lack power to identify effects below 0.10 SD (for instance, the treatment effects estimated in the education experiments surveyed in Glewwe et al. (2008) mostly have standard errors above 0.07, and would not have adequate power to detect an effect below 0.10 SD). Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 F198 THE ECONOMIC JOURNAL [AUGUST Table 3 Impact of Diagnostic Feedback and Low-Stakes Monitoring on Student Test Score Performance Dependent Variable = Normalised End of Year Student Test Scores Combined Mathematics Telugu (Language) [1] [2] [3] Feedback Schools 0.002 (0.045) À0.018 (0.048) 0.022 (0.044) Observations 48,791 24,386 24,405 R-squared 0.108 0.112 0.111 Notes. The sample includes the feedback schools and the 100 comparison schools that also received the same test as the feedback schools at the end of the school year 2005–6. The former had a baseline test, diagnostic feedback on the baseline test, regular low-stakes monitoring to measure classroom processes, and advance notice about the end of year assessments. The comparison schools had none of these. All regressions include mandal (sub-district) level fixed effects and standard errors clustered at the school level. * significant at 10%; ** significant at 5%; *** significant at 1%. Control Schools 2 95% Confidence Interval Treatment Schools Normalized Test Score 1 Difference 0 –1 –2 0 0.2 0.4 0.6 0.8 1 Percentile of Endline Score Fig. 2. Quantile (Percentile) Treatment Effects treatment effects of the feedback programme on student test scores (defined for each quantile s as: À1 À1 dðsÞ ¼ Gn ðsÞ À Fm ð sÞ ; where Gn and Fm represent the empirical distributions of the treatment and control distributions with n and m observations respectively), with bootstrapped 95% confidence intervals, and we see that the treatment effect is close to zero at every percentile of final test scores. Thus, not only did the programme have no impact on average but it also had no significant impact on any part of the student achievement distribution.25 We also test for 25 The lack of baseline scores and limited data on student characteristics in the control schools means that we can look at quantile treatment effects in terms of the end-of-year scores but cannot compute hetero- geneous effects by initial scores. However, given the almost identical distributions of test scores in treatment and control schools and the random allocation of schools to treatment and control categories, it is highly unlikely that there would have been differential effects by baseline score. Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 2010 ] DIAGNOSTIC FEEDBACK TO TEACHERS F199 differential effects by student gender and caste and find no evidence of any such differences. The lack of any impact of the treatment on student test scores (at any point in the achievement distribution) suggests that the superior measures of teacher effort found during the classroom observations are likely to have been a temporary response to the presence of enumerators in the classroom on a repeated basis and the knowledge that the schools were part of a study (confirming the presence of a Hawthorne effect). Field reports from enumerators anecdotally confirm that teachers typically became more attentive when the enumerators entered the school and also suggest that most teachers in the feedback schools briefly glanced at the reports at the beginning of the school year but did not actively use them in their teaching. 2.3. Comparing the Effect of Feedback With and Without External Incentives As mentioned earlier, the evaluation of low-stakes diagnostic feedback and mon- itoring was carried out in the context of a larger randomised evaluation of several policy interventions to improve the quality of primary education in Andhra Pradesh (AP). Two of these policies consisted of the provision of performance-linked bonuses26 to teachers in randomly selected schools in addition to the feedback and regular low-stakes monitoring that was provided to the ÔfeedbackÕ school. These schools received everything that the feedback schools did but were also eligible to receive performance-linked bonus payments to teachers and are referred to here- after as ÔincentiveÕ schools. The incentive schools received exactly the same amount of measurement, feedback and monitoring as the feedback schools and only differ from the feedback schools in that they are also eligible for performance-linked bonuses. We compare teacher behaviour in incentive and feedback schools and find that there was no difference in teacher behaviour as measured by classroom observations across the two types of schools (Table 1b). However, we find that student test scores are significantly higher in the incentive schools compared to the feedback schools.27 These apparently paradoxical results are summarised in Table 4, where we see that evaluating school performance based on observed teacher behaviour would suggest that the incentives had no impact at all, but that the feedback programme had a large positive effect on teacher behaviour. However, if we were to evaluate school performance on the basis of student learning outcomes, the conclusion would be reversed since it is the incentive schools that do much better than the feedback schools, while the feedback schools do not score any better than the comparison schools that did not receive the baseline test, diagnostic tests and regular moni- toring. The most likely explanation for this apparent paradox is that teachers were able to change their behaviour under observation and that they were particularly likely to do 26 One treatment provided the opportunity to receive performance-based bonuses at the school-level (group incentives), while the other provided the opportunity at the teacher-level (individual incentives). 27 The details of the results of the performance-pay interventions are presented in a companion paper (Muralidharan and Sundararaman, 2009) but the summary result is discussed here to enable the comparison between feedback with and without incentives. Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 F200 THE ECONOMIC JOURNAL [AUGUST Table 4 Summary of Incentive, Feedback, and Comparison Schools on Teacher Behaviour and Student Outcomes School-level Intervention Teacher Effort and Behaviour Incentives+ =Feedback + >Comparison (Measured by Classroom Observations) Feedback + Monitoring Schools Monitoring Student Learning Outcomes Incentives + >Feedback+ =Comparison (Measured by Test Scores) Feedback + Monitoring Schools Monitoring so under repeated observation by (usually) the same enumerator over the course of the year. If behaviour is affected by being part of a study and by being observed repeatedly (as suggested by Table 2), it would explain why we find no difference in teacher behaviour between the incentive and feedback schools (where each school was observed six times over the course of the school year and where all schools knew they were in a study), while we do find a difference between these schools and the control schools (which were observed only once during the year and were never revisited for classroom observations). This interpretation is supported by the fact that there is no difference between feedback and comparison schools in teacher absence or classroom cleanliness (measures which cannot be affected after the enumerator arrives in the school) but there is a significant difference in actions that a teacher is likely to believe constitute good teaching and which can be modified in the presence of an observer (such as using the blackboard, reading from the textbook, making children read from the textbook and assigning homework). However, the fact that there is no effect of feedback and monitoring on test scores suggests that while the teachers in the feedback schools worked harder while under observation, the low-stakes feedback and monitoring did not induce enough change in teacher effort over the entire year to influence student learning outcomes.28 The lack of impact of the feedback on test scores raises the question of whether the diagnostic feedback itself was of any use at all to the teachers. Table 5 shows teachersÕ self-reports on how useful they found the diagnostic feedback reports (this was reported before they knew how well they had performed and is therefore not biased by actual performance). The same fraction of teachers (around 88%) in both feedback and incentive schools mention finding the feedback reports to be either somewhat or very useful.29 But, correlating the self-reports of teachersÕ stated usefulness of the reports with the learning outcomes of their students (Table 5 – columns 4 and 5) shows that the stated usefulness of the reports was a significant 28 Teachers in the incentive schools appear to have increased efforts on dimensions that were not well captured by the classroom observations such as conducting extra classes beyond regular hours (see Mura- lidharan and Sundararaman, 2009). 29 Though, a significantly larger fraction of teachers in incentive schools report finding the reports Ôvery usefulÕ (56% vs. 44%). Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 2010 ] DIAGNOSTIC FEEDBACK TO TEACHERS F201 Table 5 Summary of Usefulness of Feedback Correlation between stated usefulness of feedback reports and student outcomes Very Useful Somewhat Useful Not Useful (%) (%) (%) Method 1 Method 2 Incentives + Feedback + 55.8 33 11.2 0.098* 0.098** Monitoring Feedback + Monitoring 43.5 44.5 12 0.029 0.064 Notes. Teachers in incentive and feedback schools were interviewed after the school year 2005–6 and asked how useful they found the feedback reports. The summary statistics on stated usefulness are reported here and also the correlations of these stated usefulness with student learning outcomes. In method 1, Ôvery usefulÕ is coded as 1 and the other responses are coded as 0. In method 2, the responses are coded continuously from 0 (not useful) to 2 (very useful). All regressions include mandal (sub-district) level fixed effects and standard errors clustered at the school level. * significant at 10%; ** significant at 5%; *** significant at 1%. predictor of student test scores only in the incentive schools and not in the feed- back schools. This does not mean that the reports caused the better performance in incentive schools but rather suggests that there was useful content in the written diagnostic feedback reports that the teachers perceived to be useful, which they could have used effectively if they had wanted to. However, the stated usefulness of the reports pos- itively predicts test scores only in the incentive schools. This suggests that the teachers in the feedback schools could have used the reports effectively if they had wanted to, but only the teachers in the incentive schools seem to have done so. This is consistent with the finding in our companion paper that the interaction between inputs and incentives is positive and that the presence of incentives can increase the effectiveness of school inputs (including pedagogical materials such as diagnostic feedback reports). 3. Conclusion Critics of high-stakes testing in schools point to the potential distortions in teacher behaviour induced by such testing and suggest that low-stakes tests that provide teachers with feedback on the performance of their students can be more effective in improving student learning. Such low-stakes diagnostic tests and school performance feedback are key components of several school improvement initiatives but the empirical evidence to date on their effectiveness is very limited. A limitation in the literature to date is the varying degrees to which feedback is combined with coaching and training of teachers, which makes it difficult to isolate the impact of feedback alone. A second limitation is the lack of rigorous evidence on the causal impact of such diagnostic feedback. We present experimental evidence of the impact of a programme that provided 100 randomly selected rural primary schools in the Indian state of Andhra Pradesh with a ÔfeedbackÕ intervention that consisted of an externally administered baseline test, Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 F202 THE ECONOMIC JOURNAL [AUGUST detailed score reports of students and diagnostic feedback on student performance, an announcement that the schools would be tested again at the end of the year and ongoing low-stakes monitoring through the school year. There are three main results in this article. First, the feedback reports had no impact on student test scores at any percentile of the achievement distribution. Second, evaluating the impact of the programme based on observed classroom behaviour would be biased since we find strong evidence for Hawthorne effects. Third, the feedback reports had useful content but were used more effectively by teachers when combined with performance-linked bonuses for teachers, which provided an incentive for improving student learning. Our results do not imply that diagnostic feedback on school and student perform- ance cannot be useful in improving learning outcomes. Both the self-reports of the teachers regarding the usefulness of the reports and the positive correlations between these reports and student outcomes in the incentive schools suggest that there was useful content in the reports. Similarly, the experience of Education Initiatives (the firm that designed the tests and diagnostic feedback) suggests that schools that demanded and paid for the diagnostic reports benefited from them (and continued to pay for the reports in subsequent years). However, our results do suggest that simply following a supply-sided policy of providing such feedback reports may not be enough to improve student learning outcomes in the absence of reforms that increase the demand for such tools from teachers followed by changes in teaching practice that use these tools effectively. The experiment studied here focused on the use of performance measurement and feedback as a way of improving teachersÕ intrinsic motivation and was careful to not confound this effect with the extrinsic incentives that may have arisen from making such assessment information public. However, the results presented in this article combined with those in our companion paper on teacher performance pay suggest that modifying the incentive structures under which teachers operate may induce them to utilise educational inputs such as diagnostic feedback reports on student learning better. Studying the relative effectiveness of monetary and non-monetary incentives (such as those created by publicising school performance data, or a strong group or peer driven coaching programme to respond to such data) in inducing teachers to make more effective use of inputs such as diagnostic feedback reports is an open question for future research. University of California, San Diego World Bank Additional Supporting information may be found in the online version of this article: Appendix A. Details of Communication Letter to Schools Appendix B.1. Sample of Class Report Appendix B.2. Extracts from the Note Accompanying the Class Reports Appendix B.3. Template for Diagnostic Feedback Letters Please note: The RES and Willey-Blackwell are not responsible for the content or functionality of any supporting materials supplied by the author. Any queries (other than missing material) should be directed to the author of the article. Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 2010 ] DIAGNOSTIC FEEDBACK TO TEACHERS F203 References Baker, G. (1992). ÔIncentive contracts and performance measurementÕ, Journal of Political Economy, vol. 100, pp. 598–614. Benabou, R. and Tirole, J. (2003). ÔIntrinsic and extrinsic motivationÕ, Review of Economic Studies, vol. 70, pp. 489–520. Betts, J., Hahn, Y. and Zau, A. (2010). ÔThe effect of diagnostic testing in math on student outcomesÕ, mimeo, University of California, San Diego. Boudett, K.P., City, E. and Murnane, R. (2005). Data Wise: A Step-by-Step Guide to Using Assessment Results to Improve Teaching and Learning, Cambridge, MA: Harvard Education Press. Coe, R. (1998). ÔFeedback, value added and teachersÕ attitudes: models, theories and experimentsÕ, unpublished PhD thesis, Durham: University of Durham. Deci, E.L. and Ryan, R.M. (1985). Intrinsic Motivation and Self-Determination in Human Behavior, New York: Plenum. Ferguson, R.F. (2003). ÔTeachersÕ perceptions and expectations and the black-white test score gapÕ, Urban Education, vol. 38, pp. 460–507. Figlio, D.N. and J. Winicki (2005). ÔFood for thought: the effects of school accountability plans on school nutritionÕ, Journal of Public Economics, vol. 89, pp. 381–94. Glewwe, P., Holla, A. and Kremer, M. (2008). ÔTeacher incentives in the developing worldÕ, mimeo, Harvard University. Glewwe, P., Ilias, N. and Kremer, M. (2003). ÔTeacher incentivesÕ, Cambridge, MA: National Bureau of Economic Research, Working Paper. Good, T. L. (1987). ÔTwo decades of research on teacher expectations: findings and future directionsÕ, Journal of Teacher Education, vol. 38, pp. 32–47. Holmstrom, B. and Milgrom, P. (1991). ÔMultitask principal-agent analyses: incentive contracts, asset own- ership, and job designÕ, Journal of Law, Economics, and Organization, vol. 7, pp. 24–52. Jacob, B.A. (2005). ÔAccountability, incentives and behavior: the impact of high-stakes testing in the Chicago public schoolsÕ, Journal of Public Economics, vol. 89, pp. 761–96. Jacob, B.A. and Levitt, S.D. (2003). ÔRotten apples: an investigation of the prevalence and predictors of teacher cheatingÕ, Quarterly Journal of Economics, vol. 118, pp. 843–77. Koretz, D.M. (2008). Measuring Up: What Educational Testing Really Tells Us, Cambridge, MA: Harvard University Press. Kremer, M., Muralidharan, K., Chaudhury, N., Rogers, F.H. and Hammer, J. (2005). ÔTeacher absence in India: a snapshotÕ, Journal of the European Economic Association, vol. 3, pp. 658–67. Malone, T.W. and Lepper, M.R. (1987). ÔMaking learning fun: a taxonomy of intrinsic motivations for learningÕ, in (R.E. Snow and M.J. Farr, eds), Aptitude, Learning, and Instruction, pp. 223–53. Hillsdale, NJ: Lawrence Earlbaum Associates, Inc. Muralidharan, K. and Sundararaman, V. (2009). ÔTeacher performance pay: experimental evidence from indiaÕ, National Bureau of Economic Research Working Paper No. 15323. Neal, D. and Schanzenbach, D. (2007). ÔLeft behind by design: proficiency counts and test-based account- abilityÕ, National Bureau of Economic Research Working Paper No. 13293. Pratham (2008). Annual Status of Education Report, available from http://www.asercentre.org/asersurvey.php. Tyler, J. (2010). ÔEvidence based teaching? Using student test data to improve classroom instructionÕ, research paper, Brown University. Tymms, P. and Wylde, M. (2003). ÔBaseline assessments and monitoring in primary schoolsÕ, research paper, Bamberg, Germany. Visscher, A.J. and Coe, R. (2003). ÔSchool performance feedback systems: conceptualization, analysis, and reflectionÕ, School Effectiveness and School Improvement, vol. 14, pp. 321–49. Ó The Author(s). Journal compilation Ó Royal Economic Society 2010 Teacher Performance Pay: Experimental Evidence from India Author(s): Karthik Muralidharan and Venkatesh Sundararaman Source: Journal of Political Economy, Vol. 119, No. 1 (February 2011), pp. 39-77 Published by: The University of Chicago Press Stable URL: http://www.jstor.org/stable/10.1086/659655 . Accessed: 05/06/2013 18:28 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org. . The University of Chicago Press is collaborating with JSTOR to digitize, preserve and extend access to Journal of Political Economy. http://www.jstor.org This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions Teacher Performance Pay: Experimental Evidence from India Karthik Muralidharan University of California, San Diego, and National Bureau of Economic Research Venkatesh Sundararaman World Bank We present results from a randomized evaluation of a teacher per- formance pay program implemented across a large representative sam- ple of government-run rural primary schools in the Indian state of Andhra Pradesh. At the end of 2 years of the program, students in incentive schools performed significantly better than those in control schools by 0.27 and 0.17 standard deviations in math and language tests, respectively. We find no evidence of any adverse consequences of the program. The program was highly cost effective, and incentive schools performed significantly better than other randomly chosen schools that received additional schooling inputs of a similar value. I. Introduction A fundamental question in education policy around the world is that of the relative effectiveness of input-based and incentive-based policies in improving the quality of schools. While the traditional approach to This paper is based on a project known as the Andhra Pradesh Randomized Evaluation Study, which is a partnership between the government of Andhra Pradesh, the Azim Premji Foundation, and the World Bank. Financial assistance for the project has been provided by the government of Andhra Pradesh, the U.K. Department for International Develop- ment, the Azim Premji Foundation, and the World Bank. We thank Dileep Ranjekar, Michelle Riboud, Amit Dar, Samuel C. Carlson, and officials of the government of Andhra Pradesh and the government of India (particularly I. V. Subba Rao, Vindra Sarup, P. Krishnaiah, and K. Ramakrishna Rao) for their continuous support and long-term vision [Journal of Political Economy, 2011, vol. 119, no. 1] ᭧ 2011 by The University of Chicago. All rights reserved. 0022-3808/2011/11901-0002$10.00 39 This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 40 journal of political economy improving schools has focused on providing them with more resources, there has been growing interest in directly measuring and rewarding schools and teachers on the basis of student learning outcomes. The idea of paying teachers on the basis of direct measures of performance has attracted particular attention since teacher salaries are the largest component of education budgets, and recent research shows that teacher characteristics rewarded under the status quo in most school systems—such as experience and master’s degrees in education—are poor predictors of better student outcomes (see Rockoff 2004; Rivkin, Hanushek, and Kain 2005; Gordon, Kane, and Staiger 2006). However, while the idea of using incentive pay schemes for teachers as a way of improving school performance is increasingly making its way into policy,1 the empirical evidence on the effectiveness of such policies is quite limited—with identification of the causal impact of teacher incentives being the main challenge. In addition, several studies have highlighted the possibility of perverse outcomes from teacher incentive and accountability programs (Jacob and Levitt 2003; Jacob 2005; Cullen and Reback 2006; Neal and Schanzenbach 2010), suggesting the need for caution and better evidence before expanding teacher incentive programs based on student test scores. In this paper, we contribute toward filling this gap with evidence from a large-scale randomized evaluation of a teacher performance pay pro- gram implemented in the Indian state of Andhra Pradesh (AP). We studied two types of teacher performance pay (group bonuses based on school performance and individual bonuses based on teacher perfor- mance), with the average bonus calibrated to be around 3 percent of a typical teacher’s annual salary. The incentive program was designed to minimize the likelihood of undesired consequences (see design de- for this research. We are especially grateful to D. D. Karopady, M. Srinivasa Rao, and staff of the Azim Premji Foundation for their leadership and meticulous work in implementing this project. Sridhar Rajagopalan, Vyjyanthi Shankar, and staff of Education Initiatives led the test design. We thank Vinayak Alladi, Gokul Madhavan, and Ketki Sheth for outstanding research assistance. The findings, interpretations, and conclusions expressed in this paper are those of the authors and do not necessarily represent the views of the government of Andhra Pradesh, the Azim Premji Foundation, or the World Bank. We are grateful to Caroline Hoxby and Michael Kremer for their support, advice, and encouragement at all stages of this project. We thank the editor Derek Neal, two anonymous referees, George Baker, Damon Clark, Julie Cullen, Gordon Dahl, Jishnu Das, Shanta Devarajan, Martin Feldstein, Richard Freeman, Robert Gibbons, Edward Glaeser, Roger Gordon, Sangeeta Goyal, Gordon Hanson, Richard Holden, Asim Khwaja, David Levine, Jens Ludwig, Sendhil Mullainathan, Ben Olken, Lant Pritchett, Halsey Rogers, Richard Romano, and various seminar participants for useful comments and discussions. 1 Teacher performance pay is being considered and implemented in several U.S. states including Colorado, Florida, Tennessee, and Texas, and additional federal resources have been dedicated to such programs under the recent Race to the Top fund created by the U.S. Department of Education in 2009. Other countries that have attempted to tie teacher pay to performance include Australia, Brazil, Chile, Israel, and the United Kingdom. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 41 tails later), and the study was conducted by randomly allocating the incentive programs across a representative sample of 300 government- run schools in rural AP with 100 schools each in the group and individual incentive treatment groups and 100 schools serving as the comparison group. This large-scale experiment allows us to answer a comprehensive set of questions with regard to teacher performance pay as follows: (i) Can teacher performance pay based on test scores improve student achieve- ment? (ii) What, if any, are the negative consequences of teacher in- centives based on student test scores? (iii) How do school-level group incentives compare with teacher-level individual incentives? (iv) How does teacher behavior change in response to performance pay? (v) How cost effective are teacher incentives relative to other uses for the same money? We find that the teacher performance pay program was effective in improving student learning. At the end of 2 years of the program, stu- dents in incentive schools performed significantly better than those in comparison schools by 0.27 and 0.17 standard deviations (SD) in math and language tests, respectively. The mean treatment effect of 0.22 SD is equal to 9 percentage points at the median of a normal distribution. We find a minimum average treatment effect of 0.1 SD at every per- centile of baseline test scores, suggesting broad-based gains in test scores as a result of the incentive program. We find no evidence of any adverse consequences as a result of the incentive programs. Students in incentive schools do significantly better not only in math and language (for which there were incentives) but also in science and social studies (for which there were no incentives), suggesting positive spillover effects. There was no difference in student attrition between incentive and control schools and no evidence of any adverse gaming of the incentive program by teachers. School-level group incentives and teacher-level individual incentives perform equally well in the first year, but the individual incentive schools outperformed the group incentive schools after 2 years of the program. At the end of 2 years, the average treatment effect was 0.28 SD in the individual incentive schools compared to 0.15 SD in the group incentive schools, with this difference being significant at the 10 percent level. We measure changes in teacher behavior in response to the program with both teacher interviews and direct physical observation of teacher activity. Our results suggest that the main mechanism for the impact of the incentive program was not increased teacher attendance but greater (and more effective) teaching effort conditional on being present. We find that performance-based bonus payments to teachers were a significantly more cost-effective way of increasing student test scores compared to spending a similar amount of money unconditionally on This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 42 journal of political economy additional schooling inputs. In a parallel initiative, two other sets of 100 randomly chosen schools were provided with an extra contract teacher and with a cash grant for school materials, respectively. At the end of 2 years, students in schools receiving the input programs scored 0.08 SD higher than those in comparison schools. However, the incentive programs had a significantly larger impact on learning outcomes (0.22 vs. 0.09 SD) over the same period, even though the total cost of the bonuses was around 25 percent lower than the amount spent on the inputs. Our results contribute to a growing literature on the effectiveness of performance-based pay for teachers.2 The best identified studies outside the United States on the effect of paying teachers on the basis of student test outcomes are Lavy (2002, 2009) and Glewwe, Ilias, and Kremer (2010), but their evidence is mixed. Lavy uses a combination of re- gression discontinuity, difference in differences, and matching methods to show that both group and individual incentives for high school teach- ers in Israel led to improvements in student outcomes (in the 2002 and 2009 papers, respectively). Glewwe et al. report results from a random- ized evaluation that provided primary school teachers (grades 4–8) in Kenya with group incentives based on test scores and find that, while test scores went up in program schools in the short run, the students did not retain the gains after the incentive program ended. They in- terpret these results as being consistent with teachers expending effort toward short-term increases in test scores but not toward long-term learning.3 Two recent experimental evaluations of performance pay in the United States both reported no effect of performance-based pay for teachers on student learning outcomes (Goodman and Turner [2010] in New York and Springer et al. [2010] in Tennessee). There are several unique features in the design of the field experiment presented in this paper. We conduct the first randomized evaluation of teacher performance pay in a representative sample of schools.4 We take 2 Previous studies include Ladd (1999) in Dallas, Atkinson et al. (2009) in the United Kingdom, and Figlio and Kenny (2007) using cross-sectional data across multiple U.S. states. See Umansky (2005) and Podgursky and Springer (2007) for reviews on teacher performance pay and incentives. The term “teacher incentives” is used very broadly in the literature. We use the term to refer to financial bonus payments based on student test scores. 3 It is worth noting though that evidence from several contexts and interventions sug- gests that the effect of almost all education interventions appears to decay when the programs are discontinued (see Jacob, Lefgren, and Sims 2008; Andrabi et al. 2009), and so this inference should be qualified. 4 The random assignment of treatment provides high internal validity, whereas the random sampling of schools into the universe of the study provides greater external validity than typical experiments by avoiding the “randomization bias,” whereby entities that are in the experiment are atypical relative to the population that the result is sought to be extrapolated to (Heckman and Smith 1995). This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 43 incentive theory seriously and design the incentive program to reward gains at all points in the student achievement distribution and to min- imize the risk of perverse outcomes. The study design also allows us to test for a wide range of possible negative outcomes. We study group (school-level) and individual (teacher-level) incentives in the same field experiment. We measure changes in teacher behavior with both direct observations and teacher interviews. Finally, we study both input- and incentive-based policies in the same field experiment to enable a direct comparison of their effectiveness. While set in the context of schools and teachers, this paper also con- tributes to the broader literature on performance pay in organizations in general and public organizations in particular.5 True experiments in compensation structure with contemporaneous control groups are rare (Bandiera, Barankay, and Rasul [2007] is a recent exception), and our results may be relevant to answering broader questions regarding per- formance pay in organizations. The rest of this paper is organized as follows: Section II provides a theoretical framework for thinking about teacher incentives. Section III describes the experimental design and the treatments, and Section IV discusses the test design. Sections V and VI present results on the impact of the incentive programs on test score outcomes and teacher behavior. Section VII discusses the cost effectiveness of the performance pay pro- grams. Section VIII presents conclusions. II. Theoretical Framework A. Multitask Moral Hazard While basic incentive theory suggests that teacher incentives based on improved test scores should have a positive impact on test scores, mul- titasking theory cautions that such incentives may increase the likelihood of undesired outcomes (Holmstrom and Milgrom 1991; Baker 1992, 2002). The challenge of optimal compensation design in the presence of multitasking is illustrated by a simple model (based on Baker [2002] and Neal [2010]). Suppose that teachers (agents) engage in two types of tasks in the classroom, T1 and T2, where T1 represents teaching using curricular best practices and T2 represents activities designed to increase scores on exams (such as drilling, coaching on items likely to be on the test, and perhaps even cheating). Let t 1 and t 2 represent the time spent on these two types of tasks, and let the technology of human capital production 5 See Gibbons (1998) and Prendergast (1999) for general overviews of the theory and empirics of incentives in organizations. Dixit (2002) provides a discussion of these themes as they apply to public organizations. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 44 journal of political economy (in gains) be given by H p f1t 1 ϩ f 2t 2 ϩ ␧, where f1 and f 2 are the marginal products of time spent on T1 and T2 on human capital production, and ␧ is random noise in H representing all factors outside the teacher’s control that also influence H. The social planner (principal) cannot observe any of H, t 1, or t 2 but can observe only an imperfect performance measure P (such as test scores) that is given by P p g 1t 1 ϩ g 2t 2 ϩ f, where g 1 and g 2 are the marginal products of time spent on T1 and T2 on test scores, and f is random noise in P outside the teacher’s control. The principal offers a wage contract as a function of P, such as w p s ϩ b 7 P, where w is the total wage, s is the salary, and b is the bonus rate paid per unit of P. The teacher’s utility function is given by U p E(w) Ϫ C(t 1 , t 2), where E(w) is the expected wage (we abstract away from risk aversion to focus on multitasking), and C(t 1 , t 2; t) is the cost associated with any combination of t 1 and t 2. Here, we follow Holmstrom and Milgrom (1991) in allowing the cost of effort to depend on an effort norm, t . Teachers may suffer psychic costs if their total effort levels fall below this norm (i.e., t 1 ϩ t 2 ! t). The optimal bonus rate, b *, depends on the functional form of this cost function, but if t 1 and t 2 are substitutes, it is easy to construct cases (typically when f1 1 f 2 and g 2 1 g 1 as is believed to be the case by most education experts) in which the optimal contract involves no incentive pay (b * p 0). In these scenarios, it is optimal for the social planner to simply accept the output generated by the norm t because incentive provision can reduce human capital accumulation by causing teachers to reduce t 1 and increase t 2. However, Neal (2010) notes that even when t 1 and t 2 are substitutes, the introduction of incentive pay may well be welfare improving in environments where t is small. When t is small, the gains from increasing total effort are more likely to exceed the costs from distorting the al- location of effort between t 1 and t 2. In addition, it is clear that incentive pay is more attractive when f1 /f 2 is not much greater than one because, in these cases, substitution from t 1 to t 2 is less costly. There is evidence to suggest that t may be quite low in India. A study using a nationally representative data set of primary schools in India found that 25 percent of teachers were absent on any given day and that less than half of them were engaged in any teaching activity (Kremer et al. 2005). There are also reasons to believe that f1 /f 2 may be close to one in India. The centrality of exam preparation in Indian and other Asian education systems may mean that the “best practices” in the ed- This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 45 ucation system may not be very different from teaching practices meant to increase test scores. There is also evidence to suggest that the act of frequent test taking can increase comprehension and retention even of nontested materials (Chan, McDermott, and Roediger 2006). So, it is possible that setting b 1 0 will not only increase test scores (P ) but also increase underlying human capital of students (H ), espe- cially in contexts such as India for the reasons mentioned above. Whether or not this is true is an empirical question and is the focus of our research design and empirical analysis (Secs. IV and V). B. Group versus Individual Incentives The theoretical prediction of the relative effectiveness of individual and group teacher incentives is ambiguous. Group (school-level) incentives could induce free riding and thus normally be lower powered than individual (teacher-level) incentives (Holmstrom 1982). However, social norms and peer monitoring (which may be feasible in the small groups of teachers in our setting) may enable community enforcement of the first-best level of effort, in which case the costs of free riding may be mitigated or eliminated (Kandel and Lazear 1992; Kandori 1992). Fi- nally, if there are gains to cooperation or complementarities in pro- duction, then it is possible that group incentives might yield better results than individual incentives (Itoh 1991; Hamilton, Nickerson, and Owan 2003). The relative effectiveness of group and individual teacher performance pay is therefore an empirical question, and we study both types of incentives in the same field experiment over two full academic years. III. Experimental Design A. Context While India has made substantial progress in improving access to pri- mary schooling and primary school enrollment rates, the average levels of learning remain very low. The most recent Annual Status of Education Report found that nearly 60 percent of children aged 6–14 in an all- India survey of rural households could not read at the second-grade level, though over 95 percent of them were enrolled in school (Pratham 2010). Public spending on education has been rising as part of the “Education for All” campaign, but there are substantial inefficiencies in public delivery of education services. As mentioned earlier, a study using a representative sample of Indian schools found that 25 percent of teachers were absent on any given day and that less than half of them were engaged in any teaching activity (Kremer et al. 2005). This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 46 journal of political economy Fig. 1.—A, Andhra Pradesh (AP); B, district sampling (stratified by sociocultural regions of AP). Andhra Pradesh is the fifth most populous state in India, with a pop- ulation of over 80 million (70 percent rural). AP is close to the all-India average on measures of human development such as gross enrollment in primary school, literacy, and infant mortality, as well as on measures of service delivery such as teacher absence (fig. 1A). The state consists of three historically distinct sociocultural regions and a total of 23 dis- tricts (fig. 1B). Each district is divided into three to five divisions, and each division is composed of 10–15 mandals, which are the lowest ad- ministrative tier of the government of AP. A typical mandal has around 25 villages and 40–60 government primary schools. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 47 The average rural primary school is quite small, with a total enroll- ment of around 80 students and an average of three teachers across grades 1–5. One teacher typically teaches all subjects for a given grade (and often teaches more than one grade simultaneously). All regular teachers are employed by the state, and their salary is mostly determined by experience and rank, with minor adjustments based on assignment location but no component based on any measure of performance. The average salary of regular teachers at the time of the study was around Rs. 8,000 per month (US$1 ≈ 45 Indian rupees [Rs.]), and total com- pensation including benefits was over Rs. 10,000 per month (per capita income in AP is around Rs. 2,000 per month). Teacher unions are strong, and disciplinary action for nonperformance is rare.6 B. Sampling We sampled five districts across each of the three sociocultural regions of AP in proportion to population (fig. 1B).7 In each of the five districts, we randomly selected one division and then randomly sampled 10 man- dals in the selected division. In each of the 50 mandals, we randomly sampled 10 schools using probability proportional to enrollment. Thus, the universe of 500 schools in the study was representative of the school- ing conditions of the typical child attending a government-run primary school in rural AP. C. Design Overview The performance pay experiments were conducted as part of a larger research project implemented by the Azim Premji Foundation to eval- uate the impact of policy options to improve the quality of primary education in AP. Four interventions were studied, with two being based on providing schools with additional inputs (an extra contract teacher and a cash block grant) and two being based on providing schools and teachers with incentives for better performance (group and individual bonus programs for teachers based on student performance). The overall design of the project is represented in table 1. As the table shows, the input treatments (see Sec. VII) were provided uncon- ditionally to the selected schools at the beginning of the school year, and the incentive treatments consisted of an announcement that bo- 6 Kremer et al. (2005) find that 25 percent of teachers are absent across India, but only one head teacher in their sample of 3,000 government schools had ever fired a teacher for repeated absence. See Kingdon and Muzammil (2009) for an illustrative case study of the power of teacher unions in India. 7 The districts were chosen so that districts within a region would be contiguous for logistical reasons. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 48 journal of political economy TABLE 1 Incentives Incentives (Conditional on Improvement in Student Learning) Individual Inputs None Group Bonus Bonus None Control 100 schools 100 schools (100 schools) Extra contract teacher 100 schools Extra block grant 100 schools nuses would be paid at the beginning of the next school year conditional on average improvements in test scores during the current school year. No school received more than one treatment, which allows the treat- ments to be analyzed independently of each other. The school year in AP starts in the middle of June, and the baseline tests were conducted in the 500 sampled schools during late June and early July 2005.8 After the baseline tests were scored, two out of the 10 project schools in each mandal were randomly allocated to each of five cells (four treatments and one control). Since 50 mandals were chosen across five districts, there were a total of 100 schools (spread out across the state) in each cell (table 1). The geographic stratification implies that every mandal was an exact microcosm of the overall study, which allows us to estimate the treatment impact with mandal-level fixed effects and thereby net out any common factors at the lowest administrative level of government. Table 2 (panel A) shows summary statistics of baseline school char- acteristics and student performance variables by treatment (control schools are also referred to as a “treatment” for expositional ease). Column 4 provides the p-value of the joint test of equality, showing that the null of equality across treatment groups cannot be rejected for any of the variables.9 After the randomization, program staff from the foundation person- ally went to each of the schools in the first week of August 2005 to provide them with student, class, and school performance reports and with oral and written communication about the intervention that the school was receiving. They also made several rounds of unannounced 8 The selected schools were informed by the government that an external assessment of learning would take place in this period, but there was no communication to any school about any of the treatments at this time (since that could have led to gaming of the baseline test). 9 Table 2 shows sample balance across control, group incentive, and individual incentive schools, which are the focus of the analysis in this paper. The randomization was done jointly across all five treatments shown in table 1, and the sample was also balanced on observables across the other treatments. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions TABLE 2 Sample Balance across Treatments p-Value Group Individual (Equality of Control Incentive Incentive All Groups) (1) (2) (3) (4) A. Means of Baseline Variables School-level variables: 1. Total enrollment (baseline: grades 1–5) 113.2 111.3 112.6 .82 2. Total test takers (baseline: grades 2–5) 64.9 62.0 66.5 .89 3. Number of teachers 3.07 3.12 3.14 .58 4. Pupil-teacher ratio 39.5 40.6 37.5 .66 5. Infrastructure index (0–6) 3.19 3.14 3.26 .84 6. Proximity to facilities index (8–24) 14.65 14.66 14.72 .98 Baseline test performance: 7. Math (raw %) 18.5 18.0 17.5 .69 8. Math (normalized; in SD) .032 .001 Ϫ.032 .70 9. Telugu (raw %) 35.1 34.9 33.5 .52 10. Telugu (normalized; in SD) .026 .021 Ϫ.046 .53 B. Means of End Line Variables Teacher turnover and attrition: Year 1 (relative to year 0): 11. Teacher attrition (%) .30 .34 .30 .54 12. Teacher turnover (%) .34 .34 .32 .82 Year 2 (relative to year 0): 13. Teacher attrition (%) .35 .38 .34 .57 14. Teacher turnover (%) .34 .36 .33 .70 Student turnover and attrition: Year 1 (relative to year 0): 15. Student attrition from baseline to end-of-year tests .081 .065 .066 .15 16. Baseline math test score of attrit- ors (equality of all groups) Ϫ.17 Ϫ.13 Ϫ.22 .77 17. Baseline Telugu test score of attritors (equality of all groups) Ϫ.26 Ϫ.17 Ϫ.25 .64 Year 2 (relative to year 0): 18. Student attrition from baseline to end-of-year tests .219 .192 .208 .23 19. Baseline math test score of attrit- ors (equality of all groups) Ϫ.13 Ϫ.05 Ϫ.14 .56 20. Baseline Telugu test score of attritors (equality of all groups) Ϫ.18 Ϫ.11 Ϫ.21 .64 Note.—The infrastructure index is the sum of six binary variables showing the existence of a brick building, a playground, a compound wall, a functioning source of water, a functional toilet, and functioning electricity. The proximity index is the sum of eight variables (each coded from 1 to 3) indicating proximity to a paved road, a bus stop, a public health clinic, a private health clinic, public telephone, bank, post office, and the mandal educational resource center. Teacher attrition refers to the fraction of teachers in the school who left the school during the year, and teacher turnover refers to the fraction of new teachers in the school at the end of the year (both are calculated relative to the list of teachers in the school at the start of the year). The p-values for the baseline test scores and attrition are computed by treating each student/teacher as an observation and clustering the standard errors at the school level (grade 1 did not have a baseline test). The other p-values are computed treating each school as an observation. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 50 journal of political economy tracking surveys to each of the schools during the school year to collect data on process variables including student attendance, teacher atten- dance and activity, and classroom observation of teaching processes.10 All schools operated under identical conditions of information and monitoring and differed only in the treatment they received. This en- sures that Hawthorne effects are minimized and that a comparison be- tween treatment and control schools can accurately isolate the treatment effect.11 End-of-year assessments were conducted in March and April 2006 in all project schools. The results were provided to the schools in the beginning of the next school year (July–August 2006), and all schools were informed that the program would continue for another year.12 Bonus checks based on first-year performance were sent to qualifying teachers by the end of September 2006, following which the same pro- cesses were repeated for a second year. D. Description of Incentive Treatments Teachers in incentive schools were offered bonus payments on the basis of the average improvement in test scores (in math and language) of students taught by them subject to a minimum improvement of 5 per- cent. The bonus formula was Bonus p 500 # (% gain in average test scores Ϫ 5%) {Rs. 0 if gain 1 5% otherwise. All teachers in group incentive schools received the same bonus based on average school-level improvement in test scores, whereas the bonus for teachers in individual incentive schools was based on the average 10 Six visits were made per school in the first year (2005–6) and four were made in the second year (2006–7). 11 An independent question of interest is that of the impact on teacher behavior and learning outcomes of the diagnostic feedback reports and low-stakes monitoring that were provided to all schools (including the control schools). We study this by comparing the “control” schools in this paper with another “pure control” group that did not receive any of the baseline test, feedback reports, or regular low-stakes monitoring and find that there was no impact of low-stakes measurement and monitoring on test scores (see Mur- alidharan and Sundararaman 2010b). 12 The communication to teachers with respect to the length of the program was that the program would continue as long as the government continued to support the project. The expectation conveyed to teachers during the first year was that the program was likely to continue but was not guaranteed to do so. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 51 13 test score improvement of students taught by the specific teacher. We use a (piecewise) linear formula for the bonus contract, for both ease of communication and implementation and also because it is the most resistant to gaming across periods (the end-of-year score in the first year determined the target score for the subsequent year).14 The “slope” of Rs. 500 per percentage point gain in average scores was set so that the expected incentive payment per school would be approximately equal to the additional spending in the input treatments (based on calibrations from the project pilot).15 The threshold of 5 percent average improvement was introduced to account for the fact that the baseline tests were in June/July and the end-of year-tests would be in March/April, and so the baseline score might be artificially low because students forget material over the summer vacation. There was no minimum threshold in the second year of the program because the first year’s end-of-year score was used as the second year’s baseline and the testing was conducted at the same time of the school year on a 12- month cycle.16 The bonus formula was designed to minimize potentially undesirable “threshold” effects, where teachers focus only on students near a per- formance target, by making the bonus payment a function of the average 13 First-grade students were not tested in the baseline, and so their “target” score for a bonus (above which the linear schedule above would apply) was set to be the mean baseline score of the second-grade students in the school. The target for the second-grade students was equal to their baseline score plus the 5 percent threshold described above. Schools selected for the incentive programs were given detailed letters and verbal communications explaining the incentive formula. Sample communication letters are available from the authors on request. 14 Holmstrom and Milgrom (1987) show the theoretical optimality of linear contracts in a dynamic setting (under assumptions of exponential utility for the agent and normally distributed noise). Oyer (1998) provides empirical evidence of gaming in response to nonlinear incentive schemes. 15 The best way to set expected incentive payments to be exactly equal to Rs. 10,000 per school would have been to run a tournament with predetermined prize amounts. Our main reason for using a contract as opposed to a tournament was that contracts were more transparent to the schools in our experiment since the universe of eligible schools was spread out across the state. Individual contracts (without relative performance mea- surement) also dominate tournaments for risk-averse agents when specific shocks (at the school or class level) are more salient for the outcome measure than aggregate shocks (across all schools), which is probably the case here (see Kane and Staiger 2002). See Lazear and Rosen (1981) and Green and Stokey (1983) for a discussion of tournaments and when they dominate contracts. 16 The convexity in reward schedule in the first year due to the threshold could have induced some gaming, but the distribution of mean class- and school-level gains at the end of the first year of the program did not have a gap below the threshold of 5 percent. If there is no penalty for a reduction in scores, there is convexity in the payment schedule even if there is no threshold (at a gain of zero). To reduce the incentives for gaming in subsequent years, we use the higher of the baseline and year-end scores as the target for the next year, and so a school/class whose performance deteriorates does not have its target reduced for the next year. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 52 journal of political economy 17 improvement of all students. If the function transforming teacher ef- fort into test score gains is concave (convex) in the baseline score, teachers would have an incentive to focus on weaker (stronger) students, but no student is likely to be wholly neglected since each contributes to the class average. In order to discourage teachers from excluding students with weak gains from taking the end-of-year test, we assigned a zero improvement score to any child who took the baseline test but not the end-of-year test.18 To make cheating as difficult as possible, the tests were conducted by external teams of five evaluators in each school (one for each grade), the identities of the students taking the test were verified, and the grading was done at a supervised central location at the end of each day’s testing. IV. Test Design A. Test Construction and Normalization We engaged India’s leading education testing firm, Educational Initia- tives, to design the tests to our specifications. The baseline test (June– July 2005) tested math and language (Telugu) and covered competen- cies up to that of the previous school year. At the end of the school year (March–April 2006), schools had two rounds of tests in each subject with a gap of 2 weeks between the rounds. The first test (referred to as the “lower-end line” or LEL) covered competencies up to that of the previous school year, whereas the second test (referred to as the “higher- end line” or HEL) covered materials from the current school year’s syllabus. The same procedure was repeated at the end of the second year. Doing two rounds of testing at the end of each year allows for the inclusion of more materials across years of testing, reduces the impact of measurement errors specific to the day of the test, and also reduces sample attrition due to student absence on the day of the test. 17 Many of the negative consequences of incentives discussed in Jacob (2005) are a response to the threshold effects created by the targets in the program he studied. Neal and Schanzenbach (2010) discuss the impact of threshold effects in the No Child Left Behind Act on teacher behavior and show that teachers do in fact focus more on students on the “bubble” and relatively neglect students far above or below the thresholds. We anticipated this concern and designed the incentive schedule accordingly. 18 In the second year (when there was no threshold), students who took the test at the end of year 1 but not at the end of year 2 were assigned a score of Ϫ5. Thus, the cost of a dropping-out student to the teacher was always equal to a Ϫ5 percent score for the student concerned. A higher penalty would have been difficult since most cases of attrition are out of the teacher’s control. The penalty of 5 percent was judged to be adequate to avoid explicit gaming of the test-taking population. We also cap negative gains at the student level at Ϫ5 percent for the calculation of teacher bonuses. Thus, putting a floor on the extent to which a poor-performing student brought down the class/school average at Ϫ5 percent ensured that a teacher/school could never do worse than having a student drop out to eliminate any incentive to get weak students to not appear for the test. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 53 For the rest of this paper, year 0 refers to the baseline tests in June– July 2005, year 1 refers to both rounds of tests conducted at the end of the first year of the program in March–April 2006, and year 2 refers to both rounds of tests conducted at the end of the second year of the program in March–April 2007. Scores in year 0 are normalized relative to the distribution of scores across all schools for the same test (pre- treatment), and scores in years 1 and 2 are normalized with respect to the score distribution in the control schools for the same test.19 B. Use of Repeat and Multiple-Choice Questions At the student level, there were no identically repeated questions be- tween year 0 and year 1. Between year 2 and year 1, 6 percent of ques- tions were repeated in math (12 out of 205) and 1.5 percent in language (three out of 201). At the school level, 13 percent and 18 percent of questions were repeated in years 1 and 2 in math and 14 percent and 10 percent in years 1 and 2 in language.20 The fraction of multiple- choice questions on any given test ranged from 22 percent to 28 percent in math and 32 percent to 43 percent in language. C. Basic versus Higher-Order Skills To distinguish between rote and conceptual learning, we asked the test design firm to design the tests to include both “mechanical” and “con- ceptual” questions within each skill category on the test. Specifically, a mechanical question was considered to be one that conformed to the format of the standard exercises in the textbook, whereas a conceptual one was defined as a question that tested the same underlying knowledge or skill in an unfamiliar way.21 19 Student test scores on each round (LEL and HEL), which are conducted 2 weeks apart, are first normalized relative to the score distribution in the control schools on that test and then averaged across the two rounds to create the normalized test score for each student at each point in time. So a student can be absent on one testing day and still be included in the analysis without bias because the included score would have been nor- malized relative to the distribution of all control school students on the same test that the student took. 20 A student-level repeated question is one that the same student would have seen in a previous round of testing. A school-level repeated question is one that any student in any grade could have seen in a previous test (this is therefore a better representation of the set of questions that the teacher may have been able to coach the students on using previous exams for test practice). 21 See the working paper version of this paper (Muralidharan and Sundararaman 2009) for more details and examples. The percentage split between mechanical and conceptual questions on the tests was roughly 70–30. Koretz (2002) points out that test score gains are meaningful only if they generalize from the specific test to other indicators of mastery of the domain in question. While there is no easy solution to this problem given the impracticality of assessing every domain beyond the test, our inclusion of both mechanical and conceptual questions in each test attempts to address this concern. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 54 journal of political economy D. Incentive versus Nonincentive Subjects Another dimension on which incentives can induce distortions is on the margin between incentive and nonincentive subjects. We study the extent to which this is a problem by conducting additional tests at the end of each year in science and social studies on which there was no incentive.22 Since these subjects are introduced only in grade 3 in the school curriculum, these additional tests were administered in grades 3–5. V. Results A. Teacher Turnover and Student Attrition Regular civil service teachers in AP are transferred once every 3 years on average. While this could potentially bias our results if more teachers chose to stay in or tried to transfer into the incentive schools, it is unlikely that this was the case since the treatments were announced in August 2005 whereas the transfer process typically starts earlier in the year. There was no statistically significant difference between any of the treat- ment groups in the extent of teacher turnover or attrition, and the transfer rate was close to 33 percent, which is consistent with the rotation of teachers once every 3 years (table 2, panel B, rows 11 and 12). As part of the agreement between the government of AP and the Azim Premji Foundation, the government agreed to minimize transfers into and out of the sample schools for the duration of the study. The average teacher turnover in the second year was only 5 percent, and once again, there was no significant difference in the 2-year teacher attrition and turnover rates across the various treatments (table 2, panel B, rows 13 and 14). The average student attrition rate in the sample (defined as the frac- tion of students in the baseline tests who did not take a test at the end of each year) was 7.1 percent and 20.6 percent in year 1 and year 2, respectively, but there is no significant difference in attrition across the treatments (rows 17 and 20). Beyond confirming sample balance, this is an important result in its own right because one of the concerns of teacher incentives based on test scores is that weaker children might be induced to drop out of testing in incentive schools (Jacob 2005). Attrition is higher among students with lower baseline scores, but this is true across all treatments, and we find no significant difference in 22 In the first year of the project, schools were not told about these additional subject tests till a week prior to the tests and were told that these tests were only for research purposes. In the second year, the schools knew that these additional tests would be con- ducted but also knew from the first year that these tests would not be included in the bonus calculations. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 55 mean baseline test scores across treatment categories among the stu- dents who drop out from the test-taking sample (table 2, panel B, rows 16, 17, 19, and 20).23 B. Specification We first discuss the impact of the incentive program as a whole by pooling the group and individual incentive schools and considering this to be the “incentive” treatment. All estimation and inference are done with the sample of 300 control and incentive schools unless stated oth- erwise. Our default specification uses the form Tijkm(Yn) p a ϩ g 7 Tijkm(Y0 ) ϩ d 7 Incentives ϩ b 7 Z m (1) ϩ ␧k ϩ ␧jk ϩ ␧ijk. The main dependent variable of interest is Tijkm , which is the nor- malized test score on the specific subject, where i , j, k , and m denote the student, grade, school, and mandal, respectively. The term Y0 in- dicates the baseline tests, and Yn indicates a test at the end of n years of the program. Including the normalized baseline test score improves efficiency as a result of the autocorrelation between test scores across multiple periods.24 All regressions include a set of mandal-level dummies (Z m), and the standard errors are clustered at the school level. We also run the regressions with and without controls for household and school variables. The Incentives variable is a dummy at the school level indi- cating treatment status, and the parameter of interest is d, which is the effect on test scores of being in an incentive school. The random as- signment of the incentive program ensures that this is an unbiased and consistent estimate of the 1-year and 2-year treatment effects. C. Impact of Incentives on Test Scores As an average across both math and language, students in incentive schools scored 0.15 SD higher than those in comparison schools at the end of the first year of the program and 0.22 SD higher at the end of the second year (table 3, panel A, cols. 1 and 3). The impact of the incentives at the end of 2 years is 0.27 SD in math and 0.17 SD in language (panels B and C of table 3). The addition of school and 23 We estimate a model of student attrition using baseline scores and observable char- acteristics and cannot reject that the same model predicts attrition in both treatment and control schools. We also estimate treatment effects by reweighting the sample by the inverse of the probability of continuing in the sample, and the results are unchanged. 24 Since grade 1 students did not have a baseline test, we set the normalized baseline score to zero for these students (similarly for students in grade 2 at the end of 2 years of the treatment). This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 56 journal of political economy TABLE 3 Impact of Incentives on Student Test Scores Dependent Variable: Normalized End-of-Year Test Score Year 1 on Year 0 Year 2 on Year 0 (1) (2) (3) (4) A. Combined (Math and Language) Normalized lagged test score .503*** .498*** .452*** .446*** (.013) (.013) (.015) (.015) Incentive school .149*** .165*** .219*** .224*** (.042) (.042) (.047) (.048) School and household con- trols No Yes No Yes Observations 42,145 37,617 29,760 24,665 R2 .31 .34 .24 .28 B. Math Normalized lagged test score .492*** .491*** .414*** .408*** (.016) (.016) (.022) (.022) Incentive school .180*** .196*** .273*** .280*** (.049) (.049) (.055) (.056) School and household con- trols No Yes No Yes Observations 20,946 18,700 14,797 12,255 R2 .30 .33 .25 .28 C. Telugu (Language) Normalized lagged test score .52*** .510*** .49*** .481*** (.014) (.014) (.014) (.014) Incentive school .118*** .134*** .166*** .168*** (.040) (.039) (.045) (.044) School and household con- trols No Yes No Yes Observations 21,199 18,917 14,963 12,410 R2 .33 .36 .26 .30 Note.—All regressions include mandal (subdistrict) fixed effects and standard errors clustered at the school level. School controls include an infrastructure and proximity index (as defined in table 2). Household controls include student caste, parental education, and affluence (as defined in panel A of table 6). * Significant at 10 percent. ** Significant at 5 percent. *** Significant at 1 percent. household controls does not significantly change the estimated value of d in any of the regressions, confirming the validity of the randomi- zation (cols. 2 and 4). We verify that teacher transfers do not affect the results by estimating equation (1) across different durations of teacher presence in the school, and there is no significant difference across these estimates. The testing process was externally proctored at all stages; we had no reason This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 57 TABLE 4 Impact of Incentives by Repeat and Nonrepeat Questions Dependent Variable: Percentage Score Combined Math Telugu Year 1 Year 2 Year 1 Year 2 Year 1 Year 2 Percentage score on non- .335*** .328*** .256*** .257*** .414*** .397*** repeat questions (.007) (.007) (.007) (.008) (.008) (.007) Percentage score on re- .352*** .42*** .252*** .386*** .452*** .468*** peat questions (.006) (.005) (.007) (.006) (.007) (.007) Incremental score in in- .030*** .039*** .033*** .046*** .027*** .033*** centive schools for non- (.009) (.009) (.009) (.010) (.010) (.010) repeats Incremental score in in- .043*** .043*** .042*** .044*** .043*** .041*** centive schools for re- (.011) (.011) (.013) (.012) (.011) (.013) peats Test for equality of treat- ment effect for repeat and nonrepeat questions (F -statistic, p -value) .141 .584 .374 .766 .076 .354 Observations 62,872 54,972 31,225 29,594 31,647 25,378 R2 .24 .18 .26 .23 .29 .18 Note.—Repeat questions are questions that at the time of administering the particular test had appeared identically on any earlier test (across grades). * Significant at 10 percent. ** Significant at 5 percent. *** Significant at 1 percent. to believe that cheating was a problem in the first year, but there were two cases of cheating in the second year. The concerned schools/teach- ers were declared ineligible for bonuses, and both these cases were dropped from the analysis presented here. D. Robustness of Treatment Effects An important concern with interpreting these results is whether they represent real gains in learning or merely reflect drilling on past exams and better test-taking skills. We use question-level data to examine this issue further. We first break down the treatment effect by repeat and nonrepeat questions. A question is classified as a repeat if it had ap- peared in any previous test in the project (for any grade and at any time).25 Table 4 shows the percentage score obtained by students in control and incentive schools by repeat and nonrepeat questions. We see that students in incentive schools score significantly higher on both repeat and nonrepeat questions (rows 3 and 4). The incremental score 25 This includes questions that appear in an LEL test for grade n and then appear 2 weeks later in the HEL test for grade n Ϫ 1. The idea is to classify any question that a teacher could have seen before and drilled the students on as a repeat question. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 58 journal of political economy TABLE 5 Impact of Incentives by Multiple Choice and Non-Multiple-Choice Questions Dependent Variable: Percentage Score Combined Math Telugu Year 1 Year 2 Year 1 Year 2 Year 1 Year 2 Percentage score on non- .311*** .311*** .258*** .278*** .364*** .344*** multiple-choice ques- (.007) (.007) (.007) (.008) (.008) (.008) tions Percentage score on multi- .379*** .391*** .227*** .284*** .529*** .497*** ple-choice questions (.004) (.004) (.005) (.004) (.005) (.005) Incremental score on non- .028*** .037*** .032*** .047*** .023** .027** multiple-choice ques- (.009) (.010) (.010) (.010) (.010) (.011) tions in incentive schools Incremental score on mul- .034*** .042*** .034*** .041*** .034*** .042*** tiple-choice questions in (.009) (.009) (.009) (.009) (.011) (.009) incentive schools Test for equality of treat- ment effect for multiple- choice questions and non-multiple-choice questions (F -statistic p - value) .168 .282 .671 .341 .119 .025 Observations 84,290 59,520 41,892 29,594 42,398 29,926 R2 .197 .187 .213 .178 .302 .289 * Significant at 10 percent. ** Significant at 5 percent. *** Significant at 1 percent. on repeat questions is higher in the incentive schools, but this is not significantly higher than the extent to which they score higher on non- repeat questions, suggesting that the main treatment effects are not being driven by improved student performance on repeated questions. We calculate the treatment effects estimated in table 3 using only the nonrepeat questions and find that the estimate is essentially unchanged. We also break down the questions into multiple-choice and non- multiple-choice questions, where performance on the former is more likely to be amenable to being improved by better test-taking skills. Table 5 presents a breakdown similar to that of table 4, and we see that in- centive schools do significantly better on both multiple-choice and free- response questions, with no significant difference in performance across the two types of questions (in five of the six comparisons). Finally, we also separately analyze student performance on both “me- chanical” and “conceptual” parts of the test (as described in Sec. IV.C) and find that incentive schools do significantly better on both the me- chanical and conceptual components of the test, with no significant difference in improvement between the two types of questions (tables available on request). This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 59 Fig. 2.—Quantile treatment effects of the performance pay program on student test scores. E. Distribution of Treatment Effects Figure 2 plots the quantile treatment effects of the performance pay program on student test scores (defined for each quantile t as d(t) p Gn Ϫ1 (t) Ϫ FmϪ1(t), where Gn and Fm represent the empirical distributions of the treatment and control distributions with n and m observations, respectively), with bootstrapped 95 percent confidence intervals, and shows that the quantile treatment effects are positive at every percentile and increasing. Note that this figure does not plot the treatment effect at different quantiles (since student rank order is not preserved between the baseline and end line tests even within the same treatment group). It simply plots the gap at each percentile of the treatment and control distributions after 2 years of the program and shows that test scores in incentive schools are higher at every percentile of the end line distri- bution and that the program also increased the variance of test scores. We next test for heterogeneity of the incentive treatment effect across baseline student, school, and teacher characteristics by testing whether d 3 is significantly different from zero: Tijkm(Yn) p a ϩ g 7 Tijkm(Y0 ) ϩ d1 7 Incentives ϩ d 2 7 Characteristic ϩ d 3 7 (Incentives # Characteristic) ϩ b 7 Z m (2) ϩ ␧k ϩ ␧jk ϩ ␧ijk. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 60 journal of political economy Table 6 (panel A) shows the results of these regressions on several school and household characteristics (each column in table 6 represents one regression testing for heterogeneous treatment effects along the characteristic mentioned). We find very limited evidence of differential treatment effects by school characteristics such as total number of stu- dents, school infrastructure, or school proximity to facilities. We also find no evidence of a significant difference in the effect of the incentives by most of the student demographic variables, including an index of household literacy, the caste of the household, the student’s gender, and the student’s baseline score. The only evidence of heterogeneous treatment effects is across levels of family affluence, with students from more affluent families showing a better response to the teacher incentive program. The lack of heterogeneous treatment effects by baseline score is an important indicator of broad-based gains since the baseline score is probably the best summary statistic of prior inputs into education. To see this more clearly, figure 3 plots nonparametric treatment effects by percentile of baseline score,26 and we see that there is a minimum treat- ment effect of 0.1 SD for students regardless of where they were in the initial test score distribution. The lack of heterogeneous treatment effects by initial scores suggests that the increase in the variance of test scores in incentive schools (fig. 2) may be reflecting the variance in teacher responsiveness to the in- centive program as opposed to the variance in student responsiveness to the treatment by initial learning levels. We test this by estimating teacher value addition (measured as teacher fixed effects in a regression of current test scores on lagged scores) and plotting the difference in teacher fixed effects at each percentile of the control and treatment distributions. We find that both the mean and variance of teacher value addition are significantly higher in the incentive schools (fig. 4). Having established that there is variation in teacher responsiveness to the incentive program, we test for differential responsiveness by ob- servable teacher characteristics (table 6, panel B). We find that the interaction of teachers’ education and training with incentives is positive and significant, whereas education and training by themselves are not significant predictors of value addition (cols. 1 and 2). This suggests that teacher qualifications by themselves are not associated with better 26 The figure plots a kernel-weighted local polynomial regression of end line scores (after 2 years) on the percentile of baseline score separately for the incentive and control schools and also plots the difference at each percentile of baseline scores. The confidence intervals of the treatment effects are constructed by drawing 1,000 bootstrap samples of data that preserve the within-school correlation structure in the original data and plotting the 95 percent range for the treatment effect at each percentile of baseline scores. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 61 learning outcomes under the status quo but that they could matter more if teachers had incentives to exert more effort (see Hanushek 2006). We also find that teachers with higher base pay as well as teachers with more experience respond less well to the incentives (cols. 3 and 4). This suggests that the magnitude of the incentive mattered because the potential bonus (which was similar for all teachers) would have been a larger share of base pay for lower-paid teachers. However, teachers with higher base pay are also more experienced, and so we cannot distinguish the impact of the incentive amount from that of other teacher characteristics that influence base pay.27 F. Impact on Nonincentive Subjects The impact of incentives on the performance in nonincentive subjects such as science and social studies is tested using a slightly modified version of specification (1) in which lagged scores on both math and language are included to control for initial learning levels. We find that students in incentive schools also performed significantly better on non- incentive subjects at the end of each year of the program, scoring 0.11 and 0.18 SD higher than students in control schools in science and social studies at the end of 2 years of the program (table 7, panel A). These results suggest that, in the context of primary education in a developing country with very low levels of learning, teacher efforts aimed at increasing test scores in math and language may also contribute to superior performance on nonincentive subjects, suggesting comple- mentarities among the measures and positive spillover effects between them. We probe the possibility of spillovers further as follows: for each student we generate a predicted math and language score at each point in time as well as the residual test score formed by taking the difference between the actual score and the predicted score (this residual is therefore an estimate of the “innovation” in learning that took place over the school year—and in light of table 3 would be larger for students in incentive schools). We then run regressions of science and social studies scores on the predicted math and language scores, the residuals as defined above, a dummy for treatment status, and interactions of the residuals and treat- ment status and present the results in table 7 (panel B). There are three noteworthy results here: (a) the coefficients on the residuals are highly significant, with the coefficient on the language residual typically being larger than that on the math residual; (b) the 27 Of course, this is a caution that applies to any interpretation of interactions in an experiment since the covariate is not randomly assigned and could be correlated with other omitted variables. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions TABLE 6 Heterogenous Treatment Effects A. Household and School Characteristics School School Household Parental Normalized Log School Proximity Infrastructure Affluence Literacy Scheduled Baseline Enrollment (8–24) (0–6) (0–7) (0–4) Caste/Tribe Male Score (1) (2) (3) (4) (5) (6) (7) (8) Two-Year Effect Incentive Ϫ.198 Ϫ.019 .28** .09 .224*** .226*** .233*** .219*** (.354) (.199) (.130) (.073) (.054) (.049) (.049) (.047) 62 Covariate Ϫ.065 Ϫ.005 .025 .017 .068*** Ϫ.066 .029 .448*** (.058) (.010) (.038) (.014) (.015) (.042) (.027) (.024) Interaction .083 .018 Ϫ.02 .038** Ϫ.003 Ϫ.013 Ϫ.02 .006 (.074) (.014) (.040) (.019) (.019) (.056) (.034) (.031) Observations 29,760 29,760 29,760 25,231 25,226 29,760 25,881 29,760 All use subject to JSTOR Terms and Conditions R2 .244 .244 .243 .272 .273 .244 .266 .243 One-Year Effect Incentive Ϫ.36 Ϫ.076 .032 .004 .166*** .164*** .157*** .149*** This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM (.381) (.161) (.110) (.060) (.047) (.045) (.044) (.042) Covariate Ϫ.128** Ϫ.016* Ϫ.001 .017 .08*** .007 .016 .502*** (.061) (.008) (.025) (.013) (.012) (.035) (.020) (.021) Interaction .103 .017 .041 .042** Ϫ.013 Ϫ.06 .002 .000 (.081) (.011) (.031) (.017) (.016) (.048) (.025) (.026) Observations 42,145 41,131 41,131 38,545 38,525 42,145 39,540 42,145 R2 .31 .32 .32 .34 .34 .31 .33 .31 B. Teacher Characteristics (Pooled Regression Using Both Years of Data) Active or Years of Teacher Active Passive Education Training Experience Salary (Log) Male Absence Teaching Teaching (1) (2) (3) (4) (5) (6) (7) (8) Incentive Ϫ.113 Ϫ.224 .258*** 1.775** .031 .15*** .084 .118 (.163) (.176) (.059) (.828) (.091) (.050) (.054) (.074) Covariate .003 Ϫ.051 Ϫ.001 Ϫ.034 Ϫ.084 Ϫ.149 .055 .131 (.032) (.041) (.003) (.066) (.057) (.137) (.078) (.093) Interaction .086* .138** Ϫ.009** Ϫ.179* .09 .013 .164* .064 (.050) (.061) (.004) (.091) (.069) (.171) (.098) (.111) Observations 53,737 53,890 54,142 53,122 54,142 53,609 53,383 53,383 R2 .29 .29 .29 .29 .29 .29 .29 .29 Note.—For panel A: The infrastructure and proximity indices are defined as in table 2. The household affluence index sums seven binary variables including ownership of land, ownership of current residence, residing in a pucca house (house with four walls and a cement and 63 concrete roof), and having each of electricity, water, toilet, and a television at home. Parental education ranges from 0 to 4; a point is added for each of the following: father’s literacy, mother’s literacy, father having completed tenth grade, and mother having completed tenth grade. Scheduled castes and tribes are considered the most socioeconomically backward groups in India. For panel B: Teacher education is coded from 1 to 4 indicating tenth grade, twelfth grade, college degree, and master’s or higher degree. Teacher training is coded from 1 to 4 indicating All use subject to JSTOR Terms and Conditions no training, a diploma, a bachelor’s degree in education, and a master’s degree in education. Teacher absence and active teaching are determined from direct observations four to six times a year. All regressions include mandal (subdistrict) fixed effects, lagged normalized test scores, and standard errors clustered at the school level. * Significant at 10 percent. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM ** Significant at 5 percent. *** Significant at 1 percent. Fig. 3.—Nonparametric treatment effects by percentile of baseline score Fig. 4.—Teacher fixed effects 64 This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions TABLE 7 Impact of Incentives on Nonincentive Subjects Dependent Variable: Normalized End Line Score Year 1 Year 2 Social Social Science Studies Science Studies A. Reduced-Form Impact Normalized baseline math score .215*** .224*** .156*** .167*** (.019) (.018) (.023) (.024) Normalized baseline language score .209*** .289*** .212*** .189*** (.019) (.019) (.023) (.024) Incentive school .112** .141*** .113** .18*** (.052) (.048) (.044) (.050) Observations 11,786 11,786 9,143 9,143 R2 .26 .31 .19 .18 B. Mechanism of Impact Normalized math predicted score .382*** .340*** .274*** .330*** (.032) (.027) (.041) (.044) Normalized Telugu predicted score .298*** .487*** .429*** .360*** (.028) (.026) (.036) (.036) Normalized math residual score .319*** .276*** .232*** .247*** (.025) (.024) (.032) (.035) Normalized Telugu residual score .343*** .425*** .399*** .341*** (.024) (.025) (.032) (.036) Incentive school Ϫ.01 .011 Ϫ.054* .009 (.031) (.027) (.030) (.033) Incentive school # normalized math residual score .048 .045 Ϫ.007 .014 (.035) (.031) (.038) (.042) Incentive school # normalized Tel- ugu residual score Ϫ.006 .024 .058 .099** (.029) (.031) (.039) (.043) Test for equality math and Telugu residuals .548 .001 .002 .128 Observations 11,228 11,228 8,949 8,949 R2 .48 .54 .41 .39 Note.—Social studies and science tests were administered only to grades 3–5. Predicted and residual scores in panel B are generated from a regression of the normalized test score (by subject and year) on baseline test score and other school and household char- acteristics in the control schools. All regressions include mandal (subdistrict) fixed effects and standard errors clustered at the school level. * Significant at 10 percent. ** Significant at 5 percent. *** Significant at 1 percent. 65 This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 66 journal of political economy TABLE 8 Group versus Individual Incentives Dependent Variable: Normalized End-of-Year Test Score Year 1 on Year 0 Year 2 on Year 0 Combined Math Telugu Combined Math Telugu (1) (2) (3) (4) (5) (6) Individual incentive school .156*** .184*** .130*** .283*** .329*** .239*** (.050) (.059) (.045) (.058) (.067) (.054) Group incentive school .141*** .175*** .107** .154*** .216*** .092* (.050) (.057) (.047) (.057) (.068) (.052) F-statistic p-value (test- ing group incentive school p individual incentive school) .765 .889 .610 .057 .160 .016 Observations 42,145 20,946 21,199 29,760 14,797 14,963 R2 .31 .299 .332 .25 .25 .26 Note.—All regressions include mandal (subdistrict) fixed effects and standard errors clustered at the school level. * Significant at 10 percent. ** Significant at 5 percent. *** Significant at 1 percent. coefficient on the incentive treatment dummy is close to zero; and (c) the interaction terms are mostly insignificant. In turn, these suggest that (a) improvements in language were more relevant for improved performance in other subjects, especially social studies; (b) the mech- anism for the improved performance in science and social studies in the incentive schools was the improved performance in math and lan- guage since the treatment dummy is close to zero after including the residuals; and (c) an innovation in math or language did not typically have a differential impact in incentive schools. Taken together, these results suggest that incentive schools did not do anything different with respect to nonincentive subjects but that positive spillovers from im- provements in math and especially language led to improved scores in nonincentive subjects as well. G. Group versus Individual Incentives Both the group and the individual incentive programs had significantly positive treatment effects at the end of each year of the program (table 8, cols. 1 and 4). In the first year of the program, students in individual incentive schools performed slightly better than those in group incentive schools, but the difference was not significant. By the end of the second year, students in individual incentive schools scored 0.28 SD higher than This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 67 those in comparison schools, whereas those in group incentive schools scored 0.15 SD higher, with this difference being significant at the 10 percent level (col. 4). We find no significant impact of the number of teachers in the school on the relative performance of group and individual incentives (both linear and quadratic interactions of school size with the group incentive treatment are insignificant). However, the variation in school size is small, with 92 percent of group incentive schools having between two and five teachers. The limited range of school size makes it difficult to precisely estimate the impact of group size on the relative effectiveness of group incentives. We repeat all the analysis presented above (in Secs. V.C–V.F) treating group and individual incentive schools separately and find that the individual incentive schools always outperform the group incentive schools, though the difference in point estimates is not always significant (tables available on request). VI. Teacher Behavior and Classroom Processes We measure changes in teacher behavior in response to the incentive program with both direct observation and teacher interviews. As de- scribed in Section III.C, enumerators conducted several rounds of un- announced tracking surveys during the two school years across all schools in the project. To code classroom processes, an enumerator typically spent between 20 and 30 minutes at the back of a classroom (during each visit) without disturbing the class and coded whether spe- cific actions took place during the period of observation. In addition to these observations, the enumerators also interviewed teachers about their teaching practices and methods, asking identical sets of questions in both incentive and control schools. These interviews were conducted in August 2006, around 4 months after the end-of-year tests but before any results were announced, and a similar set of interviews was con- ducted in August 2007 after the second full year of the program. There was no difference in either student or teacher attendance be- tween control and incentive schools. We also find no significant differ- ence between incentive and control schools on any of the various in- dicators of classroom processes as measured by direct observation.28 This is similar to the results of Glewwe et al. (2010), who find no difference in either teacher attendance or measures of teacher activity between treatment and control schools from similar surveys, and it raises the question of how the outcomes are significantly different when there do 28 These include measures of teacher activity such as using the blackboard, reading from the textbook, asking questions to students, encouraging classroom participation, assigning homework, helping students individually, and measuring student activity such as using textbooks and asking questions. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 68 journal of political economy TABLE 9 Teacher Behavior (Observation and Interviews) Incentive versus Control Schools (%) Correlation with Incentive Control p-Value of Student Test Schools Schools Difference Score Gains Teacher Behavior (1) (2) (3) (4) Teacher absence (%) .25 .23 .199 Ϫ.103 Actively teaching at point of ob- servation (%) .42 .43 .391 .135*** Did you do any special prepara- tion for the end of year tests? (% Yes) .64 .32 .000*** .095** What kind of preparation did you do? (unprompted; % mentioning): Extra homework .42 .20 .000*** .061 Extra classwork .47 .23 .000*** .084** Extra classes/teaching be- yond school hours .16 .05 .000*** .198*** Gave practice tests .30 .14 .000*** .105** Paid special attention to weaker children .20 .07 .000*** .010 Note.—All teacher response variables from the teacher interviews are binary; col. 4 reports the correlation between a teacher’s stated response and the test scores of students taught by that teacher (controlling for lagged test scores as in the default specifications throughout the paper). * Significant at 10 percent. ** Significant at 5 percent. *** Significant at 1 percent. not appear to be any differences in observed processes between the schools. The teacher interviews provide another way of testing for differences in behavior. Teachers in both incentive and control schools were asked unprompted questions about what they did differently during the school year at the end of each school year but before they knew the results of their students. The interviews indicate that teachers in incentive schools are significantly more likely to have assigned more homework and class work, conducted extra classes beyond regular school hours, given prac- tice tests, and paid special attention to weaker children (table 9). While self-reported measures of teacher activity might be considered less cred- ible than observations, we find a positive (and mostly significant) cor- relation between the reported activities of teachers and the performance of their students (table 9, col. 4), suggesting that these self-reports were credible (especially since less than 50 percent of teachers in the incentive schools report doing any of the activities described in table 9). The interview responses suggest reasons for why salient dimensions This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 69 of changes in teacher behavior might not have been captured in the classroom observations. An enumerator sitting in classrooms during the school day is unlikely to observe the extra classes conducted after school. Similarly, if the increase in practice tests occurred closer to the end of the school year (in March), this would not have been picked up by the tracking surveys conducted between September and February. Finally, while our survey instruments recorded whether various activities took place, they did not have a way to capture the intensity of teacher efforts, which may be an important channel of impact. One way to see this is to notice that there is no difference between treatment and control schools in the fraction of teachers coded as “ac- tively teaching” when observed by the enumerator (table 9, row 2), but the interaction of active teaching and being in an incentive school is significantly positively correlated with measures of teacher value addi- tion (table 6, panel B, col. 7). This suggests that teachers changed the effectiveness of their teaching in response to the incentives in ways that would not be easily captured even by observing the teacher. In summary, it appears that the incentive program based on end-of-year test scores did not change the teachers’ cost-benefit calculations on the attendance margin during the school year but that it probably made them exert more effort when present.29 VII. Comparison with Input Treatments and Cost-Benefit Analysis As mentioned earlier, a parallel component of this study provided two other sets of 100 randomly chosen schools with an extra contract teacher and with a cash block grant for school materials, respectively.30 These interventions were calibrated so that the expected spending on the input 29 Duflo, Hanna, and Ryan (2010) provide experimental estimates of the impact of teacher attendance on student learning in the Indian state of Rajasthan and estimate the effect on student learning to be roughly 0.1 SD for every 10-percentage-point reduction in teacher absence. If we use this as a benchmark and assume that (a) the unit of 1 SD is comparable in their sample and ours and (b) the effects are linear over the relevant ranges of absence, then our treatment effect of 0.11 SD per year would require an increase in teacher attendance (at status quo levels of effort) of 11 percentage points. So we could interpret our results in terms of teacher attendance and argue that the increase in intensity of effort was equivalent to reducing teacher absence by over 40 percent from 25 to 14 percentage points. 30 See our companion paper (Muralidharan and Sundararaman 2010a) for more details on the contract teacher program and its impact on student learning. We discuss the block grant intervention in Das et al. (2011). These input programs represented two out of the three most common input-based interventions (infrastructure, teachers, and materials). We did not conduct a randomized evaluation of infrastructure both because of practical difficulties and because the returns would have to be evaluated over the depreciation life cycle of the infrastructure. Thus, the set of interventions studied here all represent “flow” expenditures that would be incurred annually and are therefore comparable to the flow spending on a teacher incentive program. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 70 journal of political economy TABLE 10 Impact of Inputs versus Incentives on Learning Outcomes Dependent Variable: Normalized End-of-Year Test Score Year 1 on Year 0 Year 2 on Year 0 Combined Math Language Combined Math Language (1) (2) (3) (4) (5) (6) Normalized lagged score .512*** .494*** .536*** .458*** .416*** .499*** (.010) (.012) (.011) (.012) (.016) (.012) Incentives .15*** .179*** .121*** .218*** .272*** .164*** (.041) (.048) (.039) (.049) (.057) (.046) Inputs .102*** .117*** .086** .085* .089* .08* (.038) (.042) (.037) (.046) (.052) (.044) F -statistic p -value (inputs p incen- tives) .178 .135 .298 .003 .000 .044 Observations 69,157 34,376 34,781 49,503 24,628 24,875 R2 .30 .29 .32 .225 .226 .239 Note.—These regressions pool data from all 500 schools in the study: group and in- dividual incentive treatments are pooled together as incentives, and the extra contract teacher and block grant treatments are pooled together as inputs. All regressions include mandal (subdistrict) fixed effects and standard errors clustered at the school level. * Significant at 10 percent. ** Significant at 5 percent. *** Significant at 1 percent. and the incentive programs was roughly equal. To compare the effects across treatment types, we pool the two incentive treatments, the two input treatments, and the control schools and run the regression Tijkm(Yn) p a ϩ g 7 Tijkm(Y0 ) ϩ d1 7 Incentives ϩ d 2 7 Inputs (3) ϩ b 7 Z m ϩ ␧k ϩ ␧jk ϩ ␧ijk using the full sample of 500 schools. Both categories of treatments had a positive and significant impact on learning outcomes, but at the end of 2 years, the incentive schools scored 0.13 SD higher than the input schools, and the difference is highly significant (table 10, col. 4). The incentive schools perform better than input schools in both math and language, and both these differences are significant at the end of 2 years. The total amount spent on each intervention was calibrated to be roughly equal, but the group incentive program ended up spending a lower amount per school. The average annual spending on each of the input treatments was Rs. 10,000 per school, and the group and individual incentives programs cost roughly Rs. 6,000 per school and Rs. 10,000 This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 71 31 per school, respectively. Both the incentive programs were more cost effective than the input programs. The individual incentive program spent the same amount per school as the input programs but produced gains in test scores that were three times larger than those in the input schools (0.28 SD vs. 0.09 SD). The group incentive program had a smaller treatment effect than the individual incentive program (0.15 SD vs. 0.27 SD) but was equally cost effective because smaller bonuses were paid. A different way of thinking about the cost of the incentive program is to not consider the incentive payments as a cost at all because it is simply a way of reallocating salary spending. For instance, if salaries were increased by 3 percent every year for inflation, then it might be possible to introduce a performance-based component with an expected payout of 3 percent of base pay in lieu of a standard increase across the board (using the formulation in Sec. II, an increase in b could be offset by a reduction in s without violating the participation constraint). Under this scenario, the incentive cost would be only the risk premium needed to keep expected utility constant compared to the guaranteed increase of 3 percent. This is a very small number with an upper bound of 0.1 percent of base pay if teachers’ coefficient of absolute risk aversion (CARA) is 2 and 0.22 percent of base pay even if the CARA is as high as 5.32 Finally, if performance pay programs are designed on the basis of multiple years of performance, differences in compensation across teachers would be less because of random variation and more because of heterogeneity in ability. This will not only reduce the risk of perfor- mance pay but could also attract higher-ability teachers into the pro- fession and reduce the rents paid to less effective teachers (see Mur- alidharan and Sundararaman 2011). A full discussion of cost effectiveness should include an estimate of the cost of administering the program. The main cost outside the in- centive payments is that of independently administering and grading the tests. The approximate cost of each annual round of testing was Rs. 31 The bonus payment in the group incentive schools was lower than that in the indi- vidual incentive schools both because the treatment effect was smaller and also because classes with scores below their target brought down the average school gain in the group incentive schools, whereas teachers with negative gains (relative to targets) did not hurt teachers with positive gains in the individual incentive schools. So, even conditional on the same distribution of scores, the individual incentive payout would be higher as long as there are some classes with negative gains relative to the target because of truncation of teacher-level bonuses at zero in the individual incentive calculations. 32 The risk premium here is the value of ␧ such that 0.5[u(0.97w ϩ ␧) ϩ u(1.03w ϩ ␧)] p u(w) and is easily estimated for various values of CARA using a Taylor expansion around w. This is a conservative upper bound since the incentive program is modeled as an even lottery between the extreme outcomes of a bonus of 0 percent and 6 percent. In practice, the support of the incentive distribution would be nonzero everywhere on [0, 6], and the risk premium would be considerably lower. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 72 journal of political economy 5,000 per school, which includes the cost of two rounds of independent testing and data entry but not the additional costs borne for research purposes. The incentive program would be more cost effective than the input programs even after adding these costs and even more so if we take the long-run view that the fiscal cost of performance pay can be lower than the amount of the bonus if implemented in lieu of a sched- uled across-the-board increase in pay. Finally, we attempt a more speculative back-of-the-envelope estimate of the absolute rate of return of the program by looking at the labor market returns to improved test scores. Recent cross-sectional estimates of the returns to cognitive achievement in India suggest returns of 16 percent for scoring 1 SD higher on a standardized math test and 20 percent for scoring 1 SD higher on a standardized language test (Aslam et al. 2011). Assuming that the test score gains in this program corre- spond to a similar long-term difference in human capital accumula- tion,33 the 2-year treatment effect would correspond to a 7.7 percent increase in wages (0.27 SD # 0.16 ϩ 0.17 SD # 0.20). Depending on assumptions on rate of wage growth and discount rates, we obtain es- timates of an internal rate of return (IRR) ranging from 1,600 percent to 18,500 percent (or a return ranging from 16 to 185 times the initial cost).34 These estimates are large enough that even if the estimates on the labor market returns to test scores were to be substantially lower or the program costs much higher, the program would still have a very high rate of return. An important reason for this is that the cost of the incentive program was very low, and combining estimates from our companion papers suggests that the performance pay program would be 10 times more cost effective than reducing class size by hiring another 33 Chetty et al. (2010) show that there were significant long-term benefits to the class size reductions under the Tennessee Student/Teacher Achievement Ratio program even though the test score gains faded away a few years into the program. Deming (2009) shows similar long-term gains to Head Start, though the test score gains fade away here as well. Of course, these studies are only suggestive about the long-term effects of programs that produce test score gains because there is no precise measure of the extent to which test score gains in school translate into higher long-term wages. 34 The minimum wage for agricultural labor in AP is Rs. 112 per day. Assuming 250 working days per year yields an annual income of Rs. 28,000, and a 7.7 percent increase in wage would translate into additional income of Rs. 2,156 per year. We treat this as a 40-year stream of fixed additional earnings (which is very conservative since we do not assume wage growth) and discount at 10 percent a year to obtain a present value of Rs. 21,235 per student at the time of entering the labor market. Since the average student in our project is 8 years old, we assume that he or she will enter the labor market at age 20 and further discount the present value by 10 percent annually for another 12 years to obtain a present value of Rs. 6,750 per student. The average school had 65 students who took the tests, which provides an estimate of the total present value of Rs. 438,750. The cost of the program per school for 2 years was Rs. 27,500 (including both bonus and administrative costs), which provides an IRR estimate of 1,600 percent. If we were to assume that wages would grow at the discount rate, the calculation yields an IRR estimate of 18,500 percent. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 73 35 civil service teacher. Thus, the optimal wage contract for teachers prob- ably has a nonzero weight on student test score gains in this context. VIII. Conclusion Performance pay for teachers is an idea with strong proponents, as well as opponents, and the evidence to date on its effectiveness has been mixed. In this paper, we present evidence from a randomized evaluation of a teacher incentive program in a representative sample of govern- ment-run rural primary schools in the Indian state of Andhra Pradesh and show that teacher performance pay led to significant improvements in student test scores, with no evidence of any adverse consequences of the program. Additional school inputs were also effective in raising test scores, but the teacher incentive programs were three times as cost effective. The longer-term benefits to performance pay include not only greater teacher effort but also potentially the attraction of better teachers into the profession (Lazear 2000, 2003; Hoxby and Leigh 2004). We find a positive and significant correlation between teachers’ ex ante reported support for performance pay and their actual ex post performance (as measured by value addition). This suggests that effective teachers know who they are and that teacher compensation systems that reward effec- tiveness may attract higher-ability teachers (see Muralidharan and Sun- dararaman [2011] for further details on teacher opinions regarding the program and their correlates). While certain features of our experiment may be difficult to replicate in other settings and certain aspects of the Indian context (such as low average levels of learning and low norms for teacher effort) may be most relevant to developing countries, our results suggest that perfor- mance pay for teachers could be an effective policy tool in India and perhaps in other similar contexts as well. Input- and incentive-based policies for improving school quality are not mutually exclusive, but our results suggest that, conditional on the status quo patterns of spending in India, the marginal returns to spending additional resources on per- formance-linked bonuses for teachers may be higher than additional spending on unconditionally provided school inputs. Finally, the finding that more educated and better-trained teachers responded better to the incentives (while teacher education and training were not correlated 35 The performance pay intervention was twice as cost effective as providing schools with an extra contract teacher. We also find that the contract teacher was no less effective than a regular civil service teacher in spite of being paid a five times lower salary (Mur- alidharan and Sundararaman 2010a). Combining the results would suggest that intro- ducing a performance pay program would be 10 times more effective at increasing test scores than reducing class size with an extra civil service teacher. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 74 journal of political economy with learning outcomes in comparison schools) highlights the potential for incentives to be a productivity-enhancing measure that can improve the effectiveness of other school inputs (including teacher human capital). However, there are several unresolved issues and challenges that need to be addressed before scaling up teacher performance pay programs. One area of uncertainty is the optimal ratio of base and bonus pay. Setting the bonus too low might not provide adequate incentives to induce higher effort whereas setting it too high increases both the risk premium and the probability of undesirable distortions. We have also not devised or tested the optimal long-term formula for teacher incen- tive payments. While the formula used in this project avoided the most common pitfalls of performance pay from an incentive design perspec- tive, its accuracy was limited by the need for the bonus formula to be transparent to all teachers (most of whom were encountering a performance-based bonus for the first time in their careers). A better formula for teacher bonuses would net out home inputs to estimate a more precise measure of teachers’ value addition. It would also try to account for the fact that the transformation function from teacher effort into student outcomes is likely to be different at various points in the achievement distribution. A related concern is measurement error and the potential lack of reliability of test scores and estimates of teacher value addition at the class and school levels. The incentive formula can be improved with teacher data over mul- tiple years and by drawing on the growing literature on estimating teacher value-added models (see the essays in Herman and Haertel [2005] and the special issue of Education Finance and Policy [Fall 2009]) as well as papers complementary to ours that focus on the theoretical properties of optimal incentive formulas for teachers (see Barlevy and Neal [2010] and Neal [2010] for recent contributions). However, there may be a practical trade-off between the accuracy and precision of the bonus formula on the one hand and the transparency of the system to teachers on the other. Teachers accepted the intuitive “average gain” formula and trusted the procedure used and communicated by the Azim Premji Foundation. If such a program were to become policy, it is likely that teachers will start getting more sophisticated about the formula, at which point the decision regarding where to locate on the accuracy- transparency frontier can be made in consultation with teachers. At the same time, it is possible that there may be no satisfactory resolution of the tension between accuracy and transparency.36 36 Murnane and Cohen (1986) point out that one of the main reasons why merit pay plans fail is that it is difficult for principals to clearly explain the basis of evaluations to teachers. However, Kremer and Chen (2001) show that performance incentives, even for something as objective as teacher attendance, did not work when implemented through This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 75 While the issue of the optimal formula for teacher performance pay has not been resolved and implementation concerns are very real, this paper presents rigorous experimental evidence that even modest amounts of performance-based pay for teachers can lead to substantial improvements in student learning outcomes, with limited negative con- sequences (when implemented in a transparent and credible way). As school systems around the world consider adopting various forms of performance pay for teachers, attempts should be made to build in rigorous impact evaluations of these programs. A related point is that the details of the design of teacher incentive systems matter and should be informed by economic theory to improve the likelihood of their success (see Neal 2010). Programs and studies could also attempt to vary the magnitude of the incentives to estimate outcome elasticity with respect to the extent of variable pay and thereby gain further insights not only on performance pay for teachers but on performance pay in organizations in general. References Andrabi, T., J. Das, A. Khwaja, and T. Zajonc. 2009. “Do Value-Added Estimates Add Value? Accounting for Learning Dynamics.” Manuscript, Harvard Univ. Aslam, M., A. De, G. Kingdon, and R. Kumar. 2011. “Economic Returns to Schooling and Skills—an Analysis of India and Pakistan.” In Education Outcomes and Poverty in the South, edited by C. Colclough. London: Routledge. Atkinson, A., et al. 2009. “Evaluating the Impact of Performance-Related Pay for Teachers in England.” Labour Econ. 16:251–61. Baker, G. 1992. “Incentive Contracts and Performance Measurement.” J.P.E. 100: 598–614. ———. 2002. “Distortion and Risk in Optimal Incentive Contracts.” J. Human Resources 37:728–51. Bandiera, O., I. Barankay, and A. Rasul. 2007. “Incentives for Managers and Inequality among Workers: Evidence from a Firm Level Experiment.” Q.J.E. 122:729–73. Barlevy, G., and D. Neal. 2010. “Pay for Percentile.” Manuscript, Univ. Chicago. Chan, J. C. K., K. B. McDermott, and H. L. Roediger III. 2006. “Retrieval-Induced Facilitation: Initially Nontested Material Can Benefit from Prior Testing of Related Material.” J. Experimental Psychology: General 135:553–71. Chetty, R., J. N. Friedman, N. Hilger, E. Saez, D. W. Schanzenbach, and D. Yagan. 2010. “How Does Your Kindergarten Classroom Affect Your Earnings? Evi- dence from Project STAR.” Working Paper no. 16381, NBER, Cambridge, MA. Cullen, J. B., and R. Reback. 2006. “Tinkering towards Accolades: School Gaming under a Performance Accountability System.” In Advances in Applied Microeco- head teachers in schools in Kenya. The head teacher marked all teachers present often enough for all of them to qualify for the prize. These results suggest that the bigger concern is not complexity but rather human mediation, and so a sophisticated algorithm might be acceptable as long as it is clearly objective and based on transparently established ex ante criteria. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions 76 journal of political economy nomics, vol. 14, Improving School Accountability, edited by T. J. Gronberg and D. W. Jansen, 1–34. Bingley, U.K.: Emerald. Das, J., S. Dercon, P. Krishnan, J. Habyarimana, K. Muralidharan, and V. Sun- dararaman. 2011. “School Inputs, Household Substitution, and Test Scores.” Working Paper no. 16830, NBER, Cambridge, MA. Deming, D. 2009. “Early Childhood Intervention and Life-Cycle Skill Develop- ment: Evidence from Head Start.” American Econ. J.: Appl. Econ. 1:111–34. Dixit, A. 2002. “Incentives and Organizations in the Public Sector: An Interpre- tative Review.” J. Human Resources 37:696–727. Duflo, E., R. Hanna, and S. Ryan 2010. “Monitoring Works: Getting Teachers to Come to School.” Manuscript, Massachusetts Inst. Tech. Figlio, D. N., and L. Kenny. 2007. “Individual Teacher Incentives and Student Performance.” J. Public Econ. 91:901–14. Gibbons, R. 1998. “Incentives in Organizations.” J. Econ. Perspectives 12:115–32. Glewwe, P., N. Ilias, and M. Kremer. 2010. “Teacher Incentives.” American Econ. J.: Appl. Econ. 2:205–27. Goodman, S., and L. Turner. 2010. “Teacher Incentive Pay and Educational Outcomes: Evidence from the NYC Bonus Program.” Manuscript, Columbia Univ. Gordon, R., T. Kane, and D. Staiger. 2006. “Identifying Effective Teachers Using Performance on the Job.” Manuscript, Brookings Inst., Washington, DC. Green, J. R., and N. L. Stokey. 1983. “A Comparison of Tournaments and Con- tracts.” J.P.E. 91:349–64. Hamilton, B. H., J. A. Nickerson, and H. Owan. 2003. “Team Incentives and Worker Heterogeneity: An Empirical Analysis of the Impact of Teams on Productivity and Participation.” J.P.E. 111:465–97. Hanushek, E. 2006. “School Resources.” In Handbook of the Economics of Education, vol. 2, edited by E. Hanushek and F. Welch. Amsterdam: North-Holland. Heckman, J., and J. Smith. 1995. “Assessing the Case of Social Experiments.” J. Econ. Perspectives 9:85–110. Herman, J. L., and E. H. Haertel. 2005. Uses and Misuses of Data for Educational Accountability and Improvement. Malden, MA: Blackwell Synergy. Holmstrom, B. 1982. “Moral Hazard in Teams.” Bell J. Econ. 13:324–40. Holmstrom, B., and P. Milgrom. 1987. “Aggregation and Linearity in the Pro- vision of Intertemporal Incentives.” Econometrica 55:303–28. ———. 1991. “Multitask Principal-Agent Analyses: Incentive Contracts, Asset Ownership, and Job Design.” J. Law, Econ., and Org. 7:24–52. Hoxby, C. M., and A. Leigh. 2004. “Pulled Away or Pushed Out? Explaining the Decline of Teacher Aptitude in the United States.” A.E.R. Papers and Proc. 94: 236–40. Itoh, H. 1991. “Incentives to Help in Multi-agent Situations.” Econometrica 59: 611–36. Jacob, B. A. 2005. “Accountability, Incentives and Behavior: The Impact of High- Stakes Testing in the Chicago Public Schools.” J. Public Econ. 89:761–96. Jacob, B. A., L. Lefgren, and D. Sims. 2008. “The Persistence of Teacher-Induced Learning Gains.” Working Paper no. 14065, NBER, Cambridge, MA. Jacob, B. A., and S. D. Levitt. 2003. “Rotten Apples: An Investigation of the Prevalence and Predictors of Teacher Cheating.” Q.J.E. 118:843–77. Kandel, E., and E. Lazear. 1992. “Peer Pressure and Partnerships.” J.P.E. 100: 801–17. Kandori, M. 1992. “Social Norms and Community Enforcement.” Rev. Econ. Stud- ies 59:63–80. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions teacher performance pay 77 Kane, T. J., and D. O. Staiger. 2002. “The Promise and Pitfalls of Using Imprecise School Accountability Measures.” J. Econ. Perspectives 16:91–114. Kingdon, G. G., and M. Muzammil. 2009. “A Political Economy of Education in India: The Case of UP.” Oxford Development Studies 37 (2): 123–44. Koretz, D. M. 2002. “Limitations in the Use of Achievement Tests as Measures of Educators’ Productivity.” J. Human Resources 37:752–77. Kremer, M., and D. Chen 2001. “An Interim Program on a Teacher Attendance Incentive Program in Kenya.” Manuscript, Harvard Univ. Kremer, M., K. Muralidharan, N. Chaudhury, F. H. Rogers, and J. Hammer. 2005. “Teacher Absence in India: A Snapshot.” J. European Econ. Assoc. 3:658–67. Ladd, H. F. 1999. “The Dallas School Accountability and Incentive Program: An Evaluation of Its Impacts on Student Outcomes.” Econ. Educ. Rev. 18:1–16. Lavy, V. 2002. “Evaluating the Effect of Teachers’ Group Performance Incentives on Pupil Achievement.” J.P.E. 110:1286–1317. ———. 2009. “Performance Pay and Teachers’ Effort, Productivity, and Grading Ethics.” A.E.R. 99:1979–2011. Lazear, E. 2000. “Performance Pay and Productivity.” A.E.R. 90:1346–61. ———. 2003. “Teacher Incentives.” Swedish Econ. Policy Rev. 10:179–214. Lazear, E., and S. Rosen. 1981. “Rank-Order Tournaments as Optimum Labor Contracts.” J.P.E. 89:841–64. Muralidharan, K., and V. Sundararaman. 2009. “Teacher Performance Pay: Ex- perimental Evidence from India.” Working Paper no. 15323, Cambridge, MA. ———. 2010a. “Contract Teachers: Experimental Evidence from India.” Man- uscript, Univ. California, San Diego. ———. 2010b. “The Impact of Diagnostic Feedback to Teachers on Student Learning: Experimental Evidence from India.” Econ. J. 120:F187–F203. ———. 2011. “Teacher Opinions on Performance Pay: Evidence from India.” Econ. Educ. Rev. 30, forthcoming. Murnane, R. J., and D. K. Cohen. 1986. “Merit Pay and the Evaluation Problem: Why Most Merit Pay Plans Fail and a Few Survive.” Harvard Educ. Rev. 56:1– 17. Neal, D. 2010. “The Design of Performance Pay in Education.” Manuscript, Univ. Chicago. Neal, D., and D. Schanzenbach. 2010. “Left Behind by Design: Proficiency Counts and Test-Based Accountability.” Rev. Econ. and Statis. 92:263–83. Oyer, P. 1998. “Fiscal Year Ends and Nonlinear Incentive Contracts: The Effect on Business Seasonality.” Q.J.E. 113:149–85. Podgursky, M., and M. Springer. 2007. “Teacher Performance Pay: A Review.” J. Policy Analysis and Management 26:909–50. Pratham. 2010. Annual Status of Education Report. New Delhi: Pratham. Prendergast, C. 1999. “The Provision of Incentives in Firms.” J. Econ. Literature 37:7–63. Rivkin, S. G., E. A. Hanushek, and J. F. Kain. 2005. “Teachers, Schools, and Academic Achievement.” Econometrica 73:417–58. Rockoff, J. E. 2004. “The Impact of Individual Teachers on Student Achievement: Evidence from Panel Data.” A.E.R. 94:247–52. Springer, M. G., et al. 2010. “Teacher Pay for Performance: Experimental Evi- dence from the Project on Incentives in Teaching.” Manuscript, Nat. Center Performance Incentives, Vanderbilt Univ. Umansky, I. 2005. “A Literature Review of Teacher Quality and Incentives: The- ory and Evidence.” In Incentives to Improve Teaching: Lessons from Latin America, edited by E. Vegas, 21–61. Washington, DC: World Bank. This content downloaded from 190.81.46.114 on Wed, 5 Jun 2013 18:28:32 PM All use subject to JSTOR Terms and Conditions Working Draft – Comments Welcome Contract Teachers: Experimental Evidence from India Karthik Muralidharan † Venkatesh Sundararaman ‡ 24 May 2010 * Abstract: The large-scale expansion of primary education in developing countries has led to the increasing use of locally-hired teachers on fixed-term renewable contracts who are not professionally trained and who are paid much lower salaries than regular civil service teachers. This has been a very controversial policy, and there is limited evidence about the effectiveness of such contract teachers. We present experimental evidence from a program that provided an extra contract teacher to 100 randomly-chosen government-run rural primary schools in the Indian state of Andhra Pradesh. At the end of two years, students in schools with an extra contract teacher performed significantly better than those in comparison schools by 0.15 and 0.13 standard deviations, in math and language tests respectively. While all students gain from the program, the extra contract teacher was particularly beneficial for students in their first year of school and students in remote schools. Contract teachers were significantly less likely to be absent from school than civil-service teachers (16% vs. 27%). We also find using four different non-experimental estimation procedures that contract teachers are no less effective in improving student learning than regular civil-service teachers who are more qualified, better trained, and paid five times higher salaries. JEL Classification: I21, M55, O15 Keywords: contract teachers, teacher incentives, teacher pay, class size, primary education in developing countries, public and private schools, India † UC San Diego, NBER, and J-PAL; E-mail: kamurali@ucsd.edu ‡ South Asia Human Development Unit, World Bank. E-mail: vsundararaman@worldbank.org * We are grateful to Caroline Hoxby, Michael Kremer, and Michelle Riboud for their support, advice, and encouragement at all stages of this project. We thank Eli Berman, James Berry, Julie Cullen, Gordon Dahl, Jishnu Das, Nora Gordon, Gordon Hanson, and various seminar participants for useful comments and discussions. This paper is based on a project known as the Andhra Pradesh Randomized Evaluation Study (AP RESt), which is a partnership between the Government of Andhra Pradesh, the Azim Premji Foundation, and the World Bank. Financial assistance for the project has been provided by the Government of Andhra Pradesh, the UK Department for International Development (DFID), the Azim Premji Foundation, and the World Bank. We thank Dileep Ranjekar, Amit Dar, Samuel C. Carlson, and officials of the Department of School Education in Andhra Pradesh (particularly Dr. I.V. Subba Rao, Dr. P. Krishnaiah, and K. Ramakrishna Rao), for their continuous support and long-term vision for this research. We are especially grateful to DD Karopady, M Srinivasa Rao, and staff of the Azim Premji Foundation for their leadership and meticulous work in implementing this project. We thank Vinayak Alladi, and Ketki Sheth for outstanding research assistance. The findings, interpretations, and conclusions expressed in this paper are those of the authors and do not necessarily represent the views of the World Bank, its Executive Directors, or the governments they represent. 1. Introduction The large scale expansion of primary education in developing countries over the past two decades to achieve the Millennium Development Goal of universal primary education has led to significant improvements in school access and enrollment, but has also created difficulties with regards to maintaining and improving school quality. 1 A particularly challenging problem has been recruiting enough teachers and posting them in areas where they are needed. The challenge is both fiscal (since teacher salaries account for the largest component of education spending 2) and logistical (since qualified civil-service teachers are less willing to be deployed to underserved and remote areas where their need is the greatest). Governments in several developing countries have responded to this challenge by staffing teaching positions with locally-hired teachers on fixed-term renewable contracts, who are not professionally trained, and who are paid much lower salaries than those of regular teachers (often less than one fifth as much). 3 The increasing use of contract teachers has been one of the most significant policy innovations 4 in providing primary education in developing countries over the last two decades, but it has also been highly controversial. Supporters consider the use of contract teachers to be an efficient way of expanding education access and quality to a large number of first-generation learners, and argue that contract teachers face superior incentives compared to tenured civil-service teachers. Opponents argue that using under-qualified and untrained teachers may staff classrooms but will not produce learning outcomes, and that the use of contract teachers de-professionalizes teaching, reduces the prestige of the entire profession, and reduces motivation of all teachers. 5 1 See Pritchett (2004) for a detailed overview showing very low levels of learning (conditional on years of school completed) across several developing countries. 2 Typically over 80% and often over 90% of education spending in many developing countries is on teacher salaries (education spending data by country available at http://www.uis.unesco.org/en/stats/stats0.htm) 3 Contract teacher schemes have been used in several developing countries including Cambodia, Indonesia, Kenya, Mali, Nicaragua, Niger, Togo, and several other African countries (see Duthilleul (2005) for a review of contract teacher programs in several countries; Table 2 in Boudon et al (2007) reviews contract teacher programs in several African countries). They have also been widely employed in several states of India (under different names such as Shiksha Karmi in Madhya Pradesh and Rajasthan, Shiksha Mitra in Uttar Pradesh, Vidya Sahayak in Gujarat and Himachal Pradesh, and Vidya Volunteers in Andhra Pradesh). 4 For example, over 25% of the primary school teachers in the large Indian states of Uttar Pradesh, Bihar, and Madhya Pradesh are contract teachers (as calculated from the State Report Cards issued by the Ministry of Human Resource Development in India – see Mehta (2007)). Table 1 in Boudon et al (2007) shows the distribution of teachers by contractual status in 12 African countries and finds that on average, that nearly a third of public-school teachers are contract teachers. 5 See Kumar et al (2005) for an example of these criticisms. 1 We present experimental evidence on the impact of contract teachers from a program that was designed to mimic an expansion of the current contract teacher policy of the government of the Indian state of Andhra Pradesh (AP). The study was conducted across a representative sample of 200 government-run schools in rural AP with 100 of these schools being selected by lottery to receive an extra contract teacher over and above their usual allocation of regular and contract teachers. This paper presents the first experimental evaluation of an “as is” expansion of an existing contract teacher policy anywhere in the world. The study also features random assignment in a representative sample of schools in AP, thereby providing estimates of program impact that are directly applicable to scaling up. 6 At the end of two years of the program, we find that students in schools with an extra contract teacher perform significantly better than those in comparison schools by 0.15 and 0.13 standard deviations (SD) in math and language tests respectively, showing that even untrained teachers with less education and much lower levels of training than regular civil-service teachers were able to improve student learning outcomes. Students in remote schools benefit more from the extra contract teacher and we also find that the largest gains in test scores in treatment schools are for students in the first grade (averaging 0.23 and 0.25 SD in math and language). We find evidence to suggest that the mechanism for this result is that class-size reductions enabled by hiring an additional contract teacher were of greatest benefit to children in younger grades. Finally, we also find that contract teachers were significantly less likely to be absent from school than regular teachers (16% versus 27%), suggesting that they have superior incentives for effort. While the experiment establishes that the marginal product of contract teachers is positive, 7 it does not directly compare the effectiveness of regular and contract teachers. We use our rich panel data on student learning and data on teacher assignment to classrooms to construct four different non-experimental estimates of the relative effectiveness of contract and regular teachers (two using only within-school variation, and two using only between-school variation). We find using all four methods that we cannot reject the null hypothesis that contract teachers are as 6 The random assignment of treatment provides high internal validity, while the random sampling of schools into the universe of the study provides greater external validity than typical experiments by avoiding the “randomization bias”, whereby entities that are in the experiment (treatment or control) are atypical relative to the population that the result is sought to be extrapolated to (Heckman and Smith (1995)). 7 This is not obvious given the lack of training and the lower qualifications of contract teachers. For instance, the well-known Tennessee STAR experiment found a positive effect on test scores of reducing class sizes with a regular teacher, but found no additional impact of providing less-qualified teacher-aides. 2 effective as regular teachers (who cost five times as much) in improving student learning outcomes and the null is never rejected even under several robustness checks. To understand the broader conditions of teacher labor markets in rural India, we collect data on rural private school teachers in the same districts where the contract teacher experiment was carried out and find that private school teacher characteristics are closer to those of contract teachers than civil-service teachers. We also find that private school teacher salaries are even lower than those of contract teachers and so much lower than regular teacher salaries, that there is almost no common support in the two distributions (Figure 2). The results on equal effectiveness of contract and regular teachers and the market salary benchmarks suggest that the large wage differential between regular and contract teachers is unlikely to reflect differences in productivity and mostly represents rents accruing to unionized civil-service teachers. Our results contribute to an emerging literature on understanding the impact of contract teachers in developing countries. In addition to several descriptive studies regarding the use of contract teachers, 8 recent papers that use observational data to study the effect of contract teachers include De Laat and Vegas (2005) in Togo, Bourdon et al (2006) in Niger, and Bourdon et al (2007) in Niger, Mali, and Togo. 9 Duflo et al (2009) conduct an experimental evaluation of a contract teacher program in Kenya and find that students randomly assigned to contract teachers (and whose class size was halved) do significantly better than students in comparison schools, while students assigned to regular teachers in program schools (where the class size was also halved) do no better than those in the comparison schools. 8 Notable among these are Duthilleul (2005) describing experiences with contract teachers in Cambodia, India, and Nicaragua, and Govinda and Josephine (2004) who conduct a detailed review of contract teachers (also known as para-teachers) in India and summarize the key arguments for and against the use of contract teachers in India. The three case studies in Pritchett and Pande (2006) also provide a good discussion on locally-hired contract teachers in India. Kingdon and Sipahimalani-Rao (2010) provide a recent overview that summarizes several descriptive studies on para-teachers across India. 9 Using a data set from Togo, De Laat and Vegas (2005) control for observable differences in student and teacher characteristics and find that students of regular teachers perform better than those of contract teachers. Bourdon et al (2006) use data from Niger and conclude that after controlling for confounding factors, contract teachers do not perform much worse than regular teachers. Bourdon et al (2007) use data from Togo, Niger, and Mali and find differential effects across these countries (positive effects in Mali, mixed effects in Togo, and negative effects in Niger) and suggest that these may be explained by differences in how contract teacher programs were implemented in these countries, with positive effects found where the contract teachers were managed through local communities and negative effects where contract teacher hiring was centralized. A related paper is Banerjee et al (2007), who conduct an experimental evaluation of a remedial education program staffed by untrained informal teachers in two Indian cities and find that the program was highly effective in improving learning outcomes. But the program focused on remedial instruction and removed weak children from the classroom, and is thus quite different from the typical contract teacher policies implemented in several Indian states over the past two decades. 3 Beyond the direct implications for education policy in developing countries, our results on contract teachers are also relevant to the literature on decentralization and accountability in the provision of public services, and the literature on addressing the trade-off between access and scale in service delivery in remote areas (in both education and health). 10 We also contribute to the extensive class-size literature in developed and developing countries, 11 and show that the benefits of class-size reduction can be obtained even with less-trained contract teachers (for primary education in developing countries). Finally, while set in the context of schools and teachers, the results in this paper also contribute to our understanding of the consequences of restricting entry into professions based on credentials (either by law or by convention). 12 More broadly, Bandiera et al (2009) analyze public procurement in Italy and show that over 80% of wastage in public spending can be attributed to passive waste (attributable to inefficiencies resulting from limited incentives for public sector officials to be efficient or from regulatory constraints) as opposed to active waste (attributable to bribes and private pay-offs). Our results showing that contract and regular teachers are equally effective even though the latter cost five times more suggest similar magnitudes of passive waste in public production of primary education in India. Since education is one of the largest components of public expenditure in many countries, our results also contribute to the broader literature on the cost effectiveness of publicly-produced services. 13 10 On decentralization and service delivery, see Bardhan (2002) for a theoretical discussion, Sawada and Ragatz (2005) on the EDUCO program in El Salvador, Pritchett and Murgai (2007) on education decentralization in India, and Duflo et al (2010) on school committees in Kenya. On service delivery in remote areas see Jacob, Kochar, and Reddy (2008) on the impact of sub-scale schools and multi-grade teaching on learning outcomes in India. There is a large corresponding literature on the use of community health workers with limited training in improving health outcomes in underserved rural areas (see Haines et al (2007) and the references there for a guide to the literature). 11 References based on US evidence include Krueger (1999, 2003) and Hanushek (1999, 2003). Angrist and Lavy (1999) and Urquiola (2006) provide international evidence. Krueger (1999) reports results from the Tennessee STAR project, which is probably the most well known class-size reduction experiment. 12 There is a vast literature in the US on the effects of teacher certification and of policies requiring school districts to hire certified or qualified teachers. See Kane et al (2008) for a recent study in New York and references to the US literature on certification. In related work in higher education, Bettinger and Long (forthcoming) show that adjunct faculty (who are less qualified and paid much less than tenure-track faculty) perform slightly better than regular faculty when measured by whether students taught introductory courses by adjuncts are more likely to take more courses in the same subject or choose to major in the subject. Kleiner (2000) presents a general overview of the economics of occupational licensing. 13 On teacher personnel policies, Ballou (1996) shows that public school administrators typically do not hire the best applicants; on education spending more broadly, Hanushek (2002) reviews several studies showing the lack of a relation between public spending on education and learning outcomes; and on public sector management in general, Bloom and Van Reenen (2010) collect detailed data on management practices and show that government-owned firms are typically managed “extremely badly”. 4 There are large welfare implications of taking our results seriously. The recently passed Right to Education Act in India calls for eliminating the use of untrained teachers and increasing education spending to replace them with regular teachers over the next three years. The Act also calls for a reduction of the pupil-teacher ratio from 40:1 to 30:1 and the combination of these clauses is expected to cost over USD 5 Billion annually if fulfilled through the recruiting of additional regular teachers. 14 Since it is possible to hire several contract teachers for every one regular teacher not hired our results suggest that a substantially larger improvement in education outcomes may be obtained for any given increase in education spending, by allocating these additional resources to hiring more contract teachers as opposed to fewer regular teachers. 15 The rest of this paper is organized as follows: section 2 describes the experimental intervention and data collection and section 3 presents the results of the extra contract teacher program. Section 4 presents non-experimental comparisons of the effectiveness of regular and contract teachers, while section 5 provides comparisons to private school teachers. Section 6 discusses policy implications and concludes. 2. Experimental Design 2.1. Context While India has made substantial progress in improving access to primary schooling and primary school enrollment rates, the average levels of learning remain very low. The most recent Annual Status of Education Report found that around 60% of children aged 6 to 14 in an all-India sample of rural households could not read at the second grade level, though over 96% of them were enrolled in school (Pratham, 2010). Public spending on education has been rising as part of the “Education for All” campaign, but there are substantial inefficiencies in public delivery of education services. A recent study using a nationally representative dataset of 14 The Right to Education Act was passed in 2010 and estimates suggest that an additional 1.2 million teachers will need to be recruited to satisfy the provisions in the Act. The full annual cost of a regular teacher (including benefits) is estimated at USD 4,000/year. 15 This is in contrast to Tamura (2001), who develops of a model of human capital accumulation where there is a trade-off between teacher quality and class-size and his estimates based on historical US data suggest that returns to investing in teacher quality were higher than returns to reducing class size for poor school districts. Our results suggest that the status quo of education production is so far within the efficient frontier that there is no trade off on the current policy margin because class size can be reduced substantially for the same cost by hiring more contract teachers, who are no less effective on the margin than regular teachers. 5 primary schools in India found that 25% of teachers were absent on any given day, and that less than half of them were engaged in any teaching activity (Kremer et al (2005)). Andhra Pradesh (AP) is the 5th largest state in India, with a population of over 80 million, 73% of who live in rural areas. AP is close to the all-India average on various measures of human development such as gross enrollment in primary school, literacy, and infant mortality, as well as on measures of service delivery such as teacher absence (Figure 1a). The state consists of three historically distinct socio-cultural regions (Figure 1b) and a total of 23 districts. Each district is divided into three to five divisions, and each division is composed of ten to fifteen mandals, which are the lowest administrative tier of the government of AP. A typical mandal has around 25 villages and 40 to 60 government primary schools. There are a total of over 60,000 such schools in AP and around 80% of children in rural AP attend government-run schools (Pratham, 2010). The average rural government primary school is quite small, with total enrollment of around 80 to 100 students and an average of 2 to 3 teachers across grades one through five. 16 One teacher typically teaches all subjects for a given grade (and often teaches more than one grade simultaneously). All regular teachers are employed by the state, and their salary is mostly determined by experience and rank, with minor adjustments based on assignment location, but no component based on any measure of performance. In 2006, the average salary of regular teachers was over Rs. 8,000/month and total compensation (including benefits) was over Rs. 10,000/month (per capita income in AP was around Rs. 2,000/month). Regular teachers' salaries and benefits comprise over 90% of non-capital expenditure on primary education in AP. Teacher unions are strong and disciplinary action for non-performance is rare. 17 2.2 The Extra Contract Teacher Intervention Contract teachers (also known as para-teachers) are generally hired at the school level by school committees and have usually completed either high school or college but typically have no formal teacher training. Their contracts are renewed annually and they are not protected by 16 This is a consequence of the priority placed on providing all children with access to a primary school within one kilometer from their homes. The median of the number of teachers per school was three and the mode was two. 17 Kremer et al (2005) find that on any given working day, 25% of teachers are absent from schools across India, but only 1 head teacher in their sample of 3000 government schools had ever fired a teacher for repeated absence. The teacher absence rate in AP is almost exactly equal to the all-India average. See Kingdon and Muzammil (2002) for a descriptive study of the strength of teacher unions in India’s largest state. 6 any civil-service rules. Their typical salary of around Rs. 1000 - 1500/month is less than one fifth of the average salary of regular government teachers. 18 They are also much more likely to be younger, to be female, to be from the same village, and live closer to the school they teach in (Table 1 – Panel A). Contract teachers usually teach their own classes and are not 'teacher-aides' who support a regular teacher in the same classroom. The process by which contract teachers are typically hired in Andhra Pradesh is that schools apply to the district education administration for permission to hire a contract teacher based on their enrollment and teacher strength at the start of the school year. Thus contract teachers can be appointed both against vacant sanctioned posts (that may have been filled by a regular teacher) and as additional resources to meet the needs of growing enrollment. If the permission (and fiscal allotment) is given, a contract teacher will be hired by the school committee. The authorization of the position is not guaranteed for subsequent years, but once a position is approved, it is usually continued unless there are significant changes in enrollment patterns. But since renewal is not guaranteed, the appointment of contract teachers is typically for a 10-month period. 19 New hires are supposed to go through a brief accelerated training program prior to starting to teach, but this is imperfectly implemented in practice. The extra contract teacher intervention studied in this paper was designed to resemble the typical process of contract teacher hiring and use as closely as possible. Schools that were selected for the program by a lottery were informed in a letter from the district administration that they had been authorized to hire an additional contract teacher, and that they were expected to follow the same procedures and guidelines for hiring a contract teacher as they would normally do. The additional contract teachers were allocated to the school and not to a specific grade or pre-specified role, which is also how teachers (regular and contract) are typically allocated to primary schools. Most schools (~80%) reported starting the process of hiring the extra contract teacher within a week of receiving the notification and the modal selection committee consisted of three members (the head teacher, a member of the local elected body, and another teacher). The most important stated criterion for hiring the contract teacher was qualification (62%), followed by 18 The salary of contract teachers was Rs. 1,000/month in the first year of the project (2005 – 06) and was raised to Rs. 1,500/month in the second year (2006 – 07). 19 See Govinda and Yazali (2004) for a more detailed description of contract teacher appointment procedures across Indian states. 7 experience and distance from the school (20% each). The additional contract teachers hired under this program had the same average characteristics as typical contract teachers in the comparison schools (Table 1 – Panel B), and so the intervention mimicked an expansion of the existing contract teacher program in AP to 100 randomly selected schools. 2.3. Sampling and Randomization We sampled 5 districts across each of the 3 socio-cultural regions of AP in proportion to population (Figure 1b). In each of the 5 districts, we randomly selected one division and then randomly sampled 10 mandals in the selected division. In each of the 50 mandals, we randomly sampled 10 schools using probability proportional to enrollment. Thus, the universe of 500 schools in the study was representative of the schooling conditions of the typical child attending a government-run primary school in rural AP. Experimental results in this sample can therefore be credibly extrapolated to the full state of Andhra Pradesh. The extra contract teacher program was one of four policy options evaluated as part of a larger education research initiative known as the Andhra Pradesh Randomized Evaluation Studies (AP RESt),20 with 100 schools being randomly assigned to each of four treatment and one control groups. The school year in AP starts in mid June, and baseline tests were conducted in the 500 sampled schools during late June and early July, 2005. 21 After the baseline tests were evaluated, the Azim Premji Foundation randomly allocated 2 out of the 10 project schools in each mandal to one of 5 cells (four treatments and one control). Since 50 mandals were chosen across 5 districts, there were a total of 100 schools (spread out across the state) in each cell. The geographic stratification allows us to estimate the treatment impact with mandal-level fixed effects and thereby net out any common factors at the lowest administrative level of government. Since no school received more than one treatment, we can analyze the impact of each program independently with respect to the control schools without worrying about any confounding interactions. This analysis in this paper is based on the 200 schools that comprise the 100 schools randomly chosen for the extra contract teacher (ECT) program and the 100 that 20 The AP RESt is a partnership between the government of AP, the Azim Premji Foundation (a leading non-profit organization working to improve primary education in India), and the World Bank to rigorously evaluate the effectiveness of several policy options to improve the quality of primary education in developing countries. The Azim Premji Foundation (APF) was the main implementing agency for the study. 21 The selected schools were informed by the government that an external assessment of learning would take place in this period, but there was no communication to any school about any of the treatments at this time. 8 were randomly assigned to the comparison group. Table 2 (Panel A) shows summary statistics of baseline school and student characteristics for both treatment and comparison schools, and we see that the null of equality across treatment groups cannot be rejected for any of the variables. 22 2.4. Data Collection The data used in this paper comprise of independent learning assessments in math and language (Telugu) conducted at the beginning of the study, and at the end of each of the two years of the experiment. We also use data from regular unannounced “tracking surveys” made by staff of the Azim Premji Foundation to measure process variables such as teacher attendance and teaching activity. 23 The treatment and comparison schools operated under identical conditions of information and monitoring and only differed in the treatment that they received. This ensures that Hawthorne effects are minimized and that a comparison between treatment and control schools can accurately isolate the treatment effect. The tests used for this study were designed by India’s leading education testing firm and the difficulty level of questions was calibrated in a pilot exercise to ensure adequate statistical discrimination on the tests. The baseline test (June-July, 2005) covered competencies up to that of the previous school year. At the end of the school year (March-April, 2006), schools had two rounds of tests with a gap of two weeks between them. The first test covered competencies up to that of the previous school year, while the second test covered materials from the current school year's syllabus. The same procedure was repeated at the end of the second year, with two rounds of testing. Doing two rounds of testing at the end of each year allows for the inclusion of more overlapping materials across years of testing, reduces the impact of measurement errors specific to the day of testing by having multiple tests around two weeks apart, and also reduces sample attrition due to student absence on the day of the test. For the rest of this paper, Year 0 (Y0) refers to the baseline tests in June-July 2005; Year 1 (Y1) refers to both rounds of tests conducted at the end of the first year of the program in March- April, 2006; and Year 2 (Y2) refers to both rounds of tests conducted at the end of the second year of the program in March-April, 2007. All analysis is carried out with normalized test 22 Table 2 shows sample balance between the comparison schools and those that received an extra contract teacher, which is the focus of the analysis in this paper. The randomization was done jointly across all 5 treatments shown in Table 3.1, and the sample was also balanced on observables across the other treatments. 23 Six visits were made to each school in the first year (2005 – 06), while four visits were made in the second year (2006 – 07) 9 scores, where individual test scores are converted to z-scores by normalizing them with respect to the distribution of scores in the control schools on the same test. 24 3. Experimental Results 3.1. Teacher and Student Turnover and Attrition Regular civil-service teachers in AP are transferred once every three years on average. While this could potentially bias our results if more teachers chose to stay in or tried to transfer into the ECT schools, it is unlikely that this was the case since the treatments were announced in August ’05, while the transfer process typically starts earlier in the year. There was no statistically significant difference between the treatment and comparison groups in the extent of teacher turnover, and the turnover rate was close to 33%, which is consistent with rotation of teachers once every 3 years (Table 2 – Panel B, rows 11-12). As part of the agreement between the Government of AP and the Azim Premji Foundation, the Government agreed to minimize transfers into and out of the sample schools for the duration of the study. The average teacher turnover in the second year was only 6%, and once again, there was no significant difference in teacher transfer rates across the various treatments (Table 2 – Panel B, rows 13 - 16). 25 The average student attrition rate in the sample (defined as the fraction of students in the baseline tests who did not take a test at the end of each year) was 7.3% and 25% in year 1 and year 2 respectively, but there is no significant difference in attrition across the treatments (rows 17 and 20). Attrition is higher among students with lower baseline scores, but this is true across all treatments, and we find no significant difference in mean baseline test score across treatment categories among the students who drop out from the test-taking sample (Table 1 – Panel B, rows 18, 19, 21, 22). 3.2. Specification Our default specification uses the form: Tijkm (Yn ) = α + γ j ⋅ Tijkm (Y0 ) + δ ⋅ ECT + β ⋅ Z m + ε k + ε jk + ε ijk (3.1) 24 Since all analysis is done with normalized test scores (relative to the control school distribution), a student can be absent on one testing day and still be included in the analysis without bias because the included score normalized relative to the control school distribution for the same test that the student took. 25 There was also a court order to restrict teacher transfers in response to litigation complaining that teacher transfers during the school year were disruptive to students. This may have also helped to reduce teacher transfers during the second year of the project. 10 The main dependent variable of interest is Tijkm , which is the normalized test score on the specific test (normalized with respect to the score distribution of the comparison schools for each test and grade separately), where i, j, k, m denote the student, grade, school, and mandal respectively. Y0 indicates the baseline tests, while Yn indicates a test at the end of n years of the treatment. Including the normalized baseline test score improves efficiency due to the autocorrelation between test-scores across multiple periods. 26 All regressions include a set of mandal-level dummies (Zm) and the standard errors are clustered at the school level. Since the treatments are stratified by mandal, including mandal fixed effects increases the efficiency of the estimate. We also run the regressions with and without controls for household and school variables. The 'ECT' variable is a dummy at the school level indicating if it was selected to receive the extra contract teacher (ECT) program, and the parameter of interest is δ, which is the effect on the normalized test scores of being in an ECT school. The random assignment of treatment ensures that the 'ECT' variable in the equation above is not correlated with the error term, and the estimate of the one-year and two-year treatment effects are therefore unbiased. 3.3. Impact of ECT program on Test Scores Averaging across both math and language, students in program schools scored 0.09 standard deviations (SD) higher than those in comparison schools at the end of the first year of the program, and 0.14 SD higher at the end of the second year (Table 3 – Panel A, columns 1 and 5). The benefits of an extra contract teacher are similar across math (0.15 SD) than in language (0.13 SD) as seen in Panels B and C of Table 3. The addition of school and household controls does not significantly change the estimated value of δ, confirming the validity of the randomization (columns 2 and 6). Column 3 of Table 3 shows the results of estimating equation (3.1) for the second-year effect (with Y1 scores on the right-hand side). This is not an experimental estimate since the Y1 scores are a post-treatment outcome, but the point estimates suggest that the effect of the 26 Since grade 1 children did not have a baseline test, we set the normalized baseline score to zero for these children (similarly for children in grade 2 at the end of two years of the treatment). 11 program was almost identical across both years (0.09 SD in both years). 27 However, the two- year treatment effect of 0.14 SD is not the sum of these two effects because of depreciation of prior gains. A more detailed discussion of depreciation (or the lack of full persistence) of test score gains is beyond the scope of this paper, but the important point to note is that calculating the average treatment effect by dividing the “n” year treatment effect by “n” years, will typically underestimate the impact of the treatment beyond the first year relative to the counterfactual of discontinuation of the treatment. On the other hand, if the effects of most educational interventions fade out, then it is likely that extrapolating one-year treatment effects will typically overstate the long-term impact of programs, which highlights the importance of carrying out long-term follow ups of even experimental evaluations in order to do better cost-benefit calculations. 28 3.4. Heterogeneous treatment effects by grade Disaggregating the treatment effects by grade, we find that children in the first grade in treatment schools show the largest gains, scoring 0.20 SD and 0.29 SD better in the first and second year respectively (Table 4 – columns 1 and 2). Given sampling variation, we must exercise caution in inferring heterogeneous treatment effects, unless the same pattern is repeated over multiple years. Finding the same results (of highest treatment effect for grade one) in both years, therefore gives us confidence in the inference that the program had the greatest benefits for students in grade 1. Note, however, that the extra contract teacher is assigned to the school as opposed to a specific class. Thus, the choice of how to assign the teacher is made at the school level and it could have been possible that schools chose to reduce class sizes the most in grade 1. Table 5 shows the effective class size 29 experienced by students in each grade in both treatment and 27 Specifically the estimate of the “second year” treatment effect requires an unbiased estimate of γ, which cannot be consistently estimated in the above specification due to downward bias from measurement error and upward bias from omitted individual ability. Andrabi et al (2008) show that these biases roughly cancel out each other in their data from a similar context (primary education in Pakistan), and so we present the results of this specification as illustrative while focusing our discussion on the experimental estimates of one and two-year treatment effects. 28 The issue of persistence/depreciation of learning has only recently received attention in the literature on the effects of education interventions on test scores over multiple years. See Andrabi et al (2008) and Jacob et al (2008) for a more detailed discussion of issues involved with estimating the extent of persistence of interventions, and the implications for cost-benefit analysis. 29 We use the term “effective class size” because of the common prevalence of multi-grade teaching whereby a single teacher simultaneously teaches more than one grade. Thus ECS in any school-grade combination is defined 12 comparison schools. We see that in most cases, there was a significant reduction in effective class size for all grades in both years of the program, with the largest reductions being achieved in grade 3. In the second year, we cannot reject the null hypothesis that the effective class size reduction was the same in all grades. In the first year, we do reject the null and it appears that most of the effective class size reductions were in grades 1 to 3 and not 4 and 5. To better understand the mechanism for the results in Table 4, we estimate the correlation between effective class size and student test score gains separately by grade, and find that the impact of class size steadily declines as the grades increase (Table A1). These point estimates are correlations and should not be interpreted as the causal effect of class size on learning gains, but the results suggest that the there is a declining effect of class size on learning gains at higher grades. The results in Table 5 and Table A1 suggest that the most likely mechanism for the results in Table 4 is the possibility that the class-size reductions brought about by having an extra contract teacher matter most for younger students. This result is consistent with the education production function proposed in Lazear (2001), where the key insight is that classroom production of education is a public good where having a disruptive child produces a negative spillover effect for the rest of the class. Thus, small classes have greater benefits when the probability of having a disruptive child is higher. This is likely to be the case for younger children – especially those who are coming to school for the first time, as is the case with first grade students in Andhra Pradesh. Our results therefore provide empirical support for the theory of education production proposed in Lazear (2001) if we assume that the youngest children in primary school are likely to be most disruptive relative to their older peers who have been acclimatized to the schooling environment. Our findings are also consistent with Krueger (1999), who finds the largest benefits from small classes for students in grade 1 in the Tennessee STAR class-size reduction experiment. 3.5. Heterogeneous treatment effects by other school/student characteristics We test for heterogeneity of the ECT program effect across student, and school characteristics by testing if δ 3 is significantly different from zero in: as the number of other students that a student in that school-grade simultaneously shares his/her teacher with. For example, consider a school with enrollment of 15, 20, 25, 15, and 15 in the five grades and with three teachers, with one teacher teaching grades 1 and 2, one teaching grade 3, and the last one teaching grades 4 and 5. In this case, the ECS in this school would be 35 in grades 1 and 2, 25 in grade 3, and 30 in grades 4 and 5. 13 Tijkm ( EL) = α + γ ⋅ Tijkm ( BL) + δ 1 ⋅ ECT + δ 2 ⋅ Characteristic + δ 3 ⋅ ( ECT × Characteristic ) + β ⋅ Z m + ε k + ε jk + ε ijk (3.2) Table 6 shows the results of these regressions on several school and household characteristics, and each column represents one regression testing for heterogeneous treatment effects along the characteristic mentioned (indicated by the coefficients on the interactions). The main result is that schools in more remote areas consistently benefit more from the addition of an extra contract teacher. The school proximity index ranges from 8-24, with 24 representing a school that is far from basic amenities 30. The strong and significant positive coefficient on this interaction in both years shows that the marginal benefit of the extra contract teacher was highest in the most remote areas. A related (but weaker) result is that schools with poorer infrastructure and with fewer students also benefit more from the extra contract teacher (the interactions with infrastructure and number of students are negative and significant after two years, and negative though not significant after the first year). The other interesting result is the lack of heterogeneous treatment effects by several household and child-level characteristics. In particular, if we consider the baseline test score to be a summary statistic of all prior inputs into the child’s education, then the lack of any significance on the interaction of the program with baseline scores suggests that all children benefited equally from the program regardless of their initial level of learning and that the gains from the program were quite broad. Similarly, there was no difference in program effectiveness based on household affluence, parental literacy, caste, and gender of the child. 3.6. Differences in Teacher Effort by Contract Status Table 7 – Panel A shows that contract teachers had significantly lower levels of absence compared to regular teachers (16.3% versus 26.8% on average over two years), with the difference being higher in the second year (12%) compared to the first year (9%). Contract teachers also had higher rates of teaching activity compared to regular teachers (49% versus 43%), though these numbers are easier to manipulate than the absence figures, because it is 30 The index aggregates 8 indicators coded from 1-3 indicating proximity to a paved road, a bus stop, a public health clinic, a private health clinic, public telephone, bank, post office, and the mandal educational resource center. The coding roughly corresponds to the nearest third, middle third, and furthest third of the schools on each metric. Converting to a common code based on the distribution of the raw distance allows the units to be standardized. 14 easier for an idle teacher to start teaching when he/she sees an enumerator coming to the school than for an absent teacher to materialize during a surprise visit to the school. These differences in rates of absence and teaching activity are even higher with school fixed effects, suggesting that the presence of the contract teachers may have induced regular teachers to shirk a little more. We can test this directly by comparing the absence rates of regular teachers in comparison schools with those in program schools and we see that regular teachers in program schools do have higher rates of absence and lower rates of teaching activity than their counterparts in comparison schools (Table 7 – Panel B), and that these differences are significant when aggregated across both years of the program. Thus, our estimate of the impact of an additional contract teacher is a composite estimate that includes the reduction in effort of regular teachers induced by the presence of the extra contract teacher. The pure education production function effect of an additional contract teacher to schools is likely to be even higher. The superior performance of contract teachers on measures such as attendance and teaching activity is most likely due to a combination of factors. These include being from the local area and feeling more connected to the community, living much closer to the school and therefore having lower marginal costs of attendance, or the superior incentives from being on annually renewable contracts without the job security of civil-service tenure. 31 We do not attempt to decompose the relative importance of these factors, since contract teachers in most Indian states share all these characteristics and our aim is to evaluate an “as is” expansion of their use. 4. Comparing Contract and Regular Teachers The experimental results establish that the marginal product of contract teachers is positive and that expanding contract teacher programs as currently implemented in India is likely to improve student learning outcomes. However, the broader question is that of the relative effectiveness of regular and contract teachers and the optimal ratio in which they should be used. Economic theory suggests that optimal production of education would use expensive better- 31 We find some suggestive evidence of the last point by looking at the correlates of renewal of contract teachers and find that teachers with lower absence rates are more likely to have their contracts renewed. However, we do not observe the “decision to renew” an individual teacher’s contract conditional on re-application for the job, but only observe if the same contract teacher is still in the school the next year. Hence, the correlation is only suggestive of incentive effects of the renewable contract and should not be interpreted causally (for instance, teachers who know that they plan to leave next year may be more absent). 15 qualified regular teachers and inexpensive less-qualified contract teachers in the proportion where the ratio of marginal costs equals the ratio of marginal productivity. Since the ratio of costs is known, what is needed is an estimate of the ratios of marginal productivity from adding an additional teacher of each type. We use our rich panel dataset to construct four different non- experimental estimates of the relative effectiveness of contract teachers and regular teachers on student learning gains (two using within-school variation and two using between-school variation), and also conduct several robustness checks on each of these estimates. 4.1. School Fixed Effects Since we can match students in each year to their teacher and know the teacher type, we first estimate the effect on gains in student learning of being taught by a contract teacher as opposed to a regular teacher (note that the same teacher teaches all subjects in primary schools in Andhra Pradesh). The specification used is: Tijkm (Yn ) = α + γ j ⋅ Tijkm (Y0 ) + δ ⋅ CT + β ⋅ X i + ε k + ε jk + ε ijk (4.1) where the test score variables and error terms are defined as in (3.1) and CT is a dummy for whether the student in question was taught by a contract teacher, and X i includes a rich set of school and household controls that are progressively added to verify the robustness of the results. Our main result is that there is no differential effect on learning gains for students taught by contract teachers relative to those taught by regular teachers (Table 8 - Column 1). The result above uses variation in student assignment to teacher type between schools as well as variation within schools, both of which raise identification issues. The concern with using between-school variation is that there are omitted variables correlated with the presence of contract teachers in a school as well as the rate of learning growth. The concern with using within-school variation is that the assignment of teachers to students is endogenous to the school. While these concerns are substantially mitigated by our controlling for baseline scores (which are the best summary statistic of cumulative education inputs prior to the start of the study) 32, we do more to address these concerns below. We first shut down the between-school variation by estimating the equation above with school fixed effects (i.e. - we use only within-school variation) and find that the point estimate is 32 Trying to infer teacher quality without controlling for baseline scores is quite problematic because such a specification would attribute to a given the teacher the cumulative contributions to learning of all past teachers. 16 practically unchanged (Table 8 - Column 2). The identifying variation now comes from children within the same school being assigned to different types of teachers. Since the typical school has 2 to 3 teachers across 5 grades, it is almost never the case that there are multiple sections per grade. This eliminates one important threat to identification since students are not tracked and it is not the case that teachers are assigned to sections based on unobservable characteristics. 33 The remaining concern is that there are systematic differences across grades assigned to different teacher types. We find that there are no significant differences in mean baseline test scores between students taught by regular and contract teachers (both with and without school fixed effects), but do find differences in household and classroom characteristics (students taught by contract teachers tend to be from poorer households, and have smaller class sizes and are less likely to be in multi-grade classroom) 34. We therefore include a rich set of school and household controls (class size, a dummy for multi-grade teaching, household affluence, and parental education) and find that there is still no significant difference between contract and regular teachers (Table 8 - Column 3). A final robustness check is that we estimate the same three specifications in the sample of treatment schools only (Columns 4-6) and find the same result. Since the treatment was assigned randomly among a representative sample of schools, this sample comes closest to estimating the relative effects of the two teacher types in the context of an across the board expansion of the use of contract teachers. Note also that the large sample size and the inclusion of baseline scores means that the zero effects are quite precisely estimated and it is not the case that we are refusing to reject the null because of wide confidence intervals. While the result of no differential effect by teacher type is robust to the procedures above, we cannot rule out the possibility that there may still be omitted variables correlated with teacher assignment to cohorts and potential test score gains. 4.2. Student Fixed Effects One way of mitigating this concern is to consider the sample of students who switch from one teacher type to the other during their regular progression through school. We do this and estimate the differential impact of teacher type using student fixed-effects and still find that there 33 See Rothstein (2010) for an illustration of this concern in value-added modeling. 34 These regressions are available from the authors on request. 17 is no difference between regular and contract teachers (Table 9 - Column 1). The results are robust to including class size and a dummy for multi-grade teaching (Table 9 - Column 2). Finally, given that teachers get re-assigned on a periodic basis, a further robustness check is to restrict the estimation sample to cases where the same teacher was assigned to the same grade in both years (i.e. the identifying variation comes from a cohort of students moving across teachers who are fixed in specific grades, and thus teachers in this sample cannot be getting re-assigned on the basis of cohort-level unobservables) and we again find no difference between teacher types (Table 9 - Column 3). As in Table 8, we conduct a final robustness check by carrying out all three estimations in the treatment schools only (Columns 4-6) and find the same result. While truncating the sample may increase the probability of not rejecting the null, note again that the use of student fixed effects and the inclusion of baseline scores means that we have very precisely estimated zero effects. While, the estimates in Tables 8 and 9 use within-school variation, we can also estimate the impact of contract teachers using only between-school variation. The first advantage of this approach is that we don't have to worry about endogenous assignment of teachers to grades and the second one is that policy makers can only assign a teacher to a school and cannot typically prevent schools from reassigning additional teacher resources as they see fit. Thus, the most relevant policy question is the relative impact of adding a contract teacher to a school versus that of adding a regular teacher to a school. We address this question in two ways below. 4.3. Fraction of Contract Teachers in the School We first look at the correlation between gains in student learning and the fraction of contract teachers in a school in a specification similar to (4.1) that includes dummies for the number of teachers and controls for enrollment, and find that the fraction of contract teachers is positively correlated with gains in learning, though this is not significant (Table 10, Panel A, Column 1). Including controls for linear, quadratic, and cubic terms of student enrollment, school infrastructure, and school proximity does not change this result and neither does including controls for household affluence and parental literacy (Columns 2 and 3). Including a quadratic term in the fraction of contract teachers also does not change the result that student learning gains across schools are not affected by the fraction of contract teachers, while holding the total number of teachers constant (Columns 4-6). 18 The main identification concern here is whether there are omitted variables across schools that could be correlated with both student learning trajectories as well as the prevalence of contract teachers. We address this partially by re-estimating all the equations in Panel A with mandal (sub-district) fixed effects in Panel B and report that there is still no correlation between the fraction of contract teachers in a school and student learning gains. 35 Recall from the discussion of sampling that the total sample of 200 schools consists of 4 schools in each of 50 mandals across the state (with two treatment and two control schools in each mandal). Thus, estimating with mandal fixed effects eliminates concerns of omitted variables across administrative jurisdictions and is identified using variation within the lowest administrative unit in the state (sub-districts). Finally, note that the addition of school fixed effects in Table 8 did not change the estimate of the “contract teacher effect”, which suggests that the between-school variation in the prevalence of contract teachers is not correlated with omitted variables that may also account for differential learning growth trajectories across schools. 4.4. School-level Pupil-Teacher Ratio (PTR) Finally, we consider the impact of school-level pupil-teacher ratio (PTR) on learning outcomes and study the differential impact of reducing PTR with a regular teacher and with a contract teacher. We first restrict our analysis sample to the control schools that don’t have any contract teacher and the treatment schools with exactly one contract teacher (i.e. schools that would not have had a contract teacher but for the experiment). 36 We calculate PTR (and log_PTR) in these schools using regular teachers only (i.e. for treatment schools – this would be the counterfactual PTR had they not received the treatment), and then calculate the reduction in log_PTR in the treatment schools caused by the addition of the extra (randomly assigned) contract teacher. We include both the original log_PTR (using regular teachers only) and the reduction in log_PTR (red_log_PTR) induced by the provision of the extra contract teacher as regressors in the specification: Tijkm (Yn ) = α + γ j ⋅ Tijkm (Yn −1 ) + β 1 ⋅ log_ PTR + β 2 ⋅ red _ log_ PTR + δ ⋅ X i + ε k + ε jk + ε ijk (4.2) 35 Note that the combined effect of the linear and quadratic terms yield a positive point estimate for the correlation between the percentage of contract teachers and student learning gains for most values of the percentages of contract teachers, but this positive estimate is not significant (this is also true for the specification in Table 9, Panel B, Column 4 coefficients are significant, the combined effect is insignificant at all values of the percentage of contract teachers. 36 This includes around 70% of the sample since around 30% of the schools had a contract teacher to begin with. 19 Thus unlike in section 3 where the treatment is a binary variable, the treatment indicator here (red_log_PTR) is allowed to vary to reflect the marginal impact of the extra contract teacher on PTR, which will vary depending on the initial PTR in the school and be zero for the control schools. The point estimates suggest that reducing PTR with an extra contract teacher is almost twice as effective in improving student learning than reducing PTR with an extra regular teacher (0.34 versus 0.18) though this difference is not significant (Table 11, Panel A, Column 2). We also estimate (4.2) with log_PTR and with red_log_PTR one at a time and verify that the point estimates of β1 and β 2 are unchanged confirming the validity of the experiment ( β1 is unchanged between columns 1 and 2, and β 2 is unchanged between columns 2 and 3 of Table 11). Since we have an unbiased experimental estimate of β 2 , the identification concerns are with respect to β1 , which is estimated using non-experimental between-school variation. We apply the same robustness checks as in Table 10 and include the same rich set of school and household controls, and find that β1 is close to unchanged ( β 2 of course remains unchanged as it’s an experimental estimate). Finally, we extend the analysis to the full sample of schools in Panel B of Table 11, where the only difference is that the regressors include log_PTR based on regular teachers only (in all schools), the reduction in log_PTR using the non-experimental contract teacher (in schools that already had one prior to the experimental intervention), and finally include the reduction in log_PTR induced by the experimentally-provided extra contract teacher. The results in Table 11, Panel B are similar to those in Panel A and we find again that reducing PTR with an extra contract teacher is around twice as effective in improving student learning than reducing PTR with an extra regular teacher (0.33 versus 0.15) though this difference is not significant. We run all the specifications in Table 11 with mandal (sub-district) fixed effects and the results on relative teacher effectiveness are unchanged (tables available on request). While the results in Table 11 don’t eliminate all identification concerns, it is the most convincing evidence that regular teachers are no more effective than contract teachers. First, β 2 is unbiased because it is an experimental estimate using the school-level random assignment of the extra contract teacher. Second, since the estimated magnitude of β1 is around half that of β 2 , the true estimate of β1 would have to more than triple in magnitude to conclude that β 1 is 20 significantly greater than β 2 (given the standard errors on both estimates). This is extremely unlikely since including a full set of controls barely changes the estimate of β1 . While identification concerns are not fully eliminated, finding the same result with four different estimation methods (Tables 8 to 11), and finding the result to be robust to the inclusion of rich school and household covariates as well as school and student fixed effects, gives us confidence in concluding that contract and regular teachers are equally effective in improving primary school learning outcomes at the current margin. One limitation of this analysis is that there are several ways in which contract and regular teachers are different (see Table 1) and we do not decompose the relative importance of these factors in teacher effectiveness, since there is no identifying variation for the individual components in Table 1. Thus, we focus on the overall comparison of the two different types of teachers being used in the status quo and conclude that they appear to be equally effective. One additional concern in making this comparison is that we are comparing the marginal contract teacher with the average regular teacher (since the majority of contract teachers in our sample are hired as result of the intervention). Thus, the relevant comparison for teacher hiring is between a contract teacher and a new regular teacher. We address this by re-estimating (4.1) in three further estimation samples, restricting the regular teacher sample to those who have been teaching for less than three, five, and ten years respectively, and again do not reject the null of equal effectiveness in all three estimation samples. Since regular teachers cost around five times more than contract teachers 37, our results suggest that expanding the use of contract teachers may be a highly cost effective way of improving learning outcomes. 5. Public and Private Production of Education A prominent feature of primary education in India over the past ten years has been the rapid increase in the number of private schools (Muralidharan and Kremer, 2008) catering to an increasing number of students with nearly 20% of primary school students in rural Andhra Pradesh attending a fee-charging private school (Pratham, 2010). Since fee-charging private 37 Reasons for this wage premium are likely to include higher education (and corresponding outside opportunities), a compensating differential to locate to remote areas (since most regular teachers live in cities), a union/civil-service premium, and other inefficiencies in the wage-setting process for public employees. We don’t aim to decompose the wage premium in this discussion, but focus on the optimal ratio of expensive highly-qualified and inexpensive less- qualified teachers. 21 schools need to compete against free public schools as well as other fee-charging schools for students and also need to compete for teachers (and their characteristics), they are likely to face better incentives than public schools to operate close to the efficient frontier of education production, where the desired quality of education is produced at the lowest possible cost. As part of an ongoing study of school vouchers and choice, we also collected detailed data on teachers in private schools in the same five districts where the current study was conducted, and Table 12 compares regular teachers, contract teachers, and private school teachers (sampled from the same villages) 38 on a range of characteristics. The age and gender profile of private school teachers are similar to those of contract teachers (younger and more likely to be female than regular teachers). Private school teachers have higher levels of general education, but even lower levels of teacher training than contract teachers. They live much closer to the school and are more likely to be from the same village relative to regular teachers (though less so than contract teachers). But, the most relevant comparison is that the salaries of private school teachers are even lower than those of contract teachers and only around an eighth of regular teacher salaries. Figure 2 plots the salary distribution of teachers in government and private schools, and we see that the distribution of salaries in private schools is around the range of the contract teachers’ salaries, and there is almost no common support between distributions of private and regular public school teacher salaries. Finally, private school teachers and contract teachers have similarly low rates of absence, which are around half that of the regular teachers in spite of being paid much lower salaries. The private school data helps clarify the context of teacher labor markets in rural India and provides important guidance for thinking about expanding the use of contract teachers in government schools. First, the employment terms of contract teachers are not ‘exploitative’ as believed by opponents of their use, but in line with the market clearing wage paid by private schools. While their terms might seem exploitative when working side by side with regular 38 Note that this is a different sample from that used in Table 1. The sample in Table 1 is representative of rural government-run schools, which is the focus of this paper; the sample in Table 12 is from a sample of villages that have private schools (which tend to be larger). The data for Table 12 was also collected 3 years later than the data used for Table 1. AP government policies on contract teacher salaries now provides for some differentiation by education and experience, which accounts for the distribution in Figure 2. The lower absence rates of regular teachers in Table 12 as opposed to in Table 7 are also likely to be because the sample used for Table 12 is drawn from larger villages that are less remote. 22 teachers and doing the same work for a fraction of the salary, the distortion is not the ‘low’ contract teacher salaries but rather the large rents accruing to regular teachers. Second, the policy-relevant question is not the comparison of one regular teacher to one contract teacher (which is what the literature as well as the policy discussions have focused on), but rather the comparison of one regular teacher to several contract teachers. In earlier work by one of the authors, we find that while private schools pay much lower teacher salaries than what the government pays regular teachers, they also hire many more teachers per student and have pupil teacher ratios that are around a third that of the public schools in the same village (Muralidharan and Kremer, 2008). Thus it appears that a politically unconstrained producer of primary education services would pay salaries that are close to that of contract teachers, but hire many more teachers. To the extent that the input combination used by private schools is likely to be closer to the efficient frontier of education production, expanding the use of contract teachers in government-run schools may be a way of moving public production of education closer to the efficient frontier. Third, since private schools are able to fill their teacher positions with salaries that are even lower than those of contract teachers, an expansion of contract teacher hiring is unlikely to hit a supply constraint at current salary levels. 39 Also, none of the 100 treatment schools in our experiment reported any difficulty in filling the position and the majority of positions were filled within 2 weeks from the start of the search. More broadly, the pool of educated but unemployed rural high-school and college graduates from which contract and private school teachers are hired appears to be large enough for the labor supply of contract teachers to be fairly elastic (Kingdon and Sipahimalani-Rao, 2010). 40 6. Conclusion Regular teachers in India are well qualified, but command a substantial wage premium (greater than a factor of five) over the market clearing wage of private school (and contract) teachers that can be explained partly by their better education and outside opportunities, partly 39 One caveat is that equally qualified teachers may be willing to accept a lower salary in private schools if there are other compensating differentials like being able to teach more motivated students. 40 Another contributing factor may be that limited job opportunities for educated rural women (who have cultural and family preferences for working in the same village) within the village may be providing a subsidy to the teaching sector (Andrabi et al, 2007). Similar patterns have been documented in the history of education in developed countries. 23 by a compensating differential to locate to rural and remote areas, and partly by a union and civil-service premium/rent. The hiring of contract teachers can be a much more cost-efficient way of adding teachers to schools because none of these three sources of wage premiums are applicable for them. However, since locally-hired contract teachers are not as qualified or trained as civil-service teachers, opponents of the use of the contract teachers have posited that the use of contract teachers will not lead to improved learning. We present experimental evidence from an “as is” expansion of the existing contract teacher policy of the government of Andhra Pradesh, implemented in a randomly selected subset of 100 schools among a representative sample of schools in rural AP. We find that adding a contract teacher significantly improved average learning outcomes in treatment schools, and especially benefited the children in the first grade (the first year of formal schooling since there is no kindergarten) and those in more remote areas. We also find using four different non- experimental estimation procedures that contract teachers are no less effective in improving student learning than regular teachers who are more qualified, better trained, and paid five times higher salaries. The combination of low cost, superior performance measures than regular teachers on attendance and teaching activity, and positive program impact suggest that expanding the use of contract teachers could be a highly cost effective way of improving primary education outcomes in developing countries. In particular, expensive policy initiatives to get highly qualified teachers to remote areas may be much less cost effective than hiring several local contract teachers to provide much more attention to students at a similar cost. Observing the input choices of private schools suggests that this is what a politically unconstrained producer of rural education services would do. Another way of thinking about the inefficiency in the status quo is to consider the teacher hiring choices that a locally-elected body responsible for delivering primary education would make. Informal interviews with elected village leaders suggest that they would almost always choose to hire several local teachers as opposed to one or two civil- service teachers who are not connected to the community (though this is the de facto choice made for them under the status quo). Opponents of the use of contract teachers worry that their expanded use may lead to a permanent second-class citizenry of contract teachers, which in the long-run will erode the professional spirit of teaching and shift the composition of the teacher stock away from trained 24 teachers towards untrained teachers. Thus, even if expanding the use of contract teachers is beneficial in the short run, it might be difficult to sustain a two-tier system of teachers in the long run. Finally, the political economy concern is that hiring larger numbers of contract teachers will lead to demands to be regularized into civil-service status, which may be politically difficult to resist given the strengths of teacher unions and if such regularization were to happen, it would defeat the purpose of hiring a large number of contract teachers in the first place. One possible course of action is to hire all new teachers as contract teachers at the school- level, and create a system to measure their performance over a period of time (six to eight years for example) that would include inputs from parents, senior teachers, and measures of value addition using independent data on student performance. 41 These measures of performance could be used in the contract-renewal decision at the end of each fixed-term contract (or to pay bonuses), and consistently high-performing contract teachers could be promoted to regular civil- service rank at the end of a fixed period of time. In other words, contract teachers need not be like permanent adjunct faculty, but can be part of a performance-linked tenure track. Continuous training and professional development could be a natural component of this career progression, and integrating contract and regular teachers into a career path should help to address most of the concerns above, including the political economy ones. The perception that contract teachers are of inferior quality and that their use is a stop-gap measure to be eliminated by raising education spending enough to hire regular teachers is deeply embedded in the status quo education policy discourse (and has been formalized in the recently passed “Right to Education” Act of the Indian Parliament). 42 The results in this paper suggests that this view is not supported by the evidence. The use of locally-hired teachers on fixed-term renewable contracts can be a highly effective policy for improving student learning outcomes (especially since many more such teachers can be hired for a given budget). While there are valid concerns about the long-term consequences of expanding contract teacher programs, many of these can be addressed by placing the increased use of contract teachers in the context of a 41 Gordon et al (2006) provide a similar recommendation for the US (as part of the Hamilton Project) on identifying effective teachers through measuring their on the job performance. In related work, we show that even small amounts of performance-linked pay for teachers based on measures of value addition led to substantial improvements in student learning, with no negative consequences (Muralidharan and Sundararaman, 2009). 42 This belief is not limited to India and is widespread in education policy discourse in most countries. For example, the Indonesian government passed a law in 2005 to require all teachers to get certified and offered a doubling of salary for certified teachers. The law also provides for a 100% salary supplement to certified teachers who serve in remote and underserved areas. 25 long-term professional career path that allows for continuous training and professional development, and rewards effort and effectiveness at all stages of a teaching career. Pritchett and Murgai (2007) provide a practical discussion of how such a system may be implemented in practice, and is an excellent policy-focused complement to this paper. 43 43 Pritchett and Murgai (2007) discuss how such a structured career leader for teachers can be embedded within a more decentralized education system that provides local communities more autonomy on managing schools. Pritchett and Pande (2006) provide a related discussion on decomposing education management into components and suggesting appropriate levels of decentralization for each component based on theoretical principles of fiscal federalism. The recommendation for a career ladder is also made by Kingdon and Sipahimalani-Rao (2010). 26 REFERENCES: ANDRABI, T., J. DAS, and A. KHWAJA (2007): "Students Today, Teachers Tomorrow? Identifying Constraints on the Provision of Education," Harvard University. ANDRABI, T., J. DAS, A. KHWAJA, and T. ZAJONC (2008): "Do Value-Added Estimates Add Value: Accounting for Learning Dynamics," Harvard University. ANGRIST, J. D., and V. LAVY (1999): "Using Maimonides' Rule to Estimate the Effect of Class Size on Scholastic Achievement," Quarterly Journal of Economics, 114, 533-75. BALLOU, D. (1996): "Do Public Schools Hire the Best Applicants?," Quarterly Journal of Economics, 111, 97-133. BANDIERA, O., A. PRAT, and T. VALLETTI (2009): "Active and Passive Waste in Government Spending: Evidence from a Policy Experiment," American Economic Review, 99, 1278- 1308. BANERJEE, A., S. COLE, E. DUFLO, and L. LINDEN (2007): "Remedying Education: Evidence from Two Randomized Experiments in India," Quarterly Journal of Economics, 122, 1235-1264. BARDHAN, P. (2002): "Decentralization of Governance and Development," Journal of Economic Perspectives, 16, 185-205. BETTINGER, E., and B. T. LONG (Forthcoming): "Does Cheaper Mean Better? The Impact of Using Adjunct Instructors on Student Outcomes," Review of Economics and Statistics. BLOOM, N., and J. VAN REENEN (2010): "Why Do Management Practices Differ across Firms and Countries?," Journal of Economic Perspectives, 24, 203-224. BOURDON, J., M. FRÖLICH, and K. MICHAELOWA (2006): "Broadening Access to Primary Education: Contract Teacher Programs and Their Impact on Education Outcomes in Africa – an Econometric Evaluation for Niger," in Pro-Poor Growth: Issues, Policies, and Evidence, ed. by L. Menkhoff. Berlin: Duncker & Humblot, 117-149. — (2007): "Teacher Shortages, Teacher Contracts and Their Impact on Education in Africa," Institute for the Study of Labor (IZA), Berlin. DE LAAT, J., and E. VEGAS (2005): "Do Differences in Teacher Contracts Affect Student Performance? Evidence from Togo," World Bank. DUFLO, E., P. DUPAS, and M. KREMER (2010): "Pupil-Teacher Ratios, Teacher Management, and Education Quality: Experimental Evidence from Kenya," MIT. DUTHILLEUL, Y. (2005): "Lessons Learnt in the Use of 'Contract' Teachers," International Institute for Educational Planning, UNESCO. GORDON, R., T. KANE, and D. STAIGER (2006): "Identifying Effective Teachers Using Performance on the Job," Washington DC: The Brookings Institution. GOVINDA, R., and J. YAZALI (2004): "Para-Teachers in India: A Review," New Delhi: National Institute of Educational Planning and Administration. HAINES, A., D. SANDERS, U. LEHMANN, A. K. ROWE, J. E. LAWN, S. JAN, D. G. WALKER, and Z. BHUTTA (2007): "Achieving Child Survival Goals: Potential Contribution of Community Health Workers," The Lancet, 369, 2121-2131. HANUSHEK, E. A. (1999): "The Evidence on Class Size," in Earning and Learning: How Schools Matter, ed. by S. Mayer, and P. Peterson. Washington DC: Brookings Institution. — (2002): "Publicly Provided Education," in Handbook of Public Economics, ed. by A. J. Auerbach, and M. S. Feldstein. Amsterdam: North-Holland, 2045-2141. — (2003): "The Failure of Input-Based Schooling Policies," Economic Journal, 113, F64-98. 27 HECKMAN, J., and J. SMITH (1995): "Assessing the Case of Social Experiments," Journal of Economic Perspectives, 9, 85-110. JACOB, V., A. KOCHAR, and S. REDDY (2008): "School Size and Schooling Inequalities," Stanford. KANE, T. J., J. E. ROCKOFF, and D. O. STAIGER (2008): "What Does Certification Tell Us About Teacher Effectiveness? Evidence from New York City," Economics of Education Review, 27, 615-631. KINGDON, G. G., and M. MUZAMMIL (2001): "A Political Economy of Education in India: The Case of U.P.," Economic and Political Weekly, 36. KINGDON, G. G., and V. SIPAHIMALANI-RAO (2010): "Para-Teachers in India: Status and Impact," Economic and Political Weekly, XLV, 59-67. KLEINER, M. (2000): "Occupational Licensing," Journal of Economic Perspectives, 14, 189-202. KRUEGER, A. (1999): "Experimental Estimates of Education Production Functions," Quarterly Journal of Economics, 114, 497-531. — (2003): "Economic Considerations and Class Size," Economic Journal, 113, 34-63. KUMAR, K., M. PRIYAM, and S. SAXENA (2005): "The Trouble with Para-Teachers," Frontline, 18. LAZEAR, E. (2001): "Educational Production," Quarterly Journal of Economics, 116, 777-803. MEHTA, A. (2007): "Elementary Education in India: Where Do We Stand? State Report Cards 2005-06," New Delhi: National University of Education Planning and Administration. MURALIDHARAN, K., and M. KREMER (2008): "Public and Private Schools in Rural India," in School Choice International, ed. by P. Peterson, and R. Chakrabarti. Cambridge: MIT. MURALIDHARAN, K., and V. SUNDARARAMAN (2009): "Teacher Performance Pay: Experimental Evidence from India," National Bureau of Economic Research Working Paper 15323. PRATHAM (2010): Annual Status of Education Report. PRITCHETT, L. (2004): "Access to Education," in Global Crises, Global Solutions, ed. by B. Lomborg, 175-234. PRITCHETT, L., and R. MURGAI (2007): "Teacher Compensation: Can Decentralization to Local Bodies Take India from Perfect Storm through Troubled Waters to Clear Sailing?," in India Policy Forum 2006-07, ed. by S. Bery, B. Bosworth, and A. Panagariya: Sage Publications. PRITCHETT, L., and V. PANDE (2006): "Making Primary Education Work for India's Rural Poor: A Proposal for Effective Decentralization," New Delhi: World Bank. ROTHSTEIN, J. (2010): "Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement," Quarterly Journal of Economics, 125, 175-214. SAWADA, Y., and A. RAGATZ (2005): "Decentralization of Education, Teacher Behavior, and Outcomes: The Case of El Salvador’s Educo Program," in Incentives to Improve Teaching: Lessons from Latin America, ed. by E. Vegas. Washington DC: The World Bank. TAMURA, R. (2001): "Teachers, Growth, and Convergence," Journal of Political Economy, 109, 1021-1059. URQUIOLA, M. (2006): "Identifying Class Size Effects in Developing Countries: Evidence from Rural Bolivia," Review of Economics and Statistics, 88, 171-177. 28 Table 1: Characteristics By Teacher Type Panel A: Regular versus Contract Teachers (Control Schools) Regular Teachers Contract Teachers P-value (H0: Diff=0) Male 65.7% 28.1% 0.000*** Age 39.13 24.45 0.000*** College Degree or Higher 84.5% 46.9% 0.000*** Formal Teacher Training Degree or Certificate 98.7% 12.5% 0.000*** Received any Training in last twelve months 91.8% 59.4% 0.000*** From the same village 9.0% 81.3% 0.000*** Distance from home to school (km) 12.17 0.844 0.000*** Teacher Salary (Rs./month) 9013.6 1000 (1500) 0.000*** Panel B: Comparison of Contract teacher characteristics in Control and Treatment schools Contract Teachers in Contract Teachers in P-value (H0: Diff=0) Treatment Schools Control Schools Male 31.9% 28.1% 0.70 Age 26.05 24.45 0.14 College Degree or Higher 47.2% 46.9% 0.98 Formal Teacher Training Degree or Certificate 16.0% 12.5% 0.60 Received any Training in last twelve months 44.4% 59.4% 0.11 From the same village 88.2% 81.3% 0.37 Distance from home to school (km) 0.646 0.844 0.50 Teacher Salary (Rs./month) 1000 (1500) 1000 (1500) 0.45 Notes: 1. Table reports summary statistics from the first year of the project (2005 - 06). The teacher characteristics were similar in the second year as well (2006 - 07). The only difference was that contract teacher salary was Rs. 1000/month in the first year, but increased to Rs. 1,500 across the entire state in the second year * significant at 10%; ** significant at 5%; *** significant at 1% Table 2: Sample Balance Across Treatment and Comparison Groups Panel A (Mean Pre-program Characteristics) [1] [2] [3] Extra Contract Comparison Schools P-value (H0: Diff=0) Teacher Schools School-level Variables 1 Total Enrollment (Baseline: Grades 1-5) 113.2 104.6 0.41 2 Total Test-takers (Baseline: Grades 2-5) 64.9 62.0 0.59 3 Number of Teachers 3.07 2.83 0.24 4 Pupil-Teacher Ratio 39.5 39.8 0.94 5 Infrastructure Index (0-6) 3.19 3.13 0.84 6 Proximity to Facilities Index (8-24) 14.65 14.97 0.55 Baseline Test Performance 7 Math (Raw %) 18.47 17.27 0.34 8 Math (Normalized - in Std. deviations) 0.041 -0.043 0.29 9 Telugu (Raw %) 35.1 34.27 0.63 10 Telugu (Normalized - in Std. deviations) 0.019 -0.020 0.62 Panel B (Mean Turnover/Attrition During Program) Teacher Turnover and Attrition [1] [2] [3] Extra Contract Comparison Schools P-value (H0: Diff=0) Teacher Schools Year 1 on Year 0 11 Teacher Attrition (%) 0.30 0.31 0.80 12 Teacher Turnover (%) 0.34 0.33 0.85 Year 2 on Year 1 13 Teacher Attrition (%) 0.04 0.07 0.14 14 Teacher Turnover (%) 0.05 0.05 0.94 Year 2 on Year 0 15 Teacher Attrition (%) 0.32 0.36 0.35 16 Teacher Turnover (%) 0.37 0.37 0.99 Student Turnover and Attrition Year 1 on Year 0 17 Student Attrition from baseline to end of year tests 0.08 0.07 0.28 18 Baseline Maths test score of attritors -0.15 -0.19 0.73 19 Baseline Telugu test score of attritors -0.26 -0.28 0.89 Year 2 on Year 0 20 Student Attrition from baseline to end of year tests 0.26 0.24 0.50 21 Baseline Maths test score of attritors -0.12 -0.06 0.53 22 Baseline Telugu test score of attritors -0.20 -0.16 0.69 Notes: 1. The school infrastructure index sums 6 binary variables (coded from 0 - 6) indicating the existence of a brick building, a playground, a compound wall, a functioning source of water, a functional toilet, and functioning electricity. 2. The school proximity index ranges from 8-24 and sums 8 variables (each coded from 1-3) indicating proximity to a paved road, a bus stop, a public health clinic, a private health clinic, public telephone, bank, post office, and the mandal educational resource center. 3. Teacher attrition refers to the fraction of teachers in the school who left the school during the year, while teacher turnover refers to the fraction of new teachers in the school at the end of the year (both are calculated relative to the list of teachers in the school at the start of the year) 4. The t-statistics for the baseline test scores and attrition are computed by treating each student/teacher as an observation and clustering the standard errors at the school level (Grade 1 did not have a baseline test). The other t-statistics are computed treating each school as an observation. Table 3: Impact of Extra Contract Teacher on Student Test Scores Panel A: Combined Dependent Variable = Normalized End of Year Test Score Year 1 on Year 0 Year 2 on Year 1 Year 2 on Year 0 [1] [2] [3] [4] [5] [6] Extra Contract Teacher School 0.092 0.09 0.09 0.095 0.141 0.146 (0.035)*** (0.034)*** (0.037)** (0.039)** (0.044)*** (0.047)*** School and Household Controls No Yes No Yes No Yes Observations 44168 40557 41624 37219 41927 36499 R-squared 0.337 0.361 0.311 0.319 0.196 0.216 Panel B: Maths Dependent Variable = Normalized End of Year Test Score Year 1 on Year 0 Year 2 on Year 1 Year 2 on Year 0 [1] [2] [3] [4] [5] [6] Extra Contract Teacher School 0.11 0.105 0.096 0.107 0.153 0.165 (0.039)*** (0.039)*** (0.044)** (0.048)** (0.050)*** (0.053)*** School and Household Controls No Yes No Yes No Yes Observations 21951 20157 20781 18590 20878 18170 R-squared 0.316 0.339 0.276 0.28 0.185 0.2 Panel C: Telugu Dependent Variable = Normalized End of Year Test Score Year 1 on Year 0 Year 2 on Year 1 Year 2 on Year 0 [1] [2] [3] [4] [5] [6] Extra Contract Teacher School 0.075 0.074 0.086 0.085 0.128 0.126 (0.035)** (0.034)** (0.033)*** (0.035)** (0.041)*** (0.044)*** School and Household Controls No Yes No Yes No Yes Observations 22217 20400 20843 18629 21049 18329 R-squared 0.372 0.396 0.362 0.377 0.221 0.246 Notes: 1. All regressions include mandal (sub-district) fixed effects and standard errors clustered at the school level. They also include lagged normalized test scores interacted with grade, where the normalised lagged test score is set to 0 for students in grade 1 or for students in grade 2 in the 2-year regressions. All test scores are normalized relative to the distribution of scores in the control schools in the same grade, test, and year. 2. The two year treatment effect regressions (Year 2 on Year 0) include students who entered grade 1 in the second year of the program and who were there in the schools at end of two years of the program, but who have only been exposed to the program for one year at the end of two years of the program. 3. School controls include infrastructure and proximity indices as defined in Table 2. Household controls include a household asset index, parent education index (both defined as in Table 6), child gender an indicator for being from a disadvantaged caste/tribe. 4. Constants are insignificant in all specifications and are not shown. * significant at 10%; ** significant at 5%; *** significant at 1% Table 4: Impact of Extra Contract Teacher (ECT) by Grade Dependent Variable = Normalized End of Year Test Score Combined Math Telugu (Language) Y1 on Y0 Y2 on Y0 Y1 on Y0 Y2 on Y0 Y1 on Y0 Y2 on Y0 [1] [3] [4] [6] [7] [9] ECT * Grade 1 0.204 0.286 0.245 0.227 0.165 0.345 (0.077)*** (0.069)*** (0.082)*** (0.071)*** (0.078)** (0.077)*** ECT * Grade 2 0.18 0.137 0.194 0.113 0.169 0.162 (0.058)*** (0.070)* (0.064)*** -0.079 (0.060)*** (0.066)** ECT * Grade 3 0.04 0.207 0.071 0.273 0.009 0.141 (0.050) (0.074)*** -0.055 (0.086)*** -0.053 (0.069)** ECT * Grade 4 0.122 0.014 0.167 0.03 0.073 -0.005 (0.045)*** (0.055) (0.055)*** (0.066) (0.044)* (0.054) ECT * Grade 5 -0.019 0.115 -0.049 0.169 0.012 0.056 (0.050) (0.050)** (0.055) (0.065)*** (0.056) (0.049) Observations 44168 41927 21951 20878 22217 21049 F-Test (Equality Across Grades) 0.028 0.002 0.006 0.009 0.171 0.001 R-squared 0.341 0.199 0.322 0.188 0.374 0.227 Notes (Same as in Table 3) * significant at 10%; ** significant at 5%; *** significant at 1% Table 5: Effective Class Size in ECT Schools versus Comparison Schools Year 1 Year 2 Control Treatment Difference Control Treatment Difference Class 1 41.07 35.90 5.17* 37.82 29.46 8.36*** Class 2 40.49 33.52 6.97*** 40.87 33.07 7.8*** Class 3 36.62 29.35 7.27*** 36.16 26.01 10.15*** Class 4 35.02 33.68 1.34 33.54 27.28 6.26*** Class 5 35.91 32.40 3.51 34.53 28.99 5.55*** p-value of F-test testing equality of ECS 0.048** 0.26 reduction across grades 1. All regressions include mandal (sub-district) fixed effects and standard errors clustered at the school level. 2. ECS stands for Effective Class Size, and ECT stands for Extra Contract Teacher * significant at 10%; ** significant at 5%; *** significant at 1% Table 6: Heterogeneous Impacts of the Extra Contract Teacher Program [1] [2] [3] [4] [5] [6] [7] [8] Household Parental SC or ST Log Number Proximity Infrastructure Baseline Affluence (0 - Literacy (lower Male of Students (8 - 24) (0 - 6) Score 7) (0 - 4) caste) Year 2 on Year 0 Extra Contract Teacher 0.89 -0.863 0.501 0.082 0.104 0.114 0.116 0.123 (0.416)** (.227)*** (.152)*** (0.077) (0.054)* (0.049)** (0.051)** (0.045)*** Covariate -0.049 -0.010 0.007 0.031 0.104 -0.058 0.01 0.447 (0.055) (0.010) (0.034) (0.011)*** (0.054)* (0.037) (0.026) (0.023)*** Interaction -0.146 0.072 -0.120 0.009 0.01 0.053 0.009 0.019 (0.081)* (017)*** (.045)*** (0.018) (0.022) (0.061) (0.039) (0.037) Observations 30626 32894 32894 30209 30209 32747 30221 32747 R-squared 0.242 0.25 0.24 0.244 0.245 0.234 0.242 0.233 Year 1 on Year 0 Extra Contract Teacher 0.178 -0.211 0.202 -0.001 0.066 0.091 0.076 0.086 (0.330) (0.148) (.117)* (0.063) (0.041) (0.037)** (0.039)* (0.035)** Covariate -0.05 -0.008 0.018 0.019 0.057 -0.034 0.005 0.497 (0.042) (0.007) (0.021) (0.009)** (0.011)*** (0.031) (0.017) (0.023)*** Interaction -0.017 0.021 -0.038 0.022 0.011 -0.008 0.016 0.016 (0.063) (.011)** (0.035) (0.015) (0.016) (0.046) (0.029) (0.029) Observations 44168 43209 43209 41706 41706 44314 41718 44314 R-squared 0.334 0.34 0.34 0.344 0.346 0.33 0.342 0.33 Notes: 1. Each column in each panel reports the result of a regression that includes the covariate in the column title, a binary treatment indicator, and a linear interaction term testing for heterogeneous effects of the treatment along the covariate concerned. 2. All regressions include mandal (sub-district) fixed effects and standard errors clustered at the school level. All regressions include lagged test scores interacted by grade. 3. The school infrastructure and proximity index are as defined in Table 2 4. The household asset index ranges from 0 to 7 and is the sum of seven binary variables indicating whether the household has an electricity connection, has a water source at home, has a toilet at home, owns any land, owns their home, has a brick home, and t l i i 5. Parental education is scored from 0 to 4 in which a point is added for each of the following: father's literacy, mother's literacy, father having completed 10th grade, and mother having completed 10th grade * significant at 10%; ** significant at 5%; *** significant at 1% Table 7: Effort Comparison Across Teacher Types Panel A : Contract Teachers versus Regular Teachers Teacher Absence Contract Regular Difference with Difference (%) Teachers (%) Teachers (%) School Fixed Effects Year 1 16.1% 25.0% -9.0%*** -11.4%*** Year 2 16.5% 28.5% -12.0%*** -16.7%*** Combined 16.3% 26.6% -10.3%*** -13.3%*** Teachers Observed Actively Teaching Contract Regular Difference with Difference (%) Teachers (%) Teachers (%) School Fixed Effects Year 1 53.4% 49.2% 4.2% 6.8%** Year 2 43.4% 35.4% 8.0%*** 8.4%*** Combined 48.5% 42.8% 5.7%** 6.7%*** * significant at 10%; ** significant at 5%; *** significant at 1% Panel B : Regular Teachers in ECT Schools versus those in Control Schools Teacher Absence Regular Regular Difference with teachers in ECT teachers in non- Difference (%) Mandal fixed effects schools ECT schools Year 1 26.0% 24.1% 1.9% 1.1% Year 2 30.7% 26.4% 4.3% 4.7%** Combined 28.2% 25.2% 3.0% 2.7%* Teachers Observed Actively Teaching Regular Regular Difference with teachers in ECT teachers in non- Difference (%) mandal fixed effects schools ECT school Year 1 45.4% 52.8% -7.4%** -6.5%*** Year 2 35.1% 35.5% -0.4% -0.8% Combined 40.7% 44.8% -4.1% -3.6%** 1. All regressions include mandal (sub-district) fixed effects and standard errors clustered at the school level. * significant at 10%; ** significant at 5%; *** significant at 1% Table 8: Contract vs. Regular Teacher using School Fixed Effects Dependent Variable = Normalized End of Year Test Score Full Sample Only Treatment Schools [1] [2] [3] [4] [4] [6] Taught by a Contract Teacher 0.017 -0.016 -0.018 -0.023 -0.014 -0.021 (0.04) (0.03) (0.03) (0.04) (0.04) (0.04) School Fixed Effects No Yes Yes No Yes Yes Controls No No Yes No No Yes Observations 82154 82154 76421 39970 39970 37091 R-squared 0.27 0.34 0.35 0.25 0.33 0.34 Notes: 1. All Regressions include lagged normalized test scores interacted with grade (this is set to 0 for grade 1 students), with standard errors clustered at the school level 2. Controls include household controls and classroom-level controls. Household controls include a household asset index, parent education index (both coded as in Table 6), child gender an indicator for being from a disadvantaged caste/tribe. Classroom controls include the log of Effective Class Size (ECS) which measures the total number of students simultaneously taught by the teacher of the class, and an indicator for multigrade teaching. * significant at 10%; ** significant at 5%; *** significant at 1% Table 9:Contract vs. Regular Teacher using Student Fixed Effects Dependent Variable = Normalized End of Year Test Score Full Sample Only Treatment Schools [1] [2] [3] [4] [5] [6] Taught by a Contract Teacher 0.004 0.015 0.069 -0.002 -0.008 -0.009 (0.02) (0.02) (0.04) (0.02) (0.03) (0.06) Controls No Yes Yes No Yes Yes Stable Sample (No Change in teacher class r No No Yes No No Yes Observations 16154 16154 6595 11711 11711 4768 R-squared 0.66 0.66 0.70 0.66 0.66 0.70 Notes: 1. Same as (1) in Table 8 2. Controls include the log of Effective Class Size (ECS) which measures the total number of students simultaneously taught by the teacher of the class, and an indicator for multigrade teaching. 3. The Stable sample refers to the sample of students who had one teacher in year 1, who continued teaching the SAME class (in the same school) in year 2, and who had a teacher in year 2 who taught the SAME class (in the same school) in year 1. Thus, the identifying variation comes from students moving across teachers who are fixed in specific grades, and so teachers in this sample cannot be getting re-assigned on the basis of cohort-level unobservables. * significant at 10%; ** significant at 5%; *** significant at 1% Table 10: Impact on Test Score Growth of the Fraction of Contract Teachers (CTs) in the School Panel A : Without Mandal (sub-district) Fixed Effects Dependent Variable = Normalized End of Year Test Score [1] [2] [3] [4] [5] [6] Percentage of CT's 0.152 0.113 0.102 0.295 0.205 0.113 (0.117) (0.115) (0.113) (0.252) (0.262) (0.274) Percentage of CT's squared -0.246 -0.158 -0.019 (0.436) (0.436) (0.478) R-squared 0.27 0.27 0.29 0.27 0.27 0.29 Observations 85792 84495 78281 85792 84495 78281 Panel B : With Mandal (sub-district) Fixed Effects 3. Differences in teacher characteristics Dependent Variable = Normalized End of Year Test Score [1] [2] [3] [4] [5] [6] Percentage of CT's 0.072 0.038 0.032 0.48 0.339 0.266 (0.085) (0.079) (0.082) (0.222)** (0.218) (0.237) Percentage of CT's squared -0.715 -0.526 -0.415 (0.375)* (0.364) (0.420) R-squared 0.30 0.31 0.32 0.30 0.31 0.32 Observations 85792 84495 78281 85792 84495 78281 Dummies for Number of Teachers Yes Yes Yes Yes Yes Yes School Level Controls No Yes Yes No Yes Yes Household Level Controls No No Yes No No Yes Notes: 1. All Regressions include lagged normalized test scores interacted with grade (this is set to 0 for grade 1 students) 2. All Regressions include controls for school enrollment, and dummies indicating the number of teachers in the school, and have standard errors clustered at the school level 3. School controls include linear, quadratic, and cubic terms in school enrollment, school infrastructure and school proximity (defined as in Table 6). Household controls include a household asset index, parent education index (both defined as in Table 6), child gender an indicator for being from a disadvantaged caste/tribe. * significant at 10%; ** significant at 5%; *** significant at 1% Table 11: Estimating Impact of Contract Teacher (CT) vs. Regular Teacher (RT) using School-level Pupil Teacher Ratios (PTR) Panel A : Treatment Schools with one Contract Teacher and Control Schools with None Dependent Variable = Normalized End of Year Test Score [1] [2] [3] [4] [5] [6] Log_School_PTR using only Regular Teachers [B1] -0.171 -0.176 -0.177 -0.187 (0.053)*** (0.053)*** (0.061)*** (0.059)*** Reduction in Log_School_PTR induced by extra -0.337 -0.326 -0.324 -0.302 (experimental) Contract Teacher [B2] (0.112)*** (0.116)*** (0.112)*** (0.116)*** School and Household Controls No No No Yes Yes Yes Observations 60317 60317 60317 55320 55320 55320 R-squared 0.27 0.28 0.27 0.29 0.29 0.29 P-value (H0: B1 = B2) 0.20 0.29 Panel B : All Control and Treatment Schools Dependent Variable = Normalized End of Year Test Score [1] [2] [3] [4] [5] [6] Log_School_PTR using only Regular Teachers [B1] -0.135 -0.147 -0.141 -0.159 (0.057)** (0.056)*** (0.060)** (0.058)*** Reduction in Log_School_PTR induced by additional -0.184 -0.198 -0.167 -0.187 (pre-existing) Contract Teacher [B2] (0.105)* (0.105)* -0.108 (0.108)* Reduction in Log_School_PTR induced by additional -0.329 -0.294 -0.32 -0.289 (experimental) Contract Teacher [B3] (0.115)*** (0.115)** (0.109)*** (0.110)*** School and Household Controls No No No Yes Yes Yes Observations 81547 81547 81652 74225 74225 74287 R-squared 0.27 0.27 0.27 0.29 0.29 0.29 P-value (H0: B1 = B3) 0.36 0.35 Notes: 1. All Regressions include lagged normalized test scores interacted with grade (this is set to 0 for grade 1 students) 2. School controls include infrastructure and proximity (defined as in Table 6). Household controls include a household asset index, parent education index (both defined as in Table 6), child gender an indicator for being from a disadvantaged caste/tribe. * significant at 10%; ** significant at 5%; *** significant at 1% Table 12 : Comparing Regular, Contract, and Private School Teachers P-value (Null Hypothesis: Private School Regular Teacher Contract Teachers Contract Teacher = Private Teachers School Teacher Female =1 62.5% 80.9% 88.4% 0.047 Age of Teacher 38.39 26.95 26.57 0.626 Teacher Passed College =1 87.0% 31.3% 52.4% 0.000 Received Any Teacher Training =1 99.2% 21.2% 14.1% 0.095 Received Training Within Past Yr =1 78.1% 43.5% 2.8% 0.000 Teacher from the Same Village =1 19.4% 80.2% 54.0% 0.000 Distance to School (km) 11.73 1.01 2.48 0.000 Gross Montly Salary (Rs.) 12,162 1,910 1,527 0.000 Percentage of Absent Teachers 20.7% 11.3% 9.7% 0.487 Notes 1. Robust standard errors clustered at the school level were used to obtain the p-values for the null hypothesis 2. The data used for this table comes from an ongoing study on school vouchers and school choice in different sub-districts of the SAME districts. This data was collected based on teacher interviews in early 2009. 3. Differences in teacher characteristics relative to Table 1 reflect (a) the time gap between the 2 sets of data collection of around 3 years, and (b) the fact that the data used for Table 12 comes from villages that had a private school, which tend to be larger than the typical village in AP. The sample in Table 1 is from a representative set of rural government run schools, while the sample in Table 12 is from a sample of villages that have private schools (though the public school data in Table 12 is from the same villages as the private schools in Table 12). Table A1 : Effect of Class-level Effective Class Size on Learning Outcomes Year 1 Dependent Variable = Normalized Endline Test Score Class 1 Class 2 Class 3 Class 4 Class 5 [1] [2] [3] [4] [5] (Log) Effective Class Size -0.354 -0.261 -0.221 -0.074 -0.084 (0.096)*** (0.079)*** (0.059)*** (0.064) (0.058) Observations 5951 7586 8935 10621 11221 R-squared 0.293 0.403 0.405 0.482 0.371 Year 2 Dependent Variable = Normalized Endline Test Score Class 1 Class 2 Class 3 Class 4 Class 5 [1] [2] [3] [4] [5] (Log) Effective Class Size -0.336 -0.078 -0.159 -0.068 -0.108 (0.101)*** -0.077 (0.063)** (0.058) (0.058)* Observations 6122 6383 8577 9451 10924 R-squared 0.149 0.277 0.39 0.419 0.477 Year 1 and Year 2 Combined Dependent Variable = Normalized Endline Test Score Class 1 Class 2 Class 3 Class 4 Class 5 [1] [2] [3] [4] [5] (Log) Effective Class Size -0.335 -0.229 -0.187 -0.087 -0.063 (0.075)*** (0.053)*** (0.048)*** (0.044)** (0.045) Observations 12073 13969 17512 20072 22145 R-squared 0.138 0.287 0.331 0.414 0.395 Notes: 1. All regressions include mandal (sub-district) fixed effects and standard errors clustered at the school level. * significant at 10%; ** significant at 5%; *** significant at 1% Figure 1a: Andhra Pradesh (AP) Figure 1b: District Sampling (Stratified by Socio-cultural Region of AP) Figure 2: Salary Distribution by School and Teacher Type Salary Distribution by Teacher Type in Government Schools 20 18 16 Percentaage of teachers in salary range 14 12 10 Contract Teacher 8 Regular Teacher 6 4 2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Teacher Salary in Ranges of 500 Rs. Per Month (Thousands) Salary Distribution by Teacher Type in Private Schools &#" &!" Percentage of teachers in salary range %#" %!" $#" '()*+,-"./0112"3-+/0-(" $!" #" !" 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 !"#$%"&'(#)#&*'+,'-#,."/'01'233'-/4'5"&'60,7%'8!%09/#,:/;' Long-Term Effects of Teacher Performance Pay: Experimental Evidence from India Karthik Muralidharan† 10 April 2012* Abstract: We present results from a five-year long randomized evaluation of group and individual teacher performance pay programs implemented across a large representative sample of government-run rural primary schools in the Indian state of Andhra Pradesh. We find consistently positive and significant impacts of the individual teacher incentive program on student learning outcomes across all durations of program exposure. Students who completed their full five years of primary school under the program performed significantly better than those in control schools by 0.54 and 0.35 standard deviations in math and language tests respectively. These students also scored 0.52 and 0.3 standard deviations higher in science and social studies tests even though there were no incentives on these subjects. The group teacher incentive program also had positive (and mostly significant) effects on student test scores, but the effect sizes were always smaller than that of the individual incentive program, and were not significant at the end of primary school for the cohort exposed to the program for five years. These results suggest that reforming the compensation structure of public sector employees could play an important role in enhancing the capacity of governments in developing countries to provide more effective services. JEL Classification: C93, I21, M52, O15 Keywords: teacher performance pay, teacher incentives, education, education policy, field experiments, public sector labor markets, compensation, India † UC San Diego, NBER, BREAD, and J-PAL; E-mail: kamurali@ucsd.edu * I am especially grateful to Venkatesh Sundararaman for the long-term collaboration that has enabled this paper and to the World Bank for the long-term support that has enabled the research program that this paper is based on. I thank Jim Andreoni, Prashant Bharadwaj, Julie Cullen, Jishnu Das, Gordon Dahl, and Andres Santos for comments. This paper is based on a project known as the Andhra Pradesh Randomized Evaluation Study (AP RESt), which is a partnership between the Government of Andhra Pradesh, the Azim Premji Foundation, and the World Bank. Financial assistance for the project has been provided by the Government of Andhra Pradesh, the UK Department for International Development (DFID), the Azim Premji Foundation, the Spanish Impact Evaluation Fund (SIEF), and the World Bank. We thank Dileep Ranjekar, Amit Dar, and officials of the Department of School Education in Andhra Pradesh for their continuous support and long-term vision for this research. We are especially grateful to DD Karopady, M Srinivasa Rao, Sripada Ramamurthy, and staff of the Azim Premji Foundation for their leadership and meticulous work in implementing this project. Vinayak Alladi, Jayash Paudel, and Ketki Sheth provided outstanding research assistance. 1. Introduction Improving governance in developing countries requires improved policy making as well as enhancements in the capacity of governments to effectively implement policies. This point is reflected in a recent theoretical literature that has highlighted the centrality of investments in state capacity in long-term growth and development (Besley and Persson 2009, 2010), as well as a parallel empirical literature that has pointed out the extent to which developing countries find it challenging to ensure even basic levels of service delivery such as regular attendance of teachers and health care workers in rural communities (World Bank 2003; Chaudhury et al. 2006; Muralidharan et al. 2012).1 Since the effort exerted by public sector employees is a key determinant of state effectiveness, a natural set of policy options to enhance state capacity would be to consider linking public sector worker compensation to measures of performance. However, high- powered incentives have typically not been used for public sector workers for several reasons including concerns of multi-tasking (Holmstrom and Milgrom 1991) and multiple principals (Dixit 2002); concerns of implementation (Murnane and Cohen 1986); greater unionization of public sector workers and the opposition of unions to differentiated pay schemes based on performance (Ehrenberg and Schwarz 1986; Gregory and Borland 1999); and perhaps also because decision-makers in public bureaucracies are typically not residual claimants of improved productive efficiency (Bandiera, Prat, and Valletti 2009). At the same time, an increasing fraction of private sector jobs now link some component of employee pay to performance (Lemieux, Macleod, and Parent 2009), prompting increasing interest among policy makers in using performance-linked pay as a way of improving public sector productivity. This trend is particularly visible in education, where the idea of linking a component of teacher compensation to measures of student performance or gains has received growing attention from policy makers, and several countries as well as states in the US have attempted to implement reforms to teacher compensation structure to do this.2 Since teachers typically 1 Indeed, the large disconnect between policy formation and implementation in countries like India has led to the coining of the term ‘flailing state’ to describe countries with a weak capacity for effective policy implementation (Pritchett 2009). 2 Countries that have attempted teacher performance pay programs include Australia, Israel, Mexico, the United Kingdom and Chile (which has a fully scaled up national teacher performance pay program called SNED). In the US, states that have implemented state-wide programs to link teacher pay to measures of student achievement and/or gains include Colorado, Florida, Michigan, Minnesota, Texas, and Tennessee. In addition, the US Federal -1- comprise one of the largest groups of public-sector workers in most countries, understanding the impact of performance-based pay in education is especially relevant to the broader question of the role of compensation reforms in improving public sector productivity. In this paper, we contribute towards this understanding with results from a five-year long randomized evaluation of group and individual teacher performance pay programs implemented across a large representative sample of government-run rural primary schools in the Indian state of Andhra Pradesh (AP). Results at the end of two years of this experiment were presented in Muralidharan and Sundararaman (2011) and half of the schools originally assigned to each of the group and individual incentive programs (50 out of 100) were chosen by lottery to continue being eligible for the performance-linked bonuses for a total of five years. Since primary school in AP consists of five grades (1-5), the five-year long experiment allows us to measure the impact of these programs on a critical outcome for education in developing countries – the learning levels for a cohort of students at the end of their entire primary school education. There are three main results in this paper. First, the individual teacher performance pay program had a large and significant impact on student learning outcomes over all durations of student exposure to the program. Students who had completed their entire five years of primary school education under the program scored 0.54 and 0.35 standard deviations (SD) higher than those in control schools in math and language tests respectively. These are large effects corresponding to approximately 20 and 14 percentile point improvements at the median of a normal distribution, and are larger than the effects found in most other education interventions in developing countries (see Dhaliwal et al. 2011). Second, the results suggest that these test score gains represent genuine additions to human capital as opposed to reflecting only ‘teaching to the test’. Students in individual teacher incentive schools score significantly better on both non-repeat as well as repeat questions; on both multiple-choice and free-response questions; and on questions designed to test conceptual understanding as well as questions that could be answered through rote learning. Most importantly, these students also perform significantly better on subjects for which there were no incentives – scoring 0.52 SD and 0.30 SD higher than students in control schools on tests in science and social studies (though the bonuses were paid only for gains in math and language). Government has encouraged states to adopt performance-linked pay for teachers through the “Race to the Top” fund that provides states that innovate in these areas with additional funding. -2- There was also no differential attrition of students across treatment and control groups and no evidence to suggest any adverse consequences of the programs. Third, we find that individual teacher incentives significantly outperform group teacher incentives over the longer time horizon though they were equally effective in the first year of the experiment (the point estimates suggest that the individual incentive program was more effective than the group incentive program at every time horizon in both math and language). Students in group incentive schools score better than those in control schools over most durations of exposure, but these are not always significant and students who complete five years of primary school under the program do not score significantly higher than those in control schools. We measure changes in teacher behavior and the results suggest that the main mechanism for the improved outcomes in incentive schools is not reduced teacher absence, but increased teaching activity conditional on presence. We also measure household responses to the program – for the cohort that was exposed to five years of the program, at the end of five years – and find that there is no significant difference across treatment and control groups in either household spending on education or on time spent studying at home, suggesting that the estimated effects are unlikely to be confounded by differential household responses across treatment and control groups over time. Finally, our estimates suggest that the individual teacher bonus program was 15-20 times more cost effective at raising test scores than the default ‘education quality improvement’ policy of the Government of India, which is reducing class size from 40 to 30 students per teacher (Govt. of India, 2009). The central questions in the literature on teacher performance pay to date have been whether teacher performance pay based on test scores can improve student achievement, and whether there are negative consequences of teacher incentives based on student test scores? On the first question, two recent sets of experimental studies in the US have found no impact of teacher incentive programs on student achievement (see Fryer 2011, and Goodman and Turner 2010 for evidence based on an experiment in New York City, and Springer et al 2010 for evidence based on an experiment in Tennessee). However, other well-identified studies in developing countries have found positive effects of teacher incentives on student test scores (see Lavy 2002 and 2009 in Israel; Glewwe et al. 2010 in Kenya; and Muralidharan and Sundararaman 2011 in India). Also, Rau and Contreras (2011) conduct a non-experimental evaluation of a nationally scaled up teacher incentive program in Chile (called SNED) and find positive effects on student learning. -3- On the second question, there is a large literature showing strategic behavior on the part of teachers in response to features of incentive programs, which may have led to unintended (and sometimes negative) consequences. Examples include 'teaching to the test' and neglecting higher-order skills (Koretz 2002, 2008), manipulating performance by short-term strategies like boosting the caloric content of meals on the day of the test (Figlio and Winicki, 2005), excluding weak students from testing (Jacob, 2005), re-classifying more students as special needs to alter the test-taking population (Cullen and Reback 2006), focusing only on some students in response to "threshold effects" embodied in the structure of the incentives (Neal and Schanzenbach, 2010) or even outright cheating (Jacob and Levitt, 2003). The literature on both of these questions highlight the importance of not just evaluating teacher incentive programs that are designed by administrators, but of using economic theory to design systems of teacher performance pay that are likely to induce higher effort from teachers towards improving human capital and less likely to be susceptible to gaming (see Neal 2011). The program analyzed in this paper takes incentive theory seriously and the incentives are designed to reward gains at all points in the student achievement distribution, and to penalize attempts to strategically alter the test-taking population. The study design also allows us to test for a wide range of possible negative outcomes, and to carefully examine whether increases in test scores are likely to represent increases in human capital. This experiment is also the first one that studies both group and individual teacher incentives in the same context and time period.3 Finally, to our knowledge, five years is the longest time horizon over which an experimental evaluation of a teacher performance pay program has been carried out and this is the first paper that is able to study the impact on a cohort of students of completing their entire primary education under a system of teacher performance pay. While set in the context of schools and teachers, this paper also contributes to the broader literature on performance pay in organizations in general and public organizations in particular.4 There has been a recent increase in compensation experiments in firms (see Bandiera et al. 2011 3 There is a vast theoretical literature on optimal incentive design in teams (Holmstrom 1982 and Itoh 1992 provide a good starting point). Kandel and Lazear (1992) show how peer pressure can sustain first best effort in group incentive situations. Hamilton, Nickerson, and Owan (2003) present empirical evidence showing that group incentives for workers improved productivity relative to individual incentives (over the 2 year study period). Lavy (2002 and 2009) studies group and individual teacher incentives in Israel but over different time periods and with different non-experimental identification strategies. 4 See Lazear and Oyer (2009) for a recent review of the literature in personnel economics (which includes a detailed discussion of worker incentives), and Dixit (2002) for a discussion of these themes applied to public organizations. -4- for a review), but these are typically short-term studies (often lasting just a few months). 5 The results in this paper are based (to our knowledge) on the longest running experimental evaluation of group and individual-level performance pay in any sector. More broadly, in the absence of high-powered incentives for public sector workers, the literature on determinants of governance quality has highlighted the role of bureaucratic culture (Wilson 1989), professionalism in the bureaucracy (Evans and Rauch 1999) and selection of workers who are motivated by the public interest (Besley and Ghatak 2005). A parallel literature has looked at the role of compensation levels on the characteristics of workers who join the public sector and usually finds that improving levels of pay helps attract more educated workers, but this literature has typically not looked at whether this leads to improved outcomes, or if across the board pay increases are cost effective ways of doing so (Dolton 2006; Dal Bo, Finan, and Rossi 2011). Our results highlight the potential for well-designed changes in compensation structure (as opposed to levels) to improve public sector productivity in a cost-effective way. The rest of this paper is organized as follows: section 2 describes the experimental design; section 3 discusses data and attrition; section 4 presents the main results of the paper; section 5 discusses changes in teacher and household behavior in response to the programs, and section 6 concludes. 2. Experimental Design 2.1 Theoretical Considerations Standard agency theory suggests that having employee compensation depend on measures of output will increase the marginal return to effort and therefore increase effort and output. However, two broad sets of concerns have been raised about introducing performance-linked pay for teachers. First, there is the possibility that external incentives can crowd out intrinsic motivation and reduce effort – especially in jobs such as teaching that attract intrinsically motivated workers (Deci and Ryan 1985; Fehr and Falk 2002). The second set of concerns is based on multi-tasking theory which cautions that rewarding agents on measurable aspects of their efforts may divert effort away from non-measured outputs, leading to inferior outcomes 5 One limitation of short-term compensation experiments is the inter-temporal substitutability of leisure, which may cause the impact of a temporary change in wage structure to be different from the impact of a long-term change. -5- relative to a scenario with no performance-pay at all (Holmstrom and Milgrom 1991; Baker 1992). Muralidharan and Sundararaman (2009) discuss the first concern and suggest that a transparently administered performance-linked pay program for teachers may actually increase intrinsic motivation in contexts (like India) where there is no differentiation of career prospects based on effort. Muralidharan and Sundararaman (2011) discuss the second concern in detail and show that the social costs of the potential diversion of teacher effort from ‘curricular best practice’ to ‘maximizing test scores’ may be limited in contexts like India where (a) ‘best practice teaching’ is typically not very different from teaching to maximize scores on high-stakes tests (which are ubiquitous in India), and (b) norms of teacher effort in the public sector are quite low (which is also true in India, with 25% of public school teachers being absent on any given day – see Kremer et al. 2005). So, it is possible that linking teacher pay to improvements in student test scores will not only increase test scores, but also increase underlying human capital of students, especially in contexts such as India. Whether or not this is true is an empirical question and is the focus of our research design and empirical analysis. 2.2 Background The details of the experimental design (sampling, randomization, incentive program design, and data collected) are discussed in detail in Muralidharan and Sundararaman (2011) – hereafter referred to as MS 2011, and are only summarized briefly here. The original experiment was conducted across a representative sample of 300 government-run primary schools in the Indian state of Andhra Pradesh (AP), with 100 schools each being randomly assigned to an “individual teacher incentive program”, a “group teacher incentive program”, and a control group. The study universe was spread across 5 districts, with 10 mandals (sub-districts) being randomly sampled from each of the 5 study districts, and 6 schools being randomly sampled from each of the 50 mandals. The randomization was stratified at the mandal level and so 2 of the 6 schools in each mandal were assigned to each treatment and to the control group. The bonus formula provided teachers with Rs. 500 for every percentage point of mean improvement in test scores of their students. The teachers in group incentive (GI) schools received the same bonus based on average school-level improvement in test scores, while the bonus for teachers in individual incentive (II) schools was based on the average test score -6- improvement of students taught by the specific teacher. Teachers/schools with negative improvements did not get a bonus (there was no negative bonus). The main features of the incentive design were: (i) the bonus was based on a linear piece-rate – which provided a continuous incentive for effort, since a larger test score gain led to a larger bonus; (ii) there were limited threshold effects, since all students contributed to the bonus calculation; (iii) the incentive amounts were not large, with the expected value of the bonus being around 3% of annual teacher pay. See MS 2011 for further details of the incentive formula and the rationale for each of the design features. 2.3 Changes in Experimental Design The design details were unchanged for the first two years of the experiment (up to the point reported in MS 2011), and the experiment was initially only expected to last for two years. Renewed funding for the project allowed the experiment to continue for a third year, at which point, a small change was made to the bonus formula. In the first two years, student gains were calculated using their previous test score as a baseline. While this was an intuitive way of communicating the details of the system to teachers, it had an important limitation. Since there is substantial mean reversion in student scores, the formula unfairly penalized teachers who had an incoming cohort of high-scoring students and rewarded those who had an incoming cohort of low-scoring students. Once we had two years of data in control schools, we were able to calculate a ‘predicted’ score for each student using lagged scores and use this predicted score (predicted using only the control schools) as the ‘target’ for each student in the incentive school to cross to be eligible for the bonus. The final bonus was calculated at the student level and then aggregated across students for the teacher/school. The formula used to calculate the bonus at the individual student level was: Student level bonus = Rs. 20 × (Actual Score – Target Score).6 (1) In cases where the actual score was below the target score, a student could contribute a ‘negative amount’ to the teachers’ bonus, but this was capped at -5% or – Rs. 100 (even if the actual score was more than 5% below the target score). Cases of drop-outs (or non-test taking of students who should have taken the test) were automatically assigned a score of -5% and 6 The scores are defined in terms of “% age score on the test”. A typical teacher taught around 25 students and so a bonus of Rs. 500 per percentage point improvement in average scores in the class was equivalent to a bonus of Rs. 20 per student per percentage point improvement in student-level scores. Thus, the change in formula was not meant to change the expected amount of bonuses paid, but rather to reduce the role of mean reversion in the award of bonuses. -7- contributed to a reduction of the bonus by Rs. 100. Thus, a student could never hurt a teacher/school’s bonus more than by not taking the test, and it was therefore not possible to increase the ‘average’ score by having weak students drop out. While it was possible for an individual student to contribute a negative amount to a teacher’s bonus, the final bonus received by teachers was zero and not negative in cases where the total bonus was negative after aggregating (1) across all the students taught by the teacher/school. At the end of the third year, uncertainty regarding funding required a reduction in the sample size of the project. It was decided that it would be valuable to continue the experiment for at least a subset of the original treatment group for five years, to study the impact of the programs on a cohort of students who had completed their entire primary school education (grades 1-5) under the teacher incentive programs. Hence, both group and individual incentive programs were continued in 50 of the 100 schools where they started, and discontinued in the other 50. The selection of schools to continue or discontinue was done by lottery stratified at the mandal level and so each of the 50 mandals in the project had 1 school that continued with each treatment for 5 years, 1 school that had each treatment for 3 years and was then discontinued from the treatment, and 2 schools that served as control schools throughout the 5 years of the project (see Figure 1). Since the focus of this paper is on the effects of extended exposure to the teacher incentive treatments, most of the analysis will be based on the schools that continued with the treatments for 5 years (when treatment effects over 3 years or more are being considered). 2.4 Cohort and Grade Composition of Students in Estimation Sample Primary school in AP covers grades 1 through 5 and the project lasted 5 years, which meant that a total of 9 cohorts of students spent some portion of their primary school experience under the teacher incentive treatments. We refer to the oldest cohort as “cohort 1” (this is the cohort that was in grade 5 in the first year of the project and graduated from primary school after the first year) and the youngest cohort as “cohort 9” (this is the cohort that entered grade 1 in the fifth year of the project). Figure 2 shows the passage of various cohorts through the program and the duration of exposure they had to the treatments, and the grades in which each cohort was exposed. Cohort 5 is the one that spent its entire time in primary school under the incentive treatments. Cohorts 4 and 6 spent 4 years in the project, cohorts 3 and 7 spent 3 years, cohorts 2 and 8 spent 2 years, and finally cohorts 1 and 9 spent only 1 year in the project. -8- 2.5 Validity of Randomization The validity of the initial randomization between treatment and control groups was shown in MS 2011. Table 1 (Panel A) shows the equality on key variables between the schools that were continued and discontinued in each of the individual and group teacher incentive programs. We first show balance on school-level variables (infrastructure, proximity), and then show balance on student test scores at the end of the third year (which is the time when the randomization was done). We show this in two ways: first, we include all the students in cohorts 4, 5, 6, and 7 – these are the cohorts in the project at the end of the third year that will be included in subsequent analysis (see Figure 2); second, we only include students in cohort 5 since this is the only cohort with which we can estimate the five-year treatment effects. Table 1 (Panel B) shows that the existence of the treatments did not change the size or socio-economic characteristics composition of new incoming cohorts of students in years 2 to 5, suggesting that cohorts 6-9 also constitute valid cohorts for the experimental analysis of the impact of the teacher incentive programs. 3. Data, Estimating Equations, and Attrition 3.1 Data Data on learning outcomes is generated from annual assessments administered by the Azim Premji Foundation to all schools in the study. Students were tested on math and language (which the incentives were based on) in all grades, and also tested on science and social studies (for which there were never any incentives) in grades 3-5.7 The school year runs from mid-June to mid-April, and the baseline test in the first year of the project was conducted in June-July 2005. Five subsequent rounds of tests were conducted at the end of each academic year, starting March-April 2006 and ending in March-April 2010.8 For the rest of this paper, Year 0 (Y0) refers to the baseline tests in June-July 2005; Year 1 (Y1) refers to the tests conducted at the end 7 Science and social studies are tested only from grade 3 onwards because they are introduced in the curriculum only in the third grade. In the first year of the project, these tests were surprise tests that the schools did not know would take place till a few days prior to the test. In the subsequent years, schools knew these tests would take place – but also knew from the official communications and previous year’s bonus calculations that these subjects were not included in the bonus calculations. 8 Each of these rounds of testing featured 2 days of testing typically 2 weeks apart. Math and language were tested on both days, and the first test (called the “lower end line” or LEL) covered competencies up to that of the previous school year, while the second test (called the “higher end line” or HEL) covered materials from the current school year's syllabus. Doing two rounds of testing at the end of each year allows for the inclusion of more materials across years of testing, reduces the impact of measurement errors specific to the day of the test, and also reduces sample attrition due to student absence on the day of the test. -9- of the first year of the program in March-April, 2006, and so on with Year 5 (Y5) referring to the tests conducted at the end of the fifth year of the program in March-April, 2010. Scores in Y0 are normalized relative to the distribution of scores across all schools for the same test (pre- treatment), while scores in subsequent years are normalized with respect to the score distribution in the control schools for the same test.9 Enumerators also made several unannounced visits to all treatment and control schools in each year of the project and collected data on teacher attendance and activity during these visits. In addition, detailed interviews were conducted with teachers at the start of each school year to collect data on teaching practices during the previous school year (these interviews were conducted prior to the bonuses being announced to ensure that responses are not affected by the actual bonus received). Finally, a set of household interviews was conducted in August 2010 (after the end of the program) across treatment and control group students in cohort 5 who had spent the full five years in the study. Data was collected on household expenditure, student time allocation, the use of private tuitions, and on parental perceptions of school quality. 3.2 Estimating Equations Our main estimating equation takes the form: Tijkm (Yn )     j (Y0 )  Tijkm (Y0 )   II  II   GI  GI    Z m   k   jk   ijk (2) The dependent variable of interest is Tijkm (Yn ) , which is the normalized test score on the specific subject at the end of n years of the program (i, j, k, m denote the student, grade, school, and mandal respectively). Including the normalized baseline test score ( Y0 ) improves efficiency due to the autocorrelation between test-scores across multiple periods.10 All regressions include a set of mandal-level dummies (Zm) and the standard errors are clustered at the school level. II and GI are dummy variables at the school level corresponding to “Individual Incentive” and 9 Student test scores on each round (LEL and HEL), which are conducted two weeks apart, are first normalized relative to the score distribution in the control schools on that test, and then averaged across the 2 rounds to create the normalized test score for each student at each point in time. So a student can be absent on one testing day and still be included in the analysis without bias because the included score would have been normalized relative to the distribution of all control school students on the same test that the student took. 10 Since cohorts 5-9 (those that enter the project in grade 1 in years 1 through 5 respectively) did not have a baseline test, we set the normalized baseline score to zero for the students in these cohorts. Note that the coefficient on the baseline test score is allowed to be flexible by grade, to ensure that including a normalized baseline test of zero does not influence the estimate of the  j (Y0 ) for the cohorts where we have a baseline score. - 10 - “Group Incentive” treatments respectively, and the parameters of interest are  II and  GI , which estimate the effect on test scores of being in an individual or group incentive school. We first estimate treatment effects over durations ranging from 1 to 5 years using the cohorts that were present in our sample from the start of the project (cohorts 1 to 5) using progressively fewer cohorts (all 5 cohorts were exposed to the first year of the program, while only one cohort was exposed to all 5 years – the estimation sample for the n-year treatment effect can be visualized by considering the lower triangular matrix in Figure 2 and moving across the columns as n increases). We can also use the incoming cohorts after the start of the project (cohorts 6-9) to estimate treatment effects because there is no systematic difference in these cohorts across treatment and control schools (Table 1 – Panel B). Thus, we can estimate average treatment effects at the end of first grade using 5 cohorts (cohorts 5-9); average treatment effects at the end of second grade using 4 cohorts (cohorts 5-8) and so on (the estimation sample for the n-year treatment effect starting from grade 1, can be visualized by considering the upper triangular matrix in Figure 2 and moving down the rows as n increases). These are estimated using: Tijkm (Yn )     II  II   GI  GI    Z m   k   jk   ijk (3) with the only difference with (2) being the lack of a baseline score to control for in cohorts 5-9. Finally, a key advantage of estimating treatment effects over 5 years and 9 cohorts of students is that the estimated effects are more robust to fluctuations due to cohort or year effects. We therefore also estimate n-year treatment effects by pooling all cohorts for whom an experimental n-year effect can be estimated. Thus, we estimate 1-year effects using all 9 cohorts (cohorts 1-5 in Y1, and cohorts 6-9 in Y2-Y5; i.e. – using the first column and first row of Figure 2); 2-year effects using 7 cohorts (cohorts 2-5 in Y2, and cohorts 6-8 in Y3-Y5); 3-year effects using 5 cohorts (cohorts 3-5 in Y3, and cohorts 6-7 in Y4-Y5); 4-year effects using 3 cohorts (cohorts 4-5 in Y4, and cohort 6 in Y5); and 5-year effects using 1 cohort. In other words, we pool the samples used for (2) and (3), with cohort 5 getting removed once to avoid double counting. This is the largest sample we can use for estimating n-year treatment effects experimentally and we refer to it as the “full sample”. 3.3 Attrition While randomization ensures that treatment and control groups are balanced on observables at the start of the experiment (Table 1), the validity of the experiment can still be compromised if - 11 - there is differential attrition of students or teachers across the treatment and control groups. The average student attrition rate in the control group (defined as the fraction of students in the baseline tests who did not take a test at the end of each year) was 14.0% in Y1, 29.3% in Y2, 40.6% in Y3, and 47.4% in Y4, and 55.6% in Y5 (Table 2 – Panel A). This reflects a combination of students dropping out of school, switching schools in the same village (including moving to private schools), migrating away from the village over time, and being absent on the day of the test.11 Attrition rates were slightly lower in the incentive schools, but there was no significant difference in student attrition rates across treatment and control groups. There was also no significant difference in the baseline test score across treatment categories among the students who drop out from the test-taking sample (though attrition is higher among students with lower baseline scores). Similarly, we see that while attrition rates are high among the cohorts used to estimate (3), there is no significant difference in the attrition rates across treatment and control groups in these cohorts as well (Table 2 – Panel B). Note that no baseline scores exist for cohorts 5-9 and so we only show attrition rates here and not test scores.12 Finally, we estimate a model of attrition using all observable characteristics of students in our data set (including baseline scores, household affluence, and parental education) and cannot reject the null hypothesis that the same model predicts attrition in both treatment and control groups over the five years.13 The other challenge to experimental validity is the fact that teachers get transferred across schools every few years. As described in MS 2011, around a third of the teachers were transferred in the few first months of the project, but there was no significant difference in teacher transfers across treatment and control schools (Table 2 – Panel C – Column 1). The annual rate of teachers being transferred was much lower in Y2, Y3, and Y4 (averaging under 5% per year, with no significant difference across treatment groups). Since the teacher transfers 11 Note that the estimation sample does not include students who transferred into the school during the 5 years of the project, since the aim is to show the treatment effects on students who have been exposed to the program for n years. The attrition numbers are presented relative to the initial set of students in the project, who are the only ones we use in our estimation of treatment effects. 12 Since the only test scores available for cohorts 5-9 are after they have spent a year in the treatment schools, it is not meaningful to compare the test scores of attritors in this sample. However, we compare the average score percentiles (based on scores after completing 1st grade) of attritors in treatment and control groups and find no difference in this either over time. 13 We estimate this model separately at the end of each year, and for group and individual incentive schools relative to the control group. We reject the null of equality only once out of ten tests (five years each for GI and II schools respectively). - 12 - in the first year took place within a few months of the start of the school year (and were scheduled to take place before any news of the interventions was communicated to schools), the teacher composition in the studied schools was quite stable between Y1 and Y4 – with less than 10% teacher attrition in this period. However, there was a substantial round of teacher transfers in Y5, with nearly 70% of teachers being transferred out of study schools. While there was no significant difference in transfer rates across treatment and control schools, the transfers imply that a vast majority of teachers in treatment schools in Y5 had no prior experience of the incentive programs. It is therefore likely that our estimates of 5-year effects are a lower bound on the true effect, since the effects may have been higher if teachers with 4 years of experience of the incentive program had continued in Y5 (we discuss this further in the next section). 4. Results 4.1 Impact of Incentives on Test Scores Table 3 presents the results from estimating equation (2) for cohorts 1-5 for each year of exposure to the treatments (panel A combines math and language, while Panels B and C show the results separated out by subject). The table also indicates the estimation sample (cohorts, year, and grades) corresponding to each column (common across panels) and includes tests for equality of group and individual incentive treatments. We find that students in individual incentive schools score significantly more than students in control schools in math and language tests over all durations of program exposure. The cohort of students exposed to the program for 5 years scored 0.54 SD and 0.35 SD higher in math and language tests respectively (corresponding to approximately 20 and 14 percentile point improvements at the median of a normal distribution). Turning to the group incentive program, we see that students in these schools also attained higher test scores than those in control schools and that this difference is significant in the first 4 years, though it is not so for cohort 5 at the end of 5 years of the program. The point estimates of the impact of the individual incentive program are always higher than those of the group incentive programs (for both subjects), and the difference is significant at the end of Y2, Y4, and Y5 (when combined across math and language as in Panel A). The addition of school and household controls does not significantly change the estimated treatment effects in any of the regressions, as would be expected in an experimental setting (results available on request). - 13 - Table 4 presents results from estimating equation (3) for cohorts 5-9 and shows the mean treatment effects at the end of each grade for students who start primary school under the teacher incentive programs (note that column 5 is identical to that in Table 3 since they are both based on cohort 5 at the end of 5 years). Again, the impact of the individual incentive program is positive and significant for all durations of exposure for math as well as language. However, the group incentive program is less effective and test scores are not significantly different from those in the control schools for either math or for language for any duration of exposure. The effects of the individual incentive program are significantly greater for all durations of exposure greater than 1 year. The key difference between the samples used to estimate (2) and (3) is that the former is weighted towards the early years of the project, while the latter is weighted towards the later years (see Figure 2 and the discussion in 3.2 to see this clearly). The differences between Table 3 and 4 thus point towards declining effectiveness of the group incentive treatments over time. Finally, table 5 presents results using all the cohorts and years of data that we can use to construct an experimental estimate of  II and  GI and is based on the “full sample” as discussed in section 3.2. Each column also indicates the cohort/year/grade of the students in the estimation sample. The broad patterns of the results are the same as in the previous tables – the effects of individual teacher incentives are positive and significant at all lengths of program exposure; while the effects of the group teacher incentives are positive but not always significant, and mostly significantly below those of the individual incentives. The rest of the paper uses the “full sample” of data for further analysis, unless mentioned otherwise. We check for robustness of the results to teacher transfers, and estimate the results in Tables 3-5 by restricting the sample to teachers who had remained in the project from the beginning and find that there is no significant difference in the estimates relative to those in Tables 3-5.14 The testing process was externally proctored at all stages and we had no reason to believe that cheating was a problem.15 14 The point estimates of the impact of the individual incentive program on cohort 5 at the end of Y5 are larger in this restricted sample, but they are (a) not significantly different from the estimates in Table 3 (column 5), and (b) estimated with just 16% of the teachers who started the program. 15 As reported in MS 2011, there were 2 cases of cheating discovered in Y2. These schools were disqualified from receiving bonuses that year (and dropped from the 2-year analysis), but were not disqualified from the program in subsequent years - 14 - 4.2 Test Scores Versus Broader Measures of Human Capital A key concern in the interpretation of the above results is whether these test score gains represent real improvements in children’s human capital or merely reflect drilling on past exams and better test-taking skills. We probe this issue deeper below using data at the individual question level. First, we consider differences in student performance in incentive schools on repeat versus non-repeat questions.16 Table 6 – Panel A shows the breakdown of scores by treatment status and by whether the question was repeated (using raw percentage scores on the tests as opposed to normalized scores). We see (as may be expected) that performance on repeated questions is typically higher in the control schools. Individual incentive schools perform significantly better on both repeat as well as non-repeat questions than control schools, whereas group incentive schools only do better on repeat questions and don’t do better on non-repeat questions at any point after the first year. This table also lets us see the treatment effects in raw (as opposed to normalized) scores, and we see that at the end of 5 years, students in individual incentive schools score 9.2 percentage points higher than control schools on non-repeat questions (on a base on base on 27.4%) and 10.3 percentage points higher on repeat questions (on a base of 32.2%) in math; and 7.3 and 5.4 percentage points higher on non-repeat and repeat questions in language (on a base of 42.7% and 45.1% respectively). Next, we look at differential performance on multiple-choice questions (MCQ) and non- MCQ items on the test since the former are presumably more susceptible to improvements due to test-taking skills such as not leaving items blank. These results are presented in Table 6 – Panel B, and the results are quite similar to Panel A. Student performance is higher on MCQ’s; students in individual incentive schools score significantly higher than those in control schools on both MCQ’s and non-MCQ’s (though typically more so on MCQ’s); group incentive schools are more likely to do better on MCQ’s and typically don’t do any better on non-MCQ’s than control schools (after the first year). We adjust for these two considerations and recalculate the treatment effects shown in Tables 3-5 using only non-repeat and non-MCQ questions, but find that there is hardly any change in the estimated treatment effects.17 16 Around 16% of questions in math 10% of questions in language are repeated across years to enable vertical- linking of items over time 17 There are two likely reasons for this. First MCQ and repeat questions constitute a small component of the test. Second, even though the performance of incentive schools is higher on MCQ and repeat questions in percentage - 15 - Next, as discussed in detail in Muralidharan and Sundararaman (2009), the tests were designed to include both ‘mechanical’ and ‘conceptual’ questions, where the former questions resembled those in the textbook, while the latter tested the same underlying idea in unfamiliar ways. We analyze the impact of the incentive programs by whether the questions were ‘mechanical’ or ‘conceptual’ and find that the main results of Tables 3-5 hold regardless of the component of the test on this dimension (tables available on request). Finally, Table 7 shows the impact of the teacher incentive programs on Science and Social Studies, which were subjects on which there were no incentives paid to teachers in any of the five years.18 Students in schools with the individual teacher incentive program scored significantly higher on both science as well as social studies at all durations of program exposure, and students in cohort 5 scored 0.52 SD higher in science and 0.30 SD higher in social studies at the end of primary school after spending their entire schooling experience under the program. However, while students in group incentive schools also score better on science and social studies than students in control schools, the treatment effect is not significant for cohort 5 after five years, and is significantly lower than that of the individual incentive program. 4.3 Heterogeneity and Distribution of Treatment Effects We conduct extensive analysis of differential impacts of the teacher incentive programs along several school, student, and teacher characteristics. The default analysis uses a linear functional form as follows: Tijkm (Yn )      Tijkm (Y0 )  1  II   2  Char   3  ( II  Char)    Z m   k   jk   ijk , (4) where II (or GI) represent the treatment dummy, Char is a particular school or student characteristic, and (II × Char) is an interaction term, with  3 (II ) /  3 (GI) being the term of interest indicating whether there are differential treatment effects (for II/GI) as a function of the characteristic. Table 8 (Panel A) shows the results of these regressions on several school and household characteristics - the columns represent increasing durations of treatment exposure, the rows indicate the characteristic, and the entries in the table correspond to the estimates of  3 ( II ) and  3 (GI ) - columns 1-5 show  3 ( II ) , while columns 6-10 show  3 (GI ) . point terms, the standard deviations of scores on those components of the test are also larger, which reduces the impact of removing these questions from the calculation of normalized test scores (which is the unit of analysis for Tables 3-5). 18 Since these tests were only conducted in grades 3-5, we have fewer cohorts of students to estimate treatment effects on. Table 7 clearly indicates the cohort/year/grade combination of students who are in the estimation sample. - 16 - Given sampling variation in these estimates, we are cautious to not claim evidence of heterogeneous treatment effects unless the result is consistent across several time horizons. Overall, we find limited evidence of consistently differential treatment effects by school and student characteristics. The main heterogeneity worth highlighting is that teacher incentives appear to be more effective in schools with larger enrolments and for students with lower levels of parental literacy. Since the linear functional form for heterogeneity may be restrictive, we also show non- parametric estimates of treatment effects to better understand the distributional effects of the teacher incentive programs. Figures 3A-3D plot the quantile treatment effects of the performance pay program on student test scores (averaged across math and language) for cohort 5 at the end of 5 years. Figure 3A plots the test score distribution by treatment as a function of the percentile of the test score distribution at the end of Y5, while Figures 3B-3D show the pair- wise comparisons (II vs. control; GI vs. control; II vs. GI) with bootstrapped 95% confidence intervals. We see that students in II schools do better than those in control schools at almost every percentile of the Y5 test score distribution. However, the variance of student outcomes is also higher in these schools, with much larger treatment effects at the higher end of the Y5 distribution (in fact, while mean treatment effects are positive and significant in Table 3 – Column 5, the non-parametric plot suggests that II schools do significantly better only above the 40th percentile of the Y5 outcome distribution). Students in GI schools do better throughout the Y5 distribution, but these differences are typically not significant (as would be expected from Table 3 – column 5). However, there is no noticeable increase in variance in GI schools relative to the control schools. Finally, directly comparing GI and II schools suggests that the GI schools may have been marginally more effective at increasing scores at the lowest end of the Y5 distribution (though not significantly so), while the II schools did much better at raising scores at the high end of the Y5 distribution. We also test for differential responsiveness by observable teacher characteristics (Table 8 – Panel B). The main result we find is that the interaction of teacher training with incentives is positive and significant (for both II and GI schools), while training by itself is not a significant predictor of value addition, suggesting that teaching credentials by themselves may not add much - 17 - value under the status quo but may do so if teachers had incentives to exert more effort (see Hanushek (2006)). 4.4 Group versus Individual Incentives A key feature of our experimental design is the ability to compare group and individual teacher incentives over time and the results discussed above highlight a few broad patterns. First, II and GI schools did equally well in the first year, but the II schools typically did better over time, with the GI schools often not doing significantly better than control schools. Second, outcomes in GI schools appear to have lower variance than those in II schools, with II schools being especially effective for students at the high end of the learning distribution. The low impact of group incentives over time is quite striking given that the typical schools has 3 teachers and peer monitoring of effort should have been relatively easy. We test whether the effectiveness of GI declines with school size, and do not find any significant effect of either school enrollment or number of teachers on the relative impact of GI versus II. These results suggest that there may be (a) limited complementarity across teachers in teaching, and (b) that it may be difficult even for teachers in the same school to effectively observe and enforce intensity of effort. 4.5 Teacher Behavior Our results on the impact of the programs on teacher behavior are mostly unchanged from those reported in MS 2011. Particularly, over 5 years of measurement through unannounced visits to schools, we find no difference in teacher attendance between control and incentive schools (Table 9). We also find no significant difference between incentive and control schools on any of the various indicators of classroom processes as measured by direct observation. However, the teacher interviews, where teachers in both incentive and control schools were asked unprompted questions about what they did differently during the school year, indicate that teachers in incentive schools are significantly more likely to have assigned more homework and class work, conducted extra classes beyond regular school hours, given practice tests, and paid special attention to weaker children (Table 9). Teachers in both GI and II schools report significantly higher levels of these activities than teachers in control schools (Table 9 – columns 4 and 5). Teachers in II schools report higher levels of each of these activities than those in GI schools as well, but these differences are not always significant (column 6). While self-reported measures of teacher activity might be - 18 - considered less credible than observations, we find a positive (and mostly significant) correlation between the reported activities of teachers and the performance of their students (column 7) suggesting that these self-reports were credible (especially since less than 40% of teachers in the incentive schools report doing any one of the activities described in Table 9). In summary, it appears that the incentive program based on end of year test scores did not change the teachers' cost-benefit calculations on the attendance margin during the school year, but that it probably made them exert more effort when present. 4.6 Household Responses A key consideration in evaluating the impact of education policy interventions over a longer time horizon is the extent to which the effect of the intervention is attenuated or amplified by changes in behavior of other agents (especially households) reflecting re-optimization in light of the intervention (see Das et al. 2011 for a theoretical treatment of this issue combined with empirical evidence from Zambia and India, and Pop-Eleches and Urquiola 2011 for an application in Romania). We therefore conduct household surveys at the end of Y5 of the program and collect data on household expenditure on education, student time allocation, and household perceptions of school quality across both treatment and control groups for students in cohort 5. We find no significant differences on any of these measures across the II, GI, and control schools. Point estimates suggest lower rates of household expenditure, and greater time spent on studying at home for children in incentive schools, but none of these are significant. Overall, the results suggest that improvements in school quality resulting from greater teacher effort do not appear to be salient enough to parents for them to adjust their own inputs into their child’s education (unlike say in the case of books and materials provided through the school – see Das et al. 2011). 5. Test Score Fade Out and Cost Effectiveness 5.1 Test Score Fade Out and Net vs. Gross Treatment Effects It is well established in the education literature that test scores decay over time, and that the test score gains obtained from education interventions typically do not persist over time – with substantial fade out observed even over one year (see Andrabi et al. 2011; Rothstein 2010; Jacob, Lefgren, and Sims 2010; and Deming 2009 for examples). Applying an analogy of physical capital to human capital, the n-year ‘net’ treatment effect consists of the sum of each of - 19 - the previous n-1 years’ ‘gross’ treatment effects, the depreciation of these effects, and the n’th year ‘gross’ treatment effect. The experimental estimates presented in this paper are therefore estimates of ‘net’ treatment effects at the end of ‘n’ years that are likely to understate the impact of the treatments relative to the counterfactual of discontinuing the programs. The experimental discontinuation of the performance-pay treatments in half the originally treated schools after three years allows us to see this more directly. Table 10 presents outcomes for students who were exposed to the treatment for a full five years, as well as those who were exposed to the program for three years but did not have the program in the last 2 years. We see that there is no significant difference among these students at the end of three years – as would be expected given that the schools to be discontinued were chosen by lottery (Table 10 – column 1). However, while the scores in the individual incentive schools that continued in the program rise in years 4 and 5, the scores in the discontinued schools fall by about 40% in each of years 4 and 5 and are no longer significant at the end of five years. Thus, estimating the impact of continuing the program by comparing the 5-year TE to the 3-year TE would considerably understate the impact of the incentive programs in the last two years. The small sample sizes in Table 10 mean that our estimates of the rate of decay are not very precise (column 2 can only be estimated with cohorts 4 and 5, while column 3 can only be estimated with cohort 5), and so we only treat these estimates as suggestive of the fact that the estimated net TE at the end of n years may be lower than the sum of the gross TE’s at the same point in time (note for instance that there does not appear to be any decay in the TE’s in the GI schools where the program was discontinued, but the standard errors are large and we cannot rule out a point estimate that would be consistent with substantial decay). An important question to consider is whether what we care about in evaluating the long-term impact of education interventions is the sum of the annual ‘gross’ treatment effects or the ‘net’ TE at the end of n-years. There is growing evidence to suggest that interventions that produce test score gains lead to long-term gains in outcomes such as school completion and wages even though the test score gains themselves may fade away shortly after the interventions are stopped (see Deming 2009 for evidence from Head Start, and Chetty et al (forthcoming) for evidence from the Tennessee Star program). Even more relevant is the evidence in Chetty, Friedman, and Rockoff (2011), who find that the extent of ‘gross’ value-addition of teachers in grades 4-8 is correlated with long-term wages of the affected students. They also find evidence of decay in - 20 - test scores over time and find that estimating teachers’ impact on long-term wages using measures of net value addition (i.e. by using the extent of the value addition that persists after a few years) would considerably underestimate their impact. Thus, it seems plausible that cost effectiveness calculations of multi-year experimental interventions should be based on estimates of gross treatment effects. 5.2 Estimating “Gross” Treatment Effects The main challenge in doing this however is that the specification in (2) can be used to consistently estimate the n-year effect of the programs, but not the ‘n’th’ year effect (with the ‘n’th’ year test scores as the dependent variable controlling for ‘n-1’th’ year scores) because the ‘n-1’th’ year scores are a post-treatment outcome that will be correlated with the treatment dummies. The literature estimating experimental treatment effects in education therefore typically estimates only the n-year effect. However, since most experiments in education do not last more than two or three years, the distinction between gross and net treatment effects and the importance of isolating the former has not been addressed before. We propose two approaches towards estimating the gross treatment effect. The first is to estimate a standard test score value-added model of the form: Tijkm (Yn )     j (Yn1 )  Tijkm (Yn1 )    Z m   k   jk   ijk (5) ˆ j (Yn1) . We then use  using only the control schools, and estimate  ˆ j (Yn1) to estimate a transformed version of (2) where the dependent variable corresponds to an estimate of the gross value-added: ˆ j (Yn1 )  Tijkm (Yn 1 )     II  II  GI  GI    Zm   k   jk   ijk Tijkm (Yn )   (6) using all 25 possible 1-year comparisons (i.e. – using all 25 cells in Figure 2). The main point of this transformation is that  j (Yn1) is not estimated jointly with  II and GI , and the estimates of  II and GI , obtained from (6) will be consistent estimates of the average annual gross treatment effect as long as ˆ j is consistently estimated in (5).21 21 While this is a standard assumption in the literature on test-score value addition, it need not hold true in general since measurement error of test scores would bias the estimate downwards, while unobserved heterogeneity in student learning rates would bias it upwards. However, Andrabi et al (2011) show that when both sources of biases are corrected for in data from Pakistan (which is a similar South Asian context), the corrected estimate is not significantly different from the OLS estimate used in the literature, suggesting that the bias in this approach is likely to be small. There are two other assumptions necessary for (6) to consistently estimate gross treatment effects. The first is that test scores decay at a constant rate and that the level of decay only depends on the current test score (and does not vary based on the inputs that produced these scores). The second is that the rate of decay is constant at all - 21 - Estimating equation (6), we find that the annual gross treatment effect of the individual incentive program was 0.164 SD in math and 0.105 SD in language (Table 11 - Panel A). The sum of these gross treatment effects would be 0.82 SD for math and 0.53 SD for language over the five years of primary school suggesting that not accounting for decay would considerably understate the impact of the treatments (comparing these estimates to those in Table 3). For the group incentive schools, we find smaller effects equal to 0.086 SD in math and 0.043 SD in language (with the latter not being significant). However, these estimates suggest that the presence of decay may partly be responsible for not finding a significant effect on test scores at the end of five years for cohort 5 in the GI schools, even though impact of the GI program was typically positive (albeit smaller than that of the II program). Second, we estimate an average non-parametric treatment effect of the incentive programs in each year of the program by comparing the Y(n) scores for treatment and control students who start at the same Y(n-1) score. The average non-parametric treatment effect (ATE) is the integral of the difference between the two plots, integrated over the density of the control school distribution, and is implemented as follows (shown for II schools): ATE  1 100 100 i 1   T (Yn ( II ))  T (Yn (C )) T (Yn 1 ( II )),T (Yn 1 (C ))  Pi ,n 1 (C )  (7) where Pi ,n 1 (C ) is the i'th percentile of the distribution of control school scores in Y(n-1) and T (Yn ( II )), T (Yn (C ), T (Yn 1 ( II )), T (Yn 1 (C )) are the test scores at the end of Y(n) and Y(n-1) in the treatment (II) and control (C) schools respectively. The intuition behind this estimate is straightforward. If test scores decay at a constant rate, then the absolute test score decay will be higher in the treatment schools in the second year and beyond (because test scores in these schools are higher after the first year), and calculating the n’th year treatment effect as the difference between the n-year and (n-1) year net treatment effects (based on equation 2) will be an under-estimate of the n’th-year treatment effect. By matching treatment and control students on test scores at the end of Y(n-1) and measuring the additional gains in Y(n), we eliminate the role of decay because the treatment and control students being compared have the same Y(n-1) score, and will therefore have the same absolute decay, and the difference in scores between these students at the end of Y(n) will be an estimate levels of learning (as implied by the linear functional form). Both of these are assumptions are standard in the education production function and value added literature (see Todd and Wolpin 2003). - 22 - of the n’th year treatment effect that is not confounded by differential decay of test scores across treatment and control schools. The treatment effects estimated at each percentile of the control school distribution are then integrated over the density of the control distribution to compute an average non-parametric treatment effect.22 The main advantage of this approach is that it does not require a consistent estimate of ˆ j as required in the estimates from equation (6). A further advantage is that it does not require ˆ j to be the same at all points in the test score distribution. The main assumption required for (7) to yield consistent estimates of the average 1-year treatment effect (beyond the first year of the program) is that the effect of the treatment is the same at all points on the distribution of unobservables (since the treatment distribution is to the right of the control distribution after Y1, students who are matched on scores will typically not be matched on unobservables). While this assumption cannot be tested, we find limited evidence of heterogeneity of treatment effects along several observable school and household characteristics, suggesting that there may be limited heterogeneity in treatment effects across unobservables as well.23 Estimating equation (7), across all 25 1-year comparisons we find that the annual gross treatment effect of the individual incentive program was 0.181 SD in math and 0.119 SD in language, with both of them being significant (Table 11 - Panel B). The corresponding numbers for the group incentive program are 0.065 SD and 0.032 SD and neither of them are significant at the 5% level (95% confidence intervals are constructed by drawing a 1000 bootstrap samples and estimating the average non-parametric treatment effect in each sample). Figures 4A – 4D show the plots used to calculate the average non-parametric treatment effects across the 25 possible 1- year comparisons and find remarkably constant treatment effects at every percentile of initial test scores. We also see that the point estimates for both the II and GI programs in Panel B of Table 22 Note that the treatment distribution in Y1 and beyond will be to the right of the control distribution. Thus, integrating over the density of the control distribution adjusts for the fact that there are more students with higher Y(n-1) scores in treatment schools and that test scores of these students will decay more (in absolute terms) than those with lower scores. In other words, treatment effects are calculated at every percentile of the control distribution and then averaged across these percentiles regardless of the number of treatment students in each percentile of the control distribution at the end of Y(n-1). Also, the estimate only uses students in the common support of the distribution of Y1 scores between treatment and control schools (less than 0.1% of students are dropped as a result of this). 23 Of course, this procedure also assumes that test scores decay at a constant rate and that the level of decay only depends on the current test score (and does not vary based on the inputs that produced these scores). As discussed earlier, this is a standard assumption in the estimation of education production functions (see Todd and Wolpin 2003). - 23 - 11 are quite similar to those estimated in Panel A and suggest again that not accounting for decay would considerably understate the impact of the treatments (the estimates here suggest that the cumulative 5-year impact of the II program would be 0.9 and 0.6 SD for math and language respectively, compared to 0.54 SD and 0.35 as estimated in Table 3). 5.3 Cost Effectiveness The recently passed Right to Education (RtE) Act in India calls for reducing class sizes by one third, and the vast majority of the budgetary allocations for implementing the RtE is earmarked for teacher salaries. Muralidharan and Sundararaman (2010) estimate that halving school-level pupil-teacher ratios by hiring more regular civil service teachers will increase test scores by 0.2 – 0.25 SD annually. The typical government-run rural school has 3 teachers who are paid around Rs. 150,000/year for an annual salary bill of approximate Rs. 450,000/year per school. These figures suggest that reducing class size by a third will cost Rs. 150,000/year and increase test scores by 0.07 - 0.08 SD annually (per school). The individual incentive program cost Rs. 10,000/year per school in bonus costs and another Rs. 5,000/year per school to administer the program. The estimates from Table 11, suggests that the program cost Rs. 15,000/year for annual test score gains of 0.135 – 0.15 SD. Combining these numbers suggests that scaling up the individual teacher incentive program would be 15 to 20 times more cost effective in raising student test scores than pursuing the default policy of reducing class size by hiring additional civil-service teachers. 6. Conclusion We present evidence from the longest-running experimental evaluation of a teacher performance pay program in the world, and find that students who completed their entire primary school education under a system where their teachers received individual-level bonuses based on the performance of their students, scored 0.54 and 0.35 standard deviations higher on math and language tests respectively. We find no evidence to suggest that these gains represent only narrow gains in test scores as opposed to broader gains in human capital. In particular, we find that students in these schools also scored 0.52 and 0.30 SD higher on science and social studies even though there were no incentives paid to teachers on the basis of performance on these subjects. - 24 - An important concern among skeptics of performance-linked pay for teachers based on student test scores is that improvements in performance on highly tested components of a curriculum (as would be likely if a teacher were ‘teaching to the test’) do not typically translate into improvements in less tested components of the same underlying class of skills/knowledge (Koretz 2002, 2008). Our findings of positive effects on non-incentive subjects suggest substantial positive spillovers between improvements in math and reading and performance on other subjects (whose content is beyond the domain that the incentive was provided on), and help to negate this concern in the context of Indian primary education. The long-term results also highlight that group and individual based performance pay for teachers may have significantly different outcomes – especially over time. The low (and often indistinguishable from zero) impact of group incentives is quite striking given that the typical schools has 3 teachers and peer monitoring of effort should have been relatively easy. One possible interpretation of this result is that it is difficult for teachers even in small groups to effectively monitor the intensity of effort of their peers. The results also suggest that it may be challenging for group-based incentive programs with much larger groups of teachers (as are being tried in many states in the US) to deliver increases in student learning. While our specific findings (and point estimates of program impact) are likely to be context- specific, many features of the Indian education system (like low average levels of learning, low norms for teacher effort in government-run schools, and an academic and pedagogic culture that highly values performance on high-stakes tests), are found in other developing countries as well. Our results therefore suggest that performance pay for teachers could be an effective policy tool in India and perhaps in other similar contexts as well. The impact of performance pay estimated in this paper has been restricted to the gains in student test scores attributable to greater effort from teachers currently in schools. However, in the long run, the benefits to performance pay include not only greater teacher effort, but also potentially attracting more effective teachers into the profession (Lazear 2000, 2003; Hoxby and Leigh 2005). In this case, the estimates presented in this paper are likely to be a lower bound on the long-term impact of introducing systems of individual teacher performance pay. Finally, Muralidharan and Sundararaman (2011a) report high levels of teacher support for the idea of performance linked pay, with 85% of teachers reporting a favorable opinion about the idea and - 25 - 68% mentioning that the government should try to scale up programs of the sort implemented under this project. The main challenge to scaling up teacher performance pay programs of the type studied in this paper is likely to be administrative capacity to maintain the integrity of the testing procedures. However, the results reported in this paper over five years, suggest that it may be worth investing in the administrative capacity (perhaps using technology for testing) to implement such a program at a local scale (such as a district or comparably sized jurisdiction in India) and learn if such implementation is feasible. Combining scale ups with credible evaluation strategies will help answer whether teacher performance pay programs can continue to deliver benefits when administered at scale. - 26 - References: ANDRABI, T., J. DAS, A. KHWAJA, and T. ZAJONC (2011): "Do Value-Added Estimates Add Value? Accounting for Learning Dynamics," American Economic Journal: Applied Economics, 3, 29-54. BAKER, G. (1992): "Incentive Contracts and Performance Measurement," Journal of Political Economy, 100, 598-614. BANDIERA, O., I. BARANKAY, and I. RASUL (2011): "Field Experiments with Firms," Journal of Economic Perspectives, 25, 63-82. BANDIERA, O., A. PRAT, and T. VALLETTI (2009): "Active and Passive Waste in Government Spending: Evidence from a Policy Experiment," American Economic Review, 99, 1278- 1308. BESLEY, T., and M. GHATAK (2005): "Competition and Incentives with Motivated Agents," American Economic Review, 95, 616-636. CHETTY, R., J. N. FRIEDMAN, N. HILGER, E. SAEZ, D. W. SCHANZENBACH, and D. YAGAN (Forthcoming): "How Does Your Kindergarten Classroom Affect Your Earnings: Evidence from Project Star," Quarterly Journal of Economics. CHETTY, R., J. N. FRIEDMAN, and J. E. ROCKOFF (2011): "The Long-Term Impact of Teachers: Teacher Value-Added and Student Outcomes in Adulthood," Harvard. CULLEN, J. B., and R. REBACK (2006): "Tinkering Towards Accolades: School Gaming under a Performance Accountability System," in Advances in Applied Microeconomics, Volume 14: Elsiever, 1-34. DAL BO, E., F. FINAN, and M. ROSSI (2011): "Strengthening State Capabilities: The Role of Financial Incentives in the Call to Public Service," UC Berkeley. DAS, J., S. DERCON, J. HABYARIMANA, P. KRISHNAN, K. MURALIDHARAN, and V. SUNDARARAMAN (2011): "School Inputs, Household Substitution, and Test Scores," National Bureau of Economic Research Working Paper 16830. DECI, E. L., and R. M. RYAN (1985): Intrinsic Motivation and Self-Determination in Human Behavior. New York: Plenum. DEMING, D. (2009): "Early Childhood Intervention and Life-Cycle Skill Development: Evidence from Head Start," American Economic Journal: Applied Economics, 1, 111-34. DHALIWAL, I., E. DUFLO, R. GLENNERSTER, and C. TULLOCH (2011): "Comparative Cost- Effectiveness Analysis to Inform Policy in Developing Countries: A General Framework with Applications for Education," MIT. DIXIT, A. (2002): "Incentives and Organizations in the Public Sector: An Interpretative Review," Journal of Human Resources, 37, 696-727. DOLTON, P. (2006): "Teacher Supply," in Handbook of the Economics of Education, ed. by E. Hanushek, and F. Welch: North-Holland. EHRENBERG, R. G., and J. L. SCHWARZ (1986): "Public-Sector Labor Markets," in Handbook of Labor Economics, ed. by O. Ashenfelter, and R. Layard: Elsiever. FEHR, E., and A. FALK (2002): "Psychological Foundations of Incentives," European Economic Review, 46, 687-724. FIGLIO, D. N., and J. WINICKI (2005): "Food for Thought: The Effects of School Accountability Plans on School Nutrition," Journal of Public Economics, 89, 381-94. FRYER, R. G. (2011): "Teacher Incentives and Student Achievement: Evidence from New York City Public Schools," National Bureau of Economic Research Working Paper 16850. - 27 - GLEWWE, P., N. ILIAS, and M. KREMER (2010): "Teacher Incentives," American Economic Journal: Applied Economics, 2, 205-227. GOODMAN, S., and L. TURNER (2010): "Teacher Incentive Pay and Educational Outcomes: Evidence from the Nyc Bonus Program," Columbia University. GREGORY, R. G., and J. BORLAND (1999): "Recent Developments in Public Sector Labor Markets," in Handbook of Labor Economics, Vol 3, ed. by O. Ashenfelter, and D. Card. HAMILTON, B. H., J. A. NICKERSON, and H. OWAN (2003): "Team Incentives and Worker Heterogeneity: An Empirical Analysis of the Impact of Teams on Productivity and Participation," Journal of Political Economy 111, 465-97. HOLMSTROM, B., and P. MILGROM (1991): "Multitask Principal-Agent Analyses: Incentive Contracts, Asset Ownership, and Job Design," Journal of Law, Economics, and Organization, 7, 24-52. HOXBY, C. M., and A. LEIGH (2005): "Pulled Away or Pushed Out? Explaining the Decline of Teacher Aptitude in the United States," American Economic Review, 94, 236-40. ITOH, H. (1991): "Incentives to Help in Multi-Agent Situations," Econometrica, 59, 611-36. JACOB, B., L. LEFGREN, and D. SIMS (2010): "The Persistence of Teacher-Induced Learning Gains," Journal of Human Resources, 45, 915-943. JACOB, B. A. (2005): "Accountability, Incentives and Behavior: The Impact of High-Stakes Testing in the Chicago Public Schools," Journal of Public Economics, 89, 761-96. JACOB, B. A., and S. D. LEVITT (2003): "Rotten Apples: An Investigation of the Prevalence and Predictors of Teacher Cheating," Quarterly Journal of Economics 118, 843-77. KANDEL, E., and E. LAZEAR (1992): "Peer Pressure and Partnerships," Journal of Political Economy, 100, 801-17. KORETZ, D. M. (2002): "Limitations in the Use of Achievement Tests as Measures of Educators' Productivity," Journal of Human Resources, 37, 752-77. — (2008): Measuring Up: What Educational Testing Really Tells Us. Harvard University Press. KREMER, M., K. MURALIDHARAN, N. CHAUDHURY, F. H. ROGERS, and J. HAMMER (2005): "Teacher Absence in India: A Snapshot," Journal of the European Economic Association, 3, 658-67. LAVY, V. (2002): "Evaluating the Effect of Teachers' Group Performance Incentives on Pupil Achievement," Journal of Political Economy, 110, 1286-1317. — (2009): "Performance Pay and Teachers' Effort, Productivity, and Grading Ethics," American Economic Review, 99, 1979 - 2011. LAZEAR, E. (2000): "Performance Pay and Productivity," American Economic Review, 90, 1346- 61. — (2003): "Teacher Incentives," Swedish Economic Policy Review, 10, 179-214. LAZEAR, E., and P. OYER (2009): "Personnel Economics," Stanford University. LEMIEUX, T., W. B. MACLEOD, and D. PARENT (2009): "Performance Pay and Wage Inequality," Quarterly Journal of Economics, 124, 1-49. MURALIDHARAN, K., and V. SUNDARARAMAN (2009): "Teacher Performance Pay: Experimental Evidence from India," National Bureau of Economic Research Working Paper 15323. — (2011): "Teacher Opinions on Performance Pay: Evidence from India," Economics of Education Review, 30, 394-403. — (2011): "Teacher Performance Pay: Experimental Evidence from India," Journal of Political Economy, 119, 39-77. — (2012): "Contract Teachers: Experimental Evidence from India," UC San Diego. - 28 - MURNANE, R. J., and D. K. COHEN (1986): "Merit Pay and the Evaluation Problem: Why Most Merit Pay Plans Fail and a Few Survive," Harvard Educational Review, 56, 1-17. NEAL, D. (2011): "The Design of Performance Pay in Education," University of Chicago. NEAL, D., and D. SCHANZENBACH (2010): "Left Behind by Design: Proficiency Counts and Test- Based Accountability," The Review of Economics and Statistics, 92, 263-283. POP-ELECHES, C., and M. URQUIOLA (2011): "Going to a Better School: Effects and Behavioral Responses," Columbia University. RAU, T. B., and D. G. CONTRERAS (2011): "Tournaments Incentives for Teachers: The Case of Chile," University of Chile, Department of Economics, 42. ROTHSTEIN, J. (2010): "Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement," Quarterly Journal of Economics, 125, 175-214. SPRINGER, M. G., D. BALLOU, L. HAMILTON, V.-N. LE, J. R. LOCKWOOD, D. MCCAFFREY, M. PEPPER, and B. STECHER (2010): "Teacher Pay for Performance: Experimental Evidence from the Project on Incentives in Teaching," Nashville, TN: National Center for Performance Incentives at Vanderbilt University. TODD, P. E., and K. I. WOLPIN (2003): "On the Specification and Estimation of the Production Function for Cognitive Achievement," Economic Journal, 113, F3-33. WILSON, J. Q. (1989): Bureaucracy. New York: Basic Books. WORLD BANK (2003): World Development Report 2004: Making Services Work for Poor People. Washington, DC: Oxford University Press for the World Bank. - 29 - Figure 1: Experiment Design over 5 Years Treatment Year 1 Year 2 Year 3 Year 4 Year 5 Control 100 100 100 100 100 Individual Incentive 100 100 100 50 50 Group Incentive 100 100 100 50 50 Individual Incentive Discontinued 0 0 0 50 50 Group Incentive Discontinued 0 0 0 50 50 Notes: 1. Number of schools in the overall project indicated in each treatment/year cell 2. Randomization was stratified by mandal - and so dividing each cell by 50 corresponds to the number of schools in each mandal in each treatment/year cell Figure 2 : Nine Distinct Cohorts Exposed to the Interventions Year 1 Year 2 Year 3 Year 4 Year 5 One Cohort exposed for five years : 5 Grade 1 5 6 7 8 9 Two Cohorts exposed for four years : 4 , 6 Grade 2 4 5 6 7 8 Two Cohorts exposed for three years : 3 , 7 Grade 3 3 4 5 6 7 Two Cohorts exposed for two years : 2 , 8 Grade 4 2 3 4 5 6 Two Cohorts exposed for one year : 1 , 9 Grade 5 1 2 3 4 5 Figure 3A Figure 3B 3 3 2 2 Year 5 Endline Score Year 5 Endline Score 1 1 0 0 -1 -1 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 Percentile of Endline Score Percentile of Endline Score Group Incentive Individual Incentive Control Individual Incentive Control Difference 95% Confidence Band Figure 3C Figure 3D 3 2 2 1 Year 5 Endline Score Year 5 Endline Score 1 0 0 -1 -1 -2 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 Percentile of Endline Score Percentile of Endline Score Control Group Incentive Group Incentive Individual Incentive Difference 95% Confidence Band Difference 95% Confidence Band Figure 4A Figure 4B Figure 4C Figure 4D Table 1 : Sample Balance Across Treatments Panel A : Validity of Randomization for Continuation/Discontinuation of Treatments [1] [2] [3] [4] p-value GI p-value II Discontinued II continued GI continued (H0: [1] = [2]) Discontinued (H0: [3] = [4]) Infrastructure 2.780 3.000 0.68 2.720 2.640 0.88 Proximity 13.920 13.694 0.93 14.500 13.680 0.73 Cohorts 4-7 Maths 0.048 0.166 0.42 0.036 -0.068 0.34 Cohorts 4-7 Telugu 0.039 0.120 0.53 0.051 -0.077 0.28 Cohort 5 Maths -0.017 0.100 0.47 -0.063 0.051 0.40 Cohort 5 Telugu 0.036 0.027 0.95 -0.070 -0.028 0.75 Panel B : Balance of Incoming Cohorts (6-9) across treatment/control groups [1] [2] [3] P-value (H0: [1] Control II GI = [2] = [3])) Cohort 6 Class Enrollment 29.039 27.676 26.566 0.364 Household Affluence 3.342 3.334 3.265 0.794 Parent Literacy 1.336 1.295 1.250 0.539 Cohort 7 Class Enrollment 22.763 21.868 19.719 0.433 Household Affluence 3.308 3.227 3.173 0.678 Parent Literacy 1.164 1.133 1.205 0.687 Cohort 8 Class Enrollment 21.119 21.075 19.118 0.604 Household Affluence 3.658 3.407 3.470 0.536 Parent Literacy 1.128 1.208 1.243 0.155 Cohort 9 Class Enrollment 19.659 18.979 18.356 0.804 Household Affluence 3.844 3.626 3.627 0.165 Parent Literacy 1.241 1.143 1.315 0.414 Notes: 1. The infrastructure index is the sum of six binary variables showing the existence of a brick building, a playground, a compound wall, a functioning source of water, a functional toilet, and functioning electricity. 2. The proximity index is the sum of 8 variables (each coded from 1-3) indicating proximity to a paved road, a bus stop, a public health clinic, a private health clinic, public telephone, bank, post office, and the mandal educational resource center. 3. The p-values for the student-level variables are computed by treating each student as one observation and clustering the standard errors at the school level. The p-values for school-level variables are computed treating each school as an observation. Table 2 : Student and Teacher Attrition Panel A : Student Attrition Cohorts 1 - 5 (Corresponds to Table 3) Individual Control Incentive Group Incentive p-value 1 Y1/Y0 Fraction attrited 0.140 0.133 0.138 0.75 2 Baseline score math -0.163 -0.136 -0.138 0.96 3 Baseline score telugu -0.224 -0.197 -0.253 0.87 4 Y2/Y0 Fraction attrited 0.293 0.276 0.278 0.58 5 Baseline score math -0.116 -0.03 -0.108 0.61 6 Baseline score telugu -0.199 -0.113 -0.165 0.71 7 Y3/Y0 Fraction attrited 0.406 0.390 0.371 0.32 8 Baseline score math -0.102 -0.038 -0.065 0.83 9 Baseline score telugu -0.165 -0.086 -0.093 0.75 10 Y4/Y0 Fraction attrited 0.474 0.450 0.424 0.24 11 Baseline score math -0.134 0.015 0.006 0.50 12 Baseline score telugu -0.126 0.104 -0.004 0.25 13 Y5/Y0 Fraction attrited 0.556 0.511 0.504 0.28 Panel B : Student Attrition Cohorts 5 - 9 (Corresponds to Table 4) 14 Grade 1 0.154 0.143 0.153 0.38 15 Grade 2 0.36 0.32 0.323 0.14 16 Grade 3 0.443 0.421 0.403 0.23 17 Grade 4 0.507 0.457 0.435 0.06 18 Grade 5 0.556 0.511 0.504 0.28 Panel C : Teacher Attrition 19 Y1/Y0 0.335 0.372 0.304 0.21 20 Y2/Y0 0.349 0.375 0.321 0.40 21 Y3/Y0 0.371 0.375 0.324 0.35 22 Y4/Y0 0.385 0.431 0.371 0.31 23 Y5/Y0 0.842 0.840 0.783 0.17 Notes: 1. Panel A shows student attrition relative to the population that started in the sample in the baseline (Y0). This is the relevant attrition table to look at in conjunction with the results in Table 3 (cohorts 1-5) 2. Panel B shows student attrition relative to initial enrollment for cohorts 5-9. The grade 1 attrition is the average attrition of all 5 cohorts by the end of grade 1; grade 2 attrition is the average attrition for cohorts 5-8 at the end of grade 2; and so on. This is the relevant attrition table to look at in conjunction with the results in Table 4 (each row here represents the attrition associated with the estimation in each column of Table 4). 3. Panel C shows teacher attrition (due to transfers) relative to the initial sample of teachers who started in the project in Y0. Teacher headcount stayed roughly constant through th 5 years, and so (1-attrition) would correspond to the number of new teachers in the schools relative to Y0. Table 3: Impact of teacher performance pay by years of exposure to program for cohorts starting in Y0 (Cohorts 1-5) Panel A: Combined One Year Two Years Three Years Four Years Five Years Individual Incentive 0.156 0.273 0.203 0.448 0.444 (0.050)*** (0.058)*** (0.064)*** (0.092)*** (0.101)*** Group Incentive 0.142 0.159 0.140 0.185 0.129 (0.050)*** (0.057)*** (0.057)** (0.084)** (0.085) Observations 42145 26936 16765 6915 3456 R-squared 0.312 0.265 0.229 0.268 0.323 Pvalue II = GI 0.78 0.10 0.35 0.02 0.00 Panel B: Maths One Year Two Years Three Years Four Years Five Years Individual Incentive 0.184 0.319 0.252 0.573 0.538 (0.059)*** (0.067)*** (0.075)*** (0.117)*** (0.129)*** Group Incentive 0.175 0.224 0.176 0.197 0.119 (0.057)*** (0.069)*** (0.066)*** (0.098)** (0.106) Observations 20946 13385 8343 3442 1728 R-squared 0.300 0.268 0.238 0.316 0.370 Pvalue II = GI 0.90 0.25 0.35 0.00 0.00 Panel C: Telugu One Year Two Years Three Years Four Years Five Years Individual Incentive 0.129 0.229 0.155 0.325 0.350 (0.045)*** (0.053)*** (0.057)*** (0.077)*** (0.087)*** Group Incentive 0.108 0.095 0.106 0.173 0.139 (0.047)** (0.052)* (0.055)* (0.079)** (0.080)* Observations 21199 13551 8422 3473 1728 R-squared 0.336 0.283 0.234 0.244 0.298 Pvalue II = GI 0.64 0.03 0.42 0.10 0.02 Estimation Sample Cohort 12345 2345 345 45 5 Year 1 2 3 4 5 Grade 12345 2345 345 45 5 Notes: 1. All regressions include mandal (sub-district) fixed effects and standard errors clustered at the school level. 2. The "Estimation Sample" of Cohort/Year/Grade should be seen in conjunction with Figure 2 to clearly see the cohorts, years, and grades used in the estimation of treatment effects * significant at 10%; ** significant at 5%; *** significant at 1% Table 4: Impact of teacher performance pay by years of exposure to program for cohorts starting in Grade 1 (Cohorts 5-9) Panel A : Combined One Year Two Years Three Years Four Years Five Years (1) (2) (3) (4) (5) Individual Incentive 0.130 0.118 0.135 0.279 0.444 (0.055)** (0.058)** (0.058)** (0.068)*** (0.101)*** Group Incentive 0.061 -0.066 -0.000 0.088 0.129 (0.053) (0.057) (0.057) (0.070) (0.085) Observations 36903 22197 13876 7811 3456 R-squared 0.076 0.128 0.188 0.261 0.323 Pvalue II = GI 0.21 0.00 0.03 0.02 0.00 Panel B : Maths One Year Two Years Three Years Four Years Five Years (1) (2) (3) (4) (5) Individual Incentive 0.133 0.116 0.157 0.356 0.538 (0.059)** (0.061)* (0.061)** (0.085)*** (0.129)*** Group Incentive 0.062 -0.064 0.013 0.099 0.119 (0.055) (0.058) (0.056) (0.081) (0.106) Observations 18345 11092 6941 3906 1728 R-squared 0.078 0.132 0.194 0.290 0.370 Pvalue II = GI 0.220 0.00967 0.0296 0.0106 0.00473 Panel C : Telugu One Year Two Years Three Years Four Years Five Years (1) (2) (3) (4) (5) Individual Incentive 0.126 0.121 0.114 0.203 0.350 (0.056)** (0.057)** (0.060)* (0.062)*** (0.087)*** Group Incentive 0.060 -0.067 -0.014 0.077 0.139 (0.056) (0.059) (0.064) (0.067) (0.080)* Observations 18558 11105 6935 3905 1728 R-squared 0.081 0.130 0.191 0.247 0.298 Pvalue II = GI 0.243 0.00354 0.0552 0.0779 0.0199 Estimation Sample Cohort 56789 5678 567 56 5 Year 12345 2345 345 45 5 Grade 1 2 3 4 5 Notes: 1. All regressions include mandal (sub-district) fixed effects and standard errors clustered at the school level. 2. The "Estimation Sample" of Cohort/Year/Grade should be seen in conjunction with Figure 2 to clearly see the cohorts, years, and grades used in the estimation of treatment effects * significant at 10%; ** significant at 5%; *** significant at 1% Table 5 : Mean treatment effect after 'N' years of exposure Using the "Full Sample" (9, 7, 5, 3, and 1 cohorts for 1, 2, 3, 4, 5 years of exposure) Panel A : Combined One Year Two Years Three Years Four Years Five Years (1) (2) (3) (4) (5) Individual Incentive 0.154 0.204 0.191 0.331 0.444 (0.045)*** (0.050)*** (0.056)*** (0.072)*** (0.101)*** Group Incentive 0.106 0.061 0.089 0.123 0.129 (0.044)** (0.049) (0.051)* (0.067)* (0.085) Observations 70030 42201 24774 10961 3456 R-squared 0.183 0.197 0.209 0.246 0.323 Pvalue II = GI 0.29 0.02 0.09 0.01 0.00 Panel B : Maths One Year Two Years Three Years Four Years Five Years (1) (2) (3) (4) (5) Individual Incentive 0.175 0.229 0.227 0.425 0.538 (0.051)*** (0.055)*** (0.062)*** (0.089)*** (0.129)*** Group Incentive 0.127 0.098 0.109 0.137 0.119 (0.048)*** (0.055)* (0.055)** (0.077)* (0.106) Observations 34796 21014 12349 5465 1728 R-squared 0.177 0.192 0.213 0.28 0.37 Pvalue II = GI 0.35 0.05 0.08 0.01 0.00 Panel C : Telugu One Year Two Years Three Years Four Years Five Years (1) (2) (3) (4) (5) Individual Incentive 0.133 0.180 0.155 0.237 0.350 (0.043)*** (0.047)*** (0.053)*** (0.062)*** (0.087)*** Group Incentive 0.085 0.024 0.069 0.108 0.139 (0.044)* (0.048) (0.052) (0.063)* (0.080)* Observations 35234 21187 12425 5496 1728 R-squared 0.20 0.21 0.22 0.23 0.30 Pvalue II = GI 0.26 0.00 0.13 0.07 0.02 115 214 313 412 Cohort/Year/Grade 225 324 423 522 335 434 533 643 511 621 731 841 445 544 654 555 (CYG) Indicator 632 742 852 753 951 Notes: 1. All regressions include mandal (sub-district) fixed effects and standard errors clustered at the school level. 2. The "Estimation Sample" of Cohort/Year/Grade should be seen in conjunction with Figure 2 to clearly see the cohorts, years, and grades used in the estimation of treatment effects * significant at 10%; ** significant at 5%; *** significant at 1% Table 6A : "N" Year Impact of Performance Pay by Repeat and Non-Repeat Questions Dependent Variable : Percentage Test Score Maths Telugu One Year Two Years Three Years Four Years Five Years One Year Two Years Three Years Four Years Five Years Percentage Score on Non-repeats 0.201 0.285 0.247 0.257 0.274 0.342 0.413 0.412 0.388 0.427 (0.006)*** (0.006)*** (0.006)*** (0.008)*** (0.011)*** (0.008)*** (0.007)*** (0.008)*** (0.009)*** (0.011)*** Incremental Score on Repeats 0.031 0.071 -0.002 -0.011 0.048 -0.017 0.040 -0.033 -0.147 0.024 (0.004)*** (0.004)*** (0.005) (0.005)** (0.010)*** (0.005)*** (0.005)*** (0.005)*** (0.006)*** (0.012)** Incremental Score on non-repeats II schools 0.035 0.029 0.031 0.063 0.092 0.035 0.030 0.033 0.043 0.073 (0.010)*** (0.011)*** (0.010)*** (0.014)*** (0.023)*** (0.010)*** (0.011)*** (0.011)*** (0.013)*** (0.020)*** Incremental Score on repeats II schools 0.047 0.046 0.049 0.089 0.103 0.060 0.048 0.038 0.058 0.054 (0.012)*** (0.013)*** (0.013)*** (0.018)*** (0.024)*** (0.012)*** (0.013)*** (0.013)*** (0.018)*** (0.021)** Incremental Score on non-repeats GI schools 0.028 0.002 0.007 0.019 0.020 0.025 -0.006 0.016 0.022 0.024 (0.009)*** (0.011) (0.010) (0.013) (0.017) (0.011)** (0.011) (0.012) (0.014) (0.017) Incremental Score on repeats GI schools 0.029 0.014 0.023 0.026 0.034 0.055 0.009 0.023 0.016 0.020 (0.012)** (0.012) (0.011)** (0.014)* (0.020)* (0.012)*** (0.013) (0.012)* (0.014) (0.020) Observations 56828 38058 22584 10178 3166 57555 34486 22747 10178 3176 R-squared 0.144 0.167 0.132 0.214 0.289 0.142 0.142 0.151 0.242 0.185 Fraction of Repeat Questions 15.41% 19.67% 16.37% 16.67% 12.00% 13.95% 12.84% 10.36% 6.52% 6.25% Test For Equality of Treatment Effect for Repeat and 0.14 0.01 0.01 0.01 0.59 0.00 0.02 0.42 0.31 0.39 Non-repeat Questions in II Schools (F-stat p-value) Test For Equality of Treatment Effect for Repeat and 0.89 0.07 0.02 0.38 0.41 0.00 0.02 0.43 0.61 0.82 Non-repeat Questions in GI Schools (F-stat p-value) Table 6B : "N" Year Impact of Performance Pay by multiple choice and Non-multiple choice questions Dependent Variable : Percentage Test Score Maths Telugu One Year Two Years Three Years Four Years Five Years One Year Two Years Three Years Four Years Five Years Percentage Score on Non-mcq 0.201 0.292 0.233 0.241 0.295 0.308 0.371 0.348 0.325 0.424 (0.007)*** (0.007)*** (0.006)*** (0.008)*** (0.010)*** (0.007)*** (0.007)*** (0.008)*** (0.009)*** (0.012)*** Incremental Score on Mcq 0.028 0.021 0.020 0.057 -0.049 0.107 0.121 0.151 0.119 0.008 (0.005)*** (0.005)*** (0.006)*** (0.008)*** (0.007)*** (0.005)*** (0.005)*** (0.007)*** (0.008)*** (0.008) Incremental Score on non-mcq II schools 0.037 0.031 0.031 0.068 0.087 0.032 0.028 0.031 0.053 0.075 (0.010)*** (0.012)*** (0.011)*** (0.016)*** (0.020)*** (0.010)*** (0.012)** (0.012)** (0.015)*** (0.018)*** Incremental Score on mcq II schools 0.037 0.034 0.052 0.095 0.118 0.052 0.044 0.045 0.052 0.079 (0.011)*** (0.012)*** (0.012)*** (0.019)*** (0.023)*** (0.012)*** (0.010)*** (0.013)*** (0.016)*** (0.021)*** Incremental Score on non-mcq GI schools 0.027 0.003 0.009 0.016 0.021 0.024 -0.013 0.013 0.010 0.036 (0.010)*** (0.011) (0.010) (0.013) (0.018) (0.010)** (0.012) (0.013) (0.014) (0.017)** Incremental Score on mcq GI schools 0.027 0.012 0.029 0.037 0.025 0.037 0.012 0.029 0.041 0.015 (0.010)*** (0.011) (0.010)*** (0.017)** (0.018) (0.012)*** (0.011) (0.013)** (0.016)** (0.018) Observations 63763 36110 21037 8279 3153 64686 36313 21176 8330 3168 R-squared 0.137 0.141 0.165 0.270 0.338 0.192 0.232 0.269 0.260 0.256 Fraction of MCQ 22.62% 24.33% 26.99% 27.08% 28% 39.20% 39.19% 37.84% 38.41% 37.50% Test For Equality of Treatment Effect for mcq and 0.96 0.66 0.01 0.06 0.01 0.00 0.03 0.17 0.95 0.82 Non-mcq Questions in II Schools (F-stat p-value) Test For Equality of Treatment Effect for mcq and 0.96 0.27 0.01 0.12 0.84 0.08 0.00 0.12 0.02 0.14 Non-mcq Questions in GI Schools (F-stat p-value) 115 214 313 225 324 423 115 214 313 225 324 423 335 434 533 335 434 533 Cohort/Year/Grade (CYG) Indicator 412 511 621 522 632 742 445 544 654 555 412 511 621 522 632 742 445 544 654 555 643 753 643 753 731 841 951 852 731 841 951 852 Notes: 1. All regressions include mandal (sub-district) fixed effects and standard errors clustered at the school level. * significant at 10%; ** significant at 5%; *** significant at 1% Table 7: "N" Year Impact of Performance Pay on Non-Incentive Subjects Science Social Science One year Two Year Three Year Four Year Five Year One year Two Year Three Year Four Year Five Year (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) baseline maths 0.215 0.155 0.213 0.087 0.225 0.165 0.150 0.130 (0.019)*** (0.022)*** (0.031)*** (0.051)* (0.018)*** (0.023)*** (0.033)*** (0.040)*** baseline telugu 0.209 0.220 0.178 0.164 0.288 0.191 0.222 0.139 (0.019)*** (0.023)*** (0.035)*** (0.055)*** (0.019)*** (0.024)*** (0.036)*** (0.049)*** Individual Incentives 0.108 0.186 0.114 0.232 0.520 0.126 0.223 0.159 0.198 0.299 (0.063)* (0.057)*** (0.056)** (0.068)*** (0.125)*** (0.057)** (0.061)*** (0.057)*** (0.066)*** (0.113)*** Group Incentives 0.114 0.035 0.076 0.168 0.156 0.155 0.131 0.085 0.139 0.086 (0.061)* (0.055) (0.054) (0.067)** (0.099) (0.059)*** (0.061)** (0.057) (0.065)** (0.095) Observations 11765 9081 11133 4997 1592 11765 9081 11133 4997 1592 R-squared 0.259 0.189 0.127 0.160 0.306 0.308 0.181 0.134 0.148 0.211 Pvalue II = GI 0.93 0.03 0.48 0.41 0.01 0.67 0.20 0.19 0.44 0.08 Cohort/Year/Grade 335 434 533 335 434 533 115 214 313 225 324 423 445 544 654 555 115 214 313 225 324 423 445 544 654 555 (CYG) Indicator 643 753 643 753 Notes: 1. All regressions include mandal (sub-district) fixed effects and standard errors clustered at the school level. * significant at 10%; ** significant at 5%; *** significant at 1% Table 8A: "N" Year Heterogenous Treatment Effects By School and Student Characteristics ("Full Sample") Individual Incentives Group Incentives One Year Two Year Three Year Four Year Five Year One Year Two Year Three Year Four Year Five Year Enrollment 0.197 0.100 0.120 0.095 0.266 0.120 0.059 0.063 -0.067 0.156 (0.072)*** (0.075) (0.093) (0.111) (0.148)* (0.076) (0.072) (0.084) (0.101) (0.132) Infrastructure -0.016 -0.095 -0.022 0.009 -0.032 0.041 -0.003 0.007 -0.076 -0.070 (0.033) (0.037)** (0.043) (0.062) (0.097) (0.027) (0.039) (0.044) (0.052) (0.085) Proximity 0.008 0.014 0.004 0.011 0.214 0.009 0.020 -0.016 0.003 0.156 (0.013) (0.015) (0.019) (0.029) (0.030)*** (0.013) (0.015) (0.017) (0.026) (0.037)*** Household Affluence 0.023 0.004 0.032 0.013 -0.017 0.032 0.031 0.040 0.008 0.032 (0.020) (0.021) (0.021) (0.031) (0.044) (0.018)* (0.020) (0.021)* (0.037) (0.049) Parental Literacy -0.019 -0.024 0.017 -0.076 -0.108 -0.008 -0.011 -0.012 -0.035 -0.005 (0.018) (0.021) (0.023) (0.038)** (0.059)* (0.019) (0.022) (0.024) (0.036) (0.054) Male 0.006 -0.042 0.044 0.060 0.066 0.018 0.022 0.004 -0.021 -0.020 (0.028) (0.032) (0.045) (0.057) (0.102) (0.028) (0.038) (0.045) (0.083) (0.120) Baseline Test Score 0.002 0.048 0.031 0.006 0.015 -0.002 0.044 -0.063 -0.032 -0.037 -0.045 -0.076 -0.03 -0.038 -0.046 -0.078 115 214 313 225 324 423 115 214 313 225 324 423 Cohort/Year/Grade 335 434 533 335 434 533 412 511 621 522 632 742 445 544 654 555 412 511 621 522 632 742 445 544 654 555 (CYG) Indicator 643 753 643 753 731 841 951 852 731 841 951 852 Table 8B: Heterogenous Treatment Effects by Teacher Characteristics Dependent Variable : Teacher Value Added (using all cohorts and years) Covariates Teacher Education Teacher Training Teacher Experience Teacher Salary Teacher Absence II -0.022 -0.120 0.221 0.082 0.132 (0.134) (0.129) (0.113)* (0.482) (0.037)*** GI -0.065 -0.211 0.225 0.573 0.064 (0.136) (0.137) (0.093)** (0.518) (0.035)* Covariate -0.006 -0.052 -0.027 -0.036 -0.119 (0.025) (0.029)* (0.020) (0.029) (0.044)*** II * Covariate 0.049 0.091 -0.035 0.005 0.019 (0.041) (0.046)** (0.044) (0.052) (0.078) GI * Covariate 0.038 0.098 -0.070 -0.056 -0.020 (0.044) (0.050)** (0.037)* (0.056) (0.066) Observations 108560 108560 106592 106674 138594 R-squared 0.057 0.057 0.059 0.058 0.052 Notes: 1. All regressions include mandal (sub-district) fixed effects and standard errors clustered at the school level. * significant at 10%; ** significant at 5%; *** significant at 1% Table 9: Teacher Behavior (Observation and Interviews) Incentive versus Control Schools (All figures in %) Individual Group P-value P-value Correlation with Control P-value Teacher Behavior Incentive Incentive (H0: II = (H0: GI = student test Schools (H0: II = GI) Schools Schols Control) control) score gains [1] [2] [3] [4] [5] [6] [7] Based on School Observation Teacher Absence (%) 0.28 0.27 0.28 0.15 0.55 0.47 -0.109*** Actively Teaching at Point of Observation (%) 0.39 0.42 0.40 0.18 0.42 0.58 0.114*** Based on Teacher Interviews Did you do any special preparation for the end of year tests? (% Yes) 0.22 0.61 0.56 0.00 0.00 0.06 0.108*** What kind of preparation did you do? (UNPROMPTED) (% Mentioning) Extra Homework 0.12 0.35 0.32 0.00 0.00 0.12 0.066** Extra Classwork 0.15 0.39 0.34 0.00 0.00 0.04 0.108*** Extra Classes/Teaching Beyond School Hours 0.03 0.12 0.11 0.00 0.00 0.65 0.153*** Gave Practice Tests 0.10 0.29 0.25 0.00 0.00 0.04 0.118*** Paid Special Attention to Weaker Children 0.06 0.18 0.15 0.00 0.00 0.20 -0.004 Notes: 1. Each "teacher-year" is treated as one observation with t-tests clustered at the school level. 2. Teacher absence and active teaching are coded as means over the year (and then averaged across the 5 years) 3. All teacher response variables from the teacher interviews are binary and column 5 reports the correlation between a teacher's stated response and the value added by the teacher that year. * significant at 10%; ** significant at 5%; *** significant at 1% Table 10 : Long-Term Impact of Teacher Incentive Programs on Continued and Discontinued Cohorts Y3 on Y0 Y4 on Y0 Y5 on Y0 GI * discontinued 0.133 0.158 0.132 (0.070)* (0.067)** (0.082) GI * continued 0.029 0.167 0.117 (0.073) (0.089)* (0.087) II * discontinued 0.224 0.149 0.098 (0.082)*** (0.087)* (0.095) II * continued 0.166 0.443 0.458 (0.078)** (0.095)*** (0.111)*** Observations 10707 9794 4879 R-squared 0.196 0.233 0.249 P-value (H0: II continued = 0.56 0.01 0.01 II discontinued) P-value (H0: GI continued = 0.24 0.93 0.89 GI discontinued) Estimation Sample Cohort 4,5 4,5 5 Year 3 4 5 Grade 3,4 4,5 5 Table 11 : Average "Gross" One-Year Treatment Effect of Teacher Incentive Programs Panel B: Average non-parametric Treatment Effect Panel A: OLS with Estimated gamma (Based on Figure 4) Combined Maths Telugu Combined Maths Telugu II 0.135 0.164 0.105 0.150 0.181 0.119 (0.031)*** (0.036)*** (0.027)*** 95% CI [0.074 , 0.196] [0.093 , 0.235] [0.052 , 0.158] [0.037 , 0.264] [0.051 , 0.301] [0.009 , 0.228] GI 0.064 0.086 0.043 0.048 0.065 0.032 (0.028)** (0.031)*** (0.026) 95% CI [0.009 , 0.119] [0.0252 , 0.147] [-0.008 , 0.094] [-0.058 , 0.149] [-0.047 , 0.176] [-0.083 , 0.145] Constant -0.030 -0.029 -0.032 (0.018) (0.021) (0.017)* Observations 165300 82372 82928 R-squared 0.046 0.054 0.041 II = GI 0.0288 0.0364 0.0299 Notes: 1. All regressions in Panel A include mandal (sub-district) fixed effects and standard errors clustered at the school level * significant at 10%; ** significant at 5%; *** significant at 1%