Nonparametric Tests for Treatment Effect Heterogeneity

A large part of the recent literature on program evaluation has focused on estimation of the average effect of the treatment under assumptions of unconfoundedness or ignorability following the seminal work by Rubin (1974) and Rosenbaum and Rubin (1983). In many cases however, researchers are interested in the effects of programs beyond estimates of the overall average or the average for the subpopulation of treated individuals. It may be of substantive interest to investigate whether there is any subpopulation for which a program or treatment has a nonzero average effect, or whether there is heterogeneity in the effect of the treatment. The hypothesis that the average effect of the treatment is zero for all subpopulations is also important for researchers interested in assessing assumptions concerning the selection mechanism. In this paper we develop two nonparametric tests. The first test is for the null hypothesis that the treatment has a zero average effect for any subpopulation defined by covariates. The second test is for the null hypothesis that the average effect conditional on the covariates is identical for all subpopulations, in other words, that there is no heterogeneity in average treatment effects by covariates. Sacrificing some generality by focusing on these two specific null hypotheses we derive tests that are straightforward to implement.


Introduction
A large part of the recent literature on program evaluation focuses on estimation of the average effect of the treatment under assumptions of unconfoundedness or ignorability following the seminal work by Rubin (1974) and Rosenbaum and Rubin (1983). 1 This literature has typically allowed for general heterogeneity in the effect of the treatment. The literature on testing for the presence of treatment effects in this context is much smaller. An exception is the paper by Abadie (2002) in the context of instrumental variables models. 2 In many cases however, researchers are interested in the effects of programs beyond point estimates of the overall average or the average for the subpopulation of treated individuals. For example, it may be of substantive interest to investigate whether there is any subpopulation for which a program or treatment has a nonzero average effect, or whether there is heterogeneity in the effect of the treatment. Such questions are particularly relevant for policy makers interested in extending the program or treatment to other populations. Some of this interest in treatment effect heterogeneity has motivated the development of estimators for quantile treatment effects in various settings. 3 The hypothesis that the average effect of the treatment is zero for all subpopulations is also important for researchers interested in assessing assumptions concerning selection mechanisms. In their discussion of specification tests as a tool to obtain better estimators for average treatment effects, Heckman and Hotz (1989) introduced an important class of specification tests. These tests can be interpreted as tests of the null hypothesis of zero causal effects on lagged outcomes. Heckman and Hotz focused on methods that specifically test the hypothesis of a zero effect under the maintained assumption that the effect is constant. However, the motivation for these tests suggests that the fundamental null hypotheses of interest are ones of zero average effects for all subpopulations. Similarly, Rosenbaum (1997) discusses the use of multiple control groups to investigate the plausibility of unconfoundedness. He shows that if both control groups satisfy an unconfoundedness or exogeneity assumption, differences in average outcomes between the control groups, adjusted for differences in covariates, should be zero in expectation. Again the hypothesis of interest can be formulated as one of zero causal effects for all subpopulations, not just a zero average effect.
In this paper we develop two nonparametric tests. The first test is for the null hypothesis that the treatment has a zero average effect for any subpopulation defined by covariates. The second test is for the null hypothesis that the average effect conditional on the covariates is identical for all subpopulations, in other words, that there is no heterogeneity in average treatment effects by covariates. Sacrificing some generality by focusing on these two specific null hypotheses, we derive tests that are straightforward to implement. They are based on a series or sieve approach to nonparametric estimation for average treatment effects (e.g., Hahn, 1 See Angrist and Krueger (2000), Heckman and Robb (1984), Heckman, Lalonde and Smith (2000), Rosenbaum (2001), Wooldridge (2002), Imbens (2004), Lechner (2002) and Lee (2005) for surveys of this literature.
2 There is also a large literature on testing in the context of randomized experiments using the randomization distribution. See Rosenbaum (2001). 3 See, for example, Lehmann (1974) Doksum (1974), Firpo (2004), Abadie, Angrist and Imbens (2002), Chernozhukov and Hansen (2005), Bitler, Gelbach and Hoynes (2002). 1998; Imbens, Newey and Ridder, 2006;Chen, Hong, and Tarozzi, 2004;Chen 2005). Given the particular choice of the sieve, the null hypotheses of interest can be formulated as equality restrictions on subsets of the (expanding set of) parameters. The tests can then be implemented using standard parametric methods. In particular, the test statistics are quadratic forms in the differences in the parameter estimates with critical values from a chi-squared distribution. We provide conditions on the sieves that guarantee that in large samples the tests are valid without the parametric assumptions.
There is a large literature on the related problem of testing parametric restrictions on regression functions against nonparametric alternatives. Eubank and Spiegelman (1990), Härdle and Mammen (1993), Bierens (1982Bierens ( , 1990, Hong and White (1995), and Horowitz and Spokoiny (2001), among others, focus on tests of parametric models for regression functions against nonparametric alternatives. However, the focus in this paper is on two specific tests, zero and constant conditional average treatment effects, rather than on general parametric restrictions. As a result, the proposed tests are particularly easy to implement compared to the Härdle-Mammen and Horowitz-Spokoiny tests. For example, p-values for our proposed tests can be obtained from chi-squared or normal tables, whereas Härdle and Mammen (1993) require the use of a variation of the bootstrap they call the wild bootstrap, and Horowitz and Spokoiny (2001) require simulation to calculate the p-value. Our proposed tests are closer in spirit to those suggested by Eubank and Spiegelman (1990) and Hong and White (1995), who also use series estimation for the unknown regression function, and who obtain a test statistic with a standard normal distribution. In particular, Eubank and Spiegelman (1990) also base their test statistic on the estimated coefficients in the series regression. The general approach behind our testing procedure is also related to the strategy of testing conditional moment restrictions by using an expanding set of marginal moment conditions. See, for example, Bierens (1990), De Jong and Bierens (1994). In those papers, as in the Eubank and Spiegelman (1990) paper, the testing procedures are standard given the number of moment conditions or terms in the series, but remain valid as the moment conditions or number of terms in the series increase with the sample size. In contrast, the validity of our tests require that the number of terms of the series increases with the sample size.
The closest papers in terms of focus to the current paper are those by Härdle and Marron (1990), Neumeyer and Dette (2003) and Pinkse and Robinson (1995). Härdle and Marron study tests of parametric restrictions on comparisons of two regression functions. Their formal analysis is restricted to the case with a single regressor, although it is likely that their kernel methods can be adapted (in particular by using higher order kernels) to extend to the case with multivariate covariates. Their proposed testing procedure leads to a test statistic with a bias term involving the form of the kernel. In contrast, the tests proposed here have a standard asymptotic distribution. Neumeyer and Dette (2003) use empirical process methods to test equality of two regression functions, again in the context of a single regressor. Pinske and Robinson focus on efficient estimation of the nonparametric functions and investigate the efficiency gains from pooling the two data sets in settings where the two regression functions differ by a transformation indexed by a finite number of parameters.
We apply these tests to two sets of experimental evaluations of the effects of welfare-to-work programs. In both cases the new tests lead to substantively different conclusions regarding the effect of the programs than has been found in previous analyses of these data that focused solely on average treatment effects. We first analyze data from the MDRC experimental evaluation of California's Greater Avenues for INdependence (GAIN) program that was conducted during the 1990s. These welfare-to-work programs were designed to assist welfare recipients in finding employment and improving their labor market earnings. The programs were implemented at the county level and counties had a great deal of discretion in the designs of their programs. We analyze data for four of these counties. We find that the tests we develop in this paper suggest a very different picture of the efficacy of the programs in these counties compared to conclusions drawn from standard tests of zero average treatment effects. In particular, tests that the average effect of the program on labor market earnings is equal to zero are rejected in only one of the four counties. However, using the tests developed in this paper, we find that for three out of the four counties we can decisively reject the hypothesis of a zero average effect on earnings for all subpopulations of program participants, where subpopulations are defined by covariates. We also reject the hypothesis of a constant average treatment effect across these subpopulations. Taken together, the results using these new tests strongly suggest that, in general, these programs were effective in changing the earnings of participants in these programs, even though it may have not improved or even lowered the earnings of some in the programs. Second, we analyze data from the MDRC experimental evaluations of Work INcentive (WIN) programs in Arkansas, Baltimore, Virginia and San Diego. Again, we find that we cannot reject the null hypothesis of a zero average effect for two out of the four locations. At the same time, we can clearly reject the null hypothesis of a zero average effect for all values of the covariates. The remainder of the paper is organized as follows. In Section 2, we lay out the framework for analyzing treatment effects and characterize the alternative sets of hypotheses we consider in this paper. We also provide a detailed motivation for conducting tests of average treatment effects being zero and for constant treatment effects. In Section 3, we characterize the latter tests in parametric and nonparametric regression settings. We then lay out the conditions required for the validity of both the zero conditional and the constant treatment effect tests in the nonparametric setting. In Section 4, we apply these tests to the GAIN and WIN data and report on our findings, contrasting the results of our nonparametric tests of zero and constant conditional average treatment effects for these programs on labor market earnings. Finally, we offer some concluding remarks.

Set Up
Our basic framework uses the motivating example of testing zero conditional average treatment effects in a program evaluation setting. We note, however, that our tests can be used more generally to test the hypotheses of constant or zero differences between regression functions estimated on separate samples. The set up we use is standard in the program evaluation literature and based on the potential outcome notation popularized by Rubin (1974). See Angrist and Krueger (2000), Heckman, Lalonde and Smith (2000), Blundell andCosta-Dias (2002), andImbens (2004) for general surveys of this literature. We have a random sample of size N from a large population. For each unit i in the sample, let W i indicate whether the active treatment was received, with W i = 1 if unit i receives the active treatment, and W i = 0 if unit i receives the control treatment. Let Y i (0) denote the outcome for unit i under control and Y i (1) the outcome under treatment. We observe W i and Y i , where Y i is the realized outcome: In addition, we observe a vector of pre-treatment variables, or covariates, denoted by X i . Define the two conditional means, To solve the identification problem, we maintain throughout the paper the unconfoundedness assumption (Rosenbaum and Rubin, 1983), which asserts that conditional on the pre-treatment variables, the treatment indicator is independent of the potential outcomes. Formally: (2.1) In addition we assume there is overlap in the covariate distributions: Later we also impose smoothness conditions on the two regression functions µ w (x) and the conditional variances σ 2 w (x). Various estimators have been proposed for the average treatment effect in this setting, e.g., Hahn (1998), Heckman, Ichimura and Todd (1998), Hirano, Imbens and Ridder (2003, Chen, Hong, and Tarozzi (2004), and Abadie and Imbens (2006). [4]

Hypotheses
In this paper we focus on two null hypotheses concerning the conditional average treatment effect τ (x). The first pair of hypotheses we consider Under the null hypothesis the average effect of the treatment is zero for all values of the covariates, whereas under the alternative there are some values of the covariates for which the effect of the treatment differs from zero. The second pair of hypotheses is We refer to this pair as the null hypothesis of no treatment effect heterogeneity. Strictly speaking this is not correct, as we only require the average effect of the treatment to be equal to τ for all values of the covariates, allowing for distributional effects that average out to zero. We want to contrast these hypotheses with the pair of hypotheses corresponding to zero average effect, Tests of the null hypothesis of a zero average effect are more commonly carried out, either explicitly, or implicitly through estimating the average treatment effect and its standard error. It is obviously much less restrictive than the null hypothesis of a zero conditional average effect.
To clarify the relation between these hypotheses and the hypotheses typically considered in the nonparametric testing literature it is useful to write the former in terms of restrictions on the conditional mean of Y given X and W . Because W is binary we can write this conditional expectation as where h 0 (x) = µ 0 (x) and h 1 (x) = µ 1 (x) − µ 0 (x). The nonparametric testing literature has largely focused on hypotheses that restrict both h 0 (x) and h 1 (x) to parametric forms (e.g., Eubank and Spiegelman, 1990;Härdle and Marron, 1990;Hong and White, 1995;Horowitz and Spokoiny, 2001). In contrast, the first null hypothesis we are interested in is h 1 (x) = 0 for all x, with no restriction on h 0 (x). The second null hypothesis is in this representation h 1 (x) = τ for some τ and all x, and again no restriction on h 0 (x). This illustrates that the hypotheses in (2.2) and (2.3) generalize the setting considered in the nonparametric testing literature to a setting where we allow for nuisance functions in the regression function under the null hypothesis.

Motivation
The motivation for considering the two pairs of hypotheses beyond the hypothesis of a zero average effect consists of three parts. The first is substantive. In many cases the primary [5] interest of the researcher may be in establishing whether the average effect of the program differs from zero. However, even if it is zero on average, there may well be subpopulations for which the effect is substantively and statistically significant. As a first step towards establishing such a conclusion, it would be useful to test whether there is any statistical evidence against the hypothesis that the effect of the program is zero on average for all subpopulations (the pair of hypotheses H 0 and H a ). If one finds that there is compelling evidence that the program has nonzero effect for some subpopulations, one may then further investigate which subpopulations these are, and whether the effects for these subpopulations are substantively important. As an alternative strategy one could directly estimate average effects for substantively interesting subpopulations. However, there may be many such subpopulations and it can be difficult to control size when testing many null hypotheses. Our proposed strategy of an initial single test for zero conditional average treatment effects avoids such problems.
Second, irrespective of whether one finds evidence in favor or against a zero average treatment effect, one may be concerned with the question of whether there is heterogeneity in the average effect conditional on the observed covariates. If there is strong evidence in favor of heterogeneous effects, one may be more reluctant to recommend extending the program to populations with different distributions of the covariates.
The third motivation is very different. In much of the economic literature on program evaluation, there is concern about the validity of the unconfoundedness assumption. If individuals choose whether or not to participate in the program based on information that is not all observed by the researcher, it may well be that conditional on observed covariates there is some remaining correlation between potential outcomes and the treatment indicator. Such correlation is ruled out by the unconfoundedness assumption. The unconfoundedness assumption is not directly testable. Nevertheless, there are two specific sets of tests available that are suggestive of the plausibility of this assumption. Both are based on testing the effect of a pseudo treatment which is known to have no effect. The first set of tests was originally suggested by Heckman and Hotz (1989). See also the discussion in Imbens (2004). Let us partition the vector of covariates X into two parts, a scalar V and the remainder Z, so that X = (V, Z ) . The idea is to take the data (V, W, Z) and analyze them as if V is the outcome, W is the treatment indicator, and as if unconfoundedness holds conditional on Z. Since V is a pretreatment variable or covariate, we are certain that the effect of the treatment on V is zero for all units. If we find statistical evidence in favor of an effect of the treatment on V it must therefore be the case that the assumption of unconfoundedness conditional on Z is incorrect. Of course, this is not direct evidence against unconfoundedness conditional on X = (V, Z ) . But, at the very least, it suggests that unconfoundedness is a delicate assumption in this case with the presence of V essential. Moreover, such tests can be particularly effective if the researcher has data on a number of lagged values of the outcome. In that case one can choose V to be the one-period lagged value of the outcome. If conditional on further lags and individual characteristics one finds differences in lagged outcome distributions for those who will be treated in the future and those who will not be, it calls into question whether conditioning on all lagged outcome values will be sufficient to eliminate differences between control and treatment groups. Heckman and Hotz (1989) implement these tests by testing whether the average effect of the treatment is equal to zero, testing the pair of hypotheses in (2.4). Clearly, in this setting it would be stronger evidence in support of the unconfoundedness assumption to find that the effect of the treatment on the lagged outcome is zero for all values of Z. This corresponds to implementing tests of the pairs of hypotheses (2.2).
A similar set of issues comes up in Rosenbaum's (1997) discussion of the use of multiple controls groups. Rosenbaum considers a setting with two distinct potential control groups. He suggests that if biases one may be concerned with would likely be different for both groups, then evidence that the two control groups lead to similar estimates is suggestive that unconfoundedness may be appropriate. One can implement this idea by comparing the two control groups.
If we find evidence that this pseudo treatment has a systematic effect on the outcome, it must be that for at least one of the two control groups unconfoundedness is violated. As in the Heckman-Hotz setting, the pair of hypotheses to test is that of a zero conditional average treatment effect, (2.2).
In the next section we discuss implementing the two tests in a parametric framework. In Section 3.2, we then provide conditions under which these tests can be interpreted as nonparametric tests.

Tests in Parametric Models
Here we discuss parametric versions of the tests in (2.2) and (2.3). For notational convenience we assume here that N 0 = N 1 = N . This can be relaxed easily, as we will do in the nonparametric case. Suppose the regression functions are specified as for some vector of functions of the covariates h(x), with dimension K − 1. The simplest case is h(x) = x where we just estimate a linear model. We can estimate α w and β w using least squares: Under general heteroskedasticity, with V (Y (w)|X) = σ 2 w (X), the normalized covariance matrix of (α w ,β w ) is (3.6) [7] In large samples, LetΩ 0 andΩ 1 be consistent estimators for Ω 0 and Ω 1 . In this parametric setting the first pair of null and alternative hypotheses is This can be tested using the quadratic form Under the null hypothesis this test statistic has in large samples a chi-squared distribution with K degrees of freedom: The second test is similar. The original null and alternative hypothesis in (2.3) translate into and H a : β 0 = β 1 .
Partition Ω w into the part corresponding to the variance forα w and the part corresponding to the variance forβ w : Ω w = Ω w,00 Ω w,01 Ω w,10 Ω w,11 , and partitionΩ 0 andΩ 1 similarly. The test statistic is now (3.10)

Nonparametric Estimation of Regression Functions
In order to develop nonparametric extensions of the tests developed in Section 3.1, we need nonparametric estimators for the two regression functions. We use the particular series estimator for the regression function µ w (x) developed by Imbens, Newey and Ridder (2006) and Chen, Hong and Tarozzi (2004). See Chen (2005) for a general discussion of sieve methods. Let K denote the number of terms in the series. As the basis we use power series.
Let R kK (x) be the kth element of the vector R K (x). It will be convenient to work with this sequence of basis function R K (x). The nonparametric series estimator of the regression function µ w (x), given K terms in the series, is given by: where A − denotes a generalized inverse of A, and Define the N w ×K matrix R w,K with rows equal to R K (X i ) for units with W i = w, and Y w to be the N w vector with elements equal to Y i for the same units, so thatγ w, Given the estimatorμ w,K (x) we estimate the error variance σ 2 w aŝ Let w,K as the sample size increases for fixed K. We estimate this variance aŝ In addition to Assumptions 2.2 and 2.3 we make the following assumptions.
The density of X is bounded away from zero on X.

Assumption 3.3 (Rates for Series Estimators)
We assume homoskedasticity, although this assumption is not essential and can be relaxed to allow the conditional variance to depend on x, as long as it is bounded from above and below.

Nonparametric Tests: Zero Conditional Average Treatment Effect
In this section, we show how the tests discussed in Section 3.1 based on parametric regression functions can be used to test the null hypothesis against the alternative hypothesis given in (2.2) without the parametric model. Essentially, we are going to provide conditions under which we can apply a sequence of parametric tests identical to those discussed in Section 3.1 and obtain a test that is valid without the parametric specification.
First, we focus on tests of the null hypothesis that the conditional average treatment effect τ (x) is zero for all values of the covariates, (2.2). To test this hypothesis, we compare estimators for µ 1 (x) and µ 0 (x). Given our use of series estimators, we can compare the estimated parametersγ 0,K andγ 1,K . Specifically, we use as the test statistic for the test of the null hypothesis H 0 the normalized quadratic form (3.14) To gain some intuition for this result, it is useful to decompose the differenceγ 1,K −γ 0,K into three parts. Define the pseudo-true values γ * w,K , for w = 0, 1, K = 1, 2, . . . , as For fixed K, in large samples, the last two terms are normally distributed, and centered around zero. The asymptotic distribution of T is based on this approximate normality. This approximation ignores the first term, the difference γ * 1,K − γ * 0,K . For fixed K this difference is not equal to zero even if µ 0 (x) = µ 1 (x) because the covariate distributions differ in the two treatment groups. In large samples, however, with large K, we can ignore this difference. Recall that under the null hypothesis µ 0 ( Hence, it follows that, for large enough K, it must be that for all is close to zero, implying γ * 0,K and γ * 1,K are close. The formal result then shows that we can increase K fast enough to make this difference small, while at the same time increasing K slowly enough to maintain the close approximation of the distribution of γ w,K − γ * w,K by a normal one. A key result here is Theorem 1.1 in Bentkus (2005) that ensures that convergence to multivariate normality is fast enough to hold even with the dimension of the vector increasing.
In large samples, the test statistic has a standard normal distribution if the null hypothesis is correct. However, we would only reject the null hypothesis if the two regression functions are far apart, which corresponds to large positive values of the test statistic. Hence, we recommend using critical values for the test based on one-sided tests, like De Jong and Bierens (1994).
In practice, we may wish to modify the testing procedure slightly. Instead of calculating T we can calculate the quadratic form and compare this to the critical values of a chi-squared distribution with K degrees of freedom. In large samples this would lead to approximately the same decision rule since (Q − K)/ √ 2K is approximately standard normal if Q has a chi-squared distribution with degrees of freedom equal to K for large K. This modification would make the testing procedure identical to the one discussed in Section 3.1, which is what one would do if the parametric model is correctly specified. This makes the tests particularly simple to apply. However, in large samples the tests do not rely on the correct specification, instead relying on the increasingly flexible specification as K increases with the sample size.
Next, we analyze the properties of the test when the null hypothesis is false. We consider local alternatives. For the test of the null hypothesis of a zero conditional average treatment effect, the alternative is for some sequence of ρ N → 0, and any function ∆(x), such that |∆(x 0 )| > 0 for some x 0 .
The theorem implies that we cannot necessarily detect alternatives to the null hypothesis that are N −1/2 from the null hypothesis. We can, however, detect alternatives whose distance to the null hypothesis is arbitrarily close to N −1/2 given sufficient smoothness relative to the dimension of the covariates (so that ν can be close to zero).

Nonparametric Tests: Constant Conditional Average Treatment Effect
Next, we consider tests of the null hypothesis against the alternative hypothesis given in (2.3). Suppose without loss of generality that R 1K (x) = 1 for all K. For this test we partitionγ w,K asγ withγ w0,K a scalar andγ w1,K a K − 1-dimensional vector, and the matrixV aŝ The test statistic is then: Proof: See supplementary materials on website.
In practice we may again wish to use the chi-squared approximation. Now we calculate the quadratic form Proof: See supplementary materials on website. [12]

Application
In this section we apply the tests developed in this paper to data from two sets of experimental evaluations of welfare-to-work training programs. We first re-analyze data from the MDRC evaluations of California's Greater Avenues to INdependence (GAIN) programs. These experimental evaluations of job training and job search assistance programs took place in the 1990's in several different counties in California. 4 The second set consists of four experimental Work INcentive (WIN) demonstration programs implemented in the mid-eighties in different locations of the U.S. The WIN programs also were welfare-to-work programs that examined different strategies for improving the employment and earnings of welfare recipients. 5 The design of both evaluations entailed random assignment of welfare recipients to a treatment group that received training and job assistance services and a control group that did not. Thus, estimating the average effect from these data is straightforward. While the effects of treatments were analyzed for a number of different outcomes, we focus here on the labor market earnings of participants in the first year after random assignment for both sets of evaluations.

Treatment Effect Tests for the GAIN Data
In this section, we present the results of tests concerning the effects of the GAIN programs in four of California's counties, namely Los Angeles (LA), Riverside (RI), Alameda (AL) and San Diego (SD) counties, on participants' labor market earnings in the first year after random assignment. The sample sizes for the treatment and control groups in each of these counties are provided at the top of Table 1. For each county, we conducted tests for zero and constant conditional average treatment effects, where we condition on measures of participants' background characteristics -including gender, age, ethnicity (Hispanic, black, or other), an indicator for high school graduation, an indicator for the presence of exactly one child (all individuals have at least one child), and an indicator for the presence of children under the age of 6 -as well as on the quarterly earnings of participants in the ten quarters prior to random assignment. Descriptive statistics (means and standard deviations) for these conditioning covariates, as well as for the earnings outcome variable, are found in Table 1, separately by county. All the conditioning data on earnings are in thousands of dollars per quarter. For all of the tests, we controlled for all seven individual characteristics linearly, plus a quadratic term for age, plus all ten quarterly earnings variables and ten indicators for zero earnings in each quarter. This leads to a total of 28 covariates (listed in Table 1) in the regressions, plus an intercept. The results for the various tests we consider are reported in Table 2. (The degrees of freedom for the chi-squared version of the tests are recorded in this table under the "dof" heading.) We first consider the test of the null hypothesis that τ (x) = 0 against the alternative that τ (x) = 0 for some x ("Zero Cond. Ave. TE"). For this test, we get a clear rejection of the zero conditional average treatment effect at the 5% level for three out of the four of the GAIN counties, with the test statistic for only Los Angeles County being smaller than conventional critical values. (For all of the tests, we also include the normal distribution based version of the tests.) Results for the second test of the null hypothesis that τ (x) = τ against the alternative that τ (x) = τ for some x ("Constant Cond. Ave. TE") also are presented in Table 2. Again, we reject this null hypothesis at conventional levels for three out of the four counties. Finally, for comparison purposes, we include the simple test for the null hypothesis that the average effect of the treatment is equal to zero ("Zero Ave. TE"). This is the traditional test that is typically reported when testing treatment effects in the program evaluation literature. It is based on the statistic calculated as the difference in average outcomes for the treatment and control groups divided by the standard error of this difference. Based on this traditional test, we cannot reject the null hypothesis of no treatment effect in three out of the four counties. In particular, only for the Riverside data is there a clear rejection of a zero average treatment effect on earnings.
This latter finding, namely that only Riverside County's GAIN program showed significant effects on earnings (and other outcomes) in the initial periods after random assignment, is what was reported in the MDRC analysis on this evaluation 6 . It has been widely cited as evidence that the program strategies used in Riverside county GAIN program, namely emphasis on job search assistance and little or no basic skills training used by the other GAIN county programs, was the preferred strategy for moving welfare recipients from welfare to work. 7 However, as the results for the other two tests presented in Table 2 make clear, these conclusions are not robust. The findings from the two tests developed in this paper applied to these data clearly suggest that some subgroups in counties other than Riverside benefited from the GAIN treatments in those counties. Moreover, there is clear evidence of treatment effect heterogeneity across subgroups in all but Los Angeles County.

Treatment Effect Tests for the WIN Data
In this section, we present results for the same set of tests using data from the Work INcentive (WIN) experiments in Baltimore, Maryland (MD), Arkansas (AK), San Diego County (SD) and Virginia (VA). Here we have data on four binary indicators for individual characteristics, an indicator for one child, an indicator for a high school diploma, for never being married, and for being non white. In addition, we have four quarters of earnings data. Table 3 presents summary statistics for the 12 covariates and the outcome variable, annual earnings in the first year after random assignment, for the four locations.
Results of the tests for the four WIN evaluation locations are presented in Table 4, which has the same format as Table 2. With respect to the test of zero conditional average treatment effects, we find that we can reject this null hypothesis in three out of the four locations of the WIN experiments at the 5% level. For two out of those three locations, we also reject the hypothesis of constant treatment effects. In contrast, testing the null hypothesis of a zero average treatment effect results in the rejection of the null hypothesis for only one out of the four 6 See Riccio, Friedlander and Freedman (1994). 7 Also see Hotz, Imbens and Klerman (2006) for an explicit analysis of the relative effectiveness of alternative treatment strategies based on this same GAIN data. [14] locations. Overall, the conclusion is again that a researcher who relied only on the traditional tests of a zero average effect would have missed the presence of treatment effects for two out of the four locations analyzed in this set of evaluations.

Conclusion
In this paper, we develop and apply tools for testing the presence of and heterogeneity in treatment effects in settings with selection on observables (unconfoundedness). In these settings, researchers have largely focused on inference for the average effect or the average effect for the treated. Although researchers have typically allowed for general treatment effect heterogeneity, there has been little formal investigation of the presence of such heterogeneity and the presence of more complex patterns of treatment effects that could not be detected with traditional tests concerning average treatment effects. At best, researchers have estimated average effects for subpopulations defined by categorical individual characteristics. Here, we develop simpleto-apply tools for testing both the presence of non-zero treatment effects and of treatment effect heterogeneity. Analyzing data from eight experimental evaluations of welfare-to-work training programs, we find considerable evidence of treatment effect heterogeneity and of nonzero treatment effects that were missed by testing strategies that focused solely on inferences concerning average treatment effects.
We note that there is a related issue with respect to the presence of heterogeneity when estimating average treatment effects. In particular, allowing for general forms of heterogeneity can lead to imprecise estimates of such effects. To address this issue, Crump, Hotz, Imbens and Mitnik (2006) explore the potential gains of focusing on the estimation of average effects for subpopulations which have more overlap in the covariate distributions. They provide a systematic treatment of the choice of these subpopulations and develop estimators of treatment effects that have optimal asymptotic properties with respect to their precision. [15]

Appendix
Before proving Theorem 3.1 we present a couple of preliminary results. Proof: See supplementary materials on website. Newey (1994) and definẽ Then we can write √ N w (γ w,K − γ * w,K ) as is a normalized summation of N w independent random vectors distributed with expectation 0 and variance-covariance matrix I K . [16] Denote the distribution of S w,K by Q Nw and define β 3 ≡ i|Wi=w E Zi √ Nw 3 . Then, by Theorem 1.1, Bentkus (2005), where A K is the class of all measurable convex sets in K-dimensional Euclidean space, C is an absolute constant, and Φ is a multivariate standard Gaussian distribution. Assumptions 2.1-2.3 and 3.1−3.3. In particular let K(N ) = N ν where ν < 2 19 . Then,

Lemma A.3 Suppose
2 ) because λ min (Ω w,K ) is bounded away from zero by Lemma A.1. Next, consider where the third moment of ε w,i is bounded by Assumption 3.2 and so the factor is O(K 3 ). Since σ 2 w is also bounded by Assumption 3.2, β 3 is O(K 9 2 N − 1 2 ). Thus, w and the result follows.
We may proceed further to detail conditions under which the quadratic form, S w,K S w,K , properly normalized, converges to a univariate standard Gaussian distribution. The quadratic form S w,K S w,K can be written as where Z ij is the j th element of the vector Z i . Thus, S w,K S w,K is a sum of K uncorrelated, squared random variables with each random variable converging to a standard Gaussian distribution by the previous result. Intuitively, this sum should converge to a chi-squared random variable with K degrees of freedom. [17]

Lemma A.4 Under Assumptions 2.1-2.3 and 3.1-3.3,
The proper normalization of the quadratic form yields the studentized version, (S w,K S w,K − K)/ √ 2K. This converges to a standard Gaussian distribution by the following lemma.

Lemma A.5 Under Assumptions 2.1-2.3 and 3.1-3.3,
The first term goes to zero by Lemma A.4. For the second term we may apply the Berry-Esséen Theorem which yields, Thus for ν > 0 the right-hand side converges to zero as well and the result is established.
In order to proceed we need the following selected results from Imbens, Newey and Ridder (2006). These results establish convergence rates for the estimators of the regression function.
Lemma A.6 (Imbens, Newey and Ridder): Suppose Assumptions 3.1 -3.3 hold. Then, The following lemma describes the limiting distribution of the infeasible test statistic.

Lemma A.7 Under Assumptions 2.1-2.3 and 3.1−3.3,
Proof We need only show that, First, notice that we can rewriteγ w,K aŝ Then, First, consider equation (A.4), where the consistency of the sample variance follows by Lemma B.2 in the supplementary materials on the website.
by Markov's inequality. (A.7) follows by the fact that is a projection matrix and is thus positive semi-definite. (A.8) follows from Lemma A.6 (i) and (ii).
Next consider equation (A.5). We will work first with the second factor, so that the second factor is O K by Lemma A.1 and the continuous mapping theorem, and by Assumption 3.2, Lemma A.1 (ii) and Markov's inequality. Thus, Combining these results yields: which is o p (1) under Assumptions 3.2 and 3.3.
So then, All three terms are o p (1) under Assumptions 3.2 and 3.3 and the result follows.

Proof of Theorem 3.2 First, note that
Because s/d > 25/4 by Assumption 3.2 and 1/(2s/d + 3) < ν < 2/19 by Assumption 3.3, it follows that with probability going to one as N → ∞. Thus with probability going to one as N → ∞. Since which goes to infinity with the sample size, it follows that for any M , Next, we show that this implies that Let λ min (A) be the minimum eigenvalue of a matrix A. Denote λ min (V −1 ) by λ and note that by Lemma A.2 it follows that λ is bounded away from zero.

Additional Proofs for: Crump, Hotz, Imbens and Mitnik, "Nonparametric Tests for Treatment Effect Heterogeneity"
Proof of Lemma A.1: We will generalize the proof in Imbens, Newey and Ridder (2006). For (i) we will show so that the result follows by Markov's inequality.
The second term is The first term is We can then partition this expression into terms with i = j, and with terms i = j, Combining equations (B.1), (B.2) and (B.3) yields, where (B.11) follows by (B.12) follows since the maximum eigenvalue of Ω w,K is O(1) (see below).
For ( Now, let f w (x) = f X|W (x|W = w) and recall that Ω w, where Ω 1,K is normalized to equal I K . Next define and note that by Assumptions 2.3 and 3.1 we have that Thus we may define q(x) ≡ q +q(x) so that, Q is a positive semi-definite matrix, which implies that Ω 0,K ≥ q · Ω 1,K in a positive semi-definite sense. Thus by (B.13) λ min (Ω 0,K ) ≥ q · λ min (Ω 1,K ) = q and the minimum eigenvalue of Ω 0,K is bounded away from zero. Also, sinceq · Ω 1,K ≥Q in a positive semi-definite sense, using (B.14) we have d 2 Ω 1,K d 2 = q +q and the maximum eigenvalue of Ω 0,K is bounded. Both the minimum and maximum eigenvalue of Ω 1,K are bounded away from zero and bounded, respectively, by construction.
For (iii) consider the minimum eigenvalue ofΩ w,K .
Where (B.19) follows since for a symmetric matrix A and since the norm is nonnegative for all values of λ min (A). Finally, (B.20) follows by part (i).
Next, consider the maximum eigenvalue of Ω w,K .
Where (B.25) follows by the above discussion and (B.26) follows by part (i).

Proof of Lemma A.2:
Where the last line follows by (B.26). Thus, λ max V is bounded with probability going to one by part (i) and Assumption 3.3.
Before proving Theorem (3.3) we need the following lemma.

Lemma B.1 Recall that we partitionedV aŝ
whereV 00 and V 00 are scalars,V 01 and V 01 are 1 × (K − 1) vectors,V 10 and V 10 are (K − 1) × 1 vectors andV 11 and V 11 are (K − 1) × (K − 1) matrices. Then, Proof The proof follows by the interlacing theorem, see Li-Mathias (2002): If A is an n × n positive semi-definite Hermitian matrix with eigenvalues λ 1 ≥ ... ≥ λ n , B is a k × k principal submatrix of A with eigenvaluesλ 1 ≥ ... ≥λ k , then In our case,V and V are positive definite, symmetric and thus positive definite, Hermitian. So then, by the interlacing theorem Proof of Theorem (3.3): When the conditional average treatment effect is constant we may choose the two approximating sequences, γ 0 0,K and γ 0 1,K , to differ only by way of the first element (the coefficient of the constant term in the approximating sequence). In other words, if µ 1 (x) − µ 0 (x) = τ 0 for all x ∈ X, then the coefficients of the power series terms involving x r such that r > 0 should be identical for w = 0, 1, so that their difference no longer varies with x.
Thus, a natural strategy to test the null hypothesis of a constant conditional average treatment effect is to compare the last K − 1 elements ofγ 1,K andγ 0,K and to reject the null hypothesis when these elements are sufficiently different.

Lemma B.2 Suppose Assumptions 2.1-2.3 and 3.1-3.3 hold. Then,
where the last line follows from Lemma A.6 (iv), since ζ(K) = O(K), and by Assumption 3.3. Finally, consider equation (B.45). Note first that, We will first work with equation (B.47). Note that the individual summands have mean zero conditional on X. Thus,