Dealing with Limited Overlap in Estimation of Average Treatment Effects

Estimation of average treatment effects under unconfoundedness or exogenous treatment assignment is often hampered by lack of overlap in the covariate distributions. This lack of overlap can lead to imprecise estimates and can make commonly used estimators sensitive to the choice of specification. In such cases researchers have often used informal methods for trimming the sample. In this paper we develop a systematic approach to addressing such lack of overlap. We characterize optimal subsamples for which the average treatment effect can be estimated most precisely, as well as optimally weighted average treatment effects. Under some conditions the optimal selection rules depend solely on the propensity score. For a wide range of distributions a good approximation to the optimal rule is provided by the simple selection rule to drop all units with estimated propensity scores outside the range [0.1, 0.9].


Introduction
There is a large literature on estimating average treatment effects (ATE) under assumptions of unconfoundedness, ignorability, or exogeneity following the seminal work by Rubin (1974Rubin ( , 1978 and Rosenbaum and Rubin (1983a). Researchers have developed estimators based on regression methods (e.g., Hahn, 1998, Heckman, Ichimura and, matching (e.g., Rosenbaum, 1989, Abadie and, and methods based on the propensity score (e.g., Rubin, 1983a, Hirano, Imbens andRidder, 2003). Related methods for missing data problems are discussed in Robins, Rotnitzky and Zhao (1995) and . 1 An important practical concern in implementing these methods is that one needs overlap between covariate distributions in the two subpopulations, i.e., there must be common support in the covariates across the two subpopulations. Even if there exists overlap (common support), there may be parts of this common covariate space with limited numbers of observations for one or the other treatment groups. Such areas of limited overlap can lead to poor finite sample properties for many estimators of average treatment effects. In such cases, many of these estimators can have substantial bias, large variances, as well as considerable sensitivity to the exact specification of the treatment effect regression functions or of the propensity score. LaLonde (1986), Heckman, Ichimura and Todd (1997) and Dehejia and Wahba (1999) discuss the empirical relevance of this overlap issue. 2 One strand of the literature has focused on assessing the robustness of existing estimators to a variety of potential problems, including limited overlap. 3 A second strand focuses on developing new matching estimators of treatment effects or modifying existing ones to reduce their sensitivity and improve their precision in the face of the overlap problem. For example, Rubin (1977) and Lee (2005b), in situations where there is a single discrete covariate, suggest simply discarding all units with covariate values with either no treated or no control units. Alternatively, Cochran and Rubin (1973) suggest caliper matching where potential matches are dropped if the within-match difference in propensity scores exceeds some threshold level. LaLonde (1986) creates subsamples of the control group by conditioning on covariate values lying in ranges with substantial overlap. Ho, Imai, King and Stuart (2005) propose preprocessing the data by first matching units and carrying out parametric inferences using only the matched data. Heckman, Ichimura and Todd (1997), Heckman, Ichimura, Smith and Todd (1998), and Smith and Todd (2005), who focus on estimating the average treatment effect for the treated (ATT), discard all observations in both the treated and non-treated groups for values of the estimated propensity scores that have zero or occur infrequently. Dehejia and Wahba (1999), who also focus on estimating the ATT, discard those non-treated group observations 1 See Rosenbaum (2001), Heckman, LaLonde and Smith (1999), Wooldridge (2002), Blundell and Costa-Diaz (2002), Imbens (2004) and Lee (2005a) for surveys of this literature.
2 Dehejia and Wahba (1999) write: "... our methods succeed for a transparent reason: They only use the subset of the comparison group that is comparable to the treatment group, and discard the complement." Heckman, Ichimura and Todd (1997) write "A major finding of this paper is that comparing the incomparablei.e., violating the common support condition for the matching variables-is a major source of evaluation bias as conventionally measured." 3 See, for example, Rosenbaum and Rubin (1983b), Rosenbaum (2001), Imbens (2003), and Ichino, Mealli, and Nannicini (2005). [1] for propensity scores that are less than the smallest value of the propensity score for those in the treated group.
Although there are differences across these alternative strategies, they have several things in common. First, all of them discard observations for which there is no overlap between the treated and non-treated group based on either the propensity score or covariate distribution. As a result, each strategy focuses, in essence, on average treatment effect estimands that are defined for subsets of the sample observations and, thus, differ from either the typical ATE or ATT which are defined over the full (population) covariate distributon. Second, each of these strategies is somewhat arbitrary, i.e., the strategies used to discard or reweight observations in forming new estimators are based on criteria with unknown properties.
In this paper, we propose a systematic approach to dealing with samples with limited overlap in the covariates that have optimality properties with respect to the precision of estimating treatment effects, and which are straightforward to implement in practice. As with the previous methods, our approaches also are based on characterizing different estimands relative to the traditional ATE or ATT. We return below to the implications of and some justifications for this latter feature of our approach.
We consider the following two strategies. In the first, we focus on average treatment effects within a selected subpopulation defined in terms of covariate values. Inevitably, conditioning on a subpopulation based on any selection criterion reduces the effective sample size, which, all else the same, increases the variance of the estimated average treatment effect. However, if the subpopulation is chosen appropriately, it may be possible to estimate the average treatment within this subpopulation more precisely than the average effect for the entire population despite the smaller sample size. As we establish below, this tradeoff is, in general, well-defined and, under some conditions, leads to discarding units with propensity scores outside an interval [α, 1 − α], where the optimal cutoff value of α is solely determined by the distribution of the propensity score. Our approach is consistent with the practice noted above of researchers dropping units with extreme values of the propensity score, with two important distinctions. First, the role of the propensity score in our procedure is not imposed from the outset; rather, it emerges as a consequence of the criterion of variance minimization. Second, we have a systematic way of choosing the cutoff point, α. We refer to the resulting estimand as the Optimal Subpopulation Average Treatment Effect (OSATE). We note that the determination of the subset of observations that characterize a particular OSATE is based solely on the joint distribution of covariates and the treatment indicator and not on the outcome data. As a result, we avoid introducing deliberate bias with respect to the treatment effects being analyzed.
In the second strategy, we formulate weighted average treatment effects, where the weights depend only on covariates. Note that the OSATE can be viewed as a special case of these weighted treatment effects, where the weight function is restricted to be an indicator function. Within a broad class, we characterize the weight function that leads to the most precisely estimated average treatment effect. We note that this class of estimands includes the average treatment effect for the treated, where the weight function is proportional to the propensity score. Under the same conditions as before, the optimal weight function turns out to be a function of the propensity score; in fact, it is proportional to the product of the propensity [2] score and one minus the propensity score. We refer to this as the Optimally Weighted Average Treatment Effect (OWATE).
Although both strategies we consider are similiar to the more informal ones noted above, it is still the case that both are somewhat uncommon in econometric analyses, precisely because they entail focusing on estimands that depend on sample data. 4 Typically, econometric analyses of treatment effects focus on estimands that are defined a priori for populations of interest, as is the case with the population average treatment effect or the average treatment effect for the treated subpopulation. In these cases, estimates are produced that turn out to be more or less precise, depending on the actual sample data. In contrast, we focus on average effects for a statistically defined (weighted) subpopulation. 5 This change of focus is not motivated, per se, by an intrinsic interest in the subpopulation for which we ultimately estimate the average causal effect. Rather, it acknowledges and addresses the difficulties in making inferences about the population of primary interest.
In our view this approach has several justifications. First, our approach of achieving precision in the estimation of treatment effects has analogues in the statistics literature. In particular, it is similar to the traditional motivation for medians rather than means as more precise measures of central tendency. In particular, by changing the sample from one that was potentially representative of the population of interest, we can gain greater internal validity, although, in doing so, we may sacrifice some of the external validity of the resulting estimates. 6 Furthermore, our proposed approach of placing greater stress on internal versus external validity is similar to that found in the design of randomized experiments which are often carried out on populations unrepresentative of the population of interest in order to improve the precision of the inferences to be drawn. More generally, the relative primacy of internal validity over external validity is advocated in many discussions of causal inference (see, for example, Shadish, Cook, and Campbell, 2002).
Second, our approach may be well-suited to situations where the primary interest is to determine whether a treatment may harm or benefit at least some group in a broader population. For example, one may be interested whether there is any evidence that a particular drug could harm or have side effects for some group of patients in a well-defined population. In this context, obtaining greater precision in the estimation of a treatment effect, even if it is not for the entire population, is warranted. We note that the subpopulation for which these estimands are valid are defined in terms of the observed covariate values so that one can determine, for each individual, whether they are in the relevant subpopulation or not.
Third, our approach can provide useful, albeit auxiliary, information when making inferences about the treatment effects for fixed populations. Thus, instead of only reporting the potentially imprecise estimate for the population average treatment effect, one can also report the estimates 4 We note that the local average treatment effect introduced by Imbens and Angrist (1994) represents another example in which a new estimand is introduced-one in which the average effect of the treatment is defined for the subpopulation of compliers-to deal with a phenomenon quite similar to limited overlap. 5 This is also true for the method proposed by . 6 A separate issue is that in practice in many cases even the original sample is not representative of the population of interest. For example, we are often interested in policies that would extend small pilot versions of job training programs to different locations and times. [3] for the subpopulations where we can make more precise inferences.
Fourth, focusing on estimands that discard or reweight observations from the treated and non-treated group subsamples in order to improve precision tends to produce more balance in the distribution of the covariates across these groups. As has been noted elsewhere (Rosenbaum and Rubin, 1984;Heckman, Ichimura and Todd, 1998, among others), increasing the balance in the covariate distributions tends to reduce the sensitivity of treatment effect estimates to changes in the specification. In the extreme case, where the selected sample is completely balanced in covariates in the two treatment arms, one can simply use the average difference in outcomes between treated and control units.
At the same time, our focus on strategies for improving the precision of treatment effect estimators in the face of limited overlap has its limitations. For example, one might seek to devise strategies to deal with limited overlap of the covariate distributions that balance the representativeness of that distributon with precision. While exploring how to achieve such objectives is desirable, we see the results in this paper as an important first step in formulating strategies to deal with the problem of limited overlap that have well-defined properties and that can be implemented on real data.
Finally, it is important to note that the properties we derive below concerning the precision associated with both the OSATE and OWATE estimands are not tied to a specific estimator. Rather, we focus on differences in the efficiency bounds for different subpopulations. As a consequence, a range of efficient estimators-including the ones proposed by Hahn (1998), Hirano, Imbens and Ridder (2003), Imbens, Newey and Ridder (2006), and Robins, and Rotnitzky and Zhao (1995)-can potentially be used to estimate these estimands, especially the OWATE. However, as we make clear below, these standard estimators are not readily applicable to the estimation of the OSATE, due to the complications that arise from having to estimate the optimal subsets of the covariate distribution for this estimand. Accordingly, we develop a new estimator that deals with this case and derive its large sample properties.
We illustrate these methods using data from the non-experimental part of a data set on labor market programs previously used by LaLonde (1986), Heckman and Hotz (1989), Dehejia and Wahba (1999), Smith and Todd (2005) and others. In this data set the overlap issue is a well known problem, with the control and treatment group far apart on some of the most important covariates including lagged values for the outcome of interest, yearly earnings. Here our OSATE method suggests dropping 2363 out of 2675 observations (leaving only 312 observations, or just 12% of the original sample) in order to minimize the variance. Calculations suggest that this lowers the variance by a factor 1/160, 000, reflecting the fact that most of the controls are very different from the treated and that it is essentially impossible to estimate the population average treatment effect. More relevant, given the fact that most of the researchers analyzing this data set have focused on the average effect for the treated, is that the variance for the optimal subsample is only 40% of that for the propensity score weighted sample (which estimates the effect on the treated).
The remainder of the paper is organized as follows. In Section 2, we present a simple example in which there is a single and scalar covariate used in the estimation of the average treatment effect. This example allows us to illustrate how the precision of the estimates varies with [4] changes in the estimand. Section 3 develops the general setup we use throughout the paper. Section 4 reviews the previous approaches to dealing with limited overlap when estimating treatment effects. In Section 5, we develop new estimands and discuss their precision gains. We also show that for a wide class of distributions the optimal set is well approximated by the set of observations with propensity scores in the interval [0.1, 0.9]. In Section 6, we discuss the properties of estimators for the OSATE and OWATE estimands. In Section 7, we present the application to the LaLonde data. Section 8 concludes.

A Simple Example
To set the stage for the issues to be discussed in this paper, consider the following simplified treatment effect example in which the covariate of interest, X , is a scalar taking on one of two values. In particular, suppose that X = f (female) or X = m (male), so that the covariate space is X = {f, m}. For x = f, m, let N x be the sample size for the subsample with X = x, and let N = N f + N m be the total sample size. Let W ∈ {0, 1} denote the indicator for the treatment. Also, let p = E[W ] be the population share of treated individuals, wherep = N m /N is the share of men in the sample. We denote the average treatment effect, conditional on X = x, as τ x . Let N xw be the number of observations with covariate X i = x and treatment indicator W i = w. It follows that e x = N x1 /N x is the propensity score for x = f, m. Finally, be the average within each of the four subsamples. We assume that the distribution of the outcomes is homoskedastic, i.e., the variance of Y (w) given X i = x is σ 2 for all x = f, m and w = 0, 1.
At the outset, consider the following two average treatment effects that differ in somewhat subtle ways. In particular, consider the average effect that is averaged over the sample distribution of the covariates, τ S =p · τ m + (1 −p) · τ f , versus the average treatment effect for the full population, It is immediately obvious that for either τ S and τ P the natural estimator iŝ However, as we develop below, which estimand is the object of interest makes a difference in terms of the variance for this esimtator and this fact plays a crucial role in the results derived in this paper.
To make things very simple, suppose that subjects are randomly assigned to one of the treatment statuses, W i = 0 or 1, conditional on X . In this case, the natural, unbiased, estimators for the average treatment effects for each of the two subpopulations arê with variances (conditional on the covariates) respectively. The estimator for the sample average treatment effect, τ S , iŝ Because the two estimates,τ f andτ m , are independent, it follows that the variance of this estimator is It follows that the asymptotic variance of Note, however, that the asymptotic variance of √ N (τ X − τ P ) converges to where the extra term in this second variance arises because of the difference between the average treatment effect conditional on the sample distribution of X and the one for the full population. The first formal result of the paper concerns the comparison of V S (τ ), V(τ f ), and V(τ m ) according to a variance minimization criterion. In particular, the optimal subset A * ⊂ X that minimizes is given by (2.1) Note that which estimator has the smallest variances crucially depends on the ratio of the product of the propensity score and one minus the propensity score, e m (1 − e m )/(e f (1 − e f )). [6] If the propensity score for women is close to zero or one, we cannot estimate the average treatment effect for women precisely. In that case the ratio e m (1 − e m )/(e f (1 − e f )) will be high and we may be able to estimate the average treatment effect for men more accurately than the average effect for the sample as a whole, even though we may well lose a substantial number of observations by discarding women. Similarly, if the propensity score for men is close to zero or one, the ratio e m (1 − e m )/(e f (1 − e f )) is close to zero, and we may be able to estimate the average treatment effect for the women more accurately than for the sample as a whole. If the ratio is close to one, we can estimate the average treatment effect for the population as a whole more accurately than for either of the two subpopulations. Put differently, based on the data, and more specifically the distribution of (X, W ), one might prefer to estimate τ f (or τ m ), rather than the overall average τ , if, a priori, it is clear that τ cannot be estimated precisely, and τ f (or τ m ) can be estimated with accuracy. In this case there is a second obvious advantage of focusing on subpopulation average treatment effects. Within the two subpopulations, we can estimate the within-subpopulation average treatment effect without bias by simply differencing average treatment and control outcomes. As a result, our results are not sensitive to the choice of estimator for the within-subpopulation treatment effects. This need not be the case for the population as a whole, where there is potentially substantial bias from simply differencing average outcomes. Note that we did not define A * so as to minimize (AV ( . While doing so is, in principle, possible, it has two drawbacks, given our desire to determine the estimator which has the smallest variance and that is implementable in practice. First, using min(AV ( √ N (τX − τ P ))/N, V(τ f ), V(τ m )) as the criteria for estimator selection would require one to evaluate E[(τ X − τ P ) 2 ], which would necessarily be difficult to do. Second, this criterion depends on the value of the treatment effect, and, as such, would require analyzing outcome data, Y , before selecting the sample. This would open the door to introducing deliberate biases of the sort avoided by a selection criterion that depends solely on the treatment and covariate data.
A second issue concerns knowledge of A * . In an actual data set, one typically does not know A * and it would have to be estimated, using estimated values for the propensity score and the covariate distribution. Call this estimateÂ. In cases with continuous covariates, the uncertainty stemming from the difference betweenÂ and A * is not neglible. As a result, in our discussion of statistical inference below, we focus on the distribution of √ N (τÂ − τÂ), rather than at the distribution of √ N (τÂ − τ A * ). That is, we focus on the deviation of the estimated average effect relative to the average effect in the selected subsample, not relative to the average effect in the subset that would be optimal in the population. To be clear, focusing on τÂ rather than τ A * has consequences. For example, suppose that our estimate isÂ = {m}, so that we estimate the average treatment effect using only data for the male subpopulation. It may well be that, in fact, A * = X so that the average treatment effect should be estimated over the population of men and women. Nevertheless, we focus on the distribution ofτÂ − τÂ =τ m − τ m , rather than on the asymptotic distribution ofτÂ − τ A * =τÂ − τÂ + (τÂ − τ A * ). Given thatÂ is known, and A * is not, the estimates would seem more interpretable that way.
The second result of the paper takes account of the fact that one need not limit the choice of [7] average treatment effects to the three discussed so far. In particular, one may wish to consider a weighted average treatment effect of the form for fixed λ. It follows that τ λ can be estimated bŷ where the variance for this weighted average treatment effect is given by · 1 It follows that the variance of this estimator is minimized by choosing λ to be .
(2.2) with the minimum value for the variance equal to .
The ratio of the variance for the population average to the variance for the optimally weighted average treatment effect is By Jensen's inequality this is greater than one if V(W |X ) varies with X . So what are some of the implications of focusing on a criterion of variance minimization for selecting among alternative treatment effect estimators derived from this simplified example? Suppose one is interested in the sample average treatment effect, τ S . One may find that the efficient estimator for this average effect is likely to be imprecise, even before looking at the outcome data. This would be consistent with two states of the world that correspond to very different sets of information about treatment effects. In one state, the average effect for both of the subpopulations are imprecisely estimable, and, in effect, one cannot say much about the effect of the treatment at all. In the other state of the world it is still possible to learn something about the effect of the treatment because one of the subpopulation average treatment effects can be estimated precisely. In that case, which corresponds to the propensity score for one of the two subpopulations being close to zero or one, it may be useful to report also the estimator for the precisely estimable average treatment effect to convey the information the data contain about the effect of the treatment. It is important to stress that the message of the paper is not that one should report onlyτ m orτ f in place ofτ . Rather, in cases whereτ m orτ f are precisely estimable andτ is not, we propose one should report both. [8] In the remainder of the paper, we generalize the above analysis to the case with a vector of potentially continuously distributed covariates. We study the existence and characterization of a partition of the covariates space X into two subsets, A * and X/A * . For A * , the average treatment effect is at least as accurately estimable as that for any other subset of the covariate space. This leads to a generalization of (2.1). Under a certain set of assumptions, this problem has a well-defined solution and, under homoskedasticity, these subpopulations have a very simple characterization, namely the set of covariates such that the propensity score is in the The optimal value of the boundary point, α, is determined by the distribution of the propensity score and its calculation is straightforward. Compared to the binary covariate case just considered, it will be difficult to argue in the general setting that this subpopulation is of intrinsic or substantive interest. We will not attempt to do so. Instead, we view it as an interesting average treatment effect because of its statistical properties and, in particular, as a convenient summary measure of the full distribution of conditional treatment effects τ (x). In addition, we characterize the optimally weighted average treatment effect and its variance, the generalization of (2.2) and (2.3).

Setup
The framework we use is standard in this literature. 7 We have a random sample of size N from a large population. For each unit i in the sample, let W i indicate whether the treatment of interest was received, with W i = 1 if unit i receives the treatment of interest, and W i = 0 if unit i receives the control treatment. Using the potential outcome notation popularized by Rubin (1974), let Y i (0) denote the outcome for unit i under control and Y i (1) the outcome under treatment. We observe W i and Y i , where In addition, we observe a vector of pre-treatment variables, or covariates, denoted by X i . Define the two conditional mean functions, , and the propensity score, the probability of selection into the treatment, e(x) = Pr Initially, we focus on two average treatment effects. The first is the (super-)population average treatment effect (3.4) We also consider the sample average treatment effect where we condition on the observed set of covariates. The reason for focusing on the second one is twofold. First, it is analogous to the conditioning on covariates commonly used in regression analysis. Second, it can be estimated more precisely if there is variation in the treatment effect by covariates.
To solve the identification problem, we maintain throughout the paper the unconfoundedness assumption (Rubin, 1978;Rosenbaum and Rubin, 1983a), which asserts that conditional on the pre-treatment variables, the treatment indicator is independent of the potential outcomes: This assumption is widely used in this literature. See discussions in Hahn (1998), , Hirano, Imbens, and Ridder (2003), Lechner (2002a), and others. In addition, we assume there is overlap in the covariate distributions: In addition, one often needs smoothness conditions on the two regression functions µ w (x) and the propensity score e(x) for estimation. We make those assumptions explicit in Section 6.

Previous Approaches to Dealing with Limited Overlap
In empirical applications, there is often concern about the overlap assumption (e.g., Dehejia and Wahba, 1999;Heckman, Ichimura, and Todd, 1997). As noted in the Introduction, researchers have sometimes trimmed their sample by excluding observations with propensity scores close to zero or one in order to ensure that there is sufficient overlap. Cochran and Rubin (1973) suggest using caliper matching where units whose match quality is too low according to the distance in terms of the propensity score are left unmatched. Dehejia and Wahba (1999) focus on the average effect for the treated. They suggest dropping all control units with an estimated propensity score lower than the smallest value for the estimated propensity score among the treated units. Formally, they first estimate the propensity score. Let the estimated propensity score for unit i beê(X i ). Then let e 1 be the minimum of theê(X i ) among treated units. Dehejia and Wahba drop all control units such thatê(X i ) < e 1 .
Heckman, Ichimura and Todd (1997), Heckman, Ichimura, Smith and Todd (1998) and Smith and Todd (2005) also focus on the average effect for the treated. They propose discarding units with covariate values at which the estimated density is below some threshold. The precise method is as follows. 8 First, they estimate the propensity scoreê(x). Next, they estimate [10] the density of the estimated propensity score in both treatment arms. Letf w (e) denote the estimated density of the estimated propensity score. The specific estimator they use is a kernel estimator,f w (e) = 1 , with bandwidth h. 9 First, Heckman, Ichimura and Todd discard observations withf 0 (ê(X i )) orf 1 (ê(X i )) exactly equal to zero leaving J observations. 10 Next, they fix a quantile q (Smith and Todd use q = 0.02). Using the J observations with positive densities, they rank the 2J values off 0 (ê(X i )) andf 1 (ê(X i )). They then drop units i withf 0 (ê(X i )) orf 1 (ê(X i )) less than or equal to c q , where c q is the largest real number such that Ho, Imai, King and Stuart (2005) propose combining any specific parametric procedure that the researcher may wish to employ with a nonparametric first stage. In this first stage, all treated units are matched to the closest control unit. Only the treated units and their matches are then used in the second stage. The first stage leads to a data set that is more balanced in terms of covariate distributions between treated and control. It thus reduces sensitivity of the parametric model to specific modelling decisions such as the inclusion of covariates or functional form assumptions.
All these methods tend to make the estimators more robust to specification decisions. However, few formal results are available on the properties of these procedures. They typically also depend on arbitrarily selected values for the "trimming" parameters.

Alternative Estimands
This section contains the main results of the paper. First, in Subsection 5.1, we review some results on efficiency bounds and present one new result. These efficiency bounds are used to motivate the estimands that we propose in the remainder of this section. In Subsection 5.2, we discuss the choice of criteria for selecting estimands that have optimal properties with respect to the estimation of average treatment effects. In Subsection 5.3, we derive the optimal subset of covariates over which to define the estimand, and in Subsection 5.4 we derive the optimal weights. Finally we provide some numerical calculations based on the Beta distribution.

Efficiency Bounds
In this subsection, we discuss some results on efficiency bounds for average treatment effects that will be used to motivate the estimands proposed in this paper. In addition, we present a new result on efficiency bounds.
Various bounds have been derived in the literature. 11 Hahn (1998) derived the semiparametric efficiency bound for τ P = E[Y (1) − Y (0)] under unconfoundedness and overlap (and 9 In their application Smith and Todd (2005) use Silverman's rule of thumb to choose the bandwidth. 10 Observations with the estimated density exactly equal to zero may exist when the kernel has finite support. For example, Smith and Todd (2005) use a quadratic kernel with K(u) = (u 2 − 1) 2 for |u| ≤ 1 and zero elsewhere.
[11] some regularity conditions). 12 The efficiency bound for τ P is A generalization of τ P to the case of the weighted average treatment effect given by with the weight function ω : X → R known, is considered in Hirano, Imbens and Ridder (2003), where they establish that the efficiency bound for τ P,ω is Hirano, Imbens and Ridder (2003) propose the efficient estimator, for τ P,ω . The influence function for this estimator is Note thatτ ω also can be interpreted as an estimator of the weighted sample average treatment effect, As an estimator for τ S,ω ,τ ω satisfies .
Comparing the efficiency bound for τ P,ω in (5.7) with the asymptotic variance in (5.9), it follows that we can estimate τ S,ω more accurately than τ P,ω , so long as there is variation (with X ) in the treatment effect τ (x).
Next we consider the case where the weights depend on the propensity score: ω(x) = λ(e(x)), with λ : [0, 1] → R known and the propensity score is unknown. (If the propensity score is known, this is a special case of the previous result.) If the propensity score is unknown, the efficiency bound changes. We establish what it is in the following theorem: 12 See also Robins, Rotznitzky and Zhao (1995) for a related result in a missing data setting. [12] Theorem 5.1 (Weighted Average Treatment Effects with Weights Depending on the Propensity Score) Suppose Assumptions 3.1 and 3.2 hold, and suppose that the weights are a function of the propensity score: ω(x) = λ(e(x)) with λ(e) known and e(x) unknown. Then the semiparametric efficiency bound for τ P,λ is The difference between the known weight function case (5.7) and the case with the weight function depending on the unknown propensity score established in Theorem 5.1 is the last term in (5.10). The fact that this term depends on the derivative of the weight function with respect to the propensity score will give rise to problems in formulating an implementable estimator. We address this issue in Section 6 below.

A Criterion for Choosing the Estimand
We now consider the problem of selecting the estimand that minimize the asymptotic variance in (5.9). Formally, we choose an estimand τ S,ω by choosing the weight function ω(x) that minimizes: .
( 5.11) Under the assumption that the distribution of Y is homoskedastic, the criterion is slightly modified and one minimizes . (5.12) However, before implementing this approach, we offer several comments in support of using these variance-minimization criteria.
As we have discussed, one of the consequences of limited overlap in the covariate distributions is the imprecision with which average treatment effects are estimated. Suppose, for now, that the propensity score is known. In that case, no matter how unbalanced the sample is, there is an estimator that is exactly unbiased as long as the propensity score is strictly between zero and one. However, if the sample is severely imbalanced, the efficiency bound-and, in this case, the exact variance of the associated estimator-will be large. Because of this imprecision, it is desirable to try to modify the estimator to improve its precision. One possibility is to utilize a mean-squared-error type criterion for this modification. Unfortunately, implementing such a criterion would be very difficult in practice, as the biases of alternative estimators would be difficult to estimate. This is because these biases will depend on the entire function τ (x), which is likely to be difficult to estimate for some values of x. More generally, alternative estimators for τ P will tend to suffer from this same problem. [13] To get around this problem, we focus on alternative estimands to τ P . In doing so, the two questions naturally arise: What is the class of estimands and what is the criterion for choosing an estimand within that class? A natural class of estimands would seem to be τ P,ω = E[ω(X )·τ (X )]/E[ω(X )] and to use a criterion of the asymptotic variance of associated estimators in order to reduce the imprecision associated with limited overap. We do not pursue this class of estimands because evaluating the asymptotic variance of estimators for such estimands requires estimation of τ (x) over the entire support to deal with the term E[ω 2 (X ) · (τ (X ) − τ P,ω ) 2 ] in the expression of the asymptotic variance in (5.7), making it difficult to implement in practice. Furthermore, the resulting variance-minimization criterion associated with this estimand would depend on values of the treatment effect, introducing the potential bias discussed in Section 2.
For these two reasons, we limit ourselves to the class of estimands given by . We note, however, that this is not the only possible approach. There may be alternative classes of estimands or alternative criteria that would lead to effective solutions. The key, however, is to use a systematic approach characterized by a class of estimands and a formal criterion for choosing an estimand within that class that are easy to implement.

The Optimal Subpopulation Average Treatment Effect
In this section, we characterize the Optimal Subpopulation Averate Treatment Effect (OSATE). We do so by restricting attention to weight functions that are indicator functions: ω(x) = 1{x ∈ A}, where A is some closed subset of the covariate space X. For a given set, A, we define corresponding population and sample average treatment effects τ P,A and τ S,A as With this class of weight functions, the criterion given in (5.11) can be written as We look for an optimal A, denoted by A * , that minimizes the asymptotic variance (5.13) among all closed subsets A.
As noted in the Introduction, focusing on estimands that discard observations to reduce the variance of average treatment effect estimators has two opposing effects. First, by excluding units with covariate values outside the set A, one reduces the effective sample size from N to N · q(A). This will increase the asymptotic variance by a factor 1/q(A). Second, by discarding units with high values for σ 2 1 (X )/e(X ) + σ 2 0 (X )/(1 − e(X ))-that is, units with covariate values x such that it is difficult to estimate the average treatment effect τ (x)-one can lower the conditional expectation E[σ 2 1 (X )/e(X ) + σ 2 0 (X )/(1 − e(X ))|X ∈ A]. Optimally choosing A involves balancing these two effects. [14] The following theorem gives the formal result for the optimal A * that minimizes the asymptotic variance. Define k(x) = σ 2 1 (x)/e(x) + σ 2 0 (x)/(1 − e(x)).
Theorem 5.2 (Optimal Overlap for the Average Treatment Effect) where α is a solution to Proof: See Appendix. The result in this theorem simplifies in an interesting way under homoskedasticity.
Corollary 5.1 Optimal Overlap for the Average Treatment Effect Under Ho- 1} and x ∈ X. Then the OSATE under homoskedasticity is τ S,A * , where, In Section 6, we focus on an estimator for the optimal estimand based on choosing weight functions that minimizes the criterion in (5.12) which corresponds to the case where the Y distribution is homoskedastic rather than using the criterion in (5.11). We use this criterion, even though we do not presume that the true distribution of Y is homoskedastic and, in fact, derive the asymptotic properties of this estimator under heteroskedasticity. We use the heteroskedastic criterion in (5.12) for three distinct reasons. The first-a principled reason-is that estimators of A * H do not require using outcome data. The entire analysis of selecting the sample can be carried out without using the outcome data, thus avoiding any deliberate bias that may result from selecting the sample based on outcome data. The second-a practical reason-is that the entire analysis is motivated by the difficulty of estimating τ (x) for covariates in some subset of the covariate space. For those values, it is even less likely that one can estimate the [15] conditional variances σ 2 w (x) accurately, and, hence, methods that rely on nonparametrically estimating these conditional variances are unlikely to be effective using sample sizes found in practice. Note that the imbalance that precludes accurate estimation of τ (x) and σ 2 w (x) does not necessarily preclude accurate estimation of the propensity score e(x). In fact, when it is impossible to estimate τ (x) because there are no treated or no control units for a particular value of X , it need not be difficult at all to estimate the propensity score accurately. The third reason is that it is rare in applications to find differences of conditional variances that vary by an order of magnitude. In contrast, it is common to find considerable variation in the propensity score so that the dependence of the optimal region on the conditional variances is likely to be less important. For these reasons, we focus on A * H , optimal sets based under homoskedasticity, even though all of the estimation used in making inferences will not maintain this assumption.
The final result in this section concerns the case where we are interested only in the average treatment effect for the treated. In this case, it makes sense to limit the estimand to the average over the subpopulation of the treated with sufficient overlap. Formally, we are interested in the set A that minimizes . (5.14) We only present the result under homoskedasticity.

Theorem 5.3 Optimal Overlap for the Average Effect for the Treated Under
.
and α t is a solution to otherwise.
Proof: See Appendix.

The Optimally Weighted Average Treatment Effect
In this section, we consider weighted average treatment effects of the form without requiring ω(x) to be an indicator function. The following theorem gives the most precisely estimable weighted average treatment effect. [16] Theorem 5.4 (Optimally Weighted Average Treatment Effect) Suppose Assumptions 3.1-3.2 hold. Let f ≤ f X (x) ≤ f , and σ 2 w (x) ≤ σ 2 for w = 0, 1 and all x ∈ X. Then the Optimal Weighted Average Treatment Effect (OWATE) is τ S,ω * , where Again the result simplifies under homoskedasticity to an estimand in which the weight functions only depend on the propensity score. ).

Numerical Simulations for Optimal Estimands when the Propensity Score follows a Beta Distribution
In this section, we assess the implications of the results derived in the previous sections. We do so by presenting simulations for the optimal estimands when the true propensity score follows a Beta distribution. We study the homoskedastic case, where the optimal cutoff value as well as the ratio of the variances depends only on the (true) marginal distribution of the propensity score. The Beta distribution is characterized by two parameters, here denoted by β and γ, both nonnegative. For a Beta distribution with parameters β and γ, denoted by B(β, γ), the mean is β/(γ+β), ranging from zero to one. The corresponding variance is βγ/((γ+β) 2 (γ+β+1)), which lies between zero and 1/4. The largest value that this variance takes on is for γ = β = 0, leading to a binomial distribution with probability 1/2 for both zero and one. We focus on distributions for the true propensity score, where β ∈ {0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4} and γ ∈ {β, . . . , 4}. 13 For a given pair of values (β, γ), let V P (β, γ) denote the asymptotic variance of the efficient estimator for the sample average treatment effect, τ S , which is given by In addition, let V P,α (β, γ) denote the asymptotic variance for the sample average treatment effect, where we drop observations with the propensity score outside the interval [α, 1 − α]. This variance is given by 13 There is no difference from our perspective between a Beta distribution with parameters γ and β and one with parameters β and γ. [17] ·E 1 e(X ) α ≤ e(X ) ≤ 1 − α, e(X ) ∼ B(β, γ) .
Finally, let α(β, γ) denote the optimal cutoff point for the case where the true propensity score has a Beta distribution with parameters γ and β. We calculate the resulting variances, V P,α (β, γ), for the optimal cutoff point and two fixed cutoff values, 0.01, and 0.1. For each of the Beta distributions we report the three ratios , and V P,0.10 (β, γ) V P,α(β,γ) (β, γ) . Table 1 presents results for this case. There are two main findings. First, the gain from trimming the sample can be substantial, reducing the asymptotic variance of the average treatment effect estimand by a factor of up to ten for some of the values of the propensity score based on the Beta distribution. Second, discarding observations with a propensity score outside the interval [0.1, 0.9] produces variances that are extremely close those produced with optimally chosen cutoff values. In particular, the ratio of the asymptotic variance when using a cutoff value of 0.1 to the variance based on the optimal cutoff value is never larger than 1.04 over the range of distributions we investigate. In contrast, using the smaller fixed cutoff value of 0.01 can lead to considerably larger variances than using the optimal cutoff value.

Estimands
In this section, we discuss inference for the estimands introduced in the previous sections. Two issues arise with respect to the tractibility of forming estimators for some of these estimands. First, as we have noted at the end of Section 5.1, there is an important difference in the efficiency bounds for population average treatment effects (τ P ) and sample average treatment effects (τ S ) that complicate the formation of estimators for the former estimand. In particular, the efficiency bound for τ P requires one to evaluate τ (X ) over its population distribution, which implies that one must know this distribution or be able to estimate it non-parametrically in order to determine this bound. 14 In general, the distribution of τ (X ) is not known and nonparametrically estimating it with any precision is complicated precisely because of the limited overlap problem. Such complications do not arise if we focus on sample average treatment effects, τ S . Accordingly, we restrict our attention to chracterizing estimators for the latter class of OSATE and OWATE estimands.
A second issue arises in the case of making inferences concerning Optimal Subpopulation Average Treatment Effects (OSATE). In particular, for this class of estimands, the optimal set, A * , is generally unknown and must be estimated. Moreover, the efficiency bound in Theorem 5.1 implies that, in some cases, the average effect over any subset of the covariate space defined in terms of the propensity score cannot be estimated at root−N rate. For example, suppose that the set of interest is A = {x ∈ X|e(x) ≤ p}. Note that this corresponds to using the 14 The same problems arise in the estimation of τP,ω .
[18] weight function, λ(e(x)) = 1{e(x) ≤ p}, in defining the associated estimand. But, the efficiency bound for this estimand given in (5.10) is a function of E[( ∂ ∂e λ(e(X ))) 2 ] which diverges when λ(e(x) is an indicator function so that variance V P,λ is unbounded and cannot be attained. In contrast, such problems do not plague estimation if we focus on the subsetÂ, even though, as noted in the discussion of our simple motivating example in Section 2, the sets A * andÂ can be quite different. Accordingly, we also restrict our attention to the subsetÂ when chracterizing estimators for OSATE estimands.

Nonparametric Estimates for Regression Functions
The proposed estimators for the average treatment effects rely on preliminary estimates of the propensity score and the two conditional regression functions. For these conditional means, various estimators have been proposed (Hahn, 1998;Hirano, Imbens and Ridder, 2003;Imbens, Newey and Ridder, 2006;Chen, Hong, and Tarozzi, 2005). None of them exactly fits the setting we consider here. Specifically, the previously developed estimators do not allow for estimation of the set over which the treatment effect is averaged. It is possible to modify these estimators to allow for this complication, although doing so would not be trivial. However, it is easier to use the generalized partial mean framework developed by Newey (1994) and extended by .
For simplicity, we use the same type of estimator for both conditional means, namely kernel estimators, although it would be possible to use series estimators for the propensity score as in Hirano, Imbens and Ridder (2003). Let K : [−1, 1] L → R be the kernel and b > 0 be the bandwidth. Then the standard kernel estimators for the propensity score, the regression and the variance functions are given bỹ respectively for w = 0, 1. To deal with technical boundary issues and to avoid trimming, it is useful to modify this estimator close to the boundary of the covariate space, using the boundary correction suggested by . The key idea behind this boundary modification is to modify the standard estimator for values of x that are close to the boundary, relative to the bandwidth, by using a Taylor series expansion around the nearest point that is sufficiently far away from the boundary. Details for this modification are presented in the Appendix. The resulting estimators will be denoted byê m,b (x), andμ w,m,b (x), where m stands for the degree of the Taylor series expansion. [19]

Assumptions
Here we list three technical assumptions that will be used to control the convergence rate of the nonparametric estimators. These are closely related to the assumptions used in . The first assumption restricts the kernel.
iii) K is r times continuously differentiable, with the r-th derivative bounded on the interior of U, (iv) K is a kernel of order s, so that U K(u)du = 1 and U u λ K(u)du = 0 for all λ such that 0 < |λ| < s, for some s ≥ 1.
The second assumption requires sufficient smoothness of the distribution of (Y, W, X).
Assumption 6.2 (Distribution) (i) (Y 1 , W 1 , X 1 ), (Y 2 , W 2 , X 2 ), . . . , are independent and identically distributed, (ii) the support of X i is X ⊂ R L , X = L l=1 [x l , x l ], x l < x l for all l = 1, . . . , L. (iii), X i is a random vector with probability density function f X (x), which is q times continuously differentiable on the interior of X, with the q-th derivative bounded, is q times continuously differentiable on the interior of X with the qth derivative bounded for w = 0, 1, is q times continuously differentiable on the interior of X with the qth derivative bounded for w = 0, 1, (vii) e(x) = E[W |X = x] is q times continuously differentiable on the interior of X with the qth derivative bounded, (viii) e(X ) has a continuous distribution on [0, 1] with the probability density function bounded and continuously differentiable.
The third assumption puts restrictions on the bandwidth and the smoothness of the kernel and the conditional mean functions.

The Optimally Selected Average Treatment Effect
Define, for a given set A ⊂ X, the estimator In this expression we drop the indexing of the kernel estimatorsμ 1 (x) andμ 0 (x) on the bandwidth b N and the degree of the Taylor series expansion in the boundary correction. The latter will be assumed to be equal to s, the degree of the kernel, and subject to conditions given in Assumption 6.3. We first characterize some preliminary results in order to define the estimator for the optimal set A * H . This involves first estimating the propensity score, and then estimating the optimal cutoff value α. First, definê By the support and smoothness conditions,γ andγ exist in large enough samples. For Γ = [0, ∞), define the functionr : Γ → R: for γ >γ, and 0 for 0 ≤ γ ≤γ. We are interested in the maximand ofr(γ). To deal with nonuniqueness, we define the maximand aŝ Γ = γ ∈ Γ r(γ) ≤ sup γ∈Γr (γ) , andγ = sup γ∈Γ γ.
A key step in the proof is that which allows us to deal with-or, rather, avoid dealing with-the uncertainty in the estimated setÂ. [21] Next we consider estimation of the OWATE, based on the weight function ω * H (x) = e(x) · (1 − e(x)). Define. for all functions ω : X → R. the estimator The estimator we actually consider isτω, whereω(x) =ê(x) · (1 −ê(x)). We present two results for this estimator. First, we consider the normalized difference betweenτω and τ S,ω . Second, we consider the normalized difference betweenτω and τ P,ω * H . Theorem 6.2 (OWATE) Suppose that Assumptions 3.1-3.2 and 6.1-6.3 hold. Then .

Estimating the Asymptotic Variance
In this subsection, we propose consistent estimators for the asymptotic variances. Define, for all sets A, the folloiwng estimatorŝ [22] Theorem 6.4 Suppose that Assumptions 6.1-6.3 hold. Then Proof: See Appendix. Next, let Theorem 6.5 Suppose that Assumptions 6.1-6.3 hold. Then .

Some Illustrations Based on Real Data
In this section we apply the methods developed in this paper to data from a labor market program. The data set we use was originally constructed by LaLonde (1986) and subsequently used by, among others, Heckman and Hotz (1989), Dehejia and Wahba (1999) and Smith and Todd (2005). The particular sample we use here is the one used by Dehejia and Wahba (1999). The treatment of interest is a job training program. The trainees are drawn from an experimental evaluation of this program. The control group is a sample drawn from the Panel Study of Income Dynamics (PSID). The control and treatment group are very unbalanced. that the standard deviation is 13.88, this is a very large difference of 1.26 standard deviations, suggesting that simple covariance adjustments are unlikely to lead to credible inferences. For these data, we compute and compare 9 different estimands. The first is the sample average treatment effect, τ S (ATE). We then examine average treatment effects derived over three subsamples. In the first, we drop all observations with an estimated propensity score outside of the interval [0.01, 0.99] (ATE 0.01 ). In the second, we drop all observations with an estimated propensity score outside of the interval [0.10, 0.90] (ATE 0.10 ). Finally, we calculate the estimate of the OSATE with optimal cutoff point, α, using the results in Corollary 5.1. The estimated optimal cutoff point isα = 0.0660. For these calculations, we estimate the propensity score using a logistic model with all nine covariates displayed in Table 2 entered linearly. We also estimate the optimally weighted average treatment effect (OWATE), with weightsê(x) · (1 −ê(x)). The final four estimates we consider are all versions of the average treatment effect for the treated. We first estimate the conventional average effect for the treated (ATT). We then form ATT estimates similiar to those in Dehejia and Wabha (1999) by dropping observations which have estimated propensity scores greater than 0.99 (ATT 0.01 ) and 0.90 (ATE 0.10 ), respectively. Finally, we form estimates of the optimal subpopulation average treatment effect on the treated (OSATT) by dropping those observations with an estimated propensity score greater than the optimal cutoff point of 0.73. For each of these cases, we display, in Table 3, estimates of the associated estimands and their asymptotic standard errors. Note that the standard errors are calculated separately for each estimator, implying that implicit estimates of the conditional variance σ 2 are different. Hence, the optimal estimators need not have smaller estimated asymptotic variances than the suboptimal ones.
For both the average treatment effect and the average effect for the treated estimands, it makes a substantial difference to the standard errors of the estimators if we drop observations with propensity scores close to their extreme values. For the average treatment effects, the gain in precision is huge. This is not surprising. There are many control observations whose covariate values are so far from those for the treated that it makes little sense to attempt to estimate the treatment effect for those covariate values. Even for the average effect for the treated however, there is a substantial gain to discarding observations with outlying values for the propensity score. This reduces the asymptotic standard error from 2.58 (with no sample selection) to 1.82 (for the fixed cutoff point of 0.10).
The number of observations that should be discarded according to the OSATE is substantial. We report the number of observations dropped for this estimand in Table 4. Out of the original 2675 observations (2490 controls and 185 treated), only 312 are used in estimation (183 controls and 129 treated). We also report in Table 4 the number of observations dropped in the various categories for this criterion and for the suboptimal criteria based on the fixed cutoff points 0.01 (ATE 0.01 ) and 0.10 (ATE 0.10 ), respectively, in the subsequent two panels of this table.
While not the primary focus of our analysis, we also note that the estimates of the various estimands, themselves, vary substantially. This is not surprising, given that the definitions of the underlying estimands are varying. They even differ in sign. At the same time, we make two observations about these estimates. First, the standard errors relative to the estimates tend to be large for all of the alternative estimates, implying that the inferences drawn from them [24] would not differ across the estimates. Second, the OSATE, OWATE and OSATT estimates are all negative and tend to be closer in magnitude to one another compared to the other estimators. One should not draw strong conclusions from either of this observations, given that the theoretical results established in this paper are focused primarily on the precision of alternative estimands.

Conclusion
Estimation of average treatment effects under unconfoundedness or selection on observables is often hampered by lack of overlap in the covariate distributions. This lack of overlap can lead to imprecise estimates and can make commonly used estimators sensitive to the choice of specification. In such cases, researchers have often used informal methods for trimming the sample. In this paper, we develop a systematic approach to addressing such lack of overlap in which we sacrifice some external validity in exchange for improved internal validity. We characterize optimal subsamples where the average treatment effect can be estimated most precisely, as well as optimally weighted average treatment effects. Under some simplifying assumptions, the optimal rules depend solely on the propensity score. We find that the precision for average treatment effects for the optimally selected samples can be much higher than for the overall sample. In addition, we find that a simple ad hoc selection rule based on discarding all units with an estimated propensity score outside the interval [0.1, 0.9] can capture most of the precision gains from selecting the sample optimally for a wide range of distributions. [25]

Appendix A: The Kernel Estimator with Boundary Correction
In this appendix we present the details of the boundary correction we use for the kernel estimator. This boundary correction was developed by . We refer to this paper for more details on the estimator. Let g(x) = E[Y |X = x] be the regression function of interest, and let f X (x) be the probability density function of X, with the dimension of X equal to L. Then we can write Let ∂X be the boundary of X, and let X I be the "internal" region, more than b N away from the boundary in all directions, X I = {x ∈ X| min l=1,...,L inf y∈∂X |y l − x l | ≥ b N }. Then let r b (x) be the projection of x onto the set X I : r b (x) = arg min y∈XI x − y }. Let λ denote an L vector of nonnegative integers, with |λ| = L l=1 λ l , and λ! = L l=1 λ l !. Define for a given, m − 1 times differentiable function g : R L → R, a point y ∈ R L and an integer m, the m − 1-th order polynomial function t : R L → R based on the Taylor series expansion of order m − 1 of g(·) around the point y: Now we define the boundary corrected estimators for h k (x): Finally the boundary corrected estimator for g(x) iŝ g m,b (x) =ĥ 1,m,b (x)/ĥ 2,m,b (x).

[27]
For the special case of λ(e(x)) = e(x) (a case considered by Hahn, 1998) the semiparametric efficiency bound is, For the special case of λ(e(x)) = e(x)(1 − e(x)) the semiparametric efficiency bound is, for nonnegative functions ω(·). For estimands of this type consider the criterion that encompasses Theorems 5.2 and 5.3: .
We are interested in the choice of set A that minimizes (B.2) among the set of all closed subsets of X. The following theorem provides the characterization.
The set A * that minimizes (B.2) is equal to X if and otherwise, where γ is a positive solution to Proof: Define k(x) = σ 2 1 (x)/e(x) + σ 2 0 (x)/(1 − e(x)),fX (x) = fX(x) · ω(x)/ z fX(z) · ω(z)dz, andω(x) = ω(x)/ z fX (z) · ω(z)dz, so that k(x) is bounded, bounded away from zero, and continuously differentiable on X. LetX be a random vector with probability density functionfX(x) on X, and letq(A) = Pr(X ∈ A). 15 Then 15 Note that f X (x)dx = 1 by construction, so thatfX (x) is a valid probability density function. [28] and similarly, Because multiplying ω(x) by a constant does not change the value of the objective function in (B.2), we have Thus the question now concerns the set A that minimizes (B.3).
We do the remainder of the proof of Theorem B.1 in two stages. First, suppose there is a closed set A such that x ∈ int(A), z / ∈ A, andω(z)·k(z) <ω(x)·k(x). Then we will construct a closed setÃ such that VS,ω (Ã) < VS,ω(A). This implies that the optimal set has the form for some γ. The second step consists of deriving the optimal value for γ. For the first step define a ball around x with volume ν, where Γ(a) = ∞ 0 x a−1 exp(−x)dx is the gamma function. Let A c be the complement of A in X, and for sets A and B let A/B = A ∩ B c . Let ν0 be small enough so that for ν ≤ ν0 we have . Also, because the volume of the sets B ν/f X (x) (x) and B ν/f X (z) (z) is ν/fX (x) and ν/fX (z) respectively, it follows that

Now we construct the set
The objective function for this set is so that the difference relative to the value of the objective function for the original set A is is negative for small enough ν, which finishes the first part of the proof. The question now is to determine the optimal value for γ given that the optimal set has the form Let Y =ω(X) · k(X), with probability density function fY (y). Then Denote the minimum and maximum value of the function k(x) over the set X by k and k. By assumption k > 0 and k < ∞. Then limγ↓k → ∞. Because VS,ω(A k ) = VS,ω (X) which is finite by assumption, and because VS,ω (A k ) is continuous as a function of γ, it follows that either VS,ω(A k ) is minimized at γ = k, or there is an interior minimum where the first order conditions are satisfied. Let γ denote the optimum. The first derivative with respect to γ is Becauseω(x) = ω(x)/ z ω(z) · fX(z)dz, γ = 2 · E ω(X) · k(X) ω(X) · k(X) < γ , implies γ = 2 · E ω(X) · k(X) ω(X) · k(X) < γ , for γ = γ · ω(x) · fX(x)dx. This in turn implies γ = 2 · E ω(X) · k(X) · 1 ω(X) · k(X) < γ Pr ω(X) · k(X) < γ Substituting back k(x) = σ 2 1 (x)/e(x) + σ 2 0 (x)/(1 − e(x)) this implies Proof of Theorem 5.2: Substituting ω(x) = 1 into Theorem B.1 implies that the optimal set A * is equal to X if , and otherwise, where γ is a positive solution to γ = 2 · E σ 2 1 (X) e(X) + σ 2 0 (X) 1 − e(X) σ 2 1 (X) e(X) + σ 2 0 (X) 1 − e(X) < γ .