Discreteness causes bias in percentage-based comparisons:
A case study from educational testing
Darrick Yee and Andrew Ho Harvard Graduate School of Education
January 22, 2015
Discretizing continuous distributions can lead to bias in parameter estimates. We present a case study from educational testing that illustrates dramatic consequences of discreteness when discretizing partitions differ across distributions. The percentage of test-takers who score above a certain cutoff score (percent above cutoff, or “ ”) often describes overall performance on a test. Year-over-year changes in , or Δ , have gained prominence under recent U.S. education policies, with public schools facing sanctions if they fail to meet
targets. In this paper, we describe how test score distributions act as continuous distributions that are discretized inconsistently over time. We show that this can propagate considerable bias to trends, where positive Δ s appear negative, and vice versa, for a substantial number of actual tests. A simple model shows that this bias applies to any comparison of statistics in which values for one distribution are discretized differently from values for the other. Keywords: Education, Estimation, Testing
Darrick S. Yee is a doctoral candidate, Harvard Graduate School of Education, Cambridge, MA 02138 (e-mail: dsy783@mail.harvard.edu). Andrew D. Ho is Professor of Education, 455 Gutman Library, 6 Appian Way, Cambridge, MA 02138 (e-mail: andrew_ho@harvard.edu). This research was supported in part by a grant from the Institute of Education Sciences (R305D110018). The authors thank Sean Reardon, Judith Singer, and Richard Murnane for their helpful feedback. We claim responsibility for any errors.
	1

1. INTRODUCTION Discretization of continuous distributions is ubiquitous in practice. Previous studies have explored discretization in a variety of forms, including rounding, where data are discretized evenly to integers or decimal places; heaping, when some data are discretized coarsely and others finely; and interval censoring, when data are known to exist within some interval and are assigned the value of the interval endpoint (Heitjan, 1989; Heitjan and Rubin, 1991). The terms binning, grouping, and coarsening are less specific and refer broadly to a process that sacrifices precision for simplicity by assuming similar observations are equal in value. Generally, we desire that estimates derived from discretized data will recover parameters of the continuous data. The consequences of discretization depend upon the nature of the discretization and the target parameter. Sheppard’s correction (1898) is a well-known adjustment for bias in moments of an evenly discretized normal distribution, with implications that extend to parameters from least squares regression models (e.g., Dempster and Rubin, 1983; Schneeweiss, Komlos, and Ahmad, 2010). Horton, Lipsitz, and Parzen (2003) describe how rounding to prevent implausible values in multiple imputation procedures can impart bias to results. In this paper, we show that discreteness can cause considerable bias when we compare two or more distributions by a particular summary statistic: the cumulative proportion, or its complement, the “percent above cutoff” ( ). The -based comparison is a staple of descriptive reporting. The poverty rate, for example, is a statistic that uses an income-based cutoff, while the obesity rate is a statistic with a cutoff based on the body-mass index. Comparisons of poverty and obesity rates – for example, across regions, subpopulations, or time periods – rely on the assumption that the cutoff is the same for each comparison group. We focus on a case study in education where discretization leads to a consequential violation of this assumption. In this context, the 	often represents a passing rate, or a percentage of students considered “Proficient” in a particular academic subject, such as mathematics or English. Examples include licensure and certification tests (e.g., American Institute of CPAs, 2013; National Conference of Bar Examiners, 2012), Advanced Placement exams (The College Board, 2013), and the U.S. Department
	2

of Education’s National Assessment of Educational Progress (U.S. Department of Education, n.d.). The

change in , or

, is thus a measure of educational progress or improvement.

In 2002, this metric gained newfound importance with the signing of the No Child Left Behind

(NCLB) Act, an ambitious piece of U.S. federal legislation that set the goal of 100% student

“Proficiency” by 2014. The policy required U.S. states to administer standardized tests in multiple

subjects to the vast majority of public school students; set cutoff scores on the tests such that students

achieving or exceeding the cutoff would be considered “Proficient” in the tested subject; calculate

percentages of Proficient students at various levels of aggregation; and increase these percentages to

100% by 2014. Schools with insufficient percentages of Proficient students faced sanctions, including

possible school restructuring and closure. Although federal policies have since allowed some flexibility,

in particular for the 100% goal in 2014 that no state ultimately met, percentages remain a central metric

for reporting and incentivizing educational progress (U.S. Department of Education, 2012).

We model and address a significant source of bias associated with the

metric that, to our

knowledge, has never been addressed formally. Previous authors have observed that the relationship

between

and changes in average test scores, , is nonlinear and determined by the shape of the

distributions and the magnitude of the initial and final percentages (Holland, 2002; Ho, 2008). This

relationship is smooth and generally predictable. In contrast, we examine a source of bias attributable to

unpredictable changes in discretization over time. We model this process by varying the “discretizing

partitions” applied to each comparison group. We show that year-over-year changes in these partitions

impart severe volatility to the

metric that threatens trend interpretations, leads to sign reversals, and

overshadows conventional sampling variability for most large-scale (district- and state-level) applications.

2. TEST SCORE DISCRETIZATION AND LINKING

First, we review the two steps in educational test score construction that ultimately impart this

unpredictable bias to -based trends: discretization and linking. Later, we will describe a general

model and show how it may arise in other situations.

2.1 Discretization

	3

Although classroom intuition holds that test scores are simple counts of correctly answered

questions (“number-correct scores”), large-scale testing programs generally convert these counts to “scale

scores,” which span ranges such as the 200-800 SAT scale and the 1-36 ACT scale used in many U.S.

college admissions decisions. The state of the art for scale score construction is Item Response Theory

(IRT; Lord, 1980; Yen and Fitzpatrick, 2006), a modeling framework that allows test “items” (the formal

term for test questions) to differ across examinees and over time while still providing comparable scores

on a continuous latent scale. For the purpose of this presentation, IRT is only important in that it reflects

a commonsense intuition about a continuous score scale for academic proficiency – one on which

reported scores may be restricted to integers, but that nonetheless allows the theoretical possibility of

intermediate scores and, importantly, intermediate cutoff scores. In principle, this is analogous to

situations in which, for example, weights may be reported in kilograms or heights in inches, but finer-

grained differences exist between individuals who are reported as having the same discrete weight or

height. Test score reporting similarly involves, in part, the discretization of a theoretically continuous

distribution of academic proficiency among test-takers.

Longstanding tenets of score reporting for individual test-takers maintain that no more than 30 to

60 score points should distinguish among them (Flanagan, 1951; Kolen and Brennan, 2004). These rules

of thumb are based on the standard errors of individual scores and were developed to discourage

distinctions among individuals that the precision of scores could not support. Adhering to this rule of

thumb can result in the “binning” of multiple number-correct scores to the same scale score. In practice,

then, scale scores are typically number-correct scores that are transformed, binned, and rounded to

integers for individual score reports. This discretization is one of the two elements that imparts

unpredictable bias to

statistics.

2.2 Linking

Operational testing programs generally require replacement of test items to discourage cheating

and sensitization. The resulting tests are not identical and will naturally differ in difficulty to a degree

that cannot (and need not) be eliminated (Holland and Dorans, 2006). Linking functions transform scores

	4

from one test to be comparable to scores from the other, relying on common items, common examinees,

or randomly equivalent groups across tests (Kolen and Brennan, 2004). If items are more difficult, the

linking function will map the same number-correct scores to higher scale scores; if items are less difficult,

the linking function will map the same number-correct scores to lower scale scores.

The key observation that supports the remainder of this paper is that, unless the items appearing

on two tests are identical, the linking function from number-correct scores to scale scores will be different

for each test. This difference in linking functions effectively produces a different discretization of

continuous scores for each test. In the next section, we model this changing discretization formally and

illustrate the consequences for -based comparisons.

3. MODEL

We start by examining student proficiencies in a baseline year 0 and comparison year 1. We

assume latent proficiencies in year can be represented by continuous (real-valued) scores, with

probability density function ⋅ and cumulative distribution function ⋅ . A student is considered

“Proficient” or “above the cutoff” if her continuous score exceeds a cutoff score, ∗, which is common to

both years. The quantity 1

∗ is then the proportion (or, trivially, the percentage/100) of students

who are Proficient in year . We call this the “percentage above cutoff in year ,” or .

We model discretization as a two-step process. First, the continuous scale is partitioned into a

finite number of intervals, or “cells.” Second, a single “discrete score” is assigned to each cell. For

example, in order to discretize a continuous scale ranging from 0 to 3, one might partition the interval 0, 3 into four cells: 0, 0.5 , 0.5, 1.5 , 1.5, 2.5 , 2.5, 3 . Assigning the discrete scores 0, 1, 2, and 3

to the first through fourth cells, respectively, would produce a discretization equivalent to rounding to the

nearest integer. We will refer to this as “integer-rounding.” Alternatively, one might discretize the same continuous scale by partitioning it into cells 0, 1 , 1, 2 , 2, 3 and assigning discrete scores 0.5, 1.5,

and 2.5 to the corresponding cells. This rounds to the nearest “midpoint” between integers; we will call

this “midpoint-rounding.”

	5

Note that, in any discretization, each cell has an upper and lower bound. Additionally, for any

discretization of continuous scores, there exists a single discrete score that is the minimum discrete score exceeding the cutoff score, ∗. We use ∗ to denote the lower bound of the cell associated with this

discrete score under the discretization in year . To expand on the previous example, suppose integer-

rounding is applied in year 0, while midpoint-rounding is applied in year 1. Suppose that the Proficient cutoff score in both years is ∗ 1.2. In year 0, the lowest discrete score that exceeds ∗ is 2. The cell

associated with this discrete score is the interval 1.5, 2.5 , for which the lower bound is 1.5. Therefore,

∗ 1.5. Similarly, in year 1, the lowest discrete score exceeding ∗ is 1.5, and its corresponding cell is

1, 2 ; thus, ∗ 1.

For any pair of continuous score distributions ⋅ and ⋅ , the proportion above cutoff ( )

in year 0 is

1 ∗ , while in year 1 it is

1 ∗ . The year-over-year difference

in , or Δ , is then

Δ 1 ∗1 ∗

∗∗

(1)

However, if discrete scores are used to compute the in each year, then we have

1 ∗ and

1 ∗ . To see why, consider integer-rounding. In year 0, the set of

discrete scores is 0, 1, 2, 3 . Since the cutoff score is ∗ 1.2, students with discrete scores of 0 and 1

are considered non-Proficient, while students with a discrete score of 2 (the lowest discrete score

exceeding 1.2) or higher are considered Proficient. The “discrete ” in year 0 is therefore the

proportion of students who receive discrete scores of 2 and higher. Since continuous scores in the interval 1.5, 2.5 are all assigned a discrete score of 2, all continuous scores in this interval (and higher) are

considered “above the cutoff,” while all scores below this interval are “below the cutoff.” The discrete

for year 0 is therefore

1 1.5 . Similarly, the discrete for year 1 is

1 1.

	6

More formally, any discretization of continuous scores with PDF ⋅ and CDF ⋅ implies a

probability mass function ⋅ and corresponding CDF ⋅ , in which ∗

∗ . The year-over-

year change in using discrete scores is then Δ 1 ∗1 ∗

1 ∗1 ∗
∗∗

(2)

To continue our above example, suppose that continuous scores in both years were normally

distributed with mean 1.5 and variance 1; recall that scores range from 0 to 3, and integer-rounding is applied in year 0, while midpoint-rounding is applied in year 1. The continuous and discrete Δ s are

then given by the expressions in (1) and (2), respectively: 	Δ 1.2 1.2

Φ 0.3

Φ 0.3

0

Δ 1.5 1 Φ 0 Φ 0.5 0.19, where Φ ⋅ is the standard normal CDF. In this example, continuous scores would produce a

of 0,

but the

resulting from the use of discrete scores would suggest a 19 percentage-point increase

in Proficient students, from 31% to 50%. In effect, when the discretization pattern changes in this way,

students in year 0 are subjected to a higher cutoff score than students in year 1. On the other hand, if the

discretizations were reversed, then the opposite would hold, resulting in

1 1.5

0.19. In general, this “misalignment” of discrete score partitions, where ∗ ∗, causes students in

one year to be subjected to a different cutoff score from students in the other.

The critical issue for policy is that test administrators cannot control the discrete score partitions

that are applied to continuous scores. As noted in Section 2.2, these partitions depend primarily on the

properties of the items that appear on each test (and, to a lesser extent, the error involved in estimation of item parameters). Thus, in any pair of years, ∗ ∗ almost surely, and students in one year face a

different cutoff score from students in the other. Figure 1 illustrates the consequences of this problem.

Figure 1. Observed and estimated change in the proportion of students scoring above cutoff from 2010 to 2011, Washington state Grade 8 Mathematics (N = 150,875).

	7

Δ denotes the observed change in proportion of students scoring above the cutoff score. For any cutoff score on the horizontal axis, the solid line indicates the observed change in the proportion of students scoring above
that cutoff score from 2010 to 2011. Dotted vertical line indicates the “Proficient” cutoff score, for which the actual reported year-over-year change was -1.36 percentage points. Curve represents Δ estimated using smoothed continuous CDFs; see Section 4 for details.
Source: Washington Office of Superintendent of Public Instruction (http://www.k12.wa.us/assessment/StateTesting/TestStatistics.aspx)

In Figure 1, given any cutoff score on the horizontal axis, the plot’s value on the vertical axis

indicates the observed year-over-year change (that is, the sample estimate of

) in the proportion

of students scoring above that cutoff score. For instance, at the Proficient cutoff score of 400 (indicated

by the vertical dashed line), the observed change between 2010 and 2011 was -1.36 percentage points.

Similarly, at a cutoff score of 401, the observed Δ

was -1.16 percentage points. Qualitatively,

these represent small-to-moderate declines in the percentage of Proficient students.

However, at a cutoff score of 402, the observed Δ

was 2.85 percentage points. In other

words, if one were to assess year-over-year progress using a cutoff score only 2 points higher than the

“official” Proficient score of 400, one would conclude that there had been a moderate increase in the

percentage Proficient students – the opposite of the conclusion when observing Δ

at 400 and

401. Figure 1 shows that the observed Δ

would swing wildly back and forth if the cutoff score

	8

were changed. Such a result would not occur if scores were measured on a continuous scale. However,

this is exactly the behavior one would expect if changes in discretization caused students in different years to be subjected to different “effective” cutoff scores, ∗ and ∗.

3.1 Bias and Inconsistency of the

Estimator

We formalize the results above to show that, for any sample of students, the observed Δ

is a biased and inconsistent estimator of the change in the percentage of students with continuous scores

above a given cutoff score.

Our parameter of interest is the change in proportion of students whose continuous scores exceed the cutoff score, ∗:
Δ ∗ ∗,

where ⋅ and ⋅ are the continuous CDFs of student scores in years 0 and 1, respectively.

Given a sample size of students in year , the discrete-score estimator for ∗ is

1 ∗,

(3)

where ⋅ is an indicator function, is the continuous score for student in year , and ∗ is the lower

bound of the partition cell corresponding to the lowest discrete score above ∗, as defined previously.

Continuous student scores in year are distributed according to ⋅ ; thus, the probability that any given

student will be observed as having a score above ∗ is

∗1

∗ . The numerator of

is therefore binomially distributed with count parameter and probability parameter 1

∗ . Using

as an estimator for produces a bias of

∗∗

∗∗

∗∗

∗∗

∗∗

	9

∗∗
0
∗∗

(4)

Similarly, we can show that is an inconsistent estimator of . For simplicity, we assume that

sample sizes in both years are equal.

Let be the bias of , and let be the variance of , where is the sample size in

each year, and is a constant that does not depend on . Recall that the numerator of

is

binomially distributed; then, by a straightforward central-limit argument,

normally distributed with mean 1

∗ . Thus,

is asymptotically is asymptotically normal

with mean . For large ,

	~	 ,

and therefore √

has a standard normal distribution. Then, for any 0,

√√ √

√

1 Φ√

Φ√

.

Choosing any ∈ 0, | | then produces lim lim 1 Φ √
→→

Φ√

1

Thus, is an inconsistent estimator of . This latter result implies that, in practice,

estimates are

likely to be incorrect even with very large sample sizes, perhaps contrary to typical intuition.

3.2 Partition Misalignment and Volatility of the

Estimator

	 10

Next, we show that “misalignment” of discrete score partitions is primarily responsible for the

volatility, or “sawtooth” pattern, observed in Figure 1. We refer to the expected value of the biased,

discrete-score Δ estimator as Δ

.

Recall that Δ

, where is the value of the continuous Δ and is the bias of

the estimator. The bias given in (4) can be written
∗

∗

∗∗ ∗∗∗

∗ ∗
∗

∗

∗ ∗ ∗

(5)

When the same discretization is applied to continuous scores in both years, ∗ ∗, and the

second term of (5) evaluates to zero. The remaining term then imparts a small bias to the discrete estimator, causing it to be evaluated at ∗ ∗, rather than ∗. We refer to this component of the bias as

rounding bias, since it is commonly described as “rounding error”: its primary effect is to produce a

discrete, step-function version of the continuous

curve.

On the other hand, when different discretizations are applied to scores in each year, ∗

∗, and

the second term of (5) is nonzero. Furthermore, because its integrand consists only of a single density

function, rather than a difference between densities, its magnitude is relatively large. We refer to this

component as misalignment bias, since it results from misalignment of the discrete-score partition cells in

each year. We present visual examples of each bias component in Figure 2.

In Figure 2, scores range from 0 to 10, and the dotted lines depict the

values for continuous

scores. In panel A, scores are normally distributed with 5,

1 in the first year and 5.2,

2 in the second year. Integer-rounding is applied to continuous scores in both years; thus, ∗ ∗ for all

cutoff scores, and misalignment bias is zero. The

at each cutoff score ∗ is then equal to the

continuous

evaluated at the lower bound of the discrete-score partition cell associated with the

	 11

lowest integer equal to or exceeding ∗. For example, the lower bound of the cell associated with a

discrete score of 5 is ∗ ∗ 4.5; thus, for any ∗ ∈ 4, 5 , we have

4.5 4.5 ,

the continuous

evaluated at a score of 4.5. In effect,

gives the continuous

based

on scores rounded to the nearest integer.

Panel B, on the other hand, isolates the effect of misalignment bias. Continuous scores in both

years have the same distribution ( 5,

1), and therefore

for all , so that rounding

bias is zero. However, integer-rounding is applied in the first year, while midpoint-rounding is applied in

the second; thus, ∗ ∗ for all cutoff scores, and

suffers from misalignment bias. The

“sawtooth” volatility observed in Figure 1 is clearly visible in panel B but absent from panel A,

confirming the fact that it results from the application of different discretizations in each year. For any

cutoff score ∗ ∈ 4.5, 5 , the next-highest discrete score is 5 in the first year and 5.5 in the second year,

giving ∗ 4.5 and ∗ 5. As shown in (5), the

thus has a bias of .

4.5

5 , effectively subjecting scores in the second year to a higher cutoff score than in the first.

Figure 2. Change in percent above cutoff score (Δ ) under two discretization scenarios.

A. Different distributions, same discretizations

B. Same distributions, different discretizations

Baseline scores in both panels are normally distributed with mean 5 and variance 1 5, 1 . Comparison scores have distribution 5.2, 2 in Panel A and 5, 1 in Panel B. Δ for each cutoff score is the proportion of comparison scores equal to or exceeding the cutoff minus the proportion of baseline scores equal to or exceeding the cutoff. Solid lines indicate Δ values computed from discrete scores; dotted lines indicate values computed from continuous scores. In Panel A, both baseline and comparison scores are rounded to the nearest integer. In Panel B, baseline scores are rounded to the nearest integer, while comparison scores are rounded to the nearest “midpoint” between integers (e.g., continuous scores in the half-open interval 1, 2 are assigned a discrete score of 1.5).
	 12

In short, the use of different discretizing partitions introduces bias beyond what would be

expected when discretizations are identical. Below, we estimate this bias using state test data and show

that, where state testing is concerned, its magnitude is likely to be substantial in many cases.

4. ESTIMATION OF

BIAS IN STATE TESTS CAUSED BY DISCRETIZATION

Our model suggests a number of methods for reducing the effect of bias in the Δ estimator in

large samples. We employ a relatively simple method via OLS regression, in which we estimate

continuous CDF values in each year by fitting polynomials of increasing degree to discrete scores near

each test’s official Proficient cutoff score, stopping when the estimated at the cutoff score changes

by less than 1.5 percentage points (varying the stopping criterion does not substantially change results). We then use these “smoothed” estimates to produce estimates of the continuous Δ and the empirical Δ bias across 107 state test score trends.

We acknowledge that more sophisticated methods – such as those that incorporate weighting,

explicit modeling of the error term, or improved stopping criteria, among many others – should produce

more robust results than this procedure. However, this relatively simple continuizing procedure is

sufficient to illustrate the bias that is our interest.

4.1. Dataset

Our data consist of frequency distributions for individual student scale scores in 2010 and 2011

for Math and Reading/English Language Arts (ELA) tests, all gathered from publicly available technical

reports for 13 states, with one “state” consisting of a four-state consortium using a common testing program.1 For each testing program in our dataset, discrete scale scores and the number of students

achieving each score were compiled, where available, for students in grades 3 through 8. For the analyses

below, we exclude tests for which data appear incomplete, as well as tests that were not comparable from

2010 to 2011 due to large-scale changes in test design, state policy, or similar factors. Our final dataset

																																																								
1 States included in the final dataset were Alaska, Arizona, Idaho, Maine, Nebraska, New Hampshire, New Jersey, New York, North Carolina, Oklahoma, Pennsylvania, Rhode Island, South Dakota, Texas, Vermont, and Washington.
	 13

includes 60 Math and 47 Reading/ELA test score distributions, representing approximately 22 million

student scores. Sample sizes for each test range from about 9,000 to more than 300,000 students per year,

with a mean slightly over 100,000; thus, the impact of sampling error on our results is likely to be

negligible.

4.2. Empirical Results

We summarize the scope of this problem, as well as the results of our smoothing algorithm, in

Table 1. For each test, we compute the

using the official state Proficient cutoff score, ∗. We then

find the lowest discrete score exceeding ∗ (across both years), compute the

at this score, and

subtract the original

at ∗. The result is the change in the

attributable to increasing the

cutoff score by one discrete score increment. We repeat this with the highest discrete score lower than ∗

to produce the change attributable to a -1 score increment.

Table 1. Summary of changes in year-over-year increases in percent-Proficient ( ) from 2010 to 2011

attributable to changing the Proficient cutoff score to the next-highest or next-lowest discrete score for state

NCLB tests in 13 states.

Math ( 60

Reading/English Language Arts ( 47

No. of sign Std. dev. Min Max changes

No. of sign Std. dev. Min Max changes

Observed

1.5 -4.6 4.0 14 (23%)

1.8 -4.7 4.3 4 (9%)

Smoothed

0.2 -0.9 0.5 0 (0%)

0.3 -1.0 0.9 2 (4%)

NOTE: All decimal values in percentage points. Values were computed by increasing and decreasing the

Proficient cutoff score by one scaled score increment, computing the observed change in percent above the new

cutoff score ( ), and subtracting the observed

at the original Proficient cutoff score from the result. A

“sign change” indicates that the “incremented”

values include at least one value whose sign is the opposite

of the Proficient

. Observed values are those actually reported by states. Smoothed values were computed

using polynomial smoothing of CDFs in each year. Test data were compiled from state technical reports.

For example, in Washington state, the official Proficient cutoff score was 400, at which the

observed

was -1.36 percentage points. The lowest discrete score exceeding 400 was 401 in 2010

and 404 in 2011; therefore, we compute the observed

at 401, which we find to be -1.16 percentage

points. Incrementing the cutoff score by 1 discrete score (that is, increasing the cutoff score to the next-

highest discrete score in either year) thus produces a change in the observed

of 0.2 percentage

	 14

points. Meanwhile, the highest discrete score below 400 in either year was 396, at which the observed

was 2.11 percentage points. Decrementing the cutoff score by 1 discrete score thus increases the

by 3.5 percentage points. For this test, the maximum change in

attributable to a one-

increment change in cutoff score is therefore 3.5 percentage points, while the minimum is 0.2 percentage

points. Additionally, since the official Δ was negative, while the

at 396 was positive, the

for this distribution changed sign.

We repeat this for all tests in our dataset, using both observed and “smoothed” values, and

summarize the results in Table 1. The “Observed” row in Table 1 reports results using officially reported

data and shows that the pattern in Figure 1 is not unique to Grade 8 Math in Washington state. Small

changes in the cutoff score produce swings in

of between -4.6 and 4.0 percentage points in Math

and between -4.7 and 4.3 percentage points for Reading/ELA. More disturbingly, a sign change is

observed in roughly 1 out of every 6 tests. For these tests, a small change in the cutoff score would

produce a year-over-year “improvement” result that was qualitatively the opposite of what was officially

reported.

The “Smoothed” row presents results after application of the smoothing algorithm to the CDFs in

each year; a visual example for Washington state appears in Figure 1. Across all tests in our sample, the

estimated

s exhibit far greater stability across cutoffs when distributions are smoothed. This is

unsurprising under our model and suggests that smoothing reduces the effect of misalignment bias. We

use the empirical variance of the differences between observed and smoothed

s as an approximation

of the true variance of the discrete-score bias values.

In Figure 3, we present the empirical distributions of these estimated bias values. The average

absolute values of the observed

s in our sample were 1.67 for Math and 2.93 for Reading/ELA. We

estimate standard deviations of 1.11 and 1.28 percentage points for the Math and Reading/ELA

bias values, respectively. We consider these standard deviations to be large. In short, eliminating the bias

	 15

caused by discretization would likely change interpretations of the magnitude of year-over-year

improvement for a large proportion of the tests in our sample.

Figure 3. Estimated bias in reported change in percent-Proficient ( changes in score discretizations for NCLB tests in 13 states.
Math (N = 60)

) from 2010 to 2011 attributable to Reading/ELA (N = 47)

Bias estimated by using observed (discrete) cumulative distributions for scores in 2010 and 2011 to estimate

continuous CDF values and

s for each test. Standard deviations of estimated bias values are 1.11 for Math

and 1.28 for Reading/English Language Arts (ELA). Average absolute magnitudes of reported

s were 1.67

for Math and 2.93 for Reading/ELA. Test data compiled from state technical reports.

5. DISCUSSION

We have shown above that bias in the observed

caused by discretization may lead to

incorrect substantive conclusions regarding year-over-year educational progress. Changes in discretizing

partitions effectively subject students to different cutoff scores over time. In this section, we discuss four

additional issues: bias at the district and school levels, the relative contribution of sampling variability,

solutions implementable by state testing programs, and implications of this case study for the general

problem of comparing discretized distributions.

First, at the district and school levels, where similarities among students typically cause scores to

be more concentrated (non-zero intraclass correlations; see Hedges and Hedberg, 2007), bias is likely to

be larger, particularly when the state cutoff score happens to be close to the modal score for the district or

school. To see why, recall that the second term of (5) depends only on the continuous-score density in a single year. Because test score distributions are generally unimodal, and since the values of ∗ and ∗

	 16

∗
apply to all schools in the state, smaller variances in scores imply larger values of ∗

for values

of ∗ and ∗ near the mode of a district or school’s score distribution. Our model thus implies that

bias is almost certainly larger than estimated in Section 4 in schools or districts in which large proportions

of students score near the cutoff score.

Second, at the school and subgroup level, this bias is likely to be overshadowed by error. In

smaller samples, such as the “minimum subgroup size” that ranges from 5 to 100 across states (Fulton,

2006), sampling variability will typically overwhelm any useful information that might be gleaned from

the observed

from one year to the next. The observed in each year, given in (3), follows a

scaled binomial distribution, and thus the observed

has standard error

, where

∗ 1 ∗ and is the sample size in year . With ∗

∗ 0.5 and

20,

for example, this results in a standard error of nearly 16 percentage points – far too large to draw reliable inferences regarding year-over-year improvement using only a single Δ observation. At samples of

5,000 students per year, it declines to one percentage point, commensurate with the magnitudes of bias

that we report here. Although we have demonstrated that this bias will always be a factor, it will be most

salient at the level of reporting for states and large districts.

Third, our results have practical implications for state testing programs that report trends using

. Using finer-grained partitions will reduce the bias caused by misalignment of discrete score cells,

all else equal. States wishing to minimize

volatility should estimate the statistic using the most

fine-grained data available at each time point (e.g., through the use of scale score estimates from a two- or

three-parameter IRT model). If concerns remain about continuous or fine-grained scores imparting a

false sense of precision, states could use coarse values when comparing individual observations, while

retaining finer values for the purposes of

reporting. As a last resort, states or researchers could

employ “smoothing” to estimate continuous

s. A broader recommendation, following Ho (2008), is

	 17

to dispense with -based metrics in favor of average-based metrics. Average-based trend metrics are

not cutoff-score dependent and are more robust to discretization and changes in discretizing partitions.

Finally, although we focus on

statistics in educational testing, our model applies generally

to any percentage- or proportion-based comparison in which discretizing partitions differ between groups.

Such comparisons appear in a wide variety of fields of study, and discretizations of the underlying data

(including height, weight, temperature, and currency) often differ depending on their source. Consider

the problem of comparing poverty rates in the U.K. and the U.S., where incomes in each country are

rounded to the nearest thousand pounds or dollars, respectively. Applying an exchange rate reveals that

discretizing partitions differ across countries, effectively holding U.K. residents to a different poverty

standard than U.S. residents. Similarly, rate comparisons in which data for one group are rounded to

British units and the other rounded to metric, such as low-birth-weight comparisons, can be expected to

suffer from the same bias. Discreteness, different partitions, and the

metric interact to produce the

volatile pattern in Figure 1, and addressing any one of these (by “continuizing” distributions, using

aligned partitions, or using average-based metrics, for example) will reduce potential bias in percentage-

based comparisons.

	 18

References
American Institute of CPAs (2013). “CPA Examination Passing Rates, “ retrieved from http://www.aicpa.org/BECOMEACPA/CPAEXAM/PSYCHOMETRICSANDSCORING/PASSI NGRATES/Pages/default.aspx
Bandeira de Mello, V. (2011), Mapping State Proficiency Standards onto the NAEP Scales: Variation and Change in State Standards for English and Mathematics, 2005–2009 (NCES 2011-458). National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education, Washington, DC: Government Printing Office.
Braun, H, & Mislevy, R. (2005), “Intuitive test theory,” Phi Delta Kappan, 86, 489-497.
Bryant, M. J., Hammond, K. A., Bocain, K. M., Rettig, M. F., Miller, C. A., & Cardullo, R. A. (2008), “School performance will fail to meet legislated benchmarks,” Science, 321, 1781-1782.
College Board, The (2013), The 9th Annual AP Report to the Nation, New York, NY: The College Board.
Dempster, A. P. and Rubin, D.B. (1983), “Rounding error in regression: The appropriateness of Sheppard’s corrections,” Journal of the Royal Statistical Society, Series B (Methodological), 45(1), 51-59.
Flanagan, J. C. (1951), “Units, scores, and norms,” in E. F. Lindquist (ed.), Educational Measurement , Washington, DC: American Council on Education, pp. 695-763.
Fulton, M. (2006), “Minimum subgroup size for Adequate Yearly Progress (AYP): State trends and highlights,” Denver, CO: Education Commission for the States.
Glass, G. V., McGaw, B., and Smith, M. L. (1981), Meta-Analysis in Social Research, Beverly Hills, CA: Sage.
Hedges, L. V., and Hedberg, E. C. (2007), “Intraclass correlation values for planning group-randomized trials in education,” Educational Evaluation and Policy Analysis, 29(1), 60-87.
Heijtan, D. F. (1989), “Inference from grouped continuous data: A review,” Statistical Science, 4(2), 164179.
Heijtan, D. F. and Rubin, D.B. (1991), “Ignorability and coarse data,” Annals of Statistics, 19(4), 22442253.
Ho, A.D. (2008), "The problem with ‘proficiency’: Limitations of statistics and policy under No Child Left Behind," Educational Researcher, 37(6), 351-360.
Holland, P. (2002), “Two measures of change in the gaps between the CDFs of test-score distributions,” Journal of Educational and Behavioral Statistics, 34, 201-228.
Holland, P. W., and Dorans, N. J. (2006), “Linking and equating,” in R. Brennan (ed.), Educational Measurement (4th ed.), Westport, CT: American Council on Education / Praeger Publishers, pp. 187-220.
Horton, N.J., Lipsitz, S.R., and Parzen, M. (2003), “A potential for bias when rounding in multiple imputation,” The American Statistician, 57(4), 229-232.
Kolen, M.J., and Brennan, R.L. (2004)., Test Equating, Scaling, and Linking: Methods and Practices (2nd ed.), New York: Springer-Verlag.
Lord, F. (1980), Applications of Item Response Theory to Practical Testing Problems, Hillsdale, New Jersey: Erlbaum.
	 19

McClarty, K. L., Way, W. D., Porter, A. C., Beimers, J. N., and Miles, J. A. (2013), “Evidence-based standard setting: Establishing a validity framework for cutoff scores,” Educational Researcher, 42, 78-88.
National Conference of Bar Examiners (2012), “2011 statistics,” The Bar Examiner, 81(1), 6-41. Rasch, G. (1960), Probabilistic Models for Some Intelligence and Attainment Tests, Chicago: University
of Chicago Press (1981). Schneeweiss, H., Komlos, J., and Ahmad, A.S. (2010), “Symmetric and asymmetric rounding: A review
and some new results,” Advances in Statistical Analysis, 94(3), 247-271. Sheppard, W.F. (1897), “On the calculation of the most probable values of frequency constants for data
arranged according to equidistant divisions of a scale,” Proceedings of the London Mathematical Society, 1(1), 353-380. U.S. Department of Education (n.d.), NAEP Data Explorer, Washington, D.C.: National Center for Education Statistics, Institute of Education Sciences. U.S. Department of Education (2012), ESEA Flexibility. Available at http://www.ed.gov/esea/flexibility/documents/esea-flexibility-acc.doc Wallis, W., and Steptoe, S. (2007), “How to fix No Child Left Behind,” Newsweek. 169(23), 34-41. Yen, W. M., and Fitzpatrick, A. R. (2006), “Item response theory,” in R. Brennan (ed.), Educational Measurement (4th ed.), Westport, CT: American Council on Education / Praeger Publishers, pp. 111-154.
	 20