The Dependence of Growth-Model Results on Proficiency Cut Scores

States participating in the Growth Model Pilot Program reference individual student growth against “proficiency” cut scores that conform with the original No Child Left Behind Act (NCLB). Although achievement results from conventional NCLB models are also cut-score dependent, the functional relationships between cut-score location and growth results are more complex and are not currently well described. We apply cut-score scenarios to longitudinal data to demonstrate the dependence of state-and school-level growth results on cut-score choice. This dependence is examined along three dimensions: 1) rigor, as states set cut scores largely at their discretion, 2) across-grade articulation, as the rigor of proficiency standards may vary across grades, and 3) the time horizon chosen for growth to proficiency. Results show that the selection of plausible alternative cut scores within a growth model can change the percentage of students “on track to proficiency” by more than 20 percentage points and reverse accountability decisions for more than 40% of schools. We contribute a framework for predicting these dependencies, and we argue that the cut-score dependence of large-scale growth statistics must be made transparent, particularly for comparisons of growth results across states.


Growth Models and Cut Scores 4
McLaughlin & Bandeira de Mello, 2005).Holland (2002) and Ho (2007) have shown that proficiency-based trends and gap trends are subject to surprising changes in magnitude and sign under alternative proficiency cut scores.And NCLB's proficiency-based accountability framework may encourage the disproportionate allocation of school resources to students who are just below the proficiency cut score (Booher-Jennings, 2005;Neal & Schanzenbach, 2007).
Although the GMPP allows for alternate approaches to determine AYP, it may also add an additional layer of complexity to proficiency-based calculations and increase the likelihood of misinterpretation of student, school, and state progress toward proficiency.
In this paper, we describe how GMPP-based accountability results are dependent on a) attributes of the cut scores adopted by states and b) attributes of growth model policies themselves.Attributes of cut scores include features such as rigor-where more rigor implies higher cut scores across all grades and less rigor implies lower cut scores across all grades-and articulation, where the rigor of cut scores may vary systematically across grades.We also review attributes of growth model policies, and we focus specifically on the impact of the chosen time horizon to proficiency.For example, schools may receive credit for nonproficient students who are on track to proficiency in 3 years but not those who are on track to proficiency in 4 or more years.These two sets of attributes are responses to two different waves of federal policy.
Cut-score decisions were made largely in response to the basic requirements of NCLB, and growth model policy decisions were made largely in response to the GMPP.Before describing the impact of these attributes on growth model results, we discuss each of these sets of attributes and their policy contexts in turn.

Attributes of Cut Scores: Rigor and Across-Grade Articulation
Growth Models and Cut Scores 5 When growth is referenced to cut scores, growth model results will depend upon both the rigor of cut scores and the patterns of cut score articulation across multiple grades.The decisionmaking process that leads to these attributes has many motivations.Prior to NCLB implementation, most states tested at several benchmark grades, and only a few states tested contiguously across all or even most elementary and middle school grades (Olson, 2002).The contiguous-grade testing paradigm that emerged under NCLB necessitated setting cut scores for the newly required assessments and, in many cases, resetting cut scores previously set without regard to NCLB consequences.Two issues were addressed by state policymakers as they considered designs for setting or resetting cut scores on their NCLB assessments: the rigor of the cut scores within each grade and the articulation of cut scores across the grades.
Efforts to foster consistency of cut scores across grades were only moderately successful under the benchmark-grade testing paradigm, where standard setting tended to be a grade-bygrade activity.Substantial differences between cut scores for Grades 4 and 8, for example, are easier to reconcile than inconsistencies between cut scores for adjacent grades.The contiguousgrade testing paradigm impelled policymakers and standard-setting researchers to develop methods to vertically moderate, or articulate cut scores across grades (Cizek, 2005).These methods were guided by the principle that the pattern of proficiency rates across grades should appear rational to the various constituents who use or interpret test results.The pattern of cut scores in each grade should consider the scope and sequence of content across the grades and should reflect the regular progress that students tend to achieve from grade to grade.
There are no prescribed standards for what constitutes consistent across-grade results in terms of the percentage of students at or above a given performance standard.However, several patterns of well articulated performance standards are frequently observed in state assessment Growth Models and Cut Scores 6 programs as a result of their explicit consideration in the standard setting process.These models describe articulated, cross-grade performance standards in terms of the percentage of proficient students in each grade.Three general models and interpretations of articulated across-grade performance standards are presented by Lewis and Haug (2005).
The decreasing model reflects a smoothly decreasing percentage of proficient students across grades.There are multiple interpretations of this pattern of across-grade proficiency.It may reflect the changing nature of the domain across grades and, with it, a decrease in students' ability to meet the goals of the grade.For example, mathematics becomes increasingly complex, moving from the reasonably simple and concrete notions of counting, addition, and subtraction to the complex and abstract foundations of algebra and calculus.Lewis and Haug (2005) also observe that standard-setting participants in upper grades are often content-area experts who may act as gatekeepers of the domain, whereas elementary teachers tend to be more student centered, considering what is best for the student and hesitating to label younger students as less than proficient.
The equal-percentage model reflects an equal or approximately equal percentage of proficient students in each grade.This pattern reflects an attribute of proficiency that is inherent in many performance-level descriptors-that proficient students are well prepared to meet the challenges of the next grade.Thus, proficient students tend to meet the challenges necessary to demonstrate proficiency in their next grade, resulting in similar percentages of proficient students from one grade to the next.
The increasing model reflects a smoothly increasing percentage of proficient students across grades.There are also multiple interpretations of this pattern of across-grade proficiency.
It may be that a somewhat higher bar is set at the lower grades to increase the likelihood that Growth Models and Cut Scores 7 students will be prepared for the more challenging material-and possibly higher stakes assessments-that come in subsequent grades.Additionally, as students' strengths and weaknesses are better understood through appropriate longitudinal record keeping, teachers and parents may provide better educational opportunities customized to individual learning styles, leading to an accelerated success rate with time.This paper frames both the rigor and the across-grade consistency of cut scores as factors influencing the results of growth models under the GMPP.Rigor is investigated by increasing cut scores in all grades from those that would set proficiency rates near 90% in all grades (low rigor) to those that would set proficiency rates near 30% in all grades (high rigor).Articulation is investigated by setting cut scores that enforce increasing percentages of proficient students (decreasing rigor) across grades by 10 percentage points per grade, then 9, then 8, and so on.This rate of decline can be set to 0, an equal-percentage model, and then a rate of decreasing across-grade proficiency rates (increasing rigor) is investigated.By varying cut-score attributes in a plausible range and holding other factors constant, we demonstrate that rigor and articulation can change growth percentages by up to 20 percentage points, and school-level accountability results can be even more dramatically affected.In the next section, we discuss attributes of growth model policies that have similarly large influences on growth results under the GMPP.

Attributes of Growth Model Policies
Under conventional NCLB accountability models, a student is classified as "proficient" at Time t if the student's test score, X t , is greater than or equal to a cut score designating proficiency at that time, c t .Table 1 displays the conditions for student proficiency as well as other classifications soon to be introduced.The cornerstone statistic of NCLB is the percentage of proficient students, denoted here as PPS.The PPS is calculated for each sufficiently large Growth Models and Cut Scores 8 subgroup in a school and compared to a benchmark called the Annual Measurable Objective (AMO).While so-called safe-harbor provisions and, for some states, idiosyncratic confidenceinterval procedures may complicate matters, the baseline NCLB rule is that a school's PPS statistics must be greater than the AMO for all valid subgroups in order to avoid sanctions.
Schools whose subgroups all have PPS statistics higher than the AMO are described as making Adequate Yearly Progress (AYP, Table 1).In a simplified scenario with only one subgroup, this decision process may be represented by the statement: If PPS ≥ AMO, then AYP.
Using this terminology, proficiency describes a student, PPS is a school-level percentage, and AYP describes a school.At the state-level, relevant statistics include the PPS, which can be calculated for a state as well as for a school, and the percentage of schools making AYP (PAYP).
The PPS at both the state-and school-level is inversely related to the rigor of a cut score: the higher the cut score, the lower the PPS.The PAYP is more complex, as it is dependent on the AMO.However, for fixed AMOs, PAYP will also decline with increasing cut-score rigor.This should seem fairly straightforward: School PPS will decline as cut-score rigor increases, and fewer school PPS will surpass the AMO.We will demonstrate that the cut-score dependencies of growth statistics are just as predictable but much less straightforward.
To date, there are 11 growth models with full or conditional approval by the Department of Education (U.S.Department of Education, 2008).Many of these growth models afford the classification of students as "on track" to proficiency at some point in the future.We denote the percentage of "on track" students in a school or a state as POT.A state's growth model policy may count "on track" students as "proficient" for the sake of school accountability decisions; we adopt this policy throughout our analyses.We identify schools whose accountability classification is reversed by using growth models: We count the schools that are not making Growth Models and Cut Scores 9 AYP, that is, PPS < AMO, but have surpassed the AMO through their "on track" students, that is, PPS + POT ≥ AMO.We distinguish these schools from conventional AYP schools and describe them as having made Adequate Yearly Growth (AYG, Table 1).A state's growth model policies can dictate the definition of POT and the form of the inequality that decides AYG.For some states, current status is subtracted from the equation entirely, and students must be predicted to be on track to proficiency whether they are currently proficient or not.Under this system, there is no PPS; there is only POT, and a school makes AYP if POT ≥ AMO.States may also use models that predict future scores from prior test scores using regression-type methods.These models use longitudinal data from students from prior years to estimate prediction equations.Other states may award fractional credit to on-track students based on the starting point and/or degree of their gains.In order to focus on the functional relationships between select variables, we use a simplistic model that avoids confounding the multiple policy factors that make up AYP and AYG decisions in practice.Dunn (2008, this issue) describes the effects of the policies of approved growth model states as they work in vivo, whereas we demonstrate how all states can expect results to change under systematic manipulation of select factors.
The model we use for illustration, described in Table 1, is commonly referred to as a gain-score or trajectory growth model.Student scores must be located on a vertical scale that spans many grades, and cut scores are also mapped onto this scale.The model assumes that a student's gain over a past unit of time will be the same as that student's gain over each similar unit of time into the future.For example, a nonproficient student whose trajectory is projected from Time 1 and 2 scores of 425 and 475, respectively, is on track to 525 at Time 3, 575 at Time 4, and so on.The growth model policy, then, defines this nonproficient student as "on track" to Growth Models and Cut Scores 10 proficiency at Time 3 if the projected score of 525 is above the Time 3 proficiency cut score.This can be represented by the following conditions: X 2 < c 2 and X 1 + 2*(X 2 -X 1 ) ≥ c 3 (Table 1).
In a straightforward extension of this logic, at Time 2, students are defined as "on track in N years" if the following inequalities hold: If X 2 < c 2 and X 1 + (N + 1)*(X 2 -X 1 ) ≥ c N+2 .
This paper investigates how a particular growth model policy attribute-the time horizon to proficiency, N-affects accountability decisions.We demonstrate that cut-score attributes can interact with growth model policy attributes to dramatically affect growth statistics for states (POT) and for schools (PAYG).We show that increasing the time horizon to proficiency can more than double the percentage of students who are classified as "on track."In addition, we explain how changing cut scores can interact with AMOs with the potential to reduce school failure rates by more than 35 percentage points.These policy decisions are among many that states must make in implementing a growth model, but we demonstrate that these particular decisions have systematic relationships with outcome variables and are easy to predict.We conclude this paper with an argument for the generalizability of these dependencies to all growth models that reference growth to NCLB-type proficiency cut scores.
Methods: A Theoretical Framework for Evaluating Cut-Score Dependencies Visualization and prediction of the cut-score dependence of growth results can be assisted by a theoretical framework.The cornerstone of the theoretical framework is a bivariate scatterplot with student scores at Time 2 (X 2 ) plotted against student scores at Time 1 (X 1 ).We present an illustration of observed test score data generated from a bivariate normal distribution in Figure 1.The bivariate normal distribution was chosen for convenience and because many tests have distributions that are either scaled to be or happen to be unimodal and roughly symmetrical.Departures from bivariate normality will certainly change the findings presented.
Growth Models and Cut Scores 11 However, our purpose here is illustrative, and the framework easily allows for alternative distributional choices or, as we will show, examples with real data.
Figure 1 shows a sample of 1000 students drawn from this bivariate normal distribution and plotted as small gray dots.The scale is arbitrary up to a linear transformation; the mean of the Time 1 distribution is set to 500, and the standard deviations of the Time 1 and Time 2 distributions are set to 100.The Time 2 mean is set to 550, resulting in an average gain of 0.5 standard deviation units, a level commonly seen in practice.The correlation between Time 1 and Time 2 scores is also set to be realistic at 0.75.
To establish a reference point, this framework assumes that the current year is Time 2, and the first year of data, Time 1, came from the previous year.The centroid is indicated by a black circle.The X 2 = X 1 diagonal is plotted for reference.Points above this diagonal represent students whose scores have increased from Time 1 to Time 2, and points below the diagonal represent students whose scores have decreased from Time 1 to Time 2. For illustration, Figure 1 flags two students' data; one student is represented as a triangle and one as a square.The triangle identifies a student who scored 350 at Time 1 and 475 at Time 2 for a gain of 125.The gain can be visualized as the vertical or, equivalently, horizontal distance from any point to the diagonal.The square identifies a student who has a Time 1 score of 425 and a Time 2 score of 475 for a gain of 50.This student has a smaller gain and is closer to the diagonal.We will demonstrate how this framework readily identifies which students are classified as "on track."Five reference lines have been drawn across each axis.Each line marks the cut scores set at Times 1, 2, 3, 4, and 5; these are labeled c 1 , c 2 , c 3 , c 4 , and c 5 , respectively.In Figure 1, the cut scores are set at 450 at Time 1, 500 at Time 2, 550 at Time 3, and so on.Of the two flagged students, the triangle is below the Time 1 cut score on the horizontal axis and also below the Growth Models and Cut Scores 12 Time 2 cut score on the vertical axis.In other words, this student is below proficient in both years.The student has also made a gain of 125 points from Time 1 to Time 2. If the student were to make the same gain from Time 2 to Time 3, the student would score a 600, which is above the Time 3 cut score.The student represented by the black triangle is therefore on track to proficiency in 1 year.
The student represented by the black square has a Time 1 score of 425 and a Time 2 score of 475 for a gain of 50.Note that this student is also below proficient in both years.If the student makes the same gain from Time 2 to Time 3, the student will score a 525, which is not proficient in Time 3. This student is not on track to proficiency in 1 year.The usefulness of this framework is that students who are "on track" to proficiency can be easily identified by the area of the graph in which they are located.The lightest shaded area identifies the students who are "on track in 1 year" and is bordered by the horizontal line: X 2 = c 2 , and the diagonal line: X 2 = (X 1 + c 3 )/2.The area below the horizontal line identifies nonproficient students at Time 2. The area above the diagonal line identifies students who have made gains that place them on track to proficiency by Time 3. The equation for the diagonal line follows directly from the equation in Table 1 after solving for X 2 .The percentage of students in this area are what the gain-score model in Table 1 would describe as POT 1 .Note that the triangle has no border on its left side and is therefore semi-infinite.
Figure 1 also shows the successively darker shaded triangles that identify the students on track in 2 years (but not 1) and 3 years (but not 2 or 1), respectively.The full percentage of students on track in 3 years, POT 3 , is the sum of the proportions calculated from all three shaded triangles.The equation of each diagonal line can be found by solving the general equation in Table 1 for X 2 given a particular time horizon of N years.The figure shows that increasing the Growth Models and Cut Scores 13 time horizon to proficiency allows for greater and greater proportions of "on track" students under this growth model.
If the bivariate normal model adequately describes the distribution of observed scores, POT statistics can be calculated as the volume under the density function in the region of the semi-infinite triangles shown in Figure 1.The appropriate double integral for the first triangle, representing the percentage of students on track for proficiency in 1 year, follows: Here, f(X 1 , X 2 ) is the usual bivariate normal density function with five parameters: (2) In the scenario shown in Figure 1, the means would be 500 and 550, the standard deviations would both be 100, the correlation would be 0.75, and c 2 and c 3 would be 500 and 550 respectively.We calculate the integral using numerical quadrature in the program, Matlab, and we find that POT 1 under these parameters would be 4.71%.From here, it is straightforward to adjust cut scores in Equation 1 and recalculate POT under alternative cut score selections.For example, the cut scores shown in Figure 1 could all be shifted much lower or much higher to represent less or more rigor respectively.A generalized version of Equation 1 allows evaluation of percentages of students on track to proficiency in N years: The score scale shown in Figure 1 is hypothetical, and the implication of a Time 2 cut score of 500, for example, is difficult to discern on its own.A more interpretable approach to describing cut scores references cut scores by the proficiency statistics they generate.For example, a Time 1 cut score of 450 and a Time 2 cut score of 500, as shown in Figure 1, result in PPS of around 69% for both grades.To explain the effects of the rigor of cut scores on growth statistics, we begin by assuming that cut scores follow an equal-percentage model across grades, that is, the percentage of proficient students is the same in all grades.Distributions in higher grades are assumed to have equal standard deviations of 100 and equal average gains of 50 per year for convenience in calculating cut scores.We then shift the entire set of cut scores so that the percentages of proficient students in all grades ranges from 90% (low cut scores and less rigor) to 30% (high cut scores and more rigor), a plausible range that spans the PPS statistics seen in practice (Swanson, 2008).Visually, this is akin to taking the grid in Figure 1 and sliding it up and down the diagonal of the graph, calculating the proportion of students in the areas of the triangles while the grid moves.
Figure 2 shows results from calculations that vary both overall rigor of cut scores from 90% (less rigor) to 30% (more rigor) and the time horizon to proficiency from N = 1 to 5 years.
To maintain consistency with the framework presented in Figure 1, we order the horizontal axis by the increased rigor that comes from raising cut scores, thus the proficiency rates of figures will decrease from left to right.Figure 2 shows that low levels of rigor lead to low POT statistics Growth Models and Cut Scores 15 of around 2%, whereas increasing rigor can increase POT statistics to 10% when the time horizon is 1 year and to over 20% when the time horizon is 5 years.
Most time horizons for GMPP states are 3 years, though time horizons may be shorter for students near graduation from a particular school or when the year approaches 2014, the deadline for 100% proficiency.States that have equal time horizons but dramatically different levels of rigor may report POT differences of more than 15 percentage points due simply to cut score selection.It may seem counterintuitive that increasing rigor leads to greater percentages of students who are on track.The simple explanation for this is that increasing rigor results in more nonproficient students and thus more students eligible to qualify for growth calculations.
Visually, this can be pictured in Figure 1, where increasing cut scores will result in greater and greater proportions of students bordered by the semi-infinite shaded triangles.It is a reminder that the percentage of students who are "on track" (POT) is confounded with cutscore selection.States reporting high PPS are expected to have relatively lower POT statistics, and states with more rigorous cut scores are expected to experience the greatest benefit from the GMPP under projection growth models.These benefits would increase if the time horizon to proficiency were to increase, though the differences between 3-year and 4-year time horizons are not as dramatic as those between 1-year and 2-year time horizons.

Real Data Confirmation of Theoretical Relationships
The relevance of the theoretical framework was evaluated by applying similar cut-score scenarios to longitudinal student data from a mid-sized state.The dataset consists of a singlegrade cohort of almost 70000 students who have four years of test scores from Grades 3 through Growth Models and Cut Scores 16 span is approximately 84%.The test is an English Language Arts test that is vertically scaled with increasing grade means and decreasing variability over the four grades.
To calculate the cut-score dependence of POT 1 , the percentage of students on track to proficiency in 1 year, Grade 4 and Grade 5 scores are defined as X 1 and X 2 respectively.The cut scores c 1 , c 2 , c 3 , are set by a similar equal-percentage model as the one that generated Figure 2.
For example, for a target PPS of 80%, empirical cut scores are identified that result in 80% of students classified as proficient at each grade.The cut scores associated with a given target percentile are denoted c 1 , c 2 , and c 3 , and the target PPS is then varied to investigate cut-score dependencies.To calculate POT 2 , Grades 3 and 4 are defined as X 1 and X 2 respectively, and c 1 , c 2 , c 3 and c 4 , are calculated as the percentiles of the empirical distributions of Grades 3, 4, 5, and 6.With a four-year dataset, only two different time horizons can be evaluated: a 1-year horizon using Grade 4-5 growth projected to Grade 6 and a 2-year horizon using Grade 3-4 growth projected to Grade 6.Similar results could be obtained for a 1-year horizon using Grade 3-4 growth projected to Grade 5 but were not included for the sake of parsimony.
Figure 3 shows the results of varying cut scores to reflect equal cross-grade PPS from 90% to 30% in a similar manner to Figure 2. As in Figure 2, the POT 1 line runs from near 1% for less rigorous cut scores to near 10% for more rigorous cut scores.The POT 2 is slightly higher than the corresponding line in Figure 2 and shows greater cut-score dependence.Looking at Figure 1, this can be explained by a greater density of students in the second semi-infinite triangle.As the real data has a slightly higher correlation and a slightly lower gain than the theoretical data, there is indeed a greater density of students in the region of the second triangle, leading to the observed results.The curves in Figure 3 are bumpy because of the usual awkwardness arising from calculating percentiles from discrete data.The similarities between Figures 2 and 3 support an argument for the consistent functional dependencies between cutscore choice and POT statistics.

Effects of Articulation of Cut Scores: Theoretical and Real Data Results
Cut scores may decline or increase in rigor across grades.In order to model the effects of the articulation of cut scores across grades, we introduce a series of plausible PPS patterns in Table 2.These patterns all have an average PPS of 60% across all grades, but higher patterns in the table show declining rigor across grades (increasing PPS) and lower patterns show increasing rigor across grades (decreasing PPS).The central pattern is an equal-percentage pattern that generates 60% proficiency at each grade.The cut scores that would establish these PPS statistics are calculated for the theoretical score distributions underlying Figure 1.For example, for the first pattern in Table 2, the cut scores that generate the listed percentages are approximately 513, 537, 561, and 583.The integrals in Equations 1 or 2 can then be evaluated for cut scores generated from each of the patterns in Table 2.In Figure 1, this exercise can be visualized by constricting the distance between the cut scores (for decreasing rigor) and then expanding the distance between the cut scores (for increasing rigor).As the distances between cut scores increases, the diagonal lines in Figure 1 are pulled up and squeezed against horizontal line X 2 = c 2 .As a result, the proportion of students in the area of the triangle is decreased if rigor increases across grades.Moving from the top rows of Table 2 to the bottom rows can be seen as an exercise in increasing the distances between cut scores.
The results of the evaluation of these integrals for time horizons of 1 and 2 years are shown in Figure 4. Simply put, raising future goals will decrease the number of students who are on track to these goals.Figure 5 confirms these findings with the real dataset previously described.Similar to the contrast between Figures 2 and 3, Figures 4 and 5 are very similar for a Growth Models and Cut Scores 18 1-year horizon, but the empirical data for the 2-year horizon shows a greater dependence on cutscore attributes than the theoretical data.This may also be explained by a disproportionate weight of students in the "on-track-in-2-years" triangle.
The magnitudes of the dependencies in Figures 4 and 5 are slightly less dramatic than those shown in Figures 2 and 3. Further, the extremes shown in Figures 4 and 5 are slightly less realistic, as declines or increases in proficiency rates of 30 percentage points across four grades are not common in practice.In contrast, PPS ranges between 90% and 30%, as shown in Figures 2 and 3, represent the actual PPS variation currently seen across states (Swanson, 2008).This seems to suggest that the practical range of cut-score rigor has a greater impact than the practical range of cut-score articulation.It is nonetheless impossible to fully disentangle the these two factors in longitudinal analysis, as increasing distances between cut scores naturally affects rigor in each grade.These interactions can be visualized in Figure 1 through the stretching and shifting of the grid of cut scores over the density of data in the scatterplot.Together, Figures 2-5 demonstrate that increasing the overall rigor of cut scores increases POT statistics, and increasing rigor from grade to grade decreases POT statistics.

School-Level Accountability Results
To this point we have described the effects of cut-score rigor, cut-score articulation, and time horizons on state-level results for students, as represented by the percentage of on-track students (POT) statistic.A more relevant statistic for some policymakers may be the percentage of schools for whom growth models may make a difference in accountability decisions.A school-level version of the theoretical framework in Figure 1 exists, however the number of variables and interdependencies becomes too complicated for the framework to support helpful Growth Models and Cut Scores 19 visualizations.Instead, we simply show the empirical results for the dependence of the proportion of AYG schools on the rigor of cut scores.
The matched dataset contains information for approximately 70000 students in over 900 elementary schools.The data are longitudinal and stretch over four years for a single-grade cohort from Grade 3 to Grade 6.For the purposes of stability, and in order to mimic the minimum subgroup size for this state and many others, we exclude all schools whose available matched data number fewer than 30 students.This leaves around 640 schools for a 71% schoolinclusion rate.
AYP decisions are made on multiple subgroups within schools, where all PPS must surpass AMOs.For simplicity, we consider the single-grade cohort as the only subgroup in the school.Additionally, we only use the 2-year time horizon, where the "current" Grade 4 includes growth results from Grades 3 to 4 and credits students on track to proficiency by Grades 5 or 6.Finally, we did not include safe-harbor or confidence-interval provisions whose interactions with growth models further complicates dependencies.The implications of these findings to schools with multiple subgroups can proceed by referencing the PPS of the lowest-scoring subgroup, as schools are essentially accountable to this subgroup alone.The school-level accountability decision, which is effectively a model for Grade 4 in our scenario, follows from Table 1: If PPS ≥ AMO, then AYP, and, for growth model decisions for non-AYP schools: If PPS + POT ≥ AMO, then AYG.
AMOs are essentially cut-scores used to determine whether individual schools have met their AYP goals.As such, they have an effect on growth model results that is comparable to the impact of the student-level proficiency cut-score choice.As NCLB took effect, the AMO was tied to PPS such that states were discouraged from setting excessively low standards for schools.

Growth Models and Cut Scores 20
Our purpose is to demonstrate the dependence of PAYG on proficiency cut scores while fixing other factors, but it is unrealistic to model the shifting of cut scores without a corresponding shift in the AMO.To appropriately model AMO correspondence to a given cut-score , we follow the federal formula that originally linked a state AMO to a state PPS.NCLB requirements set the minimum AMO at the PPS of "the school at the 20th percentile in the State, based on enrollment, among all schools ranked by the percentage of students at the proficient level" (Pub. L. No. 107-110, 2002).This amounts to the following algorithm: For each grade and its associated proficiency cut score, 1) Rank all schools by their PPS; 2) Calculate the cumulative enrollment as a percentage of statewide enrollment from the lowest ranked school on up; and 3) Set the AMO equal to the PPS of the school at which 20 percent of the statewide enrollment is reached.
As before, we generate sets of cut scores that lead to equal PPS across grades.Following the algorithm above, each cut score sets a PPS which in turn determines an AMO.The vertical axis represents PAYG, the percentage of schools that do not meet AYP but achieve AYG through the growth model.The darker, black line shows the dependence of PAYG on cutscore rigor.We find that increasing rigor increases the proportion of AYG schools.This is similar to the student-level results shown in Figure 3, where increasing rigor also increases the proportion of "on track" students.Reducing the proficiency rate from 90% to 30% increases PAYG from 10% to over 20%.
The lighter, gray line reflects the result of adding 5 percentage points to the AMO, effectively raising the minimum standard for schools while keeping all other data constant.

Growth Models and Cut Scores 21
Under NCLB and the GMPP, all AMOs are required to rise from their baseline values to 100% by 2014.The gray line illustrates the results of an increase in the AMO with no corresponding change in state proficiency rates.While this would decrease PAYP, Figure 6 shows that this would increase PAYG by 5 to 10 percentage points.The gray line's position over the black line shows that, if AMOs increase over time and proficiency rates stagnate, the impact of growth models on school accountability decisions may become even greater.
It may seem surprising that PAYG never dips below 10%. Figure 6 seems to suggest that all growth model states should have shown differences for at least 10% of schools after implementation.Instead, many states observed changes at only a handful of schools (Klein, 2007).A more complete picture of the interaction between AMOs and cut-score rigor may help to resolve the apparent conflict between Figure 6 and real-world findings.While the AMO was federally mandated at the advent of NCLB, it has since become uncoupled with PPS.PPS statistics track observed student achievement annually while AMO trajectories are set by state policy, increasing from their baseline values to 100% in 2014.When AMOs and PPS become uncoupled, even more dramatic dependencies can manifest.
Figure 7 displays the results for fixed AMOs of 50% and 70%, levels that represent points along most state AMO trajectories towards 100% by 2014.The results show that the impact of a growth model on a state's schools can be both very large and deeply dependent on cut-score attributes.A solid line at the 65% mark illustrates a scenario where a state sets cut scores such that 65% of its students are proficient across all grades.For this state, the GMPP would positively affect 11% of schools if the state AMO were 50% and 31% of schools if the state AMO were 70%. Figure 7 shows that states will have particularly large growth-model benefits when the PPS is just below the AMO.In these cases, a large number of schools will be Growth Models and Cut Scores 22 on the AMO bubble, and adding POT to the calculation results in a larger proportion of AYG schools.If proficiency rates rise faster than the baseline AMO, Figure 7 shows that PAYG is expected to be quite low.For example, if 75% of students are proficient, and the AMO is set to 50%, the empirical results from this state show only 5% of schools making AYG.
Together, Figures 6 and 7 allow the following observations about the potential impact of the GMPP on states as a function of state PPS levels and its AMO.First, states whose PPS far exceeds their AMO are likely to experience little benefit from the GMPP.Second, the impact of the GMPP will be greatest, as measured by the peaks in Figure 7, when both the PPS and AMO are a) similar and b) in the middle range of percentages.When both the PPS and AMO are large, PPS suppresses POT (see Figures 2 and 3) and thus suppresses PAYG.Third, for states whose AMO trajectories rise to meet and then surpass their PPS levels, the impact of the GMPP will rise and then fall.This latter finding will be realized by many states should NCLB reach its endgame.All of these observations are best described as straightforward consequences of a cutscore-based growth model and not as meaningful differences in amounts of growth across states or over time.

Generalizing Findings to Alternative Growth Model Policies
The number of variables involved in an operational state growth model is far too large to explore all possible interactions between cut-score attributes and policy decisions.To this point, we have described how cut-score attributes affect one particular growth model approach-the gain-score or trajectory growth model-in combination with a time-horizon factor.In this section, we briefly discuss how the cut-score dependencies of this model may or may not generalize to three alternative implementations of growth models.

Growth Models and Cut Scores 23
The growth model described to this point can only help students and schools.The inequalities displayed in Table 1 leave all proficient students in place and can only add "on track" students to school accountability calculations.An alternative formulation may choose to penalize proficient students who are not on track to proficiency.Under this formulation, all students, regardless of their status, must be making gains that show them as "on track."The student-level and school-level equations reduce to: If X 1 + 2*(X 2 -X 1 ) ≥ c 3 , then "on track," and, if POT ≥ AMO, then AYG.In this model, proficiency and AYP are irrelevant as long as students have growth data.This model can be visualized in Figure 1 by extending the diagonal lines through to the other side of the X 2 = X 1 diagonal.Everyone above these lines is "on track in N years," and everyone below these lines is not.The net change between a growth model and a conventional status model must incorporate both the addition of the shaded triangles already highlighted in Figure 1 and the subtraction of new triangles bordered by X 2 = c 2 and the diagonal lines on the right side of the graph.These new triangles include currently proficient students who are not on track to proficiency and who would be classified as effectively nonproficient under the terms of this growth model.
It is clear that the growth model would no longer have a purely positive effect, but we argue that cut-score dependencies would certainly remain.Keeping triangles on both sides of Figure 1 in mind, we can see that low cut scores would actually lead to a net negative impact on states, as many proficient students would be classified as "not on track."Higher cut scores would allow the positive effects of growth models to become more salient.Thus, growth models that apply to all students regardless of their current proficiency status will be expected to have a less positive impact but remain dependent on cut-scores.

Growth Models and Cut Scores 24
The second alternative policy model that we consider is regression-based.Regressionbased models use data from previous longitudinal cohorts to generate prediction equations.If a student's scores from a current cohort are substituted in to the prediction equation and the equation returns a score above a future cut score, that student may be deemed "on track." Regression-based models are still interpretable within the framework of Figure 1.The diagonal lines in Figure 1, for example, the "on track in 1 year" line: X 1 + 2*(X 2 -X 1 ) = c 3 , are of exactly the same form as a regression-based prediction line.In fact, if all of the parameters of the multivariate normal model stayed the same, and the Time 1 to Time 3 correlation were set to 0.3 (admittedly a low value), the regression-based model would return a line with identical slope and an intercept around 15 points below the "on track in 1 year" line in Figure 1.Resulting cutscore dependencies would take on a similar form as the ones we have shown here.The impact of a regression-based growth model on POT depends critically upon the parameters of the distributions and thus the slope of this prediction line.Under different parameters, the regression line will change in slope, but its intercept will remain referenced by the future proficiency cut score.The framework in Figure 1 will still apply: raising cut scores will still leave more students to be classified as "on track," and raising future cut scores (increasing cut-score variability) will continue to decrease POT by raising the intercept of the regression line.Cut-score dependencies may therefore take on a different form but are not rendered negligible by regression-based growth models.
The third alternative policy we consider are value tables or categorical growth models.
These use multiple cut-scores within a given grade to classify students, for example, into Below Basic, Basic, Proficient, and Advanced categories.Student transitions across category boundaries may receive some form of credit for schools.This model may also be visualized in Growth Models and Cut Scores 25 Figure 1.Instead of diagonal lines, categorical growth models add greater numbers of vertical and horizontal lines corresponding to the cut scores separating, for example, the Below Basic and Basic category in each grade.Instead of shaded triangles above the main diagonal that identify students receiving credit, categorical growth models will have shaded rectangles that are weighted by certain values.
Categorical growth models will therefore exhibit similar patterns of cut-score dependence as their gain-score model counterparts.Time horizons generally do not apply in categorical growth models.Cut-score articulation becomes a more complex concept as multiple cut-scores interact within and across grades, however raising higher-grade standards with respect to lowergrade standards will still decrease POT.Finally, increasing cut scores increases the number of nonproficient students who will be included in the rectangles.Across all of these alternative policy models, Figure 1 helps to demonstrate that cut scores will have a strong and often confounding effect on growth-based classifications and decisions.

Discussion and Conclusions
This paper has provided a framework for the quantification of the cut-score dependence of growth model results.We point out that the proportion of students credited by growth models should be larger for states with more rigorous cut scores primarily because of the increased proportion of nonproficient students eligible for growth calculations.We also show that this credited proportion is smaller for states whose distances between cut scores increase up the vertical scale (declining PPS across grades) and larger for states whose distances between cut scores decrease up the vertical scale (increasing PPS across grades).We show that the expected increase in credited students under different time horizons is much larger moving from a 1-year to a 2-year horizon than it would be moving from a 4-year to a 5-year horizon.Finally, we Growth Models and Cut Scores 26 demonstrate that the proportion of schools that will be benefitted by a growth model will generally increase with more rigorous cut scores.However, the effects ultimately depend on an interaction between cut scores and the state AMO: States with PPS close to their AMO will have the greatest proportion of benefited schools, particularly when that AMO is not too high.
Under the GMPP, more rigorous cut scores result in greater proportions of students and schools credited as "on track" and meeting federal goals.In this way, the GMPP provides a modicum of balance to the challenges states face under NCLB.Policymakers adopted cut scores with a tension between national reform efforts demanding high standards and NCLB's challenging goal to have universal proficiency by 2014; more rigor satisfies the former while less rigor supports achievement of the latter.The result has been a wide range of proficiency results across states that is partially if not mostly explained by differences in rigor (Braun & Qian, 2005;McLaughlin & Bandeira de Mello, 2003).States adopting more rigorous cut scores have a greater baseline challenge than states adopting less rigorous cut scores.Our results indicate that the use of this type of growth model offsets the difference between the challenges facing these states.
However, the findings of this study indicate that the interpretation of GMPP statistics as reflecting "growth," per se, is inaccurate or at least incomplete.Two hypothetical states with the same baseline level of student achievement, that adopt the same growth model at the same time, and that experience the same increase in student achievement over time, will experience different proportions of students and schools classified as "on track" as a direct result of levels of rigor, different patterns of vertical articulation of performance standards, or different time horizons.
Consequently, growth results are only as comparable across states as their cut-score rigor, articulation, and time horizons are comparable across states.

Growth Models and Cut Scores 27
The interpretation of NCLB results would be enhanced had cut scores been more consistent across states in the baseline year-relative progress toward meeting standards would acquire a common frame of reference.This did not occur, and across-state comparisons of results that rely on this assumption may encourage substantive conclusions about state differences.Instead, these differences are more appropriately attributed to the cut-score features that generated them.Interpreting the differences illustrated in Figures 2 through 7 as meaningful differences in student growth for different states would be a mistake, given that the underlying bivariate score distributions were exactly the same.It is the dependence of the growth-toproficiency metric on cut-score attributes that produced the result.This degree of conflation between cut scores and growth results may be an undesirable attribute for models adopted under federal educational accountability systems.
Policymakers and standard setters should be aware that decisions they make about proficiency standards have a direct, dramatic, and, as we show, predictable impact on growth results in a growth-to-proficiency framework.For example, in 2001, Colorado adopted cut scores with an approximately equal percentage of proficient students across grades for its NCLB Reading assessment, whereas, in 2004, Indiana adopted Reading cut scores such that rigor increased and the percentage of proficient students decreased across grades.The results of this paper indicate that the adoption of a projection growth model would likely result in more schools classified as making AYG in Colorado than in Indiana, although the relative rigor of the cut scores would moderate the effects.
To conclude, our concerns are twofold.First, the cut-score dependencies demonstrated here are substantial.Statistics of interest like POT and PAYG can swing by dramatic amounts over plausible alternative cut scores and time horizons.These findings should be seen as parallel Growth Models and Cut Scores 28 to but contrasting with those of Allen, Briggs, Weeks, and Wiley (2008, this issue).They address the impact of scaling and linking methods, whereas we address cut-score and policy attributes.The factors that they investigate can be seen in Figure 1 as influencing the bivariate scatterplot, whereas the factors we investigate influence the grid overlaying the bivariate scatterplot.Clearly, both sets of factors will have an impact on policy-relevant outcomes.
Second, reporting growth from within a proficiency-based and therefore cut-scoredependent framework makes it difficult to satisfy two important requirements: transparency and parsimony.The degree of cut-score dependence is too substantial to disentangle growth from cut-score attributes and still allow for growth-related interpretations.This does not mean that growth-to-proficiency models should be thrown out as a possible tool of policy.However, it does require an honest recasting of results not as measures of growth but as indicators of progress toward a very particular standard.
One strategy for defensible interpretations involves the separation of growth-based accountability and growth-based reporting onto two parallel tracks.Toward the latter goal, a more straightforward method of encouraging accurate growth interpretations may follow from setting norm-referenced standards for growth (Betebenner, 2008;this issue).As these efforts progress, this paper stands as a reminder that the impact of the GMPP on state accountability metrics will be heavily moderated by cut-score attributes in systematic and predictable ways.

Figure 2
Figure 2 has striking implications for comparisons of growth-model results across states.

Figure 6
Figure 6 illustrates the dependence of school-level growth model results on the rigor of

Figure 1 .
Figure 1.Theoretical growth framework with shaded areas indicating students "on track" to

Figure 2 .Figure 3 .
Figure 2. Theoretical dependence of the percentage of "on track" students on cut-score choice

of Proficient Students Generated by Cut Score Choice
Theoretical dependence of "on track" students on the articulation of standards as indexed by the change in proficiency rates by grade, ordered from decreasing to increasing rigor.

in Percentage of Proficient Students Per Year (%pts)
Figure5.Empirical dependence of "on track" students on the articulation of standards as indexed by the decrease or increase of proficiency rates across grades.

in Percentage of Proficient Students Per Year (%pts)
Figure 6.The empirical dependence of the percentage of "growth" schools on cut-score rigor with a set AMO and an AMO that has been raised 5 percentage points.The figure shows results for a time horizon of 2 years and a minimum grade size of 30.

of Cut Scores Indexed by Decreasing Proficiency Rates Percentage of Schools Avoiding Sanctions by Growth (PAYG)
Figure7.The empirical dependence of the percentage of "growth" schools on cut-score rigor with two fixed AMOs of 50% and 70%.An illustration of a state with 65% proficiency is referenced.