Private and Public Performance Reports as Drivers of Performance and Determinants of Performance Measure Information Content A thesis presented by Henry Christian Eyring In partial fulfillment of the requirements for the degree of Doctor of Business Administration Harvard University Graduate School of Business Administration Cambridge, Massachusetts April 2017 © 2017 Henry Eyring All Rights Reserved Professor Dennis Campbell Henry Eyring Private and Public Performance Reports as Drivers of Performance and Determinants of Performance Measure Information Content Abstract This dissertation addresses how private and public performance reports affect performance and the information content of performance measures. First, I show how disclosing consumer ratings to the general public affects performance and biases raters. Using data from a health care system, I find that publicly disclosing patient ratings of physicians leads to: 1) performance improvement by the ratings and by objective measures of quality, and 2) a bias among raters, who positively weight a physician’s published average rating in deriving subsequent ratings for the physician. To understand the moderating effects of public attention, I use variation in web traffic to a physician’s disclosed rating. I find evidence consistent with public attention reinforcing raters' bias toward concurring with a physician’s published average rating, thus impeding rating improvement. Within a national distribution of ratings, the disclosure leads to an improvement in ratings by 17 percentile points and a bias in a given physician’s ratings toward his or her published average rating by 24 percentile points. These findings demonstrate that consumer- rating disclosure is a means of performance management, and that resulting bias is a reason to interpret subsequent trends in ratings as understated signals of trends in service. The second section of the dissertation shows an understudied and low-cost way of customizing private performance reports to best drive reported performance, and warns that the private reporting causes reported performance to diverge from unreported performance. A field experiment reveals the performance benefit of customizing a private performance report to include the peer-performance reference point that will most motivate improvement. The below- iii. average performers improve most when shown the median as a reference point. The 50th-75th percentile performers improve most when shown the top-quartile as a reference point. The top- quartile performers improve most when shown the top-quartile as a reference point, but only when reported performance is outcome-based as opposed to process-based. Neither the median nor top-quartile reference point has a more positive performance effect overall. With regard to the performance measure’s content, privately reporting a measure causes the measure to become less correlated with unreported performance. These findings have the following implications. First, the optimal reference point for peer performance comparison depends on 1) an individual’s initial performance relative to each reference point, and 2) whether the performance measure regards an outcome or process. Second, a performance measure, once reported, becomes a less informative signal of unreported performance. iv. Acknowledgements   I am profoundly grateful for my mentors Dennis Campbell, V.G. Narayanan, Srikant Datar, and Ananth Raman. I have never associated with finer minds or more dedicated advocates. They have honed my rough ideas to inform challenges in the effective application of accounting and management. They have offered patient, detailed instruction, and have gone to extraordinary lengths in providing me with opportunities. I have often watched a mentor go out on a limb to persuade organizations and colleagues to provide access to resources for research. In asserting my value, they both risked their reputations and provided a model for mentorship. They saw me as someone I had yet to become, but that they were sure I could, and then guided me amid the inevitable starts and stops along the way. I drew my writing style from Dennis Campbell, and have walked through each sentence of dissertation chapters with his guidance. He has methodically provided me with traction in developing as a researcher since I arrived in the program. V.G. has treated me like a son. He has carried me on his shoulders at points in navigating interactions with field sites, journals, and colleagues. I could go to him for anything, and I feel that I have. Srikant Datar has introduced me to many executives in health care, providing me access to the majority of the data used in my research. He also connected me to Bob Kaplan, who paved a relation that afforded the data for my job market paper. Srikant and Bob are renowned among business executives in part because of the applicability of their scholarship, and they have each met with me many times to help me target my research toward the most challenging dilemmas in health care. Ananth Raman has left me with a solid vision of what I want to do as an academic and as a provider for my family. That comes in part through his direct advice, which he has generously v. offered since my first day of class at Harvard Business School, and just as much through observing him. Numerous organizations have invested heavily in providing me access to data and the ability to test interventions through field experiments. Vivian Lee and her colleagues at University of Utah Health Care, and Heather Sternshein and her colleagues at HarvardX, are foremost among these. Each organization has served as an institution of my education to as great a degree as has Harvard. They have extended the greatest patience in allowing me to learn and conduct interventions and analyses. They are angel investors in my research and I would not have any of the results in this dissertation or nearly any other projects without their willingness to give me multiple chances. I have relied heavily on the guidance of Kim Clark, Clark Gilbert, and Clayton Christensen since before my program began. They directed and contributed to my preparation for the program to an immeasurable degree. They have also provided invaluable counsel on developing a pipeline that forms a cohesive line of inquiry. My sincerest thanks go to my father, mother, and my wife. If my father were a DBA, he would have accomplished more than me. My work is in many regards a watered-down version of his insight put in writing. We speak by phone for hours every week. He has treated me as a peer and equal since I was a toddler, and so we never had to transition to being colleagues. My mother is first-rate in her interpersonal skills. I try to treat people the way that she does and taught me to. That has formed a basis for friendships and collaborations in the program. From a distance and on visits she has carefully tended to my temporal and spiritual needs. My wife has walked each step of the way with me, through hills and valleys. When I am sleepless, she is too. When something works out well for me, she seems happier than I am. She has vicarious relationships vi. with all of my colleagues through me. I will strive to be as selflessly invested in her efforts to fulfill her dreams as she is in mine. vii. Table of Contents 1. Introduction………………………………………………...…………………...............…....1 2. Disclosing physician ratings: performance effects and the difficulty of altering ratings consensus…………………………………..…………………….……………….............…...3 2.1 Introduction……………………………………...…………………………....…....3 2.2 Theory and motivating literature...……..……..……………………..…..…...…..8 2.3 Setting……...…………..………………....…………………………………...…...15 2.4 Data…………………………………………………………………………...…...18 2.5 Analysis……………………………………………………………………...…….24 2.6 Conclusion……………………......……………………………………………….47 3. Performance effects of setting a high reference point for peer performance comparison... ………………………………………………………………………………………50 3.1 Introduction……………………………………………………………………….50 3.2 Theory and hypothesis development……………………….……………………55 3.3 Methodology………………………………………………………………...…….65 3.4 Analysis……………………………………………………………………...…….67 3.5 Discussion…………………………………………………………………...…….88 3.6 Conclusion……………………......…………………………………………..….102 4. Conclusion……………………………………………………………………………….....104 Appendix A…………………………………………………………………………………….113 Appendix B…………………………………………………………………………………….113 Appendix C…………………………………………………………………………………….114 Appendix D…………………………………………………………………………………….118 vii. CHAPTER 1 INTRODUCTION Though health care costs are soaring, so is the availability of cost measures for tracking and managing costs. Though structural gaps between job growth and unemployed workers’ skillsets are expanding, so is the availability of student behavior measures for tracking, customizing, and improving performance toward acquiring a skill. In education, health care, and other industries with vexing problems, then, the value of aptly applied measurement has perhaps never been greater. As “specialists of measurement,” accounting scholars have an instrumental role to play in solving the great economic conundrums of our time (Van der Stede [2015]). Measurement, though, is not a one-dimensional management tool that can simply be ramped up to meet management problems in the same way that supply can be ramped up to meet demand. Measures can be ill suited to a goal, backfiring by misdirecting effort. They can even discourage effort at all, or worse, lead to cheating. Sad examples span from multinational fraud, like those at Toshiba and Wells Fargo, to cheating in school districts like Atlanta and Washington, D.C. Those examples punctuate the need for informed measurement. This dissertation is an effort to inform the producers and users of measurement systems to get the most upside and the least downside. It begins with an analysis in Chapter 2 of physician rating disclosure. I look at performance effects as well as a cost in terms of information content; performance by reported and unreported performance improves, but the subjective ratings once disclosed become sticky around their previously published values. This suggests a reason for companies to disclose their employees’ customer ratings, as well as a reason to then interpret changes in 1 ratings as understated due to their stickiness toward past values. Without reporting the ratings, organizations pass up a low-cost opportunity to drive performance. Without updating the interpretation of ratings, organizations and consumers lose an opportunity to identify and reward changes in performance. Chapter 3 draws on a study with V.G. Narayanan, and shows a performance benefit of privately disclosing to online education students their performance relative to standards of peer performance. Though a number of studies have found that displaying such reference points drives performance, little to no research informs which reference point to display. We provide such evidence, finding that the optimal performance reference point to display to an individual depends on their initial performance relative to that reference point. We also find a cost in terms of information content. The reported measure becomes less correlated with unreported measures of performance. This is a cost in terms of performance inference—the reported measure can no longer be seen as so indicative of unreported, and perhaps unmeasured, performance. It is also a cost if performance by the reported measure is desirable when it comes along with important related performance—such as a grade coming with thorough engagement in the course, which we find is less common after privately reporting to individuals their grades relative to a peer standard. We thereby offer guidance for organizations to maximize performance by a reported measure through providing the appropriate reference point to each individual. We also raise the caveat the organizations should then monitor unreported performance to make sure that it does not lag behind reported performance in a way that undermines the value of improvement by reported performance. 2 CHAPTER 2 DISCLOSING PHYSICIAN RATINGS: PERFORMANCE EFFECTS AND THE DIFFICULTY OF ALTERING RATINGS CONSENSUS 2.1 INTRODUCTION This study investigates effects of a health care system disclosing patient ratings of its physicians to the public. Health care systems, hotel chains, and universities are among the many other organizations disclosing consumer ratings.1 Though studies find disclosure of other types driving performance, consumer-rating disclosure is distinct in that it reveals subjective ratings to subsequent raters.2 Behavioral economic mechanisms free to operate under those conditions may bias ratings toward the published consensus (Furnham and Boo [2011], Tversky and Kahneman [1975]), which would dull the sensitivity of ratings to effort and hamper their improvement (Banker and Datar [1989]). Each end of the resulting theoretical tradeoff, whereby consumer- rating disclosure may elicit improvement despite weighting consumer ratings toward the published consensus, is relevant to research on performance disclosure and on the effective use of consumer ratings.3 Evidence regarding this tradeoff, though, is lacking. I assess the predicted tradeoff and its dynamics empirically using data from the disclosure of physician ratings at University of Utah Health Care (UUHC).4 UUHC offered research access to visit-level data from millions of patient visits occurring over more than three-and-a-half years. The tests herein exploit variation in whether and when physicians became subject to the 1 For examples of consumer-rating disclosure by such institutions, see Cleveland Clinic [2016], Stanford Health Care [2016], Starwood [2016], Marriott [2016], Holiday Inn [2016], Columbia University [2016], and Texas Tech University [2016]. 2 Bennear and Olmstead [2008], Jin and Leslie [2003], Lu [2012], and Chatterji and Toffel [2010] address public performance disclosure by third-party evaluators, and do not indicate that the disclosure alters the information available to evaluators. Further, with the exception of Chatterji and Toffel [2010], the disclosed measures are objective in nature (though Jin and Leslie [2003] address restaurant hygiene grades, they state that the grades’ “subjective component has been removed” since before effect estimation). 3 See Leuz and Wysocki [2016] for a survey of performance disclosure literature, and Kaplan and Norton [2005] and Luca [2016] regarding uses of consumer ratings. 4  “Physician ratings” is one of the popular terms used to describe patient ratings of physicians (e.g., Glover [2014]), and I adopt this terminology.   3 disclosure. A generalized difference-in-differences approach pools estimates from the disclosure intervention’s staggered implementation. Such pooling, used in prior research on health care disclosure, increases the estimates’ precision and robustness (Dranove, Kessler, McClellan, and Satterthwaite [2003], Duflo [2002]). The sample’s substantial time range allows validating the assumption of parallel dependent-variable trends. Physician fixed-effects control for static differences among physicians included in and excluded from the disclosure. Robustness tests, including propensity-score matching, suggest that demographic differences among physicians do not drive results. Data on patient characteristics, including multiple measures of underlying patient health, allow controlling for patient mix through means consistent with leading health economics research (e.g., Chandra, Gruber, and McKnight [2010], Dafny [2005], Doyle Jr. [2011]). Isolating changes in performance from changes in patient mix is critical in assessing health care disclosure’s performance effects (Dranove et al. [2003]). With this identification strategy, I address the noted theoretical tradeoff of consumer-rating disclosure. On one end of the tradeoff, consumer-rating disclosure may elicit performance improvement. This result is not theoretically straightforward. Bias among raters after viewing disclosed ratings could dull the sensitivity of ratings as performance measures, deterring and/or misdirecting service providers’ effort toward rating improvement. Regarding effort deterrence, the economically optimal level of effort to exert toward improving a measure declines with a reduced sensitivity of the measure to effort (Banker and Datar [1989]). Regarding effort misdirection, rater bias would obscure responses of ratings to fundamental changes in service, inhibiting physicians’ ability to learn through trial-and-error (Campbell, Epstein, and Martinez- Jerez [2011]). Extant disclosure literature shows performance effects in settings wherein a bias in performance measures toward published values, and related performance-impeding forces, are 4 not mentioned and are relatively unlikely.5 The current study is unique in testing consumer-rating disclosure’s performance effects, and establishing their persistence among ratings’ bias toward the published consensus. A bias in consumer ratings toward the published consensus rating constitutes the other end of the predicted tradeoff. I term this a “consensus-bias” effect of disclosing consumer- ratings, and provide the first substantiating evidence.6 Raters subject to consensus bias would positively weight the published consensus rating in forming their own ratings. The effect may result from the anchoring and adjusting heuristic among raters, whereby an initially displayed reference point attracts subsequent estimates toward itself. However, anchoring has primarily been established under the conditions of an arbitrary reference point and estimation regarding a topic that an individual has limited familiarity with (Furnham and Boo [2011], Tversky and Kahneman [1975]). Whether anchoring applies when informative reference points – actual prior consumer ratings – are available to an individual who has certainly experienced the subject they are evaluating is unclear. The relevance of prior consumer ratings would plausibly make them more salient, and thereby influential, as subconscious reference points. Direct and recent experience with the subject of rating, though, may facilitate forming one’s independent rating (Muchnik et al. [2013]). In addition to anchoring, social herding may yield consensus bias. However, related theory hinges on individuals feeling uncertain and assuming that others are 5 Bennear and Olmstead [2008] assess disclosure of water safety violations, Jin and Leslie [2003] restaurant hygiene grades whose “subjective component has been removed” since before effect estimation, and Lu [2012] percentages of nursing home patrons with various health problems. The objective nature of these measures counters the use of subjectivity among evaluators to produce biased evaluations. Further, in each of those studies and in Chatterji and Toffel [2010], regarding subjective rating disclosure, the disclosures are from third-party evaluators to the public. The studies do not mention the disclosure altering the information available to evaluators, nor potentially biasing performance evaluations toward published values. 6 A study of Amazon reviews estimates unbiased scores, but does not posit or assess whether bias is toward or away from the published consensus (Sikora and Chuahan [2012]), a lab study finds mixed evidence that consumers who disagree with a consensus rating amplify their rating in the direction of disagreement (Eryarsoy and Piramuthu [2014]), and a third study shows a net effect of an arbitrarily assigned “thumbs-up,” but not of a “thumbs-down,” on average ratings of web content (Muchnik, Aral, and Taylor [2013]). 5 better informed, which may not transfer to the case of a direct consumer evaluating their consumption experience (Baddeley [2010], Keynes [1930]). Further, theories of consumer behavior suggest that some consumers who disagree with published rating exaggerate their ratings in the direction of disagreement, and a lab study finds mixed evidence of this (Eryarsoy and Piramuthu [2014]). The current study extends three streams of research. The first is research on performance effects of disclosure (e.g., Bennear and Olmstead [2008], Chatterji and Toffel [2010], Jin and Leslie [2013]). I extend this stream to the growing realm of consumer-rating disclosure. I find that consumer-rating disclosure is an effective tool for lifting performance, and that it must overcome consensus bias among raters in that process. The performance effects include improvement by consumer ratings and by objective quality measures. Improvement by those nonfinancial performance measures has been shown to drive financial performance, and also to be difficult to achieve through financial contracts.7, 8 I provide the first evidence of the tradeoffs of an alternative approach to incentivizing nonfinancial performance – that of disclosing consumer ratings. Second, this paper speaks particularly to the stream of economic and medical research on the performance effects of health care disclosure. Extant literature in this stream has focused on outcome and safety measures, and has found limited performance effects.9 Physician ratings reportedly garner greater interest among patients than do these other types of health care 7 Ittner, Larcker, and Meyer [2003] report that supervisors tasked with administering balanced-scorecard-based compensation shifted weight away from nonfinancial performance measures, and Bol, Keune, Matsumura, and Shin [2010] report that supervisors are adversely influenced by political concerns in applying nonfinancial performance ratings within compensation. 8  See Chevalier and Mayzlin [2006], Hu, Liu, and Zhang [2008], and Luca [2016] regarding effects of online reviews on revenue, Ittner and Larcker [1998], Banker et al. [2000], and Nagar and Rajan [2005] regarding effects of customer satisfaction on financial performance, and Balasubramanian, Mathur, and Thakur [2005], and Banker et al. [2000] regarding effects of quality on financial performance.     9 See Dranove et al. [2003], Epstein [2006], Ryan, Nallamothu, and Dimick [2012], and Shukla [2013] finding small, inconclusive, or no effects of such health care disclosure initiatives. 6 performance (Brown, Clarke, and Oakley [2012], Dafny and Dranove [2008], Hanauer, Zheng, Singer, Gebremariam, and Davis [2014]). Public attention to disclosure should, in theory, facilitate the disclosure’s performance effects (Kolstad [2013], Parker and Nielsen [2011], Weil et al. [2006]), which raises the possibility that disclosing physician ratings will yield significant performance effects. This paper’s assessment of those performance effects is pertinent in light of the many health care systems, including industry leaders such as Stanford Health Care and Cleveland Clinic, that have recently disclosed physician ratings. This paper extends a third stream of literature, on consumer ratings. Consensus bias would delay ratings in depicting a new service level while the published consensus rating, toward which the ratings would be biased, updated to reflect the new service level. Managers and researchers aware of such a lag in ratings depicting a new service level could better assess the sequence of service improvement, online consumer ratings, and financial performance. Accounting and economic studies note the importance of this sequence to managers (Banker et al. [2000], Banker and Mashruwala [2009], Luca [2016]). Also, principals could account for a lag in ratings’ responsiveness to service-level changes in using the ratings to infer agents’ effort (Lyu, Wick, Housman, Freischlag, and Makary [2013], Ubel [2015]). Such inference is a step in mitigating moral hazard (Hölmstrom [1979]). Finally, consensus bias is relevant to literature on employee evaluation biases in two regards (Bol [2011], Moers [2005], Prendergast and Topel [1993]). First, consensus bias among consumer raters would affect employee evaluations that incorporate consumer ratings. Second, supervisors who are aware of a past evaluation consensus regarding an employee may be prone to consensus bias in their own evaluations of the employee (Grote [2005]). 7 To further advance the noted streams of research, I explore the way in which public attention to disclosure moderates its performance and bias effects. Using data on web traffic to individual physicians’ disclosed ratings I assess evidence of theory that public attention to disclosure strengthens its performance effects (Kolstad [2013], Parker and Nielsen [2011], Weil et al. [2006]). Further, I show how consumer-rating improvement and the theoretically countering force of consensus bias vary with public attention to disclosed ratings. The results enable managers to account for public attention to disclosure in predicting its performance and bias effects. 2.2 THEORY AND MOTIVATING LITERATURE Economic theory offers reasons to expect positive performance effects of consumer-rating disclosure. Consumer ratings influence purchasing decisions in many industries, including health care (Chevalier and Mayzlin [2006], Dranove and Dafny [2008], Hannauer et al. [2014], Luca [2016]). If consumer-rating disclosure creates a more competitive market in which favorable ratings lead to revenue, the disclosure should enhance incentives to perform well as measured by the ratings. Jin and Leslie [2003] provide evidence of disclosure’s performance effects and attribute the effects to revenue-based incentives. Consumer-rating disclosure may further incentivize performance through forces of social comparison. By facilitating the comparing of oneself to peers, disclosure bolsters the self-image of individuals who perform well relative to others (Brown et al. [2012], Smith [2000]). Tafkov [2013] provides lab evidence that disclosure elicits performance improvement via self-image incentives. However, consumer-rating disclosure may impede performance improvement by biasing raters toward the published consensus rating in forming their own ratings (“consensus bias”). 8 Consensus bias would dull the sensitivity of ratings to effort. Reduced sensitivity of a performance measure diminishes the economically optimal amount of effort to exert toward improvement by that measure (Banker and Datar [1989]). By obscuring the ratings’ responsiveness to changes in service practices, consensus bias would also inhibit physicians’ ability to detect the success or failure of an attempt to improve service. Accounting literature suggests that this inhibition of trial-and-error learning would have negative performance consequences (Campbell et al. [2011]). Prior literature on disclosure’s performance effects has not addressed a disclosure of subjective ratings to raters, nor explored the existence of potentially resultant consensus bias and its implications for performance (e.g., Bennear and Olmstead [2008], Jin and Leslie [2013], Lu [2012]). Behavioral economic theories inform the likelihood of consumer-rating disclosure yielding consensus bias. The anchoring and adjusting heuristic from behavioral economics, if applicable to consumer ratings, would contribute to such an effect. Tversky and Kahneman [1975] established evidence of this heuristic, whereby exposure to an arbitrary number draws subsequent estimations and predictions toward itself. Many studies have shown similar results whereby an arbitrary number (e.g., one that an individual sees upon spinning a “wheel of fortune”) sways estimation or prediction regarding a subject of limited familiarity to the individual (e.g., the number of nations in Africa) (Furnham and Boo [2011], Tversky and Kahneman [1975]). Consumer ratings, though, are distinct from the reference points in those studies. A published rating consensus is informative, rather than arbitrary, which may strengthen its salience as a subconscious reference point for anchoring. On the other hand, consumers are relatively familiar with the service being rated, and may thereby rely less on a subconscious anchor to arrive at a rating. Indeed, Muchnik et al. [2003] find evidence that arbitrarily assigning 9 a single negative rating to user-contributed online content does not affect the average favorability of subsequent ratings, although it is unclear whether this holds for representative consensus ratings. The theoretical tension and extant empirical evidence do not allow a strong prediction of the application of anchoring to consumer-rating disclosure. Social herding may also contribute to a consensus-bias effect. Social herding theory explains how members of a peer group converge toward the same decision by observing and conforming to others’ decisions (Baddeley [2010]). Keynes [1930] applied this theory to understand speculation-driven financial market events. He argued that individuals conform to others’ decisions because they 1) feel uncertain about the decision, and 2) assume others are better informed. The influence of consumer ratings on purchasing decisions speaks to consumers’ likelihood of assuming others are well-informed (Chevalier and Mayzlin [2006], Hanauer et al. [2014], Luca [2016]). However, a consumer may not assume that others are better informed than him or herself, especially if the consumer views the rating as regarding his or her personal experience. Models of consumer behavior, and mixed lab evidence, suggest that consumers may exaggerate a rating after seeing one that they disagree with (Eryarsoy and Piramuthu [2014]). As with anchoring, it is not ex-ante clear whether social herding will apply to consumer-rating disclosure. This paper advances six hypotheses. Hypotheses 1-3 state the predicted tradeoff whereby physician-rating disclosure yields performance improvement despite creating a consensus bias. Hypotheses 1 and 2 regard performance improvement, and Hypothesis 3 regards consensus bias. H1: Physician-rating disclosure positively affects ratings. 10 H2: Physician-rating disclosure positively affects objectively measured care quality. H3: Physician-rating disclosure biases subsequent raters toward the published consensus. Testing these hypotheses extends three streams of literature. First, it informs economics- based research that assesses the performance effects of disclosure. Although consumer-rating disclosure is quickly spreading in service industries, little to no research has addressed its performance effects. H1 and H2 predict a performance improvement effect of physician-rating disclosure. The testing of H3 will indicate whether that effect occurs despite a consensus-bias effect.10 Further, I test for performance effects on customer satisfaction as well as on objective quality measures. Customer satisfaction and product/service quality generally lead to better financial performance, and accounting research documents difficulty in incentivizing such nonfinancial performance measures through financial contracts. 11 For instance, supervisors manipulate evaluations to avoid political costs, and exhibit an aversion to basing compensation on nonfinancial measures (Bol et al. [2010], Ittner et al. [2003]). The current study assesses consumer-rating disclosure as an alternative means of incentivizing nonfinancial performance measure improvement. A second stream of literature that this study extends, through the tests of H1 and H2, is that of health care disclosure research, which has so far found generally disappointing effects 10 Positive time trends in the consumer ratings, shown in Table 3 Column 3, cause consensus bias to counter rating improvement in actuality and as estimated via difference-in-differences.       11 Ittner and Larcker [1998] find evidence of customer satisfaction leading to financial performance and higher market valuations in a nonlinear manner, Nagar and Rajan [2005] find a strong relationship between customer satisfaction and future profits, and Banker et al. [2000] find positive effects of both quality and customer satisfaction on financial performance. Also, high levels of customer satisfaction that are publicly visible via awards and/or online ratings have been shown to drive financial performance and market returns (Balasubramanian et al. [2005], Chevalier and Mayzlin [2006]). 11 (Dranove et al. [2003], Epstein [2006], Shukla [2013], Ryan et al. [2012]). Coronary artery bypass graft (CABG) mortality rate disclosure, now mandated in 15 states, is the longest- standing and one of the most widely studied variants (Shukla [2013]). Some studies find declining mortality rates following disclosure (e.g., Hannan, Sarrazin, Doran, and Rosenthal [2013], Peterson, DeLong, Jollis, Muhlbaier, and Mark [1998]), but are unable to distinguish this from the results of contemporaneous initiatives (Shukla [2013]). Dranove et al. [2003] explain that observed declines in mortality rates may be due to changes in patient characteristics rather than in care quality, and show that disclosure leads physicians to select against unhealthy patients. A metastudy of mortality-rate disclosure finds generally inconclusive performance effects, and recent studies find little to no performance effects (Epstein [2006], Ryan et al. [2012], Shukla [2013]). Scholars also report adverse effects of disclosing hospital-level costs, including insurers’ using the information to negotiate away discounts for competitors, and hospitals’ manipulating nominal charges to obscure actual relative cost-efficiency (Christensen, Floyd, and Maffett [2014], Cutler and Dafny [2011]). The practice of disclosing ratings of individual physicians is relatively recent. In 2012, UUHC became the first academic health care system to make such a disclosure online. Others including Stanford Health Care, Wake Forest Baptist Health, and Cleveland Clinic have followed. Studies suggest that a lack of patient and referring-provider attention to mortality-rate disclosure weakens its impact on competition and resulting performance improvement (Brown et al. [2012], Epstein [2006], Kolstad [2013]). Survey and field evidence indicates that physician ratings receive significantly more attention (Brown et al. [2012], Hanauer et al. [2014]). Further, a study of health insurance markets reported that, in selecting health plans, individuals respond to others’ subjective ratings of care, but not to objective quality measures (Dafny and Dranove 12 [2008]). Physician-rating disclosure may, by virtue of its reported influence on patient choice, be more likely to result in performance improvement than the objective quality measure disclosures assessed in the referenced studies. I provide evidence of a performance effect. By testing H3, regarding consensus bias, this study extends a third stream of literature, regarding the interpretation and application of consumer ratings. Ratings biased toward a published consensus rating would lag a move to a new service level in depicting that new service level. The ratings would, in part, reiterate prior ratings that regarded a different service level. The ratings would more fully depict the new service level as ratings from the period of updated service became a greater portion of the published consensus rating. Such a lag is relevant to research on the sequence of service improvement, online consumer ratings, and financial performance (Banker et al. [2000], Banker and Mashruwala [2009], Luca [2016]). Managers who account for the lag could more accurately anticipate the timeline for a service improvement to yield its full effect on consumer ratings. Researchers could similarly view online consumer ratings as lagged, rather than coincident, indicators of service improvement. This would help researchers to avoid measurement error in the timing of service improvement and financial performance and to thereby more accurately estimate their causal relationship (Banker and Mashruwala [2009]). Evidence of consensus bias also informs attempts to mitigate moral hazard. A variety of managers and regulators incentivize performance by consumer ratings that are visible to subsequent raters (Flaherty [2014], Lyu et al. [2013], Ubel [2015]). By accounting for a lag in those ratings’ updating, principals could more precisely measure effort and provide rewards accordingly. For instance, academic promotion review boards could interpret student-evaluation improvement after an assistant professor received a negative consensus evaluation that was made 13 visible to his or her subsequent students as requiring greater effort than if the consensus were kept private. If such effort went underestimated, the assistant professor would be incrementally subject to moral hazard, specifically, the tendency to avoid exerting unobservable effort (Banker and Datar [1989], Hölmstrom [1979]). Finally, evidence of consensus bias is relevant to research on bias in employee evaluations. Extant research on employee evaluation bias addresses issues of leniency, centrality, race, and favoritism, but has not addressed a bias toward past consensus evaluations (Bol [2011], Moers [2005], Prendergast and Topel [1993]). Consensus bias from consumers could affect employee evaluations that incorporate consumer ratings that are visible to subsequent raters, as in the academic promotion example. Consensus bias may also affect employee evaluations through supervisors who view past consensuses and are biased toward them in evaluating an employee (Grote [2015]). To extend this paper’s contribution to performance-disclosure and consumer-rating research, I address how the performance-improvement and consensus-bias effects vary with recent public attention to consumer-rating disclosure. Public attention would, in theory, reinforce disclosure’s performance effects by increasing the financial and self-image consequences of performing well by disclosed measures (Graham [2000], Parker and Nielsen [2011], Weil et al. [2006]). Hypotheses 4 and 5 address whether stronger rating improvement and objective-quality- measure improvement effects follow greater public attention to disclosed consumer ratings. Consensus bias is also predictably stronger following recent public attention to consumer-rating disclosure. In particular, anchoring and social herding would be more likely to occur if a larger number of individuals viewed the published consensus and each viewed it 14 enough times to recall it. Hypothesis 6 addresses whether stronger consensus bias follows greater recent public attention to disclosed consumer ratings. H4: The positive effect of physician-rating disclosure on ratings is greater under conditions of greater recent public attention to the disclosure. H5: The positive effect of physician-rating disclosure on objectively measured care quality is greater under conditions of greater recent public attention to the disclosure. H6: Physician-rating disclosure biases subsequent raters toward the published consensus to a greater degree under conditions of greater recent public attention to the disclosure. Testing H4-H5 offers some of the first evidence of a correlation between recent public attention to disclosure and performance effects. A positive correlation would support theory that public attention strengthens performance incentives tied to disclosure (Parker and Nielsen [2011], Weil et al. [2006]). The test of H6, regarding consensus bias under greater public attention, makes two additional contributions. First, it identifies public attention as a contextual factor that warrants more significantly adjusting for consensus bias in interpreting and applying publicly visible consumer ratings. Second, it assesses a potential impediment to a positive correlation between public attention and subsequent performance improvement as measured by ratings. Specifically, public attention to a consensus rating may strengthen consensus bias and relatedly counter efforts to lift ratings above the published consensus. 2.3 SETTING 15 The field research site, UUHC, is an academic medical system comprised of four hospitals, eleven community clinics, and several specialty centers. The system receives over 1.4 million outpatient and over 30,000 inpatient visits annually, offering services ranging from primary care to the most advanced types of cancer treatment. UUHC is a leader in health care quality and safety, placing in the top 10 on the University Health Consortium’s ranking of approximately 120 U.S. academic medical centers for six years in a row. It has recently prioritized patient satisfaction improvement, climbing from the bottom quartile to above the 80th percentile of peer hospitals by satisfaction measures from 2008 to 2014. 2.3.1 Physician ratings as a performance measure Government, insurers, and health care systems increasingly use patient ratings of physicians and care as performance measures. Private insurers and Medicare have, over the past decade, incorporated the ratings into reimbursement calculations and public performance reports at the hospital level (Lyu et al. [2013]). UUHC and other health care systems have created departments tasked with improving patient experience as measured by physician ratings (Daniels and Lee [2014], Merlino and Raman [2013]). 2.3.2 Physician ratings at UUHC UUHC had launched a patient satisfaction improvement initiative several years before disclosing physician ratings. Letters from unsatisfied patients were often addressed to UUHC’s CEO, who remarked at the start of a patient satisfaction initiative in 2008, “If we are doing the right things, why are the patients so unhappy? I want it to feel better here” (Daniels and Miller [2014]). As part of the initiative, UUHC subscribed to Press Ganey Inc., the nation’s largest patient satisfaction survey vendor. Press Ganey began automatically emailing a patient satisfaction survey following each UUHC patient visit, including during the entirety of the current study’s 16 sample period. The survey asks patients to rate physicians on several criteria, listed in Appendix A. As UUHC’s patient satisfaction improvement efforts progressed, commercial online physician rating systems challenged the organization to consider disclosing its own physician ratings publicly. In December 2012, UUHC became the first academic health care system to make such a disclosure. 2.3.3 Publicly disclosing physician ratings Physician profile web pages, already integrated within the UUHC official website, served as the venue for physician ratings disclosure.12 The only criterion for a physician’s inclusion in disclosure was that he or she had received 30 or more ratings in the 12 months preceding a rating posting. Physicians who met that criterion in December 2012 were the first to have their ratings posted on their online profile. In July 2013, any physicians who failed to meet the criterion in December 2012, but met it in July 2013, similarly had their ratings disclosed. Disclosed ratings were the physicians’ 12 month average prior to the most recent posting that the physician met the survey count criterion for. Along with quantitative ratings, UUHC posted all comments regarding the physician that did not identify the patient or contain slander or profanity. The effects of consumer-rating disclosure herein refer to a disclosure that included such comments, which are a common element of consumer-rating disclosures (e.g., Cleveland Clinic [2016], Columbia University [2016], Starwood [2016]). Example comments appear in Appendix B. Though comments are not used as dependent variables, the example comments illustrate the type of qualitative information disclosed along with quantitative ratings. 12 This paper refers to UUHC employees who conduct patient-visits as “physicians,” although some do not hold a doctoral degree. Alternative degrees fall in nursing, assistant, and specialist categories. 17 UUHC administrators reported heightened physician interest in ratings after the ratings were disclosed. For example, a hospital executive not involved in proposing or designing the disclosure initiative remarked that the disclosure “had a dramatic impact on the culture of the physicians and their engagement [in patient satisfaction]”. 2.3.4 Generalizability I am limited to data from one organization. The results may, though, be relatively generalizable for a few reasons. First, a number of health care systems, including Stanford Health Care and Cleveland Clinic, have disclosed physician ratings on physicians’ official online profiles in the same manner as UUHC. The intervention is thus similar to that spreading in health care. The same format of emailing surveys to gather responses that are periodically posted online is also similar to rating disclosures in other industries, such as higher education (e.g., Columbia University [2016]). Second, the data span tens of thousands of patients, hundreds of physicians, and nearly four years. They also span a large patient and physician geographic area. UUHC consists of four hospitals, ten community clinics, and several specialty centers that serve a referral area encompassing five surrounding states and over 10% of the continental U.S. A reason that the results may be limited, even relative to other studies regarding individual institutions, is that the disclosure was made in a culture that supported and offered training resources for rating improvement. The performance effects may partly depend on these contextual factors. 2.4 DATA The sample consists of proprietary data regarding patient visits, patient satisfaction survey responses, and web traffic to physicians’ official online profiles. I restrict the sample to 18 physicians present with UUHC in both the period before the first instance of disclosure and after the last. That restriction: 1) affords including physician fixed-effects in difference-in-differences models, and 2) orients the models to test for changes in physician behavior rather than changes in physician group composition. The test of consensus bias further restricts the sample to physicians who had received ratings more than one year before disclosure. This allows constructing the consensus from which to measure bias toward during the year prior to disclosure. The consensus- bias results are robust to stipulating that the consensuses consist of various numbers of survey responses, as described in Section 5.3. Table 1 contains descriptive statistics of patient, visit, procedure, and physician characteristics. Appendix C contains variable definitions. 2.4.1 Patient-visit characteristics Patient-visit data fields include gender, age (with ages above 89 treated as 90, in compliance with privacy standards), whether the insurance provider was Medicare or Medicaid, whether the patient speaks English, and whether the patient was visiting the physician for his or her first time. I incorporate patient age as indicator variables to account for nonlinearity in the relationship between age and the models’ dependent variables. 13 In the analyses of ratings and bias, this is a set from psychology research that represents differences in emotion and cognition that would plausibly influence formation of a rating (Newman and Newman [2014]). In the analysis of objectively measured quality, this is a set outlined by the Centers for Medicare and Medicaid (CMS) to adjust quality measures for patient health risks (CMS [2016]). Additional patient-visit characteristics are two measures of severity and complexity of the patient’s condition. One is the Medicare reimbursement weighting associated with the visit, 13 Surveys regarding visits of patients who are too young or otherwise incapable of responding are sent to the caretaker whose email is associated with the patient’s medical record. The age of respondents who are caretakers of the patient is not available in the data, and so is represented indirectly by the patient’s age (e.g., an infant age group would control for the typical age range of individuals who are caretakers of infants). 19 which reflects the severity of the case and the related complexity of treatment.14 The second is the Charlson Comorbidity Index (CCI), a weighted score that represents the disease burden of the patient.15 The CCI takes a value of one, two, three, or six, in general proportion to the likelihood of mortality within one year associated with the comorbid condition. CCI conditions range from ulcers to cancer. The conditions are recorded at the time of a procedure. Objectively measured quality analyses regard procedures, and are thus able to include CCI as measured at the given visit. Ratings may occur as part of a visit whether or not it involves a procedure, and thereby whether or not CCI is measured. For rating analyses, I thus include CCI as its value for the patient in UUHC visits during the six-month window centered at the rated visit. The results are robust to narrowing this window to three months or expanding it to one year. 2.4.2 Physician characteristics The data identify a physician’s gender, possession of an MD, age, number of years employed by UUHC, and whether he or she is a tenure-track physician. The tenure track is available through UUHC’s affiliation with the University of Utah Medical School. Generalized difference-in- differences analysis of an intervention with multiple implementation dates utilizes individual (herein physician) fixed effects. Physician fixed effects subsume time-invariant physician characteristics. The noted data on these characteristics, though, allow comparing the physician groups included in and excluded from disclosure, as well as constructing matched samples of physicians for robustness tests. 2.4.3 Web traffic 14 See Brown [2003] and Evans, Hwang, and Nagarajan [2001] for similar application of this measure. 15 See Sundararajan, Henderson, Perry, Muggivan, Quan, and Ghali [2004] for a description of the index, and Dafny [2005], Chandra, Gruber, and McKnight [2010], Doyle Jr. [2011] for examples of its use.   20 TABLE 1: SAMPLE SELECTION AND DESCRIPTIVE STATISTICS Panel A - Sample selection Ratings Initial observations 178,334 Exclude physicians who exit the sample before or enter the sample after the first rating posting (69,154) Sample for ratings 109,150 Exclude physicians who enter the sample more recently than a year before the first rating posting (9,446) Restrict sample to one year before first rating posting and to before the second rating posting (59,228) Sample for absolute difference 40,476 Procedures Initial observations 48,839 Exclude physicians who exit the sample before or enter the sample after the first rating posting (7,232) Sample for quality deductions 41,607 Panel B - Ratings descriptive statistics Physician included in disclosure Physician excluded from disclosure Unit of observation N Mean SD N Mean SD Patient Visit Gender 106,171 0.60 0.48 2,979 0.50 0.50 Age 106,171 49.84 19.80 2,979 44.64 23.92 English speaking 106,171 0.98 0.13 2,979 0.98 0.11 Charges ($) 106,171 281.56 1,181.75 2,979 300.25 862.08 Severity/complexity 106,171 1.94 2.95 2,979 2.08 8.06 Commorbidity 106,171 0.01 0.14 2,979 0.01 0.14 Medicare or Medicaid 106,171 0.17 0.37 2,979 0.16 0.37 First visit 106,171 0.25 0.43 2,979 0.31 0.46 Physician-week's visit count 106,171 56.12 41.55 2,979 37.45 36.21 Physician Gender 295 0.36 0.48 99 0.37 0.48 MD 295 0.79 0.40 99 0.84 0.36 Age 295 44.61 10.03 99 43.50 9.21 Years with UUHC 295 9.34 5.05 99 7.85 4.11 Tenure track 295 0.31 0.46 99 0.36 0.48 Physician-website Month Web traffic 9,962 79.12 75.29 807 51.35 45.11 Satisfaction Survey Rating 106,171 4.70 0.52 2,979 4.74 0.48 Absolute difference 106,171 0.36 0.35 2,979 0.31 0.33 Physician ages are as of January 1, 2011, and patient ages are treated as 90 if above 89 in compliance with privacy standards. 21 TABLE 1: SAMPLE SELECTION AND DESCRIPTIVE STATISTICS (CONTINUED) Panel C - Procedures descriptive statistics Physician included in disclosure Physician excluded from disclosure Visit characteristic N Mean SD N Mean SD Patient Visit Gender 36,827 0.54 0.49 4,780 0.55 0.49 Age 36,827 49.28 21.51 4,780 43.80 23.66 Charges ($) 36,827 44,194.52 101,574.10 4,780 44,957.01 139,024.80 Severity/complexity 36,827 2.12 3.40 4,780 2.04 3.38 Commorbidity 36,827 0.12 0.46 4,780 0.25 0.72 Medicare or Medicaid 36,827 0.04 0.21 4,780 0.09 0.29 Physician-week's visit count 36,827 36.10 25.69 4,780 13.71 12.41 Physician Gender 158 0.30 0.46 36 0.31 0.46 MD 158 0.94 0.22 36 0.86 0.35 Age 158 43.52 8.79 36 39.05 6.74 Years with UUHC 158 8.87 4.27 36 6.55 3.74 Tenure track 158 0.45 0.49 36 0.27 0.45 Physician-website Month Web traffic 3,970 88.79 84.54 712 41.47 39.19 Quality Quality deduction 36,827 0.02 0.15 4,780 0.02 0.16 Physician ages are as of January 1, 2011, and patient ages above 89 are treated as 90 in compliance with privacy standards. 22 Available data regarding web traffic are the number of page views of a physician’s online profile each month during the sample period. A page view is an instance of a web browser loading the physician’s online profile. 2.4.4 Physician ratings Physician ratings are patient responses to automated emails sent by a third-party company, Press Ganey Inc. The survey content and distribution is not subject to the discretion of the physician conducting the visit. Patients answer the questions on a Likert scale of 1-5 with 1 indicating “very poor,” and 5 indicating “very good”. Rating components are listed in Appendix A. The average rating for an individual visit by all nine questions is the current study’s dependent variable rating. The average of rating over the 12 months prior to a date of disclosure is made visible atop his or her profile upon rating disclosure. Rating is generally very high, with an average in the sample for this study of 4.7 out of 5. This raises the possibility that institutional factors lead only satisfied patients to respond to surveys. However, this high average is representative of hospitals nationally. For UUHC’s peer group of 120 academic hospitals that use Press Ganey surveys, the average is roughly 4.6 out of 5. The automated emailing system that directs surveys only to confirmed patients would also make it difficult for physicians to manipulate the respondent pool relative to traditional online reviews or to surveys that a physician distributes. 2.4.5 Procedures and quality The objectively measured quality analysis applies to visits during which the physician performed a procedure, regardless of whether the patient subsequently responded to a survey regarding the visit. The proxy for quality is quality deduction, an indicator for whether the visit resulted in either a patient readmission to the emergency department within 30 days, a hospital acquired 23 condition, or both. A decrease in quality deduction corresponds to an increase in quality. This variable’s component measures—30-day readmissions and hospital acquired conditions (conditions acquired along with treatment at the hospital)—are measured by health care regulators and widely studied as adverse and avoidable events that indicate quality failures.16 UUHC provided these component measures dating back to one month after the start of the patient satisfaction dataset, affording nearly two years of quality observations before physician- rating disclosure. The procedure data that UUHC provided included all of the variables described in Sections 4.1 and 4.2, except whether the visit was a patient’s first to the provider and whether the patient indicated that they speak English. Those two fields are gathered and made available by Press Ganey to UUHC only along with survey response data 2.5 ANALYSIS This study utilizes the criterion determining whether a physician’s ratings were posted online, as well as the timing of the postings, in its generalized-difference-in-differences tests. The single criterion was that a physician received 30 or more survey responses in the 12-month period prior to a rating posting. Rating postings occurred twice during this study’s sample, once in December 2012, and again in July 2013. Posted ratings stayed constant on the physician’s profile until a subsequent posting that the physician met the survey count criterion for inclusion in. The estimation employs generalized difference-in-differences, which has been used in a prior study regarding health care disclosure (Dranove et al. [2003], Duflo [2002]). In addition to physician and time fixed effects, the models also include extensive controls for patient mix, including multiple measures of patients’ health risks. 16 See Andel, Davidow, Hollander, and Moreno [2012], and Joynt, Orav, and Jha [2011] for descriptions and applications. 24 A key identifying assumption is that each dependent variable would trend parallel for physicians included in and those excluded from disclosure were it not for the disclosure. The analysis includes placebo tests and graphical illustrations of these trends prior to the start of disclosure at UUHC. A second key assumption is a lack of contemporaneous changes at the time of disclosure that would otherwise alter the parallel trends. Propensity-score matching of physicians allows testing for robustness to contemporaneous changes that would arise due to differences in demographics of physicians included in and excluded from disclosure. Additional tests show that the effects are robust after excluding, and are of comparable size among, physicians who first met the criterion number of returned surveys within one year prior to their inclusion in disclosure. This suggests that potential contemporaneous changes in dependent variable trends due to a physician’s recently meeting the criterion for disclosure do not drive the results. 2.5.1 Rating improvement Model 1, specified as follows, tests for the effect of physician-rating disclosure on subsequent ratings: (1) Ratingpytv = α + δphysicianp + λyeary + ωperiodt + ςcontrolsv + βdisclosedpt + εpytv , where p indexes physicians, y indexes years, t indexes time periods segmented by disclosure events, v indexes individual patient visits, and disclosed is an indicator for the time period following which a physician’s ratings were disclosed, if ever, on their online profile. The dimensions of the coefficient vectors are δ (1 × 394), λ (1× 4), ω (1 × 3), ς (1 × 15), and β (1 × 1). Physician fixed effects control for physician characteristics, including membership in or 25 exclusion from the treated group. Time fixed effects control for time trends common to the entire sample, including differences between the before and after periods. The time variables are year and period, the latter of which controls for static variation prior to and after disclosure events occurring mid-calendar-year. Β captures the difference-in-differences estimation of the effect of physician-rating disclosure on ratings. The unit of observation is the patient-visit. Data at the visit level allow controlling for observable patient and visit characteristics. These are the monetary charges for the visit, the severity/complexity of the case, the patient’s comorbidities, whether the visit’s insurer was Medicare or Medicaid, the patient’s gender, a set of psychometric indicator variables for patient age, and an indicator variable for whether the visit was the patient’s first to the physician. In estimating this model and all others in this study, standard errors are clustered at the physician level to correct for autocorrelation among multiple observations within physicians.17 Table 2 supports the parallel trends assumption. Column 1 displays results of estimating Model 1 using a placebo disclosure, timed one year prior to the date on which a physician’s ratings were actually disclosed, and confined to the pre-disclosure period. The coefficient on placebo disclosed indicates no significant difference in rating trends in the pre-disclosure period. Figure 2 illustrates rating trends in the pre-disclosure period for physicians included in and excluded from disclosure.18 17 The number of years is too few for reliable clustering by the standards of Petersen [2009]. Clustering standard errors by both years and physicians, though accordingly incorrect, slightly strengthens the results’ statistical significance. 18 The variance of scatter points in Figures 2-4 is greater for physicians excluded from relative to those included in disclosure. This is attributable to the smaller sample size for the former group. When I scale the variance to equate the number of observations in the former group with that of the latter, the variances do not statistically significantly differ. Widely cited difference-in-differences studies use treatment and control groups that differ in sample size by a large multiple – at least as large as 18 (Card and Krueger [1994], Dynarski [2002]). A sample size difference between treated and control group reduces effective sample size, but these studies do not mention it as a threat to difference-in-differences estimation. The results in Tables 3-5 hold, as shown in Table 7, after propensity-score 26 FIGURE 1: PHYSICIAN-RATING DISCLOSURE TIMELINE 2011 2012 2013 2014 December July Rating Posting 1 Rating Update 1 Rating Posting 2 Rating Posting 1 Administrators posted, on the official webpage of each physician who had received at least 30 ratings during the prior 12 months, the physician's average rating for the nine questions in Appendix A during that period along with individual patient comments. Rating Update 1 Administrators updated the ratings posted for physicians included in Rating Posting 1 and who had received at least 30 ratings during the prior 12 months to display those physicians' rating averages from that more recent 12-month period. Rating Posting 2 Administrators posted, on the official webpage of each physician who had received at least 30 ratings during the prior 12 months, but who had not reached the threshold number of ratings for Rating Posting 1, the physician's average rating for the nine questions in Appendix A during that period along with individual patient comments. 27 FIGURE 2: PRE-DISCLOSURE RATING TRENDS BY PHYSICIAN'S EVENTUAL DISCLOSURE STATUS This figure displays trends of rating, unadjusted for covariates, for physicians included in and excluded from ratings disclosure prior to the disclosure. Scatter points are agreggated at the week level. The scatter point variance for the group with fewer visit- level observations is reduced by the component of variance attributable to the smaller number of observations. Trend lines illustrate that rating was trending similarly for these groups of physicians prior to rating disclosure. FIGURE 3: PRE-DISCLOSURE QUALITY DEDUCTION TRENDS BY PHYSICIAN'S EVENTUAL DISCLOSURE STATUS This figure displays trends of quality deduction, unadjusted for covariates, for physicians included in and excluded from ratings disclosure prior to the disclosure. The variable's availability extends from February 2011. Scatter points are agreggated at the week level. The scatter point variance for the group with fewer visit-level observations is reduced by the component of variance attributable to the smaller number of observations. Trend * lines illustrate that quality deduction was trending similarly for these groups of physicians prior to rating disclosure. 28 TABLE 2: TESTS OF PARALLEL TRENDS (1) (2) (3) Rating Quality deduction Absolute difference Placebo disclosed -0.020 0.003 -0.003 [-0.83] [0.67] [-0.19] Gender -0.018** 0.001 0.007 [-2.24] [0.56] [1.58] Charges 0.000 -0.000* 0.000 [0.78] [-1.71] [1.35] Severity/complexity 0.001 0.001 -0.000 [1.19] [1.62] [-0.50] Commorbidity 0.037*** -0.002 -0.033*** [3.05] [-1.59] [-4.46] Medicare or Medicaid 0.012 0.002 0.005 [1.55] [0.49] [0.92] Physician-week's visit count 0.000 -0.000 -0.000 [0.22] [-0.41] [-0.59] First visit -0.041*** 0.025*** [-5.42] [4.00] English speaking 0.087*** -0.067*** [4.73] [-3.52] Contemporary std. dev. 0.315*** [11.22] Consensus count 0.045 [0.49] Rating trend 0.059*** [3.38] Age dummies° Yes Yes Yes Year dummy 2012 0.014 0.008 -0.006 [1.09] [0.26] [-0.70] Period dummies 2 0.039 -0.004 0.015 [1.36] [-0.76] [0.76] 3 0.049 -0.000 [1.66] [-0.11] Physician dummies Yes Yes Yes This table presents effect estimates of a placebo physician rating disclosure to assess, in the pre- disclosure period, paralell trends in ratings, quality deductions, and the absolute difference of ratings from prior consensus ratings. Placebo disclosure is timed one year before a physician's actual rating disclosure for the results in Columns 1-2. Data for constructing absolute difference is limited to one year before the absolute disclosure. Given that constraint, the placebo test for Column 3 times its placebo disclosure seven months prior to actual disclosure, which provides a post-treatment period for analysis of the placebo disclosure as long as that available for the actual disclosure, and provides a six month pre-treatment period for the placebo disclosure analysis. °Age dummies for Column 1 and 3 are the psychometric set in Newman and Newman [2014], and for Column 4 are the health-risk set delineated by CMS [2016], with one group omitted in each case. Standard errors are clustered at the physician level. *,**,*** denote significance at the .1, .05, and .01 levels respectively. Rating N = 65,286 , Quality deduction N = 27,456 , Absolute difference N = 32,413 29 Table 3 displays the results of estimating Model 1. The coefficient on disclosed in Columns 1 and 2 shows an estimated positive and statistically significant effect of the disclosure on improvement by patient satisfaction ratings. Column 3 shows positive time trends in rating. These trends are key to estimating improvement in rating that occurs in spite of, rather than due to, consensus bias. Consensus bias would draw rating toward its prior published value and thereby counter rating’s improvement in actuality and in difference-in-differences estimation with non-negative trends. The results from Columns 1 and 2 support H1 – that disclosing physician ratings positively affects ratings. The estimated effect is also economically significant in that it raises a physician’s rank by the ratings among the national University Health Consortium peer group by 17 percentile points on average.19 UUHC and other health care systems report these ranks to individual physicians privately and sometimes publicly in aggregate (Daniels and Miller [2014], Merlino and Raman [2013]). 2.5.2 Quality improvement Model 2, specified as follows, measures the effect of physician-rating disclosure on the occurrence of objectively measured quality deductions including readmissions and hospital- acquired conditions: (2) Quality deductionpytv = α + δphysicianp + λyeary + ωperiodt + ςcontrolsv + βdisclosedpt + εpytv. The model’s subscripts and the right-hand variables, other than those contained in the controls vector, are the same as those in Model 1. The coefficient vector dimensions are δ (1 × matching reduces the multiple by which the sample sizes differ for physicians excluded from and included in disclosure to less than 13. 19 The University Health Consortium percentile point conversions were produced using UUHC’s internal data provided by Press Ganey that maps physician ratings to the consortium’s peer group distribution. 30 TABLE 3: EFFECT OF PHYSICIAN RATING DISCLOSURE ON RATING (1) (2) (3) Rating Disclosed 0.032** 0.034*** [2.26] [2.57] Gender -0.016*** [-3.73] Charges 0.000 [0.91] Severity/complexity 0.002*** [3.08] Commorbidity 0.018* [1.69] Medicare or Medicaid 0.014*** [2.70] Physician-week's visit count -0.000 [-0.59] First visit -0.048*** [-8.17] English speaking 0.089*** [5.41] Age dummies 11-17 0.011 [0.54] 18-24 -0.113*** [-4.49] 25-34 -0.072*** [-3.23] 35-59 0.012 [0.58] 59-74 0.090 [4.12] +74 0.082 [3.66] Year dummies 2012 0.043*** 0.037*** 0.044*** [7.23] [6.30] [6.31] 2013 0.052*** 0.046*** 0.088*** [3.45] [3.12] [11.05] 2014 0.065*** 0.060*** 0.101*** [4.16] [3.90] [11.87] Period dummies 2 0.003 -0.003 [0.19] [-0.20] 3 0.006 -0.003 [0.29] [-0.17] Physician dummies Yes Yes No This table presents effect estimates of physician rating disclosure on physician ratings. Columns 1-2 vary the controls included, and Column 3 presents isolated time trends. Standard errors are clustered at the physician level. Age dummies for Columns 1 and 3 are the psychometric set in Newman and Newman [2014]. *,**,*** denote significance at the .1, .05, and .01 levels respectively. N = 109,050 31 Table 3 displays the results of estimating Model 1. The coefficient on disclosed in Columns 1 and 2 shows an estimated positive and statistically significant effect of the disclosure on improvement by patient satisfaction ratings. Column 3 shows positive time trends in rating. These trends are key to estimating improvement in rating that occurs in spite of, rather than due to, consensus bias. Consensus bias would draw rating toward its prior published value and thereby counter rating’s improvement in actuality and in difference-in-differences estimation with non-negative trends. The results from Columns 1 and 2 support H1 – that disclosing physician ratings positively affects ratings. The estimated effect is also economically significant in that it raises a physician’s rank by the ratings among the national University Health Consortium peer group by 17 percentile points on average.19 UUHC and other health care systems report these ranks to individual physicians privately and sometimes publicly in aggregate (Daniels and Miller [2014], Merlino and Raman [2013]). 2.5.2 Quality improvement Model 2, specified as follows, measures the effect of physician-rating disclosure on the occurrence of objectively measured quality deductions including readmissions and hospital- acquired conditions: (2) Quality deductionpytv = α + δphysicianp + λyeary + ωperiodt + ςcontrolsv + βdisclosedpt + εpytv. The model’s subscripts and the right-hand variables, other than those contained in the controls vector, are the same as those in Model 1. The coefficient vector dimensions are δ (1 × matching reduces the multiple by which the sample sizes differ for physicians excluded from and included in disclosure to less than 13. 19 The University Health Consortium percentile point conversions were produced using UUHC’s internal data provided by Press Ganey that maps physician ratings to the consortium’s peer group distribution. 32 194), λ (1× 4), ω (1 × 3), ς (1 × 20), and β (1 × 1). The control vector comprises charges, the severity/complexity of the case, the patient’s CCI comorbidity score, whether the visit’s insurer was Medicare or Medicaid, the patient’s gender, and the indicators for age as outlined by CMS for health risk adjustment. The results in Table 2 support the parallel trends assumption for quality deduction. Column 2 displays results of estimating Model 2 using a placebo disclosure, timed one year prior to the date on which a physician’s ratings were actually disclosed, and confined to the pre- disclosure period. The coefficient on placebo disclosed indicates no significant difference in quality deduction trends in the pre-disclosure period. Figure 3 illustrates the quality trends in the pre-disclosure period for physicians included in and excluded from disclosure. Table 4 displays the results of the test of Model 2. I use ordinary least squares regression for estimating effects on a binary outcome variable, in line with prior health economics research (Dranove [2003]). This approach prohibits interpreting the results as point estimates, but avoids Type I error prone to result from applying logit or probit regression to difference-in-differences estimation (Blundell and Dias [2009]). The coefficient on disclosed in Columns 1-2 shows an estimated negative and statistically significant effect of the disclosure on quality deductions. The results support H2 – that disclosing physician ratings positively affects the quality of care that the physician provides. 2.5.3 Consensus bias Model 3, specified as follows, measures the effect of physician-rating disclosure on consensus bias in subsequent ratings: 33 TABLE 4: EFFECT OF PHYSICIAN RATING DISCLOSURE ON QUALITY DEDUCTION (1) (2) Quality deduction Disclosed -0.010*** -0.010*** [-2.61] [-2.62] Gender 0.004*** [2.80] Charges -0.000 [-1.11] Severity/complexity 0.000 [0.56] Commorbidity -0.002* [-1.74] Medicare or Medicaid 0.006 [1.50] Physician-week's visit count 0.000 [0.12] Age dummies° 1-9 0.002 [0.18] 9-20 0.034** [2.49] 21-29 0.021* [1.77] 30-39 0.021* [1.74] 40-49 0.021* [1.72] 50-59 0.014 [1.16] +59 0.009 [0.76] Year dummies 2012 0.002 0.002 [1.57] [1.63] 2013 0.002 0.002 [0.51] [0.58] 2014 -0.003 -0.003 [-0.67] [-0.55] Period dummies 2 0.007 0.007 [1.41] [1.42] 3 0.016** 0.016** [2.56] [2.56] Physician dummies Yes Yes This table presents effect estimates of physician rating disclosure on quality deduction. Columns 1-2 vary the controls included. °Age dummies are the 14 categories delineated by CMS [2016] as capturing age-dependent health risk, with the first ommitted, and the subsequent 12 paired with their adjacent category for concise display. Decoupling the paired categories and including them as separate dummies in the regression does not alter the level of significance of the estimated effect of disclosure. Standard errors are clustered at the physician level. *,**,*** denote significance at the .1, .05, and .01 levels respectively. N = 41,607 34 (3) Absolute differencepytv = α + δphysicianp + λyeary + ωperiodt + ςcontrolsv + βdisclosedpt + εpytv . The dependent variable, absolute difference, is the absolute distance of rating for the visit from the physician’s consensus rating as calculated for rating disclosure. A reduction in absolute difference, controlling for a change in the overall standard deviation of a physician’s ratings, is the proxy for consensus bias. Though UUHC did not calculate the measure prior to disclosure, the necessary information is calculable. For the year immediately preceding disclosure, I construct this measure as the absolute difference of rating from the physician’s 12-month consensus as would have been calculated for disclosure one year before the disclosure’s actual start. The variable’s measurement, which requires establishing a 12-month consensus from which to measure absolute distance, begins as early as UUHC’s data allows – in December 2011. In this test, the sample is also truncated at the time of the second disclosure, in July 2013, when physicians were added and initial ratings were updated. Beyond that point, the combination of consensus ratings for a physician that a patient rater may have been exposed to makes the appropriate standard for measuring bias unclear. The model’s subscripts and the right-hand variables other than controls are the same as in Model 1. The coefficient vector dimensions are δ (1 × 273), λ (1× 1), ω (1 × 1), ς (1 × 18), and β (1 × 1). The control vector includes all controls used in Model 1 and three additional controls. The first of these additional controls teases out the effect of a change in the overall standard deviation of rating for a given physician following disclosure. This control, contemporary standard deviation is the standard deviation of rating for a given physician in the period relative to the December 2012 posting—the only instance of disclosure used to test Model 35 3—in which the rating occurred. This control captures a physician becoming more or less consistent in the level of service he or she provides, or any other reasons for a decrease in the variance of his or her ratings not attributable to other observable covariates. The second additional control, consensus count, teases out the effect of a difference in the rating sample size for a physician’s consensus rating as calculated at December 2011 relative to December 2012. Absolute difference is measured relative to the former in the pre-disclosure period and relative to the latter in the post-disclosure period. Consensus count is the inverse square root of the sample size of ratings that constitute the physician’s consensus rating from which absolute difference is measured. The sample mean of an i.i.d. random variable converges to the population mean is 1 𝑛 (Vives p. 386 [2010]). Consensus count uses that transformation to capture an effect on absolute difference of a physician’s December 2012 consensus rating being closer to or farther from his or her population mean rating than his or her December 2011 consensus rating was. The third additional control, rating trend, teases out the effect of a physician’s mean change in rating from the variance of rating around the published score. This control is the physician-specific trend for rating in the period relative to the December 2012 posting in which an observed rating occurred. Given that disclosure yields a positive effect on rating and that rating time trends are positive, excluding this control would cause absolute difference to relatedly grow more for physicians included in disclosure, and thereby understate consensus bias as measured by a decline in absolute difference. Table 2 Column 3 displays results of a test of parallel trends of the dependent variable absolute difference. The start of the variable’s measurement one year prior to disclosure does not allow estimating the effects of a placebo disclosure occurring that far in advance of actual 36 disclosure, given the necessity of a pre-period for the placebo disclosure. The placebo test uses instead a placebo disclosure date seven months prior to the date on which a physician’s ratings were actually disclosed, which affords as long a period for assessment after placebo disclosure as is available for assessment after actual disclosure. The placebo test’s sample is, as with the other placebo tests, confined to the pre-disclosure period. I update the consensus rating to the 12- month average for each physician at the time of the placebo disclosure, as would have occurred if the disclosure were actual. The coefficient on placebo disclosed indicates no significant difference in absolute difference trends. Attempting to depict parallel trends in attraction to the consensus score by displaying absolute difference, unadjusted for covariates, should be done with caution given that the control for contemporary standard deviation loads heavily. Figure 4, though, illustrates the trends for absolute difference as calculated for the test of model 3 during the pre-disclosure period. Table 5 displays the results of the test of Model 3. The coefficient on disclosed in Columns 1-3 show an estimated negative and statistically significant effect of disclosure on absolute difference.20 Columns 2 and 3 show stronger effects, as predicted, after controlling for physician-specific rating trends. The results support H3 – that disclosing physician ratings biases raters toward the published consensus. The estimated effect is economically significant in that the bias results in ratings that are an average of 24 percentile points closer to the physician’s published consensus rating in the University Health Consortium peer group distribution. 21 Figure 5 illustrates the effect of eventual inclusion in disclosure on a physician’s rating level and spread. 20 The result is statistically significant at least the .05 level under various thresholds for a physician’s inclusion in the sample based on the number of surveys comprising their consensus ratings, including >5, >10, >15, >20, and >25 surveys. 21 The University Health Consortium percentile point conversions were produced using UUHC’s internal data provided by Press Ganey that maps physician ratings to the consortium’s peer group distribution. 37 FIGURE 4: PRE-DISCLOSURE ABSOLUTE DIFFERENCE TRENDS BY PHYSICIAN'S EVENTUAL DISCLOSURE STATUS This figure displays trends of absolute difference, unadjusted for covariates, for physicians included in and excluded from ratings disclosure prior to the disclosure. Scatter points are aggregated at the week level. The scatter point variance for the group with fewer visit-level observations is reduced by the component of variance attributable to the smaller number of observations. Absolute difference is measured from the physician's 12-month consensus rating calculated at December 2011, the earliest point of that measure's availability and one year prior to ratings disclosure. Trend lines illustrate that absolute difference was trending similarly for these groups of physicians prior to ratings disclosure. FIGURE 5: RATING PREMIUM FOR PHYISICIANS EVENTUALLY INCLUDED IN DISCLOSURE This figure displays the effect of eventual inclusion in disclosure on a physician’s rating level and spread. Trend lines show the average rating premium over time for a physician’s belonging to the group eventually included in disclosure, after controlling for all covariates in Model 1. The lines are segmented to show a change in trend at Rating Posting 1. The variance of the scatter points in each quarter is the average variance of all physician ratings in the given quarter plus the estimated effect of consensus bias in that quarter on the variance of those ratings. 38 TABLE 5: EFFECT OF PHYSICIAN RATING DISCLOSURE ON ABSOLUTE DIFFERENCE (1) (2) (3) Absolute difference Disclosed -0.040*** -0.046*** -0.047*** [-2.92] [-3.46] [-3.61] Gender 0.006 [1.44] Charges 0.000** [2.07] Severity/complexity -0.001** [-2.32] Commorbidity -0.017** [-2.05] Medicare or Medicaid 0.003 [0.66] First visit 0.024*** [4.87] Physician-week's visit count 0.000 [1.39] English speaking -0.074*** [-4.29] Contemporary std. dev. 0.374*** 0.377*** 0.374*** [13.89] [14.23] [13.91] Consensus count 0.137 0.130 0.129 [1.11] [1.13] [1.09] Rating trend 0.050*** 0.052*** [2.85] [2.95] Age dummies 11-17 0.006 [0.33] 18-24 0.076*** [3.72] 25-34 0.049*** [2.85] 35-59 0.016 [1.07] 59-74 -0.016 [-1.05] +74 -0.024 [-1.53] Year dummies 2012 -0.007 -0.006 -0.003 [-0.68] [-0.67] [-0.38] 2013 -0.013 -0.013 -0.012 [-1.00] [-1.02] [-0.93] Period dummy 2 0.035** 0.051*** 0.056*** [2.22] [3.11] [3.44] Physician dummies Yes Yes Yes This table presents effect estimates of physician rating disclosure on the absolute distance of ratings from prior consensus ratings as were calculated for online disclosure. Columns 1-3 vary the controls included. Age dummies are the psychometric set in Newman and Newman [2014]. Standard errors are clustered at the physician level. *,**,*** denote significance at the .1, .05, and .01 levels respectively.. N = 39,159 39 Relative to physicians excluded from disclosure, the ratings for physicians in this group begin trending upward and become more tightly distributed after the first rating posting. 2.5.4 Robustness tests Table 7 demonstrates the robustness of the effects in sections 5.1-5.3 to estimation after matching physicians included in and excluded from disclosure by observable physician covariates: age, MD, gender, years with UUHC, and tenure track. The analyses so far have controlled for underlying physician differences in two regards. First, physician fixed effects control for the time-invariant effects of both observable and unobservable physician characteristics. Second, the placebo tests establish that time-variant effects of those characteristics on the dependent variables do not impede generally parallel trends. Propensity-score matching on the noted observable physician covariates further evidences the results’ robustness to observable physician characteristics by showing that the effects persist among non-significant differences in these characteristics between treated and control physicians. The sample matching utilizes a probit model that estimates propensity scores based on the observable characteristics. Subsequently, each member of the smaller group (physicians excluded from disclosure) is matched, without replacement, to the member of the larger group (physicians included in disclosure) with the most similar propensity score that has not yet been matched. For the test of absolute difference, the group of physicians excluded from disclosure are all those who did not meet the criterion for the December 2012 posting. Table 6 shows the covariate balance after matching. Table 7 shows that the results for rating, quality deduction, and absolute difference hold using these matched, and much smaller, samples. Each result is statistically significant at the .05 level or lower. 40 As an additional robustness test to help rule out contemporaneous changes in dependent variable trends, Table 8 shows effect estimates partitioned by whether the physician met the criterion for disclosure within the year prior to his or her inclusion in disclosure. The coefficients on disclosure in Columns 1, 3, and 5, which exclude physicians who met the criterion within a year prior to disclosure, are statistically significant at the .05 level or lower. The estimated effects among physicians who met the criterion within a year prior to disclosure, shown in Columns 2, 4, and 6, are of similar magnitude or lower and in no case more statistically significant. This test suggests that the effects of disclosure on rating, quality deduction, and absolute difference are not explained by a contemporaneous change in dependent variable trends upon a physician meeting the criterion number of surveys for inclusion in disclosure. 2.5.5 Public attention’s moderation of disclosure’s effects Table 9 shows the disclosure effect estimates from Models 1-3 partitioned by recent public attention.22 The proxy for recent public attention is web traffic – the number of page views of a physician’s profile in the calendar month prior to a given patient visit.23 Column 1 reports the partitioned effects on rating. 𝜒! tests of differences in coefficients indicate a difference in the effects on rating, but in the opposite direction as predicted in H4; the disclosure’s positive effect on rating is strongest at lower levels of recent public attention. This runs counter to theory from disclosure literature that public attention should strengthen disclosure’s performance effects (Graham [2000], Parker and Nielsen [2011], Weil et al. [2006]). It may be partly explained by public attention to a disclosed rating reinforcing consensus bias and relatedly impeding rating 22 Physician fixed effects subsume a physician’s 12-month consensus rating at the time of disclosure, which is potentially correlated with web traffic as well as well as the disclosure’s effects. Including this control accordingly does not affect the results of 𝜒! tests for comparing effect size between partitions. 23 The statistically significant results of 𝜒! tests for comparing effect size between partitions are robust to using 3- calendar-month and 6-calendar-month-lagged web traffic in the partitioning. The level of significance in some cases differs, depending on the on the length of lagged used, but remains below 0.1. 41 TABLE 6: PROPENSITY-SCORE-MATCHED PHYSICIAN SAMPLE DESCRIPTIVE STATISTICS Physician included in disclosure Physician excluded from disclosure Physician Characteristic N Mean SD N Mean SD Rating Analysis Gender 99 0.33 0.47 99 0.37 0.48 MD 99 0.84 0.36 99 0.84 0.36 Age 99 42.62 9.41 99 43.50 9.21 Years with UUHC 99 7.77 4.33 99 7.85 4.11 Tenure 99 0.38 0.48 99 0.36 0.48 Quality Deduction Analysis Gender 36 0.27 0.45 36 0.33 0.47 MD 36 0.88 0.31 36 0.86 0.35 Age 36 38.11 7.26 36 39.05 6.74 Years with UUHC 36 6.35 3.47 36 6.55 3.74 Tenure 36 0.33 0.47 36 0.27 0.45 Absolute Difference Analysis Gender 115 0.36 0.48 115 0.36 0.48 MD 115 0.81 0.38 115 0.82 0.38 Age 115 43.52 10.27 115 43.33 8.98 Years with UUHC 115 7.65 3.74 115 7.61 4.13 Tenure 115 0.32 0.46 115 0.33 0.47 This table shows descriptive statistics of physicians comprising the samples used for testing the robustness of physician-rating disclosure effect estimates to matching physicians on gender, posession of an MD, age, number of years employed by UUHC, and status as a tenure track employee. The rating and quality deduction samples were produced using one-to-one propensity score matching, applied to match either of the two rating postings with those excluded from disclosure. The sample for absolute difference was produced by using the same matching procedure, applied to match physicians included in the first upload to those included in a later upload or excluded from disclosure. Physician ages are as of January 1, 2011. The covariates in the resulting samples do not exhibit statistically significant differences between matched groups, and all are within a 0.05 propensity-score caliper, as used in prior research pairing propensity-score matching with difference-in-differences estimation (e.g., Sandino and Murphy, 2010). 42 TABLE 7: ROBUSTNESS TEST USING PROPENSITY-SCORE-MATCHED PHYSICIAN SAMPLES (1) (2) (3) Rating Quality deduction Absolute difference Disclosed 0.033** -0.012** -0.044*** [2.36] [-2.24] [-2.97] Gender -0.013** 0.001 0.008 [-2.15] [0.76] [1.21] Charges 0.000 0.000 0.000** [0.42] [0.41] [2.46] Severity/complexity 0.001 -0.000*** -0.009 [1.80] [-3.19] [-1.19] Commorbidity 0.022 -0.003 -0.021 [1.17] [-1.28] [-1.15] Medicare or Medicaid 0.017** 0.005 0.003 [2.56] [0.77] [0.52] Physician-week's visit count -0.000 -0.000 0.000* [-0.11] [-0.14] [1.96] First visit -0.038*** 0.014*** [-4.29] [1.98] English speaking 0.069** -0.036** [2.39] [-2.43] Contemporary std. dev. 0.373*** [8.90] Consensus count 0.164 [1.03] Rating trend 0.056* [1.97] Age dummies° Yes Yes Yes Year dummies 2012 0.033*** -0.000 -0.001 [3.46] [-0.23] [-0.07] 2013 0.051** -0.000 -0.003 [2.27] [-0.03] [0.18] 2014 0.077*** -0.007 [3.23] [-0.74] Period dummies 2 -0.008 0.008 0.042** [-0.39] [1.03] [2.18] 3 -0.009 0.017 [-0.41] [1.61] Physician dummies Yes Yes Yes This table presents effect estimates of physician rating disclosure on ratings, quality deductions, and the absolute difference of ratings from prior consensus ratings, using propensity-score-matched samples of physicians included in and excluded from the assessed disclosure events. °Age dummies for columns 1-2 and 5-6 are the psychometric set in Newman and Newman [2014], and for columns 3-4 are the set delineated by CMS (2016) for use in risk adjustment, with one group omitted in each case. Standard errors are clustered at the physician level. *,**,*** denote significance at the .1, .05, and .01 levels respectively. Rating N = 37,741 , Quality deduction N = 9,861 , Absolute difference N = 15,501 43 TABLE 8: ROBUSTNESS TESTS FOR PHYSICIAN'S RECENCY OF MEETING CRITERION FOR DISCLOSURE (1) (2) (3) (4) (5) (6) Rating Quality deduction Absolute difference >= 1 Year < 1 Year >= 1 Year < 1 Year >= 1 Year < 1 Year Disclosed 0.037*** 0.024 -0.010** -0.013** -0.049*** -0.048*** [2.64] [1.09] [-2.41] [-2.33] [-3.73] [-2.30] Gender -0.017*** -0.003 0.003* 0.005** 0.008* -0.006 [-3.67] [-0.22] [1.82] [2.03] [1.87] [-0.49] Charges 0.000 0.000*** -0.000 -0.000 0.000** -0.000 [0.35] [3.58] [-0.43] [-0.42] [2.11] [-0.01] Severity/complexity 0.002*** 0.000 0.001 -0.001*** -0.001** 0.000 [2.85] [0.52] [1.04] [-2.78] [-2.20] [0.04] Commorbidity 0.021* -0.001 -0.003* -0.003 -0.022*** 0.025 [1.82] [-0.02] [-1.76] [-1.34] [-2.66] [0.60] Medicare or Medicaid 0.014** 0.024** 0.007 0.001 0.004 -0.012 [2.41] [2.05] [1.44] [0.22] [0.80] [-0.89] Physician-week's visit count -0.000 -0.000 0.000 -0.000 0.000 0.006* [-0.49] [-0.43] [0.27] [-0.33] [1.13] [1.75] First visit -0.047*** -0.048*** 0.023*** 0.033** [-7.60] [-3.05] [4.55] [2.36] English speaking 0.089*** 0.090* -0.074*** -0.076 [5.15] [1.80] [-4.21] [-1.26] Contemporary std. dev. 0.363*** 0.396*** [12.13] [8.33] Consensus count 0.432** -0.020 [2.53] [-0.13] Rating trend 0.044** 0.061* [2.25] [1.66] Age dummies° Yes Yes Yes Yes Yes Yes Year dummies 2012 0.037*** 0.053*** 0.003 0.002 -0.000 -0.050 [5.83] [3.70] [1.62] [0.50] [-0.04] [-1.69] 2013 0.055*** 0.010 -0.002 0.005 -0.013 -0.016 [3.41] [0.33] [-0.35] [0.60] [-1.00] [-0.52] 2014 0.072*** 0.003 -0.009 0.003 [4.36] [0.09] [-1.32] [0.31] Period dummies 2 -0.014 0.050* 0.011** 0.005 0.062*** 0.019 [-0.71] [1.67] [1.99] [0.61] [3.62] [0.78] 3 -0.018 0.075* 0.021*** 0.012 [-0.89] [1.95] [3.03] [1.19] Physician dummies Yes Yes Yes Yes Yes Yes This table presents effect estimates of physician rating disclosure on ratings, quality deductions, and the absolute difference of ratings from prior consensus ratings, with samples partitioned by the length of time (>=1 year or <1 year) that the physicians included in disclosure first met the criterion for inclusion (at least 30 survey responses in a 12 month period) before their ratings were disclosed. °Age dummies for Columns 1-2 and 5-6 are the psychometric set in Newman and Newman [2014], and for Columns 3-4 are the set delineated by CMS (2016) for use in risk adjustment, with one group omitted in each case. Standard errors are clustered at the physician level. *,**,*** denote significance at the .1, .05, and .01 levels respectively. Rating N (>= 1 year) = 97,507 | (< 1 year) = 13,755 , Quality deduction N (>= 1 year) = 33,565 | (< 1 year) = 12,809, Absolute difference N (>= 1 year) = 35,409 | (>= 1 year) = 4,939 44 TABLE 9: ESTIMATED EFFECTS OF PHYSICIAN RATING DISCLOSURE PARTITIONED BY PRIOR MONTH WEB TRAFFIC TO DISCLOSED INFORMATION (1) (2) (3) Rating Quality deduction Absolute difference Cutoff at bottom quartile Lower partition 0.137***,††† -0.014 0.030 (0.040) (0.009) (0.042) Upper partition 0.016 -0.009 -0.056***,†† (0.015) (0.006) (0.009) χ2 test z score [2.83] [-0.46] [2.00] Cutoff at median Lower partition 0.058** -0.008*** 0.015 (0.023) (0.004) (0.031) Upper partition 0.034 -0.012 -0.057***,†† (0.028) (0.011) (0.015) χ2 test z score [0.66] [0.34] [2.09] Cutoff at top quartile Lower partition 0.042** -0.015*** -0.056 (0.016) (0.004) (0.022) Upper partition 0.053 -0.029***,†† -0.239***,††† (0.043) (0.004) (0.021) χ2 test z score [-0.23] [2.47] [5.95] This table presents effect estimates of physician rating disclosure on ratings, quality deductions, and the absolute difference of ratings from prior consensus ratings, with samples partitioned at the physician-month level by one-calendar-month-lagged web traffic to the disclosed information. The estimates for Column 1, 2, and 3 are from the models specified with full controls in Table 3, 4, and 5, respectively. Below each coefficient is the corresponding standard error. Below each pair of effect estimates is a z score reported from the χ2 test of whether the lower partition effect estimate is significantly more positive than the corresponding upper partition effect estimate. Standard errors are clustered at the physician level. *,**,*** denote the estimate's significance at the .1, .05, and .01 levels, respectively. †,††,††† denote significant results for the χ2 tests at the .1, .05, and .01 levels, respectively, and are displayed next to the estimate of greatest magnitude included in the corresponding test. Significant results displayed from χ2 tests remain significant at at least the .1 level, and are either significant at the same level or one level greater or less, after partitioning by 3- calendar-month and 6-calendar-month-lagged web traffic as opposed to 1-calendar-month-lagged web traffic. 45 FIGURE 6: EFFECTS PARTITIONED BY WEB-TRAFFIC PERCENTILE This figure displays effect estimates for rating, quality, and consensus bias, partitioned by the physician’s percentile rank by prior-calendar-month web traffic to his or her disclosed ratings. "Rating" displays the estimated effect of rating disclosure on rating in each partition. "Consensus bias" displays the positive-signed estimated effect of rating disclosure on absolute difference in each partition. "Rating" and "consensus bias" are both within the 1-5 rating scale. "Quality" displays the positive-signed estimated effect of rating disclosure on quality deduction in each partition, and is within the 0-100 range of the percent of a physician’s procedures with no quality deduction. 46 improvement. The tests of H5 and H6 show evidence consistent with that reasoning. Consensus bias, which would make lifting a consumer rating difficult, appears more sensitive to increases in web traffic past low levels than does objective quality improvement. Those low levels are the only area in which the rating-improvement effect declines with increased web traffic. Table 9 Column 2 shows results of the quality deduction analysis partitioned by recent public attention. The results support H5, that the effect of patient satisfaction disclosure on objectively measured quality improvement is greater following greater public attention. The correlation appears only beyond the top-quartile of web traffic, though. This suggests that physicians only realize greater public attention once it reaches high levels, or that it they are aware but not incrementally incentivized to improve as manifested by objective quality measures until the public attention reaches high levels. Table 9 Column 3 shows results of the absolute difference analysis partitioned by recent public attention. All tested cutoffs for partitioning allow rejecting the null in favor of H6, that consensus bias is greater under greater recent public attention.24 The results show steady increases in consensus bias with greater recent public attention, beginning from the bottom quartile of web traffic. Figure 6 is a graph of the effect estimates for rating, quality, and consensus bias vis-à-vis recent public attention. 2.6. CONCLUSION This chapter provides evidence of a theoretical tradeoff of publicly disclosing consumer ratings whereby real and positive performance effects persist despite the disclosure creating a bias among raters. Specifically, a health care system’s disclosure of patient ratings of its physicians 24 The models include rating trend in order to estimate consensus bias for each partition of web traffic holding rating trend constant. 47 elicited performance improvement by ratings and by objectively measured quality, but generated a bias in the ratings toward the given physician’s published consensus. The rating improvement effect was weaker, the objective quality improvement stronger, and the consensus bias effect stronger following greater recent public attention to a physician’s disclosed ratings. Providing this evidence makes three main contributions. First, it forwards research on performance effects of disclosure. The results establish consumer-rating disclosure as a means of driving performance. They also show that performance effects are able to persist in spite of a bias that draws ratings toward prior published consensus ratings. In the case of objectively measured performance, the results are consistent with public attention strengthening disclosure’s performance effects. I also help to extend theory regarding that relationship by showing evidence consistent with public attention to consumer-rating disclosure strengthening consensus bias and relatedly impeding rating improvement. Second, the results are particularly relevant to research on health care disclosure. Long- standing variants of health care disclosure have shown relatively little influence on consumer markets and performance effects. Physician-rating disclosure is relatively recent and is spreading. I find economically and statistically significant positive performance effects of this type of disclosure. This chapter’s public attention analysis also provides some of the first direct evidence to test the suggestion that health care disclosure’s performance effects are greater under greater public attention to the disclosure. The results support that notion, albeit only in the case of objectively measured performance. Third, and finally, by establishing evidence of consensus bias, I inform the interpretation and application of consumer ratings. Consumer ratings are increasingly publicly visible. Noting a bias in ratings toward published consensus, and that the bias is stronger when web traffic is 48 greater, informs the use of ratings to measure underlying service. In particular, consensus bias offers reason for managers to interpret a deviation of subsequent ratings from the published consensus as a dampened signal of recent trends in service. Managers who adjust for the signal dampening would be better able to use consumer ratings to infer service, which is of value in tracing the effects of service on financial performance and in effectively evaluating and rewarding employees. 49 CHAPTER 3 PERFORMANCE EFFECTS OF SETTING A HIGH REFERENCE POINT FOR PEER- PERFORMANCE COMPARISON 3.1 INTRODUCTION This study addresses performance effects of setting a reference point for peer comparison above peer-median performance. Providing individuals with relative performance information (RPI), or information for comparing one’s own performance to that of one’s peers, elicits performance improvement in a variety of compensation settings. These include settings in which performance is not linked to pay or made visible to others (Allcott [2011], Hannan, Krishnan, and Newman [2008], Tafkov [2013]). Theories of social comparison, reference points, expectancy, and goals could guide inquiry into the performance effects of the height of reference points for peer- performance comparison that are commonly displayed along with RPI.24, 25 Such inquiry could inform the many government, non-profit, and corporate administrators who are using RPI reference points to influence constituents’ performance.26 However, evidence of performance effects of RPI reference point height is lacking. We provide such evidence through a field experiment in online education. We compare the performance effects of providing the peer top quartile as opposed to the peer median as a reference point for peer comparison within RPI. Each RPI display includes a reference point correctly labeled as one of those two alternatives. The data include measures of a range of 24 Examples of RPI that include reference points are Gorman [2015] and Daniels and Miller [2011] in hospitals, Allcott [2011] and Schultz et al. [2007] in energy consumption, Harper et al. [2013] regarding a user-generated- content website, and Blanes i Vidal and Nossol [2011] from wholesale and retail. 25 We study reference points that are percentiles of the peer performance distribution. We use the term “reference point height” in referring to how high a percentile of peer performance is provided as a reference point. 26 Allcott [2011] and Blanes i Vidal and Nossol [2011] are examples from corporations, Gorman [2015] and Harper et al. [2013] from non-profits, and Kettle et al. [2015] and Hallsworth, List, Metcalfe and Vlaev [2014] from governments.   50 actions taken in online courses, as well as a log of each instance of a student who receives RPI accessing it, over multiple months. The experimental setting and intervention do not involve explicit incentives for performance, allowing for the identification of the distinct information effect of RPI reference point height. In identifying this effect, we contribute to accounting literature on how RPI affects performance (Murthy [2010], Hannan et al. [2008, 2013], Tafkov [2013]). We also contribute to economic literature on reference points for peer comparison by looking at reference points set through anonymous reports rather than through an introduction to an identifiable peer (Hanushek et al. [2003], Lavy, Silva, and Weinhardt [2012]). Multiple disciplines offer insight into how RPI and reference points increase performance apart from the rate at which performance is compensated. First, social comparison theory applied in accounting research asserts that RPI facilitates a reward for performance in the form of favorable comparison to one’s peers (Brown et al. [2007], Garcia and Tor [2007], Tafkov [2013]). Second, economic research shows that providing higher reference points for total pay increases willingness to exert effort at a given piece rate, lending evidence to reference- dependent preferences for effort provision (Abeler et al. [2011]). Third, RPI reference points that are portrayed as standards for success, as in our study, may exhibit characteristics of goals. Goal theory states that goals energize and focus effort so as to increase performance (Locke and Latham [2002]). Finally, expectancy theory notes that belief in the attainability of a performance level reinforces motivation to work toward it (Atkinson [1957]). We draw from the referenced theories in predicting the effect of providing a higher RPI reference point than the median.27 Our predictions and tests add to the insight from a growing number of field studies that address the performance effects of providing RPI reference points (Azmat and Iriberri [2010], 27 Festinger [1954] and Smith [2000] address foundational theory regarding social comparison, Kahneman and Tversky [1979] regarding reference points and loss-aversion, Locke and Latham [2002] regarding goals, and Atkinson [1957] and Vroom [1964] regarding expectancy-based motivation. 51 Allcott [2011], Harper et al. [2013], Schultz [2007]). These studies report heterogeneity in performance effects of RPI reference points: individuals underperforming the RPI reference point exhibit a positive performance effect, while those outperforming are less positively or even negatively affected. These studies do not, though, test the performance effects of varying the height of the RPI reference point provided. Theory suggests that providing the higher of two reference points will most positively affect the performance of individuals whose performance initially lies between the two. This implies a concave relationship between the positive performance effect of providing a high reference point and an individual’s initial performance. The predicted concavity includes both negative and positive effects in partitions of initial performance, and so we do not ex-ante predict the sign of the average effect. We find the predicted concave relationship. Relative to displaying the median reference point, displaying the top-quartile reference point negatively affects the performance of initially below-median performers and positively affects the performance of initially 50th-75th percentile performers. We find that the effect among initially top quartile performers depends on the measure of performance. In our experiments, we show students either the outcome-based measure Grade, or the process-based measure Activity Level. Grade is the percent of course problems correctly answered. Activity Level is a weighted sum of course actions such as video views and discussion forum posts. In the case of Grade, surveyed interest in outperforming peers persists at high levels, and top-quartile performers improve more when shown the top-quartile reference point. In the case of Activity Level, interest in outperforming peers is weaker at higher levels, and top-quartile performers improve less when shown the top-quartile reference point. A few analyses shed further light on the effect of a high reference point for peer-performance comparison. A survey shows that individuals are significantly less confident in their ability to 52 reach the higher, rather than lower, reference point. Combined with expectancy theory, this data suggests that the negative performance effect when below-median performers view a relatively high reference point is partially due to self-doubt. In terms of demographics, we assess gender as an effect moderator. Prior research shows gender differences in the performance effect of peer comparison when the comparison occurs through an introduction to an identifiable, high performing peer (Eagly [1978], Cross and Madson [1997]). These studies find that women exhibit more positive performance responses to such comparison, and suggest that the result arises from women being more prone to cooperate with and learn from the high-performing peer. We test whether the same result holds when comparison occurs through viewing a percentile of peer performance in a graphical display. In our setting, we do not find that gender moderates the performance effect, consistent with the moderating effect depending on introduction to a peer whom one can choose to cooperate with. Our study makes three main contributions. First, we show effects of RPI reference point height that operate through the private display of anonymous performance information. This speaks to the growing body of economic, psychology, accounting, and management research on such performance information display as a tool for influencing performance and behavior. This research spans the private and public sector, with outcomes including retail service, educational attainment, energy consumption, web-site content contribution, and taxpaying.28 Research on the importance of the reference point displayed along with RPI could accordingly have significant policy implications for a variety of corporate and other societal settings. 28 See Blanes i Vidal and Nossol [2011] for evidence from the retail and wholesale industry, Azmat and Iriberri [2010] regarding education, Allcott [2011] and Schultz et al. [2007] regarding energy consumption, Harper et al. [2013] regarding web-site contributions, and Hallsworth et al. [2014] regarding taxpaying. 53 Second, analysis of the effects of RPI reference points in isolation informs theory and empirical work in a variety of other accounting and economic streams of research.29 Research on RPI-related accounting and economic mechanisms might draw from the current paper’s results in a number of ways. For example, we find weaker returns to reference point height in the presence of RPI than have been shown for targets of similar height when RPI is hidden (Erez, Early, and Hulin [1985], Locke and Latham [2002]). This offers a partial explanation for the prevalence of easy targets given that targets sometimes derive from or communicate RPI (Aranda et al. [2014], Fisher, Peffer, and Sprinkle [2003], Merchant and Manzoni [1989]).30 Also, tournament literature raises the problem of motivating those who are very far below or above a rewarded cutoff (Asch [1990], Casas-Arce and Martinez-Jerez [2009]). Our analysis shows an alternative means of performance management for these parts of the performance distribution—providing a lower reference point to low performers, and, in the case of outcome-based performance, a higher reference point to high performers. Further, evidence on the performance returns to reference point height could inform predictions of the effects of supervisor discretion in setting targets that communicate RPI (Bol, et al. [2010]). Lastly, RPI reference points, especially the median and top quartile, are commonly used in measuring corporate performance and evaluating employees.31 Research in those settings can use our results to account for behavioral responses to the display of these two RPI reference points. 29 For examples of such RPI use, see Aranda, Arellano, and Davila [2014], Bol et al. [2010], and Murphy [2000] regarding target setting, and Securities and Exchange Commission [2015] regarding disclosure and monitoring, and Gibbons and Roberts [2013] regarding contracting.   30  See  Merchant  and  Manzoni  [1989]  for  a  description  of  the  contrast  between  theory  in  favor  of  targets  with   less  than  50%  chance  of  being  attained  and  the  prevalence  of  much  more  frequently  attainable  targets  found   in  practice.  Bouwens  and  Kroos  [2011],  and  Leone  and  Rock  [2002]  also  find  frequently  attainable  targets  in   practice.       31 See Bebchuk and Fried [2005], Bizjak, Lemmon, and Nguyen [2011], and Securities and Exchange Commission [2015] regarding comparison of executive compensation and corporate performance to peer group percentiles, and Berger, Harbring, and Sliwka [2013] and Grote [2005] regarding employee evaluation involving peer group percentiles. 54 Our third contribution is to the substantial amount of accounting research that shows that the format of information display influences decisions ranging from stock trading to assigning employee bonuses (Bloomfield, Nelson, and Smith [2006], Dilla and Steinbart [2005], Maines and McDaniel [2000]). Our study extends this research by showing how the height of the reference point included in RPI influences performance. Given the common use of RPI and associated percentiles in information displays, the height of reference points set for the purpose of peer-performance comparison is a salient feature to provide evidence on (Song et al. [2015], Bizjak et al. [2011], Gibbons and Roberts [2012] p. 67). 3.2 THEORY AND HYPOTHESIS DEVELOPMENT Economics-based research addresses multiple functions of RPI that account for its widespread use (Gibbons and Roberts [2012] p. 67). Incorporating RPI in incentive contracts reduces the risk imposed on agents by filtering out common noise from a performance measure linked to incentives (Holmstrom [1982], Lazear and Rosen [1981]). The performance measure is then a more precise signal of effort, and attaching incentives to it imposes less risk on the agent in the form of uncontrollable events that also influence the measure (Banker and Datar [1989]). RPI is also of use to agents in forming expectations of pay for marginal effort in nonlinear incentive schemes. When pay is contingent upon reaching certain levels of relative performance, agents can use information on their proximity to those levels to form such expectations. Empirical studies show that agents subject to nonlinear-incentive schemes expend effort according to their updated rational expectations of pay for marginal effort (Asch [1990], Hannan et al. [2008], Casas-Arce and Martinez-Jerez [2009]). 55 An emerging stream of accounting and economic research explores an additional role of RPI. Agents incorporate the information into decisions about effort provision even when it does not provide information on the rate of pay for marginal effort (Azmat and Iriberri [2010], Hannan et al. [2008], Murthy [2010], Tafkov [2013]). These studies show generally positive performance effects of providing RPI in fixed-wage, individual-performance piece-rate pay, and in no-pay contexts. In many studied applications, RPI includes reference points for peer comparison, oftentimes peer-median performance (Blanes i Vidal and Nossol [2011], Harper et al. [2013]). Recent economic studies have incorporated reference points as a component of models of utility that traditionally only weigh monetary pay-off and cost of effort. In positively weighting a reference point, utility functions account for “reference-dependent preferences” (Abeler et al. [2011], Farber [2008]). Empirical results provide evidence that reference points for total pay and for peer-performance comparison influence effort provision above and beyond the rate of pay for marginal task performance. Although RPI and reference points both exhibit power in motivating effort, little is known regarding the effect of RPI combined with reference points of varying heights. A number of field studies report generally positive effects of RPI performance display, but do not test the performance effects of varying the height of included reference points (Allcott [2011], Allcott and Rogers [2014], Azmat and Iriberri [2010], Harper et al. [2013], Schultz et al. [2007]). These studies show heterogeneous performance effects that depend on initial performance relative to the reference point; individuals initially underperforming a reference point exhibit a more positive effect than those initially outperforming it. Proffered explanations include the difficulty of achieving beyond an already high level, as well as a downward psychological attractive power of reference points for those performing above them (Allcott 56 [2011]), Schultz [2007]). A potential implication of the observed heterogeneity is that a higher reference point, relative to which individuals would be situated differently, would yield different performance effects. Research on social comparison, reference points, goals, and expectancy all help in understanding how RPI and reference points influence performance. We address each of these streams of research to provide context for our prediction of the performance effects of displaying a relatively high RPI reference point. Social comparison theory explains how displaying RPI creates performance incentives. RPI allows peer comparison and activates the associated incentives to outperform peers and attain a more positive self-image (Smith [2000], Brown et al. [2007]). Empirical evidence shows that RPI drives performance when peers are identifiable and when they are anonymous, in the presence and absence of performance-based pay, and when one’s performance is and is not visible to others (Klar and Giladi [1998], Hannan et al. [2008], Murthy [2010], Tafkov [2013], Xiao and Lucking [2008]). Behavioral economic research offers insight into the role of reference points—subtly implied numbers that individuals positively weight in economic decisions—in influencing effort. In particular, providing a reference point for expectations of total pay lifts effort toward the level necessary to achieve the given level of pay (Abeler et al. [2011]). Individuals anticipate feelings of loss aversion from receiving pay below the reference point. Utility in the form of a reduced negative deviation from the reference point acts as a performance incentive (Abeler et al. [2011], Farber [2008]). Reference points for effort provision come in forms other than levels of pay. For example, recent research on marathons finds that runners feel a sense of loss from exceeding round number finishing times that serve as reference points (Markle et al. [2015]). Runners exert 57 effort near the end of the marathon in order to finish a few seconds before the round number time (Allen et al. [2016], Markle et al. [2015]). The combination of RPI and reference points may also influence performance through a similar mechanism as goals. Goals are explicitly set standards for performance or outcome- achievement, and may be suggested by others or set completely by personal volition. RPI reference points may assume properties of goals to the extent that individuals accept them as standards for success. Goal theory literature has established the ability of assigned goals to improve performance, without rewards for their achievement, through forces including the following: 1) directing attention, 2) energizing activity, 3) affecting persistence, and 4) leading to the arousal, discovery, and/or use of task-relevant capabilities (Locke and Bryan [1969], Locke and Latham [2002]). Expectancy theory suggests a qualification on the power of RPI reference points to drive performance. It states that motivation to achieve an outcome, such as performing at or above the level of a displayed RPI reference point, depends on perceived attainability of the outcome. Motivation is increasing in “expectancy,” or the belief that effort will lead to performance necessary to achieve the outcome (Atkinson [1957], Lawler and Suttle [1973], Vroom [1964]). Goal theory similarly states that belief that a goal is attainable is essential to its motivational effect (Erez and Zidon [1984], Locke et al. [1986], Locke and Latham [2002]). Perceived attainability is particularly relevant to the provision of the RPI reference points addressed in this study, which are inherently unattainable for either a large portion (in the case of median performance) or the majority (in the case of top-quartile performance) of individuals. An additional motivating force present in our study and in a variety of corporate and public- sector contexts is a visual indication of approval for performing well relative to a peer reference 58 point (Allcott [2011], Campbell [2002], Vanek Smith [2015]). An example of a visual used in practice is color-coding, with green indicating high and red indicating low performance. Another example is a smiley face for outperforming a reference point. We adopt smiley faces, as used in field experiments from psychology and economics, in order to test the performance effects of RPI reference points when outperforming them is visually congratulated (Allcott [2011], Schultz et al. [2007]). The development of Hypotheses 2-4, regarding RPI reference point height, take into account the motivating forces of RPI and reference points described. Hypothesis 1 addresses the effect of providing RPI with a reference point in our study to test the intervention’s validity as a performance management tool. All of our hypotheses are stated in the alternative form, and are assessed using a two-tailed test. H1: Providing relative performance information with a congratulated descriptive norm reference point for peer comparison positively affects performance. We predict a concave relationship between an individual’s performance before RPI reference point provision and the performance returns to providing a higher RPI reference point. That hypothesized concave relationship is based in part on our expectation that the top-quartile reference point will not have an incrementally positive effect relative to the median reference point for initially below-median performers. In fact, expectancy theory and forces of reference points and social comparison even suggest the possibility of negative returns to a higher reference point for this group. The higher reference point would impose a lower value on expectancy, a construct positively related to motivation, for individuals performing beneath the 59 lower reference point (Atkinson [1957]). Also, loss aversion felt through negative deviation from reference points is greater the closer one is to the reference point (Kahneman and Tversky [1979]). Loss aversion is an effort-motivating force that would be weaker with a more distant reference point as long as the individual is similarly interested in surpassing either (Abeler [2011], Heath et al. [1999], Markle et al. [2014]). In terms of utility from congratulations for exceeding an RPI reference point, the lower reference point would offer greater returns to marginal effort. In our study, individuals can anticipate the congratulations using the RPI display legend, which showed a smiley face associated with exceeding the RPI reference point. From a social comparison theory standpoint, raising the reference point increases the height of the upward social comparison for this group. Social comparison of performance involving great upward distance has been shown to cause discouragement and hurt performance (Rogers and Feller [2015]). Individuals who are below median because they struggle to interact effectively with the course might also resort to ineffective or unsustainable strategies for doing so (Hannan et al. [2008]). Research on tournaments and rank-based-pay also suggests that individuals are discouraged to the point of giving up when they feel that an explicitly rewarded reference point is too high (Bandiera et al. [2013]). A similar force may apply to peer comparison reference points. Further, the upward attractive force of the median reference point may be uniquely powerful due to its relevance to comparison of oneself to what is average given the common desire to consider oneself above-average (Dolan et al. [2012], Larrick et al. [2007]). On the other hand, goal and reference point research implies some advantage of the higher reference point among initially below-median performers. Goal theory shows that, subject to goal commitment, a higher goal elicits greater performance (Locke and Latham [2002]). Also, individuals who surpass the median might be drawn higher still by the higher reference point. 60 This concept is in line with utility from eliminating feelings of loss-aversion for performance below a reference point, and with desires to reach a congratulated level of performance (Abeler [2011], Heath et al. [1999], Reno, Cialdini, and Kallgren [1993]). Although theoretical implications for the effect of providing the top-quartile as opposed to median reference point for individuals initially below both are mixed, we predict a slightly negative performance effect among these individuals. In the partition between the two alternative reference points, forces from goal theory and expectancy theory are more aligned with a positive effect of a higher reference point than in the case of below-median performers. The 50th-75th percentile performers are more likely to view the higher reference point as attainable, and therefore to be committed to it as a goal (Locke and Latham [2002]). In terms of expectancy theory, the belief that effort will lead to performance necessary to reach the reference point would be higher for these individuals than for those who are below median (Atkinson [1957]). Also, for 50th-75th percentile performers, the higher reference point introduces the motivating force of “valance,” or satisfaction from reaching a rewarded level of performance (Atkinson [1957], Lawler [1968]). The lower reference point, by contrast, offers valence from rising above a rewarded performance level only to those in this partition of initial performance who fall below the median. Other than potentially greater innate interest in and resulting peer-comparison engagement from the median reference point, theory suggests a uniformly positive effect of the providing the top quartile point instead to the 50th-75th percentile performers (Dolan et al. [2012], Larrick et al. [2007]). Individuals in the top quartile of performance view a reference point below their position in the distribution in both the case of median and top-quartile RPI reference point display. Neither the top-quartile nor the median reference point, then, offers valence in the form of changing 61 one’s initial performance relative to the RPI reference point. A few qualified forces might work to the positive performance effect of providing the higher reference point to this group. Top- quartile performers are more likely to fall beneath the higher than the lower reference point. The greater prospect and instance of that event when the reference point is higher might yield stronger performance incentives. Also, to the extent that peer-comparison reference points carry downward attractive power, a higher reference point might act as a bulwark to mitigate a resulting performance decline (Schultz et al. [2007]). However, evidence of a downward attractive force of RPI reference points, though, is mixed (Allcott [2011], Harper et al. [2013]). Further, greater innate interest in the median may drive social comparison and related performance among top-quartile performers (Dolan et al. [2012], Larrick et al. [2007]). Finally, the higher reference point reveals to top-quartile performers that they are toward that positive tail of the distribution. This could create concern that one’s behavior is economically suboptimal and discourage effort (Schultz et al. [2007]). We predict a positive effect of providing the top-quartile as opposed to the median reference point among initially top-quartile performers despite the noted limitations of forces working toward that effect. We predict that performers initially between the median and top-quartile, those theoretically more uniformly benefitted by viewing the top-quartile reference point, will exhibit greater positive effects from its display than will performers initially in the top quartile. Hypotheses 2a-c predict the effect of providing the top-quartile as opposed to median reference point in each partition of initial performance addressed above. These predictions include effect directions despite the noted possibility of forces working in the opposing direction. Identifying the direction and significance of effects among partitions helps in understanding the nature of the concavity that Hypotheses 3a-c outline. Identifying the nature of any observed 62 concavity, in turn, deepens understanding of the overall effect of the higher reference point, which Hypothesis 4 addresses. H2a: Presenting the peer top quartile, as opposed to the peer median, as a congratulated reference point for performance negatively affects the performance of individuals who are initially below both reference points. H2b: Presenting the peer top quartile, as opposed to the peer median, as a congratulated reference point for performance positively affects the performance of individuals who are initially between both reference points. H2c: Presenting the peer top quartile, as opposed to the peer median, as a congratulated reference point for performance positively affects the performance of individuals who are initially above both reference points. H3a-c address a concave relationship between initial performance and returns to a higher reference point, which H2a-c collectively imply. H3a predicts a relatively more positive effect of the top-quartile as opposed to the median reference point in the partition of performers initially between these alternative reference points than in the outer two partitions. H3b (H3c) predict a more positive effect in the in-between partition than in the lower (higher), helping to further define the shape of performance returns to a higher reference point along the scale of initial performance. 63 H3a: The performance effect of presenting the peer top-quartile, as opposed to the peer median, as a congratulated reference point for performance is more positive for those who are initially between median and top-quartile performance than for those who are not. H3b: The performance effect of presenting the peer top-quartile, as opposed to the peer median, as a congratulated reference point for performance is more positive for those who are initially between median and top-quartile performance than for those who are initially below median performance. H3c: The performance effect of presenting the peer top-quartile, as opposed to the peer median, as a congratulated reference point for performance is more positive for those who are initially between median and top-quartile performance than for those who are initially above top- quartile performance. H4 addresses the average effect of providing the top-quartile as opposed to median reference point. H2a predicted a negative effect among the half of individuals initially below median. H2b predicted a positive effect among the quarter of individuals who are initially between alternative reference points. H2c and H3c collectively predicted a positive effect for the quarter of individuals above both alternative reference points that is less positive than it is for the quarter of individuals initially in-between the alternatives. H2a-c and H3a-c might involve effects in segments of initial performance that balance out, or alternatively that yield a directional net effect. We test for a net effect, but do not predict its direction given the directional opposition of predicted effects along the scale of initial performance. 64 H4: Presenting the peer top quartile, as opposed to the peer median, as a congratulated reference point affects performance. 3.3 FIELD SETTING AND DATA 3.3.1 Field Setting In 2012, MIT and Harvard University jointly founded edX, a nonprofit organization offering free online courses, assessments, and certificates for higher-education courses. HarvardX, our study’s field site, is the constituent organization of edX that offers courses from Harvard University faculty members. Enrollment is open globally and with no prerequisites or application process. All instruction occurs online. Course topics range from literature to statistics, and are open for periods ranging from a few weeks to a full year. We conducted experiments in four statistics courses that ranged in enrollment from roughly 6,000 to 25,000. 3.3.2 Experiment Design Our study’s 1x3 experiment design consisted of a control group, which received no RPI display, and two treatment groups, one that received an RPI display with the peer median reference point, and one that received an RPI display with the peer top-quartile reference point. The reference point in a display was labeled either “classmate median” or “classmate top quartile” to correctly reflect the point in the peer distribution that it represented. A student viewing a display was thereby informed which of those two standards he or she was being compared to. We provided students access for a period of over two months to a personal RPI display that included a reference point for peer comparison: either median or top-quartile performance. We delivered the RPI to each of the two treatment groups using weekly emails with a link to the RPI data displays. 65 To increase exposure to the intervention, we placed links reading, “Check your progress” within the course platform. Control group students clicking the link were directed to the default HarvardX progress chart for the course, showing completion status of individual assignments with no RPI. Treatment group students clicking the link were directed their personal RPI display, below which sat a link to the default HarvardX progress chart for the course. The displays were updated and available daily. Appendix D contains images of the graphs and survey instruments. In the main experiment we used “Activity Level” (defined in Appendix C and described in section 3.3.3 Data) as opposed to grade as the measure of performance. The choice of the latter was a matter of data availability; edX could not provide us daily access to grades at the time our proposal was approved, and in some courses approved for the experiment, grades were self- assigned and so not a verifiable indicator of performance. However, this study’s referee observed that individuals may respond differently to RPI regarding effort, which Activity Level captures, than they do to RPI regarding an outcome, such as grade. We developed the technology necessary to acquire daily access to objective grades. We ran the experiment both as registered for the conference, with Activity Level as the dependent variable, and as advised by the referee, with grade as the dependent variable. We refer to the former as our “main experiment,” and the latter as the “supplemental experiment.” 3.3.3 Data Our study benefits from intricate student-course-level data. Quantitative data include each student’s number of clicks on course content, number of days on which they were active, number of video views, number of discussion forum posts, grade, and several other measures of activity in the course. Qualitative data include student demographics and responses to surveys, which we incorporate in the additional analyses. 66 Activity Level is an aggregate measure of how active the student is in the course. It is a weighted sum of activities that represent course engagement: accessing the course, clicking on course content, watching videos, interacting in the discussion forum, and attempting problems. The weight applied to each summand approximately scales the summand’s historical mean to the historical mean of video views. The historical means were measured using data from past iterations of the host courses. Grade represents the percent of the course’s total problems that the student has answered correctly. Problems can be completed asynchronously. This facilitates a flexible, modular learning environment that students can use to suit their particular educational needs. The low mean Grade for students who have attempted problems does not reflect low accuracy, but rather students selecting which material they will complete and having less-than-perfect accuracy in completing the material. Success by Grade in our setting could be thought of as similar to Academy Awards. A director’s number of awards is a function both of the number of movies the director chooses to make and the director’s success rate in making awarded movies. Appendix C contains a full list of variable definitions. The dependent variable in the main experiment is Δ Activity Level, and in the supplemental experiment is Δ Grade.32 The other variables are used in the same manner across both the main and supplemental experiments. 3.4 ANALYSIS 3.4.1 Analytical Approach 32 We calculate Activity Level and Grade for both courses, but only calculate Δ Activity Level and Δ Grade for the respective experiments wherein each is the dependent variable. This is due to limitations on the longitudinal history in our raw data source for each when it is not the dependent variable. It is outside the scope of this paper to show how Δ Activity Level and Δ Grade correlate during our experiment. Including Activity Level and Grade in descriptive statistics, though, shows how the two correlate over the span of a course. 67 To make best use of the data, including null results, our study draws both from Null Hypothesis Significance Testing (NHST) and Bayesian analysis. We conduct NHST using ordinary least squares regressions for each hypothesis. When the data fail to reject a null hypothesis, or when our alternative hypothesis predicts no relation, we conduct Bayesian analysis indicating how much more probable we can expect a significant relation to be than we could before the realization of the data. In selecting our sample, we follow precedent from past economic research in similar field settings, and apply guidance from a methodology study on experiments in online courses. One empirical challenge is that of zero-inflation from the large percentage of individuals who enroll and then do not participate in these courses (Lamb et al. [2010]). Enrolling in a course is free, so individuals often enroll in a noncommittal manner. We exclude from our study students who enroll in, but do not access, the course. Of those who access the course, a large number do not try graded content. We exclude these individuals in the supplemental experiment to avoid zero- inflating the displayed standard of performance. A similar restriction of a sample to the portion of active online community members can be seen in Harper et al. [2013]. This provides individuals with a comparison to others that they deem as similar; such similarity is key to engagement in social comparison and the related effectiveness of RPI (Harper et al. [2013], Tafkov [2013]). Another empirical challenge is the influence of a small number of outliers who utilize an online course approximately ten times more than the 99th percentile (Lamb et al., [2010]). We winsorize values for Activity Level and its component Problem Attempts (number of problems attempted) in their 99th percentiles before using either as a dependent variable to ensure that a small number of extreme outliers do not drive or offset results. This is not necessary for Grade, which is inherently capped at 100. We cluster standard errors at the student level to 68 correct for auto-correlation from students who enroll in more than one course hosting the experiment. Any students present in more than one course hosting the experiment are included in the same experimental group in all courses, and their experimental group membership was, as with all students, set through random assignment. 3.4.2 Descriptive Statistics Tables 10 and 11 show the sample selection and descriptive statistics for the main and supplemental experiments, respectively. The courses attract individuals who are on average in their 30’s. The majority are male, have at least a bachelor’s degree, and live in a developed country. Of those who responded to the pre-course survey, the average student is somewhat to very familiar with the course content and intends to complete at least some course content. The baseline means for Activity Level and Grade are 96.5 and 19.02, allowing students an opportunity to reveal their level of initial performance before the experiment began. Activity Level rises by 17.56 and Grade by 1.98 in the control groups during the main and supplemental experiments, respectively, indicating substantial activity during the experiments separate from the display of RPI. Tables 12 and 13 show the correlation matrices for each experiment. Both Age and Developed Country are positively correlated with our study’s measures of performance. Survey responses indicating prior experience with online courses and commitment to complete the course are also generally positively correlated with performance. Female gender is negatively correlated with prior experience and commitment. Likely through those mediators, it is also negatively correlated with performance. Tables 14 and 15 show the geographic distribution of students. Students are widely dispersed around the globe. Europe and North America are about equally represented. Asia is 69 Table 10: Sample Selection and Descriptive Statistics for Main Experiment Panel  A:  Sample  Selection Total  Enrollment 24,554        Exclude  Students  who  did  not  Access  the  Course 9,375 Final  Sample 15,179 Panel  B:  Descriptive  Statistics N Mean Std.  Dev. 25% 75% Gender 13,260 0.32 0.46 0 1 Age 10,013 32.03 9.75 25 37 Level  of  Education 10,264 0.84 0.36 1 1 Developed  Country 11,599 0.62 0.48 0 1 Familiarity  With  Subject   532 1.46 0.91 1 2 Commitment  to  Complete  Course 512 2.57 0.61 2 3 Number  of  Online  Courses  Previously  Enrolled-­‐‑In 503 5.97 4.55 2 5 Number  of  Online  Courses  Previously  Completed 526 3.33 3.82 0 4 Grade 10,169 9.31 23.67 0 10 Activity  Level 15,177 117.73 247.74 5 70 Δ  Activity  Level 15,177 21.23 89.43 0 5 This  table  shows  the  sample  selection  and  descriptive  statistics  for  the  main  experiment.  Activity Level is the displayed measure of performance in RPI.  Demographic  data  are  missing  for  students  who  did  not  fill  it  in  when  asked  in  the   registration  process  and  within  the  course.  Grade  is  missing  for  students  whose  records  were  no  longer  in  the  course  after   we  completed  development  of  a  code  for  accessing  grade  data.  Grade  is  only  provided  in  the  main  experiment  for   descriptive  purposes.  "Level  of  Education"  is  an  indicator  variable  for  an  individual  holding  a  bachelor's  or  higher   degree.  "Familiarity  With  Subject"  is  on  an  increasing  scale  of  0-4.  "Commitment  to  Complete  Course"  is  on  an  increasing   scale  of  1-3.   70 Table 11: Sample Selection and Descriptive Statistics for Supplemental Experiment Total  Enrollment 28,057        Exclude  Students  who  did  not  try  Graded  Content 23,597 Final  Sample 4,460 Panel  B:  Descriptive  Statistics N Mean Std.  Dev. 25% 75% Gender 3,902 0.30 0.45 0 1 Age 3,772 30.62 28 24 35 Level  of  Education 3,857 0.81 0.38 1 1 Developed  Country 4,367 0.64 0.47 0 1 Familiarity  With  Subject   2,242 1.57 0.97 1 2 Commitment  to  Complete  Course 2,404 2.16 0.92 2 3 Number  of  Online  Courses  Previously  Enrolled-­‐‑In 2,221 3.44 3.75 1 5 Number  of  Online  Courses  Previously  Completed 2,221 1.86 2.86 0 2 Activity  Level 4,460 222.43 230.17 11 354 Problem  Attempts 4,460 94.05 122.40 10 146 Δ  Problem  Attempts 4,460 9.40 36.74 0 0 Problem-­‐‑Attempt  Accuracy 588 0.28 0.33 0.12 0.38 Grade 4,460 21.65 28.95 1 34 Δ  Grade 4,460 2.63 11.22 0 0 This   table   shows   the   sample   selection   and   descriptive   statistics   for   the   supplemental   experiment.   Grade is the displayed measure of performance in RPI.  Demographic  data  are  missing   for   students  who  did  not   fill   it   in  when   asked  in  the  registration  process  and  within  the  course.  Level  of  Education  is  an  indicator  variable  for  an  individual   holding  a  bachelor's  or  higher  degree.   "Familiarity  With  Subject"   is  on  an   increasing  scale  of  0-4.   "Commitment   to   Complete  Course"  is  on  an  increasing  scale  of  1-3.   71 Table 12: Correlation of Descriptive Statistics for Main Experiment I II III IV V VI VII VIII IX X I Gender II Age -­‐‑0.101*** III Level  of  Education 0.022** 0.320*** IV Developed  Country 0.059*** 0.166*** 0.106*** V Familiarity  With  Subject -­‐‑0.192*** 0.093* 0.080 0.011 VI Commitment  to  Complete  Course 0.004 -­‐‑0.031 -­‐‑0.063 -­‐‑0.126*** 0.073 VII Num.  Online  Courses  Prev.  Enrolled-­‐‑In -­‐‑0.125* 0.288*** -­‐‑0.016 0.089* 0.228*** 0.032 VIII Num.  Online  Courses  Prev.  Completed -­‐‑0.143*** 0.325*** 0.027 0.088* 0.343*** 0.123** 0.727*** IX Grade -­‐‑0.052*** 0.046*** -­‐‑0.029*** 0.052*** 0.042 0.082 -­‐‑0.006 0.070 X Activity  Level -­‐‑0.035*** 0.058*** 0.005 0.041*** 0.031 0.062 -­‐‑0.010 0.041 0.775*** XI Δ  Activity  Level -­‐‑0.015*** 0.034*** -­‐‑0.018* 0.040*** 0.060 0.046 -­‐‑0.021 0.001 0.376*** 0.594*** This  table  shows  a  correlation  matrix  for  the  variables  in  the  main  experiment.  *,**,***  denote  significance  at  the  .1,  .05,  and  .01  levels,  respectively.   Table 13: Correlation of Descriptive Statistics for Supplemental Experiment I II III IV V VI VII VIII IX X XI XII XIII I Gender II Age -­‐‑0.128*** III Level  of  Education 0.016 0.344*** IV Developed  Country 0.074*** 0.144*** 0.092*** V Familiarity  With  Subject 0.027 0.084*** 0.158*** 0.066*** VI Commitment  to  Complete  Course -­‐‑0.013 -­‐‑0.016 -­‐‑0.032 -­‐‑0.086*** 0.014 VII Num.  Online  Courses  Prev.  Enrolled-­‐‑In -­‐‑0.123*** 0.211*** 0.037* 0.062*** 0.054** -­‐‑0.015 VIII Num.  Online  Courses  Prev.  Completed -­‐‑0.135*** 0.253*** 0.025 0.078*** 0.082*** 0.032 0.789*** IX Activity  Level -­‐‑0.017 0.058*** -­‐‑0.010 0.027* 0.016 0.094*** 0.062*** 0.100*** X Problem  Attempts -­‐‑0.022 0.017 0.006 0.018 0.0435** 0.102*** 0.054** 0.098*** 0.888*** XI Δ  Problem  Attempts -­‐‑0.017 0.014 0.029* 0.022 -­‐‑0.020 0.042** 0.034 0.026 0.292*** 0.230*** XII Problem-­‐‑Attempt  Accuracy -­‐‑0.055 0.029 -­‐‑0.012 -­‐‑0.008 -­‐‑0.013 0.020 0.066 0.103* 0.039 0.003 0.045 XIII Grade -­‐‑0.045*** 0.034*** 0.010* 0.050*** 0.047** 0.094*** 0.102*** 0.148*** 0.856*** 0.766*** 0.277*** 0.231*** XIV Δ  Grade -­‐‑0.020 0.046*** -­‐‑0.028* 0.023 -­‐‑0.020 0.038* 0.042** 0.031 0.271*** 0.209*** 0.922*** 0.275*** 0.315*** This  table  shows  a  correlation  matrix  for  the  variables  in  the  supplemental  experiment.  *,**,***  denote  significance  at  the  .1,  .05,  and  .01  levels,  respectively.   72 Table 14: Number of Students by Continent in Main Experiment Number  of  Students Continent Developed  Country Developing  Country Africa 0 626 Asia 113 2,629 Europe 3,329 1 North  America 3,533 554 Oceania 0 78 South  America 0 578 6,975 4,466 This   table   shows   the  distribution  of   individuals   in   the  main  experiment   across  the  developed  and  developing  world,  subcategorized  by  continent.   Table 15: Number of Students by Continent in Main Experiment Number  of  Students Continent Developed  Country Developing  Country Africa 0 107 Asia 36 963 Europe 1,277 2 North  America 1,398 163 Oceania 114 1 South  America 0 306 2,825 1,542 This   table   shows   the   distribution   of   individuals   in   the   supplemental   experiment  across   the  developed  and  developing  world,  subcategorized   by  continent.   73 close behind at about 80% of the population from Europe. Africa and South America each account for about 10-30% of the better-represented continents. Tables 16 and 17 show RPI access across treatments. The grade RPI attracts more attention than Activity Level RPI. In both cases, only a small portion of the sample chooses to view the RPI. Future research can explore whether RPI could have a much stronger effect in practice were it delivered with explicit incentives for opening it and thereby viewed more broadly. Our study speaks to settings in which viewing RPI is voluntary and a relatively small number of individuals opt to. Tables 18 and 19 show the distribution of the dependent variable for each experiment (Δ Activity Level for the main experiment, and Δ Grade for the supplemental experiment) by treatment and level of initial performance. Figures 7 and 8 provide a graphical representation of the distribution of the dependent variable for each experiment. 3.4.3 Results In both the main and supplemental experiments, we find evidence to support H1 that providing RPI positively affects performance. The effect estimates for Activity Level and Grade appear in Column 1 of Table 20 Panel A and Table 21 Panel A. Both experiments also provide evidence to support H2a, that the lower reference point more positively affects performance than the higher reference point among initially low performers. These results are shown in Column 2 of Table 20 Panel A and Table 21 Panel A. In the main experiment, we find statistically significant evidence in support of H2b. Specifically, the higher reference point more positively affects performance than the lower reference point among individuals initially in between these two reference points. Column 3 of Table 20 Panel A shows this result. We find similar evidence, although not at a statistically significant level, in the supplemental field experiment, shown in Column 3 of Table 74 Table 16: RPI Access in Main Experiment Sample N RPI    Never  Accessed  RPI 9,752    Accessed  RPI  Once 213    Accessed  RPI  More  than  Once 367 RPI_M    Never  Accessed  RPI 4,890    Accessed  RPI  Once 108    Accessed  RPI  More  than  Once 176 RPI_T    Never  Accessed  RPI 4,862    Accessed  RPI  Once 105    Accessed  RPI  More  than  Once 191 This  table  describes  the  distribution  of  RPI  graph  access  in  the  main   experiment.   The   table   categorizes   individuals   by   experimental   condition.   The   differences   in   RPI   access   between   the   RPI_M   and   RPI_T  groups  are  not  statistically  significant. Table 17: RPI Access in Supplemental Experiment Sample N RPI    Never  Accessed  RPI 2,995    Accessed  RPI  Once 409    Accessed  RPI  More  than  Once 372 RPI_M    Never  Accessed  RPI 1,503    Accessed  RPI  Once 203    Accessed  RPI  More  than  Once 180 RPI_T    Never  Accessed  RPI 1,503    Accessed  RPI  Once 206    Accessed  RPI  More  than  Once 192 This   table   describes   the   distribution   of   RPI   graph   access   in   the   supplemental   experiment.   The   table   categorizes   individuals   by   experimental   condition.  The  differences   in  RPI  access  between   the   RPI_M  and  RPI_T  groups  are  not  statistically  significant. 75 Table 18: Distribution of Δ Activity Level in Main Experiment Partition N Mean  Δ  Activity  Level Control    Initially  Below  Average 2,659 10.78    Initially  Third  Quartile 1,124 16.02    Initially  Top  Quartile 1,056 36.27    All 4,839 17.56 RPI_M    Initially  Below  Average 2,966 16.43    Initially  Third  Quartile 1,068 16.48    Initially  Top  Quartile 1,140 45.33    All 5,174 22.81 RPI_T    Initially  Below  Average 2,916 12.54    Initially  Third  Quartile 1,094 24.96    Initially  Top  Quartile 1,148 35.00    All 5,158 20.18 This   table   shows   the  mean  Δ  Activity   Level   for   individuals   in   the  main   experiment.   The   table   categorizes   individuals   by   experimental   condition   and  by  initial  performance  level.   76 Table 19: Distribution of Δ Grade in Supplemental Experiment Partition N Mean  Δ  Grade Control    Initially  Below  Average 995 2.16    Initially  Third  Quartile 234 1.94    Initially  Top  Quartile 236 1.22    All 1,465 1.98 RPI_M    Initially  Below  Average 1,032 3.95    Initially  Third  Quartile 234 1.73    Initially  Top  Quartile 237 0.65    All 1,503 3.09 RPI_T    Initially  Below  Average 1,023 2.95    Initially  Third  Quartile 233 3.04    Initially  Top  Quartile 236 1.98    All 1,492 2.81 This  table  shows  the  mean  Δ  Grade  for  individuals  in  the  supplemental   experiment.  The  table  categorizes  individuals  by  experimental  condition   and  by  initial  performance  level.   77 Figure 7: Main Experiment Outcomes by Treatment and Initial Performance This  figure  shows  the  mean  Δ Activity Level for each treatment and in each partition of initial performance.   These   data   are   from   the   main   experiment,   with   Activity Level   as  the  displayed measure  of  performance in RPI.     78 Figure 8: Supplemental Experiment Outcomes by Treatment and Initial Performance This  figure  shows  the  mean  Δ Grade for  each  treatment  and  in  each  partition  of  initial   performance.   These   data   are   from   the   supplemental   experiment,   with   grade   as   the   displayed measure  of  performance in RPI.     79 Table 20: RPI and Reference Point Effects in Main Experiment Panel  A.  RPI  Effect  and  Reference  Point  Partitioned  Effects     1 2 3 4 Δ  Activity  Level RPI 2.77** [2.46] RPI_T -­‐‑3.71** 7.43** -­‐‑6.57 [-­‐‑2.21] [2.24] [-­‐‑1.62] Course  Fixed  Effects Yes Yes Yes Yes Bayes  Factor 20.81 N 15,126 5,882 2,168 2,288 Clustering Student Student Student Student Sample All RPI  &  Init.  BelowAverage RPI  &  Init.  Third  Quartile RPI  &  Init.  Top  Quartile This  panel  shows  effect  estimates  for  the  main  experiment.  Column  1  shows  the  effect  of  RPI.  Columns  2-4  show,  for  individuals  in  the  RPI condition, the effect of displaying RPI with the top-quartile as opposed to median reference point. These columns are partitioned by an individual's initial level of performance. T-statistics are in brackets. *, **, *** denote statistical significance at the .1, .05, and .01 levels, respectively. 80 Table 20: RPI and Reference Point Effects in Main Experiment (Continued) Panel  B.  Reference  Point  Effect  Concavity  and  Net  Effect 1 2 3 4 Δ  Activity  Level RPI_T -­‐‑4.77*** -­‐‑3.52** -­‐‑6.57 -­‐‑2.14 [-­‐‑2.70] [-­‐‑2.09] [-­‐‑1.62] [-­‐‑1.32] Init.  Third  Quartile -­‐‑3.31 4.49** -­‐‑22.61*** [-­‐‑1.63] [2.19] [-­‐‑6.73] RPI_T  x  Init.  Third  Quartile 12.46*** 11.04*** 13.69*** [3.29] [2.87] [2.70] Course  Fixed  Effects Yes Yes Yes Yes Bayes  Factor 3.79 N 10,332 8,044 4,450 10,332 Clustering Student Student Student Student RPI  &  (Init.  Below  Average   RPI  &  (Init.  Third  Quartile   Sample RPI or  Init.  Third  Quartile) or  Init.  Top  Quartile) RPI This   panel   shows   effect   estimates   for   the   main   experiment.   Columns   1-3   illustrate,   with   the   coefficient   on   RPI_T   x   Init.   Third   Quartile,   the   concavity  in  initial  performance  of  providing  the  top-quartile  as  oppposed  to  median  reference  point.  The  effect  is  concave  in  initial  performance   in  that  it  is  most  positive  for  individuals  in  the  third quartile  of  initial  performance.  Column  4  shows  the  average  effect  of  providing  the  top- quartile  as  opposed  to  median  reference  point. T-statistics are in brackets.  *,**,***  denote  statistical  significance  at  the  .1,  .05,  and  .01  levels,   respectively.   81 Table 21: RPI and Reference Point Effects from Supplemental Experiment Panel  A.  RPI  Net  Effect  and  Reference  Point  Partitioned  Effects 1 2 3 4 Δ  Grade RPI 0.97*** [3.06] RPI_T -­‐‑1.00* 1.31 1.32** [-­‐‑1.67] [1.41] [2.14] Course  Fixed  Effects Yes Yes Yes Yes Bayes  Factor 2.04 N 4,460 2,055 467 473 Sample All RPI  &  Init.  Below  Average RPI  &  Init.  Third  Quartile RPI  &  Init.  Top  Quartile This  panel  shows  effect  estimates  for  the  supplemental  experiment.  Column  1  shows  the  effect  of  RPI.  Columns  2-4  show,  for  individuals   in  the  RPI  condition,  the  effect  of  displaying  RPI  with  the  the  top-quartile as opposed to median reference point. These columns are partitioned by an individual's initial level of performance. T-statistics are in brackets. *, **, *** denote statistical significance at the .1, .05, and .01 levels, respectively. 82 Table 21: RPI and Reference Point Effects from Supplemental Experiment (Continued) Panel  B.  Reference  Point  Effect  Concavity  and  Net  Effect 1 2 3 4 Δ  Grade RPI_T -­‐‑0.56 -­‐‑1.00* 1.32** -­‐‑0.27 [-­‐‑1.13] [-­‐‑1.67] [2.14] [-­‐‑0.61] Init.  Third  Quartile -­‐‑1.61** -­‐‑2.22*** 1.13* [-­‐‑2.53] [-­‐‑3.23] [1.86] RPI_T  x  Init.  Third  Quartile 1.88* 2.31** -­‐‑0.062 [1.77] [2.09] [-­‐‑0.05] Bayes  Factor 1.93 2.79 N 2,995 2,522 923 2,995 RPI  &  (Init.  Below  Average   RPI  &  (Init.  Third  Quartile   Sample RPI RPI or  Init.  Third  Quartile) or  Init.  Top  Quartile) This  panel  shows  effect  estimates  for  the  supplemental  experiment.  Columns  1-3  illustrate,  with  the  coefficient  on  RPI_T  x  Init.  Third  Quartile,   the   concavity   in   initial  performance  of   the   effect   of  providing   the   top-quartile   rather   than  median   reference  point.  The   effect   is   concave   in   initial  performance  in  that  it  is  most  positive  for  individuals  in  the  third quartile  of  initial  performance.  Column  4  shows  the  average  effect  of   providing  the  top-quartile  as  opposed  to  median  reference  point.  T-statistics are in brackets. *,**,***  denote  statistical  significance  at  the  .1,  .05,   and  .01  levels,  respectively.   83 21 Panel A. The Bayes Factor of 2.04 warrants updating a prior of no relationship in favor of the model’s estimate of a positive relationship to a level twice the probability before the realization of the data. Among individuals initially in the top-quartile of performance, the estimate in Column 4 of Table 21 Panel A supports H2c that displaying the higher rather than lower reference point for Grade yields a more positive performance effect. This result does not hold when the performance measure is Activity Level, as is visible in Column 4 of Table 20 Panel A. In fact, the substantial Bayes Factor of over 20 suggests that providing the higher rather than lower RPI reference point yields a negative effect among initially top-quartile performers. Our hypothesis noted reasons why this might be the case, including a lack of desire to be above the top-quartile of peers by a process measure, and even concerns that one is putting in more than the optimal amount of effort. We address this result conceptually in Section 5: Discussion. The results in each partition of initial performance form the basis for a concave relationship between initial performance and the positive effect of displaying the higher rather than lower reference point as predicted in H3a. In Table 20 Panel B and in Table 21 Panel B, Column 1 shows a concave relationship, captured by the positive coefficient on the interaction of RPI_T and Initially Third Quartile. Third quartile refers to the 50th-75th percentile. Column 2 in each of those two panels reveals that the concavity is in part due to the higher reference point yielding greater performance effects among the initially third quartile performers than among low performers, as predicted in H3b. Column 3 in each of those two panels tests H3c, that the concavity is in part due to the higher reference point yielding greater performance effects among the third quartile than the top quartile of performers. This is the case in the main experiment, with Activity Level as the measure of performance. In the supplemental experiment, with Grade as the measure of performance, the higher reference point does not yield significantly better 84 performance results among third quartile than top quartile performers. The Bayes Factor of 1.93 for that test indicates that such a relation is roughly twice as likely in light of the data. Figures 9 and 10 provide a graphical representation of the concave relationship between the performance effects of providing a relatively high reference point and an individual’s initial performance. The tendency of high performers to respond with little-to-no competitiveness upon viewing the high reference point for Activity Level produces a hill shape in the main experiment, shown in Figure 9. The tendency of high-performers to respond competitively to viewing the high reference point for Grade produces a plateau shape in the supplemental experiment, shown in Figure 10. We test for an average performance effect with H4. We do not find an average effect of providing the top-quartile as opposed to median reference point. The tests of this hypothesis are shown in Column 4 of Table 20 Panel B and Table 21 Panel B for the main and supplemental experiment, respectively. The Bayes Factor of 3.79 for the main experiment, with Activity Level as the performance measure, suggests that we can update a prior of no relationship in favor of the model’s estimate of a negative effect to a level of 3.79 times the probability before the realization of the data. In the supplemental experiment, with Grade as the performance measure, the Bayes Factor is 2.79 in support of updating a prior of no relationship to that of a negative effect. 3.4.4 Additional Analyses To illustrate drivers and determinants of the main effects, we present the results of a few additional analyses. The first set of additional analyses addresses differences in performance effects by gender. Studies of social interaction point to differences in the cooperativeness and benefit from interactions dependent on gender (Eagly [1978], Cross and Madson [1997]). Recent 85 evidence suggests females exhibit more positive performance effects from interaction with high performing peers than do males (Lavy et al. [2008]). We test whether females similarly benefit more from comparison to a high reference point of peer-performance, or whether such gender- based heterogeneity is weaker when the element of social interaction in peer comparison is weaker. Three other background variables that the majority of students provide information on are age, residence in a developed country, and level of education. We include these, as well as original performance, in a fully interacted model with RPI reference point height in testing the moderating effects of gender. The results are shown in Table 22. We do not find that women benefit more than men from the provision of RPI, or from the provision of the relatively higher reference point within RPI. The lack of a differential effect is in line with theory that the gender differential in performance response to peer comparison derives from the associated opportunities to interact and cooperate with an individual revealed to be a high performer, which anonymous reporting does not provide. In Table 23, we look at the source of the improvement in Grade in the supplemental experiment. The result may arise through a quantity, quality, or joint quantity-quality mechanism. Specifically, RPI may motivate students to attempt more problems, to answer problems with greater accuracy, or some combination of the two. We find that Grade RPI display led to a higher quantity of problems attempted, but not a statistically significant increase in the accuracy of problem attempts. This suggests that the RPI led to increased effort as opposed to aptitude. This result is in line with theories of reference-dependent preferences for effort provision (Abeler et al. [2011]). A final set of additional analyses draws on a post-experiment survey that measured interest in the alternative reference points and confidence in their attainability. Tables 28 and 29 86 tabulate the survey responses and related tests. In the case of Activity Level, interest in viewing and outperforming the lower reference point is higher than it is for the higher reference point. Some individuals are interested in outperforming the median, but less interested in outperforming the top quartile. In the case of Grade we found equal interest in viewing the two reference points. We also found interest in outperforming peers that persists in its strength even at levels above the higher reference point. These results help to understand the drop-off in the performance effect of the higher reference for those who’ve exceeded it when performance is process based (Activity Level), but not when it is outcome based (Grade). In the case of Activity-Level RPI, the high reference point is less may feel complacent and even wonder whether their level of process-performance is suboptimal. In the latter, the survey evidence suggests desires to outperform the higher reference point, a goal that the top-quartile RPI treatment facilitates by showing them the higher reference point. In the case of both outcome and process performance, confidence in the achievability of median performance is significantly greater than confidence in the achievability of top-quartile performance. We find that the lower performers, those for whom the top quartile is particularly distant, perform better if shown the relatively lower reference point. This result is in line with the predictions of expectancy theory that motivation to achieve an outcome is increasing in its perceived attainability. 3.4.5 Alternative Model Specification In analyzing data and receiving workshop feedback, we have come across a more parsimonious display of our hypothesis tests than breaking our sample into several factions for subsample comparisons as we did in Tables 18-21. Given that our tests are comparing means, a cell means 87 model allows displaying each mean of interest in each subsample in a grid. This is shown in Tables 24 and 26 for the main and supplemental experiments, respectively. F tests then allow comparing means within the model. Finally, Χ2 tests allow comparing effect estimates, necessary to test whether the effect of the relatively high reference point is indeed statistically significantly stronger among individuals initially between as opposed to outside the two alternative reference points. The F and Χ2 test results are in Tables 25 and 26, which address the main and supplemental experiments, respectively. This specification yields the same rejections of null hypotheses at the same levels of statistical significance as do the specifications in Tables 20-21. The last additional analysis looks at the effect of RPI regarding a reported measure (Grade) and the correlation between grade and an unreported measure (Activity Level). Table 30 shows a decrease in correlation after reporting grade, as captured by the interaction term on Grade and RPI. 3.5 DISCUSSION This study most directly contributes to a growing body of economic, psychology, accounting and management literature on the distinct effects of RPI, or those separate from pay for or visibility of relative performance (Blanes i Vidal and Nossol [2011], Hannan et al. [2008], Harper et al. [2013], Tafkov [2013]). While such applications of RPI often include a reference point for peer comparison, little, if any, empirical evidence has established the effects of RPI reference point height (Allcott [2011], Harper et al. [2013]). Principally, we offer some of the first evidence that the performance effect of providing a relatively high RPI reference point depends on initial performance. Our study indicates the 88 Table 22: RPI and Reference Point Effects by Demographics 1 2 3 4 Δ  Activity  Level Δ  Grade RPI 1.00 0.04 [0.26] [0.04] RPI_T 8.71* -­‐‑2.40 [1.74] [-­‐‑1.46] Gender -­‐‑4.67*** -­‐‑0.95 -­‐‑0.20 -­‐‑0.77 [2.97] [-­‐‑0.47] [-­‐‑0.37] [-­‐‑1.11] Level  of  Education 3.27 0.02 0.94 -­‐‑0.16 [1.57] [0.01] [1.62] [-­‐‑0.16] Age 0.02 0.30** -­‐‑0.02 0.03 [0.25] [2.56] [-­‐‑0.87] [0.66] Developed  Country 5.62 5.05*** 0.39 -­‐‑0.72 [3.67] [2.63] [0.71] [-­‐‑0.93] RPI  x  Gender 2.30 -­‐‑0.363 [1.12] [-­‐‑0.48] RPI  x  Level  of  Education -­‐‑2.75 -­‐‑0.11 [-­‐‑1.03] [-­‐‑0.13] RPI  x  Age 0.13 0.03 [1.11] [0.90] RPI  x  Developed  Country -­‐‑2.12 -­‐‑0.05 [-­‐‑1.06] [-­‐‑0.07] RPI_T  x  Gender -­‐‑2.73 0.29 [-­‐‑1.02] [0.29] RPI_T  x  Level  of  Education 0.97 2.17* [0.29] [1.90] RPI_T  x  Age -­‐‑0.27* -­‐‑0.03 [-­‐‑1.85] [-­‐‑0.71] RPI_T  x  Developed  Country -­‐‑3.08 2.18** [-­‐‑1.20] [2.20] N 9,598 6,450 3,611 2,437 All RPI All   RPI   Sample Main               Main                 Supplemental   Supplemental   Experiment                E    x  periment             Experiment Experiment This   table   shows  effect   estimates   for  providing  RPI   and  providing   the   top-quartile   rather   than  median  reference  point  by  the  study's  demographic  variables.  The  interaction  terms  in   Columns  1-2  illustrate  any  dependence  of  the  effect  of  RPI  and  the  higher  reference  point  on   Δ  Activity  Level  in  the  main  experiment.  The  interaction  terms  in  Columns  3-4  illustrate  any   dependence   of   the   effect   of   RPI   and   the   higher   reference   point   on   Δ   Grade   in   the   supplemental  experiment.  T-statistics  are   in  brackets.   *,**,***  denote  statistical   significance   at  the  .1,  .05,  and  .01  levels,  respectively.   89 Table 23: Effect of Grade RPI on Problem-Attempt Quantity and Accuracy 1 2 Problem-­‐‑Attempt   Δ  Problem  Attempts Accuracy RPI 2.28** 3.27 [2.06] [1.41] N 4,460 588 Sample All Attempted  Problems This  table  shows  effect  estimates  for  the  supplemental  experiment  of  the  effect   of  RPI  on  problem-attempt  quantity  and  accuracy.  Problem-Attempt  Quantity   is   captured   by   Δ   Problem   Attempts.   Problem-Attempt   Accuracy   is   only   calculable   for   the  subsample  of   students  who  attempted  a  problem  during   the   experiment.   T-statistics are in brackets. *,**,***   denote   statistical   significance   at   the   .1,   .05,   and   .01   levels,   respectively.       90 Table 24: Cell Means Model for Main Experiment Control RPI_M RPI_T c1 c4 c7 10.78                         16.43               12.54               Below  Median (0.943) (0.463) (0.969) c2 c5 c8 16.02                   16.48                     24.96               Third  Quartile (1.815) (1.772) (3.452) c3 c6 c9 36.27               45.33                   35                     Above  Top  Quartile (2.514) (3.376) (2.980) c1...3 c4...6 c7...9 1 7.56 2 2.81 2 0.18 (0.934) (1.289) (1.207) c4...9 2 1.49 (0.884) This table shows cell means for Δ Activity Level from the main experiment. Cells are the nine categories from the matrix of initial performance (Below Median, Third Quartile, Above Top Quartile) and experimental condition (Control, RPI_M, RPI_T). Each cell contains a coefficient from an OLS regression on Δ Activity Level of a categorical variable representing an individual's belonging to the cell. Standard errors are in parentheses. All coefficients are statistically significant at the .01 level. 91 Table 25: Hypothesis Tests for Main Experiment Cells Coefficient Intercept Hypothesis Test  Statistic P-­‐‑Value c1…3 17.56 H0:  c1…3  =  c4…9                           None F  =  5.12** 0.023 c4...9 21.49 HA:  c1…3  <  c4…9 c4 16.43 H0:  c4  =  c7                           None F  =  4.79** 0.028 c7 12.54 HA:  c4  >  c7 c5 16.48 H0:  c5  =  c8                           None F  =  4.84** 0.027 c8 24.96 HA:  c5  <  c8 c6 45.33 H0:  c6  =  c9                           None F  =  2.45 0.117 c9 35.00 HA:  c6  <  c9 c7  &  9 -­‐‑4.86 c4  &  6 H0:  c7  &  9  =  c8                  2        χ  =  11.13*** 0.000 c8 7.43 c5 HA:  c7  &  9  <  c8 c7 -­‐‑3.71 c4 H0:  c7  =  c8                        χ2    =  9.09*** 0.002 c8 7.43 c5 HA:  c7  <  c8 c8 7.43 c5 H0:  c8  =  c9                        2  χ  =  7.46*** 0.006 c9 -­‐‑6.57 c6 HA:  c8  >  c9 c4…6 22.81 None H0:  c4…6  =  c7...9                    F        =  1.75 0.185 c7…9 20.18 This  table  shows  the  hypothesis  tests  for  the  main  experiment,  with  Δ  Activity  Level   as   the   dependent   variable,   based   on   the   cell   means   model   in   Table   16.   F   tests   compare  coefficients  within  the  cell  means  model  as  displayed  in  Table  16  with  the   intercept   suppressed.   χ2   tests   compare   coefficients   between   variants   of   the   cell   means   model   with   different   cells   ommitted   to   serve   as   the   intercept.   The   F   tests   compare   the   RPI   to   to   the   control   condition,   and   the   top-quartile   to   the   median   reference  point  condition.  The  χ2  tests  show  concavity  in  initial  performance  of  the   performance  effect  of  providing  the  top-quartile  rather  than  median  reference  point.   92 Table 26: Cell Means Model for Supplemental Experiment Control RPI_M RPI_T c1 c4 c7 2.16                         3.95               2.95               Below  Median (0.297) (0.463) (0.381) c2 c5 c8 1.94                   1.73                     3.04               Third  Quartile (0.484) (0.509) (0.783) c3 c6 c9 1.22               0.65                   1.98               Above  Top  Quartile (0.463) (0.304) (0.539) c1...3 c4...6 c7...9 1.98 3.09 2 .81 (0.225) (0.332) (0.300) c4...9 2.95 (0.224) This table shows cell means for Δ Grade from the supplemental experiment. Cells are the nine categories from the matrix of initial performance (Below Median, Third Quartile, Above Top Quartile) and experimental condition (Control, RPI_M, RPI_T). Each cell contains a coefficient from an OLS regression on Δ Grade of a categorical variable representing an individual's belonging to the cell. Standard errors are in parentheses. All coefficients are statistically significant at the .01 level. 93 Table 27: Hypothesis Tests for Supplemental Experiment Cells Coefficient Intercept Hypothesis Test  Statistic P-­‐‑Value c1…3 1.98 H0:  c1…3  =  c4…9                           None F  =  9.36*** 0.002 c4...9 2.95 HA:  c1…3  <  c4…9 c4 3.95 H0:  c4  =  c7                           None F  =  2.79* 0.095 c7 2.95 HA:  c4  >  c7 c5 1.73 H0:  c5  =  c8                           None F  =  1.98 0.159 c8 3.04 HA:  c5  <  c8 c6 0.65 H0:  c6  =  c9                           None F  =  4.60** 0.032 c9 1.98 HA:  c6  <  c9 c7  &  9 -­‐‑0.56 c4  &  6 H0:  c7  &  9  =  c8                      2    χ  =  3.15* 0.075 c8 1.31 c5 HA:  c7  &  9  <  c8 c7 -­‐‑1.00 c4 H0:  c7  =  c8                           χ2  =  4.36** 0.036 c8 1.31 c5 HA:  c7  <  c8 c8 1.31 c5 H0:  c8  =  c9                           χ2  =  0.00 0.991 c9 1.32 c6 HA:  c8  >  c9 c4…6 3.09 None H0:  c4…6  =  c7...9                    F        =  0.37 0.542 c7…9 2.81 This  table  shows  the  hypothesis  tests  for  the  supplemental  experiment,  with  Δ  Grade   as   the   dependent   variable,   based   on   the   cell   means   model   in   Table   17.   F   tests   compare  coefficients  within  the  cell  means  model  as  displayed  in  Table  17  with  the   intercept   suppressed.   χ2   tests   compare   coefficients   between   variants   of   the   cell   means   model   with   different   cells   ommitted   to   serve   as   the   intercept.   The   F   tests   compare   the   RPI   to   to   the   control   condition,   and   the   top-quartile   to   the   median   reference  point  condition.  The  χ2  tests  show  concavity  in  initial  performance  of  the   performance  effect  of  providing  the  top-quartile  rather  than  median  reference  point.   94 Table 28: Survey Responses and Wilcoxon Signed-Rank Comparisons of Survey Responses for Main Experiment Panel A: Survey Questions (the number of students selecting a response sits beside the response in parentheses) 1. Are you interested in seeing how your activity in the course compares to the... classmate median: no (16) somewhat (20) yes (21) classmate top quartile: no (22) somewhat (14) yes (21) 2. How important is it to you to be more active in the course than… 50% of your classmates: not at all important (28) somewhat important (15) important (16) 75% of your classmates: not at all important (31) somewhat important (15) important (13) 2. How confident are you in your ability to be more active in the course than… 50% of your classmates: not at all confident (7) somewhat confident (20) confident (29) 75% of your classmates: not at all confident (11) somewhat confident (20) confident (25) Panel B: Wilcoxon Signed-Rank Comparisons of Survey Responses Interest in viewing reference point z score: 2.12 in favor of median reference point p-val: 0.033 N=57 Importance of reaching reference point z score: 1.89 in favor of median reference point p-val: 0.057 N=59 Confidence in ability to reach reference point z score: 2.82 in favor of median reference point p-val: 0.004 N=56 This table shows survey questions and responses regarding individuals' opinions of the peer median and top-quartile reference points, as well as a comparison of responses. In comparing responses for each question, the least affirmative response is coded as 1, the intermediate response as 2, and the most affirmative response as 3. 95 Table 29: Survey Responses and Wilcoxon Signed-Rank Comparisons of Survey Responses for Supplemental Experiment Panel A: Survey Questions (the number of students selecting a response sits beside the response in parentheses) 1. Are you interested in seeing how your grade in the course compares to the... classmate median: no (6) somewhat (13) yes (22) classmate top quartile: no (6) somewhat (13) yes (22) 2. How important is it to you to get a higher grade in the course than… 50% of your classmates: not at all important (13) somewhat important (10) important (16) 75% of your classmates: not at all important (14) somewhat important (11) important (14) 2. How confident are you in your ability to get a higher grade in the course than… 50% of your classmates: not at all confident (5) somewhat confident (12) confident (21) 75% of your classmates: not at all confident (8) somewhat confident (15) confident (16) Panel B: Wilcoxon Signed-Rank Comparisons of Survey Responses Interest in viewing reference point z score: 0.00 p-val: 1.000 N=41 Importance of reaching reference point z score: 0.70 in favor of median reference point p-val: 0.479 N=42 Confidence in ability to reach reference point z score: 2.49 in favor of median reference point p-val: 0.012 N=39 This table shows survey questions and responses regarding individuals' opinions of the peer median and top-quartile reference points, as well as a comparison of responses. In comparing responses for each question, the least affirmative response is coded as 1, the intermediate response as 2, and the most affirmative response as 3. 96 Table  30:  Effect  of  Grade  RPI  on  Correlation  Between  Grade  and  Activity  Level 1 Grade  at  Course  End RPI 10.09*** [3.51] Activity  Level  at  Course  End 0.14*** [11.39] Activity  Level  at  Course  End  *  RPI -­‐‑0.104*** [-­‐‑3.20] N 4,460 Sample All This  table  shows  effect  estimates  for  the  supplemental  experiment  of  the  effect  of   RPI  on  the  correlation  between  grade  and  RPI.  This  effect  is  captured  by  the   interaction  of  Grade  and  RPI.  *,**,***  denote  statistical  significance  at  the  .1,  .05,   and  .01  levels,  respectively.       97 quantiles of performance in which the performance effect of viewing the top-quartile as opposed to median reference point are negative and positive. The effect is negative for initially below- median performers. The effect is positive for individuals initially in between the two reference points. The effect is positive for top-quartile performers in the case of the outcome-based measure Grade. However, it is nearly statistically significantly negative in the case of the process-based measure Activity Level. In developing hypotheses, we noted a few reasons why this might occur, and these seem to apply in the case of Activity Level. One was that individuals might be more interested in checking their comparison to the median than to the top-quartile. Tables 20 and 21 show that individuals report more interest in viewing and comparing favorably to median performance than top-quartile performance. This result holds in untabulated tests in which we restrict the survey to the responses of top-quartile performers. A second, related reason was that individuals might interpret their standing above the higher reference point as a sign of suboptimal behavior. If so, this plausibly applies more to Activity Level than grade given that individuals feel a greater sense of optimality in performance that reveals high skill or intellect (Tafkov, [2013]). Overall, our study shows that performance effects of providing a higher rather than lower reference point exist in partitions segmented by the reference points despite not appearing on average. A second contribution is to isolate effects of comparison to peers through reporting from comparison through social interaction. A body of economic research addresses comparison through social interaction (Hanushek et al. [2003], Lavy et al. [2012], Lin [2010], Lyle and Smith [2014]). Analyses of plausibly exogenous changes in peer group composition show that exposure to high performing peers leads to improvement in one’s own performance, with some evidence that very high or very low performing peers carry disproportional weight in influencing 98 one’s own performance (Lavy et al. [2012], Lazear [2001], Hoxby and Weingarth [2006]). Fundamental to the performance effects in these studies are social interactions that involve assistance from high-performers, networking, and learning through observation (Lavy et al. [2012], Lyle and Smith [2014]). Our study focuses on displaying an anonymous standard of peer performance as opposed to altering one’s peer-group or mentors. We do not find a positive relationship on average between high peer-performance and own performance through reporting alone. This suggests that the interaction component of peer-performance comparison is fundamental to the positive effect of exposure to high-performing peers. By showing the distinct effects of RPI reference point height, we also inform theory and empirical work in economic and accounting research regarding RPI-related mechanisms. First, we contribute to the literature on target setting, which notes the apparent disconnect between prescriptions that targets should be attainable infrequently and the prevalence of frequently attainable targets set in organizations (Ioannou, Serafeim, and Li [2014], Merchant and Manzoni [1989]). Targets often either explicitly contain relative performance information or allow inferring one’s relative performance (Aranda et al. [2014], Bol, et al. [2010], Merchant and Manzoni [1989], Murphy [2000]). In such settings, targets plausibly assume the role of RPI reference points by serving as a standard of peer performance for comparing oneself to. A partial explanation for the prevalence of highly attainable targets, then, might be that RPI reference points have optimal performance effects at highly attainable levels. We find no performance benefit to making RPI reference points attainable less than half the time. Our findings suggest agreement between behavioral responses to RPI reference points and the prevalence of attainable targets found in practice. 99 Second, we show performance implications of supervisors’ use of discretion to set targets. For instance, supervisors set lower RPI-based targets to mitigate fairness concerns and to avoid confrontation costs with higher-level managers (Bol et al. [2010]). Information on the initial relative performance of these groups, paired with the results of the current study on RPI reference point height, indicate that a lower reference point is likely to improve performance of individuals in the bottom half of the distribution while a higher reference point is better able to do this in the top-half of the distribution. Our results also suggest that a more optimal approach would be to customize the reference point to the individual depending on his or her initial performance. When we show a student the more effective of the two reference points depending on his or her initial performance, this baseline average 1.98 (17.56) for Δ Grade (Δ Activity Level) rises to 3.5 (24.56). By contrast, showing students the median regardless of their initial performance produces a Δ Grade of 3.04 (Δ Activity Level of 22.81). The customized approach yields a 43% (33%) larger effect. Third, our results provide insight into motivating performance among partitions of performance that are problem areas for tournaments. The literature on tournaments and rank- based pay shows that those who are performing very well or poorly compared to a rewarded relative performance mark do worse when they notice the distance (Asch [1990], Hannan et al. [2008], Casas-Arce and Martinez-Jerez [2009]). Our findings show that displaying the median RPI reference point motivates performance among below-median performers. We also show that, when performance is outcome-based, displaying the top-quartile reference point motivates performance among top-quartile performers. We thus offer a nonfinancial performance tool—the selection of RPI reference point height—for motivating performance among groups that a 100 tournament would tend to leave discouraged (if a low performer) or complacent (if a high performer). Fourth, our study focuses on behavioral responses to RPI reference points that are prevalent in employee and executive evaluations and compensation decisions. 29% of corporations in a Corporate Executive Board survey reported using forced-curve employee rankings for performance management (McGregor [2013]). Some systems include peer quartiles as RPI reference points, two of which we test behavioral responses to (Grote [2005]). As mentioned, our results suggest that behavioral responses to viewing comparison to the median (top quartile) will be most positive among low (high) performers. If managers have to choose one or the other, they could pick the reference point that motivates the group they feel is most critical to have performing well. Our results suggest that customizing the reference point based on initial performance is preferable. At the executive level, financial statements list peer-group composition along with executive pay relative to target percentiles of the peer group (Bebchuk and Fried, [2004], Bizjak et al. [2011]). The SEC has proposed requiring companies to disclose both the percentile of the CEO’s pay and a standardized measure of the company’s performance relative to the compensation peer group (Securities and Exchange Commission [2015]). While the disproportional prevalence of earning and executive pay targets above the peer median has drawn widespread criticism (Bizjak et al. [2011]), we find behavioral responses that suggest relative performance targets set above the peer top-quartile might elicit performance improvement for individuals in the top half of the distribution. Future research can weigh this dynamic along with financial and career concerns in assessing the value of high targets for performance and compensation. 101 Finally, we contribute to accounting literature on the format of performance information reports. Studies show that order, categorization, visibility, and other display characteristics of information included in performance reports influence decisions made in stock trading and in employee evaluation (Bloomfield et al. [2006], DeBusk, Brown, and Killough [2003], Dilla and Steinbart [2005], Maines and McDaniel [2000]). Although performance reports are also often provided to individuals to aid in improving their own performance, little evidence shows how formatting the information differently alters performance effects (Yigitbasioglu and Velcu, [2012]). Our results suggest a performance benefit of customizing reference point height to an individual based on initial performance. 3.6 CONCLUSION Our study provides some of the first evidence of the effect of providing alternatively high reference points within RPI. We also show how the effect depends on an individual’s initial performance relative to the two alternatives. Further, we address the moderating role of performance-measure type by testing a process-based and an outcome-based performance measure. We find that the effect of providing a relatively high reference point in RPI depends on one’s initial performance relative to the alternatives. We test the peer top-quartile and the peer median. The effect of providing the higher rather than the lower is concave in initial performance. The effect is negative among below-median performers. The effect is positive among above average performers. In the case of an outcome-based performance measure, it is also positive for those in the top-quartile of performance. Collectively, our findings inform the selection of a reference point to drive performance in the desired partition of initial performance. 102 The findings also suggest that, when reports are private as in our setting, customizing the reference point based on an individual’s initial performance is preferable. Managers and government regulators can incorporate these results when selecting RPI reference points to yield desired behavior. RPI reference points are playing a growing role in settings including retail, education, energy consumption and taxpaying. The results also reveal dynamics of social comparison that help in identifying their optimal application within oft- studied systems for measuring, managing, and reporting performance. 103 CHAPTER 4 CONCLUSION The implications of this dissertation are two-fold. First public and private performance reporting drive performance and can be fine-tuned to deliver the strongest performance effect. Second, in the act of reporting performance to the performers or to the general public, the nature of the reported measure is altered. The finding is akin to Heisenberg’s Uncertainty Principle. In the same way that measuring a particle in the field of quantum physics requires affecting its state, reporting performance in management settings requires affecting its state. This requirement can be advantageous, in the form of performance improvement. However, it also changes the nature of the measure so that a one-unit increase means something different for the organization than it did in the absence of measurement. In particular, the results from Chapter 2 show that publicly disclosing physician ratings drives improvement by the ratings and by undisclosed measures of quality. At the same time, the ratings become attached to their prior values. These effects are moderated by increased web traffic to the disclosed ratings, which both drive unreported performance and strengthen the stickiness of ratings at their prior values. The results from Chapter 3 show that privately disclosing students’ relative performance drives performance to a differential degree depending on the reference point for comparison. The returns to providing a higher reference point are concave in an individual’s initial performance relative to the two tested alternatives: median and top-quartile performance. The study also shows some evidence that the reported performance measure, though, becomes a weaker indication of improvement by unreported measures. 104 References ABELER, J., A. FALK; L. GOETTE, and D. HUFFMAN. 'Reference Points and Effort Provision.' The American Economic Review 101(2) (2011): 470–492. ALLCOTT, H. 'Social Norms and Energy Conservation.' Journal of Public Economics 95(9–10) (2011): 1082–1095. ALLCOTT, H. and T. ROGERS. 'The Short-Run and Long-Run Effects of Behavioral Interventions: Experimental Evidence from Energy Conservation.' American Economic Review 104(10) (2014): 3003-3037. ANDEL, C., S. L. DAVIDOW, M. HOLLANDER, and D. A. MORENO. "The Economics of Health Care Quality and Medical Errors." Journal of Health Care Finance 39 (2012). ARANDA, C., J. ARELLANO, and A. DAVILA. 'Ratcheting and the Role of Relative Target Setting.' The Accounting Review 89(4) (2014): 1197-1226. ASCH, B. J. 'Do Incentives Matter? The Case of Navy Recruiters.' Industrial & Labor Relations Review 43(3) (1990): 89S–106S. ATKINSON, J. W. 'Motivational Determinants of Risk-Taking Behavior.' Psychological Review 64(6p1) (1957). AZMAT, G. and N. IRIBERRI. 'The Importance of Relative Performance Feedback Information: Evidence from a Natural Experiment Using High School Students.' Journal of Public Economics 94(7) (2010): 435–452. BADDELEY, M. "Herding, Social Influence and Economic Decision-Making: Socio- Psychological and Neuroscientific Analyses." Philosophical Transactions of the Royal Society B: Biological Sciences 365 (2010): 281–290. BALASUBRAMANIAN, S. K., I. MATHUR, and R. THAKUR. "The Impact of High-Quality Firm Achievements on Shareholder Value: Focus on Malcolm Baldrige and JD Power and Associates Awards." Journal of the Academy of Marketing Science 33 (2005): 413–422. BANDIERA, O, I. BARANKAY, and I. RASUL. 'Team Incentives: Evidence from a Firm Level Field Experiment.' Journal of the European Economic Association 11(5) (2013): 1079-1114. BANKER, R. D., and S. M. DATAR. "Sensitivity, Precision, and Linear Aggregation of Signals for Performance Evaluation." Journal of Accounting Research 27 (1989): 21–21. BANKER, R. D., G. POTTER, and D. SRINIVASAN. "An Empirical Investigation of an Incentive Plan that Includes Nonfinancial Performance Measures." The Accounting Review 75 (2005): 65–92. BANKER, R. D., and R. MASHRUWALA. "Simultaneity in the Relationship Between Sales Performance and Components of Customer Satisfaction. Journal of Consumer Satisfaction, Dissatisfaction and Complaining Behavior 22 (2009). BEBCHUK, L. A. and J. M. FRIED, 'Pay without Performance: Overview of the Issues.' Journal of Applied Corporate Finance 17(4) (2005): 8-23. BENNEAR, L. S., and S. M. OLMSTEAD. "The Impacts of the “Right to Know”: Information Disclosure and the Violation of Drinking Water Standards." Journal of Environmental Economics and Management 56 (2008): 117–130. BERGER, J., C. HARBRING, and D. SLIWKA. 'Performance Appraisals and the Impact of Forced Distribution-An Experimental Investigation.' Management Science 59(1) (2013): 54- 68. 105 BIZJAK, J., M. LEMMON, and T. NGUYEN. 'Are All CEOs Above Average? An Empirical Analysis of Compensation Peer Groups and Pay Design.' Journal of Financial Economics 100(3) (2011): 538-555. BLANES I VIDAL, J. and M. NOSSOL. 'Tournaments Without Prizes: Evidence from Personnel Records.' Management Science 57(10) (2011): 1721-1736. BLOOMFIELD, R.; M. NELSON; and S. SMITH. 'Feedback Loops, Fair Value Accounting and Correlated Investments.' Review of Accounting Studies 11(2-3) (2006): 377-416. BLUNDELL, R., and M. C. DIAS. "Alternative Approaches to Evaluation in Empirical Microeconomics." Journal of Human Resources 44 (2009): 565-640. BOL, J. C., T. M. KEUNE, E. M. MATSUMURA, and J. Y. SHIN. 'Supervisor Discretion in Target Setting: An Empirical Investigation.' The Accounting Review 85(6) (2010): 1861- 1886. BOL, J. C. "The Determinants and Performance Effects of Managers' Performance Rating Biases. The Accounting Review 86 (2011): 1549–1575. BROWN, S. H. "Managed Care and Technical Efficiency." Health Economics 12 (2003): 149– 158. BROWN, D., L. FERRIS, D. HELLER, and L. KEEPING. 'Antecedents and Consequences of the Frequency of Upward and Downward Social Comparison at Work.' Organizational Behavior and Human Decision Processes 102 (2007): 59-75. BROWN, D. L., S. CLARKE, and J. OAKLEY. "Cardiac Surgeon Report Cards, Referral for Cardiac Surgery, and the Ethical Responsibilities of Cardiologists.: Journal of the American College of Cardiology 59 (2010), 2378–2382. BUTTERS, J. 'Earnings Insight S&P 500.' Earnings Insight. Factset. (2015) CASAS-ARCE, P. and F. A. MARTINEZ-JEREZ. 'Relative Performance Compensation, Contests, and Dynamic Incentives.' Management Science 55(8) (2009): 1306–1320. CAMPBELL, D., M. J. EPSTEIN, and F. A. MARTINEZ-JEREZ. "The Learning Effects of Monitoring." The Accounting Review 86 (2011): 1909–1934. CARD, D., and A. KRUEGER. "Minimum Wages and Employment: A Case Study of the Fast- Food Industry in New Jersey and Pennsylvania." The American Economic Review 84 (1994): 772-793. CHANDRA A., J. GRUBER, and R. McKnight. "Patient Cost-Sharing and Hospitalization Offsets in the Elderly. The American Economic Review 100 (2010): 193–213. CHATTERJI, A. K., and M. W. TOFFEL. "How Firms Respond to Being Rated." Strategic Management Journal 31 (2010): 917–945. CHEVALIER, J. A., and D. MAYZLIN. "The Effect of Word of Mouth on Sales: Online Book Reviews." Journal of Marketing Research 43 (2006), 345–354. CHRISTENSEN, H. B., E. FLOYD, and M. G. MAFFETT. "The Effects of Price Transparency Regulation on Prices in the Healthcare Industry." Chicago Booth Research Paper, 2015. Available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2343367 CLEVELAND CLINIC. "Cleveland Clinic | Find a Doctor." 2016. Retrieved from http://my.clevelandclinic.org/staff_directory CMS. "HHS-Operated Risk Adjustment Methodology Meeting Discussion Paper." 2016. Retrieved from https://www.cms.gov/CCIIO/Resources/Forms-Reports-and-Other- Resources/Downloads/RA-March-31-White-Paper-032416.pdf COLUMBIA UNIVERSITY. "Courses@CU Principles of Economics." (2016). Retrieved from 106 http://www.coursesatcu.com/courses/1940 CROSS, S. and L. MADSON. 'Models of the Self: Self-construals and Gender.' Psychological Bulletin 12 (1997): 5-37. CUTLER, D., and L. S. DAFNY. "Designing Transparency Systems for Medical Care Prices." New England Journal of Medicine 364 (2011): 894–895. DAFNY, L. S. "How do Hospitals Respond to Price Changes?" The American Economic Review, 95 (2005): 1525–1547. DAFNY, L. S., and D. DRANOVE. "Do Report Cards Tell Consumers Anything They Don't Already Know? The Case of Medicare HMOs." The Rand Journal of Economics 39 (2008): 790–821. DANIELS, C. and T. MILLER. 'After the Architect, Building Culture Calls for a Contractor.' Leadership Case Studies (2014). DEBUSK, G. K., R. M. BROWN, and L. N. KILLOUGH. 'Components and Relative Weights in Utilization of Dashboard Measurement Systems Like the Balanced Scorecard.' The British Accounting Review 35(3) (2003): 215-231. DILLA, W. N. and P. J. STEINBART. 'The Effects of Alternative Supplementary Display Formats on Balanced Scorecard Judgments.' International Journal of Accounting Information Systems 6(3) (2005): 159–176. DOLAN, P., M. HALLSWORTH, D. HALPERN, D. KING, R. METCALFE, and I. VLAEV. 'Influencing Behaviour: The Mindspace Way.' Journal of Economic Psychology 33(1) (2012): 264–277. DOYLE, J. J., Jr. "Returns to Local-Area Health Care Spending: Evidence from Health Shocks to Patients Far from Home." American Economic Journal: Applied Economics 3 (2011): 221–243. DRANOVE, D., D. KESSLER, M. MCCLELLAN, and M. SATTERTHWAITE. "Is More Information Better? The Effects of “Report Cards” on Health Care Providers." The Journal of Political Economy 111 (2003): 555–588. DUFLO, E. Empirical Methods. Massachusetts Institute of Technology. (2002). Retrieved from http://dspace.mit.edu/bitstream/handle/1721.1/49516/14-771Fall- 2002/NR/rdonlyres/Economics/14-771Development-Economics--Microeconomic-Issues- and-Policy-ModelsFall2002/2494CA2C-D025-40A6-B167- F8A5662520DB/0/emp_handout.pdf DYNARSKI, S. "Hope for Whom? Financial Aid for the Middle Class and Its Impact on College Attendance." National Tax Journal (2000): 629-661. EAGLY, A. H. 'Sex Differences in Influenceability.' Psychological Bulletin 85 (1978): 86-116. EPSTEIN, A. J. "Do Cardiac Surgery Report Cards Reduce Mortality? Assessing the Evidence. Medical Care Research and Review 63 (2006): 403–426. EREZ, M. and I. ZIDON. 'Effects of Goal Acceptance on the Relationship of Goal Setting and Task Performance.' Journal of Applied Psychology 69 (1984): 69-78. EREZ, M., P. C. EARLEY, and C. L. HULIN. 'The Impact of Participation on Goal Acceptance and Performance: A Two Step Model.' Academy of Management Journal 28 (1985): 50-66. ERYARSOY, E., and S. PIRAMUTHU. "Experimental Rating of Sequential Bias in Online Customer Reviews." Information and Management, 51 (2014): 964–971. EVANS, I., Y. HWANG, and N. J. NAGARAJAN. "Management Control and Hospital Cost Reduction: Additional Evidence." Journal of Accounting and Public Policy 20 (2001): 73– 88. 107 FARBER, H. S. 'Reference-Dependent Preferences and Labor Supply: The Case of New York City Taxi Drivers.' American Economic Review (2008) 98(3): 1069-1082. FESTINGER, L. 'A Theory of Social Comparison Processes.' Human relations 7(2) (1954): 117- 140. FISHER, J., S. PEFFER, and G. SPRINKLE. 'Budget-Based Contracts, Budget Levels, and Group Performance.' Journal of Management Accounting Research 15(1) (2003): 51-74. FLAHERTY, C. "Evaluating Evaluations." (2014). Retrieved from https://www.insidehighered.com/news/2014/05/20/study-suggests-research-plays-bigger- role-faculty-evaluations-student-evaluations FURNHAM, A., and H. C. BOO. "A Literature Review of the Anchoring Effect." The Journal of Socio-Economics 40 (2011): 35–42. GARCIA, S. and A. TOR. 'Rankings, Standards, and Competition: Task Versus Scale Comparisons.' Organizational Behavior and Human Decision Processes 102 (2007): 95-108. GIBBONS, R and J. ROBERTS. The Handbook of Organizational Economics. Princeton University Press, 2012. GLOVER, L. "Are Online Physician Ratings any Good?" (2014). Retrieved from http://health.usnews.com/health-news/patient-advice/articles/2014/12/19/are-online- physician-ratings-any-good GORMAN, A. 'How One Hospital Reduced Unnecessary C-Sections.' The Atlantic, 2015. Available at: http://www.theatlantic.com/health/archive/2015/05/how-one-hospital-reduced- unnecessary-c-sections/392924/ GRAHAM, M. "Regulation by Shaming." (2000). Retrieved from http://www.theatlantic.com/magazine/archive/2000/04/regulation-by-shaming/378126/ GROTE, R. C. Forced Curve: Making Performance Management Work. Harvard Business School Press, 2005. HALLSWORTH, M., J. A. LIST, R. D. METCALFE, and I. VLAEV. 'The Behavioralist as a Tax Collector: Using Natural Field Experiments to Enhance Tax Compliance.' NBER Working Paper Series, 2014. Available at: http://www.nber.org/papers/w20007.pdf HANAUER, D. A., K. ZHENG, D. C. SINGER, A. GEBREMARIAM, and M. M. DAVIS. "Public Awareness, Perception, and Use of Online Physician Rating Sites." JAMA 311 (2014): 734. HANNAN, E. L., M. S. SARRAZIN, D. DORAN, and G. ROSENTHAL. "Provider Profiling and Quality Improvement Efforts in Coronary Artery Bypass Graft Surgery: The Effect on Short-Term Mortality Among Medicare Beneficiaries." Medical Care 41 (2003). HANNAN, L., R. KRISHNAN, and A. NEWMAN. 'The Effects of Disseminating Relative Performance Feedback in Tournament and Individual Performance Compensation Plans.' The Accounting Review 83(4) (2008): 893–913. HANNAN, L., G. MCPHEE, A. NEWMAN, and I. TAFKOV. 'The Effect of Relative Performance Information on Performance and Effort Allocation in a Multi-Task Environment.' The Accounting Review 88(2) (2013): 553-575. HANUSHEK, E. A.; J.F. KAIN; J. M. MARKMAN; and S. G. RIVKIN. 'Does Peer Ability Affect Student Achievement? Empirical Analysis of Social Interactions.' Journal of Applied Econometrics 18(5) (2003): 527-544. HARPER, F. M., J. KONSTAN, Y. CHEN, and S. X. LI. 'Social Comparisons and Contributions to Online Communities: A Field Experiment on MovieLens.' The American Economic Review 100(4) (2010): 1358–1398. 108 HEATH, C; R. LARRICK; and G. WU. 'Goals as reference points." Cognitive psychology 38(1) (1999): 79-109. HÖLMSTROM, B. "Moral Hazard and Observability." The Bell Journal of Economics (1979): 74–91. HU, N., L. LIU, and J. J. Zhang. "Do Online Reviews Affect Product Sales? The Role of Reviewer Characteristics and Temporal Effects." Information Technology and Management 9 (2008): 201–214. IOANNOU, I; S. LI; and G. SERAFEIM. 'The Effect of Target Difficulty on Target Completion: The Case of Reducing Carbon Emissions.' Unpublished Paper (forthcoming in The Accounting Review), 2014. Available at: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2133004 ITTNER, C. D., and D. F. LARCKER, "Are Nonfinancial Measures Leading Indicators of Financial Performance?" An Analysis of Customer Satisfaction." Journal of Accounting Research 36 (1998): 1–35. ITTNER, C. D., D. F. LARCKER, and M. W. MEYER. "Subjectivity and the Weighting of Performance Measures: Evidence from a Balanced Scorecard." The Accounting Review 78 (2003): 725–758. JIN, G. Z., and P. LESLIE. "The Effect of Information on Product Quality: Evidence from Restaurant Hygiene Grade Cards." The Quarterly Journal of Economics, (2003): 409–451. JOHN, P.; M. SANDERS; and J. WANG. 'The Use of Descriptive Norms in Public Administration: A Panacea for Improving Citizen Behaviours?' Unpublished paper, 2014. Available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2514536. JOYNT, K. E., E. J. ORAV, and A. K. JHA. 'Thirty-day Readmission Rates for Medicare Beneficiaries by Race and Site of Care. JAMA, 305 (2011), 675–681. KAHNEMAN D. and A. TVERSKY. 'Prospect Theory: An Analysis of Decision Under Risk." Econometrica 47(2) (1979): 263-292. KAPLAN, R. S., and D. P. NORTON. "The Balanced Scorecard: Measures that Drive Performance." Harvard Business Review 83 (2005): 172. KEYNES, J. M. "A Treatise on Money." (1930). KLAR Y. and E. E. GILADI. 'No One in my Group can be Below the Group's Average: a Robust Positivity Bias in Favor of Anonymous Peers.' Journal of Personality and Social Psychology 73(5) (1997): 885-901. KETTLE, S., M. HERNANDEZ, S. RUDA, and M. SANDERS. 'Behavioural Insights to Improve Tax Compliance: Short-Term Impacts from a Randomised Experiment in Guatemala.' (2015) CMPO Working Paper. KOLSTAD, J. T. "Information and Quality when Motivation is Intrinsic: Evidence from Surgeon Report Cards." The American Economic Review, 103 (2003): 2875–2910. LAMB, A., J. SMILACK, A. D. HO, and J. REICH. 'Addressing Common Challenges in Randomized Experiments in MOOCs: A Study of Encouraging Discussion in JusticeX'. Proceedings of the Second ACM Conference on Learning@Scale (2015). Available at: http://harvardx.harvard.edu/files/harvardx/files/mooc_analytic_challenges_harvardx_wp.pdf LARRICK, R. P., K. A. BURSON, and J. B. SOLL. 'Social Comparison and Confidence: When Thinking You’re Better than Average Predicts Overconfidence (and when it does not).' Organizational Behavior and Human Decision Processes 102(1) (2007): 76–94. LAVY, V., O. SILVA, and F. WEINHARDT. 'The Good, the Bad, and the Average: Evidence on Ability Peer Effects in Schools.' Journal of Labor Economics 30(2) (2012): 367-414. 109 LAWLER, E. E. 'A Correlational-Causal Analysis of the Relationship Between Expectancy Attitudes and Job Performance.' Journal of Applied Psychology 52(6p1) (1968). LAWLER, E. E. and J. L. SUTTLE. 'Expectancy Theory and Job Behavior.' Organization Behavior and Human Performance 9 (1973): 482-503. LAZEAR, E. 'Educational Production.' Quarterly Journal of Economics 116(3) (2001): 777-803. LAZEAR, E. and S. ROSEN. 'Rank-order Tournaments as Optimum Labor Contracts.' Journal of Political Economy 89(5) (1981): 841-864. LEUZ, C., and P. WYSOCKI. "The Economics of Disclosure and Financial Reporting Regulation: Evidence and Suggestions for Future Research." Journal of Accounting Research 54 (2016): 525–622. LIN, XU. 'Identifying Peer Effects in Student Academic Achievement by Spatial Autoregressive Models with Group Unobservables.' Journal of Labor Economics 28(4) (2010): 825-860 LOCKE, E. A. and J. BRYAN. 'The Directing Function of Goals in Task Performance.' Organizational Behavior and Human Performance 4 (1969): 35-42. LOCKE, E. A. and G. P. LATHAM. "Building a Practically Useful Theory of Goal Setting and Task Motivation: A 35-year Odyssey." American Psychologist 57(9) (2002). LOCKE, E. A., S. J. MOTOWIDLO, and P. BOBKO, 'Using Self-Efficacy Theory to Resolve the Conflict Between Goal-Setting and Expectancy Theory in Organizational Behavior and Industrial/Organizational Psychology.' Journal of Social and Clinical Psychology 4 (1986): 328-338. LYLE, D. S., and J. Z. SMITH. 'The Effect of High-Performing Mentors on Junior Officer Promotion in the US Army' Journal of Labor Economics 32(2) (2014): 229–258. LU, F. S. "Multitasking, Information Disclosure, and Product Quality: Evidence from Nursing Homes." Journal of Economics and Management Strategy 21 (2012): 673–705. LUCA, M. "Reviews, Reputation, and Revenue: the Case of Yelp.Com." SSRN Electronic Journal (2016). Available at: http://doi.org/10.2139/ssrn.1928601 LYU, H., E. C. WICK, M. HOUSMAN, J. A. FREISCHLAG, and M. A. MAKARY. "Patient Satisfaction as a Possible Indicator of Quality Surgical Care." JAMA Surgery 148 (2013): 362. MAINES, L. A. and L. S. MCDANIEL. 'Effects of Comprehensive-Income Characteristics on Nonprofessional Investors' Judgments: The Role of Financial-Statement Presentation Format.' The Accounting Review 75(2) (2000): 179-207. MARRIOTT. "Reviews." 2016. Retrieved from http://www.marriott.com/hotels/hotel- reviews/chijw-jw-marriott-chicago/ MCGREGOR. 'For Whom the Bell Curve Tolls.' The Washington Post, 2013. Available at: https://www.washingtonpost.com/news/on-leadership/wp/2013/11/20/for-whom-the-bell- curve-tolls/ MERCHANT, K. A. and J. F. MANZONI.' The Achievability Of Budget Targets In Profit Centers: A Field Study.' The Accounting Review 64(3) (1989). MERLINO, J. I., and A. RAMAN. "Health Care's Service Fanatics." Harvard Business Review 91 (2013): 108-16. MOERS, F. "Discretion and Bias in Performance Rating: the Impact of Diversity and Subjectivity." Accounting, Organizations and Society 30 (2005): 67–80. MUCHNIK, L., S. ARAL, and S. J. TAYLOR. "Social Influence Bias: A Randomized Experiment." Science 341 (2013): 647–651. MURPHY, K. J. 'Performance Standards in Incentive Contracts.' Journal of Accounting and 110 Economics 30(3) (2000): 245-278. MURPHY, K. J., and T. SANDINO. "Executive Pay and “Independent” Compensation Consultants." Journal of Accounting and Economics, 49 (2010): 247-262. MURTHY, U. 'The Effect of Relative Performance Information under Different Incentive Schemes on Performance in a Production Task.' AAA 2011 Management Accounting Section (MAS) Meeting Paper, 2010. Available at: http://ssrn.com/abstract=1632663 NAGAR, V., and M. V. RAJAN. "Measuring Customer Relationships: The Case of the Retail Banking Industry." Management Science 51 (2005): 904-919. NEWMAN, B. M., and P. R. NEWMAN. "Development Through Life: A Psychosocial Approach. Cengage Learning, 2014. PARKER, C., and V. L. NIELSEN. "Explaining Compliance: Business Responses to Regulation." Edward Elgar Publishing, 2011. PERKINS, H. W., M. P. HAINES, and R. RICE. 'Misperceiving the College Drinking Norm and Related Problems: A Nationwide Study of Exposure to Information, Perceived Norms, and Student Alcohol Misuse.' Journal of Studies on Alcohol and Drugs 66(4) (2005): 470-478. PETERSEN, M. A. Estimating Standard Errors in Finance Panel Data Sets: Comparing Approaches. Review of Financial Studies 22 (2009): 435–480. PETERSON, E. D., E. R. DELONG, J. G. JOLLIS, L. H. MUHLBAIER, D. B. MARK, "The Effects of New York’s Bypass Surgery Provider Profiling on Access to Care and Patient Outcomes in the Elderly." Journal of the American College of Cardiology 32 (1998): 993– 999. PRENDERGAST, C., and R. TOPEL. "Discretion and Bias in Performance Rating." European Economic Review 37 (2003): 355–365. RENO, R. R., R. B. CIALDINI, and C. A. KALLGREN. 'The Transsituational Influence of Social Norms.' Journal of Personality and Social Psychology 64(1) (1993): 104–112. ROGERS, T. and A. FELLER. 'The Threat of Excellence: Exposure to Peers' Exemplary Work Undermines Performance and Success.' Presented at the Society for Judgement and Decision Making 2015 36th Annual Conference (2015). RYAN, A. M., B. K. NALLAMOTHU, and J. B. DIMICK. "Medicare’s Public Reporting Initiative on Hospital Quality had Modest or no Impact on Mortality from Three Key Conditions." Health Affairs, 31 (2012): 585–592. SECURITIES AND EXCHANGE COMMISSION, 'SEC Proposes Rules to Require Companies to Disclose the Relationship Between Executive Pay and a Company's Financial Performance.' Press Release. SEC. 2015. SEDIKIDES, C, L. GAERTNER, and Y. TOGUCHI. 'Pancultural self-enhancement.' Journal of Personality and Social Psychology 84(1) (2003): 60–79. SCHULTZ, P. W.; J. M. NOLAN; R. B. CIALDINI; N. J. GOLDSTEIN; and V. GRISKEVICIUS. 'The Constructive, Destructive, and Reconstructive Power of Social Norms.' Psychological Science 18(5) (2007): 429–434. SIKORA, R. T., and K. CHUAHAN. "Estimating Sequential Bias in Online Reviews: A Kalman Filtering Approach." Knowledge-Based Systems, 27 (2012): 314–321. SMITH, R. "Assimilative and Contrastive Emotional Reactions to Upward and Downward Social Comparisons." In Handbook of Social Comparison, ed. J. Suls and L. Wheeler, 2000. SONG, H., A. TUCKER, K. MURRELL, and D. VINSON, 'Public Relative Performance Feedback in Complex Service Systems: Improving Productivity through the Adoption of 111 Best Practices.' Unpublished paper, 2015. Available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2673829 SHUKLA, M. "Long-Term Impact of Coronary Artery Bypass Graft Surgery (CABG) Report Cards on CABG Mortality and Provider Market Share and Volume." The George Washington University, 2013. Ed. A. Dor. Retrieved from http://media.proquest.com/media/pq/classic/doc/3085535221/fmt/ai/rep/NPDF?_s=TO9PEY nJkbXRA%2F8fETo9otvBzqw%3D SVENSON, O. 'Are we all Less Risky and More Skillful than our Fellow Drivers?' Acta Psychologica, 47(2) (1981): 143–148. STANFORD HEALTH CARE. "Find a Doctor | Stanford Health Care." 2016. Retrieved from https://stanfordhealthcare.org/search-results.doctors.html STARWOOD. "Guest Ratings and Reviews | Sheraton New York Times Square Hotel." 2016. Retrieved from http://www.starwoodhotels.com/sheraton/property/reviews/index.html?propertyID=421 SUNDARARAJAN, V., T. HENDERSON, C. PERRY, A MUGGIVAN, H. QUAN, and W. A. GHALI. "New ICD-10 Version of the Charlson Comorbidity Index Predicted In-Hospital Mortality. Journal of Clinical Epidemiology, 57 (2004): 1288–1294. TAFKOV, I. D. "Private and Public Relative Performance Information Under Different Compensation Contracts." The Accounting Review 88 (2013): 327–350. TEXAS TECH UNIVERSITY. "Student Ratings of Courses and Instructors." (2016). Retrieved from http://www.ttu.edu/courseinfo/evals/ TVERSKY, A., and D. KAHNEMAN. "Judgment Under Uncertainty: Heuristics and Biases." In Utility, Probability, and Human Decision Making, eds. D. Wendt and C. Vlek (1995). UBEL, P. "Paying for Patient Satisfaction Harms Hospitals that Care for Poor People." (2015). Retrieved from http://www.forbes.com/sites/peterubel/2015/10/16/paying-for-patient- satisfaction-harms-hospitals-that-care-for-poor-people/#39a7355b62f7 VAN DER STEDE, W. "Management Accounting: From Where, Where Now, Where To?" Journal of Management Accounting Research 17 (2015). 171-176. VANEK SMITH, S. 'I, Waiter' NPR. (2015) Available at http://www.npr.org/templates/transcript/transcript.php?storyID=407086723 VIVES, X. "Information and Learning in Markets: the Impact of Market Microstructure." Princeton University Press, 2010 VROOM, V. H. Work and Motivation. Wiley, 1964. WEIL, D., A. FUNG, M. GRAHAM, and E. FAGOTTO. "The Effectiveness of Regulatory Disclosure Policies." Journal of Policy Analysis and Management 25 (2006), 155–181. WINDSCHITL, P. D., J. KRUGER, and E. N. SIMMS. 'The Influence of Egocentrism and Focalism on People's Optimism in Competitions: When What Affects Us Equally Affects Me More.' Journal of Personality and Social Psychology 85(3) (2003): 389–408. XIAO, Y. and R. LUCKING. 'The Impact of Two Types of Peer Assessment on Stuents' Performance and Satisfaction within a Wiki Environment'. Internet and Higher Education 11(3-4) (2008): 186-193. YIGITBASIOGLU, O., and O. VELCU. 'A Review of Dashboards in Performance Management: Implications for Design and Research.' International Journal of Accounting Information Systems 13(1) (2012): 41–59. 112 Appendix A: Patient Satisfaction Survey Questions Used in Disclosure Physician (referred to as “care provider” in the survey questions) 1) Friendliness/courtesy of the care provider 2) Explanations the care provider gave you about your problem or condition 3) Concern the care provider showed for your questions or worries 4) Care provider’s efforts to include you in decisions about your treatment 5) Degree to which care provider talked with you using words you could understand 6) Amount of time the care provider spent with you 7) Your confidence in this care provider 8) Likelihood of your recommending this care provider to others 9) Length of wait time at clinic © 2016 Press Ganey Associates, Inc. Appendix B: Example Patient Comments Each patient satisfaction survey contains a text box under the Care Provider section of a survey with the prompt: “Comments (Describe Good or Bad Experience)” that patients can choose to fill in. Comments regarding physicians were posted in their entirety on the official online profiles of included in disclosure, except when administrators, who were not physicians, screened comments that contained slander or personally identifiable information about the patient. The following are example comments: # Selected Comment 1 The only complaint I had was that he didn't tell me how many precancer spots I had on my face so I wasn't prepared for as many as there were. I wasn't quite mentally prepared for getting sprayed with the liquid nitrogen that many times. 2 Excellent service all the way around. 3 Dr. Salari is the best physician there in my opinion. I haven't really seen any others but I trust him to give me an honest opinion and he shows he cares. Apparently everyone likes him because at times he is hard to get an appointment with but that I guess is a good thing 4 I would have liked to have heard a bit about the down-side of “prednisone” – such as it can sometimes cause mood swings or depression. I wasn’t prepared for “feeling so down”. 5 The symptoms I had were frightening and the physician was very good at explaining what was going on and alleviating my fears. 113 Appendix C: Variable Definitions Chapter 2 Variables Dependent Variables Description Rating The average of the nine component ratings regarding a physician in a Press Ganey survey returned following a patient visit. Each component rating is on a Likert scale of one to five, with five as the most favorable rating. Quality deduction An indicator variable equal to one if the visit resulted in a readmission to the emergency department within 30 days of discharge, a hospital- acquired condition, or both. Absolute difference The absolute distance of an individual physician rating from the consensus rating as calculated for disclosure. For physicians included in the December 2012 posting, the consensus was calculated and disclosed, and for those excluded I retrospectively calculate the consensus. For all physicians, I retrospectively calculate the consensus as would have been disclosed in December 2011 had the disclosure occurred then. Absolute difference is measured relative to the December 2011 consensus prior to disclosure, and to the December 2012 consensus following disclosure and prior to the July 2013 rating posting/update. Treatment Variable Description Disclosed An indicator for the time period following which a physician’s ratings were disclosed, if ever. Placebo disclosed For the tests of Models 1 and 2, regarding performance effects, this is an indicator for the time period beginning one year prior to the first time at which a physician’s rating were disclosed, if ever. For the tests of Model 3, regarding consensus bias, this is an indicator for the time period beginning seven months prior to the December 2012 disclosure for physicians included in the disclosure. Partitioning Variable Description Web traffic The number of page views of the physician’s official online profile in the calendar month prior to the observed visit. Control Variables Description Age Patient age at the time of the visit, with ages above 89 treated as 90. For the tests of Models 1 and 3, related to physician ratings and bias toward ratings, ages are included as dummies using the psychometric categories of Newman and Newman (2014): 0-11, 11-17, 18-24, 25- 34, 45-59, 59-74, +74. For the tests of Model 2, regarding quality, ages are included as outlined by CMS (2016) for use in risk adjustment: 0-1, 1-4, 5-9, 10-14, 15-20, 21-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59, 60+. Gender An indicator variable equal to 1 if the physician is female. Medicare or Medicaid An indicator variable equal to 1 if Medicare or Medicaid was the primary insurance used for the visit. 114 Severity/complexity A component of Medicare reimbursement formulas that accounts for the patient’s case severity and the complexity of care provided. Comorbidity The Charlson Comorbidity Index, which takes a value of one, two, three, or six in proportion to the likelihood of mortality within one year associated with the comorbid condition. Comorbid conditions include heart disease, aids, and cancer among the 22-condition set. The conditions are recorded at the time of a procedure. Thus, for regressions on quality, the value is included as assigned to the given visit. For regressions on patient satisfaction ratings or their derivatives, which may occur before or after a procedure, the value is included as the six-month rolling window within the sample centered at the patient visit. The results are robust to narrowing the window to three months or expanding it to one year. Charges The dollar value of charges assigned to the visit. First visit An indicator variable equal to 1 if the visit is the patient’s first to the physician conducting the visit. Physician week’s visit The total number of visits conducted by the physician conducting a count visit in the same week. English speaking An indicator variable equal to 1 if a patient indicated in a survey response that they speak English. Contemporary The standard deviation of the physician’s ratings in the period in standard deviation which the rating occurred relative to the December 2012 rating posting. Consensus count The inverse square root of the count of observations that comprise a physician’s consensus rating as calculated for disclosure and used in this study in measuring absolute difference. Rating trend The rating trend for a given physician in the period in which the rating occurred relative to the December 2012 rating posting. Year A categorical variable for the calendar year in which the visit occurred: 2011, 2012, 2013, or 2014. Period A categorical variable for the period, segmented by disclosure events (i.e., the December 2012 and the July 2013 rating postings), in which the visit occurred: 1 for before the first posting, 2 for after the first and before the second posting, and 3 for after both postings. Physician dummies An indicator variable for the physician conducting the visit. Physician Variables Description Age The physician’s age as of January 1, 2011. Gender An indicator variable equal to 1 if the physician is a female. MD An indicator variable equal to 1 if the physician holds an MD. Years with UUHC The number of years that UUHC has employed the physician. Tenure track An indicator variable equal to 1 if the physician has a tenure-track appointment. 115 Chapter 3 Variables Dependent Variables Description Activity Level The following weighted sum, that approximately scales each type of action’s historical mean to the historical mean of video views in the experiment host courses: video views + 1.5 x problem attempts + 20 x forum posts + 2.5 x other forum actions + 5 x number of days active in the course + 0.1 x total actions Δ Activity Level Activity Level at the experiment’s end minus Activity Level at the experiment’s beginning Grade Grade in the course Δ Grade Grade at the experiment’s end minus Grade at the experiment’s beginning Dependent Variable Components Description Video Views The number of times a student started watching a video Problem Attempts The number of times a student entered an answer to any problem Forum Posts The number of posts a student made in discussion forums Other Forum Actions The number of actions (e.g., voting for a post, responding with a comment to an original post) a student took in discussion forums Number of Days Active in the The number of calendar days on which a student Course accessed the course Total Actions (component of All actions in the course that are recorded electronically; Activity Level) these include video views, problem attempts, forum posts, and other forum actions, but are not limited to them Independent Variables Description Control An indicator variable equal to one if the individual is a member of the control group, which does not receive RPI displays. Median Reference Point (RPI_M) An indicator variable equal to one if the individual is a member of the treatment group that receives an RPI display with the peer median reference point; the peer median is the median activity level of individuals who have accessed the course Top-quartile Reference Point An indicator variable equal to one if the individual is a (RPI_T) member of the treatment group that receives an RPI display with the peer top-quartile reference point; the peer top-quartile is the top-quartile activity level of individuals with who have accessed the course Relative Performance Information An indicator variable equal to one if either Top-quartile (RPI) Reference Point = 1 or Median Reference Point = 1 Moderator Variables Description Initially Below Average An indicator variable equal to one if the individual’s 116 activity level was less than or equal to the median of all individuals who had accessed the course at the experiment’s start Initially Third Quartile An indicator variable equal to one if the individual’s activity level was greater than the median and less than the top-quartile of all individuals who had accessed the course at the experiment’s start Initially Top Quartile An indicator variable equal to one if the individual’s activity level was greater than or equal to the top-quartile of all individuals who had accessed the course at the experiment’s start Gender An indicator variable equal to one if the individual indicated their gender in the course registration process and chose Female, and equal to zero if the individual indicated their gender in the course registration process and chose Male. Developed Country An indicator variable equal to one if the individual indicated their country of residence and the country is of UN Developed Nation status, and equal to zero if the individual indicated their country of residence and the country is of UN Developing Nation Status Level of Education An indicator variable equal to one if the individual indicated their level of education and has a bachelor’s degree or higher degree. Age The age, if any, that an individual indicated during registration, truncated at 5 and 100. Descriptive Variables Description Familiarity With Subject Response to survey question, “How familiar are you with [course]?” 0 = Not at all Familiar; 1 = Slightly Familiar; 2 = Somewhat Familiar; 3 = Very Familiar; 4 = Extremely Familiar. Commitment to Complete Course Response to survey question, “People register for HarvardX courses for different reasons. Which of the following best describes you?” 1 = Here to brows the materials, but not planning on completing any course activities (watching videos, reading text, answering problems, etc.); 2 = Planning on completing some course activities, but not planning on earning a certificate; 3 = Planning on completing enough course activities to earn a certificate. Number of Online Courses Response to survey question, “How many online courses Previously Enrolled-In have you registered for in the past?” Number of Online Courses Response to survey question, “How many online courses Previously Completed have you completed in the past?” 117 Appendix D: Relative Performance Information Displays, Emails and Surveys 1. Example Email with Link to RPI Display In the supplemental experiment, the “Grade” replaces any reference to “Activity” 2. Example In-Course Link to RPI Display The link titled, “Check your progress (link will open in a new tab)” takes control-group students to the standard course progress chart for HarvardX courses. The same link takes treatment group students directly to the proposed experiment’s RPI display that is customized to their activity in the course. The RPI display webpage has a link at the 118 bottom titled, “Click here for a more detailed progress chart”, which takes treatment- group students to the standard HarvardX course progress chart. 3. Activity Level Displays Peer-Median Reference Point 119 Peer-Top-Quartile Reference Point In the main experiment, the graphs dynamically scale to show levels of activity above 150, and start with a default height of 150. The “Click here for a more detailed progress chart” link takes treatment group students to the default HarvardX course progress chart. In the supplemental experiment, the “Grade” replaces any reference to activity or activity level. In that experiment, the graph has a fixed scale from 0-100. 120 4. Registration Form (Required of all HarvardX Students) 121 5. HarvardX Standard Pre-course Survey (example from a course called “Statistics and R for the Life Sciences”) Red asterisks, which will not be visible to survey participants, are placed next to questions that determine a value of a predicted moderator variable. These are followed by the associated coding for analysis. * Commitment to Complete Course = 1, 2, or 3 corresponding to the first three options in the order that they are listed 122 * Familiarity with Subject = 0, 1, 2, 3, or 4 corresponding to the options in the order that they are listed 123 * Number of Online Courses Previously Enrolled In = the number entered for the first question above * Number of Online Courses Completed = the number entered for the second question above 124 6. Field Experiment-specific Survey We distributed this survey in the last two weeks of the each of the main and supplemental experiments. 1.# a.# b.# 2.# a.# b.# 3.# a.# b.# Answer Coding We code a student’s response to each question as a one, two, or three in order from negative to affirmative. In Table 23, we provide charts of the percentage of respondents 125 selecting each response and test for differences in the distribution of answers regarding the median and top quartile respectively. 126