Differential Privacy: A Primer for a Non-Technical Audience

Differential privacy is a formal mathematical framework for quantifying and managing privacy risks. It provides provable privacy protection against a wide range of potential attacks, including those currently unforeseen. Differential privacy is primarily studied in the context of the collection, analysis, and release of aggregate statistics. These range from simple statistical estimations, such as averages, to machine learning. Tools for differentially private analysis are now in early stages of implementation and use across a variety of academic, industry, and government settings. Interest in the concept is growing among potential users of the tools, as well as within legal and policy communities, as it holds promise as a potential approach to satisfying legal requirements for privacy protection when handling personal information. In particular, differential privacy may be seen as a technical solution for analyzing and sharing data while protecting the privacy of individuals in accordance with existing legal or policy requirements for de-identification or disclosure limitation.<br><br>This primer seeks to introduce the concept of differential privacy and its privacy implications to non-technical audiences. It provides a simplified and informal, but mathematically accurate, description of differential privacy. Using intuitive illustrations and limited mathematical formalism, it discusses the definition of differential privacy, how differential privacy addresses privacy risks, how differentially private analyses are constructed, and how such analyses can be used in practice. A series of illustrations is used to show how practitioners and policymakers can conceptualize the guarantees provided by differential privacy. These illustrations are also used to explain related concepts, such as composition (the accumulation of risk across multiple analyses), privacy loss parameters, and privacy budgets. This primer aims to provide a foundation that can guide future decisions when analyzing and sharing statistical data about individuals, informing individuals about the privacy protection they will be afforded, and designing policies and regulations for robust privacy protection.

Differential privacy is not a single tool, but rather a criterion, which many tools for analyzing sensitive personal information have been devised to satisfy. It provides a mathematically provable guarantee of privacy protection against a wide range of privacy attacks, defined as attempts to learn private information specific to individuals from a data release. Privacy attacks include re-identification, record linkage, and differencing attacks, but may also include other attacks currently unknown or unforeseen. These concerns are separate from security attacks, which are characterized by attempts to exploit vulnerabilities in order to gain unauthorized access to a system.
Computer scientists have developed a robust theory for differential privacy over the last fifteen years, and major commercial and government implementations are starting to emerge.
The differential privacy guarantee (Part III). Differential privacy mathematically guarantees that anyone viewing the result of a differentially private analysis will essentially make the same inference about any individual's private information, whether or not that individual's private information is included in the input to the analysis.
The privacy loss parameter (Section IV.B). What can be learned about an individual as a result of her private information being included in a differentially private analysis is limited and quantified by a privacy loss parameter, usually denoted epsilon ( ). Privacy loss can grow as an individual's information is used in multiple analyses, but the increase is bounded as a known function of and the number of analyses performed.
Interpreting the guarantee (Section VI.C). The differential privacy guarantee can be understood in reference to other privacy concepts: • Differential privacy protects an individual's information essentially as if her information were not used in the analysis at all, in the sense that the outcome of a differentially private algorithm is approximately the same whether the individual's information was used or not. • Differential privacy ensures that using an individual's data will not reveal essentially any personally identifiable information that is specific to her, or even whether the individual's information was used at all. Here, specific refers to information that cannot be inferred unless the individual's information is used in the analysis. As these statements suggest, differential privacy is a new way of protecting privacy that is more quantifiable and comprehensive than the concepts of privacy underlying many existing laws, policies, and practices around privacy and data protection. The differential privacy guarantee can be interpreted in reference to these other concepts, and can even accommodate variations in how they are defined across different laws. In many settings, data holders may be able to use differential privacy to demonstrate that they have complied with applicable legal and policy requirements for privacy protection.
Differentially private tools (Part VII). Differential privacy is currently in initial stages of implementation and use in various academic, industry, and government settings, and the number of practical tools providing this guarantee is continually growing. Multiple implementations of differential privacy have been deployed by corporations such as Google, Apple, and Uber, as well as federal agencies like the US Census Bureau. Additional differentially private tools are currently under development across industry and academia.
Some differentially private tools utilize an interactive mechanism, enabling users to submit queries about a dataset and receive corresponding differentially private results, such as customgenerated linear regressions. Other tools are non-interactive, enabling static data or data summaries, such as synthetic data or contingency tables, to be released and used.
In addition, some tools rely on a curator model, in which a database administrator has access to and uses private data to generate differentially private data summaries. Others rely on a local model, which does not require individuals to share their private data with a trusted third party, but rather requires individuals to answer questions about their own data in a differentially private manner. In a local model, each of these differentially private answers is not useful on its own, but many of them can be aggregated to perform useful statistical analysis.
Benefits of differential privacy (Part VIII). Differential privacy is supported by a rich and rapidly advancing theory that enables one to reason with mathematical rigor about privacy risk. Adopting this formal approach to privacy yields a number of practical benefits for users: • Systems that adhere to strong formal definitions like differential privacy provide protection that is robust to a wide range of potential privacy attacks, including attacks that are unknown at the time of deployment. An analyst using differentially private tools need not anticipate particular types of privacy attacks, as the guarantees of differential privacy hold regardless of the attack method that may be used. • Differential privacy provides provable privacy guarantees with respect to the cumulative risk from successive data [Vol. 21:1:209 releases and is the only existing approach to privacy that provides such a guarantee. • Differentially private tools also have the benefit of transparency, as it is not necessary to maintain secrecy around a differentially private computation or its parameters. This feature distinguishes differentially private tools from traditional de-identification techniques, which often conceal the extent to which the data have been transformed, thereby leaving data users with uncertainty regarding the accuracy of analyses on the data.
• Differentially private tools can be used to provide broad, public access to data or data summaries while preserving privacy. They can even enable wide access to data that cannot otherwise be shared due to privacy concerns. An important example is the use of differentially private synthetic data generation to produce public-use microdata. Differentially private tools can, therefore, help enable researchers, policymakers, and businesses to analyze and share sensitive data, while providing strong guarantees of privacy to the individuals in the data.
Keywords: differential privacy, data privacy, social science research

I. INTRODUCTION
Businesses, government agencies, and research institutions often use and share data containing sensitive or confidential information about individuals. 1 Improper disclosure of such data can have adverse consequences for a data subject's reputation, finances, employability, and insurability, as well as lead to civil liability, criminal penalties, or physical or emotional injuries. 2 Due to these issues and other related concerns, a large body of laws, regulations, ethical codes, institutional policies, contracts, and best practices has emerged to address potential privacy-related harms associated with the collection, use, and release of personal information. 3 The following discussion provides an overview of the broader data privacy landscape that has motivated the development of formal privacy models like differential privacy.

A. Introduction to Legal and Ethical Frameworks for Data Privacy
The legal framework for privacy protection in the United States has evolved as a patchwork of highly sector-and context-specific federal and state laws. 4 For instance, Congress has enacted federal information privacy laws to protect certain categories of personal information found in health, 5 education, 6 financial, 7 and government records, 8 among others. These laws often expressly protect information classified as personally identifiable information (PII), which generally refers to information that can be linked to an individual's identity or attributes. 9 Some laws also incorporate de-identification provisions, which provide for the release of information that has been stripped of PII. 10 State data protection and breach notification laws prescribe specific data security and breach reporting requirements when managing certain types of personal information. 11 In addition, federal regulations generally require researchers conducting studies involving human subjects to secure approval from an institutional review board and fulfill ethical obligations to the participants, such as disclosing the risks of participation, obtaining their informed consent, and implementing specific measures to protect 4.

5.
See, e.g., Health Insurance Portability and Accountability Act (HIPAA), Pub. L. No. 104-191, 110 Stat. 1936104-191, 110 Stat. (1996  [Vol. 21:1:209 privacy. 12 It is also common for universities and other research institutions to adopt policies that require their faculty, staff, and students to abide by certain ethical and professional responsibility standards and set forth enforcement procedures and penalties for mishandling data. 13 Further restrictions apply when privacy-sensitive data are shared under the terms of a data sharing agreement, which will often strictly limit how the recipient can use or redisclose the data received. 14 Organizations may also require privacy measures set forth by technical standards, such as those specifying information security controls to protect personally identifiable information. 15 In addition, laws such as the EU General Data Protection Regulation are in place to protect personal data about European citizens regardless of where the data reside. 16 International privacy guidelines, such as the privacy principles developed by the Organisation for Economic Co-operation and Development, have also been adopted by governments across the world. 17 Moreover, the right to privacy is also protected by various international treaties and national constitutions. 18 Taken together, the safeguards required by these legal and ethical frameworks are designed to protect the privacy of individuals and ensure they fully understand both the scope of personal information to be collected and the associated privacy risks. They also help data holders avoid administrative, civil, and criminal penalties, as well as maintain the public's trust and confidence in commercial, government, and research activities involving personal data.

B. Traditional Statistical Disclosure Limitation Techniques
A number of technical measures for disclosing data while protecting the privacy of individuals have been produced within the context of these legal and ethical frameworks. 19 In particular, statistical agencies, data analysts, and researchers have widely adopted a collection of statistical disclosure limitation (SDL) techniques to analyze and share data containing privacy-sensitive data with the aim of making it more difficult to learn personal information pertaining to an individual. 20 This category of techniques encompasses a wide range of methods for suppressing, aggregating, perturbing, and generalizing attributes of individuals in the data. 21 Such techniques are often applied with the explicit goal of de-identification-namely, making it difficult to link an identified person to a record in a data release by redacting or coarsening data. 22 Advances in analytical capabilities, increases in computational power, and the expanding availability of personal data from a wide range of sources are eroding the effectiveness of traditional SDL techniques. 23 Since the 1990s-and with increasing frequency-privacy and security researchers have demonstrated that data that have been de-identified can often be successfully re-identified via a technique such as record linkage. 24 Re-identification via record linkage, or a linkage attack, refers to the re-identification of one or more records in a deidentified dataset by uniquely linking a record in a de-identified dataset with identified records in a publicly available dataset, such as a voter registration list. 25 As described in Example 1 below, in the late 1990s, Latanya Sweeney famously applied such an attack on a dataset containing de-identified hospital records. 26  [Vol. 21:1:209 ZIP code of patients; that many of the patients had a unique combination of these three attributes; and that these three attributes were listed alongside individuals' names and addresses in publicly available voting records. 27 Sweeney used this information to re-identify records in the de-identified dataset. 28 Subsequent attacks on protected data have demonstrated weaknesses in other traditional approaches to privacy protection, and understanding the limits of these traditional techniques is the subject of ongoing research. 29

C. The Emergence of Formal Privacy Models
Re-identification attacks are becoming increasingly sophisticated over time, as are other types of attacks that seek to infer characteristics of individuals based on information about them in a data set. 30 Successful attacks on de-identified data illustrate that traditional technical measures for privacy protection may be particularly vulnerable to attacks devised after a technique's deployment and use. 31 Some de-identification techniques, for example, require the specification of attributes in the data as identifying (e.g., names, dates of birth, or addresses) or non-identifying (e.g., movie ratings or hospital admission dates). 32 Data providers may later discover that attributes initially believed to be non-identifying can in fact be used to re-identify individuals. 33 Similarly, de-identification procedures may require a careful analysis of present and future data sources that could potentially be linked with the de-identified data and enable reidentification of the data. Anticipating the types of attacks and resources an attacker could leverage is a challenging exercise and ultimately will fail to address all potential attacks, as unanticipated sources of auxiliary information that can be used for re-identification may become available in the future. 34 Issues such as these underscore the need for privacy technologies that are immune not only to linkage attacks, but to any potential attack, including those currently unknown or unforeseen. 35 They also demonstrate that privacy technologies must provide meaningful privacy protection in settings where extensive external information may be available to potential attackers, such as employers, insurance companies, relatives, and friends of an individual in the data. 36 Real-world attacks further illustrate that ex post remedies, such as simply "taking the data back" when a vulnerability is discovered, are ineffective because many copies of a set of data typically exist, and copies often persist online indefinitely. 37 In response to the accumulated evidence of weaknesses with respect to traditional approaches, a new privacy paradigm has emerged from the computer science literature-differential privacy. 38 Differential privacy is primarily studied in the context of the collection, analysis, and release of aggregate statistics. Such analyses range from simple statistical estimations-such as averages-to machine learning. 39 Contrary to common intuition, aggregate statistics such as these are not always safe to release because, as Part III explains, they can often be combined to reveal sensitive information about individual data subjects.
First presented in 2006, 40 differential privacy is the subject of ongoing research to develop privacy technologies that provide robust protection against a wide range of potential attacks. 41 Importantly, differential privacy is not a single tool but a definition or standard for 34 [Vol. 21:1:209 quantifying and managing privacy risks for which many technological tools have been devised. 42 Analyses performed with differential privacy differ from standard statistical analyses-such as the calculation of averages, medians, and linear regression equations-in that random noise 43 is added in the computation. 44 Tools for differentially private analysis are now in early stages of implementation and use across a variety of academic, industry, and government settings. 45 This Article provides a simplified and informal, yet mathematically accurate, description of differential privacy. 46 Using intuitive illustrations and limited mathematical formalism, it describes the definition of differential privacy, how it addresses privacy risks, how differentially private analyses are constructed, and how such analyses can be used in practice. This discussion intends to help nontechnical audiences understand the guarantees provided by differential privacy. It can help guide practitioners as they make decisions regarding whether to use differential privacy and, if so, what types of promises they should make to data subjects about the guarantees differential privacy provides. In addition, these illustrations intend to help legal scholars and policymakers consider how current and future legal frameworks and instruments should apply to tools based on formal privacy models such as differential privacy.
Random noise refers to uncertainty introduced into a computation by the addition of values sampled from a random process. For example, consider a computation that first calculates the number of individuals in the dataset who suffer from diabetes, then samples a value from a normal distribution with a mean of 0 and variance of 1, and outputs = + . In this example, the random noise is added in the computation to the exact count to produce the noisy output . For a more detailed explanation of random noise, see infra Part IV.

II. PRIVACY: A PROPERTY OF THE ANALYSIS-NOT ITS OUTPUT
This Article seeks to explain how data containing personal information can be shared in a form that ensures the privacy of the individuals in the data will be protected. The formal study of privacy in the theoretical computer science literature has yielded insights into this problem and revealed why so many traditional privacy-preserving techniques have failed to adequately protect privacy in practice. First, many traditional approaches to privacy failed to acknowledge that attackers could use information obtained from outside the system (i.e., auxiliary information) in their attempts to learn private individual information from a data release. 47 As the amount of detailed auxiliary information continues to grow and become more widely available over time, any privacy-preserving method must take auxiliary information into account in order to provide a reasonable level of privacy protection in light of any auxiliary information that an attacker may hold. 48 Furthermore, traditional approaches treated privacy as a property of the output of an analysis, whereas it is now understood that privacy should be viewed as a property of the analysis itself. 49 Any privacypreserving method-including differential privacy-must adhere to this general principle in order to guarantee privacy protection.
The following discussion provides an intuitive explanation of these principles, beginning with a cautionary tale about the reidentification of anonymized records released by the Massachusetts Group Insurance Commission. 50

Example 1
In the late 1990s, the Group Insurance Commission, an agency providing health insurance to Massachusetts state employees, allowed researchers to access anonymized records summarizing information about all hospital visits made by state employees. The agency anticipated that the analysis of these records would lead to recommendations for improving healthcare and controlling 47 Massachusetts Governor William Weld reassured the public that steps would be taken to protect the privacy of patients in the data. Before releasing the records to researchers, the agency removed names, addresses, Social Security numbers, and other pieces of information that could be used to identify individuals in the records.
Viewing this as a challenge, Professor Latanya Sweeney, then a graduate student at MIT, set out to identify Governor Weld's record in the dataset. She obtained demographic information about Governor Weld, including his ZIP code and date of birth, by requesting a copy of voter registration records made available to the public for a small fee. Finding just one record in the anonymized medical claims dataset that matched Governor Weld's gender, ZIP code, and date of birth enabled her to mail the Governor a copy of his personal medical records.
As Example 1 illustrates, in many cases, a dataset that appears to be anonymous may nevertheless be used to learn sensitive information about individuals. In her demonstration, Professor Sweeney used voter registration records as auxiliary information in an attack. This re-identification demonstrates the importance of using privacy-preserving methods that are robust to auxiliary information that may be exploited by an adversary. Following Professor Sweeney's famous demonstration, a long series of attacks has been carried out against different types of data releases anonymized using a wide range of techniques and auxiliary information. 51 These attacks have shown that risks remain even if additional pieces of information, such as those that were leveraged in Professor Sweeney's attack (gender, date of birth, and ZIP code), are removed from a dataset prior to release. 52 Risks also remain when using some traditional SDL techniques, such as -anonymity, which is satisfied for a dataset in which the identifying attributes that appear for each person are identical to those of at least − 1 other individuals in the dataset. 53 Research has continually demonstrated that privacy measures that treat privacy as a property of 51.
the output, such as -anonymity and other traditional statistical disclosure limitation techniques, will fail to protect privacy.
The Authors offer a brief note on terminology before proceeding. The discussions throughout this Article use the terms "analysis" and "computation" interchangeably to refer to any transformation, usually performed by a computer program, of input data into some output.
As an example, consider an analysis on data containing personal information about individuals. The analysis may be as simple as determining the average age of the individuals in the data, or it may be more complex and utilize sophisticated modeling and inference techniques.
In any case, the analysis involves performing a computation on input data and outputting the result. Figure 1 illustrates this notion of an analysis.

Figure 1. An Analysis
This primer focuses, in particular, on analyses for transforming sensitive personal data into an output that can be released publicly. For example, an analysis may involve the application of techniques for aggregating or de-identifying a set of personal data in order to produce a sanitized version of the data that is safe to release. The data provider will want to ensure that publishing the output of this computation will not unintentionally leak information from the privacy-sensitive input data-but how?
A key insight from the theoretical computer science literature is that privacy is a property of the informational relationship between the input and output, not a property of the output alone. 54 The following discussion illustrates why this is the case through a series of examples.

Example 2
Anne, a staff member at a high school, would like to include statistics about student performance in a presentation. She

54.
This insight follows from a series of papers demonstrating privacy breaches enabled by leakages of information resulting from decisions made by the computation. See considers publishing the fact that the GPA of a representative ninth-grade student is 3.5. Because the law protects certain student information held by educational institutions, she must ensure that the statistic will not inappropriately reveal student information, such as the GPA of any particular student.
One might naturally think that Anne could examine the statistic itself and determine that it is unlikely to reveal private information about an individual student. However, although the publication of this statistic might seem harmless, Anne needs to know how the statistic was computed to make that determination. For instance, if the representative ninth-grade GPA was calculated by taking the GPA of the alphabetically first student in the school, then the statistic completely reveals the GPA of that student. 55

Example 3
Alternatively, Anne considers calculating a representative statistic based on average features of the ninth graders at the school. She takes the most common first name, the most common last name, the average age, and the average GPA for the ninth-grade class. What she produces is "John Smith, a fourteen-year-old in the ninth grade, has a 3.1 GPA." Anne includes this statistic and the method used to compute it in her presentation. In an unlikely turn of events, a new ninth-grade student named John Smith joins the class the following week.
Although the output of Anne's analysis looks like it reveals private information about the new ninth grader John Smith, it actually does not-because the analysis itself was not based on his student records in any way. While Anne might decide to present the statistic differently to avoid confusion, using it would not reveal private information about John. It may seem counterintuitive that releasing a "representative" GPA violates privacy (as shown by Example 2), while releasing a GPA attached to a student's name would not (as shown by Example 3). Yet these examples illustrate that the key to preserving 55.
One might object that the student's GPA is not traceable back to that student unless an observer knows how the statistic was produced. However, a basic principle of modern cryptography (known as Kerckhoffs' principle) holds that a system is not secure if its security depends on its inner workings being a secret. See AUGUSTE KERCKHOFFS, LA CRYPTOGRAPHIE MILITAIRE [MILITARY CRYPTOGRAPHY] 8 (1883). As applied in this example, this means that it is taken as an assumption that the algorithm behind a statistical analysis is public (or could potentially be public).
privacy is the informational relationship between the private input and the public output-and not the output itself. Furthermore, not only is it necessary to examine the analysis itself to determine whether a statistic can be published while preserving privacy, but it is also sufficient. In other words, if one knows whether the process used to generate a statistic preserves privacy, the output statistic does not need to be considered at all.

III. WHAT IS THE DIFFERENTIAL PRIVACY GUARANTEE?
The previous Part illustrates why privacy should be thought of as a property of a computation-but how does one know whether a particular computation has this property?
Intuitively, a computation protects the privacy of individuals in the data if its output does not reveal any information that is specific to any individual data subject. Differential privacy formalizes this intuition as a mathematical definition. 56 Just as we can show that an integer is even by demonstrating that it is divisible by two, we can show that a computation is differentially private by proving it meets the constraints of the definition of differential privacy. In turn, if a computation can be proven to be differentially private, we can rest assured that using the computation will not unduly reveal information specific to any data subject. 57 Here, the term specific refers to information that cannot be inferred unless the individual's information is used in the analysis. For example, the information released by Anne in Example 3 is not specific to the new ninth grader John Smith because it is computed without using his information.
The following example illustrates how differential privacy formalizes this intuitive privacy requirement as a definition.

Example 4
Researchers have selected a sample of individuals across the United States to participate in a survey exploring the relationship between socioeconomic status and health outcomes.
The participants were asked to complete a questionnaire covering topics concerning their residency, their finances, and their medical history.

56.
See Dwork  [Vol. 21:1:209 One of the participants, John, is aware that individuals have been re-identified in previous releases of de-identified data and is concerned that personal information he provides about himself, such as his medical history or annual income, could one day be revealed in de-identified data released from this study. If leaked, this information could lead to a higher life insurance premium or an adverse decision with respect to a future mortgage application. 58 Differential privacy can be used to address John's concerns. If the researchers promise they will only share survey data after processing the data with a differentially private computation, John is guaranteed that any data the researchers release will disclose essentially nothing that is specific to him, even though he participated in the study. 59 To understand what this means, consider the thought experiment, illustrated in Figure 2 and referred to as John's opt-out scenario. In John's opt-out scenario, an analysis is performed using data about the individuals in the study, except that information about John is omitted. His privacy is protected in the sense that the outcome of the analysis does not depend on his specific information-because his information was not used in the analysis at all.

Figure 2. John's Opt-Out Scenario
John's opt-out scenario differs from the real-world scenario depicted in Figure 1, where John's information is part of the input of the analysis along with the personal information of the other study participants. In contrast to his opt-out scenario, the real-world scenario involves some potential risk to John's privacy. Some of his personal information could 58.
Note that these examples are introduced for the purposes of illustrating a general category of privacy-related risks relevant to this discussion, not as a claim that life insurance and mortgage companies currently engage in this practice.
59. Intuitively, the opt-out scenario and real-world scenario are very similar, and the difference between the two scenarios is measurable and small, as described in more detail in Part IV. input without John's data analysis output be revealed by the outcome of the analysis because his information was used as input to the computation. 60

A. Examples Illustrating What Differential Privacy Protects
Differential privacy aims to protect John's privacy in the realworld scenario in a way that mimics the privacy protection he is afforded in his opt-out scenario. 61 In other words, what can be learned about John from a differentially private computation is essentially limited to what could be learned about him from everyone else's data without his own data being included in the computation. Crucially, this same guarantee is made not only with respect to John, but also with respect to every other individual contributing her information to the analysis.
A precise description of the differential privacy guarantee requires using formal mathematical language, as well as technical concepts and reasoning that are beyond the scope of this Article. In lieu of the mathematical definition, this Article offers a few illustrative examples to discuss various aspects of differential privacy in a way designed to be intuitive and generally accessible. The scenarios in this Section illustrate the types of information disclosures that are addressed when using differential privacy.

Example 5
Alice and Bob are professors at Private University. They both have access to a database that contains personal information about students at the university, including information related to the financial aid each student receives. Because it contains personal information, access to the database is restricted. To gain access, Alice and Bob were required to demonstrate they planned to follow the university's protocols for handling personal data by undergoing confidentiality training and signing data use agreements 61. See generally Dwork, Differential Privacy, supra note 46. It is important to note that the use of differentially private analysis is not equivalent to the traditional use of opting out. On the privacy side, differential privacy does not require an explicit opt-out. In comparison, traditional use of opt-out may cause privacy harms by calling attention to individuals who choose to opt out. On the utility side, there is no general expectation that using differential privacy would yield the same outcomes as adopting the policy of opt-out.
[Vol. 21:1:209 proscribing their use and disclosure of personal information obtained from the database.
In March, Alice publishes an article based on the information in this database and writes that "the current freshman class at Private University is made up of 3,005 students, 202 of whom are from families earning over $350,000 per year." Alice reasons that, because she published an aggregate statistic taken from over 3,005 people, no individual's personal information will be exposed. The following month, Bob publishes a separate article containing these statistics: "201 students in Private University's freshman class of 3,004 have household incomes exceeding $350,000 per year." Neither Alice nor Bob is aware that they have both published similar information.
A clever student Eve reads both of these articles and makes an observation. From the published information, Eve concludes that between March and April one freshman withdrew from Private University and that the student's parents earn over $350,000 per year. Eve asks around and is able to determine that a student named John dropped out around the end of March. Eve then informs her classmates that John's family probably earns over $350,000 per year.
John hears about this and is upset that his former classmates learned about his family's financial status. He complains to the university, and Alice and Bob are asked to explain. In their defense, both Alice and Bob argue that they published only information that had been aggregated over a large population and does not identify any individuals.
Example 5 illustrates how, in combination, the results of multiple analyses using information about the same people may enable one to draw conclusions about individuals in the data. Alice and Bob each published information that, in isolation, seems innocuous. However, when combined, the information they published compromised John's privacy. This type of privacy breach is difficult for Alice or Bob to prevent individually, as neither knows what information others have already revealed or will reveal in future. This is referred to as the problem of composition. 62

62.
See Cynthia Dwork et al., Calibrating Noise to Sensitivity in Private Data Analysis, 7 J. PRIVACY & CONFIDENTIALITY 17, 28 (2016) (note that this article shares a title with, and is a later version of, the authors' prior paper, supra note 38); Srivatsava Ranjit Ganta, Shiva Prasad Suppose, instead, that the institutional review board at Private University only allows researchers to access student records by submitting queries to a special data portal. This portal responds to every query with an answer produced by running a differentially private computation on the student records. As explained in Part IV, differentially private computations introduce a carefully tuned amount of random noise to the statistics outputted. 63 This means that the computation gives an approximate answer to every question asked through the data portal. 64 As Example 6 illustrates, the use of differential privacy prevents the privacy leakage that occurred in Example 5.

Example 6
In March, Alice queries the data portal for the number of freshmen who come from families with a household income exceeding $350,000. The portal returns the noisy count of 204, leading Alice to write in her article that "the current freshman class at Private University includes approximately 200 students from families earning over $350,000 per year." In April, Bob asks the same question and gets the noisy count of 199 students. Bob publishes in his article that "approximately 200 families in Private University's freshman class have household incomes exceeding $350,000 per year." The publication of these noisy figures prevents Eve from concluding that one student, with a household income greater than $350,000, withdrew from the university in March. The risk that John's personal information could be uncovered based on these publications is thereby reduced.
Example 6 hints at one of the most important properties of differential privacy-it is robust under composition. 65 If multiple analyses are performed on data describing the same set of individuals, then, as long as each of the analyses satisfies differential privacy, it is guaranteed that all of the information released, when taken together, will still be differentially private. 66  [Vol. 21:1:209 markedly different from Example 5, in which Alice and Bob do not use differentially private analyses and inadvertently release two statistics that, when combined, lead to the full disclosure of John's personal information. The use of differential privacy rules out the possibility of such a complete breach of privacy. This is because differential privacy enables one to measure and bound the cumulative privacy risk from multiple analyses of information about the same individuals. 67 It is important to note, however, that every analysis, regardless of whether it is differentially private or not, results in some leakage of information about the individuals whose information is being analyzed. This is a well-established principle within the statistical community, as evidenced by a 2005 report that concluded "[t]he release of statistical data inevitably reveals some information about individual data subjects." 68 Furthermore, this leakage accumulates with each analysis, potentially to a point where an attacker may infer the underlying data. 69 This is true for every release of data, including releases of aggregate statistics. 70 In particular, releasing too many aggregate statistics too accurately inherently leads to severe privacy loss. 71 For this reason, there is a limit to how many analyses can be performed on a specific dataset while providing an acceptable guarantee of privacy. 72 This is why it is critical to measure privacy loss and to understand quantitatively how risk accumulates across successive analyses, as Sections IV.E and VI.A describe below.

B. Examples Illustrating What Differential Privacy Does Not Protect
The following examples illustrate the types of information disclosures differential privacy does not seek to address.

Example 7
Suppose Ellen is a friend of John's and knows some of his habits, such as that he regularly consumes several glasses of red wine with dinner. Ellen learns that John took part in a large research study, and that this study found a positive correlation between drinking red wine and the likelihood of developing a certain type of cancer. She might therefore conclude, based on the results of this study and her prior knowledge of John's drinking habits, that he has a heightened risk of developing cancer.
It may seem at first that the publication of the results from the research study enabled a privacy breach by Ellen. After all, learning about the study's findings helped her infer new information about John that he himself may be unaware of (i.e., his elevated cancer risk). However, notice that Ellen would be able to infer this information about John even if John had not participated in the medical study (i.e., it is a risk that exists in both John's opt-out scenario and the real-world scenario). 73 Risks of this nature apply to everyone, regardless of whether they shared personal data through the study or not. Consider another example:

Example 8
Ellen knows that her friend John is a public school teacher with five years of experience and that he is about to start a job in a new school district. She later comes across a local news article about a teachers' union dispute, which includes salary figures for the public school teachers in John's new school district. Ellen is able to approximately determine John's salary at his new job, based on the district's average salary for a teacher with five years of experience.
Note that, as in the previous example, Ellen can determine information about John (i.e., his new salary) from the published information, even though the published information was not based on John's information. In both examples, John could be adversely affected by the discovery of the results of an analysis, even in his opt-out scenario. In both John's opt-out scenario and in a differentially private real-world scenario, it is therefore not guaranteed that no information about John can be revealed. The use of differential privacy limits the revelation of information specific to John.

73.
Ellen's inference would rely on factors such as the size of the study sample, whether the sampling was performed at random, and whether John comes from the same population as the sample, among others. These examples suggest, more generally, that any useful analysis carries a risk of revealing some information about individuals. One might observe, however, that such risks are largely unavoidable. In a world in which data about individuals are collected, analyzed, and published, John cannot expect better privacy protection than is offered by his opt-out scenario because he has no ability to prevent others from participating in a research study or appearing in public records.
Moreover, the types of information disclosures enabled in John's opt-out scenario often result in individual and societal benefits. For example, the discovery of a causal relationship between red wine consumption and elevated cancer risk can lead to new public health recommendations, support future scientific research, and inform John about possible changes he could make in his habits that would likely have positive effects on his health. Similarly, the publication of public school teacher salaries may be seen as playing a critical role in transparency and public policy, as it can help communities make informed decisions regarding appropriate salaries for their public employees.

IV. HOW DOES DIFFERENTIAL PRIVACY LIMIT PRIVACY LOSS?
The previous Part explains that the only things that can be learned about a data subject from a differentially private data release are essentially what could have been learned if the analysis had been performed without that individual's data.
How do differentially private analyses achieve this goal? And what is meant by "essentially" when stating that the only things that can be learned about a data subject are essentially those things that could be learned without the data subject's information? The answers to these two questions are related. Differentially private analyses protect the privacy of individual data subjects by introducing carefully tuned random noise when producing statistics. 74 Differentially private analyses are also allowed to leak some small amount of information specific to individual data subjects. 75

A. Differential Privacy and Randomness
Example 6 shows that differentially private analyses introduce random noise to the statistics they produce. Intuitively, this noise masks the differences between the real-world computation and the optout scenario of each individual in the dataset. This means that the outcome of a differentially private analysis is not exact, but rather an approximation. In addition, a differentially private analysis may, if performed twice on the same dataset, return different results because it intentionally introduces random noise.
Therefore, analyses performed with differential privacy differ from standard statistical analyses, such as the calculation of averages, medians, and linear regression equations, in which one gets the same answer when a computation is repeated twice on the same dataset.

Example 9
Consider a differentially private analysis that computes the number of students in a sample with a GPA of at least 3.0. Say that there are 10,000 students in the sample, and exactly 5,603 of them have a GPA of at least 3.0. An analysis that added no random noise would report that 5,603 students had a GPA of at least 3.0.
A differentially private analysis, however, introduces random noise to protect the privacy of the data subjects. For instance, a differentially private analysis might report an answer of 5,521 when run on the student data; when run a second time on the same data, it might report an answer of 5,586. 77 Although a differentially private analysis might produce many different answers given the same dataset, it is usually possible to calculate accuracy bounds for the analysis measuring how much an output of the analysis is expected to differ from the noiseless answer. 78 Section VI.B discusses how the random noise introduced by a differentially private analysis affects statistical accuracy. Appendix A.1 77.
Note that, if an analyst is allowed to repeat this computation multiple times, she could average out the noise and get the exact answer. provides more information about the role randomness plays in the construction of differentially private analyses.

B. The Privacy Loss Parameter
An essential component of a differentially private computation is the privacy loss parameter, which determines how well each individual's information needs to be hidden and, consequently, how much noise needs to be introduced. 79 It can be thought of as a tuning knob for balancing privacy and accuracy. Each differentially private analysis can be tuned to provide more or less privacy-resulting in less or more accuracy, respectively-by changing the value of this parameter. The parameter can be thought of as limiting how much a differentially private computation is allowed to deviate from the opt-out scenario of each individual in the data.
Consider the opt-out scenario for a certain computation, such as estimating the number of HIV-positive individuals in a surveyed population. Ideally, this estimate should remain exactly the same whether or not a single individual, such as John discussed above, is included in the survey. However, as described above, ensuring that the estimate is exactly the same would require the total exclusion of John's information from the real-world analysis. It would also require excluding the information of other individuals (e.g., that of Gertrude, Peter, and so forth) in order to provide perfect privacy protection for them as well. Continuing this line of argument, one can conclude that the personal information of every single surveyed individual must be removed in order to satisfy each individual's opt-out scenario. Thus, the analysis cannot rely on any person's information and is completely useless.
To avoid this dilemma, differential privacy requires only that the output of the analysis remain approximately the same, whether John participates in the survey or not. That is, differential privacy allows for a deviation between the output of the real-world analysis and that of each individual's opt-out scenario. A parameter quantifies and limits the extent of the deviation between the opt-out and real-world scenarios. 80 As Figure 3 illustrates below, this parameter is usually denoted by the Greek letter (epsilon) and referred to as the privacy parameter or, more accurately, the privacy loss parameter. 81 The parameter measures the effect of each individual's information on the 79.
output of the analysis. It can also be viewed as a measure of the additional privacy risk an individual could incur beyond the risk incurred in the opt-out scenario. Note that Figure 3 replaces John with an arbitrary individual to emphasize that the differential privacy guarantee is made simultaneously to all individuals in the sample-not just John.

Figure 3. Differential Privacy
Moreover, it can be shown that the deviation between the realworld and opt-out scenarios cannot be increased by any further processing of the output of a differentially private analysis. Hence, the guarantees of differential privacy, described below, hold regardless of how an attacker may try to manipulate the output. In this sense, differential privacy is robust to a wide range of potential privacy attacks, including attacks that are unknown at the time of deployment. 82 Choosing a value for can be thought of as setting the desired level of privacy protection. This choice also affects the utility or accuracy that can be obtained from the analysis. 83 A smaller value of results in a smaller deviation between the real-world analysis and each opt-out scenario and is therefore associated with stronger privacy 82.
The property that differential privacy is preserved under arbitrary further processing is referred to as (resilience to) post-processing. protection but less accuracy. 84 For example, when is set to zero, the real-world differentially private analysis mimics the opt-out scenario of each individual perfectly and simultaneously. However, an analysis that perfectly mimics the opt-out scenario of each individual would require ignoring all information from the input and, accordingly, could not provide any meaningful output. Yet, when is set to a small number such as 0.1, the deviation between the real-world computation and each individual's opt-out scenario will be small, providing strong privacy protection, while also enabling an analyst to derive useful statistics based on the data. Accepted guidelines for choosing have not yet been developed. 85 The increasing use of differential privacy in real-life applications will likely shed light on how to reach a reasonable compromise between privacy and accuracy, and the accumulated evidence from these realworld decisions will likely contribute to the development of future guidelines. 86 As discussed in Section IV.D, the Authors of this Article recommend that, when possible, be set to a small number, such as a 84.
See infra 86. Setting the primary loss parameter is a policy decision to be informed by normative and technical considerations. Companies and governments experimenting with practical implementations of differential privacy have selected various values for . Some of these implementations have adopted values of exceeding 1 due to the difficulty of meeting utility requirements using lower values of . To date, these choices of have not led to known vulnerabilities. For example, the US Census Bureau reportedly chose a value of = 8.9 for OnTheMap-a public interface which allows users to explore American commuting patterns using a variant of differential privacy. Although differential privacy is an emerging concept and has been deployed in limited applications to date, best practices may emerge over time as values for are selected for implementations of differential privacy in a wide range of settings. With this in mind, researchers have proposed that a registry be created to document details of differential privacy implementations, including the value of chosen and the factors that led to its selection. value less than 1. 87 As Figure 3 illustrates, the maximum deviation between the opt-out scenario and the real-world computation should hold simultaneously for each individual X whose information is included in the input.

C. Bounding Risk
The previous Section discusses how the privacy loss parameter limits the deviation between the real-world computation and each data subject's opt-out scenario. However, it might not be clear how this abstract guarantee relates to the privacy concerns individuals face in the real world. To help ground the concept, this Section discusses a practical interpretation of the privacy loss parameter. It describes how the parameter can be understood as a bound on the financial risk incurred by an individual participating in a research study.
Any useful analysis carries the risk that it will reveal information about the individuals in the data. 88 An individual whose information is used in an analysis may be concerned that a potential leakage of her personal information could result in reputational, financial, or other costs. Examples 10 and 11 below introduce a scenario in which an individual participating in a research study worries that an analysis on the data collected in the research study may leak information that could lead to a substantial increase in her life insurance premium. Example 12 illustrates that, while differential privacy necessarily cannot fully eliminate this risk, it can guarantee that the risk will be limited by quantitative bounds that depend on . 89

Example 10
Gertrude, a sixty-five-year-old woman, is considering whether to participate in a medical research study. While she can envision many potential personal and societal benefits resulting in part from her participation in the study, she is concerned that the personal information she discloses over the course of the study could lead to an increase in her life insurance premium in the future.
For example, Gertrude is concerned that the tests she would undergo as part of the research study would reveal that she is predisposed to suffer a stroke and is significantly more likely to die 87.
See discussion following in the coming year than the average person of her age and gender. If such information related to Gertrude's increased risk of morbidity and mortality is discovered by her life insurance company, it will likely increase the premium for her annual renewable term policy substantially.
Before she opts to participate in the study, Gertrude wishes to be assured that privacy measures are in place to ensure that her participation will have, at most, a limited effect on her life insurance premium.

A Baseline: Gertrude's Opt-Out Scenario
It is important to note that Gertrude's life insurance company may raise her premium based on something it learns from the medical research study, even if Gertrude does not herself participate in the study. The following example is provided to illustrate such a scenario. 90

Example 11
Gertrude holds a $100,000 life insurance policy. Her life insurance company has set her annual premium at $1,000, i.e., 1% of $100,000, based on actuarial tables showing that someone of Gertrude's age and gender has a 1% chance of dying in the next year.
Suppose Gertrude opts out of participating in the medical research study. Regardless, the study reveals that coffee drinkers are more likely to suffer a stroke than non-coffee drinkers. Gertrude's life insurance company may update its assessment and conclude that, as a sixty-five-year-old woman who drinks coffee, Gertrude has a 2% chance of dying in the next year. The company decides to increase Gertrude's annual premium from $1,000 to $2,000 based on the findings of the study. 91 90.
Figures in this example are based on data from Actuarial Life Note that there may be legal, policy, or other reasons why a company would not raise Gertrude's insurance premium based on the outcome of this study. Also, this is not a claim that insurance companies engage in this practice. Example 11 is introduced for the purposes of illustrating a general category of privacy-related risks relevant to this discussion. This example assumes that the insurance company updates its belief about Gertrude's chances of dying next year based on the outcome of this study using a Bayesian analysis. Furthermore, it assumes that Gertrude's premium is then updated in proportion to this change in belief. Differential privacy also In this example, the results of the study led to an increase in Gertrude's life insurance premium, even though she did not contribute any personal information to the study. A potential increase of this nature is unavoidable to Gertrude in this scenario because she cannot prevent other people from participating in the study. This example illustrates that Gertrude can experience a financial loss even in her optout scenario. Because, as presented in this example, Gertrude cannot avoid this type of risk on her own, 92 in the following discussion this optout scenario will serve as a baseline for measuring potential increases in her privacy risk above this threshold.

Reasoning About Gertrude's Risk
Next consider the increase in risk, relative to Gertrude's opt-out scenario, that is due to her participation in the study.

Example 12
Suppose Gertrude decides to participate in the research study. Based on the results of medical tests performed on Gertrude over the course of the study, the researchers conclude that Gertrude has a 50% chance of dying from a stroke in the next year. If the data from the study were to be made available to Gertrude's insurance company, it might decide to increase her insurance premium to $50,000 in light of this discovery.
Fortunately for Gertrude, this does not happen. Rather than releasing the full dataset from the study, the researchers release only a differentially private summary of the data they collected. Differential privacy guarantees that, if the researchers use a value of ε = 0.01, then the insurance company's estimate of the probability that Gertrude will die in the next year can increase from the opt-out scenario's estimate of 2% to at most 2% ⋅ (1 + 0.01) = 2.02%. allows one to reason (in a different manner) about a more general case where no assumptions are made regarding how the insurance company updates Gertrude's premium, but that analysis is omitted from this discussion for simplicity.
92. Although Gertrude, acting as an individual, cannot avoid this risk, society or groups of individuals may collectively act to avoid such a risk. For example, the researchers could be prohibited from running the study, or the data subjects could collectively decide not to participate. Therefore, the use of differential privacy does not completely eliminate the need to make policy decisions regarding the value of allowing data collection and analysis in the first place.
[Vol. 21:1:209 Thus Gertrude's insurance premium can increase from $2,000 to, at most, $2,020. Gertrude's first-year cost of participating in the research study, in terms of a potential increase in her insurance premium, is at most $20.
Note that this does not mean that the insurance company's estimate of the probability that Gertrude will die in the next year will necessarily increase as a result of her participation in the study, nor that if the estimate increases it must increase to 2.02%. What the analysis shows is that if the estimate were to increase it would not exceed 2.02%.
In this example, Gertrude is aware of the fact that the study could indicate that her risk of dying in the next year exceeds 1%. She happens to believe, however, that the study will not indicate more than a 2% risk of dying in the next year, in which case the potential cost to her of participating in the research will be at most $20. Based on her belief, Gertrude may decide that she considers the potential cost of $20 to be too high and that she cannot afford to participate with this value of and this level of risk. Alternatively, she may decide that it is worthwhile. Perhaps she is paid more than $20 to participate in the study, or the information she learns from the study is worth more than $20 to her. The key point is that differential privacy allows Gertrude to make a more informed decision based on the worst-case cost of her participation in the study.
It is worth noting that, should Gertrude decide to participate in the study, her risk might increase-even if her insurance company is not aware of her participation. Gertrude might actually have a higher chance of dying in the next year, and that could affect the study results. In turn, her insurance company might decide to raise her premium because she fits the profile of the studied population-even if it does not believe her data were included in the study. Differential privacy guarantees that, even if the insurance company knows that Gertrude did participate in the study-it can only make inferences about her that it could have essentially made if she had not participated in the study.

D. A General Framework for Reasoning About Privacy Risk
Gertrude's scenario illustrates how differential privacy is a general framework for reasoning about the increased risk that is incurred when an individual's information is included in a data analysis. Differential privacy guarantees that an individual will be exposed to essentially the same privacy risk, whether or not her data are included in a differentially private analysis. 93 In this context, one can think of the privacy risk associated with a release of the output of a data analysis as the potential harm that an individual might incur because of a belief that an observer forms based on that data release.
In particular, when is set to a small value, an observer's posterior belief can change-relative to the case where the data subject is not included in the data set-by a factor of at most approximately 1 + based on a differentially private data release. 94 For example, if is set to 0.01, then the privacy risk to an individual resulting from participation in a differentially private computation grows by at most a multiplicative factor of 1.01.
As Examples 11 and 12 illustrate, there is a risk to Gertrude that the insurance company will see the study results, update its beliefs about the mortality of Gertrude, and charge her a higher premium. If the insurance company infers from the study results that Gertrude has probability of dying in the next year and her insurance policy is valued at $100,000, her premium will increase to × $100,000. This risk exists, even if Gertrude does not participate in the study. Recall how, in Example 11, the insurance company's belief that Gertrude will die in the next year doubles from 1% to 2%, increasing her premium from $1,000 to $2,000, based on general information learned from the individuals who did participate. Recall also that if Gertrude does decide to participate in the study (as in Example 12), differential privacy limits the change in this risk relative to her opt-out scenario. In financial terms, her risk increases by at most $20, since the insurance company's beliefs about her probability of death change from 2% to at most 2% ⋅ (1 + ) = 2.02%, where = 0.01.
Note that the above calculation requires certain information that may be difficult to determine in the real world. In particular, the 2% baseline in Gertrude's opt-out scenario (i.e., Gertrude's insurer's belief about her chance of dying in the next year) is dependent on the results from the medical research study, which Gertrude does not know at the time she makes her decision whether to participate. Fortunately, differential privacy provides guarantees relative to every baseline risk. 95

93.
See Dwork  In general, the guarantee made by differential privacy is that the probabilities differ by at most a factor of ± , which is approximately 1 ± when is small. Say that, without her participation, the study results would lead the insurance company to believe that Gertrude has a 3% chance of dying in the next year (instead of the 2% chance hypothesized earlier). This means that Gertrude's insurance premium would increase to $3,000. Differential privacy guarantees that, if Gertrude had instead decided to participate in the study, the insurer's estimate for Gertrude's mortality would have been at most 3% ⋅ (1 + ε) = 3.03% (assuming an ε of 0.01), which means that her premium would not increase beyond $3,030.
Calculations like those used in the analysis of Gertrude's privacy risk can be performed by referring to Table 1. For example, the value of used in the research study Gertrude considered participating in was 0.01, and the baseline privacy risk in her opt-out scenario was 2%. As shown in Table 1, these values correspond to a worst-case privacy risk of 2.02% in her real-world scenario. Notice also how the calculation of risk would change with different values. For example, if the privacy risk in Gertrude's opt-out scenario were 5% rather than 2% and the value of remained the same, then the worst-case privacy risk in her real-world scenario would be 5.05%.

Table 1. Maximal Difference Between Posterior Beliefs in Gertrude's Opt-Out and Real-World Scenarios
The notation ( ′) refers to the application of the analysis on the dataset ′, which does not include Gertrude's information. As this table shows, the use of differential privacy provides a quantitative bound on how much one can learn about an individual from a computation.  The fact that the differential privacy guarantee applies to every privacy risk means that Gertrude can know for certain how participating in the study might increase her risks relative to opting out, even if she does not know a priori all the privacy risks posed by the data release. This enables Gertrude to make a more informed decision about whether to take part in the study. For instance, perhaps with the help of the researcher obtaining her informed consent, Gertrude can use this framework to better understand how the additional risk she may incur by participating in the study is bounded. By considering the bound with respect to a range of possible baseline risk values, she may 96.
For , the posterior belief given ( '), and privacy parameter , the bound on the posterior belief given ( ) is [Vol. 21:1:209 decide whether she is comfortable with taking on the risks entailed by these different scenarios. Table 1 demonstrates how significant changes in posterior belief compared to the opt-out baseline can be for different values of . Notice how, at = 1, a belief that Gertrude has a certain condition with 1% probability in the opt-out scenario would become 2.67%, which is quite a large factor increase (more than double), and a 50% belief would become nearly a 75% belief (also a very significant change). For = 0.2 and = 0.5, the changes start to become more modest, but could still be considered too large, depending on how sensitive the data are. For = 0.1 and below, the changes in beliefs may be deemed small enough for most applications.
Also note that the entries in Table 1 are the worst-case bounds that are guaranteed by a given setting of . An adversary's actual posterior beliefs given A(x) may be smaller in a given practical application, depending on the distribution of the data, the specific differentially private algorithms used, and the adversary's prior beliefs and auxiliary information.
That is, in a real-world application, a particular choice of may turn out to be safer than Table 1 indicates, but it can be difficult to quantify how much safer.
The exact choice of is a policy decision that should depend on the sensitivity of the data, with whom the output will be shared, the intended data analysts' accuracy requirements, and other technical and normative factors. Table 1 and explanations interpreting it, such as the examples provided in this Section, can help provide the kind of information needed to make such a policy decision.

E. Composition
Privacy risk accumulates with multiple analyses on an individual's data, and this is true whether or not any privacy-preserving technique is applied. 97 One of the most powerful features of differential privacy is its robustness under composition. 98 One can reason aboutand bound-the privacy risk that accumulates when multiple differentially private computations are performed on an individual's data. 99

97.
See DWORK & ROTH, supra note 25, at 5. Note that this observation is not unique to differentially private analyses. It is true for any use of information, and, therefore, for any approach to preserving privacy. However, the fact that the cumulative privacy risk from multiple analyses can be bounded is a distinguishing property of differential privacy.
The parameter quantifies how privacy risk accumulates across multiple differentially private analyses. Imagine that two differentially private computations are performed on datasets about the same individuals. If the first computation uses a parameter of G and the second uses a parameter of I , then the cumulative privacy risk resulting from these computations is no greater than the risk associated with an aggregate parameter of G + I . 100 In other words, the privacy risk from running the two analyses is bounded by the privacy risk from running a single differentially private analysis with a parameter of G + I .

Example 14
Suppose that Gertrude decides to opt into the medical study because it is about heart disease, an area of research she considers critically important. The study leads to a published research paper, which includes results from the study produced by a differentially private analysis with a parameter of G = 0.01. A few months later, the researchers decide that they want to use the same study data for another paper. This second paper would explore a hypothesis about acid reflux disease, and would require calculating new statistics based on the original study data. Like the analysis results in the first paper, these statistics would be computed using differential privacy, but this time with a parameter of I = 0.02.
Because she only consented to her data being used in research about heart disease, the researchers must obtain Gertrude's permission to reuse her data for the paper on acid reflux disease. Gertrude is concerned that her insurance company could compare the results from both papers and learn something negative about Gertrude's life expectancy and drastically raise her insurance premium. She is not particularly interested in participating in a research study about acid reflux disease and is concerned the risks of participation might outweigh the benefits to her.
Because the statistics from each study are produced using differentially private analyses, Gertrude can precisely bound the privacy risk that would result from contributing her data to the second study. The combined analyses can be thought of as a single analysis with a privacy loss parameter of Say that, without her participation in either study, the insurance company would believe that Gertrude has a 2% chance of dying in the next year, leading to a premium of $2,000. If Gertrude participates in both studies, the insurance company's estimate of Gertrude's mortality would increase to at most 2% ⋅ (1 + 0.03) = 2.06%.
This corresponds to a premium increase of $60 over the premium that Gertrude would pay if she had not participated in either study.
This means that, while it cannot get around the fundamental law that privacy risk increases when multiple analyses are performed on the same individual's data, differential privacy guarantees that privacy risk accumulates in a bounded way. 101 Despite the accumulation of risk, two differentially private analyses cannot be combined in a way that leads to a privacy breach that is disproportionate to the privacy risk associated with each analysis in isolation. To the Authors' knowledge, differential privacy is currently the only known framework with quantifiable guarantees with respect to how risk accumulates across multiple analyses.

V. WHAT TYPES OF ANALYSES ARE PERFORMED WITH DIFFERENTIAL PRIVACY?
A large number of analyses can be performed with differential privacy guarantees. Differentially private algorithms are known to exist for a wide range of statistical analyses such as count queries, histograms, cumulative distribution functions, and linear regression; techniques used in statistics and machine learning such as clustering and classification; and statistical disclosure limitation techniques like synthetic data generation, among many others.
For the purposes of illustrating that broad classes of analyses can be performed using differential privacy, the discussion in this Part provides a brief overview of each of these types of analyses and how they can be performed with differential privacy guarantees. 102
The discussion in this Part provides only a brief introduction to a number of statistical and machine learning concepts. For a more detailed introduction to these concepts, see • Count queries: The most basic statistical tool, a count query, returns an estimate of the number of individual records in the data satisfying a specific predicate. 103 For example, a count query could be used to return the number of records corresponding to HIV-positive individuals in a sample. Differentially private answers to count queries can be obtained through the addition of random noise, as demonstrated in the detailed example found in Appendix A.1. • Histograms: A histogram contains the counts of data points as they are classified into disjoint categories. 104 For example, in the case of numerical data, a histogram shows how data are classified within a series of consecutive nonoverlapping intervals. A contingency table (or cross tabulation) is a special form of histogram representing the interrelation between two or more variables. 105 The categories of a contingency to be included in future privacy-preserving tool kits for social scientists.
• Classification: In machine learning and statistics, classification is the problem of identifying or predicting which of a set of categories a data point belongs in, based on a training set of examples for which category membership is known. 117 Data scientists often utilize data samples that are pre-classified (e.g., by experts or from historical data) to train a classifier, which can later be used for labeling newly acquired data samples. 118 Theoretical work has shown that it is possible to construct differentially private classification algorithms for a large collection of classification tasks. 119 • Synthetic data: Synthetic data are data sets generated from a statistical model estimated using the original data. 120 The records in a synthetic data set have no one-toone correspondence with the individuals in the original data set, yet the synthetic data can retain many of the statistical properties of the original data. Synthetic data resemble the original sensitive data in format, and, for a large class of analyses, results are similar whether performed on the synthetic or original data. 121 Theoretical work has shown that differentially private synthetic data can be generated for a large variety of tasks. 122 A significant benefit is that, once a differentially private synthetic data set is generated, it can be analyzed any number of times, without any further implications for privacy. 123  This Part discusses some of the practical challenges to using differentially private computations such as those outlined in the previous Part. When making a decision regarding whether to implement differential privacy, one must consider the relevant privacy and utility requirements associated with the specific use case in mind. This Article provides many examples illustrating scenarios in which differentially private computations could be used. However, if, for instance, an analysis is being performed at the individual-level-e.g., in order to identify individual patients who would be good candidates for a clinical trial or to identify instances of bank fraud-differential privacy would not apply, as it will disallow learning information specific to an individual.
Additionally, because implementation and use of differential privacy is in its early stages, there is a current lack of easy-to-use general purpose and production-ready tools, though progress is being made on this front, as Part VII discusses below. The literature identifies a number of other practical limitations, emphasizing the need for additional differentially private tools tailored to specific applications such as the data products released by federal statistical agencies; subject matter experts trained in the practice of differential privacy; tools for communicating the features of differential privacy to the general public, users, and other stakeholders; and guidance on setting the privacy loss parameter . 126 This Part focuses on a selection of practical considerations, including (A) challenges due to the degradation of privacy that results from composition, (B) challenges related to the accuracy of differentially private statistics, and (C) challenges related to analyzing and sharing personal data while protecting privacy in accordance with applicable 124.
For an example of public use synthetic microdata, see regulations and policies for privacy protection. It is important to note that the challenges of producing accurate statistics, while protecting privacy and addressing composition, are not unique to differential privacy. 127 It is a fundamental law of information that privacy risk grows with the repeated use of data, and hence this risk applies to any disclosure limitation technique. 128 Traditional SDL techniques-such as suppression, aggregation, and generalization-often reduce accuracy and are vulnerable to loss in privacy due to composition. 129 The impression that these techniques do not suffer accumulated degradation in privacy is merely due to the fact that these techniques have not been analyzed with the high degree of rigor that differential privacy has been. 130 A rigorous analysis of the effect of composition is important for establishing a robust and realistic understanding of how multiple statistical computations affect privacy. 131

A. The "Privacy Budget"
As Section IV.B explains, one can think of the parameter as determining the overall privacy protection provided by a differentially private analysis.
Intuitively, determines "how much" of an individual's privacy an analysis may utilize, or, alternatively, by how much the risk to an individual's privacy can increase. A smaller value for implies better protection (i.e., less risk to privacy). 132 Conversely, a larger value for implies worse protection (i.e., higher potential risk to privacy). 133 In particular, = 0 implies perfect privacy (i.e., the analysis does not increase any individual's privacy risk at all). 134 Unfortunately, analyses that satisfy differential privacy with = 0 must completely ignore their input data and therefore are useless. 135 Section IV.B also explains that the choice of ε is dependent on various normative and technical considerations, and best practices are 127.
See Dwork  likely to emerge over time as practitioners gain experience from working with real-world implementations of differential privacy. As a starting point, experts have suggested that ε be thought of as a small value ranging from approximately 0.01 to 1. 136 Based on the analysis following Table 1, the Authors of this Article believe that adopting a global value of = 0.1, when feasible, provides sufficient protection. In general, setting involves making a compromise between privacy protection and accuracy. The consideration of both utility and privacy is challenging in practice and, in some of the early implementations of differential privacy, has led to choosing a higher value for . 137 As the accuracy of differentially private analyses improves over time, it is likely that lower values of will be chosen.
The privacy loss parameter can be thought of as a "privacy budget" to be spent by different analyses of individuals' data. If a single analysis is expected to be performed on a given set of data, then one might allow this analysis to exhaust the entire privacy budget . However, a more typical scenario is that several analyses are expected to be run on a dataset, and, therefore, one needs to calculate the total utilization of the privacy budget by these analyses. 138 Fortunately, as Section IV.E discusses, a number of composition theorems have been developed for differential privacy. In particular, these theorems state that the composition of two differentially private analyses results in a privacy loss that is bounded by the sum of the privacy losses of each of the analyses. 139 To understand how overall privacy loss is accounted for in this framework, consider the following example.

Example 15
Suppose a data analyst using a differentially private analysis tool is required to do so while maintaining differential privacy with an overall privacy loss parameter = 0.1. This requirement for the overall privacy loss parameter may be guided by an interpretation of a regulatory standard, institutional policy, or best practice, among other possibilities. It means that all of the analyst's analyses, taken together, must have a value of ε that is at most 0.1.

136.
See, e.g., Dwork Consider how this requirement would play out within the following scenarios: One-query scenario: The data analyst performs a differentially private analysis with a privacy loss parameter G = 0.1. In this case, the analyst would not be able to perform a second analysis over the data without risking a breach of the policy limiting the overall privacy loss to = 0.1.

Multiple-query scenario:
The data analyst first performs a differentially private analysis with G = 0.01, which falls below the limit of = 0.1. This means that the analyst can also apply a second differentially private analysis, say with ε I = 0.02. After the second analysis, the overall privacy loss amounts to G + I = 0.01 + 0.02 = 0.03, which is still less than = 0.1, and therefore allows the analyst to perform additional analyses before exhausting the budget.
The multiple-query scenario can be thought of as if the data analyst has a privacy budget of = 0.1 that is consumed incrementally as she performs differentially private analyses, until the budget has been exhausted. 140 Performing additional analyses after the overall budget has been exhausted may result in a privacy parameter that is larger (i.e., worse) than . 141 Any data use exceeding the privacy budget would result in a privacy risk that is too significant.
Note that, in the sample calculation for the multiple-query example, the accumulated privacy risk was bounded simply by adding the privacy parameters of each analysis. It is in fact possible to obtain better bounds on the accumulation of the privacy loss parameter than suggested by this example. 142

B. Accuracy
This Section discusses the relationship between differential privacy and accuracy. The accuracy of an analysis is a measure of how its outcome can deviate from the true quantity or model it attempts to estimate. 144 There is no single measure of accuracy, as measures of deviations differ across applications. 145 Multiple factors have an effect on the accuracy of an estimate, including measurement and sampling errors. 146 The random noise introduced in differentially private computations similarly affects accuracy. 147 For most statistical analyses, the inaccuracy coming from sampling error decreases as the number of samples grows, 148 and the same is true for the inaccuracy coming from the random noise in most differentially private analyses. In fact, it is often the case that the inaccuracy due to the random noise vanishes more quickly than the sampling error. 149 This means that, in theory, for very large datasets (with records for very many individuals), differential privacy comes essentially "for free." However, for datasets of the sizes that occur in practice, the amount of noise that is introduced for differentially private analyses can have a noticeable impact on accuracy. For small datasets, for very high levels of privacy protection (i.e., small ), or for complex analyses, the noise introduced for differential privacy can severely impact utility. 150 In general, almost no utility can be obtained from datasets containing 1/ or fewer records. 151 As Section VI.A discusses, this is 145. For example, a researcher interested in estimating the average income of a given population may care about the absolute error of this estimate (i.e., the difference between the real average and the estimate), whereas a researcher interested in the median income may care about the difference between the number of respondents whose income is below the estimate and the number of respondents whose income is above the estimate.
146. Measurement error is the difference between the measured value of a quantity and its true value (e.g., an error in measuring an individual's height or weight), and sampling error is error caused by observing a sample rather than the entire population (e.g., the fraction of people with diabetes in the sample is likely to be different from the fraction with diabetes in the population 151. This rule of thumb follows directly from the definition of differential privacy. See Dwork et al., supra note 62, at 17, 18. Specifically, the parameter bounds the distance between the probability distributions resulting from a differentially private computation on two datasets that differ on one entry. Datasets containing only 1/ entries can differ on at most this number of exacerbated by the fact that the privacy budget usually needs to be partitioned among many different queries or analyses, and thus the value of used for each query needs to be much smaller. Much of the ongoing research on differential privacy is focused on understanding and improving the tradeoff between privacy and utility (i.e., obtaining the maximum possible utility from data while preserving differential privacy). 152 Procedures for estimating the accuracy of certain types of analyses have been developed. 153 These procedures take as input the number of records, a value for , and the ranges of numerical and categorical fields, among other parameters, and produce guaranteed accuracy bounds. 154 Alternatively, a desired accuracy may be given as input instead of , and the computation results in a value for that would provide this level of accuracy. 155  entries. Summing the differences over just 1/ entries reveals that, for any two datasets of this size, the differentially private mechanism produces distributions that are at distance •   Figure 4 illustrates the outcome of a differentially private computation of the CDF of income in fictional District Q. Graph (a) presents the original CDF (without noise) and the subsequent graphs show the result of applying differentially private computations of the CDF with values of (b) 0.005, (c) 0.01, and (d) 0.1. Notice that, as smaller values of imply better privacy protection, they also imply less accuracy due to noise addition compared to larger values of .
Another concept related to accuracy is truthfulness. This term has appeared regularly, if infrequently, in the statistical disclosure limitation literature since the mid-1970s, though it does not have a Cumulative Probability Income well-recognized formal definition. 157 Roughly speaking, the SDL literature recognizes a privacy-protecting method as truthful if one can determine unambiguously which types of statements, when semantically correct as applied to the protected data (i.e., data transformed by a privacy technique such as k-anonymity), are also semantically correct when applied to the original sample data. 158 This concept has an intuitive appeal. For data protected via suppressing some of the cells in the database, statements of the form "there are records with characteristics X and Y" are correct in the original data if they are correct in the protected data. For example, one might definitively state, using only the protected data, that "some plumbers earn over $50,000." One cannot make this same statement definitively for data that have been synthetically generated. 159 One must be careful, however, to identify and communicate the types of true statements a protection method supports. For instance, neither suppression nor synthetic data support truthful nonexistence claims at the microdata level. Even if all Wisconsin residents are included in the data, a statement such as "there are no plumbers in the dataset who earn over $50,000" cannot be made definitively by examining the protected data alone if income or occupation values have been suppressed or synthetically generated. Moreover, protection methods may, in general, preserve truth at the individual record level, but not at the aggregate level (or vice versa). 160  159. Synthetic data generation, by definition, uses a statistical model built from one set of data to generate new data. This preserves some of the statistical characteristics of the data, but not the original records themselves. See Fung et al., supra note 157, at 4. As a result, any measurement made on the synthetic dataset is related only probabilistically to measurements made on the original data and is associated with a measure of uncertainty.
160. [Vol. 21:1:209 recoding and suppression, global recoding, and privacy criteria such as k-anonymity that use these operations in their implementation cannot produce reliably truthful statements about most aggregate computations. As an example, statements such as "the median income of a plumber in Wisconsin is $45,000" or "the correlation between income and education in Wisconsin is .50" will not be correct. 161 Assessing the truthfulness of modern privacy protection methods requires generalizing notions of truthfulness to apply to statements about the population from which the sample is drawn. Scientific research and the field of statistics are primarily concerned with making correct statements about the population. 162 Statistical estimates inherently involve uncertainty and, as mentioned above, there are many individual sources of error that contribute to the total uncertainty in a calculation. These are traditionally grouped by statisticians into the categories of sampling and nonsampling errors. 163 Correct assertions about a statistical statement accurately communicate the uncertainty of the estimated value. 164 Thus, a statement is statistically truthful of protected data if it accurately communicates the uncertainty-inclusive of sampling and nonsampling errors-of the estimated population value. Methods such as local suppression and global recoding are not always capable of producing statistically truthful statements. 165 Fortunately, privacy 161.

See generally
Correctly calculating and truthfully reporting the uncertainty induced by suppression would require revealing the full details of the suppression algorithm and its parameterization. Revealing these details allows information to be inferred about individuals. Traditional SDL techniques require that the mechanism itself be kept secret in order to protect against this type of attack. 162.
In general terms, the goal of statistics is to make reliable inferences about a population or distribution based on characteristics calculated from a sample of data drawn from that population. For a mathematically detailed definition, see Allan Birnbaum, On the Foundations of Statistical Inference, 57 J. AM. STAT. ASS 'N 269, 273 (1962). In similarly general terms, the goal of science is to yield reliable generalized knowledge about the world, such as knowledge about populations, general predictions, or natural laws. A widely recognized example capturing this distinction is the regulatory definition of scientific research found in the Federal Policy for the Protection of Human Subjects. See  . For instance, Willenborg and de Waal note specifically that suppression of local values (i.e., cells, when used in the context of microdata) induces missing-data bias. Generalization takes many forms, and these forms are associated with different sources of statistical bias. For example, range generalization (e.g., topcoding) involves collapsing the observed distribution of values, which statisticians recognize as yielding truncation bias, whereas global recoding to suppress an entire measure may induce protecting methods such as synthetic data generation, record swapping, and differential privacy are capable of producing statements about statistical estimates that are truthful. 166 For example, all of these methods could produce truthful statements such as "with a confidence level of 99%, the median income of a plumber is $45,000 ± $2,000." 167 When produced by a truthful method, this statement correctly communicates the uncertainty of the statement, and would, roughly speaking, 168 turn out to be true of the population in 99 out of 100 independent trials.
Generally, differentially private methods introduce uncertainty. However, it is a property of differential privacy that the method itself does not need to be kept secret. This means the amount of noise added to the computation can be taken into account in the measure of accuracy and, therefore, lead to correct statements about the population of interest. This can be contrasted with many traditional SDL techniques, which only report sampling error and keep the information needed to estimate the "privacy error" secret. Any privacy-preserving method, if misused or misinterpreted, can produce incorrect statements. Additionally, the truthfulness of some methods, such as suppression and synthetic data generation, is inherently limited to particular levels of computations (e.g., to existence statements on microdata, or statements about selected aggregate statistical properties, respectively). Differential privacy may be used truthfully for a broader set of computations, so long as the uncertainty of each calculation is estimated and reported.

C. Complying with Legal Requirements for Privacy Protection
Statistical agencies, companies, researchers, and others who collect, process, analyze, store, or share data about individuals must take steps to protect the privacy of the data subjects in accordance with various laws, institutional policies, contracts, ethical codes, and best 167. From this statement, we can derive other conclusions, such as that, with 99% confidence, at least half of all plumbers earn over $43,000 annually. And if existence statements such as these are the main concern, one could use other differentially private algorithms to support making similar statements with near certainty-not merely 99% confidence.
168. [Vol. 21:1:209 practices. 169 In some settings, tools that satisfy differential privacy can be used to analyze and share data, while both complying with legal obligations and providing strong mathematical guarantees of privacy protection for the individuals in the data. 170 Privacy regulations and related guidance do not directly answer the question of whether the use of differentially private tools is sufficient to satisfy existing regulatory requirements for protecting privacy when sharing statistics based on personal data. 171 This issue is complex because privacy laws are often context dependent, and there are significant gaps between differential privacy and the concepts underlying regulatory approaches to privacy protection. 172 Different regulatory requirements are applicable depending on the jurisdiction, sector, actors, and types of information involved. 173 As a result, datasets held by an organization may be subject to different requirements. In some cases, similar or even identical datasets may be subject to different requirements when held by different organizations. 174 In addition, many legal standards for privacy protection are, to a large extent, open to interpretation and therefore require a case-specific legal analysis by an attorney. 175 Other challenges arise as a result of differences between the concepts appearing in privacy regulations and those underlying differential privacy. For instance, many laws focus on the presence of "personally identifiable information" or the ability to "identify" an individual's personal information in a release of records. 176 Such concepts do not have precise definitions, 177 and their meaning in the context of differential privacy applications is especially unclear. 178 In addition, many privacy regulations emphasize particular requirements for protecting privacy when disclosing individual-level data, such as removing personally identifiable information, which are arguably difficult to interpret and apply when releasing aggregate statistics. 179 While in some cases it may be clear whether a regulatory standard has been met by the use of differential privacy, in other cases-particularly 169.
See supra Section I.A (discussing legal and ethical frameworks for data privacy along the boundaries of a standard-there may be considerable uncertainty. 180 Regulatory requirements relevant to issues of privacy in computation rely on an understanding of a range of different concepts, such as personally identifiable information, de-identification, linkage, inference, risk, consent, opt out, and purpose and access restrictions. The following discussion explains how the definition of differential privacy can be interpreted to address each of these concepts while accommodating differences in how these concepts are defined across various legal and institutional contexts. Personally identifiable information (PII) and de-identification are central concepts in information privacy law. 181 Regulatory protections typically extend only to personally identifiable information; information not considered personally identifiable is not protected. 182 Although definitions of personally identifiable information vary, they are generally understood to refer to the presence of pieces of information that are linkable to the identity of an individual or to an individual's personal attributes. 183 PII is also related to the concept of de-identification, which refers to a collection of techniques devised for transforming identifiable information into non-identifiable information while also preserving some utility of the data. In principle, it is intended that de-identification, if performed successfully, can be used as a tool for removing PII, or transforming PII into non-PII. 184 When differential privacy is used, it can be understood as ensuring that using an individual's data will not reveal essentially any personally identifiable information specific to her. 185 Here, the use of the term "specific" refers to information that is unique to the individual 180.
See id. at 710.
For a survey of various definitions of personally identifiable information, see id. at 1829-36. The Government Accountability Office also provides a general definition of personally identifiable information. See U.S. GOV'T ACCOUNTABILITY OFFICE, GAO-08-536, ALTERNATIVES EXIST FOR ENHANCING PROTECTION OF PERSONALLY IDENTIFIABLE INFORMATION (2008) ("For purposes of this report, the terms personal information and personally identifiable information are used interchangeably to refer to any information about an individual maintained by an agency, including (1) any information that can be used to distinguish or trace an individual's identity, such as name, Social Security number, date and place of birth, mother's maiden name, or biometric records; and (2) any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information."), https://www. e-identified records and information," which permits the release of education records "after the removal of all personally identifiable information provided that the educational agency or institution or other party has made a reasonable determination that a student's identity is not personally identifiable, whether through single or multiple releases, and taking into account other reasonably available information").
185. Note that the reference to "using an individual's data" in this statement means the inclusion of an individual's data in an analysis.
[Vol. 21:1:209 and cannot be inferred unless the individual's information is used in the analysis.
Linkage is a mode of privacy loss recognized, implicitly or explicitly, by a number of privacy regulations. 186 As illustrated in Example 1, linkage typically refers to the matching of information in a database to a specific individual, often by leveraging information from external sources. 187 Linkage is also closely related to the concept of identifying an individual in a data release, as identifying an individual is often accomplished via a successful linkage. 188 Linkage has a concrete meaning when data are published as a collection of individuallevel records, often referred to as microdata. 189 However, what is considered a successful linkage when a publication is made in other formats, such as statistical models or synthetic data, has not been defined and is open to interpretation.
Despite this ambiguity, it can be argued that differential privacy addresses record linkage in the following sense. Differentially private statistics provably hide the influence of every individual, and even small groups of individuals. 190 Although linkage has not been precisely defined, linkage attacks seem to inherently result in revealing that specific individuals participated in an analysis. Because differential privacy protects against learning whether or not an individual participated in an analysis, it can therefore be understood to protect against linkage. Furthermore, differential privacy provides a robust guarantee of privacy protection that is independent of the auxiliary information available to an attacker. 191 Indeed, under differential privacy, even an attacker utilizing arbitrary auxiliary information cannot learn much more about an individual in a database than she could if that individual's information were not in the database at all. 192 Inference is another mode of privacy loss that is implicitly or explicitly referenced by some privacy regulations and related guidance. For example, some laws protect information that enables the identity of an individual to be "reasonably inferred," 193 and others protect information that enables one to determine an attribute about an individual with "reasonable certainty." 194 When discussing inference as a mode of privacy loss, it is important to distinguish between two types-inferences about individuals and inferences about large groups of individuals. Although privacy regulations and related guidance generally do not draw a clear distinction between these two types of inference, 195 the distinction is key to understanding which privacy safeguards would be appropriate in a given setting.
Differential privacy can be understood as essentially protecting an individual from inferences about attributes that are specific to herthat is, information that is unique to the individual and cannot be inferred unless the individual's information is used in the analysis. Interventions other than differential privacy may be necessary in contexts in which inferences about large groups of individuals, such as uses of data that result in discriminatory outcomes by race or sex, are a concern. 196 Risk is another concept that appears in various ways throughout regulatory standards for privacy protection and related guidance. For example, some regulatory standards include a threshold level of risk that an individual's information may be identified in a data release. 197 Similarly, some regulations also acknowledge, implicitly or explicitly, that any disclosure of information carries privacy risks, and therefore the goal is to minimize, rather than eliminate, such risks. 198 [Vol. 21:1:209 Differential privacy can readily be understood in terms of risk. 199 Specifically, differential privacy enables a formal quantification of risk. 200 It guarantees that the risk to an individual is essentially the same with or without her participation in the dataset, 201 and this is likely true for most notions of risk adopted by regulatory standards or institutional policies. In this sense, differential privacy can be interpreted as essentially guaranteeing that the risk to an individual is minimal or very small. Moreover, the privacy loss parameter can be tuned according to different requirements for minimizing risk. 202 Consent and opt out are concepts underlying common provisions set forth in information privacy laws. 203 Consent and opt-out provisions enable individuals to choose to allow, or not to allow, their information to be used by or redisclosed to a third party. 204 Such provisions are premised on the assumption that providing individuals with an opportunity to opt in or out gives them control over the use of their personal information and effectively protects their privacy. 205 However, this assumption warrants a closer look. Providing consent or opt-out mechanisms as a means of providing individuals with greater control over their information is an incomplete solution as long as individuals are not fully informed about the consequences of uses or disclosures of their information. 206 In addition, allowing individuals the choice to opt in or out can create new privacy concerns. For example, an individual's decision to opt out may-often unintentionally-be reflected in a data release or analysis and invite scrutiny into whether the choice to opt out was motivated by the need to hide compromising information. 207 The differential privacy guarantee can arguably be interpreted as providing stronger privacy protection than a consent or opt-out mechanism. This is because differential privacy can be understood as automatically providing all individuals in the data with essentially the same protection that opting out is intended to provide. 208 Moreover, differential privacy provides all individuals with this privacy guarantee. 209 Therefore, differential privacy can be understood to prevent the possibility that individuals who choose to opt out would, by doing so, inadvertently reveal a sensitive attribute about themselves or attract attention as individuals who are potentially hiding sensitive facts about themselves.
Purpose and access provisions often appear in privacy regulations as restrictions on the use or disclosure of personal information to specific parties or for specific purposes.
Legal requirements reflecting purpose and access restrictions can be divided into two categories. The first category includes restrictions, such as those governing confidentiality for statistical agencies, 210 prohibiting the use of identifiable information except for statistical purposes. The second category broadly encompasses other types of purpose and access provisions, such as those permitting the use of identifiable information for legitimate educational purposes. 211 Restrictions limiting use to statistical purposes, including statistical purposes involving population-level rather than individuallevel analyses or statistical computations, are in many cases consistent with the use of differential privacy. This is because, as Part IV explains, differential privacy protects information specific to an individual while allowing population-level analyses to be performed. Therefore, tools that satisfy differential privacy may be understood to restrict uses to only those that are for statistical purposes, such as the definition of statistical purposes found in the Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA). 212 However, other use and access restrictions, such as provisions limiting use to legitimate educational purposes, are orthogonal to differential privacy and require alternative privacy safeguards. 213

208.
See The foregoing interpretations of the differential privacy guarantee can be used to demonstrate that, in many cases, a differentially private mechanism would prevent the types of disclosures of personal information that privacy regulations have been designed to address. Moreover, in many cases, differentially private tools provide privacy protection that is more robust than that provided by techniques commonly used to satisfy regulatory requirements for privacy protection. However, further research to develop methods for proving that differential privacy satisfies legal requirements and setting the privacy loss parameter based on such requirements is needed. 214 In practice, data providers should consult with legal counsel when considering whether differential privacy tools-potentially in combination with other tools for protecting privacy and security-are appropriate within their specific institutional settings. 215

VII. TOOLS FOR DIFFERENTIALLY PRIVATE ANALYSIS
At the time of this writing, differential privacy is transitioning from a purely theoretical mathematical concept to one that underlies software tools for practical use by analysts of privacy-sensitive data. The first real-world implementations of differential privacy have been deployed by companies such as Google, 216 Apple, 217 and Uber, 218 and government agencies such as the US Census Bureau. 219 Researchers in industry and academia are currently building and testing additional tools for differentially private statistical analysis. This Part briefly reviews some of these newly emerging tools, with a particular focus on the tools that inspired the drafting of this primer.

A. Government and Commercial Applications of Differential Privacy
Since 2006, the US Census Bureau has published an online interface enabling the exploration of the commuting patterns of workers across the United States, based on confidential data collected by the Bureau through the Longitudinal Employer-Household Dynamics program. 220 Through this interface, members of the public can interact with synthetic datasets generated from confidential survey records. 221 Beginning in 2008, the computations used to synthesize the data accessed through the interface have provided formal privacy guarantees that satisfy a variant of differential privacy. 222 In 2017, the Census Bureau announced that it was prototyping a system that would protect the full set of publication products from the 2020 decennial Census using differential privacy. 223 Google, Apple, and Uber have also experimented with differentially private implementations. 224 For instance, Google developed the RAPPOR system, which applies differentially private computations in order to gather aggregate statistics from consumers who use the Chrome web browser. 225 This tool allows analysts at Google to monitor the wide-scale effects of malicious software on the browser settings of Chrome users, while providing strong privacy guarantees to individuals. 226 The current differentially private implementations by the Census Bureau and Uber rely on a curator model-the model serving as the focus of most of this Article-in which a database administrator has access to and uses private data to generate differentially private data summaries. 227 In contrast, the current implementations by Google's RAPPOR and in Apple's macOS 10.12 and iOS 10 rely on a local model of privacy, which does not require individuals to share their private data with a trusted third party; but [Vol. 21:1:209 rather, answer questions about their own data in a differentially private manner. 228 Each of these differentially private answers is not useful on its own, but many of them can be aggregated to perform useful statistical analysis.

B. Research and Development Towards Differentially Private Tools
Several experimental systems from academia and industry enable data analysts to construct privacy-preserving analyses without requiring an understanding of the subtle technicalities of differential privacy. Systems such as Privacy Integrated Queries (PINQ), 229 Airavat, 230 GUPT, 231 Fuzz, 232 DFuzz, 233 and Ektelo 234 aim to provide user-friendly tools for writing programs that are guaranteed to be differentially private, through the use of differentially private building blocks 235 or general frameworks such as "partition-and-aggregate" or "subsample-and-aggregate" 236 for transforming non-private programs into differentially private ones. 237 These systems rely on a common approach: they keep the data safely stored and allow users to access them only via a programming interface which guarantees differential privacy. 238 They also afford generality, enabling one to design many types of differentially private programs that are suitable for a wide range of purposes. 239 However, it can be challenging for a lay user with limited expertise in programming to make effective use of these systems. 240 The Authors of this Article are collaborators on the Harvard Privacy Tools Project, which develops tools to help social scientists collect, analyze, and share data while providing privacy protection for individual research subjects. 241 To this end, the project seeks to incorporate definitions and algorithmic tools from differential privacy into a private data-sharing interface (PSI) which facilitates data exploration and analysis using differential privacy. 242 PSI is intended to be integrated into research data repositories, such as Dataverse. 243 It will provide researchers depositing datasets into a repository with guidance on how to partition a limited privacy budget among the many statistics to be produced or analyses to be run. 244 It will also provide researchers seeking to explore a dataset available on the repository with guidance on how to interpret the noisy results produced by a differentially private algorithm. 245 Through the differentially private access enabled by PSI, researchers will be able to perform rough preliminary analyses of privacy-sensitive datasets that currently cannot be safely shared. 246 Such access will help researchers determine whether it is worth the effort to apply for full access to the raw data. 247

C. Tools for Specific Data Releases or Specific Algorithms
There have been a number of successful applications of differential privacy with respect to specific types of data-including data from genome-wide association studies, 248 location history data, 249 data on commuter patterns, 250 mobility data, 251 client-side software data, 252 and data on usage patterns for phone technology. 253 For differentially private releases of each of these types of data, experts in differential privacy have taken care to choose algorithms and allocate privacy budgets with the aim of maximizing utility with respect to the particular data set. 254 Therefore, each of these tools is specific to the type of data it is designed to handle, and such tools cannot be applied in contexts in which the collection of data sources and the structure of the datasets are too heterogeneous to be compatible with such [Vol. 21:1:209 optimizations. 255 Thus, there remains a need for more general-purpose tools such as those described in the previous Section. Beyond these examples, a wide literature on the design of differentially private algorithms describes approaches to performing specific data analysis tasks, including work comparing and optimizing such algorithms across a wide range of datasets. For example, the recent development of DPBench, 256 a framework for standardized evaluation of the accuracy of privacy algorithms, provides a way to compare different algorithms and ways of optimizing them. 257

VIII. SUMMARY
As the previous Part illustrates, differential privacy is in initial stages of implementation in limited academic, commercial, and government settings, and research is ongoing to develop tools that can be deployed in new applications. As differential privacy is increasingly applied in practice, interest in the topic is growing among legal scholars, policymakers, and other practitioners. This Article provides an introduction to the key features of differential privacy, using illustrations that are intuitive and accessible to these audiences.
Differential privacy provides a formal, quantifiable measure of privacy. It is established by a rich and rapidly evolving theory that enables one to reason with mathematical rigor about privacy risk. Quantification of privacy is achieved by the privacy loss parameter , which controls, simultaneously for every individual contributing to the analysis, the deviation between one's opt-out scenario and the actual execution of the differentially private analysis.
This deviation can grow as an individual participates in additional analyses, but the overall deviation can be bounded as a function of and the number of analyses performed. This amenability to composition-or the ability to provide provable privacy guarantees with respect to the cumulative risk from successive data releases-is a unique feature of differential privacy. 258 While it is not the only framework that quantifies a notion of risk for a single analysis, it is currently the only framework with quantifiable guarantees on the risk resulting from a composition of several analyses.

256.
See The parameter can be interpreted as bounding the excess risk to an individual resulting from her data being used in an analysis (compared to her risk when her data are not being used). Indirectly, the parameter also controls the accuracy to which a differentially private computation can be performed. For example, researchers making privacy-sensitive data available through a differentially private tool may, through the interface of the tool, choose to produce a variety of differentially private summary statistics while maintaining a desired level of privacy (quantified by an accumulated privacy loss parameter), and then compute summary statistics with formal privacy guarantees.
Systems that adhere to strong formal definitions like differential privacy provide protection that is robust to a wide range of potential privacy attacks, including attacks that are unknown at the time of deployment. 259 An analyst designing a differentially private data release need not anticipate particular types of privacy attacks, such as the likelihood that one could link particular fields with other data sources that may be available. Differential privacy automatically provides a robust guarantee of privacy protection that is independent of the methods and resources used by a potential attacker.
Differentially private tools also have the benefit of transparency, as it is not necessary to maintain secrecy around a differentially private computation or its parameters.
This feature distinguishes differentially private tools from traditional de-identification techniques which often require concealment of the extent to which the data have been transformed, thereby leaving data users with uncertainty regarding the accuracy of analyses on the data.
Differentially private tools can be used to provide broad, public access to data or data summaries in a privacy-preserving way. Differential privacy can help enable researchers, policymakers, and businesses to analyze and share sensitive data that cannot otherwise be shared due to privacy concerns. Further, it ensures that they can do so with a guarantee of privacy protection that substantially increases their ability to protect the individuals in the data. This, in turn, can further the progress of scientific discovery and innovation.

APPENDIX A. ADVANCED TOPICS
This Article concludes with some advanced topics for readers interested in exploring differential privacy further. This Appendix explores how differentially private analyses are constructed, explains 259.
Here, the term "privacy attacks" refers to attempts to learn private information specific to individuals from a data release.
[Vol. 21:1:209 how the noise introduced by differential privacy compares to statistical sampling error, and discusses the protection differential privacy can provide for small groups of individuals.

A.1. How Are Differentially Private Analyses Constructed?
As indicated in Part IV, the construction of differentially private analyses relies on the careful introduction of uncertainty in the form of random noise. This Section provides a simple example illustrating how a carefully calibrated amount of random noise can be added to the outcome of an analysis in order to provide privacy protection.

Example 16
Consider computing an estimate of the number of HIV-positive individuals in a sample, where the sample contains = 10,000 individuals of whom = 38 are HIV-positive. In a differentially private version of the computation, random noise is introduced into the count so as to hide the contribution of a single individual. That is, the result of the computation would be ′ = + = 38 + instead of = 38.
The magnitude of the random noise affects both the level of privacy protection provided and the accuracy of the count. 260 Generally, greater uncertainty requires a larger noise magnitude and therefore results in worse accuracy-and vice versa. In designing a release mechanism like the one described in Example 16, the magnitude of should depend on the privacy loss parameter . A smaller value of is associated with a larger noise magnitude. When choosing the noise distribution, one possibility is to sample the random noise from a normal distribution with zero mean and standard deviation 1/ . 261 Because the choice of the value of is inversely related to the magnitude of the noise introduced by the analysis, the mechanism is designed to 260.
See supra note 84 and accompanying text. The term "magnitude" refers to the magnitude of the random noise distribution as measured in parameters like the standard deviation or variance. This is not necessarily referring to the magnitude of the actual random noise sampled from the noise distribution. Generally, greater uncertainty requires a larger noise magnitude.
261. More accurately, the noise is sampled from the Laplace distribution with a mean of 0 and standard deviation of √2/ . The exact shape of the noise distribution is important for proving that outputting + preserves differential privacy, but can be ignored for the current discussion. provide a quantifiable tradeoff between privacy and utility. 262 Consider the following example.

Example 17
A researcher uses the estimate ′, as defined in the previous example, to approximate the fraction of HIV-positive people in the population. The computation would result in the estimate ′ = ′ = 38 + 10,000 .

A.2 Two Sources of Error: Sampling Error and Added Noise
This Section continues with the example from the previous Section. Note that there are two sources of error in estimating : sampling error and added noise. The first source, sampling error, would cause to differ from the expected ⋅ by an amount of roughly | − ⋅ | ≈ U ⋅ . 263 For instance, consider how the researcher from the example above would calculate the sampling error associated with her estimate.

262.
Note that this means that, when the sample size is small, the accuracy can be significantly reduced. For instance, if the sample size is similar in magnitude to 1/ , the amount of noise that is added can even be larger than the sample size. Differential privacy works best when the sample size is large, specifically when it is significantly larger than 1/ . [Vol. 21:1:209

Example 18
The researcher reasons that ′ is expected to differ from ⋅ 10,000 by roughly U ⋅ 10,000 ≈ √38 ≈ 6.
Hence, the estimate 0.38% is expected to differ from the true by approximately 6 10,000 = 0.06%, even prior to the addition of the noise by the differentially private mechanism.
The second source of error is the addition of random noise in order to achieve differential privacy. This noise would cause ′ and to differ by an amount of roughly | ′ − | ≈ 1/ . 264 The researcher in the example would calculate this error as follows.
Taking both sources of noise into account, the researcher calculates that the difference between noisy estimate ′ and the true is at most roughly 0.06% + 0.1% = 0.16%.

264.
The expectation of W is exactly because the Laplace distribution has zero mean. The standard deviation of the difference W − is exactly the standard deviation of Y, which was chosen to be 1/ .
The two sources of noise are statistically independent, 265 so the researcher can use the fact that their variances add to produce a slightly better bound: | ′ − | ≈ U0.06 I + 0.1 I = 0.12%.
Generalizing from this example, we find that the standard deviation of the estimate ′ (hence the expected difference between ′ and ) is of magnitude roughly Notice that for a large enough sample size , the noise added for privacy protection (1/ ) will be much smaller than the sampling error (Up/n), due to the difference between having n and √n in the denominator, and thus privacy comes essentially "for free" in this regime.
Note also that the literature on differentially private algorithms has identified many other noise introduction techniques that can result in better accuracy guarantees than the simple technique used in the examples above. 266 Such techniques are especially important for more complex analyses, for which the simple noise addition technique discussed in this Section is often far from optimal in terms of accuracy.

A.3 Group Privacy
By holding individuals' opt-out scenarios as the relevant baseline, the definition of differential privacy directly addresses disclosures of information localized to a single individual. However, in many cases, information may be shared between multiple individuals. For example, relatives may share an address or certain genetic attributes.
How does differential privacy protect information of this nature? Consider the opt-out scenario for a group of individuals. This is the scenario in which the personal information of all individuals is omitted from the input to the analysis. For instance, John and Gertrude's opt-out scenario ( = 2) is the scenario in which both John's 265.
Events are said to be statistically independent when the probability of occurrence of each event does not depend on whether the other event occurs. See  and Gertrude's information is omitted from the input to the analysis. Recall that the parameter controls how much the real-world scenario can differ from any individual's opt-out scenario. It can be shown that the difference between the differentially private real-world and opt-out scenarios of a group of individuals grows to at most

⋅ . 267
This means that the privacy guarantee degrades moderately as the size of the group increases. Effectively, a meaningful privacy guarantee can be provided to groups of individuals of a size of up to about ≈ 1/