Upper and Lower Bounds for Privacy and Adaptivity in Algorithmic Data Analysis
MetadataShow full item record
CitationSteinke, Thomas Alexander. 2016. Upper and Lower Bounds for Privacy and Adaptivity in Algorithmic Data Analysis. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
AbstractThe increasing collection and use of sensitive personal data raises important privacy concerns. Another concern arising from the use of data in empirical sciences is the danger of producing results that are not statistically valid due to failing to account for the influence of previous exploration of the data. This thesis studies formalisations of these issues and the relationship between them.
(i) We give an alternative definition of differential privacy, which is a formal privacy standard for protecting sensitive data. Our definition strikes a balance between the mathematical elegance of so-called pure differential privacy and the power of approximate differential privacy.
(ii) We prove tighter upper and lower bounds for differential privacy. Namely, we bound the minimum size of a dataset that permits accurately answering simple one-way marginal queries subject to differential privacy. Our bounds are tight up to constant or log log factors.
(iii) We show fundamental limits for privacy by exhibiting a privacy attack that, given the aggregate statistics of a suitable dataset and the data of an individual, can determine whether or not the individual's data is part of the dataset. Our attack is particularly applicable to genetic data and is robust to errors in the aggregate statistics. This attack is very similar to our lower bounds for differential privacy and demonstrates that differential privacy achieves the fundamental limit of privacy in this setting.
(iv) We simplify, tighten, and extend the connection between differential privacy and generalisation established by Dwork et al. (STOC 2015). In particular, when data is analysed adaptively -- that is, multiple analyses are performed and each may depend on the outcome of previous analyses -- differentially private algorithms produce statistically valid results in the setting where the data is a sample from some larger population. Our results establish the tight connection between differential privacy and generalisation.
(v) We prove lower bounds in the adaptive data analysis setting that nearly match the upper bounds given by differential privacy. Namely, we show that, given $n$ samples from an unknown distribution, we cannot answer more than O(n^2) adaptive statistical queries about that distribution while guaranteeing statistical accuracy.
(vi) We show that adaptivity also poses a problem in differential privacy. We show that, for certain classes of queries, it is much harder to answer queries in a differentially private manner if the queries are posed adaptively than if the queries are provided all at once.
All of our results rely on understanding the information-theoretic relationship between the input and output of a randomised algorithm. The criterion for protecting privacy or ensuring generalisation is that changing a single input point of the data analysis algorithm does not affect the output too much in a probabilistic sense.
Citable link to this pagehttp://nrs.harvard.edu/urn-3:HUL.InstRepos:33840662
- FAS Theses and Dissertations