Publication: The Algorithmic Foundations of Private Computational Social Science
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Social scientists, political scientists, economists, and healthcare researchers crucially rely on statistical methods to further the study of individuals, society, and human behavior via inferential analysis. Unfortunately, the naive application and release of resulting statistics could reveal sensitive information about the individuals in the dataset. As a result, we propose and analyze rigorous privacy-preserving methods while showing the limits of such methods either mathematically and/or empirically. The main definition of privacy we will rely on is Differential Privacy (DP). In this thesis, we make algorithmic and analytic contributions to fundamental computational social science methods (such as linear regression and rank aggregation):
-
DP Statistical Prediction: We prove finite-sample convergence rates for a range of DP ordinary least squares methods. Then, we propose methods, based on nonparametric estimators, that perform well---in terms of minimizing prediction loss---when the dataset size is small. We show, experimentally, that the optimal method interpolates between a nonparametric estimator and a parametric one. In particular, for at least one of the methods proposed, the noise due to privacy will be less than the sampling error for common settings of parameters.
-
DP Uncertainty Quantification: We prove that a differentially private F-statistic converges to the asymptotic distribution of the non-private F-statistic, the chi-squared distribution. In addition, we obtain the optimal 1/sqrt(n) asymptotic convergence rates for resulting DP sufficient statistics. Using the DP F-statistic and a parametric bootstrap, we construct DP hypothesis tests for testing for a linear relationship and testing for the presence of mixtures. Through experiments, we also show when the DP F-statistic outperforms alternative methods for DP hypothesis testing.
-
DP Rank Aggregation: In social choice theory, (Kemeny) rank aggregation is a well-studied problem where the goal is to combine rankings from multiple voters into a single ranking on the same set of items. Since rankings can reveal preferences of voters (which a voter might like to keep private), it is important to aggregate preferences in such a way to preserve privacy. We present DP algorithms for rank aggregation in the pure and approximate settings along with distribution-independent utility upper and lower bounds. In addition to bounds in the central model, we also present utility bounds for the local model of differential privacy.