Show simple item record

dc.contributor.advisorRubin, Donald B.
dc.contributor.authorLoong, Bronwyn
dc.date.accessioned2012-09-07T20:57:32Z
dc.date.issued2012-09-07
dc.date.submitted2012
dc.identifier.citationLoong, Bronwyn. 2012. Topics and Applications in Synthetic Data. Doctoral dissertation, Harvard University.en_US
dc.identifier.otherhttp://dissertations.umi.com/gsas.harvard:10323en
dc.identifier.urihttp://nrs.harvard.edu/urn-3:HUL.InstRepos:9527319
dc.description.abstractReleasing synthetic data in place of observed values is a method of statistical disclosure control for the public dissemination of survey data collected by national statistical agencies. The overall goal is to limit the risk of disclosure of survey respondents' identities or sensitive attributes, but simultaneously retain enough detail in the synthetic data to preserve the inferential conclusions drawn on the target population, in potential future legitimate statistical analyses. This thesis presents three new research contributions in the analysis and application of synthetic data. Firstly, to understand differences in types of input between the imputer, typically an agency, and the analyst, we present a definition of congeniality in the context of multiple imputation for synthetic data. Our definition is motivated by common examples of uncongeniality, specifically ignorance of the original survey design in analysis of fully synthetic data, and situations when the imputation model and analysis procedure condition upon different sets of records. We conclude that our definition provides a framework to assist the imputer to identify the source of a discrepancy between observed and synthetic data analytic results. Motivated by our definition, we derive an alternative approach to synthetic data inference, to recover the observed data set sampling distribution of sufficient statistics given the synthetic data. Secondly, we address the problem of negative method-of-moments variance estimates given fully synthetic data, which may be produced with the current inferential methods. We apply the adjustment for density maximization (ADM) method to variance estimation, and demonstrate using ADM as an alternative approach to produce positive variance estimates. Thirdly, we present a new application of synthetic data techniques to confidentialize survey data from a large-scale healthcare study. To date, application of synthetic data techniques to healthcare survey data is rare. We discuss identification of variables for synthesis, specification of imputation models, and working measures of disclosure risk assessment. Following comparison of observed and synthetic data analytic results based on published studies, we conclude that use of synthetic data for our healthcare survey is best suited for exploratory data analytic purposes.en_US
dc.description.sponsorshipStatisticsen_US
dc.language.isoen_USen_US
dash.licenseLAA
dc.subjectdata confidentialityen_US
dc.subjectdata utilityen_US
dc.subjectdisclosure risken_US
dc.subjectmultiple imputationen_US
dc.subjectsyntheticen_US
dc.subjectuncongenialityen_US
dc.subjectstatisticsen_US
dc.titleTopics and Applications in Synthetic Dataen_US
dc.typeThesis or Dissertationen_US
dc.date.available2012-09-07T20:57:32Z
thesis.degree.date2012en_US
thesis.degree.disciplineStatisticsen_US
thesis.degree.grantorHarvard Universityen_US
thesis.degree.leveldoctoralen_US
thesis.degree.namePh.D.en_US
dc.contributor.committeeMemberMorris, Carlen_US
dc.contributor.committeeMemberZaslavsky, Alanen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record