Publication: Discriminative Sequence Models Extract Personally Identifiable Information from Public Gene Expression Datasets
No Thumbnail Available
Open/View Files
Date
2022-05-25
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Sadhuka, Shuvom. 2022. Discriminative Sequence Models Extract Personally Identifiable Information from Public Gene Expression Datasets. Bachelor's thesis, Harvard College.
Research Data
Abstract
The growing scale of functional genomics datasets is enabling researchers to better understand the genetic determinants of gene expression, for example through expression quantitative trait loci (eQTL) studies.
With an improving understanding of the link between genotypes and gene expression comes a greater concern that one's gene expression profile could leak personally identifiable information about her genotypes than currently known.
Prior studies have shown the feasibility of attacks linking a gene expression profile in one dataset to a genotype profile in another based on eQTL associations, yet the full extent to which such an inference could be made is incompletely understood.
In this thesis I explore the extent to which personally identifiable genotype information can be extracted from gene expression datasets. I present a novel machine learning algorithm for genotype prediction based on discriminative sequence models. Our model newly incorporates genotype correlations across nearby positions in the genome, using tools borrowed from population genetics that exploits haplotype-based genetic recombination. We also introduce a new metric to quantify sensitive genotype information extraction from auxiliary datasets which we term empirical information gain (EIG).
This work provides an enhanced understanding of genetic privacy risks in sharing gene expression datasets.
In addition, I discuss genomic privacy more broadly and survey the state-of-the-art understanding of protecting and exploiting patient genomes.
Description
Other Available Sources
Keywords
Gene Expression, Genomics, Privacy, Statistics, Computer science
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service