Publication:
Discriminative Sequence Models Extract Personally Identifiable Information from Public Gene Expression Datasets

No Thumbnail Available

Date

2022-05-25

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Sadhuka, Shuvom. 2022. Discriminative Sequence Models Extract Personally Identifiable Information from Public Gene Expression Datasets. Bachelor's thesis, Harvard College.

Research Data

Abstract

The growing scale of functional genomics datasets is enabling researchers to better understand the genetic determinants of gene expression, for example through expression quantitative trait loci (eQTL) studies. With an improving understanding of the link between genotypes and gene expression comes a greater concern that one's gene expression profile could leak personally identifiable information about her genotypes than currently known. Prior studies have shown the feasibility of attacks linking a gene expression profile in one dataset to a genotype profile in another based on eQTL associations, yet the full extent to which such an inference could be made is incompletely understood. In this thesis I explore the extent to which personally identifiable genotype information can be extracted from gene expression datasets. I present a novel machine learning algorithm for genotype prediction based on discriminative sequence models. Our model newly incorporates genotype correlations across nearby positions in the genome, using tools borrowed from population genetics that exploits haplotype-based genetic recombination. We also introduce a new metric to quantify sensitive genotype information extraction from auxiliary datasets which we term empirical information gain (EIG). This work provides an enhanced understanding of genetic privacy risks in sharing gene expression datasets. In addition, I discuss genomic privacy more broadly and survey the state-of-the-art understanding of protecting and exploiting patient genomes.

Description

Other Available Sources

Keywords

Gene Expression, Genomics, Privacy, Statistics, Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories