Identifying Interpretable Word Vector Subspaces With Principal Component Analysis

Zhao, Jessica

dc.contributor.author	Zhao, Jessica
dc.date.accessioned	2020-08-28T10:31:04Z
dc.date.created	2020-05
dc.date.issued	2020-06-17
dc.date.submitted	2020
dc.identifier.citation	Zhao, Jessica. 2020. Identifying Interpretable Word Vector Subspaces With Principal Component Analysis. Bachelor's thesis, Harvard College.
dc.identifier.uri	https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364694	*
dc.description.abstract	Over the past decade, machine learning has become an integral part of our lives by enabling several day-to-day (e.g. product recommendations) as well as critical (e.g. health care treatment recommendations) applications. In particular, the intersection of machine learning and natural language processing (NLP) has been a very active area of research, which played a key role in enabling impactful applications such as question answering systems and personal assistants (e.g. Alexa, Siri). Several NLP tasks rely on learning high-dimensional word vector representations that capture the essence of the underlying textual data and can conveniently be used for downstream prediction tasks. However, such representations may also capture undesirable biases inherent in the text, which in turn can cause catastrophic effects such as discrimination based on protected attributes. Therefore, it is important to identify those subspaces of the vector representations that correspond to protected attributes so that they can be appropriately neutralized via debasing techniques, thus preventing the biases from percolating into critical downstream tasks. While existing research on this topic has leveraged Principal Component Analysis (PCA) to identify certain specific subspaces such as those corresponding to gender, it fails to provide a principled methodology that can easily be generalized to other kinds of subspaces. This thesis develops a novel framework for reasoning about existing PCA based methods, proposes multiple theoretical and experimental criteria for choosing hyper-parameters, and finally presents a novel algorithm that applies PCA more effectively to find a subspace representing any given topic of interest. Experimental evaluation on widely used word vector representations and comparison with prior work demonstrate the efficacy and generalizability of our approach.
dc.description.sponsorship	Computer Science
dc.description.sponsorship	Computer Science
dc.format.mimetype	application/pdf
dc.language.iso	en
dash.license	LAA
dc.title	Identifying Interpretable Word Vector Subspaces With Principal Component Analysis
dc.type	Thesis or Dissertation
dash.depositing.author	Zhao, Jessica
dc.date.available	2020-08-28T10:31:04Z
thesis.degree.date	2020
thesis.degree.grantor	Harvard College
thesis.degree.grantor	Harvard College
thesis.degree.level	Undergraduate
thesis.degree.level	Undergraduate
thesis.degree.name	AB
thesis.degree.name	AB
dc.type.material	text
thesis.degree.department	Computer Science
thesis.degree.department	Computer Science
dash.identifier.vireo
dc.identifier.orcid	0000-0003-4000-157X
dash.author.email	jessijzhao@gmail.com

Files in this item

Name:: ZHAO-SENIORTHESIS-2020.pdf
Size:: 1.745Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

FAS Theses and Dissertations [6136]

Show simple item record