Show simple item record

dc.contributor.authorZhao, Jessica
dc.date.accessioned2020-08-28T10:31:04Z
dc.date.created2020-05
dc.date.issued2020-06-17
dc.date.submitted2020
dc.identifier.citationZhao, Jessica. 2020. Identifying Interpretable Word Vector Subspaces With Principal Component Analysis. Bachelor's thesis, Harvard College.
dc.identifier.urihttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364694*
dc.description.abstractOver the past decade, machine learning has become an integral part of our lives by enabling several day-to-day (e.g. product recommendations) as well as critical (e.g. health care treatment recommendations) applications. In particular, the intersection of machine learning and natural language processing (NLP) has been a very active area of research, which played a key role in enabling impactful applications such as question answering systems and personal assistants (e.g. Alexa, Siri). Several NLP tasks rely on learning high-dimensional word vector representations that capture the essence of the underlying textual data and can conveniently be used for downstream prediction tasks. However, such representations may also capture undesirable biases inherent in the text, which in turn can cause catastrophic effects such as discrimination based on protected attributes. Therefore, it is important to identify those subspaces of the vector representations that correspond to protected attributes so that they can be appropriately neutralized via debasing techniques, thus preventing the biases from percolating into critical downstream tasks. While existing research on this topic has leveraged Principal Component Analysis (PCA) to identify certain specific subspaces such as those corresponding to gender, it fails to provide a principled methodology that can easily be generalized to other kinds of subspaces. This thesis develops a novel framework for reasoning about existing PCA based methods, proposes multiple theoretical and experimental criteria for choosing hyper-parameters, and finally presents a novel algorithm that applies PCA more effectively to find a subspace representing any given topic of interest. Experimental evaluation on widely used word vector representations and comparison with prior work demonstrate the efficacy and generalizability of our approach.
dc.description.sponsorshipComputer Science
dc.description.sponsorshipComputer Science
dc.format.mimetypeapplication/pdf
dc.language.isoen
dash.licenseLAA
dc.titleIdentifying Interpretable Word Vector Subspaces With Principal Component Analysis
dc.typeThesis or Dissertation
dash.depositing.authorZhao, Jessica
dc.date.available2020-08-28T10:31:04Z
thesis.degree.date2020
thesis.degree.grantorHarvard College
thesis.degree.grantorHarvard College
thesis.degree.levelUndergraduate
thesis.degree.levelUndergraduate
thesis.degree.nameAB
thesis.degree.nameAB
dc.type.materialtext
thesis.degree.departmentComputer Science
thesis.degree.departmentComputer Science
dash.identifier.vireo
dc.identifier.orcid0000-0003-4000-157X
dash.author.emailjessijzhao@gmail.com


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record