Publication: Identifying Interpretable Word Vector Subspaces With Principal Component Analysis
No Thumbnail Available
Open/View Files
Date
2020-06-17
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Zhao, Jessica. 2020. Identifying Interpretable Word Vector Subspaces With Principal Component Analysis. Bachelor's thesis, Harvard College.
Research Data
Abstract
Over the past decade, machine learning has become an integral part of our lives by enabling several day-to-day (e.g. product recommendations) as well as critical (e.g. health care treatment recommendations) applications. In particular, the intersection of machine learning and natural language processing (NLP) has been a very active area of research, which played a key role in enabling impactful applications such as question answering systems and personal assistants (e.g. Alexa, Siri). Several NLP tasks rely on learning high-dimensional word vector representations that capture the essence of the underlying textual data and can conveniently be used for downstream prediction tasks. However, such representations may also capture undesirable biases inherent in the text, which in turn can cause catastrophic effects such as discrimination based on protected attributes. Therefore, it is important to identify those subspaces of the vector representations that correspond to protected attributes so that they can be appropriately neutralized via debasing techniques, thus preventing the biases from percolating into critical downstream tasks. While existing research on this topic has leveraged Principal Component Analysis (PCA) to identify certain specific subspaces such as those corresponding to gender, it fails to provide a principled methodology that can easily be generalized to other kinds of subspaces. This thesis develops a novel framework for reasoning about existing PCA based methods, proposes multiple theoretical and experimental criteria for choosing hyper-parameters, and finally presents a novel algorithm that applies PCA more effectively to find a subspace representing any given topic of interest. Experimental evaluation on widely used word vector representations and comparison with prior work demonstrate the efficacy and generalizability of our approach.
Description
Other Available Sources
Keywords
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service