Publication:
Identifying Interpretable Word Vector Subspaces With Principal Component Analysis

No Thumbnail Available

Date

2020-06-17

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Zhao, Jessica. 2020. Identifying Interpretable Word Vector Subspaces With Principal Component Analysis. Bachelor's thesis, Harvard College.

Research Data

Abstract

Over the past decade, machine learning has become an integral part of our lives by enabling several day-to-day (e.g. product recommendations) as well as critical (e.g. health care treatment recommendations) applications. In particular, the intersection of machine learning and natural language processing (NLP) has been a very active area of research, which played a key role in enabling impactful applications such as question answering systems and personal assistants (e.g. Alexa, Siri). Several NLP tasks rely on learning high-dimensional word vector representations that capture the essence of the underlying textual data and can conveniently be used for downstream prediction tasks. However, such representations may also capture undesirable biases inherent in the text, which in turn can cause catastrophic effects such as discrimination based on protected attributes. Therefore, it is important to identify those subspaces of the vector representations that correspond to protected attributes so that they can be appropriately neutralized via debasing techniques, thus preventing the biases from percolating into critical downstream tasks. While existing research on this topic has leveraged Principal Component Analysis (PCA) to identify certain specific subspaces such as those corresponding to gender, it fails to provide a principled methodology that can easily be generalized to other kinds of subspaces. This thesis develops a novel framework for reasoning about existing PCA based methods, proposes multiple theoretical and experimental criteria for choosing hyper-parameters, and finally presents a novel algorithm that applies PCA more effectively to find a subspace representing any given topic of interest. Experimental evaluation on widely used word vector representations and comparison with prior work demonstrate the efficacy and generalizability of our approach.

Description

Other Available Sources

Keywords

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories