Publication:
Analyzing Easy Data Augmentation Techniques for Text Classification

No Thumbnail Available

Date

2021-06-04

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Wong, Carolyn. 2021. Analyzing Easy Data Augmentation Techniques for Text Classification. Bachelor's thesis, Harvard College.

Research Data

Abstract

In natural language processing, text classification is the task of assigning a category to a given text example. Text classification has a variety of applications ranging from automated processing of customer reviews to spam detection. Current state-of-the-art approaches for text classification tasks use neural language models. These models are resource-intensive, requiring large amounts of labeled training data. However, training data may not always be available in large quantities, especially for low-resource languages, and labeled data is often laborious to obtain. Consequently, it is desirable to understand the factors contributing to text classification models' performance. I address several questions about which factors contribute to the high performance achieved by the current state-of-the-art neural models. To do so, I analyze traditional and neural methods for a diverse range of text classification tasks. I study various properties such as model assumptions and word vector representations to determine the effect of each of these features on text classification performance. On the best performing models from these understandings, I evaluate existing data augmentation techniques for text classification proposed by Wei and Zou (2019), which are methods that perform simple text editing operations to generate new training examples. However, such existing data augmentation techniques require external datasets or knowledge about the semantic properties of words. To this end, I propose and assess a novel length-based method that does not require external linguistic knowledge. This method replaces words with other words of similar length, as word length closely reflects the average information content and conceptual complexity of words in English (Piantadosi, Tily, and Gibson, 2011; Lewis and Frank, 2016). I demonstrate that this length-based technique adds consistent gains for several of the evaluated text classification tasks.

Description

Other Available Sources

Keywords

Computer science

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Referenced By

Related Stories