Publication: Analyzing Easy Data Augmentation Techniques for Text Classification
No Thumbnail Available
Open/View Files
Date
2021-06-04
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Wong, Carolyn. 2021. Analyzing Easy Data Augmentation Techniques for Text Classification. Bachelor's thesis, Harvard College.
Research Data
Abstract
In natural language processing, text classification is the task of assigning a category to a given text example. Text classification has a variety of applications ranging from automated processing of customer reviews to spam detection. Current state-of-the-art approaches for text classification tasks use neural language models. These models are resource-intensive, requiring large amounts of labeled training data. However, training data may not always be available in large quantities, especially for low-resource languages, and labeled data is often laborious to obtain. Consequently, it is desirable to understand the factors contributing to text classification models' performance. I address several questions about which factors contribute to the high performance achieved by the current state-of-the-art neural models. To do so, I analyze traditional and neural methods for a diverse range of text classification tasks. I study various properties such as model assumptions and word vector representations to determine the effect of each of these features on text classification performance. On the best performing models from these understandings, I evaluate existing data augmentation techniques for text classification proposed by Wei and Zou (2019), which are methods that perform simple text editing operations to generate new training examples. However, such existing data augmentation techniques require external datasets or knowledge about the semantic properties of words. To this end, I propose and assess a novel length-based method that does not require external linguistic knowledge. This method replaces words with other words of similar length, as word length closely reflects the average information content and conceptual complexity of words in English (Piantadosi, Tily, and Gibson, 2011; Lewis and Frank, 2016). I demonstrate that this length-based technique adds consistent gains for several of the evaluated text classification tasks.
Description
Other Available Sources
Keywords
Computer science
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service