Analyzing Easy Data Augmentation Techniques for Text Classification
Access StatusFull text of the requested work is not available in DASH at this time ("dark deposit"). For more information on dark deposits, see our FAQ.
MetadataShow full item record
CitationWong, Carolyn. 2021. Analyzing Easy Data Augmentation Techniques for Text Classification. Bachelor's thesis, Harvard College.
AbstractIn natural language processing, text classification is the task of assigning a category to a given text example. Text classification has a variety of applications ranging from automated processing of customer reviews to spam detection. Current state-of-the-art approaches for text classification tasks use neural language models. These models are resource-intensive, requiring large amounts of labeled training data. However, training data may not always be available in large quantities, especially for low-resource languages, and labeled data is often laborious to obtain. Consequently, it is desirable to understand the factors contributing to text classification models' performance. I address several questions about which factors contribute to the high performance achieved by the current state-of-the-art neural models. To do so, I analyze traditional and neural methods for a diverse range of text classification tasks. I study various properties such as model assumptions and word vector representations to determine the effect of each of these features on text classification performance. On the best performing models from these understandings, I evaluate existing data augmentation techniques for text classification proposed by Wei and Zou (2019), which are methods that perform simple text editing operations to generate new training examples. However, such existing data augmentation techniques require external datasets or knowledge about the semantic properties of words. To this end, I propose and assess a novel length-based method that does not require external linguistic knowledge. This method replaces words with other words of similar length, as word length closely reflects the average information content and conceptual complexity of words in English (Piantadosi, Tily, and Gibson, 2011; Lewis and Frank, 2016). I demonstrate that this length-based technique adds consistent gains for several of the evaluated text classification tasks.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37368592
- FAS Theses and Dissertations