Publication:

IndoLib: A Natural Language Processing Toolkit for Low-Resource South Asian Languages

Loading...
Thumbnail Image

Date

2022-10-11

Published Version

Published Version

Journal Title

Journal ISSN

Volume Title

Publisher

The Harvard community has made this article openly available. Please share how this access benefits you.

Research Projects

Organizational Units

Journal Issue

Citation

Timalsina, Nitya. 2022. IndoLib: A Natural Language Processing Toolkit for Low-Resource South Asian Languages. Master's thesis, Harvard University Division of Continuing Education.

Abstract

Out of 7,151 living languages, 665 languages (9.299%) are spoken by nearly 2 billion people across Southern Asia. Of these, 37.74% (251 languages) are endangered, while the vast majority remain underrepresented in language systems. This thesis presents a new NLP toolkit called IndoLib designed to support natural language processing (NLP) research in South Asian languages, consisting of the Indo-Aryan, Dravidian, and Sino-Tibetan language families, in this case. IndoLib includes four primary components: (i) monolingual and multilingual datasets to expand language modeling and language detection for thirty-one Indic languages, (ii) fine-tuned multilingual models for named entity recognition (NER) and summarization, (iii) a bilingual dataset with Sanskrit-English and English-Sanskrit parallel sentences, and (iv) a fine-tuned machine translation model for two-way translations between Sanskrit and English. The fine-tuned multilingual NER and bilingual translation models outperform current benchmark models upon evaluation. This thesis is intended to aid researchers interested in applying transfer learning to develop or optimize transformer-based models for South Asian languages.

Description

Other Available Sources

Research Data

Keywords

Low-Resource Language Models, Natural Language Processing (NLP), Software Engineering, South Asian Languages, Transfer Learning, Transformers, Computer science, Artificial intelligence, Information technology

Terms of Use

This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service

Endorsement

Review

Supplemented By

Related Stories