Abstract
Corpora are crucial to improve the accuracy of several Natural Language Processing tasks. Tamil is a Dravidian language and classified as a low-resource language. It is also a morphologically rich language with an agglutinative and complex inflectional structure. A balanced corpus for Tamil-TamilCorp, comprising 1.69 billion tokens across 17 genres was constructed. An exploratory data analysis (EDA) was done on the typetoken data obtained from all the 17 genres using correlation heatmaps, scatter plots and many more to understand the typetoken trends in Tamil. A uniformly random sample of about 1 million tokens (with 615 KB text per genre) was generated. The randomly sampled Tamil balanced corpus was used as the experimental dataset and compared with a similar-sized sample of corpora from other well-balanced English corpora like COCA, COHA and Brown. The analysis revealed that Tamil has much higher lexical diversity in comparison with English due to its morphologically rich and agglutinative structure. The observed differences were statistically significant (t-tests; Cohen’s d for effect size). Statistical analysis used two measures – TTR and MTLD. This experiment emphasizes the significance of understanding variation in lexical diversity across languages and contributes to the fields of NLP and corpus linguistics. The entire code for the EDA, and t-test experiments is available as a open source repository in GitHub along with the word embeddings and a Tamil dictionary of around 8 million words, obtained from TamilCorp so that it might be helpful for other researchers in the Tamil NLP community.
Original language | English |
---|---|
Title of host publication | 2025 International Conference on Data Science, Agents & Artificial Intelligence (ICDSAAI) |
Publisher | IEEE |
Number of pages | 6 |
ISBN (Electronic) | 9798331537555 |
ISBN (Print) | 9798331537562 |
DOIs | |
Publication status | Published - 29 May 2025 |
Event | 3rd International Conference on Data Science, Agents, and Artificial Intelligence - Chennai Institute of Technology, Chennai, India Duration: 28 Mar 2025 → 29 Mar 2025 https://www.citchennai.edu.in/icdsaai/ |
Conference
Conference | 3rd International Conference on Data Science, Agents, and Artificial Intelligence |
---|---|
Abbreviated title | ICDSAAI 2025 |
Country/Territory | India |
City | Chennai |
Period | 28/03/25 → 29/03/25 |
Internet address |
Keywords
- Tamil
- English
- Corpus Analysis
- TTR
- MTLD
- EDA
- word Embeddings
Fingerprint
Dive into the research topics of 'Building a balanced Tamil Corpus: EDA and Lexical Diversity Comparison with English'. Together they form a unique fingerprint.Prizes
-
Best paper award
Vasuki Murugesan, Y. V. M. (Recipient), 29 Mar 2025
Prize: Prize (including medals and awards)