Building a balanced Tamil Corpus: EDA and Lexical Diversity Comparison with English

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Corpora are crucial to improve the accuracy of several Natural Language Processing tasks. Tamil is a Dravidian language and classified as a low-resource language. It is also a morphologically rich language with an agglutinative and complex inflectional structure. A balanced corpus for Tamil-TamilCorp, comprising 1.69 billion tokens across 17 genres was constructed. An exploratory data analysis (EDA) was done on the typetoken data obtained from all the 17 genres using correlation heatmaps, scatter plots and many more to understand the typetoken trends in Tamil. A uniformly random sample of about 1 million tokens (with 615 KB text per genre) was generated. The randomly sampled Tamil balanced corpus was used as the experimental dataset and compared with a similar-sized sample of corpora from other well-balanced English corpora like COCA, COHA and Brown. The analysis revealed that Tamil has much higher lexical diversity in comparison with English due to its morphologically rich and agglutinative structure. The observed differences were statistically significant (t-tests; Cohen’s d for effect size). Statistical analysis used two measures – TTR and MTLD. This experiment emphasizes the significance of understanding variation in lexical diversity across languages and contributes to the fields of NLP and corpus linguistics. The entire code for the EDA, and t-test experiments is available as a open source repository in GitHub along with the word embeddings and a Tamil dictionary of around 8 million words, obtained from TamilCorp so that it might be helpful for other researchers in the Tamil NLP community.
Original languageEnglish
Title of host publication2025 International Conference on Data Science, Agents & Artificial Intelligence (ICDSAAI)
PublisherIEEE
Number of pages6
ISBN (Electronic)9798331537555
ISBN (Print)9798331537562
DOIs
Publication statusPublished - 29 May 2025
Event3rd International Conference on Data Science, Agents, and Artificial Intelligence - Chennai Institute of Technology, Chennai, India
Duration: 28 Mar 202529 Mar 2025
https://www.citchennai.edu.in/icdsaai/

Conference

Conference3rd International Conference on Data Science, Agents, and Artificial Intelligence
Abbreviated titleICDSAAI 2025
Country/TerritoryIndia
CityChennai
Period28/03/2529/03/25
Internet address

Keywords

  • Tamil
  • English
  • Corpus Analysis
  • TTR
  • MTLD
  • EDA
  • word Embeddings

Fingerprint

Dive into the research topics of 'Building a balanced Tamil Corpus: EDA and Lexical Diversity Comparison with English'. Together they form a unique fingerprint.
  • Best paper award

    Vasuki Murugesan, Y. V. M. (Recipient), 29 Mar 2025

    Prize: Prize (including medals and awards)

Cite this