A Neuro-Symbolic Approach to Address Sandhi and Context-Sensitive Errors in Tamil: Evaluating Deep Learning Models for a Highly Inflectional Language

    Student thesis: Doctoral ThesisDoctor of Philosophy

    Abstract

    Spell checkers play a critical role in enhancing the performance of numerous natural language processing (NLP) applications, acting as foundational components in machine translation, information retrieval, and information extraction tasks. Their integration into word processors and search engines has provided indispensable assistance to proofreaders, writers, students, and individuals with language-related disabilities, such as dyslexia. Spell checkers are particularly vital for languages that are agglutinative, morphologically rich, and highly inflectional.

    Tamil, a major Dravidian language, holds official status in Tamil Nadu and Puducherry in India, as well as in Sri Lanka and Singapore. Spoken by over 69 million people in India alone, Tamil presents unique challenges for computational processing due to its extensive alphabet (247 characters), complex morphology, and highly inflectional nature. Despite its classical status, Tamil remains a low-resource language relative to extensively studied languages such as English.

    A systematic literature review of spell checkers for the major Dravidian languages—Tamil, Telugu, Kannada, and Malayalam—identified a lack of robust spell checkers for Tamil. Of the 61 relevant studies initially identified, 48 were analysed systematically using twelve review questions. The analysis highlighted the absence of a spell checker capable of addressing the full spectrum of spelling errors: non-word errors (invalid or non-existent words), real-word errors (contextually incorrect but valid words), and sandhi errors (phonological modifications at word boundaries).

    To overcome these gaps, the study first focused on creating two essential resources, without which robust model development and evaluation would not have been possible. Existing datasets are insufficient either in size, genre variety, or the quality of error annotation. Therefore, we developed TamilCorp, a balanced written text corpus consisting of 1.7 billion tokens across 17 diverse genres (e.g., literature, news, scientific writing, and social media). TamilCorp provides comprehensive linguistic coverage to ensure that models are trained on the full spectrum of Tamil language use.

    In parallel, we designed TamilSpell, a spelling error corpus comprising over 10 million annotated entries. It includes isolated errors, two-way combinations, and three-way combinations covering non-words, real-words, and sandhi errors. TamilSpell is the first error corpus of this scale for Tamil and was purpose-built to address the critical need for supervised learning data and standardised evaluation benchmarks.

    The deep learning models and rule-based systems developed in the neuro-symbolic research were rigorously evaluated using the test split data from TamilCorp and TamilSpell. This approach ensured that the models were assessed on realistic and diverse samples of both clean language data (from TamilCorp) and systematically categorised error data (from TamilSpell). Using these independent test sets was crucial for guaranteeing that the model performance metrics reflect true generalisation capabilities rather than overfitting to training data. It also allowed for a fair comparison of model performance across different tasks, especially for complex sandhi error correction.

    Exploratory data analysis of TamilCorp further revealed that Tamil exhibits significantly higher lexical diversity than English, as evidenced by metrics such as the type-token ratio (TTR) and the measure of textual lexical diversity (MTLD). This linguistic richness underlines the challenge of building highly accurate spell-checking systems for Tamil.

    Several cross lingual language models, such as mBART, mT5, and NLLB, were trained and evaluated. XLM-R was excluded due to its limitations in generation-based tasks. To tackle sandhi errors, a neuro-symbolic AI framework was employed. A rule base was constructed for handling 20 sandhi rules (12 rules for the addition of a sandhi letter and 8 rules for the deletion of a sandhi letter), and it was published as an open-source Python library. While twenty rules were addressed using rule-based approaches, others necessitated complex machine learning and deep learning techniques. A particularly challenging rule (Panbu Thokai) was modelled as a sequence-to-sequence (Seq2Seq) task, with the mBART model achieving superior performance over other models, with a BLEU score of 99.9% and an Exact Match Accuracy (EMA) of 97.9%.

    For context-sensitive error correction, the same set of cross-lingual language models, such as mT5, mBART, and NLLB, were evaluated. Across all evaluations, mT5 consistently outperformed competing models, with a BLEU score of 99.28% and an Exact Match Accuracy (EMA) of 93.38%. Following systematic hyperparameter tuning, the best-performing models will be released as part of an open-source Tamil spell-checking toolkit, which will include both the real-word and sandhi error correction models and will be publicly available for research and application.

    Through this research, we aim to push the boundaries of Tamil NLP not only by developing a comprehensive spell-checking toolkit but also by contributing valuable language resources and exploring cutting-edge deep learning techniques. This work is expected to enhance the efficiency and accuracy of Tamil language processing tools and to serve as a foundational framework for further linguistic research in Tamil and potentially other low-resource languages.
    Date of Award2026
    Original languageEnglish
    Awarding Institution
    • University of Dundee
    SupervisorAnnalu Waller (Supervisor) & Jacky Visser (Supervisor)

    Cite this

    '