A curated catalog of open-source resources for Tamil NLP & AI.
The estimated worldwide Tamiḻ-speaking population is around 80-85 million, which is near to the population of Germany. Hence it is crucial to work on natural language processing for தமிழ் (Tamiḻ) and develop tools inorder to ensure the language is digitally well-represented.
This list will serve as a catalog for all resources related to Tamil NLP.
Note:
- Please use GitHub Issues for queries/feedback or to contribute resources/links.
- If you find this useful, please star this on GitHub to encourage this list to be active.
- If you want to follow all latest updates in this catalog, press "watch" button on top-right of this repo.
- Share this awesome website if you liked it! :-)
- Tools, Libraries, Models
- Datasets
- Other Important Resources
- iNLTK (Tools for processing and trained models)
- Indic NLP Library (Script-processing tools)
Also check Ezhil Foundation's Awesome-Tamil for lot more resources!
- FastText
- Wikipedia-based - {2016}
- CommonCrawl+Wikipedia - {2017}
- AI4Bharat IndicFT - {2020}
- Multilingual Aligned - {2017}
- ConceptNet
- BPEmb: Subword Embeddings - {2017, Aligned Multilingual}
- PolyGlot
- Facebook MUSE
- GeoMM
- TranKit
- Multilingual Text2Text
- iNLTK (ULMFit and TransformerXL) - Tamil | Tanglish
- Multilingual BERT
- XML RoBERTa
- AI4Bharat: ALBERT, BART
- Google ELECTRA - TaMillion - {2020, Code}
- Google Multilingual T5, mT6 and DeltaLM
- Google MuRIL - {2020, TF-Hub, HuggingFace}
- NMT
- AI4Bharat IndicTrans - {2021, Paper}
- not-AI-Tech Anuvaad - {2020, mT5 model fine-tuned on public datasets}
- IIIT-H IndicMulti
- EasyNMT - Collection of open source multilingual NMT models
- Moses SMT
- AI4Bharat Xlit
- notAI.tech DeepTranslit
- Indic Transliteration
- AksharaMukha - API
- LibIndic - Rule-based and Model-based | English words
- PolyGlot Transliteration
- EpiTran - IPA Transliteration
- Word Phonemizer
- WikTra - Tamil Romanizer
- Tamilinaiya Spell Checker
- Tamil Language Model and Tokenizer - {2018}
- Indic POS Tagger
- Punctuation Restoration & Indic-Punct
- Number To Words
- CommonCrawl
- OSCAR Corpus 2019 - Deduplicated Corpus {226M Tokens, 5.1GB)
- WMT Raw 2017 - CC crawls from 2012-2016
- CC-100 - CC crawls from Jan-Dec 2018
- AI4Bharat IndicCorp - {582M}
- WikiDumps
- WMT News Crawl
- Kaggle Tamil Articles Corpus
- Dinamalar News Corpus - {2009-19, 120k articles}
- TamilMurasu News Articles - {2011-19, 127k articles}
- Leipzig Corpora
- Cholloadai, 2021 - 72M phrases (not sentences)
- LDCIL Standard Text Corpus - Free for students/faculties {11M tokens}
- EMILLE Corpus - {20M Tokens, developed in collaboration with CIIL}
- Project Madurai
- AI4Bharat Samān-Antar {Paper}
- Contains most open source datasets also as of March 2021
- OPUS Corpus (Search en->ta)
- Contains MultiCC Aligned, JW300, Tanzil, bible-corpus, WikiMatrix, and more...
- Note: CC-Aligned overlaps with CommonCrawl-Matrix
- MultiIndicMT - WAT2021 / WMT20 NEWS MT Task
- MTurks Crowd-sourced - {2012}
- EkStep Anuvaad
- Parallel Corpora
- Synthetic Corpus - Translations generated using Google
- Tatoeba Wiki Back-translated data
- IndoWordNet
- VPT-IL-FIRE2018 - 3k verb phrases, available on request
Note: You can also use the MTData library to automatically download parallel data from many of the above sources.
- Indian Language Corpora Initiative - Available only on request
- TDIL EILMT
- Tourism, Agriculture, Health
- Mirrored at NPLT
- Hindi-Tamil ILCI
- Telugu-Tamil General Text Corpus
- Sinhala-Tamil Parallel Corpus - {Paper1, Paper2, Data available on request?, Test set}
- cEnTam: Creation of a New English-Tamil Corpus, 2020 - Uses OPUS+WMT20 data
- MIDAS-NMT, 2018 - Uses OPUS+EnTam data
- Google Dakshina Dataset
- NEWS2018 Dataset
- Microsoft Multi-Indic Mined Corpus - {2021, Paper}
- TRANSLIT: A Large-scale Name Transliteration Resource - {2020, Paper}
- ICTA English-Sinhala-Tamil Names - {2009, 10k triplets, SQL format}
- Thirukkural Transliteration (Old Tamil)
- Ek-Step ULCA ASR dataset
- Microsoft Speech Corpus
- OpenSLR - {2020, 9 hours, Paper}
- IARPA Babel - {2017, 350 hours}
- Mozilla CommonVoice - {2020, 20 hours}
- Facebook CoVoST - {2019, 2-4 hours}
- Spoken Tutorial - TODO: Scrape from here
- Prabhupadavani - {2022, Paper}
- CVSS - CommonVoice-based S2S - {2022, ~3 hours}
- IIT Madras TTS database - {2020, Competition}
- WikiPron - Word Pronounciations from Wiki
- LinguaLibre - Wiktionary-based word corpus
- SLR65 - Crowdsourced high-quality Tamil multi-speaker speech dataset
- VoxLingua107 - Language Identification dataset
- Abuse Detection In Multilingual Audio - {2022, Paper}
- A classification dataset for Tamil music - {2020, Paper}
- Chatbot NER
- FIRE2014
- FIRE2015 Social Media Text - Tweets
- WikiAnn - (Latest Download Link)
- University of Moratuwa NER - {2019}
- Tamil Noun Classifier
-
IndicGLUE Classification Benchmark
- Headline Classification
- Wikipedia Section Title Classification
- Wiki Cloze-style Question Answering
-
- Thirukkural Dataset - {Aṟam, Poruḷ, Inbam} classification
- Movie Review Dataset
- News Classficaition
-
Offensive Language Identification in Dravidian Languages - {2020, Dataset}
- LipiTK Isolated Handwritten Tamil Character Dataset - {156 characters, 500 samples per char}
- Tamil Vowels - Scanned Handwritten - {12 vowels, 18 images each}
- AcchuTamil Printed Characters Dataset - {MNIST format}
- Jaffna University Datasets of printed Tamil characters and documents
- Kalanjiyam: Unconstrained Offline Tamil Handwritten Database - {2016, Paper}
- SynthText - {2019}
- IIIT-H OCR benchmark and synthetic data - {2021, Available on request}
- AUKBC-TamilPOSCorpus2016v1
- ThamizhiPOSt
- Treebanks from Universal Dependencies
- SentiWordNet - SAIL
- Dravidian-CodeMix: Offensive Language Identification - FIRE2020 - {Competition, Paper, TamilMixSentiment}
- Implementations: Theedhum Nandrum
- Twitter Keyword based Emotion Corpus - {2019}
- ACTSEA: Annotated Corpus for Tamil & Sinhala Emotion Analysis
- Tamil 1k Tweets For Binary Sentiment Analysis
- Hope Speech Dataset, 2020 (Competition)
- IIIT-D Abusive Comment Identification, 2021
- Multilingual Abusive Comment Detection - ShareChatAI - 30k samples
- DravidianLangTech 2022
- IndoWordNet
- AU-KBC WordNet
- IIIT-H Word Similarity Database
- AI4Bharat Word Frequency Lists
- MTurks Bilngual Dictionary - {2014}
- XQA: A Cross-lingual Open-domain Question Answering Dataset - {2019, Paper}
- XAlign: Cross-lingual Fact-to-Text Alignment and Generation - {2022, Paper}
- XL-Sum: Abstractive Summarization
- XTREME - Multi-task Benchmark for Cross-lingual Generalization
- XTREME-S: Evaluating Cross-lingual Speech Representations - {Paper}
- IndicGLUE
- MASSIVE - NLU Benchmark - Slot filling, Intent classification, Virtual assistant evaluation
- Vyākarana - Syntactic evaluation of language models - {2021}
-
Natural Language Inference
- XNLI 2019 - Request via email
- AI4Bharat Cross-Lingual Sentence Retrieval
- AI4Bharat Cross-lingual Semantic Textual Similarity - {2020}
- Multilingual Entity-Linking from WikiNews - {2020}
- IndicLink - Multilingual Fact Linking - {2022}
-
Dialogue
-
Information Extraction
(Can also be event extraction or entity extraction) -
Misc
- Paraphrase Identification - Amrita University-DPIL Corpus
- Anaphora Resolution from Social Media Text - FIRE2020
- MMDravi - Image Captioning and Translation Benchmark, 2019 - Contains manually annotated data for dev & tests from Flickr30k dataset
- WIT : Wikipedia-based Image Text Dataset, 2021
- AllNewLyrics Dataset - Tamil Song Lyrics - {2021, Paper}
- TamilPaa Song-Lyrics Dataset, 2020
-
Reasoning
-
MorphAnalysis
-
Pure Tamil
- IndicNLP Catalog by AI4Bharat
- The Big Bad NLP Database