Tamil Deep Learning Awesome List

A curated catalog of open-source resources for Tamil NLP & AI.

The estimated worldwide Tamiḻ-speaking population is around 80-85 million, which is near to the population of Germany. Hence it is crucial to work on natural language processing for தமிழ் (Tamiḻ) and develop tools inorder to ensure the language is digitally well-represented.

This list will serve as a catalog for all resources related to Tamil NLP.

Note:

Please use GitHub Issues for queries/feedback or to contribute resources/links.
If you find this useful, please star this on GitHub to encourage this list to be active.
- If you want to follow all latest updates in this catalog, press "watch" button on top-right of this repo.
Share this awesome website if you liked it! :-)

Tools, Libraries, Models

General

iNLTK (Tools for processing and trained models)
Indic NLP Library (Script-processing tools)

Also check Ezhil Foundation's Awesome-Tamil for lot more resources!

Word Embeddings

FastText
- Wikipedia-based - {2016}
- CommonCrawl+Wikipedia - {2017}
- AI4Bharat IndicFT - {2020}
- Multilingual Aligned - {2017}
ConceptNet
BPEmb: Subword Embeddings - {2017, Aligned Multilingual}
PolyGlot
Facebook MUSE
GeoMM

Transformers, BERT

TranKit
Multilingual Text2Text
iNLTK (ULMFit and TransformerXL) - Tamil | Tanglish
Multilingual BERT
XML RoBERTa
AI4Bharat: ALBERT, BART
Google ELECTRA - TaMillion - {2020, Code}
Google Multilingual T5, mT6 and DeltaLM
Google MuRIL - {2020, TF-Hub, HuggingFace}

Translation

NMT
- AI4Bharat IndicTrans - {2021, Paper}
- not-AI-Tech Anuvaad - {2020, mT5 model fine-tuned on public datasets}
- IIIT-H IndicMulti
- EasyNMT - Collection of open source multilingual NMT models
Moses SMT
- IIT-B Śata-Anuva̅dak

Online translation libraries

Python Translators

Transliteration

AI4Bharat Xlit
notAI.tech DeepTranslit
Indic Transliteration
AksharaMukha - API
LibIndic - Rule-based and Model-based | English words
PolyGlot Transliteration
EpiTran - IPA Transliteration
Word Phonemizer
WikTra - Tamil Romanizer

OCR

Speech

Grammar

Miscellaneous

Datasets

Monolingual Corpus

CommonCrawl
- OSCAR Corpus 2019 - Deduplicated Corpus {226M Tokens, 5.1GB)
- WMT Raw 2017 - CC crawls from 2012-2016
- CC-100 - CC crawls from Jan-Dec 2018
AI4Bharat IndicCorp - {582M}
WikiDumps
WMT News Crawl
Kaggle Tamil Articles Corpus
Dinamalar News Corpus - {2009-19, 120k articles}
TamilMurasu News Articles - {2011-19, 127k articles}
Leipzig Corpora
Cholloadai, 2021 - 72M phrases (not sentences)

Government Raw Text

LDCIL Standard Text Corpus - Free for students/faculties {11M tokens}
EMILLE Corpus - {20M Tokens, developed in collaboration with CIIL}
Project Madurai

Translation

AI4Bharat Samān-Antar {Paper}
- Contains most open source datasets also as of March 2021
OPUS Corpus (Search en->ta)
- Contains MultiCC Aligned, JW300, Tanzil, bible-corpus, WikiMatrix, and more...
- Note: CC-Aligned overlaps with CommonCrawl-Matrix
MultiIndicMT - WAT2021 / WMT20 NEWS MT Task
- Contains PM India Corpus, Manathin Kural (CVIT-MkB), NLPC-UoM Corpus, Wiki Titles, Charles University EnTam v2.0 Corpus
MTurks Crowd-sourced - {2012}
EkStep Anuvaad
- Parallel Corpora
- Synthetic Corpus - Translations generated using Google
Tatoeba Wiki Back-translated data
IndoWordNet
VPT-IL-FIRE2018 - 3k verb phrases, available on request

Note: You can also use the MTData library to automatically download parallel data from many of the above sources.

Speech, Audio

Speech-To-Text

Ek-Step ULCA ASR dataset
Microsoft Speech Corpus
OpenSLR - {2020, 9 hours, Paper}
IARPA Babel - {2017, 350 hours}
Mozilla CommonVoice - {2020, 20 hours}
Facebook CoVoST - {2019, 2-4 hours}
Spoken Tutorial - TODO: Scrape from here

Speech Translation

Prabhupadavani - {2022, Paper}
CVSS - CommonVoice-based S2S - {2022, ~3 hours}

Text-to-Speech (TTS)

IIT Madras TTS database - {2020, Competition}
WikiPron - Word Pronounciations from Wiki
LinguaLibre - Wiktionary-based word corpus
SLR65 - Crowdsourced high-quality Tamil multi-speaker speech dataset

Audio

VoxLingua107 - Language Identification dataset
Abuse Detection In Multilingual Audio - {2022, Paper}
A classification dataset for Tamil music - {2020, Paper}

Named Entity Recognition

Text Classification

IndicGLUE Classification Benchmark
- Headline Classification
- Wikipedia Section Title Classification
- Wiki Cloze-style Question Answering
AI4Bharat News Article Classification
iNLTK News Articles Classification
TamilMurasu News Articles Classification
Indic Tamil NLP 2018
- Thirukkural Dataset - {Aṟam, Poruḷ, Inbam} classification
- Movie Review Dataset
- News Classficaition
A Dataset for Troll Classification of TamilMemes, 2020
Offensive Language Identification in Dravidian Languages - {2020, Dataset}

OCR

Character-level datasets

LipiTK Isolated Handwritten Tamil Character Dataset - {156 characters, 500 samples per char}
Tamil Vowels - Scanned Handwritten - {12 vowels, 18 images each}
AcchuTamil Printed Characters Dataset - {MNIST format}
Jaffna University Datasets of printed Tamil characters and documents
Kalanjiyam: Unconstrained Offline Tamil Handwritten Database - {2016, Paper}

Scene-Text Detection / Recognition

SynthText - {2019}
IIIT-H OCR benchmark and synthetic data - {2021, Available on request}

Document OCR

Anuvaad OCR Corpus

Part-Of-Speech (POS) Tagging

Sentiment and Abuse Analysis

SentiWordNet - SAIL
Dravidian-CodeMix: Offensive Language Identification - FIRE2020 - {Competition, Paper, TamilMixSentiment}
- Implementations: Theedhum Nandrum
Twitter Keyword based Emotion Corpus - {2019}
ACTSEA: Annotated Corpus for Tamil & Sinhala Emotion Analysis
Tamil 1k Tweets For Binary Sentiment Analysis
Hope Speech Dataset, 2020 (Competition)
IIIT-D Abusive Comment Identification, 2021
Multilingual Abusive Comment Detection - ShareChatAI - 30k samples
DravidianLangTech 2022

Lexical Resources

Natural Language Generation

Benchmarks

XTREME - Multi-task Benchmark for Cross-lingual Generalization
XTREME-S: Evaluating Cross-lingual Speech Representations - {Paper}
IndicGLUE
MASSIVE - NLU Benchmark - Slot filling, Intent classification, Virtual assistant evaluation
Vyākarana - Syntactic evaluation of language models - {2021}

Miscellaneous NLP Datasets

Natural Language Inference
- XNLI 2019 - Request via email
- AI4Bharat Cross-Lingual Sentence Retrieval
- AI4Bharat Cross-lingual Semantic Textual Similarity - {2020}
- Multilingual Entity-Linking from WikiNews - {2020}
- IndicLink - Multilingual Fact Linking - {2022}
Dialogue
- Code-Mixed-Dialog 2018
Information Extraction
(Can also be event extraction or entity extraction)
Misc
- Paraphrase Identification - Amrita University-DPIL Corpus
- Anaphora Resolution from Social Media Text - FIRE2020
- MMDravi - Image Captioning and Translation Benchmark, 2019 - Contains manually annotated data for dev & tests from Flickr30k dataset
- WIT : Wikipedia-based Image Text Dataset, 2021
- AllNewLyrics Dataset - Tamil Song Lyrics - {2021, Paper}
- TamilPaa Song-Lyrics Dataset, 2020
Reasoning
- Cross-lingual Choice of Plausible Alternatives (XCOPA)
MorphAnalysis
- AI4Bharat MorphAnalyzer
- ThamizhiMorph
Pure Tamil

Other Important Resources

IndicNLP Catalog by AI4Bharat
The Big Bad NLP Database

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github/workflows		.github/workflows
assets		assets
.nojekyll		.nojekyll
README.md		README.md
_sidebar.md		_sidebar.md
index.html		index.html

narVidhai/tamil-nlp-catalog

Folders and files

Latest commit

History

Repository files navigation

Tamil Deep Learning Awesome List

Table of Contents

Tools, Libraries, Models

General

Word Embeddings

Transformers, BERT

Translation

Online translation libraries

Transliteration

OCR

Speech

Grammar

Miscellaneous

Datasets

Monolingual Corpus

Government Raw Text

Translation

Government parallel data

Papers

Transliteration

Speech, Audio

Speech-To-Text

Speech Translation

Text-to-Speech (TTS)

Audio

Named Entity Recognition

Text Classification

OCR

Character-level datasets

Scene-Text Detection / Recognition

Document OCR

Part-Of-Speech (POS) Tagging

Sentiment and Abuse Analysis

Lexical Resources

Natural Language Generation

Benchmarks

Miscellaneous NLP Datasets

Other Important Resources

About

Topics

Resources

Stars

Watchers

Forks

Contributors 3

Languages