I have been learning things about semantic search and somehow found myself back to wondering about how to train a Hindi or Malayalam language model. I have done this in the LSTM days, and while I’ve kept up with some research, I haven’t with the engineering. Anyway, so if you want to do some language modelling in indic languages, your best bet would be to begin with something that AI4Bharat team has come up with. This is a language modelling and NLP team at IIT Madras who have released what seem to be the latest and the largest models for Indic languages. Because of the sad sad paucity of monolingual corpora in Indian languages, they have generally done multi lingual training.
To get an idea of just how bad things are, these are how many tokens per language the largest models for indic languages have
Lets compare that with English, the C4 dataset based on commoncrawl is trained on 1.4Trillion tokens.
Several order of magnitudes larger
Even in terms of the variety of data, most English LLMs are trained on a mix of scientific articles, blogs, news, legal documents, books, novels etc. whereas AI4Bharat which collated the largest dataset is largely just news and Wikipedia.
Not only are the datasets small, the benchmarking tasks are also several order of magnitudes smaller. This means we can never really compare the performance of a indic model with an enlglish one.
Another problem : We don’t have good tokenizers for different Indian languages. The indicBert team uses the same bert tokenizer, designed for English,with some modification. Malayalam is an agglutinative language,it seems silly to represent it using a tokenizer made for enlish or hindi. There are several tokenizers who have tried to do this, but none seem to speak to larger models or with the transformers ecosystem and not too many people seem to be using them.
What all this means is that unless we really rally and get the datasets of our languages much larger, we are destined to have our languages represented exclusively with relation to English, which really is the approach google and others suggest for dealing with “low resource” languages. This is appalling.
I deeply appreciate the work the IITM team is doing, but someone needs to fund them and a dozen other labs like them several order of magnitudes more than they are.
citation for indicBert
@inproceedings{kakwani2020indicnlpsuite,
title={{IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages}},
author={Divyanshu Kakwani and Anoop Kunchukuttan and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
year={2020},
booktitle={Findings of EMNLP},
}