If we want to deploy language detection to maximum effect on wikis beside enwiki, we need to know what languages are most often used there (in poorly-performing queries), and limit language detection to "valuable" languages for a given wiki. E.g., on enwiki, there aren't that many French queries, and many more queries are incorrectly identified as French than correctly identified, making it a net loss. Obviously, we'd need French on frwiki. We can generally work this out to within a few percent with a sample of 500-1,000.
Work on top N languages and determine the best mix of languages to use for each of them. Each evaluation set would be a set of 500+ poorly-performing queries from the given wiki, manually tagged by language. It takes half a day to a day to do if you are familiar with the main language of the wiki, up to 2 days if not, and evaluation on a given set of language models takes a couple of hours at most. (depends on T121539 to make sure we aren't wasting time on a main language that does not perform well)
Based on the search metrics dashboard the top 12 languages by volume* are English, German, Spanish, Portuguese, Russian, French, Italian, Japanese, Polish, Arabic, Chinese, and Dutch—so I'm re-aligning the remaining work to match this list.
[* For now, N = 12 and that accounts for just over 90% of search volume.]
Estimate is roughly two days per wiki to generate an evaluation set and evaluate it against our current best language identification tools and select the right mix of languages for that tool set.
Done:
- Italian, German, Spanish, and French (T132466)
- English (the older enwiki corpus we've been using is very different and should be re-done so it is more comparable) (T138315)
- Russian, Japanese, Portuguese (also T138315)
- Dutch (T142140)
To Do (mostly in sets of 4, which works out to about 2 weeks in calendar time):
- Polish, Arabic, Chinese