Feature request: Language detection #1172

mollerhoj · 2017-07-03T22:33:02Z

Are there any plans to add language detection to spaCy? If the goal of the project is to be 'the Ruby on Rails of NLP' then I think it would make sense to include this feature.

(Just imagine how magical it will be - we don't even need to call space.load(...), this can be done lazily after the language has been detected ;-). Ok, maybe not a good idea, but a useful feature nonetheless)

A quick google search has lead me to believe the approach taken by the cld2 project is state-of-the-art in this field. Essentially, it uses naive bayes on quadgrams.

I would like to hear your thought on implementing this feature. Would you want to use existing libraries, or so I try to train a classifier from scratch?

I'm about to develop af project where I need high accuracy on language detection on a corpus that has Danish and English sentences mixed together. So I need this feature, and I would love to not have to rely on external libraries.

honnibal · 2017-07-06T10:51:46Z

This is planned, yes :).

The simple solution can be derived from the .prob values, which give the unigram log probabilities. If each language has these set in the vocab, you should be able to do:

docs = [nlp(text) for nlp in languages]
probs = [(sum(tok.prob for tok in doc), doc) for doc in docs]
prob, doc = max(probs)

This selects the language under which the unigram probabilities are maximised --- i.e. it's unigram Naive Bayes. It's important to also include a prior on the languages as well. It's very useful to have this prior adapt to context. For instance, if you're processing a sequence of texts, you probably want to let the language prediction of the previous text influence the language prediction of the next text.

Depending on the application, it might also be important to pay attention to pre-processing decisions. For instance, let's say you have a Faroese sample with poor HTML cleaning. Then when you process noisy text, suddenly you're tagging it all as Faroese!

In short: improved language detection usually flows from making smart decisions specific to your application. Building more complicated language models is usually counter-productive, because it increases the risk that your language model will unexpected produce a very confident decision, overwhelming your contextual priors.

mollerhoj · 2017-07-06T12:48:01Z

Thank you honnibal, that's a nice little hack - and definitely enough for my usecase!

bittlingmayer · 2017-07-13T13:55:54Z

Lang id is non-trivial, the existing libraries are not great and as @honnibal said very application specific.

The list of probabilities that he suggested is great, much better than a set of probabilities that add up to one or less, because often the question is if the probably for a given language is greater than some threshold. For one, it deals with the case when the text is in none of the languages in the list.

But averaging or summing token probabilities has inherent drawbacks. (For comparing eg against a threshold you will want to average by token count.)

Most lang id approaches use char-level probabilities for a reason, space and performance, but also for dealing with out-of-vocabulary tokens.
Parsing information is key. 'Conoce Jim el Google Cloud Platform en Python o JavaScript?' is Spanish, but 'You know Hoy Tengo Ganas De Ti by Miguel Gallardo?' is English.

It sounds like your use case could have a lot of 2, anyway many non-English texts contain a lot of English. But no major lang id lib uses parsing as an input, to my knowledge.

A simple hack could be to remove entities or boost stop words. Parse probability would be useful too.

tiru1930 · 2017-09-29T09:58:47Z

can you please provide the exact code for this use case,i am trying to load each language and finding the probabilities,which is time consuming an memory consuming, can you please help me.

honnibal · 2017-09-29T10:24:04Z

@tiru1930 At the moment I would recommend using an external language identification package, unfortunately. We still really want to provide this, but we don't have it yet.

tiru1930 · 2017-09-29T10:28:30Z

@honnibal thank you , will check on this

diegow88 · 2017-10-15T21:52:18Z

@honnibal it worked for the English, Spanish and German models but couldn't make it work for the French model. I always get zeros. Apparently prob returns zero for every word. I've tried on Python 2.7 and 3.5. I am running spaCy v1.9.

Thanks!

ines · 2017-11-09T15:37:18Z

Update: This might be a good use case for the new custom pipeline components in spaCy v2.0! https://spacy.io/usage/processing-pipelines#custom-components

bittlingmayer · 2017-11-09T18:19:28Z

Just to follow up on earlier comments about the drawbacks of simple averaging, one old-school approach is to use the stopwords only.

SandeepNaidu · 2017-11-21T14:38:30Z

Loading all language models for detection of languages and iterating through them might bloat the memory of the process? Already we have #1600 for v2 which is requiring some effort of coding and additional processing time to breakup the document (paragraph wise) and do analysis.

nickdavidhaynes · 2017-11-30T17:18:58Z

Following up here - I wanted to play around with building an extension, so I put together a little pipeline component that integrates the CLD project (https://github.com/nickdavidhaynes/spacy-cld). Since its tied to the NLP pipeline, it won't work as magically as @mollerhoj originally envisoned. But if you need "good enough" language detection as a part of your processing pipeline, this should be relatively easy to incorporate.

cc @ines

clesleycode · 2018-01-06T05:58:20Z

@nickdavidhaynes is this being incorporated with SpaCy?

nickdavidhaynes · 2018-01-06T22:26:41Z

@lesley2958 Not as far as I know. It's fairly simple to use in conjunction with spaCy (although let me know if the README isn't clear), but there aren't any plans to bring that package in particular directly into the main spaCy codebase.

ines · 2018-01-14T13:58:16Z

@lesley2958 One of the main reasons we've decided to open up the processing pipeline API in v2.0 is to make it easier to implement features like this as plugins – like @nickdavidhaynes' package for example. Users who want to add those features to their pipeline can do so easily by installing the plugin and adding it via nlp.add_pipe. Developers who prefer a different approach, or integrating a different library or model can do so by writing their own plugin, without having to worry about the core library.

We also prefer features that we ship with spaCy to be self-contained within the library, instead of adding more third-party dependencies. We might want to add a language detection model to spaCy in the future – but if we do so, it will be its own implementation. In the meantime, we think the plugin ecosystem is a good solution to allow users to add any features they like using any other library – no matter how specific they are.

lock · 2018-05-08T02:55:51Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the enhancement Feature requests and improvements label Jul 6, 2017

ines closed this as completed Nov 9, 2017

ines reopened this Nov 9, 2017

ines added the help wanted Contributions welcome! label Nov 9, 2017

ines closed this as completed Jan 14, 2018

nickdavidhaynes mentioned this issue Feb 2, 2018

Do I have to load EN model? nickdavidhaynes/spacy-cld#3

Closed

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Language detection #1172

Feature request: Language detection #1172

mollerhoj commented Jul 3, 2017 •

edited

Loading

honnibal commented Jul 6, 2017 •

edited

Loading

mollerhoj commented Jul 6, 2017

bittlingmayer commented Jul 13, 2017 •

edited

Loading

tiru1930 commented Sep 29, 2017

honnibal commented Sep 29, 2017 •

edited

Loading

tiru1930 commented Sep 29, 2017

diegow88 commented Oct 15, 2017 •

edited

Loading

ines commented Nov 9, 2017

bittlingmayer commented Nov 9, 2017

SandeepNaidu commented Nov 21, 2017

nickdavidhaynes commented Nov 30, 2017

clesleycode commented Jan 6, 2018

nickdavidhaynes commented Jan 6, 2018

ines commented Jan 14, 2018

lock bot commented May 8, 2018

Feature request: Language detection #1172

Feature request: Language detection #1172

Comments

mollerhoj commented Jul 3, 2017 • edited Loading

honnibal commented Jul 6, 2017 • edited Loading

mollerhoj commented Jul 6, 2017

bittlingmayer commented Jul 13, 2017 • edited Loading

tiru1930 commented Sep 29, 2017

honnibal commented Sep 29, 2017 • edited Loading

tiru1930 commented Sep 29, 2017

diegow88 commented Oct 15, 2017 • edited Loading

ines commented Nov 9, 2017

bittlingmayer commented Nov 9, 2017

SandeepNaidu commented Nov 21, 2017

nickdavidhaynes commented Nov 30, 2017

clesleycode commented Jan 6, 2018

nickdavidhaynes commented Jan 6, 2018

ines commented Jan 14, 2018

lock bot commented May 8, 2018

mollerhoj commented Jul 3, 2017 •

edited

Loading

honnibal commented Jul 6, 2017 •

edited

Loading

bittlingmayer commented Jul 13, 2017 •

edited

Loading

honnibal commented Sep 29, 2017 •

edited

Loading

diegow88 commented Oct 15, 2017 •

edited

Loading