Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Language detection #1172

Closed
mollerhoj opened this issue Jul 3, 2017 · 15 comments · May be fixed by baby636/spaCy#55
Closed

Feature request: Language detection #1172

mollerhoj opened this issue Jul 3, 2017 · 15 comments · May be fixed by baby636/spaCy#55
Labels
enhancement Feature requests and improvements help wanted Contributions welcome!

Comments

@mollerhoj
Copy link
Contributor

mollerhoj commented Jul 3, 2017

Are there any plans to add language detection to spaCy? If the goal of the project is to be 'the Ruby on Rails of NLP' then I think it would make sense to include this feature.

(Just imagine how magical it will be - we don't even need to call space.load(...), this can be done lazily after the language has been detected ;-). Ok, maybe not a good idea, but a useful feature nonetheless)

A quick google search has lead me to believe the approach taken by the cld2 project is state-of-the-art in this field. Essentially, it uses naive bayes on quadgrams.

I would like to hear your thought on implementing this feature. Would you want to use existing libraries, or so I try to train a classifier from scratch?

I'm about to develop af project where I need high accuracy on language detection on a corpus that has Danish and English sentences mixed together. So I need this feature, and I would love to not have to rely on external libraries.

@honnibal honnibal added the enhancement Feature requests and improvements label Jul 6, 2017
@honnibal
Copy link
Member

honnibal commented Jul 6, 2017

This is planned, yes :).

The simple solution can be derived from the .prob values, which give the unigram log probabilities. If each language has these set in the vocab, you should be able to do:

docs = [nlp(text) for nlp in languages]
probs = [(sum(tok.prob for tok in doc), doc) for doc in docs]
prob, doc = max(probs)

This selects the language under which the unigram probabilities are maximised --- i.e. it's unigram Naive Bayes. It's important to also include a prior on the languages as well. It's very useful to have this prior adapt to context. For instance, if you're processing a sequence of texts, you probably want to let the language prediction of the previous text influence the language prediction of the next text.

Depending on the application, it might also be important to pay attention to pre-processing decisions. For instance, let's say you have a Faroese sample with poor HTML cleaning. Then when you process noisy text, suddenly you're tagging it all as Faroese!

In short: improved language detection usually flows from making smart decisions specific to your application. Building more complicated language models is usually counter-productive, because it increases the risk that your language model will unexpected produce a very confident decision, overwhelming your contextual priors.

@mollerhoj
Copy link
Contributor Author

Thank you honnibal, that's a nice little hack - and definitely enough for my usecase!

@bittlingmayer
Copy link
Contributor

bittlingmayer commented Jul 13, 2017

Lang id is non-trivial, the existing libraries are not great and as @honnibal said very application specific.

The list of probabilities that he suggested is great, much better than a set of probabilities that add up to one or less, because often the question is if the probably for a given language is greater than some threshold. For one, it deals with the case when the text is in none of the languages in the list.

But averaging or summing token probabilities has inherent drawbacks. (For comparing eg against a threshold you will want to average by token count.)

  1. Most lang id approaches use char-level probabilities for a reason, space and performance, but also for dealing with out-of-vocabulary tokens.

  2. Parsing information is key. 'Conoce Jim el Google Cloud Platform en Python o JavaScript?' is Spanish, but 'You know Hoy Tengo Ganas De Ti by Miguel Gallardo?' is English.

It sounds like your use case could have a lot of 2, anyway many non-English texts contain a lot of English. But no major lang id lib uses parsing as an input, to my knowledge.

A simple hack could be to remove entities or boost stop words. Parse probability would be useful too.

@tiru1930
Copy link

can you please provide the exact code for this use case,i am trying to load each language and finding the probabilities,which is time consuming an memory consuming, can you please help me.

@honnibal
Copy link
Member

honnibal commented Sep 29, 2017

@tiru1930 At the moment I would recommend using an external language identification package, unfortunately. We still really want to provide this, but we don't have it yet.

@tiru1930
Copy link

@honnibal thank you , will check on this

@diegow88
Copy link

diegow88 commented Oct 15, 2017

@honnibal it worked for the English, Spanish and German models but couldn't make it work for the French model. I always get zeros. Apparently prob returns zero for every word. I've tried on Python 2.7 and 3.5. I am running spaCy v1.9.

Thanks!

@ines
Copy link
Member

ines commented Nov 9, 2017

Update: This might be a good use case for the new custom pipeline components in spaCy v2.0! https://spacy.io/usage/processing-pipelines#custom-components

@ines ines closed this as completed Nov 9, 2017
@ines ines reopened this Nov 9, 2017
@ines ines added the help wanted Contributions welcome! label Nov 9, 2017
@bittlingmayer
Copy link
Contributor

Just to follow up on earlier comments about the drawbacks of simple averaging, one old-school approach is to use the stopwords only.

@SandeepNaidu
Copy link

Loading all language models for detection of languages and iterating through them might bloat the memory of the process? Already we have #1600 for v2 which is requiring some effort of coding and additional processing time to breakup the document (paragraph wise) and do analysis.

@nickdavidhaynes
Copy link

Following up here - I wanted to play around with building an extension, so I put together a little pipeline component that integrates the CLD project (https://github.com/nickdavidhaynes/spacy-cld). Since its tied to the NLP pipeline, it won't work as magically as @mollerhoj originally envisoned. But if you need "good enough" language detection as a part of your processing pipeline, this should be relatively easy to incorporate.

cc @ines

@clesleycode
Copy link

@nickdavidhaynes is this being incorporated with SpaCy?

@nickdavidhaynes
Copy link

@lesley2958 Not as far as I know. It's fairly simple to use in conjunction with spaCy (although let me know if the README isn't clear), but there aren't any plans to bring that package in particular directly into the main spaCy codebase.

@ines
Copy link
Member

ines commented Jan 14, 2018

@lesley2958 One of the main reasons we've decided to open up the processing pipeline API in v2.0 is to make it easier to implement features like this as plugins – like @nickdavidhaynes' package for example. Users who want to add those features to their pipeline can do so easily by installing the plugin and adding it via nlp.add_pipe. Developers who prefer a different approach, or integrating a different library or model can do so by writing their own plugin, without having to worry about the core library.

We also prefer features that we ship with spaCy to be self-contained within the library, instead of adding more third-party dependencies. We might want to add a language detection model to spaCy in the future – but if we do so, it will be its own implementation. In the meantime, we think the plugin ecosystem is a good solution to allow users to add any features they like using any other library – no matter how specific they are.

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements help wanted Contributions welcome!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants