-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Language detection #1172
Comments
This is planned, yes :). The simple solution can be derived from the
This selects the language under which the unigram probabilities are maximised --- i.e. it's unigram Naive Bayes. It's important to also include a prior on the languages as well. It's very useful to have this prior adapt to context. For instance, if you're processing a sequence of texts, you probably want to let the language prediction of the previous text influence the language prediction of the next text. Depending on the application, it might also be important to pay attention to pre-processing decisions. For instance, let's say you have a Faroese sample with poor HTML cleaning. Then when you process noisy text, suddenly you're tagging it all as Faroese! In short: improved language detection usually flows from making smart decisions specific to your application. Building more complicated language models is usually counter-productive, because it increases the risk that your language model will unexpected produce a very confident decision, overwhelming your contextual priors. |
Thank you honnibal, that's a nice little hack - and definitely enough for my usecase! |
Lang id is non-trivial, the existing libraries are not great and as @honnibal said very application specific. The list of probabilities that he suggested is great, much better than a set of probabilities that add up to one or less, because often the question is if the probably for a given language is greater than some threshold. For one, it deals with the case when the text is in none of the languages in the list. But averaging or summing token probabilities has inherent drawbacks. (For comparing eg against a threshold you will want to average by token count.)
It sounds like your use case could have a lot of 2, anyway many non-English texts contain a lot of English. But no major lang id lib uses parsing as an input, to my knowledge. A simple hack could be to remove entities or boost stop words. Parse probability would be useful too. |
can you please provide the exact code for this use case,i am trying to load each language and finding the probabilities,which is time consuming an memory consuming, can you please help me. |
@tiru1930 At the moment I would recommend using an external language identification package, unfortunately. We still really want to provide this, but we don't have it yet. |
@honnibal thank you , will check on this |
@honnibal it worked for the English, Spanish and German models but couldn't make it work for the French model. I always get zeros. Apparently prob returns zero for every word. I've tried on Python 2.7 and 3.5. I am running spaCy v1.9. Thanks! |
Update: This might be a good use case for the new custom pipeline components in spaCy v2.0! https://spacy.io/usage/processing-pipelines#custom-components |
Just to follow up on earlier comments about the drawbacks of simple averaging, one old-school approach is to use the stopwords only. |
Loading all language models for detection of languages and iterating through them might bloat the memory of the process? Already we have #1600 for v2 which is requiring some effort of coding and additional processing time to breakup the document (paragraph wise) and do analysis. |
Following up here - I wanted to play around with building an extension, so I put together a little pipeline component that integrates the CLD project (https://github.com/nickdavidhaynes/spacy-cld). Since its tied to the NLP pipeline, it won't work as magically as @mollerhoj originally envisoned. But if you need "good enough" language detection as a part of your processing pipeline, this should be relatively easy to incorporate. cc @ines |
@nickdavidhaynes is this being incorporated with SpaCy? |
@lesley2958 Not as far as I know. It's fairly simple to use in conjunction with spaCy (although let me know if the README isn't clear), but there aren't any plans to bring that package in particular directly into the main spaCy codebase. |
@lesley2958 One of the main reasons we've decided to open up the processing pipeline API in v2.0 is to make it easier to implement features like this as plugins – like @nickdavidhaynes' package for example. Users who want to add those features to their pipeline can do so easily by installing the plugin and adding it via We also prefer features that we ship with spaCy to be self-contained within the library, instead of adding more third-party dependencies. We might want to add a language detection model to spaCy in the future – but if we do so, it will be its own implementation. In the meantime, we think the plugin ecosystem is a good solution to allow users to add any features they like using any other library – no matter how specific they are. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Are there any plans to add language detection to spaCy? If the goal of the project is to be 'the Ruby on Rails of NLP' then I think it would make sense to include this feature.
(Just imagine how magical it will be - we don't even need to call space.load(...), this can be done lazily after the language has been detected ;-). Ok, maybe not a good idea, but a useful feature nonetheless)
A quick google search has lead me to believe the approach taken by the cld2 project is state-of-the-art in this field. Essentially, it uses naive bayes on quadgrams.
I would like to hear your thought on implementing this feature. Would you want to use existing libraries, or so I try to train a classifier from scratch?
I'm about to develop af project where I need high accuracy on language detection on a corpus that has Danish and English sentences mixed together. So I need this feature, and I would love to not have to rely on external libraries.
The text was updated successfully, but these errors were encountered: