Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour in using unique character information for filtering languages #87

Closed
dl1ely opened this issue Jan 8, 2021 · 1 comment
Labels
bug Something isn't working
Milestone

Comments

@dl1ely
Copy link

dl1ely commented Jan 8, 2021

I am not very proficient at reading Kotlin code, so i cannot really pinpoint why there is this behaviour:

  • I am limiting languages to GERMAN,ENGLISH,SPANISH,FRENCH,ITALIAN,DUTCH,PORTUGUESE
  • I am looking at the confidence values only
  • detection text "Hätte gerne das" leads to confidence values GERMAN=1.0
  • detection text "Hätte gerne das Angebot" leads to confidence values GERMAN=1.0,DUTCH=0.7305489529112811,FRENCH=0.6533180937401983,ITALIAN=0.5924134102645501,SPANISH=0.582455145441379,ENGLISH=0.5545393891315643,PORTUGUESE=0.5411208670641964

The detection still returns the correct result, but i am wondering why in the second case the library even calculates confidence values for languages that do not contain the "ä" letter in their alphabets.

Is this a bug?

@dl1ely dl1ely changed the title Strangce behaviour in using unique character information for filtering languages Strange behaviour in using unique character information for filtering languages Jan 8, 2021
@pemistahl pemistahl added the bug Something isn't working label Jan 9, 2021
@pemistahl pemistahl added this to the Lingua 1.1.0 milestone Jan 9, 2021
@pemistahl
Copy link
Owner

Hi Stefan,

thanks for using my library and for discovering this strange behavior. Indeed, this is a bug in a calculation step in the rule-based language filter. The bug only occurs for an odd number of words as input. I've just fixed it in the commit referenced above. A nice side effect now is that accuracies go up a little for certain languages.

Greetings to Aachen. :) Closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants