Strange behaviour in using unique character information for filtering languages #87

dl1ely · 2021-01-08T18:09:29Z

I am not very proficient at reading Kotlin code, so i cannot really pinpoint why there is this behaviour:

I am limiting languages to GERMAN,ENGLISH,SPANISH,FRENCH,ITALIAN,DUTCH,PORTUGUESE
I am looking at the confidence values only
detection text "Hätte gerne das" leads to confidence values GERMAN=1.0
detection text "Hätte gerne das Angebot" leads to confidence values GERMAN=1.0,DUTCH=0.7305489529112811,FRENCH=0.6533180937401983,ITALIAN=0.5924134102645501,SPANISH=0.582455145441379,ENGLISH=0.5545393891315643,PORTUGUESE=0.5411208670641964

The detection still returns the correct result, but i am wondering why in the second case the library even calculates confidence values for languages that do not contain the "ä" letter in their alphabets.

Is this a bug?

The text was updated successfully, but these errors were encountered:

pemistahl · 2021-01-09T17:13:29Z

Hi Stefan,

thanks for using my library and for discovering this strange behavior. Indeed, this is a bug in a calculation step in the rule-based language filter. The bug only occurs for an odd number of words as input. I've just fixed it in the commit referenced above. A nice side effect now is that accuracies go up a little for certain languages.

Greetings to Aachen. :) Closed.

dl1ely changed the title ~~Strangce behaviour in using unique character information for filtering languages~~ Strange behaviour in using unique character information for filtering languages Jan 8, 2021

pemistahl added the bug Something isn't working label Jan 9, 2021

pemistahl added this to the Lingua 1.1.0 milestone Jan 9, 2021

pemistahl added a commit that referenced this issue Jan 9, 2021

Fix calculation bug in rule-based language filter (#87)

6a6d284

pemistahl closed this as completed Jan 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange behaviour in using unique character information for filtering languages #87

Strange behaviour in using unique character information for filtering languages #87

dl1ely commented Jan 8, 2021

pemistahl commented Jan 9, 2021

Strange behaviour in using unique character information for filtering languages #87

Strange behaviour in using unique character information for filtering languages #87

Comments

dl1ely commented Jan 8, 2021

pemistahl commented Jan 9, 2021