Confidence scoring #11

mfasanya · 2019-08-24T20:26:43Z

Awesome library it works really well, best performing calculations compared to all the other open source stuff.

I'm looking to replace google detection API with your project but the one thing it's missing is a confidence score similar to google's

{"language":"te","confidence":0.4294964}

Is this something you are thinking about adding in?

pemistahl · 2019-08-26T21:32:29Z

Hi Mitchell (@mfasanya), thank you for using my library and for this feature request.

I'm generally open for some kind of confidence metric. I'll think about a way to implement this.

MaciejGorczyca · 2019-08-28T15:28:51Z

This would be great and very helpful. There are two methods missing imo to make this lib perfect - return list of all languages and their confidence AND check if language passed in arg has confidence higher than value passed in arg.

I used this code to return confidence:

    fun detectAllProbabilities(text: String): Map<Language, Double> {
        val trimmedText = text
            .trim()
            .toLowerCase()
            .replace(PUNCTUATION, "")
            .replace(NUMBERS, "")
            .replace(MULTIPLE_WHITESPACE, " ")

        if (trimmedText.isEmpty() || NO_LETTER.matches(trimmedText)) return hashMapOf()

        val words = if (trimmedText.contains(' ')) trimmedText.split(" ") else listOf(trimmedText)

        val languagesSequence = filterLanguagesByRules(words)

        val textSequence = trimmedText.lineSequence()
        val allProbabilities = mutableListOf<Map<Language, Double>>()
        val unigramCountsOfInputText = mutableMapOf<Language, Int>()

        if (trimmedText.length >= 1) {
            val unigramLanguageModel = LanguageModel.fromTestData(textSequence, Unigram::class)
            addNgramProbabilities(allProbabilities, languagesSequence, unigramLanguageModel)
            countUnigramsOfInputText(unigramCountsOfInputText, unigramLanguageModel, languagesSequence)
        }
        if (trimmedText.length >= 2) {
            addNgramProbabilities(allProbabilities, languagesSequence, LanguageModel.fromTestData(textSequence, Bigram::class))
        }
        if (trimmedText.length >= 3) {
            addNgramProbabilities(allProbabilities, languagesSequence, LanguageModel.fromTestData(textSequence, Trigram::class))
        }
        if (trimmedText.length >= 4) {
            addNgramProbabilities(allProbabilities, languagesSequence, LanguageModel.fromTestData(textSequence, Quadrigram::class))
        }
        if (trimmedText.length >= 5) {
            addNgramProbabilities(allProbabilities, languagesSequence, LanguageModel.fromTestData(textSequence, Fivegram::class))
        }

        val summedUpProbabilities = hashMapOf<Language, Double>()
        for (language in languagesSequence) {
            summedUpProbabilities[language] = allProbabilities.sumByDouble { it[language] ?: 0.0 }

            if (unigramCountsOfInputText.containsKey(language)) {
                summedUpProbabilities[language] = summedUpProbabilities.getValue(language) / unigramCountsOfInputText.getValue(language)
            }
        }
        return summedUpProbabilities
    }

It's almost the same as method detectLanguagesOf but is returning different things.

For calculating % confidences VERSUS the most confident, I used quick and dirty trick of comparing all values to highest value.

            allProbabilities = allProbabilities.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).collect(Collectors.toMap(Entry::getKey, Entry::getValue, (oldValue, newValue) -> oldValue, LinkedHashMap::new));
            Double minimumValue = Collections.min(allProbabilities.entrySet(), Map.Entry.comparingByValue()).getValue();
            allProbabilities.replaceAll((k, v) -> v - minimumValue);

            final Map<Language, Double> allProbabilitiesPercentage = new LinkedHashMap<>(allProbabilities);
            Double maximumValue = Collections.max(allProbabilitiesPercentage.entrySet(), Map.Entry.comparingByValue()).getValue();
            allProbabilitiesPercentage.replaceAll((k, v) -> v / maximumValue * 100);

It will not tell you the real confidence, it will tell you how far away language is from the most likely correct one. I also assume that lowest probability value is 0% probability. I use it to make sure that short sequences (1-3 words) are indeed the language I think they are. I'm getting list of sentences that I think are language X and check if language X is 90% or more compared to the language with the most probability - which means the language probably IS correct. Sometimes short sentences aren't classified correctly but probability score is veeeeery close, so I'm simply checking that.

It is very simple, it will do:

value - minimumValue

for all values and then do:

value / maximumValue * 100

for all values to get %

That said, method returning confidence score (or even probability score like the one I shared) would be definitelly helpful! The probability score is some weird number, for me it was something between 11-14 for all languages, so I'm totally lost how to get the confidence out of that and I'm not familiar with Kotlin.

chasetec · 2020-05-04T16:28:50Z

I'm also interested in getting a list of all languages found in a text value.

pemistahl · 2020-05-27T08:10:13Z

@mfasanya @MaciejGorczyca Better late than never: Yesterday, I implemented some kind of confidence metric. Here is an example:

val detector = LanguageDetectorBuilder.fromLanguages(
    GERMAN, ENGLISH, FRENCH, ITALIAN, SPANISH
).build()

println(detector.computeLanguageConfidenceValues(
    text = "des langues sont chouettes"
))

This prints the following sorted map:

{
    FRENCH=1.0, 
    ENGLISH=0.7814724667756149, 
    ITALIAN=0.7694094052498642, 
    GERMAN=0.713475791039237, 
    SPANISH=0.7063690485486268
}

The value 1.0 is always assigned to the first entry in the map. It simply means that the first language in the map, French in this case, is the most likely language for the given input. It does not necessarily mean that the first language is the correct one!

The confidence values for the other languages express the relative distance to the most likely language. So, for example, it is 1.0 - 0.78 = 0.22 = 22% less likely for English to be the correct language, compared to French. So this is not an absolute confidence metric but a relative one. This is very important to understand. An absolute metric is not possible.

Are you satisfied with this approach? I think it's quite useful. Please give me some feedback if possible. Thank you.

pemistahl · 2020-06-03T12:52:30Z

I'm closing this issue now as it has been resolved and there has not been any additional feedback.

pemistahl added the enhancement New feature or request label Aug 26, 2019

pemistahl added this to the Lingua 0.7.0 milestone Jan 5, 2020

violine1101 mentioned this issue Apr 11, 2020

Automatically detect tickets in other languages mojira/arisa-kt#60

Closed

pemistahl modified the milestones: Lingua 0.7.0, Lingua 1.0.0 May 25, 2020

pemistahl added a commit that referenced this issue May 26, 2020

Add method for computing confidence values (#11)

f79b6c0

pemistahl closed this as completed Jun 3, 2020

pemistahl added new feature and removed enhancement New feature or request labels Jun 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confidence scoring #11

Confidence scoring #11

mfasanya commented Aug 24, 2019

pemistahl commented Aug 26, 2019

MaciejGorczyca commented Aug 28, 2019 •

edited by pemistahl

Loading

chasetec commented May 4, 2020

pemistahl commented May 27, 2020

pemistahl commented Jun 3, 2020

Confidence scoring #11

Confidence scoring #11

Comments

mfasanya commented Aug 24, 2019

pemistahl commented Aug 26, 2019

MaciejGorczyca commented Aug 28, 2019 • edited by pemistahl Loading

chasetec commented May 4, 2020

pemistahl commented May 27, 2020

pemistahl commented Jun 3, 2020

MaciejGorczyca commented Aug 28, 2019 •

edited by pemistahl

Loading