Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confidence scoring #11

Closed
mfasanya opened this issue Aug 24, 2019 · 5 comments
Closed

Confidence scoring #11

mfasanya opened this issue Aug 24, 2019 · 5 comments

Comments

@mfasanya
Copy link

Hey @pemistahl

Awesome library it works really well, best performing calculations compared to all the other open source stuff.

I'm looking to replace google detection API with your project but the one thing it's missing is a confidence score similar to google's

{"language":"te","confidence":0.4294964}

Is this something you are thinking about adding in?

@pemistahl pemistahl added the enhancement New feature or request label Aug 26, 2019
@pemistahl
Copy link
Owner

Hi Mitchell (@mfasanya), thank you for using my library and for this feature request.

I'm generally open for some kind of confidence metric. I'll think about a way to implement this.

@MaciejGorczyca
Copy link

MaciejGorczyca commented Aug 28, 2019

This would be great and very helpful. There are two methods missing imo to make this lib perfect - return list of all languages and their confidence AND check if language passed in arg has confidence higher than value passed in arg.

I used this code to return confidence:

    fun detectAllProbabilities(text: String): Map<Language, Double> {
        val trimmedText = text
            .trim()
            .toLowerCase()
            .replace(PUNCTUATION, "")
            .replace(NUMBERS, "")
            .replace(MULTIPLE_WHITESPACE, " ")

        if (trimmedText.isEmpty() || NO_LETTER.matches(trimmedText)) return hashMapOf()

        val words = if (trimmedText.contains(' ')) trimmedText.split(" ") else listOf(trimmedText)

        val languagesSequence = filterLanguagesByRules(words)

        val textSequence = trimmedText.lineSequence()
        val allProbabilities = mutableListOf<Map<Language, Double>>()
        val unigramCountsOfInputText = mutableMapOf<Language, Int>()

        if (trimmedText.length >= 1) {
            val unigramLanguageModel = LanguageModel.fromTestData(textSequence, Unigram::class)
            addNgramProbabilities(allProbabilities, languagesSequence, unigramLanguageModel)
            countUnigramsOfInputText(unigramCountsOfInputText, unigramLanguageModel, languagesSequence)
        }
        if (trimmedText.length >= 2) {
            addNgramProbabilities(allProbabilities, languagesSequence, LanguageModel.fromTestData(textSequence, Bigram::class))
        }
        if (trimmedText.length >= 3) {
            addNgramProbabilities(allProbabilities, languagesSequence, LanguageModel.fromTestData(textSequence, Trigram::class))
        }
        if (trimmedText.length >= 4) {
            addNgramProbabilities(allProbabilities, languagesSequence, LanguageModel.fromTestData(textSequence, Quadrigram::class))
        }
        if (trimmedText.length >= 5) {
            addNgramProbabilities(allProbabilities, languagesSequence, LanguageModel.fromTestData(textSequence, Fivegram::class))
        }

        val summedUpProbabilities = hashMapOf<Language, Double>()
        for (language in languagesSequence) {
            summedUpProbabilities[language] = allProbabilities.sumByDouble { it[language] ?: 0.0 }

            if (unigramCountsOfInputText.containsKey(language)) {
                summedUpProbabilities[language] = summedUpProbabilities.getValue(language) / unigramCountsOfInputText.getValue(language)
            }
        }
        return summedUpProbabilities
    }

It's almost the same as method detectLanguagesOf but is returning different things.

For calculating % confidences VERSUS the most confident, I used quick and dirty trick of comparing all values to highest value.

            allProbabilities = allProbabilities.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue())).collect(Collectors.toMap(Entry::getKey, Entry::getValue, (oldValue, newValue) -> oldValue, LinkedHashMap::new));
            Double minimumValue = Collections.min(allProbabilities.entrySet(), Map.Entry.comparingByValue()).getValue();
            allProbabilities.replaceAll((k, v) -> v - minimumValue);

            final Map<Language, Double> allProbabilitiesPercentage = new LinkedHashMap<>(allProbabilities);
            Double maximumValue = Collections.max(allProbabilitiesPercentage.entrySet(), Map.Entry.comparingByValue()).getValue();
            allProbabilitiesPercentage.replaceAll((k, v) -> v / maximumValue * 100);

It will not tell you the real confidence, it will tell you how far away language is from the most likely correct one. I also assume that lowest probability value is 0% probability. I use it to make sure that short sequences (1-3 words) are indeed the language I think they are. I'm getting list of sentences that I think are language X and check if language X is 90% or more compared to the language with the most probability - which means the language probably IS correct. Sometimes short sentences aren't classified correctly but probability score is veeeeery close, so I'm simply checking that.

It is very simple, it will do:

value - minimumValue

for all values and then do:

value / maximumValue * 100

for all values to get %

That said, method returning confidence score (or even probability score like the one I shared) would be definitelly helpful! The probability score is some weird number, for me it was something between 11-14 for all languages, so I'm totally lost how to get the confidence out of that and I'm not familiar with Kotlin.

@chasetec
Copy link

chasetec commented May 4, 2020

I'm also interested in getting a list of all languages found in a text value.

@pemistahl pemistahl modified the milestones: Lingua 0.7.0, Lingua 1.0.0 May 25, 2020
@pemistahl
Copy link
Owner

@mfasanya @MaciejGorczyca Better late than never: Yesterday, I implemented some kind of confidence metric. Here is an example:

val detector = LanguageDetectorBuilder.fromLanguages(
    GERMAN, ENGLISH, FRENCH, ITALIAN, SPANISH
).build()

println(detector.computeLanguageConfidenceValues(
    text = "des langues sont chouettes"
))

This prints the following sorted map:

{
    FRENCH=1.0, 
    ENGLISH=0.7814724667756149, 
    ITALIAN=0.7694094052498642, 
    GERMAN=0.713475791039237, 
    SPANISH=0.7063690485486268
}

The value 1.0 is always assigned to the first entry in the map. It simply means that the first language in the map, French in this case, is the most likely language for the given input. It does not necessarily mean that the first language is the correct one!

The confidence values for the other languages express the relative distance to the most likely language. So, for example, it is 1.0 - 0.78 = 0.22 = 22% less likely for English to be the correct language, compared to French. So this is not an absolute confidence metric but a relative one. This is very important to understand. An absolute metric is not possible.

Are you satisfied with this approach? I think it's quite useful. Please give me some feedback if possible. Thank you.

@pemistahl
Copy link
Owner

I'm closing this issue now as it has been resolved and there has not been any additional feedback.

@pemistahl pemistahl added new feature and removed enhancement New feature or request labels Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants