Support disabling loading of quadrigram and fivegram models #136

Marcono1234 · 2022-05-22T21:43:01Z

Relates to #101

Adds the function LanguageDetectorBuilder.withoutQuadrigramAndFivegramModels() which disables loading of quadrigram and fivegram models. Quadrigram and fivegram models take up the majority of memory during runtime; if my measurements are correct, all language models preloaded require ~1783 MB, whereas only unigram, bigram and trigram models require ~110 MB. However, for larger texts LanguageDetector does not actually use them.
Therefore, for use cases where it is known beforehand that most or all texts will be longer than ~120 chars, it should be relatively safe to disable of loading quadrigram and fivegram models.

Any feedback, especially regarding the builder function name and documentation, is appreciated.

Marcono1234 · 2022-05-22T21:45:02Z

src/main/kotlin/com/github/pemistahl/lingua/api/LanguageDetectorBuilder.kt

+     * language detection. This affects both dynamically loaded models as well as
+     * [preloaded models][withPreloadedLanguageModels].


Maybe the wording "dynamically loaded models as well as preloaded models" is a bit misleading. It might not be clear enough what this means by "dynamically loaded", and it might sound as if even a detector with preloaded models could dynamically load models.

Any suggestions for alternative wordings, or is this sentence fine?

Marcono1234 · 2022-05-23T01:31:05Z

Marked this as draft again because the usage of ngrams, and specifically the usage of quadrigrams and fivegrams, is probably an implementation detail and I am not sure if it is a good idea to expose this in the public API.

Would also be interesting to know how useful such a method would be to users of this library.

pemistahl · 2022-06-05T19:10:52Z

Hi @Marcono1234, thank you for this useful idea. I've actually implemented this a bit differently than you. There is now LanguageDetectorBuilder.withoutHighAccuracyMode() which loads only trigrams and nothing else. Trigrams are enough for longer sentences, so we can ignore unigrams and bigrams in addition to quadrigrams and fivegrams.

In the accuracy reports and plots, you now find separate statistics for what I call low accuracy mode (trigrams only) and high accuracy mode (all ngrams).

Support disabling loading of quadrigram and fivegram models

461a243

Marcono1234 commented May 22, 2022

View reviewed changes

Marcono1234 marked this pull request as draft May 23, 2022 01:28

pemistahl added a commit that referenced this pull request Jun 2, 2022

Add flag to disable high accuracy mode (#101 #136)

a845fe4

pemistahl closed this Jun 5, 2022

Marcono1234 deleted the marcono1234/builder-disable-quadri-fivegram-loading branch June 7, 2022 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support disabling loading of quadrigram and fivegram models #136

Support disabling loading of quadrigram and fivegram models #136

Marcono1234 commented May 22, 2022

Marcono1234 May 22, 2022 •

edited

Loading

Marcono1234 commented May 23, 2022

pemistahl commented Jun 5, 2022

		* language detection. This affects both dynamically loaded models as well as
		* [preloaded models][withPreloadedLanguageModels].

Support disabling loading of quadrigram and fivegram models #136

Support disabling loading of quadrigram and fivegram models #136

Conversation

Marcono1234 commented May 22, 2022

Marcono1234 May 22, 2022 • edited Loading

Choose a reason for hiding this comment

Marcono1234 commented May 23, 2022

pemistahl commented Jun 5, 2022

Marcono1234 May 22, 2022 •

edited

Loading