Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support disabling loading of quadrigram and fivegram models #136

Conversation

Marcono1234
Copy link
Contributor

Relates to #101

Adds the function LanguageDetectorBuilder.withoutQuadrigramAndFivegramModels() which disables loading of quadrigram and fivegram models. Quadrigram and fivegram models take up the majority of memory during runtime; if my measurements are correct, all language models preloaded require ~1783 MB, whereas only unigram, bigram and trigram models require ~110 MB. However, for larger texts LanguageDetector does not actually use them.
Therefore, for use cases where it is known beforehand that most or all texts will be longer than ~120 chars, it should be relatively safe to disable of loading quadrigram and fivegram models.

Any feedback, especially regarding the builder function name and documentation, is appreciated.

Comment on lines +69 to +70
* language detection. This affects both dynamically loaded models as well as
* [preloaded models][withPreloadedLanguageModels].
Copy link
Contributor Author

@Marcono1234 Marcono1234 May 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the wording "dynamically loaded models as well as preloaded models" is a bit misleading. It might not be clear enough what this means by "dynamically loaded", and it might sound as if even a detector with preloaded models could dynamically load models.

Any suggestions for alternative wordings, or is this sentence fine?

@Marcono1234 Marcono1234 marked this pull request as draft May 23, 2022 01:28
@Marcono1234
Copy link
Contributor Author

Marked this as draft again because the usage of ngrams, and specifically the usage of quadrigrams and fivegrams, is probably an implementation detail and I am not sure if it is a good idea to expose this in the public API.

Would also be interesting to know how useful such a method would be to users of this library.

@pemistahl
Copy link
Owner

Hi @Marcono1234, thank you for this useful idea. I've actually implemented this a bit differently than you. There is now LanguageDetectorBuilder.withoutHighAccuracyMode() which loads only trigrams and nothing else. Trigrams are enough for longer sentences, so we can ignore unigrams and bigrams in addition to quadrigrams and fivegrams.

In the accuracy reports and plots, you now find separate statistics for what I call low accuracy mode (trigrams only) and high accuracy mode (all ngrams).

@pemistahl pemistahl closed this Jun 5, 2022
@Marcono1234 Marcono1234 deleted the marcono1234/builder-disable-quadri-fivegram-loading branch June 7, 2022 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants