Fix running tests on Windows and exclude corrupted characters from character-to-language mapping for Latvian #92
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
On Windows, file.encoding value depends on regional settings of a user and is never set to UTF-8 by default. Therefore, tests that include strings with non-ASCII characters fail on Windows. The solution is to add file.encoding as a parameter for JVM in gradle.properties.
When writing a text file, newline characters depend on OS - Windows uses a carriage return + a line feed (\r\n), however, Linux and Mac OS use just a line feed (\n). Newline characters used in a developer's source code depend on IDE and/or Git settings. Therefore, tests with multiline strings may fail if newline characters are not normalized.
A text corpus or multiple corpora used for building a language model for Latvian were not clean and contained corrupted characters (presumably, due to encoding errors). Weird characters were removed from character-to-language mappings and tests also fixed for words that contain such non-native characters.