Fix running tests on Windows and exclude corrupted characters from character-to-language mapping for Latvian #92

janissl · 2021-04-30T16:39:39Z

On Windows, file.encoding value depends on regional settings of a user and is never set to UTF-8 by default. Therefore, tests that include strings with non-ASCII characters fail on Windows. The solution is to add file.encoding as a parameter for JVM in gradle.properties.

When writing a text file, newline characters depend on OS - Windows uses a carriage return + a line feed (\r\n), however, Linux and Mac OS use just a line feed (\n). Newline characters used in a developer's source code depend on IDE and/or Git settings. Therefore, tests with multiline strings may fail if newline characters are not normalized.

A text corpus or multiple corpora used for building a language model for Latvian were not clean and contained corrupted characters (presumably, due to encoding errors). Weird characters were removed from character-to-language mappings and tests also fixed for words that contain such non-native characters.

…acters

…tent and the content read from file

pemistahl · 2021-05-02T10:12:55Z

Hello Jānis,

thank you very much for this pull request. I develop on macOS, so I wasn't aware of the Windows problems you have mentioned. I have merged your changes and will release Lingua 1.1.0 soon which will include them. :)

janissl added 4 commits April 29, 2021 22:53

Exclude corrupted characters from Latvian chars-to-language mapping

88cd672

Ensure the required encoding for JVM on Windows

43c9dc8

Exclude Latvian from results for input with corrupted non-native char…

f770161

…acters

Ensure that newline characters are identical in both the expected con…

2ba9763

…tent and the content read from file

pemistahl changed the base branch from master to v1.1.0-wip May 1, 2021 09:33

Merge branch 'v1.1.0-wip' into master

166197d

pemistahl added a commit that referenced this pull request May 2, 2021

Fix errors in rule engine for Latvian (#92)

6e8a24c

pemistahl merged commit 3f77dac into pemistahl:v1.1.0-wip May 2, 2021

janissl deleted the master branch May 3, 2021 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix running tests on Windows and exclude corrupted characters from character-to-language mapping for Latvian #92

Fix running tests on Windows and exclude corrupted characters from character-to-language mapping for Latvian #92

janissl commented Apr 30, 2021

pemistahl commented May 2, 2021

Fix running tests on Windows and exclude corrupted characters from character-to-language mapping for Latvian #92

Fix running tests on Windows and exclude corrupted characters from character-to-language mapping for Latvian #92

Conversation

janissl commented Apr 30, 2021

pemistahl commented May 2, 2021