Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detects, preserve encoding when revising files #62

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

falquaddoomi
Copy link
Collaborator

@falquaddoomi falquaddoomi commented Oct 9, 2024

This PR addresses issue #61, in which a user reported errors with reading in GBK-encoded files, e.g. for representing Chinese characters. To address the issue, I first use chardet.detect() on each input file, then use that resulting encoding when reading and writing the file. The PR includes a test that a few GBK-encoded characters make it through the revision process.

This PR introduces a dependency on chardet in order to detect the encodings of input files.

EDIT: In the process of having the PR reviewed, we decided to switch to charset_normalizer, which offers similar functionality to chardet but with greater accuracy and speed. Thanks @d33bs for the suggestion!

Closes #61.

with open(input_filepath, "r") as infile, open(output_filepath, "w") as outfile:
# detect the input file encoding using chardet
# maintain that encoding when reading and writing files
src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"]
Copy link
Collaborator Author

@falquaddoomi falquaddoomi Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I'm currently looking into how much of each file we need to read. Reading the entire thing is the safest choice, but chardet might be able to accurately detect the encoding with less data.

Alternatively, I may switch to using UniversalDetector (https://chardet.readthedocs.io/en/latest/usage.html#advanced-usage); I'll have to experiment with the confidence level we should use before stopping. (FWIW, this would all be a lot easier to decide if we had the failing markdown files, so hopefully we'll get those soon.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,I Know. I have revised the line 283:
539a78a4f8441a9c407da5cb52ca0a7

with open(input_filepath, "r",encoding='utf-8') as infile, open(output_filepath, "w",encoding='utf-8') as outfile:

@falquaddoomi falquaddoomi requested a review from d33bs November 6, 2024 21:04
Copy link
Collaborator

@d33bs d33bs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job! This looked like a nice change to address the issue which was brought up. I left a few comments about various considerations. Additionally, you might consider pulling in the most recent changes from main, which I believe would allow this PR to observe and document passing tests prior to a merge.

setup.py Outdated Show resolved Hide resolved
tests/test_editor.py Outdated Show resolved Hide resolved
tests/test_editor.py Show resolved Hide resolved
tests/test_editor.py Outdated Show resolved Hide resolved
tests/test_editor.py Outdated Show resolved Hide resolved
libs/manubot_ai_editor/editor.py Show resolved Hide resolved
with open(input_filepath, "r") as infile, open(output_filepath, "w") as outfile:
# detect the input file encoding using chardet
# maintain that encoding when reading and writing files
src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure how large the files passed in here might be. Would it make sense to consider reading only a portion of the file if it were very large to help conserve time?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that (see #62 (comment)), but these files are rarely larger than a few kilobytes, since they're the text of sections of a paper, not binary files. I concluded that the (small, granted) extra engineering effort wasn't worth a difference of milliseconds, especially when it could decrease the accuracy of the encoding detection. Also, the revision process itself takes on the order of seconds to complete since it relies on an external API, so improvements here would IMHO go unnoticed.

I'm of course willing to revise my opinion if we really do have large files that need to be detected or if the difference would be significant. I think your suggestion of using charset_normalizer would also improve both speed and accuracy, so perhaps it'll be enough to switch to it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I ended up switching to charset_normalizer in e31679e, which seems to be working well and is definitely faster.

This statement in their README gave me brief pause, though:

I don't care about the originating charset encoding, because two different tables can produce two identical rendered string. What I want is to get readable text, the best I can.

My goal is to preserve the source encoding, not just get readable text, although perhaps those two goals aren't so different. I added a somewhat more comprehensive test of Chinese character encoding and didn't find any issues, so I'm going to assume things are ok for now.

This was referenced Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error:code
3 participants