-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detects, preserve encoding when revising files #62
base: main
Are you sure you want to change the base?
Conversation
…revising files, preserving it in the output files.
… for characters in the result.
libs/manubot_ai_editor/editor.py
Outdated
with open(input_filepath, "r") as infile, open(output_filepath, "w") as outfile: | ||
# detect the input file encoding using chardet | ||
# maintain that encoding when reading and writing files | ||
src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I'm currently looking into how much of each file we need to read. Reading the entire thing is the safest choice, but chardet
might be able to accurately detect the encoding with less data.
Alternatively, I may switch to using UniversalDetector
(https://chardet.readthedocs.io/en/latest/usage.html#advanced-usage); I'll have to experiment with the confidence level we should use before stopping. (FWIW, this would all be a lot easier to decide if we had the failing markdown files, so hopefully we'll get those soon.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job! This looked like a nice change to address the issue which was brought up. I left a few comments about various considerations. Additionally, you might consider pulling in the most recent changes from main
, which I believe would allow this PR to observe and document passing tests prior to a merge.
libs/manubot_ai_editor/editor.py
Outdated
with open(input_filepath, "r") as infile, open(output_filepath, "w") as outfile: | ||
# detect the input file encoding using chardet | ||
# maintain that encoding when reading and writing files | ||
src_encoding = chardet.detect(input_filepath.read_bytes())["encoding"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure how large the files passed in here might be. Would it make sense to consider reading only a portion of the file if it were very large to help conserve time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that (see #62 (comment)), but these files are rarely larger than a few kilobytes, since they're the text of sections of a paper, not binary files. I concluded that the (small, granted) extra engineering effort wasn't worth a difference of milliseconds, especially when it could decrease the accuracy of the encoding detection. Also, the revision process itself takes on the order of seconds to complete since it relies on an external API, so improvements here would IMHO go unnoticed.
I'm of course willing to revise my opinion if we really do have large files that need to be detected or if the difference would be significant. I think your suggestion of using charset_normalizer would also improve both speed and accuracy, so perhaps it'll be enough to switch to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I ended up switching to charset_normalizer
in e31679e, which seems to be working well and is definitely faster.
This statement in their README gave me brief pause, though:
I don't care about the originating charset encoding, because two different tables can produce two identical rendered string. What I want is to get readable text, the best I can.
My goal is to preserve the source encoding, not just get readable text, although perhaps those two goals aren't so different. I added a somewhat more comprehensive test of Chinese character encoding and didn't find any issues, so I'm going to assume things are ok for now.
…nv vars to specify src/dest encoding manually. Other minor touchups.
…g specification vs. autodetection.
This PR addresses issue #61, in which a user reported errors with reading in GBK-encoded files, e.g. for representing Chinese characters. To address the issue, I first use
chardet.detect()
on each input file, then use that resulting encoding when reading and writing the file. The PR includes a test that a few GBK-encoded characters make it through the revision process.This PR introduces a dependency on chardet in order to detect the encodings of input files.
EDIT: In the process of having the PR reviewed, we decided to switch to charset_normalizer, which offers similar functionality to
chardet
but with greater accuracy and speed. Thanks @d33bs for the suggestion!Closes #61.