Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detects, preserve encoding when revising files #62

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,8 @@ export OPENAI_API_KEY=ABCD1234

You can also provide other environment variables that will change the behavior of the editor (such as revising certain files only).
For example, to specify the temperature parameter of OpenAI models, you can set the variable `export AI_EDITOR_TEMPERATURE=0.50`.
[See the complete list of supported variables](https://github.com/manubot/manubot-ai-editor/blob/main/libs/manubot_ai_editor/env_vars.py) documents.
See [the complete list of supported variables](https://github.com/manubot/manubot-ai-editor/blob/main/docs/env-vars.md) for
more information.

Then, from the root directory of your Manubot manuscript, run the following:

Expand All @@ -97,6 +98,12 @@ When it finishes, check out your manuscript files.
This will allow you to detect whether the editor is identifying paragraphs correctly.
If you find a problem, please [report the issue](https://github.com/manubot/manubot-ai-editor/issues).

Manubot AI Editor will make a best effort to guess and preserve the encoding of your input files when creating revised files. If
you prefer to have your files interpreted or written using a different encoding, you can specify it with the `AI_EDITOR_SRC_ENCODING` and `AI_EDITOR_DST_ENCODING`
environment variables. See
[these variables' help docs](https://github.com/manubot/manubot-ai-editor/blob/main/docs/env-vars.md#encodings)
for more information.

## Using the Python API

You can also use the functions of the editor directly from Python.
Expand Down
72 changes: 72 additions & 0 deletions docs/env-vars.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Manubot AI Editor Environment Variables

Manubot AI Editor provides a variety of options to customize the revision
process. These options are exposed as environment variables, all of which are
prefixed with `AI_EDITOR_`.

The following environment variables are supported, organized into categories:

## Model Configuration

- `AI_EDITOR_LANGUAGE_MODEL`: Language model to use. For example,
"text-davinci-003", "gpt-3.5-turbo", "gpt-3.5-turbo-0301", etc. The tool
currently supports the "chat/completions", "completions", and "edits" endpoints,
and you can check compatible models here:
https://platform.openai.com/docs/models/model-endpoint-compatibility
- `AI_EDITOR_MAX_TOKENS_PER_REQUEST`: Model parameter: `max_tokens`
- `AI_EDITOR_TEMPERATURE`: Model parameter: `temperature`
- `AI_EDITOR_TOP_P`: Model parameter: `top_p`
- `AI_EDITOR_PRESENCE_PENALTY`: Model parameter: `presence_penalty`
- `AI_EDITOR_FREQUENCY_PENALTY`: Model parameter: `frequency_penalty`
- `AI_EDITOR_BEST_OF`: Model parameter: `best_of`

## Prompt and Query Control

- `AI_EDITOR_FILENAME_SECTION_MAPPING`: Allows the user to specify a JSON
string, where keys are filenames and values are section names. For example:
`{"01.intro.md": "introduction"}` Possible values for section names are:
"abstract", "introduction", "results", "discussion", "conclusions", "methods",
and "supplementary material". Take a look at function `get_prompt()` in
[libs/manubot_ai_editor/models.py](https://github.com/manubot/manubot-ai-editor/blob/main/libs/manubot_ai_editor/models.py#L256)
to see which prompts are used for each section. Although the AI Editor tries to
infer the section name from the filename, sometimes filenames are not
descriptive enough (e.g., "01.intro.md" or "02.review.md" might indicate an
introduction). Mapping filenames to section names is useful to provide more
context to the AI model when revising a paragraph. For example, for the
introduction, prompts contain sentences to preserve most of the citations to
other papers.
- `AI_EDITOR_RETRY_COUNT`: Sometimes the AI model returns an empty paragraph.
Usually, this is resolved by running again the model. The AI Editor will try
five times in these cases. This variable allows to override the number of
retries from its default of 5.
- `AI_EDITOR_FILENAMES_TO_REVISE`: If specified, only these file names will be
revised. Multiple files can be specified, separated by commas. For example:
"01.intro.md,02.review.md"
- `AI_EDITOR_CUSTOM_PROMPT`: Allows the user to specify a single, custom prompt
for all sections. For example: "proofread and revise the following paragraph";
in this case, the tool will automatically append the characters ':\n\n' followed
by the paragraph. It is also possible to include placeholders in the prompt,
which will be replaced by the corresponding values. For example, "proofread and
revise the following paragraph from the section {section_name} of a scientific
manuscript with title '{title}'". The complete list of placeholders is:
`{paragraph_text}`, `{section_name}`, `{title}`, `{keywords}`.

## Encodings

These vars specify the source and destination encodings of input and output markdown
files. Behavior is as follows:
- If neither `SRC_ENCODING` nor `DEST_ENCODING` are specified, the tool will
attempt to identify the encoding using the charset_normalizer library and
use that encoding to both read and write the output files.
- If only `SRC_ENCODING` is specified, it will be used to both read and write
the files.
- If only `DEST_ENCODING` is specified, it will be used to write the output
files, and the input files will be read using the encoding identified by
[charset_normalizer](https://github.com/jawah/charset_normalizer).

The variables:

- `AI_EDITOR_SRC_ENCODING`: the encoding of the input markdown files; if empty,
defaults to auto-detecting using [charset_normalizer](https://github.com/jawah/charset_normalizer)
- `AI_EDITOR_DEST_ENCODING`: the encoding to use when writing the output markdown
files; if empty, defaults to `AI_EDITOR_SRC_ENCODING`.
22 changes: 21 additions & 1 deletion libs/manubot_ai_editor/editor.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@
import os
from pathlib import Path

import charset_normalizer

from manubot_ai_editor import env_vars
from manubot_ai_editor.prompt_config import ManuscriptPromptConfig, IGNORE_FILE
from manubot_ai_editor.models import ManuscriptRevisionModel
Expand Down Expand Up @@ -280,7 +282,25 @@ def revise_file(
if section_name is None:
section_name = self.get_section_from_filename(input_filename)

with open(input_filepath, "r") as infile, open(output_filepath, "w") as outfile:
# apply encoding settings via the env vars AI_EDITOR_SRC_ENCODING and AI_EDITOR_DEST_ENCODING,
# if specified; otherwise, detect the encoding using charset_normalizer
src_encoding = os.environ.get(env_vars.SRC_ENCODING)
dest_encoding = os.environ.get(env_vars.DEST_ENCODING)

# detect the input file encoding using charset_normalizer
# maintain that encoding when reading and writing files
if src_encoding is None:
src_encoding = charset_normalizer.detect(input_filepath.read_bytes())["encoding"]

# ensure that we have a valid encoding for the output file
if dest_encoding is None:
dest_encoding = src_encoding

print("Detected encoding:", src_encoding, flush=True)
falquaddoomi marked this conversation as resolved.
Show resolved Hide resolved

with open(input_filepath, "r", encoding=src_encoding) as infile, \
open(output_filepath, "w", encoding=dest_encoding) as outfile:

# Initialize a temporary list to store the lines of the current paragraph
paragraph = []

Expand Down
13 changes: 13 additions & 0 deletions libs/manubot_ai_editor/env_vars.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,16 @@
# The complete list of placeholders is: {paragraph_text}, {section_name},
# {title}, {keywords}.
CUSTOM_PROMPT = "AI_EDITOR_CUSTOM_PROMPT"

# Specifies the source and destination encodings of input and output markdown
# files. Behavior is as follows:
# - If neither SRC_ENCODING nor DEST_ENCODING are specified, the tool will
# attempt to identify the encoding using the charset_normalizer library and
# use that encoding to both read and write the output files.
# - If only SRC_ENCODING is specified, it will be used to both read and write
# the files.
# - If only DEST_ENCODING is specified, it will be used to write the output
# files, and the input files will be read using the encoding identified by
# charset_normalizer.
SRC_ENCODING = "AI_EDITOR_SRC_ENCODING"
DEST_ENCODING = "AI_EDITOR_DEST_ENCODING"
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
install_requires=[
"openai==0.28",
"pyyaml",
"charset_normalizer==3.4.0"
],
classifiers=[
"Programming Language :: Python :: 3",
Expand Down
33 changes: 33 additions & 0 deletions tests/manuscripts/gbk_encoded/01.abstract.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Abstract

������

Hello, world!

# Lorem Ipsum

(source: https://pinkylam.me/generator/chinese-lorem-ipsum/)

�����õû�����У��������c�ľG�������B�㣡���Lѩ���ѷŌ��󴩴����Hʿ�깫�׵���
����ꖺ��˲��r��ͤ�������е����ֹ���g�ь����̵Dz���Źָ����Խ��u�߱���Y������
�M�������������ԃ����꣬�п�������

���㻢�P�����ϲ�Ʒ���������^�c��ֱ�ꌍ�l�졹��ʿ�й������B���������LԪ�ʡ�����
�����ɿ����Ⱥ��غ��ס�������ͬ��؞��׷�桹��

�����ġ��Ϲ�ðǧҲ����ˣ�ָ渣�_�ӣ��������������ܹ֣��ܲ��P���S����Ԫ������e��
Ԓ��ӣ��A���������cҊ�뺹��˾�]�����ʳ��

����Ʒ�H�������A��ɾ��ַ����h��Ҋ����Ҳ���ԭ���h�����깝����ˮ�̻���ʯ������
�����b���ֺ���λ���ֺξ�����Y�����e�Թ�ʮ���~�ӡ�����ꖶ������ξ������l�аl�
��������@�����������ҳ߼��㡣

Ƭ�T����Ӣ���~����������ʾ�̡�һ��������Y�����e���͎��t�n��׷���ѡ�����������
��ֹ���Ͻ����ʿ ��܇��Ŀ������С���~ʳ�����YͬƷ��ð���^�����܌�����

�����B���Ë���ð�ֆT����ҪǬ�n������녎�Ҳ��βƽ�����~�Vס����������ס�棬����
���S����У���ƽ��ץ���ݴ��n�꣺�ž߼Ӽ�Ůһ֦������������t�ӣ��Ҵ�ÿʿ��ͤ��
֦��

У���Т���Ը����էѽ�漚���n��ӛ�Y����ֹϣ����Zһɽʮ���������˾��á�

11 changes: 11 additions & 0 deletions tests/manuscripts/gbk_encoded/metadata.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
---
title: "A Chinese hello, world; 你好,世界"
keywords:
- encoding
lang: zh-CN
authors:
- name: 姓名
initials: 姓名
orcid: 0000-0000-0000-0000
twitter: 姓名
email: 姓名@姓名.edu
146 changes: 146 additions & 0 deletions tests/test_editor.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,12 @@

import pytest

import charset_normalizer

from manubot_ai_editor import env_vars
from manubot_ai_editor.editor import ManuscriptEditor
from manubot_ai_editor.models import (
ManuscriptRevisionModel,
RandomManuscriptRevisionModel,
DummyManuscriptRevisionModel,
VerboseManuscriptRevisionModel,
Expand Down Expand Up @@ -1195,3 +1198,146 @@ def test_revise_entire_manuscript_non_standard_filenames_with_empty_custom_promp

output_md_files = list(output_folder.glob("*.md"))
assert len(output_md_files) == 0

@pytest.mark.parametrize(
"model",
[
DummyManuscriptRevisionModel(),
],
)
def test_revise_gbk_encoded_manuscript(tmp_path:Path, model:ManuscriptRevisionModel):
"""
Tests that the editor can revise a manuscript that contains GBK-encoded
characters, and can detect those characters encoded in UTF-8 in the output.
"""
print(f"\n{str(tmp_path)}\n")
falquaddoomi marked this conversation as resolved.
Show resolved Hide resolved

me = ManuscriptEditor(
content_dir=MANUSCRIPTS_DIR / "gbk_encoded",
)

model.title = me.title
model.keywords = me.keywords

assert tmp_path.exists()

me.revise_manuscript(tmp_path, model)

output_md_files = list(tmp_path.glob("*.md"))
assert len(output_md_files) == 1

# try to find the GBK-encoded text in the resulting file
with open(output_md_files[0], "r", encoding="gbk") as f:
text = f.read()
print(text)

# finds "hello, world"
assert "你好,世界" in text
# finds some lorem ipsum text
assert "圓跳樹乞土點見央" in text



@mock.patch.dict(
"os.environ",
{env_vars.SRC_ENCODING: "gbk", env_vars.DEST_ENCODING: "UTF-16"},
)
@pytest.mark.parametrize(
"model",
[
DummyManuscriptRevisionModel(),
],
)
def test_revise_gbk_specd_manuscript_into_utf16(tmp_path:Path, model:ManuscriptRevisionModel):
"""
Tests that the editor can revise a manuscript that is specified as being
GBK-encoded, and produces a UTF-16 encoded output file per the DEST_ENCODING
environment variable.
"""
print(f"\n{str(tmp_path)}\n")

me = ManuscriptEditor(
content_dir=MANUSCRIPTS_DIR / "gbk_encoded",
)

model.title = me.title
model.keywords = me.keywords

assert tmp_path.exists()

me.revise_manuscript(tmp_path, model)

output_md_files = list(tmp_path.glob("*.md"))
assert len(output_md_files) == 1

import charset_normalizer

# detect the encoding of the output file, ensure it's what we
# set DEST_ENCODING to
encoding = charset_normalizer.detect(open(output_md_files[0], "rb").read())["encoding"]
assert encoding == "UTF-16"

@mock.patch.dict(
"os.environ",
{env_vars.DEST_ENCODING: "UTF-16"},
)
@pytest.mark.parametrize(
"model",
[
DummyManuscriptRevisionModel(),
],
)
def test_revise_gbk_detected_manuscript_into_utf16(tmp_path:Path, model:ManuscriptRevisionModel):
"""
Tests that the editor can revise a GBK-encoded manuscript where the input
encoding wasn't specified and is thus auto-detected. Tests that the
resulting output file is written in UTF-16 as specified by the DEST_ENCODING
environment variable.
"""
print(f"\n{str(tmp_path)}\n")

me = ManuscriptEditor(
content_dir=MANUSCRIPTS_DIR / "gbk_encoded",
)

input_md_files = list(me.content_dir.glob("*.md"))

# detect the encoding of the input file
with open(input_md_files[0], "rb") as f:
text = f.read()
input_encoding = charset_normalizer.from_bytes(text).best().encoding

print(text)

# note that we get back gb18030, which is a superset of GBK and is fine
# to use for reading/writing those files.
# see the following for details:
# https://www.ibm.com/docs/en/i/7.4?topic=applications-gb18030-chinese-standard

assert input_encoding == "gb18030"

model.title = me.title
model.keywords = me.keywords

assert tmp_path.exists()

# actually revise the manuscript, producing UTF-16 output files
me.revise_manuscript(tmp_path, model)

output_md_files = list(tmp_path.glob("*.md"))
assert len(output_md_files) == 1

# detect the encoding of the output file, ensure it's what we
# set DEST_ENCODING to
output_encoding = charset_normalizer.from_path(output_md_files[0]).best().encoding
assert output_encoding == "utf_16"

# try to find some known text in the resulting file
with open(output_md_files[0], "r", encoding=output_encoding) as f:
text = f.read()
print(text)

# finds "hello, world"
assert "你好,世界" in text
# finds some lorem ipsum text
assert "圓跳樹乞土點見央" in text