-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Open
Labels
DocumentationModerateAnything that requires some knowledge of conventions and best practicesAnything that requires some knowledge of conventions and best practices
Description
#FIXME: whitespace tokenizing does not work on chinese and japanese
Implementing NLP whitespace tokenizer for japanese and chinese language can be a little hard.:radioactive: It can't be done like for a eng, ita, fra...etc...
Steps/Code to Reproduce
In ➡️/doc/tutorial/text_analytics/data/languages/fetch_data.py now we have:
# split the paragraph into fake smaller paragraphs to make the
# problem harder e.g. more similar to tweets
if lang in ('zh', 'ja'):
# FIXME: whitespace tokenizing does not work on chinese and japanese
continue
words = content.split()
n_groups = len(words) / n_words_per_short_text
if n_groups < 1:
continue
groups = np.array_split(words, n_groups)
...etc...Simple proposal (for now):
- Simbol dict. for chinese and japanese lang. will be used for extracting tokens for split the paragraph into fake smaller paragraphs. It's impl. looks like:
# split the paragraph into fake smaller paragraphs to make the
# problem harder e.g. more similar to tweets
n_words_per_short_text_zh_ja = 3
if lang in ("zh", "ja"):
words = []
string_of_words = ''
for word in content:
if word in zh_letters :
string_of_words += word
if len(string_of_words) > n_words_per_short_text_zh_ja:
words.append(string_of_words)
string_of_words = ''
n_groups = len(words) / n_words_per_short_text_zh_ja
else:
words = content.split()
n_groups = len(words) / n_words_per_short_text
if n_groups < 1:
continue
groups = np.array_split(words, n_groups)
for group in groups:
small_content = " ".join(group)
...etc... - Fror now, we have helper func. to load zh, ja , letters:
zh_letters = []
def read_zh_letters(filename) -> list:
with open(filename, encoding="utf8") as file:
while line := file.readline().strip():
zh_letters.append(line)
return zh_letters
read_zh_letters('PATH_TO_zh_ja_letter.txt')Expected And Obtained Result
in "short_paragraphs/zh" is:
-
zh_0000.txt
維基百科 英語或是 维基媒体 -
zh_0001.txt
基金会运 营的一个 多语言的 -
zh_0002.txt
線上百科 全并以创 建和维护 -
zh_0003.txt
作为开放 式协同合 作项目特 -
zh_0004.txt
点是自由 容自由编 辑自由版 -
zh_0005.txt
权目前是 全球網絡 上最大且
...etc...
Actual Results
None
Versions
System:
python: 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]
Python dependencies:
pip: 21.2.4
setuptools: 49.2.1
sklearn: 1.1.dev0
numpy: 1.21.2
scipy: 1.7.1
Cython: None
pandas: None
matplotlib: 3.4.3
joblib: 1.0.1
threadpoolctl: 2.2.0
Built with OpenMP: True
Process finished with exit code 0
Metadata
Metadata
Assignees
Labels
DocumentationModerateAnything that requires some knowledge of conventions and best practicesAnything that requires some knowledge of conventions and best practices