Skip to content

Whitespace tokenizing does not work on chinese and japanese #20992

@DeanChugall

Description

@DeanChugall

#FIXME: whitespace tokenizing does not work on chinese and japanese

Implementing NLP whitespace tokenizer for japanese and chinese language can be a little hard.:radioactive: It can't be done like for a eng, ita, fra...etc...


Steps/Code to Reproduce

In ➡️/doc/tutorial/text_analytics/data/languages/fetch_data.py now we have:

        # split the paragraph into fake smaller paragraphs to make the
        # problem harder e.g. more similar to tweets
        if lang in ('zh', 'ja'):
        # FIXME: whitespace tokenizing does not work on chinese and japanese
            continue
        words = content.split()
        n_groups = len(words) / n_words_per_short_text
        if n_groups < 1:
            continue
        groups = np.array_split(words, n_groups)
                         ...etc...

Simple proposal (for now):

  • Simbol dict. for chinese and japanese lang. will be used for extracting tokens for split the paragraph into fake smaller paragraphs. It's impl. looks like:
        # split the paragraph into fake smaller paragraphs to make the
        # problem harder e.g. more similar to tweets

        n_words_per_short_text_zh_ja = 3

        if lang in ("zh", "ja"):
            words = []
            string_of_words = ''
            for word in content:
                if word in zh_letters :
                    string_of_words += word
                    if len(string_of_words) > n_words_per_short_text_zh_ja:
                        words.append(string_of_words)
                        string_of_words = ''

            n_groups = len(words) / n_words_per_short_text_zh_ja

        else:
            words = content.split()
            n_groups = len(words) / n_words_per_short_text

        if n_groups < 1:
            continue

        groups = np.array_split(words, n_groups)

        for group in groups:
            small_content = " ".join(group)
                    
                                 ...etc...                          
  • Fror now, we have helper func. to load zh, ja , letters:
zh_letters = []
def read_zh_letters(filename) -> list:
    with open(filename, encoding="utf8") as file:
        while line := file.readline().strip():
            zh_letters.append(line)
    return zh_letters

read_zh_letters('PATH_TO_zh_ja_letter.txt')

Expected And Obtained Result

in "short_paragraphs/zh" is:

  • zh_0000.txt
    維基百科 英語或是 维基媒体

  • zh_0001.txt
    基金会运 营的一个 多语言的

  • zh_0002.txt
    線上百科 全并以创 建和维护

  • zh_0003.txt
    作为开放 式协同合 作项目特

  • zh_0004.txt
    点是自由 容自由编 辑自由版

  • zh_0005.txt
    权目前是 全球網絡 上最大且
    ...etc...


Actual Results

None


Versions

System:
    python: 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]

Python dependencies:
          pip: 21.2.4
   setuptools: 49.2.1
      sklearn: 1.1.dev0
        numpy: 1.21.2
        scipy: 1.7.1
       Cython: None
       pandas: None
   matplotlib: 3.4.3
       joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

Process finished with exit code 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocumentationModerateAnything that requires some knowledge of conventions and best practices

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions