Whitespace tokenizing does not work on chinese and japanese

### #FIXME: whitespace tokenizing does not work on chinese and japanese
Implementing NLP whitespace tokenizer for japanese and chinese language can be a little hard.:radioactive: It can't be done like for a eng, ita, fra...etc...

---

### Steps/Code to Reproduce

In :arrow_right:**[/doc/tutorial/text_analytics/data/languages/fetch_data.py](https://github.com/scikit-learn/scikit-learn/blob/40e1f895b172b9941fdcdefffd5a2aa8556ed227/doc/tutorial/text_analytics/data/languages/fetch_data.py#L82)** now we have:
```python

        # split the paragraph into fake smaller paragraphs to make the
        # problem harder e.g. more similar to tweets
        if lang in ('zh', 'ja'):
        # FIXME: whitespace tokenizing does not work on chinese and japanese
            continue
        words = content.split()
        n_groups = len(words) / n_words_per_short_text
        if n_groups < 1:
            continue
        groups = np.array_split(words, n_groups)
                         ...etc...
```
---
### Simple proposal (_for now_):

- Simbol dict. for chinese and japanese lang. will be used for extracting tokens for split the paragraph into fake smaller paragraphs. It's impl. looks like:

```python
        # split the paragraph into fake smaller paragraphs to make the
        # problem harder e.g. more similar to tweets

        n_words_per_short_text_zh_ja = 3

        if lang in ("zh", "ja"):
            words = []
            string_of_words = ''
            for word in content:
                if word in zh_letters :
                    string_of_words += word
                    if len(string_of_words) > n_words_per_short_text_zh_ja:
                        words.append(string_of_words)
                        string_of_words = ''

            n_groups = len(words) / n_words_per_short_text_zh_ja

        else:
            words = content.split()
            n_groups = len(words) / n_words_per_short_text

        if n_groups < 1:
            continue

        groups = np.array_split(words, n_groups)

        for group in groups:
            small_content = " ".join(group)
                    
                                 ...etc...                          
```

- Fror now, we have helper func. to load **zh**, **ja** , letters:

```python

zh_letters = []
def read_zh_letters(filename) -> list:
    with open(filename, encoding="utf8") as file:
        while line := file.readline().strip():
            zh_letters.append(line)
    return zh_letters

read_zh_letters('PATH_TO_zh_ja_letter.txt')
```
---
### Expected And Obtained Result
in "_short_paragraphs/zh_"  is:

- **zh_0000.txt**
`
&#32173;&#22522;&#30334;&#31185; &#33521;&#35486;&#25110;&#26159; &#32500;&#22522;&#23186;&#20307;
`

- **zh_0001.txt**
`
&#22522;&#37329;&#20250;&#36816; &#33829;&#30340;&#19968;&#20010; &#22810;&#35821;&#35328;&#30340;
`

- **zh_0002.txt**
`
&#32218;&#19978;&#30334;&#31185; &#20840;&#24182;&#20197;&#21019; &#24314;&#21644;&#32500;&#25252;
`

- **zh_0003.txt**
`
&#20316;&#20026;&#24320;&#25918; &#24335;&#21327;&#21516;&#21512; &#20316;&#39033;&#30446;&#29305;
`

- **zh_0004.txt**
`
&#28857;&#26159;&#33258;&#30001; &#23481;&#33258;&#30001;&#32534; &#36753;&#33258;&#30001;&#29256;
`

- **zh_0005.txt**
`
&#26435;&#30446;&#21069;&#26159; &#20840;&#29699;&#32178;&#32097; &#19978;&#26368;&#22823;&#19988;
`
` 
             ...etc...
` 
---
### Actual Results

None

---
### Versions

```
System:
    python: 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]

Python dependencies:
          pip: 21.2.4
   setuptools: 49.2.1
      sklearn: 1.1.dev0
        numpy: 1.21.2
        scipy: 1.7.1
       Cython: None
       pandas: None
   matplotlib: 3.4.3
       joblib: 1.0.1
threadpoolctl: 2.2.0

Built with OpenMP: True

Process finished with exit code 0

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Whitespace tokenizing does not work on chinese and japanese #20992

#FIXME: whitespace tokenizing does not work on chinese and japanese

Steps/Code to Reproduce

Simple proposal (for now):

Expected And Obtained Result

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Whitespace tokenizing does not work on chinese and japanese #20992

Description

#FIXME: whitespace tokenizing does not work on chinese and japanese

Steps/Code to Reproduce

Simple proposal (for now):

Expected And Obtained Result

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions