prepare-data

English | 简体中文

In order for users to better use DeepKE to complete entity recognition tasks, we provide an easy-to-use dict matching based entity recognition automatic annotation tool.

Dict

The format of Dict：
Two entity Dicts (one in Chinese and one in English) are provided in advance, and the samples are automatically tagged using the entity dictionary + jieba part-of-speech tagging.
- In Chinese example dict, we adapt People's Daily dataset. It is a dataset for NER, concentrating on their types of named entities related to persons(PER), locations(LOC), and organizations(ORG).
- In English example dict，we adapt Conll dataset. It is a dataset for NER, concentrating on their types of named entities related to persons(PER), locations(LOC), organizations(ORG) and others(MISC).You can get the Conll dataset with the following command.
```
wget 120.27.214.45/Data/ner/few_shot/data.tar.gz
```
- Pre-provided dict from Google Drive：
  - CN(vocab_dict_cn), EN(vocab_dict_en)
- From BaiduNetDisk ：
  - CN(vocab_dict_cn), EN(vocab_dict_en)
  - (x7ba)
If you need to build a domain self-built dictionary, please refer to the pre-provided dictionary format (csv)

Entity Label

Washington LOC

... ...

Source File

The input dictionary format is csv (contains two columns, entities and corresponding labels).
Data to be automatically marked (txt format and separated by lines, as shown in the figure below) should be placed under the source_data path, the script will traverse all txt format files in this folder, and automatically mark line by line.
The output file(the distribution ratio of training set, validation set, and test set can be customized) can be directly used as training data in DeepKE.

Environment

Implementation Environment:

jieba = 0.42.1

Args Description

language: cn or en
source_dir: Corpus path (traverse all files in txt format under this folder, automatically mark line by line, the default is source_data)
dict_dir: Entity dict path (defaults to vocab_dict.csv)
test_rate, dev_rate, test_rate: The ratio of training_set, validation_set, and test_set (please make sure the sum is 1, default 0.8:0.1:0.1)

run

Chinese

python prepare_weaksupervised_data.py --language cn --dict_dir vocab_dict_cn.csv

English

python prepare_weaksupervised_data.py --language en --dict_dir vocab_dict_en.csv

Name		Name	Last commit message	Last commit date
parent directory ..
pics		pics
README.md		README.md
README_CN.md		README_CN.md
prepare_weaksupervised_data.py		prepare_weaksupervised_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prepare-data

prepare-data

README.md

Dict

Source File

Environment

Args Description

run

Entity	Label
Washington	LOC
...	...

Files

prepare-data

Directory actions

More options

Directory actions

More options

Latest commit

History

prepare-data

Folders and files

parent directory

README.md

Dict

Source File

Environment

Args Description

run