English | 简体中文
In order for users to better use DeepKE
to complete entity recognition tasks, we provide an easy-to-use dict matching based entity recognition automatic annotation tool.
-
The format of Dict:
-
Two entity Dicts (one in Chinese and one in English) are provided in advance, and the samples are automatically tagged using the entity dictionary + jieba part-of-speech tagging.
-
In Chinese example dict, we adapt People's Daily dataset. It is a dataset for NER, concentrating on their types of named entities related to persons(PER), locations(LOC), and organizations(ORG).
-
In English example dict,we adapt Conll dataset. It is a dataset for NER, concentrating on their types of named entities related to persons(PER), locations(LOC), organizations(ORG) and others(MISC).You can get the Conll dataset with the following command.
wget 120.27.214.45/Data/ner/few_shot/data.tar.gz
- Pre-provided dict from Google Drive:
- From BaiduNetDisk :
-
-
If you need to build a domain self-built dictionary, please refer to the pre-provided dictionary format (csv)
Entity Label Washington LOC ... ...
-
The input dictionary format is csv (contains two columns, entities and corresponding labels).
-
Data to be automatically marked (txt format and separated by lines, as shown in the figure below) should be placed under the
source_data
path, the script will traverse all txt format files in this folder, and automatically mark line by line. -
The output file(the distribution ratio of
training set
,validation set
, andtest set
can be customized) can be directly used as training data in DeepKE.
Implementation Environment:
- jieba = 0.42.1
language
:cn
oren
source_dir
: Corpus path (traverse all files in txt format under this folder, automatically mark line by line, the default issource_data
)dict_dir
: Entity dict path (defaults tovocab_dict.csv
)test_rate, dev_rate, test_rate
: The ratio of training_set, validation_set, and test_set (please make sure the sum is1
, default0.8:0.1:0.1
)
- Chinese
python prepare_weaksupervised_data.py --language cn --dict_dir vocab_dict_cn.csv
- English
python prepare_weaksupervised_data.py --language en --dict_dir vocab_dict_en.csv