A deep leanring model to predict named entities, triggers, and nested events from biomedical texts.
- The model and results are reported in our paper:
DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts, Bioinformatics, 2020.
- Based on pre-trained BERT
- Predict nested entities and nested events
- Provide our trained models on the seven biomedical tasks
- Reproduce the results reported in our Bioinformatics paper
- Predict for new data given raw text input or PubMed ID
- Visualize the predicted entities and events on the brat
- DeepEventMine has been trained and evaluated on the following tasks (six BioNLP shared tasks and MLEE).
- cg: Cancer Genetics (CG), 2013
- ge11: GENIA Event Extraction (GENIA), 2011
- ge13: GENIA Event Extraction (GENIA), 2013
- id: Infectious Diseases (ID), 2011
- epi: Epigenetics and Post-translational Modifications (EPI), 2011
- pc: Pathway Curation (PC), 2013
- mlee: Multi-Level Event Extraction (MLEE)
- Python 3.6.5
- PyTorch (torch==1.1.0 torchvision==0.3.0, cuda92)
virtualenv -p python3 pytorch-env
source pytorch-env/bin/activate
CUDA_PATH=/usr/local/cuda pip install torch==1.1.0 torchvision==0.3.0
- Install Python packages
sh install.sh
- Download SciBERT BERT model from PyTorch AllenNLP
sh download.sh bert
- Download pre-trained DeepEventMine model on a given task
- [task] = cg (or pc, ge11, epi, etc)
sh download.sh deepeventmine [task]
sh download.sh brat
- Install brat based on the brat instructions
cd brat/brat-v1.3_Crunchy_Frog/
./install.sh -u
python2 standalone.py
- Download corpora
- To download the original data sets from BioNLP shared tasks.
- [task] = cg, pc, ge11, etc
sh download.sh bionlp [task]
- Preprocess data
- Tokenize texts and prepare data for prediction
sh preprocess.sh bionlp
- Generate configs
- If using GPU: [gpu] = 0, otherwise: [gpu] = -1
- [task] = cg, pc, etc
sh run.sh config [task] [gpu]
- For development and test sets (given gold entities)
- CG task: [task] = cg
- PC task: [task] = pc
- Similarly for: ge11, ge13, epi, id, mlee
sh run.sh predict [task] gold dev
sh run.sh predict [task] gold test
- Check the output in the path
- Retrieve the original offsets and create zip format
sh run.sh offset [task] gold dev
sh run.sh offset [task] gold test
- Submit the zipped file to the shared task evaluation sites:
- Evaluate events
- Evaluate event prediction for PC and CG tasks on the development sets using the shared task scripts.
- Evaluation options: s (softboundary), p(partialrecursive)
sh run.sh eval [task] gold dev sp
- Abstract
sh pubmed.sh e2e pmid 1370299 cg 0
- Full text
sh pubmed.sh e2e pmcid PMC4353630 cg 0
- Input: PMID: 1370299, PMCID: PMC4353630 (a single PubMed ID to get raw text)
- Model to predict: DeepEventMine trained on cg (Cancer Genetics 2013), (other options: pc, ge11, etc)
- GPU: 0 (if CPU: -1)
- Output: in brat format and brat visualization
T24 Organism 1248 1254 bovine
T25 Gene_or_gene_product 1255 1259 u-PA
T55 Positive_regulation 1107 1116 increased
T57 Localization 1170 1179 migration
T58 Negative_regulation 1260 1267 blocked
T23 Gene_or_gene_product 1184 1188 u-PA
T56 Positive_regulation 1157 1166 increases
E9 Positive_regulation:T56 Theme:T23
T26 Gene_or_gene_product 1320 1325 c-src
T62 Gene_expression 1326 1336 expression
E10 Gene_expression:T62 Theme:T26
T61 Positive_regulation 1310 1319 increased
E24 Positive_regulation:T61 Theme:E10
- Given an arbitrary name for your raw text data, for example "my-pubmed"
- Prepare a list of PMID and PMCID in the path
sh pubmed.sh e2e pmids my-pubmed cg 0
- Given an arbitrary name for your raw text data, for example "my-pubmed"
- Prepare your raw text files in the path
sh pubmed.sh e2e rawtext my-pubmed cg 0
- Input: your own raw text or PubMed ID
- Output: predicted entities and events in brat format
- Given an arbitrary name for your raw text data, for example "my-pubmed"
- Prepare your own raw text in the following path
- Or, you can automatically get raw text given PubMed ID or PMC ID
- PubMed ID list
- In order to get full text given PMC ID, the text should be available in ePub (for our current version).
- Prepare your list of PubMed ID and PMC ID in the path
- Get text from the PubMed ID
sh pubmed.sh pmids my-pubmed
- PubMed ID
- You can also get text by directly input a PubMed or PMC ID
sh pubmed.sh pmid 1370299
sh pubmed.sh pmcid PMC4353630
sh pubmed.sh preprocess my-pubmed
- Generate config
- Generate config for prediction
- The data name to predict: my-pubmed
- The trained model used for predict: cg (or pc, ge11, etc)
- If you use gpu [gpu]=0, otherwise [gpu]=-1
sh pubmed.sh config my-pubmed cg 0
- Predict
sh pubmed.sh predict my-pubmed
- Retrieve the original offsets
sh pubmed.sh offset my-pubmed
- Check the output in
- Copy the predicted data into the brat folder to visualize
- For the raw text prediction:
sh pubmed.sh brat my-pubmed cg
- Or for the shared task
sh run.sh brat [task] gold dev
sh run.sh brat [task] gold test
- The data to visualize is located in
This work is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO). This work is also supported by PRISM (Public/Private R&D Investment Strategic Expansion PrograM).
author = {Trieu, Hai-Long and Tran, Thy Thy and Duong, Khoa N A and Nguyen, Anh and Miwa, Makoto and Ananiadou, Sophia},
title = "{DeepEventMine: End-to-end Neural Nested Event Extraction from Biomedical Texts}",
journal = {Bioinformatics},
year = {2020},
month = {06},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btaa540},
url = {https://doi.org/10.1093/bioinformatics/btaa540},
note = {btaa540},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/doi/10.1093/bioinformatics/btaa540/33399046/btaa540.pdf},