This repo contains the implementation of several Machine Learning algorithms for Named Entity Recognition. We build, train and evaluate them on many different dataset, considering several aspects: quality of prediction, memory consumption, and latency of inference.
See environment.yml
. In general, I used tensorflow.keras
and scikit-learn
for my ML experiments 🔮.
conda env create -f environment.yml
conda activate ner-suite
You can now play with the notebooks!
data/
: directory in which are saved all the dataset used in the notebooks. The dataset are:- CoNLL03;
- Annotated Corpus for NER;
- WikiNER (english and italian);
embeddings/
: directory that contains different word embeddings:glove.6B.100d.txt
for english;w2v.itWac.128d.txt
for italian;
utils
: a package that I made in order to increase code modularity, reusability and readability;<algo>-<dataset>.ipynb
: these are the notebooks with the experiments that we made;environment.yml
: conda environment file in order to replicate the environment on your machine and reproduce the experiments;results.xlsx
: results of the experiments;
- Conditional Random Fields: a traditional Machine Learning algorithm which can deal with sequences. Refer to the original paper and the implementation of the sklearn wrapper;
- LSTM: the most used recurrent neural network for modeling sequences. We also use it in combination with pre-trained embeddings like GloVe and itWac;
- End-to-end model: in this paper it is proposed a model which combines a CNN to extract morphological features from the characters of the word, the GloVe embeddings to represent word-level features, a Bidirectional LSTM to model the context and finally a CRF layer to decode the best sequence of labels. We implemented it, thanks to the work already done in this repo.
- Improve documentation;