evidence-features

Linguistic feature extraction for German (lang: de) as 8-bit interger representations.

Install a virtual environment for CPU

# Ensure that python packages are availabe
sudo apt install python3-venv

# install virtualenv
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

# install other packages
pip install --use-pep517 -e .
# pip install --use-pep517 -r requirements.txt --no-cache-dir
pip install --use-pep517 -r requirements-dev.txt --no-cache-dir
pip install --use-pep517 -r requirements-demo.txt --no-cache-dir

# reinstall TF for better Intel-CPU support
# pip install intel-tensorflow

And, or install python package evidence-features from Github.

pip install git+ssh://[email protected]/satzbeleg/evidence-features.git

Install MiniConda for GPU

In to ensure compatible CUDA drivers, use Conda to install them (Nvidia does not maintain PyPi packages).

conda install -y pip
conda create -y --name gpu-venv-evidence-features python=3.9 pip
conda activate gpu-venv-evidence-features

conda install -y cudatoolkit=11.3.1 cudnn=8.3.2 -c conda-forge
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install torch==1.12.1+cu113 torchvision torchaudio -f https://download.pytorch.org/whl/torch_stable.html

# install other packages
pip install -e .
# pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir

Install MiniConda if not exists

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# prevent conda autostart in shell
# conda config --set auto_activate_base false

Download pretrained models and statistics

Download from original sources

The software uses pretrained NLP models and statistics.

# Ensure Debian packages are available
sudo apt install unzip p7zip p7zip-full

# some python package are called
conda activate gpu-venv-evidence-features
# source .venv/bin/activate

# set the location for pretrained models and other lists
export MODELFOLDER="$(pwd)/models"
# download
bash download-models.sh

# run tests
pytest
# check time measurement
python test/check_timer.py
# run example
python test/check_todisk.py

Download via DVC Backend

If you have access to ZDL's DVC backend, run

dvc pull -r zdl

Features Overview

Currently, 1024 binary and 157 floating-point features are extracted whcih can be stored as 293 int8 elements in a database.

ID	Language level	Used models & statistics	Metrics	Memory
1	semantics	SBert, `paraphrase-multilingual-MiniLM-L12-v2`; Hashed random projection	Contextual sentence embeddings	1024-bit or 128x Int8 (128 bytes)
2	morphosyntax	Trankit, `german-hdt`, CoNLL-U UPOS	Distribution of Part-of-Speech (PoS) tags of a sentence	16+1x Int8 (17 bytes)
3	morphosyntax	Trankit, `german-hdt`, CoNNL-U Universal Features	Distribution of other lexical and grammatical properties in a sentence	47+1x Int8 (48 bytes)
4	syntax	Trankit, `german-hdt`, dependency parser; node-distance	The distribution of the shortest paths between all nodes (word tokens) within the dependency tree of a sentence; adjusted by the visual distance between words.	21x Int8 (21 bytes)
5	phonetics	epitran, `deu-Latn`; ipasymbols	The number of IPA-based consonant clusters within a sentence	3+1x Int16 (8 bytes)
6, 7	morphology	DeReChar	Distribution of character and character-level bi-gram frequencies	6+1 + 10+1 Int16 (36 bytes)
8	lexicology	COW	Distribution of lemmata frequencies	6+1x Int8 (7 bytes)
9	morphology	SMOR	Occurence of a) all possible parsed variants (syntactial ambivalence), b) all possible unique lexemes (lexeme ambivalence), c) the longest possible lexeme (working memory for composita comprehension)	14+1x Int8 (15 bytes)
12	-	-	Other statistics, e.g., text length	2x Int16 (4 bytes)
13	semantics	FastText language detection	Proba. of language or dialect (de, nds, als, bar) or lang. groups (franconian, north germanic, anglo-friesian, romanic, slavic)	10x Int8 (10 bytes)
14	semantics	Emoji Sentiment	Distribution of emoji frequencies, pos., neg., and neutral sentiment for all emojis within a sentence	22+1 Int8 (23 bytes)

Not included in .to_float(), i.e., only the function .to_int() will return these features.

ID	Language level	Used models & statistics	Metrics	Dim (as int8)
15	syntax	Uses Trankit, `german-hdt`, dependency parser; datasketch.MinHash, mmh3, and treesimi	MinHash/mmh3 hashes for syntatic similarity	32x Int32 (128 bytes)
16	-	Uses datasketch.MinHash, mmh3, and kshingle	MinHash/mmh3 hashes for Near Duplicate Detection	32x Int32 (128 bytes)
17	-	List of headwords (lemmata) that NOUN, VERB or ADJ		List
18	-	same as 16; For hashing bibliographic information as simple string	same as 16	32x Int32 (128 bytes)

Int8 vs floating-point features

All features are encoded as Int8 features. Most features are count data or naturally integer numbers that are transformed to ratios lateron, i.e., we will save 8-bit integers instead of 32-bit floating-points. In case of SBert wer compress the floating-point feature with hashed random projections to bit-values that are stored as Int8 representations - The storage requirement can be reduced by factor 12 to 16.

Demo Scripts and Notebooks

Correlation among features

# source .venv/bin/activate
# or start conda, and set path to conda's CUDA
conda activate gpu-venv-evidence-features 
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

# limit how much GPU RAM (in Mb) Pytorch can reserve (e.g. trankit, sbert)
#export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,garbage_collection_threshold:0.7

# assign Sbert and Tensorflow to other GPU device
# export BERT_GPUID=1

export MODELFOLDER="$(pwd)/models"
cd demo/corr
bash download-corpora.sh
python3 preprocess.py
jupyter lab

Benchmarking

Sentence embedding evaluation for German

# source .venv/bin/activate
# or start conda, and set path to conda's CUDA
conda activate gpu-venv-evidence-features 
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

# limit how much GPU RAM (in Mb) Pytorch can reserve (e.g. trankit, sbert)
#export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,garbage_collection_threshold:0.7

export MODELFOLDER="$(pwd)/models"
cd demo/benchmark
bash download-datasets.sh
nohup python3 run.py > log.log &
# CUDA_LAUNCH_BLOCKING=1  python3 run.py
tail -f log.log
watch -n 0.5 nvidia-smi

Balanced F1 scores on the test sets. EV feats. uses hashed random projections of SBert features, i.e., the F1 scores are expected to decrease but not too much.

Task	SBert	EV Feats
FCLAIM	0.672	0.632
VMWE	0.751	0.730
OL19-C	0.611	0.598
ABSD-2	0.521	0.528
MIO-P	0.820	0.844
ARCHI	0.374	0.357
LSDC	0.396	0.406

Compute Scores with QUAXA

import evidence_features as evf
import json
import quaxa

sentences = [
    "Dieser Satz ist ein Beispiel, aber eher kurz.",
    "Die Kuh macht muh, der Hund wufft aber lauter."
]

(
    feats1, feats2, feats3, feats4, feats5, feats6, feats7, feats8,
    feats9, feats12, feats13, feats14, hashes15, hashes16,
    sentences_sbd, lemmata17, spans, annotations
) = evf.to_int(sentences, measure_time=True, sbert_masking=True)

# convert `annotation` to conllu format
def format_trankit_to_conllu(batch_annot):
    batch_result = []
    for annot in batch_annot:
        result = []
        for t in json.loads(annot):
            tmp_feats = t.get("feats")
            if isinstance(tmp_feats, str):
                tmp_feats = {k: v for k, v in [f.split("=") for f in tmp_feats.split("|")]}
            result.append({
                "id": t.get("id"),
                "form": t.get("text"),
                "lemma": t.get("lemma"),
                "upos": t.get("upos"),
                "xpos": t.get("xpos"),
                "feats": tmp_feats,
                "head": t.get("head"),
                "deprel": t.get("deprel"),
                "deps": t.get("deps"),
                "misc": t.get("misc"),
                "span": t.get("span"),
                "ner": t.get("ner")
            })
        batch_result.append(result)
    return batch_result

conll_annot = format_trankit_to_conllu(annotations)

# quaxa
for lemmas, sent, annot in zip(*(lemmata17, sentences_sbd, conll_annot)):
    for headword in lemmas:
        score = quaxa.total_score(headword=headword, txt=sent, annotation=annot)
        print(score, headword, sent)

Appendix

Python commands

Jupyter for the examples: jupyter lab
Check syntax: flake8 --ignore=F401 --exclude=demo,$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
Run Unit Tests: PYTHONPATH=. pytest

Clean up

find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv

Citation

You can cite the following paper if you want to use this repository in your research work.

@inproceedings{hamster-2022-everybody,
    title = "Everybody likes short sentences - A Data Analysis for the Text Complexity {DE} Challenge 2022",
    author = "Hamster, Ulf A.",
    booktitle = "Proceedings of the GermEval 2022 Workshop on Text Complexity Assessment of German Text",
    month = sep,
    year = "2022",
    address = "Potsdam, Germany",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.germeval-1.2",
    pages = "10--14",
}

Support

Please open an issue for support.

Contributing

Please contribute using Github Flow. Create a branch, add commits, and open a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.dvc		.dvc
.github		.github
cql		cql
demo		demo
evidence_features		evidence_features
models		models
test		test
.dvcignore		.dvcignore
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CHANGES.md		CHANGES.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
download-models.sh		download-models.sh
requirements-demo.txt		requirements-demo.txt
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

evidence-features

Install a virtual environment for CPU

Install MiniConda for GPU

Download pretrained models and statistics

Download from original sources

Download via DVC Backend

Features Overview

Int8 vs floating-point features

Demo Scripts and Notebooks

Correlation among features

Benchmarking

Compute Scores with QUAXA

Appendix

Python commands

Clean up

Citation

Support

Contributing

About

Releases

Sponsor this project

Packages

Languages

License

satzbeleg/evidence-features

Folders and files

Latest commit

History

Repository files navigation

evidence-features

Install a virtual environment for CPU

Install MiniConda for GPU

Download pretrained models and statistics

Download from original sources

Download via DVC Backend

Features Overview

Int8 vs floating-point features

Demo Scripts and Notebooks

Correlation among features

Benchmarking

Compute Scores with QUAXA

Appendix

Python commands

Clean up

Citation

Support

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages