Computing robust explainable clusters for a set of entities. Clusters' explanations are produced based on the facts surrounding these entities in the KG. Furthermore, ExCut reuses the explanations to enhance the clustering quality.
Technical and experimental details can be found in ExCut's Technical Report: https://mpi-inf.mpg.de/~gadelrab/downloads/ExCut/ExCut_TR.pdf
Folder [envs] contain dump from the required dependencies using pip and conda. We provide two dumps excut_env.yml
with Tensorflow CPU usage and excut-gpu_env.yml
with Tensorflow-gpu
-
Install env
conda env create -f excut_env.yml
or if GPU is required
conda env create -f excut-gpu_env.yml
-
Activate the env (if it is not the default) before running the code:
conda activate <env>
A triple store with sparql endpoint is required. For that we used Virtouso 7.2 Community.
Easy way to run Virtouso is using docker container as provided in docker-compose.yml
as follows:
- Edit data-volume location in
docker-compose.yml
file (Optional: if you would like to persist the KGs) - Run command:
docker compose up
Alternatively, The tool can be downloaded from Virtouso 7.2 Community. Then 'SPARQL_UPDATE' permission should be assigned to the SPARQL endpoint. Details can be found here!
ExCut supports Multicut algorithm which is implemented in C++ and hosted as a binary inside
excut/excut/clustering/multicut/
folder. If it will be used the correct permissions, should be granted by running
chomd +x excut/excut/clustering/multicut/find-clustering
Note: this binary might not work under macos
Run the tests via
pytest ./tests
cli/main.py
file is the main entrance of the explainable clustering approach. Files run_yago.sh
shows and example code to run:
Parameters are:
usage: python -m cli.main.py [-h] [-t TARGET_ENTITIES] [-kg KG] [-o OUTPUT_FOLDER] [-steps]
[-itrs MAX_ITERATIONS] [-e EMBEDDING_DIR] [-Skg]
[-en ENCODING_DICT_DIR] [-ed EMBEDDING_ADAPTER]
[-em EMBEDDING_METHOD] [-host HOST] [-index INDEX] [-index_d]
[-id KG_IDENTIFIER] [-dp DATA_PREFIX] [-dsafe]
[-q OBJECTIVE_QUALITY] [-expl_cc EXPL_C_COVERAGE]
[-pr_q PREDICTION_MIN_Q] [-us UPDATE_STRATEGY]
[-um UPDATE_MODE] [-ud UPDATE_DATA_MODE]
[-uc UPDATE_CONTEXT_DEPTH] [-ucf CONTEXT_FILEPATH]
[-uh UPDATE_TRIPLES_HISTORY] [-ulr UPDATE_LEARNING_RATE]
[-c CLUSTERING_METHOD] [-k NUMBER_OF_CLUSTERS]
[-cd CLUSTERING_DISTANCE] [-cp CUT_PROB] [-comm COMMENT]
[-rs SEED] [-ll MAX_LENGTH] [-ls LANGUAGE_STRUCTURE]
optional arguments:
-h, --help show this help message and exit
-t TARGET_ENTITIES, --target_entities TARGET_ENTITIES
Target entities file
-kg KG, --kg KG Triple format file <s> <p> <o>
-o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
Folder to write output to
-steps, --save_steps Save intermediate results
-itrs MAX_ITERATIONS, --max_iterations MAX_ITERATIONS
Maximum iterations
-e EMBEDDING_DIR, --embedding_dir EMBEDDING_DIR
Folder of initial embedding
-Skg, --sub_kg Only use subset of the KG to train the base embedding
-en ENCODING_DICT_DIR, --encoding_dict_dir ENCODING_DICT_DIR
Folder containing the encoding of the KG
-ed EMBEDDING_ADAPTER, --embedding_adapter EMBEDDING_ADAPTER
Adapter used for embedding
-em EMBEDDING_METHOD, --embedding_method EMBEDDING_METHOD
Embedding method
-host HOST, --host HOST
SPARQL endpoint host and ip host_ip:port
-index INDEX, --index INDEX
Index input KG (memory | remote)
-index_d, --drop_index
Drop old index
-id KG_IDENTIFIER, --kg_identifier KG_IDENTIFIER
KG identifier url , default
http://exp-<start_time>.org
-dp DATA_PREFIX, --data_prefix DATA_PREFIX
Data prefix
-dsafe, --data_safe_urls
Fix the urls (id) of the entities
-q OBJECTIVE_QUALITY, --objective_quality OBJECTIVE_QUALITY
Object quality function
-expl_cc EXPL_C_COVERAGE, --expl_c_coverage EXPL_C_COVERAGE
Minimum per cluster explanation coverage ratio
-pr_q PREDICTION_MIN_Q, --prediction_min_q PREDICTION_MIN_Q
Minimum prediction quality
-us UPDATE_STRATEGY, --update_strategy UPDATE_STRATEGY
Strategy for update
-um UPDATE_MODE, --update_mode UPDATE_MODE
Embedding Update Mode
-ud UPDATE_DATA_MODE, --update_data_mode UPDATE_DATA_MODE
Embedding Adaptation Data Mode
-uc UPDATE_CONTEXT_DEPTH, --update_context_depth UPDATE_CONTEXT_DEPTH
The depth of the Subgraph surrounding target entities
-ucf CONTEXT_FILEPATH, --context_filepath CONTEXT_FILEPATH
File with context triples for the target entities
-uh UPDATE_TRIPLES_HISTORY, --update_triples_history UPDATE_TRIPLES_HISTORY
Number iterations feedback triples to considered in
the progressive update
-ulr UPDATE_LEARNING_RATE, --update_learning_rate UPDATE_LEARNING_RATE
Update Learning Rate
-c CLUSTERING_METHOD, --clustering_method CLUSTERING_METHOD
Clustering Method
-k NUMBER_OF_CLUSTERS, --number_of_clusters NUMBER_OF_CLUSTERS
Number of clusters
-cd CLUSTERING_DISTANCE, --clustering_distance CLUSTERING_DISTANCE
Clustering Distance Metric
-cp CUT_PROB, --cut_prob CUT_PROB
Cutting Probability
-comm COMMENT, --comment COMMENT
just simple comment to be stored
-rs SEED, --seed SEED
Randomization Seed for experiments
-ll MAX_LENGTH, --max_length MAX_LENGTH
maximum length of description
-ls LANGUAGE_STRUCTURE, --language_structure LANGUAGE_STRUCTURE
Structure of the learned description
This explains code in example file: examples/simple_clustering_pipeline.py
-
Load the KG triples :
from excut.kg.kg_triples_source import load_from_file kg_triples=load_from_file('<file_path>')
Note: a) prefix is required if the data does not has valid URIs; b) when loading Yago data
safe_url
argument should be set toTrue
as Yago URIs have special characters. -
Index KG triples : Current Explanation mining requires the KG triples to be indexed eitehr in remote sparql endpoint (eg. Virtouso) or in memory.
from excut.kg.kg_indexing import Indexer kg_indexer=Indexer(store='remote', endpoint='<vos endpoint url>', identifier='http://yago-expr.org') kg_indexer.index_triples(kg_triples, drop_old=False)
the KG identifier is a url-like name for the graph (
http://yago-expr.org
) it is required in the mining process. -
Loading clustered Entities;
After clustering, the results should be loaded into one of the implemenations of
EntityLabelsInterface
in Moduleclustering.target_entities
. Loading methods are provided in the module.Example:
from excut.clustering.target_entities import load_from_file clustering_results_as_triples=load_from_file('<file_path>')
-
Explain clusters:
Example:
#a) The explaining engine requires creating two interfaces: interface to index labels, # and interfac to query the whole kg triples and the labels as well. from excut.kg.kg_indexing import Indexer from excut.kg.kg_query_interface_extended import EndPointKGQueryInterfaceExtended query_interface=EndPointKGQueryInterfaceExtended( sparql_endpoint='<vos endpoint url>', identifiers=['http://yago-expr.org', 'http://yago-expr.org.extension'], labels_identifier='http://yago-expr.org.labels' ) #b) Create Explaning Engine from excut.explanations_mining.explaining_engines_extended import PathBasedClustersExplainerExtended explaining_engine= PathBasedClustersExplainerExtended(query_interface, quality_method=objective_measure, min_coverage=0.5) #c) explain the clusters explanations_dict = explaining_engine.explain(clustering_results_as_triples,'<output file path>') #d) compute aggregate quality of the explanations import excut.evaluation.explanations_metrics as explm #evalaute rules quality explm.aggregate_explanations_quality(explanations_dict)
Note:
QueryInterfaceExtended
is the interface to the indexed KG triples and the labels of the target entities. It requires as an input the identifiers of the KG to mine over. It is possible that a single KG can be stored in several subgraphs each with a different identifier. Then all should be listed as shown in the above code,We recommend using a fresh identfier for the entities-labels different than the used for the original KG, e.g.
http://yago-expr.org.labels
(where.labels
was appended).
These are some important modules while developing in ExCut:
- Package
explanations_mining
- Module
.explanations_quality_functions
: contains quality functions used to score explanations.
- Module
- Package
evalaution
- Module
.clustering_metrics
Traditional Clustering quality - Module
.explanations_metrics
Aggregating Explanations quality - Module
.eval_utils
Some useful scripts for evaluation such as exporting to csv and ploting.
- Module
Please Contact gadelrab [at] mpi-inf.mpg.de for further questions
ExCut is open-sourced under the Apache 2.0 license. See the LICENSE file for details.