This repository contains source code and RML mappings used for creating MLSea-KG, a declaratively constructed and regularly updated machine learning KG with more than 1.44 billion RDF triples containing metadata about machine learning:
- Datasets
- Tasks
- Implementations and related hyper-parameters
- Experiment executions, their configuration settings and evaluation results
- Code notebooks and repositories
- Algorithms
- Publications
- Models
- Scientists and practitioners
The data were gathered and integrated from OpenML, Kaggle and Papers with Code.
Resource code directory contains resource code used for collecting, pre-processing, sampling and declaratively generating RDF triples, using the declarative mappings included. The input data sources used are the OpenML data extracted from the OpenML API, the Meta Kaggle CSVs and the Papers with Code dumps, which are not included in this repository.
OpenML CSV dumps are also generated, to store data retrieved from the OpenML API.
The RML mapping that were used for each platform are also provided, demonstrating the rules used to declaratively construct MLSea-KG. Both common RML mappings and the corresponding in-memory RML mappings used to generate RDF from in-memory samples are provided, complemented by their YARRRML serialization.
MLSea-KG is accessible through our SPARQL endpoint. The sparql_examples folder contains example queries for traversing MLSea-KG.
MLSea-KG snapshots are available at MLSea-KG's Zenodo repository.
Clone the repository:
git clone https://github.com/dtai-kg/MLSea-KGC.git
Install dependencies:
pip install requirements.txt
Import Original Data Sources
-
Edit 'config.py' to set the target locations where imported data sources will be stored.
-
Download Kaggle metadata through the Meta Kaggle dataset.
-
Download Papers with Code metadata through the Papers with Code dump files.
-
Download OpenML metatadata and store them as CSV backups through the OpenML API service with 'openml_data_collector.py':
python openml_data_collector.py
Process RDF Mappings
View and explore the RDF mappings. Make necessary changes to the input sources paths to point to the location of your local data sources.
Generate RDF dumps
Generate the RDF dumps of MLSea-KG by running:
python data_integration_openml.py
python data_integration_kaggle.py
python data_integration_pwc.py
Thank you for reading! To cite our resource:
@InProceedings{dasoulas2024mlsea,
author = {Dasoulas, Ioannis and Yang, Duo and Dimou, Anastasia},
booktitle = {The Semantic Web},
title = {{MLSea: A Semantic Layer for Discoverable Machine Learning}},
year = {2024}
}