This project is a cooperation between the Sphere Project at the Max Planck Institute for the History of Science and the Machine Learning Group at TU Berlin within the frame of the BIFOLD. The driving force behind this project is our shared belief that AI applications in the humanities should be transparent and explainable. On the one hand, this allows the validation of humanities hypotheses that would otherwise be direclty tied to a model's output; on the other, such approaches open the door towards a more "engaging" and "interrogative" interaction with an AI model in the humanities.
To implement such an approach, we relied heavily on the S-VED dataset, and enriched it with some extra classes to achieve our objective. We proposed three different case studies - based on this extended dataset - with the aim on generating a 'new' definition of three types of illustration in Early Modern Editions of the Sphaera Corpus, namely, 'Mathematical Instruments', 'Machines', and 'Scientific Illustrations'. These 'new' definitions, derived with XAI methods, help us better hypothesize what were the relevant visual features of each of these classes in the eyes of the early modern printer. The image below shows an example of an LRP heatmap of a mathematical instrument from our method. Regions in red, such as the graduated elements and the base, contribute positively to the classification of a 'mathematical instrument' and play an major role in the definition of what is a mathematical instrument in our dataset.
While our method is based on the well curated Sphaera Corpus, this approach is easily transferable to other [historical] datasets. To implement this on your own dataset see the Getting Started section below
The recent advancements in the field of Artificial Intelligence (AI) translated to an increased adoption of AI technology in the humanities, which is often challenged by the limited amount of annotated data, as well as its heterogeneity. Despite the scarcity of data it has become common practice to design increasingly complex AI models, usually at the expense of human readability, explainability, and trust. This in turn has led to an increased need for tools to help humanities scholars better explain and validate their models as well as their hypotheses. In this paper, we discuss the importance of employing Explainable AI (XAI) methods within the humanities to gain insights into historical processes as well as ensure model reproducibility and a trustworthy scientific result. To drive our point, we present several representative case studies from the Sphaera project where we analyze a large, well-curated corpus of early modern textbooks using an AI model, and rely on the XAI explanatory outputs to generate historical insights concerning their visual content. More specifically, we show that XAI can be used as a partner when investigating debated subjects in the history of science, such as what strategies were used in the early modern period to showcase mathematical instruments and machines.
@article{ElHajjEberle2023,
author = {Hassan El-Hajj and Oliver Eberle and Anika Merklein and Anna Siebold and Noha Shlomi and Jochen Buettner and Julius Matrinetz and Klaus-Robert Mueller and Grégoire Montavon and Matteo Valleriani},
title = {Explainability and Transparency in the Realm of Digital Humanities: Towards a Historian XAI},
year = {2023},
journal = {Internation Journal of Digital Humanities},
publisher = {Springer}
}
This is an example of how you may give instructions on setting up your project locally. To get a local copy up and running follow these simple example steps.
To get starte clone this repositry:
git clone https://github.com/oeberle/xai_digital_humanities.git
To make sure you have all required packages installed we provide a requirements.txt
that you can install using:
pip3 install -r requirements.txt
After activating this environment you should be ready to use the code.
Get started: To reproduce the results found in our paper, you can execute the xai_analysis.ipynb
notebook. Our trained model weights model.pt
can be downloaded by exceuting git lfs pull
in the project repository.
Data: We have provided sample data in the sample_data/pages
directory. The accompanying csv-file sample_data/data_subset.csv
contains further information about the source and label of each sample.
XAI: Our explanation code for the computation and visualization of LRP heatmaps can be found in xai.py
. An in-depth tutorial regarding LRP can be found here.
In case you want to adapt this code to your own research project, we next will provide some further details.
Data Collection: If you want to work with your own data, you can follow our data structure that enables you to directly use the provided dataloader. For this, we assume that data is organized as follows:
<data_root>/<book_name>/raw/<book_name>_0001.jpg
For example;
sample_data/pages/2257_fine_sphaera_1551/raw/2257_fine_sphaera_1551_p51.jpg
with
data_root = sample_data/pages
book_name = 2257_fine_sphaera_1551
page_id = 2257_fine_sphaera_1551_p51.jpg
In the accompaying csv file, you need to provide the following fields:
page_id
: Identifier composed of book_name and page number (see above), e.g.2257_fine_sphaera_1551_p51.jpg
label
: provide the label as an integer, from 0, 1, ..., #classesxywh
: if available, you can provide bounding box coordiantes, e.g."[78,128,307,416]"
, otherwise justNone
Model: To adapt the basic VGG model to predict the correct number of classes you have to set the number of classes n_classes = ...
in run_training.py
.
Training: Specify the correct data_root
and location of your csv-file
and start optimization by executing python run_training.py
. Results will be saved into your specified savedir=...
To track training performance, we use the TensorboardX
framework. You can access the results overview by calling tensorboard --logdir <savedir> --port 6077 --bind_all
.
XAI: After initializing your model with the resulting model weights in <savedir>/.../model_train.pt
, you can follow the xai analysis in xai_analysis.ipynb
. For a full xai corpus analysis, you can also adapt the run_xai_analysis.py
script.
Distributed under the MIT License.
Oliver Eberle - @oeberle Hassan El-Hajj - @hassanhajj910
Project Link: https://github.com/oeberle/xai_digital_humanities.git