The Multimodal Universe dataset is a large scale collection of multimodal astronomical data, including images, spectra, and light curves, which aims to enable research into foundation models for astrophysics and beyond.
All datasets can be previewed directly from our HuggingFace hub and accessed via load_dataset('MultimodalUniverse/dataset_name')
!
Preview datasets include ~1k examples from each survey.
from datasets import load_dataset
dset = load_dataset('MultimodalUniverse/plasticc',
split='train', streaming=True)
example = next(iter(dset))
You can try this out with our getting started notebook!
To access the full dataset, we recommend downloading the data locally. This is necessary for using the provided cross-matching utilities.
The full dataset content is hosted at the Flatiron Institute and available either through HTTPS or through GLOBUS:
- https://users.flatironinstitute.org/~polymathic/data/MultimodalUniverse
- https://app.globus.org/file-manager?origin_id=57136152-fc1d-418e-b74e-75ca52bddd21
GLOBUS is much preferable when downloading large amounts of data, or a large number of files. Local download of the full data in its native HDF5 format is necessary for using the provided cross-matching utilities.
After downloading the data, you can use Hugging Face's datasets
library to load the data directly from your local copy. For example, to load the PLAsTiCC dataset:
from datasets import load_dataset
dset = load_dataset('path/to/downloaded/plasticc',
split='train', streaming=True)
dset = dset.with_format('numpy')
example = next(iter(dset))
The Multimodal Universe currently contains data from the following surveys/modalities:
Survey | Modality | Science Use Case | # samples |
---|---|---|---|
Legacy Surveys DR10 | Images | Galaxies | 124M |
Legacy Surveys North | Images | Galaxies | 15M |
HSC | Images | Galaxies | 477k |
BTS | Images | Supernovae | 400k |
JWST | Images | Galaxies | 300k |
Gaia BP/RP | Spectra | Stars | 220M |
SDSS-II | Spectra | Galaxies, Stars | 4M |
DESI | Spectra | Galaxies | 1M |
APOGEE SDSS-III | Spectra | Stars | 716k |
GALAH | Spectra | Stars | 325k |
Chandra | Spectra | Galaxies, Stars | 129k |
VIPERS | Spectra | Galaxies | 91k |
MaNGA SDSS-IV | Hyperspectral Image | Galaxies | 12k |
PLAsTiCC | Time Series | Time-varying objects | 3.5M |
TESS | Time Series | Exoplanets | 160k |
CfA Sample | Time Series | Supernovae | 1k |
YSE | Time Series | Supernovae | 2k |
PS1 SNe Ia | Time Series | Supernovae | 369 |
DES Y3 SNe Ia | Time Series | Supernovae | 248 |
SNLS | Time Series | Supernovae | 239 |
Foundation | Time Series | Supernovae | 180 |
CSP SNe Ia | Time Series | Supernovae | 134 |
Swift SNe Ia | Time Series | Supernovae | 117 |
Gaia | Tabular | Stars | 220M |
PROVABGS | Tabular | Galaxies | 221k |
Galaxy10 DECaLS | Tabular | Galaxies | 15k |
We are accepting new datasets! Check out our contribution guidelines for more details.
We openly distribute the Multimodal Universe dataset under the Creative Commons Attribution (CC BY) 4.0 license, noting however that when using specific subsets, the license and conditions of utilisation should be respected.
Illustration of the methodology behind the Multimodal Universe. Domain scientists with expertise in a given astronomical survey provide data download and formatting scripts through Pull Requests. All datasets are then downloaded from their original source and made available as Hugging Face datasets sharing a common data schema for each modality and associated metadata. End-users can then generate any combination of subsets using provided cross-matching utilities to generate multimodal datasets.
Please see the Design Document for more context about the project.
If you make use of all or part of the Multimodal Universe dataset, please cite the individual datasets accordingly. The relevant BibTeX citations and text acknowledgement instructions for datasets can be generated through the info.py file (python scripts/info.py --help
for details).
It allows you to retrieve all of the dataset information, or just acknowledgement and citation information for some or all datasets. By not specifying a dataset, it will return all datasets. By not specifying at least one of --citation
or --acknowledge
, it will return all of the information (including license, homepage, etc.).
python scripts/info.py --cite --data <datasets>
python scripts/info.py --acknowledge --data <datasets>
For example, to get the citations for the APOGEE and SDSS datasets and save them to info_citation.bib
, run:
python scripts/info.py --cite --data apogee sdss -o info_citation.bib
To get all citations and acknowledgements, run:
```sh
python scripts/info.py --cite --acknowledge
You can always specify an output file for easy transcription to your bibliography or acknowledgements section with the --output
flag:
python scripts/info.py --cite --output full_citations.txt
python scripts/info.py --acknowledge --output full_acknowledgements.txt
Acknowledgement instructions are returned alongside citations to encourage attribution. The acknowledgement lines are commented with %
to make the citations easy to add to your bibliography.