Skip to content

MultimodalUniverse/MultimodalUniverse

Repository files navigation

image

Multimodal Universe: Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data

Dataset on Hugging Face NeurIPS Demo on Colab Test License: MIT All Contributors

Overview

The Multimodal Universe dataset is a large scale collection of multimodal astronomical data, including images, spectra, and light curves, which aims to enable research into foundation models for astrophysics and beyond.

Quick Start

All datasets can be previewed directly from our HuggingFace hub and accessed via load_dataset('MultimodalUniverse/dataset_name')! Preview datasets include ~1k examples from each survey.

from datasets import load_dataset

dset = load_dataset('MultimodalUniverse/plasticc', 
                    split='train', streaming=True)

example = next(iter(dset))

You can try this out with our getting started notebook!

Data Access

To access the full dataset, we recommend downloading the data locally. This is necessary for using the provided cross-matching utilities.

The full dataset content is hosted at the Flatiron Institute and available either through HTTPS or through GLOBUS:

GLOBUS is much preferable when downloading large amounts of data, or a large number of files. Local download of the full data in its native HDF5 format is necessary for using the provided cross-matching utilities.

After downloading the data, you can use Hugging Face's datasets library to load the data directly from your local copy. For example, to load the PLAsTiCC dataset:

from datasets import load_dataset

dset = load_dataset('path/to/downloaded/plasticc', 
                    split='train', streaming=True)
dset = dset.with_format('numpy')

example = next(iter(dset))

Datasets

The Multimodal Universe currently contains data from the following surveys/modalities:

Survey Modality Science Use Case # samples
Legacy Surveys DR10 Images Galaxies 124M
Legacy Surveys North Images Galaxies 15M
HSC Images Galaxies 477k
BTS Images Supernovae 400k
JWST Images Galaxies 300k
Gaia BP/RP Spectra Stars 220M
SDSS-II Spectra Galaxies, Stars 4M
DESI Spectra Galaxies 1M
APOGEE SDSS-III Spectra Stars 716k
GALAH Spectra Stars 325k
Chandra Spectra Galaxies, Stars 129k
VIPERS Spectra Galaxies 91k
MaNGA SDSS-IV Hyperspectral Image Galaxies 12k
PLAsTiCC Time Series Time-varying objects 3.5M
TESS Time Series Exoplanets 160k
CfA Sample Time Series Supernovae 1k
YSE Time Series Supernovae 2k
PS1 SNe Ia Time Series Supernovae 369
DES Y3 SNe Ia Time Series Supernovae 248
SNLS Time Series Supernovae 239
Foundation Time Series Supernovae 180
CSP SNe Ia Time Series Supernovae 134
Swift SNe Ia Time Series Supernovae 117
Gaia Tabular Stars 220M
PROVABGS Tabular Galaxies 221k
Galaxy10 DECaLS Tabular Galaxies 15k

We are accepting new datasets! Check out our contribution guidelines for more details.

Data License

We openly distribute the Multimodal Universe dataset under the Creative Commons Attribution (CC BY) 4.0 license, noting however that when using specific subsets, the license and conditions of utilisation should be respected.

Architecture

Illustration of the methodology behind the Multimodal Universe. Domain scientists with expertise in a given astronomical survey provide data download and formatting scripts through Pull Requests. All datasets are then downloaded from their original source and made available as Hugging Face datasets sharing a common data schema for each modality and associated metadata. End-users can then generate any combination of subsets using provided cross-matching utilities to generate multimodal datasets.

Please see the Design Document for more context about the project.

Citations & Acknowledgements

If you make use of all or part of the Multimodal Universe dataset, please cite the individual datasets accordingly. The relevant BibTeX citations and text acknowledgement instructions for datasets can be generated through the info.py file (python scripts/info.py --help for details).

It allows you to retrieve all of the dataset information, or just acknowledgement and citation information for some or all datasets. By not specifying a dataset, it will return all datasets. By not specifying at least one of --citation or --acknowledge, it will return all of the information (including license, homepage, etc.).

python scripts/info.py --cite --data <datasets>
python scripts/info.py --acknowledge --data <datasets>

For example, to get the citations for the APOGEE and SDSS datasets and save them to info_citation.bib, run:

python scripts/info.py --cite --data apogee sdss -o info_citation.bib
To get all citations and acknowledgements, run:
```sh
python scripts/info.py --cite --acknowledge

You can always specify an output file for easy transcription to your bibliography or acknowledgements section with the --output flag:

python scripts/info.py --cite --output full_citations.txt
python scripts/info.py --acknowledge --output full_acknowledgements.txt

Acknowledgement instructions are returned alongside citations to encourage attribution. The acknowledgement lines are commented with % to make the citations easy to add to your bibliography.

Contributors

Full Contribution List

Francois Lanusse
Francois Lanusse

📆 💡 💻
Liam Parker
Liam Parker

📆 💡 💻
Micah Bowles
Micah Bowles

📆 💡 💻
mhuertascompany
mhuertascompany

📆 💡 💻
Mike Smith
Mike Smith

📆 💡 💻
Helen Qu
Helen Qu

📆 💡 💻
Aaron
Aaron

💡 💻
Ben Boyd
Ben Boyd

💡 💻
Brian Cherinka
Brian Cherinka

💻
Connor Stone, PhD
Connor Stone, PhD

💡
David Chemaly
David Chemaly

💡 💻
Erin Hayes
Erin Hayes

💡 💻
Henry Leung
Henry Leung

💻
Ioana Ciucă
Ioana Ciucă

🖋
Jeff Shen
Jeff Shen

💻
jeraud
jeraud

💡 💻
John F. Wu
John F. Wu

🖋
CambridgeAstroStat
CambridgeAstroStat

🧑‍🏫
Kartheik Iyer
Kartheik Iyer

💻
Lucas Meyer
Lucas Meyer

💻
Matthew Grayling
Matthew Grayling

💡 💻
Maja Jabłońska
Maja Jabłońska

💻
Mike Walmsley
Mike Walmsley

💡 💻
Miles Cranmer
Miles Cranmer

🖋
Peter Melchior
Peter Melchior

💻
Rafael Martínez-Galarza
Rafael Martínez-Galarza

💻
Tom Hehir
Tom Hehir

💡 💻
Shirley Ho
Shirley Ho

🔍 🖋