Multimodal Universe: Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data

Overview

The Multimodal Universe dataset is a large scale collection of multimodal astronomical data, including images, spectra, and light curves, which aims to enable research into foundation models for astrophysics and beyond.

Quick Start

All datasets can be previewed directly from our HuggingFace hub and accessed via load_dataset('MultimodalUniverse/dataset_name')! Preview datasets include ~1k examples from each survey.

from datasets import load_dataset

dset = load_dataset('MultimodalUniverse/plasticc', 
                    split='train', streaming=True)

example = next(iter(dset))

You can try this out with our getting started notebook!

Data Access

To access the full dataset, we recommend downloading the data locally. This is necessary for using the provided cross-matching utilities.

The full dataset content is hosted at the Flatiron Institute and available either through HTTPS or through GLOBUS:

GLOBUS is much preferable when downloading large amounts of data, or a large number of files. Local download of the full data in its native HDF5 format is necessary for using the provided cross-matching utilities.

After downloading the data, you can use Hugging Face's datasets library to load the data directly from your local copy. For example, to load the PLAsTiCC dataset:

from datasets import load_dataset

dset = load_dataset('path/to/downloaded/plasticc', 
                    split='train', streaming=True)
dset = dset.with_format('numpy')

example = next(iter(dset))

Datasets

The Multimodal Universe currently contains data from the following surveys/modalities:

Survey	Modality	Science Use Case	# samples
Legacy Surveys DR10	Images	Galaxies	124M
Legacy Surveys North	Images	Galaxies	15M
HSC	Images	Galaxies	477k
BTS	Images	Supernovae	400k
JWST	Images	Galaxies	300k
Gaia BP/RP	Spectra	Stars	220M
SDSS-II	Spectra	Galaxies, Stars	4M
DESI	Spectra	Galaxies	1M
APOGEE SDSS-III	Spectra	Stars	716k
GALAH	Spectra	Stars	325k
Chandra	Spectra	Galaxies, Stars	129k
VIPERS	Spectra	Galaxies	91k
MaNGA SDSS-IV	Hyperspectral Image	Galaxies	12k
PLAsTiCC	Time Series	Time-varying objects	3.5M
TESS	Time Series	Exoplanets	160k
CfA Sample	Time Series	Supernovae	1k
YSE	Time Series	Supernovae	2k
PS1 SNe Ia	Time Series	Supernovae	369
DES Y3 SNe Ia	Time Series	Supernovae	248
SNLS	Time Series	Supernovae	239
Foundation	Time Series	Supernovae	180
CSP SNe Ia	Time Series	Supernovae	134
Swift SNe Ia	Time Series	Supernovae	117
Gaia	Tabular	Stars	220M
PROVABGS	Tabular	Galaxies	221k
Galaxy10 DECaLS	Tabular	Galaxies	15k

We are accepting new datasets! Check out our contribution guidelines for more details.

Data License

We openly distribute the Multimodal Universe dataset under the Creative Commons Attribution (CC BY) 4.0 license, noting however that when using specific subsets, the license and conditions of utilisation should be respected.

Architecture

Illustration of the methodology behind the Multimodal Universe. Domain scientists with expertise in a given astronomical survey provide data download and formatting scripts through Pull Requests. All datasets are then downloaded from their original source and made available as Hugging Face datasets sharing a common data schema for each modality and associated metadata. End-users can then generate any combination of subsets using provided cross-matching utilities to generate multimodal datasets.

Please see the Design Document for more context about the project.

Citations & Acknowledgements

If you make use of all or part of the Multimodal Universe dataset, please cite the individual datasets accordingly. The relevant BibTeX citations and text acknowledgement instructions for datasets can be generated through the info.py file (python scripts/info.py --help for details).

It allows you to retrieve all of the dataset information, or just acknowledgement and citation information for some or all datasets. By not specifying a dataset, it will return all datasets. By not specifying at least one of --citation or --acknowledge, it will return all of the information (including license, homepage, etc.).

python scripts/info.py --cite --data <datasets>
python scripts/info.py --acknowledge --data <datasets>

For example, to get the citations for the APOGEE and SDSS datasets and save them to info_citation.bib, run:

python scripts/info.py --cite --data apogee sdss -o info_citation.bib
To get all citations and acknowledgements, run:
```sh
python scripts/info.py --cite --acknowledge

You can always specify an output file for easy transcription to your bibliography or acknowledgements section with the --output flag:

python scripts/info.py --cite --output full_citations.txt
python scripts/info.py --acknowledge --output full_acknowledgements.txt

Acknowledgement instructions are returned alongside citations to encourage attribution. The acknowledgement lines are commented with % to make the citations easy to add to your bibliography.

Contributors

Full Contribution List

_{Francois Lanusse} 📆 💡 💻	_{Liam Parker} 📆 💡 💻	_{Micah Bowles} 📆 💡 💻	_{mhuertascompany} 📆 💡 💻	_{Mike Smith} 📆 💡 💻	_{Helen Qu} 📆 💡 💻	_Aaron 💡 💻
_{Ben Boyd} 💡 💻	_{Brian Cherinka} 💻	_{Connor Stone, PhD} 💡	_{David Chemaly} 💡 💻	_{Erin Hayes} 💡 💻	_{Henry Leung} 💻	_{Ioana Ciucă} 🖋
_{Jeff Shen} 💻	_jeraud 💡 💻	_{John F. Wu} 🖋	_{CambridgeAstroStat} 🧑‍🏫	_{Kartheik Iyer} 💻	_{Lucas Meyer} 💻	_{Matthew Grayling} 💡 💻
_{Maja Jabłońska} 💻	_{Mike Walmsley} 💡 💻	_{Miles Cranmer} 🖋	_{Peter Melchior} 💻	_{Rafael Martínez-Galarza} 💻	_{Tom Hehir} 💡 💻	_{Shirley Ho} 🔍 🖋

Name		Name	Last commit message	Last commit date
Latest commit History 964 Commits
.github/workflows		.github/workflows
assets		assets
baselines		baselines
benchmark_cfgs/image		benchmark_cfgs/image
experimental_benchmark		experimental_benchmark
mmu		mmu
notebooks		notebooks
scripts		scripts
.all-contributorsrc		.all-contributorsrc
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
LICENSE		LICENSE
README.md		README.md
bench-requirements.txt		bench-requirements.txt
dev.ipynb		dev.ipynb
dset-requirements.txt		dset-requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Universe: Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data

Overview

Quick Start

Data Access

Datasets

Data License

Architecture

Citations & Acknowledgements

Contributors

Full Contribution List

About

Releases 1

Packages

Contributors 24

Languages

License

MultimodalUniverse/MultimodalUniverse

Folders and files

Latest commit

History

Repository files navigation

Multimodal Universe: Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data

Overview

Quick Start

Data Access

Datasets

Data License

Architecture

Citations & Acknowledgements

Contributors

Full Contribution List

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 24

Languages

Packages