A curated list of awesome healthcare datasets for machine learning, research, and exploration.
- MIMIC-III Clinical Database - Deidentified health data associated with ~40,000 critical care patients. Includes demographics, vital signs, laboratory tests, medications, and more.
- eICU Collaborative Research Database - A multi-center database comprising deidentified health data associated with over 200,000 admissions to ICUs across the United States between 2014-2015.
- MIMIC-IV - An update to MIMIC-III, containing deidentified data associated with patients admitted to a tertiary academic medical center in Boston, MA, USA from 2008-2019.
- AmsterdamUMCdb - A database containing deidentified health data from the Amsterdam University Medical Center, including structured and unstructured data from patient records.
- MIMIC-IV-ED - Emergency department data from the MIMIC-IV database.
- MIMIC-IV-Note - Deidentified free-text clinical notes from the MIMIC-IV database.
- MIMIC-III Waveform Database - Waveform data from the MIMIC-III database.
- MIMIC-IV Waveform Database - Waveform data from the MIMIC-IV database.
- eICU Collaborative Research Database - A multi-center database comprising deidentified health data associated with over 200,000 admissions to ICUs across the United States between 2014-2015.
- MIMIC-II Clinical Database - An older version of the MIMIC database, containing data from 2001 to 2008.
- MIMIC-IV-ECHO - Echocardiogram data from the MIMIC-IV database.
- AMR-UTI - Antimicrobial Resistance in Urinary Tract Infections dataset.
- Abdominal and Direct Fetal ECG Database - Multichannel fetal electrocardiogram recordings obtained from 5 different women in labor.
- OpenPrescribing - A database of all medicines and appliances that are prescribed by GPs and other NHS prescribers in England.
- TCIA (The Cancer Imaging Archive) - A large archive of medical images of cancer accessible for public download.
- Chest X-Ray Dataset - A dataset consisting of 5,863 chest X-Ray images, annotated with the presence of pneumonia.
- RSNA Intracranial Hemorrhage Detection - A dataset of head CT scans, annotated with intracranial hemorrhage labels.
- MICCAI 2015 Challenge on Multimodal Brain Tumor Segmentation - Brain tumor segmentation dataset.
- Non-Small Cell Lung Cancer CT Scan Dataset - CT scans of non-small cell lung cancer patients.
- PROSTATEx - Prostate MRI scans with segmentations and annotations.
- Labeled Optical Coherence Tomography - Retinal OCT images with layer segmentations and fluid labels.
- MosMedData: Chest CT Scans with COVID-19 Related Findings - Chest CT scans of COVID-19 patients.
- LUng Nodule Analysis (LUNA16) - Chest CT scans with annotated lung nodules.
- NIH Chest X-ray Dataset of 14 Common Thorax Disease Categories - Chest X-ray images with disease labels.
- DeepLesion - A large-scale dataset of CT images with annotated lesions.
- Medical Segmentation Decathlon Datasets - Various medical imaging datasets for segmentation tasks.
- cataracts-2018-train - Cataract images dataset.
- dHCP 2nd data release -- sourcedata - Developmental Human Connectome Project dataset.
- dHCP 2nd data release -- fMRI pipeline - Developmental Human Connectome Project dataset (fMRI pipeline).
- PADCHEST_SJ - Chest X-ray images with multiple labels in Spanish.
- CAMELYON17 breast cancer - Lymph node sections annotated with metastases.
- A multimodal dental dataset facilitating machine learning research and clinic services - Dental X-rays, CBCT scans, and dental records.
- MIMIC-IV-ECG - Diagnostic electrocardiogram data from the MIMIC-IV database.
- MURA (musculoskeletal radiographs) - Bone X-rays labeled for abnormalities.
- National COVID-19 Chest Image Database (NCCID) - Chest X-rays, CT scans, and MRIs of COVID-19 patients in the UK.
- Cell Painting Gallery - A collection of cell images for drug discovery and basic research.
- International Neuroimaging Data-Sharing Initiative (INDI) - Neuroimaging datasets from various sources.
- Cancer Imaging Archive - A large archive of cancer imaging data.
- Open Access Series of Imaging Studies (OASIS) - MRI data in young, middle-aged, and elderly adults.
- Allen Cell Imaging Collections - 3D cell imaging data for basic research and computational tool development.
- BossDB Open Neuroimagery Datasets - Various neuroimaging datasets.
- Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3) - Proteomic data from cancer samples.
- IBL Neuropixels Reproducible Ephys Data on AWS - Electrophysiological recordings from the International Brain Laboratory.
- NYU Langone & FAIR FastMRI Dataset - Knee MRIs for accelerated MRI reconstruction research.
- The Human Connectome Project - A collection of neuroimaging and behavioral data.
- RadGraph - Radiology reports annotated with entities and relations.
- RadNLI - A natural language inference dataset for radiology reports.
- RadQA - A question-answering dataset for radiology reports.
- UK Biobank Brain Imaging - Detailed MRI scans of the brain, heart, abdomen, bones and carotid arteries of over 100,000 UK Biobank participants.
- Allen Brain Atlas - A growing collection of online public resources integrating extensive gene expression and neuroanatomical data.
- ADNI (Alzheimer's Disease Neuroimaging Initiative) - A longitudinal multicenter study designed to develop clinical, imaging, genetic, and biochemical biomarkers for the early detection and tracking of Alzheimer's disease.
- TCGA (The Cancer Genome Atlas) - A landmark cancer genomics program that molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types.
- GTEx (Genotype-Tissue Expression) - A resource to study tissue-specific gene expression and regulation, with data from 54 non-diseased tissue sites across nearly 1000 individuals.
- 1000 Genomes Project - A catalog of human genetic variation, including SNPs and structural variants, based on the genomes of 2,504 individuals from 26 populations.
- Cancer Cell Line Encyclopedia (CCLE) - Detailed genetic and pharmacologic characterization of a large panel of human cancer cell lines.
- Genome Aggregation Database - Aggregated and harmonized sequence data from large-scale sequencing projects.
- Open Bioinformatics Reference Data for Galaxy - Bioinformatics reference data for the Galaxy platform.
- CoMMpass from the Multiple Myeloma Research Foundation - Genomic and clinical data from multiple myeloma patients.
- NIH NCBI Sequence Read Archive (SRA) on AWS - Next-generation sequencing data from various studies.
- Basic Local Alignment Sequences Tool (BLAST) Databases - Sequence databases for use with the BLAST tool.
- Encyclopedia of DNA Elements (ENCODE) - Data from the ENCODE project, which aims to identify all functional elements in the human genome.
- Genome in a Bottle on AWS - Reference genomes and benchmarking data for genome sequencing and assembly.
- OpenCell on AWS - 3D images and meshes of cells and organelles.
- Refgenie reference genome assets - A standardized, versioned, and programmatically accessible collection of reference genome assets.
- Gene Expression Omnibus (GEO) - A public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data.
- ArrayExpress - A database of functional genomics experiments including gene expression, methylation, and protein data.
- Protein Data Bank (PDB) - A database of 3D structural data of large biological molecules, such as proteins and nucleic acids.
- Human Protein Atlas - A Swedish-based program that maps all the human proteins in cells, tissues, and organs using integration of various omics technologies.
- cBioPortal - A web resource for exploring, visualizing, and analyzing multidimensional cancer genomics data.
- Human Cell Atlas - An international collaborative consortium, which aims to create comprehensive reference maps of all human cells to describe and define the cellular basis of health and disease.
- Tox21 - A database of compounds for toxicity testing to better understand how chemicals affect human health and the environment.
- GDC (Genomic Data Commons) - A unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.
- CTRP (Cancer Therapeutics Response Portal) - A public database that links genetic, lineage, and other cellular features of cancer cell lines to small-molecule sensitivity.
- UniProt - A comprehensive resource for protein sequence and annotation data.
- European Nucleotide Archive (ENA) - A comprehensive record of the world's nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.
- UMLS (Unified Medical Language System) - A compendium of many controlled vocabularies in the biomedical sciences, providing a mapping structure among these vocabularies.
- SNOMED CT - A comprehensive, multilingual clinical healthcare terminology for clinical documentation and reporting.
- RxNorm - A normalized naming system for generic and branded drugs.
- LOINC (Logical Observation Identifiers Names and Codes) - A database and universal standard for identifying medical laboratory observations.
- MeSH (Medical Subject Headings) - A controlled vocabulary thesaurus used for indexing articles in PubMed.
- DrugBank - A comprehensive, freely accessible, online database containing information on drugs and drug targets.
- Orphanet Rare Disease Ontology - A vocabulary for rare diseases, capturing relationships between diseases, genes, and other relevant features.
- GWAS Catalog - A catalog of published genome-wide association studies (GWAS) and their findings.
- ICD-10 (International Classification of Diseases, 10th Revision) - A medical classification list by the World Health Organization (WHO).
- ICD-9 (International Classification of Diseases, 9th Revision) - An older version of the ICD medical classification list.
- CPT (Current Procedural Terminology) - A medical code set maintained by the American Medical Association (AMA).
- Gene Ontology - A bioinformatics resource that provides information about gene product function using ontologies.
- Disease Ontology - An ontology that provides a standardized description of human disease terms, phenotype characteristics, and related medical vocabulary.
- RxMix - A database of prescription drugs and their ingredients.
- RxTerms - A drug interface terminology based on RxNorm.
- Dailymed - A database of marketed drugs and their labels.
- Experimental Factor Ontology - An ontology for describing experimental variables in biomedical experiments.
- UBERON anatomy - A cross-species anatomy ontology.
- Open-targets - A platform for accessing and analyzing drug target data.
- Genetic and Rare Diseases - Information on rare diseases and their associated genes.
- International Classification of Diseases for Oncology - A domain-specific extension of the International Classification of Diseases for tumor diseases.
- Kyoto Encyclopedia of Genes and Genomes - A resource for understanding high-level functions and utilities of the biological system.
- Medical Dictionary for Regulatory Activities Terminology - A standardised medical terminology for regulatory communication.
- Online Mendelian Inheritance in Man - A catalog of human genes and genetic disorders.
- DisGeNET - A discovery platform containing publicly available collections of genes and variants associated with human diseases.
- PharmGKB - A pharmacogenomics knowledge resource that encompasses clinical information including dosing guidelines and drug labels, potentially clinically actionable gene-drug associations, and genotype-phenotype relationships.
- Global Health Observatory (GHO) - World Health Organization's data repository for global health data, including data on various health topics and SDGs.
- CDC WONDER - Wide-ranging Online Data for Epidemiologic Research from the Centers for Disease Control and Prevention (CDC).
- Medicare.gov Data - Official U.S. government site for Medicare data, including data on hospitals, nursing homes, physicians, and more.
- World Bank Health Data - A collection of World Bank datasets on various health indicators and related data.
- Global Burden of Disease (GBD) - A comprehensive regional and global assessment of mortality and disability from major diseases, injuries, and risk factors.
- UNICEF Data - Global data on the situation of children worldwide.
- OECD Health Statistics - Comprehensive source of comparable statistics on health and health systems across OECD countries.
- Humanitarian Data Exchange - An open platform for sharing data across crises and organisations.
- PubMed Central Open Access Subset - A subset of PubMed Central that contains full-text open access articles.
- CORD-19 - A dataset of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses.
- LitCovid - A curated literature hub for tracking up-to-date scientific information about COVID-19.
- PubMed - A database of more than 33 million citations for biomedical literature from MEDLINE, life science journals, and online books.
- Europe PMC - An open science platform that enables access to a worldwide collection of life science publications and preprints from trusted sources.
- Microsoft Academic Graph - A heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study.
- Semantic Scholar Open Research Corpus - A large corpus of scientific papers with rich metadata, paper abstracts, resolved bibliographic references, and structured full text.
- PhysioNet - A large and growing archive of physiological data, including datasets on ECG, EEG, and more.
- HealthData.gov - Dedicated to making high value health data more accessible to entrepreneurs, researchers, and policy makers in the hopes of better health outcomes for all.
- Human Mortality Database - Provides detailed mortality and population data to those interested in the history of human longevity.
- Global Health Observatory (GHO) Data Repository - WHO's gateway to health-related statistics for more than 1000 indicators for its 194 Member States.
- Medicare Provider Utilization and Payment Data - Data on services and procedures provided to Medicare beneficiaries.
- OpenNeuro - A free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data.
- National Health and Nutrition Examination Survey (NHANES) - A program of studies designed to assess the health and nutritional status of adults and children in the United States.
- All of Us Research Program - An effort to gather data from one million or more people living in the United States to accelerate research and improve health.
- UK Biobank - A large-scale biomedical database and research resource containing in-depth genetic and health information from half a million UK participants.
- Canadian Open Neuroscience Platform (CONP) - A platform for sharing neuroscience data and tools.
- Pharmaceuticals and Medical Devices Agency Japan - Japan's agency for pharmaceuticals and medical devices safety and effectiveness.
- European Medicines Agency - EU agency for medicine safety and effectiveness.
- PubChem - A database with information on the biological activities of small molecules.
- SIDER - A resource that contains information on marketed medicines and their recorded adverse drug reactions.
- STITCH - A database of known and predicted interactions between chemicals and proteins.
- Reactome - A free, open-source, curated and peer-reviewed pathway database.
- ChEMBL - A manually curated database of bioactive molecules with drug-like properties.
- Human Metabolome Database - A freely available electronic database containing detailed information about small molecule metabolites found in the human body.
- ZINC - A free database of commercially-available compounds for virtual screening.
This list is released into the public domain. See the license file for more details.