Skip to content

geniusrise/awesome-healthcare-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Awesome Healthcare Datasets

Awesome

A curated list of awesome healthcare datasets for machine learning, research, and exploration.

Contents

Clinical Data

  1. MIMIC-III Clinical Database - Deidentified health data associated with ~40,000 critical care patients. Includes demographics, vital signs, laboratory tests, medications, and more.
  2. eICU Collaborative Research Database - A multi-center database comprising deidentified health data associated with over 200,000 admissions to ICUs across the United States between 2014-2015.
  3. MIMIC-IV - An update to MIMIC-III, containing deidentified data associated with patients admitted to a tertiary academic medical center in Boston, MA, USA from 2008-2019.
  4. AmsterdamUMCdb - A database containing deidentified health data from the Amsterdam University Medical Center, including structured and unstructured data from patient records.
  5. MIMIC-IV-ED - Emergency department data from the MIMIC-IV database.
  6. MIMIC-IV-Note - Deidentified free-text clinical notes from the MIMIC-IV database.
  7. MIMIC-III Waveform Database - Waveform data from the MIMIC-III database.
  8. MIMIC-IV Waveform Database - Waveform data from the MIMIC-IV database.
  9. eICU Collaborative Research Database - A multi-center database comprising deidentified health data associated with over 200,000 admissions to ICUs across the United States between 2014-2015.
  10. MIMIC-II Clinical Database - An older version of the MIMIC database, containing data from 2001 to 2008.
  11. MIMIC-IV-ECHO - Echocardiogram data from the MIMIC-IV database.
  12. AMR-UTI - Antimicrobial Resistance in Urinary Tract Infections dataset.
  13. Abdominal and Direct Fetal ECG Database - Multichannel fetal electrocardiogram recordings obtained from 5 different women in labor.
  14. OpenPrescribing - A database of all medicines and appliances that are prescribed by GPs and other NHS prescribers in England.

Imaging Data

  1. TCIA (The Cancer Imaging Archive) - A large archive of medical images of cancer accessible for public download.
  2. Chest X-Ray Dataset - A dataset consisting of 5,863 chest X-Ray images, annotated with the presence of pneumonia.
  3. RSNA Intracranial Hemorrhage Detection - A dataset of head CT scans, annotated with intracranial hemorrhage labels.
  4. MICCAI 2015 Challenge on Multimodal Brain Tumor Segmentation - Brain tumor segmentation dataset.
  5. Non-Small Cell Lung Cancer CT Scan Dataset - CT scans of non-small cell lung cancer patients.
  6. PROSTATEx - Prostate MRI scans with segmentations and annotations.
  7. Labeled Optical Coherence Tomography - Retinal OCT images with layer segmentations and fluid labels.
  8. MosMedData: Chest CT Scans with COVID-19 Related Findings - Chest CT scans of COVID-19 patients.
  9. LUng Nodule Analysis (LUNA16) - Chest CT scans with annotated lung nodules.
  10. NIH Chest X-ray Dataset of 14 Common Thorax Disease Categories - Chest X-ray images with disease labels.
  11. DeepLesion - A large-scale dataset of CT images with annotated lesions.
  12. Medical Segmentation Decathlon Datasets - Various medical imaging datasets for segmentation tasks.
  13. cataracts-2018-train - Cataract images dataset.
  14. dHCP 2nd data release -- sourcedata - Developmental Human Connectome Project dataset.
  15. dHCP 2nd data release -- fMRI pipeline - Developmental Human Connectome Project dataset (fMRI pipeline).
  16. PADCHEST_SJ - Chest X-ray images with multiple labels in Spanish.
  17. CAMELYON17 breast cancer - Lymph node sections annotated with metastases.
  18. A multimodal dental dataset facilitating machine learning research and clinic services - Dental X-rays, CBCT scans, and dental records.
  19. MIMIC-IV-ECG - Diagnostic electrocardiogram data from the MIMIC-IV database.
  20. MURA (musculoskeletal radiographs) - Bone X-rays labeled for abnormalities.
  21. National COVID-19 Chest Image Database (NCCID) - Chest X-rays, CT scans, and MRIs of COVID-19 patients in the UK.
  22. Cell Painting Gallery - A collection of cell images for drug discovery and basic research.
  23. International Neuroimaging Data-Sharing Initiative (INDI) - Neuroimaging datasets from various sources.
  24. Cancer Imaging Archive - A large archive of cancer imaging data.
  25. Open Access Series of Imaging Studies (OASIS) - MRI data in young, middle-aged, and elderly adults.
  26. Allen Cell Imaging Collections - 3D cell imaging data for basic research and computational tool development.
  27. BossDB Open Neuroimagery Datasets - Various neuroimaging datasets.
  28. Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3) - Proteomic data from cancer samples.
  29. IBL Neuropixels Reproducible Ephys Data on AWS - Electrophysiological recordings from the International Brain Laboratory.
  30. NYU Langone & FAIR FastMRI Dataset - Knee MRIs for accelerated MRI reconstruction research.
  31. The Human Connectome Project - A collection of neuroimaging and behavioral data.
  32. RadGraph - Radiology reports annotated with entities and relations.
  33. RadNLI - A natural language inference dataset for radiology reports.
  34. RadQA - A question-answering dataset for radiology reports.
  35. UK Biobank Brain Imaging - Detailed MRI scans of the brain, heart, abdomen, bones and carotid arteries of over 100,000 UK Biobank participants.
  36. Allen Brain Atlas - A growing collection of online public resources integrating extensive gene expression and neuroanatomical data.
  37. ADNI (Alzheimer's Disease Neuroimaging Initiative) - A longitudinal multicenter study designed to develop clinical, imaging, genetic, and biochemical biomarkers for the early detection and tracking of Alzheimer's disease.

Omics Data

  1. TCGA (The Cancer Genome Atlas) - A landmark cancer genomics program that molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types.
  2. GTEx (Genotype-Tissue Expression) - A resource to study tissue-specific gene expression and regulation, with data from 54 non-diseased tissue sites across nearly 1000 individuals.
  3. 1000 Genomes Project - A catalog of human genetic variation, including SNPs and structural variants, based on the genomes of 2,504 individuals from 26 populations.
  4. Cancer Cell Line Encyclopedia (CCLE) - Detailed genetic and pharmacologic characterization of a large panel of human cancer cell lines.
  5. Genome Aggregation Database - Aggregated and harmonized sequence data from large-scale sequencing projects.
  6. Open Bioinformatics Reference Data for Galaxy - Bioinformatics reference data for the Galaxy platform.
  7. CoMMpass from the Multiple Myeloma Research Foundation - Genomic and clinical data from multiple myeloma patients.
  8. NIH NCBI Sequence Read Archive (SRA) on AWS - Next-generation sequencing data from various studies.
  9. Basic Local Alignment Sequences Tool (BLAST) Databases - Sequence databases for use with the BLAST tool.
  10. Encyclopedia of DNA Elements (ENCODE) - Data from the ENCODE project, which aims to identify all functional elements in the human genome.
  11. Genome in a Bottle on AWS - Reference genomes and benchmarking data for genome sequencing and assembly.
  12. OpenCell on AWS - 3D images and meshes of cells and organelles.
  13. Refgenie reference genome assets - A standardized, versioned, and programmatically accessible collection of reference genome assets.
  14. Gene Expression Omnibus (GEO) - A public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data.
  15. ArrayExpress - A database of functional genomics experiments including gene expression, methylation, and protein data.
  16. Protein Data Bank (PDB) - A database of 3D structural data of large biological molecules, such as proteins and nucleic acids.
  17. Human Protein Atlas - A Swedish-based program that maps all the human proteins in cells, tissues, and organs using integration of various omics technologies.
  18. cBioPortal - A web resource for exploring, visualizing, and analyzing multidimensional cancer genomics data.
  19. Human Cell Atlas - An international collaborative consortium, which aims to create comprehensive reference maps of all human cells to describe and define the cellular basis of health and disease.
  20. Tox21 - A database of compounds for toxicity testing to better understand how chemicals affect human health and the environment.
  21. GDC (Genomic Data Commons) - A unified data repository that enables data sharing across cancer genomic studies in support of precision medicine.
  22. CTRP (Cancer Therapeutics Response Portal) - A public database that links genetic, lineage, and other cellular features of cancer cell lines to small-molecule sensitivity.
  23. UniProt - A comprehensive resource for protein sequence and annotation data.
  24. European Nucleotide Archive (ENA) - A comprehensive record of the world's nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.

Biomedical Knowledge Graphs

  1. UMLS (Unified Medical Language System) - A compendium of many controlled vocabularies in the biomedical sciences, providing a mapping structure among these vocabularies.
  2. SNOMED CT - A comprehensive, multilingual clinical healthcare terminology for clinical documentation and reporting.
  3. RxNorm - A normalized naming system for generic and branded drugs.
  4. LOINC (Logical Observation Identifiers Names and Codes) - A database and universal standard for identifying medical laboratory observations.
  5. MeSH (Medical Subject Headings) - A controlled vocabulary thesaurus used for indexing articles in PubMed.
  6. DrugBank - A comprehensive, freely accessible, online database containing information on drugs and drug targets.
  7. Orphanet Rare Disease Ontology - A vocabulary for rare diseases, capturing relationships between diseases, genes, and other relevant features.
  8. GWAS Catalog - A catalog of published genome-wide association studies (GWAS) and their findings.
  9. ICD-10 (International Classification of Diseases, 10th Revision) - A medical classification list by the World Health Organization (WHO).
  10. ICD-9 (International Classification of Diseases, 9th Revision) - An older version of the ICD medical classification list.
  11. CPT (Current Procedural Terminology) - A medical code set maintained by the American Medical Association (AMA).
  12. Gene Ontology - A bioinformatics resource that provides information about gene product function using ontologies.
  13. Disease Ontology - An ontology that provides a standardized description of human disease terms, phenotype characteristics, and related medical vocabulary.
  14. RxMix - A database of prescription drugs and their ingredients.
  15. RxTerms - A drug interface terminology based on RxNorm.
  16. Dailymed - A database of marketed drugs and their labels.
  17. Experimental Factor Ontology - An ontology for describing experimental variables in biomedical experiments.
  18. UBERON anatomy - A cross-species anatomy ontology.
  19. Open-targets - A platform for accessing and analyzing drug target data.
  20. Genetic and Rare Diseases - Information on rare diseases and their associated genes.
  21. International Classification of Diseases for Oncology - A domain-specific extension of the International Classification of Diseases for tumor diseases.
  22. Kyoto Encyclopedia of Genes and Genomes - A resource for understanding high-level functions and utilities of the biological system.
  23. Medical Dictionary for Regulatory Activities Terminology - A standardised medical terminology for regulatory communication.
  24. Online Mendelian Inheritance in Man - A catalog of human genes and genetic disorders.
  25. DisGeNET - A discovery platform containing publicly available collections of genes and variants associated with human diseases.
  26. PharmGKB - A pharmacogenomics knowledge resource that encompasses clinical information including dosing guidelines and drug labels, potentially clinically actionable gene-drug associations, and genotype-phenotype relationships.

Public Health Data

  1. Global Health Observatory (GHO) - World Health Organization's data repository for global health data, including data on various health topics and SDGs.
  2. CDC WONDER - Wide-ranging Online Data for Epidemiologic Research from the Centers for Disease Control and Prevention (CDC).
  3. Medicare.gov Data - Official U.S. government site for Medicare data, including data on hospitals, nursing homes, physicians, and more.
  4. World Bank Health Data - A collection of World Bank datasets on various health indicators and related data.
  5. Global Burden of Disease (GBD) - A comprehensive regional and global assessment of mortality and disability from major diseases, injuries, and risk factors.
  6. UNICEF Data - Global data on the situation of children worldwide.
  7. OECD Health Statistics - Comprehensive source of comparable statistics on health and health systems across OECD countries.
  8. Humanitarian Data Exchange - An open platform for sharing data across crises and organisations.

Biomedical Literature

  1. PubMed Central Open Access Subset - A subset of PubMed Central that contains full-text open access articles.
  2. CORD-19 - A dataset of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses.
  3. LitCovid - A curated literature hub for tracking up-to-date scientific information about COVID-19.
  4. PubMed - A database of more than 33 million citations for biomedical literature from MEDLINE, life science journals, and online books.
  5. Europe PMC - An open science platform that enables access to a worldwide collection of life science publications and preprints from trusted sources.
  6. Microsoft Academic Graph - A heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study.
  7. Semantic Scholar Open Research Corpus - A large corpus of scientific papers with rich metadata, paper abstracts, resolved bibliographic references, and structured full text.

Miscellaneous

  1. PhysioNet - A large and growing archive of physiological data, including datasets on ECG, EEG, and more.
  2. HealthData.gov - Dedicated to making high value health data more accessible to entrepreneurs, researchers, and policy makers in the hopes of better health outcomes for all.
  3. Human Mortality Database - Provides detailed mortality and population data to those interested in the history of human longevity.
  4. Global Health Observatory (GHO) Data Repository - WHO's gateway to health-related statistics for more than 1000 indicators for its 194 Member States.
  5. Medicare Provider Utilization and Payment Data - Data on services and procedures provided to Medicare beneficiaries.
  6. OpenNeuro - A free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data.
  7. National Health and Nutrition Examination Survey (NHANES) - A program of studies designed to assess the health and nutritional status of adults and children in the United States.
  8. All of Us Research Program - An effort to gather data from one million or more people living in the United States to accelerate research and improve health.
  9. UK Biobank - A large-scale biomedical database and research resource containing in-depth genetic and health information from half a million UK participants.
  10. Canadian Open Neuroscience Platform (CONP) - A platform for sharing neuroscience data and tools.
  11. Pharmaceuticals and Medical Devices Agency Japan - Japan's agency for pharmaceuticals and medical devices safety and effectiveness.
  12. European Medicines Agency - EU agency for medicine safety and effectiveness.
  13. PubChem - A database with information on the biological activities of small molecules.
  14. SIDER - A resource that contains information on marketed medicines and their recorded adverse drug reactions.
  15. STITCH - A database of known and predicted interactions between chemicals and proteins.
  16. Reactome - A free, open-source, curated and peer-reviewed pathway database.
  17. ChEMBL - A manually curated database of bioactive molecules with drug-like properties.
  18. Human Metabolome Database - A freely available electronic database containing detailed information about small molecule metabolites found in the human body.
  19. ZINC - A free database of commercially-available compounds for virtual screening.

License

CC0

This list is released into the public domain. See the license file for more details.