Public Data 2

This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress.


The Cancer Genome Atlas Variety of Cancer Data, most cancer types have 100-1000 samples
NIH GDC Cancer, many types of genomic data
UK Biobank
European Genome-Phenome Archive
METABRIC The genomic profiles (somatic mutations, copy number alterations, and gene expression) of 2509 breast cancers
23andMe 2280 public domain curated genotypes
Mice SNPs, 2000+ samples, 4 generations, might be possible to learn a family structure
Arabidopsis SNPs, 100+ phenotypes

Promoter-Enhancer Pairs

TargetFinder ~100,000 DNA-DNA interaction pairs

Gene/Protein Expression

GEO Main place for NCBI data
ENCODE Variety of assays to identify functional elements
ArrayExpress DNA sequencing, gene/protein expression, epigenetics
Cytometry Continuous Flow cytometry data of 11 proteins+phospholipids, discretized and cleaned data available offline
Transcription Factor Binding ChIP-Seq data on 12 TFs
GTEx Landmark study for EQTL analysis
PharmacoGenomics DB

Single-cell Data

Single-cell Expression Atlas

Regulatory Networks

TRRUST Manually curated database of human transcriptional regulatory network
Yeast Network 23-million yeast 2-hybrid experiments to investigate genetic interactions
Perturb-Seq Integrated model of perturbations, single cell phenotypes, and epistatic interactions
KEGG Metabolic Regulatory Network (Undirected) 65554 instances, 29 attributes each
KEGG Metabolic Regulatory Network (Directed) 53414 instance, 24 attributes each


The Cancer Imaging Archive Extracts the images from the TCGA data
Multiple Myeloma DREAM Challenge Challenge to identify Multiple Myeloma Patients
Breast Cancer Wisconsin (Diagnostic) Data Set Predict whether the cancer is benign or malignant
DDSM Mammogram Database
Kaggle Soft Tissue Sarcomas Preprocessed subset of the TCIA study “Soft Tissue Sarcoma”
Kaggle Cervical Cancer Screening Classify cervix type from images
CMELYON17 Pathology challenge – automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections
Grand Challenges Datasets from biomedical image analysis competitions


ENGIMA Cerebellum Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction
Seizure Prediction Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure).

Electronic Medical Records

MIMIC 59,000 EHRs
UCI Diabetes 130 US hospital data for 1999-2008
i2b2 Clinical notes only, designed for NLP tasks
Metadata Acquired from Clinical Case Reports (MACCRs) 3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases
eICU 200k EHRs


CheXPert 200k chest radiographs, Competition and leaderboard associated
MIMIC-CXR ~400k chest x-rays, 14 labels
PadChest 160k chest x-rays, 174 different findings

Protein-Protein Interactions

HINT (High-quality INTeractomes) curated compilation of high-quality protein-protein interactions from 8 interactome resources

Longitudinal Studies

National Population Health Survey Longitudinal Survey that collects health information via surveys every two years.

Protein Structure

ProteinNet Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits.

Natural Language Data

BioASQ Abstracts of medical articles (from PubMed); ontologies of medical concepts.
Cases Articles from medical case studies
UPMC Pathology UPMC Pathology case studies


If you have any recommendations for additional data sets, please email: