Public Data 2 – Center for Artificial Intelligence Driven Health Data Systems and Analytics

This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress.

Genotyping

The Cancer Genome Atlas	Variety of Cancer Data, most cancer types have 100-1000 samples
NIH GDC	Cancer, many types of genomic data
UK Biobank
European Genome-Phenome Archive
METABRIC	The genomic profiles (somatic mutations, copy number alterations, and gene expression) of 2509 breast cancers
HapMap
23andMe	2280 public domain curated genotypes
Mice	SNPs, 2000+ samples, 4 generations, might be possible to learn a family structure
Arabidopsis	SNPs, 100+ phenotypes

~100,000 DNA-DNA interaction pairs

GEO	Main place for NCBI data
ENCODE	Variety of assays to identify functional elements
ArrayExpress	DNA sequencing, gene/protein expression, epigenetics
Cytometry Continuous	Flow cytometry data of 11 proteins+phospholipids, discretized and cleaned data available offline
Transcription Factor Binding	ChIP-Seq data on 12 TFs
GTEx	Landmark study for EQTL analysis
PharmacoGenomics DB
ProteomeXChange

TRRUST	Manually curated database of human transcriptional regulatory network
Yeast Network	23-million yeast 2-hybrid experiments to investigate genetic interactions
Perturb-Seq	Integrated model of perturbations, single cell phenotypes, and epistatic interactions
KEGG Metabolic Regulatory Network (Undirected)	65554 instances, 29 attributes each
KEGG Metabolic Regulatory Network (Directed)	53414 instance, 24 attributes each

The Cancer Imaging Archive	Extracts the images from the TCGA data
Multiple Myeloma DREAM Challenge	Challenge to identify Multiple Myeloma Patients
Breast Cancer Wisconsin (Diagnostic) Data Set	Predict whether the cancer is benign or malignant
DDSM	Mammogram Database
Kaggle Soft Tissue Sarcomas	Preprocessed subset of the TCIA study “Soft Tissue Sarcoma”
Kaggle Cervical Cancer Screening	Classify cervix type from images
CMELYON17	Pathology challenge – automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections
Grand Challenges	Datasets from biomedical image analysis competitions

ENGIMA Cerebellum	Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction
Seizure Prediction	Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure).

MIMIC	59,000 EHRs
UCI Diabetes	130 US hospital data for 1999-2008
i2b2	Clinical notes only, designed for NLP tasks
PhysioNet
Metadata Acquired from Clinical Case Reports (MACCRs)	3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases
eICU	200k EHRs

CheXPert	200k chest radiographs, Competition and leaderboard associated
MIMIC-CXR	~400k chest x-rays, 14 labels
PadChest	160k chest x-rays, 174 different findings

curated compilation of high-quality protein-protein interactions from 8 interactome resources

Longitudinal Survey that collects health information via surveys every two years.

Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits.

BioASQ	Abstracts of medical articles (from PubMed); ontologies of medical concepts.
Cases	Articles from medical case studies
UPMC Pathology	UPMC Pathology case studies

If you have any recommendations for additional data sets, please email: msvogt2@illinois.edu