This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress.
Genotyping
The Cancer Genome Atlas | Variety of Cancer Data, most cancer types have 100-1000 samples |
NIH GDC | Cancer, many types of genomic data |
UK Biobank | |
European Genome-Phenome Archive | |
METABRIC | The genomic profiles (somatic mutations, copy number alterations, and gene expression) of 2509 breast cancers |
HapMap | |
23andMe | 2280 public domain curated genotypes |
Mice | SNPs, 2000+ samples, 4 generations, might be possible to learn a family structure |
Arabidopsis | SNPs, 100+ phenotypes |
Promoter-Enhancer Pairs
TargetFinder | ~100,000 DNA-DNA interaction pairs |
Gene/Protein Expression
GEO | Main place for NCBI data |
ENCODE | Variety of assays to identify functional elements |
ArrayExpress | DNA sequencing, gene/protein expression, epigenetics |
Cytometry Continuous | Flow cytometry data of 11 proteins+phospholipids, discretized and cleaned data available offline |
Transcription Factor Binding | ChIP-Seq data on 12 TFs |
GTEx | Landmark study for EQTL analysis |
PharmacoGenomics DB | |
ProteomeXChange |
Single-cell Data
Single-cell Expression Atlas |
Regulatory Networks
TRRUST | Manually curated database of human transcriptional regulatory network |
Yeast Network | 23-million yeast 2-hybrid experiments to investigate genetic interactions |
Perturb-Seq | Integrated model of perturbations, single cell phenotypes, and epistatic interactions |
KEGG Metabolic Regulatory Network (Undirected) | 65554 instances, 29 attributes each |
KEGG Metabolic Regulatory Network (Directed) | 53414 instance, 24 attributes each |
Images
The Cancer Imaging Archive | Extracts the images from the TCGA data |
Multiple Myeloma DREAM Challenge | Challenge to identify Multiple Myeloma Patients |
Breast Cancer Wisconsin (Diagnostic) Data Set | Predict whether the cancer is benign or malignant |
DDSM | Mammogram Database |
Kaggle Soft Tissue Sarcomas | Preprocessed subset of the TCIA study “Soft Tissue Sarcoma” |
Kaggle Cervical Cancer Screening | Classify cervix type from images |
CMELYON17 | Pathology challenge – automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections |
Grand Challenges | Datasets from biomedical image analysis competitions |
fMRI
ENGIMA Cerebellum | Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction |
Seizure Prediction | Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure). |
Electronic Medical Records
MIMIC | 59,000 EHRs |
UCI Diabetes | 130 US hospital data for 1999-2008 |
i2b2 | Clinical notes only, designed for NLP tasks |
PhysioNet | |
Metadata Acquired from Clinical Case Reports (MACCRs) | 3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases |
eICU | 200k EHRs |
Radiographs
CheXPert | 200k chest radiographs, Competition and leaderboard associated |
MIMIC-CXR | ~400k chest x-rays, 14 labels |
PadChest | 160k chest x-rays, 174 different findings |
Protein-Protein Interactions
HINT (High-quality INTeractomes) | curated compilation of high-quality protein-protein interactions from 8 interactome resources |
Longitudinal Studies
National Population Health Survey | Longitudinal Survey that collects health information via surveys every two years. |
Protein Structure
ProteinNet | Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits. |
Natural Language Data
BioASQ | Abstracts of medical articles (from PubMed); ontologies of medical concepts. |
Cases | Articles from medical case studies |
UPMC Pathology | UPMC Pathology case studies |
If you have any recommendations for additional data sets, please email: msvogt2@illinois.edu