Who / When / Where
Instructor: Zhiyu Wan
Teaching Assistant: Sihan Xie, Hongzhu Jiang
Semester: Fall 2025
Time: Wednesdays & Fridays (Odd Week) , 15:00-15:45, 15:55-16:40
Location: School of Life Science and Technology, Room A103
Office Hours: Upon request, Location: BME Building, Room 228
Public Datasets
(Note: This page is tentative and subject to change.)
| Voice1 (PDF, by Sihan) | Voice2 (PDF, by Jiarui) | Text (PDF, by Jiayue) | Drug (PDF, by Huicong) | Histopathology (PDF, by Haoyun) | Human Face (PDF, by Haoyun) |
| Dermatology (PDF, by Yuhang) | Brain (PDF, by Yuhang) | Head & Neck (PDF, by Hongzhu) | Internal Organ (PDF, by Jiayi) |
- Voice Datasets (PDF, collected by Sihan)
(A) General-Purpose Voice Datasets
| Main uses | Full Name | Language | Data Scale | Identity Attributes | Institution | Data Availability | URL |
| ASR (Automatic Speech Recognition) | LibriSpeech ASR Corpus (LibriSpeech) | English | ~1000 hours | Speaker ID, Gender | Johns Hopkins University | Publicly available | https://www.openslr.org/12/ |
| ASR (Automatic Speech Recognition) | Mozilla Common Voice (Common Voice) | Multilingual | 30,000 hours | Gender, Age group, Accent | Mozilla Foundation | Publicly available | https://commonvoice.mozilla.org/datasets |
| ASR (Automatic Speech Recognition) | Switchboard Corpus (Switchboard) | English | ~260 hours, >500 speakers | Gender, Age group, Education level, Dialect region | Texas Instruments, LDC | Publicly available | https://catalog.ldc.upenn.edu/LDC97S62 |
| ASR (Automatic Speech Recognition) | Open-Source Mandarin Speech Corpus (AISHELL-1) | Chinese | 178 hours | Speaker ID, Gender | Beijing Shell Technology Co., Ltd. | Publicly available | https://www.openslr.org/33/ |
| SER (Speech Emotion Recognition) | Interactive Emotional Dyadic Motion Capture Database (IEMOCAP) | English | 12 hours of audiovisual data, including video, speech, motion capture of face, text transcriptions. | Speaker ID, Gender | University of Southern California (USC) – Signal Analysis and Interpretation Laboratory (SAIL) | Publicly available | https://sail.usc.edu/iemocap/ |
| SER (Speech Emotion Recognition) | Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) | English | 7,356 files (24 professional actors) | Speaker ID, Gender, Emotion | Toronto Metropolitan University | Publicly available | https://zenodo.org/records/1188976 |
| TTS (Text-to-Speech) | The LJ Speech Dataset (LJSpeech) | English | ~24 hours (13,100 short audio clips) | Speaker ID, Gender | Curated by Keith Ito (from LibriVox public domain data) | Publicly available | https://keithito.com/LJ-Speech-Dataset/ |
| TTS (Text-to-Speech) | CSTR VCTK Corpus (Voice Cloning Toolkit) | English | ~44 hours (109 native English speakers, ~400 sentences per speaker) | Speaker ID, Gender, Accent | University of Edinburgh | Publicly available | https://datashare.ed.ac.uk/handle/10283/2651 |
| Speaker Recognition | VoxCeleb Large-Scale Speaker Recognition Datasets (VoxCeleb 1 & 2) | English | 2000 hours, >7000 speakers | Gender (Age and Nationality can be inferred from external sources) | University of Oxford | Publicly available | http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ |
| Acoustic-Phonetic | TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT) | English | ~5 hours, 630 speakers | Gender, Age, Ethnicity, Dialect region, Education level | Texas Instruments, SRI, MIT | Publicly available | https://catalog.ldc.upenn.edu/LDC93s1 |
(B) Pathological Voice Datasets
| No. | Category | Full Name | Related Disease(s) | Language | Data Scale |
| 1 | VOICED Database | VOICED Database | Various voice disorders | Italian | 208 participants (150 pathological individuals and 58 healthy individuals) |
| 2 | VOICED Database | Saarbruecken Voice Database (SVD) | Various pathologies | German | 2225 participants (869 healthy individuals, 1356 pathological individuals) |
| 3 | Neurodegenerative | CMU PITT Corpus (CMU PITT) | Alzheimer’s Disease, Dementia | English | 397 participants (104 healthy individuals, 208 individuals, 85 unknown individuals) |
| 4 | Neurodegenerative | Parkinson’s Speech with Multiple Types of Sound Recordings (PD Database) | Parkinson’s Disease | Turkish | 252 participants (188 PWP, 64 healthy individuals) |
| 5 | Neurodegenerative | Alzheimer’s Dementia Recognition through Spontaneous Speech (ADReSSo) | Alzheimer’s Disease | English | AD Classification Task: 237 participants (166 training + 71 test) ; Cognitive Decline Task: 105 recordings |
| 6 | Mental & Emotional Health | Distress Analysis Interview Corpus (DAIC-WOZ) | Depression, Anxiety | English | 189 participants (47 depressed individuals, 142 healthy individuals) |
| 7 | Mental & Emotional Health | ASDBank Dutch Asymmetries Corpus | Autism Spectrum Disorder (ASD) | Dutch | 192 participants (109 healthy individuals, 83 pathological individuals) |
| 8 | Speech, Language & Developmental | TORGO Database | Dysarthria, Cerebral Palsy (CP), Amyotrophic Lateral Sclerosis (ALS) | English | 15 speakers (8 with dysarthria, 7 healthy controls) |
| 9 | Speech, Language & Developmental | UASpeech | cerebral palsy | English | 19 pathological individuals |
| 10 | Infectious & Respiratory | Cambridge COVID-19 Sound Database (CSSD) | COVID-19, Asthma, Cough Multilingual (Crowdsourced) | Multilingual | 53,449 audio samples (over 552 hours in total) crowd-sourced from 36,116 participants (2,106 samples tested positive) |
| No. (Continued) | Identity Attributes | Size | Institution | Data Availability | URL |
| 1 | Gender, Age, Lifestyle habits, Occupation | 110.1MB | School of Medicine, University of Naples “Federico II” | Publicly available | https://physionet.org/content/voiced/1.0.0/ |
| 2 | Gender, Age | Total 38.1GB | Saarland University, Germany | Publicly available | https://stimmdb.coli.uni-saarland.de/ |
| 3 | Gender, Age | Total 17.7GB | Carnegie Mellon University, University of Pittsburgh | Publicly available | https://talkbank.org/dementia/access/English/Pitt.html |
| 4 | Gender, Age | 2MB | Department of Neurology in CerrahpaÅŸa Faculty of Medicine, Istanbul University | Publicly available | https://archive.ics.uci.edu/dataset/470/parkinson+s+disease+classification |
| 5 | Gender, Age | / | ADReSS Challenge Organizers | First join as a DementiaBank member (free). | https://talkbank.org/dementia/ADReSSo-2021/index.html |
| 6 | Gender, Age | Total 52.82GB | University of Southern California (USC) Institute for Creative Technologies | Publicly available | https://dcapswoz.ict.usc.edu/ |
| 7 | Gender, Age | 152MB | University of Groningen | Publicly available | https://talkbank.org/asd/access/Dutch/Asymmetries.html |
| 8 | Speaker ID, Gender, Age | 18GB (uncompressed) | University of Toronto, Holland Bloorview Kids Rehab Hospital | Publicly available | https://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html |
| 9 | Gender, Age | / | University of Illinois | Publicly available | https://speechtechnology.web.illinois.edu/uaspeech/ |
| 10 | Gender, Age, Smoker status | / | University of Cambridge | Publicly available, DTA needs to be signed | https://www.covid-19-sounds.org/zh/blog/neurips_dataset.html |
2. Voice Datasets (PDF, collected by Jiarui)
| No. | Dataset Name | Purpose / Use Case | Total Duration | Number of Speakers | Sampling Rate |
| 1 | LibriSpeech | Large-scale read English speech, commonly used for training Automatic Speech Recognition (ASR) systems. | Approx. 1000 hours | 2484 | 16kHz |
| 2 | RAVDESS | Audiovisual emotional speech dataset, commonly used for Emotion Recognition research. | Approx. 25 minutes (pure speech) | 24 | 48 kHz |
| 3 | SAVEE | Audiovisual expressed emotion dataset, commonly used for Emotion Recognition. | Approx. 10-15 minutes (480 segments) | 4 | 44.1 kHz |
| 4 | EMO-DB | Berlin emotional speech database, commonly used for small-scale, high-quality Emotion Recognition training. | Approx. 3 minutes (535 segments) | 10 | 16 kHz |
| 5 | VoxCeleb-1 & 2 | Large-scale face and voice recognition dataset, mainly used for speaker identification/verification. | Approx. 2000 hours | 7000+ (VoxCeleb2) | 16 kHz |
| 6 | CMU-MOSEI | Large-scale multimodal emotion dataset, commonly used for multimodal sentiment analysis. | Approx. 23 hours (dialogue) | 1000+ | 16 kHz |
| 7 | MUSAN | Multi-purpose background, music, and noise dataset, commonly used for audio data augmentation and recognition. | Approx. 109 hours | N/A (Non-speaker) | 16 kHz |
| 8 | RIR | Room Impulse Response dataset, used for model training and audio enhancement. | N/A | N/A (Non-speaker) | 16 kHz |
| 9 | CN-Celeb | Chinese speaker recognition database, focused on speaker identification/verification tasks. | 271.72 hours | 997 | 16 kHz |
| 10 | CommonVoice | Large-scale multilingual public speech recognition database, used for ASR training. | Over 24,000 hours | 300,000+ | 16 kHz |
| 11 | IEMOCAP | Interactive emotional motion capture database, used for dialogue and emotion recognition. | Approx. 12 hours (audio/video) | 10 | 16 kHz |
| 12 | Emo-DB | Berlin emotional speech database, same as label 4, used for emotion recognition. | Approx. 3 minutes (535 segments) | 10 | 16 kHz |
| 13 | Toronto Emotional Speech Database | Emotional speech database, focused on emotion recognition research. | Approx. 7 hours (movie clips) | 20 (Actors) | 16 kHz |
| 14 | LIRIS-ACCEDE | Audiovisual content annotation and emotion database, used for audiovisual emotion analysis. | Approx. 98 hours (movie clips) | N/A | N/A |
| 15 | AESDD | Greek emotional speech database, used for emotion recognition research. | Approx. 700 segments (short) | 4 (Actors) | 44.1 kHz |
| 16 | VoxForge | Public speech recognition corpus, used for training ASR systems. | N/A (Crowdsourced) | N/A (Crowdsourced) | 16 kHz |
| 17 | TED-LIUM3 | Large English speech recognition corpus, extracted from TED talks, used for ASR. | Approx. 450 hours | Approx. 2300 | 16 kHz |
| 18 | AISHELL-1 | Chinese speech recognition dataset, used for Chinese ASR. | Approx. 178 hours | 400 | 16 kHz (downsampled) |
| 19 | AISHELL-WakeUp-1 | Chinese wake-up word database, used for wake-up word recognition. | 1561.12 hours | 254 | Six 16kHz, 16bit and one 44.1kHz, 16bit |
| 20 | CMU-MOSEI | Large-scale multimodal emotion dataset, same as label 6, used for multimodal sentiment analysis. | Approx. 23 hours (dialogue) | 1000+ | 16 kHz |
| 21 | VoxBlink2 | Face liveness detection and speaker verification dataset. | N/A | N/A | N/A |
| 22 | Emotional Voices Database (EmoV-DB) | Emotional speech database, used for emotion recognition and speech synthesis. | Approx. 3 hours | 4 (Professional) | 24 kHz |
| 23 | DEMOS | Italian emotional speech database, used for Italian emotion recognition. | Approx. 2 hours | 4 (Professional) | 48 kHz |
| 24 | Multilingual LibriSpeech (MLS) | Large-scale multilingual speech recognition dataset, used for training multilingual ASR. | Approx. 44,500 hours | N/A | 16 kHz |
| No. (Continued) | Gender Distribution | Accent/Language | Environment | Link |
| 1 | Balanced (51.65% male and 48.35% female) | English | “Clean” and “Other” | https://www.openslr.org/12 |
| 2 | Balanced (12 Male, 12 Female) | North American English | Sound Studio (Controlled) | https://datasets.activeeon.ai/ml/datasets/ravdess-dataset |
| 3 | Male-only (4 Male) | British English | Visual Media Lab | http://kahlan.eps.surrey.ac.uk/savee |
| 4 | Balanced (5 Male, 5 Female) | German | Anechoic Chamber | http://emodb.bilderbar.info/download |
| 5 | Imbalanced (61% Male, 39% Female) | Broad English/Multilingual | YouTube/Natural Environment | https://www.robots.ox.ac.uk/~vgg/data/voxceleb/index.html#about |
| 6 | Imbalanced | Broad English (American English) | YouTube/Natural Environment | http://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset |
| 7 | N/A | N/A (Noise/Music) | Various Sources (Background/Music) | https://www.openslr.org/17 |
| 8 | N/A | N/A | Various Environments (Simulated) | https://www.openslr.org/28 |
| 9 | N/A | Mandarin/Chinese | Web/Natural Environment | https://cnceleb.org |
| 10 | Imbalanced (Crowdsourced) | Multilingual/Broad Spectrum | Crowdsourced/Natural Environment | https://commonvoice.mozilla.org/en/datasets |
| 11 | Balanced (5 Male, 5 Female) | American English | Motion Capture Lab | https://sail.usc.edu/iemocap/iemocap_info.htm |
| 12 | Balanced (5 Male, 5 Female) | German | Anechoic Chamber | http://emodb.bilderbar.info/index_1280.html |
| 13 | Balanced (10 Male, 10 Female) | Broad English | Movie Clips/Natural | https://tspace.library.utoronto.ca/handle/1807/24487 |
| 14 | N/A | Broad/Multilingual | Movie Clips/Natural | http://liris-accede.ec-lyon.fr/database.php |
| 15 | Balanced (2 Male, 2 Female) | Greek | Sound Studio | http://mcl.ece.uth.gr/research/speech/speech-emotion-recognition |
| 16 | Imbalanced (Crowdsourced) | Multilingual (Crowdsourced) | Crowdsourced/Various Environments | http://www.voxforge.org |
| 17 | Imbalanced | Broad English (American English) | Lecture Venue | https://www.openslr.org/30 |
| 18 | Balanced (186 male, 214 female) | Mandarin, different accent areas in China | Quiet Indoor | https://www.openslr.org/33 |
| 19 | N/A | Mandarin | Real Home Environment | https://www.aishelltech.com/wakeup_data |
| 20 | Balanced | Broad English (American English) | YouTube/Natural Environment | http://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset |
| 21 | N/A | N/A | N/A | https://voxblink2.github.io |
| 22 | Balanced (2 Male, 2 Female) | English | Sound Studio | https://github.com/mmerlart/EmoV-DB |
| 23 | Balanced (2 Male, 2 Female) | Italian | Sound Studio | https://zenodo.org/record/2544029 |
| 24 | N/A | 10 European Languages | Read Audiobooks | https://www.openslr.org/94 |
3. Medical Text Datasets (PDF, collected by Jiayue)
| No. | Dataset Name | Data Content | Access | Size | Link |
| 1 | MIMIC-III | Electronic health records, clinical notes | Training required | 40,000 patients | https://physionet.org/content/mimiciii/1.4/ |
| 2 | MIMIC-IV | Electronic health records, clinical notes | Training required | 364,627 unique individuals, 546,028 hospital admissions and 94,458 distinct ICU stays | https://physionet.org/content/mimiciv/3.1/ |
| 3 | eICU Collaborative Research Database | Vital signs, laboratory test results, medications, admission diagnoses, patient history, timestamped diagnoses, etc | Training required | 200,000 ICU admissions | https://physionet.org/content/eicu-crd/2.0/ |
| 4 | UK Biobank | Imaging data, biomarker data, genetic data, medical records, questionnaire data, physical measurements, demographic and environmental data | Application and payment required | 500,000 participants | https://www.ukbiobank.ac.uk/about-us/how-we-work/access-to-uk-biobank-data/ |
| 5 | i2b2 (or n2c2: National NLP Clinical Challenges) | Extends the 2014 i2b2 dataset with new PHI categories and more complex contexts. | Requires signing a data use agreement | ~1,000 records | https://n2c2.dbmi.hms.harvard.edu/data-sets |
| 6 | AmsterdamUMCdb | ICU database | Registration and ethical approval | 23,106 ICU admissions (20,109 patients) | https://github.com/AmsterdamUMC/AmsterdamUMCdb?tab=readme-ov-file |
| 7 | PubMedQA | Biomedical research–oriented question answering dataset | Publicly available | 1,000 expert-annotated QA pairs, 61.2k unlabeled examples, and 211.3k automatically generated QA instances | https://aclanthology.org/D19-1259/#:~:text=We%20introduce%20PubMedQA%2C%20a%20novel,o f%20the%20abstract%20and%2C%20presumably |
| 8 | emrQA | A QA dataset generated using i2b2 annotated corpora | Publicly available | Over one million question–logical form pairs and more than 400,000 question– answer pairs | https://aclanthology.org/D18-1258/#:~:text=annotations%20on%20clinical%20notes%20for,and%2 0question%20to%20answer%20mapping |
| 9 | MediTOD | Clinical history–taking conversations | Publicly available on Github | 22,503 annotated utterances | https://aclanthology.org/2024.emnlp-main.936.pdf#:~:text=1,designed%20for%2 0the%20medical%20domain |
| 10 | MTS-Dialog | Doctor–patient conversations paired with structured clinical summaries | Publicly available on Github | 1,701 short dialogues | https://github.com/abachaa/MTS-Dialog |
| 11 | PubMed Database | Web pages of citations and abstracts | PubMed API | 39 million | https://pubmed.ncbi.nlm.nih.gov/about/#:~:text=PubMed%20contains%20more%20than%2039,P MC |
| 12 | CORD-19 (COVID-19 Open Research Dataset) | Scientific papers related to COVID-19 and other coronavirus research | Publicly available on Github | Over 1 million indexed papers | https://github.com/allenai/cord19 |
3. Drug Datasets (PDF, collected by Huicong)
| No. | Dataset Name | Description | URL |
| 1 | ChEMBL | 2.3 M chemical structures, 16,000 target sites | https://www.ebi.ac.uk/chembl |
| 2 | DrugBank | 16,000 drugs | https://go.drugbank.com |
| 3 | PubChem | The world’s largest collection of freely accessible chemical information | https://pubchem.ncbi.nlm.nih.gov |
| 4 | The Cancer Genome Atlas (TCGA) | 33 cancer types, 20,000 omics data | https://www.cancer.gov/ccg/research/genome-sequencing/tcga |
| 5 | cBioPortal | TCGA, ICGC visualization | https://www.cbioportal.org |
| 6 | Gene Expression Omnibus (GEO) | A public functional genomics data repository | https://www.ncbi.nlm.nih.gov/geo |
| 7 | FDA Adverse Event Reporting System (FAERS) | An adverse event database for FDA | https://www.fda.gov/drugs/questions-and-answers-fdas-adverse-event-reporting-system-faers/fda-adverse-event-reporting-system-faers-public-dashboard |
| 8 | Side Effect Resource (SIDER) | An adverse event database | http://sideeffects.embl.de |
| 9 | MIMIC-IV | / | https://mimic.mit.edu |
| 10 | UK Biobank | / | https://www.ukbiobank.ac.uk |
4. Histopathology Datasets (PDF, collected by Haoyun)
| No. | Dataset Name | Related Disease(s) | Data Scale | Identity Attributes | URL |
| 1 | The Cancer Genome Atlas Program | All kinds of cancer | / | / | https://www.cancer.gov/ccg/research/genome-sequencing/tcga |
| 2 | LUNG AND COLON HISTOPATHOLOGICAL IMAGE DATASET | Lung & colon cancer | 25,000 images | / | https://github.com/tampapath/lung_colon_image_set |
| 3 | 100,000 histological images of human colorectal cancer and healthy tissue | Colorectal cancer | 100,000 images | / | https://opendatalab.com/OpenDataLab/CRC100K/cli/main |
| 4 | Multi-class texture analysis in colorectal cancer histology | Colorectal cancer | 5000 images | / | https://zenodo.org/records/53169 |
| 5 | Breast-Caner-Detection Dataset | Breast Cancer | 5000 images | Breast density | https://github.com/marcos-jimenez-larroy/Breast-cancer-detection-CNN |
| 6 | EndoNuke | Endometrium | 1600 images | Stroma, epithelium, other | https://endonuke.ispras.ru/ |
5. Human Face Datasets (PDF, collected by Haoyun)
| No. | Dataset Name | Data Scale | Identity Attributes | URL |
| 1 | Flicker-Faces-HQ | 70,000 images | / | https://github.com/NVlabs/ffhq-dataset |
| 2 | CelebFaces Attribute | 202599 images | 40(e.g. hair, glasses, expression, etc.) | https://aistudio.baidu.com/datasetdetail/224771 |
| 3 | Labeled Faces in the Wild | 13,233 images | Name | https://opendatalab.org.cn/OpenDataLab/LFW/cli/main |
| 4 | CASIA-WebFace | 494,414 images | / | https://gitcode.com/Premium-Resources/abe76 |
| 5 | MegaFace | 1,000,000 + 690,572images | / | https://megaface.cs.washington.edu/dataset/download.html |
| 6 | MS-Celeb-1M | 1,000,000 images | Name | https://opendatalab.org.cn/OpenDataLab/MS-Celeb-1M/explore/main |
6. Dermatology Datasets (PDF, by Yuhang)
| No. | Dataset Name | Source | Collectors | Domain | Data Type |
| 1 | Diverse Dermatology Images (DDI) dataset | Stanford Clinics | Daneshjou R, Vodrahalli K, Novoa R A, et al. (Stanford) | Dermatology: skin lesion diagnosis (benign vs malignant) | Clinical photographs |
| 2 | Fitzpatrick 17k | Two dermatology atlases — DermaAmin and Atlas Dermatologico | Groh M, Harris C, Soenksen L, et al. Annotated by Scale AI and Centaur Labs | Dermatology: skin disease classification, fairness evaluation | Clinical images |
| 3 | PAD-UFES-20 | Federal University of Espírito Santo (UFES) | Daniel C. C. Barata, Allan C. P. Oliveira, André L. B. Medeiros, et al. | Automated skin cancer detection | Clinical images |
| 4 | HAM10000 (ISIC2018) | Office of the skin cancer practice of Cliff Rosendahl, ViDIR Group | Tschandl P., Rosendahl C., Kittler H. | Skin lesion detection | Dermatoscopic images |
| 5 | ISIC 2019 | BCN_20000 dataset; HAM10000 dataset; MSK dataset | International Skin Imaging Collaboration (ISIC) Challenge archive | Skin lesion detection | Dermatoscopic images |
| 6 | ACNE04 | China | Xiaoping Wu et al. | Facial acne severity grading and lesion counting | Color face photographs |
| 7 | ISIC 2020 | Hospital Clínic de Barcelona; Medical University of Vienna; Memorial Sloan Kettering Cancer Center; etc | International Skin Imaging Collaboration (ISIC) Challenge archive | Skin lesion classification | Dermatoscopic images |
| 8 | ISIC 2024 | Memorial Sloan Kettering Cancer Center (USA); Hospital Clínic de Barcelona (Spain); The University of Queensland (Australia); etc | International Skin Imaging Collaboration (ISIC) Challenge archive | Skin lesion classification | Clinical skin lesion crops derived from 3D total‑body photography (TBP) |
| No. (Continued) | Data Size | Labels | Demographics | Data Collection Period | Reference / URL |
| 1 | 656 images 228 MB | 2 malignant labels | Skin tone | 2010-2020 | https://doi.org/10.1126/sciadv.abq6147 |
| 2 | 16,577 images 1.36 GB | 114 disease labels; 9 clinical labels; 3 malignant labels; 6 skin tone labels | Skin tone | Pre-2021 | https://github.com/mattgroh/fitzpatrick17k |
| 3 | 2298 images (1373 patients, 1641 skin lesions) 3.35 GB | 6 disease labels; | Age, gender, fitspatrick (skin tone) and region | 2018-2019 | https://data.mendeley.com/datasets/zr7vgbcyr2/1 |
| 4 | 10015 images 2.9 GB | 7 skin lesion labels; | Age, gender, localization | ~1998 – 2017 (≈ 20 years) | https://www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000 |
| 5 | 25,331 training images; 9.1 GB 8238 test images 3.6 GB | 9 skin lesion labels; | Age, gender, localization | Pre-2019 | https://www.kaggle.com/datasets/andrewmvd/isic-2019 |
| 6 | ~1457 images 1.05 GB | Lesion count; Acne severity | / | / | https://github.com/xpwu95/ldl |
| 7 | 33,126 training images; 23 GB 10,982 test images 6.7 GB | Binary classification: malignant (mainly melanoma) or benign | Age, gender, localization | / | https://challenge.isic-archive.com/data/#2020 |
| 8 | 401,059 images 1.2 GB | Binary classification: malignant (mainly melanoma) or benign | Age, gender, localization | 2015-2024 | https://challenge.isic-archive.com/data/#2024 |
7. Brain Datasets (PDF, by Yuhang)
| No. | Dataset Name | Source | Collectors | Domain | Data Type |
| 1 | Brain Tumor MRI Dataset | Combination of three datasets: figshare, SARTAJ dataset and Br35H | / | Brain tumor classification and segmentation | MRI |
| 2 | ADNI | U.S. and Canada | / | Neuroimaging and clinical research focused on Alzheimer’s Disease (AD) and cognitive aging | Imaging: Structural MRI, functional MRI (fMRI), diffusion MRI (DTI), PET (FDG-PET, amyloid PET);etc |
| 3 | BR35H | Kaggle / Radiopaedia & Figshare collections | / | Brain tumor detection | T1-weighted MRI (axial slices, grayscale images) |
| 4 | OASIS | Washington University in St. Louis, Massachusetts General Hospital | Marcus et al., OASIS Brain Imaging Consortium | Aging and dementia research | Structural MRI (T1-weighted), some versions include PET and clinical data |
| 5 | AIBL | Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia | AIBL Research Group | Alzheimer’s disease and cognitive aging research | MRI, PET (PiB, FDG), blood biomarkers, neuropsychological and lifestyle data |
| 6 | BraTS | MICCAI Challenge | CBICA and international collaborators | Brain tumor segmentation and classification | Multimodal MRI |
| No. (Continued) | Data Size | Labels | Demographics | Data Collection Period | Reference / URL |
| 1 | 7023 images 158.6 MB | 4 tumor classification | / | / | https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset |
| 2 | 2,000+ participants | Cognitive status | Age, gender and education | Began in 2004, ongoing | http://adni.loni.usc.edu/ |
| 3 | 3,060 images (1,500 tumor / 1,560 non-tumor) 91.77 MB | Binary: Tumor / No Tumor | / | / | https://www.kaggle.com/datasets/ahmedhamada0/brain-tumor-detection |
| 4 | OASIS-4: >1,000 subjects with MRI, PET, and cognitive measures | Cognitive status: Healthy control, Mild Cognitive Impairment (MCI), Alzheimer’s Disease (AD) | Age and gender | 1999–2018 (OASIS-1 to OASIS-3) | https://www.oasis-brains.org |
| 5 | ~1,100 participants with longitudinal imaging and clinical assessments | Cognitive status: Healthy Control (HC), Mild Cognitive Impairment (MCI), Alzheimer’s Disease (AD) | Age and gender | 2006–present (ongoing longitudinal study) | https://aibl.org.au/ |
| 6 | Typically >1,000 cases ~100 GB | Tumor subregions: enhancing tumor, non-enhancing tumor core, and peritumoral edema | Adult patients with gliomas from multiple international institutions | Annual updates since 2012 (BraTS 2012–2024) | https://www.med.upenn.edu/cbica/brats2021/ |
8. Head and Neck Datasets (PDF, by Hongzhu)
| No. | Dataset Name | Source | Collection Period | Image modality | Disease type |
| 1 | OtoscopeData | Özel Van Akdamar Hospital in Turkey | 2018.10-2019.01 | Otoscope(.png) | Normal, acute otitis media, chronic suppurative otitis media, ear ventilation tube, excessive amount of earwax, foreign object in ear, otitis externa, pseudo eardrums, tympanoskleros |
| 2 | Otoscopic Image Dataset | / | / | Otoscope(.jpg) | Acute Otitis Media, Cerumen Impaction, Chronic Otitis Media, Myringosclerosis. Normal (Healthy Ear) |
| 3 | otitis-media-master | Department of Otorhinolaryngology at the Clinical Hospital of the Universidad de Chile | / | Otoscope(.jpg) | Chronic Otitis Media, Myringosclerosis, Earwax Plug, Normal |
| 4 | Ocular Disease Recognition | Shanggong Medical Technology Company | / | Color fundus image(.jpg) | Normal,Diabetes,Glaucoma,Cataract, Age related Macular Degeneration, Hypertension,Pathological Myopia,Other diseases/abnormalities |
| 5 | High-Resolution Fundus (HRF) Image Database | The Pattern Recognition Lab (CS5), the Department of Ophthalmology etc | 2010.12-2011.06 | Color fundus image(.jpg) | Normal, Diabetic Retinopathy, Glaucoma |
| 6 | Retinal Fundus Multi-Disease Image Dataset (RFMiD) | Center of Excellence in Signal and Image Processing | / | Color fundus image(.jpg) | 46 diseases |
| 7 | Retinal OCT Images | The Shiley Eye Institute of the University of California San Diego etc | 2013.07.01-2017.03.01 | Optical Coherence Tomography | NORMAL、CNV、DME、DRUSEN |
| 8 | CE‑NBI Laryngeal Dataset | The Department of Otorhinolaryngology, Head and Neck Surgery in Magdeburg University Hospital, Germany. | 2015.01.01-2021.12.31 | Laryngoscope(.jpg) | Cyst, Polyp, Reinke’s edema, Hemangioma, Granuloma, Bamboo node, etc |
| 9 | Laryngeal dataset | / | / | Laryngoscope(.png) | Le(Leukoplakia), He(Healthy Epithelium), Hbv(Hypertrophic Blood Vessels), IPCL(Intra-papillary Capillary Loops) |
| No. (Continued) | Number of images | Number of patients | Demographics | File size | URL |
| 1 | 956 | / | / | 191 MB | https://www.kaggle.com/datasets/omduggineni/otoscopedata |
| 2 | 3000 | / | / | 86.8 MB | https://www.kaggle.com/datasets/ucimachinelearning/otoscopic-image-dataset |
| 3 | 880 | 180 | / | 10.5 MB | https://github.com/zcomert/otitis-media/tree/master |
| 4 | 6392 | 5000 | Gender, age | 2 GB | https://www.kaggle.com/datasets/andrewmvd/ocular-disease-recognition-odir5k |
| 5 | 45 | / | / | 73 MB | https://www5.cs.fau.de/research/data/fundus-images/ |
| 6 | 3200 | / | / | 8 GB | https://riadd.grand-challenge.org/download-all-classes/ |
| 7 | 84,495 | / | / | 12 GB | https://www.kaggle.com/datasets/paultimothymooney/kermany2018 |
| 8 | 11144 | 210 | / | 1.4 GB | https://zenodo.org/records/6674034 |
| 9 | 1320 | 33 | / | 8 MB | https://www.kaggle.com/datasets/mahdiehhajian/laryngeal-dataset |
9. Internal Organ Datasets (PDF, collected by Jiayi)
| No. | Dataset | Modality | Organ | Disease | Image format | Size | URL |
| 1 | Open Kidney Dataset | B-mode abdominal ultrasound | Kidney | diabetes mellitus, immunoglobulin A (IgA) nephropathy, hypertension and transplanted kidneys | PNG | Total 534 514 unique B-mode images with 20 additional copies (two sets of 10) repeated from these 514. 130 MB | https://github.com/rsingla92/kidneyUS/tree/main/labels/reviewed_masks_1 |
| 2 | KiTS19 | contrast-enhanced CT | Kidney | Kidney cancer | .nii.gz; .dcm (DICOM) | 300 kidneys (210 public cases(training + validation),90 private cases(testing)) 70G | https://github.com/neheller/kits19/tree/master/data |
| 3 | CT-ORG | 3D CT | Liver, lungs, bladder, kidneys, bones, and brain | kidney cancer | .nii.gz | 140cases(119 cases for training and 21 cases for testing) 11G | https://www.cancerimagingarchive.net/collection/ct-org/ |
| 4 | AbdomenCT‑1K | 3D CT | Liver, Spleen, Pancreas, Kidneys, Gallbladder, Stomach | Abdominal organs and tumors | .nii.gz | 1112 cases 120G | https://github.com/JunMa11/AbdomenCT-1K |
| 5 | FLARE 2023 | CT | Various abdominal organs | Abdominal organs and tumors | .nii.gz | 4500 images | https://codalab.lisn.upsaclay.fr/competitions/12239 |
