BME2133: Medical Data Privacy and Ethics in the Age of Artificial Intelligence (Public Datasets)

Syllabus

Who / When / Where
Instructor: Zhiyu Wan
Teaching Assistant: Sihan Xie, Hongzhu Jiang
Semester: Fall 2025
Time: Wednesdays & Fridays (Odd Week) , 15:00-15:45, 15:55-16:40
Location: School of Life Science and Technology, Room A103
Office Hours: Upon request, Location: BME Building, Room 228

Public Datasets

(Note: This page is tentative and subject to change.)

Voice1 (PDF, by Sihan)Voice2 (PDF, by Jiarui)Text (PDF, by Jiayue)Drug (PDF, by Huicong)Histopathology (PDF, by Haoyun)Human Face (PDF, by Haoyun)
Dermatology (PDF, by Yuhang)Brain (PDF, by Yuhang)Head & Neck (PDF, by Hongzhu)Internal Organ (PDF, by Jiayi)
  1. Voice Datasets (PDF, collected by Sihan)

(A) General-Purpose Voice Datasets

Main usesFull NameLanguageData ScaleIdentity AttributesInstitutionData AvailabilityURL
ASR (Automatic Speech Recognition)LibriSpeech ASR Corpus (LibriSpeech)English~1000 hoursSpeaker ID, GenderJohns Hopkins UniversityPublicly availablehttps://www.openslr.org/12/
ASR (Automatic Speech Recognition)Mozilla Common Voice (Common Voice)Multilingual30,000 hoursGender, Age group, AccentMozilla FoundationPublicly availablehttps://commonvoice.mozilla.org/datasets
ASR (Automatic Speech Recognition)Switchboard Corpus (Switchboard)English~260 hours, >500 speakersGender, Age group, Education level, Dialect regionTexas Instruments, LDCPublicly availablehttps://catalog.ldc.upenn.edu/LDC97S62
ASR (Automatic Speech Recognition)Open-Source Mandarin Speech Corpus (AISHELL-1)Chinese178 hoursSpeaker ID, GenderBeijing Shell Technology Co., Ltd.Publicly availablehttps://www.openslr.org/33/
SER (Speech Emotion Recognition)Interactive Emotional Dyadic Motion Capture Database (IEMOCAP)English12 hours of audiovisual data, including video, speech, motion capture of face, text transcriptions.Speaker ID, GenderUniversity of Southern California (USC) – Signal Analysis and Interpretation Laboratory (SAIL)Publicly availablehttps://sail.usc.edu/iemocap/
SER (Speech Emotion Recognition)Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)English7,356 files (24 professional actors)Speaker ID, Gender, EmotionToronto Metropolitan UniversityPublicly availablehttps://zenodo.org/records/1188976
TTS (Text-to-Speech)The LJ Speech Dataset (LJSpeech)English~24 hours (13,100 short audio clips)Speaker ID, GenderCurated by Keith Ito (from LibriVox public domain data)Publicly availablehttps://keithito.com/LJ-Speech-Dataset/
TTS (Text-to-Speech)CSTR VCTK Corpus (Voice Cloning Toolkit)English~44 hours (109 native English speakers, ~400 sentences per speaker)Speaker ID, Gender, AccentUniversity of EdinburghPublicly availablehttps://datashare.ed.ac.uk/handle/10283/2651
Speaker RecognitionVoxCeleb Large-Scale Speaker Recognition Datasets (VoxCeleb 1 & 2)English2000 hours, >7000 speakersGender (Age and Nationality can be inferred from external sources)University of OxfordPublicly availablehttp://www.robots.ox.ac.uk/~vgg/data/voxceleb/
Acoustic-PhoneticTIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT)English~5 hours, 630 speakersGender, Age, Ethnicity, Dialect region, Education levelTexas Instruments, SRI, MITPublicly availablehttps://catalog.ldc.upenn.edu/LDC93s1

(B) Pathological Voice Datasets

No.CategoryFull NameRelated Disease(s)LanguageData Scale
1VOICED DatabaseVOICED DatabaseVarious voice disordersItalian208 participants (150 pathological individuals and 58 healthy individuals)
2VOICED DatabaseSaarbruecken Voice Database (SVD)Various pathologiesGerman2225 participants (869 healthy individuals, 1356 pathological individuals)
3NeurodegenerativeCMU PITT Corpus (CMU PITT)Alzheimer’s Disease, DementiaEnglish397 participants (104 healthy individuals, 208 individuals, 85 unknown individuals)
4NeurodegenerativeParkinson’s Speech with Multiple Types of Sound Recordings (PD Database)Parkinson’s DiseaseTurkish252 participants (188 PWP, 64 healthy individuals)
5NeurodegenerativeAlzheimer’s Dementia Recognition through Spontaneous Speech (ADReSSo)Alzheimer’s DiseaseEnglishAD Classification Task: 237 participants (166 training + 71 test) ; Cognitive Decline Task: 105 recordings
6Mental & Emotional HealthDistress Analysis Interview Corpus (DAIC-WOZ)Depression, AnxietyEnglish189 participants (47 depressed individuals, 142 healthy individuals)
7Mental & Emotional HealthASDBank Dutch Asymmetries CorpusAutism Spectrum Disorder (ASD)Dutch192 participants (109 healthy individuals, 83 pathological individuals)
8Speech, Language & DevelopmentalTORGO DatabaseDysarthria, Cerebral Palsy (CP), Amyotrophic Lateral Sclerosis (ALS)English15 speakers (8 with dysarthria, 7 healthy controls)
9Speech, Language & DevelopmentalUASpeechcerebral palsyEnglish19 pathological individuals
10Infectious & RespiratoryCambridge COVID-19 Sound Database (CSSD)COVID-19, Asthma, Cough Multilingual (Crowdsourced)Multilingual53,449 audio samples (over 552 hours in total) crowd-sourced from 36,116 participants (2,106 samples tested positive)
No. (Continued)Identity AttributesSizeInstitutionData AvailabilityURL
1Gender, Age, Lifestyle habits, Occupation110.1MBSchool of Medicine, University of Naples “Federico II”Publicly availablehttps://physionet.org/content/voiced/1.0.0/
2Gender, AgeTotal 38.1GBSaarland University, GermanyPublicly availablehttps://stimmdb.coli.uni-saarland.de/
3Gender, AgeTotal 17.7GBCarnegie Mellon University, University of PittsburghPublicly availablehttps://talkbank.org/dementia/access/English/Pitt.html
4Gender, Age2MBDepartment of Neurology in CerrahpaÅŸa Faculty of Medicine, Istanbul UniversityPublicly availablehttps://archive.ics.uci.edu/dataset/470/parkinson+s+disease+classification
5Gender, Age/ADReSS Challenge OrganizersFirst join as a DementiaBank member (free).https://talkbank.org/dementia/ADReSSo-2021/index.html
6Gender, AgeTotal 52.82GBUniversity of Southern California (USC) Institute for Creative TechnologiesPublicly availablehttps://dcapswoz.ict.usc.edu/
7Gender, Age152MBUniversity of GroningenPublicly availablehttps://talkbank.org/asd/access/Dutch/Asymmetries.html
8Speaker ID, Gender, Age18GB (uncompressed)University of Toronto, Holland Bloorview Kids Rehab HospitalPublicly availablehttps://www.cs.toronto.edu/~complingweb/data/TORGO/torgo.html
9Gender, Age/University of IllinoisPublicly availablehttps://speechtechnology.web.illinois.edu/uaspeech/
10Gender, Age, Smoker status/University of CambridgePublicly available, DTA needs to be signedhttps://www.covid-19-sounds.org/zh/blog/neurips_dataset.html

2. Voice Datasets (PDF, collected by Jiarui)

No.Dataset NamePurpose / Use CaseTotal DurationNumber of SpeakersSampling Rate
1LibriSpeechLarge-scale read English speech, commonly used for training Automatic Speech Recognition (ASR) systems.Approx. 1000 hours248416kHz
2RAVDESSAudiovisual emotional speech dataset, commonly used for Emotion Recognition research.Approx. 25 minutes (pure speech)2448 kHz
3SAVEEAudiovisual expressed emotion dataset, commonly used for Emotion Recognition.Approx. 10-15 minutes (480 segments)444.1 kHz
4EMO-DBBerlin emotional speech database, commonly used for small-scale, high-quality Emotion Recognition training.Approx. 3 minutes (535 segments)1016 kHz
5VoxCeleb-1 & 2Large-scale face and voice recognition dataset, mainly used for speaker identification/verification.Approx. 2000 hours7000+ (VoxCeleb2)16 kHz
6CMU-MOSEILarge-scale multimodal emotion dataset, commonly used for multimodal sentiment analysis.Approx. 23 hours (dialogue)1000+16 kHz
7MUSANMulti-purpose background, music, and noise dataset, commonly used for audio data augmentation and recognition.Approx. 109 hoursN/A (Non-speaker)16 kHz
8RIRRoom Impulse Response dataset, used for model training and audio enhancement.N/AN/A (Non-speaker)16 kHz
9CN-CelebChinese speaker recognition database, focused on speaker identification/verification tasks.271.72 hours99716 kHz
10CommonVoiceLarge-scale multilingual public speech recognition database, used for ASR training.Over 24,000 hours300,000+16 kHz
11IEMOCAPInteractive emotional motion capture database, used for dialogue and emotion recognition.Approx. 12 hours (audio/video)1016 kHz
12Emo-DBBerlin emotional speech database, same as label 4, used for emotion recognition.Approx. 3 minutes (535 segments)1016 kHz
13Toronto Emotional Speech DatabaseEmotional speech database, focused on emotion recognition research.Approx. 7 hours (movie clips)20 (Actors)16 kHz
14LIRIS-ACCEDEAudiovisual content annotation and emotion database, used for audiovisual emotion analysis.Approx. 98 hours (movie clips)N/AN/A
15AESDDGreek emotional speech database, used for emotion recognition research.Approx. 700 segments (short)4 (Actors)44.1 kHz
16VoxForgePublic speech recognition corpus, used for training ASR systems.N/A (Crowdsourced)N/A (Crowdsourced)16 kHz
17TED-LIUM3Large English speech recognition corpus, extracted from TED talks, used for ASR.Approx. 450 hoursApprox. 230016 kHz
18AISHELL-1Chinese speech recognition dataset, used for Chinese ASR.Approx. 178 hours40016 kHz (downsampled)
19AISHELL-WakeUp-1Chinese wake-up word database, used for wake-up word recognition.1561.12 hours254Six 16kHz, 16bit and one 44.1kHz, 16bit
20CMU-MOSEILarge-scale multimodal emotion dataset, same as label 6, used for multimodal sentiment analysis.Approx. 23 hours (dialogue)1000+16 kHz
21VoxBlink2Face liveness detection and speaker verification dataset.N/AN/AN/A
22Emotional Voices Database (EmoV-DB)Emotional speech database, used for emotion recognition and speech synthesis.Approx. 3 hours4 (Professional)24 kHz
23DEMOSItalian emotional speech database, used for Italian emotion recognition.Approx. 2 hours4 (Professional)48 kHz
24Multilingual LibriSpeech (MLS)Large-scale multilingual speech recognition dataset, used for training multilingual ASR.Approx. 44,500 hoursN/A16 kHz
No. (Continued)Gender DistributionAccent/LanguageEnvironmentLink
1Balanced (51.65% male and 48.35% female)English“Clean” and “Other”https://www.openslr.org/12
2Balanced (12 Male, 12 Female)North American EnglishSound Studio (Controlled)https://datasets.activeeon.ai/ml/datasets/ravdess-dataset
3Male-only (4 Male)British EnglishVisual Media Labhttp://kahlan.eps.surrey.ac.uk/savee
4Balanced (5 Male, 5 Female)GermanAnechoic Chamberhttp://emodb.bilderbar.info/download
5Imbalanced (61% Male, 39% Female)Broad English/MultilingualYouTube/Natural Environmenthttps://www.robots.ox.ac.uk/~vgg/data/voxceleb/index.html#about
6ImbalancedBroad English (American English)YouTube/Natural Environmenthttp://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset
7N/AN/A (Noise/Music)Various Sources (Background/Music)https://www.openslr.org/17
8N/AN/AVarious Environments (Simulated)https://www.openslr.org/28
9N/AMandarin/ChineseWeb/Natural Environmenthttps://cnceleb.org
10Imbalanced (Crowdsourced)Multilingual/Broad SpectrumCrowdsourced/Natural Environmenthttps://commonvoice.mozilla.org/en/datasets
11Balanced (5 Male, 5 Female)American EnglishMotion Capture Labhttps://sail.usc.edu/iemocap/iemocap_info.htm
12Balanced (5 Male, 5 Female)GermanAnechoic Chamberhttp://emodb.bilderbar.info/index_1280.html
13Balanced (10 Male, 10 Female)Broad EnglishMovie Clips/Naturalhttps://tspace.library.utoronto.ca/handle/1807/24487
14N/ABroad/MultilingualMovie Clips/Naturalhttp://liris-accede.ec-lyon.fr/database.php
15Balanced (2 Male, 2 Female)GreekSound Studiohttp://mcl.ece.uth.gr/research/speech/speech-emotion-recognition
16Imbalanced (Crowdsourced)Multilingual (Crowdsourced)Crowdsourced/Various Environmentshttp://www.voxforge.org
17ImbalancedBroad English (American English)Lecture Venuehttps://www.openslr.org/30
18Balanced (186 male, 214 female)Mandarin, different accent areas in ChinaQuiet Indoorhttps://www.openslr.org/33
19N/AMandarinReal Home Environmenthttps://www.aishelltech.com/wakeup_data
20BalancedBroad English (American English)YouTube/Natural Environmenthttp://multicomp.cs.cmu.edu/resources/cmu-mosei-dataset
21N/AN/AN/Ahttps://voxblink2.github.io
22Balanced (2 Male, 2 Female)EnglishSound Studiohttps://github.com/mmerlart/EmoV-DB
23Balanced (2 Male, 2 Female)ItalianSound Studiohttps://zenodo.org/record/2544029
24N/A10 European LanguagesRead Audiobookshttps://www.openslr.org/94

3. Medical Text Datasets (PDF, collected by Jiayue)

No.Dataset NameData ContentAccessSizeLink
1MIMIC-IIIElectronic health records, clinical notesTraining required40,000 patientshttps://physionet.org/content/mimiciii/1.4/
2MIMIC-IVElectronic health records, clinical notesTraining required364,627 unique individuals, 546,028 hospital admissions and 94,458 distinct ICU stayshttps://physionet.org/content/mimiciv/3.1/
3eICU Collaborative Research DatabaseVital signs, laboratory test results, medications, admission diagnoses, patient history, timestamped diagnoses, etcTraining required200,000
ICU admissions
https://physionet.org/content/eicu-crd/2.0/
4UK BiobankImaging data, biomarker data, genetic data, medical records, questionnaire data, physical measurements, demographic and environmental dataApplication and payment required500,000 participantshttps://www.ukbiobank.ac.uk/about-us/how-we-work/access-to-uk-biobank-data/
5i2b2 (or n2c2: National NLP Clinical Challenges)Extends the 2014 i2b2
dataset with new PHI
categories and more
complex contexts.
Requires signing a data use agreement~1,000 recordshttps://n2c2.dbmi.hms.harvard.edu/data-sets
6AmsterdamUMCdbICU databaseRegistration and ethical approval23,106
ICU admissions (20,109 patients)
https://github.com/AmsterdamUMC/AmsterdamUMCdb?tab=readme-ov-file
7PubMedQABiomedical research–oriented question answering
dataset
Publicly available1,000 expert-annotated QA pairs, 61.2k unlabeled examples, and 211.3k
automatically generated QA instances
https://aclanthology.org/D19-1259/#:~:text=We%20introduce%20PubMedQA%2C%20a%20novel,o
f%20the%20abstract%20and%2C%20presumably
8emrQAA QA dataset generated using i2b2 annotated corporaPublicly availableOver one million question–logical form pairs and more than 400,000 question–
answer pairs
https://aclanthology.org/D18-1258/#:~:text=annotations%20on%20clinical%20notes%20for,and%2
0question%20to%20answer%20mapping
9MediTODClinical history–taking conversationsPublicly available on Github22,503 annotated utteranceshttps://aclanthology.org/2024.emnlp-main.936.pdf#:~:text=1,designed%20for%2
0the%20medical%20domain
10MTS-DialogDoctor–patient conversations
paired with structured clinical summaries
Publicly available on Github1,701 short dialogueshttps://github.com/abachaa/MTS-Dialog
11PubMed DatabaseWeb pages of citations and abstractsPubMed API39 millionhttps://pubmed.ncbi.nlm.nih.gov/about/#:~:text=PubMed%20contains%20more%20than%2039,P
MC
12CORD-19 (COVID-19 Open Research Dataset)Scientific papers related to
COVID-19 and other coronavirus research
Publicly available on GithubOver 1 million indexed papershttps://github.com/allenai/cord19

3. Drug Datasets (PDF, collected by Huicong)

No.Dataset NameDescriptionURL
1ChEMBL2.3 M chemical structures, 16,000 target sites https://www.ebi.ac.uk/chembl
2DrugBank16,000 drugshttps://go.drugbank.com
3PubChemThe world’s largest collection of freely accessible chemical informationhttps://pubchem.ncbi.nlm.nih.gov
4The Cancer Genome Atlas (TCGA)33 cancer types, 20,000 omics datahttps://www.cancer.gov/ccg/research/genome-sequencing/tcga
5cBioPortalTCGA, ICGC visualizationhttps://www.cbioportal.org
6Gene Expression Omnibus (GEO)A public functional genomics data repositoryhttps://www.ncbi.nlm.nih.gov/geo
7FDA Adverse Event Reporting System (FAERS)An adverse event database for FDAhttps://www.fda.gov/drugs/questions-and-answers-fdas-adverse-event-reporting-system-faers/fda-adverse-event-reporting-system-faers-public-dashboard
8Side Effect Resource (SIDER)An adverse event databasehttp://sideeffects.embl.de
9MIMIC-IV/https://mimic.mit.edu
10UK Biobank/https://www.ukbiobank.ac.uk

4. Histopathology Datasets (PDF, collected by Haoyun)

No.Dataset NameRelated Disease(s)Data ScaleIdentity AttributesURL
1The Cancer Genome Atlas ProgramAll kinds of cancer//https://www.cancer.gov/ccg/research/genome-sequencing/tcga
2LUNG AND COLON HISTOPATHOLOGICAL IMAGE DATASETLung & colon cancer25,000 images/https://github.com/tampapath/lung_colon_image_set
3100,000 histological images of human colorectal cancer and healthy tissueColorectal cancer100,000 images/https://opendatalab.com/OpenDataLab/CRC100K/cli/main
4Multi-class texture analysis in colorectal cancer histologyColorectal cancer5000 images/https://zenodo.org/records/53169
5Breast-Caner-Detection DatasetBreast Cancer5000 imagesBreast densityhttps://github.com/marcos-jimenez-larroy/Breast-cancer-detection-CNN
6EndoNukeEndometrium1600 imagesStroma, epithelium, otherhttps://endonuke.ispras.ru/

5. Human Face Datasets (PDF, collected by Haoyun)

No.Dataset NameData ScaleIdentity AttributesURL
1Flicker-Faces-HQ70,000 images/https://github.com/NVlabs/ffhq-dataset
2CelebFaces Attribute202599 images40(e.g. hair, glasses, expression, etc.)https://aistudio.baidu.com/datasetdetail/224771
3Labeled Faces in the Wild13,233 imagesNamehttps://opendatalab.org.cn/OpenDataLab/LFW/cli/main
4CASIA-WebFace494,414 images/https://gitcode.com/Premium-Resources/abe76
5MegaFace1,000,000 + 690,572images/https://megaface.cs.washington.edu/dataset/download.html
6MS-Celeb-1M1,000,000 imagesNamehttps://opendatalab.org.cn/OpenDataLab/MS-Celeb-1M/explore/main

6. Dermatology Datasets (PDF, by Yuhang)

No.Dataset NameSourceCollectorsDomainData Type
1Diverse Dermatology Images (DDI) datasetStanford ClinicsDaneshjou R, Vodrahalli K, Novoa R A, et al. (Stanford)Dermatology: skin lesion diagnosis (benign vs malignant)Clinical photographs
2Fitzpatrick 17kTwo dermatology atlases — DermaAmin and Atlas DermatologicoGroh M, Harris C, Soenksen L, et al.
Annotated by Scale AI and Centaur Labs
Dermatology: skin disease classification, fairness evaluationClinical images
3PAD-UFES-20Federal University of Espírito Santo (UFES)Daniel C. C. Barata, Allan C. P. Oliveira, André L. B. Medeiros, et al.Automated skin cancer detectionClinical images
4HAM10000 (ISIC2018)Office of the skin cancer practice of Cliff Rosendahl, ViDIR GroupTschandl P., Rosendahl C., Kittler H.Skin lesion detectionDermatoscopic images
5ISIC 2019BCN_20000 dataset;
HAM10000 dataset;
MSK dataset
International Skin Imaging Collaboration (ISIC) Challenge archiveSkin lesion detectionDermatoscopic images
6ACNE04ChinaXiaoping Wu et al.Facial acne severity grading and lesion countingColor face photographs
7ISIC 2020Hospital Clínic de Barcelona;
Medical University of Vienna;
Memorial Sloan Kettering Cancer Center; etc
International Skin Imaging Collaboration (ISIC) Challenge archiveSkin lesion classificationDermatoscopic images
8ISIC 2024Memorial Sloan Kettering Cancer Center (USA);
Hospital Clínic de Barcelona (Spain);
The University of Queensland (Australia); etc
International Skin Imaging Collaboration (ISIC) Challenge archiveSkin lesion classificationClinical skin lesion crops derived from 3D total‑body photography (TBP)
No. (Continued)Data SizeLabelsDemographicsData Collection PeriodReference / URL
1656 images
228 MB
2 malignant labelsSkin tone2010-2020https://doi.org/10.1126/sciadv.abq6147
216,577 images
1.36 GB
114 disease labels;
9 clinical labels;
3 malignant labels;
6 skin tone labels
Skin tonePre-2021https://github.com/mattgroh/fitzpatrick17k
32298 images
(1373 patients, 1641 skin lesions)
3.35 GB
6 disease labels;Age, gender, fitspatrick (skin tone) and region2018-2019https://data.mendeley.com/datasets/zr7vgbcyr2/1
410015 images
2.9 GB
7 skin lesion labels;Age, gender, localization~1998 – 2017
(≈ 20 years)
https://www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000
525,331 training images;
9.1 GB
8238 test images
3.6 GB
9 skin lesion labels;Age, gender, localizationPre-2019https://www.kaggle.com/datasets/andrewmvd/isic-2019
6~1457 images
1.05 GB
Lesion count;
Acne severity
//https://github.com/xpwu95/ldl
733,126 training images;
23 GB
10,982 test images
6.7 GB
Binary classification: malignant (mainly melanoma) or benignAge, gender, localization/https://challenge.isic-archive.com/data/#2020
8401,059 images
1.2 GB
Binary classification: malignant (mainly melanoma) or benignAge, gender, localization2015-2024https://challenge.isic-archive.com/data/#2024

7. Brain Datasets (PDF, by Yuhang)

No.Dataset NameSourceCollectorsDomainData Type
1Brain Tumor MRI DatasetCombination of three datasets: figshare, SARTAJ dataset and Br35H/Brain tumor classification and segmentationMRI
2ADNIU.S. and Canada/Neuroimaging and clinical research focused on Alzheimer’s Disease (AD) and cognitive agingImaging: Structural MRI, functional MRI (fMRI), diffusion MRI (DTI), PET (FDG-PET, amyloid PET);etc
3BR35HKaggle / Radiopaedia & Figshare collections/Brain tumor detectionT1-weighted MRI (axial slices, grayscale images)
4OASISWashington University in St. Louis, Massachusetts General HospitalMarcus et al., OASIS Brain Imaging ConsortiumAging and dementia researchStructural MRI (T1-weighted), some versions include PET and clinical data
5AIBLCommonwealth Scientific and Industrial Research Organisation (CSIRO), AustraliaAIBL Research GroupAlzheimer’s disease and cognitive aging researchMRI, PET (PiB, FDG), blood biomarkers, neuropsychological and lifestyle data
6BraTSMICCAI ChallengeCBICA and international collaboratorsBrain tumor segmentation and classificationMultimodal MRI
No. (Continued)Data SizeLabelsDemographicsData Collection PeriodReference / URL
17023 images
158.6 MB
4 tumor classification//https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset
22,000+ participantsCognitive statusAge, gender and educationBegan in 2004, ongoinghttp://adni.loni.usc.edu/
33,060 images
(1,500 tumor / 1,560 non-tumor)
91.77 MB
Binary: Tumor / No Tumor//https://www.kaggle.com/datasets/ahmedhamada0/brain-tumor-detection
4OASIS-4: >1,000 subjects with MRI, PET, and cognitive measuresCognitive status: Healthy control, Mild Cognitive Impairment (MCI), Alzheimer’s Disease (AD)Age and gender1999–2018 (OASIS-1 to OASIS-3)https://www.oasis-brains.org
5~1,100 participants with longitudinal imaging and clinical assessmentsCognitive status: Healthy Control (HC), Mild Cognitive Impairment (MCI), Alzheimer’s Disease (AD)Age and gender2006–present (ongoing longitudinal study)https://aibl.org.au/
6Typically >1,000 cases
~100 GB
Tumor subregions: enhancing tumor, non-enhancing tumor core, and peritumoral edemaAdult patients with gliomas from multiple international institutionsAnnual updates since 2012 (BraTS 2012–2024)https://www.med.upenn.edu/cbica/brats2021/

8. Head and Neck Datasets (PDF, by Hongzhu)

No.Dataset NameSourceCollection PeriodImage modalityDisease type
1OtoscopeDataÖzel Van Akdamar Hospital in Turkey2018.10-2019.01Otoscope(.png)Normal, acute otitis media, chronic suppurative otitis media, ear ventilation tube, excessive amount of earwax, foreign object in ear, otitis externa, pseudo eardrums, tympanoskleros
2Otoscopic Image Dataset//Otoscope(.jpg)Acute Otitis Media, Cerumen Impaction, Chronic Otitis Media, Myringosclerosis. Normal (Healthy Ear)
3otitis-media-masterDepartment of Otorhinolaryngology at the Clinical Hospital of the Universidad de Chile/Otoscope(.jpg)Chronic Otitis Media, Myringosclerosis, Earwax Plug, Normal
4Ocular Disease RecognitionShanggong Medical Technology Company/Color fundus image(.jpg)Normal,Diabetes,Glaucoma,Cataract, Age related Macular Degeneration, Hypertension,Pathological Myopia,Other diseases/abnormalities
5High-Resolution Fundus (HRF) Image DatabaseThe Pattern Recognition Lab (CS5), the Department of Ophthalmology etc2010.12-2011.06Color fundus image(.jpg)Normal, Diabetic Retinopathy, Glaucoma
6Retinal Fundus Multi-Disease Image Dataset (RFMiD)Center of Excellence in Signal and Image Processing/Color fundus image(.jpg)46 diseases
7Retinal OCT ImagesThe Shiley Eye Institute of the University of California San Diego etc2013.07.01-2017.03.01Optical Coherence TomographyNORMAL、CNV、DME、DRUSEN
8CE‑NBI Laryngeal DatasetThe Department of Otorhinolaryngology, Head and Neck Surgery in Magdeburg University Hospital, Germany.2015.01.01-2021.12.31Laryngoscope(.jpg)Cyst, Polyp, Reinke’s edema, Hemangioma, Granuloma, Bamboo node, etc
9Laryngeal dataset//Laryngoscope(.png)Le(Leukoplakia), He(Healthy Epithelium), Hbv(Hypertrophic Blood Vessels), IPCL(Intra-papillary Capillary Loops)
No. (Continued)Number of imagesNumber of patientsDemographicsFile sizeURL
1956//191 MBhttps://www.kaggle.com/datasets/omduggineni/otoscopedata
23000//86.8 MBhttps://www.kaggle.com/datasets/ucimachinelearning/otoscopic-image-dataset
3880180/10.5 MBhttps://github.com/zcomert/otitis-media/tree/master
463925000Gender, age2 GBhttps://www.kaggle.com/datasets/andrewmvd/ocular-disease-recognition-odir5k
545//73 MBhttps://www5.cs.fau.de/research/data/fundus-images/
63200//8 GBhttps://riadd.grand-challenge.org/download-all-classes/
784,495//12 GBhttps://www.kaggle.com/datasets/paultimothymooney/kermany2018
811144210/1.4 GBhttps://zenodo.org/records/6674034
9132033/8 MBhttps://www.kaggle.com/datasets/mahdiehhajian/laryngeal-dataset

9. Internal Organ Datasets (PDF, collected by Jiayi)

No.DatasetModalityOrganDiseaseImage formatSizeURL
1Open Kidney Dataset B-mode abdominal ultrasoundKidneydiabetes mellitus, immunoglobulin A (IgA) nephropathy, hypertension and transplanted kidneysPNGTotal 534
514 unique B-mode images with 20 additional copies
(two sets of 10) repeated from these 514.
130 MB
https://github.com/rsingla92/kidneyUS/tree/main/labels/reviewed_masks_1
2KiTS19contrast-enhanced CTKidneyKidney cancer.nii.gz; .dcm (DICOM)300 kidneys (210 public cases(training + validation),90 private cases(testing))
70G
https://github.com/neheller/kits19/tree/master/data
3CT-ORG3D CTLiver, lungs, bladder, kidneys, bones, and brainkidney cancer.nii.gz140cases(119 cases for training and 21 cases for testing)
11G
https://www.cancerimagingarchive.net/collection/ct-org/
4AbdomenCT‑1K3D CTLiver, Spleen, Pancreas, Kidneys, Gallbladder, StomachAbdominal organs and tumors.nii.gz1112 cases
120G
https://github.com/JunMa11/AbdomenCT-1K
5FLARE 2023CTVarious abdominal organsAbdominal organs and tumors.nii.gz4500 imageshttps://codalab.lisn.upsaclay.fr/competitions/12239