Data Scientist Dataset Finder Blog: Speech Datasets

Friday, 17 April 2020

Speech Datasets

2000 HUB5 English: English-only speech data used most recently in the Deep Speech paper from Baidu.
LibriSpeech: Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by chapters of the book containing both the text and the speech.
VoxForge: Clean speech dataset of accented english. Useful for instances in which you expect to need robustness to different accents or intonations.
TIMIT: English-only speech recognition dataset.
CHIME: Noisy speech recognition challenge dataset. Dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
TED-LIUM: Audio transcription of TED talks. 1495 TED talks audio recordings along with full text transcriptions of those recordings.

Data Scientist Dataset Finder Blog