Data Scientist Dataset Finder Blog: April 2020

Tuesday, 21 April 2020

Naver sentiment movie corpus v1.0 (korean)

Monday, 20 April 2020

GAN, image segmentation dataset

dataset	example
python tools/download-dataset.py facades 400 images from CMP Facades dataset. (31MB) Pre-trained: BtoA
python tools/download-dataset.py cityscapes 2975 images from the Cityscapes training set. (113M) Pre-trained: AtoB BtoA
python tools/download-dataset.py maps 1096 training images scraped from Google Maps (246M) Pre-trained: AtoB BtoA
python tools/download-dataset.py edges2shoes 50k training images from UT Zappos50K dataset. Edges are computed by HED edge detector + post-processing. (2.2GB) Pre-trained: AtoB
python tools/download-dataset.py edges2handbags 137K Amazon Handbag images from iGAN project. Edges are computed by HED edge detector + post-processing. (8.6GB) Pre-trained: AtoB

satellite segmentation image dataset list

image segmentation dataset list

Image Segmentation Keras : Implementation of Segnet, FCN, UNet, PSPNet and other models in Keras.

image segmentation dataset
github : https://github.com/divamgupta/image-segmentation-keras
google drive : https://drive.google.com/uc?id=0B0d9ZiqAgFkiOHR1NTJhWVJMNEU&export=download

Sunday, 19 April 2020

COVID-CT

CT images with clinical findings of COVID-19

The COVID-CT-Dataset has 275 CT images containing clinical findings of COVID-19. The images are collected from medRxiv and bioRxiv papers about COVID-19. CTs containing COVID-19 abnormalities are selected by reading the figure captions in the papers. All copyrights of the data belong to medRxiv and bioRxiv.

🏡 GitHub : https://www.visualdata.io/?fbclid=IwAR2fwIhpd27Fvk7uVQ4FVroV52Fmy7u2m-7hcAT1-7TdWa1-6PmWe-NIXaM

Covid19 Challenge Dataset

Open research on large Covid-19 imaging datasets

Medical imaging is potentially well suited for Covid-19 diagnosis. This challenge is about connecting the best brains to support doctors with artificial intelligence systems.

🏡 : https://www.covid19challenge.eu

Saturday, 18 April 2020

An Open Pan-Cancer Histology Dataset for Nuclei Instance Segmentation and Classification

Semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources. In total the dataset contains 205,343 labeled nuclei, each with an instance segmentation mask. Models trained on pannuke can aid in whole slide image tissue type segmentation, and generalise to new tissues. PanNuke demonstrates one of the first succesfully semi-automatically generated datasets.

🏡: https://jgamper.github.io/PanNukeDataset/

The ORNL Overhead Vehicle Dataset (OOVD)

This data set was created to understand the potential for machine learning, computer vision, and HPC to improve the energy efficiency aspects of traffic control by leveraging GRIDSMART traffic cameras as sensors for adaptive traffic control, with a sensitivity to the fuel consumption characteristics of the traffic in the camera’s visual field. GRIDSMART cameras—an existing, fielded commercial product—sense the presence of vehicles at intersections and replace more conventional sensors (such as inductive loops) to issue calls to traffic control. These cameras, which have horizon-to-horizon view, offer the potential for an improved view of the traffic environment which can be used to generate better control algorithms.

link : https://www.ornl.gov/project/ornl-overhead-vehicle-dataset-oovd

3d fashion dataset

Deep Fashion3D: A Dataset and Benchmark for 3D Garment Reconstruction from Single Images

We present Deep Fashion3D, a large-scale repository of 3D clothing models reconstructed from real garments. It contains over 2000 3D garment models, spanning 10 different cloth categories. Each model is richly labeld with groundtruth point cloud, multi-view real images, 3D body pose and a novel annotation named feature lines. With Deep Fashion3D, inferring the garment geometry from a single image becomes possible.

official :

https://kv2000.github.io/2020/03/25/deepFashion3DRevisited/

https://ai4ce.github.io/SPARE3D/

github:

https://github.com/ai4ce/SPARE3D

COVIDx

COVID-Net Open Source Initiative

https://github.com/lindawangg/COVID-Net

Friday, 17 April 2020

Dataset Finders

Dataset Finders

Google Dataset Search Introductory blog post
Kaggle Datasets Page: A data science site that contains a variety of externally contributed interesting datasets. You can find all kinds of niche datasets in its master list, from ramen ratings to basketball data to and even Seattle pet licenses.
UCI Machine Learning Repository: One of the oldest sources of datasets on the web, and a great first stop when looking for interesting datasets. Although the data sets are user-contributed, and thus have varying levels of cleanliness, the vast majority are clean. You can download data directly from the UCI Machine Learning repository, without registration.
VisualData: Discover computer vision datasets by category, it allows searchable queries.

Government & Statistics Data

Government & Statistics Data

Data.gov: This site makes it possible to download data from multiple US government agencies. Data can range from government budgets to school performance scores. Be warned though: much of the data requires additional research.
Data USA: The most comprehensive visualization of US public data
Food Environment Atlas: Contains data on how local food choices affect diet in the US.
School system finances: A survey of the finances of school systems in the US.
The US National Center for Education Statistics: Data on educational institutions and education demographics from the US and around the world.
The UK Data Service: The UK’s largest collection of social, economic and population data.
EU Gender statistics database
The Netherlands’ National Georegister (Dutch)
United Nations Development Programme Projects

Health & Biology Data

Health & Biology Data

EU Surveillance Atlas of Infectious Diseases
Merck Molecular Activity Challenge
Musk dataset: The Musk database describes molecules occurring in different conformations. Each molecule is either musk or non-musk and one of the conformations determines this property.

Miscellaneous Datasets

Miscellaneous Datasets

CMU Motion Capture Database
Brodatz dataset: texture modeling
300 terabytes of high-quality data from the Large Hadron Collider (LHC) at CERN
NYC Taxi dataset: NYC taxi data obtained as a result of a FOIA request, led to privacy issues.
Uber FOIL dataset: Data for 4.5M pickups in NYC from an Uber FOIL request.
Criteo click stream dataset: Large Internet advertisement dataset from a major EU retargeter.

Symbolic Music Datasets

Symbolic Music Datasets

Speech Datasets

Speech Datasets

2000 HUB5 English: English-only speech data used most recently in the Deep Speech paper from Baidu.
LibriSpeech: Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by chapters of the book containing both the text and the speech.
VoxForge: Clean speech dataset of accented english. Useful for instances in which you expect to need robustness to different accents or intonations.
TIMIT: English-only speech recognition dataset.
CHIME: Noisy speech recognition challenge dataset. Dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
TED-LIUM: Audio transcription of TED talks. 1495 TED talks audio recordings along with full text transcriptions of those recordings.

Networks and Graphs

Networks and Graphs

Amazon Co-Purchasing: Amazon Reviews crawled data from “the users who bought this also bought…” section of Amazon, as well as Amazon review data for related products. Good for experimenting with recommendation systems in networks.
Friendster Social Network Dataset: Before their pivot as a gaming website, Friendster released anonymized data in the form of friends lists for 103,750,348 users.

Recommendation and ranking systems

Recommendation and ranking systems

Movielens: Movie ratings dataset from the Movielens website, in various sizes ranging from demo to mid-size.
Million Song Dataset: Large, metadata-rich, open source dataset on Kaggle that can be good for people experimenting with hybrid recommendation systems.
Last.fm: Music recommendation dataset with access to underlying social network and other metadata that can be useful for hybrid systems.
Book-Crossing dataset:: From the Book-Crossing community. Contains 278,858 users providing 1,149,780 ratings about 271,379 books.
Jester: 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
Netflix Prize:: Netflix released an anonymized version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies. First major Kaggle style data challenge. Only available unofficially, as privacy issues arose.

Sentiment

Sentiment

Multidomain sentiment analysis dataset An older, academic dataset.
IMDB: An older, relatively small dataset for binary sentiment classification. Fallen out of favor for benchmarks in the literature in lieu of larger datasets.
Stanford Sentiment Treebank: Standard sentiment dataset with fine-grained sentiment annotations at every node of each sentence’s parse tree.

Question answering

Question answering

Maluuba News QA Dataset: 120K Q&A pairs on CNN news articles.
Quora Question Pairs: first dataset release from Quora containing duplicate / semantic similarity labels.
CMU Q/A Dataset: Manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles.
Maluuba goal-oriented dialogue: Procedural conversational dataset where the dialogue aims at accomplishing a task or taking a decision. Often used to work on chat bots.
bAbi: Synthetic reading comprehension and question answering datasets from Facebook AI Research (FAIR).
The Children’s Book Test: Baseline of (Question + context, Answer) pairs extracted from Children’s books available through Project Gutenberg. Useful for question-answering (reading comprehension) and factoid look-up.

Text Datasets

Text Datasets

20 newsgroups: Classification task, mapping word occurences to newsgroup ID. One of the classic datasets for text classification) usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm.
Reuters News dataset: (Older) purely classification-based dataset with text from the newswire. Commonly used in tutorial.
Penn Treebank: Used for next word prediction or next character prediction.
UCI’s Spambase: (Older) classic spam email dataset from the famous UCI Machine Learning Repository. Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized spam filtering.
Broadcast News: Large text dataset, classically used for next word prediction.
Text Classification Datasets: From; Zhang et al., 2015; An extensive set of eight datasets for text classification. These are the benchmark for new text classification baselines. Sample size of 120K to 3.6M, ranging from binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo! and AG.
WikiText: A large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind.
SQuAD: The Stanford Question Answering Dataset — broadly useful question answering and reading comprehension dataset, where every answer to a question is posed as a segment of text.
Billion Words dataset: A large general-purpose language modeling dataset. Often used to train distributed word representations such as word2vec.
Common Crawl: Petabyte-scale crawl of the web — most frequently used for learning word embeddings. Available for free from Amazon S3. Can also be useful as a network dataset for it’s a crawl of the WWW.
Google Books Ngrams: Successive words from Google books. Offers a simple method to explore when a word first entered wide usage.
Yelp Open Dataset: The Yelp dataset is a subset of Yelp businesses, reviews, and user data for use in NLP.

Video Datasets

Video Datasets

Youtube-8M: A large and diverse labeled video dataset for video understanding research.

Facial Datasets

Facial Datasets

Labelled Faces in the Wild: 13,000 cropped facial regions (using; Viola-Jones that have been labeled with a name identifier. A subset of the people present have two images in the dataset — it’s quite common for people to train facial matching systems here.
UMD Faces Annotated dataset of 367,920 faces of 8,501 subjects.
CASIA WebFace Facial dataset of 453,453 images over 10,575 identities after face detection. Requires some filtering for quality.
MS-Celeb-1M 1 million images of celebrities from around the world. Requires some filtering for best results on deep networks.
Olivetti: A few images of several different people.
Multi-Pie: The CMU Multi-PIE Face Database
Face-in-Action
JACFEE: Japanese and Caucasian Facial Expressions of Emotion
FERET: The Facial Recognition Technology Database
mmifacedb: MMI Facial Expression Database
IndianFaceDatabase
The Yale Face Database and The Yale Face Database B).
[Mut1ny Face/Head segmentation dataset] (http://www.mut1ny.com/face-headsegmentation-dataset) Over 16k pixel-level segmented images of faces/head images

Data Scientist Dataset Finder Blog