Friday 17 April 2020

Text Datasets

Text Datasets
  • 20 newsgroups: Classification task, mapping word occurences to newsgroup ID. One of the classic datasets for text classification) usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm.
  • Reuters News dataset: (Older) purely classification-based dataset with text from the newswire. Commonly used in tutorial.
  • Penn Treebank: Used for next word prediction or next character prediction.
  • UCI’s Spambase: (Older) classic spam email dataset from the famous UCI Machine Learning Repository. Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized spam filtering.
  • Broadcast News: Large text dataset, classically used for next word prediction.
  • Text Classification Datasets: From; Zhang et al., 2015; An extensive set of eight datasets for text classification. These are the benchmark for new text classification baselines. Sample size of 120K to 3.6M, ranging from binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo! and AG.
  • WikiText: A large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind.
  • SQuAD: The Stanford Question Answering Dataset — broadly useful question answering and reading comprehension dataset, where every answer to a question is posed as a segment of text.
  • Billion Words dataset: A large general-purpose language modeling dataset. Often used to train distributed word representations such as word2vec.
  • Common Crawl: Petabyte-scale crawl of the web — most frequently used for learning word embeddings. Available for free from Amazon S3. Can also be useful as a network dataset for it’s a crawl of the WWW.
  • Google Books Ngrams: Successive words from Google books. Offers a simple method to explore when a word first entered wide usage.
  • Yelp Open Dataset: The Yelp dataset is a subset of Yelp businesses, reviews, and user data for use in NLP.

No comments:

Post a Comment

Popular Posts