Data Scientist Dataset Finder Blog: June 2023

Thursday, 29 June 2023

**Paper:**

https://arxiv.org/abs/1908.08345

**Dataset:**

1) the CNN/DailyMail news highlights dataset: somewhat Extractive

- News Articles & Related Highlights: Provides a brief overview of articles

- Input document: limited to 512 tokens

- https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail

2) the New York Times Annotated Corpus (NYT): somewhat Extractive

- Contains 110,540 articles with abstract summaries

- Input document : limited to 800 tokens

- https://research.google/resources/datasets/ny-times-annotated-corpus/

3) XSum: Abstractive

- 226,711 news articles answering the question of ‘What is this articles about?’ + one-sentence summaries

- Input document: limited to 512 tokens

- https://github.com/google-research-datasets/xsum_hallucination_annotations

Thursday, 29 June 2023