Data Scientist Dataset Finder Blog: 2023

Friday, 7 July 2023

List of Finnish Datasets for NLP Projects

https://metatext.io/datasets-list/finnish-language

FI News Corpus

Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

CC100-Finnish

This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

FinChat

Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

Finnish News Corpus for Named Entity Recognition

Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Finlex

Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Fiskmö

Dataset is a parallel corpus of Finnish and Swedish Languages.

FinChat

FI News Corpus

Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

CC100-Finnish

This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

FinChat

Finnish News Corpus for Named Entity Recognition

Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Finlex

Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Fiskmö

Dataset is a parallel corpus of Finnish and Swedish Languages.

FinChat

Thursday, 29 June 2023

text summarise dataset

**Paper:**

https://arxiv.org/abs/1908.08345

**Dataset:**

1) the CNN/DailyMail news highlights dataset: somewhat Extractive

- News Articles & Related Highlights: Provides a brief overview of articles

- Input document: limited to 512 tokens

- https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail

2) the New York Times Annotated Corpus (NYT): somewhat Extractive

- Contains 110,540 articles with abstract summaries

- Input document : limited to 800 tokens

- https://research.google/resources/datasets/ny-times-annotated-corpus/

3) XSum: Abstractive

- 226,711 news articles answering the question of ‘What is this articles about?’ + one-sentence summaries

- Input document: limited to 512 tokens

- https://github.com/google-research-datasets/xsum_hallucination_annotations

Friday, 7 July 2023

List of Finnish Datasets for NLP Projects

Thursday, 29 June 2023

text summarise dataset

Popular Posts