Data Scientist Dataset Finder Blog: July 2023

Friday, 7 July 2023

List of Finnish Datasets for NLP Projects

https://metatext.io/datasets-list/finnish-language

FI News Corpus

Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

CC100-Finnish

This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

FinChat

Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

Finnish News Corpus for Named Entity Recognition

Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Finlex

Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Fiskmö

Dataset is a parallel corpus of Finnish and Swedish Languages.

FinChat

FI News Corpus

Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

CC100-Finnish

This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

FinChat

Finnish News Corpus for Named Entity Recognition

Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Finlex

Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Fiskmö

Dataset is a parallel corpus of Finnish and Swedish Languages.

FinChat

Friday, 7 July 2023

List of Finnish Datasets for NLP Projects

Popular Posts