Friday 7 July 2023

List of Finnish Datasets for NLP Projects

 https://metatext.io/datasets-list/finnish-language


FI News Corpus
Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

CC100-Finnish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

Finnish News Corpus for Named Entity Recognition
Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Finlex
Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Fiskmö
Dataset is a parallel corpus of Finnish and Swedish Languages.

FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

FI News Corpus
Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

CC100-Finnish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

Finnish News Corpus for Named Entity Recognition
Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Finlex
Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Fiskmö
Dataset is a parallel corpus of Finnish and Swedish Languages.

FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.



Popular Posts