Friday, 7 July 2023

List of Finnish Datasets for NLP Projects

FI News Corpus
Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

Finnish News Corpus for Named Entity Recognition
Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Dataset is a parallel corpus of Finnish and Swedish Languages.

Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

FI News Corpus
Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

Finnish News Corpus for Named Entity Recognition
Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Dataset is a parallel corpus of Finnish and Swedish Languages.

Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

No comments:

Post a Comment

Popular Posts