Friday, 7 July 2023

List of Finnish Datasets for NLP Projects

 https://metatext.io/datasets-list/finnish-language


FI News Corpus
Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

CC100-Finnish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

Finnish News Corpus for Named Entity Recognition
Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Finlex
Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Fiskmö
Dataset is a parallel corpus of Finnish and Swedish Languages.

FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

FI News Corpus
Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

CC100-Finnish
This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

Finnish News Corpus for Named Entity Recognition
Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Finlex
Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Fiskmö
Dataset is a parallel corpus of Finnish and Swedish Languages.

FinChat
Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.



Thursday, 29 June 2023

text summarise dataset

 **Paper:**

https://arxiv.org/abs/1908.08345


**Dataset:**

1) the CNN/DailyMail news highlights dataset: somewhat Extractive

- News Articles & Related Highlights: Provides a brief overview of articles

- Input document: limited to 512 tokens

- https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail


2) the New York Times Annotated Corpus (NYT): somewhat Extractive

- Contains 110,540 articles with abstract summaries

- Input document : limited to 800 tokens

- https://research.google/resources/datasets/ny-times-annotated-corpus/


3) XSum: Abstractive

- 226,711 news articles answering the question of ‘What is this articles about?’ + one-sentence summaries

- Input document: limited to 512 tokens

- https://github.com/google-research-datasets/xsum_hallucination_annotations

Sunday, 6 December 2020

Tuesday, 19 May 2020

RecSys Challenge 2015 dataset

The goal:
Given a sequence of click events performed by some user during a typical session in an e-commerce website, the goal is to predict whether the user is going to buy something or not, and if he is buying, what would be the items he is going to buy. The task could therefore be divided into two sub goals:


  1. Is the user going to buy items in this session? Yes|No
  2. If yes, what are the items that are going to be bought?

Website:
https://2015.recsyschallenge.com/challenge.html

dataset link:
https://s3-eu-west-1.amazonaws.com/yc-rdata/yoochoose-data.7z

Monday, 20 April 2020

GAN, image segmentation dataset

dataset
example
python tools/download-dataset.py facades 
400 images from CMP Facades dataset. (31MB) 
Pre-trained: BtoA
python tools/download-dataset.py cityscapes 
2975 images from the Cityscapes training set. (113M) 
Pre-trained: AtoB BtoA
python tools/download-dataset.py maps 
1096 training images scraped from Google Maps (246M) 
Pre-trained: AtoB BtoA
python tools/download-dataset.py edges2shoes 
50k training images from UT Zappos50K dataset. Edges are computed by HED edge detector + post-processing. (2.2GB) 
Pre-trained: AtoB
python tools/download-dataset.py edges2handbags 
137K Amazon Handbag images from iGAN project. Edges are computed by HED edge detector + post-processing. (8.6GB) 
Pre-trained: AtoB

satellite segmentation image dataset list

image segmentation dataset list

Image Segmentation Keras : Implementation of Segnet, FCN, UNet, PSPNet and other models in Keras.


image segmentation dataset
github : https://github.com/divamgupta/image-segmentation-keras
google drive : https://drive.google.com/uc?id=0B0d9ZiqAgFkiOHR1NTJhWVJMNEU&export=download


Sunday, 19 April 2020

COVID-CT

CT images with clinical findings of COVID-19
The COVID-CT-Dataset has 275 CT images containing clinical findings of COVID-19. The images are collected from medRxiv and bioRxiv papers about COVID-19. CTs containing COVID-19 abnormalities are selected by reading the figure captions in the papers. All copyrights of the data belong to medRxiv and bioRxiv.




Covid19 Challenge Dataset

Open research on large Covid-19 imaging datasets
Medical imaging is potentially well suited for Covid-19 diagnosis. This challenge is about connecting the best brains to support doctors with artificial intelligence systems.


Saturday, 18 April 2020

An Open Pan-Cancer Histology Dataset for Nuclei Instance Segmentation and Classification

Semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources. In total the dataset contains 205,343 labeled nuclei, each with an instance segmentation mask. Models trained on pannuke can aid in whole slide image tissue type segmentation, and generalise to new tissues. PanNuke demonstrates one of the first succesfully semi-automatically generated datasets.


The ORNL Overhead Vehicle Dataset (OOVD)

This data set was created to understand the potential for machine learning, computer vision, and HPC to improve the energy efficiency aspects of traffic control by leveraging GRIDSMART traffic cameras as sensors for adaptive traffic control, with a sensitivity to the fuel consumption characteristics of the traffic in the camera’s visual field. GRIDSMART cameras—an existing, fielded commercial product—sense the presence of vehicles at intersections and replace more conventional sensors (such as inductive loops) to issue calls to traffic control. These cameras, which have horizon-to-horizon view, offer the potential for an improved view of the traffic environment which can be used to generate better control algorithms.



Popular Posts