Data Scientist Dataset Finder Blog

Friday 7 July 2023

List of Finnish Datasets for NLP Projects

https://metatext.io/datasets-list/finnish-language

FI News Corpus

Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

CC100-Finnish

This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

FinChat

Dataset contains conversations with message timestamps, sender’s id, and metadata information. It contains 86 conversations with 3,630 messages, 22,210 words with the average word length of 5.6, and on the average 14 turns per each conversation.

Finnish News Corpus for Named Entity Recognition

Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Finlex

Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Fiskmö

Dataset is a parallel corpus of Finnish and Swedish Languages.

FinChat

FI News Corpus

Dataset is a collection of news headlines and short summaries of text, organized by date. The news articles were published between 2012-2020.

CC100-Finnish

This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository. The size of this corpus is 15G.

FinChat

Finnish News Corpus for Named Entity Recognition

Dataset contains 953 articles (193,742 word tokens) with 6 named entity classes: organization, location, person, product, event, and date.

Finlex

Dataset is a collection of legislative and other judicial information of Finland, which is available in Finnish and Swedish.

Fiskmö

Dataset is a parallel corpus of Finnish and Swedish Languages.

FinChat

Thursday 29 June 2023

text summarise dataset

**Paper:**

https://arxiv.org/abs/1908.08345

**Dataset:**

1) the CNN/DailyMail news highlights dataset: somewhat Extractive

- News Articles & Related Highlights: Provides a brief overview of articles

- Input document: limited to 512 tokens

- https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail

2) the New York Times Annotated Corpus (NYT): somewhat Extractive

- Contains 110,540 articles with abstract summaries

- Input document : limited to 800 tokens

- https://research.google/resources/datasets/ny-times-annotated-corpus/

3) XSum: Abstractive

- 226,711 news articles answering the question of ‘What is this articles about?’ + one-sentence summaries

- Input document: limited to 512 tokens

- https://github.com/google-research-datasets/xsum_hallucination_annotations

Sunday 6 December 2020

pix2pix dataset

go to this page

https://people.eecs.berkeley.edu/~tinghuiz/projects/pix2pix/datasets/

or click this link directly

cityscapes.tar.gz	2016-12-02 23:15	99M
edges2handbags.tar.gz	2016-12-02 23:16	8.0G
edges2shoes.tar.gz	2016-12-02 23:17	2.0G
facades.tar.gz	2016-12-02 23:17	29M
maps.tar.gz	2016-12-02 23:17	239M

Tuesday 19 May 2020

RecSys Challenge 2015 dataset

The goal:
Given a sequence of click events performed by some user during a typical session in an e-commerce website, the goal is to predict whether the user is going to buy something or not, and if he is buying, what would be the items he is going to buy. The task could therefore be divided into two sub goals:

Is the user going to buy items in this session? Yes|No
If yes, what are the items that are going to be bought?

Website:
https://2015.recsyschallenge.com/challenge.html

dataset link:
https://s3-eu-west-1.amazonaws.com/yc-rdata/yoochoose-data.7z

Sunday 10 May 2020

Network(GML format graph) Dataset

Best interesting data is football network

refer to this page: http://www-personal.umich.edu/~mejn/netdata/

Tuesday 21 April 2020

Naver sentiment movie corpus v1.0 (korean)

https://github.com/e9t/nsmc

Monday 20 April 2020

GAN, image segmentation dataset

dataset	example
python tools/download-dataset.py facades 400 images from CMP Facades dataset. (31MB) Pre-trained: BtoA
python tools/download-dataset.py cityscapes 2975 images from the Cityscapes training set. (113M) Pre-trained: AtoB BtoA
python tools/download-dataset.py maps 1096 training images scraped from Google Maps (246M) Pre-trained: AtoB BtoA
python tools/download-dataset.py edges2shoes 50k training images from UT Zappos50K dataset. Edges are computed by HED edge detector + post-processing. (2.2GB) Pre-trained: AtoB
python tools/download-dataset.py edges2handbags 137K Amazon Handbag images from iGAN project. Edges are computed by HED edge detector + post-processing. (8.6GB) Pre-trained: AtoB

satellite segmentation image dataset list

image segmentation dataset list

Image Segmentation Keras : Implementation of Segnet, FCN, UNet, PSPNet and other models in Keras.

image segmentation dataset
github : https://github.com/divamgupta/image-segmentation-keras
google drive : https://drive.google.com/uc?id=0B0d9ZiqAgFkiOHR1NTJhWVJMNEU&export=download

Sunday 19 April 2020

COVID-CT

CT images with clinical findings of COVID-19

The COVID-CT-Dataset has 275 CT images containing clinical findings of COVID-19. The images are collected from medRxiv and bioRxiv papers about COVID-19. CTs containing COVID-19 abnormalities are selected by reading the figure captions in the papers. All copyrights of the data belong to medRxiv and bioRxiv.

🏡 GitHub : https://www.visualdata.io/?fbclid=IwAR2fwIhpd27Fvk7uVQ4FVroV52Fmy7u2m-7hcAT1-7TdWa1-6PmWe-NIXaM

Covid19 Challenge Dataset

Open research on large Covid-19 imaging datasets

Medical imaging is potentially well suited for Covid-19 diagnosis. This challenge is about connecting the best brains to support doctors with artificial intelligence systems.

🏡 : https://www.covid19challenge.eu

Saturday 18 April 2020

An Open Pan-Cancer Histology Dataset for Nuclei Instance Segmentation and Classification

Semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources. In total the dataset contains 205,343 labeled nuclei, each with an instance segmentation mask. Models trained on pannuke can aid in whole slide image tissue type segmentation, and generalise to new tissues. PanNuke demonstrates one of the first succesfully semi-automatically generated datasets.

🏡: https://jgamper.github.io/PanNukeDataset/

The ORNL Overhead Vehicle Dataset (OOVD)

This data set was created to understand the potential for machine learning, computer vision, and HPC to improve the energy efficiency aspects of traffic control by leveraging GRIDSMART traffic cameras as sensors for adaptive traffic control, with a sensitivity to the fuel consumption characteristics of the traffic in the camera’s visual field. GRIDSMART cameras—an existing, fielded commercial product—sense the presence of vehicles at intersections and replace more conventional sensors (such as inductive loops) to issue calls to traffic control. These cameras, which have horizon-to-horizon view, offer the potential for an improved view of the traffic environment which can be used to generate better control algorithms.

link : https://www.ornl.gov/project/ornl-overhead-vehicle-dataset-oovd

Friday 7 July 2023

Thursday 29 June 2023

Sunday 6 December 2020

Tuesday 19 May 2020

Sunday 10 May 2020

Tuesday 21 April 2020

Monday 20 April 2020

Sunday 19 April 2020

Saturday 18 April 2020

Popular Posts