Data Scientist Dataset Finder Blog: Datasets for English Named Entity Recognition

Thursday, 16 April 2020

Datasets for English Named Entity Recognition

Datasets for English Named Entity Recognition

Annotated Corpus for Named Entity Recognition: Corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

i2b2 Challenges: By the Informatics for Integrating Biology & the Bedside (i2b2) center, these clinical datasets were created for named entity recognition.

CoNLL 2003: Dataset that contains 1,393 English news articles with annotated entities LOC (location), ORG (organization), PER (person) and MISC (miscellaneous).

NLPBA 2004: Medical data tagged with protein/DNA/RNA/cell line/cell type (2,404 MEDLINE abstracts).

Resume Entities for NER: Document annotation dataset to be used to perform NER on resumes from indeed.com.

Enron Emails: Over 500,000 email messages tagged with names, dates and times.

MIT Movie Corpus: A semantically tagged training and test corpus in BIO format. The eng corpus are simple queries, and the trivia10k13 corpus are more complex queries.

Annotated GMB Corpus: An annotated corpus using GMB (Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Best Buy E-Commerce NER Dataset: A dataset containing Best Buy search queries labeled with entities such as Brand, Model name, Category Name, and etc.

WNUT 17 Emerging Entities Dataset: Text from YouTube, Stack Overflow, Twitter and Reddit comments filtered to prefer text that is likely to contain named entities.

Data Scientist Dataset Finder Blog

Thursday, 16 April 2020

Datasets for English Named Entity Recognition

No comments:

Post a Comment

Popular Posts