Thursday 16 April 2020

Datasets for English Named Entity Recognition

Datasets for English Named Entity Recognition

Annotated Corpus for Named Entity Recognition: Corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

i2b2 Challenges: By the Informatics for Integrating Biology & the Bedside (i2b2) center, these clinical datasets were created for named entity recognition.

CoNLL 2003: Dataset that contains 1,393 English news articles with annotated entities LOC (location), ORG (organization), PER (person) and MISC (miscellaneous).

NLPBA 2004: Medical data tagged with protein/DNA/RNA/cell line/cell type (2,404 MEDLINE abstracts).

Resume Entities for NER: Document annotation dataset to be used to perform NER on resumes from indeed.com.

Enron Emails: Over 500,000 email messages tagged with names, dates and times.

MIT Movie Corpus: A semantically tagged training and test corpus in BIO format. The eng corpus are simple queries, and the trivia10k13 corpus are more complex queries.

Annotated GMB Corpus: An annotated corpus using GMB (Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Best Buy E-Commerce NER Dataset: A dataset containing Best Buy search queries labeled with entities such as Brand, Model name, Category Name, and etc.

WNUT 17 Emerging Entities Dataset: Text from YouTube, Stack Overflow, Twitter and Reddit comments filtered to prefer text that is likely to contain named entities.

No comments:

Post a Comment

Popular Posts