Datasets for English Named Entity Recognition
Annotated Corpus for Named Entity Recognition: Corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.
i2b2 Challenges: By the Informatics for Integrating Biology & the Bedside (i2b2) center, these clinical datasets were created for named entity recognition.
CoNLL 2003: Dataset that contains 1,393 English news articles with annotated entities LOC (location), ORG (organization), PER (person) and MISC (miscellaneous).
NLPBA 2004: Medical data tagged with protein/DNA/RNA/cell line/cell type (2,404 MEDLINE abstracts).
Resume Entities for NER: Document annotation dataset to be used to perform NER on resumes from indeed.com.
Enron Emails: Over 500,000 email messages tagged with names, dates and times.
MIT Movie Corpus: A semantically tagged training and test corpus in BIO format. The eng corpus are simple queries, and the trivia10k13 corpus are more complex queries.
Annotated GMB Corpus: An annotated corpus using GMB (Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.
Best Buy E-Commerce NER Dataset: A dataset containing Best Buy search queries labeled with entities such as Brand, Model name, Category Name, and etc.
WNUT 17 Emerging Entities Dataset: Text from YouTube, Stack Overflow, Twitter and Reddit comments filtered to prefer text that is likely to contain named entities.
No comments:
Post a Comment