Named-entity recognition

From Wikipedia, the free encyclopedia

Named-entity recognition (NER) (also known as entity identification and entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER systems has been structured as taking an unannotated block of text, such as this one:

Jim bought 300 shares of Acme Corp. in 2006.

And producing an annotated block of text that highlights where the named entities are, such as this one:

<ENAMEX TYPE="PERSON">Jim</ENAMEX>bought<NUMEX TYPE="QUANTITY">300</NUMEX>shares of<ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>.

In this example, the annotations are marked using XML ENAMEX elements, following the format developed for the Message Understanding Conference in the 1990s.

State-of-the-art NER systems for English produce near-human performance. For example, the best system entering MUC-7 scored 93.39% of F-measure while human annotators scored 97.60% and 96.95%.^[1]^[2]

Approaches

NER systems have been created that use linguistic grammar-based techniques as well as statistical models. Hand-crafted grammar-based systems typically obtain better precision, but at the cost of lower recall and months of work by experienced computational linguists. Statistical NER systems typically require a large amount of manually annotated training data.

Problem domains

Research indicates that even state-of-the-art NER systems are brittle, meaning that NER systems developed for one domain do not typically perform well on other domains.^[3] Considerable effort is involved in tuning NER systems to perform well in a new domain; this is true for both rule-based and trainable statistical systems.

Early work in NER systems in the 1990s was aimed primarily at extraction from journalistic articles. Attention then turned to processing of military dispatches and reports. Later stages of the automatic content extraction (ACE) evaluation also included several types of informal text styles, such as weblogs and text transcripts from conversational telephone speech conversations. Since about 1998, there has been a great deal of interest in entity identification in the molecular biology, bioinformatics, and medical natural language processing communities. The most common entity of interest in that domain has been names of genes and gene products.

Named entity types

In the expression named entity, the word named restricts the task to those entities for which one or many rigid designators, as defined by Kripke, stands for the referent. For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company. Rigid designators include proper names as well as certain natural kind terms like biological species and substances.

There is a general agreement to include temporal expressions and some numerical expressions (i.e., money, percentages, etc.) as instances of named entities in the context of the NER task. While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”). In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar. In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.). It is arguable that the named entity definition is loosened in such cases for practical reasons. The definition of the term named entity is therefore not strict and often has to be explained in the context it is used.^[4]

At least two hierarchies of named entity types have been proposed in the literature. BBN categories, proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes.^[5] Sekine's extended hierarchy, proposed in 2002, is made of 200 subtypes.^[6]

Current challenges and research

Despite the high F1 numbers reported on the MUC-7 dataset, the problem of Named Entity Recognition is far from being solved. The main efforts are directed to reducing the annotation labor by employing semi-supervised learning,^[7]^[8]^[9] robust performance across domains^[10]^[11] and scaling up to fine-grained entity types.^[12]^[13] In recent years, many projects have turned to a crowdsourcing, which is a promising solution to obtain high-quality aggregate human judgments for supervised and semi-supervised machine learning approaches to NER.^[14]

A recently emerging task of identifying "important expressions" in text and cross-linking them to Wikipedia ^[15] ^[16]^[17] can be seen as an instance of extremely fine-grained named entity recognition, where the types are the actual Wikipedia pages describing the (potentially ambiguous) concepts. Below is an example output of a Wikification system:

<ENTITY url="http://en.wikipedia.org/wiki/Michael_I._Jordan"> Michael Jordan </ENTITY> is a professor at <ENTITY url="http://en.wikipedia.org/wiki/University_of_California,_Berkeley"> Berkeley </ENTITY>

Software

Apache OpenNLP includes rule based and statistical Named Entity Recognition
GATE supports NER across many languages and domains out of the box, usable via graphical interface and also Java API
Stanford NLP Tools includes a Java-based CRF named entity recognition tool

References

↑ Elaine Marsh, Dennis Perzanowski, "MUC-7 Evaluation of IE Technology: Overview of Results", 29 April 1998 PDF
↑ MUC-07 Proceedings (Named Entity Tasks)
↑ Poibeau, Thierry and Kosseim, L. (2001) Proper Name Extraction from Non-Journalistic Texts. Proc. Computational Linguistics in the Netherlands.
↑ Named Entity Definition. Webknox.com. Retrieved on 2013-07-21.
↑ Bbn’S Proposed Answer Categories For Question Answering. Ldc.upenn.edu. Retrieved on 2013-07-21.
↑ Sekine's Extended Named Entity Hierarchy. Nlp.cs.nyu.edu. Retrieved on 2013-07-21.
↑ Lin, Dekang; Wu, Xiaoyun (2009). "Phrase clustering for discriminative learning". Annual Meeting of the ACL and IJCNLP. pp. 1030–1038.
↑ Word representations: A simple and general method for semi-supervised learning.
↑ Phrase Clustering for Discriminative Learning.
↑ Design Challenges and Misconceptions in Named Entity Recognition.
↑ Frustratingly Easy Domain Adaptation.
↑ Fine-Grained Named Entity Recognition Using Conditional Random Fields for Question Answering.
↑ Sekine's Extended Named Entity Hierarchy. Nlp.cs.nyu.edu. Retrieved on 2013-07-21.
↑ Web 2.0-based crowdsourcing for high-quality gold standard development in clinical Natural Language Processing
↑ Linking Documents to Encyclopedic Knowledge.
↑ Learning to link with Wikipedia.
↑ Local and Global Algorithms for Disambiguation to Wikipedia.

External links

SMILE NER - free online NER service, supporting more than 250 categories.
Named entity recognition for Arabic – Issues and challenges in morphologically rich languages such as Arabic
Farhad Abedini, Fariborz Mahmoudi, and Amir Hossein Jadidinejad, "From Text to Knowledge: Semantic Entity Extraction using YAGO Ontology," International Journal of Machine Learning and Computing vol. 1, no. 2, pp. 113-119 , 2011.
Farhad Abedini, Fariborz Mahmoudi, and Seyedeh Masoumeh Mirhashem, "Using Semantic Entity Extraction Method for a New Application," International Journal of Machine Learning and Computing vol. 2, no. 2, pp. 178-182, 2012.
CoNLL Language-independent NER shared tasks (2002) and (2003): NER data sets and methods for Spanish, Dutch, English and German

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.