InfoXtract: A customizable intermediate level information extraction engine

Information Extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of heterogeneous documents for situations that c...

Full description

Saved in:

Bibliographic Details
Published in	Natural language engineering Vol. 14; no. 1; pp. 33 - 69
Main Authors	SRIHARI, ROHINI K., LI, WEI, CORNELL, THOMAS, NIU, CHENG
Format	Journal Article
Language	English
Published	Cambridge, UK Cambridge University Press 01.01.2008
Subjects	Anaphora Computer Applications Computer Generated Language Analysis Computer Software Data mining Information Retrieval Machine Learning Natural Language Processing Beijing China United States > US China
Online Access	Get full text
ISSN	1351-3249 1469-8110
DOI	10.1017/S1351324906004116

Cover

More Information
Summary:	Information Extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of heterogeneous documents for situations that cannot be anticipated a priori, they require IE systems to have breadth as well as depth. This implies the need for a domain-independent IE system that can easily be customized for specific domains: end users must be given tools to customize the system on their own. It also implies the need for defining new intermediate level IE tasks that are richer than the subject-verb-object (SVO) triples produced by shallow systems, yet not as complex as the domain-specific scenarios defined by the Message Understanding Conference (MUC). This paper describes InfoXtract, a robust, scalable, intermediate-level IE engine that can be ported to various domains. It describes new IE tasks such as synthesis of entity profiles, and extraction of concept-based general events which represent realistic near-term goals focused on deriving useful, actionable information. Entity profiles consolidate information about a person/organization/location etc. within a document and across documents into a single template; this takes into account aliases and anaphoric references as well as key relationships and events pertaining to that entity. Concept-based events attempt to normalize information such as time expressions (e.g., yesterday) as well as ambiguous location references (e.g., Buffalo). These new tasks facilitate the correlation of output from an IE engine with structured data to enable text mining. InfoXtract's hybrid architecture comprised of grammatical processing and machine learning is described in detail. Benchmarking results for the core engine and applications utilizing the engine are presented.
Bibliography:	istex:D103455880874F9161AA9C94FD6B90132422F762 ArticleID:00411 ark:/67375/6GQ-K10VJL6S-1 PII:S1351324906004116 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1351-3249 1469-8110
DOI:	10.1017/S1351324906004116