InfoXtract: A customizable intermediate level information extraction engine

Information Extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of heterogeneous documents for situations that c...

Full description

Saved in:
Bibliographic Details
Published inNatural language engineering Vol. 14; no. 1; pp. 33 - 69
Main Authors SRIHARI, ROHINI K., LI, WEI, CORNELL, THOMAS, NIU, CHENG
Format Journal Article
LanguageEnglish
Published Cambridge, UK Cambridge University Press 01.01.2008
Subjects
Online AccessGet full text
ISSN1351-3249
1469-8110
DOI10.1017/S1351324906004116

Cover

More Information
Summary:Information Extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of heterogeneous documents for situations that cannot be anticipated a priori, they require IE systems to have breadth as well as depth. This implies the need for a domain-independent IE system that can easily be customized for specific domains: end users must be given tools to customize the system on their own. It also implies the need for defining new intermediate level IE tasks that are richer than the subject-verb-object (SVO) triples produced by shallow systems, yet not as complex as the domain-specific scenarios defined by the Message Understanding Conference (MUC). This paper describes InfoXtract, a robust, scalable, intermediate-level IE engine that can be ported to various domains. It describes new IE tasks such as synthesis of entity profiles, and extraction of concept-based general events which represent realistic near-term goals focused on deriving useful, actionable information. Entity profiles consolidate information about a person/organization/location etc. within a document and across documents into a single template; this takes into account aliases and anaphoric references as well as key relationships and events pertaining to that entity. Concept-based events attempt to normalize information such as time expressions (e.g., yesterday) as well as ambiguous location references (e.g., Buffalo). These new tasks facilitate the correlation of output from an IE engine with structured data to enable text mining. InfoXtract's hybrid architecture comprised of grammatical processing and machine learning is described in detail. Benchmarking results for the core engine and applications utilizing the engine are presented.
Bibliography:istex:D103455880874F9161AA9C94FD6B90132422F762
ArticleID:00411
ark:/67375/6GQ-K10VJL6S-1
PII:S1351324906004116
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1351-3249
1469-8110
DOI:10.1017/S1351324906004116