Classifier-based acronym extraction for business documents

Acronym extraction for business documents has been neglected in favor of acronym extraction for biomedical documents. Although there are overlapping challenges, the semi-structured and non-predictive nature of business documents hinder the effectiveness of the extraction methods used on biomedical d...

Full description

Saved in:
Bibliographic Details
Published inKnowledge and information systems Vol. 29; no. 2; pp. 305 - 334
Main Authors MENARD, Pierre Andre, RATTE, Sylvie
Format Journal Article
LanguageEnglish
Published London Springer-Verlag 01.11.2011
Springer
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN0219-1377
0219-3116
DOI10.1007/s10115-010-0341-9

Cover

More Information
Summary:Acronym extraction for business documents has been neglected in favor of acronym extraction for biomedical documents. Although there are overlapping challenges, the semi-structured and non-predictive nature of business documents hinder the effectiveness of the extraction methods used on biomedical documents and fail to deliver the expected performance. A classifier-based extraction subsystem is presented as part of the wider project, Binocle, for the analysis of French business corpora. Explicit and implicit acronym presentation cases are identified using textual and syntactical hints. Among the 7 features extracted from each candidate instance, we introduce “similarity” features, which compare a candidate’s characteristics with average length-related values calculated from a generic acronym repository. Commonly used rules for evaluating the candidate (matching first letters, ordered instances, etc.) are scored and aggregated in a single composite feature that permits a supple classification. One hundred and thirty-eight French business documents from 14 public organizations were used for the training and evaluation corpora, yielding a recall of 90.9% at a precision level of 89.1% for a search space size of 3 sentences.
Bibliography:SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
ObjectType-Article-2
content type line 23
ISSN:0219-1377
0219-3116
DOI:10.1007/s10115-010-0341-9