Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili

The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this paper, we describe the recent effort of adding the...

Full description

Saved in:
Bibliographic Details
Published inLanguage Resources and Evaluation Vol. 45; no. 3; pp. 311 - 330
Main Authors Steinberger, Ralf, Ombuya, Sylvia, Kabadjov, Mijail, Pouliquen, Bruno, Rocca, Leo Della, Belyaeva, Jenya, de Paola, Monica, Ignat, Camelia, van der Goot, Erik
Format Journal Article
LanguageEnglish
Published Dordrecht Springer 01.09.2011
Springer Netherlands
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN1574-020X
1572-8412
1574-0218
DOI10.1007/s10579-011-9155-y

Cover

More Information
Summary:The Europe Media Monitor (EMM) family of applications is a set of multilingual tools that gather, cluster and classify news in currently fifty languages and that extract named entities and quotations (reported speech) from twenty languages. In this paper, we describe the recent effort of adding the African Bantu language Swahili to EMM. EMM is designed in an entirely modular way, allowing plugging in a new language by providing the language-specific resources for that language. We thus describe the type of language-specific resources needed, the effort involved, and ways of boot-strapping the generation of these resources in order to keep the effort of adding a new language to a minimum. The text analysis applications pursued in our efforts include clustering, classification, recognition and disambiguation of named entities (persons, organisations and locations), recognition and normalisation of date expressions, as well as the identification of reported speech quotations by and about people.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ObjectType-Article-2
ObjectType-Feature-1
ISSN:1574-020X
1572-8412
1574-0218
DOI:10.1007/s10579-011-9155-y