Cross-lingual extreme summarization of scholarly documents

The number of scientific publications nowadays is rapidly increasing, causing information overload for researchers and making it hard for scholars to keep up to date with current trends and lines of work. Recent work has tried to address this problem by developing methods for automated summarization...

Full description

Saved in:

Bibliographic Details
Published in	International journal on digital libraries Vol. 25; no. 2; pp. 249 - 271
Main Authors	Takeshita, Sotaro, Green, Tommaso, Friedrich, Niklas, Eckert, Kai, Ponzetto, Simone Paolo
Format	Journal Article
Language	English
Published	Berlin, Heidelberg Springer 01.06.2024 Springer Berlin Heidelberg Springer Nature B.V
Subjects	Computer Science Database Management Datasets Distillation Documents English language Information Systems and Communication Service Machine translation Multilingualism Multilinguality Scholarly document processing Summaries Summarization Multilinguality Scholarly document processing Summarization
Online Access	Get full text
ISSN	1432-1300 1432-5012 1432-1300
DOI	10.1007/s00799-023-00373-2

Cover

More Information
Summary:	The number of scientific publications nowadays is rapidly increasing, causing information overload for researchers and making it hard for scholars to keep up to date with current trends and lines of work. Recent work has tried to address this problem by developing methods for automated summarization in the scholarly domain, but concentrated so far only on monolingual settings, primarily English. In this paper, we consequently explore how state-of-the-art neural abstract summarization models based on a multilingual encoder–decoder architecture can be used to enable cross-lingual extreme summaries of scholarly texts. To this end, we compile a new abstractive cross-lingual summarization dataset for the scholarly domain in four different languages, which enables us to train and evaluate models that process English papers and generate summaries in German, Italian, Chinese and Japanese. We present our new X-SCITLDR dataset for multilingual summarization and thoroughly benchmark different models based on a state-of-the-art multilingual pre-trained model, including a two-stage pipeline approach that independently summarizes and translates, as well as a direct cross-lingual model. We additionally explore the benefits of intermediate-stage training using English monolingual summarization and machine translation as intermediate tasks and analyze performance in zero- and few-shot scenarios. Finally, we investigate how to make our approach more efficient on the basis of knowledge distillation methods, which make it possible to shrink the size of our models, so as to reduce the computational complexity of the summarization inference.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1432-1300 1432-5012 1432-1300
DOI:	10.1007/s00799-023-00373-2