Supervised Learning Approach for Section Title Detection in PDF Scientific Articles

The majority of scientific articles is available in Portable Document Format (PDF). Although PDF format has the advantage of preserving layout across platforms it does not maintain the original metadata structure, making it difficult further text processing. Despite different layouts, depending on t...

Full description

Saved in:
Bibliographic Details
Published inAdvances in Computational Intelligence Vol. 13067; pp. 44 - 54
Main Authors Guedes, Gustavo Bartz, da Silva, Ana Estela Antunes
Format Book Chapter
LanguageEnglish
Published Switzerland Springer International Publishing AG 2021
Springer International Publishing
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text
ISBN9783030898168
3030898164
ISSN0302-9743
1611-3349
DOI10.1007/978-3-030-89817-5_3

Cover

More Information
Summary:The majority of scientific articles is available in Portable Document Format (PDF). Although PDF format has the advantage of preserving layout across platforms it does not maintain the original metadata structure, making it difficult further text processing. Despite different layouts, depending on the applied template, articles have a hierarchical structure and are divided into sections, which represent topics of specific subjects, such as methodology and results. Hence, section segmentation serves as an important step for a contextualized text processing of scientific articles. Therefore, this work applies binary classification, a supervised learning task, for section title detection in PDF scientific articles. To train the classifiers, a large dataset (more than 5 millions samples from 7,302 articles) was created through an automated feature extraction approach, comprised by 17 features, where 4 were introduced in this work. Training and testing were made for ten different classifiers for which the best F1 score reached 0.94. Finally, we evaluated our results against CERMINE, an open-source system that extracts metadata from scientific articles, having an absolute improvement in section detection of 0.19 in F1 score.
ISBN:9783030898168
3030898164
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-030-89817-5_3