Supervised Learning Approach for Section Title Detection in PDF Scientific Articles
The majority of scientific articles is available in Portable Document Format (PDF). Although PDF format has the advantage of preserving layout across platforms it does not maintain the original metadata structure, making it difficult further text processing. Despite different layouts, depending on t...
        Saved in:
      
    
          | Published in | Advances in Computational Intelligence Vol. 13067; pp. 44 - 54 | 
|---|---|
| Main Authors | , | 
| Format | Book Chapter | 
| Language | English | 
| Published | 
        Switzerland
          Springer International Publishing AG
    
        2021
     Springer International Publishing  | 
| Series | Lecture Notes in Computer Science | 
| Subjects | |
| Online Access | Get full text | 
| ISBN | 9783030898168 3030898164  | 
| ISSN | 0302-9743 1611-3349  | 
| DOI | 10.1007/978-3-030-89817-5_3 | 
Cover
| Summary: | The majority of scientific articles is available in Portable Document Format (PDF). Although PDF format has the advantage of preserving layout across platforms it does not maintain the original metadata structure, making it difficult further text processing. Despite different layouts, depending on the applied template, articles have a hierarchical structure and are divided into sections, which represent topics of specific subjects, such as methodology and results. Hence, section segmentation serves as an important step for a contextualized text processing of scientific articles. Therefore, this work applies binary classification, a supervised learning task, for section title detection in PDF scientific articles. To train the classifiers, a large dataset (more than 5 millions samples from 7,302 articles) was created through an automated feature extraction approach, comprised by 17 features, where 4 were introduced in this work. Training and testing were made for ten different classifiers for which the best F1 score reached 0.94. Finally, we evaluated our results against CERMINE, an open-source system that extracts metadata from scientific articles, having an absolute improvement in section detection of 0.19 in F1 score. | 
|---|---|
| ISBN: | 9783030898168 3030898164  | 
| ISSN: | 0302-9743 1611-3349  | 
| DOI: | 10.1007/978-3-030-89817-5_3 |