Self-supervised learning framework for efficient classification of endoscopic images using pretext tasks

Identifying anatomical landmarks in endoscopic video frames is essential for the early diagnosis of gastrointestinal diseases. However, this task remains challenging due to variability in visual characteristics across different regions and the limited availability of annotated data. In this study, w...

Full description

Saved in:
Bibliographic Details
Published inPloS one Vol. 20; no. 5; p. e0322028
Main Authors Nezhad, Shima Ayyoubi, Tajeddin, Golnaz, Khatibi, Toktam, Sohrabi, Masoudreza
Format Journal Article
LanguageEnglish
Published United States Public Library of Science 08.05.2025
Subjects
Online AccessGet full text
ISSN1932-6203
1932-6203
DOI10.1371/journal.pone.0322028

Cover

More Information
Summary:Identifying anatomical landmarks in endoscopic video frames is essential for the early diagnosis of gastrointestinal diseases. However, this task remains challenging due to variability in visual characteristics across different regions and the limited availability of annotated data. In this study, we propose a novel self-supervised learning (SSL) framework that integrates three complementary pretext task, colorization, jigsaw puzzle solving, and patch prediction, to enhance feature learning from unlabeled endoscopic images. By leveraging these tasks, our model extracts rich, meaningful representations, improving the downstream classification of Z-line, esophageal, and antrum/pylorus regions. To further enhance feature extraction and model interpretability, we incorporate attention mechanisms, transformer-based architectures, and Grad-CAM visualization. The integration of attention layers and transformers strengthens the model’s ability to learn discriminative and generalizable features, while Grad-CAM improves explainability by highlighting critical decision-making regions. These enhancements make our approach more suitable for clinical deployment, ensuring both high accuracy and interpretability. We evaluate our proposed framework on a comprehensive dataset, demonstrating substantial improvements in classification accuracy, precision, recall, and F1-score compared to conventional models trained without SSL. Specifically, our combined model achieves a classification accuracy of 98%, with high precision and recall across all classes, as reflected in ROC curves and confusion matrices. These results underscore the effectiveness of pretext-task-based SSL, attention mechanism, and transformers for anatomical landmark identification in endoscopic video frames. Our work introduces a scalable and interpretable methodology for improving endoscopic image classification, reducing reliance on large annotated datasets while enhancing model performance in real-world clinical applications. Future research will explore validation on diverse datasets, real-time diagnostic integration, and scalability to further advance medical image analysis using SSL.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
Competing Interests: The authors have declared that no competing interests exist.
ISSN:1932-6203
1932-6203
DOI:10.1371/journal.pone.0322028