X-SSL: Self-supervised X-ray threat detection with zero-shot and multi-modal learning

Automated X-ray threat detection is challenged by cluttered baggage scans, severe object occlusions, and the scarcity of annotated datasets. Traditional supervised approaches are impractical, as they require large amounts of labeled data, which is difficult to obtain given the rarity of threat objec...

Full description

Saved in:

Bibliographic Details
Published in	Information processing & management Vol. 63; no. 1; p. 104339
Main Authors	Michael, Yonathan, Alansari, Mohamad, Ahmed, Abdelfatah, Werghi, Naoufel, Henschel, Andreas
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.01.2026
Subjects	Instance segmentation Knowledge distillation Multi-modal clustering Self-supervised learning (SSL) X-ray threat detection Zero-shot object proposal Self-supervised learning (SSL) Multi-modal clustering Instance segmentation X-ray threat detection Zero-shot object proposal Knowledge distillation
Online Access	Get full text
ISSN	0306-4573
DOI	10.1016/j.ipm.2025.104339

Cover

More Information
Summary:	Automated X-ray threat detection is challenged by cluttered baggage scans, severe object occlusions, and the scarcity of annotated datasets. Traditional supervised approaches are impractical, as they require large amounts of labeled data, which is difficult to obtain given the rarity of threat objects. Unsupervised methods, on the other hand, often fail to differentiate between threat and nonthreat items due to the complex grayscale nature and high object overlap inherent in X-ray imagery. To overcome these limitations, we propose X-SSL a novel self-supervised learning framework that eliminates manual annotations designed to perform threat localization. Our approach integrates spatial region extraction using MaskCut for zero-shot object proposal generation, contrastive multi-modal clustering that leverages both image and text encoders to cluster and label proposals into threat and nonthreat categories, and self-supervised knowledge distillation where a teacher–student model refines multiscale features from global and local image crops for improved representation learning. We evaluated X-SSL on two benchmark datasets: PIDray (39,000+ images) and CLCXray (14,000+ images), demonstrating significant improvements over previous state-of-the-art (SOTA) methods. In the hidden PIDray subset, X-SSL improves detection AP to 18.16 (+4.41 AP) and segmentation AP to 13.76 (+1.01 AP) over previous methods. On CLCXray, it achieves 39.26 detection AP (+5.93 AP) and 38.67 segmentation AP (+10.20 AP), significantly surpassing previous approaches. For classification, X-SSL achieves an accuracy of 43% on PIDray Hidden and 65% on CLCXray, further highlighting its superior performance compared to existing weakly supervised and unsupervised methods. Code will be available here: https://github.com/yonathan-kiflom/X-SSL.
ISSN:	0306-4573
DOI:	10.1016/j.ipm.2025.104339