Self-supervised Vision Transformers with Data Augmentation Strategies Using Morphological Operations for Writer Retrieval

This paper introduces a self-supervised approach using vision transformers for writer retrieval based on knowledge distillation. We propose morphological operations as a general data augmentation method for handwriting images to learn discriminative features independent of the pen. Our method operat...

Full description

Saved in:

Bibliographic Details
Published in	Frontiers in Handwriting Recognition Vol. 13639; pp. 122 - 136
Main Authors	Peer, Marco, Kleber, Florian, Sablatnig, Robert
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2022 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Document analysis Morphological operations Unsupervised learning Writer retrieval
Online Access	Get full text
ISBN	3031216474 9783031216473
ISSN	0302-9743 1611-3349
DOI	10.1007/978-3-031-21648-0_9

Cover

More Information
Summary:	This paper introduces a self-supervised approach using vision transformers for writer retrieval based on knowledge distillation. We propose morphological operations as a general data augmentation method for handwriting images to learn discriminative features independent of the pen. Our method operates on binarized 224×224 $$224\times 224$$ -sized patches extracted of the documents’ writing region, and we generate two different views based on randomly sampled kernels for erosion and dilation to learn a representative embedding space invariant to different pens. Our evaluation shows that morphological operations outperform data augmentation generally used in retrieval tasks, e.g., flipping, rotation, and translation, by up to 8%. Additionally, we evaluate our data augmentation strategy to existing approaches such as networks trained with triplet loss. We achieve a mean average precision of 66.4% on the Historical-WI dataset, competing with methods using algorithms like SIFT for patch extraction or computationally expensive encodings, e.g., mVLAD, NetVLAD, or E-SVM. In the end, we show by visualizing the attention mechanism that the heads of the vision transformer focus on different parts of the handwriting, e.g., loops or specific characters, enhancing the explainability of our writer retrieval.
Bibliography:	Original Abstract: This paper introduces a self-supervised approach using vision transformers for writer retrieval based on knowledge distillation. We propose morphological operations as a general data augmentation method for handwriting images to learn discriminative features independent of the pen. Our method operates on binarized 224×224\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$224\times 224$$\end{document}-sized patches extracted of the documents’ writing region, and we generate two different views based on randomly sampled kernels for erosion and dilation to learn a representative embedding space invariant to different pens. Our evaluation shows that morphological operations outperform data augmentation generally used in retrieval tasks, e.g., flipping, rotation, and translation, by up to 8%. Additionally, we evaluate our data augmentation strategy to existing approaches such as networks trained with triplet loss. We achieve a mean average precision of 66.4% on the Historical-WI dataset, competing with methods using algorithms like SIFT for patch extraction or computationally expensive encodings, e.g., mVLAD, NetVLAD, or E-SVM. In the end, we show by visualizing the attention mechanism that the heads of the vision transformer focus on different parts of the handwriting, e.g., loops or specific characters, enhancing the explainability of our writer retrieval.
ISBN:	3031216474 9783031216473
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-031-21648-0_9