Self-supervised Vision Transformers with Data Augmentation Strategies Using Morphological Operations for Writer Retrieval
This paper introduces a self-supervised approach using vision transformers for writer retrieval based on knowledge distillation. We propose morphological operations as a general data augmentation method for handwriting images to learn discriminative features independent of the pen. Our method operat...
Saved in:
| Published in | Frontiers in Handwriting Recognition Vol. 13639; pp. 122 - 136 |
|---|---|
| Main Authors | , , |
| Format | Book Chapter |
| Language | English |
| Published |
Switzerland
Springer International Publishing AG
2022
Springer International Publishing |
| Series | Lecture Notes in Computer Science |
| Subjects | |
| Online Access | Get full text |
| ISBN | 3031216474 9783031216473 |
| ISSN | 0302-9743 1611-3349 |
| DOI | 10.1007/978-3-031-21648-0_9 |
Cover
| Summary: | This paper introduces a self-supervised approach using vision transformers for writer retrieval based on knowledge distillation. We propose morphological operations as a general data augmentation method for handwriting images to learn discriminative features independent of the pen. Our method operates on binarized 224×224 $$224\times 224$$ -sized patches extracted of the documents’ writing region, and we generate two different views based on randomly sampled kernels for erosion and dilation to learn a representative embedding space invariant to different pens. Our evaluation shows that morphological operations outperform data augmentation generally used in retrieval tasks, e.g., flipping, rotation, and translation, by up to 8%. Additionally, we evaluate our data augmentation strategy to existing approaches such as networks trained with triplet loss. We achieve a mean average precision of 66.4% on the Historical-WI dataset, competing with methods using algorithms like SIFT for patch extraction or computationally expensive encodings, e.g., mVLAD, NetVLAD, or E-SVM. In the end, we show by visualizing the attention mechanism that the heads of the vision transformer focus on different parts of the handwriting, e.g., loops or specific characters, enhancing the explainability of our writer retrieval. |
|---|---|
| Bibliography: | Original Abstract: This paper introduces a self-supervised approach using vision transformers for writer retrieval based on knowledge distillation. We propose morphological operations as a general data augmentation method for handwriting images to learn discriminative features independent of the pen. Our method operates on binarized 224×224\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$224\times 224$$\end{document}-sized patches extracted of the documents’ writing region, and we generate two different views based on randomly sampled kernels for erosion and dilation to learn a representative embedding space invariant to different pens. Our evaluation shows that morphological operations outperform data augmentation generally used in retrieval tasks, e.g., flipping, rotation, and translation, by up to 8%. Additionally, we evaluate our data augmentation strategy to existing approaches such as networks trained with triplet loss. We achieve a mean average precision of 66.4% on the Historical-WI dataset, competing with methods using algorithms like SIFT for patch extraction or computationally expensive encodings, e.g., mVLAD, NetVLAD, or E-SVM. In the end, we show by visualizing the attention mechanism that the heads of the vision transformer focus on different parts of the handwriting, e.g., loops or specific characters, enhancing the explainability of our writer retrieval. |
| ISBN: | 3031216474 9783031216473 |
| ISSN: | 0302-9743 1611-3349 |
| DOI: | 10.1007/978-3-031-21648-0_9 |