Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR

Abstract

This paper advances the fine-grained sketch-based image retrieval (FG-SBIR) literature by putting forward a strong baseline that overshoots prior state-of-the-arts by ≈11%. This is not via complicated design though, but by addressing two critical issues facing the community (i) the gold standard triplet loss does not enforce holistic latent space geometry, and (ii) there are never enough sketches to train a high accuracy model. For the former, we propose a simple modification to the standard triplet loss, that explicitly enforces separation amongst photos/sketch instances. For the latter, we put forward a novel knowledge distillation module can leverage photo data for model training. Both modules are then plugged into a novel plug-n-playable training paradigm that allows for more stable training. More specifically, for (i) we employ an intra-modal triplet loss amongst sketches to bring sketches of the same instance closer from others, and one more amongst photos to push away different photo instances while bringing closer a structurally augmented version of the same photo (offering a gain of ≈4-6%). To tackle (ii), we first pre-train a teacher on the large set of unlabelled photos over the aforementioned intra-modal photo triplet loss. Then we distill the contextual similarity present amongst the instances in the teacher’s embedding space to that in the student’s embedding space, by matching the distribution over inter-feature distances of respective samples in both embedding spaces (delivering a further gain of ≈4-5%). Apart from outperforming prior arts significantly, our model also yields satisfactory results on generalising to new classes.

Motivation

A study shows exisiting models (orange) to have low training stability, unlike our proposed model (green).

Study on variable data-size (left) and cross-category generalisation (right).

Architecture

Using PVT as a backbone, a teacher pre-trained on unlabelled photos, distills the contextual similarity among features in its latent space to a student, which also learns a discriminative latent via cross- and intra-modal triplet losses. Distillation occurs via a learnable distillation token (shown as inset) introduced in the student’s PVT backbone, in three ways: (i) from unlabelled photos, (ii) labelled photos and (iii) by aligning sketches to their paired photos in student-space by distilling contextual similarity of labelled photos in teacher-space.

BibTeX

@Inproceedings{sain2023exploiting,
  title={{Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR}},
  author={Aneeshan Sain and Ayan Kumar Bhunia and Subhadeep Koley and Pinaki Nath Chowdhury and Soumitri Chattopadhyay and Tao Xiang and Yi-Zhe Song},
  booktitle={CVPR},
  year={2023}
}

Exploiting Unlabelled Photos

for Stronger Fine-Grained SBIR

(left) A strong FG-SBIR baseline with strong PVT backbone trained on intra-modal triplet loss, stabilised by EMA. (right) It additionally leverages unlabelled photos to enrich its latent space by distilling instance-wise discriminative knowledge of a teacher pre-trained on unlabelled photos.