Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

Mar 23, 2023

Jiangbin Zheng, Yile Wang, Cheng Tan, Siyuan Li, Ge Wang, Jun Xia, Yidong Chen, Stan Z. Li

Figure 1 for CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

Figure 2 for CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

Figure 3 for CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

Figure 4 for CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

Share this with someone who'll enjoy it:

Abstract:Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. Most SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is designed to explicitly enhance the consistency constraints. Extensive experiments on public datasets (PHOENIX-2014 and PHOENIX-2014T) demonstrate that our proposed CVT-SLR consistently outperforms existing single-cue methods and even outperforms SOTA multi-cue methods.

* Accepted to CVPR 2023 (Highlight Paper Top 2.5%)

View paper on

Share this with someone who'll enjoy it:

Title:CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition with Variational Alignment

Paper and Code