Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hyeongwoo Jeon

DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model

Nov 26, 2024

JiHwan Moon, Jihoon Park, Jungeun Kim, Jongseong Bae, Hyeongwoo Jeon, Ha Young Kim

Figure 1 for DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model

Figure 2 for DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model

Figure 3 for DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model

Figure 4 for DiffSLT: Enhancing Diversity in Sign Language Translation via Diffusion Model

Abstract:Sign language translation (SLT) is challenging, as it involves converting sign language videos into natural language. Previous studies have prioritized accuracy over diversity. However, diversity is crucial for handling lexical and syntactic ambiguities in machine translation, suggesting it could similarly benefit SLT. In this work, we propose DiffSLT, a novel gloss-free SLT framework that leverages a diffusion model, enabling diverse translations while preserving sign language semantics. DiffSLT transforms random noise into the target latent representation, conditioned on the visual features of input video. To enhance visual conditioning, we design Guidance Fusion Module, which fully utilizes the multi-level spatiotemporal information of the visual features. We also introduce DiffSLT-P, a DiffSLT variant that conditions on pseudo-glosses and visual features, providing key textual guidance and reducing the modality gap. As a result, DiffSLT and DiffSLT-P significantly improve diversity over previous gloss-free SLT methods and achieve state-of-the-art performance on two SLT datasets, thereby markedly improving translation quality.

* Project page: https://diffslt.github.io/

Via

Access Paper or Ask Questions

Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation

Nov 25, 2024

Jungeun Kim, Hyeongwoo Jeon, Jongseong Bae, Ha Young Kim

Abstract:Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we generate detailed textual descriptions of sign language components using MLLMs. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be effectively utilized in SLT.

Via

Access Paper or Ask Questions