Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:Sign Language Recognition

What is Sign Language Recognition? Sign language recognition is a computer vision and natural language processing task that involves automatically recognizing and translating sign language gestures into written or spoken language. The goal of sign language recognition is to develop algorithms that can understand and interpret sign language, enabling people who use sign language as their primary mode of communication to communicate more easily with non-signers.

Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

Jul 08, 2024

Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Figure 1 for Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

Figure 2 for Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

Figure 3 for Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

Figure 4 for Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

Abstract:Recent multimodal large language models (MLLM) such as GPT-4o and GPT-4v have shown great potential in autonomous driving. In this paper, we propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition (TSR). We first construct a traffic sign detection network based on Vision Transformer Adapter and an extraction module to extract traffic signs from the original road images. To reduce the dependence on training data and improve the performance stability of cross-country TSR, we introduce a cross-domain few-shot in-context learning method based on the MLLM. To enhance MLLM's fine-grained recognition ability of traffic signs, the proposed method generates corresponding description texts using template traffic signs. These description texts contain key information about the shape, color, and composition of traffic signs, which can stimulate the ability of MLLM to perceive fine-grained traffic sign categories. By using the description texts, our method reduces the cross-domain differences between template and real traffic signs. Our approach requires only simple and uniform textual indications, without the need for large-scale traffic sign images and labels. We perform comprehensive evaluations on the German traffic sign recognition benchmark dataset, the Belgium traffic sign dataset, and two real-world datasets taken from Japan. The experimental results show that our method significantly enhances the TSR performance.

Via

Access Paper or Ask Questions

FSboard: Over 3 million characters of ASL fingerspelling collected via smartphones

Jul 22, 2024

Manfred Georg, Garrett Tanzer, Saad Hassan, Maximus Shengelia, Esha Uboweja, Sam Sepah, Sean Forbes, Thad Starner

Abstract:Progress in machine understanding of sign languages has been slow and hampered by limited data. In this paper, we present FSboard, an American Sign Language fingerspelling dataset situated in a mobile text entry use case, collected from 147 paid and consenting Deaf signers using Pixel 4A selfie cameras in a variety of environments. Fingerspelling recognition is an incomplete solution that is only one small part of sign language translation, but it could provide some immediate benefit to Deaf/Hard of Hearing signers as more broadly capable technology develops. At >3 million characters in length and >250 hours in duration, FSboard is the largest fingerspelling recognition dataset to date by a factor of >10x. As a simple baseline, we finetune 30 Hz MediaPipe Holistic landmark inputs into ByT5-Small and achieve 11.1% Character Error Rate (CER) on a test set with unique phrases and signers. This quality degrades gracefully when decreasing frame rate and excluding face/body landmarks: plausible optimizations to help models run on device in real time.

* Access FSboard at https://www.kaggle.com/datasets/googleai/fsboard

Via

Access Paper or Ask Questions

CorrNet+: Sign Language Recognition and Translation via Spatial-Temporal Correlation

Apr 17, 2024

Lianyu Hu, Wei Feng, Liqing Gao, Zekang Liu, Liang Wan

Figure 1 for CorrNet+: Sign Language Recognition and Translation via Spatial-Temporal Correlation

Figure 2 for CorrNet+: Sign Language Recognition and Translation via Spatial-Temporal Correlation

Figure 3 for CorrNet+: Sign Language Recognition and Translation via Spatial-Temporal Correlation

Figure 4 for CorrNet+: Sign Language Recognition and Translation via Spatial-Temporal Correlation

Abstract:In sign language, the conveyance of human body trajectories predominantly relies upon the coordinated movements of hands and facial expressions across successive frames. Despite the recent advancements of sign language understanding methods, they often solely focus on individual frames, inevitably overlooking the inter-frame correlations that are essential for effectively modeling human body trajectories. To address this limitation, this paper introduces a spatial-temporal correlation network, denoted as CorrNet+, which explicitly identifies body trajectories across multiple frames. In specific, CorrNet+ employs a correlation module and an identification module to build human body trajectories. Afterwards, a temporal attention module is followed to adaptively evaluate the contributions of different frames. The resultant features offer a holistic perspective on human body movements, facilitating a deeper understanding of sign language. As a unified model, CorrNet+ achieves new state-of-the-art performance on two extensive sign language understanding tasks, including continuous sign language recognition (CSLR) and sign language translation (SLT). Especially, CorrNet+ surpasses previous methods equipped with resource-intensive pose-estimation networks or pre-extracted heatmaps for hand and facial feature extraction. Compared with CorrNet, CorrNet+ achieves a significant performance boost across all benchmarks while halving the computational overhead. A comprehensive comparison with previous spatial-temporal reasoning methods verifies the superiority of CorrNet+. Code is available at https://github.com/hulianyuyy/CorrNet_Plus.

* arXiv admin note: substantial text overlap with arXiv:2303.03202

Via

Access Paper or Ask Questions

Optimizing Hand Region Detection in MediaPipe Holistic Full-Body Pose Estimation to Improve Accuracy and Avoid Downstream Errors

May 06, 2024

Amit Moryossef

Abstract:This paper addresses a critical flaw in MediaPipe Holistic's hand Region of Interest (ROI) prediction, which struggles with non-ideal hand orientations, affecting sign language recognition accuracy. We propose a data-driven approach to enhance ROI estimation, leveraging an enriched feature set including additional hand keypoints and the z-dimension. Our results demonstrate better estimates, with higher Intersection-over-Union compared to the current method. Our code and optimizations are available at https://github.com/sign-language-processing/mediapipe-hand-crop-fix.

Via

Access Paper or Ask Questions

Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets

Mar 21, 2024

Ahmet Alp Kindiroglu, Ozgur Kara, Ogulcan Ozdemir, Lale Akarun

Figure 1 for Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets

Figure 2 for Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets

Figure 3 for Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets

Figure 4 for Transfer Learning for Cross-dataset Isolated Sign Language Recognition in Under-Resourced Datasets

Abstract:Sign language recognition (SLR) has recently achieved a breakthrough in performance thanks to deep neural networks trained on large annotated sign datasets. Of the many different sign languages, these annotated datasets are only available for a select few. Since acquiring gloss-level labels on sign language videos is difficult, learning by transferring knowledge from existing annotated sources is useful for recognition in under-resourced sign languages. This study provides a publicly available cross-dataset transfer learning benchmark from two existing public Turkish SLR datasets. We use a temporal graph convolution-based sign language recognition approach to evaluate five supervised transfer learning approaches and experiment with closed-set and partial-set cross-dataset transfer learning. Experiments demonstrate that improvement over finetuning based transfer learning is possible with specialized supervised transfer learning methods.

* Accepted to The 18th IEEE International Conference on Automatic Face and Gesture Recognition 2024, Code available in https://github.com/alpk/tid-supervised-transfer-learning-dataset

Via

Access Paper or Ask Questions

An Advanced Deep Learning Based Three-Stream Hybrid Model for Dynamic Hand Gesture Recognition

Aug 15, 2024

Md Abdur Rahim, Abu Saleh Musa Miah, Hemel Sharker Akash, Jungpil Shin, Md. Imran Hossain, Md. Najmul Hossain

Abstract:In the modern context, hand gesture recognition has emerged as a focal point. This is due to its wide range of applications, which include comprehending sign language, factories, hands-free devices, and guiding robots. Many researchers have attempted to develop more effective techniques for recognizing these hand gestures. However, there are challenges like dataset limitations, variations in hand forms, external environments, and inconsistent lighting conditions. To address these challenges, we proposed a novel three-stream hybrid model that combines RGB pixel and skeleton-based features to recognize hand gestures. In the procedure, we preprocessed the dataset, including augmentation, to make rotation, translation, and scaling independent systems. We employed a three-stream hybrid model to extract the multi-feature fusion using the power of the deep learning module. In the first stream, we extracted the initial feature using the pre-trained Imagenet module and then enhanced this feature by using a multi-layer of the GRU and LSTM modules. In the second stream, we extracted the initial feature with the pre-trained ReseNet module and enhanced it with the various combinations of the GRU and LSTM modules. In the third stream, we extracted the hand pose key points using the media pipe and then enhanced them using the stacked LSTM to produce the hierarchical feature. After that, we concatenated the three features to produce the final. Finally, we employed a classification module to produce the probabilistic map to generate predicted output. We mainly produced a powerful feature vector by taking advantage of the pixel-based deep learning feature and pos-estimation-based stacked deep learning feature, including a pre-trained model with a scratched deep learning model for unequalled gesture detection capabilities.

Via

Access Paper or Ask Questions

Improving Continuous Sign Language Recognition with Adapted Image Models

Apr 12, 2024

Lianyu Hu, Tongkai Shi, Liqing Gao, Zekang Liu, Wei Feng

Figure 1 for Improving Continuous Sign Language Recognition with Adapted Image Models

Figure 2 for Improving Continuous Sign Language Recognition with Adapted Image Models

Figure 3 for Improving Continuous Sign Language Recognition with Adapted Image Models

Figure 4 for Improving Continuous Sign Language Recognition with Adapted Image Models

Abstract:The increase of web-scale weakly labelled image-text pairs have greatly facilitated the development of large-scale vision-language models (e.g., CLIP), which have shown impressive generalization performance over a series of downstream tasks. However, the massive model size and scarcity of available data limit their applications to fine-tune the whole model in downstream tasks. Besides, fully fine-tuning the model easily forgets the generic essential knowledge acquired in the pretraining stage and overfits the downstream data. To enable high efficiency when adapting these large vision-language models (e.g., CLIP) to performing continuous sign language recognition (CSLR) while preserving their generalizability, we propose a novel strategy (AdaptSign). Especially, CLIP is adopted as the visual backbone to extract frame-wise features whose parameters are fixed, and a set of learnable modules are introduced to model spatial sign variations or capture temporal sign movements. The introduced additional modules are quite lightweight, only owning 3.2% extra computations with high efficiency. The generic knowledge acquired in the pretraining stage is well-preserved in the frozen CLIP backbone in this process. Extensive experiments show that despite being efficient, AdaptSign is able to demonstrate superior performance across a series of CSLR benchmarks including PHOENIX14, PHOENIX14-T, CSL-Daily and CSL compared to existing methods. Visualizations show that AdaptSign could learn to dynamically pay major attention to the informative spatial regions and cross-frame trajectories in sign videos.

Via

Access Paper or Ask Questions

A Hong Kong Sign Language Corpus Collected from Sign-interpreted TV News

May 02, 2024

Zhe Niu, Ronglai Zuo, Brian Mak, Fangyun Wei

Figure 1 for A Hong Kong Sign Language Corpus Collected from Sign-interpreted TV News

Figure 2 for A Hong Kong Sign Language Corpus Collected from Sign-interpreted TV News

Figure 3 for A Hong Kong Sign Language Corpus Collected from Sign-interpreted TV News

Figure 4 for A Hong Kong Sign Language Corpus Collected from Sign-interpreted TV News

Abstract:This paper introduces TVB-HKSL-News, a new Hong Kong sign language (HKSL) dataset collected from a TV news program over a period of 7 months. The dataset is collected to enrich resources for HKSL and support research in large-vocabulary continuous sign language recognition (SLR) and translation (SLT). It consists of 16.07 hours of sign videos of two signers with a vocabulary of 6,515 glosses (for SLR) and 2,850 Chinese characters or 18K Chinese words (for SLT). One signer has 11.66 hours of sign videos and the other has 4.41 hours. One objective in building the dataset is to support the investigation of how well large-vocabulary continuous sign language recognition/translation can be done for a single signer given a (relatively) large amount of his/her training data, which could potentially lead to the development of new modeling methods. Besides, most parts of the data collection pipeline are automated with little human intervention; we believe that our collection method can be scaled up to collect more sign language data easily for SLT in the future for any sign languages if such sign-interpreted videos are available. We also run a SOTA SLR/SLT model on the dataset and get a baseline SLR word error rate of 34.08% and a baseline SLT BLEU-4 score of 23.58 for benchmarking future research on the dataset.

* Accepted by LREC-COLING 2024

Via

Access Paper or Ask Questions

TCNet: Continuous Sign Language Recognition from Trajectories and Correlated Regions

Mar 18, 2024

Hui Lu, Albert Ali Salah, Ronald Poppe

Abstract:A key challenge in continuous sign language recognition (CSLR) is to efficiently capture long-range spatial interactions over time from the video input. To address this challenge, we propose TCNet, a hybrid network that effectively models spatio-temporal information from Trajectories and Correlated regions. TCNet's trajectory module transforms frames into aligned trajectories composed of continuous visual tokens. In addition, for a query token, self-attention is learned along the trajectory. As such, our network can also focus on fine-grained spatio-temporal patterns, such as finger movements, of a specific region in motion. TCNet's correlation module uses a novel dynamic attention mechanism that filters out irrelevant frame regions. Additionally, it assigns dynamic key-value tokens from correlated regions to each query. Both innovations significantly reduce the computation cost and memory. We perform experiments on four large-scale datasets: PHOENIX14, PHOENIX14-T, CSL, and CSL-Daily, respectively. Our results demonstrate that TCNet consistently achieves state-of-the-art performance. For example, we improve over the previous state-of-the-art by 1.5% and 1.0% word error rate on PHOENIX14 and PHOENIX14-T, respectively.

Via

Access Paper or Ask Questions

Dynamic Spatial-Temporal Aggregation for Skeleton-Aware Sign Language Recognition

Mar 19, 2024

Lianyu Hu, Liqing Gao, Zekang Liu, Wei Feng

Figure 1 for Dynamic Spatial-Temporal Aggregation for Skeleton-Aware Sign Language Recognition

Figure 2 for Dynamic Spatial-Temporal Aggregation for Skeleton-Aware Sign Language Recognition

Figure 3 for Dynamic Spatial-Temporal Aggregation for Skeleton-Aware Sign Language Recognition

Figure 4 for Dynamic Spatial-Temporal Aggregation for Skeleton-Aware Sign Language Recognition

Abstract:Skeleton-aware sign language recognition (SLR) has gained popularity due to its ability to remain unaffected by background information and its lower computational requirements. Current methods utilize spatial graph modules and temporal modules to capture spatial and temporal features, respectively. However, their spatial graph modules are typically built on fixed graph structures such as graph convolutional networks or a single learnable graph, which only partially explore joint relationships. Additionally, a simple temporal convolution kernel is used to capture temporal information, which may not fully capture the complex movement patterns of different signers. To overcome these limitations, we propose a new spatial architecture consisting of two concurrent branches, which build input-sensitive joint relationships and incorporates specific domain knowledge for recognition, respectively. These two branches are followed by an aggregation process to distinguishe important joint connections. We then propose a new temporal module to model multi-scale temporal information to capture complex human dynamics. Our method achieves state-of-the-art accuracy compared to previous skeleton-aware methods on four large-scale SLR benchmarks. Moreover, our method demonstrates superior accuracy compared to RGB-based methods in most cases while requiring much fewer computational resources, bringing better accuracy-computation trade-off. Code is available at https://github.com/hulianyuyy/DSTA-SLR.

Via

Access Paper or Ask Questions

Topic:Sign Language Recognition

Papers and Code