Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zezhi Tang

SignVLA: Real-Time Sign Language-Guided Robotic Manipulation via Attention LSTM and Vision-Language-Action Models

Jun 18, 2026

Ningwei Bai, Xinyu Tan, Harry Gardner, Zhengyang Zhong, Liuhaichen Yang, Luoyu Zhang, Zhekai Duan, Monkgogi Galeitsiwe, Zezhi Tang

Abstract:Vision-Language-Action (VLA) models enable robots to execute manipulation tasks from natural-language instructions grounded in visual observations. However, existing VLA interfaces primarily rely on speech or text input, limiting accessibility for deaf, hard-of-hearing, and speech-impaired users. We present SignVLA, a real-time sign-language-guided VLA framework for accessible human-robot interaction. The system introduces a modular sign-to-text interface that converts visual sign gestures into semantic instructions compatible with downstream VLA policies. Given video streams, SignVLA extracts hand landmark features and employs an attention-enhanced Long Short-Term Memory (LSTM) network to capture temporal gesture dynamics for alphabet- and command-level sign recognition. A temporal stabilization module further improves prediction consistency in real-time interaction settings.The generated instruction sequence is then passed to a downstream VLA policy for sign-conditioned robotic manipulation. Experimental results demonstrate stable real-time sign recognition and successful execution of manipulation tasks driven by sign-language inputs. Our findings suggest that lightweight temporal sign recognition can serve as an effective and practical accessibility layer for multimodal embodied intelligence.

Via

Access Paper or Ask Questions

SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation

Feb 26, 2026

Xinyu Tan, Ningwei Bai, Harry Gardener, Zhengyang Zhong, Luoyu Zhang, Liuhaichen Yang, Zhekai Duan, Monkgogi Galeitsiwe, Zezhi Tang

Abstract:We present, to our knowledge, the first sign language-driven Vision-Language-Action (VLA) framework for intuitive and inclusive human-robot interaction. Unlike conventional approaches that rely on gloss annotations as intermediate supervision, the proposed system adopts a gloss-free paradigm and directly maps visual sign gestures to semantic instructions. This design reduces annotation cost and avoids the information loss introduced by gloss representations, enabling more natural and scalable multimodal interaction. In this work, we focus on a real-time alphabet-level finger-spelling interface that provides a robust and low-latency communication channel for robotic control. Compared with large-scale continuous sign language recognition, alphabet-level interaction offers improved reliability, interpretability, and deployment feasibility in safety-critical embodied environments. The proposed pipeline transforms continuous gesture streams into coherent language commands through geometric normalization, temporal smoothing, and lexical refinement, ensuring stable and consistent interaction. Furthermore, the framework is designed to support future integration of transformer-based gloss-free sign language models, enabling scalable word-level and sentence-level semantic understanding. Experimental results demonstrate the effectiveness of the proposed system in grounding sign-derived instructions into precise robotic actions under diverse interaction scenarios. These results highlight the potential of the framework to advance accessible, scalable, and multimodal embodied intelligence.

* 7 pages, 2 figures

Via

Access Paper or Ask Questions

A temporal scale transformer framework for precise remaining useful life prediction in fuel cells

Apr 08, 2025

Zezhi Tang, Xiaoyu Chen, Xin Jin, Benyuan Zhang, Wenyu Liang

Abstract:In exploring Predictive Health Management (PHM) strategies for Proton Exchange Membrane Fuel Cells (PEMFC), the Transformer model, widely used in data-driven approaches, excels in many fields but struggles with time series analysis due to its self-attention mechanism, which yields a complexity of the input sequence squared and low computational efficiency. It also faces challenges in capturing both global long-term dependencies and local details effectively. To tackle this, we propose the Temporal Scale Transformer (TSTransformer), an enhanced version of the inverted Transformer (iTransformer). Unlike traditional Transformers that treat each timestep as an input token, TSTransformer maps sequences of varying lengths into tokens at different stages for inter-sequence modeling, using attention to capture multivariate correlations and feed-forward networks (FFN) to encode sequence representations. By integrating a one-dimensional convolutional layer into the multivariate attention for multi-level scaling of K and V matrices, it improves local feature extraction, captures temporal scale characteristics, and reduces token count and computational costs. Experiments comparing TSTransformer with models like Long Short-Term Memory, iTransformer, and Transformer demonstrate its potential as a powerful tool for advancing PHM in renewable energy, effectively addressing the limitations of pure Transformer models in data-driven time series tasks.

* 7 figs, 10 pages

Via

Access Paper or Ask Questions

Real-time object detection and robotic manipulation for agriculture using a YOLO-based learning approach

Jan 28, 2024

Hongyu Zhao, Zezhi Tang, Zhenhong Li, Yi Dong, Yuancheng Si, Mingyang Lu, George Panoutsos

Abstract:The optimisation of crop harvesting processes for commonly cultivated crops is of great importance in the aim of agricultural industrialisation. Nowadays, the utilisation of machine vision has enabled the automated identification of crops, leading to the enhancement of harvesting efficiency, but challenges still exist. This study presents a new framework that combines two separate architectures of convolutional neural networks (CNNs) in order to simultaneously accomplish the tasks of crop detection and harvesting (robotic manipulation) inside a simulated environment. Crop images in the simulated environment are subjected to random rotations, cropping, brightness, and contrast adjustments to create augmented images for dataset generation. The you only look once algorithmic framework is employed with traditional rectangular bounding boxes for crop localization. The proposed method subsequently utilises the acquired image data via a visual geometry group model in order to reveal the grasping positions for the robotic manipulation.

* 7 pages, 9 figures

Via

Access Paper or Ask Questions

Discrete-Time Stress Matrix-Based Formation Control of General Linear Multi-Agent Systems

Jan 10, 2024

Okechi Onuoha, Suleiman Kurawa, Zezhi Tang, Yi Dong

Abstract:This paper considers the distributed leader-follower stress-matrix-based affine formation control problem of discrete-time linear multi-agent systems with static and dynamic leaders. In leader-follower multi-agent formation control, the aim is to drive a set of agents comprising leaders and followers to form any desired geometric pattern and simultaneously execute any required manoeuvre by controlling only a few agents denoted as leaders. Existing works in literature are mostly limited to the cases where the agents' inter-agent communications are either in the continuous-time settings or the sampled-data cases where the leaders are constrained to constant (or zero) velocities or accelerations. Here, we relax these constraints and study the discrete-time cases where the leaders can have stationary or time-varying velocities. We propose control laws in the study of different situations and provide some sufficient conditions to guarantee the overall system stability. Simulation study is used to demonstrate the efficacy of our proposed control laws.

Via

Access Paper or Ask Questions