Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Min Wang

SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Jul 23, 2024

Longtao Jiang, Min Wang, Zecheng Li, Yao Fang, Wengang Zhou, Houqiang Li

Figure 1 for SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Figure 2 for SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Figure 3 for SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Figure 4 for SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Abstract:Different from traditional video retrieval, sign language retrieval is more biased towards understanding the semantic information of human actions contained in video clips. Previous works typically only encode RGB videos to obtain high-level semantic features, resulting in local action details drowned in a large amount of visual information redundancy. Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. Specifically, the Pose encoder embeds the coordinates of keypoints corresponding to human joints, effectively capturing detailed action features. For better context-aware fusion of two video modalities, we propose a Cross Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features with similar semantic information from intra-modality and inter-modality. Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance the aggregated fusion feature by contextual matching of fine-grained dual-stream features. Besides the offline RGB encoder, the whole framework only contains learnable lightweight networks, which can be trained end-to-end. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods on various datasets.

* Accepted to ACM International Conference on Multimedia (MM) 2024

Via

Access Paper or Ask Questions

DTR: A Unified Deep Tensor Representation Framework for Multimedia Data Recovery

Jul 07, 2024

Ting-Wei Zhou, Xi-Le Zhao, Jian-Li Wang, Yi-Si Luo, Min Wang, Xiao-Xuan Bai, Hong Yan

Figure 1 for DTR: A Unified Deep Tensor Representation Framework for Multimedia Data Recovery

Figure 2 for DTR: A Unified Deep Tensor Representation Framework for Multimedia Data Recovery

Figure 3 for DTR: A Unified Deep Tensor Representation Framework for Multimedia Data Recovery

Figure 4 for DTR: A Unified Deep Tensor Representation Framework for Multimedia Data Recovery

Abstract:Recently, the transform-based tensor representation has attracted increasing attention in multimedia data (e.g., images and videos) recovery problems, which consists of two indispensable components, i.e., transform and characterization. Previously, the development of transform-based tensor representation mainly focuses on the transform aspect. Although several attempts consider using shallow matrix factorization (e.g., singular value decomposition and negative matrix factorization) to characterize the frontal slices of transformed tensor (termed as latent tensor), the faithful characterization aspect is underexplored. To address this issue, we propose a unified Deep Tensor Representation (termed as DTR) framework by synergistically combining the deep latent generative module and the deep transform module. Especially, the deep latent generative module can faithfully generate the latent tensor as compared with shallow matrix factorization. The new DTR framework not only allows us to better understand the classic shallow representations, but also leads us to explore new representation. To examine the representation ability of the proposed DTR, we consider the representative multi-dimensional data recovery task and suggest an unsupervised DTR-based multi-dimensional data recovery model. Extensive experiments demonstrate that DTR achieves superior performance compared to state-of-the-art methods in both quantitative and qualitative aspects, especially for fine details recovery.

Via

Access Paper or Ask Questions

MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition

May 31, 2024

Weichao Zhao, Hezhen Hu, Wengang Zhou, Yunyao Mao, Min Wang, Houqiang Li

Figure 1 for MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition

Figure 2 for MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition

Figure 3 for MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition

Figure 4 for MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition

Abstract:Sign language recognition (SLR) has long been plagued by insufficient model representation capabilities. Although current pre-training approaches have alleviated this dilemma to some extent and yielded promising performance by employing various pretext tasks on sign pose data, these methods still suffer from two primary limitations: 1) Explicit motion information is usually disregarded in previous pretext tasks, leading to partial information loss and limited representation capability. 2) Previous methods focus on the local context of a sign pose sequence, without incorporating the guidance of the global meaning of lexical signs. To this end, we propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information in a self-supervised learning paradigm for SLR. Our framework contains two crucial components, i.e., a motion-aware masked autoencoder (MA) and a momentum semantic alignment module (SA). Specifically, in MA, we introduce an autoencoder architecture with a motion-aware masked strategy to reconstruct motion residuals of masked frames, thereby explicitly exploring dynamic motion cues among sign pose sequences. Moreover, in SA, we embed our framework with global semantic awareness by aligning the embeddings of different augmented samples from the input sequence in the shared latent space. In this way, our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation. Furthermore, we conduct extensive experiments to validate the effectiveness of our method, achieving new state-of-the-art performance on four public benchmarks.

* Accepted by TCSVT 2024

Via

Access Paper or Ask Questions

Timeliness of Status Update System: The Effect of Parallel Transmission Using Heterogeneous Updating Devices

May 27, 2024

Zhengchuan Chen, Kang Lang, Nikolaos Pappas, Howard H. Yang, Min Wang, Zhong Tian, Tony Q. S. Quek

Figure 1 for Timeliness of Status Update System: The Effect of Parallel Transmission Using Heterogeneous Updating Devices

Figure 2 for Timeliness of Status Update System: The Effect of Parallel Transmission Using Heterogeneous Updating Devices

Figure 3 for Timeliness of Status Update System: The Effect of Parallel Transmission Using Heterogeneous Updating Devices

Figure 4 for Timeliness of Status Update System: The Effect of Parallel Transmission Using Heterogeneous Updating Devices

Abstract:Timely status updating is the premise of emerging interaction-based applications in the Internet of Things (IoT). Using redundant devices to update the status of interest is a promising method to improve the timeliness of information. However, parallel status updating leads to out-of-order arrivals at the monitor, significantly challenging timeliness analysis. This work studies the Age of Information (AoI) of a multi-queue status update system where multiple devices monitor the same physical process. Specifically, two systems are considered: the Basic System, which only has type-1 devices that are ad hoc devices located close to the source, and the Hybrid System, which contains additional type-2 devices that are infrastructure-based devices located in fixed points compared to the Basic System. Using the Stochastic Hybrid Systems (SHS) framework, a mathematical model that combines discrete and continuous dynamics, we derive the expressions of the average AoI of the considered two systems in closed form. Numerical results verify the accuracy of the analysis. It is shown that when the number and parameters of the type-1 devices/type-2 devices are fixed, the logarithm of average AoI will linearly decrease with the logarithm of the total arrival rate of type-2 devices or that of the number of type-1 devices under specific condition. It has also been demonstrated that the proposed systems can significantly outperform the FCFS M/M/N status update system.

Via

Access Paper or Ask Questions

Autoencoder-assisted Feature Ensemble Net for Incipient Faults

Apr 22, 2024

Mingxuan Gao, Min Wang, Maoyin Chen

Figure 1 for Autoencoder-assisted Feature Ensemble Net for Incipient Faults

Figure 2 for Autoencoder-assisted Feature Ensemble Net for Incipient Faults

Figure 3 for Autoencoder-assisted Feature Ensemble Net for Incipient Faults

Figure 4 for Autoencoder-assisted Feature Ensemble Net for Incipient Faults

Abstract:Deep learning has shown the great power in the field of fault detection. However, for incipient faults with tiny amplitude, the detection performance of the current deep learning networks (DLNs) is not satisfactory. Even if prior information about the faults is utilized, DLNs can't successfully detect faults 3, 9 and 15 in Tennessee Eastman process (TEP). These faults are notoriously difficult to detect, lacking effective detection technologies in the field of fault detection. In this work, we propose Autoencoder-assisted Feature Ensemble Net (AE-FENet): a deep feature ensemble framework that uses the unsupervised autoencoder to conduct the feature transformation. Compared with the principle component analysis (PCA) technique adopted in the original Feature Ensemble Net (FENet), autoencoder can mine more exact features on incipient faults, which results in the better detection performance of AE-FENet. With same kinds of basic detectors, AE-FENet achieves a state-of-the-art average accuracy over 96% on faults 3, 9 and 15 in TEP, which represents a significant enhancement in performance compared to other methods. Plenty of experiments have been done to extend our framework, proving that DLNs can be utilized efficiently within this architecture.

Via

Access Paper or Ask Questions

Token-Efficient Leverage Learning in Large Language Models

Apr 01, 2024

Yuanhao Zeng, Min Wang, Yihang Wang, Yingxia Shao

Figure 1 for Token-Efficient Leverage Learning in Large Language Models

Figure 2 for Token-Efficient Leverage Learning in Large Language Models

Figure 3 for Token-Efficient Leverage Learning in Large Language Models

Figure 4 for Token-Efficient Leverage Learning in Large Language Models

Abstract:Large Language Models (LLMs) have excelled in various tasks but perform better in high-resource scenarios, which presents challenges in low-resource scenarios. Data scarcity and the inherent difficulty of adapting LLMs to specific tasks compound the challenge. To address the twin hurdles, we introduce \textbf{Leverage Learning}. We present a streamlined implement of this methodology called Token-Efficient Leverage Learning (TELL). TELL showcases the potential of Leverage Learning, demonstrating effectiveness across various LLMs and low-resource tasks, ranging from $10^4$ to $10^6$ tokens. It reduces task data requirements by up to nearly an order of magnitude compared to conventional Supervised Fine-Tuning (SFT) while delivering competitive performance. With the same amount of task data, TELL leads in improving task performance compared to SFT. We discuss the mechanism of Leverage Learning, suggesting it aligns with quantization hypothesis and explore its promising potential through empirical testing.

* 15 pages, 16 figures

Via

Access Paper or Ask Questions

GaussNav: Gaussian Splatting for Visual Navigation

Mar 20, 2024

Xiaohan Lei, Min Wang, Wengang Zhou, Houqiang Li

Abstract:In embodied vision, Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment. The primary difficulty of IIN stems from the necessity of recognizing the target object across varying viewpoints and rejecting potential distractors. Existing map-based navigation methods largely adopt the representation form of Bird's Eye View (BEV) maps, which, however, lack the representation of detailed textures in a scene. To address the above issues, we propose a new Gaussian Splatting Navigation (abbreviated as GaussNav) framework for IIN task, which constructs a novel map representation based on 3D Gaussian Splatting (3DGS). The proposed framework enables the agent to not only memorize the geometry and semantic information of the scene, but also retain the textural features of objects. Our GaussNav framework demonstrates a significant leap in performance, evidenced by an increase in Success weighted by Path Length (SPL) from 0.252 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset. Our code will be made publicly available.

* conference

Via

Access Paper or Ask Questions

Motion-aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction

Mar 18, 2024

Zhiyang Guo, Wengang Zhou, Li Li, Min Wang, Houqiang Li

Figure 1 for Motion-aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction

Figure 2 for Motion-aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction

Figure 3 for Motion-aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction

Figure 4 for Motion-aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction

Abstract:3D Gaussian Splatting (3DGS) has become an emerging tool for dynamic scene reconstruction. However, existing methods focus mainly on extending static 3DGS into a time-variant representation, while overlooking the rich motion information carried by 2D observations, thus suffering from performance degradation and model redundancy. To address the above problem, we propose a novel motion-aware enhancement framework for dynamic scene reconstruction, which mines useful motion cues from optical flow to improve different paradigms of dynamic 3DGS. Specifically, we first establish a correspondence between 3D Gaussian movements and pixel-level flow. Then a novel flow augmentation method is introduced with additional insights into uncertainty and loss collaboration. Moreover, for the prevalent deformation-based paradigm that presents a harder optimization problem, a transient-aware deformation auxiliary module is proposed. We conduct extensive experiments on both multi-view and monocular scenes to verify the merits of our work. Compared with the baselines, our method shows significant superiority in both rendering quality and efficiency.

Via

Access Paper or Ask Questions

DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception

Mar 04, 2024

Jingyu Gong, Min Wang, Wentao Liu, Chen Qian, Zhizhong Zhang, Yuan Xie, Lizhuang Ma

Figure 1 for DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception

Figure 2 for DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception

Figure 3 for DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception

Figure 4 for DEMOS: Dynamic Environment Motion Synthesis in 3D Scenes via Local Spherical-BEV Perception

Abstract:Motion synthesis in real-world 3D scenes has recently attracted much attention. However, the static environment assumption made by most current methods usually cannot be satisfied especially for real-time motion synthesis in scanned point cloud scenes, if multiple dynamic objects exist, e.g., moving persons or vehicles. To handle this problem, we propose the first Dynamic Environment MOtion Synthesis framework (DEMOS) to predict future motion instantly according to the current scene, and use it to dynamically update the latent motion for final motion synthesis. Concretely, we propose a Spherical-BEV perception method to extract local scene features that are specifically designed for instant scene-aware motion prediction. Then, we design a time-variant motion blending to fuse the new predicted motions into the latent motion, and the final motion is derived from the updated latent motions, benefitting both from motion-prior and iterative methods. We unify the data format of two prevailing datasets, PROX and GTA-IM, and take them for motion synthesis evaluation in 3D scenes. We also assess the effectiveness of the proposed method in dynamic environments from GTA-IM and Semantic3D to check the responsiveness. The results show our method outperforms previous works significantly and has great performance in handling dynamic environments.

Via

Access Paper or Ask Questions

Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval

Mar 03, 2024

Yongchao Du, Min Wang, Wengang Zhou, Shuping Hui, Houqiang Li

Figure 1 for Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval

Figure 2 for Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval

Figure 3 for Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval

Figure 4 for Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval

Abstract:The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent. Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments when deploying the large vision-language model. To tackle the above problems, we propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning. In the framework, we propose a new adaptive token learner that maps an image to a sentence in the word embedding space of VL model. The sentence adaptively captures discriminative visual information and is further integrated with the text modifier. An asymmetric structure is devised for flexible deployment, in which the lightweight model is adopted for the query side while the large VL model is deployed on the gallery side. The global contrastive distillation and the local alignment regularization are adopted for the alignment between the light model and the VL model for CIR task. Our experiments demonstrate that the proposed ISA could better cope with the real retrieval scenarios and further improve retrieval accuracy and efficiency.

* ICLR 2024 spotlight

Via

Access Paper or Ask Questions