Alert button
Picture for Li Tang

Li Tang

Alert button

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

Nov 09, 2023
Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao

In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.

* Accepted to NeurIPS 2023 
Viaarxiv icon

Gloss Attention for Gloss-free Sign Language Translation

Jul 14, 2023
Aoxiong Yin, Tianyun Zhong, Li Tang, Weike Jin, Tao Jin, Zhou Zhao

Figure 1 for Gloss Attention for Gloss-free Sign Language Translation
Figure 2 for Gloss Attention for Gloss-free Sign Language Translation
Figure 3 for Gloss Attention for Gloss-free Sign Language Translation
Figure 4 for Gloss Attention for Gloss-free Sign Language Translation

Most sign language translation (SLT) methods to date require the use of gloss annotations to provide additional supervision information, however, the acquisition of gloss is not easy. To solve this problem, we first perform an analysis of existing models to confirm how gloss annotations make SLT easier. We find that it can provide two aspects of information for the model, 1) it can help the model implicitly learn the location of semantic boundaries in continuous sign language videos, 2) it can help the model understand the sign language video globally. We then propose \emph{gloss attention}, which enables the model to keep its attention within video segments that have the same semantics locally, just as gloss helps existing models do. Furthermore, we transfer the knowledge of sentence-to-sentence similarity from the natural language model to our gloss attention SLT network (GASLT) to help it understand sign language videos at the sentence level. Experimental results on multiple large-scale sign language datasets show that our proposed GASLT model significantly outperforms existing methods. Our code is provided in \url{https://github.com/YinAoXiong/GASLT}.

Viaarxiv icon

Connecting Multi-modal Contrastive Representations

May 22, 2023
Zehan Wang, Yang Zhao, Xize Cheng, Haifeng Huang, Jiageng Liu, Li Tang, Linjun Li, Yongqi Wang, Aoxiong Yin, Ziang Zhang, Zhou Zhao

Figure 1 for Connecting Multi-modal Contrastive Representations
Figure 2 for Connecting Multi-modal Contrastive Representations
Figure 3 for Connecting Multi-modal Contrastive Representations
Figure 4 for Connecting Multi-modal Contrastive Representations

Multi-modal Contrastive Representation (MCR) learning aims to encode different modalities into a semantically aligned shared space. This paradigm shows remarkable generalization ability on numerous downstream tasks across various modalities. However, the reliance on massive high-quality data pairs limits its further development on more modalities. This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR). Specifically, given two existing MCRs pre-trained on (A, B) and (B, C) modality pairs, we project them to a new space and use the data from the overlapping modality B to aligning the two MCRs in the new space. Meanwhile, since the modality pairs (A, B) and (B, C) are already aligned within each MCR, the connection learned by overlapping modality can also be transferred to non-overlapping modality pair (A, C). To unleash the potential of C-MCR, we further introduce a semantic-enhanced inter- and intra-MCR connection method. We first enhance the semantic consistency and completion of embeddings across different modalities for more robust alignment. Then we utilize the inter-MCR alignment to establish the connection, and employ the intra-MCR alignment to better maintain the connection for inputs from non-overlapping modalities. We take the field of audio-visual contrastive learning as an example to demonstrate the effectiveness of C-MCR. We connect pre-trained CLIP and CLAP models via texts to derive audio-visual contrastive representations. Remarkably, without using any paired audio-visual data and further tuning, C-MCR achieves state-of-the-art performance on six datasets across three audio-visual downstream tasks.

* Demos are available at \url{https://c-mcr.github.io/C-MCR/} 
Viaarxiv icon

Time Interval-enhanced Graph Neural Network for Shared-account Cross-domain Sequential Recommendation

Jun 16, 2022
Lei Guo, Jinyu Zhang, Li Tang, Tong Chen, Lei Zhu, Hongzhi Yin

Figure 1 for Time Interval-enhanced Graph Neural Network for Shared-account Cross-domain Sequential Recommendation
Figure 2 for Time Interval-enhanced Graph Neural Network for Shared-account Cross-domain Sequential Recommendation
Figure 3 for Time Interval-enhanced Graph Neural Network for Shared-account Cross-domain Sequential Recommendation
Figure 4 for Time Interval-enhanced Graph Neural Network for Shared-account Cross-domain Sequential Recommendation

Shared-account Cross-domain Sequential Recommendation (SCSR) task aims to recommend the next item via leveraging the mixed user behaviors in multiple domains. It is gaining immense research attention as more and more users tend to sign up on different platforms and share accounts with others to access domain-specific services. Existing works on SCSR mainly rely on mining sequential patterns via Recurrent Neural Network (RNN)-based models, which suffer from the following limitations: 1) RNN-based methods overwhelmingly target discovering sequential dependencies in single-user behaviors. They are not expressive enough to capture the relationships among multiple entities in SCSR. 2) All existing methods bridge two domains via knowledge transfer in the latent space, and ignore the explicit cross-domain graph structure. 3) None existing studies consider the time interval information among items, which is essential in the sequential recommendation for characterizing different items and learning discriminative representations for them. In this work, we propose a new graph-based solution, namely TiDA-GCN, to address the above challenges. Specifically, we first link users and items in each domain as a graph. Then, we devise a domain-aware graph convolution network to learn userspecific node representations. To fully account for users' domainspecific preferences on items, two effective attention mechanisms are further developed to selectively guide the message passing process. Moreover, to further enhance item- and account-level representation learning, we incorporate the time interval into the message passing, and design an account-aware self-attention module for learning items' interactive characteristics. Experiments demonstrate the superiority of our proposed method from various aspects.

* 15 pages, 6 figures 
Viaarxiv icon

DA-GCN: A Domain-aware Attentive Graph Convolution Network for Shared-account Cross-domain Sequential Recommendation

May 07, 2021
Lei Guo, Li Tang, Tong Chen, Lei Zhu, Quoc Viet Hung Nguyen, Hongzhi Yin

Figure 1 for DA-GCN: A Domain-aware Attentive Graph Convolution Network for Shared-account Cross-domain Sequential Recommendation
Figure 2 for DA-GCN: A Domain-aware Attentive Graph Convolution Network for Shared-account Cross-domain Sequential Recommendation
Figure 3 for DA-GCN: A Domain-aware Attentive Graph Convolution Network for Shared-account Cross-domain Sequential Recommendation
Figure 4 for DA-GCN: A Domain-aware Attentive Graph Convolution Network for Shared-account Cross-domain Sequential Recommendation

Shared-account Cross-domain Sequential recommendation (SCSR) is the task of recommending the next item based on a sequence of recorded user behaviors, where multiple users share a single account, and their behaviours are available in multiple domains. Existing work on solving SCSR mainly relies on mining sequential patterns via RNN-based models, which are not expressive enough to capture the relationships among multiple entities. Moreover, all existing algorithms try to bridge two domains via knowledge transfer in the latent space, and the explicit cross-domain graph structure is unexploited. In this work, we propose a novel graph-based solution, namely DA-GCN, to address the above challenges. Specifically, we first link users and items in each domain as a graph. Then, we devise a domain-aware graph convolution network to learn user-specific node representations. To fully account for users' domain-specific preferences on items, two novel attention mechanisms are further developed to selectively guide the message passing process. Extensive experiments on two real-world datasets are conducted to demonstrate the superiority of our DA-GCN method.

* The 30th International Joint Conference on Artificial Intelligence, 2021  
Viaarxiv icon

Improving the generalization of network based relative pose regression: dimension reduction as a regularizer

Oct 24, 2020
Xiaqing Ding, Yue Wang, Li Tang, Yanmei Jiao, Rong Xiong

Figure 1 for Improving the generalization of network based relative pose regression: dimension reduction as a regularizer
Figure 2 for Improving the generalization of network based relative pose regression: dimension reduction as a regularizer
Figure 3 for Improving the generalization of network based relative pose regression: dimension reduction as a regularizer
Figure 4 for Improving the generalization of network based relative pose regression: dimension reduction as a regularizer

Visual localization occupies an important position in many areas such as Augmented Reality, robotics and 3D reconstruction. The state-of-the-art visual localization methods perform pose estimation using geometry based solver within the RANSAC framework. However, these methods require accurate pixel-level matching at high image resolution, which is hard to satisfy under significant changes from appearance, dynamics or perspective of view. End-to-end learning based regression networks provide a solution to circumvent the requirement for precise pixel-level correspondences, but demonstrate poor performance towards cross-scene generalization. In this paper, we explicitly add a learnable matching layer within the network to isolate the pose regression solver from the absolute image feature values, and apply dimension regularization on both the correlation feature channel and the image scale to further improve performance towards generalization and large viewpoint change. We implement this dimension regularization strategy within a two-layer pyramid based framework to regress the localization results from coarse to fine. In addition, the depth information is fused for absolute translational scale recovery. Through experiments on real world RGBD datasets we validate the effectiveness of our design in terms of improving both generalization performance and robustness towards viewpoint change, and also show the potential of regression based visual localization networks towards challenging occasions that are difficult for geometry based visual localization methods.

Viaarxiv icon

Radar-on-Lidar: metric radar localization on prior lidar maps

May 10, 2020
Huan Yin, Yue Wang, Li Tang, Rong Xiong

Figure 1 for Radar-on-Lidar: metric radar localization on prior lidar maps
Figure 2 for Radar-on-Lidar: metric radar localization on prior lidar maps
Figure 3 for Radar-on-Lidar: metric radar localization on prior lidar maps
Figure 4 for Radar-on-Lidar: metric radar localization on prior lidar maps

Radar and lidar, provided by two different range sensors, each has pros and cons of various perception tasks on mobile robots or autonomous driving. In this paper, a Monte Carlo system is used to localize the robot with a rotating radar sensor on 2D lidar maps. We first train a conditional generative adversarial network to transfer raw radar data to lidar data, and achieve reliable radar points from generator. Then an efficient radar odometry is included in the Monte Carlo system. Combining the initial guess from odometry, a measurement model is proposed to match the radar data and prior lidar maps for final 2D positioning. We demonstrate the effectiveness of the proposed localization framework on the public multi-session dataset. The experimental results show that our system can achieve high accuracy for long-term localization in outdoor scenes.

* Accept by IEEE RCAR 2020 
Viaarxiv icon

DeepGoal: Learning to Drive with driving intention from Human Control Demonstration

Nov 28, 2019
Huifang Ma, Yue Wang, Rong Xiong, Sarath Kodagoda, Li Tang

Figure 1 for DeepGoal: Learning to Drive with driving intention from Human Control Demonstration
Figure 2 for DeepGoal: Learning to Drive with driving intention from Human Control Demonstration
Figure 3 for DeepGoal: Learning to Drive with driving intention from Human Control Demonstration
Figure 4 for DeepGoal: Learning to Drive with driving intention from Human Control Demonstration

Recent research on automotive driving developed an efficient end-to-end learning mode that directly maps visual input to control commands. However, it models distinct driving variations in a single network, which increases learning complexity and is less adaptive for modular integration. In this paper, we re-investigate human's driving style and propose to learn an intermediate driving intention region to relax difficulties in end-to-end approach. The intention region follows both road structure in image and direction towards goal in public route planner, which addresses visual variations only and figures out where to go without conventional precise localization. Then the learned visual intention is projected on vehicle local coordinate and fused with reliable obstacle perception to render a navigation score map widely used for motion planning. The core of the proposed system is a weakly-supervised cGAN-LSTM model trained to learn driving intention from human demonstration. The adversarial loss learns from limited demonstration data with one local planned route and enables reasoning of multi-modal behavior with diverse routes while testing. Comprehensive experiments are conducted with real-world datasets. Results show the proposed paradigm can produce more consistent motion commands with human demonstration, and indicates better reliability and robustness to environment change.

Viaarxiv icon

Towards navigation without precise localization: Weakly supervised learning of goal-directed navigation cost map

Jun 06, 2019
Huifang Ma, Yue Wang, Li Tang, Sarath Kodagoda, Rong Xiong

Figure 1 for Towards navigation without precise localization: Weakly supervised learning of goal-directed navigation cost map
Figure 2 for Towards navigation without precise localization: Weakly supervised learning of goal-directed navigation cost map
Figure 3 for Towards navigation without precise localization: Weakly supervised learning of goal-directed navigation cost map
Figure 4 for Towards navigation without precise localization: Weakly supervised learning of goal-directed navigation cost map

Autonomous navigation based on precise localization has been widely developed in both academic research and practical applications. The high demand for localization accuracy has been essential for safe robot planing and navigation while it makes the current geometric solutions less robust to environmental changes. Recent research on end-to-end methods handle raw sensory data with forms of navigation instructions and directly output the command for robot control. However, the lack of intermediate semantics makes the system more rigid and unstable for practical use. To explore these issues, this paper proposes an innovate navigation framework based on the GPS-level localization, which takes the raw perception data with publicly accessible navigation maps to produce an intermediate navigation cost map that allows subsequent flexible motion planning. A deterministic conditional adversarial network is adopted in our method to generate visual goal-directed paths under diverse navigation conditions. The adversarial loss avoids the pixel-level annotation and enables a weakly supervised training strategy to implicitly learn both of the traffic semantics in image perceptions and the planning intentions in navigation instructions. The navigation cost map is then rendered from the goal-directed path and the concurrently collected laser data, indicating the way towards the destination. Comprehensive experiments have been conducted with a real vehicle running in our campus and the results have verified the robustness to localization error of the proposed navigation system.

Viaarxiv icon

Spatiotemporal Decoupling Based LiDAR-Camera Calibration under Arbitrary Configurations

Mar 14, 2019
Bo Fu, Yue Wang, Yanmei Jiao, Xiaqing Ding, Li Tang, Rong Xiong

Figure 1 for Spatiotemporal Decoupling Based LiDAR-Camera Calibration under Arbitrary Configurations
Figure 2 for Spatiotemporal Decoupling Based LiDAR-Camera Calibration under Arbitrary Configurations
Figure 3 for Spatiotemporal Decoupling Based LiDAR-Camera Calibration under Arbitrary Configurations
Figure 4 for Spatiotemporal Decoupling Based LiDAR-Camera Calibration under Arbitrary Configurations

LiDAR-camera calibration is a precondition for many heterogeneous systems that fuse data from LiDAR and camera. However, the constraint from the common field of view and the requirement for strict time synchronization make the calibration a challenging problem. In this paper, we propose a hybrid LiDAR-camera calibration method aiming to solve these two difficulties. The configuration between LiDAR and camera is free from their common field of view as we move the camera to cover the scenario observed by LiDAR. 3D visual reconstruction of the environment can be achieved from the sequential visual images obtained by the moving camera, which later can be aligned with the single 3D laser scan captured when both the scene and the equipment are stationary. Under this design, our method can further get rid of the influence from time synchronization between LiDAR and camera. Moreover, the extended field of view obtained by the moving camera can improve the calibration accuracy. We derive the conditions of minimal observability for our method and discuss the influence on calibration accuracy from different placements of chessboards, which can be utilized as a guideline for designing high-accuracy calibration procedures. We validate our method on both simulation platform and real-world datasets. Experiments show that our method can achieve higher accuracy than other comparable methods.

Viaarxiv icon