Alert button
Picture for Xuxin Cheng

Xuxin Cheng

Alert button

SSVMR: Saliency-based Self-training for Video-Music Retrieval

Feb 18, 2023
Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yaowei Li, Yuexian Zou

Figure 1 for SSVMR: Saliency-based Self-training for Video-Music Retrieval
Figure 2 for SSVMR: Saliency-based Self-training for Video-Music Retrieval
Figure 3 for SSVMR: Saliency-based Self-training for Video-Music Retrieval
Figure 4 for SSVMR: Saliency-based Self-training for Video-Music Retrieval

With the rise of short videos, the demand for selecting appropriate background music (BGM) for a video has increased significantly, video-music retrieval (VMR) task gradually draws much attention by research community. As other cross-modal learning tasks, existing VMR approaches usually attempt to measure the similarity between the video and music in the feature space. However, they (1) neglect the inevitable label noise; (2) neglect to enhance the ability to capture critical video clips. In this paper, we propose a novel saliency-based self-training framework, which is termed SSVMR. Specifically, we first explore to fully make use of the information containing in the training dataset by applying a semi-supervised method to suppress the adverse impact of label noise problem, where a self-training approach is adopted. In addition, we propose to capture the saliency of the video by mixing two videos at span level and preserving the locality of the two original videos. Inspired by back translation in NLP, we also conduct back retrieval to obtain more training data. Experimental results on MVD dataset show that our SSVMR achieves the state-of-the-art performance by a large margin, obtaining a relative improvement of 34.8% over the previous best model in terms of R@1.

* Accepted by ICASSP 2023 
Viaarxiv icon

Generating Templated Caption for Video Grounding

Jan 15, 2023
Hongxiang Li, Meng Cao, Xuxin Cheng, Zhihong Zhu, Yaowei Li, Yuexian Zou

Figure 1 for Generating Templated Caption for Video Grounding
Figure 2 for Generating Templated Caption for Video Grounding
Figure 3 for Generating Templated Caption for Video Grounding
Figure 4 for Generating Templated Caption for Video Grounding

Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video. Previous works ignore the \emph{sparsity dilemma} in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. In this paper, we contend that providing easily available captions which describe general actions \ie, templated captions defined in our paper, will significantly boost the performance. To this end, we propose a Templated Caption Network (TCNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions, and then obtain templated captions by Non-Templated Caption Suppression (NTCS). To utilize templated captions better, we propose Caption Guided Attention (CGA) project the semantic relations between templated captions and query sentences into temporal space and fuse them into visual representations. Considering the gap between templated captions and ground truth, we propose Asymmetric Dual Matching Supervised Contrastive Learning (ADMSCL) for constructing more negative pairs to maximize cross-modal mutual information. Without bells and whistles, extensive experiments on three public datasets (\ie, ActivityNet Captions, TACoS and ActivityNet-CG) demonstrate that our method significantly outperforms state-of-the-art methods.

Viaarxiv icon

M3ST: Mix at Three Levels for Speech Translation

Dec 07, 2022
Xuxin Cheng, Qianqian Dong, Fengpeng Yue, Tom Ko, Mingxuan Wang, Yuexian Zou

Figure 1 for M3ST: Mix at Three Levels for Speech Translation
Figure 2 for M3ST: Mix at Three Levels for Speech Translation
Figure 3 for M3ST: Mix at Three Levels for Speech Translation
Figure 4 for M3ST: Mix at Three Levels for Speech Translation

How to solve the data scarcity problem for end-to-end speech-to-text translation (ST)? It's well known that data augmentation is an efficient method to improve performance for many tasks by enlarging the dataset. In this paper, we propose Mix at three levels for Speech Translation (M^3ST) method to increase the diversity of the augmented training corpus. Specifically, we conduct two phases of fine-tuning based on a pre-trained model using external machine translation (MT) data. In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data. At the second stage of fine-tuning, we take both original speech sequences and original text sequences in parallel into the model to fine-tune the network, and use Jensen-Shannon divergence to regularize their outputs. Experiments on MuST-C speech translation benchmark and analysis show that M^3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.

* Submitted to ICASSP 2023 
Viaarxiv icon

A Dynamic Graph Interactive Framework with Label-Semantic Injection for Spoken Language Understanding

Nov 08, 2022
Zhihong Zhu, Weiyuan Xu, Xuxin Cheng, Tengtao Song, Yuexian Zou

Figure 1 for A Dynamic Graph Interactive Framework with Label-Semantic Injection for Spoken Language Understanding
Figure 2 for A Dynamic Graph Interactive Framework with Label-Semantic Injection for Spoken Language Understanding
Figure 3 for A Dynamic Graph Interactive Framework with Label-Semantic Injection for Spoken Language Understanding
Figure 4 for A Dynamic Graph Interactive Framework with Label-Semantic Injection for Spoken Language Understanding

Multi-intent detection and slot filling joint models are gaining increasing traction since they are closer to complicated real-world scenarios. However, existing approaches (1) focus on identifying implicit correlations between utterances and one-hot encoded labels in both tasks while ignoring explicit label characteristics; (2) directly incorporate multi-intent information for each token, which could lead to incorrect slot prediction due to the introduction of irrelevant intent. In this paper, we propose a framework termed DGIF, which first leverages the semantic information of labels to give the model additional signals and enriched priors. Then, a multi-grain interactive graph is constructed to model correlations between intents and slots. Specifically, we propose a novel approach to construct the interactive graph based on the injection of label semantics, which can automatically update the graph to better alleviate error propagation. Experimental results show that our framework significantly outperforms existing approaches, obtaining a relative improvement of 13.7% over the previous best model on the MixATIS dataset in overall accuracy.

* Submitted to ICASSP 2023 
Viaarxiv icon

Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion

Oct 18, 2022
Zipeng Fu, Xuxin Cheng, Deepak Pathak

Figure 1 for Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion
Figure 2 for Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion
Figure 3 for Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion
Figure 4 for Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion

An attached arm can significantly increase the applicability of legged robots to several mobile manipulation tasks that are not possible for the wheeled or tracked counterparts. The standard hierarchical control pipeline for such legged manipulators is to decouple the controller into that of manipulation and locomotion. However, this is ineffective. It requires immense engineering to support coordination between the arm and legs, and error can propagate across modules causing non-smooth unnatural motions. It is also biological implausible given evidence for strong motor synergies across limbs. In this work, we propose to learn a unified policy for whole-body control of a legged manipulator using reinforcement learning. We propose Regularized Online Adaptation to bridge the Sim2Real gap for high-DoF control, and Advantage Mixing exploiting the causal dependency in the action space to overcome local minima during training the whole-body system. We also present a simple design for a low-cost legged manipulator, and find that our unified policy can demonstrate dynamic and agile behaviors across several task setups. Videos are at https://maniploco.github.io

* CoRL 2022 (Oral). Project website at https://maniploco.github.io 
Viaarxiv icon

Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots

Mar 26, 2021
Zhongyu Li, Xuxin Cheng, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, Koushil Sreenath

Figure 1 for Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots
Figure 2 for Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots
Figure 3 for Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots
Figure 4 for Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots

Developing robust walking controllers for bipedal robots is a challenging endeavor. Traditional model-based locomotion controllers require simplifying assumptions and careful modelling; any small errors can result in unstable control. To address these challenges for bipedal locomotion, we present a model-free reinforcement learning framework for training robust locomotion policies in simulation, which can then be transferred to a real bipedal Cassie robot. To facilitate sim-to-real transfer, domain randomization is used to encourage the policies to learn behaviors that are robust across variations in system dynamics. The learned policies enable Cassie to perform a set of diverse and dynamic behaviors, while also being more robust than traditional controllers and prior learning-based methods that use residual control. We demonstrate this on versatile walking behaviors such as tracking a target walking velocity, walking height, and turning yaw.

* To appear on 2021 International Conference on Robotics and Automation (ICRA 2021) 
Viaarxiv icon

Automated Lane Change Strategy using Proximal Policy Optimization-based Deep Reinforcement Learning

Feb 07, 2020
Fei Ye, Xuxin Cheng, Pin Wang, Ching-Yao Chan

Figure 1 for Automated Lane Change Strategy using Proximal Policy Optimization-based Deep Reinforcement Learning
Figure 2 for Automated Lane Change Strategy using Proximal Policy Optimization-based Deep Reinforcement Learning
Figure 3 for Automated Lane Change Strategy using Proximal Policy Optimization-based Deep Reinforcement Learning
Figure 4 for Automated Lane Change Strategy using Proximal Policy Optimization-based Deep Reinforcement Learning

Lane-change maneuvers are commonly executed by drivers to follow a certain routing plan, overtake a slower vehicle, adapt to a merging lane ahead, etc. However, improper lane change behaviors can be a major cause of traffic flow disruptions and even crashes. While many rule-based methods have been proposed to solve lane change problems for autonomous driving, they tend to exhibit limited performance due to the uncertainty and complexity of the driving environment. Machine learning-based methods offer an alternative approach, as Deep reinforcement learning (DRL) has shown promising success in many application domains including robotic manipulation, navigation, and playing video games. However, applying DRL for autonomous driving still faces many practical challenges in terms of slow learning rates, sample inefficiency, and non-stationary trajectories. In this study, we propose an automated lane change strategy using proximal policy optimization-based deep reinforcement learning, which shows great advantage in learning efficiency while maintaining stable performance. The trained agent is able to learn a smooth, safe, and efficient driving policy to determine lane-change decisions (i.e. when and how) even in dense traffic scenarios. The effectiveness of the proposed policy is validated using task success rate and collision rate, which demonstrates the lane change maneuvers can be efficiently learned and executed in a safe, smooth and efficient manner.

Viaarxiv icon

Driving Decision and Control for Autonomous Lane Change based on Deep Reinforcement Learning

Apr 23, 2019
Tianyu Shi, Pin Wang, Xuxin Cheng, Ching-Yao Chan

Figure 1 for Driving Decision and Control for Autonomous Lane Change based on Deep Reinforcement Learning
Figure 2 for Driving Decision and Control for Autonomous Lane Change based on Deep Reinforcement Learning
Figure 3 for Driving Decision and Control for Autonomous Lane Change based on Deep Reinforcement Learning
Figure 4 for Driving Decision and Control for Autonomous Lane Change based on Deep Reinforcement Learning

We apply Deep Q-network (DQN) with the consideration of safety during the task for deciding whether to conduct the maneuver. Furthermore, we design two similar Deep Q learning frameworks with quadratic approximator for deciding how to select a comfortable gap and just follow the preceding vehicle. Finally, a polynomial lane change trajectory is generated and Pure Pursuit Control is implemented for path tracking. We demonstrate the effectiveness of this framework in simulation, from both the decision-making and control layers. The proposed architecture also has the potential to be extended to other autonomous driving scenarios.

* This Paper has been submitted to ITSC 2019 
Viaarxiv icon