Alert button
Picture for Qihang Zhang

Qihang Zhang

Alert button

Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with Expert Guidance

Sep 04, 2023
Qisen Yang, Shenzhi Wang, Qihang Zhang, Gao Huang, Shiji Song

Offline reinforcement learning (RL) optimizes the policy on a previously collected dataset without any interactions with the environment, yet usually suffers from the distributional shift problem. To mitigate this issue, a typical solution is to impose a policy constraint on a policy improvement objective. However, existing methods generally adopt a ``one-size-fits-all'' practice, i.e., keeping only a single improvement-constraint balance for all the samples in a mini-batch or even the entire offline dataset. In this work, we argue that different samples should be treated with different policy constraint intensities. Based on this idea, a novel plug-in approach named Guided Offline RL (GORL) is proposed. GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample. We theoretically prove that the guidance provided by our method is rational and near-optimal. Extensive experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.

Viaarxiv icon

Learning Modulated Transformation in GANs

Aug 29, 2023
Ceyuan Yang, Qihang Zhang, Yinghao Xu, Jiapeng Zhu, Yujun Shen, Bo Dai

Figure 1 for Learning Modulated Transformation in GANs
Figure 2 for Learning Modulated Transformation in GANs
Figure 3 for Learning Modulated Transformation in GANs
Figure 4 for Learning Modulated Transformation in GANs

The success of style-based generators largely benefits from style modulation, which helps take care of the cross-instance variation within data. However, the instance-wise stochasticity is typically introduced via regular convolution, where kernels interact with features at some fixed locations, limiting its capacity for modeling geometric variation. To alleviate this problem, we equip the generator in generative adversarial networks (GANs) with a plug-and-play module, termed as modulated transformation module (MTM). This module predicts spatial offsets under the control of latent codes, based on which the convolution operation can be applied at variable locations for different instances, and hence offers the model an additional degree of freedom to handle geometry deformation. Extensive experiments suggest that our approach can be faithfully generalized to various generative tasks, including image generation, 3D-aware image synthesis, and video generation, and get compatible with state-of-the-art frameworks without any hyper-parameter tuning. It is noteworthy that, towards human generation on the challenging TaiChi dataset, we improve the FID of StyleGAN3 from 21.36 to 13.60, demonstrating the efficacy of learning modulated geometry transformation.

* Technical report 
Viaarxiv icon

Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Mar 20, 2023
Jihao Liu, Tai Wang, Boxiao Liu, Qihang Zhang, Yu Liu, Hongsheng Li

Figure 1 for Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding
Figure 2 for Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding
Figure 3 for Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding
Figure 4 for Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Multi-view camera-based 3D detection is a challenging problem in computer vision. Recent works leverage a pretrained LiDAR detection model to transfer knowledge to a camera-based student network. However, we argue that there is a major domain gap between the LiDAR BEV features and the camera-based BEV features, as they have different characteristics and are derived from different sources. In this paper, we propose Geometry Enhanced Masked Image Modeling (GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune paradigm for improving the multi-view camera-based 3D detection. GeoMIM is a multi-camera vision transformer with Cross-View Attention (CVA) blocks that uses LiDAR BEV features encoded by the pretrained BEV model as learning targets. During pretraining, GeoMIM's decoder has a semantic branch completing dense perspective-view features and the other geometry branch reconstructing dense perspective-view depth maps. The depth branch is designed to be camera-aware by inputting the camera's parameters for better transfer capability. Extensive results demonstrate that GeoMIM outperforms existing methods on nuScenes benchmark, achieving state-of-the-art performance for camera-based 3D object detection and 3D segmentation.

Viaarxiv icon

Towards Smooth Video Composition

Dec 14, 2022
Qihang Zhang, Ceyuan Yang, Yujun Shen, Yinghao Xu, Bolei Zhou

Figure 1 for Towards Smooth Video Composition
Figure 2 for Towards Smooth Video Composition
Figure 3 for Towards Smooth Video Composition
Figure 4 for Towards Smooth Video Composition

Video generation requires synthesizing consistent and persistent frames with dynamic content over time. This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs). First, towards composing adjacent frames, we show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality. Second, by incorporating the temporal shift module (TSM), originally designed for video understanding, into the discriminator, we manage to advance the generator in synthesizing more consistent dynamics. Third, we develop a novel B-Spline based motion representation to ensure temporal smoothness to achieve infinite-length video generation. It can go beyond the frame number used in training. A low-rank temporal modulation is also proposed to alleviate repeating contents for long video generation. We evaluate our approach on various datasets and show substantial improvements over video generation baselines. Code and models will be publicly available at https://genforce.github.io/StyleSV.

Viaarxiv icon

Noise-resilient approach for deep tomographic imaging

Nov 22, 2022
Zhen Guo, Zhiguang Liu, Qihang Zhang, George Barbastathis, Michael E. Glinsky

Figure 1 for Noise-resilient approach for deep tomographic imaging

We propose a noise-resilient deep reconstruction algorithm for X-ray tomography. Our approach shows strong noise resilience without obtaining noisy training examples. The advantages of our framework may further enable low-photon tomographic imaging.

* 2022 CLEO (the Conference on Lasers and Electro-Optics) conference submission 
Viaarxiv icon

Generative Category-Level Shape and Pose Estimation with Semantic Primitives

Oct 03, 2022
Guanglin Li, Yifeng Li, Zhichao Ye, Qihang Zhang, Tao Kong, Zhaopeng Cui, Guofeng Zhang

Figure 1 for Generative Category-Level Shape and Pose Estimation with Semantic Primitives
Figure 2 for Generative Category-Level Shape and Pose Estimation with Semantic Primitives
Figure 3 for Generative Category-Level Shape and Pose Estimation with Semantic Primitives
Figure 4 for Generative Category-Level Shape and Pose Estimation with Semantic Primitives

Empowering autonomous agents with 3D understanding for daily objects is a grand challenge in robotics applications. When exploring in an unknown environment, existing methods for object pose estimation are still not satisfactory due to the diversity of object shapes. In this paper, we propose a novel framework for category-level object shape and pose estimation from a single RGB-D image. To handle the intra-category variation, we adopt a semantic primitive representation that encodes diverse shapes into a unified latent space, which is the key to establish reliable correspondences between observed point clouds and estimated shapes. Then, by using a SIM(3)-invariant shape descriptor, we gracefully decouple the shape and pose of an object, thus supporting latent shape optimization of target objects in arbitrary poses. Extensive experiments show that the proposed method achieves SOTA pose estimation performance and better generalization in the real-world dataset. Code and video are available at https://zju3dv.github.io/gCasp

* CoRL 2022, 17 pages, 13 figures 
Viaarxiv icon

F3A-GAN: Facial Flow for Face Animation with Generative Adversarial Networks

May 13, 2022
Xintian Wu, Qihang Zhang, Yiming Wu, Huanyu Wang, Songyuan Li, Lingyun Sun, Xi Li

Figure 1 for F3A-GAN: Facial Flow for Face Animation with Generative Adversarial Networks
Figure 2 for F3A-GAN: Facial Flow for Face Animation with Generative Adversarial Networks
Figure 3 for F3A-GAN: Facial Flow for Face Animation with Generative Adversarial Networks
Figure 4 for F3A-GAN: Facial Flow for Face Animation with Generative Adversarial Networks

Formulated as a conditional generation problem, face animation aims at synthesizing continuous face images from a single source image driven by a set of conditional face motion. Previous works mainly model the face motion as conditions with 1D or 2D representation (e.g., action units, emotion codes, landmark), which often leads to low-quality results in some complicated scenarios such as continuous generation and largepose transformation. To tackle this problem, the conditions are supposed to meet two requirements, i.e., motion information preserving and geometric continuity. To this end, we propose a novel representation based on a 3D geometric flow, termed facial flow, to represent the natural motion of the human face at any pose. Compared with other previous conditions, the proposed facial flow well controls the continuous changes to the face. After that, in order to utilize the facial flow for face editing, we build a synthesis framework generating continuous images with conditional facial flows. To fully take advantage of the motion information of facial flows, a hierarchical conditional framework is designed to combine the extracted multi-scale appearance features from images and motion features from flows in a hierarchical manner. The framework then decodes multiple fused features back to images progressively. Experimental results demonstrate the effectiveness of our method compared to other state-of-the-art methods.

* IEEE Transactions on Image Processing (2021)  
Viaarxiv icon

From Laser Speckle to Particle Size Distribution in drying powders: A Physics-Enhanced AutoCorrelation-based Estimator (PEACE)

Apr 20, 2022
Qihang Zhang, Janaka C. Gamekkanda, Wenlong Tang, Charles Papageorgiou, Chris Mitchell, Yihui Yang, Michael Schwaerzler, Tolutola Oyetunde, Richard D. Braatz, Allan S. Myerson, George Barbastathis

Figure 1 for From Laser Speckle to Particle Size Distribution in drying powders: A Physics-Enhanced AutoCorrelation-based Estimator (PEACE)
Figure 2 for From Laser Speckle to Particle Size Distribution in drying powders: A Physics-Enhanced AutoCorrelation-based Estimator (PEACE)
Figure 3 for From Laser Speckle to Particle Size Distribution in drying powders: A Physics-Enhanced AutoCorrelation-based Estimator (PEACE)
Figure 4 for From Laser Speckle to Particle Size Distribution in drying powders: A Physics-Enhanced AutoCorrelation-based Estimator (PEACE)

Extracting quantitative information about highly scattering surfaces from an imaging system is challenging because the phase of the scattered light undergoes multiple folds upon propagation, resulting in complex speckle patterns. One specific application is the drying of wet powders in the pharmaceutical industry, where quantifying the particle size distribution (PSD) is of particular interest. A non-invasive and real-time monitoring probe in the drying process is required, but there is no suitable candidate for this purpose. In this report, we develop a theoretical relationship from the PSD to the speckle image and describe a physics-enhanced autocorrelation-based estimator (PEACE) machine learning algorithm for speckle analysis to measure the PSD of a powder surface. This method solves both the forward and inverse problems together and enjoys increased interpretability, since the machine learning approximator is regularized by the physical law.

Viaarxiv icon

Action-Conditioned Contrastive Policy Pretraining

Apr 05, 2022
Qihang Zhang, Zhenghao Peng, Bolei Zhou

Figure 1 for Action-Conditioned Contrastive Policy Pretraining
Figure 2 for Action-Conditioned Contrastive Policy Pretraining
Figure 3 for Action-Conditioned Contrastive Policy Pretraining
Figure 4 for Action-Conditioned Contrastive Policy Pretraining

Deep visuomotor policy learning achieves promising results in control tasks such as robotic manipulation and autonomous driving, where the action is generated from the visual input by the neural policy. However, it requires a huge number of online interactions with the training environment, which limits its real-world application. Compared to the popular unsupervised feature learning for visual recognition, feature pretraining for visuomotor control tasks is much less explored. In this work, we aim to pretrain policy representations for driving tasks using hours-long uncurated YouTube videos. A new contrastive policy pretraining method is developed to learn action-conditioned features from video frames with action pseudo labels. Experiments show that the resulting action-conditioned features bring substantial improvements to the downstream reinforcement learning and imitation learning tasks, outperforming the weights pretrained from previous unsupervised learning methods. Code and models will be made publicly available.

Viaarxiv icon

MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning

Sep 26, 2021
Quanyi Li, Zhenghao Peng, Zhenghai Xue, Qihang Zhang, Bolei Zhou

Figure 1 for MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning
Figure 2 for MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning
Figure 3 for MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning
Figure 4 for MetaDrive: Composing Diverse Driving Scenarios for Generalizable Reinforcement Learning

Driving safely requires multiple capabilities from human and intelligent agents, such as the generalizability to unseen environments, the decision making in complex multi-agent settings, and the safety awareness of the surrounding traffic. Despite the great success of reinforcement learning, most of the RL research studies each capability separately due to the lack of the integrated interactive environments. In this work, we develop a new driving simulation platform called MetaDrive for the study of generalizable reinforcement learning algorithms. MetaDrive is highly compositional, which can generate an infinite number of diverse driving scenarios from both the procedural generation and the real traffic data replay. Based on MetaDrive, we construct a variety of RL tasks and baselines in both single-agent and multi-agent settings, including benchmarking generalizability across unseen scenes, safe exploration, and learning multi-agent traffic. We open-source this simulator and maintain its development at: https://github.com/decisionforce/metadrive

* MetaDrive: https://github.com/decisionforce/metadrive 
Viaarxiv icon