Alert button
Picture for Xiaokang Yang

Xiaokang Yang

Alert button

SAM-PARSER: Fine-tuning SAM Efficiently by Parameter Space Reconstruction

Aug 31, 2023
Zelin Peng, Zhengqin Xu, Zhilin Zeng, Xiaokang Yang, Wei Shen

Figure 1 for SAM-PARSER: Fine-tuning SAM Efficiently by Parameter Space Reconstruction
Figure 2 for SAM-PARSER: Fine-tuning SAM Efficiently by Parameter Space Reconstruction
Figure 3 for SAM-PARSER: Fine-tuning SAM Efficiently by Parameter Space Reconstruction
Figure 4 for SAM-PARSER: Fine-tuning SAM Efficiently by Parameter Space Reconstruction

Segment Anything Model (SAM) has received remarkable attention as it offers a powerful and versatile solution for object segmentation in images. However, fine-tuning SAM for downstream segmentation tasks under different scenarios remains a challenge, as the varied characteristics of different scenarios naturally requires diverse model parameter spaces. Most existing fine-tuning methods attempt to bridge the gaps among different scenarios by introducing a set of new parameters to modify SAM's original parameter space. Unlike these works, in this paper, we propose fine-tuning SAM efficiently by parameter space reconstruction (SAM-PARSER), which introduce nearly zero trainable parameters during fine-tuning. In SAM-PARSER, we assume that SAM's original parameter space is relatively complete, so that its bases are able to reconstruct the parameter space of a new scenario. We obtain the bases by matrix decomposition, and fine-tuning the coefficients to reconstruct the parameter space tailored to the new scenario by an optimal linear combination of the bases. Experimental results show that SAM-PARSER exhibits superior segmentation performance across various scenarios, while reducing the number of trainable parameters by $\approx 290$ times compared with current parameter-efficient fine-tuning methods.

Viaarxiv icon

Dual Aggregation Transformer for Image Super-Resolution

Aug 11, 2023
Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xiaokang Yang, Fisher Yu

Figure 1 for Dual Aggregation Transformer for Image Super-Resolution
Figure 2 for Dual Aggregation Transformer for Image Super-Resolution
Figure 3 for Dual Aggregation Transformer for Image Super-Resolution
Figure 4 for Dual Aggregation Transformer for Image Super-Resolution

Transformer has recently gained considerable popularity in low-level vision tasks, including image super-resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Specifically, we alternately apply spatial and channel self-attention in consecutive Transformer blocks. The alternate strategy enables DAT to capture the global context and realize inter-block feature aggregation. Furthermore, we propose the adaptive interaction module (AIM) and the spatial-gate feed-forward network (SGFN) to achieve intra-block feature aggregation. AIM complements two self-attention mechanisms from corresponding dimensions. Meanwhile, SGFN introduces additional non-linear spatial information in the feed-forward network. Extensive experiments show that our DAT surpasses current methods. Code and models are obtainable at https://github.com/zhengchen1999/DAT.

* Accepted to ICCV 2023. Code is available at https://github.com/zhengchen1999/DAT 
Viaarxiv icon

Vid2Act: Activate Offline Videos for Visual RL

Jun 07, 2023
Minting Pan, Yitao Zheng, Wendong Zhang, Yunbo Wang, Xiaokang Yang

Figure 1 for Vid2Act: Activate Offline Videos for Visual RL
Figure 2 for Vid2Act: Activate Offline Videos for Visual RL
Figure 3 for Vid2Act: Activate Offline Videos for Visual RL
Figure 4 for Vid2Act: Activate Offline Videos for Visual RL

Pretraining RL models on offline video datasets is a promising way to improve their training efficiency in online tasks, but challenging due to the inherent mismatch in tasks, dynamics, and behaviors across domains. A recent model, APV, sidesteps the accompanied action records in offline datasets and instead focuses on pretraining a task-irrelevant, action-free world model within the source domains. We present Vid2Act, a model-based RL method that learns to transfer valuable action-conditioned dynamics and potentially useful action demonstrations from offline to online settings. The main idea is to use the world models not only as simulators for behavior learning but also as tools to measure the domain relevance for both dynamics representation transfer and policy transfer. Specifically, we train the world models to generate a set of time-varying task similarities using a domain-selective knowledge distillation loss. These similarities serve two purposes: (i) adaptively transferring the most useful source knowledge to facilitate dynamics learning, and (ii) learning to replay the most relevant source actions to guide the target policy. We demonstrate the advantages of Vid2Act over the action-free visual RL pretraining method in both Meta-World and DeepMind Control Suite.

Viaarxiv icon

Neural LerPlane Representations for Fast 4D Reconstruction of Deformable Tissues

May 31, 2023
Chen Yang, Kailing Wang, Yuehao Wang, Xiaokang Yang, Wei Shen

Figure 1 for Neural LerPlane Representations for Fast 4D Reconstruction of Deformable Tissues
Figure 2 for Neural LerPlane Representations for Fast 4D Reconstruction of Deformable Tissues
Figure 3 for Neural LerPlane Representations for Fast 4D Reconstruction of Deformable Tissues
Figure 4 for Neural LerPlane Representations for Fast 4D Reconstruction of Deformable Tissues

Reconstructing deformable tissues from endoscopic stereo videos in robotic surgery is crucial for various clinical applications. However, existing methods relying only on implicit representations are computationally expensive and require dozens of hours, which limits further practical applications. To address this challenge, we introduce LerPlane, a novel method for fast and accurate reconstruction of surgical scenes under a single-viewpoint setting. LerPlane treats surgical procedures as 4D volumes and factorizes them into explicit 2D planes of static and dynamic fields, leading to a compact memory footprint and significantly accelerated optimization. The efficient factorization is accomplished by fusing features obtained through linear interpolation of each plane and enables using lightweight neural networks to model surgical scenes. Besides, LerPlane shares static fields, significantly reducing the workload of dynamic tissue modeling. We also propose a novel sample scheme to boost optimization and improve performance in regions with tool occlusion and large motions. Experiments on DaVinci robotic surgery videos demonstrate that LerPlane accelerates optimization by over 100$\times$ while maintaining high quality across various non-rigid deformations, showing significant promise for future intraoperative surgery applications.

* 11 pages, 3 fugure 
Viaarxiv icon

Collaborative World Models: An Online-Offline Transfer RL Approach

May 25, 2023
Qi Wang, Junming Yang, Yunbo Wang, Xin Jin, Wenjun Zeng, Xiaokang Yang

Figure 1 for Collaborative World Models: An Online-Offline Transfer RL Approach
Figure 2 for Collaborative World Models: An Online-Offline Transfer RL Approach
Figure 3 for Collaborative World Models: An Online-Offline Transfer RL Approach
Figure 4 for Collaborative World Models: An Online-Offline Transfer RL Approach

Training visual reinforcement learning (RL) models in offline datasets is challenging due to overfitting issues in representation learning and overestimation problems in value function. In this paper, we propose a transfer learning method called Collaborative World Models (CoWorld) to improve the performance of visual RL under offline conditions. The core idea is to use an easy-to-interact, off-the-shelf simulator to train an auxiliary RL model as the online "test bed" for the offline policy learned in the target domain, which provides a flexible constraint for the value function -- Intuitively, we want to mitigate the overestimation problem of value functions outside the offline data distribution without impeding the exploration of actions with potential advantages. Specifically, CoWorld performs domain-collaborative representation learning to bridge the gap between online and offline hidden state distributions. Furthermore, it performs domain-collaborative behavior learning that enables the source RL agent to provide target-aware value estimation, allowing for effective offline policy regularization. Experiments show that CoWorld significantly outperforms existing methods in offline visual control tasks in DeepMind Control and Meta-World.

Viaarxiv icon

Object-Centric Voxelization of Dynamic Scenes via Inverse Neural Rendering

Apr 30, 2023
Siyu Gao, Yanpeng Zhao, Yunbo Wang, Xiaokang Yang

Figure 1 for Object-Centric Voxelization of Dynamic Scenes via Inverse Neural Rendering
Figure 2 for Object-Centric Voxelization of Dynamic Scenes via Inverse Neural Rendering
Figure 3 for Object-Centric Voxelization of Dynamic Scenes via Inverse Neural Rendering
Figure 4 for Object-Centric Voxelization of Dynamic Scenes via Inverse Neural Rendering

Understanding the compositional dynamics of the world in unsupervised 3D scenarios is challenging. Existing approaches either fail to make effective use of time cues or ignore the multi-view consistency of scene decomposition. In this paper, we propose DynaVol, an inverse neural rendering framework that provides a pilot study for learning time-varying volumetric representations for dynamic scenes with multiple entities (like objects). It has two main contributions. First, it maintains a time-dependent 3D grid, which dynamically and flexibly binds the spatial locations to different entities, thus encouraging the separation of information at a representational level. Second, our approach jointly learns grid-level local dynamics, object-level global dynamics, and the compositional neural radiance fields in an end-to-end architecture, thereby enhancing the spatiotemporal consistency of object-centric scene voxelization. We present a two-stage training scheme for DynaVol and validate its effectiveness on various benchmarks with multiple objects, diverse dynamics, and real-world shapes and textures. We present visualization at https://sites.google.com/view/dynavol-visual.

Viaarxiv icon

HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks

Apr 19, 2023
Zhuo Chen, Xudong Xu, Yichao Yan, Ye Pan, Wenhan Zhu, Wayne Wu, Bo Dai, Xiaokang Yang

Figure 1 for HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks
Figure 2 for HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks
Figure 3 for HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks
Figure 4 for HyperStyle3D: Text-Guided 3D Portrait Stylization via Hypernetworks

Portrait stylization is a long-standing task enabling extensive applications. Although 2D-based methods have made great progress in recent years, real-world applications such as metaverse and games often demand 3D content. On the other hand, the requirement of 3D data, which is costly to acquire, significantly impedes the development of 3D portrait stylization methods. In this paper, inspired by the success of 3D-aware GANs that bridge 2D and 3D domains with 3D fields as the intermediate representation for rendering 2D images, we propose a novel method, dubbed HyperStyle3D, based on 3D-aware GANs for 3D portrait stylization. At the core of our method is a hyper-network learned to manipulate the parameters of the generator in a single forward pass. It not only offers a strong capacity to handle multiple styles with a single model, but also enables flexible fine-grained stylization that affects only texture, shape, or local part of the portrait. While the use of 3D-aware GANs bypasses the requirement of 3D data, we further alleviate the necessity of style images with the CLIP model being the stylization guidance. We conduct an extensive set of experiments across the style, attribute, and shape, and meanwhile, measure the 3D consistency. These experiments demonstrate the superior capability of our HyperStyle3D model in rendering 3D-consistent images in diverse styles, deforming the face shape, and editing various attributes.

Viaarxiv icon