Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gang Yu

Department of Biomedical Engineering, School of Basic Medical Sciences, Central South University, Changsha, China

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Aug 20, 2023

Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin, Chunhua Shen, Ling Chen, Yunchao Wei

Abstract:The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets using the open-source LLAVA model as a testbed for our proposed pipeline. Our results underscore marked enhancements across more than ten commonly assessed capabilities,

* Project page: https://github.com/icoz69/StableLLAVA

Via

Access Paper or Ask Questions

Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

Jul 20, 2023

Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, Chunhua Shen

Figure 1 for Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

Figure 2 for Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

Figure 3 for Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

Figure 4 for Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

Abstract:Reconstructing accurate 3D scenes from images is a long-standing vision task. Due to the ill-posedness of the single-image reconstruction problem, most well-established methods are built upon multi-view geometry. State-of-the-art (SOTA) monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained with over 8 million images with thousands of camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Experiments demonstrate SOTA performance of our method on 7 zero-shot benchmarks. Notably, our method won the championship in the 2nd Monocular Depth Estimation Challenge. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 1), leading to high-quality metric scale dense mapping. The code is available at https://github.com/YvanYin/Metric3D.

* Accepted to ICCV 2023. Won the championship in the 2nd Monocular Depth Estimation Challenge. The code is available at https://github.com/YvanYin/Metric3D

Via

Access Paper or Ask Questions

Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Jul 03, 2023

Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, Shenghua Gao

Figure 1 for Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Figure 2 for Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Figure 3 for Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Figure 4 for Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

Abstract:We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to producing inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Our extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.

* Project Website: https://neuralcarver.github.io/michelangelo

Via

Access Paper or Ask Questions

MotionGPT: Human Motion as a Foreign Language

Jun 26, 2023

Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, Tao Chen

Figure 1 for MotionGPT: Human Motion as a Foreign Language

Figure 2 for MotionGPT: Human Motion as a Foreign Language

Figure 3 for MotionGPT: Human Motion as a Foreign Language

Figure 4 for MotionGPT: Human Motion as a Foreign Language

Abstract:Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

* https://github.com/OpenMotionLab/MotionGPT

Via

Access Paper or Ask Questions

STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection

Jun 05, 2023

Zhenglin Zhou, Huaxia Li, Hong Liu, Nanyang Wang, Gang Yu, Rongrong Ji

Abstract:Recently, deep learning-based facial landmark detection has achieved significant improvement. However, the semantic ambiguity problem degrades detection performance. Specifically, the semantic ambiguity causes inconsistent annotation and negatively affects the model's convergence, leading to worse accuracy and instability prediction. To solve this problem, we propose a Self-adapTive Ambiguity Reduction (STAR) loss by exploiting the properties of semantic ambiguity. We find that semantic ambiguity results in the anisotropic predicted distribution, which inspires us to use predicted distribution to represent semantic ambiguity. Based on this, we design the STAR loss that measures the anisotropism of the predicted distribution. Compared with the standard regression loss, STAR loss is encouraged to be small when the predicted distribution is anisotropic and thus adaptively mitigates the impact of semantic ambiguity. Moreover, we propose two kinds of eigenvalue restriction methods that could avoid both distribution's abnormal change and the model's premature convergence. Finally, the comprehensive experiments demonstrate that STAR loss outperforms the state-of-the-art methods on three benchmarks, i.e., COFW, 300W, and WFLW, with negligible computation overhead. Code is at https://github.com/ZhenglinZhou/STAR.

* 14 pages, 7 figures, accepted by CVPR 2023

Via

Access Paper or Ask Questions

Synchro-Transient-Extracting Transform for the Analysis of Signals with Both Harmonic and Impulsive Components

Jun 02, 2023

Yunlong Ma, Gang Yu, Tianran Lin, Qingtang Jiang

Figure 1 for Synchro-Transient-Extracting Transform for the Analysis of Signals with Both Harmonic and Impulsive Components

Figure 2 for Synchro-Transient-Extracting Transform for the Analysis of Signals with Both Harmonic and Impulsive Components

Figure 3 for Synchro-Transient-Extracting Transform for the Analysis of Signals with Both Harmonic and Impulsive Components

Figure 4 for Synchro-Transient-Extracting Transform for the Analysis of Signals with Both Harmonic and Impulsive Components

Abstract:Time-frequency analysis (TFA) techniques play an increasingly important role in the field of machine fault diagnosis attributing to their superiority in dealing with nonstationary signals. Synchroextracting transform (SET) and transient-extracting transform (TET) are two newly emerging techniques that can produce energy concentrated representation for nonstationary signals. However, SET and TET are only suitable for processing harmonic signals and impulsive signals, respectively. This poses a challenge for each of these two techniques when a signal contains both harmonic and impulsive components. In this paper, we propose a new TFA technique to solve this problem. The technique aims to combine the advantages of SET and TET to generate energy concentrated representations for both harmonic and impulsive components of the signal. Furthermore, we theoretically demonstrate that the proposed technique retains the signal reconstruction capability. The effectiveness of the proposed technique is verified using numerical and real-world signals.

Via

Access Paper or Ask Questions

Disentangled Pre-training for Image Matting

Apr 03, 2023

Yanda Li, Zilong Huang, Gang Yu, Ling Chen, Yunchao Wei, Jianbo Jiao

Figure 1 for Disentangled Pre-training for Image Matting

Figure 2 for Disentangled Pre-training for Image Matting

Figure 3 for Disentangled Pre-training for Image Matting

Figure 4 for Disentangled Pre-training for Image Matting

Abstract:Image matting requires high-quality pixel-level human annotations to support the training of a deep model in recent literature. Whereas such annotation is costly and hard to scale, significantly holding back the development of the research. In this work, we make the first attempt towards addressing this problem, by proposing a self-supervised pre-training approach that can leverage infinite numbers of data to boost the matting performance. The pre-training task is designed in a similar manner as image matting, where random trimap and alpha matte are generated to achieve an image disentanglement objective. The pre-trained model is then used as an initialisation of the downstream matting task for fine-tuning. Extensive experimental evaluations show that the proposed approach outperforms both the state-of-the-art matting methods and other alternative self-supervised initialisation approaches by a large margin. We also show the robustness of the proposed approach over different backbone architectures. The code and models will be publicly available.

Via

Access Paper or Ask Questions

Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Mar 01, 2023

Sen Yang, Wen Heng, Gang Liu, Guozhong Luo, Wankou Yang, Gang Yu

Figure 1 for Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Figure 2 for Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Figure 3 for Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Figure 4 for Capturing the motion of every joint: 3D human pose and shape estimation with independent tokens

Abstract:In this paper we present a novel method to estimate 3D human pose and shape from monocular videos. This task requires directly recovering pixel-alignment 3D human pose and body shape from monocular images or videos, which is challenging due to its inherent ambiguity. To improve precision, existing methods highly rely on the initialized mean pose and shape as prior estimates and parameter regression with an iterative error feedback manner. In addition, video-based approaches model the overall change over the image-level features to temporally enhance the single-frame feature, but fail to capture the rotational motion at the joint level, and cannot guarantee local temporal consistency. To address these issues, we propose a novel Transformer-based model with a design of independent tokens. First, we introduce three types of tokens independent of the image feature: \textit{joint rotation tokens, shape token, and camera token}. By progressively interacting with image features through Transformer layers, these tokens learn to encode the prior knowledge of human 3D joint rotations, body shape, and position information from large-scale data, and are updated to estimate SMPL parameters conditioned on a given image. Second, benefiting from the proposed token-based representation, we further use a temporal model to focus on capturing the rotational temporal information of each joint, which is empirically conducive to preventing large jitters in local parts. Despite being conceptually simple, the proposed method attains superior performances on the 3DPW and Human3.6M datasets. Using ResNet-50 and Transformer architectures, it obtains 42.0 mm error on the PA-MPJPE metric of the challenging 3DPW, outperforming state-of-the-art counterparts by a large margin. Code will be publicly available at https://github.com/yangsenius/INT_HMR_Model

* 17 pages, 12 figures. ICLR 2023 (spotlight)

Via

Access Paper or Ask Questions

SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation

Feb 09, 2023

Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, Li Zhang

Figure 1 for SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation

Figure 2 for SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation

Figure 3 for SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation

Figure 4 for SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation

Abstract:Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement render these methods unsuitable on the mobile device, especially for the high-resolution per-pixel semantic segmentation task. In this paper, we introduce a new method squeeze-enhanced Axial TransFormer (SeaFormer) for mobile semantic segmentation. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K and Cityscapes datasets. Critically, we beat both the mobile-friendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification problem, demonstrating the potentials of serving as a versatile mobile-friendly backbone.

* ICLR 2023

Via

Access Paper or Ask Questions

Executing your Commands via Motion Diffusion in Latent Space

Dec 08, 2022

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, Jingyi Yu, Gang Yu

Figure 1 for Executing your Commands via Motion Diffusion in Latent Space

Figure 2 for Executing your Commands via Motion Diffusion in Latent Space

Figure 3 for Executing your Commands via Motion Diffusion in Latent Space

Figure 4 for Executing your Commands via Motion Diffusion in Latent Space

Abstract:We study a challenging task, conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to the human motion sequences. Besides, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises. To learn a better representation of the various human motion sequences, we first design a powerful Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence. Then, instead of using a diffusion model to establish the connections between the raw motion sequences and the conditional inputs, we perform a diffusion process on the motion latent space. Our proposed Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences conforming to the given conditional inputs and substantially reduce the computational overhead in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models on raw motion sequences.

* 18 pages, 11 figures, conference

Via

Access Paper or Ask Questions