Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ping Luo

Learning Humanoid Locomotion with Perceptive Internal Model

Nov 21, 2024

Junfeng Long, Junli Ren, Moji Shi, Zirui Wang, Tao Huang, Ping Luo, Jiangmiao Pang

Figure 1 for Learning Humanoid Locomotion with Perceptive Internal Model

Figure 2 for Learning Humanoid Locomotion with Perceptive Internal Model

Figure 3 for Learning Humanoid Locomotion with Perceptive Internal Model

Figure 4 for Learning Humanoid Locomotion with Perceptive Internal Model

Abstract:In contrast to quadruped robots that can navigate diverse terrains using a "blind" policy, humanoid robots require accurate perception for stable locomotion due to their high degrees of freedom and inherently unstable morphology. However, incorporating perceptual signals often introduces additional disturbances to the system, potentially reducing its robustness, generalizability, and efficiency. This paper presents the Perceptive Internal Model (PIM), which relies on onboard, continuously updated elevation maps centered around the robot to perceive its surroundings. We train the policy using ground-truth obstacle heights surrounding the robot in simulation, optimizing it based on the Hybrid Internal Model (HIM), and perform inference with heights sampled from the constructed elevation map. Unlike previous methods that directly encode depth maps or raw point clouds, our approach allows the robot to perceive the terrain beneath its feet clearly and is less affected by camera movement or noise. Furthermore, since depth map rendering is not required in simulation, our method introduces minimal additional computational costs and can train the policy in 3 hours on an RTX 4090 GPU. We verify the effectiveness of our method across various humanoid robots, various indoor and outdoor terrains, stairs, and various sensor configurations. Our method can enable a humanoid robot to continuously climb stairs and has the potential to serve as a foundational algorithm for the development of future humanoid control methods.

* submitted to ICRA2025

Via

Access Paper or Ask Questions

Autoregressive Models in Vision: A Survey

Nov 08, 2024

Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang(+10 more)

Figure 1 for Autoregressive Models in Vision: A Survey

Figure 2 for Autoregressive Models in Vision: A Survey

Figure 3 for Autoregressive Models in Vision: A Survey

Figure 4 for Autoregressive Models in Vision: A Survey

Abstract:Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, \textit{i.e.}, pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: \url{https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey}.

Via

Access Paper or Ask Questions

MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

Oct 30, 2024

Jie Zhu, Yixiong Chen, Mingyu Ding, Ping Luo, Leye Wang, Jingdong Wang

Figure 1 for MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

Figure 2 for MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

Figure 3 for MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

Figure 4 for MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

Abstract:Text-to-image diffusion has attracted vast attention due to its impressive image-generation capabilities. However, when it comes to human-centric text-to-image generation, particularly in the context of faces and hands, the results often fall short of naturalness due to insufficient training priors. We alleviate the issue in this work from two perspectives. 1) From the data aspect, we carefully collect a human-centric dataset comprising over one million high-quality human-in-the-scene images and two specific sets of close-up images of faces and hands. These datasets collectively provide a rich prior knowledge base to enhance the human-centric image generation capabilities of the diffusion model. 2) On the methodological front, we propose a simple yet effective method called Mixture of Low-rank Experts (MoLE) by considering low-rank modules trained on close-up hand and face images respectively as experts. This concept draws inspiration from our observation of low-rank refinement, where a low-rank module trained by a customized close-up dataset has the potential to enhance the corresponding image part when applied at an appropriate scale. To validate the superiority of MoLE in the context of human-centric image generation compared to state-of-the-art, we construct two benchmarks and perform evaluations with diverse metrics and human studies. Datasets, model, and code are released at https://sites.google.com/view/mole4diffuser/.

* Published at NeurIPS 2024

Via

Access Paper or Ask Questions

CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

Oct 28, 2024

Chongjian Ge, Chenfeng Xu, Yuanfeng Ji, Chensheng Peng, Masayoshi Tomizuka, Ping Luo, Mingyu Ding, Varun Jampani, Wei Zhan

Figure 1 for CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

Figure 2 for CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

Figure 3 for CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

Figure 4 for CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

Abstract:Recent breakthroughs in text-guided image generation have significantly advanced the field of 3D generation. While generating a single high-quality 3D object is now feasible, generating multiple objects with reasonable interactions within a 3D space, a.k.a. compositional 3D generation, presents substantial challenges. This paper introduces CompGS, a novel generative framework that employs 3D Gaussian Splatting (GS) for efficient, compositional text-to-3D content generation. To achieve this goal, two core designs are proposed: (1) 3D Gaussians Initialization with 2D compositionality: We transfer the well-established 2D compositionality to initialize the Gaussian parameters on an entity-by-entity basis, ensuring both consistent 3D priors for each entity and reasonable interactions among multiple entities; (2) Dynamic Optimization: We propose a dynamic strategy to optimize 3D Gaussians using Score Distillation Sampling (SDS) loss. CompGS first automatically decomposes 3D Gaussians into distinct entity parts, enabling optimization at both the entity and composition levels. Additionally, CompGS optimizes across objects of varying scales by dynamically adjusting the spatial parameters of each entity, enhancing the generation of fine-grained details, particularly in smaller entities. Qualitative comparisons and quantitative evaluations on T3Bench demonstrate the effectiveness of CompGS in generating compositional 3D objects with superior image quality and semantic alignment over existing methods. CompGS can also be easily extended to controllable 3D editing, facilitating scene generation. We hope CompGS will provide new insights to the compositional 3D generation. Project page: https://chongjiange.github.io/compgs.html.

Via

Access Paper or Ask Questions

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Oct 17, 2024

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan(+1 more)

Figure 1 for Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Figure 2 for Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Figure 3 for Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Figure 4 for Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Abstract:In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

* Technical Report

Via

Access Paper or Ask Questions

Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos

Oct 15, 2024

Zhouxia Wang, Jiawei Zhang, Xintao Wang, Tianshui Chen, Ying Shan, Wenping Wang, Ping Luo

Figure 1 for Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos

Figure 2 for Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos

Figure 3 for Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos

Figure 4 for Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos

Abstract:Recent progress in blind face restoration has resulted in producing high-quality restored results for static images. However, efforts to extend these advancements to video scenarios have been minimal, partly because of the absence of benchmarks that allow for a comprehensive and fair comparison. In this work, we first present a fair evaluation benchmark, in which we first introduce a Real-world Low-Quality Face Video benchmark (RFV-LQ), evaluate several leading image-based face restoration algorithms, and conduct a thorough systematical analysis of the benefits and challenges associated with extending blind face image restoration algorithms to degraded face videos. Our analysis identifies several key issues, primarily categorized into two aspects: significant jitters in facial components and noise-shape flickering between frames. To address these issues, we propose a Temporal Consistency Network (TCN) cooperated with alignment smoothing to reduce jitters and flickers in restored videos. TCN is a flexible component that can be seamlessly plugged into the most advanced face image restoration algorithms, ensuring the quality of image-based restoration is maintained as closely as possible. Extensive experiments have been conducted to evaluate the effectiveness and efficiency of our proposed TCN and alignment smoothing operation. Project page: https://wzhouxiff.github.io/projects/FIR2FVR/FIR2FVR.

* IEEE Trans Image Process. 2024;33:5676-5687. Epub 2024 Oct 9. PMID: 39316481
* Accepted by TIP'2024; Project page: https://wzhouxiff.github.io/projects/FIR2FVR/FIR2FVR

Via

Access Paper or Ask Questions

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Oct 11, 2024

Yue Yang, Shuibai Zhang, Wenqi Shao, Kaipeng Zhang, Yi Bin, Yu Wang, Ping Luo

Figure 1 for Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Figure 2 for Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Figure 3 for Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Figure 4 for Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks. However, these benchmarks keep a static nature and overlap with the pre-training data, resulting in fixed complexity constraints and data contamination issues. This raises the concern regarding the validity of the evaluation. To address these two challenges, we introduce a dynamic multimodal evaluation protocol called Vision-Language Bootstrapping (VLB). VLB provides a robust and comprehensive assessment for LVLMs with reduced data contamination and flexible complexity. To this end, VLB dynamically generates new visual question-answering samples through a multimodal bootstrapping module that modifies both images and language, while ensuring that newly generated samples remain consistent with the original ones by a judge module. By composing various bootstrapping strategies, VLB offers dynamic variants of existing benchmarks with diverse complexities, enabling the evaluation to co-evolve with the ever-evolving capabilities of LVLMs. Extensive experimental results across multiple benchmarks, including SEEDBench, MMBench, and MME, show that VLB significantly reduces data contamination and exposes performance limitations of LVLMs.

Via

Access Paper or Ask Questions

DCP: Learning Accelerator Dataflow for Neural Network via Propagation

Oct 09, 2024

Peng Xu, Wenqi Shao, Mingyu Ding, Ping Luo

Figure 1 for DCP: Learning Accelerator Dataflow for Neural Network via Propagation

Figure 2 for DCP: Learning Accelerator Dataflow for Neural Network via Propagation

Figure 3 for DCP: Learning Accelerator Dataflow for Neural Network via Propagation

Figure 4 for DCP: Learning Accelerator Dataflow for Neural Network via Propagation

Abstract:Deep neural network (DNN) hardware (HW) accelerators have achieved great success in improving DNNs' performance and efficiency. One key reason is dataflow in executing a DNN layer, including on-chip data partitioning, computation parallelism, and scheduling policy, which have large impacts on latency and energy consumption. Unlike prior works that required considerable efforts from HW engineers to design suitable dataflows for different DNNs, this work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort. It has several attractive benefits that prior arts do not have. (i) We translate the HW dataflow configuration into a code representation in a unified dataflow coding space, which can be optimized by backpropagating gradients given a DNN layer or network. (ii) DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives e.g., latency and energy. (iii) It can be easily generalized to unseen HW configurations in a zero-shot or few-shot learning manner. For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples. Extensive experiments on several representative models such as MobileNet, ResNet, and ViT show that DCP outperforms its counterparts in various settings.

Via

Access Paper or Ask Questions

PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

Oct 07, 2024

Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo

Figure 1 for PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

Figure 2 for PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

Figure 3 for PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

Figure 4 for PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

Abstract:Quantization is essential for deploying Large Language Models (LLMs) by enhancing memory efficiency and inference speed. Existing methods for activation quantization mainly address channel-wise outliers, often neglecting token-wise outliers, leading to reliance on costly per-token dynamic quantization. To address this, we introduce PrefixQuant, a novel technique that isolates outlier tokens offline without re-training. Specifically, PrefixQuant identifies high-frequency outlier tokens and prefixes them in the KV cache, preventing the generation of outlier tokens during inference and simplifying quantization. To our knowledge, PrefixQuant is the first to enable efficient per-tensor static quantization to outperform expensive per-token dynamic quantization. For instance, in W4A4KV4 (4- bit weight, 4-bit activation, and 4-bit KV cache) Llama-3-8B, PrefixQuant with per-tensor static quantization achieves a 7.43 WikiText2 perplexity and 71.08% average accuracy on 5 common-sense reasoning tasks, outperforming previous per-token dynamic quantization methods like QuaRot with 0.98 perplexity improvement and +5.98 points accuracy. Additionally, the inference speed of W4A4 quantized models using PrefixQuant is 1.60x to 2.81x faster than FP16 models and exceeds QuaRot models by 1.2x to 1.3x. Our code is available at \url{https://github.com/ChenMnZ/PrefixQuant}.

* A PTQ method to significantly boost the performance of static activation quantization

Via

Access Paper or Ask Questions

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Oct 07, 2024

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, Ping Luo

Abstract:Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive \textbf{Phy}sics \textbf{Gen}eration \textbf{Ben}chmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes at https://github.com/OpenGVLab/PhyGenBench

* Project Page: https://phygenbench123.github.io/

Via

Access Paper or Ask Questions