Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kaiming He

Fractal Generative Models

Feb 25, 2025

Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He

Abstract:Modularization is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modularization by abstracting generative models into atomic generative modules. Analogous to fractals in mathematics, our method constructs a new type of generative model by recursively invoking atomic generative modules, resulting in self-similar fractal architectures that we call fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic generative modules and examine it on the challenging task of pixel-by-pixel image generation, demonstrating strong performance in both likelihood estimation and generation quality. We hope this work could open a new paradigm in generative modeling and provide a fertile ground for future research. Code is available at https://github.com/LTH14/fractalgen.

Via

Access Paper or Ask Questions

Is Noise Conditioning Necessary for Denoising Generative Models?

Feb 18, 2025

Qiao Sun, Zhicheng Jiang, Hanhong Zhao, Kaiming He

Figure 1 for Is Noise Conditioning Necessary for Denoising Generative Models?

Figure 2 for Is Noise Conditioning Necessary for Denoising Generative Models?

Figure 3 for Is Noise Conditioning Necessary for Denoising Generative Models?

Figure 4 for Is Noise Conditioning Necessary for Denoising Generative Models?

Abstract:It is widely believed that noise conditioning is indispensable for denoising diffusion models to work successfully. This work challenges this belief. Motivated by research on blind image denoising, we investigate a variety of denoising-based generative models in the absence of noise conditioning. To our surprise, most models exhibit graceful degradation, and in some cases, they even perform better without noise conditioning. We provide a theoretical analysis of the error caused by removing noise conditioning and demonstrate that our analysis aligns with empirical observations. We further introduce a noise-unconditional model that achieves a competitive FID of 2.23 on CIFAR-10, significantly narrowing the gap to leading noise-conditional models. We hope our findings will inspire the community to revisit the foundations and formulations of denoising generative models.

Via

Access Paper or Ask Questions

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Oct 17, 2024

Lijie Fan, Tianhong Li, Siyang Qin, Yuanzhen Li, Chen Sun, Michael Rubinstein, Deqing Sun, Kaiming He, Yonglong Tian

Figure 1 for Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Figure 2 for Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Figure 3 for Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Figure 4 for Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Abstract:Scaling up autoregressive models in vision has not proven as beneficial as in large language models. In this work, we investigate this scaling problem in the context of text-to-image generation, focusing on two critical factors: whether models use discrete or continuous tokens, and whether tokens are generated in a random or fixed raster order using BERT- or GPT-like transformer architectures. Our empirical results show that, while all models scale effectively in terms of validation loss, their evaluation performance -- measured by FID, GenEval score, and visual quality -- follows different trends. Models based on continuous tokens achieve significantly better visual quality than those using discrete tokens. Furthermore, the generation order and attention mechanisms significantly affect the GenEval score: random-order models achieve notably better GenEval scores compared to raster-order models. Inspired by these findings, we train Fluid, a random-order autoregressive model on continuous tokens. Fluid 10.5B model achieves a new state-of-the-art zero-shot FID of 6.16 on MS-COCO 30K, and 0.69 overall score on the GenEval benchmark. We hope our findings and results will encourage future efforts to further bridge the scaling gap between vision and language models.

* Tech report

Via

Access Paper or Ask Questions

Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

Sep 30, 2024

Lirui Wang, Xinlei Chen, Jialiang Zhao, Kaiming He

Figure 1 for Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

Figure 2 for Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

Figure 3 for Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

Figure 4 for Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

Abstract:One of the roadblocks for training generalist robotic models today is heterogeneity. Previous robot learning methods often collect data to train with one specific embodiment for one task, which is expensive and prone to overfitting. This work studies the problem of learning policy representations through heterogeneous pre-training on robot data across different embodiments and tasks at scale. We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation. This general architecture aligns the specific proprioception and vision inputs from distinct embodiments to a short sequence of tokens and then processes such tokens to map to control robots for different tasks. Leveraging the recent large-scale multi-embodiment real-world robotic datasets as well as simulation, deployed robots, and human video datasets, we investigate pre-training policies across heterogeneity. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks in multiple simulator benchmarks and real-world settings. See the project website (https://liruiw.github.io/hpt/) for code and videos.

* Neurips 2024
* See the project website (https://liruiw.github.io/hpt/) for code and videos

Via

Access Paper or Ask Questions

Autoregressive Image Generation without Vector Quantization

Jun 17, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, Kaiming He

Figure 1 for Autoregressive Image Generation without Vector Quantization

Figure 2 for Autoregressive Image Generation without Vector Quantization

Figure 3 for Autoregressive Image Generation without Vector Quantization

Figure 4 for Autoregressive Image Generation without Vector Quantization

Abstract:Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We observe that while a discrete-valued space can facilitate representing a categorical distribution, it is not a necessity for autoregressive modeling. In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space. Rather than using categorical cross-entropy loss, we define a Diffusion Loss function to model the per-token probability. This approach eliminates the need for discrete-valued tokenizers. We evaluate its effectiveness across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants. By removing vector quantization, our image generator achieves strong results while enjoying the speed advantage of sequence modeling. We hope this work will motivate the use of autoregressive generation in other continuous-valued domains and applications.

* Tech report

Via

Access Paper or Ask Questions

Physically Compatible 3D Object Modeling from a Single Image

Jun 03, 2024

Minghao Guo, Bohan Wang, Pingchuan Ma, Tianyuan Zhang, Crystal Elaine Owens, Chuang Gan, Joshua B. Tenenbaum, Kaiming He, Wojciech Matusik

Abstract:We present a computational framework that transforms single images into 3D physical objects. The visual geometry of a physical object in an image is determined by three orthogonal attributes: mechanical properties, external forces, and rest-shape geometry. Existing single-view 3D reconstruction methods often overlook this underlying composition, presuming rigidity or neglecting external forces. Consequently, the reconstructed objects fail to withstand real-world physical forces, resulting in instability or undesirable deformation -- diverging from their intended designs as depicted in the image. Our optimization framework addresses this by embedding physical compatibility into the reconstruction process. We explicitly decompose the three physical attributes and link them through static equilibrium, which serves as a hard constraint, ensuring that the optimized physical shapes exhibit desired physical behaviors. Evaluations on a dataset collected from Objaverse demonstrate that our framework consistently enhances the physical realism of 3D models over existing methods. The utility of our framework extends to practical applications in dynamic simulations and 3D printing, where adherence to physical compatibility is paramount.

Via

Access Paper or Ask Questions

TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes

May 30, 2024

Minghao Guo, Bohan Wang, Kaiming He, Wojciech Matusik

Figure 1 for TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes

Figure 2 for TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes

Figure 3 for TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes

Figure 4 for TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes

Abstract:We present TetSphere splatting, an explicit, Lagrangian representation for reconstructing 3D shapes with high-quality geometry. In contrast to conventional object reconstruction methods which predominantly use Eulerian representations, including both neural implicit (e.g., NeRF, NeuS) and explicit representations (e.g., DMTet), and often struggle with high computational demands and suboptimal mesh quality, TetSphere splatting utilizes an underused but highly effective geometric primitive -- tetrahedral meshes. This approach directly yields superior mesh quality without relying on neural networks or post-processing. It deforms multiple initial tetrahedral spheres to accurately reconstruct the 3D shape through a combination of differentiable rendering and geometric energy optimization, resulting in significant computational efficiency. Serving as a robust and versatile geometry representation, Tet-Sphere splatting seamlessly integrates into diverse applications, including single-view 3D reconstruction, image-/text-to-3D content generation. Experimental results demonstrate that TetSphere splatting outperforms existing representations, delivering faster optimization speed, enhanced mesh quality, and reliable preservation of thin structures.

Via

Access Paper or Ask Questions

Dynamic Inhomogeneous Quantum Resource Scheduling with Reinforcement Learning

May 25, 2024

Linsen Li, Pratyush Anand, Kaiming He, Dirk Englund

Figure 1 for Dynamic Inhomogeneous Quantum Resource Scheduling with Reinforcement Learning

Figure 2 for Dynamic Inhomogeneous Quantum Resource Scheduling with Reinforcement Learning

Figure 3 for Dynamic Inhomogeneous Quantum Resource Scheduling with Reinforcement Learning

Figure 4 for Dynamic Inhomogeneous Quantum Resource Scheduling with Reinforcement Learning

Abstract:A central challenge in quantum information science and technology is achieving real-time estimation and feedforward control of quantum systems. This challenge is compounded by the inherent inhomogeneity of quantum resources, such as qubit properties and controls, and their intrinsically probabilistic nature. This leads to stochastic challenges in error detection and probabilistic outcomes in processes such as heralded remote entanglement. Given these complexities, optimizing the construction of quantum resource states is an NP-hard problem. In this paper, we address the quantum resource scheduling issue by formulating the problem and simulating it within a digitized environment, allowing the exploration and development of agent-based optimization strategies. We employ reinforcement learning agents within this probabilistic setting and introduce a new framework utilizing a Transformer model that emphasizes self-attention mechanisms for pairs of qubits. This approach facilitates dynamic scheduling by providing real-time, next-step guidance. Our method significantly improves the performance of quantum systems, achieving more than a 3$\times$ improvement over rule-based agents, and establishes an innovative framework that improves the joint design of physical and control systems for quantum applications in communication, networking, and computing.

Via

Access Paper or Ask Questions

A Decade's Battle on Dataset Bias: Are We There Yet?

Mar 13, 2024

Zhuang Liu, Kaiming He

Abstract:We revisit the "dataset classification" experiment suggested by Torralba and Efros a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be simply explained by memorization. We hope our discovery will inspire the community to rethink the issue involving dataset bias and model capabilities.

Via

Access Paper or Ask Questions

Deconstructing Denoising Diffusion Models for Self-Supervised Learning

Jan 25, 2024

Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He

Abstract:In this study, we examine the representation learning abilities of Denoising Diffusion Models (DDM) that were originally purposed for image generation. Our philosophy is to deconstruct a DDM, gradually transforming it into a classical Denoising Autoencoder (DAE). This deconstructive procedure allows us to explore how various components of modern DDMs influence self-supervised representation learning. We observe that only a very few modern components are critical for learning good representations, while many others are nonessential. Our study ultimately arrives at an approach that is highly simplified and to a large extent resembles a classical DAE. We hope our study will rekindle interest in a family of classical methods within the realm of modern self-supervised learning.

* Technical report, 10 pages

Via

Access Paper or Ask Questions