Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ke Li

Jack

NeRF Revisited: Fixing Quadrature Instability in Volume Rendering

Oct 31, 2023

Mikaela Angelina Uy, Kiyohiro Nakayama, Guandao Yang, Rahul Krishna Thomas, Leonidas Guibas, Ke Li

Figure 1 for NeRF Revisited: Fixing Quadrature Instability in Volume Rendering

Figure 2 for NeRF Revisited: Fixing Quadrature Instability in Volume Rendering

Figure 3 for NeRF Revisited: Fixing Quadrature Instability in Volume Rendering

Figure 4 for NeRF Revisited: Fixing Quadrature Instability in Volume Rendering

Abstract:Neural radiance fields (NeRF) rely on volume rendering to synthesize novel views. Volume rendering requires evaluating an integral along each ray, which is numerically approximated with a finite sum that corresponds to the exact integral along the ray under piecewise constant volume density. As a consequence, the rendered result is unstable w.r.t. the choice of samples along the ray, a phenomenon that we dub quadrature instability. We propose a mathematically principled solution by reformulating the sample-based rendering equation so that it corresponds to the exact integral under piecewise linear volume density. This simultaneously resolves multiple issues: conflicts between samples along different rays, imprecise hierarchical sampling, and non-differentiability of quantiles of ray termination distances w.r.t. model parameters. We demonstrate several benefits over the classical sample-based rendering equation, such as sharper textures, better geometric reconstruction, and stronger depth supervision. Our proposed formulation can be also be used as a drop-in replacement to the volume rendering equation of existing NeRF-based methods. Our project page can be found at pl-nerf.github.io.

* Neurips 2023

Via

Access Paper or Ask Questions

InstOptima: Evolutionary Multi-objective Instruction Optimization via Large Language Model-based Instruction Operators

Oct 26, 2023

Heng Yang, Ke Li

Abstract:Instruction-based language modeling has received significant attention in pretrained language models. However, the efficiency of instruction engineering remains low and hinders the development of instruction studies. Recent studies have focused on automating instruction generation, but they primarily aim to improve performance without considering other crucial objectives that impact instruction quality, such as instruction length and perplexity. Therefore, we propose a novel approach (i.e., InstOptima) that treats instruction generation as an evolutionary multi-objective optimization problem. In contrast to text edition-based methods, our approach utilizes a large language model (LLM) to simulate instruction operators, including mutation and crossover. Furthermore, we introduce an objective-guided mechanism for these operators, allowing the LLM to comprehend the objectives and enhance the quality of the generated instructions. Experimental results demonstrate improved fine-tuning performance and the generation of a diverse set of high-quality instructions.

* Accepted by EMNLP Findings

Via

Access Paper or Ask Questions

Woodpecker: Hallucination Correction for Multimodal Large Language Models

Oct 24, 2023

Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, Enhong Chen

Abstract:Hallucination is a big shadow hanging over the rapidly evolving Multimodal Large Language Models (MLLMs), referring to the phenomenon that the generated text is inconsistent with the image content. In order to mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like a woodpecker heals trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.

* 16 pages, 7 figures. Code Website: https://github.com/BradyFU/Woodpecker

Via

Access Paper or Ask Questions

Solving Expensive Optimization Problems in Dynamic Environments with Meta-learning

Oct 19, 2023

Huan Zhang, Jinliang Ding, Liang Feng, Kay Chen Tan, Ke Li

Abstract:Dynamic environments pose great challenges for expensive optimization problems, as the objective functions of these problems change over time and thus require remarkable computational resources to track the optimal solutions. Although data-driven evolutionary optimization and Bayesian optimization (BO) approaches have shown promise in solving expensive optimization problems in static environments, the attempts to develop such approaches in dynamic environments remain rarely unexplored. In this paper, we propose a simple yet effective meta-learning-based optimization framework for solving expensive dynamic optimization problems. This framework is flexible, allowing any off-the-shelf continuously differentiable surrogate model to be used in a plug-in manner, either in data-driven evolutionary optimization or BO approaches. In particular, the framework consists of two unique components: 1) the meta-learning component, in which a gradient-based meta-learning approach is adopted to learn experience (effective model parameters) across different dynamics along the optimization process. 2) the adaptation component, where the learned experience (model parameters) is used as the initial parameters for fast adaptation in the dynamic environment based on few shot samples. By doing so, the optimization process is able to quickly initiate the search in a new environment within a strictly restricted computational budget. Experiments demonstrate the effectiveness of the proposed algorithm framework compared to several state-of-the-art algorithms on common benchmark test problems under different dynamic characteristics.

Via

Access Paper or Ask Questions

CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes

Oct 15, 2023

Yulei Qin, Xingyu Chen, Yunhang Shen, Chaoyou Fu, Yun Gu, Ke Li, Xing Sun, Rongrong Ji

Figure 1 for CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes

Figure 2 for CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes

Figure 3 for CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes

Figure 4 for CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes

Abstract:Webly supervised learning has attracted increasing attention for its effectiveness in exploring publicly accessible data at scale without manual annotation. However, most existing methods of learning with web datasets are faced with challenges from label noise, and they have limited assumptions on clean samples under various noise. For instance, web images retrieved with queries of tiger cat (a cat species) and drumstick (a musical instrument) are almost dominated by images of tigers and chickens, which exacerbates the challenge of fine-grained visual concept learning. In this case, exploiting both web images and their associated texts is a requisite solution to combat real-world noise. In this paper, we propose Cross-modality Aligned Prototypes (CAPro), a unified prototypical contrastive learning framework to learn visual representations with correct semantics. For one thing, we leverage textual prototypes, which stem from the distinct concept definition of classes, to select clean images by text matching and thus disambiguate the formation of visual prototypes. For another, to handle missing and mismatched noisy texts, we resort to the visual feature space to complete and enhance individual texts and thereafter improve text matching. Such semantically aligned visual prototypes are further polished up with high-quality samples, and engaged in both cluster regularization and noise removal. Besides, we propose collective bootstrapping to encourage smoother and wiser label reference from appearance-similar instances in a manner of dictionary look-up. Extensive experiments on WebVision1k and NUS-WIDE (Web) demonstrate that CAPro well handles realistic noise under both single-label and multi-label scenarios. CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition. Codes are available at https://github.com/yuleiqin/capro.

* Accepted at NeurIPS2023

Via

Access Paper or Ask Questions

Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

Sep 22, 2023

Jiamin Xie, Ke Li, Jinxi Guo, Andros Tjandra, Yuan Shangguan, Leda Sari, Chunyang Wu, Junteng Jia, Jay Mahadeokar, Ozlem Kalinli

Figure 1 for Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

Figure 2 for Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

Figure 3 for Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

Figure 4 for Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

Abstract:Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.

Via

Access Paper or Ask Questions

Masked Autoencoders are Efficient Class Incremental Learners

Aug 24, 2023

Jiang-Tian Zhai, Xialei Liu, Andrew D. Bagdanov, Ke Li, Ming-Ming Cheng

Figure 1 for Masked Autoencoders are Efficient Class Incremental Learners

Figure 2 for Masked Autoencoders are Efficient Class Incremental Learners

Figure 3 for Masked Autoencoders are Efficient Class Incremental Learners

Figure 4 for Masked Autoencoders are Efficient Class Incremental Learners

Abstract:Class Incremental Learning (CIL) aims to sequentially learn new classes while avoiding catastrophic forgetting of previous knowledge. We propose to use Masked Autoencoders (MAEs) as efficient learners for CIL. MAEs were originally designed to learn useful representations through reconstructive unsupervised learning, and they can be easily integrated with a supervised loss for classification. Moreover, MAEs can reliably reconstruct original input images from randomly selected patches, which we use to store exemplars from past tasks more efficiently for CIL. We also propose a bilateral MAE framework to learn from image-level and embedding-level fusion, which produces better-quality reconstructed images and more stable representations. Our experiments confirm that our approach performs better than the state-of-the-art on CIFAR-100, ImageNet-Subset, and ImageNet-Full. The code is available at https://github.com/scok30/MAE-CIL .

* Accepted at ICCV 2023

Via

Access Paper or Ask Questions

MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

Aug 18, 2023

Junkai Xu, Liang Peng, Haoran Cheng, Hao Li, Wei Qian, Ke Li, Wenxiao Wang, Deng Cai

Figure 1 for MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

Figure 2 for MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

Figure 3 for MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

Figure 4 for MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

Abstract:In the field of monocular 3D detection, it is common practice to utilize scene geometric clues to enhance the detector's performance. However, many existing works adopt these clues explicitly such as estimating a depth map and back-projecting it into 3D space. This explicit methodology induces sparsity in 3D representations due to the increased dimensionality from 2D to 3D, and leads to substantial information loss, especially for distant and occluded objects. To alleviate this issue, we propose MonoNeRD, a novel detection framework that can infer dense 3D geometry and occupancy. Specifically, we model scenes with Signed Distance Functions (SDF), facilitating the production of dense 3D representations. We treat these representations as Neural Radiance Fields (NeRF) and then employ volume rendering to recover RGB images and depth maps. To the best of our knowledge, this work is the first to introduce volume rendering for M3D, and demonstrates the potential of implicit reconstruction for image-based 3D perception. Extensive experiments conducted on the KITTI-3D benchmark and Waymo Open Dataset demonstrate the effectiveness of MonoNeRD. Codes are available at https://github.com/cskkxjk/MonoNeRD.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

LiDAR Meta Depth Completion

Aug 16, 2023

Wolfgang Boettcher, Lukas Hoyer, Ozan Unal, Ke Li, Dengxin Dai

Figure 1 for LiDAR Meta Depth Completion

Figure 2 for LiDAR Meta Depth Completion

Figure 3 for LiDAR Meta Depth Completion

Figure 4 for LiDAR Meta Depth Completion

Abstract:Depth estimation is one of the essential tasks to be addressed when creating mobile autonomous systems. While monocular depth estimation methods have improved in recent times, depth completion provides more accurate and reliable depth maps by additionally using sparse depth information from other sensors such as LiDAR. However, current methods are specifically trained for a single LiDAR sensor. As the scanning pattern differs between sensors, every new sensor would require re-training a specialized depth completion model, which is computationally inefficient and not flexible. Therefore, we propose to dynamically adapt the depth completion model to the used sensor type enabling LiDAR adaptive depth completion. Specifically, we propose a meta depth completion network that uses data patterns derived from the data to learn a task network to alter weights of the main depth completion network to solve a given depth completion task effectively. The method demonstrates a strong capability to work on multiple LiDAR scanning patterns and can also generalize to scanning patterns that are unseen during training. While using a single model, our method yields significantly better results than a non-adaptive baseline trained on different LiDAR patterns. It outperforms LiDAR-specific expert models for very sparse cases. These advantages allow flexible deployment of a single depth completion model on different sensors, which could also prove valuable to process the input of nascent LiDAR technology with adaptive instead of fixed scanning patterns.

* Accepted at IROS 2023, v2 has updated author list and fixed a figure caption

Via

Access Paper or Ask Questions

Prompting Large Language Models with Speech Recognition Abilities

Jul 21, 2023

Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli(+2 more)

Figure 1 for Prompting Large Language Models with Speech Recognition Abilities

Figure 2 for Prompting Large Language Models with Speech Recognition Abilities

Figure 3 for Prompting Large Language Models with Speech Recognition Abilities

Figure 4 for Prompting Large Language Models with Speech Recognition Abilities

Abstract:Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio.

Via

Access Paper or Ask Questions