Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zelin Zhao

Grounding and Enhancing Grid-based Models for Neural Fields

Apr 06, 2024

Zelin Zhao, Fenglei Fan, Wenlong Liao, Junchi Yan

Abstract:Many contemporary studies utilize grid-based models for neural field representation, but a systematic analysis of grid-based models is still missing, hindering the improvement of those models. Therefore, this paper introduces a theoretical framework for grid-based models. This framework points out that these models' approximation and generalization behaviors are determined by grid tangent kernels (GTK), which are intrinsic properties of grid-based models. The proposed framework facilitates a consistent and systematic analysis of diverse grid-based models. Furthermore, the introduced framework motivates the development of a novel grid-based model named the Multiplicative Fourier Adaptive Grid (MulFAGrid). The numerical analysis demonstrates that MulFAGrid exhibits a lower generalization bound than its predecessors, indicating its robust generalization performance. Empirical studies reveal that MulFAGrid achieves state-of-the-art performance in various tasks, including 2D image fitting, 3D signed distance field (SDF) reconstruction, and novel view synthesis, demonstrating superior representation ability. The project website is available at https://sites.google.com/view/cvpr24-2034-submission/home.

* Accepted in CVPR24 as an oral presentation. Pre-rebuttal scores: 555. Post-rebuttal scores: 555

Via

Access Paper or Ask Questions

CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model

Oct 10, 2023

Peng Di, Jianguo Li, Hang Yu, Wei Jiang, Wenting Cai, Yang Cao, Chaoyu Chen, Dajun Chen, Hongwei Chen, Liang Chen(+28 more)

Abstract:Code Large Language Models (Code LLMs) have gained significant attention in the industry due to their wide applications in the full lifecycle of software engineering. However, the effectiveness of existing models in understanding non-English inputs for multi-lingual code-related tasks is still far from well studied. This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages. CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset that is carefully filtered by program analyzers and optimized during the training process. Extensive experiments are conducted using real-world usage scenarios, the industry-standard benchmark HumanEval-x, and the specially designed CodeFuseEval for Chinese prompts. To assess the effectiveness of CodeFuse, we actively collected valuable human feedback from the AntGroup's software development process where CodeFuse has been successfully deployed. The results demonstrate that CodeFuse-13B achieves a HumanEval pass@1 score of 37.10%, positioning it as one of the top multi-lingual code LLMs with similar parameter sizes. In practical scenarios, such as code generation, code translation, code comments, and testcase generation, CodeFuse performs better than other models when confronted with Chinese prompts.

* 10 pages with 2 pages for references

Via

Access Paper or Ask Questions

End-to-end View Synthesis via NeRF Attention

Aug 01, 2022

Zelin Zhao, Jiaya Jia

Figure 1 for End-to-end View Synthesis via NeRF Attention

Figure 2 for End-to-end View Synthesis via NeRF Attention

Figure 3 for End-to-end View Synthesis via NeRF Attention

Figure 4 for End-to-end View Synthesis via NeRF Attention

Abstract:In this paper, we present a simple seq2seq formulation for view synthesis where we take a set of ray points as input and output colors corresponding to the rays. Directly applying a standard transformer on this seq2seq formulation has two limitations. First, the standard attention cannot successfully fit the volumetric rendering procedure, and therefore high-frequency components are missing in the synthesized views. Second, applying global attention to all rays and pixels is extremely inefficient. Inspired by the neural radiance field (NeRF), we propose the NeRF attention (NeRFA) to address the above problems. On the one hand, NeRFA considers the volumetric rendering equation as a soft feature modulation procedure. In this way, the feature modulation enhances the transformers with the NeRF-like inductive bias. On the other hand, NeRFA performs multi-stage attention to reduce the computational overhead. Furthermore, the NeRFA model adopts the ray and pixel transformers to learn the interactions between rays and pixels. NeRFA demonstrates superior performance over NeRF and NerFormer on four datasets: DeepVoxels, Blender, LLFF, and CO3D. Besides, NeRFA establishes a new state-of-the-art under two settings: the single-scene view synthesis and the category-centric novel view synthesis. The code will be made publicly available.

* Fixed reference formatting issues

Via

Access Paper or Ask Questions

Tracking Objects as Pixel-wise Distributions

Jul 15, 2022

Zelin Zhao, Ze Wu, Yueqing Zhuang, Boxun Li, Jiaya Jia

Figure 1 for Tracking Objects as Pixel-wise Distributions

Figure 2 for Tracking Objects as Pixel-wise Distributions

Figure 3 for Tracking Objects as Pixel-wise Distributions

Figure 4 for Tracking Objects as Pixel-wise Distributions

Abstract:Multi-object tracking (MOT) requires detecting and associating objects through frames. Unlike tracking via detected bounding boxes or tracking objects as points, we propose tracking objects as pixel-wise distributions. We instantiate this idea on a transformer-based architecture, P3AFormer, with pixel-wise propagation, prediction, and association. P3AFormer propagates pixel-wise features guided by flow information to pass messages between frames. Furthermore, P3AFormer adopts a meta-architecture to produce multi-scale object feature maps. During inference, a pixel-wise association procedure is proposed to recover object connections through frames based on the pixel-wise prediction. P3AFormer yields 81.2\% in terms of MOTA on the MOT17 benchmark -- the first among all transformer networks to reach 80\% MOTA in literature. P3AFormer also outperforms state-of-the-arts on the MOT20 and KITTI benchmarks.

* Accepted in ECCV22 as an oral presentation paper. The code&project page is at https://github.com/dvlab-research/ECCV22-P3AFormer-Tracking-Objects-as-Pixel-wise-Distributions

Via

Access Paper or Ask Questions

Learning Temporal Rules from Noisy Timeseries Data

Feb 11, 2022

Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song

Abstract:Events across a timeline are a common data representation, seen in different temporal modalities. Individual atomic events can occur in a certain temporal ordering to compose higher level composite events. Examples of a composite event are a patient's medical symptom or a baseball player hitting a home run, caused distinct temporal orderings of patient vitals and player movements respectively. Such salient composite events are provided as labels in temporal datasets and most works optimize models to predict these composite event labels directly. We focus on uncovering the underlying atomic events and their relations that lead to the composite events within a noisy temporal data setting. We propose Neural Temporal Logic Programming (Neural TLP) which first learns implicit temporal relations between atomic events and then lifts logic rules for composite events, given only the composite events labels for supervision. This is done through efficiently searching through the combinatorial space of all temporal logic rules in an end-to-end differentiable manner. We evaluate our method on video and healthcare datasets where it outperforms the baseline methods for rule discovery.

* 19 pages, 5 figures

Via

Access Paper or Ask Questions

ProTo: Program-Guided Transformer for Program-Guided Tasks

Oct 16, 2021

Zelin Zhao, Karan Samel, Binghong Chen, Le Song

Figure 1 for ProTo: Program-Guided Transformer for Program-Guided Tasks

Figure 2 for ProTo: Program-Guided Transformer for Program-Guided Tasks

Figure 3 for ProTo: Program-Guided Transformer for Program-Guided Tasks

Figure 4 for ProTo: Program-Guided Transformer for Program-Guided Tasks

Abstract:Programs, consisting of semantic and structural information, play an important role in the communication between humans and agents. Towards learning general program executors to unify perception, reasoning, and decision making, we formulate program-guided tasks which require learning to execute a given program on the observed task specification. Furthermore, we propose the Program-guided Transformer (ProTo), which integrates both semantic and structural guidance of a program by leveraging cross-attention and masked self-attention to pass messages between the specification and routines in the program. ProTo executes a program in a learned latent space and enjoys stronger representation ability than previous neural-symbolic approaches. We demonstrate that ProTo significantly outperforms the previous state-of-the-art methods on GQA visual reasoning and 2D Minecraft policy learning datasets. Additionally, ProTo demonstrates better generalization to unseen, complex, and human-written programs.

* Accepted in NeurIPS 2021

Via

Access Paper or Ask Questions

How to Design Sample and Computationally Efficient VQA Models

Mar 22, 2021

Karan Samel, Zelin Zhao, Binghong Chen, Kuan Wang, Robin Luo, Le Song

Figure 1 for How to Design Sample and Computationally Efficient VQA Models

Figure 2 for How to Design Sample and Computationally Efficient VQA Models

Figure 3 for How to Design Sample and Computationally Efficient VQA Models

Figure 4 for How to Design Sample and Computationally Efficient VQA Models

Abstract:In multi-modal reasoning tasks, such as visual question answering (VQA), there have been many modeling and training paradigms tested. Previous models propose different methods for the vision and language tasks, but which ones perform the best while being sample and computationally efficient? Based on our experiments, we find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata. We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner. Empirical results demonstrate that this differentiable end-to-end program executor is able to maintain state-of-the-art accuracy while being sample and computationally efficient.

* 20 pages, 5 figures

Via

Access Paper or Ask Questions

Augmenting Policy Learning with Routines Discovered from a Demonstration

Dec 24, 2020

Zelin Zhao, Chuang Gan, Jiajun Wu, Xiaoxiao Guo, Joshua B. Tenenbaum

Figure 1 for Augmenting Policy Learning with Routines Discovered from a Demonstration

Figure 2 for Augmenting Policy Learning with Routines Discovered from a Demonstration

Figure 3 for Augmenting Policy Learning with Routines Discovered from a Demonstration

Figure 4 for Augmenting Policy Learning with Routines Discovered from a Demonstration

Abstract:Humans can abstract prior knowledge from very little data and use it to boost skill learning. In this paper, we propose routine-augmented policy learning (RAPL), which discovers routines composed of primitive actions from a single demonstration and uses discovered routines to augment policy learning. To discover routines from the demonstration, we first abstract routine candidates by identifying grammar over the demonstrated action trajectory. Then, the best routines measured by length and frequency are selected to form a routine library. We propose to learn policy simultaneously at primitive-level and routine-level with discovered routines, leveraging the temporal structure of routines. Our approach enables imitating expert behavior at multiple temporal scales for imitation learning and promotes reinforcement learning exploration. Extensive experiments on Atari games demonstrate that RAPL improves the state-of-the-art imitation learning method SQIL and reinforcement learning method A2C. Further, we show that discovered routines can generalize to unseen levels and difficulties on the CoinRun benchmark.

* To appear in AAAI-21. Code is available at https://github.com/sjtuytc/-AAAI21-RoutineAugmentedPolicyLearning-RAPL-

Via

Access Paper or Ask Questions

Estimating 6D Pose From Localizing Designated Surface Keypoints

Dec 04, 2018

Zelin Zhao, Gao Peng, Haoyu Wang, Hao-Shu Fang, Chengkun Li, Cewu Lu

Figure 1 for Estimating 6D Pose From Localizing Designated Surface Keypoints

Figure 2 for Estimating 6D Pose From Localizing Designated Surface Keypoints

Figure 3 for Estimating 6D Pose From Localizing Designated Surface Keypoints

Figure 4 for Estimating 6D Pose From Localizing Designated Surface Keypoints

Abstract:In this paper, we present an accurate yet effective solution for 6D pose estimation from an RGB image. The core of our approach is that we first designate a set of surface points on target object model as keypoints and then train a keypoint detector (KPD) to localize them. Finally a PnP algorithm can recover the 6D pose according to the 2D-3D relationship of keypoints. Different from recent state-of-the-art CNN-based approaches that rely on a time-consuming post-processing procedure, our method can achieve competitive accuracy without any refinement after pose prediction. Meanwhile, we obtain a 30% relative improvement in terms of ADD accuracy among methods without using refinement. Moreover, we succeed in handling heavy occlusion by selecting the most confident keypoints to recover the 6D pose. For the sake of reproducibility, we will make our code and models publicly available soon.

Via

Access Paper or Ask Questions