Alert button
Picture for Zelin Zhao

Zelin Zhao

Alert button

CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model

Oct 10, 2023
Peng Di, Jianguo Li, Hang Yu, Wei Jiang, Wenting Cai, Yang Cao, Chaoyu Chen, Dajun Chen, Hongwei Chen, Liang Chen, Gang Fan, Jie Gong, Zi Gong, Wen Hu, Tingting Guo, Zhichao Lei, Ting Li, Zheng Li, Ming Liang, Cong Liao, Bingchang Liu, Jiachen Liu, Zhiwei Liu, Shaojun Lu, Min Shen, Guangpei Wang, Huan Wang, Zhi Wang, Zhaogui Xu, Jiawei Yang, Qing Ye, Gehao Zhang, Yu Zhang, Zelin Zhao, Xunjin Zheng, Hailian Zhou, Lifu Zhu, Xianying Zhu

Figure 1 for CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
Figure 2 for CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
Figure 3 for CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
Figure 4 for CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model

Code Large Language Models (Code LLMs) have gained significant attention in the industry due to their wide applications in the full lifecycle of software engineering. However, the effectiveness of existing models in understanding non-English inputs for multi-lingual code-related tasks is still far from well studied. This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts and supports over 40 programming languages. CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset that is carefully filtered by program analyzers and optimized during the training process. Extensive experiments are conducted using real-world usage scenarios, the industry-standard benchmark HumanEval-x, and the specially designed CodeFuseEval for Chinese prompts. To assess the effectiveness of CodeFuse, we actively collected valuable human feedback from the AntGroup's software development process where CodeFuse has been successfully deployed. The results demonstrate that CodeFuse-13B achieves a HumanEval pass@1 score of 37.10%, positioning it as one of the top multi-lingual code LLMs with similar parameter sizes. In practical scenarios, such as code generation, code translation, code comments, and testcase generation, CodeFuse performs better than other models when confronted with Chinese prompts.

* 10 pages with 2 pages for references 
Viaarxiv icon

End-to-end View Synthesis via NeRF Attention

Aug 01, 2022
Zelin Zhao, Jiaya Jia

Figure 1 for End-to-end View Synthesis via NeRF Attention
Figure 2 for End-to-end View Synthesis via NeRF Attention
Figure 3 for End-to-end View Synthesis via NeRF Attention
Figure 4 for End-to-end View Synthesis via NeRF Attention

In this paper, we present a simple seq2seq formulation for view synthesis where we take a set of ray points as input and output colors corresponding to the rays. Directly applying a standard transformer on this seq2seq formulation has two limitations. First, the standard attention cannot successfully fit the volumetric rendering procedure, and therefore high-frequency components are missing in the synthesized views. Second, applying global attention to all rays and pixels is extremely inefficient. Inspired by the neural radiance field (NeRF), we propose the NeRF attention (NeRFA) to address the above problems. On the one hand, NeRFA considers the volumetric rendering equation as a soft feature modulation procedure. In this way, the feature modulation enhances the transformers with the NeRF-like inductive bias. On the other hand, NeRFA performs multi-stage attention to reduce the computational overhead. Furthermore, the NeRFA model adopts the ray and pixel transformers to learn the interactions between rays and pixels. NeRFA demonstrates superior performance over NeRF and NerFormer on four datasets: DeepVoxels, Blender, LLFF, and CO3D. Besides, NeRFA establishes a new state-of-the-art under two settings: the single-scene view synthesis and the category-centric novel view synthesis. The code will be made publicly available.

* Fixed reference formatting issues 
Viaarxiv icon

Tracking Objects as Pixel-wise Distributions

Jul 15, 2022
Zelin Zhao, Ze Wu, Yueqing Zhuang, Boxun Li, Jiaya Jia

Figure 1 for Tracking Objects as Pixel-wise Distributions
Figure 2 for Tracking Objects as Pixel-wise Distributions
Figure 3 for Tracking Objects as Pixel-wise Distributions
Figure 4 for Tracking Objects as Pixel-wise Distributions

Multi-object tracking (MOT) requires detecting and associating objects through frames. Unlike tracking via detected bounding boxes or tracking objects as points, we propose tracking objects as pixel-wise distributions. We instantiate this idea on a transformer-based architecture, P3AFormer, with pixel-wise propagation, prediction, and association. P3AFormer propagates pixel-wise features guided by flow information to pass messages between frames. Furthermore, P3AFormer adopts a meta-architecture to produce multi-scale object feature maps. During inference, a pixel-wise association procedure is proposed to recover object connections through frames based on the pixel-wise prediction. P3AFormer yields 81.2\% in terms of MOTA on the MOT17 benchmark -- the first among all transformer networks to reach 80\% MOTA in literature. P3AFormer also outperforms state-of-the-arts on the MOT20 and KITTI benchmarks.

* Accepted in ECCV22 as an oral presentation paper. The code&project page is at https://github.com/dvlab-research/ECCV22-P3AFormer-Tracking-Objects-as-Pixel-wise-Distributions 
Viaarxiv icon

Learning Temporal Rules from Noisy Timeseries Data

Feb 11, 2022
Karan Samel, Zelin Zhao, Binghong Chen, Shuang Li, Dharmashankar Subramanian, Irfan Essa, Le Song

Figure 1 for Learning Temporal Rules from Noisy Timeseries Data
Figure 2 for Learning Temporal Rules from Noisy Timeseries Data
Figure 3 for Learning Temporal Rules from Noisy Timeseries Data
Figure 4 for Learning Temporal Rules from Noisy Timeseries Data

Events across a timeline are a common data representation, seen in different temporal modalities. Individual atomic events can occur in a certain temporal ordering to compose higher level composite events. Examples of a composite event are a patient's medical symptom or a baseball player hitting a home run, caused distinct temporal orderings of patient vitals and player movements respectively. Such salient composite events are provided as labels in temporal datasets and most works optimize models to predict these composite event labels directly. We focus on uncovering the underlying atomic events and their relations that lead to the composite events within a noisy temporal data setting. We propose Neural Temporal Logic Programming (Neural TLP) which first learns implicit temporal relations between atomic events and then lifts logic rules for composite events, given only the composite events labels for supervision. This is done through efficiently searching through the combinatorial space of all temporal logic rules in an end-to-end differentiable manner. We evaluate our method on video and healthcare datasets where it outperforms the baseline methods for rule discovery.

* 19 pages, 5 figures 
Viaarxiv icon

ProTo: Program-Guided Transformer for Program-Guided Tasks

Oct 16, 2021
Zelin Zhao, Karan Samel, Binghong Chen, Le Song

Figure 1 for ProTo: Program-Guided Transformer for Program-Guided Tasks
Figure 2 for ProTo: Program-Guided Transformer for Program-Guided Tasks
Figure 3 for ProTo: Program-Guided Transformer for Program-Guided Tasks
Figure 4 for ProTo: Program-Guided Transformer for Program-Guided Tasks

Programs, consisting of semantic and structural information, play an important role in the communication between humans and agents. Towards learning general program executors to unify perception, reasoning, and decision making, we formulate program-guided tasks which require learning to execute a given program on the observed task specification. Furthermore, we propose the Program-guided Transformer (ProTo), which integrates both semantic and structural guidance of a program by leveraging cross-attention and masked self-attention to pass messages between the specification and routines in the program. ProTo executes a program in a learned latent space and enjoys stronger representation ability than previous neural-symbolic approaches. We demonstrate that ProTo significantly outperforms the previous state-of-the-art methods on GQA visual reasoning and 2D Minecraft policy learning datasets. Additionally, ProTo demonstrates better generalization to unseen, complex, and human-written programs.

* Accepted in NeurIPS 2021 
Viaarxiv icon

How to Design Sample and Computationally Efficient VQA Models

Mar 22, 2021
Karan Samel, Zelin Zhao, Binghong Chen, Kuan Wang, Robin Luo, Le Song

Figure 1 for How to Design Sample and Computationally Efficient VQA Models
Figure 2 for How to Design Sample and Computationally Efficient VQA Models
Figure 3 for How to Design Sample and Computationally Efficient VQA Models
Figure 4 for How to Design Sample and Computationally Efficient VQA Models

In multi-modal reasoning tasks, such as visual question answering (VQA), there have been many modeling and training paradigms tested. Previous models propose different methods for the vision and language tasks, but which ones perform the best while being sample and computationally efficient? Based on our experiments, we find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata. We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner. Empirical results demonstrate that this differentiable end-to-end program executor is able to maintain state-of-the-art accuracy while being sample and computationally efficient.

* 20 pages, 5 figures 
Viaarxiv icon

Augmenting Policy Learning with Routines Discovered from a Demonstration

Dec 24, 2020
Zelin Zhao, Chuang Gan, Jiajun Wu, Xiaoxiao Guo, Joshua B. Tenenbaum

Figure 1 for Augmenting Policy Learning with Routines Discovered from a Demonstration
Figure 2 for Augmenting Policy Learning with Routines Discovered from a Demonstration
Figure 3 for Augmenting Policy Learning with Routines Discovered from a Demonstration
Figure 4 for Augmenting Policy Learning with Routines Discovered from a Demonstration

Humans can abstract prior knowledge from very little data and use it to boost skill learning. In this paper, we propose routine-augmented policy learning (RAPL), which discovers routines composed of primitive actions from a single demonstration and uses discovered routines to augment policy learning. To discover routines from the demonstration, we first abstract routine candidates by identifying grammar over the demonstrated action trajectory. Then, the best routines measured by length and frequency are selected to form a routine library. We propose to learn policy simultaneously at primitive-level and routine-level with discovered routines, leveraging the temporal structure of routines. Our approach enables imitating expert behavior at multiple temporal scales for imitation learning and promotes reinforcement learning exploration. Extensive experiments on Atari games demonstrate that RAPL improves the state-of-the-art imitation learning method SQIL and reinforcement learning method A2C. Further, we show that discovered routines can generalize to unseen levels and difficulties on the CoinRun benchmark.

* To appear in AAAI-21. Code is available at https://github.com/sjtuytc/-AAAI21-RoutineAugmentedPolicyLearning-RAPL- 
Viaarxiv icon

Estimating 6D Pose From Localizing Designated Surface Keypoints

Dec 04, 2018
Zelin Zhao, Gao Peng, Haoyu Wang, Hao-Shu Fang, Chengkun Li, Cewu Lu

Figure 1 for Estimating 6D Pose From Localizing Designated Surface Keypoints
Figure 2 for Estimating 6D Pose From Localizing Designated Surface Keypoints
Figure 3 for Estimating 6D Pose From Localizing Designated Surface Keypoints
Figure 4 for Estimating 6D Pose From Localizing Designated Surface Keypoints

In this paper, we present an accurate yet effective solution for 6D pose estimation from an RGB image. The core of our approach is that we first designate a set of surface points on target object model as keypoints and then train a keypoint detector (KPD) to localize them. Finally a PnP algorithm can recover the 6D pose according to the 2D-3D relationship of keypoints. Different from recent state-of-the-art CNN-based approaches that rely on a time-consuming post-processing procedure, our method can achieve competitive accuracy without any refinement after pose prediction. Meanwhile, we obtain a 30% relative improvement in terms of ADD accuracy among methods without using refinement. Moreover, we succeed in handling heavy occlusion by selecting the most confident keypoints to recover the 6D pose. For the sake of reproducibility, we will make our code and models publicly available soon.

Viaarxiv icon