Alert button
Picture for Xuefei Ning

Xuefei Ning

Alert button

Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding

Jul 28, 2023
Xuefei Ning, Zinan Lin, Zixuan Zhou, Huazhong Yang, Yu Wang

Figure 1 for Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
Figure 2 for Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
Figure 3 for Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
Figure 4 for Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding

This work aims at decreasing the end-to-end generation latency of large language models (LLMs). One of the major causes of the high generation latency is the sequential decoding approach adopted by almost all state-of-the-art LLMs. In this work, motivated by the thinking and writing process of humans, we propose "Skeleton-of-Thought" (SoT), which guides LLMs to first generate the skeleton of the answer, and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point in parallel. Not only does SoT provide considerable speed-up (up to 2.39x across 11 different LLMs), but it can also potentially improve the answer quality on several question categories in terms of diversity and relevance. SoT is an initial attempt at data-centric optimization for efficiency, and reveal the potential of pushing LLMs to think more like a human for answer quality.

* Technical report, work in progress 
Viaarxiv icon

Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection

Jul 17, 2023
Tianchen Zhao, Xuefei Ning, Ke Hong, Zhongyuan Qiu, Pu Lu, Yali Zhao, Linfeng Zhang, Lipu Zhou, Guohao Dai, Huazhong Yang, Yu Wang

Figure 1 for Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection
Figure 2 for Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection
Figure 3 for Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection
Figure 4 for Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection

Voxel-based methods have achieved state-of-the-art performance for 3D object detection in autonomous driving. However, their significant computational and memory costs pose a challenge for their application to resource-constrained vehicles. One reason for this high resource consumption is the presence of a large number of redundant background points in Lidar point clouds, resulting in spatial redundancy in both 3D voxel and dense BEV map representations. To address this issue, we propose an adaptive inference framework called Ada3D, which focuses on exploiting the input-level spatial redundancy. Ada3D adaptively filters the redundant input, guided by a lightweight importance predictor and the unique properties of the Lidar point cloud. Additionally, we utilize the BEV features' intrinsic sparsity by introducing the Sparsity Preserving Batch Normalization. With Ada3D, we achieve 40% reduction for 3D voxels and decrease the density of 2D BEV feature maps from 100% to 20% without sacrificing accuracy. Ada3D reduces the model computational and memory cost by 5x, and achieves 1.52x/1.45x end-to-end GPU latency and 1.5x/4.5x GPU peak memory optimization for the 3D and 2D backbone respectively.

* Accepted at ICCV2023 
Viaarxiv icon

OMS-DPM: Optimizing the Model Schedule for Diffusion Probabilistic Models

Jun 15, 2023
Enshu Liu, Xuefei Ning, Zinan Lin, Huazhong Yang, Yu Wang

Figure 1 for OMS-DPM: Optimizing the Model Schedule for Diffusion Probabilistic Models
Figure 2 for OMS-DPM: Optimizing the Model Schedule for Diffusion Probabilistic Models
Figure 3 for OMS-DPM: Optimizing the Model Schedule for Diffusion Probabilistic Models
Figure 4 for OMS-DPM: Optimizing the Model Schedule for Diffusion Probabilistic Models

Diffusion probabilistic models (DPMs) are a new class of generative models that have achieved state-of-the-art generation quality in various domains. Despite the promise, one major drawback of DPMs is the slow generation speed due to the large number of neural network evaluations required in the generation process. In this paper, we reveal an overlooked dimension -- model schedule -- for optimizing the trade-off between generation quality and speed. More specifically, we observe that small models, though having worse generation quality when used alone, could outperform large models in certain generation steps. Therefore, unlike the traditional way of using a single model, using different models in different generation steps in a carefully designed \emph{model schedule} could potentially improve generation quality and speed \emph{simultaneously}. We design OMS-DPM, a predictor-based search algorithm, to optimize the model schedule given an arbitrary generation time budget and a set of pre-trained models. We demonstrate that OMS-DPM can find model schedules that improve generation quality and speed than prior state-of-the-art methods across CIFAR-10, CelebA, ImageNet, and LSUN datasets. When applied to the public checkpoints of the Stable Diffusion model, we are able to accelerate the sampling by 2$\times$ while maintaining the generation quality.

* Accepted by ICML2023 
Viaarxiv icon

Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start"

Feb 02, 2023
Junbo Zhao, Xuefei Ning, Enshu Liu, Binxin Ru, Zixuan Zhou, Tianchen Zhao, Chen Chen, Jiajin Zhang, Qingmin Liao, Yu Wang

Figure 1 for Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start"
Figure 2 for Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start"
Figure 3 for Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start"
Figure 4 for Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start"

Predictor-based Neural Architecture Search (NAS) employs an architecture performance predictor to improve the sample efficiency. However, predictor-based NAS suffers from the severe ``cold-start'' problem, since a large amount of architecture-performance data is required to get a working predictor. In this paper, we focus on exploiting information in cheaper-to-obtain performance estimations (i.e., low-fidelity information) to mitigate the large data requirements of predictor training. Despite the intuitiveness of this idea, we observe that using inappropriate low-fidelity information even damages the prediction ability and different search spaces have different preferences for low-fidelity information types. To solve the problem and better fuse beneficial information provided by different types of low-fidelity information, we propose a novel dynamic ensemble predictor framework that comprises two steps. In the first step, we train different sub-predictors on different types of available low-fidelity information to extract beneficial knowledge as low-fidelity experts. In the second step, we learn a gating network to dynamically output a set of weighting coefficients conditioned on each input neural architecture, which will be used to combine the predictions of different low-fidelity experts in a weighted sum. The overall predictor is optimized on a small set of actual architecture-performance data to fuse the knowledge from different low-fidelity experts to make the final prediction. We conduct extensive experiments across five search spaces with different architecture encoders under various experimental settings. Our method can easily be incorporated into existing predictor-based NAS frameworks to discover better architectures.

Viaarxiv icon

CLOSE: Curriculum Learning On the Sharing Extent Towards Better One-shot NAS

Jul 16, 2022
Zixuan Zhou, Xuefei Ning, Yi Cai, Jiashu Han, Yiping Deng, Yuhan Dong, Huazhong Yang, Yu Wang

Figure 1 for CLOSE: Curriculum Learning On the Sharing Extent Towards Better One-shot NAS
Figure 2 for CLOSE: Curriculum Learning On the Sharing Extent Towards Better One-shot NAS
Figure 3 for CLOSE: Curriculum Learning On the Sharing Extent Towards Better One-shot NAS
Figure 4 for CLOSE: Curriculum Learning On the Sharing Extent Towards Better One-shot NAS

One-shot Neural Architecture Search (NAS) has been widely used to discover architectures due to its efficiency. However, previous studies reveal that one-shot performance estimations of architectures might not be well correlated with their performances in stand-alone training because of the excessive sharing of operation parameters (i.e., large sharing extent) between architectures. Thus, recent methods construct even more over-parameterized supernets to reduce the sharing extent. But these improved methods introduce a large number of extra parameters and thus cause an undesirable trade-off between the training costs and the ranking quality. To alleviate the above issues, we propose to apply Curriculum Learning On Sharing Extent (CLOSE) to train the supernet both efficiently and effectively. Specifically, we train the supernet with a large sharing extent (an easier curriculum) at the beginning and gradually decrease the sharing extent of the supernet (a harder curriculum). To support this training strategy, we design a novel supernet (CLOSENet) that decouples the parameters from operations to realize a flexible sharing scheme and adjustable sharing extent. Extensive experiments demonstrate that CLOSE can obtain a better ranking quality across different computational budget constraints than other one-shot supernets, and is able to discover superior architectures when combined with various search strategies. Code is available at https://github.com/walkerning/aw_nas.

* accepted by ECCV 2022 (14 pages main texts) 
Viaarxiv icon

Fault-Tolerant Deep Learning: A Hierarchical Perspective

Apr 05, 2022
Cheng Liu, Zhen Gao, Siting Liu, Xuefei Ning, Huawei Li, Xiaowei Li

Figure 1 for Fault-Tolerant Deep Learning: A Hierarchical Perspective
Figure 2 for Fault-Tolerant Deep Learning: A Hierarchical Perspective
Figure 3 for Fault-Tolerant Deep Learning: A Hierarchical Perspective
Figure 4 for Fault-Tolerant Deep Learning: A Hierarchical Perspective

With the rapid advancements of deep learning in the past decade, it can be foreseen that deep learning will be continuously deployed in more and more safety-critical applications such as autonomous driving and robotics. In this context, reliability turns out to be critical to the deployment of deep learning in these applications and gradually becomes a first-class citizen among the major design metrics like performance and energy efficiency. Nevertheless, the back-box deep learning models combined with the diverse underlying hardware faults make resilient deep learning extremely challenging. In this special session, we conduct a comprehensive survey of fault-tolerant deep learning design approaches with a hierarchical perspective and investigate these approaches from model layer, architecture layer, circuit layer, and cross layer respectively.

* Special session submitted to VTS'22 
Viaarxiv icon

CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance

Mar 27, 2022
Tianchen Zhao, Niansong Zhang, Xuefei Ning, He Wang, Li Yi, Yu Wang

Figure 1 for CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance
Figure 2 for CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance
Figure 3 for CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance
Figure 4 for CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance

Transformers have gained much attention by outperforming convolutional neural networks in many 2D vision tasks. However, they are known to have generalization problems and rely on massive-scale pre-training and sophisticated training techniques. When applying to 3D tasks, the irregular data structure and limited data scale add to the difficulty of transformer's application. We propose CodedVTR (Codebook-based Voxel TRansformer), which improves data efficiency and generalization ability for 3D sparse voxel transformers. On the one hand, we propose the codebook-based attention that projects an attention space into its subspace represented by the combination of "prototypes" in a learnable codebook. It regularizes attention learning and improves generalization. On the other hand, we propose geometry-aware self-attention that utilizes geometric information (geometric pattern, density) to guide attention learning. CodedVTR could be embedded into existing sparse convolution-based methods, and bring consistent performance improvements for indoor and outdoor 3D semantic segmentation tasks

* Published at CVPR2022 
Viaarxiv icon

Multi-Agent Vulnerability Discovery for Autonomous Driving with Hazard Arbitration Reward

Dec 12, 2021
Weilin Liu, Ye Mu, Chao Yu, Xuefei Ning, Zhong Cao, Yi Wu, Shuang Liang, Huazhong Yang, Yu Wang

Figure 1 for Multi-Agent Vulnerability Discovery for Autonomous Driving with Hazard Arbitration Reward
Figure 2 for Multi-Agent Vulnerability Discovery for Autonomous Driving with Hazard Arbitration Reward
Figure 3 for Multi-Agent Vulnerability Discovery for Autonomous Driving with Hazard Arbitration Reward
Figure 4 for Multi-Agent Vulnerability Discovery for Autonomous Driving with Hazard Arbitration Reward

Discovering hazardous scenarios is crucial in testing and further improving driving policies. However, conducting efficient driving policy testing faces two key challenges. On the one hand, the probability of naturally encountering hazardous scenarios is low when testing a well-trained autonomous driving strategy. Thus, discovering these scenarios by purely real-world road testing is extremely costly. On the other hand, a proper determination of accident responsibility is necessary for this task. Collecting scenarios with wrong-attributed responsibilities will lead to an overly conservative autonomous driving strategy. To be more specific, we aim to discover hazardous scenarios that are autonomous-vehicle responsible (AV-responsible), i.e., the vulnerabilities of the under-test driving policy. To this end, this work proposes a Safety Test framework by finding Av-Responsible Scenarios (STARS) based on multi-agent reinforcement learning. STARS guides other traffic participants to produce Av-Responsible Scenarios and make the under-test driving policy misbehave via introducing Hazard Arbitration Reward (HAR). HAR enables our framework to discover diverse, complex, and AV-responsible hazardous scenarios. Experimental results against four different driving policies in three environments demonstrate that STARS can effectively discover AV-responsible hazardous scenarios. These scenarios indeed correspond to the vulnerabilities of the under-test driving policies, thus are meaningful for their further improvements.

Viaarxiv icon

BoolNet: Minimizing The Energy Consumption of Binary Neural Networks

Jun 13, 2021
Nianhui Guo, Joseph Bethge, Haojin Yang, Kai Zhong, Xuefei Ning, Christoph Meinel, Yu Wang

Figure 1 for BoolNet: Minimizing The Energy Consumption of Binary Neural Networks
Figure 2 for BoolNet: Minimizing The Energy Consumption of Binary Neural Networks
Figure 3 for BoolNet: Minimizing The Energy Consumption of Binary Neural Networks
Figure 4 for BoolNet: Minimizing The Energy Consumption of Binary Neural Networks

Recent works on Binary Neural Networks (BNNs) have made promising progress in narrowing the accuracy gap of BNNs to their 32-bit counterparts. However, the accuracy gains are often based on specialized model designs using additional 32-bit components. Furthermore, almost all previous BNNs use 32-bit for feature maps and the shortcuts enclosing the corresponding binary convolution blocks, which helps to effectively maintain the accuracy, but is not friendly to hardware accelerators with limited memory, energy, and computing resources. Thus, we raise the following question: How can accuracy and energy consumption be balanced in a BNN network design? We extensively study this fundamental problem in this work and propose a novel BNN architecture without most commonly used 32-bit components: \textit{BoolNet}. Experimental results on ImageNet demonstrate that BoolNet can achieve 4.6x energy reduction coupled with 1.2\% higher accuracy than the commonly used BNN architecture Bi-RealNet. Code and trained models are available at: https://github.com/hpi-xnor/BoolNet.

Viaarxiv icon

Ensemble-in-One: Learning Ensemble within Random Gated Networks for Enhanced Adversarial Robustness

Mar 27, 2021
Yi Cai, Xuefei Ning, Huazhong Yang, Yu Wang

Figure 1 for Ensemble-in-One: Learning Ensemble within Random Gated Networks for Enhanced Adversarial Robustness
Figure 2 for Ensemble-in-One: Learning Ensemble within Random Gated Networks for Enhanced Adversarial Robustness
Figure 3 for Ensemble-in-One: Learning Ensemble within Random Gated Networks for Enhanced Adversarial Robustness
Figure 4 for Ensemble-in-One: Learning Ensemble within Random Gated Networks for Enhanced Adversarial Robustness

Adversarial attacks have rendered high security risks on modern deep learning systems. Adversarial training can significantly enhance the robustness of neural network models by suppressing the non-robust features. However, the models often suffer from significant accuracy loss on clean data. Ensemble training methods have emerged as promising solutions for defending against adversarial attacks by diversifying the vulnerabilities among the sub-models, simultaneously maintaining comparable accuracy as standard training. However, existing ensemble methods are with poor scalability, owing to the rapid complexity increase when including more sub-models in the ensemble. Moreover, in real-world applications, it is difficult to deploy an ensemble with multiple sub-models, owing to the tight hardware resource budget and latency requirement. In this work, we propose ensemble-in-one (EIO), a simple but efficient way to train an ensemble within one random gated network (RGN). EIO augments the original model by replacing the parameterized layers with multi-path random gated blocks (RGBs) to construct a RGN. By diversifying the vulnerability of the numerous paths within the RGN, better robustness can be achieved. It provides high scalability because the paths within an EIO network exponentially increase with the network depth. Our experiments demonstrate that EIO consistently outperforms previous ensemble training methods with even less computational overhead.

Viaarxiv icon