Alert button
Picture for Zhen Xie

Zhen Xie

Alert button

Benchmarking and In-depth Performance Study of Large Language Models on Habana Gaudi Processors

Sep 29, 2023
Chengming Zhang, Baixi Sun, Xiaodong Yu, Zhen Xie, Weijian Zheng, Kamil Iskra, Pete Beckman, Dingwen Tao

Transformer models have achieved remarkable success in various machine learning tasks but suffer from high computational complexity and resource requirements. The quadratic complexity of the self-attention mechanism further exacerbates these challenges when dealing with long sequences and large datasets. Specialized AI hardware accelerators, such as the Habana GAUDI architecture, offer a promising solution to tackle these issues. GAUDI features a Matrix Multiplication Engine (MME) and a cluster of fully programmable Tensor Processing Cores (TPC). This paper explores the untapped potential of using GAUDI processors to accelerate Transformer-based models, addressing key challenges in the process. Firstly, we provide a comprehensive performance comparison between the MME and TPC components, illuminating their relative strengths and weaknesses. Secondly, we explore strategies to optimize MME and TPC utilization, offering practical insights to enhance computational efficiency. Thirdly, we evaluate the performance of Transformers on GAUDI, particularly in handling long sequences and uncovering performance bottlenecks. Lastly, we evaluate the end-to-end performance of two Transformer-based large language models (LLM) on GAUDI. The contributions of this work encompass practical insights for practitioners and researchers alike. We delve into GAUDI's capabilities for Transformers through systematic profiling, analysis, and optimization exploration. Our study bridges a research gap and offers a roadmap for optimizing Transformer-based model training on the GAUDI architecture.

Viaarxiv icon

Transfer Learning Across Heterogeneous Features For Efficient Tensor Program Generation

Apr 11, 2023
Gaurav Verma, Siddhisanket Raskar, Zhen Xie, Abid M Malik, Murali Emani, Barbara Chapman

Figure 1 for Transfer Learning Across Heterogeneous Features For Efficient Tensor Program Generation
Figure 2 for Transfer Learning Across Heterogeneous Features For Efficient Tensor Program Generation
Figure 3 for Transfer Learning Across Heterogeneous Features For Efficient Tensor Program Generation
Figure 4 for Transfer Learning Across Heterogeneous Features For Efficient Tensor Program Generation

Tuning tensor program generation involves searching for various possible program transformation combinations for a given program on target hardware to optimize the tensor program execution. It is already a complex process because of the massive search space and exponential combinations of transformations make auto-tuning tensor program generation more challenging, especially when we have a heterogeneous target. In this research, we attempt to address these problems by learning the joint neural network and hardware features and transferring them to the new target hardware. We extensively study the existing state-of-the-art dataset, TenSet, perform comparative analysis on the test split strategies and propose methodologies to prune the dataset. We adopt an attention-inspired approach for tuning the tensor programs enabling them to embed neural network and hardware-specific features. Our approach could prune the dataset up to 45\% of the baseline without compromising the Pairwise Comparison Accuracy (PCA). Further, the proposed methodology can achieve on-par or improved mean inference time with 25%-40% of the baseline tuning time across different networks and target hardware.

Viaarxiv icon

Adaptive Neural Network-Based Approximation to Accelerate Eulerian Fluid Simulation

Aug 26, 2020
Wenqian Dong, Jie Liu, Zhen Xie, Dong Li

Figure 1 for Adaptive Neural Network-Based Approximation to Accelerate Eulerian Fluid Simulation
Figure 2 for Adaptive Neural Network-Based Approximation to Accelerate Eulerian Fluid Simulation
Figure 3 for Adaptive Neural Network-Based Approximation to Accelerate Eulerian Fluid Simulation
Figure 4 for Adaptive Neural Network-Based Approximation to Accelerate Eulerian Fluid Simulation

The Eulerian fluid simulation is an important HPC application. The neural network has been applied to accelerate it. The current methods that accelerate the fluid simulation with neural networks lack flexibility and generalization. In this paper, we tackle the above limitation and aim to enhance the applicability of neural networks in the Eulerian fluid simulation. We introduce Smartfluidnet, a framework that automates model generation and application. Given an existing neural network as input, Smartfluidnet generates multiple neural networks before the simulation to meet the execution time and simulation quality requirement. During the simulation, Smartfluidnet dynamically switches the neural networks to make the best efforts to reach the user requirement on simulation quality. Evaluating with 20,480 input problems, we show that Smartfluidnet achieves 1.46x and 590x speedup comparing with a state-of-the-art neural network model and the original fluid simulation respectively on an NVIDIA Titan X Pascal GPU, while providing better simulation quality than the state-of-the-art model.

Viaarxiv icon

Smart-PGSim: Using Neural Network to Accelerate AC-OPF Power Grid Simulation

Aug 26, 2020
Wenqian Dong, Zhen Xie, Gokcen Kestor, Dong Li

Figure 1 for Smart-PGSim: Using Neural Network to Accelerate AC-OPF Power Grid Simulation
Figure 2 for Smart-PGSim: Using Neural Network to Accelerate AC-OPF Power Grid Simulation
Figure 3 for Smart-PGSim: Using Neural Network to Accelerate AC-OPF Power Grid Simulation
Figure 4 for Smart-PGSim: Using Neural Network to Accelerate AC-OPF Power Grid Simulation

The optimal power flow (OPF) problem is one of the most important optimization problems for the operation of the power grid. It calculates the optimum scheduling of the committed generation units. In this paper, we develop a neural network approach to the problem of accelerating the current optimal power flow (AC-OPF) by generating an intelligent initial solution. The high quality of the initial solution and guidance of other outputs generated by the neural network enables faster convergence to the solution without losing optimality of final solution as computed by traditional methods. Smart-PGSim generates a novel multitask-learning neural network model to accelerate the AC-OPF simulation. Smart-PGSim also imposes the physical constraints of the simulation on the neural network automatically. Smart-PGSim brings an average of 49.2% performance improvement (up to 91%), computed over 10,000 problem simulations, with respect to the original AC-OPF implementation, without losing the optimality of the final solution.

Viaarxiv icon

FLAME: A Self-Adaptive Auto-labeling System for Heterogeneous Mobile Processors

Mar 03, 2020
Jie Liu, Jiawen Liu, Zhen Xie, Dong Li

Figure 1 for FLAME: A Self-Adaptive Auto-labeling System for Heterogeneous Mobile Processors
Figure 2 for FLAME: A Self-Adaptive Auto-labeling System for Heterogeneous Mobile Processors
Figure 3 for FLAME: A Self-Adaptive Auto-labeling System for Heterogeneous Mobile Processors
Figure 4 for FLAME: A Self-Adaptive Auto-labeling System for Heterogeneous Mobile Processors

How to accurately and efficiently label data on a mobile device is critical for the success of training machine learning models on mobile devices. Auto-labeling data on mobile devices is challenging, because data is usually incrementally generated and there is possibility of having unknown labels. Furthermore, the rich hardware heterogeneity on mobile devices creates challenges on efficiently executing auto-labeling workloads. In this paper, we introduce Flame, an auto-labeling system that can label non-stationary data with unknown labels. Flame includes a runtime system that efficiently schedules and executes auto-labeling workloads on heterogeneous mobile processors. Evaluating Flame with eight datasets on a smartphone, we demonstrate that Flame enables auto-labeling with high labeling accuracy and high performance.

Viaarxiv icon