Alert button
Picture for Zhen Zheng

Zhen Zheng

Alert button

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

Add code
Bookmark button
Alert button
Jan 25, 2024
Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song

Viaarxiv icon

ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks

Add code
Bookmark button
Alert button
Dec 18, 2023
Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Reza Yazdani Aminabadi, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao

Figure 1 for ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Figure 2 for ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Figure 3 for ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Figure 4 for ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Viaarxiv icon

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Add code
Bookmark button
Alert button
Sep 19, 2023
Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song

Figure 1 for Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Figure 2 for Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Figure 3 for Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Figure 4 for Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Viaarxiv icon

Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform

Add code
Bookmark button
Alert button
Feb 16, 2023
Shiwei Zhang, Lansong Diao, Siyu Wang, Zongyan Cao, Yiliang Gu, Chang Si, Ziji Shi, Zhen Zheng, Chuan Wu, Wei Lin

Figure 1 for Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
Figure 2 for Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
Figure 3 for Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
Figure 4 for Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
Viaarxiv icon

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

Add code
Bookmark button
Alert button
Sep 23, 2020
Zhen Zheng, Pengzhan Zhao, Guoping Long, Feiwen Zhu, Kai Zhu, Wenyi Zhao, Lansong Diao, Jun Yang, Wei Lin

Figure 1 for FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads
Figure 2 for FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads
Figure 3 for FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads
Figure 4 for FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads
Viaarxiv icon

Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads

Add code
Bookmark button
Alert button
Jul 08, 2020
Siyu Wang, Yi Rong, Shiqing Fan, Zhen Zheng, LanSong Diao, Guoping Long, Jun Yang, Xiaoyong Liu, Wei Lin

Figure 1 for Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads
Figure 2 for Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads
Figure 3 for Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads
Figure 4 for Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads
Viaarxiv icon