Picture for Haihao Shen

Haihao Shen

Efficient LLM Inference on CPUs

Add code
Nov 01, 2023
Figure 1 for Efficient LLM Inference on CPUs
Figure 2 for Efficient LLM Inference on CPUs
Figure 3 for Efficient LLM Inference on CPUs
Figure 4 for Efficient LLM Inference on CPUs
Viaarxiv icon

TEQ: Trainable Equivalent Transformation for Quantization of LLMs

Add code
Oct 17, 2023
Figure 1 for TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Figure 2 for TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Figure 3 for TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Figure 4 for TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Viaarxiv icon

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Add code
Sep 28, 2023
Figure 1 for Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Figure 2 for Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Figure 3 for Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Figure 4 for Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Viaarxiv icon

Efficient Post-training Quantization with FP8 Formats

Add code
Sep 26, 2023
Figure 1 for Efficient Post-training Quantization with FP8 Formats
Figure 2 for Efficient Post-training Quantization with FP8 Formats
Figure 3 for Efficient Post-training Quantization with FP8 Formats
Figure 4 for Efficient Post-training Quantization with FP8 Formats
Viaarxiv icon

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

Add code
Jun 28, 2023
Figure 1 for An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs
Figure 2 for An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs
Figure 3 for An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs
Figure 4 for An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs
Viaarxiv icon

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Add code
Oct 31, 2022
Figure 1 for QuaLA-MiniLM: a Quantized Length Adaptive MiniLM
Figure 2 for QuaLA-MiniLM: a Quantized Length Adaptive MiniLM
Figure 3 for QuaLA-MiniLM: a Quantized Length Adaptive MiniLM
Viaarxiv icon

Fast DistilBERT on CPUs

Add code
Oct 27, 2022
Figure 1 for Fast DistilBERT on CPUs
Figure 2 for Fast DistilBERT on CPUs
Figure 3 for Fast DistilBERT on CPUs
Figure 4 for Fast DistilBERT on CPUs
Viaarxiv icon

Prune Once for All: Sparse Pre-Trained Language Models

Add code
Nov 10, 2021
Figure 1 for Prune Once for All: Sparse Pre-Trained Language Models
Figure 2 for Prune Once for All: Sparse Pre-Trained Language Models
Figure 3 for Prune Once for All: Sparse Pre-Trained Language Models
Figure 4 for Prune Once for All: Sparse Pre-Trained Language Models
Viaarxiv icon

Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe

Add code
May 04, 2018
Figure 1 for Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe
Figure 2 for Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe
Figure 3 for Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe
Figure 4 for Highly Efficient 8-bit Low Precision Inference of Convolutional Neural Networks with IntelCaffe
Viaarxiv icon