Picture for Mart van Baalen

Mart van Baalen

Leech Lattice Vector Quantization for Efficient LLM Compression

Add code
Mar 11, 2026
Viaarxiv icon

Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking

Add code
Dec 02, 2024
Figure 1 for Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
Figure 2 for Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
Figure 3 for Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
Figure 4 for Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
Viaarxiv icon

Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

Add code
Nov 27, 2024
Figure 1 for Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
Figure 2 for Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
Figure 3 for Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
Figure 4 for Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
Viaarxiv icon

GPTVQ: The Blessing of Dimensionality for LLM Quantization

Add code
Feb 23, 2024
Figure 1 for GPTVQ: The Blessing of Dimensionality for LLM Quantization
Figure 2 for GPTVQ: The Blessing of Dimensionality for LLM Quantization
Figure 3 for GPTVQ: The Blessing of Dimensionality for LLM Quantization
Figure 4 for GPTVQ: The Blessing of Dimensionality for LLM Quantization
Viaarxiv icon

The LLM Surgeon

Add code
Dec 28, 2023
Viaarxiv icon

QBitOpt: Fast and Accurate Bitwidth Reallocation during Training

Add code
Jul 10, 2023
Figure 1 for QBitOpt: Fast and Accurate Bitwidth Reallocation during Training
Figure 2 for QBitOpt: Fast and Accurate Bitwidth Reallocation during Training
Figure 3 for QBitOpt: Fast and Accurate Bitwidth Reallocation during Training
Figure 4 for QBitOpt: Fast and Accurate Bitwidth Reallocation during Training
Viaarxiv icon

Pruning vs Quantization: Which is Better?

Add code
Jul 06, 2023
Figure 1 for Pruning vs Quantization: Which is Better?
Figure 2 for Pruning vs Quantization: Which is Better?
Figure 3 for Pruning vs Quantization: Which is Better?
Figure 4 for Pruning vs Quantization: Which is Better?
Viaarxiv icon

FP8 versus INT8 for efficient deep learning inference

Add code
Mar 31, 2023
Figure 1 for FP8 versus INT8 for efficient deep learning inference
Figure 2 for FP8 versus INT8 for efficient deep learning inference
Figure 3 for FP8 versus INT8 for efficient deep learning inference
Figure 4 for FP8 versus INT8 for efficient deep learning inference
Viaarxiv icon

A Practical Mixed Precision Algorithm for Post-Training Quantization

Add code
Feb 10, 2023
Figure 1 for A Practical Mixed Precision Algorithm for Post-Training Quantization
Figure 2 for A Practical Mixed Precision Algorithm for Post-Training Quantization
Figure 3 for A Practical Mixed Precision Algorithm for Post-Training Quantization
Figure 4 for A Practical Mixed Precision Algorithm for Post-Training Quantization
Viaarxiv icon

Quantized Sparse Weight Decomposition for Neural Network Compression

Add code
Jul 22, 2022
Figure 1 for Quantized Sparse Weight Decomposition for Neural Network Compression
Figure 2 for Quantized Sparse Weight Decomposition for Neural Network Compression
Figure 3 for Quantized Sparse Weight Decomposition for Neural Network Compression
Figure 4 for Quantized Sparse Weight Decomposition for Neural Network Compression
Viaarxiv icon