

We propose a cluster-based quantization method to convert pre-trained full precision weights into ternary weights with minimal impact on the accuracy. In addition, we also constrain the activations to 8-bits thus enabling sub 8-bit full integer inference pipeline. Our method uses smaller clusters of N filters with a common scaling factor to minimize the quantization loss, while also maximizing the number of ternary operations. We show that with a cluster size of N=4 on Resnet-101, can achieve 71.8% TOP-1 accuracy, within 6% of the best full precision results while replacing ~85% of all multiplications with 8-bit accumulations. Using the same method with 4-bit weights achieves 76.3% TOP-1 accuracy which within 2% of the full precision result. We also study the impact of the size of the cluster on both performance and accuracy, larger cluster sizes N=64 can replace ~98% of the multiplications with ternary operations but introduces significant drop in accuracy which necessitates fine tuning the parameters with retraining the network at lower precision. To address this we have also trained low-precision Resnet-50 with 8-bit activations and ternary weights by pre-initializing the network with full precision weights and achieve 68.9% TOP-1 accuracy within 4 additional epochs. Our final quantized model can run on a full 8-bit compute pipeline, with a potential 16x improvement in performance compared to baseline full-precision models.