Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Luo Tao

CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

Apr 09, 2026

Yulin Zou, Yan Chen, Wenyan Chen, JooYoung Park, Shivaraman Nitin, Luo Tao, Francisco Romero, Dmitrii Ustiugov

Abstract:Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CodecSight, a codec-guided streaming video analytics system, built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CodecSight treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CodecSight achieves an improvement in throughput of up to 3$\times$, and a reduction of up to 87% in GPU compute over state-of-the-art baselines, maintaining competitive accuracy with only 0$\sim$8% F1 drop.

* 18 pages, 34 figures

Via

Access Paper or Ask Questions

CoStream: Codec-Guided Resource-Efficient System for Video Streaming Analytics

Apr 07, 2026

Yulin Zou, Yan Chen, Wenyan Chen, JooYoung Park, Shivaraman Nitin, Luo Tao, Francisco Romero, Dmitrii Ustiugov

Abstract:Video streaming analytics is a crucial workload for vision-language model serving, but the high cost of multimodal inference limits scalability. Prior systems reduce inference cost by exploiting temporal and spatial redundancy in video streams, but they target either the vision transformer (ViT) or the LLM with a limited view, leaving end-to-end opportunities untapped. Moreover, existing methods incur significant overhead to identify redundancy, either through offline profiling and training or costly online computation, making them ill-suited for dynamic real-time streams. We present CoStream, a codec-guided streaming video analytics system built on a key observation that video codecs already extract the temporal and spatial structure of each stream as a byproduct of compression. CoStream treats this codec metadata as a low-cost runtime signal to unify optimization across video decoding, visual processing, and LLM prefilling, with transmission reduction as an inherent benefit of operating directly on compressed bitstreams. This drives codec-guided patch pruning before ViT encoding and selective key-value cache refresh during LLM prefilling, both of which are fully online and do not require offline training. Experiments show that CoStream achieves up to 3x throughput improvement and up to 87% GPU compute reduction over state-of-the-art baselines, while maintaining competitive accuracy with only 0-8% F1 drop.

* 18 pages, 34 figures

Via

Access Paper or Ask Questions

Applying the Roofline model for Deep Learning performance optimizations

Sep 23, 2020

Jacek Czaja, Michal Gallus, Joanna Wozna, Adam Grygielski, Luo Tao

Figure 1 for Applying the Roofline model for Deep Learning performance optimizations

Figure 2 for Applying the Roofline model for Deep Learning performance optimizations

Figure 3 for Applying the Roofline model for Deep Learning performance optimizations

Figure 4 for Applying the Roofline model for Deep Learning performance optimizations

Abstract:In this paper We present a methodology for creating Roofline models automatically for Non-Unified Memory Access (NUMA) using Intel Xeon as an example. Finally, we present an evaluation of highly efficient deep learning primitives as implemented in the Intel oneDNN Library.

* oneDNN library analysis with roofline model

Via

Access Paper or Ask Questions

An FPGA-based Parallel Architecture for Face Detection using Mixed Color Models

May 27, 2014

Luo Tao, Shi zaifeng

Figure 1 for An FPGA-based Parallel Architecture for Face Detection using Mixed Color Models

Figure 2 for An FPGA-based Parallel Architecture for Face Detection using Mixed Color Models

Abstract:In this paper, a reliable method for detecting human faces in color images is proposed. This system firstly detects skin color in YCgCr and YIQ color space, then filters binary texture and the result is morphological processed, finally converts skin tone to the preferred skin color configured by users in YIQ color space. The real-time adjusting circuit is implemented and some of simulation results are given out. Experimental results demonstrate that the method has achieved high rates and low false positives, another advantage is its simplicity and minor computational costs.

* 9 pages, 7 figures

Via

Access Paper or Ask Questions