Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kai Jia

Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations

Jul 16, 2025

Yichen Han, Xiaoyang Hao, Keming Chen, Weibo Xiong, Jun He, Ruonan Zhang, Junjie Cao, Yue Liu, Bowen Li, Dongrui Zhang(+18 more)

Abstract:Text-to-speech (TTS) synthesis has seen renewed progress under the discrete modeling paradigm. Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss. Even with post-hoc refinement techniques such as flow matching, these methods fail to recover fine-grained details (e.g., prosodic nuances, speaker-specific timbres), especially in challenging scenarios like singing voice or music synthesis. We propose QTTS, a novel TTS framework built upon our new audio codec, QDAC. The core innovation of QDAC lies in its end-to-end training of an ASR-based auto-regressive network with a GAN, which achieves superior semantic feature disentanglement for scalable, near-lossless compression. QTTS models these discrete codes using two innovative strategies: the Hierarchical Parallel architecture, which uses a dual-AR structure to model inter-codebook dependencies for higher-quality synthesis, and the Delay Multihead approach, which employs parallelized prediction with a fixed delay to accelerate inference speed. Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline. This suggests that scaling up compression via multi-codebook modeling is a promising direction for high-fidelity, general-purpose speech and audio generation.

Via

Access Paper or Ask Questions

Sound Explanation for Trustworthy Machine Learning

Jun 08, 2023

Kai Jia, Pasapol Saowakon, Limor Appelbaum, Martin Rinard

Abstract:We take a formal approach to the explainability problem of machine learning systems. We argue against the practice of interpreting black-box models via attributing scores to input components due to inherently conflicting goals of attribution-based interpretation. We prove that no attribution algorithm satisfies specificity, additivity, completeness, and baseline invariance. We then formalize the concept, sound explanation, that has been informally adopted in prior work. A sound explanation entails providing sufficient information to causally explain the predictions made by a system. Finally, we present the application of feature selection as a sound explanation for cancer prediction models to cultivate trust among clinicians.

Via

Access Paper or Ask Questions

Effective Neural Network $L_0$ Regularization With BinMask

Apr 21, 2023

Kai Jia, Martin Rinard

Figure 1 for Effective Neural Network $L_0$ Regularization With BinMask

Figure 2 for Effective Neural Network $L_0$ Regularization With BinMask

Figure 3 for Effective Neural Network $L_0$ Regularization With BinMask

Figure 4 for Effective Neural Network $L_0$ Regularization With BinMask

Abstract:$L_0$ regularization of neural networks is a fundamental problem. In addition to regularizing models for better generalizability, $L_0$ regularization also applies to selecting input features and training sparse neural networks. There is a large body of research on related topics, some with quite complicated methods. In this paper, we show that a straightforward formulation, BinMask, which multiplies weights with deterministic binary masks and uses the identity straight-through estimator for backpropagation, is an effective $L_0$ regularizer. We evaluate BinMask on three tasks: feature selection, network sparsification, and model regularization. Despite its simplicity, BinMask achieves competitive performance on all the benchmarks without task-specific tuning compared to methods designed for each task. Our results suggest that decoupling weights from mask optimization, which has been widely adopted by previous work, is a key component for effective $L_0$ regularization.

Via

Access Paper or Ask Questions

Delving Deep into Pixel Alignment Feature for Accurate Multi-view Human Mesh Recovery

Jan 15, 2023

Kai Jia, Hongwen Zhang, Liang An, Yebin Liu

Abstract:Regression-based methods have shown high efficiency and effectiveness for multi-view human mesh recovery. The key components of a typical regressor lie in the feature extraction of input views and the fusion of multi-view features. In this paper, we present Pixel-aligned Feedback Fusion (PaFF) for accurate yet efficient human mesh recovery from multi-view images. PaFF is an iterative regression framework that performs feature extraction and fusion alternately. At each iteration, PaFF extracts pixel-aligned feedback features from each input view according to the reprojection of the current estimation and fuses them together with respect to each vertex of the downsampled mesh. In this way, our regressor can not only perceive the misalignment status of each view from the feedback features but also correct the mesh parameters more effectively based on the feature fusion on mesh vertices. Additionally, our regressor disentangles the global orientation and translation of the body mesh from the estimation of mesh parameters such that the camera parameters of input views can be better utilized in the regression process. The efficacy of our method is validated in the Human3.6M dataset via comprehensive ablation experiments, where PaFF achieves 33.02 MPJPE and brings significant improvements over the previous best solutions by more than 29%. The project page with code and video results can be found at https://kairobo.github.io/PaFF/.

* Project Page: https://kairobo.github.io/PaFF/

Via

Access Paper or Ask Questions

Verifying Low-dimensional Input Neural Networks via Input Quantization

Aug 18, 2021

Kai Jia, Martin Rinard

Figure 1 for Verifying Low-dimensional Input Neural Networks via Input Quantization

Figure 2 for Verifying Low-dimensional Input Neural Networks via Input Quantization

Abstract:Deep neural networks are an attractive tool for compressing the control policy lookup tables in systems such as the Airborne Collision Avoidance System (ACAS). It is vital to ensure the safety of such neural controllers via verification techniques. The problem of analyzing ACAS Xu networks has motivated many successful neural network verifiers. These verifiers typically analyze the internal computation of neural networks to decide whether a property regarding the input/output holds. The intrinsic complexity of neural network computation renders such verifiers slow to run and vulnerable to floating-point error. This paper revisits the original problem of verifying ACAS Xu networks. The networks take low-dimensional sensory inputs with training data provided by a precomputed lookup table. We propose to prepend an input quantization layer to the network. Quantization allows efficient verification via input state enumeration, whose complexity is bounded by the size of the quantization space. Quantization is equivalent to nearest-neighbor interpolation at run time, which has been shown to provide acceptable accuracy for ACAS in simulation. Moreover, our technique can deliver exact verification results immune to floating-point error if we directly enumerate the network outputs on the target inference implementation or on an accurate simulation of the target implementation.

* SAS 2021

Via

Access Paper or Ask Questions

Deep Retrieval: An End-to-End Learnable Structure Model for Large-Scale Recommendations

Jul 12, 2020

Weihao Gao, Xiangjun Fan, Jiankai Sun, Kai Jia, Wenzhi Xiao, Chong Wang, Xiaobing Liu

Figure 1 for Deep Retrieval: An End-to-End Learnable Structure Model for Large-Scale Recommendations

Figure 2 for Deep Retrieval: An End-to-End Learnable Structure Model for Large-Scale Recommendations

Figure 3 for Deep Retrieval: An End-to-End Learnable Structure Model for Large-Scale Recommendations

Figure 4 for Deep Retrieval: An End-to-End Learnable Structure Model for Large-Scale Recommendations

Abstract:One of the core problems in large-scale recommendations is to retrieve top relevant candidates accurately and efficiently, preferably in sub-linear time. Previous approaches are mostly based on a two-step procedure: first learn an inner-product model and then use maximum inner product search (MIPS) algorithms to search top candidates, leading to potential loss of retrieval accuracy. In this paper, we present Deep Retrieval (DR), an end-to-end learnable structure model for large-scale recommendations. DR encodes all candidates into a discrete latent space. Those latent codes for the candidates are model parameters and to be learnt together with other neural network parameters to maximize the same objective function. With the model learnt, a beam search over the latent codes is performed to retrieve the top candidates. Empirically, we showed that DR, with sub-linear computational complexity, can achieve almost the same accuracy as the brute-force baseline.

* 13 pages, 4 figures

Via

Access Paper or Ask Questions

Efficient Exact Verification of Binarized Neural Networks

May 07, 2020

Kai Jia, Martin Rinard

Figure 1 for Efficient Exact Verification of Binarized Neural Networks

Figure 2 for Efficient Exact Verification of Binarized Neural Networks

Figure 3 for Efficient Exact Verification of Binarized Neural Networks

Figure 4 for Efficient Exact Verification of Binarized Neural Networks

Abstract:We present a new system, EEV, for verifying binarized neural networks (BNNs). We formulate BNN verification as a Boolean satisfiability problem (SAT) with reified cardinality constraints of the form $y = (x_1 + \cdots + x_n \le b)$, where $x_i$ and $y$ are Boolean variables possibly with negation and $b$ is an integer constant. We also identify two properties, specifically balanced weight sparsity and lower cardinality bounds, that reduce the verification complexity of BNNs. EEV contains both a SAT solver enhanced to handle reified cardinality constraints natively and novel training strategies designed to reduce verification complexity by delivering networks with improved sparsity properties and cardinality bounds. We demonstrate the effectiveness of EEV by presenting the first exact verification results for $\ell_{\infty}$-bounded adversarial robustness of nontrivial convolutional BNNs on the MNIST and CIFAR10 datasets. Our results also show that, depending on the dataset and network architecture, our techniques verify BNNs between a factor of ten to ten thousand times faster than the best previous exact verification techniques for either binarized or real-valued networks.

Via

Access Paper or Ask Questions

Exploiting Verified Neural Networks via Floating Point Numerical Error

Mar 06, 2020

Kai Jia, Martin Rinard

Figure 1 for Exploiting Verified Neural Networks via Floating Point Numerical Error

Figure 2 for Exploiting Verified Neural Networks via Floating Point Numerical Error

Figure 3 for Exploiting Verified Neural Networks via Floating Point Numerical Error

Figure 4 for Exploiting Verified Neural Networks via Floating Point Numerical Error

Abstract:We show how to construct adversarial examples for neural networks with exactly verified robustness against $\ell_{\infty}$-bounded input perturbations by exploiting floating point error. We argue that any exact verification of real-valued neural networks must accurately model the implementation details of any floating point arithmetic used during inference or verification.

Via

Access Paper or Ask Questions

MegDet: A Large Mini-Batch Object Detector

Apr 11, 2018

Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, Jian Sun

Figure 1 for MegDet: A Large Mini-Batch Object Detector

Figure 2 for MegDet: A Large Mini-Batch Object Detector

Figure 3 for MegDet: A Large Mini-Batch Object Detector

Figure 4 for MegDet: A Large Mini-Batch Object Detector

Abstract:The improvements in recent CNN-based object detection works, from R-CNN [11], Fast/Faster R-CNN [10, 31] to recent Mask R-CNN [14] and RetinaNet [24], mainly come from new network, new framework, or novel loss design. But mini-batch size, a key factor in the training, has not been well studied. In this paper, we propose a Large MiniBatch Object Detector (MegDet) to enable the training with much larger mini-batch size than before (e.g. from 16 to 256), so that we can effectively utilize multiple GPUs (up to 128 in our experiments) to significantly shorten the training time. Technically, we suggest a learning rate policy and Cross-GPU Batch Normalization, which together allow us to successfully train a large mini-batch detector in much less time (e.g., from 33 hours to 4 hours), and achieve even better accuracy. The MegDet is the backbone of our submission (mmAP 52.5%) to COCO 2017 Challenge, where we won the 1st place of Detection task.

Via

Access Paper or Ask Questions

Theano: A Python framework for fast computation of mathematical expressions

May 09, 2016

The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov(+103 more)

Figure 1 for Theano: A Python framework for fast computation of mathematical expressions

Figure 2 for Theano: A Python framework for fast computation of mathematical expressions

Figure 3 for Theano: A Python framework for fast computation of mathematical expressions

Figure 4 for Theano: A Python framework for fast computation of mathematical expressions

Abstract:Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models. The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.

* 19 pages, 5 figures

Via

Access Paper or Ask Questions