Alert button
Picture for Xiaoyang Qu

Xiaoyang Qu

Alert button

EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices

Aug 17, 2023
Liang Wang, Nan Zhang, Xiaoyang Qu, Jianzong Wang, Jiguang Wan, Guokuan Li, Kaiyu Hu, Guilin Jiang, Jing Xiao

Figure 1 for EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices
Figure 2 for EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices
Figure 3 for EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices
Figure 4 for EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices

Real-time video analytics on edge devices for changing scenes remains a difficult task. As edge devices are usually resource-constrained, edge deep neural networks (DNNs) have fewer weights and shallower architectures than general DNNs. As a result, they only perform well in limited scenarios and are sensitive to data drift. In this paper, we introduce EdgeMA, a practical and efficient video analytics system designed to adapt models to shifts in real-world video streams over time, addressing the data drift problem. EdgeMA extracts the gray level co-occurrence matrix based statistical texture feature and uses the Random Forest classifier to detect the domain shift. Moreover, we have incorporated a method of model adaptation based on importance weighting, specifically designed to update models to cope with the label distribution shift. Through rigorous evaluation of EdgeMA on a real-world dataset, our results illustrate that EdgeMA significantly improves inference accuracy.

* Accepted by 30th International Conference on Neural Information Processing (ICONIP 2023) 
Viaarxiv icon

FedET: A Communication-Efficient Federated Class-Incremental Learning Framework Based on Enhanced Transformer

Jun 27, 2023
Chenghao Liu, Xiaoyang Qu, Jianzong Wang, Jing Xiao

Figure 1 for FedET: A Communication-Efficient Federated Class-Incremental Learning Framework Based on Enhanced Transformer
Figure 2 for FedET: A Communication-Efficient Federated Class-Incremental Learning Framework Based on Enhanced Transformer
Figure 3 for FedET: A Communication-Efficient Federated Class-Incremental Learning Framework Based on Enhanced Transformer
Figure 4 for FedET: A Communication-Efficient Federated Class-Incremental Learning Framework Based on Enhanced Transformer

Federated Learning (FL) has been widely concerned for it enables decentralized learning while ensuring data privacy. However, most existing methods unrealistically assume that the classes encountered by local clients are fixed over time. After learning new classes, this assumption will make the model's catastrophic forgetting of old classes significantly severe. Moreover, due to the limitation of communication cost, it is challenging to use large-scale models in FL, which will affect the prediction accuracy. To address these challenges, we propose a novel framework, Federated Enhanced Transformer (FedET), which simultaneously achieves high accuracy and low communication cost. Specifically, FedET uses Enhancer, a tiny module, to absorb and communicate new knowledge, and applies pre-trained Transformers combined with different Enhancers to ensure high precision on various tasks. To address local forgetting caused by new classes of new tasks and global forgetting brought by non-i.i.d (non-independent and identically distributed) class imbalance across different local clients, we proposed an Enhancer distillation method to modify the imbalance between old and new knowledge and repair the non-i.i.d. problem. Experimental results demonstrate that FedET's average accuracy on representative benchmark datasets is 14.1% higher than the state-of-the-art method, while FedET saves 90% of the communication cost compared to the previous method.

* Accepted by 2023 International Joint Conference on Artificial Intelligence (IJCAI2023) 
Viaarxiv icon

Shoggoth: Towards Efficient Edge-Cloud Collaborative Real-Time Video Inference via Adaptive Online Learning

Jun 27, 2023
Liang Wang, Kai Lu, Nan Zhang, Xiaoyang Qu, Jianzong Wang, Jiguang Wan, Guokuan Li, Jing Xiao

Figure 1 for Shoggoth: Towards Efficient Edge-Cloud Collaborative Real-Time Video Inference via Adaptive Online Learning
Figure 2 for Shoggoth: Towards Efficient Edge-Cloud Collaborative Real-Time Video Inference via Adaptive Online Learning
Figure 3 for Shoggoth: Towards Efficient Edge-Cloud Collaborative Real-Time Video Inference via Adaptive Online Learning
Figure 4 for Shoggoth: Towards Efficient Edge-Cloud Collaborative Real-Time Video Inference via Adaptive Online Learning

This paper proposes Shoggoth, an efficient edge-cloud collaborative architecture, for boosting inference performance on real-time video of changing scenes. Shoggoth uses online knowledge distillation to improve the accuracy of models suffering from data drift and offloads the labeling process to the cloud, alleviating constrained resources of edge devices. At the edge, we design adaptive training using small batches to adapt models under limited computing power, and adaptive sampling of training frames for robustness and reducing bandwidth. The evaluations on the realistic dataset show 15%-20% model accuracy improvement compared to the edge-only strategy and fewer network costs than the cloud-only strategy.

* Accepted by 60th ACM/IEEE Design Automation Conference (DAC2023) 
Viaarxiv icon

Detecting Out-of-distribution Examples via Class-conditional Impressions Reappearing

Mar 17, 2023
Jinggang Chen, Xiaoyang Qu, Junjie Li, Jianzong Wang, Jiguang Wan, Jing Xiao

Figure 1 for Detecting Out-of-distribution Examples via Class-conditional Impressions Reappearing
Figure 2 for Detecting Out-of-distribution Examples via Class-conditional Impressions Reappearing
Figure 3 for Detecting Out-of-distribution Examples via Class-conditional Impressions Reappearing
Figure 4 for Detecting Out-of-distribution Examples via Class-conditional Impressions Reappearing

Out-of-distribution (OOD) detection aims at enhancing standard deep neural networks to distinguish anomalous inputs from original training data. Previous progress has introduced various approaches where the in-distribution training data and even several OOD examples are prerequisites. However, due to privacy and security, auxiliary data tends to be impractical in a real-world scenario. In this paper, we propose a data-free method without training on natural data, called Class-Conditional Impressions Reappearing (C2IR), which utilizes image impressions from the fixed model to recover class-conditional feature statistics. Based on that, we introduce Integral Probability Metrics to estimate layer-wise class-conditional deviations and obtain layer weights by Measuring Gradient-based Importance (MGI). The experiments verify the effectiveness of our method and indicate that C2IR outperforms other post-hoc methods and reaches comparable performance to the full access (ID and OOD) detection method, especially in the far-OOD dataset (SVHN).

* Accepted by ICASSP 2023 
Viaarxiv icon

Feature-Rich Audio Model Inversion for Data-Free Knowledge Distillation Towards General Sound Classification

Mar 14, 2023
Zuheng Kang, Yayun He, Jianzong Wang, Junqing Peng, Xiaoyang Qu, Jing Xiao

Figure 1 for Feature-Rich Audio Model Inversion for Data-Free Knowledge Distillation Towards General Sound Classification
Figure 2 for Feature-Rich Audio Model Inversion for Data-Free Knowledge Distillation Towards General Sound Classification
Figure 3 for Feature-Rich Audio Model Inversion for Data-Free Knowledge Distillation Towards General Sound Classification
Figure 4 for Feature-Rich Audio Model Inversion for Data-Free Knowledge Distillation Towards General Sound Classification

Data-Free Knowledge Distillation (DFKD) has recently attracted growing attention in the academic community, especially with major breakthroughs in computer vision. Despite promising results, the technique has not been well applied to audio and signal processing. Due to the variable duration of audio signals, it has its own unique way of modeling. In this work, we propose feature-rich audio model inversion (FRAMI), a data-free knowledge distillation framework for general sound classification tasks. It first generates high-quality and feature-rich Mel-spectrograms through a feature-invariant contrastive loss. Then, the hidden states before and after the statistics pooling layer are reused when knowledge distillation is performed on these feature-rich samples. Experimental results on the Urbansound8k, ESC-50, and audioMNIST datasets demonstrate that FRAMI can generate feature-rich samples. Meanwhile, the accuracy of the student model is further improved by reusing the hidden state and significantly outperforms the baseline method.

* Accepted by ICASSP 2023. International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023) 
Viaarxiv icon

Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation

Oct 15, 2022
Chendong Zhao, Jianzong Wang, Xiaoyang Qu, Haoqian Wang, Jing Xiao

Figure 1 for Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation
Figure 2 for Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation
Figure 3 for Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation
Figure 4 for Learning Invariant Representation and Risk Minimized for Unsupervised Accent Domain Adaptation

Unsupervised representation learning for speech audios attained impressive performances for speech recognition tasks, particularly when annotated speech is limited. However, the unsupervised paradigm needs to be carefully designed and little is known about what properties these representations acquire. There is no guarantee that the model learns meaningful representations for valuable information for recognition. Moreover, the adaptation ability of the learned representations to other domains still needs to be estimated. In this work, we explore learning domain-invariant representations via a direct mapping of speech representations to their corresponding high-level linguistic informations. Results prove that the learned latents not only capture the articulatory feature of each phoneme but also enhance the adaptation ability, outperforming the baseline largely on accented benchmarks.

* Accepted to SLT 2022 
Viaarxiv icon

Pose Guided Human Image Synthesis with Partially Decoupled GAN

Oct 07, 2022
Jianhan Wu, Jianzong Wang, Shijing Si, Xiaoyang Qu, Jing Xiao

Figure 1 for Pose Guided Human Image Synthesis with Partially Decoupled GAN
Figure 2 for Pose Guided Human Image Synthesis with Partially Decoupled GAN
Figure 3 for Pose Guided Human Image Synthesis with Partially Decoupled GAN
Figure 4 for Pose Guided Human Image Synthesis with Partially Decoupled GAN

Pose Guided Human Image Synthesis (PGHIS) is a challenging task of transforming a human image from the reference pose to a target pose while preserving its style. Most existing methods encode the texture of the whole reference human image into a latent space, and then utilize a decoder to synthesize the image texture of the target pose. However, it is difficult to recover the detailed texture of the whole human image. To alleviate this problem, we propose a method by decoupling the human body into several parts (\eg, hair, face, hands, feet, \etc) and then using each of these parts to guide the synthesis of a realistic image of the person, which preserves the detailed information of the generated images. In addition, we design a multi-head attention-based module for PGHIS. Because most convolutional neural network-based methods have difficulty in modeling long-range dependency due to the convolutional operation, the long-range modeling capability of attention mechanism is more suitable than convolutional neural networks for pose transfer task, especially for sharp pose deformation. Extensive experiments on Market-1501 and DeepFashion datasets reveal that our method almost outperforms other existing state-of-the-art methods in terms of both qualitative and quantitative metrics.

* 16 pages, 14th Asian Conference on Machine Learning conference 
Viaarxiv icon

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

Sep 30, 2022
Chendong Zhao, Jianzong Wang, Wen qi Wei, Xiaoyang Qu, Haoqian Wang, Jing Xiao

Figure 1 for Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition
Figure 2 for Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition
Figure 3 for Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition
Figure 4 for Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

The Transformer architecture model, based on self-attention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head attention structure. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.

* Accepted to DSAA 2022 
Viaarxiv icon

Blur the Linguistic Boundary: Interpreting Chinese Buddhist Sutra in English via Neural Machine Translation

Sep 30, 2022
Denghao Li, Yuqiao Zeng, Jianzong Wang, Lingwei Kong, Zhangcheng Huang, Ning Cheng, Xiaoyang Qu, Jing Xiao

Figure 1 for Blur the Linguistic Boundary: Interpreting Chinese Buddhist Sutra in English via Neural Machine Translation
Figure 2 for Blur the Linguistic Boundary: Interpreting Chinese Buddhist Sutra in English via Neural Machine Translation
Figure 3 for Blur the Linguistic Boundary: Interpreting Chinese Buddhist Sutra in English via Neural Machine Translation
Figure 4 for Blur the Linguistic Boundary: Interpreting Chinese Buddhist Sutra in English via Neural Machine Translation

Buddhism is an influential religion with a long-standing history and profound philosophy. Nowadays, more and more people worldwide aspire to learn the essence of Buddhism, attaching importance to Buddhism dissemination. However, Buddhist scriptures written in classical Chinese are obscure to most people and machine translation applications. For instance, general Chinese-English neural machine translation (NMT) fails in this domain. In this paper, we proposed a novel approach to building a practical NMT model for Buddhist scriptures. The performance of our translation pipeline acquired highly promising results in ablation experiments under three criteria.

* This paper is accepted by ICTAI 2022. The 34th IEEE International Conference on Tools with Artificial Intelligence (ICTAI) 
Viaarxiv icon

Boosting Star-GANs for Voice Conversion with Contrastive Discriminator

Sep 27, 2022
Shijing Si, Jianzong Wang, Xulong Zhang, Xiaoyang Qu, Ning Cheng, Jing Xiao

Figure 1 for Boosting Star-GANs for Voice Conversion with Contrastive Discriminator
Figure 2 for Boosting Star-GANs for Voice Conversion with Contrastive Discriminator
Figure 3 for Boosting Star-GANs for Voice Conversion with Contrastive Discriminator
Figure 4 for Boosting Star-GANs for Voice Conversion with Contrastive Discriminator

Nonparallel multi-domain voice conversion methods such as the StarGAN-VCs have been widely applied in many scenarios. However, the training of these models usually poses a challenge due to their complicated adversarial network architectures. To address this, in this work we leverage the state-of-the-art contrastive learning techniques and incorporate an efficient Siamese network structure into the StarGAN discriminator. Our method is called SimSiam-StarGAN-VC and it boosts the training stability and effectively prevents the discriminator overfitting issue in the training process. We conduct experiments on the Voice Conversion Challenge (VCC 2018) dataset, plus a user study to validate the performance of our framework. Our experimental results show that SimSiam-StarGAN-VC significantly outperforms existing StarGAN-VC methods in terms of both the objective and subjective metrics.

* 12 pages, 3 figures, Accepted by ICONIP 2022 
Viaarxiv icon