Alert button
Picture for Yuan Zhou

Yuan Zhou

Alert button

ES2Net: An Efficient Spectral-Spatial Network for Hyperspectral Image Change Detection

Jul 23, 2023
Qingren Yao, Yuan Zhou, Wei Xiang

Figure 1 for ES2Net: An Efficient Spectral-Spatial Network for Hyperspectral Image Change Detection
Figure 2 for ES2Net: An Efficient Spectral-Spatial Network for Hyperspectral Image Change Detection
Figure 3 for ES2Net: An Efficient Spectral-Spatial Network for Hyperspectral Image Change Detection
Figure 4 for ES2Net: An Efficient Spectral-Spatial Network for Hyperspectral Image Change Detection

Hyperspectral image change detection (HSI-CD) aims to identify the differences in bitemporal HSIs. To mitigate spectral redundancy and improve the discriminativeness of changing features, some methods introduced band selection technology to select bands conducive for CD. However, these methods are limited by the inability to end-to-end training with the deep learning-based feature extractor and lack considering the complex nonlinear relationship among bands. In this paper, we propose an end-to-end efficient spectral-spatial change detection network (ES2Net) to address these issues. Specifically, we devised a learnable band selection module to automatically select bands conducive to CD. It can be jointly optimized with a feature extraction network and capture the complex nonlinear relationships among bands. Moreover, considering the large spatial feature distribution differences among different bands, we design the cluster-wise spatial attention mechanism that assigns a spatial attention factor to each individual band to individually improve the feature discriminativeness for each band. Experiments on three widely used HSI-CD datasets demonstrate the effectiveness and superiority of this method compared with other state-of-the-art methods.

Viaarxiv icon

ABC-KD: Attention-Based-Compression Knowledge Distillation for Deep Learning-Based Noise Suppression

May 26, 2023
Yixin Wan, Yuan Zhou, Xiulian Peng, Kai-Wei Chang, Yan Lu

Figure 1 for ABC-KD: Attention-Based-Compression Knowledge Distillation for Deep Learning-Based Noise Suppression
Figure 2 for ABC-KD: Attention-Based-Compression Knowledge Distillation for Deep Learning-Based Noise Suppression
Figure 3 for ABC-KD: Attention-Based-Compression Knowledge Distillation for Deep Learning-Based Noise Suppression
Figure 4 for ABC-KD: Attention-Based-Compression Knowledge Distillation for Deep Learning-Based Noise Suppression

Noise suppression (NS) models have been widely applied to enhance speech quality. Recently, Deep Learning-Based NS, which we denote as Deep Noise Suppression (DNS), became the mainstream NS method due to its excelling performance over traditional ones. However, DNS models face 2 major challenges for supporting the real-world applications. First, high-performing DNS models are usually large in size, causing deployment difficulties. Second, DNS models require extensive training data, including noisy audios as inputs and clean audios as labels. It is often difficult to obtain clean labels for training DNS models. We propose the use of knowledge distillation (KD) to resolve both challenges. Our study serves 2 main purposes. To begin with, we are among the first to comprehensively investigate mainstream KD techniques on DNS models to resolve the two challenges. Furthermore, we propose a novel Attention-Based-Compression KD method that outperforms all investigated mainstream KD frameworks on DNS task.

* This paper was accepted to Interspeech 2023 Main Conference 
Viaarxiv icon

Advancing Incremental Few-shot Semantic Segmentation via Semantic-guided Relation Alignment and Adaptation

May 18, 2023
Yuan Zhou, Xin Chen, Yanrong Guo, Shijie Hao, Richang Hong, Qi Tian

Figure 1 for Advancing Incremental Few-shot Semantic Segmentation via Semantic-guided Relation Alignment and Adaptation
Figure 2 for Advancing Incremental Few-shot Semantic Segmentation via Semantic-guided Relation Alignment and Adaptation
Figure 3 for Advancing Incremental Few-shot Semantic Segmentation via Semantic-guided Relation Alignment and Adaptation
Figure 4 for Advancing Incremental Few-shot Semantic Segmentation via Semantic-guided Relation Alignment and Adaptation

Incremental few-shot semantic segmentation (IFSS) aims to incrementally extend a semantic segmentation model to novel classes according to only a few pixel-level annotated data, while preserving its segmentation capability on previously learned base categories. This task faces a severe semantic-aliasing issue between base and novel classes due to data imbalance, which makes segmentation results unsatisfactory. To alleviate this issue, we propose the Semantic-guided Relation Alignment and Adaptation (SRAA) method that fully considers the guidance of prior semantic information. Specifically, we first conduct Semantic Relation Alignment (SRA) in the base step, so as to semantically align base class representations to their semantics. As a result, the embeddings of base classes are constrained to have relatively low semantic correlations to categories that are different from them. Afterwards, based on the semantically aligned base categories, Semantic-Guided Adaptation (SGA) is employed during the incremental learning stage. It aims to ensure affinities between visual and semantic embeddings of encountered novel categories, thereby making the feature representations be consistent with their semantic information. In this way, the semantic-aliasing issue can be suppressed. We evaluate our model on the PASCAL VOC 2012 and the COCO dataset. The experimental results on both these two datasets exhibit its competitive performance, which demonstrates the superiority of our method.

Viaarxiv icon

High-Fidelity and Freely Controllable Talking Head Video Generation

Apr 20, 2023
Yue Gao, Yuan Zhou, Jinglu Wang, Xiao Li, Xiang Ming, Yan Lu

Figure 1 for High-Fidelity and Freely Controllable Talking Head Video Generation
Figure 2 for High-Fidelity and Freely Controllable Talking Head Video Generation
Figure 3 for High-Fidelity and Freely Controllable Talking Head Video Generation
Figure 4 for High-Fidelity and Freely Controllable Talking Head Video Generation

Talking head generation is to generate video based on a given source identity and target motion. However, current methods face several challenges that limit the quality and controllability of the generated videos. First, the generated face often has unexpected deformation and severe distortions. Second, the driving image does not explicitly disentangle movement-relevant information, such as poses and expressions, which restricts the manipulation of different attributes during generation. Third, the generated videos tend to have flickering artifacts due to the inconsistency of the extracted landmarks between adjacent frames. In this paper, we propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression. Our method leverages both self-supervised learned landmarks and 3D face model-based landmarks to model the motion. We also introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion. Furthermore, we enhance the smoothness of the synthesized talking head videos with a feature context adaptation and propagation module. We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance. More information is available at https://yuegao.me/PECHead.

* CVPR 2023 
Viaarxiv icon

RAF: Holistic Compilation for Deep Learning Model Training

Mar 08, 2023
Cody Hao Yu, Haozheng Fan, Guangtai Huang, Zhen Jia, Yizhi Liu, Jie Wang, Zach Zheng, Yuan Zhou, Haichen Shen, Junru Shao, Mu Li, Yida Wang

Figure 1 for RAF: Holistic Compilation for Deep Learning Model Training
Figure 2 for RAF: Holistic Compilation for Deep Learning Model Training
Figure 3 for RAF: Holistic Compilation for Deep Learning Model Training
Figure 4 for RAF: Holistic Compilation for Deep Learning Model Training

As deep learning is pervasive in modern applications, many deep learning frameworks are presented for deep learning practitioners to develop and train DNN models rapidly. Meanwhile, as training large deep learning models becomes a trend in recent years, the training throughput and memory footprint are getting crucial. Accordingly, optimizing training workloads with compiler optimizations is inevitable and getting more and more attentions. However, existing deep learning compilers (DLCs) mainly target inference and do not incorporate holistic optimizations, such as automatic differentiation and automatic mixed precision, in training workloads. In this paper, we present RAF, a deep learning compiler for training. Unlike existing DLCs, RAF accepts a forward model and in-house generates a training graph. Accordingly, RAF is able to systematically consolidate graph optimizations for performance, memory and distributed training. In addition, to catch up to the state-of-the-art performance with hand-crafted kernel libraries as well as tensor compilers, RAF proposes an operator dialect mechanism to seamlessly integrate all possible kernel implementations. We demonstrate that by in-house training graph generation and operator dialect mechanism, we are able to perform holistic optimizations and achieve either better training throughput or larger batch size against PyTorch (eager and torchscript mode), XLA, and DeepSpeed for popular transformer models on GPUs.

Viaarxiv icon

Medical visual question answering using joint self-supervised learning

Feb 25, 2023
Yuan Zhou, Jing Mei, Yiqin Yu, Tanveer Syeda-Mahmood

Figure 1 for Medical visual question answering using joint self-supervised learning
Figure 2 for Medical visual question answering using joint self-supervised learning
Figure 3 for Medical visual question answering using joint self-supervised learning
Figure 4 for Medical visual question answering using joint self-supervised learning

Visual Question Answering (VQA) becomes one of the most active research problems in the medical imaging domain. A well-known VQA challenge is the intrinsic diversity between the image and text modalities, and in the medical VQA task, there is another critical problem relying on the limited size of labelled image-question-answer data. In this study we propose an encoder-decoder framework that leverages the image-text joint representation learned from large-scaled medical image-caption data and adapted to the small-sized medical VQA task. The encoder embeds across the image-text dual modalities with self-attention mechanism and is independently pre-trained on the large-scaled medical image-caption dataset by multiple self-supervised learning tasks. Then the decoder is connected to the top of the encoder and fine-tuned using the small-sized medical VQA dataset. The experiment results present that our proposed method achieves better performance comparing with the baseline and SOTA methods.

Viaarxiv icon

Time-Variance Aware Real-Time Speech Enhancement

Feb 25, 2023
Chengyu Zheng, Yuan Zhou, Xiulian Peng, Yuan Zhang, Yan Lu

Figure 1 for Time-Variance Aware Real-Time Speech Enhancement
Figure 2 for Time-Variance Aware Real-Time Speech Enhancement
Figure 3 for Time-Variance Aware Real-Time Speech Enhancement
Figure 4 for Time-Variance Aware Real-Time Speech Enhancement

Time-variant factors often occur in real-world full-duplex communication applications. Some of them are caused by the complex environment such as non-stationary environmental noises and varying acoustic path while some are caused by the communication system such as the dynamic delay between the far-end and near-end signals. Current end-to-end deep neural network (DNN) based methods usually model the time-variant components implicitly and can hardly handle the unpredictable time-variance in real-time speech enhancement. To explicitly capture the time-variant components, we propose a dynamic kernel generation (DKG) module that can be introduced as a learnable plug-in to a DNN-based end-to-end pipeline. Specifically, the DKG module generates a convolutional kernel regarding to each input audio frame, so that the DNN model is able to dynamically adjust its weights according to the input signal during inference. Experimental results verify that DKG module improves the performance of the model under time-variant scenarios, in the joint acoustic echo cancellation (AEC) and deep noise suppression (DNS) tasks.

Viaarxiv icon

Real-time speech enhancement with dynamic attention span

Feb 21, 2023
Chengyu Zheng, Yuan Zhou, Xiulian Peng, Yuan Zhang, Yan Lu

Figure 1 for Real-time speech enhancement with dynamic attention span
Figure 2 for Real-time speech enhancement with dynamic attention span
Figure 3 for Real-time speech enhancement with dynamic attention span
Figure 4 for Real-time speech enhancement with dynamic attention span

For real-time speech enhancement (SE) including noise suppression, dereverberation and acoustic echo cancellation, the time-variance of the audio signals becomes a severe challenge. The causality and memory usage limit that only the historical information can be used for the system to capture the time-variant characteristics. We propose to adaptively change the receptive field according to the input signal in deep neural network based SE model. Specifically, in an encoder-decoder framework, a dynamic attention span mechanism is introduced to all the attention modules for controlling the size of historical content used for processing the current frame. Experimental results verify that this dynamic mechanism can better track time-variant factors and capture speech-related characteristics, benefiting to both interference removing and speech quality retaining.

* ICASSP 2023 (Accepted) 
Viaarxiv icon

FLYOVER: A Model-Driven Method to Generate Diverse Highway Interchanges for Autonomous Vehicle Testing

Jan 30, 2023
Yuan Zhou, Gengjie Lin, Yun Tang, Kairui Yang, Wei Jing, Ping Zhang, Junbo Chen, Liang Gong, Yang Liu

Figure 1 for FLYOVER: A Model-Driven Method to Generate Diverse Highway Interchanges for Autonomous Vehicle Testing
Figure 2 for FLYOVER: A Model-Driven Method to Generate Diverse Highway Interchanges for Autonomous Vehicle Testing
Figure 3 for FLYOVER: A Model-Driven Method to Generate Diverse Highway Interchanges for Autonomous Vehicle Testing
Figure 4 for FLYOVER: A Model-Driven Method to Generate Diverse Highway Interchanges for Autonomous Vehicle Testing

It has become a consensus that autonomous vehicles (AVs) will first be widely deployed on highways. However, the complexity of highway interchanges becomes the bottleneck for deploying AVs. An AV should be sufficiently tested under different highway interchanges, which is still challenging due to the lack of available datasets containing diverse highway interchanges. In this paper, we propose a model-driven method, FLYOVER, to generate a dataset consisting of diverse interchanges with measurable diversity coverage. First, FLYOVER proposes a labeled digraph to model the topology of an interchange. Second, FLYOVER takes real-world interchanges as input to guarantee topology practicality and extracts different topology equivalence classes by classifying the corresponding topology models. Third, for each topology class, FLYOVER identifies the corresponding geometrical features for the ramps and generates concrete interchanges using k-way combinatorial coverage and differential evolution. To illustrate the diversity and applicability of the generated interchange dataset, we test the built-in traffic flow control algorithm in SUMO and the fuel-optimization trajectory tracking algorithm deployed to Alibaba's autonomous trucks on the dataset. The results show that except for the geometrical difference, the interchanges are diverse in throughput and fuel consumption under the traffic flow control and trajectory tracking algorithms, respectively.

* Accepted by ICRA 2023 
Viaarxiv icon

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Oct 15, 2022
Zihan Zhang, Yuhang Jiang, Yuan Zhou, Xiangyang Ji

Figure 1 for Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with $S$ states, $A$ actions and planning horizon $H$, we design a computational efficient algorithm to achieve near-optimal regret of $\tilde{O}(\sqrt{SAH^3K\ln(1/\delta)})$\footnote{$\tilde{O}(\cdot)$ hides logarithmic terms of $(S,A,H,K)$} in $K$ episodes using $O\left(H+\log_2\log_2(K) \right)$ batches with confidence parameter $\delta$. To our best of knowledge, it is the first $\tilde{O}(\sqrt{SAH^3K})$ regret bound with $O(H+\log_2\log_2(K))$ batch complexity. Meanwhile, we show that to achieve $\tilde{O}(\mathrm{poly}(S,A,H)\sqrt{K})$ regret, the number of batches is at least $\Omega\left(H/\log_A(K)+ \log_2\log_2(K) \right)$, which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.

Viaarxiv icon