Alert button
Picture for Junlan Feng

Junlan Feng

Alert button

Fine-grained Recognition with Learnable Semantic Data Augmentation

Sep 01, 2023
Yifan Pu, Yizeng Han, Yulin Wang, Junlan Feng, Chao Deng, Gao Huang

Fine-grained image recognition is a longstanding computer vision challenge that focuses on differentiating objects belonging to multiple subordinate categories within the same meta-category. Since images belonging to the same meta-category usually share similar visual appearances, mining discriminative visual cues is the key to distinguishing fine-grained categories. Although commonly used image-level data augmentation techniques have achieved great success in generic image classification problems, they are rarely applied in fine-grained scenarios, because their random editing-region behavior is prone to destroy the discriminative visual cues residing in the subtle regions. In this paper, we propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a covariance prediction network, which predicts a sample-wise covariance matrix to adapt to the large intra-class variation inherent in fine-grained images. Furthermore, the covariance prediction network is jointly optimized with the classification network in a meta-learning manner to alleviate the degenerate solution problem. Experiments on four competitive fine-grained recognition benchmarks (CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds) demonstrate that our method significantly improves the generalization performance on several popular classification networks (e.g., ResNets, DenseNets, EfficientNets, RegNets and ViT). Combined with a recently proposed method, our semantic data augmentation approach achieves state-of-the-art performance on the CUB-200-2011 dataset. The source code will be released.

Viaarxiv icon

Dynamic Perceiver for Efficient Visual Recognition

Jun 20, 2023
Yizeng Han, Dongchen Han, Zeyu Liu, Yulin Wang, Xuran Pan, Yifan Pu, Chao Deng, Junlan Feng, Shiji Song, Gao Huang

Figure 1 for Dynamic Perceiver for Efficient Visual Recognition
Figure 2 for Dynamic Perceiver for Efficient Visual Recognition
Figure 3 for Dynamic Perceiver for Efficient Visual Recognition
Figure 4 for Dynamic Perceiver for Efficient Visual Recognition

Early exiting has become a promising approach to improving the inference efficiency of deep networks. By structuring models with multiple classifiers (exits), predictions for ``easy'' samples can be generated at earlier exits, negating the need for executing deeper layers. Current multi-exit networks typically implement linear classifiers at intermediate layers, compelling low-level features to encapsulate high-level semantics. This sub-optimal design invariably undermines the performance of later exits. In this paper, we propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task with a novel dual-branch architecture. A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks. Bi-directional cross-attention layers are established to progressively fuse the information of both branches. Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features. Dyn-Perceiver constitutes a versatile and adaptable framework that can be built upon various architectures. Experiments on image classification, action recognition, and object detection demonstrate that our method significantly improves the inference efficiency of different backbones, outperforming numerous competitive approaches across a broad range of computational budgets. Evaluation on both CPU and GPU platforms substantiate the superior practical efficiency of Dyn-Perceiver. Code is available at https://www.github.com/LeapLabTHU/Dynamic_Perceiver.

Viaarxiv icon

MPPN: Multi-Resolution Periodic Pattern Network For Long-Term Time Series Forecasting

Jun 12, 2023
Xing Wang, Zhendong Wang, Kexin Yang, Junlan Feng, Zhiyan Song, Chao Deng, Lin zhu

Figure 1 for MPPN: Multi-Resolution Periodic Pattern Network For Long-Term Time Series Forecasting
Figure 2 for MPPN: Multi-Resolution Periodic Pattern Network For Long-Term Time Series Forecasting
Figure 3 for MPPN: Multi-Resolution Periodic Pattern Network For Long-Term Time Series Forecasting
Figure 4 for MPPN: Multi-Resolution Periodic Pattern Network For Long-Term Time Series Forecasting

Long-term time series forecasting plays an important role in various real-world scenarios. Recent deep learning methods for long-term series forecasting tend to capture the intricate patterns of time series by decomposition-based or sampling-based methods. However, most of the extracted patterns may include unpredictable noise and lack good interpretability. Moreover, the multivariate series forecasting methods usually ignore the individual characteristics of each variate, which may affecting the prediction accuracy. To capture the intrinsic patterns of time series, we propose a novel deep learning network architecture, named Multi-resolution Periodic Pattern Network (MPPN), for long-term series forecasting. We first construct context-aware multi-resolution semantic units of time series and employ multi-periodic pattern mining to capture the key patterns of time series. Then, we propose a channel adaptive module to capture the perceptions of multivariate towards different patterns. In addition, we present an entropy-based method for evaluating the predictability of time series and providing an upper bound on the prediction accuracy before forecasting. Our experimental evaluation on nine real-world benchmarks demonstrated that MPPN significantly outperforms the state-of-the-art Transformer-based, decomposition-based and sampling-based methods for long-term series forecasting.

* 21 pages 
Viaarxiv icon

Healing Unsafe Dialogue Responses with Weak Supervision Signals

May 25, 2023
Zi Liang, Pinghui Wang, Ruofei Zhang, Shuo Zhang, Xiaofan Ye Yi Huang, Junlan Feng

Figure 1 for Healing Unsafe Dialogue Responses with Weak Supervision Signals
Figure 2 for Healing Unsafe Dialogue Responses with Weak Supervision Signals
Figure 3 for Healing Unsafe Dialogue Responses with Weak Supervision Signals
Figure 4 for Healing Unsafe Dialogue Responses with Weak Supervision Signals

Recent years have seen increasing concerns about the unsafe response generation of large-scale dialogue systems, where agents will learn offensive or biased behaviors from the real-world corpus. Some methods are proposed to address the above issue by detecting and replacing unsafe training examples in a pipeline style. Though effective, they suffer from a high annotation cost and adapt poorly to unseen scenarios as well as adversarial attacks. Besides, the neglect of providing safe responses (e.g. simply replacing with templates) will cause the information-missing problem of dialogues. To address these issues, we propose an unsupervised pseudo-label sampling method, TEMP, that can automatically assign potential safe responses. Specifically, our TEMP method groups responses into several clusters and samples multiple labels with an adaptively sharpened sampling strategy, inspired by the observation that unsafe samples in the clusters are usually few and distribute in the tail. Extensive experiments in chitchat and task-oriented dialogues show that our TEMP outperforms state-of-the-art models with weak supervision signals and obtains comparable results under unsupervised learning settings.

Viaarxiv icon

Knowledge-Retrieval Task-Oriented Dialog Systems with Semi-Supervision

May 22, 2023
Yucheng Cai, Hong Liu, Zhijian Ou, Yi Huang, Junlan Feng

Figure 1 for Knowledge-Retrieval Task-Oriented Dialog Systems with Semi-Supervision
Figure 2 for Knowledge-Retrieval Task-Oriented Dialog Systems with Semi-Supervision
Figure 3 for Knowledge-Retrieval Task-Oriented Dialog Systems with Semi-Supervision
Figure 4 for Knowledge-Retrieval Task-Oriented Dialog Systems with Semi-Supervision

Most existing task-oriented dialog (TOD) systems track dialog states in terms of slots and values and use them to query a database to get relevant knowledge to generate responses. In real-life applications, user utterances are noisier, and thus it is more difficult to accurately track dialog states and correctly secure relevant knowledge. Recently, a progress in question answering and document-grounded dialog systems is retrieval-augmented methods with a knowledge retriever. Inspired by such progress, we propose a retrieval-based method to enhance knowledge selection in TOD systems, which significantly outperforms the traditional database query method for real-life dialogs. Further, we develop latent variable model based semi-supervised learning, which can work with the knowledge retriever to leverage both labeled and unlabeled dialog data. Joint Stochastic Approximation (JSA) algorithm is employed for semi-supervised model training, and the whole system is referred to as that JSA-KRTOD. Experiments are conducted on a real-life dataset from China Mobile Custom-Service, called MobileCS, and show that JSA-KRTOD achieves superior performances in both labeled-only and semi-supervised settings.

* 5 pages, accepted by INTERSPEECH2023 
Viaarxiv icon

VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

Mar 14, 2023
Ao Zhang, He Wang, Pengcheng Guo, Yihui Fu, Lei Xie, Yingying Gao, Shilei Zhang, Junlan Feng

Figure 1 for VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting
Figure 2 for VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting
Figure 3 for VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting
Figure 4 for VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting

The performance of the keyword spotting (KWS) system based on audio modality, commonly measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. Therefore, audio-visual keyword spotting, which leverages complementary relationships over multiple modalities, has recently gained much attention. However, current studies mainly focus on combining the exclusively learned representations of different modalities, instead of exploring the modal relationships during each respective modeling. In this paper, we propose a novel visual modality enhanced end-to-end KWS framework (VE-KWS), which fuses audio and visual modalities from two aspects. The first one is utilizing the speaker location information obtained from the lip region in videos to assist the training of multi-channel audio beamformer. By involving the beamformer as an audio enhancement module, the acoustic distortions, caused by the far field or noisy environments, could be significantly suppressed. The other one is conducting cross-attention between different modalities to capture the inter-modal relationships and help the representation learning of each modality. Experiments on the MSIP challenge corpus show that our proposed model achieves 2.79% false rejection rate and 2.95% false alarm rate on the Eval set, resulting in a new SOTA performance compared with the top-ranking systems in the ICASSP2022 MISP challenge.

* 5 pages. Accepted at ICASSP2023 
Viaarxiv icon

ESCL: Equivariant Self-Contrastive Learning for Sentence Representations

Mar 09, 2023
Jie Liu, Yixuan Liu, Xue Han, Chao Deng, Junlan Feng

Figure 1 for ESCL: Equivariant Self-Contrastive Learning for Sentence Representations
Figure 2 for ESCL: Equivariant Self-Contrastive Learning for Sentence Representations
Figure 3 for ESCL: Equivariant Self-Contrastive Learning for Sentence Representations

Previous contrastive learning methods for sentence representations often focus on insensitive transformations to produce positive pairs, but neglect the role of sensitive transformations that are harmful to semantic representations. Therefore, we propose an Equivariant Self-Contrastive Learning (ESCL) method to make full use of sensitive transformations, which encourages the learned representations to be sensitive to certain types of transformations with an additional equivariant learning task. Meanwhile, in order to improve practicability and generality, ESCL simplifies the implementations of traditional equivariant contrastive methods to share model parameters from the perspective of multi-task learning. We evaluate our ESCL on semantic textual similarity tasks. The proposed method achieves better results while using fewer learning parameters compared to previous methods.

* accepted by ICASSP 2023 
Viaarxiv icon

Adaptive Hybrid Spatial-Temporal Graph Neural Network for Cellular Traffic Prediction

Feb 28, 2023
Xing Wang, Kexin Yang, Zhendong Wang, Junlan Feng, Lin Zhu, Juan Zhao, Chao Deng

Figure 1 for Adaptive Hybrid Spatial-Temporal Graph Neural Network for Cellular Traffic Prediction
Figure 2 for Adaptive Hybrid Spatial-Temporal Graph Neural Network for Cellular Traffic Prediction
Figure 3 for Adaptive Hybrid Spatial-Temporal Graph Neural Network for Cellular Traffic Prediction
Figure 4 for Adaptive Hybrid Spatial-Temporal Graph Neural Network for Cellular Traffic Prediction

Cellular traffic prediction is an indispensable part for intelligent telecommunication networks. Nevertheless, due to the frequent user mobility and complex network scheduling mechanisms, cellular traffic often inherits complicated spatial-temporal patterns, making the prediction incredibly challenging. Although recent advanced algorithms such as graph-based prediction approaches have been proposed, they frequently model spatial dependencies based on static or dynamic graphs and neglect the coexisting multiple spatial correlations induced by traffic generation. Meanwhile, some works lack the consideration of the diverse cellular traffic patterns, result in suboptimal prediction results. In this paper, we propose a novel deep learning network architecture, Adaptive Hybrid Spatial-Temporal Graph Neural Network (AHSTGNN), to tackle the cellular traffic prediction problem. First, we apply adaptive hybrid graph learning to learn the compound spatial correlations among cell towers. Second, we implement a Temporal Convolution Module with multi-periodic temporal data input to capture the nonlinear temporal dependencies. In addition, we introduce an extra Spatial-Temporal Adaptive Module to conquer the heterogeneity lying in cell towers. Our experiments on two real-world cellular traffic datasets show AHSTGNN outperforms the state-of-the-art by a significant margin, illustrating the superior scalability of our method for spatial-temporal cellular traffic prediction.

* To be published in IEEE International Conference on Communications (ICC) 
Viaarxiv icon

Multi-Action Dialog Policy Learning from Logged User Feedback

Feb 27, 2023
Shuo Zhang, Junzhou Zhao, Pinghui Wang, Tianxiang Wang, Zi Liang, Jing Tao, Yi Huang, Junlan Feng

Figure 1 for Multi-Action Dialog Policy Learning from Logged User Feedback
Figure 2 for Multi-Action Dialog Policy Learning from Logged User Feedback
Figure 3 for Multi-Action Dialog Policy Learning from Logged User Feedback
Figure 4 for Multi-Action Dialog Policy Learning from Logged User Feedback

Multi-action dialog policy, which generates multiple atomic dialog actions per turn, has been widely applied in task-oriented dialog systems to provide expressive and efficient system responses. Existing policy models usually imitate action combinations from the labeled multi-action dialog examples. Due to data limitations, they generalize poorly toward unseen dialog flows. While reinforcement learning-based methods are proposed to incorporate the service ratings from real users and user simulators as external supervision signals, they suffer from sparse and less credible dialog-level rewards. To cope with this problem, we explore to improve multi-action dialog policy learning with explicit and implicit turn-level user feedback received for historical predictions (i.e., logged user feedback) that are cost-efficient to collect and faithful to real-world scenarios. The task is challenging since the logged user feedback provides only partial label feedback limited to the particular historical dialog actions predicted by the agent. To fully exploit such feedback information, we propose BanditMatch, which addresses the task from a feedback-enhanced semi-supervised learning perspective with a hybrid objective of semi-supervised learning and bandit learning. BanditMatch integrates pseudo-labeling methods to better explore the action space through constructing full label feedback. Extensive experiments show that our BanditMatch outperforms the state-of-the-art methods by generating more concise and informative responses. The source code and the appendix of this paper can be obtained from https://github.com/ShuoZhangXJTU/BanditMatch.

* AAAI 2023 
Viaarxiv icon