Alert button
Picture for Junbo Zhang

Junbo Zhang

Alert button

Spatio-Temporal Contrastive Self-Supervised Learning for POI-level Crowd Flow Inference

Sep 12, 2023
Songyu Ke, Ting Li, Li Song, Yanping Sun, Qintian Sun, Junbo Zhang, Yu Zheng

Accurate acquisition of crowd flow at Points of Interest (POIs) is pivotal for effective traffic management, public service, and urban planning. Despite this importance, due to the limitations of urban sensing techniques, the data quality from most sources is inadequate for monitoring crowd flow at each POI. This renders the inference of accurate crowd flow from low-quality data a critical and challenging task. The complexity is heightened by three key factors: 1) The scarcity and rarity of labeled data, 2) The intricate spatio-temporal dependencies among POIs, and 3) The myriad correlations between precise crowd flow and GPS reports. To address these challenges, we recast the crowd flow inference problem as a self-supervised attributed graph representation learning task and introduce a novel Contrastive Self-learning framework for Spatio-Temporal data (CSST). Our approach initiates with the construction of a spatial adjacency graph founded on the POIs and their respective distances. We then employ a contrastive learning technique to exploit large volumes of unlabeled spatio-temporal data. We adopt a swapped prediction approach to anticipate the representation of the target subgraph from similar instances. Following the pre-training phase, the model is fine-tuned with accurate crowd flow data. Our experiments, conducted on two real-world datasets, demonstrate that the CSST pre-trained on extensive noisy data consistently outperforms models trained from scratch.

* 18 pages; submitted to TKDD; 
Viaarxiv icon

CED: Consistent ensemble distillation for audio tagging

Sep 08, 2023
Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

Figure 1 for CED: Consistent ensemble distillation for audio tagging
Figure 2 for CED: Consistent ensemble distillation for audio tagging
Figure 3 for CED: Consistent ensemble distillation for audio tagging
Figure 4 for CED: Consistent ensemble distillation for audio tagging

Augmentation and knowledge distillation (KD) are well-established techniques employed in audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training framework that distils student models from large teacher ensembles with consistent teaching. To achieve this, CED efficiently stores logits as well as the augmentation methods on disk, making it scalable to large-scale datasets. Central to CED's efficacy is its label-free nature, meaning that only the stored logits are used for the optimization of a student model only requiring 0.3\% additional disk space for AS. The study trains various transformer-based models, including a 10M parameter model achieving a 49.0 mean average precision (mAP) on AS. Pretrained models and code are available at https://github.com/RicherMans/CED.

Viaarxiv icon

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Jun 28, 2023
Jiuxin Lin, Peng Wang, Heinrich Dinkel, Jun Chen, Zhiyong Wu, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

Figure 1 for Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information
Figure 2 for Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information
Figure 3 for Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information
Figure 4 for Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Previously, Target Speaker Extraction (TSE) has yielded outstanding performance in certain application scenarios for speech enhancement and source separation. However, obtaining auxiliary speaker-related information is still challenging in noisy environments with significant reverberation. inspired by the recently proposed distance-based sound separation, we propose the near sound (NS) extractor, which leverages distance information for TSE to reliably extract speaker information without requiring previous speaker enrolment, called speaker embedding self-enrollment (SESE). Full- & sub-band modeling is introduced to enhance our NS-Extractor's adaptability towards environments with significant reverberation. Experimental results on several cross-datasets demonstrate the effectiveness of our improvements and the excellent performance of our proposed NS-Extractor in different application scenarios.

* Accepted by InterSpeech2023 
Viaarxiv icon

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Jun 25, 2023
Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng

Figure 1 for AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction
Figure 2 for AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction
Figure 3 for AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction
Figure 4 for AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of the visual feature. Then self- and cross-attention are employed to model and fuse the multi-modal features. Furthermore, we use a novel 2D positional encoding, that introduces the positional information between and within chunks and provides significant gains over the traditional positional encoding. Our model has two key advantages: the time granularity of audio chunked feature is synchronized to the visual feature, which alleviates the harm caused by the inconsistency of audio and video sampling rate; by combining self- and cross-attention, feature fusion and speech extraction processes are unified within an attention paradigm. The experimental results show that AV-SepFormer significantly outperforms other existing methods.

* Accepted by ICASSP2023 
Viaarxiv icon

Understanding temporally weakly supervised training: A case study for keyword spotting

May 30, 2023
Heinrich Dinkel, Weiji Zhuang, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

Figure 1 for Understanding temporally weakly supervised training: A case study for keyword spotting
Figure 2 for Understanding temporally weakly supervised training: A case study for keyword spotting
Figure 3 for Understanding temporally weakly supervised training: A case study for keyword spotting
Figure 4 for Understanding temporally weakly supervised training: A case study for keyword spotting

The currently most prominent algorithm to train keyword spotting (KWS) models with deep neural networks (DNNs) requires strong supervision i.e., precise knowledge of the spoken keyword location in time. Thus, most KWS approaches treat the presence of redundant data, such as noise, within their training set as an obstacle. A common training paradigm to deal with data redundancies is to use temporally weakly supervised learning, which only requires providing labels on a coarse scale. This study explores the limits of DNN training using temporally weak labeling with applications in KWS. We train a simple end-to-end classifier on the common Google Speech Commands dataset with increased difficulty by randomly appending and adding noise to the training dataset. Our results indicate that temporally weak labeling can achieve comparable results to strongly supervised baselines while having a less stringent labeling requirement. In the presence of noise, weakly supervised models are capable to localize and extract target keywords without explicit supervision, leading to a performance increase compared to strongly supervised approaches.

Viaarxiv icon

Streaming Audio Transformers for Online Audio Tagging

May 29, 2023
Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

Figure 1 for Streaming Audio Transformers for Online Audio Tagging
Figure 2 for Streaming Audio Transformers for Online Audio Tagging
Figure 3 for Streaming Audio Transformers for Online Audio Tagging
Figure 4 for Streaming Audio Transformers for Online Audio Tagging

Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audio transformers (SAT) that combine the vision transformer (ViT) architecture with Transformer-Xl-like chunk processing, enabling efficient processing of long-range audio signals. Our proposed SAT is benchmarked against other transformer-based SOTA methods, achieving significant improvements in terms of mean average precision (mAP) at a delay of 2s and 1s, while also exhibiting significantly lower memory usage and computational overhead. Checkpoints are publicly available https://github.com/RicherMans/SAT.

Viaarxiv icon

AutoSTL: Automated Spatio-Temporal Multi-Task Learning

Apr 16, 2023
Zijian Zhang, Xiangyu Zhao, Hao Miao, Chunxu Zhang, Hongwei Zhao, Junbo Zhang

Figure 1 for AutoSTL: Automated Spatio-Temporal Multi-Task Learning
Figure 2 for AutoSTL: Automated Spatio-Temporal Multi-Task Learning
Figure 3 for AutoSTL: Automated Spatio-Temporal Multi-Task Learning
Figure 4 for AutoSTL: Automated Spatio-Temporal Multi-Task Learning

Spatio-Temporal prediction plays a critical role in smart city construction. Jointly modeling multiple spatio-temporal tasks can further promote an intelligent city life by integrating their inseparable relationship. However, existing studies fail to address this joint learning problem well, which generally solve tasks individually or a fixed task combination. The challenges lie in the tangled relation between different properties, the demand for supporting flexible combinations of tasks and the complex spatio-temporal dependency. To cope with the problems above, we propose an Automated Spatio-Temporal multi-task Learning (AutoSTL) method to handle multiple spatio-temporal tasks jointly. Firstly, we propose a scalable architecture consisting of advanced spatio-temporal operations to exploit the complicated dependency. Shared modules and feature fusion mechanism are incorporated to further capture the intrinsic relationship between tasks. Furthermore, our model automatically allocates the operations and fusion weight. Extensive experiments on benchmark datasets verified that our model achieves state-of-the-art performance. As we can know, AutoSTL is the first automated spatio-temporal multi-task learning method.

Viaarxiv icon

Spatio-Temporal Graph Neural Networks for Predictive Learning in Urban Computing: A Survey

Mar 25, 2023
Guangyin Jin, Yuxuan Liang, Yuchen Fang, Jincai Huang, Junbo Zhang, Yu Zheng

Figure 1 for Spatio-Temporal Graph Neural Networks for Predictive Learning in Urban Computing: A Survey
Figure 2 for Spatio-Temporal Graph Neural Networks for Predictive Learning in Urban Computing: A Survey
Figure 3 for Spatio-Temporal Graph Neural Networks for Predictive Learning in Urban Computing: A Survey
Figure 4 for Spatio-Temporal Graph Neural Networks for Predictive Learning in Urban Computing: A Survey

With the development of sophisticated sensors and large database technologies, more and more spatio-temporal data in urban systems are recorded and stored. Predictive learning for the evolution patterns of these spatio-temporal data is a basic but important loop in urban computing, which can better support urban intelligent management decisions, especially in the fields of transportation, environment, security, public health, etc. Since traditional statistical learning and deep learning methods can hardly capture the complex correlations in the urban spatio-temporal data, the framework of spatio-temporal graph neural network (STGNN) has been proposed in recent years. STGNNs enable the extraction of complex spatio-temporal dependencies by integrating graph neural networks (GNNs) and various temporal learning methods. However, for different predictive learning tasks, it is a challenging problem to effectively design the spatial dependencies learning modules, temporal dependencies learning modules and spatio-temporal dependencies fusion methods in STGNN framework. In this paper, we provide a comprehensive survey on recent progress on STGNN technologies for predictive learning in urban computing. We first briefly introduce the construction methods of spatio-temporal graph data and popular deep learning models that are employed in STGNNs. Then we sort out the main application domains and specific predictive learning tasks from the existing literature. Next we analyze the design approaches of STGNN framework and the combination with some advanced technologies in recent years. Finally, we conclude the limitations of the existing research and propose some potential directions.

Viaarxiv icon