Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Junbo Zhang

Improving Open-world Continual Learning under the Constraints of Scarce Labeled Data

Feb 28, 2025

Yujie Li, Xiangkun Wang, Xin Yang, Marcello Bonsangue, Junbo Zhang, Tianrui Li

Figure 1 for Improving Open-world Continual Learning under the Constraints of Scarce Labeled Data

Figure 2 for Improving Open-world Continual Learning under the Constraints of Scarce Labeled Data

Figure 3 for Improving Open-world Continual Learning under the Constraints of Scarce Labeled Data

Figure 4 for Improving Open-world Continual Learning under the Constraints of Scarce Labeled Data

Abstract:Open-world continual learning (OWCL) adapts to sequential tasks with open samples, learning knowledge incrementally while preventing forgetting. However, existing OWCL still requires a large amount of labeled data for training, which is often impractical in real-world applications. Given that new categories/entities typically come with limited annotations and are in small quantities, a more realistic situation is OWCL with scarce labeled data, i.e., few-shot training samples. Hence, this paper investigates the problem of open-world few-shot continual learning (OFCL), challenging in (i) learning unbounded tasks without forgetting previous knowledge and avoiding overfitting, (ii) constructing compact decision boundaries for open detection with limited labeled data, and (iii) transferring knowledge about knowns and unknowns and even update the unknowns to knowns once the labels of open samples are learned. In response, we propose a novel OFCL framework that integrates three key components: (1) an instance-wise token augmentation (ITA) that represents and enriches sample representations with additional knowledge, (2) a margin-based open boundary (MOB) that supports open detection with new tasks emerge over time, and (3) an adaptive knowledge space (AKS) that endows unknowns with knowledge for the updating from unknowns to knowns. Finally, extensive experiments show the proposed OFCL framework outperforms all baselines remarkably with practical importance and reproducibility. The source code is released at https://github.com/liyj1201/OFCL.

Via

Access Paper or Ask Questions

Order-Robust Class Incremental Learning: Graph-Driven Dynamic Similarity Grouping

Feb 27, 2025

Guannan Lai, Yujie Li, Xiangkun Wang, Junbo Zhang, Tianrui Li, Xin Yang

Abstract:Class Incremental Learning (CIL) requires a model to continuously learn new classes without forgetting previously learned ones. While recent studies have significantly alleviated the problem of catastrophic forgetting (CF), more and more research reveals that the order in which classes appear have significant influences on CIL models. Specifically, prioritizing the learning of classes with lower similarity will enhance the model's generalization performance and its ability to mitigate forgetting. Hence, it is imperative to develop an order-robust class incremental learning model that maintains stable performance even when faced with varying levels of class similarity in different orders. In response, we first provide additional theoretical analysis, which reveals that when the similarity among a group of classes is lower, the model demonstrates increased robustness to the class order. Then, we introduce a novel \textbf{G}raph-\textbf{D}riven \textbf{D}ynamic \textbf{S}imilarity \textbf{G}rouping (\textbf{GDDSG}) method, which leverages a graph coloring algorithm for class-based similarity grouping. The proposed approach trains independent CIL models for each group of classes, ultimately combining these models to facilitate joint prediction. Experimental results demonstrate that our method effectively addresses the issue of class order sensitivity while achieving optimal performance in both model accuracy and anti-forgetting capability. Our code is available at https://github.com/AIGNLAI/GDDSG.

* Accepted by the proceeding of CVPR2025

Via

Access Paper or Ask Questions

The ICME 2025 Audio Encoder Capability Challenge

Jan 25, 2025

Junbo Zhang, Heinrich Dinkel, Qiong Song, Helen Wang, Yadong Niu, Si Cheng, Xiaofeng Xin, Ke Li, Wenwu Wang, Yujun Wang(+1 more)

Abstract:This challenge aims to evaluate the capabilities of audio encoders, especially in the context of multi-task learning and real-world applications. Participants are invited to submit pre-trained audio encoders that map raw waveforms to continuous embeddings. These encoders will be tested across diverse tasks including speech, environmental sounds, and music, with a focus on real-world usability. The challenge features two tracks: Track A for parameterized evaluation, and Track B for parameter-free evaluation. This challenge provides a platform for evaluating and advancing the state-of-the-art in audio encoder design.

Via

Access Paper or Ask Questions

UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

Sep 08, 2024

Peng Xie, Minbo Ma, Bin Wang, Junbo Zhang, Tianrui Li

Figure 1 for UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

Figure 2 for UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

Figure 3 for UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

Figure 4 for UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

Abstract:Accurate prediction of metro Origin-Destination (OD) flow is essential for the development of intelligent transportation systems and effective urban traffic management. Existing approaches typically either predict passenger outflow of departure stations or inflow of destination stations. However, we argue that travelers generally have clearly defined departure and arrival stations, making these OD pairs inherently interconnected. Consequently, considering OD pairs as a unified entity more accurately reflects actual metro travel patterns and allows for analyzing potential spatio-temporal correlations between different OD pairs. To address these challenges, we propose a novel and effective urban metro OD flow prediction method (UMOD), comprising three core modules: a data embedding module, a temporal relation module, and a spatial relation module. The data embedding module projects raw OD pair inputs into hidden space representations, which are subsequently processed by the temporal and spatial relation modules to capture both inter-pair and intra-pair spatio-temporal dependencies. Experimental results on two real-world urban metro OD flow datasets demonstrate that adopting the OD pairs perspective is critical for accurate metro OD flow prediction. Our method outperforms existing approaches, delivering superior predictive performance.

Via

Access Paper or Ask Questions

Personalized Federated Continual Learning via Multi-granularity Prompt

Jun 27, 2024

Hao Yu, Xin Yang, Xin Gao, Yan Kang, Hao Wang, Junbo Zhang, Tianrui Li

Figure 1 for Personalized Federated Continual Learning via Multi-granularity Prompt

Figure 2 for Personalized Federated Continual Learning via Multi-granularity Prompt

Figure 3 for Personalized Federated Continual Learning via Multi-granularity Prompt

Figure 4 for Personalized Federated Continual Learning via Multi-granularity Prompt

Abstract:Personalized Federated Continual Learning (PFCL) is a new practical scenario that poses greater challenges in sharing and personalizing knowledge. PFCL not only relies on knowledge fusion for server aggregation at the global spatial-temporal perspective but also needs model improvement for each client according to the local requirements. Existing methods, whether in Personalized Federated Learning (PFL) or Federated Continual Learning (FCL), have overlooked the multi-granularity representation of knowledge, which can be utilized to overcome Spatial-Temporal Catastrophic Forgetting (STCF) and adopt generalized knowledge to itself by coarse-to-fine human cognitive mechanisms. Moreover, it allows more effectively to personalized shared knowledge, thus serving its own purpose. To this end, we propose a novel concept called multi-granularity prompt, i.e., coarse-grained global prompt acquired through the common model learning process, and fine-grained local prompt used to personalize the generalized representation. The former focuses on efficiently transferring shared global knowledge without spatial forgetting, and the latter emphasizes specific learning of personalized local knowledge to overcome temporal forgetting. In addition, we design a selective prompt fusion mechanism for aggregating knowledge of global prompts distilled from different clients. By the exclusive fusion of coarse-grained knowledge, we achieve the transmission and refinement of common knowledge among clients, further enhancing the performance of personalization. Extensive experiments demonstrate the effectiveness of the proposed method in addressing STCF as well as improving personalized performance. Our code now is available at https://github.com/SkyOfBeginning/FedMGP.

* Accepted by KDD 2024 Research Track

Via

Access Paper or Ask Questions

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Jun 19, 2024

Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

Abstract:Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by -Base (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

Scaling up masked audio encoder learning for general audio classification

Jun 11, 2024

Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

Figure 1 for Scaling up masked audio encoder learning for general audio classification

Figure 2 for Scaling up masked audio encoder learning for general audio classification

Figure 3 for Scaling up masked audio encoder learning for general audio classification

Figure 4 for Scaling up masked audio encoder learning for general audio classification

Abstract:Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music, and environmental information, as shown in nearest-neighbor classification experiments. Code is available https://github.com/richermans/dasheng/.

* Interspeech 2024

Via

Access Paper or Ask Questions

Bridging Language Gaps in Audio-Text Retrieval

Jun 11, 2024

Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang

Figure 1 for Bridging Language Gaps in Audio-Text Retrieval

Figure 2 for Bridging Language Gaps in Audio-Text Retrieval

Figure 3 for Bridging Language Gaps in Audio-Text Retrieval

Figure 4 for Bridging Language Gaps in Audio-Text Retrieval

Abstract:Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multilingual text encoder (SONAR) to encode the text data with language-specific information. Additionally, we optimize the audio encoder through the application of consistent ensemble distillation (CED), enhancing support for variable-length audio-text retrieval. Our methodology excels in English audio-text retrieval, demonstrating state-of-the-art (SOTA) performance on commonly used datasets such as AudioCaps and Clotho. Simultaneously, the approach exhibits proficiency in retrieving content in seven other languages with only 10% of additional language-enhanced training data, yielding promising results. The source code is publicly available https://github.com/zyyan4/ml-clap.

* interspeech2024

Via

Access Paper or Ask Questions

Multi-task Manipulation Policy Modeling with Visuomotor Latent Diffusion

Mar 12, 2024

Wenhui Tan, Bei Liu, Junbo Zhang, Ruihua Song, Jianlong Fu

Figure 1 for Multi-task Manipulation Policy Modeling with Visuomotor Latent Diffusion

Figure 2 for Multi-task Manipulation Policy Modeling with Visuomotor Latent Diffusion

Figure 3 for Multi-task Manipulation Policy Modeling with Visuomotor Latent Diffusion

Figure 4 for Multi-task Manipulation Policy Modeling with Visuomotor Latent Diffusion

Abstract:Modeling a generalized visuomotor policy has been a longstanding challenge for both computer vision and robotics communities. Existing approaches often fail to efficiently leverage cross-dataset resources or rely on heavy Vision-Language models, which require substantial computational resources, thereby limiting their multi-task performance and application potential. In this paper, we introduce a novel paradigm that effectively utilizes latent modeling of manipulation skills and an efficient visuomotor latent diffusion policy, which enhances the utilizing of existing cross-embodiment and cross-environment datasets, thereby improving multi-task capabilities. Our methodology consists of two decoupled phases: action modeling and policy modeling. Firstly, we introduce a task-agnostic, embodiment-aware trajectory latent autoencoder for unified action skills modeling. This step condenses action data and observation into a condensed latent space, effectively benefiting from large-scale cross-datasets. Secondly, we propose to use a visuomotor latent diffusion policy that recovers target skill latent from noises for effective task execution. We conducted extensive experiments on two widely used benchmarks, and the results demonstrate the effectiveness of our proposed paradigms on multi-tasking and pre-training. Code is available at https://github.com/AlbertTan404/RoLD.

* Submitted to ECCV 2024

Via

Access Paper or Ask Questions

Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook

Feb 29, 2024

Xingchen Zou, Yibo Yan, Xixuan Hao, Yuehong Hu, Haomin Wen, Erdong Liu, Junbo Zhang, Yong Li, Tianrui Li, Yu Zheng(+1 more)

Figure 1 for Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook

Figure 2 for Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook

Figure 3 for Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook

Figure 4 for Deep Learning for Cross-Domain Data Fusion in Urban Computing: Taxonomy, Advances, and Outlook

Abstract:As cities continue to burgeon, Urban Computing emerges as a pivotal discipline for sustainable development by harnessing the power of cross-domain data fusion from diverse sources (e.g., geographical, traffic, social media, and environmental data) and modalities (e.g., spatio-temporal, visual, and textual modalities). Recently, we are witnessing a rising trend that utilizes various deep-learning methods to facilitate cross-domain data fusion in smart cities. To this end, we propose the first survey that systematically reviews the latest advancements in deep learning-based data fusion methods tailored for urban computing. Specifically, we first delve into data perspective to comprehend the role of each modality and data source. Secondly, we classify the methodology into four primary categories: feature-based, alignment-based, contrast-based, and generation-based fusion methods. Thirdly, we further categorize multi-modal urban applications into seven types: urban planning, transportation, economy, public safety, society, environment, and energy. Compared with previous surveys, we focus more on the synergy of deep learning methods with urban computing applications. Furthermore, we shed light on the interplay between Large Language Models (LLMs) and urban computing, postulating future research directions that could revolutionize the field. We firmly believe that the taxonomy, progress, and prospects delineated in our survey stand poised to significantly enrich the research community. The summary of the comprehensive and up-to-date paper list can be found at https://github.com/yoshall/Awesome-Multimodal-Urban-Computing.

Via

Access Paper or Ask Questions