Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Zhou

Celine

Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey

Nov 30, 2024

Wei Zhou, Lei Zhao, Runyu Zhang, Yifan Cui, Hongpu Huang, Kun Qie, Chen Wang

Figure 1 for Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey

Figure 2 for Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey

Figure 3 for Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey

Figure 4 for Vision Technologies with Applications in Traffic Surveillance Systems: A Holistic Survey

Abstract:Traffic Surveillance Systems (TSS) have become increasingly crucial in modern intelligent transportation systems, with vision-based technologies playing a central role for scene perception and understanding. While existing surveys typically focus on isolated aspects of TSS, a comprehensive analysis bridging low-level and high-level perception tasks, particularly considering emerging technologies, remains lacking. This paper presents a systematic review of vision-based technologies in TSS, examining both low-level perception tasks (object detection, classification, and tracking) and high-level perception applications (parameter estimation, anomaly detection, and behavior understanding). Specifically, we first provide a detailed methodological categorization and comprehensive performance evaluation for each task. Our investigation reveals five fundamental limitations in current TSS: perceptual data degradation in complex scenarios, data-driven learning constraints, semantic understanding gaps, sensing coverage limitations and computational resource demands. To address these challenges, we systematically analyze five categories of potential solutions: advanced perception enhancement, efficient learning paradigms, knowledge-enhanced understanding, cooperative sensing frameworks and efficient computing frameworks. Furthermore, we evaluate the transformative potential of foundation models in TSS, demonstrating their unique capabilities in zero-shot learning, semantic understanding, and scene generation. This review provides a unified framework bridging low-level and high-level perception tasks, systematically analyzes current limitations and solutions, and presents a structured roadmap for integrating emerging technologies, particularly foundation models, to enhance TSS capabilities.

Via

Access Paper or Ask Questions

CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR

Nov 12, 2024

Wei Zhou, Junteng Jia, Leda Sari, Jay Mahadeokar, Ozlem Kalinli

Abstract:CTC compressor can be an effective approach to integrate audio encoders to decoder-only models, which has gained growing interest for different speech applications. In this work, we propose a novel CTC compressor based joint speech and text training (CJST) framework for decoder-only ASR. CJST matches speech and text modalities from both directions by exploring a simple modality adaptor and several features of the CTC compressor, including sequence compression, on-the-fly forced peaky alignment and CTC class embeddings. Experimental results on the Librispeech and TED-LIUM2 corpora show that the proposed CJST achieves an effective text injection without the need of duration handling, leading to the best performance for both in-domain and cross-domain scenarios. We also provide a comprehensive study on CTC compressor, covering various compression modes, edge case handling and behavior under both clean and noisy data conditions, which reveals the most robust setting to use CTC compressor for decoder-only models.

* submitted to ICASSP2025

Via

Access Paper or Ask Questions

A Unified Solution to Diverse Heterogeneities in One-shot Federated Learning

Oct 28, 2024

Jun Bai, Yiliao Song, Di Wu, Atul Sajjanhar, Yong Xiang, Wei Zhou, Xiaohui Tao, Yan Li

Figure 1 for A Unified Solution to Diverse Heterogeneities in One-shot Federated Learning

Figure 2 for A Unified Solution to Diverse Heterogeneities in One-shot Federated Learning

Figure 3 for A Unified Solution to Diverse Heterogeneities in One-shot Federated Learning

Figure 4 for A Unified Solution to Diverse Heterogeneities in One-shot Federated Learning

Abstract:One-shot federated learning (FL) limits the communication between the server and clients to a single round, which largely decreases the privacy leakage risks in traditional FLs requiring multiple communications. However, we find existing one-shot FL frameworks are vulnerable to distributional heterogeneity due to their insufficient focus on data heterogeneity while concentrating predominantly on model heterogeneity. Filling this gap, we propose a unified, data-free, one-shot federated learning framework (FedHydra) that can effectively address both model and data heterogeneity. Rather than applying existing value-only learning mechanisms, a structure-value learning mechanism is proposed in FedHydra. Specifically, a new stratified learning structure is proposed to cover data heterogeneity, and the value of each item during computation reflects model heterogeneity. By this design, the data and model heterogeneity issues are simultaneously monitored from different aspects during learning. Consequently, FedHydra can effectively mitigate both issues by minimizing their inherent conflicts. We compared FedHydra with three SOTA baselines on four benchmark datasets. Experimental results show that our method outperforms the previous one-shot FL methods in both homogeneous and heterogeneous settings.

Via

Access Paper or Ask Questions

Efficient Streaming LLM for Speech Recognition

Oct 02, 2024

Junteng Jia, Gil Keren, Wei Zhou, Egor Lakomkin, Xiaohui Zhang, Chunyang Wu, Frank Seide, Jay Mahadeokar, Ozlem Kalinli

Figure 1 for Efficient Streaming LLM for Speech Recognition

Figure 2 for Efficient Streaming LLM for Speech Recognition

Figure 3 for Efficient Streaming LLM for Speech Recognition

Figure 4 for Efficient Streaming LLM for Speech Recognition

Abstract:Recent works have shown that prompting large language models with audio encodings can unlock speech recognition capabilities. However, existing techniques do not scale efficiently, especially while handling long form streaming audio inputs -- not only do they extrapolate poorly beyond the audio length seen during training, but they are also computationally inefficient due to the quadratic cost of attention. In this work, we introduce SpeechLLM-XL, a linear scaling decoder-only model for streaming speech recognition. We process audios in configurable chunks using limited attention window for reduced computation, and the text tokens for each audio chunk are generated auto-regressively until an EOS is predicted. During training, the transcript is segmented into chunks, using a CTC forced alignment estimated from encoder output. SpeechLLM-XL with 1.28 seconds chunk size achieves 2.7%/6.7% WER on LibriSpeech test clean/other, and it shows no quality degradation on long form utterances 10x longer than the training utterances.

Via

Access Paper or Ask Questions

Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

Oct 02, 2024

Wonjune Kang, Junteng Jia, Chunyang Wu, Wei Zhou, Egor Lakomkin, Yashesh Gaur, Leda Sari, Suyoun Kim, Ke Li, Jay Mahadeokar(+1 more)

Figure 1 for Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

Figure 2 for Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

Figure 3 for Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

Figure 4 for Frozen Large Language Models Can Perceive Paralinguistic Aspects of Speech

Abstract:As speech becomes an increasingly common modality for interacting with large language models (LLMs), it is becoming desirable to develop systems where LLMs can take into account users' emotions or speaking styles when providing their responses. In this work, we study the potential of an LLM to understand these aspects of speech without fine-tuning its weights. To do this, we utilize an end-to-end system with a speech encoder; the encoder is trained to produce token embeddings such that the LLM's response to an expressive speech prompt is aligned with its response to a semantically matching text prompt where the speaker's emotion has also been specified. We find that this training framework allows the encoder to generate tokens that capture both semantic and paralinguistic information in speech and effectively convey it to the LLM, even when the LLM remains completely frozen. We also explore training on additional emotion and style-related response alignment tasks, finding that they further increase the amount of paralinguistic information explicitly captured in the speech tokens. Experiments demonstrate that our system is able to produce higher quality and more empathetic responses to expressive speech prompts compared to several baselines.

Via

Access Paper or Ask Questions

Perceptual Depth Quality Assessment of Stereoscopic Omnidirectional Images

Aug 19, 2024

Wei Zhou, Zhou Wang

Figure 1 for Perceptual Depth Quality Assessment of Stereoscopic Omnidirectional Images

Figure 2 for Perceptual Depth Quality Assessment of Stereoscopic Omnidirectional Images

Figure 3 for Perceptual Depth Quality Assessment of Stereoscopic Omnidirectional Images

Figure 4 for Perceptual Depth Quality Assessment of Stereoscopic Omnidirectional Images

Abstract:Depth perception plays an essential role in the viewer experience for immersive virtual reality (VR) visual environments. However, previous research investigations in the depth quality of 3D/stereoscopic images are rather limited, and in particular, are largely lacking for 3D viewing of 360-degree omnidirectional content. In this work, we make one of the first attempts to develop an objective quality assessment model named depth quality index (DQI) for efficient no-reference (NR) depth quality assessment of stereoscopic omnidirectional images. Motivated by the perceptual characteristics of the human visual system (HVS), the proposed DQI is built upon multi-color-channel, adaptive viewport selection, and interocular discrepancy features. Experimental results demonstrate that the proposed method outperforms state-of-the-art image quality assessment (IQA) and depth quality assessment (DQA) approaches in predicting the perceptual depth quality when tested using both single-viewport and omnidirectional stereoscopic image databases. Furthermore, we demonstrate that combining the proposed depth quality model with existing IQA methods significantly boosts the performance in predicting the overall quality of 3D omnidirectional images.

* Accepted by IEEE TCSVT

Via

Access Paper or Ask Questions

Morphing median fin enhances untethered bionic robotic tuna's linear acceleration and turning maneuverability

Jul 26, 2024

Hongbin Huang, Zhonglu Lin, Wei Zheng, Jinhu Zhang, Zhibin Liu, Wei Zhou, Yu Zhang

Figure 1 for Morphing median fin enhances untethered bionic robotic tuna's linear acceleration and turning maneuverability

Figure 2 for Morphing median fin enhances untethered bionic robotic tuna's linear acceleration and turning maneuverability

Figure 3 for Morphing median fin enhances untethered bionic robotic tuna's linear acceleration and turning maneuverability

Figure 4 for Morphing median fin enhances untethered bionic robotic tuna's linear acceleration and turning maneuverability

Abstract:Median fins of fish-like swimmers play a crucial role in linear acceleration and maneuvering processes. However, few research focused on untethered robotic fish experiments. Imitating the behaviour of real tuna, we developed a free-swimming bionic tuna with a foldable dorsal fin. The erection of dorsal fin, at proper conditions, can reduce head heave by 50%, enhance linear acceleration by 15.7%, increase turning angular velocity by 32.78%, and turning radius decreasing by 33.13%. Conversely, erecting the dorsal fin increases the wetted surface area, resulting in decreased maximum speed and efficiency during steady swimming phase. This finding partially explains why tuna erect their median fins during maneuvers or acceleration and fold them afterward to reduce drag. In addition, we verified that folding the median fins after acceleration does not significantly affect locomotion efficiency. This study supports the application of morphing median fins in undulating underwater robots and helps to further understand the impact of median fins on fish locomotion.

* 7 pages, 5 figures

Via

Access Paper or Ask Questions

Data-Guided Physics-Informed Neural Networks for Solving Inverse Problems in Partial Differential Equations

Jul 15, 2024

Wei Zhou, Y. F. Xu

Figure 1 for Data-Guided Physics-Informed Neural Networks for Solving Inverse Problems in Partial Differential Equations

Figure 2 for Data-Guided Physics-Informed Neural Networks for Solving Inverse Problems in Partial Differential Equations

Figure 3 for Data-Guided Physics-Informed Neural Networks for Solving Inverse Problems in Partial Differential Equations

Figure 4 for Data-Guided Physics-Informed Neural Networks for Solving Inverse Problems in Partial Differential Equations

Abstract:Physics-informed neural networks (PINNs) represent a significant advancement in scientific machine learning by integrating fundamental physical laws into their architecture through loss functions. PINNs have been successfully applied to solve various forward and inverse problems in partial differential equations (PDEs). However, a notable challenge can emerge during the early training stages when solving inverse problems. Specifically, data losses remain high while PDE residual losses are minimized rapidly, thereby exacerbating the imbalance between loss terms and impeding the overall efficiency of PINNs. To address this challenge, this study proposes a novel framework termed data-guided physics-informed neural networks (DG-PINNs). The DG-PINNs framework is structured into two distinct phases: a pre-training phase and a fine-tuning phase. In the pre-training phase, a loss function with only the data loss is minimized in a neural network. In the fine-tuning phase, a composite loss function, which consists of the data loss, PDE residual loss, and, if available, initial and boundary condition losses, is minimized in the same neural network. Notably, the pre-training phase ensures that the data loss is already at a low value before the fine-tuning phase commences. This approach enables the fine-tuning phase to converge to a minimal composite loss function with fewer iterations compared to existing PINNs. To validate the effectiveness, noise-robustness, and efficiency of DG-PINNs, extensive numerical investigations are conducted on inverse problems related to several classical PDEs, including the heat equation, wave equation, Euler--Bernoulli beam equation, and Navier--Stokes equation. The numerical results demonstrate that DG-PINNs can accurately solve these inverse problems and exhibit robustness against noise in training data.

Via

Access Paper or Ask Questions

Transferring Structure Knowledge: A New Task to Fake news Detection Towards Cold-Start Propagation

Jul 13, 2024

Lingwei Wei, Dou Hu, Wei Zhou, Songlin Hu

Abstract:Many fake news detection studies have achieved promising performance by extracting effective semantic and structure features from both content and propagation trees. However, it is challenging to apply them to practical situations, especially when using the trained propagation-based models to detect news with no propagation data. Towards this scenario, we study a new task named cold-start fake news detection, which aims to detect content-only samples with missing propagation. To achieve the task, we design a simple but effective Structure Adversarial Net (SAN) framework to learn transferable features from available propagation to boost the detection of content-only samples. SAN introduces a structure discriminator to estimate dissimilarities among learned features with and without propagation, and further learns structure-invariant features to enhance the generalization of existing propagation-based methods for content-only samples. We conduct qualitative and quantitative experiments on three datasets. Results show the challenge of the new task and the effectiveness of our SAN framework.

* ICASSP 2024

Via

Access Paper or Ask Questions

MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

Jul 13, 2024

Zhenpeng Su, Zijia Lin, Xue Bai, Xing Wu, Yizhe Xiong, Haoran Lian, Guangyuan Ma, Hui Chen, Guiguang Ding, Wei Zhou(+1 more)

Abstract:Scaling model capacity enhances its capabilities but significantly increases computation. Mixture-of-Experts models (MoEs) address this by allowing model capacity to scale without substantially increasing training or inference costs. Despite their promising results, MoE models encounter several challenges. Primarily, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, while fixed routing mechanisms can mitigate this issue, they compromise on the diversity of representations. In this paper, we propose MaskMoE, a method designed to enhance token-level learning by employing a routing masking technique within the Mixture-of-Experts model. MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Experimental results demonstrate that our method outperforms previous dominant Mixture-of-Experts models in both perplexity (PPL) and downstream tasks.

* Work in progress

Via

Access Paper or Ask Questions