Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gang Pan

Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Aug 29, 2024

Qianhui Liu, Jiadong Wang, Yang Wang, Xin Yang, Gang Pan, Haizhou Li

Figure 1 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Figure 2 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Figure 3 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Figure 4 for Human-Inspired Audio-Visual Speech Recognition: Spike Activity, Cueing Interaction and Causal Processing

Abstract:Humans naturally perform audiovisual speech recognition (AVSR), enhancing the accuracy and robustness by integrating auditory and visual information. Spiking neural networks (SNNs), which mimic the brain's information-processing mechanisms, are well-suited for emulating the human capability of AVSR. Despite their potential, research on SNNs for AVSR is scarce, with most existing audio-visual multimodal methods focused on object or digit recognition. These models simply integrate features from both modalities, neglecting their unique characteristics and interactions. Additionally, they often rely on future information for current processing, which increases recognition latency and limits real-time applicability. Inspired by human speech perception, this paper proposes a novel human-inspired SNN named HI-AVSNN for AVSR, incorporating three key characteristics: cueing interaction, causal processing and spike activity. For cueing interaction, we propose a visual-cued auditory attention module (VCA2M) that leverages visual cues to guide attention to auditory features. We achieve causal processing by aligning the SNN's temporal dimension with that of visual and auditory features and applying temporal masking to utilize only past and current information. To implement spike activity, in addition to using SNNs, we leverage the event camera to capture lip movement as spikes, mimicking the human retina and providing efficient visual data. We evaluate HI-AVSNN on an audiovisual speech recognition dataset combining the DVS-Lip dataset with its corresponding audio samples. Experimental results demonstrate the superiority of our proposed fusion method, outperforming existing audio-visual SNN fusion methods and achieving a 2.27% improvement in accuracy over the only existing SNN-based AVSR method.

Via

Access Paper or Ask Questions

An Asynchronous Multi-core Accelerator for SNN inference

Jul 30, 2024

Zhuo Chen, De Ma, Xiaofei Jin, Qinghui Xing, Ouwen Jin, Xin Du, Shuibing He, Gang Pan

Figure 1 for An Asynchronous Multi-core Accelerator for SNN inference

Figure 2 for An Asynchronous Multi-core Accelerator for SNN inference

Figure 3 for An Asynchronous Multi-core Accelerator for SNN inference

Figure 4 for An Asynchronous Multi-core Accelerator for SNN inference

Abstract:Spiking Neural Networks (SNNs) are extensively utilized in brain-inspired computing and neuroscience research. To enhance the speed and energy efficiency of SNNs, several many-core accelerators have been developed. However, maintaining the accuracy of SNNs often necessitates frequent explicit synchronization among all cores, which presents a challenge to overall efficiency. In this paper, we propose an asynchronous architecture for Spiking Neural Networks (SNNs) that eliminates the need for inter-core synchronization, thus enhancing speed and energy efficiency. This approach leverages the pre-determined dependencies of neuromorphic cores established during compilation. Each core is equipped with a scheduler that monitors the status of its dependencies, allowing it to safely advance to the next timestep without waiting for other cores. This eliminates the necessity for global synchronization and minimizes core waiting time despite inherent workload imbalances. Comprehensive evaluations using five different SNN workloads show that our architecture achieves a 1.86x speedup and a 1.55x increase in energy efficiency compared to state-of-the-art synchronization architectures.

Via

Access Paper or Ask Questions

Sewer Image Super-Resolution with Depth Priors and Its Lightweight Network

Jul 27, 2024

Gang Pan, Chen Wang, Zhijie Sui, Shuai Guo, Yaozhi Lv, Honglie Li, Di Sun

Figure 1 for Sewer Image Super-Resolution with Depth Priors and Its Lightweight Network

Figure 2 for Sewer Image Super-Resolution with Depth Priors and Its Lightweight Network

Figure 3 for Sewer Image Super-Resolution with Depth Priors and Its Lightweight Network

Figure 4 for Sewer Image Super-Resolution with Depth Priors and Its Lightweight Network

Abstract:The Quick-view (QV) technique serves as a primary method for detecting defects within sewerage systems. However, the effectiveness of QV is impeded by the limited visual range of its hardware, resulting in suboptimal image quality for distant portions of the sewer network. Image super-resolution is an effective way to improve image quality and has been applied in a variety of scenes. However, research on super-resolution for sewer images remains considerably unexplored. In response, this study leverages the inherent depth relationships present within QV images and introduces a novel Depth-guided, Reference-based Super-Resolution framework denoted as DSRNet. It comprises two core components: a depth extraction module and a depth information matching module (DMM). DSRNet utilizes the adjacent frames of the low-resolution image as reference images and helps them recover texture information based on the correlation. By combining these modules, the integration of depth priors significantly enhances both visual quality and performance benchmarks. Besides, in pursuit of computational efficiency and compactness, our paper introduces a super-resolution knowledge distillation model based on an attention mechanism. This mechanism facilitates the acquisition of feature similarity between a more complex teacher model and a streamlined student model, the latter being a lightweight version of DSRNet. Experimental results demonstrate that DSRNet significantly improves PSNR and SSIM compared with other methods. This study also conducts experiments on sewer defect semantic segmentation, object detection, and classification on the Pipe dataset and Sewer-ML dataset. Experiments show that the method can improve the performance of low-resolution sewer images in these tasks.

Via

Access Paper or Ask Questions

Semi-Supervised Pipe Video Temporal Defect Interval Localization

Jul 21, 2024

Zhu Huang, Gang Pan, Chao Kang, YaoZhi Lv

Figure 1 for Semi-Supervised Pipe Video Temporal Defect Interval Localization

Figure 2 for Semi-Supervised Pipe Video Temporal Defect Interval Localization

Figure 3 for Semi-Supervised Pipe Video Temporal Defect Interval Localization

Figure 4 for Semi-Supervised Pipe Video Temporal Defect Interval Localization

Abstract:In sewer pipe Closed-Circuit Television (CCTV) inspection, accurate temporal defect localization is essential for effective defect classification, detection, segmentation and quantification. Industry standards typically do not require time-interval annotations, even though they are more informative than time-point annotations for defect localization, resulting in additional annotation costs when fully supervised methods are used. Additionally, differences in scene types and camera motion patterns between pipe inspections and Temporal Action Localization (TAL) hinder the effective transfer of point-supervised TAL methods. Therefore, this study introduces a Semi-supervised multi-Prototype-based method incorporating visual Odometry for enhanced attention guidance (PipeSPO). PipeSPO fully leverages unlabeled data through unsupervised pretext tasks and utilizes time-point annotated data with a weakly supervised multi-prototype-based method, relying on visual odometry features to capture camera pose information. Experiments on real-world datasets demonstrate that PipeSPO achieves 41.89% average precision across Intersection over Union (IoU) thresholds of 0.1-0.7, improving by 8.14% over current state-of-the-art methods.

* 13 pages, 3 figures

Via

Access Paper or Ask Questions

Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Jun 11, 2024

Jinyuan Li, Ziyan Li, Han Li, Jianfei Yu, Rui Xia, Di Sun, Gang Pan

Figure 1 for Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Figure 2 for Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Figure 3 for Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Figure 4 for Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Abstract:Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and text on social media contributes to a notable proportion of named entities being ungroundable. 2) There exists a distinction between coarse-grained noun phrases used in similar tasks (e.g., phrase localization) and fine-grained named entities. In this paper, we propose RiVEG, a unified framework that reformulates GMNER into a joint MNER-VE-VG task by leveraging large language models (LLMs) as connecting bridges. This reformulation brings two benefits: 1) It enables us to optimize the MNER module for optimal MNER performance and eliminates the need to pre-extract region features using object detection methods, thus naturally addressing the two major limitations of existing GMNER methods. 2) The introduction of Entity Expansion Expression module and Visual Entailment (VE) module unifies Visual Grounding (VG) and Entity Grounding (EG). This endows the proposed framework with unlimited data and model scalability. Furthermore, to address the potential ambiguity stemming from the coarse-grained bounding box output in GMNER, we further construct the new Segmented Multimodal Named Entity Recognition (SMNER) task and corresponding Twitter-SMNER dataset aimed at generating fine-grained segmentation masks, and experimentally demonstrate the feasibility and effectiveness of using box prompt-based Segment Anything Model (SAM) to empower any GMNER model with the ability to accomplish the SMNER task. Extensive experiments demonstrate that RiVEG significantly outperforms SoTA methods on four datasets across the MNER, GMNER, and SMNER tasks.

* Extension of our Findings of EMNLP 2023 & ACL 2024 paper

Via

Access Paper or Ask Questions

Context Gating in Spiking Neural Networks: Achieving Lifelong Learning through Integration of Local and Global Plasticity

Jun 04, 2024

Jiangrong Shen, Wenyao Ni, Qi Xu, Gang Pan, Huajin Tang

Abstract:Humans learn multiple tasks in succession with minimal mutual interference, through the context gating mechanism in the prefrontal cortex (PFC). The brain-inspired models of spiking neural networks (SNN) have drawn massive attention for their energy efficiency and biological plausibility. To overcome catastrophic forgetting when learning multiple tasks in sequence, current SNN models for lifelong learning focus on memory reserving or regularization-based modification, while lacking SNN to replicate human experimental behavior. Inspired by biological context-dependent gating mechanisms found in PFC, we propose SNN with context gating trained by the local plasticity rule (CG-SNN) for lifelong learning. The iterative training between global and local plasticity for task units is designed to strengthen the connections between task neurons and hidden neurons and preserve the multi-task relevant information. The experiments show that the proposed model is effective in maintaining the past learning experience and has better task-selectivity than other methods during lifelong learning. Our results provide new insights that the CG-SNN model can extend context gating with good scalability on different SNN architectures with different spike-firing mechanisms. Thus, our models have good potential for parallel implementation on neuromorphic hardware and model human's behavior.

Via

Access Paper or Ask Questions

Towards Efficient Deep Spiking Neural Networks Construction with Spiking Activity based Pruning

Jun 03, 2024

Yaxin Li, Qi Xu, Jiangrong Shen, Hongming Xu, Long Chen, Gang Pan

Figure 1 for Towards Efficient Deep Spiking Neural Networks Construction with Spiking Activity based Pruning

Figure 2 for Towards Efficient Deep Spiking Neural Networks Construction with Spiking Activity based Pruning

Figure 3 for Towards Efficient Deep Spiking Neural Networks Construction with Spiking Activity based Pruning

Figure 4 for Towards Efficient Deep Spiking Neural Networks Construction with Spiking Activity based Pruning

Abstract:The emergence of deep and large-scale spiking neural networks (SNNs) exhibiting high performance across diverse complex datasets has led to a need for compressing network models due to the presence of a significant number of redundant structural units, aiming to more effectively leverage their low-power consumption and biological interpretability advantages. Currently, most model compression techniques for SNNs are based on unstructured pruning of individual connections, which requires specific hardware support. Hence, we propose a structured pruning approach based on the activity levels of convolutional kernels named Spiking Channel Activity-based (SCA) network pruning framework. Inspired by synaptic plasticity mechanisms, our method dynamically adjusts the network's structure by pruning and regenerating convolutional kernels during training, enhancing the model's adaptation to the current target task. While maintaining model performance, this approach refines the network architecture, ultimately reducing computational load and accelerating the inference process. This indicates that structured dynamic sparse learning methods can better facilitate the application of deep SNNs in low-power and high-efficiency scenarios.

Via

Access Paper or Ask Questions

LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model

May 29, 2024

Hongen Liu, Yi Liu, Di Sun, Jiahao Wang, Gang Pan

Abstract:Video text spotting aims to simultaneously localize, recognize and track text instances in videos. To address the limited recognition capability of end-to-end methods, tracking the zero-shot results of state-of-the-art image text spotters directly can achieve impressive performance. However, owing to the domain gap between different datasets, these methods usually obtain limited tracking trajectories on extreme dataset. Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements, albeit at the expense of considerable training resources. In this paper, we propose a Language Collaboration and Glyph Perception Model, termed LOGO to enhance the performance of conventional text spotters through the integration of a synergy module. To achieve this goal, a language synergy classifier (LSC) is designed to explicitly discern text instances from background noise in the recognition stage. Specially, the language synergy classifier can output text content or background code based on the legibility of text regions, thus computing language scores. Subsequently, fusion scores are computed by taking the average of detection scores and language scores, and are utilized to re-score the detection results before tracking. By the re-scoring mechanism, the proposed LSC facilitates the detection of low-resolution text instances while filtering out text-like regions. Besides, the glyph supervision and visual position mixture module are proposed to enhance the recognition accuracy of noisy text regions, and acquire more discriminative tracking features, respectively. Extensive experiments on public benchmarks validate the effectiveness of the proposed method.

Via

Access Paper or Ask Questions

Resisting Stochastic Risks in Diffusion Planners with the Trajectory Aggregation Tree

May 28, 2024

Lang Feng, Pengjie Gu, Bo An, Gang Pan

Abstract:Diffusion planners have shown promise in handling long-horizon and sparse-reward tasks due to the non-autoregressive plan generation. However, their inherent stochastic risk of generating infeasible trajectories presents significant challenges to their reliability and stability. We introduce a novel approach, the Trajectory Aggregation Tree (TAT), to address this issue in diffusion planners. Compared to prior methods that rely solely on raw trajectory predictions, TAT aggregates information from both historical and current trajectories, forming a dynamic tree-like structure. Each trajectory is conceptualized as a branch and individual states as nodes. As the structure evolves with the integration of new trajectories, unreliable states are marginalized, and the most impactful nodes are prioritized for decision-making. TAT can be deployed without modifying the original training and sampling pipelines of diffusion planners, making it a training-free, ready-to-deploy solution. We provide both theoretical analysis and empirical evidence to support TAT's effectiveness. Our results highlight its remarkable ability to resist the risk from unreliable trajectories, guarantee the performance boosting of diffusion planners in $100\%$ of tasks, and exhibit an appreciable tolerance margin for sample quality, thereby enabling planning with a more than $3\times$ acceleration.

* The 41st International Conference on Machine Learning (ICML 2024)

Via

Access Paper or Ask Questions

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

May 04, 2024

Wenjia Meng, Qian Zheng, Long Yang, Yilong Yin, Gang Pan

Figure 1 for Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Figure 2 for Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Figure 3 for Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Figure 4 for Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Abstract:Policy-based methods have achieved remarkable success in solving challenging reinforcement learning problems. Among these methods, off-policy policy gradient methods are particularly important due to that they can benefit from off-policy data. However, these methods suffer from the high variance of the off-policy policy gradient (OPPG) estimator, which results in poor sample efficiency during training. In this paper, we propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue. Specifically, this baseline maintains the OPPG estimator's unbiasedness while theoretically minimizing its variance. To enhance practical computational efficiency, we design an approximated version of this optimal baseline. Utilizing this approximation, our method (Off-OAB) aims to decrease the OPPG estimator's variance during policy optimization. We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo, where it demonstrably surpasses state-of-the-art methods on the majority of these tasks.

* 12 pages, 3 figures

Via

Access Paper or Ask Questions