Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianqin Yin

MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Sep 11, 2024

Zehao Wang, Haobo Yue, Zhicheng Zhang, Da Mu, Jin Tang, Jianqin Yin

Figure 1 for MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Figure 2 for MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Figure 3 for MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Figure 4 for MTDA-HSED: Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection

Abstract:Sound Event Detection (SED) plays a vital role in comprehending and perceiving acoustic scenes. Previous methods have demonstrated impressive capabilities. However, they are deficient in learning features of complex scenes from heterogeneous dataset. In this paper, we introduce a novel dual-branch architecture named Mutual-Assistance Tuning and Dual-Branch Aggregating for Heterogeneous Sound Event Detection (MTDA-HSED). The MTDA-HSED architecture employs the Mutual-Assistance Audio Adapter (M3A) to effectively tackle the multi-scenario problem and uses the Dual-Branch Mid-Fusion (DBMF) module to tackle the multi-granularity problem. Specifically, M3A is integrated into the BEATs block as an adapter to improve the BEATs' performance by fine-tuning it on the multi-scenario dataset. The DBMF module connects BEATs and CNN branches, which facilitates the deep fusion of information from the BEATs and the CNN branches. Experimental results show that the proposed methods exceed the baseline of mpAUC by \textbf{$5\%$} on the DESED and MAESTRO Real datasets. Code is available at https://github.com/Visitor-W/MTDA.

* Submit to Icassp2025

Via

Access Paper or Ask Questions

SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

Aug 09, 2024

Da Mu, Zhicheng Zhang, Haobo Yue, Zehao Wang, Jin Tang, Jianqin Yin

Figure 1 for SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

Figure 2 for SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

Figure 3 for SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

Figure 4 for SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation

Abstract:In the Sound Event Localization and Detection (SELD) task, Transformer-based models have demonstrated impressive capabilities. However, the quadratic complexity of the Transformer's self-attention mechanism results in computational inefficiencies. In this paper, we propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model. We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks to capture a broader range of contextual information while maintaining computational efficiency. Additionally, we implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss. Our experimental results on the 2024 DCASE Challenge Task3 dataset demonstrate the effectiveness of the selective state-space model in SELD and highlight the benefits of the two-stage training approach in enhancing SELD performance.

Via

Access Paper or Ask Questions

ActivityCLIP: Enhancing Group Activity Recognition by Mining Complementary Information from Text to Supplement Image Modality

Jul 29, 2024

Guoliang Xu, Jianqin Yin, Feng Zhou, Yonghao Dang

Figure 1 for ActivityCLIP: Enhancing Group Activity Recognition by Mining Complementary Information from Text to Supplement Image Modality

Figure 2 for ActivityCLIP: Enhancing Group Activity Recognition by Mining Complementary Information from Text to Supplement Image Modality

Figure 3 for ActivityCLIP: Enhancing Group Activity Recognition by Mining Complementary Information from Text to Supplement Image Modality

Figure 4 for ActivityCLIP: Enhancing Group Activity Recognition by Mining Complementary Information from Text to Supplement Image Modality

Abstract:Previous methods usually only extract the image modality's information to recognize group activity. However, mining image information is approaching saturation, making it difficult to extract richer information. Therefore, extracting complementary information from other modalities to supplement image information has become increasingly important. In fact, action labels provide clear text information to express the action's semantics, which existing methods often overlook. Thus, we propose ActivityCLIP, a plug-and-play method for mining the text information contained in the action labels to supplement the image information for enhancing group activity recognition. ActivityCLIP consists of text and image branches, where the text branch is plugged into the image branch (The off-the-shelf image-based method). The text branch includes Image2Text and relation modeling modules. Specifically, we propose the knowledge transfer module, Image2Text, which adapts image information into text information extracted by CLIP via knowledge distillation. Further, to keep our method convenient, we add fewer trainable parameters based on the relation module of the image branch to model interaction relation in the text branch. To show our method's generality, we replicate three representative methods by ActivityCLIP, which adds only limited trainable parameters, achieving favorable performance improvements for each method. We also conduct extensive ablation studies and compare our method with state-of-the-art methods to demonstrate the effectiveness of ActivityCLIP.

Via

Access Paper or Ask Questions

Micro-expression recognition based on depth map to point cloud

Jun 12, 2024

Ren Zhang, Jianqin Yin, Chao Qi, Zehao Wang, Zhicheng Zhang, Yonghao Dang

Abstract:Micro-expressions are nonverbal facial expressions that reveal the covert emotions of individuals, making the micro-expression recognition task receive widespread attention. However, the micro-expression recognition task is challenging due to the subtle facial motion and brevity in duration. Many 2D image-based methods have been developed in recent years to recognize MEs effectively, but, these approaches are restricted by facial texture information and are susceptible to environmental factors, such as lighting. Conversely, depth information can effectively represent motion information related to facial structure changes and is not affected by lighting. Motion information derived from facial structures can describe motion features that pixel textures cannot delineate. We proposed a network for micro-expression recognition based on facial depth information, and our experiments have demonstrated the crucial role of depth maps in the micro-expression recognition task. Initially, we transform the depth map into a point cloud and obtain the motion information for each point by aligning the initiating frame with the apex frame and performing a differential operation. Subsequently, we adjusted all point cloud motion feature input dimensions and used them as inputs for multiple point cloud networks to assess the efficacy of this representation. PointNet++ was chosen as the ultimate outcome for micro-expression recognition due to its superior performance. Our experiments show that our proposed method significantly outperforms the existing deep learning methods, including the baseline, on the $CAS(ME)^3$ dataset, which includes depth information.

Via

Access Paper or Ask Questions

CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering

May 13, 2024

Yuanyuan Jiang, Jianqin Yin

Figure 1 for CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering

Figure 2 for CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering

Figure 3 for CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering

Figure 4 for CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering

Abstract:While vision-language pretrained models (VLMs) excel in various multimodal understanding tasks, their potential in fine-grained audio-visual reasoning, particularly for audio-visual question answering (AVQA), remains largely unexplored. AVQA presents specific challenges for VLMs due to the requirement of visual understanding at the region level and seamless integration with audio modality. Previous VLM-based AVQA methods merely used CLIP as a feature encoder but underutilized its knowledge, and mistreated audio and video as separate entities in a dual-stream framework as most AVQA methods. This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for AVQA using the image-text matching knowledge of the pretrained model through the audio-visual matching characteristic of nature. It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG). Specifically, we propose a TSG+ module to transfer the image-text matching knowledge from CLIP models to our region-text matching process without corresponding ground-truth labels. Moreover, unlike previous separate dual-stream networks that still required an additional audio-visual fusion module, JTG unifies audio-visual fusion and question-aware temporal grounding in a simplified single-stream architecture. It treats audio and video as a cohesive entity and further extends the pretrained image-text knowledge to audio-text matching by preserving their temporal correlation with our proposed cross-modal synchrony (CMS) loss. Extensive experiments conducted on the MUSIC-AVQA benchmark verified the effectiveness of our proposed method over existing state-of-the-art methods.

* Submitted to the Journal on February 6, 2024

Via

Access Paper or Ask Questions

DHRNet: A Dual-Path Hierarchical Relation Network for Multi-Person Pose Estimation

Apr 27, 2024

Yonghao Dang, Jianqin Yin, Liyuan Liu, Pengxiang Ding, Yuan Sun, Yanzhu Hu

Figure 1 for DHRNet: A Dual-Path Hierarchical Relation Network for Multi-Person Pose Estimation

Figure 2 for DHRNet: A Dual-Path Hierarchical Relation Network for Multi-Person Pose Estimation

Figure 3 for DHRNet: A Dual-Path Hierarchical Relation Network for Multi-Person Pose Estimation

Figure 4 for DHRNet: A Dual-Path Hierarchical Relation Network for Multi-Person Pose Estimation

Abstract:Multi-person pose estimation (MPPE) presents a formidable yet crucial challenge in computer vision. Most existing methods predominantly concentrate on isolated interaction either between instances or joints, which is inadequate for scenarios demanding concurrent localization of both instances and joints. This paper introduces a novel CNN-based single-stage method, named Dual-path Hierarchical Relation Network (DHRNet), to extract instance-to-joint and joint-to-instance interactions concurrently. Specifically, we design a dual-path interaction modeling module (DIM) that strategically organizes cross-instance and cross-joint interaction modeling modules in two complementary orders, enriching interaction information by integrating merits from different correlation modeling branches. Notably, DHRNet excels in joint localization by leveraging information from other instances and joints. Extensive evaluations on challenging datasets, including COCO, CrowdPose, and OCHuman datasets, showcase DHRNet's state-of-the-art performance. The code will be released at https://github.com/YHDang/dhrnet-multi-pose-estimation.

Via

Access Paper or Ask Questions

OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian Segmentation

Apr 25, 2024

Lizhi Wang, Feng Zhou, Jianqin Yin

Figure 1 for OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian Segmentation

Figure 2 for OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian Segmentation

Figure 3 for OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian Segmentation

Figure 4 for OMEGAS: Object Mesh Extraction from Large Scenes Guided by Gaussian Segmentation

Abstract:Recent advancements in 3D reconstruction technologies have paved the way for high-quality and real-time rendering of complex 3D scenes. Despite these achievements, a notable challenge persists: it is difficult to precisely reconstruct specific objects from large scenes. Current scene reconstruction techniques frequently result in the loss of object detail textures and are unable to reconstruct object portions that are occluded or unseen in views. To address this challenge, we delve into the meticulous 3D reconstruction of specific objects within large scenes and propose a framework termed OMEGAS: Object Mesh Extraction from Large Scenes Guided by GAussian Segmentation. OMEGAS employs a multi-step approach, grounded in several excellent off-the-shelf methodologies. Specifically, initially, we utilize the Segment Anything Model (SAM) to guide the segmentation of 3D Gaussian Splatting (3DGS), thereby creating a basic 3DGS model of the target object. Then, we leverage large-scale diffusion priors to further refine the details of the 3DGS model, especially aimed at addressing invisible or occluded object portions from the original scene views. Subsequently, by re-rendering the 3DGS model onto the scene views, we achieve accurate object segmentation and effectively remove the background. Finally, these target-only images are used to improve the 3DGS model further and extract the definitive 3D object mesh by the SuGaR model. In various scenarios, our experiments demonstrate that OMEGAS significantly surpasses existing scene reconstruction methods. Our project page is at: https://github.com/CrystalWlz/OMEGAS

* arXiv admin note: text overlap with arXiv:2311.17061 by other authors

Via

Access Paper or Ask Questions

Towards more realistic human motion prediction with attention to motion coordination

Apr 04, 2024

Pengxiang Ding, Jianqin Yin

Figure 1 for Towards more realistic human motion prediction with attention to motion coordination

Figure 2 for Towards more realistic human motion prediction with attention to motion coordination

Figure 3 for Towards more realistic human motion prediction with attention to motion coordination

Figure 4 for Towards more realistic human motion prediction with attention to motion coordination

Abstract:Joint relation modeling is a curial component in human motion prediction. Most existing methods rely on skeletal-based graphs to build the joint relations, where local interactive relations between joint pairs are well learned. However, the motion coordination, a global joint relation reflecting the simultaneous cooperation of all joints, is usually weakened because it is learned from part to whole progressively and asynchronously. Thus, the final predicted motions usually appear unrealistic. To tackle this issue, we learn a medium, called coordination attractor (CA), from the spatiotemporal features of motion to characterize the global motion features, which is subsequently used to build new relative joint relations. Through the CA, all joints are related simultaneously, and thus the motion coordination of all joints can be better learned. Based on this, we further propose a novel joint relation modeling module, Comprehensive Joint Relation Extractor (CJRE), to combine this motion coordination with the local interactions between joint pairs in a unified manner. Additionally, we also present a Multi-timescale Dynamics Extractor (MTDE) to extract enriched dynamics from the raw position information for effective prediction. Extensive experiments show that the proposed framework outperforms state-of-the-art methods in both short- and long-term predictions on H3.6M, CMU-Mocap, and 3DPW.

* Accepted by TCSVT

Via

Access Paper or Ask Questions

Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection

Jan 10, 2024

Haobo Yue, Zhicheng Zhang, Da Mu, Yonghao Dang, Jianqin Yin, Jin Tang

Figure 1 for Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection

Figure 2 for Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection

Figure 3 for Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection

Figure 4 for Full-frequency dynamic convolution: a physical frequency-dependent convolution for sound event detection

Abstract:Recently, 2D convolution has been found unqualified in sound event detection (SED). It enforces translation equivariance on sound events along frequency axis, which is not a shift-invariant dimension. To address this issue, dynamic convolution is used to model the frequency dependency of sound events. In this paper, we proposed the first full-dynamic method named \emph{full-frequency dynamic convolution} (FFDConv). FFDConv generates frequency kernels for every frequency band, which is designed directly in the structure for frequency-dependent modeling. It physically furnished 2D convolution with the capability of frequency-dependent modeling. FFDConv outperforms not only the baseline by 6.6\% in DESED real validation dataset in terms of PSDS1, but outperforms the other full-dynamic methods. In addition, by visualizing features of sound events, we observed that FFDConv could effectively extract coherent features in specific frequency bands, consistent with the vocal continuity of sound events. This proves that FFDConv has great frequency-dependent perception ability.

* 6 pages, 4 figures, submitted to ICME2024

Via

Access Paper or Ask Questions

Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition

Jan 09, 2024

Shaojie Zhang, Jianqin Yin, Yonghao Dang

Abstract:Skeleton-based action recognition is a central task of human-computer interaction. However, most of the previous methods suffer from two issues: (i) semantic ambiguity arising from spatiotemporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to local optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into almost all previous skeleton encoders and have no impact on the skeleton encoders when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatiotemporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvement on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released.

Via

Access Paper or Ask Questions