Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiajun Deng

Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction

Dec 11, 2024

Bohan Li, Xin Jin, Jiajun Deng, Yasheng Sun, Xiaofeng Wang, Wenjun Zeng

Figure 1 for Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction

Figure 2 for Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction

Figure 3 for Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction

Figure 4 for Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction

Abstract:Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations. Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning, alleviating issues like occlusion or ambiguity. However, these solutions often face misalignment issues wherein the corresponding features at the same position across different frames may have different semantic meanings during the aggregation process, which leads to unreliable contextual fusion results and an unstable representation learning process. To address this problem, we introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP). Hi-SOP first disentangles the geometric and temporal context for separate alignment, which two branches are then composed to enhance the reliability of SOP. This parsing of the visual input into a local-global alignment hierarchy includes: (I) disentangled geometric and temporal separate alignment, within each leverages depth confidence and camera pose as prior for relevant feature matching respectively; (II) global alignment and composition of the transformed geometric and temporal volumes based on semantics consistency. Our method outperforms SOTAs for semantic scene completion on the SemanticKITTI & NuScenes-Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset.

Via

Access Paper or Ask Questions

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Sep 13, 2024

Mingyu Cui, Yifan Yang, Jiajun Deng, Jiawen Kang, Shujie Hu, Tianzi Wang, Zhaoqing Li, Shiliang Zhang, Xie Chen, Xunying Liu

Figure 1 for Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Figure 2 for Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Figure 3 for Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Figure 4 for Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR

Abstract:Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems. The efficacy of replacing Fbank features with discrete token features for modelling either cross-utterance contexts (from preceding and future segments), or current utterance's internal contexts alone, or both at the same time, are demonstrated thoroughly on the Gigaspeech 1000-hr corpus. The best Zipformer-Transducer system using discrete tokens based cross-utterance context features outperforms the baseline using utterance internal context only with statistically significant word error rate (WER) reductions of 0.32% to 0.41% absolute (2.78% to 3.54% relative) on the dev and test data. The lowest published WER of 11.15% and 11.14% were obtained on the dev and test sets. Our work is open-source and publicly available at https://github.com/open-creator/icefall/tree/master/egs/gigaspeech/Context\_ASR.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

Jul 27, 2024

Xiaomeng Chu, Jiajun Deng, Guoliang You, Yifan Duan, Yao Li, Yanyong Zhang

Figure 1 for RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

Figure 2 for RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

Figure 3 for RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

Figure 4 for RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

Abstract:The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded detection accuracy. To this end, we introduce RayFormer, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras. Specifically, RayFormer transforms perspective-view image features into bird's eye view (BEV) via the lift-splat-shoot method and segments the BEV map to sectors based on the camera rays. Object queries are uniformly and sparsely initialized along each camera ray, facilitating the projection of different queries onto different areas in the image to extract distinct features. Besides, we leverage the instance information of images to supplement the uniformly initialized object queries by further involving additional queries along the ray from 2D object detection boxes. To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird's eye view. Extensive experiments are conducted on the nuScenes dataset to validate our proposed ray-inspired model design. The proposed RayFormer achieves 55.5% mAP and 63.3% NDS, respectively. Our codes will be made available.

* Accepted by ACM Multimedia 2024

Via

Access Paper or Ask Questions

Described Spatial-Temporal Video Detection

Jul 08, 2024

Wei Ji, Xiangyan Liu, Yingfei Sun, Jiajun Deng, You Qin, Ammar Nuwanna, Mengyao Qiu, Lina Wei, Roger Zimmermann

Figure 1 for Described Spatial-Temporal Video Detection

Figure 2 for Described Spatial-Temporal Video Detection

Figure 3 for Described Spatial-Temporal Video Detection

Figure 4 for Described Spatial-Temporal Video Detection

Abstract:Detecting visual content on language expression has become an emerging topic in the community. However, in the video domain, the existing setting, i.e., spatial-temporal video grounding (STVG), is formulated to only detect one pre-existing object in each frame, ignoring the fact that language descriptions can involve none or multiple entities within a video. In this work, we advance the STVG to a more practical setting called described spatial-temporal video detection (DSTVD) by overcoming the above limitation. To facilitate the exploration of DSTVD, we first introduce a new benchmark, namely DVD-ST. Notably, DVD-ST supports grounding from none to many objects onto the video in response to queries and encompasses a diverse range of over 150 entities, including appearance, actions, locations, and interactions. The extensive breadth and diversity of the DVD-ST dataset make it an exemplary testbed for the investigation of DSTVD. In addition to the new benchmark, we further present two baseline methods for our proposed DSTVD task by extending two representative STVG models, i.e., TubeDETR, and STCAT. These extended models capitalize on tubelet queries to localize and track referred objects across the video sequence. Besides, we adjust the training objectives of these models to optimize spatial and temporal localization accuracy and multi-class classification capabilities. Furthermore, we benchmark the baselines on the introduced DVD-ST dataset and conduct extensive experimental analysis to guide future investigation. Our code and benchmark will be publicly available.

Via

Access Paper or Ask Questions

Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Jul 08, 2024

Mengzhe Geng, Xurong Xie, Jiajun Deng, Zengrui Jin, Guinan Li, Tianzi Wang, Shujie Hu, Zhaoqing Li, Helen Meng, Xunying Liu

Figure 1 for Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Figure 2 for Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Figure 3 for Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Figure 4 for Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Abstract:The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this end, this paper proposes two novel data-efficient methods to learn homogeneous dysarthric and elderly speaker-level features for rapid, on-the-fly test-time adaptation of DNN/TDNN and Conformer ASR models. These include: 1) speaker-level variance-regularized spectral basis embedding (VR-SBE) features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation; and 2) feature-based learning hidden unit contributions (f-LHUC) transforms that are conditioned on VR-SBE features. Experiments are conducted on four tasks across two languages: the English UASpeech and TORGO dysarthric speech datasets, the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora. The proposed on-the-fly speaker adaptation techniques consistently outperform baseline iVector and xVector adaptation by statistically significant word or character error rate reductions up to 5.32% absolute (18.57% relative) and batch-mode LHUC speaker adaptation by 2.24% absolute (9.20% relative), while operating with real-time factors speeding up to 33.6 times against xVectors during adaptation. The efficacy of the proposed adaptation techniques is demonstrated in a comparison against current ASR technologies including SSL pre-trained systems on UASpeech, where our best system produces a state-of-the-art WER of 23.33%. Analyses show VR-SBE features and f-LHUC transforms are insensitive to speaker-level data quantity in testtime adaptation. T-SNE visualization reveals they have stronger speaker-level homogeneity than baseline iVectors, xVectors and batch-mode LHUC transforms.

* In submission to IEEE/ACM Transactions on Audio, Speech, and Language Processing

Via

Access Paper or Ask Questions

Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion

Jul 02, 2024

Bohan Li, Jiajun Deng, Wenyao Zhang, Zhujin Liang, Dalong Du, Xin Jin, Wenjun Zeng

Abstract:Camera-based 3D semantic scene completion (SSC) is pivotal for predicting complicated 3D layouts with limited 2D image observations. The existing mainstream solutions generally leverage temporal information by roughly stacking history frames to supplement the current frame, such straightforward temporal modeling inevitably diminishes valid clues and increases learning difficulty. To address this problem, we present HTCL, a novel Hierarchical Temporal Context Learning paradigm for improving camera-based semantic scene completion. The primary innovation of this work involves decomposing temporal context learning into two hierarchical steps: (a) cross-frame affinity measurement and (b) affinity-based dynamic refinement. Firstly, to separate critical relevant context from redundant information, we introduce the pattern affinity with scale-aware isolation and multiple independent learners for fine-grained contextual correspondence modeling. Subsequently, to dynamically compensate for incomplete observations, we adaptively refine the feature sampling locations based on initially identified locations with high affinity and their neighboring relevant regions. Our method ranks $1^{st}$ on the SemanticKITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU on the OpenOccupancy benchmark. Our code is available on https://github.com/Arlo0o/HTCL.

* ECCV 2024

Via

Access Paper or Ask Questions

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Jun 14, 2024

Guinan Li, Jiajun Deng, Youjun Chen, Mengzhe Geng, Shujie Hu, Zhe Li, Zengrui Jin, Tianzi Wang, Xurong Xie, Helen Meng(+1 more)

Figure 1 for Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Figure 2 for Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Figure 3 for Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Figure 4 for Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Abstract:This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and tightly integrated with the complete system training. Experiments conducted on LRS3-TED data simulated multichannel overlapped speech suggest that joint speaker feature learning consistently improves speech separation and recognition performance over the baselines without joint speaker feature estimation. Further analyses reveal performance improvements are strongly correlated with increased inter-speaker discrimination measured using cosine similarity. The best-performing joint speaker feature learning adapted system outperformed the baseline fine-tuned WavLM model by statistically significant WER reductions of 21.6% and 25.3% absolute (67.5% and 83.5% relative) on Dev and Test sets after incorporating WavLM features and video modality.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

Jun 14, 2024

Zhaoqing Li, Haoning Xu, Tianzi Wang, Shoukang Hu, Zengrui Jin, Shujie Hu, Jiajun Deng, Mingyu Cui, Mengzhe Geng, Xunying Liu

Figure 1 for One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

Figure 2 for One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

Figure 3 for One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

Figure 4 for One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

Abstract:We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision settings to be simultaneously constructed without the need to train and store individual target systems separately. Experiments consistently demonstrate the multiple ASR systems compressed in a single all-in-one model produced a word error rate (WER) comparable to, or lower by up to 1.01\% absolute (6.98\% relative) than individually trained systems of equal complexity. A 3.4x overall system compression and training time speed-up was achieved. Maximum model size compression ratios of 12.8x and 3.93x were obtained over the baseline Switchboard-300hr Conformer and LibriSpeech-100hr fine-tuned wav2vec2.0 models, respectively, incurring no statistically significant WER increase.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

Jun 14, 2024

Tianzi Wang, Xurong Xie, Zhaoqing Li, Shoukang Hu, Zengrui Jing, Jiajun Deng, Mingyu Cui, Shujie Hu, Mengzhe Geng, Guinan Li(+2 more)

Figure 1 for Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

Figure 2 for Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

Figure 3 for Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

Figure 4 for Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

Abstract:This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam search algorithm is designed to leverage a dynamic fusion of CTC, AR Decoder, and AMD probabilities. Experiments on the LibriSpeech-100hr corpus suggest the tripartite Decoder incorporating the AMD module produces a maximum decoding speed-up ratio of 1.73x over the baseline CTC+AR decoding, while incurring no statistically significant word error rate (WER) increase on the test sets. When operating with the same decoding real time factors, statistically significant WER reductions of up to 0.7% and 0.3% absolute (5.3% and 6.1% relative) were obtained over the CTC+AR baseline.

* 5 pages, 2 figures, 2 tables, Interspeech24 conference

Via

Access Paper or Ask Questions

HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation

Mar 18, 2024

Sha Zhang, Jiajun Deng, Lei Bai, Houqiang Li, Wanli Ouyang, Yanyong Zhang

Abstract:We present a hybrid-view-based knowledge distillation framework, termed HVDistill, to guide the feature learning of a point cloud neural network with a pre-trained image network in an unsupervised manner. By exploiting the geometric relationship between RGB cameras and LiDAR sensors, the correspondence between the two modalities based on both image-plane view and bird-eye view can be established, which facilitates representation learning. Specifically, the image-plane correspondences can be simply obtained by projecting the point clouds, while the bird-eye-view correspondences can be achieved by lifting pixels to the 3D space with the predicted depths under the supervision of projected point clouds. The image teacher networks provide rich semantics from the image-plane view and meanwhile acquire geometric information from the bird-eye view. Indeed, image features from the two views naturally complement each other and together can ameliorate the learned feature representation of the point cloud student networks. Moreover, with a self-supervised pre-trained 2D network, HVDistill requires neither 2D nor 3D annotations. We pre-train our model on nuScenes dataset and transfer it to several downstream tasks on nuScenes, SemanticKITTI, and KITTI datasets for evaluation. Extensive experimental results show that our method achieves consistent improvements over the baseline trained from scratch and significantly outperforms the existing schemes. Codes are available at git@github.com:zhangsha1024/HVDistill.git.

Via

Access Paper or Ask Questions