Alert button
Picture for Bing Han

Bing Han

Alert button

The SJTU System for Short-duration Speaker Verification Challenge 2021

Aug 03, 2022
Bing Han, Zhengyang Chen, Zhikai Zhou, Yanmin Qian

Figure 1 for The SJTU System for Short-duration Speaker Verification Challenge 2021
Figure 2 for The SJTU System for Short-duration Speaker Verification Challenge 2021
Figure 3 for The SJTU System for Short-duration Speaker Verification Challenge 2021
Figure 4 for The SJTU System for Short-duration Speaker Verification Challenge 2021

This paper presents the SJTU system for both text-dependent and text-independent tasks in short-duration speaker verification (SdSV) challenge 2021. In this challenge, we explored different strong embedding extractors to extract robust speaker embedding. For text-independent task, language-dependent adaptive snorm is explored to improve the system performance under the cross-lingual verification condition. For text-dependent task, we mainly focus on the in-domain fine-tuning strategies based on the model pre-trained on large-scale out-of-domain data. In order to improve the distinction between different speakers uttering the same phrase, we proposed several novel phrase-aware fine-tuning strategies and phrase-aware neural PLDA. With such strategies, the system performance is further improved. Finally, we fused the scores of different systems, and our fusion systems achieved 0.0473 in Task1 (rank 3) and 0.0581 in Task2 (rank 8) on the primary evaluation metric.

* Published by Interspeech 2021 
Viaarxiv icon

Self-Supervised Speaker Verification Using Dynamic Loss-Gate and Label Correction

Aug 03, 2022
Bing Han, Zhengyang Chen, Yanmin Qian

Figure 1 for Self-Supervised Speaker Verification Using Dynamic Loss-Gate and Label Correction
Figure 2 for Self-Supervised Speaker Verification Using Dynamic Loss-Gate and Label Correction
Figure 3 for Self-Supervised Speaker Verification Using Dynamic Loss-Gate and Label Correction
Figure 4 for Self-Supervised Speaker Verification Using Dynamic Loss-Gate and Label Correction

For self-supervised speaker verification, the quality of pseudo labels decides the upper bound of the system due to the massive unreliable labels. In this work, we propose dynamic loss-gate and label correction (DLG-LC) to alleviate the performance degradation caused by unreliable estimated labels. In DLG, we adopt Gaussian Mixture Model (GMM) to dynamically model the loss distribution and use the estimated GMM to distinguish the reliable and unreliable labels automatically. Besides, to better utilize the unreliable data instead of dropping them directly, we correct the unreliable label with model predictions. Moreover, we apply the negative-pairs-free DINO framework in our experiments for further improvement. Compared to the best-known speaker verification system with self-supervised learning, our proposed DLG-LC converges faster and achieves 11.45%, 18.35% and 15.16% relative improvement on Vox-O, Vox-E and Vox-H trials of Voxceleb1 evaluation dataset.

* Accepted by Interspeech 2022 
Viaarxiv icon

The SJTU X-LANCE Lab System for CNSRC 2022

Jun 23, 2022
Zhengyang Chen, Bei Liu, Bing Han, Leying Zhang, Yanmin Qian

Figure 1 for The SJTU X-LANCE Lab System for CNSRC 2022
Figure 2 for The SJTU X-LANCE Lab System for CNSRC 2022
Figure 3 for The SJTU X-LANCE Lab System for CNSRC 2022
Figure 4 for The SJTU X-LANCE Lab System for CNSRC 2022

This technical report describes the SJTU X-LANCE Lab system for the three tracks in CNSRC 2022. In this challenge, we explored the speaker embedding modeling ability of deep ResNet (Deeper r-vector). All the systems are only trained on the Cnceleb training set and we use the same systems for the three tracks in CNSRC 2022. In this challenge, our system ranks the first place in the fixed track of speaker verification task. Our best single system and fusion system achieve 0.3164 and 0.2975 minDCF respectively. Besides, we submit the result of ResNet221 to the speaker retrieval track and achieve 0.4626 mAP.

Viaarxiv icon

Cross-Architecture Self-supervised Video Representation Learning

May 26, 2022
Sheng Guo, Zihua Xiong, Yujie Zhong, Limin Wang, Xiaobo Guo, Bing Han, Weilin Huang

Figure 1 for Cross-Architecture Self-supervised Video Representation Learning
Figure 2 for Cross-Architecture Self-supervised Video Representation Learning
Figure 3 for Cross-Architecture Self-supervised Video Representation Learning
Figure 4 for Cross-Architecture Self-supervised Video Representation Learning

In this paper, we present a new cross-architecture contrastive learning (CACL) framework for self-supervised video representation learning. CACL consists of a 3D CNN and a video transformer which are used in parallel to generate diverse positive pairs for contrastive learning. This allows the model to learn strong representations from such diverse yet meaningful pairs. Furthermore, we introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences in the temporal order. This enables the model to learn a rich temporal representation that compensates strongly to the video-level representation learned by the CACL. We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets, where our method achieves excellent performance, surpassing the state-of-the-art methods such as VideoMoCo and MoCo+BE by a large margin. The code is made available at https://github.com/guoshengcv/CACL.

* Accepted to CVPR2022 
Viaarxiv icon

Poincaré Heterogeneous Graph Neural Networks for Sequential Recommendation

May 16, 2022
Naicheng Guo, Xiaolei Liu, Shaoshuai Li, Qiongxu Ma, Kaixin Gao, Bing Han, Lin Zheng, Xiaobo Guo

Figure 1 for Poincaré Heterogeneous Graph Neural Networks for Sequential Recommendation
Figure 2 for Poincaré Heterogeneous Graph Neural Networks for Sequential Recommendation
Figure 3 for Poincaré Heterogeneous Graph Neural Networks for Sequential Recommendation
Figure 4 for Poincaré Heterogeneous Graph Neural Networks for Sequential Recommendation

Sequential recommendation (SR) learns users' preferences by capturing the sequential patterns from users' behaviors evolution. As discussed in many works, user-item interactions of SR generally present the intrinsic power-law distribution, which can be ascended to hierarchy-like structures. Previous methods usually handle such hierarchical information by making user-item sectionalization empirically under Euclidean space, which may cause distortion of user-item representation in real online scenarios. In this paper, we propose a Poincar\'{e}-based heterogeneous graph neural network named PHGR to model the sequential pattern information as well as hierarchical information contained in the data of SR scenarios simultaneously. Specifically, for the purpose of explicitly capturing the hierarchical information, we first construct a weighted user-item heterogeneous graph by aliening all the user-item interactions to improve the perception domain of each user from a global view. Then the output of the global representation would be used to complement the local directed item-item homogeneous graph convolution. By defining a novel hyperbolic inner product operator, the global and local graph representation learning are directly conducted in Poincar\'{e} ball instead of commonly used projection operation between Poincar\'{e} ball and Euclidean space, which could alleviate the cumulative error issue of general bidirectional translation process. Moreover, for the purpose of explicitly capturing the sequential dependency information, we design two types of temporal attention operations under Poincar\'{e} ball space. Empirical evaluations on datasets from the public and financial industry show that PHGR outperforms several comparison methods.

* 32 pages, 12 figuews 
Viaarxiv icon

MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization

Apr 19, 2022
Wujiang Xu, Shaoshuai Li, Qiongxu Ma, Yunan Zhao, Sheng Guo, Xiaobo Guo, Bing Han, Junchi Yan, Yifei Xu

Figure 1 for MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization
Figure 2 for MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization
Figure 3 for MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization
Figure 4 for MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization

Video summarization intends to produce a concise video summary by effectively capturing and combining the most informative parts of the whole content. Existing approaches for video summarization regard the task as a frame-wise keyframe selection problem and generally construct the frame-wise representation by combining the long-range temporal dependency with the unimodal or bimodal information. However, the optimal video summaries need to reflect the most valuable keyframe with its own information, and one with semantic power of the whole content. Thus, it is critical to construct a more powerful and robust frame-wise representation and predict the frame-level importance score in a fair and comprehensive manner. To tackle the above issues, we propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation via combining the comprehensive available multimodal information. Specifically, we design a hierarchical ShotConv network to incorporate the adaptive shot-aware frame-level representation by considering the short-range and long-range temporal dependency. Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video. Extensive experiments on two standard video summarization datasets demonstrate that our proposed method consistently outperforms state-of-the-art baselines. Source code will be made publicly available.

Viaarxiv icon

AdaMixer: A Fast-Converging Query-Based Object Detector

Mar 31, 2022
Ziteng Gao, Limin Wang, Bing Han, Sheng Guo

Figure 1 for AdaMixer: A Fast-Converging Query-Based Object Detector
Figure 2 for AdaMixer: A Fast-Converging Query-Based Object Detector
Figure 3 for AdaMixer: A Fast-Converging Query-Based Object Detector
Figure 4 for AdaMixer: A Fast-Converging Query-Based Object Detector

Traditional object detectors employ the dense paradigm of scanning over locations and scales in an image. The recent query-based object detectors break this convention by decoding image features with a set of learnable queries. However, this paradigm still suffers from slow convergence, limited performance, and design complexity of extra networks between backbone and decoder. In this paper, we find that the key to these issues is the adaptability of decoders for casting queries to varying objects. Accordingly, we propose a fast-converging query-based detector, named AdaMixer, by improving the adaptability of query-based decoding processes in two aspects. First, each query adaptively samples features over space and scales based on estimated offsets, which allows AdaMixer to efficiently attend to the coherent regions of objects. Then, we dynamically decode these sampled features with an adaptive MLP-Mixer under the guidance of each query. Thanks to these two critical designs, AdaMixer enjoys architectural simplicity without requiring dense attentional encoders or explicit pyramid networks. On the challenging MS COCO benchmark, AdaMixer with ResNet-50 as the backbone, with 12 training epochs, reaches up to 45.0 AP on the validation set along with 27.9 APs in detecting small objects. With the longer training scheme, AdaMixer with ResNeXt-101-DCN and Swin-S reaches 49.5 and 51.3 AP. Our work sheds light on a simple, accurate, and fast converging architecture for query-based object detectors. The code is made available at https://github.com/MCG-NJU/AdaMixer

* Accepted to CVPR 2022 (oral presentation) 
Viaarxiv icon

Semi-Supervised Clustering with Contrastive Learning for Discovering New Intents

Jan 07, 2022
Feng Wei, Zhenbo Chen, Zhenghong Hao, Fengxin Yang, Hua Wei, Bing Han, Sheng Guo

Most dialogue systems in real world rely on predefined intents and answers for QA service, so discovering potential intents from large corpus previously is really important for building such dialogue services. Considering that most scenarios have few intents known already and most intents waiting to be discovered, we focus on semi-supervised text clustering and try to make the proposed method benefit from labeled samples for better overall clustering performance. In this paper, we propose Deep Contrastive Semi-supervised Clustering (DCSC), which aims to cluster text samples in a semi-supervised way and provide grouped intents to operation staff. To make DCSC fully utilize the limited known intents, we propose a two-stage training procedure for DCSC, in which DCSC will be trained on both labeled samples and unlabeled samples, and achieve better text representation and clustering performance. We conduct experiments on two public datasets to compare our model with several popular methods, and the results show DCSC achieve best performance across all datasets and circumstances, indicating the effect of the improvements in our work.

Viaarxiv icon

Multi-layer VI-GNSS Global Positioning Framework with Numerical Solution aided MAP Initialization

Jan 05, 2022
Bing Han, Zhongyang Xiao, Shuai Huang, Tao Zhang

Figure 1 for Multi-layer VI-GNSS Global Positioning Framework with Numerical Solution aided MAP Initialization
Figure 2 for Multi-layer VI-GNSS Global Positioning Framework with Numerical Solution aided MAP Initialization
Figure 3 for Multi-layer VI-GNSS Global Positioning Framework with Numerical Solution aided MAP Initialization
Figure 4 for Multi-layer VI-GNSS Global Positioning Framework with Numerical Solution aided MAP Initialization

Motivated by the goal of achieving long-term drift-free camera pose estimation in complex scenarios, we propose a global positioning framework fusing visual, inertial and Global Navigation Satellite System (GNSS) measurements in multiple layers. Different from previous loosely- and tightly- coupled methods, the proposed multi-layer fusion allows us to delicately correct the drift of visual odometry and keep reliable positioning while GNSS degrades. In particular, local motion estimation is conducted in the inner-layer, solving the problem of scale drift and inaccurate bias estimation in visual odometry by fusing the velocity of GNSS, pre-integration of Inertial Measurement Unit (IMU) and camera measurement in a tightly-coupled way. The global localization is achieved in the outer-layer, where the local motion is further fused with GNSS position and course in a long-term period in a loosely-coupled way. Furthermore, a dedicated initialization method is proposed to guarantee fast and accurate estimation for all state variables and parameters. We give exhaustive tests of the proposed framework on indoor and outdoor public datasets. The mean localization error is reduced up to 63%, with a promotion of 69% in initialization accuracy compared with state-of-the-art works. We have applied the algorithm to Augmented Reality (AR) navigation, crowd sourcing high-precision map update and other large-scale applications.

Viaarxiv icon

Oscillatory Fourier Neural Network: A Compact and Efficient Architecture for Sequential Processing

Sep 14, 2021
Bing Han, Cheng Wang, Kaushik Roy

Figure 1 for Oscillatory Fourier Neural Network: A Compact and Efficient Architecture for Sequential Processing
Figure 2 for Oscillatory Fourier Neural Network: A Compact and Efficient Architecture for Sequential Processing
Figure 3 for Oscillatory Fourier Neural Network: A Compact and Efficient Architecture for Sequential Processing
Figure 4 for Oscillatory Fourier Neural Network: A Compact and Efficient Architecture for Sequential Processing

Tremendous progress has been made in sequential processing with the recent advances in recurrent neural networks. However, recurrent architectures face the challenge of exploding/vanishing gradients during training, and require significant computational resources to execute back-propagation through time. Moreover, large models are typically needed for executing complex sequential tasks. To address these challenges, we propose a novel neuron model that has cosine activation with a time varying component for sequential processing. The proposed neuron provides an efficient building block for projecting sequential inputs into spectral domain, which helps to retain long-term dependencies with minimal extra model parameters and computation. A new type of recurrent network architecture, named Oscillatory Fourier Neural Network, based on the proposed neuron is presented and applied to various types of sequential tasks. We demonstrate that recurrent neural network with the proposed neuron model is mathematically equivalent to a simplified form of discrete Fourier transform applied onto periodical activation. In particular, the computationally intensive back-propagation through time in training is eliminated, leading to faster training while achieving the state of the art inference accuracy in a diverse group of sequential tasks. For instance, applying the proposed model to sentiment analysis on IMDB review dataset reaches 89.4% test accuracy within 5 epochs, accompanied by over 35x reduction in the model size compared to LSTM. The proposed novel RNN architecture is well poised for intelligent sequential processing in resource constrained hardware.

Viaarxiv icon