Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jian Wu

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Feb 17, 2023
Mufan Sang, Yong Zhao, Gang Liu, John H. L. Hansen, Jian Wu

Figure 1 for Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Figure 2 for Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Figure 3 for Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Figure 4 for Improving Transformer-based Networks With Locality For Automatic Speaker Verification

Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer with the locality modeling in two directions. First, we propose the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise convolution and channel-wise attention into the Conformer blocks. Second, we present the Speaker Swin Transformer (SST) by adapting the Swin Transformer, originally proposed for vision tasks, into speaker embedding network. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset. The proposed models achieve 0.75% EER on VoxCeleb 1 test set, outperforming the previously proposed Transformer-based models and CNN-based models, such as ResNet34 and ECAPA-TDNN. When trained on the MS-internal dataset, the proposed models achieve promising results with 14.6% relative reduction in EER over the Res2Net50 model.

* Accepted to ICASSP 2023

Via

Access Paper or Ask Questions

Speaker Change Detection for Transformer Transducer ASR

Feb 16, 2023
Jian Wu, Zhuo Chen, Min Hu, Xiong Xiao, Jinyu Li

Figure 1 for Speaker Change Detection for Transformer Transducer ASR

Figure 2 for Speaker Change Detection for Transformer Transducer ASR

Figure 3 for Speaker Change Detection for Transformer Transducer ASR

Figure 4 for Speaker Change Detection for Transformer Transducer ASR

Speaker change detection (SCD) is an important feature that improves the readability of the recognized words from an automatic speech recognition (ASR) system by breaking the word sequence into paragraphs at speaker change points. Existing SCD solutions either require additional ensemble for the time based decisions and recognized word sequences, or implement a tight integration between ASR and SCD, limiting the potential optimum performance for both tasks. To address these issues, we propose a novel framework for the SCD task, where an additional SCD module is built on top of an existing Transformer Transducer ASR (TT-ASR) network. Two variants of the SCD network are explored in this framework that naturally estimate speaker change probability for each word, while allowing the ASR and SCD to have independent optimization scheme for the best performance. Experiments show that our methods can significantly improve the F1 score on LibriCSS and Microsoft call center data sets without ASR degradation, compared with a joint SCD and ASR baseline.

* 5 pages, 1 figure, accepted by ICASSP 2023

Via

Access Paper or Ask Questions

Sample-efficient Multi-objective Molecular Optimization with GFlowNets

Feb 08, 2023
Yiheng Zhu, Jialu Wu, Chaowen Hu, Jiahuan Yan, Chang-Yu Hsieh, Tingjun Hou, Jian Wu

Figure 1 for Sample-efficient Multi-objective Molecular Optimization with GFlowNets

Figure 2 for Sample-efficient Multi-objective Molecular Optimization with GFlowNets

Figure 3 for Sample-efficient Multi-objective Molecular Optimization with GFlowNets

Figure 4 for Sample-efficient Multi-objective Molecular Optimization with GFlowNets

Many crucial scientific problems involve designing novel molecules with desired properties, which can be formulated as an expensive black-box optimization problem over the discrete chemical space. Computational methods have achieved initial success but still struggle with simultaneously optimizing multiple competing properties in a sample-efficient manner. In this work, we propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN) as an acquisition function optimizer, with the purpose of sampling a diverse batch of candidate molecular graphs from an approximate Pareto front. Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives. Inspired by reinforcement learning, we further propose a hindsight-like off-policy strategy to share high-performing molecules among different preferences in order to speed up learning for HN-GFN. Through synthetic experiments, we illustrate that HN-GFN has adequate capacity to generalize over preferences. Extensive experiments show that our framework outperforms the best baselines by a large margin in terms of hypervolume in various real-world MOBO settings.

* 15 pages, 6 figures

Via

Access Paper or Ask Questions

ACL-Fig: A Dataset for Scientific Figure Classification

Jan 28, 2023
Zeba Karishma, Shaurya Rohatgi, Kavya Shrinivas Puranik, Jian Wu, C. Lee Giles

Figure 1 for ACL-Fig: A Dataset for Scientific Figure Classification

Figure 2 for ACL-Fig: A Dataset for Scientific Figure Classification

Figure 3 for ACL-Fig: A Dataset for Scientific Figure Classification

Figure 4 for ACL-Fig: A Dataset for Scientific Figure Classification

Most existing large-scale academic search engines are built to retrieve text-based information. However, there are no large-scale retrieval services for scientific figures and tables. One challenge for such services is understanding scientific figures' semantics, such as their types and purposes. A key obstacle is the need for datasets containing annotated scientific figures and tables, which can then be used for classification, question-answering, and auto-captioning. Here, we develop a pipeline that extracts figures and tables from the scientific literature and a deep-learning-based framework that classifies scientific figures using visual features. Using this pipeline, we built the first large-scale automatically annotated corpus, ACL-Fig, consisting of 112,052 scientific figures extracted from ~56K research papers in the ACL Anthology. The ACL-Fig-Pilot dataset contains 1,671 manually labeled scientific figures belonging to 19 categories. The dataset is accessible at https://huggingface.co/datasets/citeseerx/ACL-fig under a CC BY-NC license.

* 6 pages, 4 figures, accepted by the AAAI-23 Workshop on Scientific Document Understanding

Via

Access Paper or Ask Questions

ExcelFormer: A Neural Network Surpassing GBDTs on Tabular Data

Jan 24, 2023
Jintai Chen, Jiahuan Yan, Danny Ziyi Chen, Jian Wu

Figure 1 for ExcelFormer: A Neural Network Surpassing GBDTs on Tabular Data

Figure 2 for ExcelFormer: A Neural Network Surpassing GBDTs on Tabular Data

Figure 3 for ExcelFormer: A Neural Network Surpassing GBDTs on Tabular Data

Figure 4 for ExcelFormer: A Neural Network Surpassing GBDTs on Tabular Data

Though deep neural networks have gained enormous successes in various fields (e.g., computer vision) with supervised learning, they have so far been still trailing after the performances of GBDTs on tabular data. Delving into this task, we determine that a judicious handling of feature interactions and feature representation is crucial to the effectiveness of neural networks on tabular data. We develop a novel neural network called ExcelFormer, which alternates in turn between two attention modules that shrewdly manipulate feature interactions and feature representation updates, respectively. A bespoke training methodology is jointly introduced to facilitate model performances. Specifically, by initializing parameters with minuscule values, these attention modules are attenuated when the training begins, and the effects of feature interactions and representation updates grow progressively up to optimum levels under the guidance of our proposed specific regularization schemes Feat-Mix and Hidden-Mix as the training proceeds. Experiments on 28 public tabular datasets show that our ExcelFormer approach is superior to extensively-tuned GBDTs, which is an unprecedented progress of deep neural networks on supervised tabular learning.

Via

Access Paper or Ask Questions

CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction

Dec 12, 2022
Jinhong Wang, Jingwen Wang, Tingting Chen, Wenhao Zheng, Zhe Xu, Xingdi Wu, Wen Xu, Haochao Ying, Danny Chen, Jian Wu

Figure 1 for CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction

Figure 2 for CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction

Figure 3 for CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction

Figure 4 for CTT-Net: A Multi-view Cross-token Transformer for Cataract Postoperative Visual Acuity Prediction

Surgery is the only viable treatment for cataract patients with visual acuity (VA) impairment. Clinically, to assess the necessity of cataract surgery, accurately predicting postoperative VA before surgery by analyzing multi-view optical coherence tomography (OCT) images is crucially needed. Unfortunately, due to complicated fundus conditions, determining postoperative VA remains difficult for medical experts. Deep learning methods for this problem were developed in recent years. Although effective, these methods still face several issues, such as not efficiently exploring potential relations between multi-view OCT images, neglecting the key role of clinical prior knowledge (e.g., preoperative VA value), and using only regression-based metrics which are lacking reference. In this paper, we propose a novel Cross-token Transformer Network (CTT-Net) for postoperative VA prediction by analyzing both the multi-view OCT images and preoperative VA. To effectively fuse multi-view features of OCT images, we develop cross-token attention that could restrict redundant/unnecessary attention flow. Further, we utilize the preoperative VA value to provide more information for postoperative VA prediction and facilitate fusion between views. Moreover, we design an auxiliary classification loss to improve model performance and assess VA recovery more sufficiently, avoiding the limitation by only using the regression metrics. To evaluate CTT-Net, we build a multi-view OCT image dataset collected from our collaborative hospital. A set of extensive experiments validate the effectiveness of our model compared to existing methods in various metrics. Code is available at: https://github.com/wjh892521292/Cataract OCT.

* 5 pages, 3 figures, accepted for publication in BIBM

Via

Access Paper or Ask Questions

T2G-Former: Organizing Tabular Features into Relation Graphs Promotes Heterogeneous Feature Interaction

Nov 30, 2022
Jiahuan Yan, Jintai Chen, Yixuan Wu, Danny Z. Chen, Jian Wu

Recent development of deep neural networks (DNNs) for tabular learning has largely benefited from the capability of DNNs for automatic feature interaction. However, the heterogeneity nature of tabular features makes such features relatively independent, and developing effective methods to promote tabular feature interaction still remains an open problem. In this paper, we propose a novel Graph Estimator, which automatically estimates the relations among tabular features and builds graphs by assigning edges between related features. Such relation graphs organize independent tabular features into a kind of graph data such that interaction of nodes (tabular features) can be conducted in an orderly fashion. Based on our proposed Graph Estimator, we present a bespoke Transformer network tailored for tabular learning, called T2G-Former, which processes tabular data by performing tabular feature interaction guided by the relation graphs. A specific Cross-level Readout collects salient features predicted by the layers in T2G-Former across different levels, and attains global semantics for final prediction. Comprehensive experiments show that our T2G-Former achieves superior performance among DNNs and is competitive with non-deep Gradient Boosted Decision Tree models.

* 13 pages, 3 figures

Via

Access Paper or Ask Questions

Simulating realistic speech overlaps improves multi-talker ASR

Nov 17, 2022
Muqiao Yang, Naoyuki Kanda, Xiaofei Wang, Jian Wu, Sunit Sivasankaran, Zhuo Chen, Jinyu Li, Takuya Yoshioka

Figure 1 for Simulating realistic speech overlaps improves multi-talker ASR

Figure 2 for Simulating realistic speech overlaps improves multi-talker ASR

Figure 3 for Simulating realistic speech overlaps improves multi-talker ASR

Figure 4 for Simulating realistic speech overlaps improves multi-talker ASR

Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlapping speech of multiple speakers. Due to the difficulty in acquiring real conversation data with high-quality human transcriptions, a na\"ive simulation of multi-talker speech by randomly mixing multiple utterances was conventionally used for model training. In this work, we propose an improved technique to simulate multi-talker overlapping speech with realistic speech overlaps, where an arbitrary pattern of speech overlaps is represented by a sequence of discrete tokens. With this representation, speech overlapping patterns can be learned from real conversations based on a statistical language model, such as N-gram, which can be then used to generate multi-talker speech for training. In our experiments, multi-talker ASR models trained with the proposed method show consistent improvement on the word error rates across multiple datasets.

* v2: fix minor typo

Via

Access Paper or Ask Questions

Robust Training of Graph Neural Networks via Noise Governance

Nov 12, 2022
Siyi Qian, Haochao Ying, Renjun Hu, Jingbo Zhou, Jintai Chen, Danny Z. Chen, Jian Wu

Figure 1 for Robust Training of Graph Neural Networks via Noise Governance

Figure 2 for Robust Training of Graph Neural Networks via Noise Governance

Figure 3 for Robust Training of Graph Neural Networks via Noise Governance

Figure 4 for Robust Training of Graph Neural Networks via Noise Governance

Graph Neural Networks (GNNs) have become widely-used models for semi-supervised learning. However, the robustness of GNNs in the presence of label noise remains a largely under-explored problem. In this paper, we consider an important yet challenging scenario where labels on nodes of graphs are not only noisy but also scarce. In this scenario, the performance of GNNs is prone to degrade due to label noise propagation and insufficient learning. To address these issues, we propose a novel RTGNN (Robust Training of Graph Neural Networks via Noise Governance) framework that achieves better robustness by learning to explicitly govern label noise. More specifically, we introduce self-reinforcement and consistency regularization as supplemental supervision. The self-reinforcement supervision is inspired by the memorization effects of deep neural networks and aims to correct noisy labels. Further, the consistency regularization prevents GNNs from overfitting to noisy labels via mimicry loss in both the inter-view and intra-view perspectives. To leverage such supervisions, we divide labels into clean and noisy types, rectify inaccurate labels, and further generate pseudo-labels on unlabeled nodes. Supervision for nodes with different types of labels is then chosen adaptively. This enables sufficient learning from clean labels while limiting the impact of noisy ones. We conduct extensive experiments to evaluate the effectiveness of our RTGNN framework, and the results validate its consistent superior performance over state-of-the-art methods with two types of label noises and various noise rates.

* 9 pages, accepted to WSDM 2023

Via

Access Paper or Ask Questions