Alert button
Picture for Ye Zhang

Ye Zhang

Alert button

Alex

Multi-level feature fusion network combining attention mechanisms for polyp segmentation

Sep 24, 2023
Junzhuo Liu, Qiaosong Chen, Ye Zhang, Zhixiang Wang, Deng Xin, Jin Wang

Figure 1 for Multi-level feature fusion network combining attention mechanisms for polyp segmentation
Figure 2 for Multi-level feature fusion network combining attention mechanisms for polyp segmentation
Figure 3 for Multi-level feature fusion network combining attention mechanisms for polyp segmentation
Figure 4 for Multi-level feature fusion network combining attention mechanisms for polyp segmentation

Clinically, automated polyp segmentation techniques have the potential to significantly improve the efficiency and accuracy of medical diagnosis, thereby reducing the risk of colorectal cancer in patients. Unfortunately, existing methods suffer from two significant weaknesses that can impact the accuracy of segmentation. Firstly, features extracted by encoders are not adequately filtered and utilized. Secondly, semantic conflicts and information redundancy caused by feature fusion are not attended to. To overcome these limitations, we propose a novel approach for polyp segmentation, named MLFF-Net, which leverages multi-level feature fusion and attention mechanisms. Specifically, MLFF-Net comprises three modules: Multi-scale Attention Module (MAM), High-level Feature Enhancement Module (HFEM), and Global Attention Module (GAM). Among these, MAM is used to extract multi-scale information and polyp details from the shallow output of the encoder. In HFEM, the deep features of the encoders complement each other by aggregation. Meanwhile, the attention mechanism redistributes the weight of the aggregated features, weakening the conflicting redundant parts and highlighting the information useful to the task. GAM combines features from the encoder and decoder features, as well as computes global dependencies to prevent receptive field locality. Experimental results on five public datasets show that the proposed method not only can segment multiple types of polyps but also has advantages over current state-of-the-art methods in both accuracy and generalization ability.

Viaarxiv icon

Graph-based Asynchronous Event Processing for Rapid Object Recognition

Aug 28, 2023
Yijin Li, Han Zhou, Bangbang Yang, Ye Zhang, Zhaopeng Cui, Hujun Bao, Guofeng Zhang

Figure 1 for Graph-based Asynchronous Event Processing for Rapid Object Recognition
Figure 2 for Graph-based Asynchronous Event Processing for Rapid Object Recognition
Figure 3 for Graph-based Asynchronous Event Processing for Rapid Object Recognition
Figure 4 for Graph-based Asynchronous Event Processing for Rapid Object Recognition

Different from traditional video cameras, event cameras capture asynchronous events stream in which each event encodes pixel location, trigger time, and the polarity of the brightness changes. In this paper, we introduce a novel graph-based framework for event cameras, namely SlideGCN. Unlike some recent graph-based methods that use groups of events as input, our approach can efficiently process data event-by-event, unlock the low latency nature of events data while still maintaining the graph's structure internally. For fast graph construction, we develop a radius search algorithm, which better exploits the partial regular structure of event cloud against k-d tree based generic methods. Experiments show that our method reduces the computational complexity up to 100 times with respect to current graph-based methods while keeping state-of-the-art performance on object recognition. Moreover, we verify the superiority of event-wise processing with our method. When the state becomes stable, we can give a prediction with high confidence, thus making an early recognition. Project page: \url{https://zju3dv.github.io/slide_gcn/}.

* Accepted to ICCV 2021. Project Page: https://zju3dv.github.io/slide_gcn/ 
Viaarxiv icon

Are words equally surprising in audio and audio-visual comprehension?

Jul 14, 2023
Pranava Madhyastha, Ye Zhang, Gabriella Vigliocco

Figure 1 for Are words equally surprising in audio and audio-visual comprehension?
Figure 2 for Are words equally surprising in audio and audio-visual comprehension?

We report a controlled study investigating the effect of visual information (i.e., seeing the speaker) on spoken language comprehension. We compare the ERP signature (N400) associated with each word in audio-only and audio-visual presentations of the same verbal stimuli. We assess the extent to which surprisal measures (which quantify the predictability of words in their lexical context) are generated on the basis of different types of language models (specifically n-gram and Transformer models) that predict N400 responses for each word. Our results indicate that cognitive effort differs significantly between multimodal and unimodal settings. In addition, our findings suggest that while Transformer-based models, which have access to a larger lexical context, provide a better fit in the audio-only setting, 2-gram language models are more effective in the multimodal setting. This highlights the significant impact of local lexical context on cognitive processing in a multimodal environment.

* In CogSci 2023 
Viaarxiv icon

CoNIC Challenge: Pushing the Frontiers of Nuclear Detection, Segmentation, Classification and Counting

Mar 14, 2023
Simon Graham, Quoc Dang Vu, Mostafa Jahanifar, Martin Weigert, Uwe Schmidt, Wenhua Zhang, Jun Zhang, Sen Yang, Jinxi Xiang, Xiyue Wang, Josef Lorenz Rumberger, Elias Baumann, Peter Hirsch, Lihao Liu, Chenyang Hong, Angelica I. Aviles-Rivero, Ayushi Jain, Heeyoung Ahn, Yiyu Hong, Hussam Azzuni, Min Xu, Mohammad Yaqub, Marie-Claire Blache, Benoît Piégu, Bertrand Vernay, Tim Scherr, Moritz Böhland, Katharina Löffler, Jiachen Li, Weiqin Ying, Chixin Wang, Dagmar Kainmueller, Carola-Bibiane Schönlieb, Shuolin Liu, Dhairya Talsania, Yughender Meda, Prakash Mishra, Muhammad Ridzuan, Oliver Neumann, Marcel P. Schilling, Markus Reischl, Ralf Mikut, Banban Huang, Hsiang-Chin Chien, Ching-Ping Wang, Chia-Yen Lee, Hong-Kun Lin, Zaiyi Liu, Xipeng Pan, Chu Han, Jijun Cheng, Muhammad Dawood, Srijay Deshpande, Raja Muhammad Saad Bashir, Adam Shephard, Pedro Costa, João D. Nunes, Aurélio Campilho, Jaime S. Cardoso, Hrishikesh P S, Densen Puthussery, Devika R G, Jiji C V, Ye Zhang, Zijie Fang, Zhifan Lin, Yongbing Zhang, Chunhui Lin, Liukun Zhang, Lijian Mao, Min Wu, Vi Thi-Tuong Vo, Soo-Hyung Kim, Taebum Lee, Satoshi Kondo, Satoshi Kasai, Pranay Dumbhare, Vedant Phuse, Yash Dubey, Ankush Jamthikar, Trinh Thi Le Vuong, Jin Tae Kwak, Dorsa Ziaei, Hyun Jung, Tianyi Miao, David Snead, Shan E Ahmed Raza, Fayyaz Minhas, Nasir M. Rajpoot

Figure 1 for CoNIC Challenge: Pushing the Frontiers of Nuclear Detection, Segmentation, Classification and Counting
Figure 2 for CoNIC Challenge: Pushing the Frontiers of Nuclear Detection, Segmentation, Classification and Counting
Figure 3 for CoNIC Challenge: Pushing the Frontiers of Nuclear Detection, Segmentation, Classification and Counting
Figure 4 for CoNIC Challenge: Pushing the Frontiers of Nuclear Detection, Segmentation, Classification and Counting

Nuclear detection, segmentation and morphometric profiling are essential in helping us further understand the relationship between histology and patient outcome. To drive innovation in this area, we setup a community-wide challenge using the largest available dataset of its kind to assess nuclear segmentation and cellular composition. Our challenge, named CoNIC, stimulated the development of reproducible algorithms for cellular recognition with real-time result inspection on public leaderboards. We conducted an extensive post-challenge analysis based on the top-performing models using 1,658 whole-slide images of colon tissue. With around 700 million detected nuclei per model, associated features were used for dysplasia grading and survival analysis, where we demonstrated that the challenge's improvement over the previous state-of-the-art led to significant boosts in downstream performance. Our findings also suggest that eosinophils and neutrophils play an important role in the tumour microevironment. We release challenge models and WSI-level results to foster the development of further methods for biomarker discovery.

Viaarxiv icon

A Novel Speech Feature Fusion Algorithm for Text-Independent Speaker Recognition

Dec 01, 2022
Biao Ma, Chengben Xu, Ye Zhang

Figure 1 for A Novel Speech Feature Fusion Algorithm for Text-Independent Speaker Recognition
Figure 2 for A Novel Speech Feature Fusion Algorithm for Text-Independent Speaker Recognition
Figure 3 for A Novel Speech Feature Fusion Algorithm for Text-Independent Speaker Recognition
Figure 4 for A Novel Speech Feature Fusion Algorithm for Text-Independent Speaker Recognition

A novel speech feature fusion algorithm with independent vector analysis (IVA) and parallel convolutional neural network (PCNN) is proposed for text-independent speaker recognition. Firstly, some different feature types, such as the time domain (TD) features and the frequency domain (FD) features, can be extracted from a speaker's speech, and the TD and the FD features can be considered as the linear mixtures of independent feature components (IFCs) with an unknown mixing system. To estimate the IFCs, the TD and the FD features of the speaker's speech are concatenated to build the TD and the FD feature matrix, respectively. Then, a feature tensor of the speaker's speech is obtained by paralleling the TD and the FD feature matrix. To enhance the dependence on different feature types and remove the redundancies of the same feature type, the independent vector analysis (IVA) can be used to estimate the IFC matrices of TD and FD features with the feature tensor. The IFC matrices are utilized as the input of the PCNN to extract the deep features of the TD and FD features, respectively. The deep features can be integrated to obtain the fusion feature of the speaker's speech. Finally, the fusion feature of the speaker's speech is employed as the input of a deep convolutional neural network (DCNN) classifier for speaker recognition. The experimental results show the effectiveness and performances of the proposed speaker recognition system.

Viaarxiv icon

A new Speech Feature Fusion method with cross gate parallel CNN for Speaker Recognition

Nov 24, 2022
Jiacheng Zhang, Wenyi Yan, Ye Zhang

Figure 1 for A new Speech Feature Fusion method with cross gate parallel CNN for Speaker Recognition
Figure 2 for A new Speech Feature Fusion method with cross gate parallel CNN for Speaker Recognition
Figure 3 for A new Speech Feature Fusion method with cross gate parallel CNN for Speaker Recognition
Figure 4 for A new Speech Feature Fusion method with cross gate parallel CNN for Speaker Recognition

In this paper, a new speech feature fusion method is proposed for speaker recognition on the basis of the cross gate parallel convolutional neural network (CG-PCNN). The Mel filter bank features (MFBFs) of different frequency resolutions can be extracted from each speech frame of a speaker's speech by several Mel filter banks, where the numbers of the triangular filters in the Mel filter banks are different. Due to the frequency resolutions of these MFBFs are different, there are some complementaries for these MFBFs. The CG-PCNN is utilized to extract the deep features from these MFBFs, which applies a cross gate mechanism to capture the complementaries for improving the performance of the speaker recognition system. Then, the fusion feature can be obtained by concatenating these deep features for speaker recognition. The experimental results show that the speaker recognition system with the proposed speech feature fusion method is effective, and marginally outperforms the existing state-of-the-art systems.

Viaarxiv icon

N-Grammer: Augmenting Transformers with latent n-grams

Jul 13, 2022
Aurko Roy, Rohan Anil, Guangda Lai, Benjamin Lee, Jeffrey Zhao, Shuyuan Zhang, Shibo Wang, Ye Zhang, Shen Wu, Rigel Swavely, Tao, Yu, Phuong Dao, Christopher Fifty, Zhifeng Chen, Yonghui Wu

Figure 1 for N-Grammer: Augmenting Transformers with latent n-grams
Figure 2 for N-Grammer: Augmenting Transformers with latent n-grams
Figure 3 for N-Grammer: Augmenting Transformers with latent n-grams
Figure 4 for N-Grammer: Augmenting Transformers with latent n-grams

Transformer models have recently emerged as one of the foundational models in natural language processing, and as a byproduct, there is significant recent interest and investment in scaling these models. However, the training and inference costs of these large Transformer language models are prohibitive, thus necessitating more research in identifying more efficient variants. In this work, we propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams that are constructed from a discrete latent representation of the text sequence. We evaluate our model, the N-Grammer on language modeling on the C4 data-set as well as text classification on the SuperGLUE data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer. We open-source our model for reproducibility purposes in Jax.

* 8 pages, 2 figures 
Viaarxiv icon

Effective Sequence-to-Sequence Dialogue State Tracking

Sep 08, 2021
Jeffrey Zhao, Mahdis Mahdieh, Ye Zhang, Yuan Cao, Yonghui Wu

Figure 1 for Effective Sequence-to-Sequence Dialogue State Tracking
Figure 2 for Effective Sequence-to-Sequence Dialogue State Tracking

Sequence-to-sequence models have been applied to a wide variety of NLP tasks, but how to properly use them for dialogue state tracking has not been systematically investigated. In this paper, we study this problem from the perspectives of pre-training objectives as well as the formats of context representations. We demonstrate that the choice of pre-training objective makes a significant difference to the state tracking quality. In particular, we find that masked span prediction is more effective than auto-regressive language modeling. We also explore using Pegasus, a span prediction-based pre-training objective for text summarization, for the state tracking model. We found that pre-training for the seemingly distant summarization task works surprisingly well for dialogue state tracking. In addition, we found that while recurrent state context representation works also reasonably well, the model may have a hard time recovering from earlier mistakes. We conducted experiments on the MultiWOZ 2.1-2.4, WOZ 2.0, and DSTC2 datasets with consistent observations.

* Accepted at EMNLP 2021 
Viaarxiv icon