Alert button
Picture for Lei Sun

Lei Sun

Alert button

The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge

Aug 28, 2023
Ruoyu Wang, Maokui He, Jun Du, Hengshun Zhou, Shutong Niu, Hang Chen, Yanyan Yue, Gaobin Yang, Shilong Wu, Lei Sun, Yanhui Tu, Haitao Tang, Shuangqing Qian, Tian Gao, Mengzhi Wang, Genshun Wan, Jia Pan, Jianqing Gao, Chin-Hui Lee

Figure 1 for The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge
Figure 2 for The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge
Figure 3 for The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge
Figure 4 for The USTC-NERCSLIP Systems for the CHiME-7 DASR Challenge

This technical report details our submission system to the CHiME-7 DASR Challenge, which focuses on speaker diarization and speech recognition under complex multi-speaker settings. Additionally, it also evaluates the efficiency of systems in handling diverse array devices. To address these issues, we implemented an end-to-end speaker diarization system and introduced a rectification strategy based on multi-channel spatial information. This approach significantly diminished the word error rates (WER). In terms of recognition, we utilized publicly available pre-trained models as the foundational models to train our end-to-end speech recognition models. Our system attained a macro-averaged diarization-attributed WER (DA-WER) of 22.4\% on the CHiME-7 development set, which signifies a relative improvement of 52.5\% over the official baseline system.

* Accepted by 2023 CHiME Workshop, Oral 
Viaarxiv icon

Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Jun 27, 2023
Haitao Tang, Yu Fu, Lei Sun, Jiabin Xue, Dan Liu, Yongchao Li, Zhiqiang Ma, Minghui Wu, Jia Pan, Genshun Wan, Ming'en Zhao

Figure 1 for Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation
Figure 2 for Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation
Figure 3 for Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation
Figure 4 for Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Transducer is one of the mainstream frameworks for streaming speech recognition. There is a performance gap between the streaming and non-streaming transducer models due to limited context. To reduce this gap, an effective way is to ensure that their hidden and output distributions are consistent, which can be achieved by hierarchical knowledge distillation. However, it is difficult to ensure the distribution consistency simultaneously because the learning of the output distribution depends on the hidden one. In this paper, we propose an adaptive two-stage knowledge distillation method consisting of hidden layer learning and output layer learning. In the former stage, we learn hidden representation with full context by applying mean square error loss function. In the latter stage, we design a power transformation based adaptive smoothness method to learn stable output distribution. It achieved 19\% relative reduction in word error rate, and a faster response for the first token compared with the original streaming model in LibriSpeech corpus.

Viaarxiv icon

Minimalist and High-Quality Panoramic Imaging with PSF-aware Transformers

Jun 22, 2023
Qi Jiang, Shaohua Gao, Yao Gao, Kailun Yang, Zhonghua Yi, Hao Shi, Lei Sun, Kaiwei Wang

Figure 1 for Minimalist and High-Quality Panoramic Imaging with PSF-aware Transformers
Figure 2 for Minimalist and High-Quality Panoramic Imaging with PSF-aware Transformers
Figure 3 for Minimalist and High-Quality Panoramic Imaging with PSF-aware Transformers
Figure 4 for Minimalist and High-Quality Panoramic Imaging with PSF-aware Transformers

High-quality panoramic images with a Field of View (FoV) of 360-degree are essential for contemporary panoramic computer vision tasks. However, conventional imaging systems come with sophisticated lens designs and heavy optical components. This disqualifies their usage in many mobile and wearable applications where thin and portable, minimalist imaging systems are desired. In this paper, we propose a Panoramic Computational Imaging Engine (PCIE) to address minimalist and high-quality panoramic imaging. With less than three spherical lenses, a Minimalist Panoramic Imaging Prototype (MPIP) is constructed based on the design of the Panoramic Annular Lens (PAL), but with low-quality imaging results due to aberrations and small image plane size. We propose two pipelines, i.e. Aberration Correction (AC) and Super-Resolution and Aberration Correction (SR&AC), to solve the image quality problems of MPIP, with imaging sensors of small and large pixel size, respectively. To provide a universal network for the two pipelines, we leverage the information from the Point Spread Function (PSF) of the optical system and design a PSF-aware Aberration-image Recovery Transformer (PART), in which the self-attention calculation and feature extraction are guided via PSF-aware mechanisms. We train PART on synthetic image pairs from simulation and put forward the PALHQ dataset to fill the gap of real-world high-quality PAL images for low-level vision. A comprehensive variety of experiments on synthetic and real-world benchmarks demonstrates the impressive imaging results of PCIE and the effectiveness of plug-and-play PSF-aware mechanisms. We further deliver heuristic experimental findings for minimalist and high-quality panoramic imaging. Our dataset and code will be available at https://github.com/zju-jiangqi/PCIE-PART.

* The dataset and code will be available at https://github.com/zju-jiangqi/PCIE-PART 
Viaarxiv icon

A Question-Answering Approach to Key Value Pair Extraction from Form-like Document Images

Apr 17, 2023
Kai Hu, Zhuoyuan Wu, Zhuoyao Zhong, Weihong Lin, Lei Sun, Qiang Huo

Figure 1 for A Question-Answering Approach to Key Value Pair Extraction from Form-like Document Images
Figure 2 for A Question-Answering Approach to Key Value Pair Extraction from Form-like Document Images
Figure 3 for A Question-Answering Approach to Key Value Pair Extraction from Form-like Document Images
Figure 4 for A Question-Answering Approach to Key Value Pair Extraction from Form-like Document Images

In this paper, we present a new question-answering (QA) based key-value pair extraction approach, called KVPFormer, to robustly extracting key-value relationships between entities from form-like document images. Specifically, KVPFormer first identifies key entities from all entities in an image with a Transformer encoder, then takes these key entities as \textbf{questions} and feeds them into a Transformer decoder to predict their corresponding \textbf{answers} (i.e., value entities) in parallel. To achieve higher answer prediction accuracy, we propose a coarse-to-fine answer prediction approach further, which first extracts multiple answer candidates for each identified question in the coarse stage and then selects the most likely one among these candidates in the fine stage. In this way, the learning difficulty of answer prediction can be effectively reduced so that the prediction accuracy can be improved. Moreover, we introduce a spatial compatibility attention bias into the self-attention/cross-attention mechanism for \Ours{} to better model the spatial interactions between entities. With these new techniques, our proposed \Ours{} achieves state-of-the-art results on FUNSD and XFUND datasets, outperforming the previous best-performing method by 7.2\% and 13.2\% in F1 score, respectively.

* AAAI 2023 
Viaarxiv icon

Robust Table Structure Recognition with Dynamic Queries Enhanced Detection Transformer

Mar 21, 2023
Jiawei Wang, Weihong Lin, Chixiang Ma, Mingze Li, Zheng Sun, Lei Sun, Qiang Huo

Figure 1 for Robust Table Structure Recognition with Dynamic Queries Enhanced Detection Transformer
Figure 2 for Robust Table Structure Recognition with Dynamic Queries Enhanced Detection Transformer
Figure 3 for Robust Table Structure Recognition with Dynamic Queries Enhanced Detection Transformer
Figure 4 for Robust Table Structure Recognition with Dynamic Queries Enhanced Detection Transformer

We present a new table structure recognition (TSR) approach, called TSRFormer, to robustly recognizing the structures of complex tables with geometrical distortions from various table images. Unlike previous methods, we formulate table separation line prediction as a line regression problem instead of an image segmentation problem and propose a new two-stage dynamic queries enhanced DETR based separation line regression approach, named DQ-DETR, to predict separation lines from table images directly. Compared to Vallina DETR, we propose three improvements in DQ-DETR to make the two-stage DETR framework work efficiently and effectively for the separation line prediction task: 1) A new query design, named Dynamic Query, to decouple single line query into separable point queries which could intuitively improve the localization accuracy for regression tasks; 2) A dynamic queries based progressive line regression approach to progressively regressing points on the line which further enhances localization accuracy for distorted tables; 3) A prior-enhanced matching strategy to solve the slow convergence issue of DETR. After separation line prediction, a simple relation network based cell merging module is used to recover spanning cells. With these new techniques, our TSRFormer achieves state-of-the-art performance on several benchmark datasets, including SciTSR, PubTabNet, WTW and FinTabNet. Furthermore, we have validated the robustness and high localization accuracy of our approach to tables with complex structures, borderless cells, large blank spaces, empty or spanning cells as well as distorted or even curved shapes on a more challenging real-world in-house dataset.

* 18 pages, 11 figures, Preprint. arXiv admin note: substantial text overlap with arXiv:2208.04921 
Viaarxiv icon

Improving Fast Auto-Focus with Event Polarity

Mar 15, 2023
Yuhan Bao, Lei Sun, Yuqin Ma, Diyang Gu, Kaiwei Wang

Figure 1 for Improving Fast Auto-Focus with Event Polarity
Figure 2 for Improving Fast Auto-Focus with Event Polarity
Figure 3 for Improving Fast Auto-Focus with Event Polarity
Figure 4 for Improving Fast Auto-Focus with Event Polarity

Fast and accurate auto-focus in adverse conditions remains an arduous task. The paper presents a polarity-based event camera auto-focus algorithm featuring high-speed, precise auto-focus in dark, dynamic scenes that conventional frame-based cameras cannot match. Specifically, the symmetrical relationship between the event polarities in focusing is investigated, and the event-based focus evaluation function is proposed based on the principles of the event cameras and the imaging model in the focusing process. Comprehensive experiments on the public EAD dataset show the robustness of the model. Furthermore, precise focus with less than one depth of focus is achieved within 0.004 seconds on our self-built high-speed focusing platform. The dataset and code will be made publicly available.

* 18 pages, 10 figures, 2 tables 
Viaarxiv icon

Event-Based Frame Interpolation with Ad-hoc Deblurring

Jan 12, 2023
Lei Sun, Christos Sakaridis, Jingyun Liang, Peng Sun, Jiezhang Cao, Kai Zhang, Qi Jiang, Kaiwei Wang, Luc Van Gool

Figure 1 for Event-Based Frame Interpolation with Ad-hoc Deblurring
Figure 2 for Event-Based Frame Interpolation with Ad-hoc Deblurring
Figure 3 for Event-Based Frame Interpolation with Ad-hoc Deblurring
Figure 4 for Event-Based Frame Interpolation with Ad-hoc Deblurring

The performance of video frame interpolation is inherently correlated with the ability to handle motion in the input scene. Even though previous works recognize the utility of asynchronous event information for this task, they ignore the fact that motion may or may not result in blur in the input video to be interpolated, depending on the length of the exposure time of the frames and the speed of the motion, and assume either that the input video is sharp, restricting themselves to frame interpolation, or that it is blurry, including an explicit, separate deblurring stage before interpolation in their pipeline. We instead propose a general method for event-based frame interpolation that performs deblurring ad-hoc and thus works both on sharp and blurry input videos. Our model consists in a bidirectional recurrent network that naturally incorporates the temporal dimension of interpolation and fuses information from the input frames and the events adaptively based on their temporal proximity. In addition, we introduce a novel real-world high-resolution dataset with events and color videos named HighREV, which provides a challenging evaluation setting for the examined task. Extensive experiments on the standard GoPro benchmark and on our dataset show that our network consistently outperforms previous state-of-the-art methods on frame interpolation, single image deblurring and the joint task of interpolation and deblurring. Our code and dataset will be made publicly available.

Viaarxiv icon

Computational Optics Meet Domain Adaptation: Transferring Semantic Segmentation Beyond Aberrations

Nov 21, 2022
Qi Jiang, Hao Shi, Shaohua Gao, Jiaming Zhang, Kailun Yang, Lei Sun, Kaiwei Wang

Figure 1 for Computational Optics Meet Domain Adaptation: Transferring Semantic Segmentation Beyond Aberrations
Figure 2 for Computational Optics Meet Domain Adaptation: Transferring Semantic Segmentation Beyond Aberrations
Figure 3 for Computational Optics Meet Domain Adaptation: Transferring Semantic Segmentation Beyond Aberrations
Figure 4 for Computational Optics Meet Domain Adaptation: Transferring Semantic Segmentation Beyond Aberrations

Semantic scene understanding with Minimalist Optical Systems (MOS) in mobile and wearable applications remains a challenge due to the corrupted imaging quality induced by optical aberrations. However, previous works only focus on improving the subjective imaging quality through computational optics, i.e. Computational Imaging (CI) technique, ignoring the feasibility in semantic segmentation. In this paper, we pioneer to investigate Semantic Segmentation under Optical Aberrations (SSOA) of MOS. To benchmark SSOA, we construct Virtual Prototype Lens (VPL) groups through optical simulation, generating Cityscapes-ab and KITTI-360-ab datasets under different behaviors and levels of aberrations. We look into SSOA via an unsupervised domain adaptation perspective to address the scarcity of labeled aberration data in real-world scenarios. Further, we propose Computational Imaging Assisted Domain Adaptation (CIADA) to leverage prior knowledge of CI for robust performance in SSOA. Based on our benchmark, we conduct experiments on the robustness of state-of-the-art segmenters against aberrations. In addition, extensive evaluations of possible solutions to SSOA reveal that CIADA achieves superior performance under all aberration distributions, paving the way for the applications of MOS in semantic scene understanding. Code and dataset will be made publicly available at https://github.com/zju-jiangqi/CIADA.

* Code and dataset will be made publicly available at https://github.com/zju-jiangqi/CIADA 
Viaarxiv icon

Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report

Nov 07, 2022
Andrey Ignatov, Radu Timofte, Maurizio Denna, Abdel Younes, Ganzorig Gankhuyag, Jingang Huh, Myeong Kyun Kim, Kihwan Yoon, Hyeon-Cheol Moon, Seungho Lee, Yoonsik Choe, Jinwoo Jeong, Sungjei Kim, Maciej Smyl, Tomasz Latkowski, Pawel Kubik, Michal Sokolski, Yujie Ma, Jiahao Chao, Zhou Zhou, Hongfan Gao, Zhengfeng Yang, Zhenbing Zeng, Zhengyang Zhuge, Chenghua Li, Dan Zhu, Mengdi Sun, Ran Duan, Yan Gao, Lingshun Kong, Long Sun, Xiang Li, Xingdong Zhang, Jiawei Zhang, Yaqi Wu, Jinshan Pan, Gaocheng Yu, Jin Zhang, Feng Zhang, Zhe Ma, Hongbin Wang, Hojin Cho, Steve Kim, Huaen Li, Yanbo Ma, Ziwei Luo, Youwei Li, Lei Yu, Zhihong Wen, Qi Wu, Haoqiang Fan, Shuaicheng Liu, Lize Zhang, Zhikai Zong, Jeremy Kwon, Junxi Zhang, Mengyuan Li, Nianxiang Fu, Guanchen Ding, Han Zhu, Zhenzhong Chen, Gen Li, Yuanfan Zhang, Lei Sun, Dafeng Zhang, Neo Yang, Fitz Liu, Jerry Zhao, Mustafa Ayazoglu, Bahri Batuhan Bilecen, Shota Hirose, Kasidis Arunruangsirilert, Luo Ao, Ho Chun Leung, Andrew Wei, Jie Liu, Qiang Liu, Dahai Yu, Ao Li, Lei Luo, Ce Zhu, Seongmin Hong, Dongwon Park, Joonhee Lee, Byeong Hyun Lee, Seunggyu Lee, Se Young Chun, Ruiyuan He, Xuhao Jiang, Haihang Ruan, Xinjian Zhang, Jing Liu, Garas Gendy, Nabil Sabor, Jingchao Hou, Guanghui He

Figure 1 for Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report
Figure 2 for Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report
Figure 3 for Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report
Figure 4 for Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report

Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.

* arXiv admin note: text overlap with arXiv:2105.07825, arXiv:2105.08826, arXiv:2211.04470, arXiv:2211.03885, arXiv:2211.05256 
Viaarxiv icon

Robust Human Matting via Semantic Guidance

Oct 11, 2022
Xiangguang Chen, Ye Zhu, Yu Li, Bingtao Fu, Lei Sun, Ying Shan, Shan Liu

Figure 1 for Robust Human Matting via Semantic Guidance
Figure 2 for Robust Human Matting via Semantic Guidance
Figure 3 for Robust Human Matting via Semantic Guidance
Figure 4 for Robust Human Matting via Semantic Guidance

Automatic human matting is highly desired for many real applications. We investigate recent human matting methods and show that common bad cases happen when semantic human segmentation fails. This indicates that semantic understanding is crucial for robust human matting. From this, we develop a fast yet accurate human matting framework, named Semantic Guided Human Matting (SGHM). It builds on a semantic human segmentation network and introduces a light-weight matting module with only marginal computational cost. Unlike previous works, our framework is data efficient, which requires a small amount of matting ground-truth to learn to estimate high quality object mattes. Our experiments show that trained with merely 200 matting images, our method can generalize well to real-world datasets, and outperform recent methods on multiple benchmarks, while remaining efficient. Considering the unbearable labeling cost of matting data and widely available segmentation data, our method becomes a practical and effective solution for the task of human matting. Source code is available at https://github.com/cxgincsu/SemanticGuidedHumanMatting.

* ACCV 2022 
Viaarxiv icon