Alert button
Picture for Yi He

Yi He

Alert button

Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis

Aug 29, 2023
Sotirios Kastanas, Shaomu Tan, Yi He

Figure 1 for Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis
Figure 2 for Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis
Figure 3 for Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis
Figure 4 for Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis

Document AI aims to automatically analyze documents by leveraging natural language processing and computer vision techniques. One of the major tasks of Document AI is document layout analysis, which structures document pages by interpreting the content and spatial relationships of layout, image, and text. This task can be image-centric, wherein the aim is to identify and label various regions such as authors and paragraphs, or text-centric, where the focus is on classifying individual words in a document. Although there are increasingly sophisticated methods for improving layout analysis, doubts remain about the extent to which their findings can be generalized to a broader context. Specifically, prior work developed systems based on very different architectures, such as transformer-based, graph-based, and CNNs. However, no work has mentioned the effectiveness of these models in a comparative analysis. Moreover, while language-independent Document AI models capable of knowledge transfer have been developed, it remains to be investigated to what degree they can effectively transfer knowledge. In this study, we aim to fill these gaps by conducting a comparative evaluation of state-of-the-art models in document layout analysis and investigating the potential of cross-lingual layout analysis by utilizing machine translation techniques.

Viaarxiv icon

Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition

Jun 09, 2023
Xianzhao Chen, Yist Y. Lin, Kang Wang, Yi He, Zejun Ma

Figure 1 for Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition
Figure 2 for Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition
Figure 3 for Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition
Figure 4 for Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition

End-to-end (E2E) systems have shown comparable performance to hybrid systems for automatic speech recognition (ASR). Word timings, as a by-product of ASR, are essential in many applications, especially for subtitling and computer-aided pronunciation training. In this paper, we improve the frame-level classifier for word timings in E2E system by introducing label priors in connectionist temporal classification (CTC) loss, which is adopted from prior works, and combining low-level Mel-scale filter banks with high-level ASR encoder output as input feature. On the internal Chinese corpus, the proposed method achieves 95.68%/94.18% compared to the hybrid system 93.0%/90.22% on the word timing accuracy metrics. It also surpass a previous E2E approach with an absolute increase of 4.80%/8.02% on the metrics on 7 languages. In addition, we further improve word timing accuracy by delaying CTC peaks with frame-wise knowledge distillation, though only experimenting on LibriSpeech.

* To appear in the proceedings of INTERSPEECH 2023 
Viaarxiv icon

Sketch2Cloth: Sketch-based 3D Garment Generation with Unsigned Distance Fields

Mar 01, 2023
Yi He, Haoran Xie, Kazunori Miyata

Figure 1 for Sketch2Cloth: Sketch-based 3D Garment Generation with Unsigned Distance Fields
Figure 2 for Sketch2Cloth: Sketch-based 3D Garment Generation with Unsigned Distance Fields
Figure 3 for Sketch2Cloth: Sketch-based 3D Garment Generation with Unsigned Distance Fields
Figure 4 for Sketch2Cloth: Sketch-based 3D Garment Generation with Unsigned Distance Fields

3D model reconstruction from a single image has achieved great progress with the recent deep generative models. However, the conventional reconstruction approaches with template mesh deformation and implicit fields have difficulty in reconstructing non-watertight 3D mesh models, such as garments. In contrast to image-based modeling, the sketch-based approach can help users generate 3D models to meet the design intentions from hand-drawn sketches. In this study, we propose Sketch2Cloth, a sketch-based 3D garment generation system using the unsigned distance fields from the user's sketch input. Sketch2Cloth first estimates the unsigned distance function of the target 3D model from the sketch input, and extracts the mesh from the estimated field with Marching Cubes. We also provide the model editing function to modify the generated mesh. We verified the proposed Sketch2Cloth with quantitative evaluations on garment generation and editing with a state-of-the-art approach.

* 8 pages, 9 figures, video is here https://youtu.be/miisvVTpqj8 
Viaarxiv icon

Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking

Feb 16, 2023
Zichong Wang, Yang Zhou, Meikang Qiu, Israat Haque, Laura Brown, Yi He, Jianwu Wang, David Lo, Wenbin Zhang

Figure 1 for Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking
Figure 2 for Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking
Figure 3 for Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking
Figure 4 for Towards Fair Machine Learning Software: Understanding and Addressing Model Bias Through Counterfactual Thinking

The increasing use of Machine Learning (ML) software can lead to unfair and unethical decisions, thus fairness bugs in software are becoming a growing concern. Addressing these fairness bugs often involves sacrificing ML performance, such as accuracy. To address this issue, we present a novel counterfactual approach that uses counterfactual thinking to tackle the root causes of bias in ML software. In addition, our approach combines models optimized for both performance and fairness, resulting in an optimal solution in both aspects. We conducted a thorough evaluation of our approach on 10 benchmark tasks using a combination of 5 performance metrics, 3 fairness metrics, and 15 measurement scenarios, all applied to 8 real-world datasets. The conducted extensive evaluations show that the proposed method significantly improves the fairness of ML software while maintaining competitive performance, outperforming state-of-the-art solutions in 84.6% of overall cases based on a recent benchmarking tool.

Viaarxiv icon

Multi-Metric AutoRec for High Dimensional and Sparse User Behavior Data Prediction

Dec 20, 2022
Cheng Liang, Teng Huang, Yi He, Song Deng, Di Wu, Xin Luo

Figure 1 for Multi-Metric AutoRec for High Dimensional and Sparse User Behavior Data Prediction
Figure 2 for Multi-Metric AutoRec for High Dimensional and Sparse User Behavior Data Prediction
Figure 3 for Multi-Metric AutoRec for High Dimensional and Sparse User Behavior Data Prediction
Figure 4 for Multi-Metric AutoRec for High Dimensional and Sparse User Behavior Data Prediction

User behavior data produced during interaction with massive items in the significant data era are generally heterogeneous and sparse, leaving the recommender system (RS) a large diversity of underlying patterns to excavate. Deep neural network-based models have reached the state-of-the-art benchmark of the RS owing to their fitting capabilities. However, prior works mainly focus on designing an intricate architecture with fixed loss function and regulation. These single-metric models provide limited performance when facing heterogeneous and sparse user behavior data. Motivated by this finding, we propose a multi-metric AutoRec (MMA) based on the representative AutoRec. The idea of the proposed MMA is mainly two-fold: 1) apply different $L_p$-norm on loss function and regularization to form different variant models in different metric spaces, and 2) aggregate these variant models. Thus, the proposed MMA enjoys the multi-metric orientation from a set of dispersed metric spaces, achieving a comprehensive representation of user data. Theoretical studies proved that the proposed MMA could attain performance improvement. The extensive experiment on five real-world datasets proves that MMA can outperform seven other state-of-the-art models in predicting unobserved user behavior data.

* 6 pages, 4 Tables 
Viaarxiv icon

Improving short-video speech recognition using random utterance concatenation

Oct 28, 2022
Haihua Xu, Van Tung Pham, Yerbolat Khassanov, Yist Lin, Tao Han, Tze Yuan Chong, Yi He, Zejun Ma

Figure 1 for Improving short-video speech recognition using random utterance concatenation
Figure 2 for Improving short-video speech recognition using random utterance concatenation
Figure 3 for Improving short-video speech recognition using random utterance concatenation
Figure 4 for Improving short-video speech recognition using random utterance concatenation

One of the limitations in end-to-end automatic speech recognition framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose a random utterance concatenation (RUC) method to alleviate train-test utterance length mismatch issue for short-video speech recognition task. Specifically, we are motivated by observations our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (~3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (~10 seconds on average). Such a mismatch can lead to sub-optimal performance. Experimentally, by using the proposed RUC method, the best word error rate reduction (WERR) can be achieved with around three fold training data size increase as well as two utterance concatenation for each. In practice, the proposed method consistently outperforms the strong baseline models, where 3.64% average WERR is achieved on 14 languages.

* 5 pages, 2 figures, 4 tables 
Viaarxiv icon

Reducing Language confusion for Code-switching Speech Recognition with Token-level Language Diarization

Oct 26, 2022
Hexin Liu, Haihua Xu, Leibny Paola Garcia, Andy W. H. Khong, Yi He, Sanjeev Khudanpur

Figure 1 for Reducing Language confusion for Code-switching Speech Recognition with Token-level Language Diarization
Figure 2 for Reducing Language confusion for Code-switching Speech Recognition with Token-level Language Diarization
Figure 3 for Reducing Language confusion for Code-switching Speech Recognition with Token-level Language Diarization
Figure 4 for Reducing Language confusion for Code-switching Speech Recognition with Token-level Language Diarization

Code-switching (CS) refers to the phenomenon that languages switch within a speech signal and leads to language confusion for automatic speech recognition (ASR). This paper aims to address language confusion for improving CS-ASR from two perspectives: incorporating and disentangling language information. We incorporate language information in the CS-ASR model by dynamically biasing the model with token-level language posteriors which are outputs of a sequence-to-sequence auxiliary language diarization module. In contrast, the disentangling process reduces the difference between languages via adversarial training so as to normalize two languages. We conduct the experiments on the SEAME dataset. Compared to the baseline model, both the joint optimization with LD and the language posterior bias achieve performance improvement. The comparison of the proposed methods indicates that incorporating language information is more effective than disentangling for reducing language confusion in CS speech.

* Submitted to ICASSP 2023 
Viaarxiv icon

An Online Sparse Streaming Feature Selection Algorithm

Aug 03, 2022
Feilong Chen, Di Wu, Jie Yang, Yi He

Figure 1 for An Online Sparse Streaming Feature Selection Algorithm
Figure 2 for An Online Sparse Streaming Feature Selection Algorithm
Figure 3 for An Online Sparse Streaming Feature Selection Algorithm
Figure 4 for An Online Sparse Streaming Feature Selection Algorithm

Online streaming feature selection (OSFS), which conducts feature selection in an online manner, plays an important role in dealing with high-dimensional data. In many real applications such as intelligent healthcare platform, streaming feature always has some missing data, which raises a crucial challenge in conducting OSFS, i.e., how to establish the uncertain relationship between sparse streaming features and labels. Unfortunately, existing OSFS algorithms never consider such uncertain relationship. To fill this gap, we in this paper propose an online sparse streaming feature selection with uncertainty (OS2FSU) algorithm. OS2FSU consists of two main parts: 1) latent factor analysis is utilized to pre-estimate the missing data in sparse streaming features before con-ducting feature selection, and 2) fuzzy logic and neighborhood rough set are employed to alleviate the uncertainty between estimated streaming features and labels during conducting feature selection. In the experiments, OS2FSU is compared with five state-of-the-art OSFS algorithms on six real datasets. The results demonstrate that OS2FSU outperforms its competitors when missing data are encountered in OSFS.

Viaarxiv icon

Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

Jul 09, 2022
Jicheng Zhang, Yizhou Peng, Haihua Xu, Yi He, Eng Siong Chng, Hao Huang

Figure 1 for Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder
Figure 2 for Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder
Figure 3 for Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder
Figure 4 for Intermediate-layer output Regularization for Attention-based Speech Recognition with Shared Decoder

Intermediate layer output (ILO) regularization by means of multitask training on encoder side has been shown to be an effective approach to yielding improved results on a wide range of end-to-end ASR frameworks. In this paper, we propose a novel method to do ILO regularized training differently. Instead of using conventional multitask methods that entail more training overhead, we directly make the intermediate layer output as input to the decoder, that is, our decoder not only accepts the output of the final encoder layer as input, it also takes the output of the encoder ILO as input during training. With the proposed method, as both encoder and decoder are simultaneously "regularized", the network is more sufficiently trained, consistently leading to improved results, over the ILO-based CTC method, as well as over the original attention-based modeling method without the proposed method employed.

* 5 pages. Submitted to INTERSPEECH 2022 
Viaarxiv icon

Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

Jul 09, 2022
Yizhou Peng, Yufei Liu, Jicheng Zhang, Haihua Xu, Yi He, Hao Huang, Eng Siong Chng

Figure 1 for Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition
Figure 2 for Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition
Figure 3 for Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition
Figure 4 for Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

Internal Language Model Estimation (ILME) based language model (LM) fusion has been shown significantly improved recognition results over conventional shallow fusion in both intra-domain and cross-domain speech recognition tasks. In this paper, we attempt to apply our ILME method to cross-domain code-switching speech recognition (CSSR) work. Specifically, our curiosity comes from several aspects. First, we are curious about how effective the ILME-based LM fusion is for both intra-domain and cross-domain CSSR tasks. We verify this with or without merging two code-switching domains. More importantly, we train an end-to-end (E2E) speech recognition model by means of merging two monolingual data sets and observe the efficacy of the proposed ILME-based LM fusion for CSSR. Experimental results on SEAME that is from Southeast Asian and another Chinese Mainland CS data set demonstrate the effectiveness of the proposed ILME-based LM fusion method.

* 5 pages. Submitted to INTERSPEECH 2022 
Viaarxiv icon