Alert button
Picture for Yihao Chen

Yihao Chen

Alert button

AcademicGPT: Empowering Academic Research

Nov 21, 2023
Shufa Wei, Xiaolong Xu, Xianbiao Qi, Xi Yin, Jun Xia, Jingyi Ren, Peijun Tang, Yuxiang Zhong, Yihao Chen, Xiaoqin Ren, Yuxin Liang, Liankai Huang, Kai Xie, Weikang Gui, Wei Tan, Shuanglong Sun, Yongquan Hu, Qinxian Liu, Nanjin Li, Chihao Dai, Lihua Wang, Xiaohui Liu, Lei Zhang, Yutao Xie

Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Yet, many of these advanced LLMs are tailored for broad, general-purpose applications. In this technical report, we introduce AcademicGPT, designed specifically to empower academic research. AcademicGPT is a continual training model derived from LLaMA2-70B. Our training corpus mainly consists of academic papers, thesis, content from some academic domain, high-quality Chinese data and others. While it may not be extensive in data scale, AcademicGPT marks our initial venture into a domain-specific GPT tailored for research area. We evaluate AcademicGPT on several established public benchmarks such as MMLU and CEval, as well as on some specialized academic benchmarks like PubMedQA, SCIEval, and our newly-created ComputerScienceQA, to demonstrate its ability from general knowledge ability, to Chinese ability, and to academic ability. Building upon AcademicGPT's foundation model, we also developed several applications catered to the academic area, including General Academic Question Answering, AI-assisted Paper Reading, Paper Review, and AI-assisted Title and Abstract Generation.

* Technical Report. arXiv admin note: text overlap with arXiv:2310.12081, arXiv:2310.10053 by other authors 
Viaarxiv icon

Generic and Robust Root Cause Localization for Multi-Dimensional Data in Online Service Systems

May 05, 2023
Zeyan Li, Junjie Chen, Yihao Chen, Chengyang Luo, Yiwei Zhao, Yongqian Sun, Kaixin Sui, Xiping Wang, Dapeng Liu, Xing Jin, Qi Wang, Dan Pei

Figure 1 for Generic and Robust Root Cause Localization for Multi-Dimensional Data in Online Service Systems
Figure 2 for Generic and Robust Root Cause Localization for Multi-Dimensional Data in Online Service Systems
Figure 3 for Generic and Robust Root Cause Localization for Multi-Dimensional Data in Online Service Systems
Figure 4 for Generic and Robust Root Cause Localization for Multi-Dimensional Data in Online Service Systems

Localizing root causes for multi-dimensional data is critical to ensure online service systems' reliability. When a fault occurs, only the measure values within specific attribute combinations are abnormal. Such attribute combinations are substantial clues to the underlying root causes and thus are called root causes of multidimensional data. This paper proposes a generic and robust root cause localization approach for multi-dimensional data, PSqueeze. We propose a generic property of root cause for multi-dimensional data, generalized ripple effect (GRE). Based on it, we propose a novel probabilistic cluster method and a robust heuristic search method. Moreover, we identify the importance of determining external root causes and propose an effective method for the first time in literature. Our experiments on two real-world datasets with 5400 faults show that the F1-score of PSqueeze outperforms baselines by 32.89%, while the localization time is around 10 seconds across all cases. The F1-score in determining external root causes of PSqueeze achieves 0.90. Furthermore, case studies in several production systems demonstrate that PSqueeze is helpful to fault diagnosis in the real world.

* Accepted by Journal of Systems and Software at May 4 2023 
Viaarxiv icon

LipsFormer: Introducing Lipschitz Continuity to Vision Transformers

Apr 19, 2023
Xianbiao Qi, Jianan Wang, Yihao Chen, Yukai Shi, Lei Zhang

Figure 1 for LipsFormer: Introducing Lipschitz Continuity to Vision Transformers
Figure 2 for LipsFormer: Introducing Lipschitz Continuity to Vision Transformers
Figure 3 for LipsFormer: Introducing Lipschitz Continuity to Vision Transformers
Figure 4 for LipsFormer: Introducing Lipschitz Continuity to Vision Transformers

We present a Lipschitz continuous Transformer, called LipsFormer, to pursue training stability both theoretically and empirically for Transformer-based models. In contrast to previous practical tricks that address training instability by learning rate warmup, layer normalization, attention formulation, and weight initialization, we show that Lipschitz continuity is a more essential property to ensure training stability. In LipsFormer, we replace unstable Transformer component modules with Lipschitz continuous counterparts: CenterNorm instead of LayerNorm, spectral initialization instead of Xavier initialization, scaled cosine similarity attention instead of dot-product attention, and weighted residual shortcut. We prove that these introduced modules are Lipschitz continuous and derive an upper bound on the Lipschitz constant of LipsFormer. Our experiments show that LipsFormer allows stable training of deep Transformer architectures without the need of careful learning rate tuning such as warmup, yielding a faster convergence and better generalization. As a result, on the ImageNet 1K dataset, LipsFormer-Swin-Tiny based on Swin Transformer training for 300 epochs can obtain 82.7\% without any learning rate warmup. Moreover, LipsFormer-CSwin-Tiny, based on CSwin, training for 300 epochs achieves a top-1 accuracy of 83.5\% with 4.7G FLOPs and 24M parameters. The code will be released at \url{https://github.com/IDEA-Research/LipsFormer}.

* To appear in ICLR 2023, our code will be public at https://github.com/IDEA-Research/LipsFormer 
Viaarxiv icon

DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

Apr 17, 2023
Yihao Chen, Xianbiao Qi, Jianan Wang, Lei Zhang

Figure 1 for DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
Figure 2 for DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
Figure 3 for DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
Figure 4 for DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

We propose DisCo-CLIP, a distributed memory-efficient CLIP training approach, to reduce the memory consumption of contrastive loss when training contrastive learning models. Our approach decomposes the contrastive loss and its gradient computation into two parts, one to calculate the intra-GPU gradients and the other to compute the inter-GPU gradients. According to our decomposition, only the intra-GPU gradients are computed on the current GPU, while the inter-GPU gradients are collected via all_reduce from other GPUs instead of being repeatedly computed on every GPU. In this way, we can reduce the GPU memory consumption of contrastive loss computation from $\bigO(B^2)$ to $\bigO(\frac{B^2}{N})$, where $B$ and $N$ are the batch size and the number of GPUs used for training. Such a distributed solution is mathematically equivalent to the original non-distributed contrastive loss computation, without sacrificing any computation accuracy. It is particularly efficient for large-batch CLIP training. For instance, DisCo-CLIP can enable contrastive training of a ViT-B/32 model with a batch size of 32K or 196K using 8 or 64 A100 40GB GPUs, compared with the original CLIP solution which requires 128 A100 40GB GPUs to train a ViT-B/32 model with a batch size of 32K. The code will be released at https://github.com/IDEA-Research/DisCo-CLIP

* To appear in CVPR 2023 as a highlight, our code will be public at https://github.com/IDEA-Research/DisCo-CLIP 
Viaarxiv icon

Exploring Vision Transformers as Diffusion Learners

Dec 28, 2022
He Cao, Jianan Wang, Tianhe Ren, Xianbiao Qi, Yihao Chen, Yuan Yao, Lei Zhang

Figure 1 for Exploring Vision Transformers as Diffusion Learners
Figure 2 for Exploring Vision Transformers as Diffusion Learners
Figure 3 for Exploring Vision Transformers as Diffusion Learners
Figure 4 for Exploring Vision Transformers as Diffusion Learners

Score-based diffusion models have captured widespread attention and funded fast progress of recent vision generative tasks. In this paper, we focus on diffusion model backbone which has been much neglected before. We systematically explore vision Transformers as diffusion learners for various generative tasks. With our improvements the performance of vanilla ViT-based backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods. We further provide a hypothesis on the implication of disentangling the generative backbone as an encoder-decoder structure and show proof-of-concept experiments verifying the effectiveness of a stronger encoder for generative tasks with ASymmetriC ENcoder Decoder (ASCEND). Our improvements achieve competitive results on CIFAR-10, CelebA, LSUN, CUB Bird and large-resolution text-to-image tasks. To the best of our knowledge, we are the first to successfully train a single diffusion model on text-to-image task beyond 64x64 resolution. We hope this will motivate people to rethink the modeling choices and the training pipelines for diffusion-based generative models.

Viaarxiv icon

The SpeakIn Speaker Verification System for Far-Field Speaker Verification Challenge 2022

Sep 23, 2022
Yu Zheng, Jinghan Peng, Yihao Chen, Yajun Zhang, Jialong Wang, Min Liu, Minqiang Xu

Figure 1 for The SpeakIn Speaker Verification System for Far-Field Speaker Verification Challenge 2022
Figure 2 for The SpeakIn Speaker Verification System for Far-Field Speaker Verification Challenge 2022

This paper describes speaker verification (SV) systems submitted by the SpeakIn team to the Task 1 and Task 2 of the Far-Field Speaker Verification Challenge 2022 (FFSVC2022). SV tasks of the challenge focus on the problem of fully supervised far-field speaker verification (Task 1) and semi-supervised far-field speaker verification (Task 2). In Task 1, we used the VoxCeleb and FFSVC2020 datasets as train datasets. And for Task 2, we only used the VoxCeleb dataset as train set. The ResNet-based and RepVGG-based architectures were developed for this challenge. Global statistic pooling structure and MQMHA pooling structure were used to aggregate the frame-level features across time to obtain utterance-level representation. We adopted AM-Softmax and AAM-Softmax to classify the resulting embeddings. We innovatively propose a staged transfer learning method. In the pre-training stage we reserve the speaker weights, and there are no positive samples to train them in this stage. Then we fine-tune these weights with both positive and negative samples in the second stage. Compared with the traditional transfer learning strategy, this strategy can better improve the model performance. The Sub-Mean and AS-Norm backend methods were used to solve the problem of domain mismatch. In the fusion stage, three models were fused in Task1 and two models were fused in Task2. On the FFSVC2022 leaderboard, the EER of our submission is 3.0049% and the corresponding minDCF is 0.2938 in Task1. In Task2, EER and minDCF are 6.2060% and 0.5232 respectively. Our approach leads to excellent performance and ranks 1st in both challenge tasks.

* 5 pages. arXiv admin note: text overlap with arXiv:2209.10846 
Viaarxiv icon

The SpeakIn System Description for CNSRC2022

Sep 22, 2022
Yu Zheng, Yihao Chen, Jinghan Peng, Yajun Zhang, Min Liu, Minqiang Xu

Figure 1 for The SpeakIn System Description for CNSRC2022
Figure 2 for The SpeakIn System Description for CNSRC2022
Figure 3 for The SpeakIn System Description for CNSRC2022
Figure 4 for The SpeakIn System Description for CNSRC2022

This report describes our speaker verification systems for the tasks of the CN-Celeb Speaker Recognition Challenge 2022 (CNSRC 2022). This challenge includes two tasks, namely speaker verification(SV) and speaker retrieval(SR). The SV task involves two tracks: fixed track and open track. In the fixed track, we only used CN-Celeb.T as the training set. For the open track of the SV task and SR task, we added our open-source audio data. The ResNet-based, RepVGG-based, and TDNN-based architectures were developed for this challenge. Global statistic pooling structure and MQMHA pooling structure were used to aggregate the frame-level features across time to obtain utterance-level representation. We adopted AM-Softmax and AAM-Softmax combined with the Sub-Center method to classify the resulting embeddings. We also used the Large-Margin Fine-Tuning strategy to further improve the model performance. In the backend, Sub-Mean and AS-Norm were used. In the SV task fixed track, our system was a fusion of five models, and two models were fused in the SV task open track. And we used a single system in the SR task. Our approach leads to superior performance and comes the 1st place in the open track of the SV task, the 2nd place in the fixed track of the SV task, and the 3rd place in the SR task.

* 4 pages 
Viaarxiv icon

VDDB: a comprehensive resource and machine learning platform for antiviral drug discovery

Sep 17, 2022
Shunming Tao, Yihao Chen, Jingxing Wu, Duancheng Zhao, Hanxuan Cai, Ling Wang

Figure 1 for VDDB: a comprehensive resource and machine learning platform for antiviral drug discovery
Figure 2 for VDDB: a comprehensive resource and machine learning platform for antiviral drug discovery
Figure 3 for VDDB: a comprehensive resource and machine learning platform for antiviral drug discovery
Figure 4 for VDDB: a comprehensive resource and machine learning platform for antiviral drug discovery

Virus infection is one of the major diseases that seriously threaten human health. To meet the growing demand for mining and sharing data resources related to antiviral drugs and to accelerate the design and discovery of new antiviral drugs, we presented an open-access antiviral drug resource and machine learning platform (VDDB), which, to the best of our knowledge, is the first comprehensive dedicated resource for experimentally verified potential drugs/molecules based on manually curated data. Currently, VDDB highlights 848 clinical vaccines, 199 clinical antibodies, as well as over 710,000 small molecules targeting 39 medically important viruses including SARS-CoV-2. Furthermore, VDDB stores approximately 3 million records of pharmacological data for these collected potential antiviral drugs/molecules, involving 314 cell infection-based phenotypic and 234 target-based genotypic assays. Based on these annotated pharmacological data, VDDB allows users to browse, search and download reliable information about these collects for various viruses of interest. In particular, VDDB also integrates 57 cell infection- and 117 target-based associated high-accuracy machine learning models to support various antivirals identification-related tasks, such as compound activity prediction, virtual screening, drug repositioning and target fishing. VDDB is freely accessible at http://vddb.idruglab.cn.

Viaarxiv icon

3D Shuffle-Mixer: An Efficient Context-Aware Vision Learner of Transformer-MLP Paradigm for Dense Prediction in Medical Volume

Apr 14, 2022
Jianye Pang, Cheng Jiang, Yihao Chen, Jianbo Chang, Ming Feng, Renzhi Wang, Jianhua Yao

Figure 1 for 3D Shuffle-Mixer: An Efficient Context-Aware Vision Learner of Transformer-MLP Paradigm for Dense Prediction in Medical Volume
Figure 2 for 3D Shuffle-Mixer: An Efficient Context-Aware Vision Learner of Transformer-MLP Paradigm for Dense Prediction in Medical Volume
Figure 3 for 3D Shuffle-Mixer: An Efficient Context-Aware Vision Learner of Transformer-MLP Paradigm for Dense Prediction in Medical Volume
Figure 4 for 3D Shuffle-Mixer: An Efficient Context-Aware Vision Learner of Transformer-MLP Paradigm for Dense Prediction in Medical Volume

Dense prediction in medical volume provides enriched guidance for clinical analysis. CNN backbones have met bottleneck due to lack of long-range dependencies and global context modeling power. Recent works proposed to combine vision transformer with CNN, due to its strong global capture ability and learning capability. However, most works are limited to simply applying pure transformer with several fatal flaws (i.e., lack of inductive bias, heavy computation and little consideration for 3D data). Therefore, designing an elegant and efficient vision transformer learner for dense prediction in medical volume is promising and challenging. In this paper, we propose a novel 3D Shuffle-Mixer network of a new Local Vision Transformer-MLP paradigm for medical dense prediction. In our network, a local vision transformer block is utilized to shuffle and learn spatial context from full-view slices of rearranged volume, a residual axial-MLP is designed to mix and capture remaining volume context in a slice-aware manner, and a MLP view aggregator is employed to project the learned full-view rich context to the volume feature in a view-aware manner. Moreover, an Adaptive Scaled Enhanced Shortcut is proposed for local vision transformer to enhance feature along spatial and channel dimensions adaptively, and a CrossMerge is proposed to skip-connects the multi-scale feature appropriately in the pyramid architecture. Extensive experiments demonstrate the proposed model outperforms other state-of-the-art medical dense prediction methods.

Viaarxiv icon

Fusion of medical imaging and electronic health records with attention and multi-head machanisms

Dec 22, 2021
Cheng Jiang, Yihao Chen, Jianbo Chang, Ming Feng, Renzhi Wang, Jianhua Yao

Figure 1 for Fusion of medical imaging and electronic health records with attention and multi-head machanisms
Figure 2 for Fusion of medical imaging and electronic health records with attention and multi-head machanisms
Figure 3 for Fusion of medical imaging and electronic health records with attention and multi-head machanisms
Figure 4 for Fusion of medical imaging and electronic health records with attention and multi-head machanisms

Doctors often make diagonostic decisions based on patient's image scans, such as magnetic resonance imaging (MRI), and patient's electronic health records (EHR) such as age, gender, blood pressure and so on. Despite a lot of automatic methods have been proposed for either image or text analysis in computer vision or natural language research areas, much fewer studies have been developed for the fusion of medical image and EHR data for medical problems. Among existing early or intermediate fusion methods, concatenation of features from both modalities is still a mainstream. For a better exploiting of image and EHR data, we propose a multi-modal attention module which use EHR data to help the selection of important regions during image feature extraction process conducted by traditional CNN. Moreover, we propose to incorporate multi-head machnism to gated multimodal unit (GMU) to make it able to parallelly fuse image and EHR features in different subspaces. With the help of the two modules, existing CNN architecture can be enhanced using both modalities. Experiments on predicting Glasgow outcome scale (GOS) of intracerebral hemorrhage patients and classifying Alzheimer's Disease showed the proposed method can automatically focus on task-related areas and achieve better results by making better use of image and EHR features.

Viaarxiv icon