Alert button
Picture for Pengcheng He

Pengcheng He

Alert button

MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

Add code
Bookmark button
Alert button
Apr 15, 2022
Simiao Zuo, Qingru Zhang, Chen Liang, Pengcheng He, Tuo Zhao, Weizhu Chen

Figure 1 for MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation
Figure 2 for MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation
Figure 3 for MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation
Figure 4 for MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation
Viaarxiv icon

CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing

Add code
Bookmark button
Alert button
Apr 13, 2022
Chen Liang, Pengcheng He, Yelong Shen, Weizhu Chen, Tuo Zhao

Figure 1 for CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing
Figure 2 for CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing
Figure 3 for CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing
Figure 4 for CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing
Viaarxiv icon

Truncated Diffusion Probabilistic Models

Add code
Bookmark button
Alert button
Feb 19, 2022
Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou

Figure 1 for Truncated Diffusion Probabilistic Models
Figure 2 for Truncated Diffusion Probabilistic Models
Figure 3 for Truncated Diffusion Probabilistic Models
Figure 4 for Truncated Diffusion Probabilistic Models
Viaarxiv icon

No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models

Add code
Bookmark button
Alert button
Feb 14, 2022
Chen Liang, Haoming Jiang, Simiao Zuo, Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, Tuo Zhao

Figure 1 for No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models
Figure 2 for No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models
Figure 3 for No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models
Figure 4 for No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models
Viaarxiv icon

Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs

Add code
Bookmark button
Alert button
Feb 14, 2022
Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou

Figure 1 for Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs
Figure 2 for Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs
Figure 3 for Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs
Figure 4 for Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs
Viaarxiv icon

Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention

Add code
Bookmark button
Alert button
Dec 14, 2021
Yichong Xu, Chenguang Zhu, Shuohang Wang, Siqi Sun, Hao Cheng, Xiaodong Liu, Jianfeng Gao, Pengcheng He, Michael Zeng, Xuedong Huang

Figure 1 for Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention
Figure 2 for Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention
Figure 3 for Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention
Figure 4 for Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention
Viaarxiv icon

DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing

Add code
Bookmark button
Alert button
Nov 18, 2021
Pengcheng He, Jianfeng Gao, Weizhu Chen

Figure 1 for DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
Figure 2 for DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
Figure 3 for DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
Figure 4 for DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
Viaarxiv icon

ARCH: Efficient Adversarial Regularized Training with Caching

Add code
Bookmark button
Alert button
Sep 15, 2021
Simiao Zuo, Chen Liang, Haoming Jiang, Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, Tuo Zhao

Figure 1 for ARCH: Efficient Adversarial Regularized Training with Caching
Figure 2 for ARCH: Efficient Adversarial Regularized Training with Caching
Figure 3 for ARCH: Efficient Adversarial Regularized Training with Caching
Figure 4 for ARCH: Efficient Adversarial Regularized Training with Caching
Viaarxiv icon