Picture for Yimeng Wu

Yimeng Wu

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

Add code
May 07, 2025
Viaarxiv icon

Scaling Law for Language Models Training Considering Batch Size

Add code
Dec 02, 2024
Viaarxiv icon

ParaLBench: A Large-Scale Benchmark for Computational Paralinguistics over Acoustic Foundation Models

Add code
Nov 14, 2024
Figure 1 for ParaLBench: A Large-Scale Benchmark for Computational Paralinguistics over Acoustic Foundation Models
Figure 2 for ParaLBench: A Large-Scale Benchmark for Computational Paralinguistics over Acoustic Foundation Models
Figure 3 for ParaLBench: A Large-Scale Benchmark for Computational Paralinguistics over Acoustic Foundation Models
Figure 4 for ParaLBench: A Large-Scale Benchmark for Computational Paralinguistics over Acoustic Foundation Models
Viaarxiv icon

AraMUS: Pushing the Limits of Data and Model Scale for Arabic Natural Language Processing

Add code
Jun 11, 2023
Viaarxiv icon

Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding

Add code
May 21, 2022
Figure 1 for Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding
Figure 2 for Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding
Figure 3 for Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding
Figure 4 for Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding
Viaarxiv icon

JABER and SABER: Junior and Senior Arabic BERt

Add code
Jan 09, 2022
Figure 1 for JABER and SABER: Junior and Senior Arabic BERt
Figure 2 for JABER and SABER: Junior and Senior Arabic BERt
Figure 3 for JABER and SABER: Junior and Senior Arabic BERt
Figure 4 for JABER and SABER: Junior and Senior Arabic BERt
Viaarxiv icon

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Add code
Dec 27, 2020
Figure 1 for ALP-KD: Attention-Based Layer Projection for Knowledge Distillation
Figure 2 for ALP-KD: Attention-Based Layer Projection for Knowledge Distillation
Figure 3 for ALP-KD: Attention-Based Layer Projection for Knowledge Distillation
Figure 4 for ALP-KD: Attention-Based Layer Projection for Knowledge Distillation
Viaarxiv icon

Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

Add code
Oct 06, 2020
Figure 1 for Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers
Figure 2 for Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers
Figure 3 for Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers
Figure 4 for Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers
Viaarxiv icon