Picture for Jiantao Qiu

Jiantao Qiu

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Add code
Apr 06, 2026
Viaarxiv icon

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Add code
Mar 26, 2026
Viaarxiv icon

Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification

Add code
Jun 08, 2025
Figure 1 for Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification
Figure 2 for Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification
Figure 3 for Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification
Figure 4 for Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification
Viaarxiv icon

Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning

Add code
Jun 08, 2025
Viaarxiv icon

Not All Documents Are What You Need for Extracting Instruction Tuning Data

Add code
May 18, 2025
Viaarxiv icon

Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

Add code
Apr 19, 2025
Viaarxiv icon

WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages

Add code
Jan 24, 2025
Figure 1 for WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
Figure 2 for WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
Figure 3 for WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
Figure 4 for WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
Viaarxiv icon

Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

Add code
Sep 25, 2024
Figure 1 for Harnessing Diversity for Important Data Selection in Pretraining Large Language Models
Figure 2 for Harnessing Diversity for Important Data Selection in Pretraining Large Language Models
Figure 3 for Harnessing Diversity for Important Data Selection in Pretraining Large Language Models
Viaarxiv icon

InternLM2 Technical Report

Add code
Mar 26, 2024
Figure 1 for InternLM2 Technical Report
Figure 2 for InternLM2 Technical Report
Figure 3 for InternLM2 Technical Report
Figure 4 for InternLM2 Technical Report
Viaarxiv icon

WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset

Add code
Mar 12, 2024
Figure 1 for WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
Figure 2 for WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
Figure 3 for WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
Figure 4 for WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
Viaarxiv icon