Picture for Conghui He

Conghui He

MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition

Add code
May 07, 2026
Viaarxiv icon

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Add code
Apr 27, 2026
Viaarxiv icon

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Add code
Apr 12, 2026
Viaarxiv icon

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Add code
Apr 06, 2026
Viaarxiv icon

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Add code
Mar 29, 2026
Viaarxiv icon

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Add code
Mar 27, 2026
Viaarxiv icon

Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale

Add code
Mar 26, 2026
Viaarxiv icon

MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Add code
Mar 23, 2026
Viaarxiv icon

Molecular Identifier Visual Prompt and Verifiable Reinforcement Learning for Chemical Reaction Diagram Parsing

Add code
Mar 17, 2026
Viaarxiv icon

AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

Add code
Feb 27, 2026
Viaarxiv icon