Abstract:This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose CPM.cu that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.
Abstract:This work introduces CLIP-aware Domain-Adaptive Super-Resolution (CDASR), a novel framework that addresses the critical challenge of domain generalization in single image super-resolution. By leveraging the semantic capabilities of CLIP (Contrastive Language-Image Pre-training), CDASR achieves unprecedented performance across diverse domains and extreme scaling factors. The proposed method integrates CLIP-guided feature alignment mechanism with a meta-learning inspired few-shot adaptation strategy, enabling efficient knowledge transfer and rapid adaptation to target domains. A custom domain-adaptive module processes CLIP features alongside super-resolution features through a multi-stage transformation process, including CLIP feature processing, spatial feature generation, and feature fusion. This intricate process ensures effective incorporation of semantic information into the super-resolution pipeline. Additionally, CDASR employs a multi-component loss function that combines pixel-wise reconstruction, perceptual similarity, and semantic consistency. Extensive experiments on benchmark datasets demonstrate CDASR's superiority, particularly in challenging scenarios. On the Urban100 dataset at $\times$8 scaling, CDASR achieves a significant PSNR gain of 0.15dB over existing methods, with even larger improvements of up to 0.30dB observed at $\times$16 scaling.
Abstract:Fabric defect detection confronts two fundamental challenges. First, conventional non-maximum suppression disrupts gradient flow, which hinders genuine end-to-end learning. Second, acquiring pixel-level annotations at industrial scale is prohibitively costly. Addressing these limitations, we propose a differentiable NMS framework for fabric defect detection that achieves superior localization precision through end-to-end optimization. We reformulate NMS as a differentiable bipartite matching problem solved through the Sinkhorn-Knopp algorithm, maintaining uninterrupted gradient flow throughout the network. This approach specifically targets the irregular morphologies and ambiguous boundaries of fabric defects by integrating proposal quality, feature similarity, and spatial relationships. Our entropy-constrained mask refinement mechanism further enhances localization precision through principled uncertainty modeling. Extensive experiments on the Tianchi fabric defect dataset demonstrate significant performance improvements over existing methods while maintaining real-time speeds suitable for industrial deployment. The framework exhibits remarkable adaptability across different architectures and generalizes effectively to general object detection tasks.
Abstract:We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.
Abstract:Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a de facto image tokenization approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively shorten the token sequence and reduce the computational cost of ViT-like plain architectures. In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1x1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models. Moreover, as a by-product, we discover that with smaller patches, task-specific decoder heads become less critical for dense prediction. In the experiments, we successfully scale up the visual sequence to an exceptional length of 50,176 tokens, achieving a competitive test accuracy of 84.6% with a base-sized model on the ImageNet-1k benchmark. We hope this study can provide insights and theoretical foundations for future works of building non-compressive vision models. Code is available at https://github.com/wangf3014/Patch_Scaling.
Abstract:Recent advancements have highlighted the Mamba framework, a state-space model known for its efficiency in capturing long-range dependencies with linear computational complexity. While Mamba has shown competitive performance in medical image segmentation, it encounters difficulties in modeling local features due to the sporadic nature of traditional location-based scanning methods and the complex, ambiguous boundaries often present in medical images. To overcome these challenges, we propose Uncertainty-Driven Mamba (UD-Mamba), which redefines the pixel-order scanning process by incorporating channel uncertainty into the scanning mechanism. UD-Mamba introduces two key scanning techniques: 1) sequential scanning, which prioritizes regions with high uncertainty by scanning in a row-by-row fashion, and 2) skip scanning, which processes columns vertically, moving from high-to-low or low-to-high uncertainty at fixed intervals. Sequential scanning efficiently clusters high-uncertainty regions, such as boundaries and foreground objects, to improve segmentation precision, while skip scanning enhances the interaction between background and foreground regions, allowing for timely integration of background information to support more accurate foreground inference. Recognizing the advantages of scanning from certain to uncertain areas, we introduce four learnable parameters to balance the importance of features extracted from different scanning methods. Additionally, a cosine consistency loss is employed to mitigate the drawbacks of transitioning between uncertain and certain regions during the scanning process. Our method demonstrates robust segmentation performance, validated across three distinct medical imaging datasets involving pathology, dermatological lesions, and cardiac tasks.
Abstract:Magnetotelluric (MT) forward modeling is fundamental for improving the accuracy and efficiency of MT inversion. Neural operators (NOs) have been effectively used for rapid MT forward modeling, demonstrating their promising performance in solving the MT forward modeling-related partial differential equations (PDEs). Particularly, they can obtain the electromagnetic field at arbitrary locations and frequencies. In these NOs, the projection layers have been dominated by multi-layer perceptrons (MLPs), which may potentially reduce the accuracy of solution due to they usually suffer from the disadvantages of MLPs, such as lack of interpretability, overfitting, and so on. Therefore, to improve the accuracy of MT forward modeling with NOs and explore the potential alternatives to MLPs, we propose a novel neural operator by extending the Fourier neural operator (FNO) with Kolmogorov-Arnold network (EFKAN). Within the EFKAN framework, the FNO serves as the branch network to calculate the apparent resistivity and phase from the resistivity model in the frequency domain. Meanwhile, the KAN acts as the trunk network to project the resistivity and phase, determined by the FNO, to the desired locations and frequencies. Experimental results demonstrate that the proposed method not only achieves higher accuracy in obtaining apparent resistivity and phase compared to the NO equipped with MLPs at the desired frequencies and locations but also outperforms traditional numerical methods in terms of computational speed.
Abstract:Physical and optical factors interacting with sensor characteristics create complex image degradation patterns. Despite advances in deep learning-based super-resolution, existing methods overlook the causal nature of degradation by adopting simplistic black-box mappings. This paper formulates super-resolution using structural causal models to reason about image degradation processes. We establish a mathematical foundation that unifies principles from causal inference, deriving necessary conditions for identifying latent degradation mechanisms and corresponding propagation. We propose a novel counterfactual learning strategy that leverages semantic guidance to reason about hypothetical degradation scenarios, leading to theoretically-grounded representations that capture invariant features across different degradation conditions. The framework incorporates an adaptive intervention mechanism with provable bounds on treatment effects, allowing precise manipulation of degradation factors while maintaining semantic consistency. Through extensive empirical validation, we demonstrate that our approach achieves significant improvements over state-of-the-art methods, particularly in challenging scenarios with compound degradations. On standard benchmarks, our method consistently outperforms existing approaches by significant margins (0.86-1.21dB PSNR), while providing interpretable insights into the restoration process. The theoretical framework and empirical results demonstrate the fundamental importance of causal reasoning in understanding image restoration systems.
Abstract:Animated video separates foreground and background elements into layers, with distinct processes for sketching, refining, coloring, and in-betweening. Existing video generation methods typically treat animation as a monolithic data domain, lacking fine-grained control over individual layers. In this paper, we introduce LayerAnimate, a novel architectural approach that enhances fine-grained control over individual animation layers within a video diffusion model, allowing users to independently manipulate foreground and background elements in distinct layers. To address the challenge of limited layer-specific data, we propose a data curation pipeline that features automated element segmentation, motion-state hierarchical merging, and motion coherence refinement. Through quantitative and qualitative comparisons, and user study, we demonstrate that LayerAnimate outperforms current methods in terms of animation quality, control precision, and usability, making it an ideal tool for both professional animators and amateur enthusiasts. This framework opens up new possibilities for layer-specific animation applications and creative flexibility. Our code is available at https://layeranimate.github.io.
Abstract:Reflections often degrade the visual quality of images captured through transparent surfaces, and reflection removal methods suffers from the shortage of paired real-world samples.This paper proposes a hybrid approach that combines cycle-consistency with denoising diffusion probabilistic models (DDPM) to effectively remove reflections from single images without requiring paired training data. The method introduces a Reflective Removal Network (RRN) that leverages DDPMs to model the decomposition process and recover the transmission image, and a Reflective Synthesis Network (RSN) that re-synthesizes the input image using the separated components through a nonlinear attention-based mechanism. Experimental results demonstrate the effectiveness of the proposed method on the SIR$^2$, Flash-Based Reflection Removal (FRR) Dataset, and a newly introduced Museum Reflection Removal (MRR) dataset, showing superior performance compared to state-of-the-art methods.