Abstract:Current no-reference image quality assessment (NR-IQA) models for enhanced images often struggle to generalize, as they tend to overfit to the distinct patterns of specific enhancement algorithms rather than evaluating genuine perceptual quality. To address this issue, we propose a preference-guided debiasing framework for no-reference enhancement image quality assessment (EIQA). Specifically, we first learn a continuous enhancement-preference embedding space using supervised contrastive learning, where images generated by similar enhancement styles are encouraged to have closer representations. Based on this, we further estimate the enhancement-induced nuisance component contained in the raw quality representation and remove it before quality regression. In this way, the model is guided to focus on algorithm-invariant perceptual quality cues instead of enhancement-specific visual fingerprints. To facilitate stable optimization, we adopt a two-stage training strategy that first learns the enhancement-preference space and then performs debiased quality prediction. Extensive experiments on public EIQA benchmarks demonstrate that the proposed method effectively mitigates algorithm-induced representation bias and achieves superior robustness and cross-algorithm generalization compared with existing approaches.
Abstract:Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and perform downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key findings include: (1) LLMs follow a three-phase attention pattern -- early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in middle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further enhanced by table-tuning. We hope these findings and insights can facilitate interpretability and future research on table-related tasks.
Abstract:Implicit Neural Representations (INRs) have emerged as a promising paradigm for video representation and compression. However, existing multi-scale INR generators often suffer from significant parameter redundancy by stacking independent processing blocks for each scale. Inspired by the principle of scale self-similarity in the generation process, we propose SRNeRV, a novel scale-wise recursive framework that replaces this stacked design with a parameter-efficient shared architecture. The core of our approach is a hybrid sharing scheme derived from decoupling the processing block into a scale-specific spatial mixing module and a scale-invariant channel mixing module. We recursively apply the same shared channel mixing module, which contains the majority of the parameters, across all scales, significantly reducing the model size while preserving the crucial capacity to learn scale-specific spatial patterns. Extensive experiments demonstrate that SRNeRV achieves a significant rate-distortion performance boost, especially in INR-friendly scenarios, validating that our sharing scheme successfully amplifies the core strengths of the INR paradigm.
Abstract:Classifier-Free Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffusion process. This finding explains the limitations of fixed-weight strategies and establishes a principled foundation for time-dependent guidance. Motivated by this insight, we introduce \textbf{Control Classifier-Free Guidance (C$^2$FG)}, a novel, training-free, and plug-in method that aligns the guidance strength with the diffusion dynamics via an exponential decay control function. Extensive experiments demonstrate that C$^2$FG is effective and broadly applicable across diverse generative tasks, while also exhibiting orthogonality to existing strategies.
Abstract:In this work, we introduce Wallaroo, a simple autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation, and editing at the same time. Moreover, Wallaroo supports multi-resolution image input and output, as well as bilingual support for both Chinese and English. We decouple the visual encoding into separate pathways and apply a four-stage training strategy to reshape the model's capabilities. Experiments are conducted on various benchmarks where Wallaroo produces competitive performance or exceeds other unified models, suggesting the great potential of autoregressive models in unifying multi-modality understanding and generation. Our code is available at https://github.com/JiePKU/Wallaroo.
Abstract:We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
Abstract:Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.
Abstract:Video compression has recently benefited from implicit neural representations (INRs), which model videos as continuous functions. INRs offer compact storage and flexible reconstruction, providing a promising alternative to traditional codecs. However, most existing INR-based methods treat the temporal dimension as an independent input, limiting their ability to capture complex temporal dependencies. To address this, we propose a Hierarchical Temporal Neural Representation for Videos, TeNeRV. TeNeRV integrates short- and long-term dependencies through two key components. First, an Inter-Frame Feature Fusion (IFF) module aggregates features from adjacent frames, enforcing local temporal coherence and capturing fine-grained motion. Second, a GoP-Adaptive Modulation (GAM) mechanism partitions videos into Groups-of-Pictures and learns group-specific priors. The mechanism modulates network parameters, enabling adaptive representations across different GoPs. Extensive experiments demonstrate that TeNeRV consistently outperforms existing INR-based methods in rate-distortion performance, validating the effectiveness of our proposed approach.
Abstract:Implicit Neural Representations (INRs) have emerged as a promising paradigm for video compression. However, existing INR-based frameworks typically suffer from inherent spectral bias, which favors low-frequency components and leads to over-smoothed reconstructions and suboptimal rate-distortion performance. In this paper, we propose FaNeRV, a Frequency-aware Neural Representation for videos, which explicitly decouples low- and high-frequency components to enable efficient and faithful video reconstruction. FaNeRV introduces a multi-resolution supervision strategy that guides the network to progressively capture global structures and fine-grained textures through staged supervision . To further enhance high-frequency reconstruction, we propose a dynamic high-frequency injection mechanism that adaptively emphasizes challenging regions. In addition, we design a frequency-decomposed network module to improve feature modeling across different spectral bands. Extensive experiments on standard benchmarks demonstrate that FaNeRV significantly outperforms state-of-the-art INR methods and achieves competitive rate-distortion performance against traditional codecs.
Abstract:The pre-trained transformer demonstrates remarkable generalization ability in natural image processing. However, directly transferring it to magnetic resonance images faces two key challenges: the inability to adapt to the specificity of medical anatomical structures and the limitations brought about by the privacy and scarcity of medical data. To address these issues, this paper proposes a Self-Supervised Pretrained Transformer (SSPFormer) for MRI images, which effectively learns domain-specific feature representations of medical images by leveraging unlabeled raw imaging data. To tackle the domain gap and data scarcity, we introduce inverse frequency projection masking, which prioritizes the reconstruction of high-frequency anatomical regions to enforce structure-aware representation learning. Simultaneously, to enhance robustness against real-world MRI artifacts, we employ frequency-weighted FFT noise enhancement that injects physiologically realistic noise into the Fourier domain. Together, these strategies enable the model to learn domain-invariant and artifact-robust features directly from raw scans. Through extensive experiments on segmentation, super-resolution, and denoising tasks, the proposed SSPFormer achieves state-of-the-art performance, fully verifying its ability to capture fine-grained MRI image fidelity and adapt to clinical application requirements.