Department of Control Science and Engineering, Zhejiang University, China
Abstract:We investigate what structure emerges in 3D Gaussian Splatting (3DGS) solutions from standard multi-view optimization. We term these Rendering-Optimal References (RORs) and analyze their statistical properties, revealing stable patterns: mixture-structured scales and bimodal radiance across diverse scenes. To understand what determines these parameters, we apply learnability probes by training predictors to reconstruct RORs from point clouds without rendering supervision. Our analysis uncovers fundamental density-stratification. Dense regions exhibit geometry-correlated parameters amenable to render-free prediction, while sparse regions show systematic failure across architectures. We formalize this through variance decomposition, demonstrating that visibility heterogeneity creates covariance-dominated coupling between geometric and appearance parameters in sparse regions. This reveals the dual character of RORs: geometric primitives where point clouds suffice, and view synthesis primitives where multi-view constraints are essential. We provide density-aware strategies that improve training robustness and discuss architectural implications for systems that adaptively balance feed-forward prediction and rendering-based refinement.
Abstract:Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.
Abstract:The upper-mid band (7-24 GHz), designated as Frequency Range 3 (FR3), has emerged as a definitive ``golden band" for 6G networks, strategically balancing the wide coverage of sub-6 GHz with the high capacity of mmWave. To compensate for the severe path loss inherent to this band, the deployment of Extremely Large Aperture Arrays (ELAA) is indispensable. However, the legacy 3GPP TR 38.901 channel model faces critical validity challenges when applied to 6G FR3, stemming from both the distinct propagation characteristics of this frequency band and the fundamental physical paradigm shift induced by ELAA. In response, 3GPP Release 19 (Rel-19) has validated the model through extensive new measurements and introduced significant enhancements. This tutorial provides a comprehensive guide to the Rel-19 channel model for 6G FR3, bridging the gap between standardization specifications and practical simulation implementation. First, we provide a high-level overview of the fundamental principles of the 3GPP channel modeling framework. Second, we detail the specific enhancements and modifications introduced in Rel-19, including the rationale behind the new Suburban Macro (SMa) scenario, the mathematical modeling of ELAA-driven features such as near-field and spatial non-stationarity, and the recalibration of large-scale parameters. Overall, this tutorial serves as an essential guide for researchers and engineers to master the latest 3GPP channel modeling methodology, laying a solid foundation for the accurate design and performance evaluation of future 6G FR3 networks.
Abstract:Post-training pruning is an effective approach for reducing the size and inference cost of large language models (LLMs), but existing methods often face a trade-off between pruning quality and computational efficiency. Heuristic pruning methods are efficient but sensitive to activation outliers, while reconstruction-based approaches improve fidelity at the cost of heavy computation. In this work, we propose a lightweight post-training pruning framework based on first-order statistical properties of model weights and activations. During pruning, channel-wise statistics are used to calibrate magnitude-based importance scores, reducing bias from activation-dominated channels. After pruning, we apply an analytic energy compensation to correct distributional distortions caused by weight removal. Both steps operate without retraining, gradients, or second-order information. Experiments across multiple LLM families, sparsity patterns, and evaluation tasks show that the proposed approach improves pruning performance while maintaining computational cost comparable to heuristic methods. The results suggest that simple statistical corrections can be effective for post-training pruning of LLMs.
Abstract:DNA storage has matured from concept to practical stage, yet its integration with neural compression pipelines remains inefficient. Early DNA encoders applied redundancy-heavy constraint layers atop raw binary data - workable but primitive. Recent neural codecs compress data into learned latent representations with rich statistical structure, yet still convert these latents to DNA via naive binary-to-quaternary transcoding, discarding the entropy model's optimization. This mismatch undermines compression efficiency and complicates the encoding stack. A plug-in module that collapses latent compression and DNA encoding into a single step. SCONE performs quaternary arithmetic coding directly on the latent space in DNA bases. Its Constraint-Aware Adaptive Coding module dynamically steers the entropy encoder's learned probability distribution to enforce biochemical constraints - Guanine-Cytosine (GC) balance and homopolymer suppression - deterministically during encoding, eliminating post-hoc correction. The design preserves full reversibility and exploits the hyperprior model's learned priors without modification. Experiments show SCONE achieves near-perfect constraint satisfaction with negligible computational overhead (<2% latency), establishing a latent-agnostic interface for end-to-end DNA-compatible learned codecs.
Abstract:Large language models (LLMs) have recently shown strong potential for Automated Program Repair (APR), yet most existing approaches remain unimodal and fail to leverage the rich diagnostic signals contained in visual artifacts such as screenshots and control-flow graphs. In practice, many bug reports convey critical information visually (e.g., layout breakage or missing widgets), but directly using such dense visual inputs often causes context loss and noise, making it difficult for MLLMs to ground visual observations into precise fault localization and executable patches. To bridge this semantic gap, we propose \textbf{SVRepair}, a multimodal APR framework with structured visual representation. SVRepair first fine-tunes a vision-language model, \textbf{Structured Visual Representation (SVR)}, to uniformly transform heterogeneous visual artifacts into a \emph{semantic scene graph} that captures GUI elements and their structural relations (e.g., hierarchy), providing normalized, code-relevant context for downstream repair. Building on the graph, SVRepair drives a coding agent to localize faults and synthesize patches, and further introduces an iterative visual-artifact segmentation strategy that progressively narrows the input to bug-centered regions to suppress irrelevant context and reduce hallucinations. Extensive experiments across multiple benchmarks demonstrate state-of-the-art performance: SVRepair achieves \textbf{36.47\%} accuracy on SWE-Bench M, \textbf{38.02\%} on MMCode, and \textbf{95.12\%} on CodeVision, validating the effectiveness of SVRepair for multimodal program repair.
Abstract:Agentic Test-Time Scaling (TTS) has delivered state-of-the-art (SOTA) performance on complex software engineering tasks such as code generation and bug fixing. However, its practical adoption remains limited due to significant computational overhead, primarily driven by two key challenges: (1) the high cost associated with deploying excessively large ensembles, and (2) the lack of a reliable mechanism for selecting the optimal candidate solution, ultimately constraining the performance gains that can be realized. To address these challenges, we propose Entropy-Guided Stepwise Scaling (EGSS), a novel TTS framework that dynamically balances efficiency and effectiveness through entropy-guided adaptive search and robust test-suite augmentation. Extensive experiments on SWE-Bench-Verified demonstrate that EGSS consistently boosts performance by 5-10% across all evaluated models. Specifically, it increases the resolved ratio of Kimi-K2-Intruct from 63.2% to 72.2%, and GLM-4.6 from 65.8% to 74.6%. Furthermore, when paired with GLM-4.6, EGSS achieves a new state-of-the-art among open-source large language models. In addition to these accuracy improvements, EGSS reduces inference-time token usage by over 28% compared to existing TTS methods, achieving simultaneous gains in both effectiveness and computational efficiency.
Abstract:Code localization constitutes a key bottleneck in automated software development pipelines. While concurrent tool execution can enhance discovery speed, current agents demonstrate a 34.9\% redundant invocation rate, which negates parallelism benefits. We propose \textbf{FuseSearch}, reformulating parallel code localization as a \textbf{joint quality-efficiency optimization} task. Through defining \textbf{tool efficiency} -- the ratio of unique information gain to invocation count -- we utilize a two-phase SFT and RL training approach for learning adaptive parallel strategies. Different from fixed-breadth approaches, FuseSearch dynamically modulates search breadth according to task context, evolving from exploration phases to refinement stages. Evaluated on SWE-bench Verified, FuseSearch-4B achieves SOTA-level performance (84.7\% file-level and 56.4\% function-level $F_1$ scores) with 93.6\% speedup, utilizing 67.7\% fewer turns and 68.9\% fewer tokens. Results indicate that efficiency-aware training naturally improves quality through eliminating noisy redundant signals, enabling high-performance cost-effective localization agents.
Abstract:Free-Viewpoint Video (FVV) reconstruction enables photorealistic and interactive 3D scene visualization; however, real-time streaming is often bottlenecked by sparse-view inputs, prohibitive training costs, and bandwidth constraints. While recent 3D Gaussian Splatting (3DGS) has advanced FVV due to its superior rendering speed, Streaming Free-Viewpoint Video (SFVV) introduces additional demands for rapid optimization, high-fidelity reconstruction under sparse constraints, and minimal storage footprints. To bridge this gap, we propose StreamLoD-GS, an LoD-based Gaussian Splatting framework designed specifically for SFVV. Our approach integrates three core innovations: 1) an Anchor- and Octree-based LoD-structured 3DGS with a hierarchical Gaussian dropout technique to ensure efficient and stable optimization while maintaining high-quality rendering; 2) a GMM-based motion partitioning mechanism that separates dynamic and static content, refining dynamic regions while preserving background stability; and 3) a quantized residual refinement framework that significantly reduces storage requirements without compromising visual fidelity. Extensive experiments demonstrate that StreamLoD-GS achieves competitive or state-of-the-art performance in terms of quality, efficiency, and storage.
Abstract:Traditional cellular networks struggle with poor quality of service (QoS) for cell-edge users, while cell-free (CF) systems offer uniform QoS but incur high roll-out costs due to acquiring numerous access point (AP) sites and deploying a large-scale optical fiber network to connect them. This paper proposes a cost-effective heterogeneous massive MIMO architecture that integrates centralized co-located antennas at a cell-center base station with distributed edge APs. By strategically splitting massive antennas between centralized and distributed nodes, the system maintains high user fairness comparable to CF systems but reduces infrastructure costs substantially, by minimizing the required number of AP sites and fronthaul connections. Numerical results demonstrate its superiority in balancing performance and costs compared to cellular and CF systems.