Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jie Zhou

SFTok: Bridging the Performance Gap in Discrete Tokenizers

Dec 18, 2025

Qihang Rao, Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu

Abstract:Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).

* Under review. Code is available at https://github.com/Neur-IO/SFTok

Via

Access Paper or Ask Questions

On the Asymptotic Performance of Diagonally Loaded Detectors for Large Arrays: To Achieve CFAR and Optimality

Dec 17, 2025

Jie Zhou, Junhao Xie

Figure 1 for On the Asymptotic Performance of Diagonally Loaded Detectors for Large Arrays: To Achieve CFAR and Optimality

Figure 2 for On the Asymptotic Performance of Diagonally Loaded Detectors for Large Arrays: To Achieve CFAR and Optimality

Figure 3 for On the Asymptotic Performance of Diagonally Loaded Detectors for Large Arrays: To Achieve CFAR and Optimality

Figure 4 for On the Asymptotic Performance of Diagonally Loaded Detectors for Large Arrays: To Achieve CFAR and Optimality

Abstract:This paper addresses two critical limitations in diagonally loaded (DL) adaptive matched filter (AMF) detector: (1) the lack of CFAR property with respect to arbitrary covariance matrices, and (2) the absence of selection criteria for optimal loading factor from the perspective of maximizing the detection probability (Pd). We provide solutions to both challenges through a comprehensive analysis for the asymptotic performance of DL-AMF under large dimensional regime (LDR) where the dimension N and sample size K tend to infinity whereas their ratio N/K converges to a constant c\in(0,1). The analytical results show that any DL detectors constructed by normalizing the random variable |a|2=|sH(R+λIN)-1y0|2 with a deterministic quantity or a random variable that converges almost surely to a deterministic value will exhibit equivalent performance under LDR. Following this idea, we derive two CFAR DL detectors: CFAR DL semi-clairvoyant matched filter (CFAR-DL-SCMF) detector and CFAR DL adaptive matched filter (CFAR-DL-AMF) detector, by normalizing |a|2 with an appropriate deterministic quantity and its consistent estimate, respectively. The theoretical analysis and simulations show that both CFAR-DL-SCMF and CFAR-DL-AMF achieve CFAR with respect to covariance matrix, target steering vector and loading factor. Furthermore, we derive the asymptotically optimal loading factor λ_opt by maximizing the explicit expression of asymptotic Pd. For practical implementation, we provide a consistent estimator for λ_opt under LDR. Based on λ_opt and its consistent estimate, we establish the optimal CFAR-DL-SCMF (opt-CFAR-DL-SCMF) and the optimal CFAR-DL-AMF (opt-CFAR-DL-AMF). Numerical examples demonstrate that the proposed opt-CFAR-DL-SCMF and opt-CFAR-DL-AMF consistently outperform EL-AMF and persymmetric AMF in both full-rank and low-rank clutter plus noise environments.

Via

Access Paper or Ask Questions

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Dec 17, 2025

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu

Abstract:The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.

* Project Page: https://github.com/JoeLeelyf/Skyra

Via

Access Paper or Ask Questions

Astra: General Interactive World Model with Autoregressive Denoising

Dec 15, 2025

Yixuan Zhu, Jiaqi Feng, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu

Figure 1 for Astra: General Interactive World Model with Autoregressive Denoising

Figure 2 for Astra: General Interactive World Model with Autoregressive Denoising

Figure 3 for Astra: General Interactive World Model with Autoregressive Denoising

Figure 4 for Astra: General Interactive World Model with Autoregressive Denoising

Abstract:Recent advances in diffusion transformers have empowered video generation models to generate high-quality video clips from texts or images. However, world models with the ability to predict long-horizon futures from past observations and actions remain underexplored, especially for general-purpose scenarios and various forms of actions. To bridge this gap, we introduce Astra, an interactive general world model that generates real-world futures for diverse scenarios (e.g., autonomous driving, robot grasping) with precise action interactions (e.g., camera motion, robot action). We propose an autoregressive denoising architecture and use temporal causal attention to aggregate past observations and support streaming outputs. We use a noise-augmented history memory to avoid over-reliance on past frames to balance responsiveness with temporal coherence. For precise action control, we introduce an action-aware adapter that directly injects action signals into the denoising process. We further develop a mixture of action experts that dynamically route heterogeneous action modalities, enhancing versatility across diverse real-world tasks such as exploration, manipulation, and camera control. Astra achieves interactive, consistent, and general long-term video prediction and supports various forms of interactions. Experiments across multiple datasets demonstrate the improvements of Astra in fidelity, long-range prediction, and action alignment over existing state-of-the-art world models.

* Code is available at: https://github.com/EternalEvan/Astra

Via

Access Paper or Ask Questions

JPEG-Inspired Cloud-Edge Holography

Dec 13, 2025

Shuyang Xie, Jie Zhou, Jun Wang, Renjing Xu

Figure 1 for JPEG-Inspired Cloud-Edge Holography

Figure 2 for JPEG-Inspired Cloud-Edge Holography

Figure 3 for JPEG-Inspired Cloud-Edge Holography

Figure 4 for JPEG-Inspired Cloud-Edge Holography

Abstract:Computer-generated holography (CGH) presents a transformative solution for near-eye displays in augmented and virtual reality. Recent advances in deep learning have greatly improved CGH in reconstructed quality and computational efficiency. However, deploying neural CGH pipelines directly on compact, eyeglass-style devices is hindered by stringent constraints on computation and energy consumption, while cloud offloading followed by transmission with natural image codecs often distorts phase information and requires high bandwidth to maintain reconstruction quality. Neural compression methods can reduce bandwidth but impose heavy neural decoders at the edge, increasing inference latency and hardware demand. In this work, we introduce JPEG-Inspired Cloud-Edge Holography, an efficient pipeline designed around a learnable transform codec that retains the block-structured and hardware-friendly nature of JPEG. Our system shifts all heavy neural processing to the cloud, while the edge device performs only lightweight decoding without any neural inference. To further improve throughput, we implement custom CUDA kernels for entropy coding on both cloud and edge. This design achieves a peak signal-to-noise ratio of 32.15 dB at $<$ 2 bits per pixel with decode latency as low as 4.2 ms. Both numerical simulations and optical experiments confirm the high reconstruction quality of the holograms. By aligning CGH with a codec that preserves JPEG's structural efficiency while extending it with learnable components, our framework enables low-latency, bandwidth-efficient hologram streaming on resource-constrained wearable devices-using only simple block-based decoding readily supported by modern system-on-chips, without requiring neural decoders or specialized hardware.

Via

Access Paper or Ask Questions

SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Dec 12, 2025

Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang(+4 more)

Abstract:Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

* Code Repository: https://github.com/KlingTeam/SVG-T2I; Model Weights: https://huggingface.co/KlingTeam/SVG-T2I

Via

Access Paper or Ask Questions

A Lightweight 3D Anomaly Detection Method with Rotationally Invariant Features

Nov 17, 2025

Hanzhe Liang, Jie Zhou, Can Gao, Bingyang Guo, Jinbao Wang, Linlin Shen

Abstract:3D anomaly detection (AD) is a crucial task in computer vision, aiming to identify anomalous points or regions from point cloud data. However, existing methods may encounter challenges when handling point clouds with changes in orientation and position because the resulting features may vary significantly. To address this problem, we propose a novel Rotationally Invariant Features (RIF) framework for 3D AD. Firstly, to remove the adverse effect of variations on point cloud data, we develop a Point Coordinate Mapping (PCM) technique, which maps each point into a rotationally invariant space to maintain consistency of representation. Then, to learn robust and discriminative features, we design a lightweight Convolutional Transform Feature Network (CTF-Net) to extract rotationally invariant features for the memory bank. To improve the ability of the feature extractor, we introduce the idea of transfer learning to pre-train the feature extractor with 3D data augmentation. Experimental results show that the proposed method achieves the advanced performance on the Anomaly-ShapeNet dataset, with an average P-AUROC improvement of 17.7\%, and also gains the best performance on the Real3D-AD dataset, with an average P-AUROC improvement of 1.6\%. The strong generalization ability of RIF has been verified by combining it with traditional feature extraction methods on anomaly detection tasks, demonstrating great potential for industrial applications.

* Submitted to Elsevier

Via

Access Paper or Ask Questions

Continuous Autoregressive Language Models

Oct 31, 2025

Chenze Shao, Darren Li, Fandong Meng, Jie Zhou

Abstract:The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.

Via

Access Paper or Ask Questions

Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

Oct 23, 2025

Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun

Abstract:Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.

Via

Access Paper or Ask Questions

Terra: Explorable Native 3D World Model with Point Latents

Oct 16, 2025

Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu

Abstract:World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.

* Project Page: https://huang-yh.github.io/terra/

Via

Access Paper or Ask Questions