Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ye Mao

Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

Apr 02, 2026

Ye Mao, Weixun Luo, Ranran Huang, Junpeng Jing, Krystian Mikolajczyk

Abstract:Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state-of-the-art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. https://yebulabula.github.io/UniScene3D/

* 24 pages

Via

Access Paper or Ask Questions

From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis

Mar 29, 2026

Ranran Huang, Weixun Luo, Ye Mao, Krystian Mikolajczyk

Abstract:In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at https://ranrhuang.github.io/nas3r/.

Via

Access Paper or Ask Questions

Context Compression via Explicit Information Transmission

Feb 03, 2026

Jiangnan Ye, Hanqi Yan, Zhenyi Shen, Heng Chang, Ye Mao, Yulan He

Abstract:Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches, motivating context compression. In this work, we study soft context compression, where a long context is condensed into a small set of continuous representations. Existing methods typically re-purpose the LLM itself as a trainable compressor, relying on layer-by-layer self-attention to iteratively aggregate information. We argue that this paradigm suffers from two structural limitations: (i) progressive representation overwriting across layers (ii) uncoordinated allocation of compression capacity across tokens. We propose ComprExIT (Context Compression via Explicit Information Transmission), a lightweight framework that formulates soft compression into a new paradigm: explicit information transmission over frozen LLM hidden states. This decouples compression from the model's internal self-attention dynamics. ComprExIT performs (i) depth-wise transmission to selectively transmit multi-layer information into token anchors, mitigating progressive overwriting, and (ii) width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan, ensuring coordinated allocation of information. Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters, demonstrating that explicit and coordinated information transmission enables more effective and robust long-context compression.

Via

Access Paper or Ask Questions

Stereo Any Video: Temporally Consistent Stereo Matching

Mar 07, 2025

Junpeng Jing, Weixun Luo, Ye Mao, Krystian Mikolajczyk

Figure 1 for Stereo Any Video: Temporally Consistent Stereo Matching

Figure 2 for Stereo Any Video: Temporally Consistent Stereo Matching

Figure 3 for Stereo Any Video: Temporally Consistent Stereo Matching

Figure 4 for Stereo Any Video: Temporally Consistent Stereo Matching

Abstract:This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively ensure robustness, accuracy, and temporal consistency, setting a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios.

Via

Access Paper or Ask Questions

Hypo3D: Exploring Hypothetical Reasoning in 3D

Feb 04, 2025

Ye Mao, Weixun Luo, Junpeng Jing, Anlan Qiu, Krystian Mikolajczyk

Figure 1 for Hypo3D: Exploring Hypothetical Reasoning in 3D

Figure 2 for Hypo3D: Exploring Hypothetical Reasoning in 3D

Figure 3 for Hypo3D: Exploring Hypothetical Reasoning in 3D

Figure 4 for Hypo3D: Exploring Hypothetical Reasoning in 3D

Abstract:The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art foundation models struggle to reason in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the context change is irrelevant to the question, models often incorrectly adjust their answers.

* 19 pages, 15 figures, 9 tables

Via

Access Paper or Ask Questions

Match Stereo Videos via Bidirectional Alignment

Sep 30, 2024

Junpeng Jing, Ye Mao, Anlan Qiu, Krystian Mikolajczyk

Figure 1 for Match Stereo Videos via Bidirectional Alignment

Figure 2 for Match Stereo Videos via Bidirectional Alignment

Figure 3 for Match Stereo Videos via Bidirectional Alignment

Figure 4 for Match Stereo Videos via Bidirectional Alignment

Abstract:Video stereo matching is the task of estimating consistent disparity maps from rectified stereo videos. There is considerable scope for improvement in both datasets and methods within this area. Recent learning-based methods often focus on optimizing performance for independent stereo pairs, leading to temporal inconsistencies in videos. Existing video methods typically employ sliding window operation over time dimension, which can result in low-frequency oscillations corresponding to the window size. To address these challenges, we propose a bidirectional alignment mechanism for adjacent frames as a fundamental operation. Building on this, we introduce a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods. Regarding datasets, current synthetic object-based and indoor datasets are commonly used for training and benchmarking, with a lack of outdoor nature scenarios. To bridge this gap, we present a realistic synthetic dataset and benchmark focused on natural scenes, along with a real-world dataset captured by a stereo camera in diverse urban scenes for qualitative evaluation. Extensive experiments on in-domain, out-of-domain, and robustness evaluation demonstrate the contribution of our methods and datasets, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks. The project page, demos, code, and datasets are available at: \url{https://tomtomtommi.github.io/BiDAVideo/}.

Via

Access Paper or Ask Questions

OpenDlign: Enhancing Open-World 3D Learning with Depth-Aligned Images

Apr 25, 2024

Ye Mao, Junpeng Jing, Krystian Mikolajczyk

Abstract:Recent advances in Vision and Language Models (VLMs) have improved open-world 3D representation, facilitating 3D zero-shot capability in unseen categories. Existing open-world methods pre-train an extra 3D encoder to align features from 3D data (e.g., depth maps or point clouds) with CAD-rendered images and corresponding texts. However, the limited color and texture variations in CAD images can compromise the alignment robustness. Furthermore, the volume discrepancy between pre-training datasets of the 3D encoder and VLM leads to sub-optimal 2D to 3D knowledge transfer. To overcome these issues, we propose OpenDlign, a novel framework for learning open-world 3D representations, that leverages depth-aligned images generated from point cloud-projected depth maps. Unlike CAD-rendered images, our generated images provide rich, realistic color and texture diversity while preserving geometric and semantic consistency with the depth maps. OpenDlign also optimizes depth map projection and integrates depth-specific text prompts, improving 2D VLM knowledge adaptation for 3D learning efficient fine-tuning. Experimental results show that OpenDlign significantly outperforms existing benchmarks in zero-shot and few-shot 3D tasks, exceeding prior scores by 8.0% on ModelNet40 and 16.4% on OmniObject3D with just 6 million tuned parameters. Moreover, integrating generated depth-aligned images into existing 3D learning pipelines consistently improves their performance.

* 12 pages

Via

Access Paper or Ask Questions

Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

Mar 16, 2024

Junpeng Jing, Ye Mao, Krystian Mikolajczyk

Figure 1 for Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

Figure 2 for Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

Figure 3 for Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

Figure 4 for Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo Matching

Abstract:Dynamic stereo matching is the task of estimating consistent disparities from stereo videos with dynamic objects. Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies. Existing video methods apply per-frame matching and window-based cost aggregation across the time dimension, leading to low-frequency oscillations at the scale of the window size. Towards this challenge, we develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation. We further propose a novel framework, BiDAStereo, that achieves consistent dynamic stereo matching. Unlike the existing methods, we model this task as local matching and global aggregation. Locally, we consider correlation in a triple-frame manner to pool information from adjacent frames and improve the temporal consistency. Globally, to exploit the entire sequence's consistency and extract dynamic scene cues for aggregation, we develop a motion-propagation recurrent unit. Extensive experiments demonstrate the performance of our method, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks.

Via

Access Paper or Ask Questions

Knowledge Distilled Ensemble Model for sEMG-based Silent Speech Interface

Aug 07, 2023

Wenqiang Lai, Qihan Yang, Ye Mao, Endong Sun, Jiangnan Ye

Figure 1 for Knowledge Distilled Ensemble Model for sEMG-based Silent Speech Interface

Figure 2 for Knowledge Distilled Ensemble Model for sEMG-based Silent Speech Interface

Figure 3 for Knowledge Distilled Ensemble Model for sEMG-based Silent Speech Interface

Figure 4 for Knowledge Distilled Ensemble Model for sEMG-based Silent Speech Interface

Abstract:Voice disorders affect millions of people worldwide. Surface electromyography-based Silent Speech Interfaces (sEMG-based SSIs) have been explored as a potential solution for decades. However, previous works were limited by small vocabularies and manually extracted features from raw data. To address these limitations, we propose a lightweight deep learning knowledge-distilled ensemble model for sEMG-based SSI (KDE-SSI). Our model can classify a 26 NATO phonetic alphabets dataset with 3900 data samples, enabling the unambiguous generation of any English word through spelling. Extensive experiments validate the effectiveness of KDE-SSI, achieving a test accuracy of 85.9\%. Our findings also shed light on an end-to-end system for portable, practical equipment.

* 6 pages, 5 figures

Via

Access Paper or Ask Questions

DisC-Diff: Disentangled Conditional Diffusion Model for Multi-Contrast MRI Super-Resolution

Mar 24, 2023

Ye Mao, Lan Jiang, Xi Chen, Chao Li

Figure 1 for DisC-Diff: Disentangled Conditional Diffusion Model for Multi-Contrast MRI Super-Resolution

Figure 2 for DisC-Diff: Disentangled Conditional Diffusion Model for Multi-Contrast MRI Super-Resolution

Figure 3 for DisC-Diff: Disentangled Conditional Diffusion Model for Multi-Contrast MRI Super-Resolution

Figure 4 for DisC-Diff: Disentangled Conditional Diffusion Model for Multi-Contrast MRI Super-Resolution

Abstract:Multi-contrast magnetic resonance imaging (MRI) is the most common management tool used to characterize neurological disorders based on brain tissue contrasts. However, acquiring high-resolution MRI scans is time-consuming and infeasible under specific conditions. Hence, multi-contrast super-resolution methods have been developed to improve the quality of low-resolution contrasts by leveraging complementary information from multi-contrast MRI. Current deep learning-based super-resolution methods have limitations in estimating restoration uncertainty and avoiding mode collapse. Although the diffusion model has emerged as a promising approach for image enhancement, capturing complex interactions between multiple conditions introduced by multi-contrast MRI super-resolution remains a challenge for clinical applications. In this paper, we propose a disentangled conditional diffusion model, DisC-Diff, for multi-contrast brain MRI super-resolution. It utilizes the sampling-based generation and simple objective function of diffusion models to estimate uncertainty in restorations effectively and ensure a stable optimization process. Moreover, DisC-Diff leverages a disentangled multi-stream network to fully exploit complementary information from multi-contrast MRI, improving model interpretation under multiple conditions of multi-contrast inputs. We validated the effectiveness of DisC-Diff on two datasets: the IXI dataset, which contains 578 normal brains, and a clinical dataset with 316 pathological brains. Our experimental results demonstrate that DisC-Diff outperforms other state-of-the-art methods both quantitatively and visually.

Via

Access Paper or Ask Questions