Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rui Ma

Mitsubishi Electric Research Labs, Cambridge, MA, USA

BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

Mar 27, 2025

Hang Zhou, Xinxin Zuo, Rui Ma, Li Cheng

Figure 1 for BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

Figure 2 for BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

Figure 3 for BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

Figure 4 for BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

Abstract:In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits their capacity to model complex data distributions. Alternatively, transformer networks with a sparse contrastive loss have been explored, but their over-relaxed regularization often leads to imprecise object placement. We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem. Our approach begins by identifying suitable regions of interest for object placement. This is achieved by training a specialized detection transformer on object-subtracted backgrounds, enhanced with multi-object supervisions. It then semantically associates each target compositing object with detected regions based on their complementary characteristics. Through a boostrapped training approach applied to randomly object-subtracted images, our model enforces meaningful placements through extensive paired data augmentation. Experimental results on established benchmarks demonstrate BOOTPLACE's superior performance in object repositioning, markedly surpassing state-of-the-art baselines on Cityscapes and OPA datasets with notable improvements in IOU scores. Additional ablation studies further showcase the compositionality and generalizability of our approach, supported by user study evaluations.

* CVPR 2025. Project page: https://ryanhangzhou.github.io/bootplace/ , code: https://github.com/RyanHangZhou/BOOTPLACE

Via

Access Paper or Ask Questions

Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning

Mar 10, 2025

Huilin Deng, Ding Zou, Rui Ma, Hongchen Luo, Yang Cao, Yu Kang

Figure 1 for Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning

Figure 2 for Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning

Figure 3 for Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning

Figure 4 for Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning

Abstract:While state-of-the-art vision-language models (VLMs) have demonstrated remarkable capabilities in complex visual-text tasks, their success heavily relies on massive model scaling, limiting their practical deployment. Small-scale VLMs offer a more practical alternative but face significant challenges when trained with traditional supervised fine-tuning (SFT), particularly in two aspects: out-of-domain (OOD) generalization and reasoning abilities, which significantly lags behind the contemporary Large language models (LLMs). To address these challenges, we propose Curriculum Reinforcement Finetuning (Curr-ReFT), a novel post-training paradigm specifically designed for small-scale VLMs. Inspired by the success of reinforcement learning in LLMs, Curr-ReFT comprises two sequential stages: (1) Curriculum Reinforcement Learning, which ensures steady progression of model capabilities through difficulty-aware reward design, transitioning from basic visual perception to complex reasoning tasks; and (2) Rejected Sampling-based Self-improvement, which maintains the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples. Extensive experiments demonstrate that models trained with Curr-ReFT paradigm achieve state-of-the-art performance across various visual tasks in both in-domain and out-of-domain settings. Moreover, our Curr-ReFT enhanced 3B model matches the performance of 32B-parameter models, demonstrating that efficient training paradigms can effectively bridge the gap between small and large models.

Via

Access Paper or Ask Questions

DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction

Mar 07, 2025

Miaowei Wang, Yibo Zhang, Rui Ma, Weiwei Xu, Changqing Zou, Daniel Morris

Figure 1 for DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction

Figure 2 for DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction

Figure 3 for DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction

Figure 4 for DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction

Abstract:We present DecoupledGaussian, a novel system that decouples static objects from their contacted surfaces captured in-the-wild videos, a key prerequisite for realistic Newtonian-based physical simulations. Unlike prior methods focused on synthetic data or elastic jittering along the contact surface, which prevent objects from fully detaching or moving independently, DecoupledGaussian allows for significant positional changes without being constrained by the initial contacted surface. Recognizing the limitations of current 2D inpainting tools for restoring 3D locations, our approach proposes joint Poisson fields to repair and expand the Gaussians of both objects and contacted scenes after separation. This is complemented by a multi-carve strategy to refine the object's geometry. Our system enables realistic simulations of decoupling motions, collisions, and fractures driven by user-specified impulses, supporting complex interactions within and across multiple scenes. We validate DecoupledGaussian through a comprehensive user study and quantitative benchmarks. This system enhances digital interaction with objects and scenes in real-world environments, benefiting industries such as VR, robotics, and autonomous driving. Our project page is at: https://wangmiaowei.github.io/DecoupledGaussian.github.io/.

* CVPR2025 Accepted

Via

Access Paper or Ask Questions

Research on visual simultaneous localization and mapping technology based on near infrared light

Mar 04, 2025

Rui Ma, Mengfang Liu, Boliang Li, Xinghui Li

Abstract:In view of the problems that visual simultaneous localization and mapping (VSLAM) are susceptible to environmental light interference and luminosity inconsistency, the visual simultaneous localization and mapping technology based on near infrared perception (NIR-VSLAM) is proposed. In order to avoid ambient light interference, the near infrared light is innovatively selected as the light source. The luminosity parameter estimation of error energy function, halo factor and exposure time and the light source irradiance correction method are proposed in this paper, which greatly improves the positioning accuracy of Direct Sparse Odometry (DSO). The feasibility of the proposed method in four large scenes is verified, which provides the reference for visual positioning in automatic driving and mobile robot.

* 12 pages, 9 figures, 2 tables

Via

Access Paper or Ask Questions

SuperNeRF-GAN: A Universal 3D-Consistent Super-Resolution Framework for Efficient and Enhanced 3D-Aware Image Synthesis

Jan 12, 2025

Peng Zheng, Linzhi Huang, Yizhou Yu, Yi Chang, Yilin Wang, Rui Ma

Figure 1 for SuperNeRF-GAN: A Universal 3D-Consistent Super-Resolution Framework for Efficient and Enhanced 3D-Aware Image Synthesis

Figure 2 for SuperNeRF-GAN: A Universal 3D-Consistent Super-Resolution Framework for Efficient and Enhanced 3D-Aware Image Synthesis

Figure 3 for SuperNeRF-GAN: A Universal 3D-Consistent Super-Resolution Framework for Efficient and Enhanced 3D-Aware Image Synthesis

Figure 4 for SuperNeRF-GAN: A Universal 3D-Consistent Super-Resolution Framework for Efficient and Enhanced 3D-Aware Image Synthesis

Abstract:Neural volume rendering techniques, such as NeRF, have revolutionized 3D-aware image synthesis by enabling the generation of images of a single scene or object from various camera poses. However, the high computational cost of NeRF presents challenges for synthesizing high-resolution (HR) images. Most existing methods address this issue by leveraging 2D super-resolution, which compromise 3D-consistency. Other methods propose radiance manifolds or two-stage generation to achieve 3D-consistent HR synthesis, yet they are limited to specific synthesis tasks, reducing their universality. To tackle these challenges, we propose SuperNeRF-GAN, a universal framework for 3D-consistent super-resolution. A key highlight of SuperNeRF-GAN is its seamless integration with NeRF-based 3D-aware image synthesis methods and it can simultaneously enhance the resolution of generated images while preserving 3D-consistency and reducing computational cost. Specifically, given a pre-trained generator capable of producing a NeRF representation such as tri-plane, we first perform volume rendering to obtain a low-resolution image with corresponding depth and normal map. Then, we employ a NeRF Super-Resolution module which learns a network to obtain a high-resolution NeRF. Next, we propose a novel Depth-Guided Rendering process which contains three simple yet effective steps, including the construction of a boundary-correct multi-depth map through depth aggregation, a normal-guided depth super-resolution and a depth-guided NeRF rendering. Experimental results demonstrate the superior efficiency, 3D-consistency, and quality of our approach. Additionally, ablation studies confirm the effectiveness of our proposed components.

Via

Access Paper or Ask Questions

ZenSVI: An Open-Source Software for the Integrated Acquisition, Processing and Analysis of Street View Imagery Towards Scalable Urban Science

Dec 24, 2024

Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman, Xiucheng Liang, Zicheng Fan, Yujun Hou, Tianhong Zhao, Rui Ma, Kunihiko Fujiwara, Jiani Ouyang(+2 more)

Abstract:Street view imagery (SVI) has been instrumental in many studies in the past decade to understand and characterize street features and the built environment. Researchers across a variety of domains, such as transportation, health, architecture, human perception, and infrastructure have employed different methods to analyze SVI. However, these applications and image-processing procedures have not been standardized, and solutions have been implemented in isolation, often making it difficult for others to reproduce existing work and carry out new research. Using SVI for research requires multiple technical steps: accessing APIs for scalable data collection, preprocessing images to standardize formats, implementing computer vision models for feature extraction, and conducting spatial analysis. These technical requirements create barriers for researchers in urban studies, particularly those without extensive programming experience. We develop ZenSVI, a free and open-source Python package that integrates and implements the entire process of SVI analysis, supporting a wide range of use cases. Its end-to-end pipeline includes downloading SVI from multiple platforms (e.g., Mapillary and KartaView) efficiently, analyzing metadata of SVI, applying computer vision models to extract target features, transforming SVI into different projections (e.g., fish-eye and perspective) and different formats (e.g., depth map and point cloud), visualizing analyses with maps and plots, and exporting outputs to other software tools. We demonstrate its use in Singapore through a case study of data quality assessment and clustering analysis in a streamlined manner. Our software improves the transparency, reproducibility, and scalability of research relying on SVI and supports researchers in conducting urban analyses efficiently. Its modular design facilitates extensions and unlocking new use cases.

Via

Access Paper or Ask Questions

Results of the Big ANN: NeurIPS'23 competition

Sep 25, 2024

Harsha Vardhan Simhadri, Martin Aumüller, Amir Ingber, Matthijs Douze, George Williams, Magdalen Dobson Manohar, Dmitry Baranchuk, Edo Liberty, Frank Liu, Ben Landrum(+13 more)

Figure 1 for Results of the Big ANN: NeurIPS'23 competition

Figure 2 for Results of the Big ANN: NeurIPS'23 competition

Figure 3 for Results of the Big ANN: NeurIPS'23 competition

Figure 4 for Results of the Big ANN: NeurIPS'23 competition

Abstract:The 2023 Big ANN Challenge, held at NeurIPS 2023, focused on advancing the state-of-the-art in indexing data structures and search algorithms for practical variants of Approximate Nearest Neighbor (ANN) search that reflect the growing complexity and diversity of workloads. Unlike prior challenges that emphasized scaling up classical ANN search ~\cite{DBLP:conf/nips/SimhadriWADBBCH21}, this competition addressed filtered search, out-of-distribution data, sparse and streaming variants of ANNS. Participants developed and submitted innovative solutions that were evaluated on new standard datasets with constrained computational resources. The results showcased significant improvements in search accuracy and efficiency over industry-standard baselines, with notable contributions from both academic and industrial teams. This paper summarizes the competition tracks, datasets, evaluation metrics, and the innovative approaches of the top-performing submissions, providing insights into the current advancements and future directions in the field of approximate nearest neighbor search.

* Code: https://github.com/harsha-simhadri/big-ann-benchmarks/releases/tag/v0.3.0

Via

Access Paper or Ask Questions

One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild

Sep 15, 2024

Dongqi Fan, Tao Chen, Mingjie Wang, Rui Ma, Qiang Tang, Zili Yi, Qian Wang, Liang Chang

Figure 1 for One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild

Figure 2 for One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild

Figure 3 for One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild

Figure 4 for One-Shot Learning for Pose-Guided Person Image Synthesis in the Wild

Abstract:Current Pose-Guided Person Image Synthesis (PGPIS) methods depend heavily on large amounts of labeled triplet data to train the generator in a supervised manner. However, they often falter when applied to in-the-wild samples, primarily due to the distribution gap between the training datasets and real-world test samples. While some researchers aim to enhance model generalizability through sophisticated training procedures, advanced architectures, or by creating more diverse datasets, we adopt the test-time fine-tuning paradigm to customize a pre-trained Text2Image (T2I) model. However, naively applying test-time tuning results in inconsistencies in facial identities and appearance attributes. To address this, we introduce a Visual Consistency Module (VCM), which enhances appearance consistency by combining the face, text, and image embedding. Our approach, named OnePoseTrans, requires only a single source image to generate high-quality pose transfer results, offering greater stability than state-of-the-art data-driven methods. For each test case, OnePoseTrans customizes a model in around 48 seconds with an NVIDIA V100 GPU.

Via

Access Paper or Ask Questions

DiffPop: Plausibility-Guided Object Placement Diffusion for Image Composition

Jun 12, 2024

Jiacheng Liu, Hang Zhou, Shida Wei, Rui Ma

Abstract:In this paper, we address the problem of plausible object placement for the challenging task of realistic image composition. We propose DiffPop, the first framework that utilizes plausibility-guided denoising diffusion probabilistic model to learn the scale and spatial relations among multiple objects and the corresponding scene image. First, we train an unguided diffusion model to directly learn the object placement parameters in a self-supervised manner. Then, we develop a human-in-the-loop pipeline which exploits human labeling on the diffusion-generated composite images to provide the weak supervision for training a structural plausibility classifier. The classifier is further used to guide the diffusion sampling process towards generating the plausible object placement. Experimental results verify the superiority of our method for producing plausible and diverse composite images on the new Cityscapes-OP dataset and the public OPA dataset, as well as demonstrate its potential in applications such as data augmentation and multi-object placement tasks. Our dataset and code will be released.

Via

Access Paper or Ask Questions

GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis

May 30, 2024

Boming Zhao, Yuan Li, Ziyu Sun, Lin Zeng, Yujun Shen, Rui Ma, Yinda Zhang, Hujun Bao, Zhaopeng Cui

Figure 1 for GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis

Figure 2 for GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis

Figure 3 for GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis

Figure 4 for GaussianPrediction: Dynamic 3D Gaussian Prediction for Motion Extrapolation and Free View Synthesis

Abstract:Forecasting future scenarios in dynamic environments is essential for intelligent decision-making and navigation, a challenge yet to be fully realized in computer vision and robotics. Traditional approaches like video prediction and novel-view synthesis either lack the ability to forecast from arbitrary viewpoints or to predict temporal dynamics. In this paper, we introduce GaussianPrediction, a novel framework that empowers 3D Gaussian representations with dynamic scene modeling and future scenario synthesis in dynamic environments. GaussianPrediction can forecast future states from any viewpoint, using video observations of dynamic scenes. To this end, we first propose a 3D Gaussian canonical space with deformation modeling to capture the appearance and geometry of dynamic scenes, and integrate the lifecycle property into Gaussians for irreversible deformations. To make the prediction feasible and efficient, a concentric motion distillation approach is developed by distilling the scene motion with key points. Finally, a Graph Convolutional Network is employed to predict the motions of key points, enabling the rendering of photorealistic images of future scenarios. Our framework shows outstanding performance on both synthetic and real-world datasets, demonstrating its efficacy in predicting and rendering future environments.

* Accepted to SIGGRAPH 2024 Conference. Project Page: https://zju3dv.github.io/gaussian-prediction/

Via

Access Paper or Ask Questions