Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Byoung-Tak Zhang

Biointelligence Laboratory, Department of Computer Science and Engineering, Seoul National University, Seoul, South Korea, AI Institute of Seoul National University

Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

Aug 14, 2025

Hyundo Lee, Suhyung Choi, Byoung-Tak Zhang, Inwoo Hwang

Abstract:Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).

Via

Access Paper or Ask Questions

DBMovi-GS: Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting

Jun 26, 2025

Yeon-Ji Song, Jaein Kim, Byung-Ju Kim, Byoung-Tak Zhang

Abstract:Novel view synthesis is a task of generating scenes from unseen perspectives; however, synthesizing dynamic scenes from blurry monocular videos remains an unresolved challenge that has yet to be effectively addressed. Existing novel view synthesis methods are often constrained by their reliance on high-resolution images or strong assumptions about static geometry and rigid scene priors. Consequently, their approaches lack robustness in real-world environments with dynamic object and camera motion, leading to instability and degraded visual fidelity. To address this, we propose Motion-aware Dynamic View Synthesis from Blurry Monocular Video via Sparse-Controlled Gaussian Splatting (DBMovi-GS), a method designed for dynamic view synthesis from blurry monocular videos. Our model generates dense 3D Gaussians, restoring sharpness from blurry videos and reconstructing detailed 3D geometry of the scene affected by dynamic motion variations. Our model achieves robust performance in novel view synthesis under dynamic blurry scenes and sets a new benchmark in realistic novel view synthesis for blurry monocular video inputs.

* CVPRW 2025, Neural Fields Beyond Conventional Cameras

Via

Access Paper or Ask Questions

OV-MAP : Open-Vocabulary Zero-Shot 3D Instance Segmentation Map for Robots

Jun 13, 2025

Juno Kim, Yesol Park, Hye-Jung Yoon, Byoung-Tak Zhang

Abstract:We introduce OV-MAP, a novel approach to open-world 3D mapping for mobile robots by integrating open-features into 3D maps to enhance object recognition capabilities. A significant challenge arises when overlapping features from adjacent voxels reduce instance-level precision, as features spill over voxel boundaries, blending neighboring regions together. Our method overcomes this by employing a class-agnostic segmentation model to project 2D masks into 3D space, combined with a supplemented depth image created by merging raw and synthetic depth from point clouds. This approach, along with a 3D mask voting mechanism, enables accurate zero-shot 3D instance segmentation without relying on 3D supervised segmentation models. We assess the effectiveness of our method through comprehensive experiments on public datasets such as ScanNet200 and Replica, demonstrating superior zero-shot performance, robustness, and adaptability across diverse environments. Additionally, we conducted real-world experiments to demonstrate our method's adaptability and robustness when applied to diverse real-world environments.

* Accepted at IROS 2024

Via

Access Paper or Ask Questions

Exploring Ordinal Bias in Action Recognition for Instructional Videos

Apr 09, 2025

Joochan Kim, Minjoon Jung, Byoung-Tak Zhang

Abstract:Action recognition models have achieved promising results in understanding instructional videos. However, they often rely on dominant, dataset-specific action sequences rather than true video comprehension, a problem that we define as ordinal bias. To address this issue, we propose two effective video manipulation methods: Action Masking, which masks frames of frequently co-occurring actions, and Sequence Shuffling, which randomizes the order of action segments. Through comprehensive experiments, we demonstrate that current models exhibit significant performance drops when confronted with nonstandard action sequences, underscoring their vulnerability to ordinal bias. Our findings emphasize the importance of rethinking evaluation strategies and developing models capable of generalizing beyond fixed action patterns in diverse instructional videos.

* Accepted to SCSL @ ICLR 2025

Via

Access Paper or Ask Questions

Variational Online Mirror Descent for Robust Learning in Schrödinger Bridge

Apr 03, 2025

Dong-Sig Han, Jaein Kim, Hee Bin Yoo, Byoung-Tak Zhang

Figure 1 for Variational Online Mirror Descent for Robust Learning in Schrödinger Bridge

Figure 2 for Variational Online Mirror Descent for Robust Learning in Schrödinger Bridge

Figure 3 for Variational Online Mirror Descent for Robust Learning in Schrödinger Bridge

Figure 4 for Variational Online Mirror Descent for Robust Learning in Schrödinger Bridge

Abstract:Sch\"odinger bridge (SB) has evolved into a universal class of probabilistic generative models. In practice, however, estimated learning signals are often uncertain, and the reliability promised by existing methods is often based on speculative optimal-case scenarios. Recent studies regarding the Sinkhorn algorithm through mirror descent (MD) have gained attention, revealing geometric insights into solution acquisition of the SB problems. In this paper, we propose a variational online MD (OMD) framework for the SB problems, which provides further stability to SB solvers. We formally prove convergence and a regret bound for the novel OMD formulation of SB acquisition. As a result, we propose a simulation-free SB algorithm called Variational Mirrored Schr\"odinger Bridge (VMSB) by utilizing the Wasserstein-Fisher-Rao geometry of the Gaussian mixture parameterization for Schr\"odinger potentials. Based on the Wasserstein gradient flow theory, the algorithm offers tractable learning dynamics that precisely approximate each OMD step. In experiments, we validate the performance of the proposed VMSB algorithm across an extensive suite of benchmarks. VMSB consistently outperforms contemporary SB solvers on a range of SB problems, demonstrating the robustness predicted by our theory.

Via

Access Paper or Ask Questions

From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning

Jan 29, 2025

Junseok Park, Hyeonseo Yang, Min Whoo Lee, Won-Seok Choi, Minsu Lee, Byoung-Tak Zhang

Figure 1 for From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning

Figure 2 for From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning

Figure 3 for From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning

Figure 4 for From Sparse to Dense: Toddler-inspired Reward Transition in Goal-Oriented Reinforcement Learning

Abstract:Reinforcement learning (RL) agents often face challenges in balancing exploration and exploitation, particularly in environments where sparse or dense rewards bias learning. Biological systems, such as human toddlers, naturally navigate this balance by transitioning from free exploration with sparse rewards to goal-directed behavior guided by increasingly dense rewards. Inspired by this natural progression, we investigate the Toddler-Inspired Reward Transition in goal-oriented RL tasks. Our study focuses on transitioning from sparse to potential-based dense (S2D) rewards while preserving optimal strategies. Through experiments on dynamic robotic arm manipulation and egocentric 3D navigation tasks, we demonstrate that effective S2D reward transitions significantly enhance learning performance and sample efficiency. Additionally, using a Cross-Density Visualizer, we show that S2D transitions smooth the policy loss landscape, resulting in wider minima that improve generalization in RL models. In addition, we reinterpret Tolman's maze experiments, underscoring the critical role of early free exploratory learning in the context of S2D rewards.

* Extended version of AAAI 2024 paper: Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning. This manuscript is currently being prepared for journal submission

Via

Access Paper or Ask Questions

On the Consistency of Video Large Language Models in Temporal Comprehension

Nov 20, 2024

Minjoon Jung, Junbin Xiao, Byoung-Tak Zhang, Angela Yao

Figure 1 for On the Consistency of Video Large Language Models in Temporal Comprehension

Figure 2 for On the Consistency of Video Large Language Models in Temporal Comprehension

Figure 3 for On the Consistency of Video Large Language Models in Temporal Comprehension

Figure 4 for On the Consistency of Video Large Language Models in Temporal Comprehension

Abstract:Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments. Yet, such temporal comprehension capabilities are neither well-studied nor understood. So we conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding. After the model identifies an initial moment within the video content, we apply a series of probes to check if the model's responses align with this initial grounding as an indicator of reliable comprehension. Our results reveal that current Video-LLMs are sensitive to variations in video contents, language queries, and task settings, unveiling severe deficiencies in maintaining consistency. We further explore common prompting and instruction-tuning methods as potential solutions, but find that their improvements are often unstable. To that end, we propose event temporal verification tuning that explicitly accounts for consistency, and demonstrate significant improvements for both grounding and consistency. Our data and code will be available at https://github.com/minjoong507/Consistency-of-Video-LLM.

Via

Access Paper or Ask Questions

CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision

Nov 01, 2024

Gi-Cheon Kang, Junghyun Kim, Kyuhwan Shim, Jun Ki Lee, Byoung-Tak Zhang

Abstract:This paper explores how non-experts can teach robots desired skills in their environments. We argue that natural language is an intuitive and accessible interface for robot learning. To this end, we investigate two key aspects: (1) how non-experts collect robotic data using natural language supervision and (2) how pre-trained vision-language models learn end-to-end policies directly from this supervision. We propose a data collection framework that collects robot demonstrations based on natural language supervision (e.g., "move forward") and further augments these demonstrations. Next, we introduce a model that learns language-conditioned policies from natural language supervision called CLIP-RT. Our model employs pre-trained CLIP models and learns to predict actions represented in language via contrastive imitation learning. We first train CLIP-RT on large-scale robotic data and then enable it to learn desired skills using data collected from our framework. CLIP-RT shows strong capabilities in acquiring novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 17% in average success rates, while using 7x fewer parameters (1B).

* 27 pages, 27 figures

Via

Access Paper or Ask Questions

HAPFI: History-Aware Planning based on Fused Information

Jul 23, 2024

Sujin Jeon, Suyeon Shin, Byoung-Tak Zhang

Abstract:Embodied Instruction Following (EIF) is a task of planning a long sequence of sub-goals given high-level natural language instructions, such as "Rinse a slice of lettuce and place on the white table next to the fork". To successfully execute these long-term horizon tasks, we argue that an agent must consider its past, i.e., historical data, when making decisions in each step. Nevertheless, recent approaches in EIF often neglects the knowledge from historical data and also do not effectively utilize information across the modalities. To this end, we propose History-Aware Planning based on Fused Information (HAPFI), effectively leveraging the historical data from diverse modalities that agents collect while interacting with the environment. Specifically, HAPFI integrates multiple modalities, including historical RGB observations, bounding boxes, sub-goals, and high-level instructions, by effectively fusing modalities via our Mutually Attentive Fusion method. Through experiments with diverse comparisons, we show that an agent utilizing historical multi-modal information surpasses all the compared methods that neglect the historical data in terms of action planning capability, enabling the generation of well-informed action plans for the next step. Moreover, we provided qualitative evidence highlighting the significance of leveraging historical multi-modal data, particularly in scenarios where the agent encounters intermediate failures, showcasing its robust re-planning capabilities.

* 7 pages, 3 figures, published to ICRA 2024

Via

Access Paper or Ask Questions

Fine-Grained Causal Dynamics Learning with Quantization for Improving Robustness in Reinforcement Learning

Jun 05, 2024

Inwoo Hwang, Yunhyeok Kwak, Suhyung Choi, Byoung-Tak Zhang, Sanghack Lee

Abstract:Causal dynamics learning has recently emerged as a promising approach to enhancing robustness in reinforcement learning (RL). Typically, the goal is to build a dynamics model that makes predictions based on the causal relationships among the entities. Despite the fact that causal connections often manifest only under certain contexts, existing approaches overlook such fine-grained relationships and lack a detailed understanding of the dynamics. In this work, we propose a novel dynamics model that infers fine-grained causal structures and employs them for prediction, leading to improved robustness in RL. The key idea is to jointly learn the dynamics model with a discrete latent variable that quantizes the state-action space into subgroups. This leads to recognizing meaningful context that displays sparse dependencies, where causal structures are learned for each subgroup throughout the training. Experimental results demonstrate the robustness of our method to unseen states and locally spurious correlations in downstream tasks where fine-grained causal reasoning is crucial. We further illustrate the effectiveness of our subgroup-based approach with quantization in discovering fine-grained causal relationships compared to prior methods.

* ICML 2024

Via

Access Paper or Ask Questions