Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinwoo Choi

Universal Domain Adaptation for Semantic Segmentation

May 28, 2025

Seun-An Choe, Keon-Hee Park, Jinwoo Choi, Gyeong-Moon Park

Abstract:Unsupervised domain adaptation for semantic segmentation (UDA-SS) aims to transfer knowledge from labeled source data to unlabeled target data. However, traditional UDA-SS methods assume that category settings between source and target domains are known, which is unrealistic in real-world scenarios. This leads to performance degradation if private classes exist. To address this limitation, we propose Universal Domain Adaptation for Semantic Segmentation (UniDA-SS), achieving robust adaptation even without prior knowledge of category settings. We define the problem in the UniDA-SS scenario as low confidence scores of common classes in the target domain, which leads to confusion with private classes. To solve this problem, we propose UniMAP: UniDA-SS with Image Matching and Prototype-based Distinction, a novel framework composed of two key components. First, Domain-Specific Prototype-based Distinction (DSPD) divides each class into two domain-specific prototypes, enabling finer separation of domain-specific features and enhancing the identification of common classes across domains. Second, Target-based Image Matching (TIM) selects a source image containing the most common-class pixels based on the target pseudo-label and pairs it in a batch to promote effective learning of common classes. We also introduce a new UniDA-SS benchmark and demonstrate through various experiments that UniMAP significantly outperforms baselines. The code is available at \href{https://github.com/KU-VGI/UniMAP}{this https URL}.

* Accepted by CVPR 2025

Via

Access Paper or Ask Questions

Dynamic Contrastive Skill Learning with State-Transition Based Skill Clustering and Dynamic Length Adjustment

Apr 21, 2025

Jinwoo Choi, Seung-Woo Seo

Abstract:Reinforcement learning (RL) has made significant progress in various domains, but scaling it to long-horizon tasks with complex decision-making remains challenging. Skill learning attempts to address this by abstracting actions into higher-level behaviors. However, current approaches often fail to recognize semantically similar behaviors as the same skill and use fixed skill lengths, limiting flexibility and generalization. To address this, we propose Dynamic Contrastive Skill Learning (DCSL), a novel framework that redefines skill representation and learning. DCSL introduces three key ideas: state-transition based skill representation, skill similarity function learning, and dynamic skill length adjustment. By focusing on state transitions and leveraging contrastive learning, DCSL effectively captures the semantic context of behaviors and adapts skill lengths to match the appropriate temporal extent of behaviors. Our approach enables more flexible and adaptive skill extraction, particularly in complex or noisy datasets, and demonstrates competitive performance compared to existing methods in task completion and efficiency.

* ICLR 2025; 23 pages, 12 figures

Via

Access Paper or Ask Questions

PCBEAR: Pose Concept Bottleneck for Explainable Action Recognition

Apr 17, 2025

Jongseo Lee, Wooil Lee, Gyeong-Moon Park, Seong Tae Kim, Jinwoo Choi

Abstract:Human action recognition (HAR) has achieved impressive results with deep learning models, but their decision-making process remains opaque due to their black-box nature. Ensuring interpretability is crucial, especially for real-world applications requiring transparency and accountability. Existing video XAI methods primarily rely on feature attribution or static textual concepts, both of which struggle to capture motion dynamics and temporal dependencies essential for action understanding. To address these challenges, we propose Pose Concept Bottleneck for Explainable Action Recognition (PCBEAR), a novel concept bottleneck framework that introduces human pose sequences as motion-aware, structured concepts for video action recognition. Unlike methods based on pixel-level features or static textual descriptions, PCBEAR leverages human skeleton poses, which focus solely on body movements, providing robust and interpretable explanations of motion dynamics. We define two types of pose-based concepts: static pose concepts for spatial configurations at individual frames, and dynamic pose concepts for motion patterns across multiple frames. To construct these concepts, PCBEAR applies clustering to video pose sequences, allowing for automatic discovery of meaningful concepts without manual annotation. We validate PCBEAR on KTH, Penn-Action, and HAA500, showing that it achieves high classification performance while offering interpretable, motion-driven explanations. Our method provides both strong predictive performance and human-understandable insights into the model's reasoning process, enabling test-time interventions for debugging and improving model behavior.

* This paper is accepted by CVPRW 2025

Via

Access Paper or Ask Questions

CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition

Mar 30, 2025

Jongseo Lee, Joohyun Chang, Dongho Lee, Jinwoo Choi

Abstract:We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.

* 27 pages including appendix, TPAMI under review

Via

Access Paper or Ask Questions

MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

Mar 20, 2025

Kyungho Bae, Jinhyung Kim, Sihaeng Lee, Soonyoung Lee, Gunhee Lee, Jinwoo Choi

Abstract:In this work, we tackle action-scene hallucination in Video Large Language Models (Video-LLMs), where models incorrectly predict actions based on the scene context or scenes based on observed actions. We observe that existing Video-LLMs often suffer from action-scene hallucination due to two main factors. First, existing Video-LLMs intermingle spatial and temporal features by applying an attention operation across all tokens. Second, they use the standard Rotary Position Embedding (RoPE), which causes the text tokens to overemphasize certain types of tokens depending on their sequential orders. To address these issues, we introduce MASH-VLM, Mitigating Action-Scene Hallucination in Video-LLMs through disentangled spatial-temporal representations. Our approach includes two key innovations: (1) DST-attention, a novel attention mechanism that disentangles the spatial and temporal tokens within the LLM by using masked attention to restrict direct interactions between the spatial and temporal tokens; (2) Harmonic-RoPE, which extends the dimensionality of the positional IDs, allowing the spatial and temporal tokens to maintain balanced positions relative to the text tokens. To evaluate the action-scene hallucination in Video-LLMs, we introduce the UNSCENE benchmark with 1,320 videos and 4,078 QA pairs. Extensive experiments demonstrate that MASH-VLM achieves state-of-the-art results on the UNSCENE benchmark, as well as on existing video understanding benchmarks.

* Accepted for CVPR 2025

Via

Access Paper or Ask Questions

The Geometry of Optimal Gait Families for Steering Kinematic Locomoting Systems

Feb 24, 2025

Jinwoo Choi, Siming Deng, Nathan Justus, Noah J. Cowan, Ross L. Hatton

Abstract:Motion planning for locomotion systems typically requires translating high-level rigid-body tasks into low-level joint trajectories-a process that is straightforward for car-like robots with fixed, unbounded actuation inputs but more challenging for systems like snake robots, where the mapping depends on the current configuration and is constrained by joint limits. In this paper, we focus on generating continuous families of optimal gaits-collections of gaits parameterized by step size or steering rate-to enhance controllability and maneuverability. We uncover the underlying geometric structure of these optimal gait families and propose methods for constructing them using both global and local search strategies, where the local method and the global method compensate each other. The global search approach is robust to nonsmooth behavior, albeit yielding reduced-order solutions, while the local search provides higher accuracy but can be unstable near nonsmooth regions. To demonstrate our framework, we generate optimal gait families for viscous and perfect-fluid three-link swimmers. This work lays a foundation for integrating low-level joint controllers with higher-level motion planners in complex locomotion systems.

* 17 pages, submitted to IEEE Transactions on Robotics

Via

Access Paper or Ask Questions

HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

Dec 19, 2024

Minkuk Kim, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim

Figure 1 for HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

Figure 2 for HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

Figure 3 for HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

Figure 4 for HiCM$^2$: Hierarchical Compact Memory Modeling for Dense Video Captioning

Abstract:With the growing demand for solutions to real-world video challenges, interest in dense video captioning (DVC) has been on the rise. DVC involves the automatic captioning and localization of untrimmed videos. Several studies highlight the challenges of DVC and introduce improved methods utilizing prior knowledge, such as pre-training and external memory. In this research, we propose a model that leverages the prior knowledge of human-oriented hierarchical compact memory inspired by human memory hierarchy and cognition. To mimic human-like memory recall, we construct a hierarchical memory and a hierarchical memory reading module. We build an efficient hierarchical compact memory by employing clustering of memory events and summarization using large language models. Comparative experiments demonstrate that this hierarchical memory recall process improves the performance of DVC by achieving state-of-the-art performance on YouCook2 and ViTT datasets.

* AAAI2025

Via

Access Paper or Ask Questions

Geometric Optimal Control of Mechanical Systems with Gravitational and Resistive Force

Oct 12, 2024

Jinwoo Choi, Alejandro Cabrera, Ross L. Hatton

Abstract:Optimal control plays a crucial role in numerous mechanical and robotic applications. Broadly, optimal control methods are divided into direct methods (which optimize trajectories directly via discretization) and indirect methods (which transform optimality conditions into equations that guarantee optimal trajectories). While direct methods could mask geometric insights into system dynamics due to discretization, indirect methods offer a deeper understanding of the system's geometry. In this paper, we propose a geometric framework for understanding optimal control in mechanical systems, focusing on the combined effects of inertia, drag, and gravitational forces. By modeling mechanical systems as configuration manifolds equipped with kinetic and drag metrics, alongside a potential field, we explore how these factors influence trajectory optimization. We derive optimal control equations incorporating these effects and apply them to two-link and UR5 robotic manipulators, demonstrating how manifold curvature and resistive forces shape optimal trajectories. This work offers a comprehensive geometric approach to optimal control, with broad applications to robotic systems.

* 6 pages, submitted to The International Conference on Robotics and Automation (ICRA)

Via

Access Paper or Ask Questions

PCEvE: Part Contribution Evaluation Based Model Explanation for Human Figure Drawing Assessment and Beyond

Sep 26, 2024

Jongseo Lee, Geo Ahn, Jinwoo Choi, Seongtae Kim

Figure 1 for PCEvE: Part Contribution Evaluation Based Model Explanation for Human Figure Drawing Assessment and Beyond

Figure 2 for PCEvE: Part Contribution Evaluation Based Model Explanation for Human Figure Drawing Assessment and Beyond

Figure 3 for PCEvE: Part Contribution Evaluation Based Model Explanation for Human Figure Drawing Assessment and Beyond

Figure 4 for PCEvE: Part Contribution Evaluation Based Model Explanation for Human Figure Drawing Assessment and Beyond

Abstract:For automatic human figure drawing (HFD) assessment tasks, such as diagnosing autism spectrum disorder (ASD) using HFD images, the clarity and explainability of a model decision are crucial. Existing pixel-level attribution-based explainable AI (XAI) approaches demand considerable effort from users to interpret the semantic information of a region in an image, which can be often time-consuming and impractical. To overcome this challenge, we propose a part contribution evaluation based model explanation (PCEvE) framework. On top of the part detection, we measure the Shapley Value of each individual part to evaluate the contribution to a model decision. Unlike existing attribution-based XAI approaches, the PCEvE provides a straightforward explanation of a model decision, i.e., a part contribution histogram. Furthermore, the PCEvE expands the scope of explanations beyond the conventional sample-level to include class-level and task-level insights, offering a richer, more comprehensive understanding of model behavior. We rigorously validate the PCEvE via extensive experiments on multiple HFD assessment datasets. Also, we sanity-check the proposed method with a set of controlled experiments. Additionally, we demonstrate the versatility and applicability of our method to other domains by applying it to a photo-realistic dataset, the Stanford Cars.

Via

Access Paper or Ask Questions

Optimal Control Approach for Gait Transition with Riemannian Splines

Sep 13, 2024

Jinwoo Choi, Ross L. Hatton

Figure 1 for Optimal Control Approach for Gait Transition with Riemannian Splines

Figure 2 for Optimal Control Approach for Gait Transition with Riemannian Splines

Figure 3 for Optimal Control Approach for Gait Transition with Riemannian Splines

Figure 4 for Optimal Control Approach for Gait Transition with Riemannian Splines

Abstract:Robotic locomotion often relies on sequenced gaits to efficiently convert control input into desired motion. Despite extensive studies on gait optimization, achieving smooth and efficient gait transitions remains challenging. In this paper, we propose a general solver based on geometric optimal control methods, leveraging insights from previous works on gait efficiency. Building upon our previous work, we express the effort to execute the trajectory as distinct geometric objects, transforming the optimization problems into boundary value problems. To validate our approach, we generate gait transition trajectories for three-link swimmers across various fluid environments. This work provides insights into optimal trajectory geometries and mechanical considerations for robotic locomotion.

* 7 pages, Accepted by the 63rd IEEE Conference on Decision and Control (CDC 2024)

Via

Access Paper or Ask Questions