Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Norimichi Ukita

Dynamic Group Detection using VLM-augmented Temporal Groupness Graph

Sep 05, 2025

Kaname Yokoyama, Chihiro Nakatani, Norimichi Ukita

Abstract:This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames' groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://github.com/irajisamurai/VLM-GroupDetection.git

* 10 pages, Accepted to ICCV2025

Via

Access Paper or Ask Questions

CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow

May 19, 2025

Takahiro Maeda, Jinkun Cao, Norimichi Ukita, Kris Kitani

Abstract:Many density estimation techniques for 3D human motion prediction require a significant amount of inference time, often exceeding the duration of the predicted time horizon. To address the need for faster density estimation for 3D human motion prediction, we introduce a novel flow-based method for human motion prediction called CacheFlow. Unlike previous conditional generative models that suffer from time efficiency, CacheFlow takes advantage of an unconditional flow-based generative model that transforms a Gaussian mixture into the density of future motions. The results of the computation of the flow-based generative model can be precomputed and cached. Then, for conditional prediction, we seek a mapping from historical trajectories to samples in the Gaussian mixture. This mapping can be done by a much more lightweight model, thus saving significant computation overhead compared to a typical conditional flow model. In such a two-stage fashion and by caching results from the slow flow model computation, we build our CacheFlow without loss of prediction accuracy and model expressiveness. This inference process is completed in approximately one millisecond, making it 4 times faster than previous VAE methods and 30 times faster than previous diffusion-based methods on standard benchmarks such as Human3.6M and AMASS datasets. Furthermore, our method demonstrates improved density estimation accuracy and comparable prediction accuracy to a SOTA method on Human3.6M. Our code and models will be publicly available.

Via

Access Paper or Ask Questions

Human Motion Prediction via Test-domain-aware Adaptation with Easily-available Human Motions Estimated from Videos

May 13, 2025

Katsuki Shimbo, Hiromu Taketsugu, Norimichi Ukita

Abstract:In 3D Human Motion Prediction (HMP), conventional methods train HMP models with expensive motion capture data. However, the data collection cost of such motion capture data limits the data diversity, which leads to poor generalizability to unseen motions or subjects. To address this issue, this paper proposes to enhance HMP with additional learning using estimated poses from easily available videos. The 2D poses estimated from the monocular videos are carefully transformed into motion capture-style 3D motions through our pipeline. By additional learning with the obtained motions, the HMP model is adapted to the test domain. The experimental results demonstrate the quantitative and qualitative impact of our method.

* 5 pages, 4 figures

Via

Access Paper or Ask Questions

Robot Motion Planning using One-Step Diffusion with Noise-Optimized Approximate Motions

Apr 28, 2025

Tomoharu Aizu, Takeru Oba, Yuki Kondo, Norimichi Ukita

Abstract:This paper proposes an image-based robot motion planning method using a one-step diffusion model. While the diffusion model allows for high-quality motion generation, its computational cost is too expensive to control a robot in real time. To achieve high quality and efficiency simultaneously, our one-step diffusion model takes an approximately generated motion, which is predicted directly from input images. This approximate motion is optimized by additive noise provided by our novel noise optimizer. Unlike general isotropic noise, our noise optimizer adjusts noise anisotropically depending on the uncertainty of each motion element. Our experimental results demonstrate that our method outperforms state-of-the-art methods while maintaining its efficiency by one-step diffusion.

* 7 pages, 5 figures. Under peer review

Via

Access Paper or Ask Questions

Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment

Mar 21, 2025

Hiromu Taketsugu, Takeru Oba, Takahiro Maeda, Shohei Nobuhara, Norimichi Ukita

Abstract:Humans can predict future human trajectories even from momentary observations by using human pose-related cues. However, previous Human Trajectory Prediction (HTP) methods leverage the pose cues implicitly, resulting in implausible predictions. To address this, we propose Locomotion Embodiment, a framework that explicitly evaluates the physical plausibility of the predicted trajectory by locomotion generation under the laws of physics. While the plausibility of locomotion is learned with an indifferentiable physics simulator, it is replaced by our differentiable Locomotion Value function to train an HTP network in a data-driven manner. In particular, our proposed Embodied Locomotion loss is beneficial for efficiently training a stochastic HTP network using multiple heads. Furthermore, the Locomotion Value filter is proposed to filter out implausible trajectories at inference. Experiments demonstrate that our method enhances even the state-of-the-art HTP methods across diverse datasets and problem settings. Our code is available at: https://github.com/ImIntheMiddle/EmLoco.

* CVPR2025. Project page: https://iminthemiddle.github.io/EmLoco-Page/

Via

Access Paper or Ask Questions

Size-Variable Virtual Try-On with Physical Clothes Size

Dec 09, 2024

Yohei Yamashita, Chihiro Nakatani, Norimichi Ukita

Figure 1 for Size-Variable Virtual Try-On with Physical Clothes Size

Figure 2 for Size-Variable Virtual Try-On with Physical Clothes Size

Figure 3 for Size-Variable Virtual Try-On with Physical Clothes Size

Figure 4 for Size-Variable Virtual Try-On with Physical Clothes Size

Abstract:This paper addresses a new virtual try-on problem of fitting any size of clothes to a reference person in the image domain. While previous image-based virtual try-on methods can produce highly natural try-on images, these methods fit the clothes on the person without considering the relative relationship between the physical sizes of the clothes and the person. Different from these methods, our method achieves size-variable virtual try-on in which the image size of the try-on clothes is changed depending on this relative relationship of the physical sizes. To relieve the difficulty in maintaining the physical size of the closes while synthesizing the high-fidelity image of the whole clothes, our proposed method focuses on the residual between the silhouettes of the clothes in the reference and try-on images. We also develop a size-variable virtual try-on dataset consisting of 1,524 images provided by 26 subjects. Furthermore, we propose an evaluation metric for size-variable virtual-try-on. Quantitative and qualitative experimental results show that our method can achieve size-variable virtual try-on better than general virtual try-on methods.

Via

Access Paper or Ask Questions

Test-time Cost-and-Quality Controllable Arbitrary-Scale Super-Resolution with Variable Fourier Components

Dec 07, 2024

Kazutoshi Akita, Norimichi Ukita

Figure 1 for Test-time Cost-and-Quality Controllable Arbitrary-Scale Super-Resolution with Variable Fourier Components

Figure 2 for Test-time Cost-and-Quality Controllable Arbitrary-Scale Super-Resolution with Variable Fourier Components

Figure 3 for Test-time Cost-and-Quality Controllable Arbitrary-Scale Super-Resolution with Variable Fourier Components

Figure 4 for Test-time Cost-and-Quality Controllable Arbitrary-Scale Super-Resolution with Variable Fourier Components

Abstract:Super-resolution (SR) with arbitrary scale factor and cost-and-quality controllability at test time is essential for various applications. While several arbitrary-scale SR methods have been proposed, these methods require us to modify the model structure and retrain it to control the computational cost and SR quality. To address this limitation, we propose a novel SR method using a Recurrent Neural Network (RNN) with the Fourier representation. In our method, the RNN sequentially estimates Fourier components, each consisting of frequency and amplitude, and aggregates these components to reconstruct an SR image. Since the RNN can adjust the number of recurrences at test time, we can control the computational cost and SR quality in a single model: fewer recurrences (i.e., fewer Fourier components) lead to lower cost but lower quality, while more recurrences (i.e., more Fourier components) lead to better quality but more cost. Experimental results prove that more Fourier components improve the PSNR score. Furthermore, even with fewer Fourier components, our method achieves a lower PSNR drop than other state-of-the-art arbitrary-scale SR methods.

* 14 pages, 10 figures

Via

Access Paper or Ask Questions

Burst Super-Resolution with Diffusion Models for Improving Perceptual Quality

Apr 08, 2024

Kyotaro Tokoro, Kazutoshi Akita, Norimichi Ukita

Figure 1 for Burst Super-Resolution with Diffusion Models for Improving Perceptual Quality

Figure 2 for Burst Super-Resolution with Diffusion Models for Improving Perceptual Quality

Figure 3 for Burst Super-Resolution with Diffusion Models for Improving Perceptual Quality

Figure 4 for Burst Super-Resolution with Diffusion Models for Improving Perceptual Quality

Abstract:While burst LR images are useful for improving the SR image quality compared with a single LR image, prior SR networks accepting the burst LR images are trained in a deterministic manner, which is known to produce a blurry SR image. In addition, it is difficult to perfectly align the burst LR images, making the SR image more blurry. Since such blurry images are perceptually degraded, we aim to reconstruct the sharp high-fidelity boundaries. Such high-fidelity images can be reconstructed by diffusion models. However, prior SR methods using the diffusion model are not properly optimized for the burst SR task. Specifically, the reverse process starting from a random sample is not optimized for image enhancement and restoration methods, including burst SR. In our proposed method, on the other hand, burst LR features are used to reconstruct the initial burst SR image that is fed into an intermediate step in the diffusion model. This reverse process from the intermediate step 1) skips diffusion steps for reconstructing the global structure of the image and 2) focuses on steps for refining detailed textures. Our experimental results demonstrate that our method can improve the scores of the perceptual quality metrics. Code: https://github.com/placerkyo/BSRD

* Accepted to IJCNN 2024 (International Joint Conference on Neural Networks)

Via

Access Paper or Ask Questions

Time-series Initialization and Conditioning for Video-agnostic Stabilization of Video Super-Resolution using Recurrent Networks

Mar 23, 2024

Hiroshi Mori, Norimichi Ukita

Figure 1 for Time-series Initialization and Conditioning for Video-agnostic Stabilization of Video Super-Resolution using Recurrent Networks

Figure 2 for Time-series Initialization and Conditioning for Video-agnostic Stabilization of Video Super-Resolution using Recurrent Networks

Figure 3 for Time-series Initialization and Conditioning for Video-agnostic Stabilization of Video Super-Resolution using Recurrent Networks

Figure 4 for Time-series Initialization and Conditioning for Video-agnostic Stabilization of Video Super-Resolution using Recurrent Networks

Abstract:A Recurrent Neural Network (RNN) for Video Super Resolution (VSR) is generally trained with randomly clipped and cropped short videos extracted from original training videos due to various challenges in learning RNNs. However, since this RNN is optimized to super-resolve short videos, VSR of long videos is degraded due to the domain gap. Our preliminary experiments reveal that such degradation changes depending on the video properties, such as the video length and dynamics. To avoid this degradation, this paper proposes the training strategy of RNN for VSR that can work efficiently and stably independently of the video length and dynamics. The proposed training strategy stabilizes VSR by training a VSR network with various RNN hidden states changed depending on the video properties. Since computing such a variety of hidden states is time-consuming, this computational cost is reduced by reusing the hidden states for efficient training. In addition, training stability is further improved with frame-number conditioning. Our experimental results demonstrate that the proposed method performed better than base methods in videos with various lengths and dynamics.

* Accepted to IJCNN 2024 (International Joint Conference on Neural Networks)

Via

Access Paper or Ask Questions

Inpainting-Driven Mask Optimization for Object Removal

Mar 23, 2024

Kodai Shimosato, Norimichi Ukita

Abstract:This paper proposes a mask optimization method for improving the quality of object removal using image inpainting. While many inpainting methods are trained with a set of random masks, a target for inpainting may be an object, such as a person, in many realistic scenarios. This domain gap between masks in training and inference images increases the difficulty of the inpainting task. In our method, this domain gap is resolved by training the inpainting network with object masks extracted by segmentation, and such object masks are also used in the inference step. Furthermore, to optimize the object masks for inpainting, the segmentation network is connected to the inpainting network and end-to-end trained to improve the inpainting performance. The effect of this end-to-end training is further enhanced by our mask expansion loss for achieving the trade-off between large and small masks. Experimental results demonstrate the effectiveness of our method for better object removal using image inpainting.

* Accepted to IJCNN 2024 (International Joint Conference on Neural Networks)

Via

Access Paper or Ask Questions