Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Takayoshi Yamashita

SunnyParking: Multi-Shot Trajectory Generation and Motion State Awareness for Human-like Parking

Feb 25, 2026

Jishu Miao, Han Chen, Jiankun Zhai, Qi Liu, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi

Abstract:Autonomous parking fundamentally differs from on-road driving due to its frequent direction changes and complex maneuvering requirements. However, existing End-to-End (E2E) planning methods often simplify the parking task into a geometric path regression problem, neglecting explicit modeling of the vehicle's kinematic state. This "dimensionality deficiency" easily leads to physically infeasible trajectories and deviates from real human driving behavior, particularly at critical gear-shift points in multi-shot parking scenarios. In this paper, we propose SunnyParking, a novel dual-branch E2E architecture that achieves motion state awareness by jointly predicting spatial trajectories and discrete motion state sequences (e.g., forward/reverse). Additionally, we introduce a Fourier feature-based representation of target parking slots to overcome the resolution limitations of traditional bird's-eye view (BEV) approaches, enabling high-precision target interactions. Experimental results demonstrate that our framework generates more robust and human-like trajectories in complex multi-shot parking scenarios, while significantly improving gear-shift point localization accuracy compared to state-of-the-art methods. We open-source a new parking dataset of the CARLA simulator, specifically designed to evaluate full prediction capabilities under complex maneuvers.

Via

Access Paper or Ask Questions

Single-Agent vs. Multi-Agent LLM Strategies for Automated Student Reflection Assessment

Apr 08, 2025

Gen Li, Li Chen, Cheng Tang, Valdemar Švábenský, Daisuke Deguchi, Takayoshi Yamashita, Atsushi Shimada

Abstract:We explore the use of Large Language Models (LLMs) for automated assessment of open-text student reflections and prediction of academic performance. Traditional methods for evaluating reflections are time-consuming and may not scale effectively in educational settings. In this work, we employ LLMs to transform student reflections into quantitative scores using two assessment strategies (single-agent and multi-agent) and two prompting techniques (zero-shot and few-shot). Our experiments, conducted on a dataset of 5,278 reflections from 377 students over three academic terms, demonstrate that the single-agent with few-shot strategy achieves the highest match rate with human evaluations. Furthermore, models utilizing LLM-assessed reflection scores outperform baselines in both at-risk student identification and grade prediction tasks. These findings suggest that LLMs can effectively automate reflection assessment, reduce educators' workload, and enable timely support for students who may need additional assistance. Our work emphasizes the potential of integrating advanced generative AI technologies into educational practices to enhance student engagement and academic success.

* To be published in Proceedings of the 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2025)

Via

Access Paper or Ask Questions

Panoramic Distortion-Aware Tokenization for Person Detection and Localization Using Transformers in Overhead Fisheye Images

Mar 18, 2025

Nobuhiko Wakai, Satoshi Sato, Yasunori Ishii, Takayoshi Yamashita

Abstract:Person detection methods are used widely in applications including visual surveillance, pedestrian detection, and robotics. However, accurate detection of persons from overhead fisheye images remains an open challenge because of factors including person rotation and small-sized persons. To address the person rotation problem, we convert the fisheye images into panoramic images. For smaller people, we focused on the geometry of the panoramas. Conventional detection methods tend to focus on larger people because these larger people yield large significant areas for feature maps. In equirectangular panoramic images, we find that a person's height decreases linearly near the top of the images. Using this finding, we leverage the significance values and aggregate tokens that are sorted based on these values to balance the significant areas. In this leveraging process, we introduce panoramic distortion-aware tokenization. This tokenization procedure divides a panoramic image using self-similarity figures that enable determination of optimal divisions without gaps, and we leverage the maximum significant values in each tile of token groups to preserve the significant areas of smaller people. To achieve higher detection accuracy, we propose a person detection and localization method that combines panoramic-image remapping and the tokenization procedure. Extensive experiments demonstrated that our method outperforms conventional methods when applied to large-scale datasets.

Via

Access Paper or Ask Questions

DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Oct 11, 2024

Nguyen Huu Bao Long, Chenyu Zhang, Yuzhi Shi, Tsubasa Hirakawa, Takayoshi Yamashita, Tohgoroh Matsui, Hironobu Fujiyoshi

Figure 1 for DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Figure 2 for DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Figure 3 for DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Figure 4 for DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing Attention

Abstract:Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at {https://github.com/maclong01/DeBiFormer}

* ACCV 2024
* 20 pages, 7 figures. arXiv admin note: text overlap with arXiv:2303.08810 by other authors

Via

Access Paper or Ask Questions

Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Jul 18, 2024

Takumi Komatsu, Motonari Kambara, Shumpei Hatanaka, Haruka Matsuo, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Komei Sugiura

Figure 1 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Figure 2 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Figure 3 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Figure 4 for Nearest Neighbor Future Captioning: Generating Descriptions for Possible Collisions in Object Placement Tasks

Abstract:Domestic service robots (DSRs) that support people in everyday environments have been widely investigated. However, their ability to predict and describe future risks resulting from their own actions remains insufficient. In this study, we focus on the linguistic explainability of DSRs. Most existing methods do not explicitly model the region of possible collisions; thus, they do not properly generate descriptions of these regions. In this paper, we propose the Nearest Neighbor Future Captioning Model that introduces the Nearest Neighbor Language Model for future captioning of possible collisions, which enhances the model output with a nearest neighbors retrieval mechanism. Furthermore, we introduce the Collision Attention Module that attends regions of possible collisions, which enables our model to generate descriptions that adequately reflect the objects associated with possible collisions. To validate our method, we constructed a new dataset containing samples of collisions that can occur when a DSR places an object in a simulation environment. The experimental results demonstrated that our method outperformed baseline methods, based on the standard metrics. In particular, on CIDEr-D, the baseline method obtained 25.09 points, whereas our method obtained 33.08 points.

* Accepted for presentation at Advanced Robotics 24

Via

Access Paper or Ask Questions

Layer-Wise Relevance Propagation with Conservation Property for ResNet

Jul 12, 2024

Seitaro Otsuki, Tsumugi Iida, Félix Doublet, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Komei Sugiura

Figure 1 for Layer-Wise Relevance Propagation with Conservation Property for ResNet

Figure 2 for Layer-Wise Relevance Propagation with Conservation Property for ResNet

Figure 3 for Layer-Wise Relevance Propagation with Conservation Property for ResNet

Figure 4 for Layer-Wise Relevance Propagation with Conservation Property for ResNet

Abstract:The transparent formulation of explanation methods is essential for elucidating the predictions of neural networks, which are typically black-box models. Layer-wise Relevance Propagation (LRP) is a well-established method that transparently traces the flow of a model's prediction backward through its architecture by backpropagating relevance scores. However, the conventional LRP does not fully consider the existence of skip connections, and thus its application to the widely used ResNet architecture has not been thoroughly explored. In this study, we extend LRP to ResNet models by introducing Relevance Splitting at points where the output from a skip connection converges with that from a residual block. Our formulation guarantees the conservation property throughout the process, thereby preserving the integrity of the generated explanations. To evaluate the effectiveness of our approach, we conduct experiments on ImageNet and the Caltech-UCSD Birds-200-2011 dataset. Our method achieves superior performance to that of baseline methods on standard evaluation metrics such as the Insertion-Deletion score while maintaining its conservation property. We will release our code for further research at https://5ei74r0.github.io/lrp-for-resnet.page/

* Accepted for presentation at ECCV2024

Via

Access Paper or Ask Questions

Action Q-Transformer: Visual Explanation in Deep Reinforcement Learning with Encoder-Decoder Model using Action Query

Jun 24, 2023

Hidenori Itaya, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Komei Sugiura

Abstract:The excellent performance of Transformer in supervised learning has led to growing interest in its potential application to deep reinforcement learning (DRL) to achieve high performance on a wide variety of problems. However, the decision making of a DRL agent is a black box, which greatly hinders the application of the agent to real-world problems. To address this problem, we propose the Action Q-Transformer (AQT), which introduces a transformer encoder-decoder structure to Q-learning based DRL methods. In AQT, the encoder calculates the state value function and the decoder calculates the advantage function to promote the acquisition of different attentions indicating the agent's decision-making. The decoder in AQT utilizes action queries, which represent the information of each action, as queries. This enables us to obtain the attentions for the state value and for each action. By acquiring and visualizing these attentions that detail the agent's decision-making, we achieve a DRL model with high interpretability. In this paper, we show that visualization of attention in Atari 2600 games enables detailed analysis of agents' decision-making in various game tasks. Further, experimental results demonstrate that our method can achieve higher performance than the baseline in some games.

* 16 pages, 8 figures, 3 tables

Via

Access Paper or Ask Questions

Learning from AI: An Interactive Learning Method Using a DNN Model Incorporating Expert Knowledge as a Teacher

Jun 04, 2023

Kohei Hattori, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi

Figure 1 for Learning from AI: An Interactive Learning Method Using a DNN Model Incorporating Expert Knowledge as a Teacher

Figure 2 for Learning from AI: An Interactive Learning Method Using a DNN Model Incorporating Expert Knowledge as a Teacher

Figure 3 for Learning from AI: An Interactive Learning Method Using a DNN Model Incorporating Expert Knowledge as a Teacher

Figure 4 for Learning from AI: An Interactive Learning Method Using a DNN Model Incorporating Expert Knowledge as a Teacher

Abstract:Visual explanation is an approach for visualizing the grounds of judgment by deep learning, and it is possible to visually interpret the grounds of a judgment for a certain input by visualizing an attention map. As for deep-learning models that output erroneous decision-making grounds, a method that incorporates expert human knowledge in the model via an attention map in a manner that improves explanatory power and recognition accuracy is proposed. In this study, based on a deep-learning model that incorporates the knowledge of experts, a method by which a learner "learns from AI" the grounds for its decisions is proposed. An "attention branch network" (ABN), which has been fine-tuned with attention maps modified by experts, is prepared as a teacher. By using an interactive editing tool for the fine-tuned ABN and attention maps, the learner learns by editing the attention maps and changing the inference results. By repeatedly editing the attention maps and making inferences so that the correct recognition results are output, the learner can acquire the grounds for the expert's judgments embedded in the ABN. The results of an evaluation experiment with subjects show that learning using the proposed method is more efficient than the conventional method.

* 12 pages, 5 figures

Via

Access Paper or Ask Questions

PALF: Pre-Annotation and Camera-LiDAR Late Fusion for the Easy Annotation of Point Clouds

Apr 13, 2023

Yucheng Zhang, Masaki Fukuda, Yasunori Ishii, Kyoko Ohshima, Takayoshi Yamashita

Abstract:3D object detection has become indispensable in the field of autonomous driving. To date, gratifying breakthroughs have been recorded in 3D object detection research, attributed to deep learning. However, deep learning algorithms are data-driven and require large amounts of annotated point cloud data for training and evaluation. Unlike 2D image labels, annotating point cloud data is difficult due to the limitations of sparsity, irregularity, and low resolution, which requires more manual work, and the annotation efficiency is much lower than 2D image.Therefore, we propose an annotation algorithm for point cloud data, which is pre-annotation and camera-LiDAR late fusion algorithm to easily and accurately annotate. The contributions of this study are as follows. We propose (1) a pre-annotation algorithm that employs 3D object detection and auto fitting for the easy annotation of point clouds, (2) a camera-LiDAR late fusion algorithm using 2D and 3D results for easily error checking, which helps annotators easily identify missing objects, and (3) a point cloud annotation evaluation pipeline to evaluate our experiments. The experimental results show that the proposed algorithm improves the annotating speed by 6.5 times and the annotation quality in terms of the 3D Intersection over Union and precision by 8.2 points and 5.6 points, respectively; additionally, the miss rate is reduced by 31.9 points.

Via

Access Paper or Ask Questions

Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under ManhattanWorld AssumptionWithout Ambiguity

Mar 30, 2023

Nobuhiko Wakai, Satoshi Sato, Yasunori Ishii, Takayoshi Yamashita

Figure 1 for Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under ManhattanWorld AssumptionWithout Ambiguity

Figure 2 for Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under ManhattanWorld AssumptionWithout Ambiguity

Figure 3 for Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under ManhattanWorld AssumptionWithout Ambiguity

Figure 4 for Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under ManhattanWorld AssumptionWithout Ambiguity

Abstract:In orthogonal world coordinates, a Manhattan world lying along cuboid buildings is widely useful for various computer vision tasks. However, the Manhattan world has much room for improvement because the origin of pan angles from an image is arbitrary, that is, four-fold rotational symmetric ambiguity of pan angles. To address this problem, we propose a definition for the pan-angle origin based on the directions of the roads with respect to a camera and the direction of travel. We propose a learning-based calibration method that uses heatmap regression to remove the ambiguity by each direction of labeled image coordinates, similar to pose estimation keypoints. Simultaneously, our two-branched network recovers the rotation and removes fisheye distortion from a general scene image. To alleviate the lack of vanishing points in images, we introduce auxiliary diagonal points that have the optimal 3D arrangement of spatial uniformity. Extensive experiments demonstrated that our method outperforms conventional methods on large-scale datasets and with off-the-shelf cameras.

Via

Access Paper or Ask Questions