Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jacinto Colan

Affordance-Based Disambiguation of Surgical Instructions for Collaborative Robot-Assisted Surgery

Sep 18, 2025

Ana Davila, Jacinto Colan, Yasuhisa Hasegawa

Figure 1 for Affordance-Based Disambiguation of Surgical Instructions for Collaborative Robot-Assisted Surgery

Figure 2 for Affordance-Based Disambiguation of Surgical Instructions for Collaborative Robot-Assisted Surgery

Abstract:Effective human-robot collaboration in surgery is affected by the inherent ambiguity of verbal communication. This paper presents a framework for a robotic surgical assistant that interprets and disambiguates verbal instructions from a surgeon by grounding them in the visual context of the operating field. The system employs a two-level affordance-based reasoning process that first analyzes the surgical scene using a multimodal vision-language model and then reasons about the instruction using a knowledge base of tool capabilities. To ensure patient safety, a dual-set conformal prediction method is used to provide a statistically rigorous confidence measure for robot decisions, allowing it to identify and flag ambiguous commands. We evaluated our framework on a curated dataset of ambiguous surgical requests from cholecystectomy videos, demonstrating a general disambiguation rate of 60% and presenting a method for safer human-robot interaction in the operating room.

* To be presented at the 1st Workshop on Intelligent Cobodied Assistance and Robotic Empowerment (iCARE). 2025 Conference on Robot Learning (CoRL)

Via

Access Paper or Ask Questions

Assessing the Value of Visual Input: A Benchmark of Multimodal Large Language Models for Robotic Path Planning

Jul 16, 2025

Jacinto Colan, Ana Davila, Yasuhisa Hasegawa

Figure 1 for Assessing the Value of Visual Input: A Benchmark of Multimodal Large Language Models for Robotic Path Planning

Figure 2 for Assessing the Value of Visual Input: A Benchmark of Multimodal Large Language Models for Robotic Path Planning

Figure 3 for Assessing the Value of Visual Input: A Benchmark of Multimodal Large Language Models for Robotic Path Planning

Figure 4 for Assessing the Value of Visual Input: A Benchmark of Multimodal Large Language Models for Robotic Path Planning

Abstract:Large Language Models (LLMs) show potential for enhancing robotic path planning. This paper assesses visual input's utility for multimodal LLMs in such tasks via a comprehensive benchmark. We evaluated 15 multimodal LLMs on generating valid and optimal paths in 2D grid environments, simulating simplified robotic planning, comparing text-only versus text-plus-visual inputs across varying model sizes and grid complexities. Our results indicate moderate success rates on simpler small grids, where visual input or few-shot text prompting offered some benefits. However, performance significantly degraded on larger grids, highlighting a scalability challenge. While larger models generally achieved higher average success, the visual modality was not universally dominant over well-structured text for these multimodal systems, and successful paths on simpler grids were generally of high quality. These results indicate current limitations in robust spatial reasoning, constraint adherence, and scalable multimodal integration, identifying areas for future LLM development in robotic path planning.

* 2025 SICE Festival with Annual Conference (SICE FES)
* Accepted at the 2025 SICE Festival with Annual Conference (SICE FES)

Via

Access Paper or Ask Questions

Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate

Jul 16, 2025

Ana Davila, Jacinto Colan, Yasuhisa Hasegawa

Figure 1 for Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate

Figure 2 for Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate

Figure 3 for Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate

Figure 4 for Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate

Abstract:Large Language Models (LLMs) have demonstrated significant capabilities in understanding and generating human language, contributing to more natural interactions with complex systems. However, they face challenges such as ambiguity in user requests processed by LLMs. To address these challenges, this paper introduces and evaluates a multi-agent debate framework designed to enhance detection and resolution capabilities beyond single models. The framework consists of three LLM architectures (Llama3-8B, Gemma2-9B, and Mistral-7B variants) and a dataset with diverse ambiguities. The debate framework markedly enhanced the performance of Llama3-8B and Mistral-7B variants over their individual baselines, with Mistral-7B-led debates achieving a notable 76.7% success rate and proving particularly effective for complex ambiguities and efficient consensus. While acknowledging varying model responses to collaborative strategies, these findings underscore the debate framework's value as a targeted method for augmenting LLM capabilities. This work offers important insights for developing more robust and adaptive language understanding systems by showing how structured debates can lead to improved clarity in interactive systems.

* 2025 SICE Festival with Annual Conference (SICE FES)
* Accepted at the 2025 SICE Festival with Annual Conference (SICE FES)

Via

Access Paper or Ask Questions

Human-Robot collaboration in surgery: Advances and challenges towards autonomous surgical assistants

Jul 15, 2025

Jacinto Colan, Ana Davila, Yutaro Yamada, Yasuhisa Hasegawa

Abstract:Human-robot collaboration in surgery represents a significant area of research, driven by the increasing capability of autonomous robotic systems to assist surgeons in complex procedures. This systematic review examines the advancements and persistent challenges in the development of autonomous surgical robotic assistants (ASARs), focusing specifically on scenarios where robots provide meaningful and active support to human surgeons. Adhering to the PRISMA guidelines, a comprehensive literature search was conducted across the IEEE Xplore, Scopus, and Web of Science databases, resulting in the selection of 32 studies for detailed analysis. Two primary collaborative setups were identified: teleoperation-based assistance and direct hands-on interaction. The findings reveal a growing research emphasis on ASARs, with predominant applications currently in endoscope guidance, alongside emerging progress in autonomous tool manipulation. Several key challenges hinder wider adoption, including the alignment of robotic actions with human surgeon preferences, the necessity for procedural awareness within autonomous systems, the establishment of seamless human-robot information exchange, and the complexities of skill acquisition in shared workspaces. This review synthesizes current trends, identifies critical limitations, and outlines future research directions essential to improve the reliability, safety, and effectiveness of human-robot collaboration in surgical environments.

* 2025 IEEE International Conference on Robot and Human Interactive Communication (ROMAN)
* Accepted at 2025 IEEE International Conference on Robot and Human Interactive Communication (ROMAN)

Via

Access Paper or Ask Questions

LLM-based ambiguity detection in natural language instructions for collaborative surgical robots

Jul 15, 2025

Ana Davila, Jacinto Colan, Yasuhisa Hasegawa

Abstract:Ambiguity in natural language instructions poses significant risks in safety-critical human-robot interaction, particularly in domains such as surgery. To address this, we propose a framework that uses Large Language Models (LLMs) for ambiguity detection specifically designed for collaborative surgical scenarios. Our method employs an ensemble of LLM evaluators, each configured with distinct prompting techniques to identify linguistic, contextual, procedural, and critical ambiguities. A chain-of-thought evaluator is included to systematically analyze instruction structure for potential issues. Individual evaluator assessments are synthesized through conformal prediction, which yields non-conformity scores based on comparison to a labeled calibration dataset. Evaluating Llama 3.2 11B and Gemma 3 12B, we observed classification accuracy exceeding 60% in differentiating ambiguous from unambiguous surgical instructions. Our approach improves the safety and reliability of human-robot collaboration in surgery by offering a mechanism to identify potentially ambiguous instructions before robot action.

* 2025 IEEE International Conference on Robot and Human Interactive Communication (ROMAN)
* Accepted at 2025 IEEE International Conference on Robot and Human Interactive Communication (ROMAN)

Via

Access Paper or Ask Questions

A hierarchical framework for collision avoidance in robot-assisted minimally invasive surgery

Sep 16, 2024

Jacinto Colan, Ana Davila, Khusniddin Fozilov, Yasuhisa Hasegawa

Figure 1 for A hierarchical framework for collision avoidance in robot-assisted minimally invasive surgery

Figure 2 for A hierarchical framework for collision avoidance in robot-assisted minimally invasive surgery

Figure 3 for A hierarchical framework for collision avoidance in robot-assisted minimally invasive surgery

Figure 4 for A hierarchical framework for collision avoidance in robot-assisted minimally invasive surgery

Abstract:Minimally invasive surgery (MIS) procedures benefit significantly from robotic systems due to their improved precision and dexterity. However, ensuring safety in these dynamic and cluttered environments is an ongoing challenge. This paper proposes a novel hierarchical framework for collision avoidance in MIS. This framework integrates multiple tasks, including maintaining the Remote Center of Motion (RCM) constraint, tracking desired tool poses, avoiding collisions, optimizing manipulability, and adhering to joint limits. The proposed approach utilizes Hierarchical Quadratic Programming (HQP) to seamlessly manage these constraints while enabling smooth transitions between task priorities for collision avoidance. Experimental validation through simulated scenarios demonstrates the framework's robustness and effectiveness in handling diverse scenarios involving static and dynamic obstacles, as well as inter-tool collisions.

* Accepted at 2024 IEEE International Conference on Cyborg and Bionic Systems (CBS2024)

Via

Access Paper or Ask Questions

Voice control interface for surgical robot assistants

Sep 16, 2024

Ana Davila, Jacinto Colan, Yasuhisa Hasegawa

Abstract:Traditional control interfaces for robotic-assisted minimally invasive surgery impose a significant cognitive load on surgeons. To improve surgical efficiency, surgeon-robot collaboration capabilities, and reduce surgeon burden, we present a novel voice control interface for surgical robotic assistants. Our system integrates Whisper, state-of-the-art speech recognition, within the ROS framework to enable real-time interpretation and execution of voice commands for surgical manipulator control. The proposed system consists of a speech recognition module, an action mapping module, and a robot control module. Experimental results demonstrate the system's high accuracy and inference speed, and demonstrates its feasibility for surgical applications in a tissue triangulation task. Future work will focus on further improving its robustness and clinical applicability.

* Accepted at 2024 IEEE International Symposium on Micro-NanoMechatronics and Human Science

Via

Access Paper or Ask Questions

Embedded Image-to-Image Translation for Efficient Sim-to-Real Transfer in Learning-based Robot-Assisted Soft Manipulation

Sep 16, 2024

Jacinto Colan, Keisuke Sugita, Ana Davila, Yutaro Yamada, Yasuhisa Hasegawa

Figure 1 for Embedded Image-to-Image Translation for Efficient Sim-to-Real Transfer in Learning-based Robot-Assisted Soft Manipulation

Figure 2 for Embedded Image-to-Image Translation for Efficient Sim-to-Real Transfer in Learning-based Robot-Assisted Soft Manipulation

Figure 3 for Embedded Image-to-Image Translation for Efficient Sim-to-Real Transfer in Learning-based Robot-Assisted Soft Manipulation

Figure 4 for Embedded Image-to-Image Translation for Efficient Sim-to-Real Transfer in Learning-based Robot-Assisted Soft Manipulation

Abstract:Recent advances in robotic learning in simulation have shown impressive results in accelerating learning complex manipulation skills. However, the sim-to-real gap, caused by discrepancies between simulation and reality, poses significant challenges for the effective deployment of autonomous surgical systems. We propose a novel approach utilizing image translation models to mitigate domain mismatches and facilitate efficient robot skill learning in a simulated environment. Our method involves the use of contrastive unpaired Image-to-image translation, allowing for the acquisition of embedded representations from these transformed images. Subsequently, these embeddings are used to improve the efficiency of training surgical manipulation models. We conducted experiments to evaluate the performance of our approach, demonstrating that it significantly enhances task success rates and reduces the steps required for task completion compared to traditional methods. The results indicate that our proposed system effectively bridges the sim-to-real gap, providing a robust framework for advancing the autonomy of surgical robots in minimally invasive procedures.

* Accepted at 2024 IEEE International Symposium on Micro-NanoMechatronics and Human Science

Via

Access Paper or Ask Questions

Comparison of fine-tuning strategies for transfer learning in medical image classification

Jun 14, 2024

Ana Davila, Jacinto Colan, Yasuhisa Hasegawa

Abstract:In the context of medical imaging and machine learning, one of the most pressing challenges is the effective adaptation of pre-trained models to specialized medical contexts. Despite the availability of advanced pre-trained models, their direct application to the highly specialized and diverse field of medical imaging often falls short due to the unique characteristics of medical data. This study provides a comprehensive analysis on the performance of various fine-tuning methods applied to pre-trained models across a spectrum of medical imaging domains, including X-ray, MRI, Histology, Dermoscopy, and Endoscopic surgery. We evaluated eight fine-tuning strategies, including standard techniques such as fine-tuning all layers or fine-tuning only the classifier layers, alongside methods such as gradually unfreezing layers, regularization based fine-tuning and adaptive learning rates. We selected three well-established CNN architectures (ResNet-50, DenseNet-121, and VGG-19) to cover a range of learning and feature extraction scenarios. Although our results indicate that the efficacy of these fine-tuning methods significantly varies depending on both the architecture and the medical imaging type, strategies such as combining Linear Probing with Full Fine-tuning resulted in notable improvements in over 50% of the evaluated cases, demonstrating general effectiveness across medical domains. Moreover, Auto-RGN, which dynamically adjusts learning rates, led to performance enhancements of up to 11% for specific modalities. Additionally, the DenseNet architecture showed more pronounced benefits from alternative fine-tuning approaches compared to traditional full fine-tuning. This work not only provides valuable insights for optimizing pre-trained models in medical image analysis but also suggests the potential for future research into more advanced architectures and fine-tuning methods.

* Image and Vision Computing 146 (2024): 105012
* Accepted at Image and Vision Computing

Via

Access Paper or Ask Questions

Task segmentation based on transition state clustering for surgical robot assistance

Jun 14, 2024

Yutaro Yamada, Jacinto Colan, Ana Davila, Yasuhisa Hasegawa

Abstract:Understanding surgical tasks represents an important challenge for autonomy in surgical robotic systems. To achieve this, we propose an online task segmentation framework that uses hierarchical transition state clustering to activate predefined robot assistance. Our approach involves performing a first clustering on visual features and a subsequent clustering on robot kinematic features for each visual cluster. This enables to capture relevant task transition information on each modality independently. The approach is implemented for a pick-and-place task commonly found in surgical training. The validation of the transition segmentation showed high accuracy and fast computation time. We have integrated the transition recognition module with predefined robot-assisted tool positioning. The complete framework has shown benefits in reducing task completion time and cognitive workload.

* 2023 International Conference on Control and Robotics Engineering (ICCRE), pp.260-264
* Accepted at 2023 International Conference on Control and Robotics Engineering (ICCRE)

Via

Access Paper or Ask Questions