Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Karla Stepanova

See and Switch: Vision-Based Branching for Interactive Robot-Skill Programming

Mar 09, 2026

Petr Vanc, Jan Kristof Behrens, Václav Hlaváč, Karla Stepanova

Abstract:Programming robots by demonstration (PbD) is an intuitive concept, but scaling it to real-world variability remains a challenge for most current teaching frameworks. Conditional task graphs are very expressive and can be defined incrementally, which fits very well with the PbD idea. However, acting using conditional task graphs requires reliable perception-grounded online branch selection. In this paper, we present See & Switch, an interactive teaching-and-execution framework that represents tasks as user-extendable graphs of skill parts connected via decision states (DS), enabling conditional branching during replay. Unlike prior approaches that rely on manual branching or low-dimensional signals (e.g., proprioception), our vision-based Switcher uses eye-in-hand images (high-dimensional) to select among competing successor skill parts and to detect out-of-distribution contexts that require new demonstrations. We integrate kinesthetic teaching, joystick control, and hand gestures via an input-modality-abstraction layer and demonstrate that our proposed method is teaching modality-independent, enabling efficient in-situ recovery demonstrations. The system is validated in experiments on three challenging dexterous manipulation tasks. We evaluate our method under diverse conditions and furthermore conduct user studies with 8 participants. We show that the proposed method reliably performs branch selection and anomaly detection for novice users, achieving 90.7 % and 87.9 % accuracy, respectively, across 576 real-robot rollouts. We provide all code and data required to reproduce our experiments at http://imitrob.ciirc.cvut.cz/publications/seeandswitch.

* 8 pages, 11 figures

Via

Access Paper or Ask Questions

TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication

Apr 02, 2025

Petr Vanc, Karla Stepanova

Figure 1 for TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication

Figure 2 for TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication

Figure 3 for TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication

Figure 4 for TransforMerger: Transformer-based Voice-Gesture Fusion for Robust Human-Robot Communication

Abstract:As human-robot collaboration advances, natural and flexible communication methods are essential for effective robot control. Traditional methods relying on a single modality or rigid rules struggle with noisy or misaligned data as well as with object descriptions that do not perfectly fit the predefined object names (e.g. 'Pick that red object'). We introduce TransforMerger, a transformer-based reasoning model that infers a structured action command for robotic manipulation based on fused voice and gesture inputs. Our approach merges multimodal data into a single unified sentence, which is then processed by the language model. We employ probabilistic embeddings to handle uncertainty and we integrate contextual scene understanding to resolve ambiguous references (e.g., gestures pointing to multiple objects or vague verbal cues like "this"). We evaluate TransforMerger in simulated and real-world experiments, demonstrating its robustness to noise, misalignment, and missing information. Our results show that TransforMerger outperforms deterministic baselines, especially in scenarios requiring more contextual knowledge, enabling more robust and flexible human-robot communication. Code and datasets are available at: http://imitrob.ciirc.cvut.cz/publications/transformerger.

* 8 pages, 7 figures

Via

Access Paper or Ask Questions

MuBlE: MuJoCo and Blender simulation Environment and Benchmark for Task Planning in Robot Manipulation

Mar 04, 2025

Michal Nazarczuk, Karla Stepanova, Jan Kristof Behrens, Matej Hoffmann, Krystian Mikolajczyk

Abstract:Current embodied reasoning agents struggle to plan for long-horizon tasks that require to physically interact with the world to obtain the necessary information (e.g. 'sort the objects from lightest to heaviest'). The improvement of the capabilities of such an agent is highly dependent on the availability of relevant training environments. In order to facilitate the development of such systems, we introduce a novel simulation environment (built on top of robosuite) that makes use of the MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. It is the first simulator focusing on long-horizon robot manipulation tasks preserving accurate physics modeling. MuBlE can generate mutlimodal data for training and enable design of closed-loop methods through environment interaction on two levels: visual - action loop, and control - physics loop. Together with the simulator, we propose SHOP-VRB2, a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements.

* https://github.com/michaal94/MuBlE. arXiv admin note: substantial text overlap with arXiv:2404.15194

Via

Access Paper or Ask Questions

ILeSiA: Interactive Learning of Situational Awareness from Camera Input

Sep 30, 2024

Petr Vanc, Giovanni Franzese, Jan Kristof Behrens, Cosimo Della Santina, Karla Stepanova, Jens Kober

Abstract:Learning from demonstration is a promising way of teaching robots new skills. However, a central problem when executing acquired skills is to recognize risks and failures. This is essential since the demonstrations usually cover only a few mostly successful cases. Inevitable errors during execution require specific reactions that were not apparent in the demonstrations. In this paper, we focus on teaching the robot situational awareness from an initial skill demonstration via kinesthetic teaching and sparse labeling of autonomous skill executions as safe or risky. At runtime, our system, called ILeSiA, detects risks based on the perceived camera images by encoding the images into a low-dimensional latent space representation and training a classifier based on the encoding and the provided labels. In this way, ILeSiA boosts the confidence and safety with which robotic skills can be executed. Our experiments demonstrate that classifiers, trained with only a small amount of user-provided data, can successfully detect numerous risks. The system is flexible because the risk cases are defined by labeling data. This also means that labels can be added as soon as risks are identified by a human supervisor. We provide all code and data required to reproduce our experiments at imitrob.ciirc.cvut.cz/publications/ilesia.

* 7 pages, 8 figures

Via

Access Paper or Ask Questions

Closed Loop Interactive Embodied Reasoning for Robot Manipulation

Apr 23, 2024

Michal Nazarczuk, Jan Kristof Behrens, Karla Stepanova, Matej Hoffmann, Krystian Mikolajczyk

Abstract:Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. 'Sort the objects from lightest to heaviest'). In order to facilitate the development of such systems we introduce a new simulating environment that makes use of MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. Together with the simulator we propose a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements. Finally, we develop a new modular Closed Loop Interactive Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. We extensively evaluate our reasoning approach in simulation and in the real world manipulation tasks with a success rate above 76% and 64%, respectively.

Via

Access Paper or Ask Questions

Tell and show: Combining multiple modalities to communicate manipulation tasks to a robot

Apr 02, 2024

Petr Vanc, Radoslav Skoviera, Karla Stepanova

Figure 1 for Tell and show: Combining multiple modalities to communicate manipulation tasks to a robot

Figure 2 for Tell and show: Combining multiple modalities to communicate manipulation tasks to a robot

Figure 3 for Tell and show: Combining multiple modalities to communicate manipulation tasks to a robot

Figure 4 for Tell and show: Combining multiple modalities to communicate manipulation tasks to a robot

Abstract:As human-robot collaboration is becoming more widespread, there is a need for a more natural way of communicating with the robot. This includes combining data from several modalities together with the context of the situation and background knowledge. Current approaches to communication typically rely only on a single modality or are often very rigid and not robust to missing, misaligned, or noisy data. In this paper, we propose a novel method that takes inspiration from sensor fusion approaches to combine uncertain information from multiple modalities and enhance it with situational awareness (e.g., considering object properties or the scene setup). We first evaluate the proposed solution on simulated bimodal datasets (gestures and language) and show by several ablation experiments the importance of various components of the system and its robustness to noisy, missing, or misaligned observations. Then we implement and evaluate the model on the real setup. In human-robot interaction, we must also consider whether the selected action is probable enough to be executed or if we should better query humans for clarification. For these purposes, we enhance our model with adaptive entropy-based thresholding that detects the appropriate thresholds for different types of interaction showing similar performance as fine-tuned fixed thresholds.

* 8 pages, 8 figures

Via

Access Paper or Ask Questions

Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

Apr 02, 2024

Gabriela Sejnova, Michal Vavrecka, Karla Stepanova

Figure 1 for Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

Figure 2 for Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

Figure 3 for Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

Figure 4 for Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

Abstract:In this work, we focus on unsupervised vision-language-action mapping in the area of robotic manipulation. Recently, multiple approaches employing pre-trained large language and vision models have been proposed for this task. However, they are computationally demanding and require careful fine-tuning of the produced outputs. A more lightweight alternative would be the implementation of multimodal Variational Autoencoders (VAEs) which can extract the latent features of the data and integrate them into a joint representation, as has been demonstrated mostly on image-image or image-text data for the state-of-the-art models. Here we explore whether and how can multimodal VAEs be employed in unsupervised robotic manipulation tasks in a simulated environment. Based on the obtained results, we propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%. Moreover, we systematically evaluate the challenges raised by the individual tasks such as object or robot position variability, number of distractors or the task length. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories based on vision and language.

* 7 pages, 5 figures, 2 tables, conference

Via

Access Paper or Ask Questions

Adaptive Compression of the Latent Space in Variational Autoencoders

Dec 11, 2023

Gabriela Sejnova, Michal Vavrecka, Karla Stepanova

Figure 1 for Adaptive Compression of the Latent Space in Variational Autoencoders

Figure 2 for Adaptive Compression of the Latent Space in Variational Autoencoders

Figure 3 for Adaptive Compression of the Latent Space in Variational Autoencoders

Figure 4 for Adaptive Compression of the Latent Space in Variational Autoencoders

Abstract:Variational Autoencoders (VAEs) are powerful generative models that have been widely used in various fields, including image and text generation. However, one of the known challenges in using VAEs is the model's sensitivity to its hyperparameters, such as the latent space size. This paper presents a simple extension of VAEs for automatically determining the optimal latent space size during the training process by gradually decreasing the latent size through neuron removal and observing the model performance. The proposed method is compared to traditional hyperparameter grid search and is shown to be significantly faster while still achieving the best optimal dimensionality on four image datasets. Furthermore, we show that the final performance of our method is comparable to training on the optimal latent size from scratch, and might thus serve as a convenient substitute.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

Communicating human intent to a robotic companion by multi-type gesture sentences

Mar 08, 2023

Petr Vanc, Jan Kristof Behrens, Karla Stepanova, Vaclav Hlavac

Figure 1 for Communicating human intent to a robotic companion by multi-type gesture sentences

Figure 2 for Communicating human intent to a robotic companion by multi-type gesture sentences

Figure 3 for Communicating human intent to a robotic companion by multi-type gesture sentences

Figure 4 for Communicating human intent to a robotic companion by multi-type gesture sentences

Abstract:Human-Robot collaboration in home and industrial workspaces is on the rise. However, the communication between robots and humans is a bottleneck. Although people use a combination of different types of gestures to complement speech, only a few robotic systems utilize gestures for communication. In this paper, we propose a gesture pseudo-language and show how multiple types of gestures can be combined to express human intent to a robot (i.e., expressing both the desired action and its parameters - e.g., pointing to an object and showing that the object should be emptied into a bowl). The demonstrated gestures and the perceived table-top scene (object poses detected by CosyPose) are processed in real-time) to extract the human's intent. We utilize behavior trees to generate reactive robot behavior that handles various possible states of the world (e.g., a drawer has to be opened before an object is placed into it) and recovers from errors (e.g., when the scene changes). Furthermore, our system enables switching between direct teleoperation of the end-effector and high-level operation using the proposed gesture sentences. The system is evaluated on increasingly complex tasks using a real 7-DoF Franka Emika Panda manipulator. Controlling the robot via action gestures lowered the execution time by up to 60%, compared to direct teleoperation.

* 7 pages, 9 figures

Via

Access Paper or Ask Questions

Context-aware robot control using gesture episodes

Jan 24, 2023

Petr Vanc, Jan Kristof Behrens, Karla Stepanova

Figure 1 for Context-aware robot control using gesture episodes

Figure 2 for Context-aware robot control using gesture episodes

Figure 3 for Context-aware robot control using gesture episodes

Figure 4 for Context-aware robot control using gesture episodes

Abstract:Collaborative robots became a popular tool for increasing productivity in partly automated manufacturing plants. Intuitive robot teaching methods are required to quickly and flexibly adapt the robot programs to new tasks. Gestures have an essential role in human communication. However, in human-robot-interaction scenarios, gesture-based user interfaces are so far used rarely, and if they employ a one-to-one mapping of gestures to robot control variables. In this paper, we propose a method that infers the user's intent based on gesture episodes, the context of the situation, and common sense. The approach is evaluated in a simulated table-top manipulation setting. We conduct deterministic experiments with simulated users and show that the system can even handle personal preferences of each user.

* 7 pages, 8 figures, accepted for ICRA 2023

Via

Access Paper or Ask Questions