Engineering Mathematics, University of Bristol, affiliated with the Bristol Robotics Lab, United Kingdom
Abstract:Vision-based tactile sensors (VBTSs) enable robots to infer contact geometry and force-related cues by imaging deformation through an internal camera, yet generalisation across sensor designs remains poorly understood. We present TacVerse, a multi-sensor dataset and benchmark for cross-sensor vision-based tactile perception. The dataset contains 106,800 tactile images from seven VBTSs and supports three downstream tasks: shape classification, grating classification, and force regression. Experiments are conducted under three settings: within-sensor training, zero-shot cross-sensor transfer, and few-shot adaptation. Strong within-sensor performance across all tasks indicates that the collected tactile observations are informative for the target objectives. Direct cross-sensor transfer, however, leads to substantial degradation. Shape classification is comparatively robust, whereas grating classification and force regression are more sensitive to sensor shift. Few-shot adaptation for force regression consistently improves performance on unseen target sensors but does not fully close the gap to within-sensor upper bounds. A representation study further shows that MAE (Masked Autoencoder) pretraining provides the most consistent gains across tasks and sensors. TacVerse provides a controlled testbed for studying sensor shift, data-efficient adaptation, and self-supervised learning in tactile perception.
Abstract:Many real-world manipulation scenarios, such as handling complex collaborative tasks and dealing with large workspaces, require coordination of more than two robotic arms. Consequently, an effective multi-arm teleoperation system is required to collect demonstrations for training coordinated multi-arm manipulation policies. However, existing teleoperation frameworks mainly focus on single-operator or multi-operator setups, facing a practical trade-off between the cognitive load placed on a single operator and the coordination cost incurred by multiple operators. To address this problem, we introduce HATS, a human-agent teleoperation system that enables a single human operator, assisted by an MLLM-based agent, to collect data for multi-arm manipulation tasks. Our system decouples the control space: two primary arms are directly teleoperated by the human, while two assistive arms are controlled by a training-free agent that handles sub-tasks. In addition, the human operator can use voice commands to prevent collisions and correct assistive arm behaviors during execution. Extensive evaluations demonstrate that HATS achieves data collection efficiency and success rates comparable to expert dual-human teams. Moreover, downstream policy evaluations demonstrate the efficacy and quality of the data collected through HATS.
Abstract:3D visuomotor policies offer a promising direction for complex robotic manipulation, as depth maps and point clouds provide rich geometric information for spatial reasoning. However, their success often depends on large-scale real-world demonstrations, which are costly and time-consuming to collect. To this end, existing methods commonly use demonstration generation strategies to improve data efficiency by applying object-centric transformations to human-collected demonstrations, such as varying object poses or scales. While effective for local variation, these transformations largely preserve the original scene structure and skill sequence, limiting their ability to synthesize diverse scene-skill-object combinations for complex tasks. In this paper, we propose Task-Edit, a novel demonstration generation framework that generates diverse trajectories from a task-centric editing perspective. The key insight of Task-Edit is to decompose a task into scene, skill and object components, and flexibly recombine them. In this way, Task-Edit enables scalable demonstration generation and significantly improves generalization for long-horizon manipulation tasks. We evaluate Task-Edit through extensive real-world experiments and demonstrate three advantages: (1) Effectiveness: Task-Edit significantly improves 3D visuomotor policies across various real-world tasks and robot embodiments. (2) Generalizability: Task-Edit improves model generalization across different scenario setups. (3) Applicability: Task-Edit enables models to handle scenarios that are difficult to collect in the real world, including disturbance resistance, obstacle avoidance and unseen cluttered scenes.
Abstract:We propose an optimization-based framework for robust contact-rich manipulation. Recent contact-implicit methods enable online hybrid planning across contact modes, allowing closed-loop manipulation for a given target state and contact location sequence of the robot and object. However, most existing approaches lack the ability to autonomously reason and generate diverse contact location sequences and manipulation trajectories, i.e., active contact location selection, which limits their applicability to relatively simple tasks. Active contact location selection is challenging due to complementarity in contact dynamics and the sparse gradients, making the design of a unified framework for contact selection and planning difficult. To address these challenges, we introduce Simultaneous Contact Selection and Planning (SCSP), a cascaded optimization framework comprising Contact Selection Optimization (CSO) and Contact Planning Optimization (CPO). CSO leverages a surrogate contact model and discrete-continuous optimization to efficiently resolve the nonsmoothness and coupling in contact selection, enabling online global searching of optimal contact locations. CPO performs prior-guided contact planning by evaluating the reference contact locations produced by CSO and generating corresponding manipulation trajectories in real time for redundant manipulators. Extensive simulations and real-world experiments demonstrate that SCSP produces diverse manipulation behaviors and robust control under inaccurate dynamics and perceptual noise. We further validate the generalization of the framework on challenging manipulation tasks. Project website: \href{https://sites.google.com/view/scsp-robot}{https://sites.google.com/view/scsp-robot}.
Abstract:Optical tweezers (OT) provide piconewton-scale manipulation for delicate biomedical tasks, where visuo-haptic feedback can improve operator awareness by conveying interaction-force cues and trap-stability information. However, visuo-haptic teleoperation frameworks for complex-shaped optical microrobots remain underdeveloped, particularly in multi-trap manipulation scenarios. This paper presents a digital twin framework for virtual visuo-haptic teleoperation of complex-shaped OT-driven microrobots. The framework integrates a digital twin environment, image-based pose and depth estimation, microrobot motion simulation, and model-based haptic rendering within a Robot Operating System (ROS)-connected bimanual teleoperation system. For force modeling, we combine a Multi-Sphere Distributed Manipulation (MSDM) model with optical-force estimation from the Optical Tweezers Toolbox, enabling simulator-driven visuo-haptic feedback. The framework reproduces representative microrobot motion trends and provides haptic force rendering that is numerically consistent with the fitted optical-force model. In simulated cell-delivery tasks, haptic feedback reduced the standard deviations of the contact-force metric and the microrobot-to-trap-center distance metric by 53.2% and 55.2%, respectively, and improved task success from 30% to 80%. These results demonstrate the framework's effectiveness for evaluating visuo-haptic teleoperation strategies for complex-shaped optical microrobots.
Abstract:Magnetically actuated microrobots have been used as wireless, non-contact manipulation tools at microscales, making them promising for minimally invasive applications. However, their control remains challenging due to indirect actuation, limited sensing, and nonlinear magnetic interactions. In this work, we propose Mag-VLA, a vision-language-action (VLA) model for dexterous magnetic microrobot manipulation using two robotic arms with mounted magnets for dynamic magnetic-field construction. Bimanual coordination enables capabilities such as microrobot reorientation that are difficult or infeasible with a single arm, but it also introduces coupled control challenges, as the policy must generate coordinated trajectories for both actuators within a shared workspace. Our framework adapts a Qwen2.5-VL-7B backbone using Low-Rank Adaptation (LoRA) to process visual observations and language instructions for action prediction. To capture task progression, we introduce a motion-aware phase classifier and a phase-conditioned Action Chunking Transformer (ACT) decoder for temporally coherent multi-step control. We further construct a teleoperated magnetic microrobot manipulation dataset covering three task configurations. Ablation studies show that the ACT-based decoder substantially outperforms alternative generative action heads. In real-robot experiments, Mag-VLA achieves a 90% approach success rate across all tasks and transport success rates of 80%, 70%, and 50% as task difficulty increases. These results demonstrate that hierarchical VLA modeling provides a promising framework for magnetic microrobot manipulation.
Abstract:Microscale manipulation has advanced substantially in controlled locomotion and targeted transport, yet many biomedical applications require precise and adaptive interaction with biological micro-objects. At these scales, manipulation is realized through three main classes of platforms: embodied microrobots that physically interact as mobile agents, field-mediated systems that generate contactless trapping or manipulation forces, and externally actuated end-effectors that interact through remotely driven physical tools. Unlike macroscale manipulators, these systems function in fluidic, confined, and surface-dominated environments characterized by negligible inertia, dominant interfacial forces, and soft, heterogeneous, and fragile targets. Consequently, classical assumptions of dexterous manipulation, including rigid-body contact, stable grasping, and rich proprioceptive feedback, become difficult to maintain. This review introduces micro-dexterity as a framework for analyzing biological micromanipulation through the coupled roles of embodiment, perception, and control. We examine how classical manipulation primitives, including pushing, reorientation, grasping, and cooperative manipulation, are reformulated at the microscale; compare the architectures that enable them, from contact-based micromanipulators to contactless field-mediated systems and cooperative multi-agent platforms; and review the perception and control strategies required for task execution. We identify the current dexterity gap between laboratory demonstrations and clinically relevant biological manipulation, and outline key challenges for future translation.
Abstract:Optical microrobots actuated by optical tweezers (OT) are important for cell manipulation and microscale assembly, but their autonomous operation depends on accurate 3D perception. Developing such perception systems is challenging because large-scale, high-quality microscopy datasets are scarce, owing to complex fabrication processes and labor-intensive annotation. Although generative AI offers a promising route for data augmentation, existing generative adversarial network (GAN)-based methods struggle to reproduce key optical characteristics, particularly depth-dependent diffraction and defocus effects. To address this limitation, we propose Du-FreqNet, a dual-control, frequency-aware diffusion model for physically consistent microscopy image synthesis. The framework features two independent ControlNet branches to encode microrobot 3D point clouds and depth-specific mesh layers, respectively. We introduce an adaptive frequency-domain loss that dynamically reweights high- and low-frequency components based on the distance to the focal plane. By leveraging differentiable FFT-based supervision, Du-FreqNet captures physically meaningful frequency distributions often missed by pixel-space methods. Trained on a limited dataset (e.g., 80 images per pose), our model achieves controllable, depth-dependent image synthesis, improving SSIM by 20.7% over baselines. Extensive experiments demonstrate that Du-FreqNet generalizes effectively to unseen poses and significantly enhances downstream tasks, including 3D pose and depth estimation, thereby facilitating robust closed-loop control in microrobotic systems.
Abstract:Multi-agent robotic manipulation remains challenging due to the combined demands of coordination, grasp stability, and collision avoidance in shared workspaces. To address these challenges, we propose the Adaptive Dynamic Modality Diffusion Policy (ADM-DP), a framework that integrates vision, tactile, and graph-based (multi-agent pose) modalities for coordinated control. ADM-DP introduces four key innovations. First, an enhanced visual encoder merges RGB and point-cloud features via Feature-wise Linear Modulation (FiLM) modulation to enrich perception. Second, a tactile-guided grasping strategy uses Force-Sensitive Resistor (FSR) feedback to detect insufficient contact and trigger corrective grasp refinement, improving grasp stability. Third, a graph-based collision encoder leverages shared tool center point (TCP) positions of multiple agents as structured kinematic context to maintain spatial awareness and reduce inter-agent interference. Fourth, an Adaptive Modality Attention Mechanism (AMAM) dynamically re-weights modalities according to task context, enabling flexible fusion. For scalability and modularity, a decoupled training paradigm is employed in which agents learn independent policies while sharing spatial information. This maintains low interdependence between agents while retaining collective awareness. Across seven multi-agent tasks, ADM-DP achieves 12-25% performance gains over state-of-the-art baselines. Ablation studies show the greatest improvements in tasks requiring multiple sensory modalities, validating our adaptive fusion strategy and demonstrating its robustness for diverse manipulation scenarios.
Abstract:Modular small-scale robots offer the potential for on-demand assembly and disassembly, enabling task-specific adaptation in dynamic and constrained environments. However, existing modular magnetic platforms often depend on workspace collisions for reconfiguration, employ bulky three-dimensional electromagnetic systems, and lack robust single-module control, which limits their applicability in biomedical settings. In this work, we present a modular magnetic millirobotic platform comprising three cube-shaped modules with embedded permanent magnets, each designed for a distinct functional role: a free module that supports self-assembly and reconfiguration, a fixed module that enables flip-and-walk locomotion, and a gripper module for cargo manipulation. Locomotion and reconfiguration are actuated by programmable combinations of time-varying two-dimensional uniform and gradient magnetic field inputs. Experiments demonstrate closed-loop navigation using real-time vision feedback and A* path planning, establishing robust single-module control capabilities. Beyond locomotion, the system achieves self-assembly, multimodal transformations, and disassembly at low field strengths. Chain-to-gripper transformations succeeded in 90% of trials, while chain-to-square transformations were less consistent, underscoring the role of module geometry in reconfiguration reliability. These results establish a versatile modular robotic platform capable of multimodal behavior and robust control, suggesting a promising pathway toward scalable and adaptive task execution in confined environments.