Abstract:To enable reliable climbing locomotion of quadruped robots on ferromagnetic surfaces, this paper presents a high-load-density electro-permanent magnetic foot with controllable adhesion, featuring force-feedback circular Halbach-net electro-permanent magnet (CHN-EPM) adhesion units and a magnetization control system. Due to its three-dimensional magnetic circuit structure and flux-concentration effect, the CHN-EPM enables a distributed parallel magnetic flux path with enhanced flux utilization, resulting in reduced sensitivity to air-gap variations and allowing effective adhesion to be maintained even under partial contact conditions. The proposed CHN-EPM generates a maximum adhesion force exceeding 1000 N with a load-to-weight ratio over 200:1. A magnetization driver and a two-stage pulse current control strategy are developed to regulate the excitation current amplitude and duration, enabling accurate and reliable magnetization. By incorporating a flexible pressure sensor for contact force feedback, the system can effectively monitor attachment and detachment states, ensuring robust adhesion switching under uncertain contact conditions. The proposed system is integrated into a commercial quadruped robot (Unitree GO2), demonstrating high-load adhesion on ceiling and vertical-wall surfaces and stable locomotion on painted, perforated, and curved ferromagnetic surfaces.
Abstract:Contact-rich manipulation tasks such as tight-clearance insertion, connector mating, polishing, and surface-conforming wiping remain difficult for data-driven controllers because they couple discontinuous contact dynamics, partial observability, and strict safety constraints. No single sensing modality suffices: vision supplies global context before contact, force/torque (F/T) feedback governs interaction after contact, and proprioceptive pose provides a consistent kinematic backbone. Most prior imitation-learning policies for contact-rich tasks operate on uni- or bi-modal signals, and the few that fuse three modalities typically adopt off-the-shelf attention modules with no explicit prior on how attention mass should be distributed across task-relevant regions. We present Spacetime Optimal-Transport Attention (SO-TA), a tri-modal fusion backbone that replaces softmax-normalized patch attention by an entropy-regularized Optimal Transport (OT) alignment between force-pose-derived sub-queries and visual patches. Explicit marginal constraints act as a structured inductive bias for contact-rich tasks, encouraging conditioning-aware spatial selection that is stable across illumination, distractors, and partial occlusion. SO-TA is paired with a diffusion-based sequence policy mapping observation windows to pose-action chunks. We evaluate SO-TA on three real-robot tasks: tight peg-in-hole assembly, BCM wiring-connector insertion, and curved-surface mark erasing. With ~200 rollouts per condition, SO-TA reaches 100% success on tight peg-in-hole versus 93% for cross-attention at matched capacity, and retains 82.5% success under illumination, distractor, and partial-occlusion perturbations where a concatenation baseline drops to 43.5%. OT-derived patch heatmaps and leave-one-out modality-influence ratios provide interpretable, phase-dependent diagnostics.
Abstract:Existing bilateral teleoperation platforms still rely on costly rigid six-axis force/torque sensors, tightly coupled leader-follower hardware, and kilohertz control loops. We present a Cartesian bilateral framework built on the hardware-agnostic WinGs Operating Studio (WOS) middleware, in which a low-cost compliant 6-DOF pose-and-force sensing end-effector, Delta6, is mounted on both sides so that each manipulator behaves as an end-effector 6-DOF series elastic actuator (SEA). The leader runs a damping-only admittance loop with a 6-D biquad notch filter; the follower realizes a stiffness-damping impedance through a position-based outer loop with a PID wrench-to-pose mapping. Three time scales (hardware I/O, mid-rate impedance/admittance, low-rate teleoperation messages) are explicitly decoupled, enabling the same application to drive heterogeneous arms. On a Lite6/FR3 testbed at 150 Hz, the system tracks stably under delays up to $120\pm40$ ms and 1% packet loss, matches the prescribed virtual stiffness in contact, and shows a favorable cumulative energy signature in passivity-style tests.
Abstract:This paper proposes a common interface for real-time low-level motion planning of collaborative robotic arms, aimed at enabling broader applicability and improved portability across heterogeneous hardware platforms. In previous work, we introduced WinGs Operating Studio (WOS), a middleware solution that abstracts diverse robotic components into uniform software resources and provides a broad suite of language-agnostic APIs. This paper specifically focuses on its minimal yet flexible interface for real-time end-effector trajectory control. By employing an n-degree polynomial interpolator in conjunction with a quadratic programming solver, the proposed method generates smooth, continuously differentiable trajectories with precise position, velocity, and acceleration profiles. We validate our approach in three distinct scenarios. First, in an offline demonstration, a collaborative arm accurately draws various geometric shapes on paper. Second, in an interruptible, low-frequency re-planning setting, a robotic manipulator grasps a dynamic object placed on a moving mobile robot. Finally, we conducted a teleoperation experiment in which one robotic arm controlled another to perform a series of dexterous manipulations, confirming the proposed method's reliability, versatility, and ease of use.
Abstract:This paper presents Delta6, a low-cost, six-degree-of-freedom (6-DOF) force/torque end-effector that combines antagonistic springs with magnetic encoders to deliver accurate wrench sensing while remaining as simple to assemble as flat-pack furniture. A fully 3D-printed prototype, assembled entirely from off-the-shelf parts, withstands peak forces above +/-14.4 N and torques of +/-0.33 N.m per axis; these limits can be further extended by leveraging the proposed parametric analytical model. Without calibration, Delta6 attains a 99th-percentile error of 7% full scale (FS). With lightweight sequence models, the error is reduced to 3.8% FS by the best-performing network. Benchmarks on multiple computing platforms confirm that the device's bandwidth is adjustable, enabling balanced trade-offs among update rate, accuracy, and cost, while durability, thermal drift, and zero-calibration tests confirm its robustness. With Delta6 mounted on a robot arm governed by a force-impedance controller, the system successfully performs two contact-rich tasks: buffing curved surfaces and tight assemblies. Experiments validate the design, showing that Delta6 is a robust, low-cost alternative to existing 6-DOF force sensing solutions. Open-source site: https://wings-robotics.github.io/delta6 .
Abstract:Open-vocabulary scene understanding is crucial for robotic applications, enabling robots to comprehend complex 3D environmental contexts and supporting various downstream tasks such as navigation and manipulation. However, existing methods require pre-built complete 3D semantic maps to construct scene graphs for scene understanding, which limits their applicability in robotic scenarios where environments are explored incrementally. To address this challenge, we propose OGScene3D, an open-vocabulary scene understanding system that achieves accurate 3D semantic mapping and scene graph construction incrementally. Our system employs a confidence-based Gaussian semantic representation that jointly models semantic predictions and their reliability, enabling robust scene modeling. Building on this representation, we introduce a hierarchical 3D semantic optimization strategy that achieves semantic consistency through local correspondence establishment and global refinement, thereby constructing globally consistent semantic maps. Moreover, we design a long-term global optimization method that leverages temporal memory of historical observations to enhance semantic predictions. By integrating 2D-3D semantic consistency with Gaussian rendering contribution, this method continuously refines the semantic understanding of the entire scene.Furthermore, we develop a progressive graph construction approach that dynamically creates and updates both nodes and semantic relationships, allowing continuous updating of the 3D scene graphs. Extensive experiments on widely used datasets and real-world scenes demonstrate the effectiveness of our OGScene3D on open-vocabulary scene understanding.
Abstract:Recent advances in Dense Simultaneous Localization and Mapping (SLAM) have demonstrated remarkable performance in static environments. However, dense SLAM in dynamic environments remains challenging. Most methods directly remove dynamic objects and focus solely on static scene reconstruction, which ignores the motion information contained in these dynamic objects. In this paper, we present D$^2$GSLAM, a novel dynamic SLAM system utilizing Gaussian representation, which simultaneously performs accurate dynamic reconstruction and robust tracking within dynamic environments. Our system is composed of four key components: (i) We propose a geometric-prompt dynamic separation method to distinguish between static and dynamic elements of the scene. This approach leverages the geometric consistency of Gaussian representation and scene geometry to obtain coarse dynamic regions. The regions then serve as prompts to guide the refinement of the coarse mask for achieving accurate motion mask. (ii) To facilitate accurate and efficient mapping of the dynamic scene, we introduce dynamic-static composite representation that integrates static 3D Gaussians with dynamic 4D Gaussians. This representation allows for modeling the transitions between static and dynamic states of objects in the scene for composite mapping and optimization. (iii) We employ a progressive pose refinement strategy that leverages both the multi-view consistency of static scene geometry and motion information from dynamic objects to achieve accurate camera tracking. (iv) We introduce a motion consistency loss, which leverages the temporal continuity in object motions for accurate dynamic modeling. Our D$^2$GSLAM demonstrates superior performance on dynamic scenes in terms of mapping and tracking accuracy, while also showing capability in accurate dynamic modeling.
Abstract:We present a novel under-actuated gripper with two 3-joint fingers, which realizes force feedback control by the deep learning technique- Long Short-Term Memory (LSTM) model, without any force sensor. First, a five-linkage mechanism stacked by double four-linkages is designed as a finger to automatically achieve the transformation between parallel and enveloping grasping modes. This enables the creation of a low-cost under-actuated gripper comprising a single actuator and two 3-phalange fingers. Second, we devise theoretical models of kinematics and power transmission based on the proposed gripper, accurately obtaining fingertip positions and contact forces. Through coupling and decoupling of five-linkage mechanisms, the proposed gripper offers the expected capabilities of grasping payload/force/stability and objects with large dimension ranges. Third, to realize the force control, an LSTM model is proposed to determine the grasping mode for synthesizing force-feedback control policies that exploit contact sensing after outlining the uncertainty of currents using a statistical method. Finally, a series of experiments are implemented to measure quantitative indicators, such as the payload, grasping force, force sensing, grasping stability and the dimension ranges of objects to be grasped. Additionally, the grasping performance of the proposed gripper is verified experimentally to guarantee the high versatility and robustness of the proposed gripper.




Abstract:Robotic kitting is a critical task in industrial automation that requires the precise arrangement of objects into kits to support downstream production processes. However, when handling complex kitting tasks that involve fine-grained orientation alignment, existing approaches often suffer from limited accuracy and computational efficiency. To address these challenges, we propose Histogram Transporter, a novel kitting framework that learns high-precision pick-and-place actions from scratch using only a few demonstrations. First, our method extracts rotation-equivariant orientation histograms (EOHs) from visual observations using an efficient Fourier-based discretization strategy. These EOHs serve a dual purpose: improving picking efficiency by directly modeling action success probabilities over high-resolution orientations and enhancing placing accuracy by serving as local, discriminative feature descriptors for object-to-placement matching. Second, we introduce a subgroup alignment strategy in the place model that compresses the full spectrum of EOHs into a compact orientation representation, enabling efficient feature matching while preserving accuracy. Finally, we examine the proposed framework on the simulated Hand-Tool Kitting Dataset (HTKD), where it outperforms competitive baselines in both success rates and computational efficiency. Further experiments on five Raven-10 tasks exhibits the remarkable adaptability of our approach, with real-robot trials confirming its applicability for real-world deployment.




Abstract:In construction quality monitoring, accurately detecting and segmenting cracks in concrete structures is paramount for safety and maintenance. Current convolutional neural networks (CNNs) have demonstrated strong performance in crack segmentation tasks, yet they often struggle with complex backgrounds and fail to capture fine-grained tubular structures fully. In contrast, Transformers excel at capturing global context but lack precision in detailed feature extraction. We introduce DSCformer, a novel hybrid model that integrates an enhanced Dynamic Snake Convolution (DSConv) with a Transformer architecture for crack segmentation to address these challenges. Our key contributions include the enhanced DSConv through a pyramid kernel for adaptive offset computation and a simultaneous bi-directional learnable offset iteration, significantly improving the model's performance to capture intricate crack patterns. Additionally, we propose a Weighted Convolutional Attention Module (WCAM), which refines channel attention, allowing for more precise and adaptive feature attention. We evaluate DSCformer on the Crack3238 and FIND datasets, achieving IoUs of 59.22\% and 87.24\%, respectively. The experimental results suggest that our DSCformer outperforms state-of-the-art methods across different datasets.