Abstract:Understanding actions within surgical workflows is essential for evaluating post-operative outcomes. However, capturing long sequences of actions performed in surgical settings poses challenges, as individual surgeons have their unique approaches shaped by their expertise, leading to significant variability. To tackle this complex problem, we focused on segmentation with precise boundaries, a demanding task due to the inherent variability in action durations and the subtle transitions often observed in untrimmed videos. These transitions, marked by ambiguous starting and ending points, complicate the segmentation process. Traditional models, such as MS-TCN, which depend on large receptive fields, frequently face challenges of over-segmentation (resulting in fragmented segments) or under-segmentation (merging distinct actions). Both of these issues negatively impact the quality of segmentation. To overcome these challenges, we present the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention, designed to enhance action segmentation. Our proposed approach incorporates a novel unified loss function that treats action classification and boundary detection as distinct yet interdependent tasks. Unlike traditional binary boundary detection methods, our boundary voting mechanism accurately identifies start and end points by leveraging contextual information. Extensive experiments using three challenging surgical datasets demonstrate the superior performance of the proposed method, achieving state-of-the-art results in F1 scores at thresholds of 25% and 50%, while also delivering comparable performance in other metrics.
Abstract:Edge computing-based 3D perception has received attention in intelligent transportation systems (ITS) because real-time monitoring of traffic candidates potentially strengthens Vehicle-to-Everything (V2X) orchestration. Thanks to the capability of precisely measuring the depth information on surroundings from LiDAR, the increasing studies focus on lidar-based 3D detection, which significantly promotes the development of 3D perception. Few methods met the real-time requirement of edge deployment because of high computation-intensive operations. Moreover, an inconsistency problem of object detection remains uncovered in the pointcloud domain due to large sparsity. This paper thoroughly analyses this problem, comprehensively roused by recent works on determining inconsistency problems in the image specialisation. Therefore, we proposed a 3D harmonic loss function to relieve the pointcloud based inconsistent predictions. Moreover, the feasibility of 3D harmonic loss is demonstrated from a mathematical optimization perspective. The KITTI dataset and DAIR-V2X-I dataset are used for simulations, and our proposed method considerably improves the performance than benchmark models. Further, the simulative deployment on an edge device (Jetson Xavier TX) validates our proposed model's efficiency.