Topic:Real Time Semantic Segmentation
What is Real Time Semantic Segmentation? Semantic segmentation is a computer-vision task that involves assigning a semantic label to each pixel in an image. In Real-Time Semantic Segmentation, the goal is to perform this labeling quickly and accurately in real time, allowing for the segmentation results to be used for tasks such as object recognition, scene understanding, and autonomous navigation.
Papers and Code
Apr 30, 2025
Abstract:In recent studies, numerous previous works emphasize the importance of semantic segmentation of LiDAR data as a critical component to the development of driver-assistance systems and autonomous vehicles. However, many state-of-the-art methods are tested on outdated, lower-resolution LiDAR sensors and struggle with real-time constraints. This study introduces a novel semantic segmentation framework tailored for modern high-resolution LiDAR sensors that addresses both accuracy and real-time processing demands. We propose a novel LiDAR dataset collected by a cutting-edge automotive 128 layer LiDAR in urban traffic scenes. Furthermore, we propose a semantic segmentation method utilizing surface normals as strong input features. Our approach is bridging the gap between cutting-edge research and practical automotive applications. Additionaly, we provide a Robot Operating System (ROS2) implementation that we operate on our research vehicle. Our dataset and code are publicly available: https://github.com/kav-institute/SemanticLiDAR.
Via

Apr 28, 2025
Abstract:Understanding continuous video streams plays a fundamental role in real-time applications including embodied AI and autonomous driving. Unlike offline video understanding, streaming video understanding requires the ability to process video streams frame by frame, preserve historical information, and make low-latency decisions.To address these challenges, our main contributions are three-fold. (i) We develop a novel streaming video backbone, termed as StreamFormer, by incorporating causal temporal attention into a pre-trained vision transformer. This enables efficient streaming video processing while maintaining image representation capability.(ii) To train StreamFormer, we propose to unify diverse spatial-temporal video understanding tasks within a multitask visual-language alignment framework. Hence, StreamFormer learns global semantics, temporal dynamics, and fine-grained spatial relationships simultaneously. (iii) We conduct extensive experiments on online action detection, online video instance segmentation, and video question answering. StreamFormer achieves competitive results while maintaining efficiency, demonstrating its potential for real-time applications.
Via

Apr 27, 2025
Abstract:Real-time open-vocabulary scene understanding is essential for efficient 3D perception in applications such as vision-language navigation, embodied intelligence, and augmented reality. However, existing methods suffer from imprecise instance segmentation, static semantic updates, and limited handling of complex queries. To address these issues, we present OpenFusion++, a TSDF-based real-time 3D semantic-geometric reconstruction system. Our approach refines 3D point clouds by fusing confidence maps from foundational models, dynamically updates global semantic labels via an adaptive cache based on instance area, and employs a dual-path encoding framework that integrates object attributes with environmental context for precise query responses. Experiments on the ICL, Replica, ScanNet, and ScanNet++ datasets demonstrate that OpenFusion++ significantly outperforms the baseline in both semantic accuracy and query responsiveness.
* 8 pages, 9 figures
Via

Apr 28, 2025
Abstract:Semantic-aware 3D scene reconstruction is essential for autonomous robots to perform complex interactions. Semantic SLAM, an online approach, integrates pose tracking, geometric reconstruction, and semantic mapping into a unified framework, shows significant potential. However, existing systems, which rely on 2D ground truth priors for supervision, are often limited by the sparsity and noise of these signals in real-world environments. To address this challenge, we propose GSFF-SLAM, a novel dense semantic SLAM system based on 3D Gaussian Splatting that leverages feature fields to achieve joint rendering of appearance, geometry, and N-dimensional semantic features. By independently optimizing feature gradients, our method supports semantic reconstruction using various forms of 2D priors, particularly sparse and noisy signals. Experimental results demonstrate that our approach outperforms previous methods in both tracking accuracy and photorealistic rendering quality. When utilizing 2D ground truth priors, GSFF-SLAM achieves state-of-the-art semantic segmentation performance with 95.03\% mIoU, while achieving up to 2.9$\times$ speedup with only marginal performance degradation.
Via

Apr 15, 2025
Abstract:Deep learning techniques have achieved remarkable success in the semantic segmentation of remote sensing images and in land-use change detection. Nevertheless, their real-time deployment on edge platforms remains constrained by decoder complexity. Herein, we introduce LightFormer, a lightweight decoder for time-critical tasks that involve unstructured targets, such as disaster assessment, unmanned aerial vehicle search-and-rescue, and cultural heritage monitoring. LightFormer employs a feature-fusion and refinement module built on channel processing and a learnable gating mechanism to aggregate multi-scale, multi-range information efficiently, which drastically curtails model complexity. Furthermore, we propose a spatial information selection module (SISM) that integrates long-range attention with a detail preservation branch to capture spatial dependencies across multiple scales, thereby substantially improving the recognition of unstructured targets in complex scenes. On the ISPRS Vaihingen benchmark, LightFormer attains 99.9% of GLFFNet's mIoU (83.9% vs. 84.0%) while requiring only 14.7% of its FLOPs and 15.9% of its parameters, thus achieving an excellent accuracy-efficiency trade-off. Consistent results on LoveDA, ISPRS Potsdam, RescueNet, and FloodNet further demonstrate its robustness and superior perception of unstructured objects. These findings highlight LightFormer as a practical solution for remote sensing applications where both computational economy and high-precision segmentation are imperative.
* 26 pages, 69 figures
Via

Apr 14, 2025
Abstract:Posidonia oceanica meadows are a species of seagrass highly dependent on rocks for their survival and conservation. In recent years, there has been a concerning global decline in this species, emphasizing the critical need for efficient monitoring and assessment tools. While deep learning-based semantic segmentation and visual automated monitoring systems have shown promise in a variety of applications, their performance in underwater environments remains challenging due to complex water conditions and limited datasets. This paper introduces a framework that combines machine learning and computer vision techniques to enable an autonomous underwater vehicle (AUV) to inspect the boundaries of Posidonia oceanica meadows autonomously. The framework incorporates an image segmentation module using an existing Mask R-CNN model and a strategy for Posidonia oceanica meadow boundary tracking. Furthermore, a new class dedicated to rocks is introduced to enhance the existing model, aiming to contribute to a comprehensive monitoring approach and provide a deeper understanding of the intricate interactions between the meadow and its surrounding environment. The image segmentation model is validated using real underwater images, while the overall inspection framework is evaluated in a realistic simulation environment, replicating actual monitoring scenarios with real underwater images. The results demonstrate that the proposed framework enables the AUV to autonomously accomplish the main tasks of underwater inspection and segmentation of rocks. Consequently, this work holds significant potential for the conservation and protection of marine environments, providing valuable insights into the status of Posidonia oceanica meadows and supporting targeted preservation efforts
Via

Apr 14, 2025
Abstract:Autonomous fabrication systems are transforming construction and manufacturing, yet they remain vulnerable to print errors. Texture classification is a key component of computer vision systems that enable real-time monitoring and adjustment during cementitious fabrication. Traditional classification methods often rely on global image features, which can bias the model toward semantic content rather than low-level textures. In this paper, we introduce a novel preprocessing technique called "patch and shuffle," which segments input images into smaller patches, shuffles them, and reconstructs a jumbled image before classification. This transformation removes semantic context, forcing the classifier to rely on local texture features. We evaluate this approach on a dataset of extruded cement images, using a ResNet-18-based architecture. Our experiments compare the patch and shuffle method to a standard pipeline, holding all other factors constant. Results show a significant improvement in accuracy: the patch and shuffle model achieved 90.64% test accuracy versus 72.46% for the baseline. These findings suggest that disrupting global structure enhances performance in texture-based classification tasks. This method has implications for broader vision tasks where low-level features matter more than high-level semantics. The technique may improve classification in applications ranging from fabrication monitoring to medical imaging.
* Originally completed as a final project for CEE 374 at Princeton
University
Via

Apr 11, 2025
Abstract:Life-transformative applications such as immersive extended reality are revolutionizing wireless communications and computer vision (CV). This paper presents a novel framework for importance-aware adaptive data transmissions, designed specifically for real-time CV applications where task-specific fidelity is critical. A novel importance-weighted mean square error (IMSE) metric is introduced as a task-oriented measure of reconstruction quality, considering sub-pixel-level importance (SP-I) and semantic segment-level importance (SS-I) models. To minimize IMSE under total power constraints, data-importance-aware waterfilling approaches are proposed to optimally allocate transmission power according to data importance and channel conditions, prioritizing sub-streams with high importance. Simulation results demonstrate that the proposed approaches significantly outperform margin-adaptive waterfilling and equal power allocation strategies. The data partitioning that combines both SP-I and SS-I models is shown to achieve the most significant improvements, with normalized IMSE gains exceeding $7\,$dB and $10\,$dB over the baselines at high SNRs ($>10\,$dB). These substantial gains highlight the potential of the proposed framework to enhance data efficiency and robustness in real-time CV applications, especially in bandwidth-limited and resource-constrained environments.
* Submitted to JSAC
Via

Apr 11, 2025
Abstract:Geometrically accurate and semantically expressive map representations have proven invaluable to facilitate robust and safe mobile robot navigation and task planning. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments is still an open problem. In this paper we present FindAnything, an open-world mapping and exploration framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything bridges the gap between pure geometric and open-vocabulary semantic information for a higher level of understanding while allowing to explore any environment without the help of any external source of ground-truth pose information. We represent the environment as a series of volumetric occupancy submaps, resulting in a robust and accurate map representation that deforms upon pose updates when the underlying SLAM system corrects its drift, allowing for a locally consistent representation between submaps. Pixel-wise vision-language features are aggregated from efficient SAM (eSAM)-generated segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. The open-vocabulary map representation of FindAnything achieves state-of-the-art semantic accuracy in closed-set evaluations on the Replica dataset. This level of scene understanding allows a robot to explore environments based on objects or areas of interest selected via natural language queries. Our system is the first of its kind to be deployed on resource-constrained devices, such as MAVs, leveraging vision-language information for real-world robotic tasks.
* 11 pages, 5 figures
Via

Apr 09, 2025
Abstract:Medical image segmentation has achieved remarkable success through the continuous advancement of UNet-based and Transformer-based foundation backbones. However, clinical diagnosis in the real world often requires integrating domain knowledge, especially textual information. Conducting multimodal learning involves visual and text modalities shown as a solution, but collecting paired vision-language datasets is expensive and time-consuming, posing significant challenges. Inspired by the superior ability in numerous cross-modal tasks for Large Language Models (LLMs), we proposed a novel Vision-LLM union framework to address the issues. Specifically, we introduce frozen LLMs for zero-shot instruction generation based on corresponding medical images, imitating the radiology scanning and report generation process. {To better approximate real-world diagnostic processes}, we generate more precise text instruction from multimodal radiology images (e.g., T1-w or T2-w MRI and CT). Based on the impressive ability of semantic understanding and rich knowledge of LLMs. This process emphasizes extracting special features from different modalities and reunion the information for the ultimate clinical diagnostic. With generated text instruction, our proposed union segmentation framework can handle multimodal segmentation without prior collected vision-language datasets. To evaluate our proposed method, we conduct comprehensive experiments with influential baselines, the statistical results and the visualized case study demonstrate the superiority of our novel method.}
* 21 pages, 4 figures, In Press by a journal
Via
