Object detection is a computer vision task in which the goal is to detect and locate objects of interest in an image or video. The task involves identifying the position and boundaries of objects in an image, and classifying the objects into different categories. It forms a crucial part of vision recognition, alongside image classification and retrieval.
The prevalence of sarcasm in multimodal dialogues on the social platforms presents a crucial yet challenging task for understanding the true intent behind online content. Comprehensive sarcasm analysis requires two key aspects: Multimodal Sarcasm Detection (MSD) and Multimodal Sarcasm Explanation (MuSE). Intuitively, the act of detection is the result of the reasoning process that explains the sarcasm. Current research predominantly focuses on addressing either MSD or MuSE as a single task. Even though some recent work has attempted to integrate these tasks, their inherent causal dependency is often overlooked. To bridge this gap, we propose MuVaC, a variational causal inference framework that mimics human cognitive mechanisms for understanding sarcasm, enabling robust multimodal feature learning to jointly optimize MSD and MuSE. Specifically, we first model MSD and MuSE from the perspective of structural causal models, establishing variational causal pathways to define the objectives for joint optimization. Next, we design an alignment-then-fusion approach to integrate multimodal features, providing robust fusion representations for sarcasm detection and explanation generation. Finally, we enhance the reasoning trustworthiness by ensuring consistency between detection results and explanations. Experimental results demonstrate the superiority of MuVaC in public datasets, offering a new perspective for understanding multimodal sarcasm.
Shared control improves Human-Robot Interaction by reducing the user's workload and increasing the robot's autonomy. It allows robots to perform tasks under the user's supervision. Current eye-tracking-driven approaches face several challenges. These include accuracy issues in 3D gaze estimation and difficulty interpreting gaze when differentiating between multiple tasks. We present an eye-tracking-driven control framework, aimed at enabling individuals with severe physical disabilities to perform daily tasks independently. Our system uses task pictograms as fiducial markers combined with a feature matching approach that transmits data of the selected object to accomplish necessary task related measurements with an eye-in-hand configuration. This eye-tracking control does not require knowledge of the user's position in relation to the object. The framework correctly interpreted object and task selection in up to 97.9% of measurements. Issues were found in the evaluation, that were improved and shared as lessons learned. The open-source framework can be adapted to new tasks and objects due to the integration of state-of-the-art object detection models.
Data-based optimization (DBO) offers a promising approach for efficiently optimizing shape for better aerodynamic performance by leveraging a pretrained surrogate model for offline evaluations during iterations. However, DBO heavily relies on the quality of the training database. Samples outside the training distribution encountered during optimization can lead to significant prediction errors, potentially misleading the optimization process. Therefore, incorporating uncertainty quantification into optimization is critical for detecting outliers and enhancing robustness. This study proposes an uncertainty-aware data-based optimization (UA-DBO) framework to monitor and minimize surrogate model uncertainty during DBO. A probabilistic encoder-decoder surrogate model is developed to predict uncertainties associated with its outputs, and these uncertainties are integrated into a model-confidence-aware objective function to penalize samples with large prediction errors during data-based optimization process. The UA-DBO framework is evaluated on two multipoint optimization problems aimed at improving airfoil drag divergence and buffet performance. Results demonstrate that UA-DBO consistently reduces prediction errors in optimized samples and achieves superior performance gains compared to original DBO. Moreover, compared to multipoint optimization based on full computational simulations, UA-DBO offers comparable optimization effectiveness while significantly accelerating optimization speed.
This paper presents YOLOE-26, a unified framework that integrates the deployment-optimized YOLO26(or YOLOv26) architecture with the open-vocabulary learning paradigm of YOLOE for real-time open-vocabulary instance segmentation. Building on the NMS-free, end-to-end design of YOLOv26, the proposed approach preserves the hallmark efficiency and determinism of the YOLO family while extending its capabilities beyond closed-set recognition. YOLOE-26 employs a convolutional backbone with PAN/FPN-style multi-scale feature aggregation, followed by end-to-end regression and instance segmentation heads. A key architectural contribution is the replacement of fixed class logits with an object embedding head, which formulates classification as similarity matching against prompt embeddings derived from text descriptions, visual examples, or a built-in vocabulary. To enable efficient open-vocabulary reasoning, the framework incorporates Re-Parameterizable Region-Text Alignment (RepRTA) for zero-overhead text prompting, a Semantic-Activated Visual Prompt Encoder (SAVPE) for example-guided segmentation, and Lazy Region Prompt Contrast for prompt-free inference. All prompting modalities operate within a unified object embedding space, allowing seamless switching between text-prompted, visual-prompted, and fully autonomous segmentation. Extensive experiments demonstrate consistent scaling behavior and favorable accuracy-efficiency trade-offs across model sizes in both prompted and prompt-free settings. The training strategy leverages large-scale detection and grounding datasets with multi-task optimization and remains fully compatible with the Ultralytics ecosystem for training, validation, and deployment. Overall, YOLOE-26 provides a practical and scalable solution for real-time open-vocabulary instance segmentation in dynamic, real-world environments.
Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of "guess what". Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate snippets from the results of class-agnostic object detection, while CCP can form the concept of flow referring to MS-VLS and then make LLM understand snippets for OVOD. Finally, the extensive experiments are carried out on natural and remote sensing datasets, including COCO val, Pascal VOC, DIOR, and NWPU-10, and the results indicate that our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.
Proficiency in microanastomosis is a fundamental competency across multiple microsurgical disciplines. These procedures demand exceptional precision and refined technical skills, making effective, standardized assessment methods essential. Traditionally, the evaluation of microsurgical techniques has relied heavily on the subjective judgment of expert raters. They are inherently constrained by limitations such as inter-rater variability, lack of standardized evaluation criteria, susceptibility to cognitive bias, and the time-intensive nature of manual review. These shortcomings underscore the urgent need for an objective, reliable, and automated system capable of assessing microsurgical performance with consistency and scalability. To bridge this gap, we propose a novel AI framework for the automated assessment of microanastomosis instrument handling skills. The system integrates four core components: (1) an instrument detection module based on the You Only Look Once (YOLO) architecture; (2) an instrument tracking module developed from Deep Simple Online and Realtime Tracking (DeepSORT); (3) an instrument tip localization module employing shape descriptors; and (4) a supervised classification module trained on expert-labeled data to evaluate instrument handling proficiency. Experimental results demonstrate the effectiveness of the framework, achieving an instrument detection precision of 97%, with a mean Average Precision (mAP) of 96%, measured by Intersection over Union (IoU) thresholds ranging from 50% to 95% (mAP50-95).
Passive Radar Systems have received tremendous attention during the past few decades, due to their low cost and ability to remain covert during operation. Such systems do not transmit any energy themselves, but rely on a so-called Illuminator-of-Opportunity (IO), for example a commercial TV station. A network of Receiving Nodes (RN) receive the direct signal as well as reflections from possible targets. The RNs transmit information to a Central Node (CN), that performs the final target detection, localization and tracking. A large number of methods and algorithms for target detection and localization have been proposed in the literature. In the present contribution, the focus is on the seminal Extended Cancelation Algorithm (ECA), in which each RN estimates target parameters after canceling interference from the direct-path as well as clutter from unwanted stationary objects. This is done by exploiting a separate Reference Channel (RC), which captures the IO signal without interference apart from receiver noise. We derive the statistical properties of the ECA parameter estimates under the assumption of a high Signal-to-Noise Ratio (SNR), and we give a sufficient condition for the SNR in the RC to enable statistically efficient estimates. The theoretical results are corroborated through computer simulations, which show that the theory agrees well with empirical results above a certain SNR threshold. The results can be used to predict the performance of passive radar systems in given scenarios, which is useful for feasibility studies as well as system design.
Industrial image anomaly detection is a challenging problem owing to extreme class imbalance and the scarcity of labeled defective samples, particularly in few-shot settings. We propose BayPrAnoMeta, a Bayesian generalization of Proto-MAML for few-shot industrial image anomaly detection. Unlike existing Proto-MAML approaches that rely on deterministic class prototypes and distance-based adaptation, BayPrAnoMeta replaces prototypes with task-specific probabilistic normality models and performs inner-loop adaptation via a Bayesian posterior predictive likelihood. We model normal support embeddings with a Normal-Inverse-Wishart (NIW) prior, producing a Student-$t$ predictive distribution that enables uncertainty-aware, heavy-tailed anomaly scoring and is essential for robustness in extreme few-shot settings. We further extend BayPrAnoMeta to a federated meta-learning framework with supervised contrastive regularization for heterogeneous industrial clients and prove convergence to stationary points of the resulting nonconvex objective. Experiments on the MVTec AD benchmark demonstrate consistent and significant AUROC improvements over MAML, Proto-MAML, and PatchCore-based methods in few-shot anomaly detection settings.
Recent advances in multi-modal detection have significantly improved detection accuracy in challenging environments (e.g., low light, overexposure). By integrating RGB with modalities such as thermal and depth, multi-modal fusion increases data redundancy and system robustness. However, significant challenges remain in effectively extracting task-relevant information both within and across modalities, as well as in achieving precise cross-modal alignment. While CNNs excel at feature extraction, they are limited by constrained receptive fields, strong inductive biases, and difficulty in capturing long-range dependencies. Transformer-based models offer global context but suffer from quadratic computational complexity and are confined to pairwise correlation modeling. Mamba and other State Space Models (SSMs), on the other hand, are hindered by their sequential scanning mechanism, which flattens 2D spatial structures into 1D sequences, disrupting topological relationships and limiting the modeling of complex higher-order dependencies. To address these issues, we propose a multi-modal perception network based on hypergraph theory called M2I2HA. Our architecture includes an Intra-Hypergraph Enhancement module to capture global many-to-many high-order relationships within each modality, and an Inter-Hypergraph Fusion module to align, enhance, and fuse cross-modal features by bridging configuration and spatial gaps between data sources. We further introduce a M2-FullPAD module to enable adaptive multi-level fusion of multi-modal enhanced features within the network, meanwhile enhancing data distribution and flow across the architecture. Extensive object detection experiments on multiple public datasets against baselines demonstrate that M2I2HA achieves state-of-the-art performance in multi-modal object detection tasks.
We consider the problem of vision-based pose estimation for autonomous systems. While deep neural networks have been successfully used for vision-based tasks, they inherently lack provable guarantees on the correctness of their output, which is crucial for safety-critical applications. We present a framework for designing certifiable neural networks (NNs) for perception-based pose estimation that integrates physics-driven modeling with learning-based estimation. The proposed framework begins by leveraging the known geometry of planar objects commonly found in the environment, such as traffic signs and runway markings, referred to as target objects. At its core, it introduces a geometric generative model (GGM), a neural-network-like model whose parameters are derived from the image formation process of a target object observed by a camera. Once designed, the GGM can be used to train NN-based pose estimators with certified guarantees in terms of their estimation errors. We first demonstrate this framework in uncluttered environments, where the target object is the only object present in the camera's field of view. We extend this using ideas from NN reachability analysis to design certified object NN that can detect the presence of the target object in cluttered environments. Subsequently, the framework consolidates the certified object detector with the certified pose estimator to design a multi-stage perception pipeline that generalizes the proposed approach to cluttered environments, while maintaining its certified guarantees. We evaluate the proposed framework using both synthetic and real images of various planar objects commonly encountered by autonomous vehicles. Using images captured by an event-based camera, we show that the trained encoder can effectively estimate the pose of a traffic sign in accordance with the certified bound provided by the framework.