Abstract:Out-of-distribution (OOD) detection is essential for the safe deployment of neural networks, as it enables the identification of samples outside the training domain. We present FOODER, a real-time, privacy-preserving radar-based framework that integrates OOD-based facial authentication with facial expression recognition. FOODER operates using low-cost frequency-modulated continuous-wave (FMCW) radar and exploits both range-Doppler and micro range-Doppler representations. The authentication module employs a multi-encoder multi-decoder architecture with Body Part (BP) and Intermediate Linear Encoder-Decoder (ILED) components to classify a single enrolled individual as in-distribution while detecting all other faces as OOD. Upon successful authentication, an expression recognition module is activated. Concatenated radar representations are processed by a ResNet block to distinguish between dynamic and static facial expressions. Based on this categorization, two specialized MobileViT networks are used to classify dynamic expressions (smile, shock) and static expressions (neutral, anger). This hierarchical design enables robust facial authentication and fine-grained expression recognition while preserving user privacy by relying exclusively on radar data. Experiments conducted on a dataset collected with a 60 GHz short-range FMCW radar demonstrate that FOODER achieves an AUROC of 94.13% and an FPR95 of 18.12% for authentication, along with an average expression recognition accuracy of 94.70%. FOODER outperforms state-of-the-art OOD detection methods and several transformer-based architectures while operating efficiently in real time.
Abstract:Millimeter-wave (mmWave) radar has emerged as a robust sensing modality for several areas, offering reliable operation under adverse environmental conditions. Its ability to penetrate lightweight materials such as packaging or thin walls enables non-visual sensing in industrial and automated environments and can provide robotic platforms with enhanced environmental perception when used alongside optical sensors. Recent work with MIMO mmWave radar has demonstrated its ability to penetrate cardboard packaging for occluded object classification. However, existing models leave room for improvement and warrant a more thorough evaluation across different sensing frequencies. In this paper, we propose ACCOR, an attention-enhanced complex-valued contrastive learning approach for radar, enabling robust occluded object classification. We process complex-valued IQ radar signals using a complex-valued CNN backbone, followed by a multi-head attention layer and a hybrid loss. Our proposed loss combines a weighted cross-entropy term with a supervised contrastive term. We further extend an existing 64 GHz dataset with a 67 GHz subset of the occluded objects and evaluate our model using both center frequencies. Performance evaluation demonstrates that our approach outperforms prior radar-specific models and image classification models with adapted input, achieving classification accuracies of 96.60% at 64 GHz and 93.59% at 67 GHz for ten different objects. These results demonstrate the benefits of complex-valued deep learning with attention and contrastive learning for mmWave radar-based occluded object classification in industrial and automated environments.
Abstract:Millimeter-wave (mmWave) radar has emerged as a compact and powerful sensing modality for advanced perception tasks that leverage machine learning techniques. It is particularly effective in scenarios where vision-based sensors fail to capture reliable information, such as detecting occluded objects or distinguishing between different surface materials in indoor environments. Due to the non-linear characteristics of mmWave radar signals, deep learning-based methods are well suited for extracting relevant information from in-phase and quadrature (IQ) data. However, the current state of the art in IQ signal-based occluded-object and material classification still offers substantial potential for further improvement. In this paper, we propose a bidirectional cross-attention fusion network that combines IQ-signal and FFT-transformed radar features obtained by distinct complex-valued convolutional neural networks (CNNs). The proposed method achieves improved performance and robustness compared to standalone complex-valued CNNs. We achieve a near-perfect material classification accuracy of 99.92% on samples collected at same sensor-to-surface distances used during training, and an improved accuracy of 67.38% on samples measured at previously unseen distances, demonstrating improved generalization ability across varying measurement conditions. Furthermore, the accuracy for occluded object classification improves from 91.99% using standalone complex-valued CNNs to 94.20% using our proposed approach.
Abstract:This paper introduces BIR-Adapter, a low-complexity blind image restoration adapter for diffusion models. The BIR-Adapter enables the utilization of the prior of pre-trained large-scale diffusion models on blind image restoration without training any auxiliary feature extractor. We take advantage of the robustness of pretrained models. We extract features from degraded images via the model itself and extend the self-attention mechanism with these degraded features. We introduce a sampling guidance mechanism to reduce hallucinations. We perform experiments on synthetic and real-world degradations and demonstrate that BIR-Adapter achieves competitive or better performance compared to state-of-the-art methods while having significantly lower complexity. Additionally, its adapter-based design enables integration into other diffusion models, enabling broader applications in image restoration tasks. We showcase this by extending a super-resolution-only model to perform better under additional unknown degradations.
Abstract:This paper presents a multimodal framework that integrates touch signals (contact points and surface normals) into 3D Gaussian Splatting (3DGS). Our approach enhances scene reconstruction, particularly under challenging conditions like low lighting, limited camera viewpoints, and occlusions. Different from the visual-only method, the proposed approach incorporates spatially selective touch measurements to refine both the geometry and appearance of the 3D Gaussian representation. To guide the touch exploration, we introduce a two-stage sampling scheme that initially probes sparse regions and then concentrates on high-uncertainty boundaries identified from the reconstructed mesh. A geometric loss is proposed to ensure surface smoothness, resulting in improved geometry. Experimental results across diverse scenarios show consistent improvements in geometric accuracy. In the most challenging case with severe occlusion, the Chamfer Distance is reduced by over 15x, demonstrating the effectiveness of integrating touch cues into 3D Gaussian Splatting. Furthermore, our approach maintains a fully online pipeline, underscoring its feasibility in visually degraded environments.
Abstract:In this report, we address the task of online mistake detection, which is vital in domains like industrial automation and education, where real-time video analysis allows human operators to correct errors as they occur. While previous work focuses on procedural errors involving action order, broader error types must be addressed for real-world use. We introduce an online mistake detection framework that handles both procedural and execution errors (e.g., motor slips or tool misuse). Upon detecting an error, we use a large language model (LLM) to generate explanatory feedback. Experiments on the HoloAssist benchmark confirm the effectiveness of our approach, where our approach is placed second on the mistake detection task.




Abstract:In this work, we propose a novel pipeline for face recognition and out-of-distribution (OOD) detection using short-range FMCW radar. The proposed system utilizes Range-Doppler and micro Range-Doppler Images. The architecture features a primary path (PP) responsible for the classification of in-distribution (ID) faces, complemented by intermediate paths (IPs) dedicated to OOD detection. The network is trained in two stages: first, the PP is trained using triplet loss to optimize ID face classification. In the second stage, the PP is frozen, and the IPs-comprising simple linear autoencoder networks-are trained specifically for OOD detection. Using our dataset generated with a 60 GHz FMCW radar, our method achieves an ID classification accuracy of 99.30% and an OOD detection AUROC of 96.91%.




Abstract:This study proposes a novel approach for real-time facial expression recognition utilizing short-range Frequency-Modulated Continuous-Wave (FMCW) radar equipped with one transmit (Tx), and three receive (Rx) antennas. The system leverages four distinct modalities simultaneously: Range-Doppler images (RDIs), micro range-Doppler Images (micro-RDIs), range azimuth images (RAIs), and range elevation images (REIs). Our innovative architecture integrates feature extractor blocks, intermediate feature extractor blocks, and a ResNet block to accurately classify facial expressions into smile, anger, neutral, and no-face classes. Our model achieves an average classification accuracy of 98.91% on the dataset collected using a 60 GHz short-range FMCW radar. The proposed solution operates in real-time in a person-independent manner, which shows the potential use of low-cost FMCW radars for effective facial expression recognition in various applications.
Abstract:In practical use cases, polygonal mesh editing can be faster than generating new ones, but it can still be challenging and time-consuming for users. Existing solutions for this problem tend to focus on a single task, either geometry or novel view synthesis, which often leads to disjointed results between the mesh and view. In this work, we propose LEMON, a mesh editing pipeline that combines neural deferred shading with localized mesh optimization. Our approach begins by identifying the most important vertices in the mesh for editing, utilizing a segmentation model to focus on these key regions. Given multi-view images of an object, we optimize a neural shader and a polygonal mesh while extracting the normal map and the rendered image from each view. By using these outputs as conditioning data, we edit the input images with a text-to-image diffusion model and iteratively update our dataset while deforming the mesh. This process results in a polygonal mesh that is edited according to the given text instruction, preserving the geometric characteristics of the initial mesh while focusing on the most significant areas. We evaluate our pipeline using the DTU dataset, demonstrating that it generates finely-edited meshes more rapidly than the current state-of-the-art methods. We include our code and additional results in the supplementary material.




Abstract:This paper proposes a short-range FMCW radar-based facial authentication and out-of-distribution (OOD) detection framework. Our pipeline jointly estimates the correct classes for the in-distribution (ID) samples and detects the OOD samples to prevent their inaccurate prediction. Our reconstruction-based architecture consists of a main convolutional block with one encoder and multi-decoder configuration, and intermediate linear encoder-decoder parts. Together, these elements form an accurate human face classifier and a robust OOD detector. For our dataset, gathered using a 60 GHz short-range FMCW radar, our network achieves an average classification accuracy of 98.07% in identifying in-distribution human faces. As an OOD detector, it achieves an average Area Under the Receiver Operating Characteristic (AUROC) curve of 98.50% and an average False Positive Rate at 95% True Positive Rate (FPR95) of 6.20%. Also, our extensive experiments show that the proposed approach outperforms previous OOD detectors in terms of common OOD detection metrics.