Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper, we propose TiO, an end-to-end system for interactive visual grounding in human-robot interaction. Benefiting from a unified formulation of visual dialogue and grounding, our method can be trained on a joint of extensive public data, and show superior generality to diversified and challenging open-world scenarios. In the experiments, we validate TiO on GuessWhat?! and InViG benchmarks, setting new state-of-the-art performance by a clear margin. Moreover, we conduct HRI experiments on the carefully selected 150 challenging scenes as well as real-robot platforms. Results show that our method demonstrates superior generality to diversified visual and language inputs with a high success rate. Codes and demos are available at https://github.com/jxu124/TiO.
Videos captured from multiple viewpoints can help in perceiving the 3D structure of the world and benefit computer vision tasks such as action recognition, tracking, etc. In this paper, we present a method for self-supervised learning from synchronized multi-view videos. We use a cross-view reconstruction task to inject geometry information in the model. Our approach is based on the masked autoencoder (MAE) framework. In addition to the same-view decoder, we introduce a separate cross-view decoder which leverages cross-attention mechanism to reconstruct a target viewpoint video using a video from source viewpoint, to help representations robust to viewpoint changes. For videos, static regions can be reconstructed trivially which hinders learning meaningful representations. To tackle this, we introduce a motion-weighted reconstruction loss which improves temporal modeling. We report state-of-the-art results on the NTU-60, NTU-120 and ETRI datasets, as well as in the transfer learning setting on NUCLA, PKU-MMD-II and ROCOG-v2 datasets, demonstrating the robustness of our approach. Code will be made available.
This paper investigates the spectrum sharing between a multiple-input single-output (MISO) secure communication system and a multiple-input multiple-output (MIMO) radar system in the presence of one suspicious eavesdropper. We jointly design the radar waveform and communication beamforming vector at the two systems, such that the interference between the base station (BS) and radar is reduced, and the detrimental radar interference to the communication system is enhanced to jam the eavesdropper, thereby increasing secure information transmission performance. In particular, by considering the imperfect channel state information (CSI) for the user and eavesdropper, we maximize the worst-case secrecy rate at the user, while ensuring the detection performance of radar system. To tackle this challenging problem, we propose a two-layer robust cooperative algorithm based on the S-lemma and semidefinite relaxation techniques. Simulation results demonstrate that the proposed algorithm achieves significant secrecy rate gains over the non-robust scheme. Furthermore, we illustrate the trade-off between secrecy rate and detection probability.
The efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored. This study focused on evaluating and enhancing the clinical capabilities of LLMs in specific domains, using osteoarthritis (OA) management as a case study. A domain specific benchmark framework was developed, which evaluate LLMs across a spectrum from domain-specific knowledge to clinical applications in real-world clinical scenarios. DocOA, a specialized LLM tailored for OA management that integrates retrieval-augmented generation (RAG) and instruction prompts, was developed. The study compared the performance of GPT-3.5, GPT-4, and a specialized assistant, DocOA, using objective and human evaluations. Results showed that general LLMs like GPT-3.5 and GPT-4 were less effective in the specialized domain of OA management, particularly in providing personalized treatment recommendations. However, DocOA showed significant improvements. This study introduces a novel benchmark framework which assesses the domain-specific abilities of LLMs in multiple aspects, highlights the limitations of generalized LLMs in clinical contexts, and demonstrates the potential of tailored approaches for developing domain-specific medical LLMs.
This article presents a novel multi-functional system for a sixth-generation (6G) wireless network with integrated sensing, communication, and powering (ISCAP), which unifies integrated sensing and communication (ISAC) and wireless information and power transfer (WIPT) techniques. The multi-functional ISCAP network promises to enhance resource utilization efficiency, reduce network costs, and improve overall performance through versatile operational modes. Specifically, a multi-functional base station (BS) can enable multi-functional transmission, by exploiting the same radio signals to perform target/environment sensing, wireless communication, and wireless power transfer (WPT), simultaneously. Besides, the three functions can be intelligently coordinated to pursue mutual benefits,i.e., wireless sensing can be leveraged to enable light-training or even training-free WIPT by providing side-channel information, and the BS can utilize WPT to wirelessly charge low-power devices for ensuring sustainable ISAC. Furthermore, multiple multi-functional BSs can cooperate in both transmission and reception phases for efficient interference management, multi-static sensing, and distributed energy beamforming. For these operational modes, we discuss the technical challenges and potential solutions, particularly focusing on the fundamental performance tradeoff limits, transmission protocol design, as well as waveform and beamforming optimization. Finally, interesting research directions are identified.
Federated learning (FL) with noisy labels poses a significant challenge. Existing methods designed for handling noisy labels in centralized learning tend to lose their effectiveness in the FL setting, mainly due to the small dataset size and the heterogeneity of client data. While some attempts have been made to tackle FL with noisy labels, they primarily focused on scenarios involving class-conditional noise. In this paper, we study the more challenging and practical issue of instance-dependent noise (IDN) in FL. We introduce a novel algorithm called FedBeat (Federated Learning with Bayesian Ensemble-Assisted Transition Matrix Estimation). FedBeat aims to build a global statistically consistent classifier using the IDN transition matrix (IDNTM), which encompasses three synergistic steps: (1) A federated data extraction step that constructs a weak global model and extracts high-confidence data using a Bayesian model ensemble method. (2) A federated transition matrix estimation step in which clients collaboratively train an IDNTM estimation network based on the extracted data. (3) A federated classifier correction step that enhances the global model's performance by training it using a loss function tailored for noisy labels, leveraging the IDNTM. Experiments conducted on CIFAR-10 and SVHN verify that the proposed method significantly outperforms state-of-the-art methods.
The increasing demand for wireless communication underscores the need to optimize radio frequency spectrum utilization. An effective strategy for leveraging underutilized licensed frequency bands is cooperative spectrum sensing (CSS), which enable multiple secondary users (SUs) to collaboratively detect the spectrum usage of primary users (PUs) prior to accessing the licensed spectrum. The increasing popularity of machine learning has led to a shift from traditional CSS methods to those based on deep learning. However, deep learning-based CSS methods often rely on centralized learning, posing challenges like communication overhead and data privacy risks. Recent research suggests vertical federated learning (VFL) as a potential solution, with its core concept centered on partitioning the deep neural network into distinct segments, with each segment is trained separately. However, existing VFL-based CSS works do not fully address the practical challenges arising from streaming data and the objective shift. In this work, we introduce online vertical federated learning (OVFL), a robust framework designed to address the challenges of ongoing data stream and shifting learning goals. Our theoretical analysis reveals that OVFL achieves a sublinear regret bound, thereby evidencing its efficiency. Empirical results from our experiments show that OVFL outperforms benchmarks in CSS tasks. We also explore the impact of various parameters on the learning performance.
Neural radiance fields (NeRF) is a promising approach for generating photorealistic images and representing complex scenes. However, when processing data sequentially, it can suffer from catastrophic forgetting, where previous data is easily forgotten after training with new data. Existing incremental learning methods using knowledge distillation assume that continuous data chunks contain both 2D images and corresponding camera pose parameters, pre-estimated from the complete dataset. This poses a paradox as the necessary camera pose must be estimated from the entire dataset, even though the data arrives sequentially and future chunks are inaccessible. In contrast, we focus on a practical scenario where camera poses are unknown. We propose IL-NeRF, a novel framework for incremental NeRF training, to address this challenge. IL-NeRF's key idea lies in selecting a set of past camera poses as references to initialize and align the camera poses of incoming image data. This is followed by a joint optimization of camera poses and replay-based NeRF distillation. Our experiments on real-world indoor and outdoor scenes show that IL-NeRF handles incremental NeRF training and outperforms the baselines by up to $54.04\%$ in rendering quality.
Path signatures have been proposed as a powerful representation of paths that efficiently captures the path's analytic and geometric characteristics, having useful algebraic properties including fast concatenation of paths through tensor products. Signatures have recently been widely adopted in machine learning problems for time series analysis. In this work we establish connections between value functions typically used in optimal control and intriguing properties of path signatures. These connections motivate our novel control framework with signature transforms that efficiently generalizes the Bellman equation to the space of trajectories. We analyze the properties and advantages of the framework, termed signature control. In particular, we demonstrate that (i) it can naturally deal with varying/adaptive time steps; (ii) it propagates higher-level information more efficiently than value function updates; (iii) it is robust to dynamical system misspecification over long rollouts. As a specific case of our framework, we devise a model predictive control method for path tracking. This method generalizes integral control, being suitable for problems with unknown disturbances. The proposed algorithms are tested in simulation, with differentiable physics models including typical control and robotics tasks such as point-mass, curve following for an ant model, and a robotic manipulator.