We propose a framework for monostatic sensing by a user equipment (UE), aided by a reconfigurable intelligent surface (RIS) in environments with single- and double-bounce signal propagation. We design appropriate UE-side precoding and combining, to facilitate signal separation. We derive the adaptive detection probabilities of the resolvable signals, based on the geometric channel parameters of the links. Then, we estimate the passive objects using both the double-bounce signals via passive RIS (i.e., RIS-sensing) and the single-bounce multipath direct to the objects (i.e., non-RIS-sensing), based on a mapping filter. Finally, we provide numerical results to demonstrate that effective sensing can be achieved through the proposed framework.
In the upcoming sixth generation (6G) of wireless communication systems, reconfigurable intelligent surfaces~(RISs) are regarded as one of the promising technological enablers, which can provide programmable signal propagation. Therefore, simultaneous radio localization and mapping(SLAM) with RISs appears as an emerging research direction within the 6G ecosystem. In this paper, we propose a novel framework of RIS-enabled radio SLAM for wireless operation without the intervention of access points (APs). We first design the RIS phase profiles leveraging prior information for the user equipment~(UE), such that they uniformly illuminate the angular sector where the UE is probabilistically located. Second, we modify the marginal Poisson multi-Bernoulli SLAM filter and estimate the UE state and landmarks, which enables efficient mapping of the radio propagation environment. Third, we derive the theoretical Cram\'er-Rao lower bounds on the estimators for the channel parameters and the UE state. We finally evaluate the performance of the proposed method under scenarios with a limited number of transmissions, taking into account the channel coherence time. Our results demonstrate that the RIS enables solving the radio SLAM problem with zero APs, and that the consideration of the Doppler shift contributes to improving the UE speed estimates.
Reconfigurable intelligent surfaces (RISs) are expected to be a key component enabling the mobile network evolution towards a flexible and intelligent 6G wireless platform. In most of the research works so far, RIS has been treated as a passive base station (BS) with a known state, in terms of its location and orientation, to boost the communication and/or terminal positioning performance. However, such performance gains cannot be guaranteed anymore when the RIS state is not perfectly known. In this paper, by taking the RIS state uncertainty into account, we formulate and study the performance of a joint RIS calibration and user positioning (JrCUP) scheme. From the Fisher information perspective, we formulate the JrCUP problem in a network-centric single-input multiple-output (SIMO) scenario with a single BS, and derive the analytical lower bound for the states of both user and RIS. We also demonstrate the geometric impact of different user locations on the JrCUP performance while also characterizing the performance under different RIS sizes. Finally, the study is extended to a multi-user scenario, shown to further improve the state estimation performance.
Modeling the speaker variability is a key challenge for automatic speech recognition (ASR) systems. In this paper, the learning hidden unit contributions (LHUC) based adaptation techniques with compact speaker dependent (SD) parameters are used to facilitate both speaker adaptive training (SAT) and unsupervised test-time speaker adaptation for end-to-end (E2E) lattice-free MMI (LF-MMI) models. An unsupervised model-based adaptation framework is proposed to estimate the SD parameters in E2E paradigm using LF-MMI and cross entropy (CE) criterions. Various regularization methods of the standard LHUC adaptation, e.g., the Bayesian LHUC (BLHUC) adaptation, are systematically investigated to mitigate the risk of overfitting, on E2E LF-MMI CNN-TDNN and CNN-TDNN-BLSTM models. Lattice-based confidence score estimation is used for adaptation data selection to reduce the supervision label uncertainty. Experiments on the 300-hour Switchboard task suggest that applying BLHUC in the proposed unsupervised E2E adaptation framework to byte pair encoding (BPE) based E2E LF-MMI systems consistently outperformed the baseline systems by relative word error rate (WER) reductions up to 10.5% and 14.7% on the NIST Hub5'00 and RT03 evaluation sets, and achieved the best performance in WERs of 9.0% and 9.7%, respectively. These results are comparable to the results of state-of-the-art adapted LF-MMI hybrid systems and adapted Conformer-based E2E systems.
In 5G/6G wireless systems, reconfigurable intelligent surfaces (RIS) can play a role as a passive anchor to enable and enhance localization in various scenarios. However, most existing RIS-aided localization works assume that the geometry of the RIS is perfectly known, which is not realistic in practice due to calibration errors. In this work, we derive the misspecified Cram\'er-Rao bound (MCRB) for a single-input-single-output RIS-aided localization system with RIS geometry mismatch. Specifically, unlike most existing works that use numerical methods, we propose a closed-form solution to the pseudo-true parameter determination problem for MCRB analysis. Simulation results demonstrate the validity of the derived pseudo-true parameters and MCRB, and show that the RIS geometry mismatch causes performance saturation in the high signal-to-noise ratio regions.
The ground plane prior is a very informative geometry clue in monocular 3D object detection (M3OD). However, it has been neglected by most mainstream methods. In this paper, we identify two key factors that limit the applicability of ground plane prior: the projection point localization issue and the ground plane tilt issue. To pick up the ground plane prior for M3OD, we propose a Ground Plane Enhanced Network (GPENet) which resolves both issues at one go. For the projection point localization issue, instead of using the bottom vertices or bottom center of the 3D bounding box (BBox), we leverage the object's ground contact points, which are explicit pixels in the image and easy for the neural network to detect. For the ground plane tilt problem, our GPENet estimates the horizon line in the image and derives a novel mathematical expression to accurately estimate the ground plane equation. An unsupervised vertical edge mining algorithm is also proposed to address the occlusion of the horizon line. Furthermore, we design a novel 3D bounding box deduction method based on a dynamic back projection algorithm, which could take advantage of the accurate contact points and the ground plane equation. Additionally, using only M3OD labels, contact point and horizon line pseudo labels can be easily generated with NO extra data collection and label annotation cost. Extensive experiments on the popular KITTI benchmark show that our GPENet can outperform other methods and achieve state-of-the-art performance, well demonstrating the effectiveness and the superiority of the proposed approach. Moreover, our GPENet works better than other methods in cross-dataset evaluation on the nuScenes dataset. Our code and models will be published.
Among the key differentiators of 6G compared to 5G will be the increased emphasis on radio based positioning and sensing. These will be utilized not only for conventional location-aware services and for enhancing communication performance, but also to support new use case families with extreme performance requirements. This paper presents a unified vision from stakeholders across the value chain in terms of both opportunities and challenges for 6G positioning and sensing, as well as use cases, performance requirements, and gap analysis. Combined, this motivates the technical advances in 6G and guides system design.
Existing multimodal tasks mostly target at the complete input modality setting, i.e., each modality is either complete or completely missing in both training and test sets. However, the randomly missing situations have still been underexplored. In this paper, we present a novel approach named MM-Align to address the missing-modality inference problem. Concretely, we propose 1) an alignment dynamics learning module based on the theory of optimal transport (OT) for indirect missing data imputation; 2) a denoising training algorithm to simultaneously enhance the imputation results and backbone network performance. Compared with previous methods which devote to reconstructing the missing inputs, MM-Align learns to capture and imitate the alignment dynamics between modality sequences. Results of comprehensive experiments on three datasets covering two multimodal tasks empirically demonstrate that our method can perform more accurate and faster inference and relieve overfitting under various missing conditions.
Generative Knowledge Graph Construction (KGC) refers to those methods that leverage the sequence-to-sequence framework for building knowledge graphs, which is flexible and can be adapted to widespread tasks. In this study, we summarize the recent compelling progress in generative knowledge graph construction. We present the advantages and weaknesses of each paradigm in terms of different generation targets and provide theoretical insight and empirical analysis. Based on the review, we suggest promising research directions for the future. Our contributions are threefold: (1) We present a detailed, complete taxonomy for the generative KGC methods; (2) We provide a theoretical and empirical analysis of the generative KGC methods; (3) We propose several research directions that can be developed in the future.
Self-training methods have been explored in recent years and have exhibited great performance in improving semi-supervised learning. This work presents a Simple instance-Adaptive self-Training method (SAT) for semi-supervised text classification. SAT first generates two augmented views for each unlabeled data and then trains a meta-learner to automatically identify the relative strength of augmentations based on the similarity between the original view and the augmented views. The weakly-augmented view is fed to the model to produce a pseudo-label and the strongly-augmented view is used to train the model to predict the same pseudo-label. We conducted extensive experiments and analyses on three text classification datasets and found that with varying sizes of labeled training data, SAT consistently shows competitive performance compared to existing semi-supervised learning methods. Our code can be found at \url{https://github.com/declare-lab/SAT.git}.