The millimeter wave (mmWave) has received considerable interest due to its expansive bandwidth and high frequency. However, a noteworthy challenge arises from its vulnerability to blockages, leading to reduced coverage and achievable rates. To address these limitations, a potential solution is to deploy distributed reconfigurable intelligent surfaces (RISs), which comprise many low-cost and passively reflected elements, and can facilitate the establishment of extra communication links. In this paper, we leverage stochastic geometry to investigate the ergodic coverage probability and the achievable rate in both distributed RISs-assisted single-cell and multi-cell mmWave wireless communication systems. Specifically, we first establish the system model considering the stochastically distributed blockages, RISs and users by the Poisson point process. Then we give the association criterion and derive the association probabilities, the distance distributions, and the conditional coverage probabilities for two cases of associations between base stations and users without or with RISs. Finally, we use Campbell's theorem and the total probability theorem to obtain the closed-form expressions of the ergodic coverage probability and the achievable rate. Simulation results verify the effectiveness of our analysis method, and demonstrate that by deploying distributed RISs, the ergodic coverage probability is significantly improved by approximately 50%, and the achievable rate is increased by more than 1.5 times.
In millimeter-wave communications, large-scale antenna arrays are commonly employed to mitigate obstacle occlusion and path loss. However, these large-scale arrays generate pencil-shaped beams, which necessitate a higher number of training beams to cover the desired space. This results in the heavy beam training overhead. Furthermore, as the antenna aperture increases, users are more likely to be situated in the near-field region of the base station (BS) antenna array. This motivates our investigation into the beam training problem in the near-field region to achieve efficient beam alignment. To address the high complexity and low identification accuracy of existing beam training techniques, we propose an efficient hashing multi-arm beam (HMB) training scheme for the near-field scenario. Specifically, we first design a set of sparse bases based on the polar domain sparsity of the near-field channel and construct a near-field single-beam training codebook. Then, the hash functions are chosen to construct the near-field multi-arm beam training codebook. Each multi-arm beam training codeword is used in a time slot until the predefined codebook is traversed. Finally, the soft decision and voting methods are applied to distinguish the signal from different BS and obtain the correctly aligned beams. In addition, we provide the logically rigorous proof of computational complexity. Simulation results show that our proposed near-field HMB training method can achieve 96.4% identification accuracy of the exhaustive beam training method and greatly reduce the training overhead to the logarithmic level. Furthermore, we verify its applicability under the far-field scenario as well.
Face aging has received continuous research attention over the past two decades. Although previous works on this topic have achieved impressive success, two longstanding problems remain unsettled: 1) generating diverse and plausible facial aging patterns at the target age stage; 2) measuring the rationality of identity variation between the original portrait and its syntheses with age progression or regression. In this paper, we introduce DLAT + , the first algorithm that can realize Diverse and Lifespan Age Transformation on human faces, where the diversity jointly manifests in the transformation of facial textures and shapes. Apart from the diversity mechanism embedded in the model, multiple consistency restrictions are leveraged to keep it away from counterfactual aging syntheses. Moreover, we propose a new metric to assess the rationality of Identity Deviation under Age Gaps (IDAG) between the input face and its series of age-transformed generations, which is based on statistical laws summarized from plenty of genuine face-aging data. Extensive experimental results demonstrate the uniqueness and effectiveness of our method in synthesizing diverse and perceptually reasonable faces across the whole lifetime.
Environment shifts and conflicts present significant challenges for learning-based sound event localization and detection (SELD) methods. SELD systems, when trained in particular acoustic settings, often show restricted generalization capabilities for diverse acoustic environments. Furthermore, it is notably costly to obtain annotated samples for spatial sound events. Deploying a SELD system in a new environment requires extensive time for re-training and fine-tuning. To overcome these challenges, we propose environment-adaptive Meta-SELD, designed for efficient adaptation to new environments using minimal data. Our method specifically utilizes computationally synthesized spatial data and employs Model-Agnostic Meta-Learning (MAML) on a pre-trained, environment-independent model. The method then utilizes fast adaptation to unseen real-world environments using limited samples from the respective environments. Inspired by the Learning-to-Forget approach, we introduce the concept of selective memory as a strategy for resolving conflicts across environments. This approach involves selectively memorizing target-environment-relevant information and adapting to the new environments through the selective attenuation of model parameters. In addition, we introduce environment representations to characterize different acoustic settings, enhancing the adaptability of our attenuation approach to various environments. We evaluate our proposed method on the development set of the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSS23) dataset and computationally synthesized scenes. Experimental results demonstrate the superior performance of the proposed method compared to conventional supervised learning methods, particularly in localization.
We introduce Efficient Title Reranker via Broadcasting Query Encoder, a novel title reranking technique to achieve efficient title reranking 20x-40x faster than vanilla passage reranker. However, one of the challenges with the training of Efficient Title Reranker is the instability. Analyzing the issue, we found some very difficult ground truths might act as noisy labels causing accuracy to drop as well as some extreme values in model probability output causing nan. To address these issues, we introduce the Sigmoid Trick, a novel technique that reduces the gradient update of both cases resulting in better retrieval efficacy. Experiments showed the effectiveness of ETR and sigmoid trick as we achieved four state-of-the-art positions on the kilt knowledge benchmark.
Within the multimodal field, the key to integrating vision and language lies in establishing a good alignment strategy. Recently, benefiting from the success of self-supervised learning, significant progress has been made in multimodal semantic representation based on pre-trained models for vision and language. However, there is still room for improvement in visual semantic representation. The lack of spatial semantic coherence and vulnerability to noise makes it challenging for current pixel or patch-based methods to accurately extract complex scene boundaries. To this end, this paper develops superpixel as a comprehensive compact representation of learnable image data, which effectively reduces the number of visual primitives for subsequent processing by clustering perceptually similar pixels. To mine more precise topological relations, we propose a Multiscale Difference Graph Convolutional Network (MDGCN). It parses the entire image as a fine-to-coarse hierarchical structure of constituent visual patterns, and captures multiscale features by progressively merging adjacent superpixels as graph nodes. Moreover, we predict the differences between adjacent nodes through the graph structure, facilitating key information aggregation of graph nodes to reason actual semantic relations. Afterward, we design a multi-level fusion rule in a bottom-up manner to avoid understanding deviation by learning complementary spatial information at different regional scales. Our proposed method can be well applied to multiple downstream task learning. Extensive experiments demonstrate that our method is competitive with other state-of-the-art methods in visual reasoning. Our code will be released upon publication.
3D object-level mapping is a fundamental problem in robotics, which is especially challenging when object CAD models are unavailable during inference. In this work, we propose a framework that can reconstruct high-quality object-level maps for unknown objects. Our approach takes multiple RGB-D images as input and outputs dense 3D shapes and 9-DoF poses (including 3 scale parameters) for detected objects. The core idea of our approach is to leverage a learnt generative model for shape categories as a prior and to formulate a probabilistic, uncertainty-aware optimization framework for 3D reconstruction. We derive a probabilistic formulation that propagates shape and pose uncertainty through two novel loss functions. Unlike current state-of-the-art approaches, we explicitly model the uncertainty of the object shapes and poses during our optimization, resulting in a high-quality object-level mapping system. Moreover, the resulting shape and pose uncertainties, which we demonstrate can accurately reflect the true errors of our object maps, can also be useful for downstream robotics tasks such as active vision. We perform extensive evaluations on indoor and outdoor real-world datasets, achieving achieves substantial improvements over state-of-the-art methods. Our code will be available at https://github.com/TRAILab/UncertainShapePose.
6D pose estimation of textureless shiny objects has become an essential problem in many robotic applications. Many pose estimators require high-quality depth data, often measured by structured light cameras. However, when objects have shiny surfaces (e.g., metal parts), these cameras fail to sense complete depths from a single viewpoint due to the specular reflection, resulting in a significant drop in the final pose accuracy. To mitigate this issue, we present a complete active vision framework for 6D object pose refinement and next-best-view prediction. Specifically, we first develop an optimization-based pose refinement module for the structured light camera. Our system then selects the next best camera viewpoint to collect depth measurements by minimizing the predicted uncertainty of the object pose. Compared to previous approaches, we additionally predict measurement uncertainties of future viewpoints by online rendering, which significantly improves the next-best-view prediction performance. We test our approach on the challenging real-world ROBI dataset. The results demonstrate that our pose refinement method outperforms the traditional ICP-based approach when given the same input depth data, and our next-best-view strategy can achieve high object pose accuracy with significantly fewer viewpoints than the heuristic-based policies.
For learning-based sound event localization and detection (SELD) methods, different acoustic environments in the training and test sets may result in large performance differences in the validation and evaluation stages. Different environments, such as different sizes of rooms, different reverberation times, and different background noise, may be reasons for a learning-based system to fail. On the other hand, acquiring annotated spatial sound event samples, which include onset and offset time stamps, class types of sound events, and direction-of-arrival (DOA) of sound sources is very expensive. In addition, deploying a SELD system in a new environment often poses challenges due to time-consuming training and fine-tuning processes. To address these issues, we propose Meta-SELD, which applies meta-learning methods to achieve fast adaptation to new environments. More specifically, based on Model Agnostic Meta-Learning (MAML), the proposed Meta-SELD aims to find good meta-initialized parameters to adapt to new environments with only a small number of samples and parameter updating iterations. We can then quickly adapt the meta-trained SELD model to unseen environments. Our experiments compare fine-tuning methods from pre-trained SELD models with our Meta-SELD on the Sony-TAU Realistic Spatial Soundscapes 2023 (STARSSS23) dataset. The evaluation results demonstrate the effectiveness of Meta-SELD when adapting to new environments.