Recently, retrieval models based on dense representations are dominant in passage retrieval tasks, due to their outstanding ability in terms of capturing semantics of input text compared to the traditional sparse vector space models. A common practice of dense retrieval models is to exploit a dual-encoder architecture to represent a query and a passage independently. Though efficient, such a structure loses interaction between the query-passage pair, resulting in inferior accuracy. To enhance the performance of dense retrieval models without loss of efficiency, we propose a GNN-encoder model in which query (passage) information is fused into passage (query) representations via graph neural networks that are constructed by queries and their top retrieved passages. By this means, we maintain a dual-encoder structure, and retain some interaction information between query-passage pairs in their representations, which enables us to achieve both efficiency and efficacy in passage retrieval. Evaluation results indicate that our method significantly outperforms the existing models on MSMARCO, Natural Questions and TriviaQA datasets, and achieves the new state-of-the-art on these datasets.
Visible light positioning (VLP) is an accurate indoor positioning technology that uses luminaires as transmitters. In particular, circular luminaires are a common source type for VLP, that are typically treated only as point sources for positioning, while ignoring their geometry characteristics. In this paper, the arc feature of the circular luminaire and the coordinate information obtained via visible light communication (VLC) are jointly used for VLC-enabled indoor positioning, and a novel perspective arcs approach is proposed. The proposed approach does not rely on any inertial measurement unit, and has no tilted angle limitations at the user. First, a VLC assisted perspective circle and arc algorithm (V-PCA) is proposed for a scenario in which a complete luminaire and an incomplete one can be captured by the user. Considering the cases in which parts of VLC links are blocked, an anti-occlusion VLC assisted perspective arcs algorithm (OA-V-PA) is proposed. Simulation results show that the proposed indoor positioning algorithm can achieve a 95th percentile positioning accuracy of around 10 cm. Moreover, an experimental prototype based on mobile phone is implemented, in which, a fused image processing method is proposed. Experimental results show that the average positioning accuracy is less than 5 cm.
Underwater visual perception is essentially important for underwater exploration, archeology, ecosystem and so on. The low illumination, light reflections, scattering, absorption and suspended particles inevitably lead to the critically degraded underwater image quality, which causes great challenges on recognizing the objects from the underwater images. The existing underwater enhancement methods that aim to promote the underwater visibility, heavily suffer from the poor image restoration performance and generalization ability. To reduce the difficulty of underwater image enhancement, we introduce the media transmission map as guidance to assist in image enhancement. We formulate the interaction between the underwater visual images and the transmission map to obtain better enhancement results. Even with simple and lightweight network configuration, the proposed method can achieve advanced results of 22.6 dB on the challenging Test-R90 with an impressive 30 times faster than the existing models. Comprehensive experimental results have demonstrated the superiority and potential on underwater perception. Paper's code is offered on: https://github.com/GroupG-yk/MTUR-Net.
The conversational recommender systems (CRSs) have received extensive attention in recent years. However, most of the existing works focus on various deep learning models, which are largely limited by the requirement of large-scale human-annotated datasets. Such methods are not able to deal with the cold-start scenarios in industrial products. To alleviate the problem, we propose FORCE, a Framework Of Rule-based Conversational Recommender system that helps developers to quickly build CRS bots by simple configuration. We conduct experiments on two datasets in different languages and domains to verify its effectiveness and usability.
Most existing point cloud completion methods are only applicable to partial point clouds without any noises and outliers, which does not always hold in practice. We propose in this paper an end-to-end network, named CS-Net, to complete the point clouds contaminated by noises or containing outliers. In our CS-Net, the completion and segmentation modules work collaboratively to promote each other, benefited from our specifically designed cascaded structure. With the help of segmentation, more clean point cloud is fed into the completion module. We design a novel completion decoder which harnesses the labels obtained by segmentation together with FPS to purify the point cloud and leverages KNN-grouping for better generation. The completion and segmentation modules work alternately share the useful information from each other to gradually improve the quality of prediction. To train our network, we build a dataset to simulate the real case where incomplete point clouds contain outliers. Our comprehensive experiments and comparisons against state-of-the-art completion methods demonstrate our superiority. We also compare with the scheme of segmentation followed by completion and their end-to-end fusion, which also proves our efficacy.
Interactive robotic grasping using natural language is one of the most fundamental tasks in human-robot interaction. However, language can be a source of ambiguity, particularly when there are ambiguous visual or linguistic contents. This paper investigates the use of object attributes in disambiguation and develops an interactive grasping system capable of effectively resolving ambiguities via dialogues. Our approach first predicts target scores and attribute scores through vision-and-language grounding. To handle ambiguous objects and commands, we propose an attribute-guided formulation of the partially observable Markov decision process (Attr-POMDP) for disambiguation. The Attr-POMDP utilizes target and attribute scores as the observation model to calculate the expected return of an attribute-based (e.g., "what is the color of the target, red or green?") or a pointing-based (e.g., "do you mean this one?") question. Our disambiguation module runs in real time on a real robot, and the interactive grasping system achieves a 91.43\% selection accuracy in the real-robot experiments, outperforming several baselines by large margins.
Joint source channel coding (JSCC) has attracted increasing attentions due to its robustness and high efficiency. However, the existing research on JSCC mainly focuses on minimizing the distortion between the transmitted and received information, while limiting the required data rate. Therefore, even though the transmitted information is well recovered, the transmitted bits may be far more than the minimal threshold according to the rate-distortion (RD) theory. In this paper, we propose an adaptive Information Bottleneck (IB) guided JSCC (AIB-JSCC), which aims at achieving the theoretically maximal compression ratio for a given reconstruction quality. In particular, we first derive a mathematically tractable form of loss function for AIB-JSCC. To keep a better tradeoff between compression and reconstruction quality, we further propose an adaptive algorithm that adjusts hyperparameter beta of the proposed loss function dynamically according to the distortion during training. Experiment results show that AIB-JSCC can significantly reduce the required amount of the transmitted data and improve the reconstruction quality and downstream artificial-intelligent task performance.
Energy efficiency (EE) is an important aspect of satellite communications. Different with the existing algorithms that typically use the first-order Taylor lower bound approximation to convert non-convex EE maximization (EEM) problems into convex ones, in this letter a two-step quadratic transformation method is presented. In the first step, the fractional form of the achievable rate over the total power consumption is converted into a non-fractional form based on quadratic transformation. In the second step, the fractional form of the signal power over the interference-and-noise power is further converted into a non-fractional form, still based on quadratic transformation. After the two-step quadratic transformation, the original EEM problem is converted into an equivalent convex one. Then an alternating optimization algorithm is presented to solve it by iteratively performing two stages until a stop condition is satisfied. Simulation results show that the presented algorithm can fast converge and its performance is better than that of the sequential convex approximation algorithm and the multibeam interference mitigation algorithm.
Humans do not perceive all parts of a scene with the same resolution, but rather focus on few regions of interest (ROIs). Traditional Object-Based codecs take advantage of this biological intuition, and are capable of non-uniform allocation of bits in favor of salient regions, at the expense of increased distortion the remaining areas: such a strategy allows a boost in perceptual quality under low rate constraints. Recently, several neural codecs have been introduced for video compression, yet they operate uniformly over all spatial locations, lacking the capability of ROI-based processing. In this paper, we introduce two models for ROI-based neural video coding. First, we propose an implicit model that is fed with a binary ROI mask and it is trained by de-emphasizing the distortion of the background. Secondly, we design an explicit latent scaling method, that allows control over the quantization binwidth for different spatial regions of latent variables, conditioned on the ROI mask. By extensive experiments, we show that our methods outperform all our baselines in terms of Rate-Distortion (R-D) performance in the ROI. Moreover, they can generalize to different datasets and to any arbitrary ROI at inference time. Finally, they do not require expensive pixel-level annotations during training, as synthetic ROI masks can be used with little to no degradation in performance. To the best of our knowledge, our proposals are the first solutions that integrate ROI-based capabilities into neural video compression models.
Robots in the real world frequently come across identical objects in dense clutter. When evaluating grasp poses in these scenarios, a target-driven grasping system requires knowledge of spatial relations between scene objects (e.g., proximity, adjacency, and occlusions). To efficiently complete this task, we propose a target-driven grasping system that simultaneously considers object relations and predicts 6-DoF grasp poses. A densely cluttered scene is first formulated as a grasp graph with nodes representing object geometries in the grasp coordinate frame and edges indicating spatial relations between the objects. We design a Grasp Graph Neural Network (G2N2) that evaluates the grasp graph and finds the most feasible 6-DoF grasp pose for a target object. Additionally, we develop a shape completion-assisted grasp pose sampling method that improves sample quality and consequently grasping efficiency. We compare our method against several baselines in both simulated and real settings. In real-world experiments with novel objects, our approach achieves a 77.78% grasping accuracy in densely cluttered scenarios, surpassing the best-performing baseline by more than 15%. Supplementary material is available at https://sites.google.com/umn.edu/graph-grasping.