In this paper, we propose BeamSense, a completely novel approach to implement standard-compliant Wi-Fi sensing applications. Wi-Fi sensing enables game-changing applications in remote healthcare, home entertainment, and home surveillance, among others. However, existing work leverages the manual extraction of channel state information (CSI) from Wi-Fi chips to classify activities, which is not supported by the Wi-Fi standard and hence requires the usage of specialized equipment. On the contrary, BeamSense leverages the standard-compliant beamforming feedback information (BFI) to characterize the propagation environment. Conversely from CSI, the BFI (i) can be easily recorded without any firmware modification, and (ii) captures the multiple channels between the access point and the stations, thus providing much better sensitivity. BeamSense includes a novel cross-domain few-shot learning (FSL) algorithm to handle unseen environments and subjects with few additional data points. We evaluate BeamSense through an extensive data collection campaign with three subjects performing twenty different activities in three different environments. We show that our BFI-based approach achieves about 10% more accuracy when compared to CSI-based prior work, while our FSL strategy improves accuracy by up to 30% and 80% when compared with state-of-the-art cross-domain algorithms.
In this article, we study Euler characteristic techniques in topological data analysis. Pointwise computing the Euler characteristic of a family of simplicial complexes built from data gives rise to the so-called Euler characteristic profile. We show that this simple descriptor achieve state-of-the-art performance in supervised tasks at a very low computational cost. Inspired by signal analysis, we compute hybrid transforms of Euler characteristic profiles. These integral transforms mix Euler characteristic techniques with Lebesgue integration to provide highly efficient compressors of topological signals. As a consequence, they show remarkable performances in unsupervised settings. On the qualitative side, we provide numerous heuristics on the topological and geometric information captured by Euler profiles and their hybrid transforms. Finally, we prove stability results for these descriptors as well as asymptotic guarantees in random settings.
Image matching is a classic and fundamental task in computer vision. In this paper, under the hypothesis that the areas outside the co-visible regions carry little information, we propose a matching key-points crop (MKPC) algorithm. The MKPC locates, proposes and crops the critical regions, which are the co-visible areas with great efficiency and accuracy. Furthermore, building upon MKPC, we propose a general two-stage pipeline for image matching, which is compatible to any image matching models or combinations. We experimented with plugging SuperPoint + SuperGlue into the two-stage pipeline, whose results show that our method enhances the performance for outdoor pose estimations. What's more, in a fair comparative condition, our method outperforms the SOTA on Image Matching Challenge 2022 Benchmark, which represents the hardest outdoor benchmark of image matching currently.
Autonomous robots are required to actively and adaptively learn the categories and words of various places by exploring the surrounding environment and interacting with users. In semantic mapping and spatial language acquisition conducted using robots, it is costly and labor-intensive to prepare training datasets that contain linguistic instructions from users. Therefore, we aimed to enable mobile robots to learn spatial concepts through autonomous active exploration. This study is characterized by interpreting the `action' of the robot that asks the user the question `What kind of place is this?' in the context of active inference. We propose an active inference method, spatial concept formation with information gain-based active exploration (SpCoAE), that combines sequential Bayesian inference by particle filters and position determination based on information gain in a probabilistic generative model. Our experiment shows that the proposed method can efficiently determine a position to form appropriate spatial concepts in home environments. In particular, it is important to conduct efficient exploration that leads to appropriate concept formation and quickly covers the environment without adopting a haphazard exploration strategy.
Without ground truth supervision, self-supervised depth estimation can be trapped in a local minimum due to the gradient-locality issue of the photometric loss. In this paper, we present a framework to enhance depth by leveraging semantic segmentation to guide the network to jump out of the local minimum. Prior works have proposed to share encoders between these two tasks or explicitly align them based on priors like the consistency between edges in the depth and segmentation maps. Yet, these methods usually require ground truth or high-quality pseudo labels, which may not be easily accessible in real-world applications. In contrast, we investigate self-supervised depth estimation along with a segmentation branch that is supervised with noisy labels provided by models pre-trained with limited data. We extend parameter sharing from the encoder to the decoder and study the influence of different numbers of shared decoder parameters on model performance. Also, we propose to use cross-task information to refine current depth and segmentation predictions to generate pseudo-depth and semantic labels for training. The advantages of the proposed method are demonstrated through extensive experiments on the KITTI benchmark and a downstream task for endoscopic tissue deformation tracking.
Representation multi-task learning (MTL) and transfer learning (TL) have achieved tremendous success in practice. However, the theoretical understanding of these methods is still lacking. Most existing theoretical works focus on cases where all tasks share the same representation, and claim that MTL and TL almost always improve performance. However, as the number of tasks grow, assuming all tasks share the same representation is unrealistic. Also, this does not always match empirical findings, which suggest that a shared representation may not necessarily improve single-task or target-only learning performance. In this paper, we aim to understand how to learn from tasks with \textit{similar but not exactly the same} linear representations, while dealing with outlier tasks. We propose two algorithms that are \textit{adaptive} to the similarity structure and \textit{robust} to outlier tasks under both MTL and TL settings. Our algorithms outperform single-task or target-only learning when representations across tasks are sufficiently similar and the fraction of outlier tasks is small. Furthermore, they always perform no worse than single-task learning or target-only learning, even when the representations are dissimilar. We provide information-theoretic lower bounds to show that our algorithms are nearly \textit{minimax} optimal in a large regime.
Medical image segmentation is considered as the basic step for medical image analysis and surgical intervention. And many previous works attempted to incorporate shape priors for designing segmentation models, which is beneficial to attain finer masks with anatomical shape information. Here in our work, we detailedly discuss three types of segmentation models with shape priors, which consist of atlas-based models, statistical-based models and UNet-based models. On the ground that the former two kinds of methods show a poor generalization ability, UNet-based models have dominated the field of medical image segmentation in recent years. However, existing UNet-based models tend to employ implicit shape priors, which do not have a good interpretability and generalization ability on different organs with distinctive shapes. Thus, we proposed a novel shape prior module (SPM), which could explicitly introduce shape priors to promote the segmentation performance of UNet-based models. To evaluate the effectiveness of SPM, we conduct experiments on three challenging public datasets. And our proposed model achieves state-of-the-art performance. Furthermore, SPM shows an outstanding generalization ability on different classic convolution-neural-networks (CNNs) and recent Transformer-based backbones, which can serve as a plug-and-play structure for the segmentation task of different datasets.
Generative Adversarial Networks (GAN) offer a promising approach for Neural Machine Translation (NMT). However, feeding multiple morphologically languages into a single model during training reduces the NMT's performance. In GAN, similar to bilingual models, multilingual NMT only considers one reference translation for each sentence during model training. This single reference translation limits the GAN model from learning sufficient information about the source sentence representation. Thus, in this article, we propose Denoising Adversarial Auto-encoder-based Sentence Interpolation (DAASI) approach to perform sentence interpolation by learning the intermediate latent representation of the source and target sentences of multilingual language pairs. Apart from latent representation, we also use the Wasserstein-GAN approach for the multilingual NMT model by incorporating the model generated sentences of multiple languages for reward computation. This computed reward optimizes the performance of the GAN-based multilingual model in an effective manner. We demonstrate the experiments on low-resource language pairs and find that our approach outperforms the existing state-of-the-art approaches for multilingual NMT with a performance gain of up to 4 BLEU points. Moreover, we use our trained model on zero-shot language pairs under an unsupervised scenario and show the robustness of the proposed approach.
Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. Existing methods are computationally expensive and can either handle only small numbers of phylogenetic trees or are limited to severely restricted classes of networks. In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of efficient heuristics that are guaranteed to produce a network containing each of the input trees, for datasets consisting of binary trees. Some of the heuristics in this framework are based on the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. We also propose simple and fast randomised heuristics that prove to be very effective when run multiple times. Unlike the existing exact methods, our heuristics are applicable to datasets of practical size, and the experimental study we conducted on both simulated and real data shows that these solutions are qualitatively good, always within some small constant factor from the optimum. Moreover, our machine-learned heuristics are one of the first applications of machine learning to phylogenetics and show its promise.
While interacting in the world is a multi-sensory experience, many robots continue to predominantly rely on visual perception to map and navigate in their environments. In this work, we propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues. AVLMaps integrate the open-vocabulary capabilities of multimodal foundation models pre-trained on Internet-scale data by fusing their features into a centralized 3D voxel grid. In the context of navigation, we show that AVLMaps enable robot systems to index goals in the map based on multimodal queries, e.g., textual descriptions, images, or audio snippets of landmarks. In particular, the addition of audio information enables robots to more reliably disambiguate goal locations. Extensive experiments in simulation show that AVLMaps enable zero-shot multimodal goal navigation from multimodal prompts and provide 50% better recall in ambiguous scenarios. These capabilities extend to mobile robots in the real world - navigating to landmarks referring to visual, audio, and spatial concepts. Videos and code are available at: https://avlmaps.github.io.