The U-shape structure has shown its advantage in salient object detection for efficiently combining multi-scale features. However, most existing U-shape based methods focused on improving the bottom-up and top-down pathways while ignoring the connections between them. This paper shows that by centralizing these connections, we can achieve the cross-scale information interaction among them, hence obtaining semantically stronger and positionally more precise features. To inspire the potential of the newly proposed strategy, we further design a relative global calibration module that can simultaneously process multi-scale inputs without spatial interpolation. Benefiting from the above strategy and module, our proposed approach can aggregate features more effectively while introducing only a few additional parameters. Our approach can cooperate with various existing U-shape-based salient object detection methods by substituting the connections between the bottom-up and top-down pathways. Experimental results demonstrate that our proposed approach performs favorably against the previous state-of-the-arts on five widely used benchmarks with less computational complexity. The source code will be publicly available.
Motion and interaction of social insects (such as ants) have been studied by many researchers to understand the clustering mechanism. Most studies in the field of ant behavior have only focused on indoor environments, while outdoor environments are still underexplored. In this paper, we collect 10 videos of ant colonies from different indoor and outdoor scenes. And we develop an image sequence marking software named VisualMarkData, which enables us to provide annotations of ants in the video. In all 5354 frames, the location information and the identification number of each ant are recorded for a total of 712 ants and 114112 annotations. Moreover, we provide visual analysis tools to assess and validate the technical quality and reproducibility of our data. It is hoped that this dataset will contribute to a deeper exploration on the behavior of the ant colony.
In the real world, medical datasets often exhibit a long-tailed data distribution (i.e., a few classes occupy most of the data, while most classes have rarely few samples), which results in a challenging imbalance learning scenario. For example, there are estimated more than 40 different kinds of retinal diseases with variable morbidity, however with more than 30+ conditions are very rare from the global patient cohorts, which results in a typical long-tailed learning problem for deep learning-based screening models. Moreover, there may exist more than one kind of disease on the retina, which results in a multi-label scenario and bring label co-occurrence issue for re-sampling strategy. In this work, we propose a novel framework that leverages the prior knowledge in retinal diseases for training a more robust representation of the model under a hierarchy-sensible constraint. Then, an instance-wise class-balanced sampling strategy and hybrid knowledge distillation manner are firstly introduced to learn from the long-tailed multi-label distribution. Our experiments training on the retinal dataset of more than one million samples demonstrate the superiority of our proposed methods which outperform all competitors and significantly improve the recognition accuracy of most diseases especially those rare diseases.
In contrast to British or American English, labeled pronunciation data on the phonetic level is scarce for Indian English (IE). This has made it challenging to study pronunciations of Indian English. Moreover, IE has many varieties, resulting from various native language influences on L2 English. Indian English has been studied in the past, by a few linguistic works. They report phonetic rules for such characterisation, however, the extent to which they can be applied to a diverse large-scale Indian pronunciation data remains under-examined. We consider a corpus, IndicTIMIT, which is rich in the diversity of IE varieties and is curated in a nativity balanced manner. It contains data from 80 speakers corresponding to various regions of India. We present an approach to validate the phonetic rules of IE along with reporting unexplored rules derived using a data-driven manner, on this corpus. We also provide quantitative information regarding which rules are more prominently observed than the others, attributing to their relevance in IE accordingly.
We consider the object recognition problem in autonomous driving using automotive radar sensors. Comparing to Lidar sensors, radar is cost-effective and robust in all-weather conditions for perception in autonomous driving. However, radar signals suffer from low angular resolution and precision in recognizing surrounding objects. To enhance the capacity of automotive radar, in this work, we exploit the temporal information from successive ego-centric bird-eye-view radar image frames for radar object recognition. We leverage the consistency of an object's existence and attributes (size, orientation, etc.), and propose a temporal relational layer to explicitly model the relations between objects within successive radar images. In both object detection and multiple object tracking, we show the superiority of our method compared to several baseline approaches.
It is not accurate to make recommendations only based one single current session. Therefore, multi-session-based recommendation(MSBR) is a solution for the problem. Compared with the previous MSBR models, we have made three improvements in this paper. First, the previous work choose to use all the history sessions of the user and/or of his similar users. When the user's current interest changes greatly from the past, most of these sessions can only have negative impacts. Therefore, we select a large number of randomly chosen sessions from the dataset as candidate sessions to avoid over depending on history data. Then we only choose to use the most similar sessions to get the most useful information while reduce the noise caused by dissimilar sessions. Second, in real-world datasets, short sessions account for a large proportion. The RNN often used in previous work is not suitable to process short sessions, because RNN only focuses on the sequential relationship, which we find is not the only relationship between items in short sessions. So, we designed a more suitable method named GAFE based on attention to process short sessions. Third, Although there are few long sessions, they can not be ignored. Not like previous models, which simply process long sessions in the same way as short sessions, we propose LSIS, which can split the interest of long sessions, to make better use of long sessions. Finally, to help recommendations, we also have considered users' long-term interests captured by a multi-layer GRU. Considering the four points above, we built the model ENIREC. Experiments on two real-world datasets show that the comprehensive performance of ENIREC is better than other existing models.
Reconfigurable intelligent surfaces (RISs) are envisioned as a potentially transformative technology for future wireless communications. However, RIS's inability to process signals and their attendant increased channel dimension have brought new challenges to RIS-assisted systems, which greatly increases the pilot overhead required for channel estimation. To address these problems, several prior contributions that enhance the hardware architecture of RISs or develop algorithms to exploit the channels' mathematical properties have been made, where the required pilot overhead is reduced to be proportional to the number of RIS elements. In this paper, we propose a dimension-independent channel state information (CSI) acquisition approach in which the required pilot overhead is independent of the number of RIS elements. Specifically, in contrast to traditional signal transmission methods, where signals from the base station (BS) and the users are transmitted in different time slots, we propose a novel method in which signals are transmitted from the BS and the user simultaneously during CSI acquisition. Under this method, an electromagnetic interference random field (IRF) will be induced on the RIS, and we employ a sensing RIS to capture its features. Moreover, we develop three algorithms for parameter estimation in this system, and also derive the Cramer-Rao lower bound (CRLB) and an asymptotic expression for it. Simulation results verify that our proposed signal transmission method and the corresponding algorithms can achieve dimension-independent CSI acquisition for beamforming.
Multi-player multi-armed bandits (MMAB) study how decentralized players cooperatively play the same multi-armed bandit so as to maximize their total cumulative rewards. Existing MMAB models mostly assume when more than one player pulls the same arm, they either have a collision and obtain zero rewards, or have no collision and gain independent rewards, both of which are usually too restrictive in practical scenarios. In this paper, we propose an MMAB with shareable resources as an extension to the collision and non-collision settings. Each shareable arm has finite shareable resources and a "per-load" reward random variable, both of which are unknown to players. The reward from a shareable arm is equal to the "per-load" reward multiplied by the minimum between the number of players pulling the arm and the arm's maximal shareable resources. We consider two types of feedback: sharing demand information (SDI) and sharing demand awareness (SDA), each of which provides different signals of resource sharing. We design the DPE-SDI and SIC-SDA algorithms to address the shareable arm problem under these two cases of feedback respectively and prove that both algorithms have logarithmic regrets that are tight in the number of rounds. We conduct simulations to validate both algorithms' performance and show their utilities in wireless networking and edge computing.
The path planning problem for autonomous exploration of an unknown region by a robotic agent typically employs frontier-based or information-theoretic heuristics. Frontier-based heuristics typically evaluate the information gain of a viewpoint by the number of visible frontier voxels, which is a discrete measure that can only be optimized by sampling. On the other hand, information-theoretic heuristics compute information gain as the mutual information between the map and the sensor's measurement. Although the gradient of such measures can be computed, the computation involves costly numerical differentiation. In this work, we add a novel fuzzy logic filter in the counting of visible frontier voxels surrounding a viewpoint, which allows the gradient of the information gain with respect to the viewpoint to be efficiently computed using automatic differentiation. This enables us to simultaneously optimize information gain with other differentiable quality measures such as path length. Using multiple simulation environments, we demonstrate that the proposed gradient-based optimization method consistently improves the information gain and other quality measures of exploration paths.
We study how the choice of visual perspective affects learning and generalization in the context of physical manipulation from raw sensor observations. Compared with the more commonly used global third-person perspective, a hand-centric (eye-in-hand) perspective affords reduced observability, but we find that it consistently improves training efficiency and out-of-distribution generalization. These benefits hold across a variety of learning algorithms, experimental settings, and distribution shifts, and for both simulated and real robot apparatuses. However, this is only the case when hand-centric observability is sufficient; otherwise, including a third-person perspective is necessary for learning, but also harms out-of-distribution generalization. To mitigate this, we propose to regularize the third-person information stream via a variational information bottleneck. On six representative manipulation tasks with varying hand-centric observability adapted from the Meta-World benchmark, this results in a state-of-the-art reinforcement learning agent operating from both perspectives improving its out-of-distribution generalization on every task. While some practitioners have long put cameras in the hands of robots, our work systematically analyzes the benefits of doing so and provides simple and broadly applicable insights for improving end-to-end learned vision-based robotic manipulation.