In this paper, we address the challenge of exploring unknown indoor aerial environments using autonomous aerial robots with Size Weight and Power (SWaP) constraints. The SWaP constraints induce limits on mission time requiring efficiency in exploration. We present a novel exploration framework that uses Deep Learning (DL) to predict the most likely indoor map given the previous observations, and Deep Reinforcement Learning (DRL) for exploration, designed to run on modern SWaP constraints neural processors. The DL-based map predictor provides a prediction of the occupancy of the unseen environment while the DRL-based planner determines the best navigation goals that can be safely reached to provide the most information. The two modules are tightly coupled and run onboard allowing the vehicle to safely map an unknown environment. Extensive experimental and simulation results show that our approach surpasses state-of-the-art methods by 50-60% in efficiency, which we measure by the fraction of the explored space as a function of the length of the trajectory traveled.
What is the minimal information that a robot must retain to achieve its task? To design economical robots, the literature dealing with reduction of combinatorial filters approaches this problem algorithmically.As lossless state compression is NP-hard, prior work has examined, along with minimization algorithms, a variety of special cases in which specific properties enable efficient solution. Complementing those findings, this paper refines the present understanding from the perspective of parameterized complexity. We give a fixed-parameter tractable algorithm for the general reduction problem by exploiting a transformation into minimal clique covering. The transformation introduces new constraints that arise from sequential dependencies encoded within the input filter -- some of these constraints can be repaired, others are treated through enumeration. Through this approach, we identify parameters affecting filter reduction that are based upon inter-constraint couplings (expressed as a notion of their height and width), which add to the structural parameters present in the unconstrained problem of minimal clique covering.
Recently, personalized federated learning (pFL) has attracted increasing attention in privacy protection, collaborative learning, and tackling statistical heterogeneity among clients, e.g., hospitals, mobile smartphones, etc. Most existing pFL methods focus on exploiting the global information and personalized information in the client-level model parameters while neglecting that data is the source of these two kinds of information. To address this, we propose the Federated Conditional Policy (FedCP) method, which generates a conditional policy for each sample to separate the global information and personalized information in its features and then processes them by a global head and a personalized head, respectively. FedCP is more fine-grained to consider personalization in a sample-specific manner than existing pFL methods. Extensive experiments in computer vision and natural language processing domains show that FedCP outperforms eleven state-of-the-art methods by up to 6.69%. Furthermore, FedCP maintains its superiority when some clients accidentally drop out, which frequently happens in mobile settings. Our code is public at https://github.com/TsingZ0/FedCP.
Visual attention forms the basis of understanding the visual world. In this work we follow a computational approach to investigate the biological basis of visual attention. We analyze retinal and cortical electrophysiological data from mouse. Visual Stimuli are Natural Images depicting real world scenes. Our results show that in primary visual cortex (V1), a subset of around $10\%$ of the neurons responds differently to salient versus non-salient visual regions. Visual attention information was not traced in retinal response. It appears that the retina remains naive concerning visual attention; cortical response gets modulated to interpret visual attention information. Experimental animal studies may be designed to further explore the biological basis of visual attention we traced in this study. In applied and translational science, our study contributes to the design of improved visual prostheses systems -- systems that create artificial visual percepts to visually impaired individuals by electronic implants placed on either the retina or the cortex.
Ground segmentation, as the basic task of unmanned intelligent perception, provides an important support for the target detection task. Unstructured road scenes represented by open-pit mines have irregular boundary lines and uneven road surfaces, which lead to segmentation errors in current ground segmentation methods. To solve this problem, a ground segmentation method based on point cloud map is proposed, which involves three parts: region of interest extraction, point cloud registration and background subtraction. Firstly, establishing boundary semantic associations to obtain regions of interest in unstructured roads. Secondly, establishing the location association between point cloud map and the real-time point cloud of region of interest by semantics information. Thirdly, establishing a background model based on Gaussian distribution according to location association, and segments the ground in real-time point cloud by the background substraction method. Experimental results show that the correct segmentation rate of ground points is 99.95%, and the running time is 26ms. Compared with state of the art ground segmentation algorithm Patchwork++, the average accuracy of ground point segmentation is increased by 7.43%, and the running time is increased by 17ms. Furthermore, the proposed method is practically applied to unstructured road scenarios represented by open pit mines.
In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model. The speech units mainly contain linguistic information while suppressing other characteristics of speech. This allows us to incorporate the language modeling capability of the pre-trained vision-language model into the spoken language modeling of Im2Sp. With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases, COCO and Flickr8k. Then, we further improve the efficiency of the Im2Sp model. Similar to the speech unit case, we convert the original image into image units, which are derived through vector quantization of the raw image. With these image units, we can drastically reduce the required data storage for saving image data to just 0.8% when compared to the original image data in terms of bits. Demo page: https://ms-dot-k.github.io/Image-to-Speech-Captioning.
Advanced Persistent Threat (APT) attacks are highly sophisticated and employ a multitude of advanced methods and techniques to target organizations and steal sensitive and confidential information. APT attacks consist of multiple stages and have a defined strategy, utilizing new and innovative techniques and technologies developed by hackers to evade security software monitoring. To effectively protect against APTs, detecting and predicting APT indicators with an explanation from Machine Learning (ML) prediction is crucial to reveal the characteristics of attackers lurking in the network system. Meanwhile, Federated Learning (FL) has emerged as a promising approach for building intelligent applications without compromising privacy. This is particularly important in cybersecurity, where sensitive data and high-quality labeling play a critical role in constructing effective machine learning models for detecting cyber threats. Therefore, this work proposes XFedHunter, an explainable federated learning framework for APT detection in Software-Defined Networking (SDN) leveraging local cyber threat knowledge from many training collaborators. In XFedHunter, Graph Neural Network (GNN) and Deep Learning model are utilized to reveal the malicious events effectively in the large number of normal ones in the network system. The experimental results on NF-ToN-IoT and DARPA TCE3 datasets indicate that our framework can enhance the trust and accountability of ML-based systems utilized for cybersecurity purposes without privacy leakage.
Food image classification is a fundamental step of image-based dietary assessment, enabling automated nutrient analysis from food images. Many current methods employ deep neural networks to train on generic food image datasets that do not reflect the dynamism of real-life food consumption patterns, in which food images appear sequentially over time, reflecting the progression of what an individual consumes. Personalized food classification aims to address this problem by training a deep neural network using food images that reflect the consumption pattern of each individual. However, this problem is under-explored and there is a lack of benchmark datasets with individualized food consumption patterns due to the difficulty in data collection. In this work, we first introduce two benchmark personalized datasets including the Food101-Personal, which is created based on surveys of daily dietary patterns from participants in the real world, and the VFNPersonal, which is developed based on a dietary study. In addition, we propose a new framework for personalized food image classification by leveraging self-supervised learning and temporal image feature information. Our method is evaluated on both benchmark datasets and shows improved performance compared to existing works. The dataset has been made available at: https://skynet.ecn.purdue.edu/~pan161/dataset_personal.html
We analyze a family of large language models in such a lightweight manner that can be done on a single GPU. Specifically, we focus on the OPT family of models ranging from 125m to 66b parameters and rely only on whether an FFN neuron is activated or not. First, we find that the early part of the network is sparse and represents many discrete features. Here, many neurons (more than 70% in some layers of the 66b model) are "dead", i.e. they never activate on a large collection of diverse data. At the same time, many of the alive neurons are reserved for discrete features and act as token and n-gram detectors. Interestingly, their corresponding FFN updates not only promote next token candidates as could be expected, but also explicitly focus on removing the information about triggering them tokens, i.e., current input. To the best of our knowledge, this is the first example of mechanisms specialized at removing (rather than adding) information from the residual stream. With scale, models become more sparse in a sense that they have more dead neurons and token detectors. Finally, some neurons are positional: them being activated or not depends largely (or solely) on position and less so (or not at all) on textual data. We find that smaller models have sets of neurons acting as position range indicators while larger models operate in a less explicit manner.
Ensuring safety in real-world robotic systems is often challenging due to unmodeled disturbances and noisy sensor measurements. To account for such stochastic uncertainties, many robotic systems leverage probabilistic state estimators such as Kalman filters to obtain a robot's belief, i.e. a probability distribution over possible states. We propose belief control barrier functions (BCBFs) to enable risk-aware control synthesis, leveraging all information provided by state estimators. This allows robots to stay in predefined safety regions with desired confidence under these stochastic uncertainties. BCBFs are general and can be applied to a variety of robotic systems that use extended Kalman filters as state estimator. We demonstrate BCBFs on a quadrotor that is exposed to external disturbances and varying sensing conditions. Our results show improved safety compared to traditional state-based approaches while allowing control frequencies of up to 1kHz.