In this paper, cooperative edge caching problem is studied in fog radio access networks (F-RANs). Given the non-deterministic polynomial hard (NP-hard) property of the problem, a dueling deep Q network (Dueling DQN) based caching update algorithm is proposed to make an optimal caching decision by learning the dynamic network environment. In order to protect user data privacy and solve the problem of slow convergence of the single deep reinforcement learning (DRL) model training, we propose a federated reinforcement learning method with quantization (FRLQ) to implement cooperative training of models from multiple fog access points (F-APs) in F-RANs. To address the excessive consumption of communications resources caused by model transmission, we prune and quantize the shared DRL models to reduce the number of model transfer parameters. The communications interval is increased and the communications rounds are reduced by periodical model global aggregation. We analyze the global convergence and computational complexity of our policy. Simulation results verify that our policy has better performance in reducing user request delay and improving cache hit rate compared to benchmark schemes. The proposed policy is also shown to have faster training speed and higher communications efficiency with minimal loss of model accuracy.
Many recent works indicate that the deep neural networks tend to take dataset biases as shortcuts to make decision, rather than understand the tasks, which results in failures on the real-world applications. In this work, we focus on the spurious correlation between feature and label, which derive from the biased data distribution in the training data, and analyze it concretely. In particular, we define the word highly co-occurring with a specific label as biased word, and the example containing biased word as biased example. Our analysis reveals that the biased examples with spurious correlations are easier for models to learn, and when predicting, the biased words make significantly higher contributions to models' predictions than other words, and the models tend to assign the labels over-relying on the spurious correlation between words and labels. To mitigate the model's over-reliance on the shortcut, we propose a training strategy Less-Learn-Shortcut (LLS): we quantify the biased degree of the biased examples, and down-weight them with the biased degree. Experimental results on QM and NLI tasks show that the models improve the performances both on in-domain and adversarial data (1.57% on DuQM and 2.12% on HANS) with our LLS.
Recently, video recognition is emerging with the help of multi-modal learning, which focuses on integrating multiple modalities to improve the performance or robustness of a model. Although various multi-modal learning methods have been proposed and offer remarkable recognition results, almost all of these methods rely on high-quality manual annotations and assume that modalities among multi-modal data provide relevant semantic information. Unfortunately, most widely used video datasets are collected from the Internet and inevitably contain noisy labels and noisy correspondence. To solve this problem, we use the audio-visual action recognition task as a proxy and propose a noise-tolerant learning framework to find anti-interference model parameters to both noisy labels and noisy correspondence. Our method consists of two phases and aims to rectify noise by the inherent correlation between modalities. A noise-tolerant contrastive training phase is performed first to learn robust model parameters unaffected by the noisy labels. To reduce the influence of noisy correspondence, we propose a cross-modal noise estimation component to adjust the consistency between different modalities. Since the noisy correspondence existed at the instance level, a category-level contrastive loss is proposed to further alleviate the interference of noisy correspondence. Then in the hybrid supervised training phase, we calculate the distance metric among features to obtain corrected labels, which are used as complementary supervision. In addition, we investigate the noisy correspondence in real-world datasets and conduct comprehensive experiments with synthetic and real noise data. The results verify the advantageous performance of our method compared to state-of-the-art methods.
Recently, semantic search has been successfully applied to e-commerce product search and the learned semantic space(s) for query and product encoding are expected to generalize to unseen queries or products. Yet, whether generalization can conveniently emerge has not been thoroughly studied in the domain thus far. In this paper, we examine several general-domain and domain-specific pre-trained Roberta variants and discover that general-domain fine-tuning does not help generalization, which aligns with the discovery of prior art. Proper domain-specific fine-tuning with clickstream data can lead to better model generalization, based on a bucketed analysis of a publicly available manual annotated query-product pair da
Recently, semantic search has been successfully applied to e-commerce product search and the learned semantic space(s) for query and product encoding are expected to generalize to unseen queries or products. Yet, whether generalization can conveniently emerge has not been thoroughly studied in the domain thus far. In this paper, we examine several general-domain and domain-specific pre-trained Roberta variants and discover that general-domain fine-tuning does not help generalization, which aligns with the discovery of prior art. Proper domain-specific fine-tuning with clickstream data can lead to better model generalization, based on a bucketed analysis of a publicly available manual annotated query-product pair data.
This paper explores Deep Learning (DL) methods that are used or have the potential to be used for traffic video analysis, emphasizing driving safety for both Autonomous Vehicles (AVs) and human-operated vehicles. We present a typical processing pipeline, which can be used to understand and interpret traffic videos by extracting operational safety metrics and providing general hints and guidelines to improve traffic safety. This processing framework includes several steps, including video enhancement, video stabilization, semantic and incident segmentation, object detection and classification, trajectory extraction, speed estimation, event analysis, modeling and anomaly detection. Our main goal is to guide traffic analysts to develop their own custom-built processing frameworks by selecting the best choices for each step and offering new designs for the lacking modules by providing a comparative analysis of the most successful conventional and DL-based algorithms proposed for each step. We also review existing open-source tools and public datasets that can help train DL models. To be more specific, we review exemplary traffic problems and mentioned requires steps for each problem. Besides, we investigate connections to the closely related research areas of drivers' cognition evaluation, Crowd-sourcing-based monitoring systems, Edge Computing in roadside infrastructures, ADS-equipped AVs, and highlight the missing gaps. Finally, we review commercial implementations of traffic monitoring systems, their future outlook, and open problems and remaining challenges for widespread use of such systems.
Human silhouette segmentation, which is originally defined in computer vision, has achieved promising results for understanding human activities. However, the physical limitation makes existing systems based on optical cameras suffer from severe performance degradation under low illumination, smoke, and/or opaque obstruction conditions. To overcome such limitations, in this paper, we propose to utilize the radio signals, which can traverse obstacles and are unaffected by the lighting conditions to achieve silhouette segmentation. The proposed RFMask framework is composed of three modules. It first transforms RF signals captured by millimeter wave radar on two planes into spatial domain and suppress interference with the signal processing module. Then, it locates human reflections on RF frames and extract features from surrounding signals with human detection module. Finally, the extracted features from RF frames are aggregated with an attention based mask generation module. To verify our proposed framework, we collect a dataset containing 804,760 radio frames and 402,380 camera frames with human activities under various scenes. Experimental results show that the proposed framework can achieve impressive human silhouette segmentation even under the challenging scenarios(such as low light and occlusion scenarios) where traditional optical-camera-based methods fail. To the best of our knowledge, this is the first investigation towards segmenting human silhouette based on millimeter wave signals. We hope that our work can serve as a baseline and inspire further research that perform vision tasks with radio signals. The dataset and codes will be made in public.
The past years have witnessed increasing research interest in achieving passive human localization with commodity WiFi devices. However, due to the fundamental limited spatial resolution of WiFi signals, it is still very difficult to achieve accurate localization with existing commodity WiFi devices. To tackle this problem, in this paper, we propose to exploit the degree of freedom provided by the Intelligent Reflecting Surface (IRS), which is composed of a large number of controllable reflective elements, to modulate the spatial distribution of WiFi signals and thus break down the spatial resolution limitation of WiFi signals to achieve accurate localization. Specifically, in the single-person scenario, we derive the closed-form solution to optimally control the phase shift of the IRS elements. In the multi-person scenario, we propose a Side-lobe Cancellation Algorithm to eliminate the near-far effect to achieve accurate localization of multiple persons in an iterative manner. Extensive simulation results demonstrate that without any change to the existing WiFi infrastructure, the proposed framework can locate multiple moving persons passively with sub-centimeter accuracy under multipath interference and random noise.
The electrocardiogram (ECG) has always been an important biomedical test to diagnose cardiovascular diseases. Current approaches for ECG monitoring are based on body attached electrodes leading to uncomfortable user experience. Therefore, contactless ECG monitoring has drawn tremendous attention, which however remains unsolved. In fact, cardiac electrical-mechanical activities are coupling in a well-coordinated pattern. In this paper, we achieve contactless ECG monitoring by breaking the boundary between the cardiac mechanical and electrical activity. Specifically, we develop a millimeter-wave radar system to contactlessly measure cardiac mechanical activity and reconstruct ECG without any contact in. To measure the cardiac mechanical activity comprehensively, we propose a series of signal processing algorithms to extract 4D cardiac motions from radio frequency (RF) signals. Furthermore, we design a deep neural network to solve the cardiac related domain transformation problem and achieve end-to-end reconstruction mapping from RF input to the ECG output. The experimental results show that our contactless ECG measurements achieve timing accuracy of cardiac electrical events with median error below 14ms and morphology accuracy with median Pearson-Correlation of 90% and median Root-Mean-Square-Error of 0.081mv compared to the groudtruth ECG. These results indicate that the system enables the potential of contactless, continuous and accurate ECG monitoring.