Efficient beam training is the key challenge in the codebook-based configuration of reconfigurable intelligent surfaces (RISs) because the beam training overhead can have a strong impact on the achievable system performance. In this paper, we study the performance tradeoff between overhead and achievable signal-to-noise ratio (SNR) in RIS beam training while taking into account the size of the targeted coverage area, the RIS response time, and the delay for feedback transmissions. Thereby, we consider three common beam training strategies: full search (FS), hierarchical search (HS), and tracking-based search (TS). Our analysis shows that the codebook-based illumination of a given coverage area can be realized with wide- or narrow-beam designs, which result in two different scaling laws for the achievable SNR. Similarly, there are two regimes for the overhead, where the number of pilot symbols required for reliable beam training is dependent on and independent of the SNR, respectively. Based on these insights, we investigate the impact of the beam training overhead on the effective rate and provide an upper bound on the user velocity for which the overhead is negligible. Moreover, when the overhead is not negligible, we show that TS beam training achieves higher effective rates than HS and FS beam training, while HS beam training may or may not outperform FS beam training, depending on the RIS response time, feedback delay, and codebook size. Finally, we present numerical simulation results that verify our theoretical analysis. In particular, our results confirm the existence of the proposed regimes, reveal that fast RISs can lead to negligible overhead for FS beam training, and show that large feedback delays can significantly reduce the performance for HS beam training.
Unlearnable examples (UEs) refer to training samples modified to be unlearnable to Deep Neural Networks (DNNs). These examples are usually generated by adding error-minimizing noises that can fool a DNN model into believing that there is nothing (no error) to learn from the data. The concept of UE has been proposed as a countermeasure against unauthorized data exploitation on personal data. While UE has been extensively studied on images, it is unclear how to craft effective UEs for time series data. In this work, we introduce the first UE generation method to protect time series data from unauthorized training by deep learning models. To this end, we propose a new form of error-minimizing noise that can be \emph{selectively} applied to specific segments of time series, rendering them unlearnable to DNN models while remaining imperceptible to human observers. Through extensive experiments on a wide range of time series datasets, we demonstrate that the proposed UE generation method is effective in both classification and generation tasks. It can protect time series data against unauthorized exploitation, while preserving their utility for legitimate usage, thereby contributing to the development of secure and trustworthy machine learning systems.
Mathematical formulas are the crystallization of human wisdom in exploring the laws of nature for thousands of years. Describing the complex laws of nature with a concise mathematical formula is a constant pursuit of scientists and a great challenge for artificial intelligence. This field is called symbolic regression. Symbolic regression was originally formulated as a combinatorial optimization problem, and GP and reinforcement learning algorithms were used to solve it. However, GP is sensitive to hyperparameters, and these two types of algorithms are inefficient. To solve this problem, researchers treat the mapping from data to expressions as a translation problem. And the corresponding large-scale pre-trained model is introduced. However, the data and expression skeletons do not have very clear word correspondences as the two languages do. Instead, they are more like two modalities (e.g., image and text). Therefore, in this paper, we proposed MMSR. The SR problem is solved as a pure multimodal problem, and contrastive learning is also introduced in the training process for modal alignment to facilitate later modal feature fusion. It is worth noting that in order to better promote the modal feature fusion, we adopt the strategy of training contrastive learning loss and other losses at the same time, which only needs one-step training, instead of training contrastive learning loss first and then training other losses. Because our experiments prove training together can make the feature extraction module and feature fusion module running-in better. Experimental results show that compared with multiple large-scale pre-training baselines, MMSR achieves the most advanced results on multiple mainstream datasets including SRBench.
For decades, human-computer interaction has fundamentally been manual. Even today, almost all productive work done on the computer necessitates human input at every step. Autonomous virtual agents represent an exciting step in automating many of these menial tasks. Virtual agents would empower users with limited technical proficiency to harness the full possibilities of computer systems. They could also enable the efficient streamlining of numerous computer tasks, ranging from calendar management to complex travel bookings, with minimal human intervention. In this paper, we introduce OmniACT, the first-of-a-kind dataset and benchmark for assessing an agent's capability to generate executable programs to accomplish computer tasks. Our scope extends beyond traditional web automation, covering a diverse range of desktop applications. The dataset consists of fundamental tasks such as "Play the next song", as well as longer horizon tasks such as "Send an email to John Doe mentioning the time and place to meet". Specifically, given a pair of screen image and a visually-grounded natural language task, the goal is to generate a script capable of fully executing the task. We run several strong baseline language model agents on our benchmark. The strongest baseline, GPT-4, performs the best on our benchmark However, its performance level still reaches only 15% of the human proficiency in generating executable scripts capable of completing the task, demonstrating the challenge of our task for conventional web agents. Our benchmark provides a platform to measure and evaluate the progress of language model agents in automating computer tasks and motivates future work towards building multimodal models that bridge large language models and the visual grounding of computer screens.
Human Activity Recognition (HAR) has been extensively studied, with recent emphasis on the implementation of advanced Machine Learning (ML) and Deep Learning (DL) algorithms for accurate classification. This study investigates the efficacy of two ML algorithms, eXtreme Gradient Boosting (XGBoost) and MiniRocket, in the realm of HAR using data collected from smartphone sensors. The experiments are conducted on a dataset obtained from the UCI repository, comprising accelerometer and gyroscope signals captured from 30 volunteers performing various activities while wearing a smartphone. The dataset undergoes preprocessing, including noise filtering and feature extraction, before being utilized for training and testing the classifiers. Monte Carlo cross-validation is employed to evaluate the models' robustness. The findings reveal that both XGBoost and MiniRocket attain accuracy, F1 score, and AUC values as high as 0.99 in activity classification. XGBoost exhibits a slightly superior performance compared to MiniRocket. Notably, both algorithms surpass the performance of other ML and DL algorithms reported in the literature for HAR tasks. Additionally, the study compares the computational efficiency of the two algorithms, revealing XGBoost's advantage in terms of training time. Furthermore, the performance of MiniRocket, which achieves accuracy and F1 values of 0.94, and an AUC value of 0.96 using raw data and utilizing only one channel from the sensors, highlights the potential of directly leveraging unprocessed signals. It also suggests potential advantages that could be gained by utilizing sensor fusion or channel fusion techniques. Overall, this research sheds light on the effectiveness and computational characteristics of XGBoost and MiniRocket in HAR tasks, providing insights for future studies in activity recognition using smartphone sensor data.
In social recommender systems, it is crucial that the recommendation models provide equitable visibility for different demographic groups, such as gender or race. Most existing research has addressed this problem by only studying individual static snapshots of networks that typically change over time. To address this gap, we study the evolution of recommendation fairness over time and its relation to dynamic network properties. We examine three real-world dynamic networks by evaluating the fairness of six recommendation algorithms and analyzing the association between fairness and network properties over time. We further study how interventions on network properties influence fairness by examining counterfactual scenarios with alternative evolution outcomes and differing network properties. Our results on empirical datasets suggest that recommendation fairness improves over time, regardless of the recommendation method. We also find that two network properties, minority ratio, and homophily ratio, exhibit stable correlations with fairness over time. Our counterfactual study further suggests that an extreme homophily ratio potentially contributes to unfair recommendations even with a balanced minority ratio. Our work provides insights into the evolution of fairness within dynamic networks in social science. We believe that our findings will help system operators and policymakers to better comprehend the implications of temporal changes and interventions targeting fairness in social networks.
Intelligent reflecting surface (IRS) is a potential candidate for massive multiple-input multiple-output (MIMO) 2.0 technology due to its low cost, ease of deployment, energy efficiency and extended coverage. This chapter investigates the slot-by-slot IRS reflection pattern design and two-timescale reflection pattern design schemes, respectively. For the slot-by-slot reflection optimization, we propose exploiting an IRS to improve the propagation channel rank in mmWave massive MIMO systems without need to increase the transmit power budget. Then, we analyze the impact of the distributed IRS on the channel rank. To further reduce the heavy overhead of channel training, channel state information (CSI) estimation, and feedback in time-varying MIMO channels, we present a two-timescale reflection optimization scheme, where the IRS is configured relatively infrequently based on statistical CSI (S-CSI) and the active beamformers and power allocation are updated based on quickly outdated instantaneous CSI (I-CSI) per slot. The achievable average sum-rate (AASR) of the system is maximized without excessive overhead of cascaded channel estimation. A recursive sampling particle swarm optimization (PSO) algorithm is developed to optimize the large-timescale IRS reflection pattern efficiently with reduced samplings of channel samples.
Modality discrepancies have perpetually posed significant challenges within the realm of Automated Audio Captioning (AAC) and across all multi-modal domains. Facilitating models in comprehending text information plays a pivotal role in establishing a seamless connection between the two modalities of text and audio. While recent research has focused on closing the gap between these two modalities through contrastive learning, it is challenging to bridge the difference between both modalities using only simple contrastive loss. This paper introduces Enhance Depth of Text Comprehension (EDTC), which enhances the model's understanding of text information from three different perspectives. First, we propose a novel fusion module, FUSER, which aims to extract shared semantic information from different audio features through feature fusion. We then introduced TRANSLATOR, a novel alignment module designed to align audio features and text features along the tensor level. Finally, the weights are updated by adding momentum to the twin structure so that the model can learn information about both modalities at the same time. The resulting method achieves state-of-the-art performance on AudioCaps datasets and demonstrates results comparable to the state-of-the-art on Clotho datasets.
We present a method to increase the resolution of measurements of a physical system and subsequently predict its time evolution using thermodynamics-aware neural networks. Our method uses adversarial autoencoders, which reduce the dimensionality of the full order model to a set of latent variables that are enforced to match a prior, for example a normal distribution. Adversarial autoencoders are seen as generative models, and they can be trained to generate high-resolution samples from low-resoution inputs, meaning they can address the so-called super-resolution problem. Then, a second neural network is trained to learn the physical structure of the latent variables and predict their temporal evolution. This neural network is known as an structure-preserving neural network. It learns the metriplectic-structure of the system and applies a physical bias to ensure that the first and second principles of thermodynamics are fulfilled. The integrated trajectories are decoded to their original dimensionality, as well as to the higher dimensionality space produced by the adversarial autoencoder and they are compared to the ground truth solution. The method is tested with two examples of flow over a cylinder, where the fluid properties are varied between both examples.
Text-to-Avatar generation has recently made significant strides due to advancements in diffusion models. However, most existing work remains constrained by limited diversity, producing avatars with subtle differences in appearance for a given text prompt. We design DivAvatar, a novel framework that generates diverse avatars, empowering 3D creatives with a multitude of distinct and richly varied 3D avatars from a single text prompt. Different from most existing work that exploits scene-specific 3D representations such as NeRF, DivAvatar finetunes a 3D generative model (i.e., EVA3D), allowing diverse avatar generation from simply noise sampling in inference time. DivAvatar has two key designs that help achieve generation diversity and visual quality. The first is a noise sampling technique during training phase which is critical in generating diverse appearances. The second is a semantic-aware zoom mechanism and a novel depth loss, the former producing appearances of high textual fidelity by separate fine-tuning of specific body parts and the latter improving geometry quality greatly by smoothing the generated mesh in the features space. Extensive experiments show that DivAvatar is highly versatile in generating avatars of diverse appearances.