Intelligent reflecting surface (IRS) has recently appeared as a potential technology for 6G, and received much attention from academia and industry. However, most of existing works on IRS focus on how to compute the phase shift for performance enhancement, and the problem on how to obtain the computed phase shift at the IRS side is generally neglected. In this paper, we consider compressing the computed phase shift at the receiver side to the IRS through a bandwidth-limited feedback channel. In particular, we propose and investigate a novel attention mechanism named as global attention by exploiting the attention map over both spatial and channel dimensions. This allows us to to push the limit of phase shift feedback compression by utilizing the two-dimensional information, which is in sharp contrast to exiting works that only consider either the spatial or channel dimension. Besides, to cope with the problem of mismatched distribution of the phase shift, we introduce the generalized divisive normalization (GDN) layer and inverse generalized divisive normalization (IGDN) layer to the proposed global attention phase shift compression network (GAPSCN). Furthermore, due to practical constraints on the IRS, it is desirable to consider a simplified GAPSCN (S-GAPSCN), where a lightweight multi-scale simplified global attention module (MSSGAM) is proposed in the decoder located at the IRS side to compensate for the performance degradation due to the simplified structure. Simulation results show that the proposed GAPSCN is able to achieve a reconstruction accuracy close to 1 and performs much better than existing algorithms. The performance of the proposed S-GAPSCN can approach that of the GAPSCN but with a much lower computational load.
In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utilizes instrument information and transcription results. The instrument conditioning is designed for an explicit multi-instrument functionality while the connection between the transcription and source separation modules is for better transcription performance. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. However, its novelty necessitates a new perspective on how to evaluate such a model. During the experiment, we assess the model from various aspects, providing a new evaluation perspective for multi-instrument transcription. We also argue that transcription models can be utilized as a preprocessing module for other music analysis tasks. In the experiment on several downstream tasks, the symbolic representation provided by our transcription model turned out to be helpful to spectrograms in solving downbeat detection, chord recognition, and key estimation.
Structured light (SL) systems acquire high-fidelity 3D geometry with active illumination projection. Conventional systems exhibit challenges when working in environments with strong ambient illumination, global illumination and cross-device interference. This paper proposes a general-purposed technique to improve the robustness of SL by projecting redundant optical signals in addition to the native SL patterns. In this way, projected signals become more distinguishable from errors. Thus the geometry information can be more easily recovered using simple signal processing and the ``coding gain" in performance is obtained. We propose three applications using our redundancy codes: (1) Self error-correction for SL imaging under strong ambient light, (2) Error detection for adaptive reconstruction under global illumination, and (3) Interference filtering with device-specific projection sequence encoding, especially for event camera-based SL and light curtain devices. We systematically analyze the design rules and signal processing algorithms in these applications. Corresponding hardware prototypes are built for evaluations on real-world complex scenes. Experimental results on the synthetic and real data demonstrate the significant performance improvements in SL systems with our redundancy codes.
In this study, we propose circularly-shifted chirp (CSC)-based majority vote (MV) (CSC-MV), a power-efficient over-the-air computation (OAC) scheme, to achieve long-range federated edge learning (FEEL). The proposed approach maps the votes (i.e., the sign of the local gradients) from the edge devices (EDs) to the linear CSCs constructed with a discrete Fourier transform-spread orthogonal frequency division multiplexing (DFT-s-OFDM) transmitter. At the edge server (ES), the MV is calculated with an energy detector. We compare our proposed scheme with one-bit broadband digital aggregation (OBDA) and show that the output-power back-off (OBO) requirement of the transmitters with an adjacent-channel-leakage ratio (ACLR) constraint for CSC-MV is lower than the one with OBDA. For example, with an ACLR constraint of -22 dB, CSC-MV can have an OBO requirement of 6-7 dB less than the one with OBDA. When the power amplifier (PA) non-linearity is considered, we demonstrate that CSC-MV outperforms OBDA in terms of test accuracy for both homogeneous and heterogeneous data distributions, without using channel state information (CSI) at the ES and EDs.
Image blurring refers to the degradation of an image wherein the image's overall sharpness decreases. Image blurring is caused by several factors. Additionally, during the image acquisition process, noise may get added to the image. Such a noisy and blurred image can be represented as the image resulting from the convolution of the original image with the associated point spread function, along with additive noise. However, the blurred image often contains inadequate information to uniquely determine the plausible original image. Based on the availability of blurring information, image deblurring methods can be classified as blind and non-blind. In non-blind image deblurring, some prior information is known regarding the corresponding point spread function and the added noise. The objective of this study is to determine the effectiveness of non-blind image deblurring methods with respect to the identification and elimination of noise present in blurred images. In this study, three non-blind image deblurring methods, namely Wiener deconvolution, Lucy-Richardson deconvolution, and regularized deconvolution were comparatively analyzed for noisy images featuring salt-and-pepper noise. Two types of blurring effects were simulated, namely motion blurring and Gaussian blurring. The said three non-blind deblurring methods were applied under two scenarios: direct deblurring of noisy blurred images and deblurring of images after denoising through the application of the adaptive median filter. The obtained results were then compared for each scenario to determine the best approach for deblurring noisy images.
Session-based recommendation aims to predict items that an anonymous user would like to purchase based on her short behavior sequence. The current approaches towards session-based recommendation only focus on modeling users' interest preferences, while they all ignore a key attribute of an item, i.e., the price. Many marketing studies have shown that the price factor significantly influences users' behaviors and the purchase decisions of users are determined by both price and interest preferences simultaneously. However, it is nontrivial to incorporate price preferences for session-based recommendation. Firstly, it is hard to handle heterogeneous information from various features of items to capture users' price preferences. Secondly, it is difficult to model the complex relations between price and interest preferences in determining user choices. To address the above challenges, we propose a novel method Co-guided Heterogeneous Hypergraph Network (CoHHN) for session-based recommendation. Towards the first challenge, we devise a heterogeneous hypergraph to represent heterogeneous information and rich relations among them. A dual-channel aggregating mechanism is then designed to aggregate various information in the heterogeneous hypergraph. After that, we extract users' price preferences and interest preferences via attention layers. As to the second challenge, a co-guided learning scheme is designed to model the relations between price and interest preferences and enhance the learning of each other. Finally, we predict user actions based on item features and users' price and interest preferences. Extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed CoHHN. Further analysis reveals the significance of price for session-based recommendation.
A key promise of machine learning is the ability to assist users with personal tasks. Because the personal context required to make accurate predictions is often sensitive, we require systems that protect privacy. A gold standard privacy-preserving system will satisfy perfect secrecy, meaning that interactions with the system provably reveal no additional private information to adversaries. This guarantee should hold even as we perform multiple personal tasks over the same underlying data. However, privacy and quality appear to be in tension in existing systems for personal tasks. Neural models typically require lots of training to perform well, while individual users typically hold a limited scale of data, so the systems propose to learn from the aggregate data of multiple users. This violates perfect secrecy and instead, in the last few years, academics have defended these solutions using statistical notions of privacy -- i.e., the probability of learning private information about a user should be reasonably low. Given the vulnerabilities of these solutions, we explore whether the strong perfect secrecy guarantee can be achieved using recent zero-to-few sample adaptation techniques enabled by foundation models. In response, we propose FOCUS, a framework for personal tasks. Evaluating on popular privacy benchmarks, we find the approach, satisfying perfect secrecy, competes with strong collaborative learning baselines on 6 of 7 tasks. We empirically analyze the proposal, highlighting the opportunities and limitations across task types, and model inductive biases and sizes.
Devising and analyzing learning models for spatiotemporal network data is of importance for tasks including forecasting, anomaly detection, and multi-agent coordination, among others. Graph Convolutional Neural Networks (GCNNs) are an established approach to learn from time-invariant network data. The graph convolution operation offers a principled approach to aggregate multiresolution information. However, extending the convolution principled learning and respective analysis to the spatiotemporal domain is challenging because spatiotemporal data have more intrinsic dependencies. Hence, a higher flexibility to capture jointly the spatial and the temporal dependencies is required to learn meaningful higher-order representations. Here, we leverage product graphs to represent the spatiotemporal dependencies in the data and introduce Graph-Time Convolutional Neural Networks (GTCNNs) as a principled architecture to aid learning. The proposed approach can work with any type of product graph and we also introduce a parametric product graph to learn also the spatiotemporal coupling. The convolution principle further allows a similar mathematical tractability as for GCNNs. In particular, the stability result shows GTCNNs are stable to spatial perturbations but there is an implicit trade-off between discriminability and robustness; i.e., the more complex the model, the less stable. Extensive numerical results on benchmark datasets corroborate our findings and show the GTCNN compares favorably with state-of-the-art solutions. We anticipate the GTCNN to be a starting point for more sophisticated models that achieve good performance but are also fundamentally grounded.
Automating the Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios such as rapid indexing and archiving. Many existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. However, collecting and labeling a large dataset is time-consuming and is not a user-friendly requirement for many cloud platforms. To overcome these challenges, we propose a deep end-to-end trainable network for one-shot KIE using partial graph matching. Contrary to previous methods that the learning of similarity and solving are optimized separately, our method enables the learning of the two processes in an end-to-end framework. Existing one-shot KIE methods are either template or simple attention-based learning approach that struggle to handle texts that are shifted beyond their desired positions caused by printers, as illustrated in Fig.1. To solve this problem, we add one-to-(at most)-one constraint such that we will find the globally optimized solution even if some texts are drifted. Further, we design a multimodal context ensemble block to boost the performance through fusing features of spatial, textual, and aspect representations. To promote research of KIE, we collected and annotated a one-shot document KIE dataset named DKIE with diverse types of images. The DKIE dataset consists of 2.5K document images captured by mobile phones in natural scenes, and it is the largest available one-shot KIE dataset up to now. The results of experiments on DKIE show that our method achieved state-of-the-art performance compared with recent one-shot and supervised learning approaches. The dataset and proposed one-shot KIE model will be released soo
Due to the fact that basic uncertain information provides a simple form for decision information with certainty degree, it has been developed to reflect the quality of observed or subjective assessments. In order to study the algebra structure and preference relation of basic uncertain information, we develop some algebra operations for basic uncertain information. The order relation of such type of information has also been considered. Finally, to apply the developed algebra operations and order relations, a generalized TODIM method for multi-attribute decision making with basic uncertain information is given. The numerical example shows that the developed decision procedure is valid.