The pre-training-fine-tuning paradigm based on layout-aware multimodal pre-trained models has achieved significant progress on document image question answering. However, domain pre-training and task fine-tuning for additional visual, layout, and task modules prevent them from directly utilizing off-the-shelf instruction-tuning language foundation models, which have recently shown promising potential in zero-shot learning. Contrary to aligning language models to the domain of document image question answering, we align document image question answering to off-the-shell instruction-tuning language foundation models to utilize their zero-shot capability. Specifically, we propose layout and task aware instruction prompt called LATIN-Prompt, which consists of layout-aware document content and task-aware descriptions. The former recovers the layout information among text segments from OCR tools by appropriate spaces and line breaks. The latter ensures that the model generates answers that meet the requirements, especially format requirements, through a detailed description of task. Experimental results on three benchmarks show that LATIN-Prompt can improve the zero-shot performance of instruction-tuning language foundation models on document image question answering and help them achieve comparable levels to SOTAs based on the pre-training-fine-tuning paradigm. Quantitative analysis and qualitative analysis demonstrate the effectiveness of LATIN-Prompt. We provide the code in supplementary and will release the code to facilitate future research.
Parameter regularization or allocation methods are effective in overcoming catastrophic forgetting in lifelong learning. However, they solve all tasks in a sequence uniformly and ignore the differences in the learning difficulty of different tasks. So parameter regularization methods face significant forgetting when learning a new task very different from learned tasks, and parameter allocation methods face unnecessary parameter overhead when learning simple tasks. In this paper, we propose the Parameter Allocation & Regularization (PAR), which adaptively select an appropriate strategy for each task from parameter allocation and regularization based on its learning difficulty. A task is easy for a model that has learned tasks related to it and vice versa. We propose a divergence estimation method based on the Nearest-Prototype distance to measure the task relatedness using only features of the new task. Moreover, we propose a time-efficient relatedness-aware sampling-based architecture search strategy to reduce the parameter overhead for allocation. Experimental results on multiple benchmarks demonstrate that, compared with SOTAs, our method is scalable and significantly reduces the model's redundancy while improving the model's performance. Further qualitative analysis indicates that PAR obtains reasonable task-relatedness.
Over the past few years, the prevalence of wireless devices has become one of the essential sources of electromagnetic (EM) radiation to the public. Facing with the swift development of wireless communications, people are skeptical about the risks of long-term exposure to EM radiation. As EM exposure is required to be restricted at user terminals, it is inefficient to blindly decrease the transmit power, which leads to limited spectral efficiency and energy efficiency (EE). Recently, rate-splitting multiple access (RSMA) has been proposed as an effective way to provide higher wireless transmission performance, which is a promising technology for future wireless communications. To this end, we propose using RSMA to increase the EE of massive MIMO uplink while limiting the EM exposure of users. In particularly, we investigate the optimization of the transmit covariance matrices and decoding order using statistical channel state information (CSI). The problem is formulated as non-convex mixed integer program, which is in general difficult to handle. We first propose a modified water-filling scheme to obtain the transmit covariance matrices with fixed decoding order. Then, a greedy approach is proposed to obtain the decoding permutation. Numerical results verify the effectiveness of the proposed EM exposure-aware EE maximization scheme for uplink RSMA.
Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. The code and models are publicly available at http://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout.
In this paper, we propose a symbol-level precoding (SLP) design that aims to minimize the weighted mean square error between the received signal and the constellation point located in the constructive interference region (CIR). Unlike most existing SLP schemes that rely on channel state information (CSI) only, the proposed scheme exploits both CSI and the distribution information of the noise to achieve improved performance. We firstly propose a simple generic description of CIR that facilitates the subsequent SLP design. Such an objective can further be formulated as a nonnegative least squares (NNLS) problem, which can be solved efficiently by the active-set algorithm. Furthermore, the weighted minimum mean square error (WMMSE) precoding and the existing SLP can be easily verified as special cases of the proposed scheme. Finally, simulation results show that the proposed precoding outperforms the state-of-the-art SLP schemes in full signal-to-noise ratio ranges in both uncoded and coded systems without additional complexity over conventional SLP.
Recent efforts of multimodal Transformers have improved Visually Rich Document Understanding (VrDU) tasks via incorporating visual and textual information. However, existing approaches mainly focus on fine-grained elements such as words and document image patches, making it hard for them to learn from coarse-grained elements, including natural lexical units like phrases and salient visual regions like prominent image regions. In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics, which are valuable for document understanding. At first, a document graph is proposed to model complex relationships among multi-grained multimodal elements, in which salient visual regions are detected by a cluster-based method. Then, a multi-grained multimodal Transformer called mmLayout is proposed to incorporate coarse-grained information into existing pre-trained fine-grained multimodal Transformers based on the graph. In mmLayout, coarse-grained information is aggregated from fine-grained, and then, after further processing, is fused back into fine-grained for final prediction. Furthermore, common sense enhancement is introduced to exploit the semantic information of natural lexical units. Experimental results on four tasks, including information extraction and document question answering, show that our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters. Qualitative analyses show that our method can capture consistent semantics in coarse-grained elements.
The massive multiple-input multiple-output (MIMO) transmission technology has recently attracted much attention in the non-geostationary, e.g., low earth orbit (LEO) satellite communication (SATCOM) systems since it can significantly improve the energy efficiency (EE) and spectral efficiency. In this work, we develop a hybrid analog/digital precoding technique in the massive MIMO LEO SATCOM downlink, which reduces the onboard hardware complexity and power consumption. In the proposed scheme, the analog precoder is implemented via a more practical twin-resolution phase shifting (TRPS) network to make a meticulous tradeoff between the power consumption and array gain. In addition, we consider and study the impact of the distortion effect of the nonlinear power amplifiers (NPAs) in the system design. By jointly considering all the above factors, we propose an efficient algorithmic approach for the TRPS-based hybrid precoding problem with NPAs. Numerical results show the EE gains considering the nonlinear distortion and the performance superiority of the proposed TRPS-based hybrid precoding scheme over the baselines.
In the fifth-generation and beyond era, reconfigurable intelligent surface (RIS) and dynamic metasurface antennas (DMAs) are emerging metamaterials keeping up with the demand for high-quality wireless communication services, which promote the diversification of portable wireless terminals. However, along with the rapid expansion of wireless devices, the electromagnetic (EM) radiation increases unceasingly and inevitably affects public health, which requires a limited exposure level in the transmission design. To reduce the EM radiation and preserve the quality of communication service, we investigate the spectral efficiency (SE) maximization with EM constraints for uplink transmission in hybrid RIS and DMA assisted multiuser multiple-input multiple-output systems. Specifically, alternating optimization is adopted to optimize the transmit covariance, RIS phase shift, and DMA weight matrices. We first figure out the water-filling solutions of transmit covariance matrices with given RIS and DMA parameters. Then, the RIS phase shift matrix is optimized via the weighted minimum mean square error, block coordinate descent and minorization-maximization methods. Furthermore, we solve the unconstrainted DMA weight matrix optimization problem in closed form and then design the DMA weight matrix to approach this performance under DMA constraints. Numerical results confirm the effectiveness of the EM aware SE maximization transmission scheme over the conventional baselines.
Extremely large-scale multiple-input multiple-output (XL-MIMO) is the development trend of future wireless communications. However, the extremely large-scale antenna array could bring inevitable nearfield and dual-wideband effects that seriously reduce the transmission performance. This paper proposes an algorithmic framework to design the beam combining for the near-field wideband XL-MIMO uplink transmissions assisted by holographic metasurface antennas (HMAs). Firstly, we introduce a spherical-wave-based channel model that simultaneously takes into account both the near-field and dual-wideband effects. Based on such a model, we then formulate the HMA-based beam combining problem for the proposed XL-MIMO communications, which is challenging due to the nonlinear coupling of high dimensional HMA weights and baseband combiners. We further present a sum-mean-square-error-minimization-based algorithmic framework. Numerical results showcase that the proposed scheme can effectively alleviate the sum-rate loss caused by the near-field and dual-wideband effects in HMA-assisted XL-MIMO systems. Meanwhile, the proposed HMA-based scheme can achieve a higher sum rate than the conventional phase-shifter-based hybrid analog/digital one with the same array aperture.