Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.
In the rapidly advancing research fields such as AI, managing and staying abreast of the latest scientific literature has become a significant challenge for researchers. Although previous efforts have leveraged AI to assist with literature searches, paper recommendations, and question-answering, a comprehensive support system that addresses the holistic needs of researchers has been lacking. This paper introduces SurveyAgent, a novel conversational system designed to provide personalized and efficient research survey assistance to researchers. SurveyAgent integrates three key modules: Knowledge Management for organizing papers, Recommendation for discovering relevant literature, and Query Answering for engaging with content on a deeper level. This system stands out by offering a unified platform that supports researchers through various stages of their literature review process, facilitated by a conversational interface that prioritizes user interaction and personalization. Our evaluation demonstrates SurveyAgent's effectiveness in streamlining research activities, showcasing its capability to facilitate how researchers interact with scientific literature.
In this paper, we consider the physical layer security of an RIS-assisted multiple-antenna communication system with randomly located eavesdroppers. The exact distributions of the received signal-to-noise-ratios (SNRs) at the legitimate user and the eavesdroppers located according to a Poisson point process (PPP) are derived, and a closed-form expression for the secrecy outage probability (SOP) is obtained. It is revealed that the secrecy performance is mainly affected by the number of RIS reflecting elements, and the impact of the transmit antennas and transmit power at the base station is marginal. In addition, when the locations of the randomly located eavesdroppers are unknown, deploying the RIS closer to the legitimate user rather than to the base station is shown to be more efficient. We also perform an analytical study demonstrating that the secrecy diversity order depends on the path loss exponent of the RIS-to-ground links. Finally, numerical simulations are conducted to verify the accuracy of these theoretical observations.
We examine pulse amplitude modulation (PAM) for intensity modulation and direct detection systems. Using a straight-forward, mixed noise model, we optimize the constellations with an autoencoder-based neural network (NN), an improve required signal-to-noise ratio of 4 dB for amplified spontaneous emission (ASE)-limited PAM4 and PAM8, without increasing system complexity. Performance can also be improved in O-band wavelength division multiplexing system with semiconductor optical amplifier amplification and chromatic dispersion. We show via simulation that for such a system operating at 53 Gbaud, we can extend the reach of PAM4 by 10-25 km with an optimized constellation and a NN decoder. We present an experimental validation of 4 dB improvement of an ASE-limited PAM4 at 60 Gbaud using an optimized constellation and a NN decoder.
In the rapidly evolving landscape of information retrieval, search engines strive to provide more personalized and relevant results to users. Query suggestion systems play a crucial role in achieving this goal by assisting users in formulating effective queries. However, existing query suggestion systems mainly rely on textual inputs, potentially limiting user search experiences for querying images. In this paper, we introduce a novel Multimodal Query Suggestion (MMQS) task, which aims to generate query suggestions based on user query images to improve the intentionality and diversity of search results. We present the RL4Sugg framework, leveraging the power of Large Language Models (LLMs) with Multi-Agent Reinforcement Learning from Human Feedback to optimize the generation process. Through comprehensive experiments, we validate the effectiveness of RL4Sugg, demonstrating a 18% improvement compared to the best existing approach. Moreover, the MMQS has been transferred into real-world search engine products, which yield enhanced user engagement. Our research advances query suggestion systems and provides a new perspective on multimodal information retrieval.
Reconfigurable intelligent surface (RIS) technology is emerging as a promising technique for performance enhancement for next-generation wireless networks. This paper investigates the physical layer security of an RIS-assisted multiple-antenna communication system in the presence of random spatially distributed eavesdroppers. The RIS-to-ground channels are assumed to experience Rician fading. Using stochastic geometry, exact distributions of the received signal-to-noise-ratios (SNRs) at the legitimate user and the eavesdroppers located according to a Poisson point process (PPP) are derived, and closed-form expressions for the secrecy outage probability (SOP) and the ergodic secrecy capacity (ESC) are obtained to provide insightful guidelines for system design. First, the secrecy diversity order is obtained as $\frac{2}{\alpha_2}$, where $\alpha_2$ denotes the path loss exponent of the RIS-to-ground links. Then, it is revealed that the secrecy performance is mainly affected by the number of RIS reflecting elements, $N$, and the impact of the number of transmit antennas and transmit power at the base station is marginal. In addition, when the locations of the randomly located eavesdroppers are unknown, deploying the RIS closer to the legitimate user rather than to the base station is shown to be more efficient. Moreover, it is also found that the density of randomly located eavesdroppers, $\lambda_e$, has an additive effect on the asymptotic ESC performance given by $\log_2{\left({1}/{\lambda_e}\right)}$. Finally, numerical simulations are conducted to verify the accuracy of these theoretical observations.
Recently, there has been significant progress in the development of large models. Following the success of ChatGPT, numerous language models have been introduced, demonstrating remarkable performance. Similar advancements have also been observed in image generation models, such as Google's Imagen model, OpenAI's DALL-E 2, and stable diffusion models, which have exhibited impressive capabilities in generating images. However, similar to large language models, these models still encounter unresolved challenges. Fortunately, the availability of open-source stable diffusion models and their underlying mathematical principles has enabled the academic community to extensively analyze the performance of current image generation models and make improvements based on this stable diffusion framework. This survey aims to examine the existing issues and the current solutions pertaining to image generation models.
Ultra-reliable low latency communications (URLLC) service is envisioned to enable use cases with strict reliability and latency requirements in 5G. One approach for enabling URLLC services is to leverage Reinforcement Learning (RL) to efficiently allocate wireless resources. However, with conventional RL methods, the decision variables (though being deployed at various network layers) are typically optimized in the same control loop, leading to significant practical limitations on the control loop's delay as well as excessive signaling and energy consumption. In this paper, we propose a multi-agent Hierarchical RL (HRL) framework that enables the implementation of multi-level policies with different control loop timescales. Agents with faster control loops are deployed closer to the base station, while the ones with slower control loops are at the edge or closer to the core network providing high-level guidelines for low-level actions. On a use case from the prior art, with our HRL framework, we optimized the maximum number of retransmissions and transmission power of industrial devices. Our extensive simulation results on the factory automation scenario show that the HRL framework achieves better performance as the baseline single-agent RL method, with significantly less overhead of signal transmissions and delay compared to the one-agent RL methods.