Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Shi

ParGo: Bridging Vision-Language with Partial and Global Views

Aug 23, 2024

An-Lan Wang, Bin Shan, Wei Shi, Kun-Yu Lin, Xiang Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, Wei-Shi Zheng

Abstract:This work presents ParGo, a novel Partial-Global projector designed to connect the vision and language modalities for Multimodal Large Language Models (MLLMs). Unlike previous works that rely on global attention-based projectors, our ParGo bridges the representation gap between the separately pre-trained vision encoders and the LLMs by integrating global and partial views, which alleviates the overemphasis on prominent regions. To facilitate the effective training of ParGo, we collect a large-scale detail-captioned image-text dataset named ParGoCap-1M-PT, consisting of 1 million images paired with high-quality captions. Extensive experiments on several MLLM benchmarks demonstrate the effectiveness of our ParGo, highlighting its superiority in aligning vision and language modalities. Compared to conventional Q-Former projector, our ParGo achieves an improvement of 259.96 in MME benchmark. Furthermore, our experiments reveal that ParGo significantly outperforms other projectors, particularly in tasks that emphasize detail perception ability.

Via

Access Paper or Ask Questions

Empowering Over-the-Air Personalized Federated Learning via RIS

Aug 22, 2024

Wei Shi, Jiacheng Yao, Jindan Xu, Wei Xu, Lexi Xu, Chunming Zhao

Abstract:Over-the-air computation (AirComp) integrates analog communication with task-oriented computation, serving as a key enabling technique for communication-efficient federated learning (FL) over wireless networks. However, AirComp-enabled FL (AirFL) with a single global consensus model fails to address the data heterogeneity in real-life FL scenarios with non-independent and identically distributed local datasets. In this paper, we introduce reconfigurable intelligent surface (RIS) technology to enable efficient personalized AirFL, mitigating the data heterogeneity issue. First, we achieve statistical interference elimination across different clusters in the personalized AirFL framework via RIS phase shift configuration. Then, we propose two personalized aggregation schemes involving power control and denoising factor design from the perspectives of first- and second-order moments, respectively, to enhance the FL convergence. Numerical results validate the superior performance of our proposed schemes over existing baselines.

* Accepted by SCIENCE CHINA Information Sciences

Via

Access Paper or Ask Questions

Benchmarking LLMs for Optimization Modeling and Enhancing Reasoning via Reverse Socratic Synthesis

Jul 13, 2024

Zhicheng Yang, Yinya Huang, Wei Shi, Liang Feng, Linqi Song, Yiwei Wang, Xiaodan Liang, Jing Tang

Figure 1 for Benchmarking LLMs for Optimization Modeling and Enhancing Reasoning via Reverse Socratic Synthesis

Figure 2 for Benchmarking LLMs for Optimization Modeling and Enhancing Reasoning via Reverse Socratic Synthesis

Figure 3 for Benchmarking LLMs for Optimization Modeling and Enhancing Reasoning via Reverse Socratic Synthesis

Figure 4 for Benchmarking LLMs for Optimization Modeling and Enhancing Reasoning via Reverse Socratic Synthesis

Abstract:Large language models (LLMs) have exhibited their problem-solving ability in mathematical reasoning. Solving realistic optimization (OPT) problems in industrial application scenarios requires advanced and applied math ability. However, current OPT benchmarks that merely solve linear programming are far from complex realistic situations. In this work, we propose E-OPT, a benchmark for end-to-end optimization problem-solving with human-readable inputs and outputs. E-OPT contains rich optimization problems, including linear/nonlinear programming with/without table data, which can comprehensively evaluate LLMs' solving ability. In our benchmark, LLMs are required to correctly understand the problem in E-OPT and call code solver to get precise numerical answers. Furthermore, to alleviate the data scarcity for optimization problems, and to bridge the gap between open-source LLMs on a small scale (e.g., Llama-2-7b and Llama-3-8b) and closed-source LLMs (e.g., GPT-4), we further propose a novel data synthesis method namely ReSocratic. Unlike general data synthesis methods that proceed from questions to answers, ReSocratic first incrementally synthesizes optimization scenarios with mathematical formulations step by step and then back-translates the generated scenarios into questions. In such a way, we construct the ReSocratic-29k dataset from a small seed sample pool with the powerful open-source large model DeepSeek-V2. To demonstrate the effectiveness of ReSocratic, we conduct supervised fine-tuning with ReSocratic-29k on multiple open-source models. The results show that Llama3-8b is significantly improved from 13.6% to 51.7% on E-OPT, while DeepSeek-V2 reaches 61.0%, approaching 65.5% of GPT-4.

Via

Access Paper or Ask Questions

M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions

May 26, 2024

Zheng Wang, Shu Xian Teo, Jieer Ouyang, Yongjun Xu, Wei Shi

Figure 1 for M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions

Figure 2 for M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions

Figure 3 for M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions

Figure 4 for M-RAG: Reinforcing Large Language Model Performance through Retrieval-Augmented Generation with Multiple Partitions

Abstract:Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant memories from an external database. However, existing RAG methods typically organize all memories in a whole database, potentially limiting focus on crucial memories and introducing noise. In this paper, we introduce a multiple partition paradigm for RAG (called M-RAG), where each database partition serves as a basic unit for RAG execution. Based on this paradigm, we propose a novel framework that leverages LLMs with Multi-Agent Reinforcement Learning to optimize different language generation tasks explicitly. Through comprehensive experiments conducted on seven datasets, spanning three language generation tasks and involving three distinct language model architectures, we confirm that M-RAG consistently outperforms various baseline methods, achieving improvements of 11%, 8%, and 12% for text summarization, machine translation, and dialogue generation, respectively.

* This paper has been accepted by ACL 2024

Via

Access Paper or Ask Questions

From Persona to Personalization: A Survey on Role-Playing Language Agents

Apr 28, 2024

Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu(+8 more)

Figure 1 for From Persona to Personalization: A Survey on Role-Playing Language Agents

Figure 2 for From Persona to Personalization: A Survey on Role-Playing Language Agents

Figure 3 for From Persona to Personalization: A Survey on Role-Playing Language Agents

Figure 4 for From Persona to Personalization: A Survey on Role-Playing Language Agents

Abstract:Recent advancements in large language models (LLMs) have significantly boosted the rise of Role-Playing Language Agents (RPLAs), i.e., specialized AI systems designed to simulate assigned personas. By harnessing multiple advanced abilities of LLMs, including in-context learning, instruction following, and social intelligence, RPLAs achieve a remarkable sense of human likeness and vivid role-playing performance. RPLAs can mimic a wide range of personas, ranging from historical figures and fictional characters to real-life individuals. Consequently, they have catalyzed numerous AI applications, such as emotional companions, interactive video games, personalized assistants and copilots, and digital clones. In this paper, we conduct a comprehensive survey of this field, illustrating the evolution and recent progress in RPLAs integrating with cutting-edge LLM technologies. We categorize personas into three types: 1) Demographic Persona, which leverages statistical stereotypes; 2) Character Persona, focused on well-established figures; and 3) Individualized Persona, customized through ongoing user interactions for personalized services. We begin by presenting a comprehensive overview of current methodologies for RPLAs, followed by the details for each persona type, covering corresponding data sourcing, agent construction, and evaluation. Afterward, we discuss the fundamental risks, existing limitations, and future prospects of RPLAs. Additionally, we provide a brief review of RPLAs in AI applications, which reflects practical user demands that shape and drive RPLA research. Through this work, we aim to establish a clear taxonomy of RPLA research and applications, and facilitate future research in this critical and ever-evolving field, and pave the way for a future where humans and RPLAs coexist in harmony.

* Preprint

Via

Access Paper or Ask Questions

TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Apr 19, 2024

Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao(+6 more)

Figure 1 for TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Figure 2 for TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Figure 3 for TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Figure 4 for TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Abstract:Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

Via

Access Paper or Ask Questions

SurveyAgent: A Conversational System for Personalized and Efficient Research Survey

Apr 09, 2024

Xintao Wang, Jiangjie Chen, Nianqi Li, Lida Chen, Xinfeng Yuan, Wei Shi, Xuyang Ge, Rui Xu, Yanghua Xiao

Abstract:In the rapidly advancing research fields such as AI, managing and staying abreast of the latest scientific literature has become a significant challenge for researchers. Although previous efforts have leveraged AI to assist with literature searches, paper recommendations, and question-answering, a comprehensive support system that addresses the holistic needs of researchers has been lacking. This paper introduces SurveyAgent, a novel conversational system designed to provide personalized and efficient research survey assistance to researchers. SurveyAgent integrates three key modules: Knowledge Management for organizing papers, Recommendation for discovering relevant literature, and Query Answering for engaging with content on a deeper level. This system stands out by offering a unified platform that supports researchers through various stages of their literature review process, facilitated by a conversational interface that prioritizes user interaction and personalization. Our evaluation demonstrates SurveyAgent's effectiveness in streamlining research activities, showcasing its capability to facilitate how researchers interact with scientific literature.

* 6 pages

Via

Access Paper or Ask Questions

Secure Outage Analysis for RIS-Aided MISO Systems with Randomly Located Eavesdroppers

Mar 22, 2024

Wei Shi, Jindan Xu, Wei Xu, Chau Yuen, A. Lee Swindlehurst, Xiaohu You, Chunming Zhao

Abstract:In this paper, we consider the physical layer security of an RIS-assisted multiple-antenna communication system with randomly located eavesdroppers. The exact distributions of the received signal-to-noise-ratios (SNRs) at the legitimate user and the eavesdroppers located according to a Poisson point process (PPP) are derived, and a closed-form expression for the secrecy outage probability (SOP) is obtained. It is revealed that the secrecy performance is mainly affected by the number of RIS reflecting elements, and the impact of the transmit antennas and transmit power at the base station is marginal. In addition, when the locations of the randomly located eavesdroppers are unknown, deploying the RIS closer to the legitimate user rather than to the base station is shown to be more efficient. We also perform an analytical study demonstrating that the secrecy diversity order depends on the path loss exponent of the RIS-to-ground links. Finally, numerical simulations are conducted to verify the accuracy of these theoretical observations.

* Accepted by 2023 IEEE Globecom Workshops (GC Wkshps). arXiv admin note: substantial text overlap with arXiv:2312.16814

Via

Access Paper or Ask Questions

Auto-Encoder Optimized PAM IM/DD Transceivers for Amplified Fiber Links

Feb 12, 2024

Amir Omidi, Mai Banawan, Erwan Weckenmann, Benoit Paquin, Alireza Geravand, Zibo Zheng, Wei Shi, Ming Zeng, Leslie A. Rusch

Figure 1 for Auto-Encoder Optimized PAM IM/DD Transceivers for Amplified Fiber Links

Figure 2 for Auto-Encoder Optimized PAM IM/DD Transceivers for Amplified Fiber Links

Figure 3 for Auto-Encoder Optimized PAM IM/DD Transceivers for Amplified Fiber Links

Figure 4 for Auto-Encoder Optimized PAM IM/DD Transceivers for Amplified Fiber Links

Abstract:We examine pulse amplitude modulation (PAM) for intensity modulation and direct detection systems. Using a straight-forward, mixed noise model, we optimize the constellations with an autoencoder-based neural network (NN), an improve required signal-to-noise ratio of 4 dB for amplified spontaneous emission (ASE)-limited PAM4 and PAM8, without increasing system complexity. Performance can also be improved in O-band wavelength division multiplexing system with semiconductor optical amplifier amplification and chromatic dispersion. We show via simulation that for such a system operating at 53 Gbaud, we can extend the reach of PAM4 by 10-25 km with an optimized constellation and a NN decoder. We present an experimental validation of 4 dB improvement of an ASE-limited PAM4 at 60 Gbaud using an optimized constellation and a NN decoder.

* 14 pages and 11 figures

Via

Access Paper or Ask Questions

Multimodal Query Suggestion with Multi-Agent Reinforcement Learning from Human Feedback

Feb 09, 2024

Zheng Wang, Bingzheng Gan, Wei Shi

Abstract:In the rapidly evolving landscape of information retrieval, search engines strive to provide more personalized and relevant results to users. Query suggestion systems play a crucial role in achieving this goal by assisting users in formulating effective queries. However, existing query suggestion systems mainly rely on textual inputs, potentially limiting user search experiences for querying images. In this paper, we introduce a novel Multimodal Query Suggestion (MMQS) task, which aims to generate query suggestions based on user query images to improve the intentionality and diversity of search results. We present the RL4Sugg framework, leveraging the power of Large Language Models (LLMs) with Multi-Agent Reinforcement Learning from Human Feedback to optimize the generation process. Through comprehensive experiments, we validate the effectiveness of RL4Sugg, demonstrating a 18% improvement compared to the best existing approach. Moreover, the MMQS has been transferred into real-world search engine products, which yield enhanced user engagement. Our research advances query suggestion systems and provides a new perspective on multimodal information retrieval.

* This paper has been accepted by WWW 2024

Via

Access Paper or Ask Questions