Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaoying Zhang

Building Task Bots with Self-learning for Enhanced Adaptability, Extensibility, and Factuality

Aug 27, 2025

Xiaoying Zhang

Abstract:Developing adaptable, extensible, and accurate task bots with minimal or zero human intervention is a significant challenge in dialog research. This thesis examines the obstacles and potential solutions for creating such bots, focusing on innovative techniques that enable bots to learn and adapt autonomously in constantly changing environments.

* 179 pages

Via

Access Paper or Ask Questions

Seed1.5-VL Technical Report

May 11, 2025

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang(+187 more)

Abstract:We present Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning. Seed1.5-VL is composed with a 532M-parameter vision encoder and a Mixture-of-Experts (MoE) LLM of 20B active parameters. Despite its relatively compact architecture, it delivers strong performance across a wide spectrum of public VLM benchmarks and internal evaluation suites, achieving the state-of-the-art performance on 38 out of 60 public benchmarks. Moreover, in agent-centric tasks such as GUI control and gameplay, Seed1.5-VL outperforms leading multimodal systems, including OpenAI CUA and Claude 3.7. Beyond visual and video understanding, it also demonstrates strong reasoning abilities, making it particularly effective for multimodal reasoning challenges such as visual puzzles. We believe these capabilities will empower broader applications across diverse tasks. In this report, we mainly provide a comprehensive review of our experiences in building Seed1.5-VL across model design, data construction, and training at various stages, hoping that this report can inspire further research. Seed1.5-VL is now accessible at https://www.volcengine.com/ (Volcano Engine Model ID: doubao-1-5-thinking-vision-pro-250428)

Via

Access Paper or Ask Questions

Towards Self-Improving Systematic Cognition for Next-Generation Foundation MLLMs

Mar 16, 2025

Xiaoying Zhang, Da Peng, Yipeng Zhang, Zonghao Guo, Chengyue Wu, Chi Chen, Wei Ke, Helen Meng, Maosong Sun

Abstract:Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) face challenges with fine-grained perception and complex reasoning. Prevalent pre-training approaches focus on enhancing perception by training on high-quality image captions due to the extremely high cost of collecting chain-of-thought (CoT) reasoning data for improving reasoning. While leveraging advanced MLLMs for caption generation enhances scalability, the outputs often lack comprehensiveness and accuracy. In this paper, we introduce Self-Improving Cognition (SIcog), a self-learning framework designed to construct next-generation foundation MLLMs by enhancing their systematic cognitive capabilities through multimodal pre-training with self-generated data. Specifically, we propose chain-of-description, an approach that improves an MLLM's systematic perception by enabling step-by-step visual understanding, ensuring greater comprehensiveness and accuracy. Additionally, we adopt a structured CoT reasoning technique to enable MLLMs to integrate in-depth multimodal reasoning. To construct a next-generation foundation MLLM with self-improved cognition, SIcog first equips an MLLM with systematic perception and reasoning abilities using minimal external annotations. The enhanced models then generate detailed captions and CoT reasoning data, which are further curated through self-consistency. This curated data is ultimately used to refine the MLLM during multimodal pre-training, facilitating next-generation foundation MLLM construction. Extensive experiments on both low- and high-resolution MLLMs across diverse benchmarks demonstrate that, with merely 213K self-generated pre-training samples, SIcog produces next-generation foundation MLLMs with significantly improved cognition, achieving benchmark-leading performance compared to prevalent pre-training approaches.

* 38 pages

Via

Access Paper or Ask Questions

Conversational Dueling Bandits in Generalized Linear Models

Jul 26, 2024

Shuhua Yang, Hui Yuan, Xiaoying Zhang, Mengdi Wang, Hong Zhang, Huazheng Wang

Figure 1 for Conversational Dueling Bandits in Generalized Linear Models

Figure 2 for Conversational Dueling Bandits in Generalized Linear Models

Figure 3 for Conversational Dueling Bandits in Generalized Linear Models

Figure 4 for Conversational Dueling Bandits in Generalized Linear Models

Abstract:Conversational recommendation systems elicit user preferences by interacting with users to obtain their feedback on recommended commodities. Such systems utilize a multi-armed bandit framework to learn user preferences in an online manner and have received great success in recent years. However, existing conversational bandit methods have several limitations. First, they only enable users to provide explicit binary feedback on the recommended items or categories, leading to ambiguity in interpretation. In practice, users are usually faced with more than one choice. Relative feedback, known for its informativeness, has gained increasing popularity in recommendation system design. Moreover, current contextual bandit methods mainly work under linear reward assumptions, ignoring practical non-linear reward structures in generalized linear models. Therefore, in this paper, we introduce relative feedback-based conversations into conversational recommendation systems through the integration of dueling bandits in generalized linear models (GLM) and propose a novel conversational dueling bandit algorithm called ConDuel. Theoretical analyses of regret upper bounds and empirical validations on synthetic and real-world data underscore ConDuel's efficacy. We also demonstrate the potential to extend our algorithm to multinomial logit bandits with theoretical and experimental guarantees, which further proves the applicability of the proposed framework.

Via

Access Paper or Ask Questions

User-Creator Feature Dynamics in Recommender Systems with Dual Influence

Jul 19, 2024

Tao Lin, Kun Jin, Andrew Estornell, Xiaoying Zhang, Yiling Chen, Yang Liu

Figure 1 for User-Creator Feature Dynamics in Recommender Systems with Dual Influence

Figure 2 for User-Creator Feature Dynamics in Recommender Systems with Dual Influence

Figure 3 for User-Creator Feature Dynamics in Recommender Systems with Dual Influence

Figure 4 for User-Creator Feature Dynamics in Recommender Systems with Dual Influence

Abstract:Recommender systems present relevant contents to users and help content creators reach their target audience. The dual nature of these systems influences both users and creators: users' preferences are affected by the items they are recommended, while creators are incentivized to alter their contents such that it is recommended more frequently. We define a model, called user-creator feature dynamics, to capture the dual influences of recommender systems. We prove that a recommender system with dual influence is guaranteed to polarize, causing diversity loss in the system. We then investigate, both theoretically and empirically, approaches for mitigating polarization and promoting diversity in recommender systems. Unexpectedly, we find that common diversity-promoting approaches do not work in the presence of dual influence, while relevancy-optimizing methods like top-$k$ recommendation can prevent polarization and improve diversity of the system.

Via

Access Paper or Ask Questions

Toward Optimal LLM Alignments Using Two-Player Games

Jun 16, 2024

Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang, Yuanshun Yao, Xiaojun Xu, Zhaoran Wang, Zhiheng Xi, Tao Gui, Qi Zhang(+3 more)

Figure 1 for Toward Optimal LLM Alignments Using Two-Player Games

Figure 2 for Toward Optimal LLM Alignments Using Two-Player Games

Figure 3 for Toward Optimal LLM Alignments Using Two-Player Games

Figure 4 for Toward Optimal LLM Alignments Using Two-Player Games

Abstract:The standard Reinforcement Learning from Human Feedback (RLHF) framework primarily focuses on optimizing the performance of large language models using pre-collected prompts. However, collecting prompts that provide comprehensive coverage is both tedious and challenging, and often fails to include scenarios that LLMs need to improve on the most. In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent. The adversarial agent's task at each step is to generate prompts that expose the weakness of the defensive agent. In return, the defensive agent seeks to improve its responses to these newly identified prompts it struggled with, based on feedback from the reward model. We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents. Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.

* Our code is released at https://github.com/ruizheng20/gpo

Via

Access Paper or Ask Questions

Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

Jun 11, 2024

Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Yipeng Zhang, Haitao Mi, Helen Meng

Figure 1 for Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

Figure 2 for Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

Figure 3 for Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

Figure 4 for Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching

Abstract:Large language models (LLMs) often struggle to provide up-to-date information due to their one-time training and the constantly evolving nature of the world. To keep LLMs current, existing approaches typically involve continued pre-training on new documents. However, they frequently face difficulties in extracting stored knowledge. Motivated by the remarkable success of the Feynman Technique in efficient human learning, we introduce Self-Tuning, a learning framework aimed at improving an LLM's ability to effectively acquire new knowledge from raw documents through self-teaching. Specifically, we develop a Self-Teaching strategy that augments the documents with a set of knowledge-intensive tasks created in a self-supervised manner, focusing on three crucial aspects: memorization, comprehension, and self-reflection. Additionally, we introduce three Wiki-Newpages-2023-QA datasets to facilitate an in-depth analysis of an LLM's knowledge acquisition ability concerning memorization, extraction, and reasoning. Extensive experimental results on Llama2 family models reveal that Self-Tuning consistently exhibits superior performance across all knowledge acquisition tasks and excels in preserving previous knowledge.

* 30 pages

Via

Access Paper or Ask Questions

GI-Free Pilot-Aided Channel Estimation for Affine Frequency Division Multiplexing Systems

Apr 01, 2024

Yu Zhou, Haoran Yin, Nanhao Zhou, Yanqun Tang, Xiaoying Zhang, Weijie Yuan

Figure 1 for GI-Free Pilot-Aided Channel Estimation for Affine Frequency Division Multiplexing Systems

Figure 2 for GI-Free Pilot-Aided Channel Estimation for Affine Frequency Division Multiplexing Systems

Figure 3 for GI-Free Pilot-Aided Channel Estimation for Affine Frequency Division Multiplexing Systems

Figure 4 for GI-Free Pilot-Aided Channel Estimation for Affine Frequency Division Multiplexing Systems

Abstract:The recently developed affine frequency division multiplexing (AFDM) can achieve full diversity in doubly selective channels, providing a comprehensive sparse representation of the delay-Doppler domain channel. Thus, accurate channel estimation is feasible by using just one pilot symbol. However, traditional AFDM channel estimation schemes necessitate the use of guard intervals (GI) to mitigate data-pilot interference, leading to spectral efficiency degradation. In this paper, we propose a GI-free pilot-aided channel estimation algorithm for AFDM systems, which improves spectral efficiency significantly. To mitigate the interference between the pilot and data symbols caused by the absence of GI, we perform joint interference cancellation, channel estimation, and signal detection iterately. Simulation results show that the bit error rate (BER) performance of the proposed method can approach the ideal case with perfect channel estimation.

Via

Access Paper or Ask Questions

Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

Mar 14, 2024

Wei Shen, Xiaoying Zhang, Yuanshun Yao, Rui Zheng, Hongyi Guo, Yang Liu

Figure 1 for Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

Figure 2 for Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

Figure 3 for Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

Figure 4 for Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

Abstract:Reinforcement learning from human feedback (RLHF) is the mainstream paradigm used to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources, e.g. human labeling errors, making the pipeline fragile. In this work, we improve the effectiveness of the reward model by introducing a penalty term on the reward, named as \textit{contrastive rewards}. %Contrastive rewards Our approach involves two steps: (1) an offline sampling step to obtain responses to prompts that serve as baseline calculation and (2) a contrastive reward calculated using the baseline responses and used in the Proximal Policy Optimization (PPO) step. We show that contrastive rewards enable the LLM to penalize reward uncertainty, improve robustness, encourage improvement over baselines, calibrate according to task difficulty, and reduce variance in PPO. We show empirically contrastive rewards can improve RLHF substantially, evaluated by both GPTs and humans, and our method consistently outperforms strong baselines.

Via

Access Paper or Ask Questions

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Mar 08, 2024

Xiaoying Zhang, Jean-Francois Ton, Wei Shen, Hongning Wang, Yang Liu

Figure 1 for Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Figure 2 for Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Figure 3 for Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Abstract:We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward model, without the need for computationally expensive reward ensembles. AdvPO then addresses a distributionally robust optimization problem centred around the confidence interval of the reward model's predictions for policy improvement. Through comprehensive experiments on the Anthropic HH and TL;DR summarization datasets, we illustrate the efficacy of AdvPO in mitigating the overoptimization issue, consequently resulting in enhanced performance as evaluated through human-assisted evaluation.

Via

Access Paper or Ask Questions