Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zixian Guo

Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving

May 23, 2025

Zixian Guo, Ming Liu, Zhilong Ji, Jinfeng Bai, Lei Zhang, Wangmeng Zuo

Abstract:Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified process. Well alignment needs high-quality pre-training data and a carefully designed training process. Current LVLMs face challenges when addressing complex vision-language reasoning tasks, with their reasoning capabilities notably lagging behind those of LLMs. This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework based on existing visual interpretation specialists and text-based reasoning LLMs. Our approach leverages (1) a dedicated vision-language model to transform the visual content of images into textual descriptions and (2) an LLM to perform reasoning according to the visual-derived text and the original question. This method presents a cost-efficient solution for multi-modal model development by optimizing existing models to work collaboratively, avoiding end-to-end development of vision-language models from scratch. By transforming images into language model-compatible text representations, it facilitates future low-cost and flexible upgrades to upcoming powerful LLMs. We introduce an outcome-rewarded joint-tuning strategy to optimize the cooperation between the visual interpretation and linguistic reasoning model. Evaluation results on vision-language benchmarks demonstrate that the decoupled reasoning framework outperforms recent LVLMs. Our approach yields particularly significant performance gains on visually intensive geometric mathematics problems. The code is available: https://github.com/guozix/DVLR.

Via

Access Paper or Ask Questions

SelfBC: Self Behavior Cloning for Offline Reinforcement Learning

Aug 04, 2024

Shirong Liu, Chenjia Bai, Zixian Guo, Hao Zhang, Gaurav Sharma, Yang Liu

Figure 1 for SelfBC: Self Behavior Cloning for Offline Reinforcement Learning

Figure 2 for SelfBC: Self Behavior Cloning for Offline Reinforcement Learning

Figure 3 for SelfBC: Self Behavior Cloning for Offline Reinforcement Learning

Figure 4 for SelfBC: Self Behavior Cloning for Offline Reinforcement Learning

Abstract:Policy constraint methods in offline reinforcement learning employ additional regularization techniques to constrain the discrepancy between the learned policy and the offline dataset. However, these methods tend to result in overly conservative policies that resemble the behavior policy, thus limiting their performance. We investigate this limitation and attribute it to the static nature of traditional constraints. In this paper, we propose a novel dynamic policy constraint that restricts the learned policy on the samples generated by the exponential moving average of previously learned policies. By integrating this self-constraint mechanism into off-policy methods, our method facilitates the learning of non-conservative policies while avoiding policy collapse in the offline setting. Theoretical results show that our approach results in a nearly monotonically improved reference policy. Extensive experiments on the D4RL MuJoCo domain demonstrate that our proposed method achieves state-of-the-art performance among the policy constraint methods.

Via

Access Paper or Ask Questions

Two Optimizers Are Better Than One: LLM Catalyst for Enhancing Gradient-Based Optimization

May 31, 2024

Zixian Guo, Ming Liu, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng Zuo

Figure 1 for Two Optimizers Are Better Than One: LLM Catalyst for Enhancing Gradient-Based Optimization

Figure 2 for Two Optimizers Are Better Than One: LLM Catalyst for Enhancing Gradient-Based Optimization

Figure 3 for Two Optimizers Are Better Than One: LLM Catalyst for Enhancing Gradient-Based Optimization

Figure 4 for Two Optimizers Are Better Than One: LLM Catalyst for Enhancing Gradient-Based Optimization

Abstract:Learning a skill generally relies on both practical experience by doer and insightful high-level guidance by instructor. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal update at each step. Recent methods utilize large language models (LLMs) to optimize solutions for concrete problems by inferring from natural language instructions, akin to a high-level instructor. In this paper, we show that these two optimizers are complementary to each other, suggesting a collaborative optimization approach. The gradient-based optimizer and LLM-based optimizer are combined in an interleaved manner. We instruct LLMs using task descriptions and timely optimization trajectories recorded during gradient-based optimization. Inferred results from LLMs are used as restarting points for the next stage of gradient optimization. By leveraging both the locally rigorous gradient-based optimizer and the high-level deductive LLM-based optimizer, our combined optimization method consistently yields improvements over competitive baseline prompt tuning methods. Our results demonstrate the synergistic effect of conventional gradient-based optimization and the inference ability of LLMs. The code is released at https://github.com/guozix/LLM-catalyst.

Via

Access Paper or Ask Questions

Black-Box Tuning of Vision-Language Models with Effective Gradient Approximation

Dec 26, 2023

Zixian Guo, Yuxiang Wei, Ming Liu, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng Zuo

Abstract:Parameter-efficient fine-tuning (PEFT) methods have provided an effective way for adapting large vision-language models to specific tasks or scenarios. Typically, they learn a very small scale of parameters for pre-trained models in a white-box formulation, which assumes model architectures to be known and parameters to be accessible. However, large models are often not open-source due to considerations of preventing abuse or commercial factors, hence posing a barrier to the deployment of white-box PEFT methods. To alleviate the dependence on model accessibility, we introduce collaborative black-box tuning (CBBT) for both textual prompt optimization and output feature adaptation for black-box models. Specifically, considering that the backpropagation gradients are blocked, we approximate the gradients of textual prompts by analyzing the predictions with perturbed prompts. Secondly, a lightweight adapter is deployed over the output feature of the inaccessible model, further facilitating the model adaptation process. Empowered with these designs, our CBBT is extensively evaluated on eleven downstream benchmarks and achieves remarkable improvements compared to existing black-box VL adaptation methods. Code is released at https://github.com/guozix/cbbt.

Via

Access Paper or Ask Questions

Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Nov 23, 2022

Zixian Guo, Bowen Dong, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng Zuo

Figure 1 for Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Figure 2 for Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Figure 3 for Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Figure 4 for Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Abstract:Prompt tuning has been employed as an efficient way to adapt large vision-language pre-trained models (e.g. CLIP) to various downstream tasks in data-limited or label-limited settings. Nonetheless, visual data (e.g., images) is by default prerequisite for learning prompts in existing methods. In this work, we advocate that the effectiveness of image-text contrastive learning in aligning the two modalities (for training CLIP) further makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. In contrast to the visual data, text descriptions are easy to collect, and their class labels can be directly derived. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Moreover, with TaI, double-grained prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and fine-grained embeddings for enhancing the multi-label recognition performance. Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE, while it can be combined with existing methods of prompting from images to improve recognition performance further. Code is released at https://github.com/guozix/TaI-DPT.

Via

Access Paper or Ask Questions