Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiang Li

China Agricultural University

SEAGraph: Unveiling the Whole Story of Paper Review Comments

Dec 16, 2024

Jianxiang Yu, Jiaqi Tan, Zichen Ding, Jiapeng Zhu, Jiahao Li, Yao Cheng, Qier Cui, Yunshi Lan, Xiang Li

Figure 1 for SEAGraph: Unveiling the Whole Story of Paper Review Comments

Figure 2 for SEAGraph: Unveiling the Whole Story of Paper Review Comments

Figure 3 for SEAGraph: Unveiling the Whole Story of Paper Review Comments

Figure 4 for SEAGraph: Unveiling the Whole Story of Paper Review Comments

Abstract:Peer review, as a cornerstone of scientific research, ensures the integrity and quality of scholarly work by providing authors with objective feedback for refinement. However, in the traditional peer review process, authors often receive vague or insufficiently detailed feedback, which provides limited assistance and leads to a more time-consuming review cycle. If authors can identify some specific weaknesses in their paper, they can not only address the reviewer's concerns but also improve their work. This raises the critical question of how to enhance authors' comprehension of review comments. In this paper, we present SEAGraph, a novel framework developed to clarify review comments by uncovering the underlying intentions behind them. We construct two types of graphs for each paper: the semantic mind graph, which captures the author's thought process, and the hierarchical background graph, which delineates the research domains related to the paper. A retrieval method is then designed to extract relevant content from both graphs, facilitating coherent explanations for the review comments. Extensive experiments show that SEAGraph excels in review comment understanding tasks, offering significant benefits to authors.

Via

Access Paper or Ask Questions

Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent

Dec 15, 2024

Xiang Li, Qiaomin Xie

Figure 1 for Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent

Figure 2 for Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent

Figure 3 for Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent

Figure 4 for Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent

Abstract:The convergence behavior of Stochastic Gradient Descent (SGD) crucially depends on the stepsize configuration. When using a constant stepsize, the SGD iterates form a Markov chain, enjoying fast convergence during the initial transient phase. However, when reaching stationarity, the iterates oscillate around the optimum without making further progress. In this paper, we study the convergence diagnostics for SGD with constant stepsize, aiming to develop an effective dynamic stepsize scheme. We propose a novel coupling-based convergence diagnostic procedure, which monitors the distance of two coupled SGD iterates for stationarity detection. Our diagnostic statistic is simple and is shown to track the transition from transience stationarity theoretically. We conduct extensive numerical experiments and compare our method against various existing approaches. Our proposed coupling-based stepsize scheme is observed to achieve superior performance across a diverse set of convex and non-convex problems. Moreover, our results demonstrate the robustness of our approach to a wide range of hyperparameters.

* 13 pages, 30 figures, to be published in AAAI 2025

Via

Access Paper or Ask Questions

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Dec 14, 2024

Hao Chen, Ze Wang, Xiang Li, Ximeng Sun, Fangyi Chen, Jiang Liu, Jindong Wang, Bhiksha Raj, Zicheng Liu, Emad Barsoum

Figure 1 for SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Figure 2 for SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Figure 3 for SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Figure 4 for SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Abstract:Efficient image tokenization with high compression ratios remains a critical challenge for training generative models. We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens. Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL. It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VQE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models. Code and model are released.

* Code and model: https://github.com/Hhhhhhao/continuous_tokenizer

Via

Access Paper or Ask Questions

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Dec 12, 2024

Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng Yang, Chaoyou Fu, Xiang Li, Jian Yang, Ying Tai

Figure 1 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Figure 2 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Figure 3 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Figure 4 for InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Abstract:Text-to-video generation has evolved rapidly in recent years, delivering remarkable results. Training typically relies on video-caption paired data, which plays a crucial role in enhancing generation performance. However, current video captions often suffer from insufficient details, hallucinations and imprecise motion depiction, affecting the fidelity and consistency of generated videos. In this work, we propose a novel instance-aware structured caption framework, termed InstanceCap, to achieve instance-level and fine-grained video caption for the first time. Based on this scheme, we design an auxiliary models cluster to convert original video into instances to enhance instance fidelity. Video instances are further used to refine dense prompts into structured phrases, achieving concise yet precise descriptions. Furthermore, a 22K InstanceVid dataset is curated for training, and an enhancement pipeline that tailored to InstanceCap structure is proposed for inference. Experimental results demonstrate that our proposed InstanceCap significantly outperform previous models, ensuring high fidelity between captions and videos while reducing hallucinations.

Via

Access Paper or Ask Questions

Agent-based Video Trimming

Dec 12, 2024

Lingfeng Yang, Zhenyuan Chen, Xiang Li, Peiyang Jia, Liangqu Long, Jian Yang

Abstract:As information becomes more accessible, user-generated videos are increasing in length, placing a burden on viewers to sift through vast content for valuable insights. This trend underscores the need for an algorithm to extract key video information efficiently. Despite significant advancements in highlight detection, moment retrieval, and video summarization, current approaches primarily focus on selecting specific time intervals, often overlooking the relevance between segments and the potential for segment arranging. In this paper, we introduce a novel task called Video Trimming (VT), which focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story. To address this task, we propose Agent-based Video Trimming (AVT), structured into three phases: Video Structuring, Clip Filtering, and Story Composition. Specifically, we employ a Video Captioning Agent to convert video slices into structured textual descriptions, a Filtering Module to dynamically discard low-quality footage based on the structured information of each clip, and a Video Arrangement Agent to select and compile valid clips into a coherent final narrative. For evaluation, we develop a Video Evaluation Agent to assess trimmed videos, conducting assessments in parallel with human evaluations. Additionally, we curate a new benchmark dataset for video trimming using raw user videos from the internet. As a result, AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task. The code and models are available at https://ylingfeng.github.io/AVT.

Via

Access Paper or Ask Questions

ATPrompt: Textual Prompt Learning with Embedded Attributes

Dec 12, 2024

Zheng Li, Yibing Song, Penghai Zhao, Ming-Ming Cheng, Xiang Li, Jian Yang

Figure 1 for ATPrompt: Textual Prompt Learning with Embedded Attributes

Figure 2 for ATPrompt: Textual Prompt Learning with Embedded Attributes

Figure 3 for ATPrompt: Textual Prompt Learning with Embedded Attributes

Figure 4 for ATPrompt: Textual Prompt Learning with Embedded Attributes

Abstract:Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text prompt inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories. In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-embedded Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple universal attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form. To finalize the attributes for downstream tasks, we propose a differentiable attribute search method that learns to identify representative and suitable attributes from a candidate pool summarized by a large language model. As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format of textual-based methods, offering general improvements at a negligible computational cost. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.

* Technical Report. Project Page: https://zhengli97.github.io/ATPrompt/

Via

Access Paper or Ask Questions

PAFFA: Premeditated Actions For Fast Agents

Dec 10, 2024

Shambhavi Krishna, Zheng Chen, Vaibhav Kumar, Xiaojiang Huang, Yingjie Li, Fan Yang, Xiang Li

Figure 1 for PAFFA: Premeditated Actions For Fast Agents

Figure 2 for PAFFA: Premeditated Actions For Fast Agents

Figure 3 for PAFFA: Premeditated Actions For Fast Agents

Figure 4 for PAFFA: Premeditated Actions For Fast Agents

Abstract:Modern AI assistants have made significant progress in natural language understanding and API/tool integration, with emerging efforts to incorporate diverse interfaces (such as Web interfaces) for enhanced scalability and functionality. However, current approaches that heavily rely on repeated LLM-driven HTML parsing are computationally expensive and error-prone, particularly when handling dynamic web interfaces and multi-step tasks. To overcome these challenges, we introduce PAFFA (Premeditated Actions For Fast Agents), a framework designed to enhance web interaction capabilities through an Action API Library of reusable, verified browser interaction functions. By pre-computing interaction patterns and employing two core methodologies - "Dist-Map" for task-agnostic element distillation and "Unravel" for incremental page-wise exploration - PAFFA reduces inference calls by 87% while maintaining robust performance even as website structures evolve. This framework accelerates multi-page task execution and offers a scalable solution to advance autonomous web agent research.

* 9 pages

Via

Access Paper or Ask Questions

Personalized and Sequential Text-to-Image Generation

Dec 10, 2024

Ofir Nabati, Guy Tennenholtz, ChihWei Hsu, Moonkyung Ryu, Deepak Ramachandran, Yinlam Chow, Xiang Li, Craig Boutilier

Figure 1 for Personalized and Sequential Text-to-Image Generation

Figure 2 for Personalized and Sequential Text-to-Image Generation

Figure 3 for Personalized and Sequential Text-to-Image Generation

Figure 4 for Personalized and Sequential Text-to-Image Generation

Abstract:We address the problem of personalized, interactive text-to-image (T2I) generation, designing a reinforcement learning (RL) agent which iteratively improves a set of generated images for a user through a sequence of prompt expansions. Using human raters, we create a novel dataset of sequential preferences, which we leverage, together with large-scale open-source (non-sequential) datasets. We construct user-preference and user-choice models using an EM strategy and identify varying user preference types. We then leverage a large multimodal language model (LMM) and a value-based RL approach to suggest a personalized and diverse slate of prompt expansions to the user. Our Personalized And Sequential Text-to-image Agent (PASTA) extends T2I models with personalized multi-turn capabilities, fostering collaborative co-creation and addressing uncertainty or underspecification in a user's intent. We evaluate PASTA using human raters, showing significant improvement compared to baseline methods. We also release our sequential rater dataset and simulated user-rater interactions to support future research in personalized, multi-turn T2I generation.

* Link to PASTA dataset: https://www.kaggle.com/datasets/googleai/pasta-data

Via

Access Paper or Ask Questions

Enhancing LLMs for Impression Generation in Radiology Reports through a Multi-Agent System

Dec 06, 2024

Fang Zeng, Zhiliang Lyu, Quanzheng Li, Xiang Li

Abstract:This study introduces "RadCouncil," a multi-agent Large Language Model (LLM) framework designed to enhance the generation of impressions in radiology reports from the finding section. RadCouncil comprises three specialized agents: 1) a "Retrieval" Agent that identifies and retrieves similar reports from a vector database, 2) a "Radiologist" Agent that generates impressions based on the finding section of the given report plus the exemplar reports retrieved by the Retrieval Agent, and 3) a "Reviewer" Agent that evaluates the generated impressions and provides feedback. The performance of RadCouncil was evaluated using both quantitative metrics (BLEU, ROUGE, BERTScore) and qualitative criteria assessed by GPT-4, using chest X-ray as a case study. Experiment results show improvements in RadCouncil over the single-agent approach across multiple dimensions, including diagnostic accuracy, stylistic concordance, and clarity. This study highlights the potential of utilizing multiple interacting LLM agents, each with a dedicated task, to enhance performance in specialized medical tasks and the development of more robust and adaptable healthcare AI solutions.

Via

Access Paper or Ask Questions

Community Detection with Heterogeneous Block Covariance Model

Dec 04, 2024

Xiang Li, Yunpeng Zhao, Qing Pan, Ning Hao

Abstract:Community detection is the task of clustering objects based on their pairwise relationships. Most of the model-based community detection methods, such as the stochastic block model and its variants, are designed for networks with binary (yes/no) edges. In many practical scenarios, edges often possess continuous weights, spanning positive and negative values, which reflect varying levels of connectivity. To address this challenge, we introduce the heterogeneous block covariance model (HBCM) that defines a community structure within the covariance matrix, where edges have signed and continuous weights. Furthermore, it takes into account the heterogeneity of objects when forming connections with other objects within a community. A novel variational expectation-maximization algorithm is proposed to estimate the group membership. The HBCM provides provable consistent estimates of memberships, and its promising performance is observed in numerical simulations with different setups. The model is applied to a single-cell RNA-seq dataset of a mouse embryo and a stock price dataset. Supplementary materials for this article are available online.

Via

Access Paper or Ask Questions