Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yifan Gao

Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Feb 10, 2025

Yuchen Zhuang, Jingfeng Yang, Haoming Jiang, Xin Liu, Kewei Cheng, Sanket Lokegaonkar, Yifan Gao, Qing Ping, Tianyi Liu, Binxuan Huang(+9 more)

Figure 1 for Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Figure 2 for Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Figure 3 for Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Figure 4 for Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Abstract:Due to the scarcity of agent-oriented pre-training data, LLM-based autonomous agents typically rely on complex prompting or extensive fine-tuning, which often fails to introduce new capabilities while preserving strong generalizability. We introduce Hephaestus-Forge, the first large-scale pre-training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning and planning, and adapting to environmental feedback. Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories to strengthen intrinsic reasoning. To explore effective training protocols, we investigate scaling laws to identify the optimal recipe in data mixing ratios. By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks, demonstrating the effectiveness of our pre-training corpus in enhancing fundamental agentic capabilities and generalization of LLMs to new tasks or environments.

* Accepted to NAACL 2025 main conference

Via

Access Paper or Ask Questions

Leveraging Multimodal Methods and Spontaneous Speech for Alzheimer's Disease Identification

Dec 13, 2024

Yifan Gao, Long Guo, Hong Liu

Figure 1 for Leveraging Multimodal Methods and Spontaneous Speech for Alzheimer's Disease Identification

Figure 2 for Leveraging Multimodal Methods and Spontaneous Speech for Alzheimer's Disease Identification

Abstract:Cognitive impairment detection through spontaneous speech offers potential for early diagnosis of Alzheimer's disease (AD) and mild cognitive impairment (MCI). The PROCESS Grand Challenge, part of ICASSP 2025, focuses on advancing this field with innovative solutions for classification and regression tasks. In this work, we integrate interpretable features with temporal features extracted from pre-trained models through a multimodal fusion strategy. For the classification task, our model achieved an F1-score of 0.649 in predicting cognitive states (healthy, MCI, dementia). For the regression task, which involves MMSE score prediction, we obtained a root-mean-square error (RMSE) of 2.628. These results led to our team securing the top overall ranking in the competition.

* ICASSP 2025

Via

Access Paper or Ask Questions

Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

Oct 28, 2024

Yilun Jin, Zheng Li, Chenwei Zhang, Tianyu Cao, Yifan Gao, Pratik Jayarao, Mao Li, Xin Liu, Ritesh Sarkhel, Xianfeng Tang(+12 more)

Figure 1 for Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

Figure 2 for Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

Figure 3 for Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

Figure 4 for Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models

Abstract:Online shopping is a complex multi-task, few-shot learning problem with a wide and evolving range of entities, relations, and tasks. However, existing models and benchmarks are commonly tailored to specific tasks, falling short of capturing the full complexity of online shopping. Large Language Models (LLMs), with their multi-task and few-shot learning abilities, have the potential to profoundly transform online shopping by alleviating task-specific engineering efforts and by providing users with interactive conversations. Despite the potential, LLMs face unique challenges in online shopping, such as domain-specific concepts, implicit knowledge, and heterogeneous user behaviors. Motivated by the potential and challenges, we propose Shopping MMLU, a diverse multi-task online shopping benchmark derived from real-world Amazon data. Shopping MMLU consists of 57 tasks covering 4 major shopping skills: concept understanding, knowledge reasoning, user behavior alignment, and multi-linguality, and can thus comprehensively evaluate the abilities of LLMs as general shop assistants. With Shopping MMLU, we benchmark over 20 existing LLMs and uncover valuable insights about practices and prospects of building versatile LLM-based shop assistants. Shopping MMLU can be publicly accessed at https://github.com/KL4805/ShoppingMMLU. In addition, with Shopping MMLU, we host a competition in KDD Cup 2024 with over 500 participating teams. The winning solutions and the associated workshop can be accessed at our website https://amazon-kddcup24.github.io/.

* NeurIPS 2024 Datasets and Benchmarks Track Accepted

Via

Access Paper or Ask Questions

Scaling Laws for Predicting Downstream Performance in LLMs

Oct 11, 2024

Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, Heng Ji

Figure 1 for Scaling Laws for Predicting Downstream Performance in LLMs

Figure 2 for Scaling Laws for Predicting Downstream Performance in LLMs

Figure 3 for Scaling Laws for Predicting Downstream Performance in LLMs

Figure 4 for Scaling Laws for Predicting Downstream Performance in LLMs

Abstract:Precise estimation of downstream performance in large language models (LLMs) prior to training is essential for guiding their development process. Scaling laws analysis utilizes the statistics of a series of significantly smaller sampling language models (LMs) to predict the performance of the target LLM. For downstream performance prediction, the critical challenge lies in the emergent abilities in LLMs that occur beyond task-specific computational thresholds. In this work, we focus on the pre-training loss as a more computation-efficient metric for performance estimation. Our two-stage approach consists of first estimating a function that maps computational resources (e.g., FLOPs) to the pre-training Loss using a series of sampling models, followed by mapping the pre-training loss to downstream task Performance after the critical "emergent phase". In preliminary experiments, this FLP solution accurately predicts the performance of LLMs with 7B and 13B parameters using a series of sampling LMs up to 3B, achieving error margins of 5% and 10%, respectively, and significantly outperforming the FLOPs-to-Performance approach. This motivates FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training, specifically blending general corpora with code data to accurately represent the common necessity. FLP-M extends the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources, and employs a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance. By utilizing a 3B LLM trained on a specific ratio and a series of smaller sampling LMs, FLP-M can effectively forecast the performance of 3B and 7B LLMs across various data mixtures for most benchmarks within 10% error margins.

Via

Access Paper or Ask Questions

Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs

Aug 07, 2024

Kewei Cheng, Jingfeng Yang, Haoming Jiang, Zhengyang Wang, Binxuan Huang, Ruirui Li, Shiyang Li, Zheng Li, Yifan Gao, Xian Li(+2 more)

Figure 1 for Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs

Figure 2 for Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs

Figure 3 for Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs

Figure 4 for Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs

Abstract:Reasoning encompasses two typical types: deductive reasoning and inductive reasoning. Despite extensive research into the reasoning capabilities of Large Language Models (LLMs), most studies have failed to rigorously differentiate between inductive and deductive reasoning, leading to a blending of the two. This raises an essential question: In LLM reasoning, which poses a greater challenge - deductive or inductive reasoning? While the deductive reasoning capabilities of LLMs, (i.e. their capacity to follow instructions in reasoning tasks), have received considerable attention, their abilities in true inductive reasoning remain largely unexplored. To investigate into the true inductive reasoning capabilities of LLMs, we propose a novel framework, SolverLearner. This framework enables LLMs to learn the underlying function (i.e., $y = f_w(x)$), that maps input data points $(x)$ to their corresponding output values $(y)$, using only in-context examples. By focusing on inductive reasoning and separating it from LLM-based deductive reasoning, we can isolate and investigate inductive reasoning of LLMs in its pure form via SolverLearner. Our observations reveal that LLMs demonstrate remarkable inductive reasoning capabilities through SolverLearner, achieving near-perfect performance with ACC of 1 in most cases. Surprisingly, despite their strong inductive reasoning abilities, LLMs tend to relatively lack deductive reasoning capabilities, particularly in tasks involving ``counterfactual'' reasoning.

Via

Access Paper or Ask Questions

MBA-Net: SAM-driven Bidirectional Aggregation Network for Ovarian Tumor Segmentation

Jul 08, 2024

Yifan Gao, Wei Xia, Wenkui Wang, Xin Gao

Figure 1 for MBA-Net: SAM-driven Bidirectional Aggregation Network for Ovarian Tumor Segmentation

Figure 2 for MBA-Net: SAM-driven Bidirectional Aggregation Network for Ovarian Tumor Segmentation

Figure 3 for MBA-Net: SAM-driven Bidirectional Aggregation Network for Ovarian Tumor Segmentation

Figure 4 for MBA-Net: SAM-driven Bidirectional Aggregation Network for Ovarian Tumor Segmentation

Abstract:Accurate segmentation of ovarian tumors from medical images is crucial for early diagnosis, treatment planning, and patient management. However, the diverse morphological characteristics and heterogeneous appearances of ovarian tumors pose significant challenges to automated segmentation methods. In this paper, we propose MBA-Net, a novel architecture that integrates the powerful segmentation capabilities of the Segment Anything Model (SAM) with domain-specific knowledge for accurate and robust ovarian tumor segmentation. MBA-Net employs a hybrid encoder architecture, where the encoder consists of a prior branch, which inherits the SAM encoder to capture robust segmentation priors, and a domain branch, specifically designed to extract domain-specific features. The bidirectional flow of information between the two branches is facilitated by the robust feature injection network (RFIN) and the domain knowledge integration network (DKIN), enabling MBA-Net to leverage the complementary strengths of both branches. We extensively evaluate MBA-Net on the public multi-modality ovarian tumor ultrasound dataset and the in-house multi-site ovarian tumor MRI dataset. Our proposed method consistently outperforms state-of-the-art segmentation approaches. Moreover, MBA-Net demonstrates superior generalization capability across different imaging modalities and clinical sites.

* MICCAI 2024

Via

Access Paper or Ask Questions

CausalCellSegmenter: Causal Inference inspired Diversified Aggregation Convolution for Pathology Image Segmentation

Mar 10, 2024

Dawei Fan, Yifan Gao, Jiaming Yu, Yanping Chen, Wencheng Li, Chuancong Lin, Kaibin Li, Changcai Yang, Riqing Chen, Lifang Wei

Figure 1 for CausalCellSegmenter: Causal Inference inspired Diversified Aggregation Convolution for Pathology Image Segmentation

Figure 2 for CausalCellSegmenter: Causal Inference inspired Diversified Aggregation Convolution for Pathology Image Segmentation

Figure 3 for CausalCellSegmenter: Causal Inference inspired Diversified Aggregation Convolution for Pathology Image Segmentation

Figure 4 for CausalCellSegmenter: Causal Inference inspired Diversified Aggregation Convolution for Pathology Image Segmentation

Abstract:Deep learning models have shown promising performance for cell nucleus segmentation in the field of pathology image analysis. However, training a robust model from multiple domains remains a great challenge for cell nucleus segmentation. Additionally, the shortcomings of background noise, highly overlapping between cell nucleus, and blurred edges often lead to poor performance. To address these challenges, we propose a novel framework termed CausalCellSegmenter, which combines Causal Inference Module (CIM) with Diversified Aggregation Convolution (DAC) techniques. The DAC module is designed which incorporates diverse downsampling features through a simple, parameter-free attention module (SimAM), aiming to overcome the problems of false-positive identification and edge blurring. Furthermore, we introduce CIM to leverage sample weighting by directly removing the spurious correlations between features for every input sample and concentrating more on the correlation between features and labels. Extensive experiments on the MoNuSeg-2018 dataset achieves promising results, outperforming other state-of-the-art methods, where the mIoU and DSC scores growing by 3.6% and 2.65%.

* 10 pages, 5 figures, 2 tables, MICCAI

Via

Access Paper or Ask Questions

Enhancing User Intent Capture in Session-Based Recommendation with Attribute Patterns

Dec 23, 2023

Xin Liu, Zheng Li, Yifan Gao, Jingfeng Yang, Tianyu Cao, Zhengyang Wang, Bing Yin, Yangqiu Song

Abstract:The goal of session-based recommendation in E-commerce is to predict the next item that an anonymous user will purchase based on the browsing and purchase history. However, constructing global or local transition graphs to supplement session data can lead to noisy correlations and user intent vanishing. In this work, we propose the Frequent Attribute Pattern Augmented Transformer (FAPAT) that characterizes user intents by building attribute transition graphs and matching attribute patterns. Specifically, the frequent and compact attribute patterns are served as memory to augment session representations, followed by a gate and a transformer block to fuse the whole session information. Through extensive experiments on two public benchmarks and 100 million industrial data in three domains, we demonstrate that FAPAT consistently outperforms state-of-the-art methods by an average of 4.5% across various evaluation metrics (Hits, NDCG, MRR). Besides evaluating the next-item prediction, we estimate the models' capabilities to capture user intents via predicting items' attributes and period-item recommendations.

* Accepted by NeurIPS 2023

Via

Access Paper or Ask Questions

Situated Natural Language Explanations

Aug 27, 2023

Zining Zhu, Haoming Jiang, Jingfeng Yang, Sreyashi Nag, Chao Zhang, Jie Huang, Yifan Gao, Frank Rudzicz, Bing Yin

Figure 1 for Situated Natural Language Explanations

Figure 2 for Situated Natural Language Explanations

Figure 3 for Situated Natural Language Explanations

Figure 4 for Situated Natural Language Explanations

Abstract:Natural language is among the most accessible tools for explaining decisions to humans, and large pretrained language models (PLMs) have demonstrated impressive abilities to generate coherent natural language explanations (NLE). The existing NLE research perspectives do not take the audience into account. An NLE can have high textual quality, but it might not accommodate audiences' needs and preference. To address this limitation, we propose an alternative perspective, situated NLE, including a situated generation framework and a situated evaluation framework. On the generation side, we propose simple prompt engineering methods that adapt the NLEs to situations. In human studies, the annotators preferred the situated NLEs. On the evaluation side, we set up automated evaluation scores in lexical, semantic, and pragmatic categories. The scores can be used to select the most suitable prompts to generate NLEs. Situated NLE provides a perspective to conduct further research on automatic NLE generations.

* A previous version was presented in ACL 2023 NLRSE workshop

Via

Access Paper or Ask Questions

AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation

Aug 23, 2023

Jinpeng Lin, Min Zhou, Ye Ma, Yifan Gao, Chenxi Fei, Yangjian Chen, Zhang Yu, Tiezheng Ge

Figure 1 for AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation

Figure 2 for AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation

Figure 3 for AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation

Figure 4 for AutoPoster: A Highly Automatic and Content-aware Design System for Advertising Poster Generation

Abstract:Advertising posters, a form of information presentation, combine visual and linguistic modalities. Creating a poster involves multiple steps and necessitates design experience and creativity. This paper introduces AutoPoster, a highly automatic and content-aware system for generating advertising posters. With only product images and titles as inputs, AutoPoster can automatically produce posters of varying sizes through four key stages: image cleaning and retargeting, layout generation, tagline generation, and style attribute prediction. To ensure visual harmony of posters, two content-aware models are incorporated for layout and tagline generation. Moreover, we propose a novel multi-task Style Attribute Predictor (SAP) to jointly predict visual style attributes. Meanwhile, to our knowledge, we propose the first poster generation dataset that includes visual attribute annotations for over 76k posters. Qualitative and quantitative outcomes from user studies and experiments substantiate the efficacy of our system and the aesthetic superiority of the generated posters compared to other poster generation methods.

* Accepted for ACM MM 2023

Via

Access Paper or Ask Questions