Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tianyi Zhou

Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA

Oct 09, 2024

Maharshi Gor, Hal Daumé III, Tianyi Zhou, Jordan Boyd-Graber

Figure 1 for Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA

Figure 2 for Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA

Figure 3 for Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA

Figure 4 for Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA

Abstract:Recent advancements of large language models (LLMs) have led to claims of AI surpassing humans in natural language processing (NLP) tasks such as textual understanding and reasoning. This work investigates these assertions by introducing CAIMIRA, a novel framework rooted in item response theory (IRT) that enables quantitative assessment and comparison of problem-solving abilities of question-answering (QA) agents: humans and AI systems. Through analysis of over 300,000 responses from ~70 AI systems and 155 humans across thousands of quiz questions, CAIMIRA uncovers distinct proficiency patterns in knowledge domains and reasoning skills. Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning, while state-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval and fact-based reasoning, particularly when information gaps are well-defined and addressable through pattern matching or data retrieval. These findings highlight the need for future QA tasks to focus on questions that challenge not only higher-order reasoning and scientific thinking, but also demand nuanced linguistic interpretation and cross-contextual knowledge application, helping advance AI developments that better emulate or complement human cognitive abilities in real-world problem-solving.

* To appear at EMNLP 2024 (Main)

Via

Access Paper or Ask Questions

WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Oct 09, 2024

Siyu Zhou, Tianyi Zhou, Yijun Yang, Guodong Long, Deheng Ye, Jing Jiang, Chengqi Zhang

Figure 1 for WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Figure 2 for WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Figure 3 for WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Figure 4 for WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Abstract:Can large language models (LLMs) directly serve as powerful world models for model-based agents? While the gaps between the prior knowledge of LLMs and the specified environment's dynamics do exist, our study reveals that the gaps can be bridged by aligning an LLM with its deployed environment and such "world alignment" can be efficiently achieved by rule learning on LLMs. Given the rich prior knowledge of LLMs, only a few additional rules suffice to align LLM predictions with the specified environment dynamics. To this end, we propose a neurosymbolic approach to learn these rules gradient-free through LLMs, by inducing, updating, and pruning rules based on comparisons of agent-explored trajectories and world model predictions. The resulting world model is composed of the LLM and the learned rules. Our embodied LLM agent "WALL-E" is built upon model-predictive control (MPC). By optimizing look-ahead actions based on the precise world model, MPC significantly improves exploration and learning efficiency. Compared to existing LLM agents, WALL-E's reasoning only requires a few principal rules rather than verbose buffered trajectories being included in the LLM input. On open-world challenges in Minecraft and ALFWorld, WALL-E achieves higher success rates than existing methods, with lower costs on replanning time and the number of tokens used for reasoning. In Minecraft, WALL-E exceeds baselines by 15-30% in success rate while costing 8-20 fewer replanning rounds and only 60-80% of tokens. In ALFWorld, its success rate surges to a new record high of 95% only after 6 iterations.

* 35 pages, including references and appendix

Via

Access Paper or Ask Questions

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Sep 27, 2024

Mucong Ding, Chenghao Deng, Jocelyn Choo, Zichu Wu, Aakriti Agrawal, Avi Schwarzschild, Tianyi Zhou, Tom Goldstein, John Langford, Anima Anandkumar(+1 more)

Figure 1 for Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Figure 2 for Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Figure 3 for Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Figure 4 for Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization

Abstract:While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank. Aiming to address this limitation, we present Easy2Hard-Bench, a consistently formatted collection of 6 benchmark datasets spanning various domains, such as mathematics and programming problems, chess puzzles, and reasoning questions. Each problem within these datasets is annotated with numerical difficulty scores. To systematically estimate problem difficulties, we collect abundant performance data on attempts to each problem by humans in the real world or LLMs on the prominent leaderboard. Leveraging the rich performance data, we apply well-established difficulty ranking systems, such as Item Response Theory (IRT) and Glicko-2 models, to uniformly assign numerical difficulty scores to problems. Moreover, datasets in Easy2Hard-Bench distinguish themselves from previous collections by a higher proportion of challenging problems. Through extensive experiments with six state-of-the-art LLMs, we provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty, with the aim of inspiring future research in LLM generalization. The datasets are available at https://huggingface.co/datasets/furonghuang-lab/Easy2Hard-Bench.

* NeurIPS 2024 Datasets and Benchmarks Track

Via

Access Paper or Ask Questions

From Lists to Emojis: How Format Bias Affects Model Alignment

Sep 18, 2024

Xuanchang Zhang, Wei Xiong, Lichang Chen, Tianyi Zhou, Heng Huang, Tong Zhang

Figure 1 for From Lists to Emojis: How Format Bias Affects Model Alignment

Figure 2 for From Lists to Emojis: How Format Bias Affects Model Alignment

Figure 3 for From Lists to Emojis: How Format Bias Affects Model Alignment

Figure 4 for From Lists to Emojis: How Format Bias Affects Model Alignment

Abstract:In this paper, we study format biases in reinforcement learning from human feedback (RLHF). We observe that many widely-used preference models, including human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark, exhibit strong biases towards specific format patterns, such as lists, links, bold text, and emojis. Furthermore, large language models (LLMs) can exploit these biases to achieve higher rankings on popular benchmarks like AlpacaEval and LMSYS Chatbot Arena. One notable example of this is verbosity bias, where current preference models favor longer responses that appear more comprehensive, even when their quality is equal to or lower than shorter, competing responses. However, format biases beyond verbosity remain largely underexplored in the literature. In this work, we extend the study of biases in preference learning beyond the commonly recognized length bias, offering a comprehensive analysis of a wider range of format biases. Additionally, we show that with a small amount of biased data (less than 1%), we can inject significant bias into the reward model. Moreover, these format biases can also be easily exploited by downstream alignment algorithms, such as best-of-n sampling and online iterative DPO, as it is usually easier to manipulate the format than to improve the quality of responses. Our findings emphasize the need to disentangle format and content both for designing alignment algorithms and evaluating models.

* Working in progress

Via

Access Paper or Ask Questions

Personalized Federated Collaborative Filtering: A Variational AutoEncoder Approach

Aug 16, 2024

Zhiwei Li, Guodong Long, Tianyi Zhou, Jing Jiang, Chengqi Zhang

Abstract:Federated Collaborative Filtering (FedCF) is an emerging field focused on developing a new recommendation framework with preserving privacy in a federated setting. Existing FedCF methods typically combine distributed Collaborative Filtering (CF) algorithms with privacy-preserving mechanisms, and then preserve personalized information into a user embedding vector. However, the user embedding is usually insufficient to preserve the rich information of the fine-grained personalization across heterogeneous clients. This paper proposes a novel personalized FedCF method by preserving users' personalized information into a latent variable and a neural model simultaneously. Specifically, we decompose the modeling of user knowledge into two encoders, each designed to capture shared knowledge and personalized knowledge separately. A personalized gating network is then applied to balance personalization and generalization between the global and local encoders. Moreover, to effectively train the proposed framework, we model the CF problem as a specialized Variational AutoEncoder (VAE) task by integrating user interaction vector reconstruction with missing value prediction. The decoder is trained to reconstruct the implicit feedback from items the user has interacted with, while also predicting items the user might be interested in but has not yet interacted with. Experimental results on benchmark datasets demonstrate that the proposed method outperforms other baseline methods, showcasing superior performance.

* 10 pages, 3 figures, 4 tables, conference

Via

Access Paper or Ask Questions

M2EF-NNs: Multimodal Multi-instance Evidence Fusion Neural Networks for Cancer Survival Prediction

Aug 08, 2024

Hui Luo, Jiashuang Huang, Hengrong Ju, Tianyi Zhou, Weiping Ding

Abstract:Accurate cancer survival prediction is crucial for assisting clinical doctors in formulating treatment plans. Multimodal data, including histopathological images and genomic data, offer complementary and comprehensive information that can greatly enhance the accuracy of this task. However, the current methods, despite yielding promising results, suffer from two notable limitations: they do not effectively utilize global context and disregard modal uncertainty. In this study, we put forward a neural network model called M2EF-NNs, which leverages multimodal and multi-instance evidence fusion techniques for accurate cancer survival prediction. Specifically, to capture global information in the images, we use a pre-trained Vision Transformer (ViT) model to obtain patch feature embeddings of histopathological images. Then, we introduce a multimodal attention module that uses genomic embeddings as queries and learns the co-attention mapping between genomic and histopathological images to achieve an early interaction fusion of multimodal information and better capture their correlations. Subsequently, we are the first to apply the Dempster-Shafer evidence theory (DST) to cancer survival prediction. We parameterize the distribution of class probabilities using the processed multimodal features and introduce subjective logic to estimate the uncertainty associated with different modalities. By combining with the Dempster-Shafer theory, we can dynamically adjust the weights of class probabilities after multimodal fusion to achieve trusted survival prediction. Finally, Experimental validation on the TCGA datasets confirms the significant improvements achieved by our proposed method in cancer survival prediction and enhances the reliability of the model.

Via

Access Paper or Ask Questions

FDiff-Fusion:Denoising diffusion fusion network based on fuzzy learning for 3D medical image segmentation

Jul 22, 2024

Weiping Ding, Sheng Geng, Haipeng Wang, Jiashuang Huang, Tianyi Zhou

Figure 1 for FDiff-Fusion:Denoising diffusion fusion network based on fuzzy learning for 3D medical image segmentation

Figure 2 for FDiff-Fusion:Denoising diffusion fusion network based on fuzzy learning for 3D medical image segmentation

Figure 3 for FDiff-Fusion:Denoising diffusion fusion network based on fuzzy learning for 3D medical image segmentation

Figure 4 for FDiff-Fusion:Denoising diffusion fusion network based on fuzzy learning for 3D medical image segmentation

Abstract:In recent years, the denoising diffusion model has achieved remarkable success in image segmentation modeling. With its powerful nonlinear modeling capabilities and superior generalization performance, denoising diffusion models have gradually been applied to medical image segmentation tasks, bringing new perspectives and methods to this field. However, existing methods overlook the uncertainty of segmentation boundaries and the fuzziness of regions, resulting in the instability and inaccuracy of the segmentation results. To solve this problem, a denoising diffusion fusion network based on fuzzy learning for 3D medical image segmentation (FDiff-Fusion) is proposed in this paper. By integrating the denoising diffusion model into the classical U-Net network, this model can effectively extract rich semantic information from input medical images, thus providing excellent pixel-level representation for medical image segmentation. ... Finally, to validate the effectiveness of FDiff-Fusion, we compare it with existing advanced segmentation networks on the BRATS 2020 brain tumor dataset and the BTCV abdominal multi-organ dataset. The results show that FDiff-Fusion significantly improves the Dice scores and HD95 distance on these two datasets, demonstrating its superiority in medical image segmentation tasks.

* Information Fusion, 2024: 102540
* This paper has been accepted by Information Fusion. Permission from Elsevier must be obtained for all other uses, in any current or future media. The final version is available at [doi:10.1016/J.INFFUS.2024.102540]

Via

Access Paper or Ask Questions

FMDNN: A Fuzzy-guided Multi-granular Deep Neural Network for Histopathological Image Classification

Jul 22, 2024

Weiping Ding, Tianyi Zhou, Jiashuang Huang, Shu Jiang, Tao Hou, Chin-Teng Lin

Figure 1 for FMDNN: A Fuzzy-guided Multi-granular Deep Neural Network for Histopathological Image Classification

Figure 2 for FMDNN: A Fuzzy-guided Multi-granular Deep Neural Network for Histopathological Image Classification

Figure 3 for FMDNN: A Fuzzy-guided Multi-granular Deep Neural Network for Histopathological Image Classification

Figure 4 for FMDNN: A Fuzzy-guided Multi-granular Deep Neural Network for Histopathological Image Classification

Abstract:Histopathological image classification constitutes a pivotal task in computer-aided diagnostics. The precise identification and categorization of histopathological images are of paramount significance for early disease detection and treatment. In the diagnostic process of pathologists, a multi-tiered approach is typically employed to assess abnormalities in cell regions at different magnifications. However, feature extraction is often performed at a single granularity, overlooking the multi-granular characteristics of cells. To address this issue, we propose the Fuzzy-guided Multi-granularity Deep Neural Network (FMDNN). Inspired by the multi-granular diagnostic approach of pathologists, we perform feature extraction on cell structures at coarse, medium, and fine granularity, enabling the model to fully harness the information in histopathological images. We incorporate the theory of fuzzy logic to address the challenge of redundant key information arising during multi-granular feature extraction. Cell features are described from different perspectives using multiple fuzzy membership functions, which are fused to create universal fuzzy features. A fuzzy-guided cross-attention module guides universal fuzzy features toward multi-granular features. We propagate these features through an encoder to all patch tokens, aiming to achieve enhanced classification accuracy and robustness. In experiments on multiple public datasets, our model exhibits a significant improvement in accuracy over commonly used classification methods for histopathological image classification and shows commendable interpretability.

* IEEE Transactions on Fuzzy Systems ( Early Access ) 2024
* This paper has been accepted by IEEE Transactions on Fuzzy Systems for publication. Permission from IEEE must be obtained for all other uses, in any current or future media. The final version is available at [doi: 10.1109/TFUZZ.2024.3410929]

Via

Access Paper or Ask Questions

One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts

Jun 28, 2024

Ruochen Wang, Sohyun An, Minhao Cheng, Tianyi Zhou, Sung Ju Hwang, Cho-Jui Hsieh

Figure 1 for One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts

Figure 2 for One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts

Figure 3 for One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts

Figure 4 for One Prompt is not Enough: Automated Construction of a Mixture-of-Expert Prompts

Abstract:Large Language Models (LLMs) exhibit strong generalization capabilities to novel tasks when prompted with language instructions and in-context demos. Since this ability sensitively depends on the quality of prompts, various methods have been explored to automate the instruction design. While these methods demonstrated promising results, they also restricted the searched prompt to one instruction. Such simplification significantly limits their capacity, as a single demo-free instruction might not be able to cover the entire complex problem space of the targeted task. To alleviate this issue, we adopt the Mixture-of-Expert paradigm and divide the problem space into a set of sub-regions; Each sub-region is governed by a specialized expert, equipped with both an instruction and a set of demos. A two-phase process is developed to construct the specialized expert for each region: (1) demo assignment: Inspired by the theoretical connection between in-context learning and kernel regression, we group demos into experts based on their semantic similarity; (2) instruction assignment: A region-based joint search of an instruction per expert complements the demos assigned to it, yielding a synergistic effect. The resulting method, codenamed Mixture-of-Prompts (MoP), achieves an average win rate of 81% against prior arts across several major benchmarks.

* Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 2024
* ICML 2024. code available at https://github.com/ruocwang/mixture-of-prompts

Via

Access Paper or Ask Questions

UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

Jun 27, 2024

Siyuan Wu, Yue Huang, Chujie Gao, Dongping Chen, Qihui Zhang, Yao Wan, Tianyi Zhou, Xiangliang Zhang, Jianfeng Gao, Chaowei Xiao(+1 more)

Figure 1 for UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

Figure 2 for UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

Figure 3 for UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

Figure 4 for UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

Abstract:Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents UniGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. UniGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by UniGen, and each module within UniGen plays a critical role in this enhancement. Additionally, UniGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that UniGen effectively supports dynamic and evolving benchmarking, and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.

Via

Access Paper or Ask Questions