Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jie Huang

RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

Dec 19, 2024

Jie Huang, Ruibing Hou, Jiahe Zhao, Hong Chang, Shiguang Shan

Figure 1 for RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

Figure 2 for RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

Figure 3 for RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

Figure 4 for RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios

Abstract:Human-centric perceptions play a crucial role in real-world applications. While recent human-centric works have achieved impressive progress, these efforts are often constrained to the visual domain and lack interaction with human instructions, limiting their applicability in broader scenarios such as chatbots and sports analysis. This paper introduces Referring Human Perceptions, where a referring prompt specifies the person of interest in an image. To tackle the new task, we propose RefHCM (Referring Human-Centric Model), a unified framework to integrate a wide range of human-centric referring tasks. Specifically, RefHCM employs sequence mergers to convert raw multimodal data -- including images, text, coordinates, and parsing maps -- into semantic tokens. This standardized representation enables RefHCM to reformulate diverse human-centric referring tasks into a sequence-to-sequence paradigm, solved using a plain encoder-decoder transformer architecture. Benefiting from a unified learning strategy, RefHCM effectively facilitates knowledge transfer across tasks and exhibits unforeseen capabilities in handling complex reasoning. This work represents the first attempt to address referring human perceptions with a general-purpose framework, while simultaneously establishing a corresponding benchmark that sets new standards for the field. Extensive experiments showcase RefHCM's competitive and even superior performance across multiple human-centric referring tasks. The code and data are publicly at https://github.com/JJJYmmm/RefHCM.

* 13 pages

Via

Access Paper or Ask Questions

Multimodal Sentiment Analysis Based on Causal Reasoning

Dec 10, 2024

Fuhai Chen, Pengpeng Huang, Xuri Ge, Jie Huang, Zishuo Bao

Figure 1 for Multimodal Sentiment Analysis Based on Causal Reasoning

Figure 2 for Multimodal Sentiment Analysis Based on Causal Reasoning

Figure 3 for Multimodal Sentiment Analysis Based on Causal Reasoning

Figure 4 for Multimodal Sentiment Analysis Based on Causal Reasoning

Abstract:With the rapid development of multimedia, the shift from unimodal textual sentiment analysis to multimodal image-text sentiment analysis has obtained academic and industrial attention in recent years. However, multimodal sentiment analysis is affected by unimodal data bias, e.g., text sentiment is misleading due to explicit sentiment semantic, leading to low accuracy in the final sentiment classification. In this paper, we propose a novel CounterFactual Multimodal Sentiment Analysis framework (CF-MSA) using causal counterfactual inference to construct multimodal sentiment causal inference. CF-MSA mitigates the direct effect from unimodal bias and ensures heterogeneity across modalities by differentiating the treatment variables between modalities. In addition, considering the information complementarity and bias differences between modalities, we propose a new optimisation objective to effectively integrate different modalities and reduce the inherent bias from each modality. Experimental results on two public datasets, MVSA-Single and MVSA-Multiple, demonstrate that the proposed CF-MSA has superior debiasing capability and achieves new state-of-the-art performances. We will release the code and datasets to facilitate future research.

Via

Access Paper or Ask Questions

SimulBench: Evaluating Language Models with Creative Simulation Tasks

Sep 11, 2024

Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin

Figure 1 for SimulBench: Evaluating Language Models with Creative Simulation Tasks

Figure 2 for SimulBench: Evaluating Language Models with Creative Simulation Tasks

Figure 3 for SimulBench: Evaluating Language Models with Creative Simulation Tasks

Figure 4 for SimulBench: Evaluating Language Models with Creative Simulation Tasks

Abstract:We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM's general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on \DataName{}, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the final response generated by the target LLMs given multi-turn dialogue scripts. Our comprehensive experiments indicate that these simulation tasks continue to pose a significant challenge with their unique natures and show the gap between proprietary models and the most advanced open LLMs. For example, GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55\% more cases.

* Website: https://simulbench.github.io/

Via

Access Paper or Ask Questions

Segmentation by registration-enabled SAM prompt engineering using five reference images

Jul 25, 2024

Yaxi Chen, Aleksandra Ivanova, Shaheer U. Saeed, Rikin Hargunani, Jie Huang, Chaozong Liu, Yipeng Hu

Figure 1 for Segmentation by registration-enabled SAM prompt engineering using five reference images

Figure 2 for Segmentation by registration-enabled SAM prompt engineering using five reference images

Figure 3 for Segmentation by registration-enabled SAM prompt engineering using five reference images

Abstract:The recently proposed Segment Anything Model (SAM) is a general tool for image segmentation, but it requires additional adaptation and careful fine-tuning for medical image segmentation, especially for small, irregularly-shaped, and boundary-ambiguous anatomical structures such as the knee cartilage that is of interest in this work. Repaired cartilage, after certain surgical procedures, exhibits imaging patterns unseen to pre-training, posing further challenges for using models like SAM with or without general-purpose fine-tuning. To address this, we propose a novel registration-based prompt engineering framework for medical image segmentation using SAM. This approach utilises established image registration algorithms to align the new image (to-be-segmented) and a small number of reference images, without requiring segmentation labels. The spatial transformations generated by registration align either the new image or pre-defined point-based prompts, before using them as input to SAM. This strategy, requiring as few as five reference images with defined point prompts, effectively prompts SAM for inference on new images, without needing any segmentation labels. Evaluation of MR images from patients who received cartilage stem cell therapy yielded Dice scores of 0.89, 0.87, 0.53, and 0.52 for segmenting femur, tibia, femoral- and tibial cartilages, respectively. This outperforms atlas-based label fusion and is comparable to supervised nnUNet, an upper-bound fair baseline in this application, both of which require full segmentation labels for reference samples. The codes are available at: https://github.com/chrissyinreallife/KneeSegmentWithSAM.git

* Accepted to the 11th International Workshop on Biomedical Image Registration (WBIR 2024)

Via

Access Paper or Ask Questions

Long-form factuality in large language models

Apr 03, 2024

Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang(+2 more)

Figure 1 for Long-form factuality in large language models

Figure 2 for Long-form factuality in large language models

Figure 3 for Long-form factuality in large language models

Figure 4 for Long-form factuality in large language models

Abstract:Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factuality through a method which we call Search-Augmented Factuality Evaluator (SAFE). SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results. Furthermore, we propose extending F1 score as an aggregated metric for long-form factuality. To do so, we balance the percentage of supported facts in a response (precision) with the percentage of provided facts relative to a hyperparameter representing a user's preferred response length (recall). Empirically, we demonstrate that LLM agents can outperform crowdsourced human annotators - on a set of ~16k individual facts, SAFE agrees with crowdsourced human annotators 72% of the time, and on a random subset of 100 disagreement cases, SAFE wins 76% of the time. At the same time, SAFE is more than 20 times cheaper than human annotators. We also benchmark thirteen language models on LongFact across four model families (Gemini, GPT, Claude, and PaLM-2), finding that larger language models generally achieve better long-form factuality. LongFact, SAFE, and all experimental code are available at https://github.com/google-deepmind/long-form-factuality.

Via

Access Paper or Ask Questions

Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

Mar 04, 2024

Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin

Abstract:Large Language Models (LLMs) have become integral components in various autonomous agent systems. In this study, we present an exploration-based trajectory optimization approach, referred to as ETO. This learning method is designed to enhance the performance of open LLM agents. Contrary to previous studies that exclusively train on successful expert trajectories, our method allows agents to learn from their exploration failures. This leads to improved performance through an iterative optimization framework. During the exploration phase, the agent interacts with the environment while completing given tasks, gathering failure trajectories to create contrastive trajectory pairs. In the subsequent training phase, the agent utilizes these trajectory preference pairs to update its policy using contrastive learning methods like DPO. This iterative cycle of exploration and training fosters continued improvement in the agents. Our experiments on three complex tasks demonstrate that ETO consistently surpasses baseline performance by a large margin. Furthermore, an examination of task-solving efficiency and potential in scenarios lacking expert trajectory underscores the effectiveness of our approach.

Via

Access Paper or Ask Questions

Electromagnetic Information Theory: Fundamentals and Applications for 6G Wireless Communication Systems

Jan 17, 2024

Cheng-Xiang Wang, Yue Yang, Jie Huang, Xiqi Gao, Tie Jun Cui, Lajos Hanzo

Figure 1 for Electromagnetic Information Theory: Fundamentals and Applications for 6G Wireless Communication Systems

Figure 2 for Electromagnetic Information Theory: Fundamentals and Applications for 6G Wireless Communication Systems

Figure 3 for Electromagnetic Information Theory: Fundamentals and Applications for 6G Wireless Communication Systems

Figure 4 for Electromagnetic Information Theory: Fundamentals and Applications for 6G Wireless Communication Systems

Abstract:In wireless communications, electromagnetic theory and information theory constitute a pair of fundamental theories, bridged by antenna theory and wireless propagation channel modeling theory. Up to the fifth generation (5G) wireless communication networks, these four theories have been developing relatively independently. However, in sixth generation (6G) space-air-ground-sea wireless communication networks, seamless coverage is expected in the three-dimensional (3D) space, potentially necessitating the acquisition of channel state information (CSI) and channel capacity calculation at anywhere and any time. Additionally, the key 6G technologies such as ultra-massive multiple-input multiple-output (MIMO) and holographic MIMO achieves intricate interaction of the antennas and wireless propagation environments, which necessitates the joint modeling of antennas and wireless propagation channels. To address the challenges in 6G, the integration of the above four theories becomes inevitable, leading to the concept of the so-called electromagnetic information theory (EIT). In this article, a suite of 6G key technologies is highlighted. Then, the concepts and relationships of the four theories are unveiled. Finally, the necessity and benefits of integrating them into the EIT are revealed.

Via

Access Paper or Ask Questions

Cascade Speculative Drafting for Even Faster LLM Inference

Dec 21, 2023

Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Jie Huang, Kevin Chen-Chuan Chang

Figure 1 for Cascade Speculative Drafting for Even Faster LLM Inference

Figure 2 for Cascade Speculative Drafting for Even Faster LLM Inference

Figure 3 for Cascade Speculative Drafting for Even Faster LLM Inference

Figure 4 for Cascade Speculative Drafting for Even Faster LLM Inference

Abstract:Speculative decoding enhances the efficiency of large language models (LLMs) by leveraging a draft model to draft for a larger target model to review. However, drafting in speculative decoding involves slow autoregressive generation and generating tokens of different importance with the same time allocation. These two inefficiencies lead to its suboptimal performance. To address this issue, we introduce Cascade Speculative Drafting (CS. Drafting), a novel approach that employs two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models. The Horizontal Cascade constitutes efficient time allocation in drafting with its optimality supported by our theoretical analysis. Combining both cascades, our CS. Drafting algorithm has achieved up to 72 percent additional speedup over speculative decoding in our experiments while keeping the same output distribution.

* Preprint in progress

Via

Access Paper or Ask Questions

Decoupling Degradation and Content Processing for Adverse Weather Image Restoration

Dec 08, 2023

Xi Wang, Xueyang Fu, Peng-Tao Jiang, Jie Huang, Mi Zhou, Bo Li, Zheng-Jun Zha

Figure 1 for Decoupling Degradation and Content Processing for Adverse Weather Image Restoration

Figure 2 for Decoupling Degradation and Content Processing for Adverse Weather Image Restoration

Figure 3 for Decoupling Degradation and Content Processing for Adverse Weather Image Restoration

Figure 4 for Decoupling Degradation and Content Processing for Adverse Weather Image Restoration

Abstract:Adverse weather image restoration strives to recover clear images from those affected by various weather types, such as rain, haze, and snow. Each weather type calls for a tailored degradation removal approach due to its unique impact on images. Conversely, content reconstruction can employ a uniform approach, as the underlying image content remains consistent. Although previous techniques can handle multiple weather types within a single network, they neglect the crucial distinction between these two processes, limiting the quality of restored images. This work introduces a novel adverse weather image restoration method, called DDCNet, which decouples the degradation removal and content reconstruction process at the feature level based on their channel statistics. Specifically, we exploit the unique advantages of the Fourier transform in both these two processes: (1) the degradation information is mainly located in the amplitude component of the Fourier domain, and (2) the Fourier domain contains global information. The former facilitates channel-dependent degradation removal operation, allowing the network to tailor responses to various adverse weather types; the latter, by integrating Fourier's global properties into channel-independent content features, enhances network capacity for consistent global content reconstruction. We further augment the degradation removal process with a degradation mapping loss function. Extensive experiments demonstrate our method achieves state-of-the-art performance in multiple adverse weather removal benchmarks.

Via

Access Paper or Ask Questions

A WINNER+ Based 3-D Non-Stationary Wideband MIMO Channel Model

Dec 01, 2023

Ji Bian, Jian Sun, Cheng-Xiang Wang, Rui Feng, Jie Huang, Yang Yang, Minggao Zhang

Abstract:In this paper, a three-dimensional (3-D) non-stationary wideband multiple-input multiple-output (MIMO) channel model based on the WINNER+ channel model is proposed. The angular distributions of clusters in both the horizontal and vertical planes are jointly considered. The receiver and clusters can be moving, which makes the model more general. Parameters including number of clusters, powers, delays, azimuth angles of departure (AAoDs), azimuth angles of arrival (AAoAs), elevation angles of departure (EAoDs), and elevation angles of arrival (EAoAs) are time-variant. The cluster time evolution is modeled using a birth-death process. Statistical properties, including spatial cross-correlation function (CCF), temporal autocorrelation function (ACF), Doppler power spectrum density (PSD), level-crossing rate (LCR), average fading duration (AFD), and stationary interval are investigated and analyzed. The LCR, AFD, and stationary interval of the proposed channel model are validated against the measurement data. Numerical and simulation results show that the proposed channel model has the ability to reproduce the main properties of real non-stationary channels. Furthermore, the proposed channel model can be adapted to various communication scenarios by adjusting different parameter values.

Via

Access Paper or Ask Questions