Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yiming Sun

ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

Nov 04, 2024

Yiming Sun, Fan Yu, Shaoxiang Chen, Yu Zhang, Junwei Huang, Chenhui Li, Yang Li, Changbo Wang

Figure 1 for ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

Figure 2 for ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

Figure 3 for ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

Figure 4 for ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

Abstract:Visual object tracking aims to locate a targeted object in a video sequence based on an initial bounding box. Recently, Vision-Language~(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. However, VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We found that this inferiority primarily results from their heavy reliance on manual textual annotations, which include the frequent provision of ambiguous language descriptions. In this paper, we propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions and enhance tracking performance. To this end, we propose a novel reflection-based prompt optimization module to iteratively refine the ambiguous and inaccurate descriptions of the target with tracking feedback. To further utilize semantic information produced by MLLM, a simple yet effective VL tracking framework is proposed and can be easily integrated as a plug-and-play module to boost the performance of both VL and visual trackers. Experimental results show that our proposed ChatTracker achieves a performance comparable to existing methods.

Via

Access Paper or Ask Questions

Learning Multimodal Cues of Children's Uncertainty

Oct 17, 2024

Qi Cheng, Mert İnan, Rahma Mbarki, Grace Grmek, Theresa Choi, Yiming Sun, Kimele Persaud, Jenny Wang, Malihe Alikhani

Figure 1 for Learning Multimodal Cues of Children's Uncertainty

Figure 2 for Learning Multimodal Cues of Children's Uncertainty

Figure 3 for Learning Multimodal Cues of Children's Uncertainty

Figure 4 for Learning Multimodal Cues of Children's Uncertainty

Abstract:Understanding uncertainty plays a critical role in achieving common ground (Clark et al.,1983). This is especially important for multimodal AI systems that collaborate with users to solve a problem or guide the user through a challenging concept. In this work, for the first time, we present a dataset annotated in collaboration with developmental and cognitive psychologists for the purpose of studying nonverbal cues of uncertainty. We then present an analysis of the data, studying different roles of uncertainty and its relationship with task difficulty and performance. Lastly, we present a multimodal machine learning model that can predict uncertainty given a real-time video clip of a participant, which we find improves upon a baseline multimodal transformer model. This work informs research on cognitive coordination between human-human and human-AI and has broad implications for gesture understanding and generation. The anonymized version of our data and code will be publicly available upon the completion of the required consent forms and data sheets.

* SIGDIAL 2023

Via

Access Paper or Ask Questions

Transfer Learning with Clinical Concept Embeddings from Large Language Models

Sep 20, 2024

Yuhe Gao, Runxue Bao, Yuelyu Ji, Yiming Sun, Chenxi Song, Jeffrey P. Ferraro, Ye Ye

Figure 1 for Transfer Learning with Clinical Concept Embeddings from Large Language Models

Figure 2 for Transfer Learning with Clinical Concept Embeddings from Large Language Models

Figure 3 for Transfer Learning with Clinical Concept Embeddings from Large Language Models

Figure 4 for Transfer Learning with Clinical Concept Embeddings from Large Language Models

Abstract:Knowledge sharing is crucial in healthcare, especially when leveraging data from multiple clinical sites to address data scarcity, reduce costs, and enable timely interventions. Transfer learning can facilitate cross-site knowledge transfer, but a major challenge is heterogeneity in clinical concepts across different sites. Large Language Models (LLMs) show significant potential of capturing the semantic meaning of clinical concepts and reducing heterogeneity. This study analyzed electronic health records from two large healthcare systems to assess the impact of semantic embeddings from LLMs on local, shared, and transfer learning models. Results indicate that domain-specific LLMs, such as Med-BERT, consistently outperform in local and direct transfer scenarios, while generic models like OpenAI embeddings require fine-tuning for optimal performance. However, excessive tuning of models with biomedical embeddings may reduce effectiveness, emphasizing the need for balance. This study highlights the importance of domain-specific embeddings and careful model tuning for effective knowledge transfer in healthcare.

Via

Access Paper or Ask Questions

Free-form Grid Structure Form Finding based on Machine Learning and Multi-objective Optimisation

Jul 13, 2024

Yiping Meng, Yiming Sun

Figure 1 for Free-form Grid Structure Form Finding based on Machine Learning and Multi-objective Optimisation

Figure 2 for Free-form Grid Structure Form Finding based on Machine Learning and Multi-objective Optimisation

Figure 3 for Free-form Grid Structure Form Finding based on Machine Learning and Multi-objective Optimisation

Figure 4 for Free-form Grid Structure Form Finding based on Machine Learning and Multi-objective Optimisation

Abstract:Free-form structural forms are widely used to design spatial structures for their irregular spatial morphology. Current free-form form-finding methods cannot adequately meet the material properties, structural requirements or construction conditions, which brings the deviation between the initial 3D geometric design model and the constructed free-form structure. Thus, the main focus of this paper is to improve the rationality of free-form morphology considering multiple objectives in line with the characteristics and constraints of material. In this paper, glued laminated timber is selected as a case. Firstly, machine learning is adopted based on the predictive capability. By selecting a free-form timber grid structure and following the principles of NURBS, the free-form structure is simplified into free-form curves. The transformer is selected to train and predict the curvatures of the curves considering the material characteristics. After predicting the curvatures, the curves are transformed into vectors consisting of control points, weights, and knot vectors. To ensure the constructability and robustness of the structure, minimising the mass of the structure, stress and strain energy are the optimisation objectives. Two parameters (weight and the z-coordinate of the control points) of the free-from morphology are extracted as the variables of the free-form morphology to conduct the optimisation. The evaluation algorithm was selected as the optimal tool due to its capability to optimise multiple parameters. While optimising the two variables, the mechanical performance evaluation indexes such as the maximum displacement in the z-direction are demonstrated in the 60th step. The optimisation results for structure mass, stress and strain energy after 60 steps show the tendency of oscillation convergence, which indicates the efficiency of the proposal multi-objective optimisation.

* 11 pages, 9 figures

Via

Access Paper or Ask Questions

One-shot Active Learning Based on Lewis Weight Sampling for Multiple Deep Models

May 23, 2024

Sheng-Jun Huang, Yi Li, Yiming Sun, Ying-Peng Tang

Abstract:Active learning (AL) for multiple target models aims to reduce labeled data querying while effectively training multiple models concurrently. Existing AL algorithms often rely on iterative model training, which can be computationally expensive, particularly for deep models. In this paper, we propose a one-shot AL method to address this challenge, which performs all label queries without repeated model training. Specifically, we extract different representations of the same dataset using distinct network backbones, and actively learn the linear prediction layer on each representation via an $\ell_p$-regression formulation. The regression problems are solved approximately by sampling and reweighting the unlabeled instances based on their maximum Lewis weights across the representations. An upper bound on the number of samples needed is provided with a rigorous analysis for $p\in [1, +\infty)$. Experimental results on 11 benchmarks show that our one-shot approach achieves competitive performances with the state-of-the-art AL methods for multiple target models.

* A preliminary version appeared in the Proceedings of the 12th International Conference on Learning Representations (ICLR 2024)

Via

Access Paper or Ask Questions

Multi: Multimodal Understanding Leaderboard with Text and Images

Feb 05, 2024

Zichen Zhu, Yang Xu, Lu Chen, Jingkai Yang, Yichuan Ma, Yiming Sun, Hailin Wen, Jiaqi Liu, Jinyu Cai, Yingzi Ma(+4 more)

Figure 1 for Multi: Multimodal Understanding Leaderboard with Text and Images

Figure 2 for Multi: Multimodal Understanding Leaderboard with Text and Images

Figure 3 for Multi: Multimodal Understanding Leaderboard with Text and Images

Figure 4 for Multi: Multimodal Understanding Leaderboard with Text and Images

Abstract:Rapid progress in multimodal large language models (MLLMs) highlights the need to introduce challenging yet realistic benchmarks to the academic community. Existing benchmarks primarily focus on simple natural image understanding, but Multi emerges as a cutting-edge benchmark for MLLMs, offering a comprehensive dataset for evaluating MLLMs against understanding complex figures and tables, and scientific questions. This benchmark, reflecting current realistic examination styles, provides multimodal inputs and requires responses that are either precise or open-ended, similar to real-life school tests. It challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis, and cross-modality reasoning. Multi includes over 18,000 questions, with a focus on science-based QA in diverse formats. We also introduce Multi-Elite, a 500-question subset for testing the extremities of MLLMs, and Multi-Extend, which enhances In-Context Learning research with more than 4,500 knowledge pieces. Our evaluation indicates significant potential for MLLM advancement, with GPT-4V achieving a 63.7% accuracy rate on Multi, in contrast to other MLLMs scoring between 31.3% and 53.7%. Multi serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.

* Details and access are available at: https://OpenDFM.github.io/MULTI-Benchmark/

Via

Access Paper or Ask Questions

Online Transfer Learning for RSV Case Detection

Feb 03, 2024

Yiming Sun, Yuhe Gao, Runxue Bao, Gregory F. Cooper, Jessi Espino, Harry Hochheiser, Marian G. Michaels, John M. Aronis, Ye Ye

Figure 1 for Online Transfer Learning for RSV Case Detection

Figure 2 for Online Transfer Learning for RSV Case Detection

Figure 3 for Online Transfer Learning for RSV Case Detection

Figure 4 for Online Transfer Learning for RSV Case Detection

Abstract:Transfer learning has become a pivotal technique in machine learning, renowned for its effectiveness in various real-world applications. However, a significant challenge arises when applying this approach to sequential epidemiological data, often characterized by a scarcity of labeled information. To address this challenge, we introduce Predictive Volume-Adaptive Weighting (PVAW), a novel online multi-source transfer learning method. PVAW innovatively implements a dynamic weighting mechanism within an ensemble model, allowing for the automatic adjustment of weights based on the relevance and contribution of each source and target model. We demonstrate the effectiveness of PVAW through its application in analyzing Respiratory Syncytial Virus (RSV) data, collected over multiple seasons at the University of Pittsburgh Medical Center. Our method showcases significant improvements in model performance over existing baselines, highlighting the potential of online transfer learning in handling complex, sequential data. This study not only underscores the adaptability and sophistication of transfer learning in healthcare but also sets a new direction for future research in creating advanced predictive models.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

A Survey of Heterogeneous Transfer Learning

Oct 15, 2023

Runxue Bao, Yiming Sun, Yuhe Gao, Jindong Wang, Qiang Yang, Haifeng Chen, Zhi-Hong Mao, Ye Ye

Figure 1 for A Survey of Heterogeneous Transfer Learning

Figure 2 for A Survey of Heterogeneous Transfer Learning

Figure 3 for A Survey of Heterogeneous Transfer Learning

Figure 4 for A Survey of Heterogeneous Transfer Learning

Abstract:The application of transfer learning, an approach utilizing knowledge from a source domain to enhance model performance in a target domain, has seen a tremendous rise in recent years, underpinning many real-world scenarios. The key to its success lies in the shared common knowledge between the domains, a prerequisite in most transfer learning methodologies. These methods typically presuppose identical feature spaces and label spaces in both domains, known as homogeneous transfer learning, which, however, is not always a practical assumption. Oftentimes, the source and target domains vary in feature spaces, data distributions, and label spaces, making it challenging or costly to secure source domain data with identical feature and label spaces as the target domain. Arbitrary elimination of these differences is not always feasible or optimal. Thus, heterogeneous transfer learning, acknowledging and dealing with such disparities, has emerged as a promising approach for a variety of tasks. Despite the existence of a survey in 2017 on this topic, the fast-paced advances post-2017 necessitate an updated, in-depth review. We therefore present a comprehensive survey of recent developments in heterogeneous transfer learning methods, offering a systematic guide for future research. Our paper reviews methodologies for diverse learning scenarios, discusses the limitations of current studies, and covers various application contexts, including Natural Language Processing, Computer Vision, Multimodality, and Biomedicine, to foster a deeper understanding and spur future research.

Via

Access Paper or Ask Questions

Prediction of COVID-19 Patients' Emergency Room Revisit using Multi-Source Transfer Learning

Jun 29, 2023

Yuelyu Ji, Yuhe Gao, Runxue Bao, Qi Li, Disheng Liu, Yiming Sun, Ye Ye

Figure 1 for Prediction of COVID-19 Patients' Emergency Room Revisit using Multi-Source Transfer Learning

Figure 2 for Prediction of COVID-19 Patients' Emergency Room Revisit using Multi-Source Transfer Learning

Figure 3 for Prediction of COVID-19 Patients' Emergency Room Revisit using Multi-Source Transfer Learning

Figure 4 for Prediction of COVID-19 Patients' Emergency Room Revisit using Multi-Source Transfer Learning

Abstract:The coronavirus disease 2019 (COVID-19) has led to a global pandemic of significant severity. In addition to its high level of contagiousness, COVID-19 can have a heterogeneous clinical course, ranging from asymptomatic carriers to severe and potentially life-threatening health complications. Many patients have to revisit the emergency room (ER) within a short time after discharge, which significantly increases the workload for medical staff. Early identification of such patients is crucial for helping physicians focus on treating life-threatening cases. In this study, we obtained Electronic Health Records (EHRs) of 3,210 encounters from 13 affiliated ERs within the University of Pittsburgh Medical Center between March 2020 and January 2021. We leveraged a Natural Language Processing technique, ScispaCy, to extract clinical concepts and used the 1001 most frequent concepts to develop 7-day revisit models for COVID-19 patients in ERs. The research data we collected from 13 ERs may have distributional differences that could affect the model development. To address this issue, we employed a classic deep transfer learning method called the Domain Adversarial Neural Network (DANN) and evaluated different modeling strategies, including the Multi-DANN algorithm, the Single-DANN algorithm, and three baseline methods. Results showed that the Multi-DANN models outperformed the Single-DANN models and baseline models in predicting revisits of COVID-19 patients to the ER within 7 days after discharge. Notably, the Multi-DANN strategy effectively addressed the heterogeneity among multiple source domains and improved the adaptation of source data to the target domain. Moreover, the high performance of Multi-DANN models indicates that EHRs are informative for developing a prediction model to identify COVID-19 patients who are very likely to revisit an ER within 7 days after discharge.

* to appear at ICHI 2023

Via

Access Paper or Ask Questions

MoE-Fusion: Instance Embedded Mixture-of-Experts for Infrared and Visible Image Fusion

Feb 02, 2023

Yiming Sun, Bing Cao, Pengfei Zhu, Qinghua Hu

Figure 1 for MoE-Fusion: Instance Embedded Mixture-of-Experts for Infrared and Visible Image Fusion

Figure 2 for MoE-Fusion: Instance Embedded Mixture-of-Experts for Infrared and Visible Image Fusion

Figure 3 for MoE-Fusion: Instance Embedded Mixture-of-Experts for Infrared and Visible Image Fusion

Figure 4 for MoE-Fusion: Instance Embedded Mixture-of-Experts for Infrared and Visible Image Fusion

Abstract:Infrared and visible image fusion can compensate for the incompleteness of single-modality imaging and provide a more comprehensive scene description based on cross-modal complementarity. Most works focus on learning the overall cross-modal features by high- and low-frequency constraints at the image level alone, ignoring the fact that cross-modal instance-level features often contain more valuable information. To fill this gap, we model cross-modal instance-level features by embedding instance information into a set of Mixture-of-Experts (MoEs) for the first time, prompting image fusion networks to specifically learn instance-level information. We propose a novel framework with instance embedded Mixture-of-Experts for infrared and visible image fusion, termed MoE-Fusion, which contains an instance embedded MoE group (IE-MoE), an MoE-Decoder, two encoders, and two auxiliary detection networks. By embedding the instance-level information learned in the auxiliary network, IE-MoE achieves specialized learning of cross-modal foreground and background features. MoE-Decoder can adaptively select suitable experts for cross-modal feature decoding and obtain fusion results dynamically. Extensive experiments show that our MoE-Fusion outperforms state-of-the-art methods in preserving contrast and texture details by learning instance-level information in cross-modal images.

Via

Access Paper or Ask Questions