Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yongqiang Zhao

ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations

Jun 25, 2025

Zhiyuan Wu, Yongqiang Zhao, Shan Luo

Abstract:Vision and touch are two fundamental sensory modalities for robots, offering complementary information that enhances perception and manipulation tasks. Previous research has attempted to jointly learn visual-tactile representations to extract more meaningful information. However, these approaches often rely on direct combination, such as feature addition and concatenation, for modality fusion, which tend to result in poor feature integration. In this paper, we propose ConViTac, a visual-tactile representation learning network designed to enhance the alignment of features during fusion using contrastive representations. Our key contribution is a Contrastive Embedding Conditioning (CEC) mechanism that leverages a contrastive encoder pretrained through self-supervised contrastive learning to project visual and tactile inputs into unified latent embeddings. These embeddings are used to couple visual-tactile feature fusion through cross-modal attention, aiming at aligning the unified representations and enhancing performance on downstream tasks. We conduct extensive experiments to demonstrate the superiority of ConViTac in real world over current state-of-the-art methods and the effectiveness of our proposed CEC mechanism, which improves accuracy by up to 12.0% in material classification and grasping prediction tasks.

Via

Access Paper or Ask Questions

Enhancing LLM Generation with Knowledge Hypergraph for Evidence-Based Medicine

Mar 18, 2025

Chengfeng Dou, Ying Zhang, Zhi Jin, Wenpin Jiao, Haiyan Zhao, Yongqiang Zhao, Zhengwei Tao

Abstract:Evidence-based medicine (EBM) plays a crucial role in the application of large language models (LLMs) in healthcare, as it provides reliable support for medical decision-making processes. Although it benefits from current retrieval-augmented generation~(RAG) technologies, it still faces two significant challenges: the collection of dispersed evidence and the efficient organization of this evidence to support the complex queries necessary for EBM. To tackle these issues, we propose using LLMs to gather scattered evidence from multiple sources and present a knowledge hypergraph-based evidence management model to integrate these evidence while capturing intricate relationships. Furthermore, to better support complex queries, we have developed an Importance-Driven Evidence Prioritization (IDEP) algorithm that utilizes the LLM to generate multiple evidence features, each with an associated importance score, which are then used to rank the evidence and produce the final retrieval results. Experimental results from six datasets demonstrate that our approach outperforms existing RAG techniques in application domains of interest to EBM, such as medical quizzing, hallucination detection, and decision support. Testsets and the constructed knowledge graph can be accessed at \href{https://drive.google.com/file/d/1WJ9QTokK3MdkjEmwuFQxwH96j_Byawj_/view?usp=drive_link}{https://drive.google.com/rag4ebm}.

Via

Access Paper or Ask Questions

General Force Sensation for Tactile Robot

Mar 02, 2025

Zhuo Chen, Ni Ou, Xuyang Zhang, Zhiyuan Wu, Yongqiang Zhao, Yupeng Wang, Nathan Lepora, Lorenzo Jamone, Jiankang Deng, Shan Luo

Figure 1 for General Force Sensation for Tactile Robot

Figure 2 for General Force Sensation for Tactile Robot

Figure 3 for General Force Sensation for Tactile Robot

Figure 4 for General Force Sensation for Tactile Robot

Abstract:Robotic tactile sensors, including vision-based and taxel-based sensors, enable agile manipulation and safe human-robot interaction through force sensation. However, variations in structural configurations, measured signals, and material properties create domain gaps that limit the transferability of learned force sensation across different tactile sensors. Here, we introduce GenForce, a general framework for achieving transferable force sensation across both homogeneous and heterogeneous tactile sensors in robotic systems. By unifying tactile signals into marker-based binary tactile images, GenForce enables the transfer of existing force labels to arbitrary target sensors using a marker-to-marker translation technique with a few paired data. This process equips uncalibrated tactile sensors with force prediction capabilities through spatiotemporal force prediction models trained on the transferred data. Extensive experimental results validate GenForce's generalizability, accuracy, and robustness across sensors with diverse marker patterns, structural designs, material properties, and sensing principles. The framework significantly reduces the need for costly and labor-intensive labeled data collection, enabling the rapid deployment of multiple tactile sensors on robotic hands requiring force sensing capabilities.

Via

Access Paper or Ask Questions

Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment

Oct 05, 2024

Chengfeng Dou, Ying Zhang, Zhi Jin, Wenpin Jiao, Haiyan Zhao, Yongqiang Zhao, Zhengwei Tao

Figure 1 for Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment

Figure 2 for Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment

Figure 3 for Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment

Figure 4 for Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment

Abstract:This research examines the use of Reinforcement Learning from AI Feedback (RLAIF) techniques to improve healthcare dialogue models, with the aim of tackling the challenges of preference-aligned data annotation while reducing the reliance on medical experts. We argue that the primary challenges in current RLAIF research for healthcare are the limitations of automated evaluation methods and the difficulties in accurately representing physician preferences. To address these challenges, we present a new evaluation framework based on standardized patient examinations. This framework is designed to objectively assess the effectiveness of large language models (LLMs) in guiding users and following instructions, enabling a comprehensive comparison across different models. Furthermore, our investigation of effective ways to express physician preferences using Constitutional AI algorithms highlighted the particular effectiveness of flowcharts. Utilizing this finding, we introduce an innovative agent-based approach for annotating preference data. This approach autonomously creates medical dialogue flows tailored to the patient's condition, demonstrates strong generalization abilities, and reduces the need for expert involvement. Our results show that the agent-based approach outperforms existing RLAIF annotation methods in standardized patient examinations and surpasses current open source medical dialogue LLMs in various test scenarios.

* 14 Pages, 12 figures

Via

Access Paper or Ask Questions

Metasurface-based Snapshot Shortwave-Infrared Hyperspectral Image Reconstruction with Inter and Intra Prior Learning Network

Jul 11, 2024

Linqiang Li, Jinglei Hao, Yongqiang Zhao, Pan Liu, Haofang Yan, Ziqin Zhang, Seong G. Kong

Figure 1 for Metasurface-based Snapshot Shortwave-Infrared Hyperspectral Image Reconstruction with Inter and Intra Prior Learning Network

Figure 2 for Metasurface-based Snapshot Shortwave-Infrared Hyperspectral Image Reconstruction with Inter and Intra Prior Learning Network

Figure 3 for Metasurface-based Snapshot Shortwave-Infrared Hyperspectral Image Reconstruction with Inter and Intra Prior Learning Network

Figure 4 for Metasurface-based Snapshot Shortwave-Infrared Hyperspectral Image Reconstruction with Inter and Intra Prior Learning Network

Abstract:Shortwave-infrared(SWIR) spectral information,ranging from 1 {\mu}m to 2.5{\mu}m, breaks the limitations of traditional color cameras in acquiring scene information and has been used in many fields. However, conventional SWIR hyperspectral imaging systems face challenges due to their bulky setups and low acquisition speed. In this work, we introduce a snapshot SWIR hyperspectral imaging system based on a metasurface filter and a corresponding filter selection method to achieve the lowest correlation coefficient among these filters.This systemhas the advantages of small size and snapshot imaging. We propose a novel inter and intra prior learning unfolding framework proposed to achieve high-quality SWIR hyperspectral image reconstruction, which bridges the gap between prior learning and cross-stage information interaction. We also design an adaptive feature transfer mechanism to adaptively the transfer contextual correlation of multi-scale encoder features to prevent detailed information loss in the decoder. Experiment results demonstrate that our method can reconstruct HSI with high speed and superior performance over existing methods.

* 10 pages,5 figures

Via

Access Paper or Ask Questions

FOTS: A Fast Optical Tactile Simulator for Sim2Real Learning of Tactile-motor Robot Manipulation Skills

May 01, 2024

Yongqiang Zhao, Kun Qian, Boyi Duan, Shan Luo

Abstract:Simulation is a widely used tool in robotics to reduce hardware consumption and gather large-scale data. Despite previous efforts to simulate optical tactile sensors, there remain challenges in efficiently synthesizing images and replicating marker motion under different contact loads. In this work, we propose a fast optical tactile simulator, named FOTS, for simulating optical tactile sensors. We utilize multi-layer perceptron mapping and planar shadow generation to simulate the optical response, while employing marker distribution approximation to simulate the motion of surface markers caused by the elastomer deformation. Experimental results demonstrate that FOTS outperforms other methods in terms of image generation quality and rendering speed, achieving 28.6 fps for optical simulation and 326.1 fps for marker motion simulation on a single CPU without GPU acceleration. In addition, we integrate the FOTS simulation model with physical engines like MuJoCo, and the peg-in-hole task demonstrates the effectiveness of our method in achieving zero-shot Sim2Real learning of tactile-motor robot manipulation skills. Our code is available at https://github.com/Rancho-zhao/FOTS.

Via

Access Paper or Ask Questions

Integrating Physician Diagnostic Logic into Large Language Models: Preference Learning from Process Feedback

Jan 11, 2024

Chengfeng Dou, Zhi Jin, Wenpin Jiao, Haiyan Zhao, Yongqiang Zhao, Zhenwei Tao

Abstract:The use of large language models in medical dialogue generation has garnered significant attention, with a focus on improving response quality and fluency. While previous studies have made progress in optimizing model performance for single-round medical Q&A tasks, there is a need to enhance the model's capability for multi-round conversations to avoid logical inconsistencies. To address this, we propose an approach called preference learning from process feedback~(PLPF), which integrates the doctor's diagnostic logic into LLMs. PLPF involves rule modeling, preference data generation, and preference alignment to train the model to adhere to the diagnostic process. Experimental results using Standardized Patient Testing show that PLPF enhances the diagnostic accuracy of the baseline model in medical conversations by 17.6%, outperforming traditional reinforcement learning from human feedback. Additionally, PLPF demonstrates effectiveness in both multi-round and single-round dialogue tasks, showcasing its potential for improving medical dialogue generation.

Via

Access Paper or Ask Questions

Enhancing the Spatial Awareness Capability of Multi-Modal Large Language Model

Nov 01, 2023

Yongqiang Zhao, Zhenyu Li, Zhi Jin, Feng Zhang, Haiyan Zhao, Chengfeng Dou, Zhengwei Tao, Xinhai Xu, Donghong Liu

Abstract:The Multi-Modal Large Language Model (MLLM) refers to an extension of the Large Language Model (LLM) equipped with the capability to receive and infer multi-modal data. Spatial awareness stands as one of the crucial abilities of MLLM, encompassing diverse skills related to understanding spatial relationships among objects and between objects and the scene area. Industries such as autonomous driving, smart healthcare, robotics, virtual, and augmented reality heavily demand MLLM's spatial awareness capabilities. However, there exists a noticeable gap between the current spatial awareness capabilities of MLLM and the requirements set by human needs. To address this issue, this paper proposes using more precise spatial position information between objects to guide MLLM in providing more accurate responses to user-related inquiries. Specifically, for a particular multi-modal task, we utilize algorithms for acquiring geometric spatial information and scene graphs to obtain relevant geometric spatial information and scene details of objects involved in the query. Subsequently, based on this information, we direct MLLM to address spatial awareness-related queries posed by the user. Extensive experiments were conducted in benchmarks such as MME, MM-Vet, and other multi-modal large language models. The experimental results thoroughly confirm the efficacy of the proposed method in enhancing the spatial awareness tasks and associated tasks of MLLM.

Via

Access Paper or Ask Questions

Multi-granularity Backprojection Transformer for Remote Sensing Image Super-Resolution

Oct 19, 2023

Jinglei Hao, Wukai Li, Binglu Wang, Shunzhou Wang, Yuting Lu, Ning Li, Yongqiang Zhao

Abstract:Backprojection networks have achieved promising super-resolution performance for nature images but not well be explored in the remote sensing image super-resolution (RSISR) field due to the high computation costs. In this paper, we propose a Multi-granularity Backprojection Transformer termed MBT for RSISR. MBT incorporates the backprojection learning strategy into a Transformer framework. It consists of Scale-aware Backprojection-based Transformer Layers (SPTLs) for scale-aware low-resolution feature learning and Context-aware Backprojection-based Transformer Blocks (CPTBs) for hierarchical feature learning. A backprojection-based reconstruction module (PRM) is also introduced to enhance the hierarchical features for image reconstruction. MBT stands out by efficiently learning low-resolution features without excessive modules for high-resolution processing, resulting in lower computational resources. Experiment results on UCMerced and AID datasets demonstrate that MBT obtains state-of-the-art results compared to other leading methods.

Via

Access Paper or Ask Questions

Enhancing Subtask Performance of Multi-modal Large Language Model

Aug 31, 2023

Yongqiang Zhao, Zhenyu Li, Feng Zhang, Xinhai Xu, Donghong Liu

Figure 1 for Enhancing Subtask Performance of Multi-modal Large Language Model

Figure 2 for Enhancing Subtask Performance of Multi-modal Large Language Model

Figure 3 for Enhancing Subtask Performance of Multi-modal Large Language Model

Figure 4 for Enhancing Subtask Performance of Multi-modal Large Language Model

Abstract:Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data. Current MLLMs typically begin by using LLMs to decompose tasks into multiple subtasks, then employing individual pre-trained models to complete specific subtasks, and ultimately utilizing LLMs to integrate the results of each subtasks to obtain the results of the task. In real-world scenarios, when dealing with large projects, it is common practice to break down the project into smaller sub-projects, with different teams providing corresponding solutions or results. The project owner then decides which solution or result to use, ensuring the best possible outcome for each subtask and, consequently, for the entire project. Inspired by this, this study considers selecting multiple pre-trained models to complete the same subtask. By combining the results from multiple pre-trained models, the optimal subtask result is obtained, enhancing the performance of the MLLM. Specifically, this study first selects multiple pre-trained models focused on the same subtask based on distinct evaluation approaches, and then invokes these models in parallel to process input data and generate corresponding subtask results. Finally, the results from multiple pre-trained models for the same subtask are compared using the LLM, and the best result is chosen as the outcome for that subtask. Extensive experiments are conducted in this study using GPT-4 annotated datasets and human-annotated datasets. The results of various evaluation metrics adequately demonstrate the effectiveness of the proposed approach in this paper.

Via

Access Paper or Ask Questions