Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haojie Zhang

Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

Oct 02, 2025

Yongyi Su, Haojie Zhang, Shijie Li, Nanqing Liu, Jingyi Liao, Junyi Pan, Yuan Liu, Xiaofen Xing, Chong Sun, Chen Li(+4 more)

Figure 1 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

Figure 2 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

Figure 3 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

Figure 4 for Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

Abstract:Multimodal large language models (MLLMs) have advanced rapidly in recent years. However, existing approaches for vision tasks often rely on indirect representations, such as generating coordinates as text for detection, which limits performance and prevents dense prediction tasks like segmentation. To overcome these challenges, we introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables MLLMs to directly generate both textual and diverse visual outputs. Central to PaDT are Visual Reference Tokens (VRTs), derived from visual patch embeddings of query images and interleaved seamlessly with LLM's output textual tokens. A lightweight decoder then transforms LLM's outputs into detection, segmentation, and grounding predictions. Unlike prior methods, PaDT processes VRTs independently at each forward pass and dynamically expands the embedding table, thus improving localization and differentiation among similar objects. We further tailor a training strategy for PaDT by randomly selecting VRTs for supervised fine-tuning and introducing a robust per-token cross-entropy loss. Our empirical studies across four visual perception and understanding tasks suggest PaDT consistently achieving state-of-the-art performance, even compared with significantly larger MLLM models. The code is available at https://github.com/Gorilla-Lab-SCUT/PaDT.

* 24 pages, 12 figures and 9 tables

Via

Access Paper or Ask Questions

DropLoRA: Sparse Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Aug 24, 2025

Haojie Zhang

Abstract:LoRA-based large model parameter-efficient fine-tuning (PEFT) methods use low-rank de- composition to approximate updates to model parameters. However, compared to full- parameter fine-tuning, low-rank updates often lead to a performance gap in downstream tasks. To address this, we introduce DropLoRA, a novel pruning-based approach that focuses on pruning the rank dimension. Unlike conven- tional methods that attempt to overcome the low-rank bottleneck, DropLoRA innovatively integrates a pruning module between the two low-rank matrices in LoRA to simulate dy- namic subspace learning. This dynamic low- rank subspace learning allows DropLoRA to overcome the limitations of traditional LoRA, which operates within a static subspace. By continuously adapting the learning subspace, DropLoRA significantly boosts performance without incurring additional training or infer- ence costs. Our experimental results demon- strate that DropLoRA consistently outperforms LoRA in fine-tuning the LLaMA series across a wide range of large language model gener- ation tasks, including commonsense reason- ing, mathematical reasoning, code generation, and instruction-following. Our code is avail- able at https://github.com/TayeeChang/DropLoRA.

* 8 pages

Via

Access Paper or Ask Questions

MAD-UV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge

Jan 08, 2025

Zijiang Yang, Meishu Song, Xin Jing, Haojie Zhang, Kun Qian, Bin Hu, Kota Tamada, Toru Takumi, Björn W. Schuller, Yoshiharu Yamamoto

Figure 1 for MAD-UV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge

Figure 2 for MAD-UV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge

Figure 3 for MAD-UV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge

Abstract:The Mice Autism Detection via Ultrasound Vocalization (MAD-UV) Challenge introduces the first INTERSPEECH challenge focused on detecting autism spectrum disorder (ASD) in mice through their vocalizations. Participants are tasked with developing models to automatically classify mice as either wild-type or ASD models based on recordings with a high sampling rate. Our baseline system employs a simple CNN-based classification using three different spectrogram features. Results demonstrate the feasibility of automated ASD detection, with the considered audible-range features achieving the best performance (UAR of 0.600 for segment-level and 0.625 for subject-level classification). This challenge bridges speech technology and biomedical research, offering opportunities to advance our understanding of ASD models through machine learning approaches. The findings suggest promising directions for vocalization analysis and highlight the potential value of audible and ultrasound vocalizations in ASD detection.

* 5 pages, 1 figure and 2 tables. For MAD-UV Challenge 2025

Via

Access Paper or Ask Questions

LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

Nov 24, 2024

Haojie Zhang, Zhihao Liang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Chenxing Li, Jianhua Tao, Yaling Liang

Figure 1 for LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

Figure 2 for LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

Figure 3 for LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

Figure 4 for LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis

Abstract:Portrait image animation using audio has rapidly advanced, enabling the creation of increasingly realistic and expressive animated faces. The challenges of this multimodality-guided video generation task involve fusing various modalities while ensuring consistency in timing and portrait. We further seek to produce vivid talking heads. To address these challenges, we present LetsTalk (LatEnt Diffusion TranSformer for Talking Video Synthesis), a diffusion transformer that incorporates modular temporal and spatial attention mechanisms to merge multimodality and enhance spatial-temporal consistency. To handle multimodal conditions, we first summarize three fusion schemes, ranging from shallow to deep fusion compactness, and thoroughly explore their impact and applicability. Then we propose a suitable solution according to the modality differences of image, audio, and video generation. For portrait, we utilize a deep fusion scheme (Symbiotic Fusion) to ensure portrait consistency. For audio, we implement a shallow fusion scheme (Direct Fusion) to achieve audio-animation alignment while preserving diversity. Our extensive experiments demonstrate that our approach generates temporally coherent and realistic videos with enhanced diversity and liveliness.

* 17 pages, 14 figures

Via

Access Paper or Ask Questions

GenCRF: Generative Clustering and Reformulation Framework for Enhanced Intent-Driven Information Retrieval

Sep 17, 2024

Wonduk Seo, Haojie Zhang, Yueyang Zhang, Changhao Zhang, Songyao Duan, Lixin Su, Daiting Shi, Jiashu Zhao, Dawei Yin

Figure 1 for GenCRF: Generative Clustering and Reformulation Framework for Enhanced Intent-Driven Information Retrieval

Figure 2 for GenCRF: Generative Clustering and Reformulation Framework for Enhanced Intent-Driven Information Retrieval

Figure 3 for GenCRF: Generative Clustering and Reformulation Framework for Enhanced Intent-Driven Information Retrieval

Figure 4 for GenCRF: Generative Clustering and Reformulation Framework for Enhanced Intent-Driven Information Retrieval

Abstract:Query reformulation is a well-known problem in Information Retrieval (IR) aimed at enhancing single search successful completion rate by automatically modifying user's input query. Recent methods leverage Large Language Models (LLMs) to improve query reformulation, but often generate limited and redundant expansions, potentially constraining their effectiveness in capturing diverse intents. In this paper, we propose GenCRF: a Generative Clustering and Reformulation Framework to capture diverse intentions adaptively based on multiple differentiated, well-generated queries in the retrieval phase for the first time. GenCRF leverages LLMs to generate variable queries from the initial query using customized prompts, then clusters them into groups to distinctly represent diverse intents. Furthermore, the framework explores to combine diverse intents query with innovative weighted aggregation strategies to optimize retrieval performance and crucially integrates a novel Query Evaluation Rewarding Model (QERM) to refine the process through feedback loops. Empirical experiments on the BEIR benchmark demonstrate that GenCRF achieves state-of-the-art performance, surpassing previous query reformulation SOTAs by up to 12% on nDCG@10. These techniques can be adapted to various LLMs, significantly boosting retriever performance and advancing the field of Information Retrieval.

Via

Access Paper or Ask Questions

A Unified Label-Aware Contrastive Learning Framework for Few-Shot Named Entity Recognition

Apr 26, 2024

Haojie Zhang, Yimeng Zhuang

Figure 1 for A Unified Label-Aware Contrastive Learning Framework for Few-Shot Named Entity Recognition

Figure 2 for A Unified Label-Aware Contrastive Learning Framework for Few-Shot Named Entity Recognition

Figure 3 for A Unified Label-Aware Contrastive Learning Framework for Few-Shot Named Entity Recognition

Figure 4 for A Unified Label-Aware Contrastive Learning Framework for Few-Shot Named Entity Recognition

Abstract:Few-shot Named Entity Recognition (NER) aims to extract named entities using only a limited number of labeled examples. Existing contrastive learning methods often suffer from insufficient distinguishability in context vector representation because they either solely rely on label semantics or completely disregard them. To tackle this issue, we propose a unified label-aware token-level contrastive learning framework. Our approach enriches the context by utilizing label semantics as suffix prompts. Additionally, it simultaneously optimizes context-context and context-label contrastive learning objectives to enhance generalized discriminative contextual representations.Extensive experiments on various traditional test domains (OntoNotes, CoNLL'03, WNUT'17, GUM, I2B2) and the large-scale few-shot NER dataset (FEWNERD) demonstrate the effectiveness of our approach. It outperforms prior state-of-the-art models by a significant margin, achieving an average absolute gain of 7% in micro F1 scores across most scenarios. Further analysis reveals that our model benefits from its powerful transfer capability and improved contextual representations.

Via

Access Paper or Ask Questions

Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in Conversations

Apr 25, 2024

Shen Zhang, Haojie Zhang, Jing Zhang, Xudong Zhang, Yimeng Zhuang, Jinting Wu

Figure 1 for Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in Conversations

Figure 2 for Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in Conversations

Figure 3 for Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in Conversations

Figure 4 for Samsung Research China-Beijing at SemEval-2024 Task 3: A multi-stage framework for Emotion-Cause Pair Extraction in Conversations

Abstract:In human-computer interaction, it is crucial for agents to respond to human by understanding their emotions. Unraveling the causes of emotions is more challenging. A new task named Multimodal Emotion-Cause Pair Extraction in Conversations is responsible for recognizing emotion and identifying causal expressions. In this study, we propose a multi-stage framework to generate emotion and extract the emotion causal pairs given the target emotion. In the first stage, Llama-2-based InstructERC is utilized to extract the emotion category of each utterance in a conversation. After emotion recognition, a two-stream attention model is employed to extract the emotion causal pairs given the target emotion for subtask 2 while MuTEC is employed to extract causal span for subtask 1. Our approach achieved first place for both of the two subtasks in the competition.

Via

Access Paper or Ask Questions

Sur2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images

Jan 08, 2024

Zhangjin Huang, Zhihao Liang, Haojie Zhang, Yangkai Lin, Kui Jia

Figure 1 for Sur2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images

Figure 2 for Sur2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images

Figure 3 for Sur2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images

Figure 4 for Sur2f: A Hybrid Representation for High-Quality and Efficient Surface Reconstruction from Multi-view Images

Abstract:Multi-view surface reconstruction is an ill-posed, inverse problem in 3D vision research. It involves modeling the geometry and appearance with appropriate surface representations. Most of the existing methods rely either on explicit meshes, using surface rendering of meshes for reconstruction, or on implicit field functions, using volume rendering of the fields for reconstruction. The two types of representations in fact have their respective merits. In this work, we propose a new hybrid representation, termed Sur2f, aiming to better benefit from both representations in a complementary manner. Technically, we learn two parallel streams of an implicit signed distance field and an explicit surrogate surface Sur2f mesh, and unify volume rendering of the implicit signed distance function (SDF) and surface rendering of the surrogate mesh with a shared, neural shader; the unified shading promotes their convergence to the same, underlying surface. We synchronize learning of the surrogate mesh by driving its deformation with functions induced from the implicit SDF. In addition, the synchronized surrogate mesh enables surface-guided volume sampling, which greatly improves the sampling efficiency per ray in volume rendering. We conduct thorough experiments showing that Sur$^2$f outperforms existing reconstruction methods and surface representations, including hybrid ones, in terms of both recovery quality and recovery efficiency.

* 18 pages, 16 figures

Via

Access Paper or Ask Questions

Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation

Dec 06, 2023

Haojie Zhang, Yongyi Su, Xun Xu, Kui Jia

Figure 1 for Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation

Figure 2 for Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation

Figure 3 for Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation

Figure 4 for Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation

Abstract:The success of large language models has inspired the computer vision community to explore image segmentation foundation model that is able to zero/few-shot generalize through prompt engineering. Segment-Anything(SAM), among others, is the state-of-the-art image segmentation foundation model demonstrating strong zero/few-shot generalization. Despite the success, recent studies reveal the weakness of SAM under strong distribution shift. In particular, SAM performs awkwardly on corrupted natural images, camouflaged images, medical images, etc. Motivated by the observations, we aim to develop a self-training based strategy to adapt SAM to target distribution. Given the unique challenges of large source dataset, high computation cost and incorrect pseudo label, we propose a weakly supervised self-training architecture with anchor regularization and low-rank finetuning to improve the robustness and computation efficiency of adaptation. We validate the effectiveness on 5 types of downstream segmentation tasks including natural clean/corrupted images, medical images, camouflaged images and robotic images. Our proposed method is task-agnostic in nature and outperforms pre-trained SAM and state-of-the-art domain adaptation methods on almost all downstream tasks with the same testing prompt inputs.

* 20 pages, 12 figures

Via

Access Paper or Ask Questions

Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively

Nov 03, 2022

Haojie Zhang, Ge Li, Jia Li, Zhongjin Zhang, Yuqi Zhu, Zhi Jin

Abstract:Large-scale pre-trained language models have achieved impressive results on a wide range of downstream tasks recently. However, fine-tuning an extremely large-scale pre-trained language model on limited target datasets is often plagued by overfitting and representation degradation. In this paper, we propose a Dynamic Parameter Selection (DPS) algorithm for the large-scale pre-trained models during fine-tuning, which adaptively selects a more promising subnetwork to perform staging updates based on gradients of back-propagation. Experiments on the GLUE benchmark show that DPS outperforms previous fine-tuning methods in terms of overall performance and stability, and consistently achieves better results with variable pre-trained language models. In addition, DPS brings a large magnitude of improvement in out-of-domain transferring experiments and low-resource scenarios, which shows that it can maintain stable general contextual features and reduce the representation collapse. We release our code at https://github.com/ZhangHaojie077/DPS

* NeurIPS 2022

Via

Access Paper or Ask Questions