Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Miaojing Shi

Kings College London

Weakly-Supervised Referring Video Object Segmentation through Text Supervision

Apr 21, 2026

Miaojing Shi, Jun Huang, Zijie Yue, Hanli Wang

Abstract:Referring video object segmentation (RVOS) aims to segment the target instance in a video, referred by a text expression. Conventional approaches are mostly supervised learning, requiring expensive pixel-level mask annotations. To tackle it, weakly-supervised RVOS has recently been proposed to replace mask annotations with bounding boxes or points, which are however still costly and labor-intensive. In this paper, we design a novel weakly-supervised RVOS method, namely WSRVOS, to train the model with only text expressions. Given an input video and the referring expression, we first design a contrastive referring expression augmentation scheme that leverages the captioning capabilities of a multimodal large language model to generate both positive and negative expressions. We extract visual and linguistic features from the input video and generated expressions, then perform bi-directional vision-language feature selection and interaction to enable fine-grained multimodal alignment. Next, we propose an instance-aware expression classification scheme to optimize the model in distinguishing positive from negative expressions. Also, we introduce a positive-prediction fusion strategy to generate high-quality pseudo-masks, which serve as additional supervision to the model. Last, we design a temporal segment ranking constraint such that the overlaps between mask predictions of temporally neighboring frames are required to conform to specific orders. Extensive experiments on four publicly available RVOS datasets, including A2D Sentences, J-HMDB Sentences, Ref-YouTube-VOS, and Ref-DAVIS17, demonstrate the superiority of our method. Code is available at https://github.com/viscom-tongji/WSRVOS.

* Accepted by CVPR 2026 Findings

Via

Access Paper or Ask Questions

FAAR: Efficient Frequency-Aware Multi-Task Fine-Tuning via Automatic Rank Selection

Mar 20, 2026

Maxime Fontana, Michael Spratling, Miaojing Shi

Abstract:Adapting models pre-trained on large-scale datasets is a proven way to reach strong performance quickly for down-stream tasks. However, the growth of state-of-the-art mod-els makes traditional full fine-tuning unsuitable and difficult, especially for multi-task learning (MTL) where cost scales with the number of tasks. As a result, recent studies investigate parameter-efficient fine-tuning (PEFT) using low-rank adaptation to significantly reduce the number of trainable parameters. However, these existing methods use a single, fixed rank, which may not be optimal for differ-ent tasks or positions in the MTL architecture. Moreover, these methods fail to learn spatial information that cap-tures inter-task relationships and helps to improve diverse task predictions. This paper introduces Frequency-Aware and Automatic Rank (FAAR) for efficient MTL fine-tuning. Our method introduces Performance-Driven Rank Shrink-ing (PDRS) to allocate the optimal rank per adapter location and per task. Moreover, by analyzing the image frequency spectrum, FAAR proposes a Task-Spectral Pyramidal Decoder (TS-PD) that injects input-specific context into spatial bias learning to better reflect cross-task relationships. Experiments performed on dense visual task benchmarks show the superiority of our method in terms of both accuracy and efficiency compared to other PEFT methods in MTL. FAAR reduces the number of parameters by up to 9 times compared to traditional MTL fine-tuning whilst improving overall performance. Our code is available.

* CVPR 2026

Via

Access Paper or Ask Questions

Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting

Feb 13, 2026

Xiaowen Zhang, Zijie Yue, Yong Luo, Cairong Zhao, Qijun Chen, Miaojing Shi

Abstract:Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, e.g. person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting. Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs. Code is available at https://github.com/viscom-tongji/WS-COC.

* Accepted at ICLR 2026

Via

Access Paper or Ask Questions

Text-promptable Object Counting via Quantity Awareness Enhancement

Jul 09, 2025

Miaojing Shi, Xiaowen Zhang, Zijie Yue, Yong Luo, Cairong Zhao, Li Li

Figure 1 for Text-promptable Object Counting via Quantity Awareness Enhancement

Figure 2 for Text-promptable Object Counting via Quantity Awareness Enhancement

Figure 3 for Text-promptable Object Counting via Quantity Awareness Enhancement

Figure 4 for Text-promptable Object Counting via Quantity Awareness Enhancement

Abstract:Recent advances in large vision-language models (VLMs) have shown remarkable progress in solving the text-promptable object counting problem. Representative methods typically specify text prompts with object category information in images. This however is insufficient for training the model to accurately distinguish the number of objects in the counting task. To this end, we propose QUANet, which introduces novel quantity-oriented text prompts with a vision-text quantity alignment loss to enhance the model's quantity awareness. Moreover, we propose a dual-stream adaptive counting decoder consisting of a Transformer stream, a CNN stream, and a number of Transformer-to-CNN enhancement adapters (T2C-adapters) for density map prediction. The T2C-adapters facilitate the effective knowledge communication and aggregation between the Transformer and CNN streams. A cross-stream quantity ranking loss is proposed in the end to optimize the ranking orders of predictions from the two streams. Extensive experiments on standard benchmarks such as FSC-147, CARPK, PUCPR+, and ShanghaiTech demonstrate our model's strong generalizability for zero-shot class-agnostic counting. Code is available at https://github.com/viscom-tongji/QUANet

* 13 pages, 5 figures

Via

Access Paper or Ask Questions

TRAIL: Transferable Robust Adversarial Images via Latent diffusion

May 22, 2025

Yuhao Xue, Zhifei Zhang, Xinyang Jiang, Yifei Shen, Junyao Gao, Wentao Gu, Jiale Zhao, Miaojing Shi, Cairong Zhao

Figure 1 for TRAIL: Transferable Robust Adversarial Images via Latent diffusion

Figure 2 for TRAIL: Transferable Robust Adversarial Images via Latent diffusion

Figure 3 for TRAIL: Transferable Robust Adversarial Images via Latent diffusion

Figure 4 for TRAIL: Transferable Robust Adversarial Images via Latent diffusion

Abstract:Adversarial attacks exploiting unrestricted natural perturbations present severe security risks to deep learning systems, yet their transferability across models remains limited due to distribution mismatches between generated adversarial features and real-world data. While recent works utilize pre-trained diffusion models as adversarial priors, they still encounter challenges due to the distribution shift between the distribution of ideal adversarial samples and the natural image distribution learned by the diffusion model. To address the challenge, we propose Transferable Robust Adversarial Images via Latent Diffusion (TRAIL), a test-time adaptation framework that enables the model to generate images from a distribution of images with adversarial features and closely resembles the target images. To mitigate the distribution shift, during attacks, TRAIL updates the diffusion U-Net's weights by combining adversarial objectives (to mislead victim models) and perceptual constraints (to preserve image realism). The adapted model then generates adversarial samples through iterative noise injection and denoising guided by these objectives. Experiments demonstrate that TRAIL significantly outperforms state-of-the-art methods in cross-model attack transferability, validating that distribution-aligned adversarial feature synthesis is critical for practical black-box attacks.

Via

Access Paper or Ask Questions

MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

Apr 01, 2025

Zhiyuan Zhang, Xiaofan Li, Zhihao Xu, Wenjie Peng, Zijian Zhou, Miaojing Shi, Shuangping Huang

Figure 1 for MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

Figure 2 for MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

Figure 3 for MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

Figure 4 for MPDrive: Improving Spatial Understanding with Marker-Based Prompt Learning for Autonomous Driving

Abstract:Autonomous driving visual question answering (AD-VQA) aims to answer questions related to perception, prediction, and planning based on given driving scene images, heavily relying on the model's spatial understanding capabilities. Prior works typically express spatial information through textual representations of coordinates, resulting in semantic gaps between visual coordinate representations and textual descriptions. This oversight hinders the accurate transmission of spatial information and increases the expressive burden. To address this, we propose a novel Marker-based Prompt learning framework (MPDrive), which represents spatial coordinates by concise visual markers, ensuring linguistic expressive consistency and enhancing the accuracy of both visual perception and spatial expression in AD-VQA. Specifically, we create marker images by employing a detection expert to overlay object regions with numerical labels, converting complex textual coordinate generation into straightforward text-based visual marker predictions. Moreover, we fuse original and marker images as scene-level features and integrate them with detection priors to derive instance-level features. By combining these features, we construct dual-granularity visual prompts that stimulate the LLM's spatial perception capabilities. Extensive experiments on the DriveLM and CODA-LM datasets show that MPDrive achieves state-of-the-art performance, particularly in cases requiring sophisticated spatial understanding.

* Accepted by CVPR 2025

Via

Access Paper or Ask Questions

Enhancing Generalized Few-Shot Semantic Segmentation via Effective Knowledge Transfer

Dec 20, 2024

Xinyue Chen, Miaojing Shi, Zijian Zhou, Lianghua He, Sophia Tsoka

Figure 1 for Enhancing Generalized Few-Shot Semantic Segmentation via Effective Knowledge Transfer

Figure 2 for Enhancing Generalized Few-Shot Semantic Segmentation via Effective Knowledge Transfer

Figure 3 for Enhancing Generalized Few-Shot Semantic Segmentation via Effective Knowledge Transfer

Figure 4 for Enhancing Generalized Few-Shot Semantic Segmentation via Effective Knowledge Transfer

Abstract:Generalized few-shot semantic segmentation (GFSS) aims to segment objects of both base and novel classes, using sufficient samples of base classes and few samples of novel classes. Representative GFSS approaches typically employ a two-phase training scheme, involving base class pre-training followed by novel class fine-tuning, to learn the classifiers for base and novel classes respectively. Nevertheless, distribution gap exists between base and novel classes in this process. To narrow this gap, we exploit effective knowledge transfer from base to novel classes. First, a novel prototype modulation module is designed to modulate novel class prototypes by exploiting the correlations between base and novel classes. Second, a novel classifier calibration module is proposed to calibrate the weight distribution of the novel classifier according to that of the base classifier. Furthermore, existing GFSS approaches suffer from a lack of contextual information for novel classes due to their limited samples, we thereby introduce a context consistency learning scheme to transfer the contextual knowledge from base to novel classes. Extensive experiments on PASCAL-5$^i$ and COCO-20$^i$ demonstrate that our approach significantly enhances the state of the art in the GFSS setting. The code is available at: https://github.com/HHHHedy/GFSS-EKT.

* Accepted to AAAI 2025

Via

Access Paper or Ask Questions

SEG-SAM: Semantic-Guided SAM for Unified Medical Image Segmentation

Dec 17, 2024

Shuangping Huang, Hao Liang, Qingfeng Wang, Chulong Zhong, Zijian Zhou, Miaojing Shi

Figure 1 for SEG-SAM: Semantic-Guided SAM for Unified Medical Image Segmentation

Figure 2 for SEG-SAM: Semantic-Guided SAM for Unified Medical Image Segmentation

Figure 3 for SEG-SAM: Semantic-Guided SAM for Unified Medical Image Segmentation

Figure 4 for SEG-SAM: Semantic-Guided SAM for Unified Medical Image Segmentation

Abstract:Recently, developing unified medical image segmentation models gains increasing attention, especially with the advent of the Segment Anything Model (SAM). SAM has shown promising binary segmentation performance in natural domains, however, transferring it to the medical domain remains challenging, as medical images often possess substantial inter-category overlaps. To address this, we propose the SEmantic-Guided SAM (SEG-SAM), a unified medical segmentation model that incorporates semantic medical knowledge to enhance medical segmentation performance. First, to avoid the potential conflict between binary and semantic predictions, we introduce a semantic-aware decoder independent of SAM's original decoder, specialized for both semantic segmentation on the prompted object and classification on unprompted objects in images. To further enhance the model's semantic understanding, we solicit key characteristics of medical categories from large language models and incorporate them into SEG-SAM through a text-to-vision semantic module, adaptively transferring the language information into the visual segmentation task. In the end, we introduce the cross-mask spatial alignment strategy to encourage greater overlap between the predicted masks from SEG-SAM's two decoders, thereby benefiting both predictions. Extensive experiments demonstrate that SEG-SAM outperforms state-of-the-art SAM-based methods in unified binary medical segmentation and task-specific methods in semantic medical segmentation, showcasing promising results and potential for broader medical applications.

* 12 pages, 3 figures

Via

Access Paper or Ask Questions

Learning Flow Fields in Attention for Controllable Person Image Generation

Dec 12, 2024

Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa(+4 more)

Figure 1 for Learning Flow Fields in Attention for Controllable Person Image Generation

Figure 2 for Learning Flow Fields in Attention for Controllable Person Image Generation

Figure 3 for Learning Flow Fields in Attention for Controllable Person Image Generation

Figure 4 for Learning Flow Fields in Attention for Controllable Person Image Generation

Abstract:Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.

* github: https://github.com/franciszzj/Leffa, demo: https://huggingface.co/spaces/franciszzj/Leffa, model: https://huggingface.co/franciszzj/Leffa

Via

Access Paper or Ask Questions

Optimizing Dense Visual Predictions Through Multi-Task Coherence and Prioritization

Dec 04, 2024

Maxime Fontana, Michael Spratling, Miaojing Shi

Figure 1 for Optimizing Dense Visual Predictions Through Multi-Task Coherence and Prioritization

Figure 2 for Optimizing Dense Visual Predictions Through Multi-Task Coherence and Prioritization

Figure 3 for Optimizing Dense Visual Predictions Through Multi-Task Coherence and Prioritization

Figure 4 for Optimizing Dense Visual Predictions Through Multi-Task Coherence and Prioritization

Abstract:Multi-Task Learning (MTL) involves the concurrent training of multiple tasks, offering notable advantages for dense prediction tasks in computer vision. MTL not only reduces training and inference time as opposed to having multiple single-task models, but also enhances task accuracy through the interaction of multiple tasks. However, existing methods face limitations. They often rely on suboptimal cross-task interactions, resulting in task-specific predictions with poor geometric and predictive coherence. In addition, many approaches use inadequate loss weighting strategies, which do not address the inherent variability in task evolution during training. To overcome these challenges, we propose an advanced MTL model specifically designed for dense vision tasks. Our model leverages state-of-the-art vision transformers with task-specific decoders. To enhance cross-task coherence, we introduce a trace-back method that improves both cross-task geometric and predictive features. Furthermore, we present a novel dynamic task balancing approach that projects task losses onto a common scale and prioritizes more challenging tasks during training. Extensive experiments demonstrate the superiority of our method, establishing new state-of-the-art performance across two benchmark datasets. The code is available at:https://github.com/Klodivio355/MT-CP

* Accepted by WACV 2025

Via

Access Paper or Ask Questions