Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jian Zhang

Task Consistent Prototype Learning for Incremental Few-shot Semantic Segmentation

Oct 16, 2024

Wenbo Xu, Yanan Wu, Haoran Jiang, Yang Wang, Qiang Wu, Jian Zhang

Figure 1 for Task Consistent Prototype Learning for Incremental Few-shot Semantic Segmentation

Figure 2 for Task Consistent Prototype Learning for Incremental Few-shot Semantic Segmentation

Figure 3 for Task Consistent Prototype Learning for Incremental Few-shot Semantic Segmentation

Figure 4 for Task Consistent Prototype Learning for Incremental Few-shot Semantic Segmentation

Abstract:Incremental Few-Shot Semantic Segmentation (iFSS) tackles a task that requires a model to continually expand its segmentation capability on novel classes using only a few annotated examples. Typical incremental approaches encounter a challenge that the objective of the base training phase (fitting base classes with sufficient instances) does not align with the incremental learning phase (rapidly adapting to new classes with less forgetting). This disconnect can result in suboptimal performance in the incremental setting. This study introduces a meta-learning-based prototype approach that encourages the model to learn how to adapt quickly while preserving previous knowledge. Concretely, we mimic the incremental evaluation protocol during the base training session by sampling a sequence of pseudo-incremental tasks. Each task in the simulated sequence is trained using a meta-objective to enable rapid adaptation without forgetting. To enhance discrimination among class prototypes, we introduce prototype space redistribution learning, which dynamically updates class prototypes to establish optimal inter-prototype boundaries within the prototype space. Extensive experiments on iFSS datasets built upon PASCAL and COCO benchmarks show the advanced performance of the proposed approach, offering valuable insights for addressing iFSS challenges.

* conference

Via

Access Paper or Ask Questions

Dual-Teacher Ensemble Models with Double-Copy-Paste for 3D Semi-Supervised Medical Image Segmentation

Oct 15, 2024

Zhan Fa, Shumeng Li, Jian Zhang, Lei Qi, Qian Yu, Yinghuan Shi

Figure 1 for Dual-Teacher Ensemble Models with Double-Copy-Paste for 3D Semi-Supervised Medical Image Segmentation

Figure 2 for Dual-Teacher Ensemble Models with Double-Copy-Paste for 3D Semi-Supervised Medical Image Segmentation

Figure 3 for Dual-Teacher Ensemble Models with Double-Copy-Paste for 3D Semi-Supervised Medical Image Segmentation

Figure 4 for Dual-Teacher Ensemble Models with Double-Copy-Paste for 3D Semi-Supervised Medical Image Segmentation

Abstract:Semi-supervised learning (SSL) techniques address the high labeling costs in 3D medical image segmentation, with the teacher-student model being a common approach. However, using an exponential moving average (EMA) in single-teacher models may cause coupling issues, where the weights of the student and teacher models become similar, limiting the teacher's ability to provide additional knowledge for the student. Dual-teacher models were introduced to address this problem but often neglected the importance of maintaining teacher model diversity, leading to coupling issues among teachers. To address the coupling issue, we incorporate a double-copy-paste (DCP) technique to enhance the diversity among the teachers. Additionally, we introduce the Staged Selective Ensemble (SSE) module, which selects different ensemble methods based on the characteristics of the samples and enables more accurate segmentation of label boundaries, thereby improving the quality of pseudo-labels. Experimental results demonstrate the effectiveness of our proposed method in 3D medical image segmentation tasks. Here is the code link: https://github.com/Fazhan-cs/DCP.

* 35 pages, 5 figures

Via

Access Paper or Ask Questions

DR-MPC: Deep Residual Model Predictive Control for Real-world Social Navigation

Oct 14, 2024

James R. Han, Hugues Thomas, Jian Zhang, Nicholas Rhinehart, Timothy D. Barfoot

Figure 1 for DR-MPC: Deep Residual Model Predictive Control for Real-world Social Navigation

Figure 2 for DR-MPC: Deep Residual Model Predictive Control for Real-world Social Navigation

Figure 3 for DR-MPC: Deep Residual Model Predictive Control for Real-world Social Navigation

Figure 4 for DR-MPC: Deep Residual Model Predictive Control for Real-world Social Navigation

Abstract:How can a robot safely navigate around people exhibiting complex motion patterns? Reinforcement Learning (RL) or Deep RL (DRL) in simulation holds some promise, although much prior work relies on simulators that fail to precisely capture the nuances of real human motion. To address this gap, we propose Deep Residual Model Predictive Control (DR-MPC), a method to enable robots to quickly and safely perform DRL from real-world crowd navigation data. By blending MPC with model-free DRL, DR-MPC overcomes the traditional DRL challenges of large data requirements and unsafe initial behavior. DR-MPC is initialized with MPC-based path tracking, and gradually learns to interact more effectively with humans. To further accelerate learning, a safety component estimates when the robot encounters out-of-distribution states and guides it away from likely collisions. In simulation, we show that DR-MPC substantially outperforms prior work, including traditional DRL and residual DRL models. Real-world experiments show our approach successfully enables a robot to navigate a variety of crowded situations with few errors using less than 4 hours of training data.

* 8 pages, 8 figures, under review for IEEE Robotics and Automation Letters (RA-L)

Via

Access Paper or Ask Questions

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Oct 12, 2024

Ting Yu, Kunhao Fu, Jian Zhang, Qingming Huang, Jun Yu

Figure 1 for Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Figure 2 for Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Figure 3 for Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Figure 4 for Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Abstract:Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task focusing on semantic understanding of untrimmed long-term videos and diverse free-form questions, simultaneously emphasizing comprehensive cross-modal reasoning to yield precise answers. The canonical approaches often rely on off-the-shelf feature extractors to detour the expensive computation overhead, but often result in domain-independent modality-unrelated representations. Furthermore, the inherent gradient blocking between unimodal comprehension and cross-modal interaction hinders reliable answer generation. In contrast, recent emerging successful video-language pre-training models enable cost-effective end-to-end modeling but fall short in domain-specific ratiocination and exhibit disparities in task formulation. Toward this end, we present an entirely end-to-end solution for long-term VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation (MCG) model. To derive discriminative representations possessing high visual concepts, we introduce Joint Unimodal Modeling (JUM) on a clip-bone architecture and leverage Multi-granularity Contrastive Learning (MCL) to harness the intrinsically or explicitly exhibited semantic correspondences. To alleviate the task formulation discrepancy problem, we propose a Cross-modal Collaborative Generation (CCG) module to reformulate VideoQA as a generative task instead of the conventional classification scheme, empowering the model with the capability for cross-modal high-semantic fusion and generation so as to rationalize and answer. Extensive experiments conducted on six publicly available VideoQA datasets underscore the superiority of our proposed method.

* Transactions on Image Processing, vol. 33, pp. 3115-3129, 2024
* Transactions on Image Processing

Via

Access Paper or Ask Questions

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

Oct 03, 2024

Zhipei Xu, Xuanyu Zhang, Runyi Li, Zecheng Tang, Qing Huang, Jian Zhang

Figure 1 for FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

Figure 2 for FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

Figure 3 for FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

Figure 4 for FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

Abstract:The rapid development of generative AI is a double-edged sword, which not only facilitates content creation but also makes image manipulation easier and more difficult to detect. Although current image forgery detection and localization (IFDL) methods are generally effective, they tend to face two challenges: \textbf{1)} black-box nature with unknown detection principle, \textbf{2)} limited generalization across diverse tampering methods (e.g., Photoshop, DeepFake, AIGC-Editing). To address these issues, we propose the explainable IFDL task and design FakeShield, a multi-modal framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues. Additionally, we leverage GPT-4o to enhance existing IFDL datasets, creating the Multi-Modal Tamper Description dataSet (MMTD-Set) for training FakeShield's tampering analysis capabilities. Meanwhile, we incorporate a Domain Tag-guided Explainable Forgery Detection Module (DTE-FDM) and a Multi-modal Forgery Localization Module (MFLM) to address various types of tamper detection interpretation and achieve forgery localization guided by detailed textual descriptions. Extensive experiments demonstrate that FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.

Via

Access Paper or Ask Questions

Generative AI Application for Building Industry

Oct 01, 2024

Hanlong Wan, Jian Zhang, Yan Chen, Weili Xu, Fan Feng

Figure 1 for Generative AI Application for Building Industry

Figure 2 for Generative AI Application for Building Industry

Figure 3 for Generative AI Application for Building Industry

Figure 4 for Generative AI Application for Building Industry

Abstract:This paper investigates the transformative potential of generative AI technologies, particularly large language models (LLMs), within the building industry. By leveraging these advanced AI tools, the study explores their application across key areas such as energy code compliance, building design optimization, and workforce training. The research highlights how LLMs can automate labor-intensive processes, significantly improving efficiency, accuracy, and safety in building practices. The paper also addresses the challenges associated with interpreting complex visual and textual data in architectural plans and regulatory codes, proposing innovative solutions to enhance AI-driven compliance checking and design processes. Additionally, the study considers the broader implications of AI integration, including the development of AI-powered tools for comprehensive code compliance across various regulatory domains and the potential for AI to revolutionize workforce training through realistic simulations. This paper provides a comprehensive analysis of the current capabilities of generative AI in the building industry while outlining future directions for research and development, aiming to pave the way for smarter, more sustainable, and responsive construction practices.

* 28 pages, 11 figures, 4 tables

Via

Access Paper or Ask Questions

FastTalker: Jointly Generating Speech and Conversational Gestures from Text

Sep 24, 2024

Zixin Guo, Jian Zhang

Figure 1 for FastTalker: Jointly Generating Speech and Conversational Gestures from Text

Figure 2 for FastTalker: Jointly Generating Speech and Conversational Gestures from Text

Figure 3 for FastTalker: Jointly Generating Speech and Conversational Gestures from Text

Figure 4 for FastTalker: Jointly Generating Speech and Conversational Gestures from Text

Abstract:Generating 3D human gestures and speech from a text script is critical for creating realistic talking avatars. One solution is to leverage separate pipelines for text-to-speech (TTS) and speech-to-gesture (STG), but this approach suffers from poor alignment of speech and gestures and slow inference times. In this paper, we introduce FastTalker, an efficient and effective framework that simultaneously generates high-quality speech audio and 3D human gestures at high inference speeds. Our key insight is reusing the intermediate features from speech synthesis for gesture generation, as these features contain more precise rhythmic information than features re-extracted from generated speech. Specifically, 1) we propose an end-to-end framework that concurrently generates speech waveforms and full-body gestures, using intermediate speech features such as pitch, onset, energy, and duration directly for gesture decoding; 2) we redesign the causal network architecture to eliminate dependencies on future inputs for real applications; 3) we employ Reinforcement Learning-based Neural Architecture Search (NAS) to enhance both performance and inference speed by optimizing our network architecture. Experimental results on the BEAT2 dataset demonstrate that FastTalker achieves state-of-the-art performance in both speech synthesis and gesture generation, processing speech and gestures in 0.17 seconds per second on an NVIDIA 3090.

* European Conference on Computer Vision Workshop

Via

Access Paper or Ask Questions

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

Sep 11, 2024

Jian Zhang, Weijian Mai, Zhijun Zhang

Figure 1 for EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

Figure 2 for EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

Figure 3 for EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

Figure 4 for EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

Abstract:The task of audio-driven portrait animation involves generating a talking head video using an identity image and an audio track of speech. While many existing approaches focus on lip synchronization and video quality, few tackle the challenge of generating emotion-driven talking head videos. The ability to control and edit emotions is essential for producing expressive and realistic animations. In response to this challenge, we propose EMOdiffhead, a novel method for emotional talking head video generation that not only enables fine-grained control of emotion categories and intensities but also enables one-shot generation. Given the FLAME 3D model's linearity in expression modeling, we utilize the DECA method to extract expression vectors, that are combined with audio to guide a diffusion model in generating videos with precise lip synchronization and rich emotional expressiveness. This approach not only enables the learning of rich facial information from emotion-irrelevant data but also facilitates the generation of emotional videos. It effectively overcomes the limitations of emotional data, such as the lack of diversity in facial and background information, and addresses the absence of emotional details in emotion-irrelevant data. Extensive experiments and user studies demonstrate that our approach achieves state-of-the-art performance compared to other emotion portrait animation methods.

* 12 pages, 7 figures

Via

Access Paper or Ask Questions

MakeWay: Object-Aware Costmaps for Proactive Indoor Navigation Using LiDAR

Aug 30, 2024

Binbin Xu, Allen Tao, Hugues Thomas, Jian Zhang, Timothy D. Barfoot

Figure 1 for MakeWay: Object-Aware Costmaps for Proactive Indoor Navigation Using LiDAR

Figure 2 for MakeWay: Object-Aware Costmaps for Proactive Indoor Navigation Using LiDAR

Figure 3 for MakeWay: Object-Aware Costmaps for Proactive Indoor Navigation Using LiDAR

Figure 4 for MakeWay: Object-Aware Costmaps for Proactive Indoor Navigation Using LiDAR

Abstract:In this paper, we introduce a LiDAR-based robot navigation system, based on novel object-aware affordance-based costmaps. Utilizing a 3D object detection network, our system identifies objects of interest in LiDAR keyframes, refines their 3D poses with the Iterative Closest Point (ICP) algorithm, and tracks them via Kalman filters and the Hungarian algorithm for data association. It then updates existing object poses with new associated detections and creates new object maps for unmatched detections. Using the maintained object-level mapping system, our system creates affordance-driven object costmaps for proactive collision avoidance in path planning. Additionally, we address the scarcity of indoor semantic LiDAR data by introducing an automated labeling technique. This method utilizes a CAD model database for accurate ground-truth annotations, encompassing bounding boxes, positions, orientations, and point-wise semantics of each object in LiDAR sequences. Our extensive evaluations, conducted in both simulated and real-world robot platforms, highlights the effectiveness of proactive object avoidance by using object affordance costmaps, enhancing robotic navigation safety and efficiency. The system can operate in real-time onboard and we intend to release our code and data for public use.

* 8 pages, 11 figures

Via

Access Paper or Ask Questions

GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation

Aug 21, 2024

Abiao Li, Chenlei Lv, Guofeng Mei, Yifan Zuo, Jian Zhang, Yuming Fang

Figure 1 for GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation

Figure 2 for GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation

Figure 3 for GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation

Figure 4 for GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation

Abstract:Learning meaningful local and global information remains a challenge in point cloud segmentation tasks. When utilizing local information, prior studies indiscriminately aggregates neighbor information from different classes to update query points, potentially compromising the distinctive feature of query points. In parallel, inaccurate modeling of long-distance contextual dependencies when utilizing global information can also impact model performance. To address these issues, we propose GSTran, a novel transformer network tailored for the segmentation task. The proposed network mainly consists of two principal components: a local geometric transformer and a global semantic transformer. In the local geometric transformer module, we explicitly calculate the geometric disparity within the local region. This enables amplifying the affinity with geometrically similar neighbor points while suppressing the association with other neighbors. In the global semantic transformer module, we design a multi-head voting strategy. This strategy evaluates semantic similarity across the entire spatial range, facilitating the precise capture of contextual dependencies. Experiments on ShapeNetPart and S3DIS benchmarks demonstrate the effectiveness of the proposed method, showing its superiority over other algorithms. The code is available at https://github.com/LAB123-tech/GSTran.

* ICPR 2024

Via

Access Paper or Ask Questions