Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hao Huang

A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models

Feb 19, 2025

Hao Huang, Shuaihang Yuan, Yu Hao, Congcong Wen, Yi Fang

Abstract:A large-scale vision and language model that has been pretrained on massive data encodes visual and linguistic prior, which makes it easier to generate images and language that are more natural and realistic. Despite this, there is still a significant domain gap between the modalities of vision and language, especially when training data is scarce in few-shot settings, where only very limited data are available for training. In order to mitigate this issue, a multi-modal meta-learning framework has been proposed to bridge the gap between two frozen pretrained large vision and language models by introducing a tunable prompt connecting these two large models. For few-shot image captioning, the existing multi-model meta-learning framework utilizes a one-step prompting scheme to accumulate the visual features of input images to guide the language model, which struggles to generate accurate image descriptions with only a few training samples. Instead, we propose a chain-of-thought (CoT) meta-learning scheme as a multi-step image captioning procedure to better imitate how humans describe images. In addition, we further propose to learn different meta-parameters of the model corresponding to each CoT step in distinct subspaces to avoid interference. We evaluated our method on three commonly used image captioning datasets, i.e., MSCOCO, Flickr8k, and Flickr30k, under few-shot settings. The results of our experiments indicate that our chain-of-thought subspace meta-learning strategy is superior to the baselines in terms of performance across different datasets measured by different metrics.

* 11 pages, 3 figures, 5 tables

Via

Access Paper or Ask Questions

RW-Net: Enhancing Few-Shot Point Cloud Classification with a Wavelet Transform Projection-based Network

Jan 06, 2025

Haosheng Zhang, Hao Huang

Figure 1 for RW-Net: Enhancing Few-Shot Point Cloud Classification with a Wavelet Transform Projection-based Network

Figure 2 for RW-Net: Enhancing Few-Shot Point Cloud Classification with a Wavelet Transform Projection-based Network

Figure 3 for RW-Net: Enhancing Few-Shot Point Cloud Classification with a Wavelet Transform Projection-based Network

Figure 4 for RW-Net: Enhancing Few-Shot Point Cloud Classification with a Wavelet Transform Projection-based Network

Abstract:In the domain of 3D object classification, a fundamental challenge lies in addressing the scarcity of labeled data, which limits the applicability of traditional data-intensive learning paradigms. This challenge is particularly pronounced in few-shot learning scenarios, where the objective is to achieve robust generalization from minimal annotated samples. To overcome these limitations, it is crucial to identify and leverage the most salient and discriminative features of 3D objects, thereby enhancing learning efficiency and reducing dependency on large-scale labeled datasets. This work introduces RW-Net, a novel framework designed to address the challenges above by integrating Rate-Distortion Explanation (RDE) and wavelet transform into a state-of-the-art projection-based 3D object classification architecture. The proposed method capitalizes on RDE to extract critical features by identifying and preserving the most informative data components while reducing redundancy. This process ensures the retention of essential information for effective decision-making, optimizing the model's ability to learn from limited data. Complementing RDE, incorporating the wavelet transform further enhances the framework's capability to generalize in low-data regimes. By emphasizing low-frequency components of the input data, the wavelet transform captures fundamental geometric and structural attributes of 3D objects. These attributes are instrumental in mitigating overfitting and improving the robustness of the learned representations across diverse tasks and domains. To validate the effectiveness of our RW-Net, we conduct extensive experiments on three datasets: ModelNet40, ModelNet40-C, and ScanObjectNN for few-shot 3D object classification. The results demonstrate that our approach achieves state-of-the-art performance and exhibits superior generalization and robustness in few-shot learning scenarios.

* 11 pages, 5 figures, 9 tables

Via

Access Paper or Ask Questions

Impact of Data Distribution on Fairness Guarantees in Equitable Deep Learning

Dec 29, 2024

Yan Luo, Congcong Wen, Min Shi, Hao Huang, Yi Fang, Mengyu Wang

Figure 1 for Impact of Data Distribution on Fairness Guarantees in Equitable Deep Learning

Figure 2 for Impact of Data Distribution on Fairness Guarantees in Equitable Deep Learning

Figure 3 for Impact of Data Distribution on Fairness Guarantees in Equitable Deep Learning

Figure 4 for Impact of Data Distribution on Fairness Guarantees in Equitable Deep Learning

Abstract:We present a comprehensive theoretical framework analyzing the relationship between data distributions and fairness guarantees in equitable deep learning. Our work establishes novel theoretical bounds that explicitly account for data distribution heterogeneity across demographic groups, while introducing a formal analysis framework that minimizes expected loss differences across these groups. We derive comprehensive theoretical bounds for fairness errors and convergence rates, and characterize how distributional differences between groups affect the fundamental trade-off between fairness and accuracy. Through extensive experiments on diverse datasets, including FairVision (ophthalmology), CheXpert (chest X-rays), HAM10000 (dermatology), and FairFace (facial recognition), we validate our theoretical findings and demonstrate that differences in feature distributions across demographic groups significantly impact model fairness, with performance disparities particularly pronounced in racial categories. The theoretical bounds we derive crroborate these empirical observations, providing insights into the fundamental limits of achieving fairness in deep learning models when faced with heterogeneous data distributions. This work advances our understanding of fairness in AI-based diagnosis systems and provides a theoretical foundation for developing more equitable algorithms. The code for analysis is publicly available via \url{https://github.com/Harvard-Ophthalmology-AI-Lab/fairness_guarantees}.

Via

Access Paper or Ask Questions

Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition

Dec 19, 2024

Hao Huang, Yujie Lin, Siyu Chen, Haiyang Liu

Figure 1 for Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition

Figure 2 for Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition

Figure 3 for Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition

Figure 4 for Synchronized and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition

Abstract:Skeleton-based action recognition using GCNs has achieved remarkable performance, but recognizing ambiguous actions, such as "waving" and "saluting", remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and TCNs, where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasize local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, we propose a lightweight plug-and-play module called Synchronized and Fine-grained Head (SF-Head), inserted between GCN and TCN layers. SF-Head first conducts Synchronized Spatial-Temporal Extraction (SSTE) with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction between the two types of features. It then performs Adaptive Cross-dimensional Feature Aggregation (AC-FA), with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. This aggregation step effectively combines both global context and local details. Experimental results on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets demonstrate significant improvements in distinguishing ambiguous actions. Our code will be made available at https://github.com/HaoHuang2003/SFHead.

* 20pages, 5 figures

Via

Access Paper or Ask Questions

Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared Spectrum

Nov 26, 2024

Jing-Yang Wei, Hao Huang, Xin Zhang, De-Mao Ye, Yi Li, Le Wang, Yao-Guang Ma, Yang-Hui Li

Figure 1 for Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared Spectrum

Figure 2 for Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared Spectrum

Figure 3 for Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared Spectrum

Figure 4 for Neural-Network-Enhanced Metalens Camera for High-Definition, Dynamic Imaging in the Long-Wave Infrared Spectrum

Abstract:To provide a lightweight and cost-effective solution for the long-wave infrared imaging using a singlet, we develop a camera by integrating a High-Frequency-Enhancing Cycle-GAN neural network into a metalens imaging system. The High-Frequency-Enhancing Cycle-GAN improves the quality of the original metalens images by addressing inherent frequency loss introduced by the metalens. In addition to the bidirectional cyclic generative adversarial network, it incorporates a high-frequency adversarial learning module. This module utilizes wavelet transform to extract high-frequency components, and then establishes a high-frequency feedback loop. It enables the generator to enhance the camera outputs by integrating adversarial feedback from the high-frequency discriminator. This ensures that the generator adheres to the constraints imposed by the high-frequency adversarial loss, thereby effectively recovering the camera's frequency loss. This recovery guarantees high-fidelity image output from the camera, facilitating smooth video production. Our camera is capable of achieving dynamic imaging at 125 frames per second with an End Point Error value of 12.58. We also achieve 0.42 for Fr\'echet Inception Distance, 30.62 for Peak Signal to Noise Ratio, and 0.69 for Structural Similarity in the recorded videos.

Via

Access Paper or Ask Questions

GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

Oct 31, 2024

Shuaihang Yuan, Hao Huang, Yu Hao, Congcong Wen, Anthony Tzes, Yi Fang

Figure 1 for GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

Figure 2 for GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

Figure 3 for GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

Figure 4 for GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

Abstract:Zero-Shot Object Goal Navigation (ZS-OGN) enables robots or agents to navigate toward objects of unseen categories without object-specific training. Traditional approaches often leverage categorical semantic information for navigation guidance, which struggles when only objects are partially observed or detailed and functional representations of the environment are lacking. To resolve the above two issues, we propose \textit{Geometric-part and Affordance Maps} (GAMap), a novel method that integrates object parts and affordance attributes as navigation guidance. Our method includes a multi-scale scoring approach to capture geometric-part and affordance attributes of objects at different scales. Comprehensive experiments conducted on HM3D and Gibson benchmark datasets demonstrate improvements in Success Rate and Success weighted by Path Length, underscoring the efficacy of our geometric-part and affordance-guided navigation approach in enhancing robot autonomy and versatility, without any additional object-specific training or fine-tuning with the semantics of unseen objects and/or the locomotions of the robot.

* 16 pages, 8 figures, 7 tables

Via

Access Paper or Ask Questions

Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation

Oct 29, 2024

Halil Utku Unlu, Shuaihang Yuan, Congcong Wen, Hao Huang, Anthony Tzes, Yi Fang

Figure 1 for Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation

Figure 2 for Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation

Figure 3 for Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation

Figure 4 for Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation

Abstract:We introduce an innovative approach to advancing semantic understanding in zero-shot object goal navigation (ZS-OGN), enhancing the autonomy of robots in unfamiliar environments. Traditional reliance on labeled data has been a limitation for robotic adaptability, which we address by employing a dual-component framework that integrates a GLIP Vision Language Model for initial detection and an InstructionBLIP model for validation. This combination not only refines object and environmental recognition but also fortifies the semantic interpretation, pivotal for navigational decision-making. Our method, rigorously tested in both simulated and real-world settings, exhibits marked improvements in navigation precision and reliability.

* 16 pages, 7 figures, 2 tables

Via

Access Paper or Ask Questions

Exploring the Reliability of Foundation Model-Based Frontier Selection in Zero-Shot Object Goal Navigation

Oct 28, 2024

Shuaihang Yuan, Halil Utku Unlu, Hao Huang, Congcong Wen, Anthony Tzes, Yi Fang

Abstract:In this paper, we present a novel method for reliable frontier selection in Zero-Shot Object Goal Navigation (ZS-OGN), enhancing robotic navigation systems with foundation models to improve commonsense reasoning in indoor environments. Our approach introduces a multi-expert decision framework to address the nonsensical or irrelevant reasoning often seen in foundation model-based systems. The method comprises two key components: Diversified Expert Frontier Analysis (DEFA) and Consensus Decision Making (CDM). DEFA utilizes three expert models: furniture arrangement, room type analysis, and visual scene reasoning, while CDM aggregates their outputs, prioritizing unanimous or majority consensus for more reliable decisions. Demonstrating state-of-the-art performance on the RoboTHOR and HM3D datasets, our method excels at navigating towards untrained objects or goals and outperforms various baselines, showcasing its adaptability to dynamic real-world conditions and superior generalization capabilities.

* 17 pages, 5 figures, 3 tables

Via

Access Paper or Ask Questions

3D Distance-color-coded Assessment of PCI Stent Apposition via Deep-learning-based Three-dimensional Multi-object Segmentation

Oct 26, 2024

Xiaoyang Qin, Hao Huang, Shuaichen Lin, Xinhao Zeng, Kaizhi Cao, Renxiong Wu, Yuming Huang, Junqing Yang, Yong Liu, Gang Li(+1 more)

Abstract:Coronary artery disease poses a significant global health challenge, often necessitating percutaneous coronary intervention (PCI) with stent implantation. Assessing stent apposition holds pivotal importance in averting and identifying PCI complications that lead to in-stent restenosis. Here we proposed a novel three-dimensional (3D) distance-color-coded assessment (DccA)for PCI stent apposition via deep-learning-based 3D multi-object segmentation in intravascular optical coherence tomography (IV-OCT). Our proposed 3D DccA accurately segments 3D vessel lumens and stents in IV-OCT images, using a spatial matching network and dual-layer training with style transfer. It quantifies and maps stent-lumen distances into a 3D color space, facilitating 3D visual assessment of PCI stent apposition. Achieving over 95% segmentation precision, our proposed DccA enhances clinical evaluation of PCI stent deployment and supports personalized treatment planning.

Via

Access Paper or Ask Questions

Zero-shot Object Navigation with Vision-Language Models Reasoning

Oct 24, 2024

Congcong Wen, Yisiyuan Huang, Hao Huang, Yanjia Huang, Shuaihang Yuan, Yu Hao, Hui Lin, Yu-Shen Liu, Yi Fang

Figure 1 for Zero-shot Object Navigation with Vision-Language Models Reasoning

Figure 2 for Zero-shot Object Navigation with Vision-Language Models Reasoning

Figure 3 for Zero-shot Object Navigation with Vision-Language Models Reasoning

Figure 4 for Zero-shot Object Navigation with Vision-Language Models Reasoning

Abstract:Object navigation is crucial for robots, but traditional methods require substantial training data and cannot be generalized to unknown environments. Zero-shot object navigation (ZSON) aims to address this challenge, allowing robots to interact with unknown objects without specific training data. Language-driven zero-shot object navigation (L-ZSON) is an extension of ZSON that incorporates natural language instructions to guide robot navigation and interaction with objects. In this paper, we propose a novel Vision Language model with a Tree-of-thought Network (VLTNet) for L-ZSON. VLTNet comprises four main modules: vision language model understanding, semantic mapping, tree-of-thought reasoning and exploration, and goal identification. Among these modules, Tree-of-Thought (ToT) reasoning and exploration module serves as a core component, innovatively using the ToT reasoning framework for navigation frontier selection during robot exploration. Compared to conventional frontier selection without reasoning, navigation using ToT reasoning involves multi-path reasoning processes and backtracking when necessary, enabling globally informed decision-making with higher accuracy. Experimental results on PASTURE and RoboTHOR benchmarks demonstrate the outstanding performance of our model in LZSON, particularly in scenarios involving complex natural language as target instructions.

* Accepted by the International Conference on Pattern Recognition (ICPR) for Oral presentation

Via

Access Paper or Ask Questions