Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhe Huang

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation

Nov 10, 2024

Xiaowei Yu, Zhe Huang, Zao Zhang

Figure 1 for Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation

Figure 2 for Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation

Figure 3 for Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation

Figure 4 for Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation

Abstract:Unsupervised domain adaptation (UDA) aims to leverage the knowledge learned from labeled source domains to improve performance on the unlabeled target domains. While Convolutional Neural Networks (CNNs) have been dominant in previous UDA methods, recent research has shown promise in applying Vision Transformers (ViTs) to this task. In this study, we propose a novel Feature Fusion Transferability Aware Transformer (FFTAT) to enhance ViT performance in UDA tasks. Our method introduces two key innovations: First, we introduce a patch discriminator to evaluate the transferability of patches, generating a transferability matrix. We integrate this matrix into self-attention, directing the model to focus on transferable patches. Second, we propose a feature fusion technique to fuse embeddings in the latent space, enabling each embedding to incorporate information from all others, thereby improving generalization. These two components work in synergy to enhance feature representation learning. Extensive experiments on widely used benchmarks demonstrate that our method significantly improves UDA performance, achieving state-of-the-art (SOTA) results.

* IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025

Via

Access Paper or Ask Questions

Towards Real-Time Generation of Delay-Compensated Video Feeds for Outdoor Mobile Robot Teleoperation

Sep 16, 2024

Neeloy Chakraborty, Yixiao Fang, Andre Schreiber, Tianchen Ji, Zhe Huang, Aganze Mihigo, Cassidy Wall, Abdulrahman Almana, Katherine Driggs-Campbell

Figure 1 for Towards Real-Time Generation of Delay-Compensated Video Feeds for Outdoor Mobile Robot Teleoperation

Figure 2 for Towards Real-Time Generation of Delay-Compensated Video Feeds for Outdoor Mobile Robot Teleoperation

Figure 3 for Towards Real-Time Generation of Delay-Compensated Video Feeds for Outdoor Mobile Robot Teleoperation

Figure 4 for Towards Real-Time Generation of Delay-Compensated Video Feeds for Outdoor Mobile Robot Teleoperation

Abstract:Teleoperation is an important technology to enable supervisors to control agricultural robots remotely. However, environmental factors in dense crop rows and limitations in network infrastructure hinder the reliability of data streamed to teleoperators. These issues result in delayed and variable frame rate video feeds that often deviate significantly from the robot's actual viewpoint. We propose a modular learning-based vision pipeline to generate delay-compensated images in real-time for supervisors. Our extensive offline evaluations demonstrate that our method generates more accurate images compared to state-of-the-art approaches in our setting. Additionally, we are one of the few works to evaluate a delay-compensation method in outdoor field environments with complex terrain on data from a real robot in real-time. Additional videos are provided at https://sites.google.com/illinois.edu/comp-teleop.

* 8 pages, 4 figures, 3 tables

Via

Access Paper or Ask Questions

GSLAMOT: A Tracklet and Query Graph-based Simultaneous Locating, Mapping, and Multiple Object Tracking System

Aug 17, 2024

Shuo Wang, Yongcai Wang, Zhimin Xu, Yongyu Guo, Wanting Li, Zhe Huang, Xuewei Bai, Deying Li

Abstract:For interacting with mobile objects in unfamiliar environments, simultaneously locating, mapping, and tracking the 3D poses of multiple objects are crucially required. This paper proposes a Tracklet Graph and Query Graph-based framework, i.e., GSLAMOT, to address this challenge. GSLAMOT utilizes camera and LiDAR multimodal information as inputs and divides the representation of the dynamic scene into a semantic map for representing the static environment, a trajectory of the ego-agent, and an online maintained Tracklet Graph (TG) for tracking and predicting the 3D poses of the detected mobile objects. A Query Graph (QG) is constructed in each frame by object detection to query and update TG. For accurate object association, a Multi-criteria Star Graph Association (MSGA) method is proposed to find matched objects between the detections in QG and the predicted tracklets in TG. Then, an Object-centric Graph Optimization (OGO) method is proposed to simultaneously optimize the TG, the semantic map, and the agent trajectory. It triangulates the detected objects into the map to enrich the map's semantic information. We address the efficiency issues to handle the three tightly coupled tasks in parallel. Experiments are conducted on KITTI, Waymo, and an emulated Traffic Congestion dataset that highlights challenging scenarios. Experiments show that GSLAMOT enables accurate crowded object tracking while conducting SLAM accurately in challenging scenarios, demonstrating more excellent performances than the state-of-the-art methods. The code and dataset are at https://gslamot.github.io.

* 11 pages, 9 figures, ACM MM 2024

Via

Access Paper or Ask Questions

RoCo:Robust Collaborative Perception By Iterative Object Matching and Pose Adjustment

Aug 01, 2024

Zhe Huang, Shuo Wang, Yongcai Wang, Wanting Li, Deying Li, Lei Wang

Abstract:Collaborative autonomous driving with multiple vehicles usually requires the data fusion from multiple modalities. To ensure effective fusion, the data from each individual modality shall maintain a reasonably high quality. However, in collaborative perception, the quality of object detection based on a modality is highly sensitive to the relative pose errors among the agents. It leads to feature misalignment and significantly reduces collaborative performance. To address this issue, we propose RoCo, a novel unsupervised framework to conduct iterative object matching and agent pose adjustment. To the best of our knowledge, our work is the first to model the pose correction problem in collaborative perception as an object matching task, which reliably associates common objects detected by different agents. On top of this, we propose a graph optimization process to adjust the agent poses by minimizing the alignment errors of the associated objects, and the object matching is re-done based on the adjusted agent poses. This process is carried out iteratively until convergence. Experimental study on both simulated and real-world datasets demonstrates that the proposed framework RoCo consistently outperforms existing relevant methods in terms of the collaborative object detection performance, and exhibits highly desired robustness when the pose information of agents is with high-level noise. Ablation studies are also provided to show the impact of its key parameters and components. The code is released at https://github.com/HuangZhe885/RoCo.

* Proceedings of the 32nd ACM International Conference on Multimedia (MM '24), October 28-November 1, 2024, Melbourne, VIC, Australia
* ACM MM2024

Via

Access Paper or Ask Questions

Topology-Guided ORCA: Smooth Multi-Agent Motion Planning in Constrained Environments

Jul 23, 2024

Fatemeh Cheraghi Pouria, Zhe Huang, Ananya Yammanuru, Shuijing Liu, Katherine Driggs-Campbell

Figure 1 for Topology-Guided ORCA: Smooth Multi-Agent Motion Planning in Constrained Environments

Figure 2 for Topology-Guided ORCA: Smooth Multi-Agent Motion Planning in Constrained Environments

Abstract:We present Topology-Guided ORCA as an alternative simulator to replace ORCA for planning smooth multi-agent motions in environments with static obstacles. Despite the impressive performance in simulating multi-agent crowd motion in free space, ORCA encounters a significant challenge in navigating the agents with the presence of static obstacles. ORCA ignores static obstacles until an agent gets too close to an obstacle, and the agent will get stuck if the obstacle intercepts an agent's path toward the goal. To address this challenge, Topology-Guided ORCA constructs a graph to represent the topology of the traversable region of the environment. We use a path planner to plan a path of waypoints that connects each agent's start and goal positions. The waypoints are used as a sequence of goals to guide ORCA. The experiments of crowd simulation in constrained environments show that our method outperforms ORCA in terms of generating smooth and natural motions of multiple agents in constrained environments, which indicates great potential of Topology-Guided ORCA for serving as an effective simulator for training constrained social navigation policies.

Via

Access Paper or Ask Questions

LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application

Jun 19, 2024

Zhe Huang, John Pohovey, Ananya Yammanuru, Katherine Driggs-Campbell

Figure 1 for LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application

Figure 2 for LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application

Figure 3 for LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application

Abstract:Large Language Models (LLM) and Vision Language Models (VLM) enable robots to ground natural language prompts into control actions to achieve tasks in an open world. However, when applied to a long-horizon collaborative task, this formulation results in excessive prompting for initiating or clarifying robot actions at every step of the task. We propose Language-driven Intention Tracking (LIT), leveraging LLMs and VLMs to model the human user's long-term behavior and to predict the next human intention to guide the robot for proactive collaboration. We demonstrate smooth coordination between a LIT-based collaborative robot and the human user in collaborative cooking tasks.

* Spotlight Presentation at the 3rd Workshop on Computer Vision in the Wild at CVPR 2024. Also accepted by the 5th Annual Embodied AI Workshop at CVPR 2024

Via

Access Paper or Ask Questions

DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection

May 17, 2024

Zhe Huang, Yizhe Zhao, Hao Xiao, Chenyan Wu, Lingting Ge

Figure 1 for DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection

Figure 2 for DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection

Figure 3 for DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection

Figure 4 for DuoSpaceNet: Leveraging Both Bird's-Eye-View and Perspective View Representations for 3D Object Detection

Abstract:Recent advances in multi-view camera-only 3D object detection either rely on an accurate reconstruction of bird's-eye-view (BEV) 3D features or on traditional 2D perspective view (PV) image features. While both have their own pros and cons, few have found a way to stitch them together in order to benefit from "the best of both worlds". To this end, we explore a duo space (i.e., BEV and PV) 3D perception framework, in conjunction with some useful duo space fusion strategies that allow effective aggregation of the two feature representations. To the best of our knowledge, our proposed method, DuoSpaceNet, is the first to leverage two distinct feature spaces and achieves the state-of-the-art 3D object detection and BEV map segmentation results on nuScenes dataset.

Via

Access Paper or Ask Questions

Private Wasserstein Distance with Random Noises

Apr 10, 2024

Wenqian Li, Haozhi Wang, Zhe Huang, Yan Pang

Abstract:Wasserstein distance is a principle measure of data divergence from a distributional standpoint. However, its application becomes challenging in the context of data privacy, where sharing raw data is restricted. Prior attempts have employed techniques like Differential Privacy or Federated optimization to approximate Wasserstein distance. Nevertheless, these approaches often lack accuracy and robustness against potential attack. In this study, we investigate the underlying triangular properties within the Wasserstein space, leading to a straightforward solution named TriangleWad. This approach enables the computation of Wasserstein distance between datasets stored across different entities. Notably, TriangleWad is 20 times faster, making raw data information truly invisible, enhancing resilience against attacks, and without sacrificing estimation accuracy. Through comprehensive experimentation across various tasks involving both image and text data, we demonstrate its superior performance and generalizations.

Via

Access Paper or Ask Questions

InterLUDE: Interactions between Labeled and Unlabeled Data to Enhance Semi-Supervised Learning

Mar 15, 2024

Zhe Huang, Xiaowei Yu, Dajiang Zhu, Michael C. Hughes

Abstract:Semi-supervised learning (SSL) seeks to enhance task performance by training on both labeled and unlabeled data. Mainstream SSL image classification methods mostly optimize a loss that additively combines a supervised classification objective with a regularization term derived solely from unlabeled data. This formulation neglects the potential for interaction between labeled and unlabeled images. In this paper, we introduce InterLUDE, a new approach to enhance SSL made of two parts that each benefit from labeled-unlabeled interaction. The first part, embedding fusion, interpolates between labeled and unlabeled embeddings to improve representation learning. The second part is a new loss, grounded in the principle of consistency regularization, that aims to minimize discrepancies in the model's predictions between labeled versus unlabeled inputs. Experiments on standard closed-set SSL benchmarks and a medical SSL task with an uncurated unlabeled set show clear benefits to our approach. On the STL-10 dataset with only 40 labels, InterLUDE achieves 3.2% error rate, while the best previous method reports 14.9%.

* Semi-supervised Learning; Vision Transformers

Via

Access Paper or Ask Questions

Semi-Supervised Multimodal Multi-Instance Learning for Aortic Stenosis Diagnosis

Mar 09, 2024

Zhe Huang, Xiaowei Yu, Benjamin S. Wessler, Michael C. Hughes

Figure 1 for Semi-Supervised Multimodal Multi-Instance Learning for Aortic Stenosis Diagnosis

Figure 2 for Semi-Supervised Multimodal Multi-Instance Learning for Aortic Stenosis Diagnosis

Figure 3 for Semi-Supervised Multimodal Multi-Instance Learning for Aortic Stenosis Diagnosis

Figure 4 for Semi-Supervised Multimodal Multi-Instance Learning for Aortic Stenosis Diagnosis

Abstract:Automated interpretation of ultrasound imaging of the heart (echocardiograms) could improve the detection and treatment of aortic stenosis (AS), a deadly heart disease. However, existing deep learning pipelines for assessing AS from echocardiograms have two key limitations. First, most methods rely on limited 2D cineloops, thereby ignoring widely available Doppler imaging that contains important complementary information about pressure gradients and blood flow abnormalities associated with AS. Second, obtaining labeled data is difficult. There are often far more unlabeled echocardiogram recordings available, but these remain underutilized by existing methods. To overcome these limitations, we introduce Semi-supervised Multimodal Multiple-Instance Learning (SMMIL), a new deep learning framework for automatic interpretation for structural heart diseases like AS. When deployed, SMMIL can combine information from two input modalities, spectral Dopplers and 2D cineloops, to produce a study-level AS diagnosis. During training, SMMIL can combine a smaller labeled set and an abundant unlabeled set of both modalities to improve its classifier. Experiments demonstrate that SMMIL outperforms recent alternatives at 3-level AS severity classification as well as several clinically relevant AS detection tasks.

* Echocardiography; Multimodal; Semi-supervised Learning; Multiple-Instance Learning

Via

Access Paper or Ask Questions