Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Phu Pham

TransLocNet: Cross-Modal Attention for Aerial-Ground Vehicle Localization with Contrastive Learning

Dec 11, 2025

Phu Pham, Damon Conover, Aniket Bera

Abstract:Aerial-ground localization is difficult due to large viewpoint and modality gaps between ground-level LiDAR and overhead imagery. We propose TransLocNet, a cross-modal attention framework that fuses LiDAR geometry with aerial semantic context. LiDAR scans are projected into a bird's-eye-view representation and aligned with aerial features through bidirectional attention, followed by a likelihood map decoder that outputs spatial probability distributions over position and orientation. A contrastive learning module enforces a shared embedding space to improve cross-modal alignment. Experiments on CARLA and KITTI show that TransLocNet outperforms state-of-the-art baselines, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy. These results demonstrate that TransLocNet provides robust and generalizable aerial-ground localization in both synthetic and real-world settings.

* 8 pages, 4 figures, 4 tables

Via

Access Paper or Ask Questions

FlashSLAM: Accelerated RGB-D SLAM for Real-Time 3D Scene Reconstruction with Gaussian Splatting

Dec 01, 2024

Phu Pham, Damon Conover, Aniket Bera

Abstract:We present FlashSLAM, a novel SLAM approach that leverages 3D Gaussian Splatting for efficient and robust 3D scene reconstruction. Existing 3DGS-based SLAM methods often fall short in sparse view settings and during large camera movements due to their reliance on gradient descent-based optimization, which is both slow and inaccurate. FlashSLAM addresses these limitations by combining 3DGS with a fast vision-based camera tracking technique, utilizing a pretrained feature matching model and point cloud registration for precise pose estimation in under 80 ms - a 90% reduction in tracking time compared to SplaTAM - without costly iterative rendering. In sparse settings, our method achieves up to a 92% improvement in average tracking accuracy over previous methods. Additionally, it accounts for noise in depth sensors, enhancing robustness when using unspecialized devices such as smartphones. Extensive experiments show that FlashSLAM performs reliably across both sparse and dense settings, in synthetic and real-world environments. Evaluations on benchmark datasets highlight its superior accuracy and efficiency, establishing FlashSLAM as a versatile and high-performance solution for SLAM, advancing the state-of-the-art in 3D reconstruction across diverse applications.

* 16 pages, 9 figures, 13 tables

Via

Access Paper or Ask Questions

Quadratic Is Not What You Need For Multimodal Large Language Models

Oct 08, 2024

Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Zeliang Zhang, Daniel Miranda, Ajinkya Kale, Chenliang Xu

Figure 1 for Quadratic Is Not What You Need For Multimodal Large Language Models

Figure 2 for Quadratic Is Not What You Need For Multimodal Large Language Models

Figure 3 for Quadratic Is Not What You Need For Multimodal Large Language Models

Figure 4 for Quadratic Is Not What You Need For Multimodal Large Language Models

Abstract:In the past year, the capabilities of Multimodal Large Language Models (MLLMs) have significantly improved across various aspects. However, constrained by the quadratic growth of computation in LLMs as the number of tokens increases, efficiency has become a bottleneck for further scaling MLLMs. Although recent efforts have been made to prune visual tokens or use more lightweight LLMs to reduce computation, the problem of quadratic growth in computation with the increase of visual tokens still persists. To address this, we propose a novel approach: instead of reducing the input visual tokens for LLMs, we focus on pruning vision-related computations within the LLMs. After pruning, the computation growth in the LLM is no longer quadratic with the increase of visual tokens, but linear. Surprisingly, we found that after applying such extensive pruning, the capabilities of MLLMs are comparable with the original one and even superior on some benchmarks with only 25% of the computation. This finding opens up the possibility for MLLMs to incorporate much denser visual tokens. Additionally, based on this finding, we further analyzed some architectural design deficiencies in existing MLLMs and proposed promising improvements. To the best of our knowledge, this is the first study to investigate the computational redundancy in the LLM's vision component of MLLMs. Code and checkpoints will be released soon.

Via

Access Paper or Ask Questions

Go-SLAM: Grounded Object Segmentation and Localization with Gaussian Splatting SLAM

Sep 25, 2024

Phu Pham, Dipam Patel, Damon Conover, Aniket Bera

Figure 1 for Go-SLAM: Grounded Object Segmentation and Localization with Gaussian Splatting SLAM

Figure 2 for Go-SLAM: Grounded Object Segmentation and Localization with Gaussian Splatting SLAM

Figure 3 for Go-SLAM: Grounded Object Segmentation and Localization with Gaussian Splatting SLAM

Figure 4 for Go-SLAM: Grounded Object Segmentation and Localization with Gaussian Splatting SLAM

Abstract:We introduce Go-SLAM, a novel framework that utilizes 3D Gaussian Splatting SLAM to reconstruct dynamic environments while embedding object-level information within the scene representations. This framework employs advanced object segmentation techniques, assigning a unique identifier to each Gaussian splat that corresponds to the object it represents. Consequently, our system facilitates open-vocabulary querying, allowing users to locate objects using natural language descriptions. Furthermore, the framework features an optimal path generation module that calculates efficient navigation paths for robots toward queried objects, considering obstacles and environmental uncertainties. Comprehensive evaluations in various scene settings demonstrate the effectiveness of our approach in delivering high-fidelity scene reconstructions, precise object segmentation, flexible object querying, and efficient robot path planning. This work represents an additional step forward in bridging the gap between 3D scene reconstruction, semantic object understanding, and real-time environment interactions.

Via

Access Paper or Ask Questions

MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification

Sep 10, 2024

Phu Pham, Aradhya N. Mathur, Ojaswa Sharma, Aniket Bera

Figure 1 for MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification

Figure 2 for MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification

Figure 3 for MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification

Figure 4 for MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification

Abstract:The field of text-to-3D content generation has made significant progress in generating realistic 3D objects, with existing methodologies like Score Distillation Sampling (SDS) offering promising guidance. However, these methods often encounter the "Janus" problem-multi-face ambiguities due to imprecise guidance. Additionally, while recent advancements in 3D gaussian splitting have shown its efficacy in representing 3D volumes, optimization of this representation remains largely unexplored. This paper introduces a unified framework for text-to-3D content generation that addresses these critical gaps. Our approach utilizes multi-view guidance to iteratively form the structure of the 3D model, progressively enhancing detail and accuracy. We also introduce a novel densification algorithm that aligns gaussians close to the surface, optimizing the structural integrity and fidelity of the generated models. Extensive experiments validate our approach, demonstrating that it produces high-quality visual outputs with minimal time cost. Notably, our method achieves high-quality results within half an hour of training, offering a substantial efficiency gain over most existing methods, which require hours of training time to achieve comparable results.

* 13 pages, 10 figures

Via

Access Paper or Ask Questions

RL Dreams: Policy Gradient Optimization for Score Distillation based 3D Generation

Dec 08, 2023

Aradhya N. Mathur, Phu Pham, Aniket Bera, Ojaswa Sharma

Abstract:3D generation has rapidly accelerated in the past decade owing to the progress in the field of generative modeling. Score Distillation Sampling (SDS) based rendering has improved 3D asset generation to a great extent. Further, the recent work of Denoising Diffusion Policy Optimization (DDPO) demonstrates that the diffusion process is compatible with policy gradient methods and has been demonstrated to improve the 2D diffusion models using an aesthetic scoring function. We first show that this aesthetic scorer acts as a strong guide for a variety of SDS-based methods and demonstrates its effectiveness in text-to-3D synthesis. Further, we leverage the DDPO approach to improve the quality of the 3D rendering obtained from 2D diffusion models. Our approach, DDPO3D, employs the policy gradient method in tandem with aesthetic scoring. To the best of our knowledge, this is the first method that extends policy gradient methods to 3D score-based rendering and shows improvement across SDS-based methods such as DreamGaussian, which are currently driving research in text-to-3D synthesis. Our approach is compatible with score distillation-based methods, which would facilitate the integration of diverse reward functions into the generative process. Our project page can be accessed via https://ddpo3d.github.io.

Via

Access Paper or Ask Questions

DREAM: Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems

Sep 29, 2023

Dipam Patel, Phu Pham, Kshitij Tiwari, Aniket Bera

Figure 1 for DREAM: Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems

Figure 2 for DREAM: Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems

Figure 3 for DREAM: Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems

Figure 4 for DREAM: Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems

Abstract:Resource-constrained robots often suffer from energy inefficiencies, underutilized computational abilities due to inadequate task allocation, and a lack of robustness in dynamic environments, all of which strongly affect their performance. This paper introduces DREAM - Decentralized Reinforcement Learning for Exploration and Efficient Energy Management in Multi-Robot Systems, a comprehensive framework that optimizes the allocation of resources for efficient exploration. It advances beyond conventional heuristic-based task planning as observed conventionally. The framework incorporates Operational Range Estimation using Reinforcement Learning to perform exploration and obstacle avoidance in unfamiliar terrains. DREAM further introduces an Energy Consumption Model for goal allocation, thereby ensuring mission completion under constrained resources using a Graph Neural Network. This approach also ensures that the entire Multi-Robot System can survive for an extended period of time for further missions compared to the conventional approach of randomly allocating goals, which compromises one or more agents. Our approach adapts to prioritizing agents in real-time, showcasing remarkable resilience against dynamic environments. This robust solution was evaluated in various simulated environments, demonstrating adaptability and applicability across diverse scenarios. We observed a substantial improvement of about 25% over the baseline method, leading the way for future research in resource-constrained robotics.

* Submitted to 2024 IEEE International Conference on Robotics and Automation (ICRA 2024)

Via

Access Paper or Ask Questions

Crowd-Aware Multi-Agent Pathfinding With Boosted Curriculum Reinforcement Learning

Sep 19, 2023

Phu Pham, Aniket Bera

Figure 1 for Crowd-Aware Multi-Agent Pathfinding With Boosted Curriculum Reinforcement Learning

Figure 2 for Crowd-Aware Multi-Agent Pathfinding With Boosted Curriculum Reinforcement Learning

Figure 3 for Crowd-Aware Multi-Agent Pathfinding With Boosted Curriculum Reinforcement Learning

Figure 4 for Crowd-Aware Multi-Agent Pathfinding With Boosted Curriculum Reinforcement Learning

Abstract:Multi-Agent Path Finding (MAPF) in crowded environments presents a challenging problem in motion planning, aiming to find collision-free paths for all agents in the system. MAPF finds a wide range of applications in various domains, including aerial swarms, autonomous warehouse robotics, and self-driving vehicles. The current approaches for MAPF can be broadly categorized into two main categories: centralized and decentralized planning. Centralized planning suffers from the curse of dimensionality and thus does not scale well in large and complex environments. On the other hand, decentralized planning enables agents to engage in real-time path planning within a partially observable environment, demonstrating implicit coordination. However, they suffer from slow convergence and performance degradation in dense environments. In this paper, we introduce CRAMP, a crowd-aware decentralized approach to address this problem by leveraging reinforcement learning guided by a boosted curriculum-based training strategy. We test CRAMP on simulated environments and demonstrate that our method outperforms the state-of-the-art decentralized methods for MAPF on various metrics. CRAMP improves the solution quality up to 58% measured in makespan and collision count, and up to 5% in success rate in comparison to previous methods.

* 8 pages, 3 figures, 1 table

Via

Access Paper or Ask Questions

DroNeRF: Real-time Multi-agent Drone Pose Optimization for Computing Neural Radiance Fields

Mar 08, 2023

Dipam Patel, Phu Pham, Aniket Bera

Figure 1 for DroNeRF: Real-time Multi-agent Drone Pose Optimization for Computing Neural Radiance Fields

Figure 2 for DroNeRF: Real-time Multi-agent Drone Pose Optimization for Computing Neural Radiance Fields

Figure 3 for DroNeRF: Real-time Multi-agent Drone Pose Optimization for Computing Neural Radiance Fields

Figure 4 for DroNeRF: Real-time Multi-agent Drone Pose Optimization for Computing Neural Radiance Fields

Abstract:We present a novel optimization algorithm called DroNeRF for the autonomous positioning of monocular camera drones around an object for real-time 3D reconstruction using only a few images. Neural Radiance Fields or NeRF, is a novel view synthesis technique used to generate new views of an object or scene from a set of input images. Using drones in conjunction with NeRF provides a unique and dynamic way to generate novel views of a scene, especially with limited scene capabilities of restricted movements. Our approach focuses on calculating optimized pose for individual drones while solely depending on the object geometry without using any external localization system. The unique camera positioning during the data-capturing phase significantly impacts the quality of the 3D model. To evaluate the quality of our generated novel views, we compute different perceptual metrics like the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure(SSIM). Our work demonstrates the benefit of using an optimal placement of various drones with limited mobility to generate perceptually better results.

Via

Access Paper or Ask Questions

The Ivory Tower Lost: How College Students Respond Differently than the General Public to the COVID-19 Pandemic

Apr 21, 2020

Viet Duong, Phu Pham, Tongyu Yang, Yu Wang, Jiebo Luo

Figure 1 for The Ivory Tower Lost: How College Students Respond Differently than the General Public to the COVID-19 Pandemic

Figure 2 for The Ivory Tower Lost: How College Students Respond Differently than the General Public to the COVID-19 Pandemic

Figure 3 for The Ivory Tower Lost: How College Students Respond Differently than the General Public to the COVID-19 Pandemic

Figure 4 for The Ivory Tower Lost: How College Students Respond Differently than the General Public to the COVID-19 Pandemic

Abstract:Recently, the pandemic of the novel Coronavirus Disease-2019 (COVID-19) has presented governments with ultimate challenges. In the United States, the country with the highest confirmed COVID-19 infection cases, a nationwide social distancing protocol has been implemented by the President. For the first time in a hundred years since the 1918 flu pandemic, the US population is mandated to stay in their households and avoid public contact. As a result, the majority of public venues and services have ceased their operations. Following the closure of the University of Washington on March 7th, more than a thousand colleges and universities in the United States have cancelled in-person classes and campus activities, impacting millions of students. This paper aims to discover the social implications of this unprecedented disruption in our interactive society regarding both the general public and higher education populations by mining people's opinions on social media. We discover several topics embedded in a large number of COVID-19 tweets that represent the most central issues related to the pandemic, which are of great concerns for both college students and the general public. Moreover, we find significant differences between these two groups of Twitter users with respect to the sentiments they expressed towards the COVID-19 issues. To our best knowledge, this is the first social media-based study which focuses on the college student community's demographics and responses to prevalent social issues during a major crisis.

Via

Access Paper or Ask Questions