Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rao Fu

RflyUT-Sim: A Simulation Platform for Development and Testing of Complex Low-Altitude Traffic Control

Dec 30, 2025

Zonghan Li, Tianwen Tao, Rao Fu, Liang Wang, Dongyuan Zhang, Quan Quan

Abstract:Significant challenges are posed by simulation and testing in the field of low-altitude unmanned aerial vehicle (UAV) traffic due to the high costs associated with large-scale UAV testing and the complexity of establishing low-altitude traffic test scenarios. Stringent safety requirements make high fidelity one of the key metrics for simulation platforms. Despite advancements in simulation platforms for low-altitude UAVs, there is still a shortage of platforms that feature rich traffic scenarios, high-precision UAV and scenario simulators, and comprehensive testing capabilities for low-altitude traffic. Therefore, this paper introduces an integrated high-fidelity simulation platform for low-altitude UAV traffic. This platform simulates all components of the UAV traffic network, including the control system, the traffic management system, the UAV system, the communication network , the anomaly and fault modules, etc. Furthermore, it integrates RflySim/AirSim and Unreal Engine 5 to develop full-state models of UAVs and 3D maps that model the real world using the oblique photogrammetry technique. Additionally, the platform offers a wide range of interfaces, and all models and scenarios can be customized with a high degree of flexibility. The platform's source code has been released, making it easier to conduct research related to low-altitude traffic.

Via

Access Paper or Ask Questions

OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction

Dec 18, 2025

Yuxin Ray Song, Jinzhou Li, Rao Fu, Devin Murphy, Kaichen Zhou, Rishi Shiv, Yaqi Li, Haoyu Xiong, Crystal Elaine Owens, Yilun Du(+5 more)

Abstract:The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.

* https://opentouch-tactile.github.io/

Via

Access Paper or Ask Questions

Art3D: Training-Free 3D Generation from Flat-Colored Illustration

Apr 14, 2025

Xiaoyan Cong, Jiayi Shen, Zekun Li, Rao Fu, Tao Lu, Srinath Sridhar

Abstract:Large-scale pre-trained image-to-3D generative models have exhibited remarkable capabilities in diverse shape generations. However, most of them struggle to synthesize plausible 3D assets when the reference image is flat-colored like hand drawings due to the lack of 3D illusion, which are often the most user-friendly input modalities in art content creation. To this end, we propose Art3D, a training-free method that can lift flat-colored 2D designs into 3D. By leveraging structural and semantic features with pre- trained 2D image generation models and a VLM-based realism evaluation, Art3D successfully enhances the three-dimensional illusion in reference images, thus simplifying the process of generating 3D from 2D, and proves adaptable to a wide range of painting styles. To benchmark the generalization performance of existing image-to-3D models on flat-colored images without 3D feeling, we collect a new dataset, Flat-2D, with over 100 samples. Experimental results demonstrate the performance and robustness of Art3D, exhibiting superior generalizable capacity and promising practical applicability. Our source code and dataset will be publicly available on our project page: https://joy-jy11.github.io/ .

* Technical Report. Course Project of Brown CSCI 1430 Computer Vision. Project Page: https://joy-jy11.github.io/

Via

Access Paper or Ask Questions

GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Dec 05, 2024

Rao Fu, Dingxi Zhang, Alex Jiang, Wanjia Fu, Austin Funk, Daniel Ritchie, Srinath Sridhar

Figure 1 for GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Figure 2 for GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Figure 3 for GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Figure 4 for GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Abstract:Understanding bimanual human hand activities is a critical problem in AI and robotics. We cannot build large models of bimanual activities because existing datasets lack the scale, coverage of diverse hand activities, and detailed annotations. We introduce GigaHands, a massive annotated dataset capturing 34 hours of bimanual hand activities from 56 subjects and 417 objects, totaling 14k motion clips derived from 183 million frames paired with 84k text annotations. Our markerless capture setup and data acquisition protocol enable fully automatic 3D hand and object estimation while minimizing the effort required for text annotation. The scale and diversity of GigaHands enable broad applications, including text-driven action synthesis, hand motion captioning, and dynamic radiance field reconstruction.

Via

Access Paper or Ask Questions

ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges

Nov 28, 2024

Rao Fu, Ziyang Luo, Hongzhan Lin, Zhen Ye, Jing Ma

Abstract:Recent advancements in large multimodal models (LMMs) have showcased impressive code generation capabilities, primarily evaluated through image-to-code benchmarks. However, these benchmarks are limited to specific visual programming scenarios where the logic reasoning and the multimodal understanding capacities are split apart. To fill this gap, we propose ScratchEval, a novel benchmark designed to evaluate the visual programming reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based visual programming language widely used in children's programming education. By integrating visual elements and embedded programming logic, ScratchEval requires the model to process both visual information and code structure, thereby comprehensively evaluating its programming intent understanding ability. Our evaluation approach goes beyond the traditional image-to-code mapping and focuses on unified logical thinking and problem-solving abilities, providing a more comprehensive and challenging framework for evaluating the visual programming ability of LMMs. ScratchEval not only fills the gap in existing evaluation methods, but also provides new insights for the future development of LMMs in the field of visual programming. Our benchmark can be accessed at https://github.com/HKBUNLP/ScratchEval .

Via

Access Paper or Ask Questions

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Mar 22, 2024

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, Wenhan Xiong

Figure 1 for Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Figure 2 for Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Figure 3 for Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Figure 4 for Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Abstract:This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.

Via

Access Paper or Ask Questions

Deep Generative Modeling for Financial Time Series with Application in VaR: A Comparative Review

Jan 18, 2024

Lars Ericson, Xuejun Zhu, Xusi Han, Rao Fu, Shuang Li, Steve Guo, Ping Hu

Abstract:In the financial services industry, forecasting the risk factor distribution conditional on the history and the current market environment is the key to market risk modeling in general and value at risk (VaR) model in particular. As one of the most widely adopted VaR models in commercial banks, Historical simulation (HS) uses the empirical distribution of daily returns in a historical window as the forecast distribution of risk factor returns in the next day. The objectives for financial time series generation are to generate synthetic data paths with good variety, and similar distribution and dynamics to the original historical data. In this paper, we apply multiple existing deep generative methods (e.g., CGAN, CWGAN, Diffusion, and Signature WGAN) for conditional time series generation, and propose and test two new methods for conditional multi-step time series generation, namely Encoder-Decoder CGAN and Conditional TimeVAE. Furthermore, we introduce a comprehensive framework with a set of KPIs to measure the quality of the generated time series for financial modeling. The KPIs cover distribution distance, autocorrelation and backtesting. All models (HS, parametric and neural networks) are tested on both historical USD yield curve data and additional data simulated from GARCH and CIR processes. The study shows that top performing models are HS, GARCH and CWGAN models. Future research directions in this area are also discussed.

Via

Access Paper or Ask Questions

AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

Dec 11, 2023

Zehao Wen, Zichen Liu, Srinath Sridhar, Rao Fu

Abstract:We introduce AnyHome, a framework that translates open-vocabulary descriptions, ranging from simple labels to elaborate paragraphs, into well-structured and textured 3D indoor scenes at a house-scale. Inspired by cognition theories, AnyHome employs an amodal structured representation to capture 3D spatial cues from textual narratives and then uses egocentric inpainting to enrich these scenes. To this end, we begin by using specially designed template prompts for Large Language Models (LLMs), which enable precise control over the textual input. We then utilize intermediate representations to maintain the spatial structure's consistency, ensuring that the 3D scenes align closely with the textual description. Then, we apply a Score Distillation Sampling process to refine the placement of objects. Lastly, an egocentric inpainting process is incorporated to enhance the realism and appearance of the scenes. AnyHome stands out due to its hierarchical structured representation combined with the versatility of open-vocabulary text interpretation. This allows for extensive customization of indoor scenes at various levels of granularity. We demonstrate that AnyHome can reliably generate a range of diverse indoor scenes, characterized by their detailed spatial structures and textures, all corresponding to the free-form textual inputs.

Via

Access Paper or Ask Questions

Patch-Wise Point Cloud Generation: A Divide-and-Conquer Approach

Jul 22, 2023

Cheng Wen, Baosheng Yu, Rao Fu, Dacheng Tao

Figure 1 for Patch-Wise Point Cloud Generation: A Divide-and-Conquer Approach

Figure 2 for Patch-Wise Point Cloud Generation: A Divide-and-Conquer Approach

Figure 3 for Patch-Wise Point Cloud Generation: A Divide-and-Conquer Approach

Figure 4 for Patch-Wise Point Cloud Generation: A Divide-and-Conquer Approach

Abstract:A generative model for high-fidelity point clouds is of great importance in synthesizing 3d environments for applications such as autonomous driving and robotics. Despite the recent success of deep generative models for 2d images, it is non-trivial to generate 3d point clouds without a comprehensive understanding of both local and global geometric structures. In this paper, we devise a new 3d point cloud generation framework using a divide-and-conquer approach, where the whole generation process can be divided into a set of patch-wise generation tasks. Specifically, all patch generators are based on learnable priors, which aim to capture the information of geometry primitives. We introduce point- and patch-wise transformers to enable the interactions between points and patches. Therefore, the proposed divide-and-conquer approach contributes to a new understanding of point cloud generation from the geometry constitution of 3d shapes. Experimental results on a variety of object categories from the most popular point cloud dataset, ShapeNet, show the effectiveness of the proposed patch-wise point cloud generation, where it clearly outperforms recent state-of-the-art methods for high-fidelity point cloud generation.

Via

Access Paper or Ask Questions

BPNet: Bézier Primitive Segmentation on 3D Point Clouds

Jul 08, 2023

Rao Fu, Cheng Wen, Qian Li, Xiao Xiao, Pierre Alliez

Abstract:This paper proposes BPNet, a novel end-to-end deep learning framework to learn B\'ezier primitive segmentation on 3D point clouds. The existing works treat different primitive types separately, thus limiting them to finite shape categories. To address this issue, we seek a generalized primitive segmentation on point clouds. Taking inspiration from B\'ezier decomposition on NURBS models, we transfer it to guide point cloud segmentation casting off primitive types. A joint optimization framework is proposed to learn B\'ezier primitive segmentation and geometric fitting simultaneously on a cascaded architecture. Specifically, we introduce a soft voting regularizer to improve primitive segmentation and propose an auto-weight embedding module to cluster point features, making the network more robust and generic. We also introduce a reconstruction module where we successfully process multiple CAD models with different primitives simultaneously. We conducted extensive experiments on the synthetic ABC dataset and real-scan datasets to validate and compare our approach with different baseline methods. Experiments show superior performance over previous work in terms of segmentation, with a substantially faster inference speed.

Via

Access Paper or Ask Questions