Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Heng Yang

ETH Zürich

Compose by Focus: Scene Graph-based Atomic Skills

Sep 19, 2025

Han Qi, Changhe Chen, Heng Yang

Figure 1 for Compose by Focus: Scene Graph-based Atomic Skills

Figure 2 for Compose by Focus: Scene Graph-based Atomic Skills

Figure 3 for Compose by Focus: Scene Graph-based Atomic Skills

Figure 4 for Compose by Focus: Scene Graph-based Atomic Skills

Abstract:A key requirement for generalist robots is compositional generalization - the ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine "focused" scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state-of-the-art baselines, highlighting improved robustness and compositional generalization in long-horizon tasks.

Via

Access Paper or Ask Questions

Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation

Jun 25, 2025

Rachel Luo, Heng Yang, Michael Watson, Apoorva Sharma, Sushant Veer, Edward Schmerling, Marco Pavone

Figure 1 for Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation

Figure 2 for Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation

Figure 3 for Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation

Figure 4 for Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation

Abstract:Learning-based robotic systems demand rigorous validation to assure reliable performance, but extensive real-world testing is often prohibitively expensive, and if conducted may still yield insufficient data for high-confidence guarantees. In this work, we introduce a general estimation framework that leverages paired data across test platforms, e.g., paired simulation and real-world observations, to achieve better estimates of real-world metrics via the method of control variates. By incorporating cheap and abundant auxiliary measurements (for example, simulator outputs) as control variates for costly real-world samples, our method provably reduces the variance of Monte Carlo estimates and thus requires significantly fewer real-world samples to attain a specified confidence bound on the mean performance. We provide theoretical analysis characterizing the variance and sample-efficiency improvement, and demonstrate empirically in autonomous driving and quadruped robotics settings that our approach achieves high-probability bounds with markedly improved sample efficiency. Our technique can lower the real-world testing burden for validating the performance of the stack, thereby enabling more efficient and cost-effective experimental evaluation of robotic systems.

Via

Access Paper or Ask Questions

FedGraM: Defending Against Untargeted Attacks in Federated Learning via Embedding Gram Matrix

May 20, 2025

Di Wu, Qian Li, Heng Yang, Yong Han

Figure 1 for FedGraM: Defending Against Untargeted Attacks in Federated Learning via Embedding Gram Matrix

Figure 2 for FedGraM: Defending Against Untargeted Attacks in Federated Learning via Embedding Gram Matrix

Figure 3 for FedGraM: Defending Against Untargeted Attacks in Federated Learning via Embedding Gram Matrix

Figure 4 for FedGraM: Defending Against Untargeted Attacks in Federated Learning via Embedding Gram Matrix

Abstract:Federated Learning (FL) enables geographically distributed clients to collaboratively train machine learning models by sharing only their local models, ensuring data privacy. However, FL is vulnerable to untargeted attacks that aim to degrade the global model's performance on the underlying data distribution. Existing defense mechanisms attempt to improve FL's resilience against such attacks, but their effectiveness is limited in practical FL environments due to data heterogeneity. On the contrary, we aim to detect and remove the attacks to mitigate their impact. Generalization contribution plays a crucial role in distinguishing untargeted attacks. Our observations indicate that, with limited data, the divergence between embeddings representing different classes provides a better measure of generalization than direct accuracy. In light of this, we propose a novel robust aggregation method, FedGraM, designed to defend against untargeted attacks in FL. The server maintains an auxiliary dataset containing one sample per class to support aggregation. This dataset is fed to the local models to extract embeddings. Then, the server calculates the norm of the Gram Matrix of the embeddings for each local model. The norm serves as an indicator of each model's inter-class separation capability in the embedding space. FedGraM identifies and removes potentially malicious models by filtering out those with the largest norms, then averages the remaining local models to form the global model. We conduct extensive experiments to evaluate the performance of FedGraM. Our empirical results show that with limited data samples used to construct the auxiliary dataset, FedGraM achieves exceptional performance, outperforming state-of-the-art defense methods.

Via

Access Paper or Ask Questions

OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking

May 20, 2025

Heng Yang, Jack Cole, Yuan Li, Renzhi Chen, Geyong Min, Ke Li

Abstract:The code of nature, embedded in DNA and RNA genomes since the origin of life, holds immense potential to impact both humans and ecosystems through genome modeling. Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome. As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation. We present OmniGenBench, a modular benchmarking platform designed to unify the data, model, benchmarking, and interpretability layers across GFMs. OmniGenBench enables standardized, one-command evaluation of any GFM across five benchmark suites, with seamless integration of over 31 open-source models. Through automated pipelines and community-extensible features, the platform addresses critical reproducibility challenges, including data transparency, model interoperability, benchmark fragmentation, and black-box interpretability. OmniGenBench aims to serve as foundational infrastructure for reproducible genomic AI research, accelerating trustworthy discovery and collaborative innovation in the era of genome-scale modeling.

Via

Access Paper or Ask Questions

Image-Editing Specialists: An RLAIF Approach for Diffusion Models

Apr 17, 2025

Elior Benarous, Yilun Du, Heng Yang

Abstract:We present a novel approach to training specialized instruction-based image-editing diffusion models, addressing key challenges in structural preservation with input images and semantic alignment with user prompts. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations or curating a large dataset. Our method significantly improves the realism and alignment with instructions in two ways. First, the proposed models achieve precise and structurally coherent modifications in complex scenes while maintaining high fidelity in instruction-irrelevant areas. Second, they capture fine nuances in the desired edit by leveraging a visual prompt, enabling detailed control over visual edits without lengthy textual prompts. This approach simplifies users' efforts to achieve highly specific edits, requiring only 5 reference images depicting a certain concept for training. Experimental results demonstrate that our models can perform intricate edits in complex scenes, after just 10 training steps. Finally, we showcase the versatility of our method by applying it to robotics, where enhancing the visual realism of simulated environments through targeted sim-to-real image edits improves their utility as proxies for real-world settings.

Via

Access Paper or Ask Questions

Online Aggregation of Trajectory Predictors

Feb 11, 2025

Alex Tong, Apoorva Sharma, Sushant Veer, Marco Pavone, Heng Yang

Figure 1 for Online Aggregation of Trajectory Predictors

Figure 2 for Online Aggregation of Trajectory Predictors

Figure 3 for Online Aggregation of Trajectory Predictors

Figure 4 for Online Aggregation of Trajectory Predictors

Abstract:Trajectory prediction, the task of forecasting future agent behavior from past data, is central to safe and efficient autonomous driving. A diverse set of methods (e.g., rule-based or learned with different architectures and datasets) have been proposed, yet it is often the case that the performance of these methods is sensitive to the deployment environment (e.g., how well the design rules model the environment, or how accurately the test data match the training data). Building upon the principled theory of online convex optimization but also going beyond convexity and stationarity, we present a lightweight and model-agnostic method to aggregate different trajectory predictors online. We propose treating each individual trajectory predictor as an "expert" and maintaining a probability vector to mix the outputs of different experts. Then, the key technical approach lies in leveraging online data -the true agent behavior to be revealed at the next timestep- to form a convex-or-nonconvex, stationary-or-dynamic loss function whose gradient steers the probability vector towards choosing the best mixture of experts. We instantiate this method to aggregate trajectory predictors trained on different cities in the NUSCENES dataset and show that it performs just as well, if not better than, any singular model, even when deployed on the out-of-distribution LYFT dataset.

* 9 pages, 7 figures

Via

Access Paper or Ask Questions

Building Rome with Convex Optimization

Feb 10, 2025

Haoyu Han, Heng Yang

Figure 1 for Building Rome with Convex Optimization

Figure 2 for Building Rome with Convex Optimization

Figure 3 for Building Rome with Convex Optimization

Figure 4 for Building Rome with Convex Optimization

Abstract:Global bundle adjustment is made easy by depth prediction and convex optimization. We (i) propose a scaled bundle adjustment (SBA) formulation that lifts 2D keypoint measurements to 3D with learned depth, (ii) design an empirically tight convex semidfinite program (SDP) relaxation that solves SBA to certfiable global optimality, (iii) solve the SDP relaxations at extreme scale with Burer-Monteiro factorization and a CUDA-based trust-region Riemannian optimizer (dubbed XM), (iv) build a structure from motion (SfM) pipeline with XM as the optimization engine and show that XM-SfM dominates or compares favorably with existing SfM pipelines in terms of reconstruction quality while being faster, more scalable, and initialization-free.

Via

Access Paper or Ask Questions

Global Contact-Rich Planning with Sparsity-Rich Semidefinite Relaxations

Feb 06, 2025

Shucheng Kang, Guorui Liu, Heng Yang

Figure 1 for Global Contact-Rich Planning with Sparsity-Rich Semidefinite Relaxations

Figure 2 for Global Contact-Rich Planning with Sparsity-Rich Semidefinite Relaxations

Figure 3 for Global Contact-Rich Planning with Sparsity-Rich Semidefinite Relaxations

Figure 4 for Global Contact-Rich Planning with Sparsity-Rich Semidefinite Relaxations

Abstract:We show that contact-rich motion planning is also sparsity-rich when viewed as polynomial optimization (POP). We can exploit not only the correlative and term sparsity patterns that are general to all POPs, but also specialized sparsity patterns from the robot kinematic structure and the separability of contact modes. Such sparsity enables the design of high-order but sparse semidefinite programming (SDPs) relaxations--building upon Lasserre's moment and sums of squares hierarchy--that (i) can be solved in seconds by off-the-shelf SDP solvers, and (ii) compute near globally optimal solutions to the nonconvex contact-rich planning problems with small certified suboptimality. Through extensive experiments both in simulation (Push Bot, Push Box, Push Box with Obstacles, and Planar Hand) and real world (Push T), we demonstrate the power of using convex SDP relaxations to generate global contact-rich motion plans. As a contribution of independent interest, we release the Sparse Polynomial Optimization Toolbox (SPOT)--implemented in C++ with interfaces to both Python and Matlab--that automates sparsity exploitation for robotics and beyond.

* Website: https://computationalrobotics.seas.harvard.edu/project-spot/

Via

Access Paper or Ask Questions

On the Surprising Robustness of Sequential Convex Optimization for Contact-Implicit Motion Planning

Feb 03, 2025

Yulin Li, Haoyu Han, Shucheng Kang, Jun Ma, Heng Yang

Figure 1 for On the Surprising Robustness of Sequential Convex Optimization for Contact-Implicit Motion Planning

Figure 2 for On the Surprising Robustness of Sequential Convex Optimization for Contact-Implicit Motion Planning

Figure 3 for On the Surprising Robustness of Sequential Convex Optimization for Contact-Implicit Motion Planning

Figure 4 for On the Surprising Robustness of Sequential Convex Optimization for Contact-Implicit Motion Planning

Abstract:Contact-implicit motion planning-embedding contact sequencing as implicit complementarity constraints-holds the promise of leveraging continuous optimization to discover new contact patterns online. Nevertheless, the resulting optimization, being an instance of Mathematical Programming with Complementary Constraints, fails the classical constraint qualifications that are crucial for the convergence of popular numerical solvers. We present robust contact-implicit motion planning with sequential convex programming (CRISP), a solver that departs from the usual primal-dual algorithmic framework but instead only focuses on the primal problem. CRISP solves a convex quadratic program with an adaptive trust region radius at each iteration, and its convergence is evaluated by a merit function using weighted penalty. We (i) provide sufficient conditions on CRISP's convergence to first-order stationary points of the merit function; (ii) release a high-performance C++ implementation of CRISP with a generic nonlinear programming interface; and (iii) demonstrate CRISP's surprising robustness in solving contact-implicit planning with naive initialization. In fact, CRISP solves several contact-implicit problems with all-zero initialization.

Via

Access Paper or Ask Questions

LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Dec 10, 2024

Ziqi Lu, Heng Yang, Danfei Xu, Boyi Li, Boris Ivanovic, Marco Pavone, Yue Wang

Figure 1 for LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Figure 2 for LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Figure 3 for LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Figure 4 for LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Abstract:Emerging 3D geometric foundation models, such as DUSt3R, offer a promising approach for in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem space and scarcity of high-quality 3D data, these pre-trained models still struggle to generalize to many challenging circumstances, such as limited view overlap or low lighting. To address this, we propose LoRA3D, an efficient self-calibration pipeline to $\textit{specialize}$ the pre-trained models to target scenes using their own multi-view predictions. Taking sparse RGB images as input, we leverage robust optimization techniques to refine multi-view predictions and align them into a global coordinate frame. In particular, we incorporate prediction confidence into the geometric optimization process, automatically re-weighting the confidence to better reflect point estimation accuracy. We use the calibrated confidence to generate high-quality pseudo labels for the calibrating views and use low-rank adaptation (LoRA) to fine-tune the models on the pseudo-labeled data. Our method does not require any external priors or manual labels. It completes the self-calibration process on a $\textbf{single standard GPU within just 5 minutes}$. Each low-rank adapter requires only $\textbf{18MB}$ of storage. We evaluated our method on $\textbf{more than 160 scenes}$ from the Replica, TUM and Waymo Open datasets, achieving up to $\textbf{88% performance improvement}$ on 3D reconstruction, multi-view pose estimation and novel-view rendering.

Via

Access Paper or Ask Questions