Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yajie Li

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

May 12, 2026

Yajie Li, Bozhou Zhang, Chun Gu, Zipei Ma, Jiahui Zhang, Jiankang Deng, Xiatian Zhu, Li Zhang

Abstract:Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.

* ICML 2026

Via

Access Paper or Ask Questions

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Nov 15, 2024

Zhennan Chen, Yajie Li, Haofan Wang, Zhibo Chen, Zhengkai Jiang, Jun Li, Qian Wang, Jian Yang, Ying Tai

Figure 1 for Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Figure 2 for Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Figure 3 for Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Figure 4 for Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Abstract:Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within cross-attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. RAG decouple the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAG novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.

* Code is available at https://github.com/NJU-PCALab/RAG-Diffusion

Via

Access Paper or Ask Questions

TDiffDe: A Truncated Diffusion Model for Remote Sensing Hyperspectral Image Denoising

Nov 22, 2023

Jiang He, Yajie Li, Jie L, Qiangqiang Yuan

Figure 1 for TDiffDe: A Truncated Diffusion Model for Remote Sensing Hyperspectral Image Denoising

Figure 2 for TDiffDe: A Truncated Diffusion Model for Remote Sensing Hyperspectral Image Denoising

Figure 3 for TDiffDe: A Truncated Diffusion Model for Remote Sensing Hyperspectral Image Denoising

Figure 4 for TDiffDe: A Truncated Diffusion Model for Remote Sensing Hyperspectral Image Denoising

Abstract:Hyperspectral images play a crucial role in precision agriculture, environmental monitoring or ecological analysis. However, due to sensor equipment and the imaging environment, the observed hyperspectral images are often inevitably corrupted by various noise. In this study, we proposed a truncated diffusion model, called TDiffDe, to recover the useful information in hyperspectral images gradually. Rather than starting from a pure noise, the input data contains image information in hyperspectral image denoising. Thus, we cut the trained diffusion model from small steps to avoid the destroy of valid information.

Via

Access Paper or Ask Questions

Cluster-based Method for Eavesdropping Identification and Localization in Optical Links

Sep 25, 2023

Haokun Song, Rui Lin, Andrea Sgambelluri, Filippo Cugini, Yajie Li, Jie Zhang, Paolo Monti

Figure 1 for Cluster-based Method for Eavesdropping Identification and Localization in Optical Links

Figure 2 for Cluster-based Method for Eavesdropping Identification and Localization in Optical Links

Figure 3 for Cluster-based Method for Eavesdropping Identification and Localization in Optical Links

Figure 4 for Cluster-based Method for Eavesdropping Identification and Localization in Optical Links

Abstract:We propose a cluster-based method to detect and locate eavesdropping events in optical line systems characterized by small power losses. Our findings indicate that detecting such subtle losses from eavesdropping can be accomplished solely through optical performance monitoring (OPM) data collected at the receiver. On the other hand, the localization of such events can be effectively achieved by leveraging in-line OPM data.

* 4 pages, 6 figures, Asia Communications and Photonics Conference (ACP) 2023

Via

Access Paper or Ask Questions