LAAS-SARA
Abstract:Training reinforcement learning (RL) policies for legged robots remains challenging due to high-dimensional continuous actions, hardware constraints, and limited exploration. Existing methods for locomotion and whole-body control work well for position-based control with environment-specific heuristics (e.g., reward shaping, curriculum design, and manual initialization), but are less effective for torque-based control, where sufficiently exploring the action space and obtaining informative gradient signals for training is significantly more difficult. We introduce Growing Policy Optimization (GPO), a training framework that applies a time-varying action transformation to restrict the effective action space in the early stage, thereby encouraging more effective data collection and policy learning, and then progressively expands it to enhance exploration and achieve higher expected return. We prove that this transformation preserves the PPO update rule and introduces only bounded, vanishing gradient distortion, thereby ensuring stable training. We evaluate GPO on both quadruped and hexapod robots, including zero-shot deployment of simulation-trained policies on hardware. Policies trained with GPO consistently achieve better performance. These results suggest that GPO provides a general, environment-agnostic optimization framework for learning legged locomotion.
Abstract:User queries in real-world retrieval are often non-faithful (noisy, incomplete, or distorted), causing retrievers to fail when key semantics are missing. We formalize this as retrieval under recall noise, where the observed query is drawn from a noisy recall process of a latent target item. To address this, we propose QUARK, a simple yet effective training-free framework for robust retrieval under non-faithful queries. QUARK explicitly models query uncertainty through recovery hypotheses, i.e., multiple plausible interpretations of the latent intent given the observed query, and introduces query-anchored aggregation to combine their signals robustly. The original query serves as a semantic anchor, while recovery hypotheses provide controlled auxiliary evidence, preventing semantic drift and hypothesis hijacking. This design enables QUARK to improve recall and ranking quality without sacrificing robustness, even when some hypotheses are noisy or uninformative. Across controlled simulations and BEIR benchmarks (FIQA, SciFact, NFCorpus) with both sparse and dense retrievers, QUARK improves Recall, MRR, and nDCG over the base retriever. Ablations show QUARK is robust to the number of recovery hypotheses and that anchored aggregation outperforms unanchored max/mean/median pooling. These results demonstrate that modeling query uncertainty through recovery hypotheses, coupled with principled anchored aggregation, is essential for robust retrieval under non-faithful queries.
Abstract:Instruction tuning is a standard paradigm for adapting large language models (LLMs), but modern instruction datasets are large, noisy, and redundant, making full-data fine-tuning costly and often unnecessary. Existing data selection methods either build expensive gradient datastores or assign static scores from a weak proxy, largely ignoring evolving uncertainty, and thus missing a key source of LLM interpretability. We propose GRADFILTERING, an objective-agnostic, uncertainty-aware data selection framework that utilizes a small GPT-2 proxy with a LoRA ensemble and aggregates per-example gradients into a Gradient Signal-to-Noise Ratio (G-SNR) utility. Our method matches or surpasses random subsets and strong baselines in most LLM-as-a-judge evaluations as well as in human assessment. Moreover, GRADFILTERING-selected subsets converge faster than competitive filters under the same compute budget, reflecting the benefit of uncertainty-aware scoring.
Abstract:Diffusion-based planners have emerged as a promising approach for human-like trajectory generation in autonomous driving. Recent works incorporate reinforcement fine-tuning to enhance the robustness of diffusion planners through reward-oriented optimization in a generation-evaluation loop. However, they struggle to generate multi-modal, scenario-adaptive trajectories, hindering the exploitation efficiency of informative rewards during fine-tuning. To resolve this, we propose PlannerRFT, a sample-efficient reinforcement fine-tuning framework for diffusion-based planners. PlannerRFT adopts a dual-branch optimization that simultaneously refines the trajectory distribution and adaptively guides the denoising process toward more promising exploration, without altering the original inference pipeline. To support parallel learning at scale, we develop nuMax, an optimized simulator that achieves 10 times faster rollout compared to native nuPlan. Extensive experiments shows that PlannerRFT yields state-of-the-art performance with distinct behaviors emerging during the learning process.
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) is crucial for advancing large-scale reasoning models. However, existing parameter-efficient methods, such as PiSSA and MiLoRA, are designed for Supervised Fine-Tuning (SFT) and do not account for the distinct optimization dynamics and geometric structures of RLVR. Applying these methods directly leads to spectral collapse and optimization instability, which severely limit model performance. Meanwhile, alternative approaches that leverage update sparsity encounter significant efficiency bottlenecks on modern hardware due to unstructured computations. To address these challenges, we propose GeoRA (Geometry-Aware Low-Rank Adaptation), which exploits the anisotropic and compressible nature of RL update subspaces. GeoRA initializes adapters by extracting principal directions via Singular Value Decomposition (SVD) within a geometrically constrained subspace while freezing the residual components. This method preserves the pre-trained geometric structure and enables efficient GPU computation through dense operators. Experiments on Qwen and Llama demonstrate that GeoRA mitigates optimization bottlenecks caused by geometric misalignment. It consistently outperforms established low-rank baselines on key mathematical benchmarks, achieving state-of-the-art (SOTA) results. Moreover, GeoRA shows superior generalization and resilience to catastrophic forgetting in out-of-domain tasks.
Abstract:Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.




Abstract:Ensuring fairness in the coordination of connected and automated vehicles at intersections is essential for equitable access, social acceptance, and long-term system efficiency, yet it remains underexplored in safety-critical, real-time traffic control. This paper proposes a fairness-aware hierarchical control framework that explicitly integrates inequity aversion into intersection management. At the top layer, a centralized allocation module assigns control authority (i.e., selects a single vehicle to execute its trajectory) by maximizing a utility that accounts for waiting time, urgency, control history, and velocity deviation. At the bottom layer, the authorized vehicle executes a precomputed trajectory using a Linear Quadratic Regulator (LQR) and applies a high-order Control Barrier Function (HOCBF)-based safety filter for real-time collision avoidance. Simulation results across varying traffic demands and demand distributions demonstrate that the proposed framework achieves near-perfect fairness, eliminates collisions, reduces average delay, and maintains real-time feasibility. These results highlight that fairness can be systematically incorporated without sacrificing safety or performance, enabling scalable and equitable coordination for future autonomous traffic systems.
Abstract:A key challenge in scientific machine learning is solving partial differential equations (PDEs) on complex domains, where the curved geometry complicates the approximation of functions and their derivatives required by differential operators. This paper establishes the first simultaneous approximation theory for deep neural networks on manifolds. We prove that a constant-depth $\mathrm{ReLU}^{k-1}$ network with bounded weights--a property that plays a crucial role in controlling generalization error--can approximate any function in the Sobolev space $\mathcal{W}_p^{k}(\mathcal{M}^d)$ to an error of $\varepsilon$ in the $\mathcal{W}_p^{s}(\mathcal{M}^d)$ norm, for $k\geq 3$ and $s<k$, using $\mathcal{O}(\varepsilon^{-d/(k-s)})$ nonzero parameters, a rate that overcomes the curse of dimensionality by depending only on the intrinsic dimension $d$. These results readily extend to functions in H\"older-Zygmund spaces. We complement this result with a matching lower bound, proving our construction is nearly optimal by showing the required number of parameters matches up to a logarithmic factor. Our proof of the lower bound introduces novel estimates for the Vapnik-Chervonenkis dimension and pseudo-dimension of the network's high-order derivative classes. These complexity bounds provide a theoretical cornerstone for learning PDEs on manifolds involving derivatives. Our analysis reveals that the network architecture leverages a sparse structure to efficiently exploit the manifold's low-dimensional geometry.
Abstract:Many contemporary data-driven research efforts in the natural sciences, such as chemistry and materials science, require large-scale, high-performance entity recognition from scientific datasets. Large language models (LLMs) have increasingly been adopted to solve the entity recognition task, with the same trend being observed on all-spectrum NLP tasks. The prevailing entity recognition LLMs rely on fine-tuned technology, yet the fine-tuning process often incurs significant cost. To achieve a best performance-cost trade-off, we propose ALLabel, a three-stage framework designed to select the most informative and representative samples in preparing the demonstrations for LLM modeling. The annotated examples are used to construct a ground-truth retrieval corpus for LLM in-context learning. By sequentially employing three distinct active learning strategies, ALLabel consistently outperforms all baselines under the same annotation budget across three specialized domain datasets. Experimental results also demonstrate that selectively annotating only 5\%-10\% of the dataset with ALLabel can achieve performance comparable to the method annotating the entire dataset. Further analyses and ablation studies verify the effectiveness and generalizability of our proposal.
Abstract:In this paper, we study the binary classification problem on $[0,1]^d$ under the Tsybakov noise condition (with exponent $s \in [0,\infty]$) and the compositional assumption. This assumption requires the conditional class probability function of the data distribution to be the composition of $q+1$ vector-valued multivariate functions, where each component function is either a maximum value function or a H\"{o}lder-$\beta$ smooth function that depends only on $d_*$ of its input variables. Notably, $d_*$ can be significantly smaller than the input dimension $d$. We prove that, under these conditions, the optimal convergence rate for the excess 0-1 risk of classifiers is $$ \left( \frac{1}{n} \right)^{\frac{\beta\cdot(1\wedge\beta)^q}{{\frac{d_*}{s+1}+(1+\frac{1}{s+1})\cdot\beta\cdot(1\wedge\beta)^q}}}\;\;\;, $$ which is independent of the input dimension $d$. Additionally, we demonstrate that ReLU deep neural networks (DNNs) trained with hinge loss can achieve this optimal convergence rate up to a logarithmic factor. This result provides theoretical justification for the excellent performance of ReLU DNNs in practical classification tasks, particularly in high-dimensional settings. The technique used to establish these results extends the oracle inequality presented in our previous work. The generalized approach is of independent interest.