La Trobe University, Melbourne, Australia
Abstract:Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.
Abstract:Multimodal Entity Alignment (MMEA) aims to identify equivalent entities across different data modalities, enabling structural data integration that in turn improves the performance of various large language model applications. To lift the requirement of labeled seed pairs that are difficult to obtain, recent methods shifted to an unsupervised paradigm using pseudo-alignment seeds. However, unsupervised entity alignment in multimodal settings remains underexplored, mainly because the incorporation of multimodal information often results in imbalanced coverage of pseudo-seeds within the knowledge graph. To overcome this, we propose PSQE (Pseudo-Seed Quality Enhancement) to improve the precision and graph coverage balance of pseudo seeds via multimodal information and clustering-resampling. Theoretical analysis reveals the impact of pseudo seeds on existing contrastive learning-based MMEA models. In particular, pseudo seeds can influence the attraction and the repulsion terms in contrastive learning at once, whereas imbalanced graph coverage causes models to prioritize high-density regions, thereby weakening their learning capability for entities in sparse regions. Experimental results validate our theoretical findings and show that PSQE as a plug-and-play module can improve the performance of baselines by considerable margins.
Abstract:Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/
Abstract:Deep learning-based watermarking has made remarkable progress in recent years. To achieve robustness against various distortions, current methods commonly adopt a training strategy where a \underline{\textbf{s}}ingle \underline{\textbf{r}}andom \underline{\textbf{d}}istortion (SRD) is chosen as the noise layer in each training batch. However, the SRD strategy treats distortions independently within each batch, neglecting the inherent relationships among different types of distortions and causing optimization conflicts across batches. As a result, the robustness and generalizability of the watermarking model are limited. To address this issue, we propose a novel training strategy that enhances robustness and generalization via \underline{\textbf{meta}}-learning with \underline{\textbf{f}}eature \underline{\textbf{c}}onsistency (Meta-FC). Specifically, we randomly sample multiple distortions from the noise pool to construct a meta-training task, while holding out one distortion as a simulated ``unknown'' distortion for the meta-testing phase. Through meta-learning, the model is encouraged to identify and utilize neurons that exhibit stable activations across different types of distortions, mitigating the optimization conflicts caused by the random sampling of diverse distortions in each batch. To further promote the transformation of stable activations into distortion-invariant representations, we introduce a feature consistency loss that constrains the decoded features of the same image subjected to different distortions to remain consistent. Extensive experiments demonstrate that, compared to the SRD training strategy, Meta-FC improves the robustness and generalization of various watermarking models by an average of 1.59\%, 4.71\%, and 2.38\% under high-intensity, combined, and unknown distortions.
Abstract:The pursuit of general-purpose robotic manipulation is hindered by the scarcity of diverse, real-world interaction data. Unlike data collection from web in vision or language, robotic data collection is an active process incurring prohibitive physical costs. Consequently, automated task curation to maximize data value remains a critical yet under-explored challenge. Existing manual methods are unscalable and biased toward common tasks, while off-the-shelf foundation models often hallucinate physically infeasible instructions. To address this, we introduce RoboGene, an agentic framework designed to automate the generation of diverse, physically plausible manipulation tasks across single-arm, dual-arm, and mobile robots. RoboGene integrates three core components: diversity-driven sampling for broad task coverage, self-reflection mechanisms to enforce physical constraints, and human-in-the-loop refinement for continuous improvement. We conduct extensive quantitative analysis and large-scale real-world experiments, collecting datasets of 18k trajectories and introducing novel metrics to assess task quality, feasibility, and diversity. Results demonstrate that RoboGene significantly outperforms state-of-the-art foundation models (e.g., GPT-4o, Gemini 2.5 Pro). Furthermore, real-world experiments show that VLA models pre-trained with RoboGene achieve higher success rates and superior generalization, underscoring the importance of high-quality task generation. Our project is available at https://robogene-boost-vla.github.io.
Abstract:Enhancing the generalization capability of robotic learning to enable robots to operate effectively in diverse, unseen scenes is a fundamental and challenging problem. Existing approaches often depend on pretraining with large-scale data collection, which is labor-intensive and time-consuming, or on semantic data augmentation techniques that necessitate an impractical assumption of flawless upstream object detection in real-world scenarios. In this work, we propose RoboAug, a novel generative data augmentation framework that significantly minimizes the reliance on large-scale pretraining and the perfect visual recognition assumption by requiring only the bounding box annotation of a single image during training. Leveraging this minimal information, RoboAug employs pre-trained generative models for precise semantic data augmentation and integrates a plug-and-play region-contrastive loss to help models focus on task-relevant regions, thereby improving generalization and boosting task success rates. We conduct extensive real-world experiments on three robots, namely UR-5e, AgileX, and Tien Kung 2.0, spanning over 35k rollouts. Empirical results demonstrate that RoboAug significantly outperforms state-of-the-art data augmentation baselines. Specifically, when evaluating generalization capabilities in unseen scenes featuring diverse combinations of backgrounds, distractors, and lighting conditions, our method achieves substantial gains over the baseline without augmentation. The success rates increase from 0.09 to 0.47 on UR-5e, from 0.16 to 0.60 on AgileX, and from 0.19 to 0.67 on Tien Kung 2.0. These results highlight the superior generalization and effectiveness of RoboAug in real-world manipulation tasks. Our project is available at https://x-roboaug.github.io/.
Abstract:Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict "Latency Wall", constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.
Abstract:Text-attributed graphs (TAGs) associate nodes with textual attributes and graph structure, enabling GNNs to jointly model semantic and structural information. While effective on in-distribution (ID) data, GNNs often encounter out-of-distribution (OOD) nodes with unseen textual or structural patterns in real-world settings, leading to overconfident and erroneous predictions in the absence of reliable OOD detection. Early approaches address this issue from a topology-driven perspective, leveraging neighboring structures to mitigate node-level detection bias. However, these methods typically encode node texts as shallow vector features, failing to fully exploit rich semantic information. In contrast, recent LLM-based approaches generate pseudo OOD priors by leveraging textual knowledge, but they suffer from several limitations: (1) a reliability-informativeness imbalance in the synthesized OOD priors, as the generated OOD exposures either deviate from the true OOD semantics, or introduce non-negligible ID noise, all of which offers limited improvement to detection performance; (2) reliance on specialized architectures, which prevents incorporation of the extensive effective topology-level insights that have been empirically validated in prior work. To this end, we propose LG-Plug, an LLM-Guided Plug-and-play strategy for TAG OOD detection tasks. LG-Plug aligns topology and text representations to produce fine-grained node embeddings, then generates consensus-driven OOD exposure via clustered iterative LLM prompting. Moreover, it leverages lightweight in-cluster codebook and heuristic sampling reduce time cost of LLM querying. The resulting OOD exposure serves as a regularization term to separate ID and OOD nodes, enabling seamless integration with existing detectors.
Abstract:Manual endoscopic submucosal dissection (ESD) is technically demanding, and existing single-segment robotic tools offer limited dexterity. These limitations motivate the development of more advanced solutions. To address this, DESectBot, a novel dual segment continuum robot with a decoupled structure and integrated surgical forceps, enabling 6 degrees of freedom (DoFs) tip dexterity for improved lesion targeting in ESD, was developed in this work. Deep learning controllers based on gated recurrent units (GRUs) for simultaneous tip position and orientation control, effectively handling the nonlinear coupling between continuum segments, were proposed. The GRU controller was benchmarked against Jacobian based inverse kinematics, model predictive control (MPC), a feedforward neural network (FNN), and a long short-term memory (LSTM) network. In nested-rectangle and Lissajous trajectory tracking tasks, the GRU achieved the lowest position/orientation RMSEs: 1.11 mm/ 4.62° and 0.81 mm/ 2.59°, respectively. For orientation control at a fixed position (four target poses), the GRU attained a mean RMSE of 0.14 mm and 0.72°, outperforming all alternatives. In a peg transfer task, the GRU achieved a 100% success rate (120 success/120 attempts) with an average transfer time of 11.8s, the STD significantly outperforms novice-controlled systems. Additionally, an ex vivo ESD demonstration grasping, elevating, and resecting tissue as the scalpel completed the cut confirmed that DESectBot provides sufficient stiffness to divide thick gastric mucosa and an operative workspace adequate for large lesions.These results confirm that GRU-based control significantly enhances precision, reliability, and usability in ESD surgical training scenarios.
Abstract:General matrix multiplication (GEMM) is a fundamental operation in deep learning (DL). With DL moving increasingly toward low precision, recent works have proposed novel unary GEMM designs as an alternative to conventional binary GEMM hardware. A rigorous evaluation of recent unary and binary GEMM designs is needed to assess the potential of unary hardware for future DL compute. This paper focuses on unary GEMM designs for integer-based DL inference and performs a detailed evaluation of three latest unary design proposals, namely, uGEMM, tuGEMM and tubGEMM, by comparing them to a conventional binary GEMM. Rigorous post-synthesis evaluations beyond prior works are performed across varying bit-widths and matrix sizes to assess the designs' tradeoffs and determine optimal sweetspots. Further, we perform weight sparsity analysis across eight pretrained convolutional neural networks (CNNs) and the LLaMA2 large language model (LLM). In this work, we demonstrate how unary GEMM can be effectively used for energy-efficient compute in future edge AI accelerators.