



Abstract:3D perception ability is crucial for generalizable robotic manipulation. While recent foundation models have made significant strides in perception and decision-making with RGB-based input, their lack of 3D perception limits their effectiveness in fine-grained robotic manipulation tasks. To address these limitations, we propose a Depth Information Injection ($\bold{DI}^{\bold{2}}$) framework that leverages the RGB-Depth modality for policy fine-tuning, while relying solely on RGB images for robust and efficient deployment. Concretely, we introduce the Depth Completion Module (DCM) to extract the spatial prior knowledge related to depth information and generate virtual depth information from RGB inputs to aid policy deployment. Further, we propose the Depth-Aware Codebook (DAC) to eliminate noise and reduce the cumulative error from the depth prediction. In the inference phase, this framework employs RGB inputs and accurately predicted depth data to generate the manipulation action. We conduct experiments on simulated LIBERO environments and real-world scenarios, and the experiment results prove that our method could effectively enhance the pre-trained RGB-based policy with 3D perception ability for robotic manipulation. The website is released at https://gewu-lab.github.io/DepthHelps-IROS2024.




Abstract:Online Imitation Learning methods struggle with the gap between extensive online exploration space and limited expert trajectories, which hinder efficient exploration due to inaccurate task-aware reward estimation. Inspired by the findings from cognitive neuroscience that task decomposition could facilitate cognitive processing for efficient learning, we hypothesize that an agent could estimate precise task-aware imitation rewards for efficient online exploration by decomposing the target task into the objectives of "what to do" and the mechanisms of "how to do". In this work, we introduce the hybrid Key-state guided Online Imitation (KOI) learning approach, which leverages the integration of semantic and motion key states as guidance for task-aware reward estimation. Initially, we utilize the visual-language models to segment the expert trajectory into semantic key states, indicating the objectives of "what to do". Within the intervals between semantic key states, optical flow is employed to capture motion key states to understand the process of "how to do". By integrating a thorough grasp of both semantic and motion key states, we refine the trajectory-matching reward computation, encouraging task-aware exploration for efficient online imitation learning. Our experiment results prove that our method is more sample efficient in the Meta-World and LIBERO environments. We also conduct real-world robotic manipulation experiments to validate the efficacy of our method, demonstrating the practical applicability of our KOI method.




Abstract:Fluid antennas (FAs) and movable antennas (MAs) have attracted increasing attention in wireless communications recently. As compared to the conventional fixed-position antennas (FPAs), their geometry can be dynamically reconfigured, such that more flexible beamforming can be achieved for signal coverage and/or interference nulling. In this paper, we investigate the use of MAs to achieve uniform coverage for multiple regions with arbitrary number and width in the spatial domain. In particular, we aim to jointly optimize the MAs weights and positions within a linear array to maximize the minimum beam gain over the desired spatial regions. However, the resulting problem is non-convex and difficult to be optimally solved. To tackle this difficulty, we propose an alternating optimization (AO) algorithm to obtain a high-quality suboptimal solution, where the MAs weights and positions are alternately optimized by applying successive convex approximation (SCA) technique. Numerical results show that our proposed MAbased beam coverage scheme can achieve much better performance than conventional FPAs.




Abstract:Data-driven deep learning models have shown great capabilities to assist radiologists in breast ultrasound (US) diagnoses. However, their effectiveness is limited by the long-tail distribution of training data, which leads to inaccuracies in rare cases. In this study, we address a long-standing challenge of improving the diagnostic model performance on rare cases using long-tailed data. Specifically, we introduce a pipeline, TAILOR, that builds a knowledge-driven generative model to produce tailored synthetic data. The generative model, using 3,749 lesions as source data, can generate millions of breast-US images, especially for error-prone rare cases. The generated data can be further used to build a diagnostic model for accurate and interpretable diagnoses. In the prospective external evaluation, our diagnostic model outperforms the average performance of nine radiologists by 33.5% in specificity with the same sensitivity, improving their performance by providing predictions with an interpretable decision-making process. Moreover, on ductal carcinoma in situ (DCIS), our diagnostic model outperforms all radiologists by a large margin, with only 34 DCIS lesions in the source data. We believe that TAILOR can potentially be extended to various diseases and imaging modalities.



Abstract:Sign language is one of the most effective communication tools for people with hearing difficulties. Most existing works focus on improving the performance of sign language tasks on RGB videos, which may suffer from degraded recording conditions, such as fast movement of hands with motion blur and textured signer's appearance. The bio-inspired event camera, which asynchronously captures brightness change with high speed, could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks. In this work, we aim at exploring the potential of event camera in continuous sign language recognition (CSLR) and sign language translation (SLT). To promote the research, we first collect an event-based benchmark EvSign for those tasks with both gloss and spoken language annotations. EvSign dataset offers a substantial amount of high-quality event streams and an extensive vocabulary of glosses and words, thereby facilitating the development of sign language tasks. In addition, we propose an efficient transformer-based framework for event-based SLR and SLT tasks, which fully leverages the advantages of streaming events. The sparse backbone is employed to extract visual features from sparse events. Then, the temporal coherence is effectively utilized through the proposed local token fusion and gloss-aware temporal aggregation modules. Extensive experimental results are reported on both simulated (PHOENIX14T) and EvSign datasets. Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost (0.84G FLOPS per video) and 44.2% network parameters. The project is available at https://zhang-pengyu.github.io/EVSign.




Abstract:This paper focuses on integrating the networks and adversarial training into constrained optimization problems to develop a framework algorithm for constrained optimization problems. For such problems, we first transform them into minimax problems using the augmented Lagrangian method and then use two (or several) deep neural networks(DNNs) to represent the primal and dual variables respectively. The parameters in the neural networks are then trained by an adversarial process. The proposed architecture is relatively insensitive to the scale of values of different constraints when compared to penalty based deep learning methods. Through this type of training, the constraints are imposed better based on the augmented Lagrangian multipliers. Extensive examples for optimization problems with scalar constraints, nonlinear constraints, partial differential equation constraints, and inequality constraints are considered to show the capability and robustness of the proposed method, with applications ranging from Ginzburg--Landau energy minimization problems, partition problems, fluid-solid topology optimization, to obstacle problems.
Abstract:Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker speech recognition by sequentially decoding the speech of individual speakers. To address the challenging label-permutation issue, prior methods have relied on either the Permutation Invariant Training (PIT) or the time-based First-In-First-Out (FIFO) rule. This study presents a model-based serialization strategy that incorporates an auxiliary module into the Attention Encoder-Decoder architecture, autonomously identifying the crucial factors to order the output sequence of the speech components in multi-talker speech. Experiments conducted on the LibriSpeech and LibriMix databases reveal that our approach significantly outperforms the PIT and FIFO baselines in both 2-mix and 3-mix scenarios. Further analysis shows that the serialization module identifies dominant speech components in a mixture by factors including loudness and gender, and orders speech components based on the dominance score.
Abstract:Recent studies have demonstrated the efficacy of large language models (LLMs) in error correction for automatic speech recognition (ASR). However, much of the research focuses on the English language. This paper redirects the attention to Chinese. Firstly, we construct a specialized benchmark dataset aimed at error correction for Chinese ASR with 724K hypotheses-transcription pairs, named the Chinese Hypotheses Paradise dataset (ChineseHP), which contains a wide range of scenarios and presents significant challenges. Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. Furthermore, we propose a straightforward method of Pinyin regularization for prompts, which involves the transcription of Pinyin directly from text hypotheses. The experimental results reveal that Pinyin regularization consistently enhances the error-correcting ability of LLMs when compared with those without regularization. The dataset is available on the website.




Abstract:Traditional reinforcement learning (RL) generates discrete control policies, assigning one action per cycle. These policies are usually implemented as in a fixed-frequency control loop. This rigidity presents challenges as optimal control frequency is task-dependent; suboptimal frequencies increase computational demands and reduce exploration efficiency. Variable Time Step Reinforcement Learning (VTS-RL) addresses these issues with adaptive control frequencies, executing actions only when necessary, thus reducing computational load and extending the action space to include action durations. In this paper we introduce the Multi-Objective Soft Elastic Actor-Critic (MOSEAC) method to perform VTS-RL, validating it through theoretical analysis and experimentation in simulation and on real robots. Results show faster convergence, better training results, and reduced energy consumption with respect to other variable- or fixed-frequency approaches.


Abstract:Fluid antennas (FAs) and movable antennas (MAs) have drawn increasing attention in wireless communications recently due to their ability to create favorable channel conditions via local antenna movement within a confined region. In this letter, we advance their application for cognitive radio to facilitate efficient spectrum sharing between primary and secondary communication systems. In particular, we aim to jointly optimize the transmit beamforming and MA positions at a secondary transmitter (ST) to maximize the received signal power at a secondary receiver (SR) subject to the constraints on its imposed co-channel interference power with multiple primary receivers (PRs). However, such an optimization problem is difficult to be optimally solved due to the highly nonlinear functions of the received signal/interference power at the SR/all PRs in terms of the MA positions. To drive useful insights, we first perform theoretical analyses to unveil MAs' capability to achieve maximum-ratio transmission with the SR and effective interference mitigation for all PRs at the same time. To solve the MA position optimization problem, we propose an alternating optimization (AO) algorithm to obtain a high-quality suboptimal solution. Numerical results demonstrate that our proposed algorithms can significantly outperform the conventional fixed-position antennas (FPAs) and other baseline schemes.