Previous Sign Language Translation (SLT) methods achieve superior performance by relying on gloss annotations. However, labeling high-quality glosses is a labor-intensive task, which limits the further development of SLT. Although some approaches work towards gloss-free SLT through jointly training the visual encoder and translation network, these efforts still suffer from poor performance and inefficient use of the powerful Large Language Model (LLM). Most seriously, we find that directly introducing LLM into SLT will lead to insufficient learning of visual representations as LLM dominates the learning curve. To address these problems, we propose Factorized Learning assisted with Large Language Model (FLa-LLM) for gloss-free SLT. Concretely, we factorize the training process into two stages. In the visual initialing stage, we employ a lightweight translation model after the visual encoder to pre-train the visual encoder. In the LLM fine-tuning stage, we freeze the acquired knowledge in the visual encoder and integrate it with a pre-trained LLM to inspire the LLM's translation potential. This factorized training strategy proves to be highly effective as evidenced by significant improvements achieved across three SLT datasets which are all conducted under the gloss-free setting.
This paper studies the fair transmission design for an intelligent reflecting surface (IRS) aided rate-splitting multiple access (RSMA). IRS is used to establish a good signal propagation environment and enhance the RSMA transmission performance. The fair rate adaption problem is constructed as a max-min optimization problem. To solve the optimization problem, we adopt an alternative optimization (AO) algorithm to optimize the power allocation, beamforming, and decoding order, respectively. A generalized power iteration (GPI) method is proposed to optimize the receive beamforming, which can improve the minimum rate of devices and reduce the optimization complexity. At the base station (BS), a successive group decoding (SGD) algorithm is proposed to tackle the uplink signal estimation, which trades off the fairness and complexity of decoding. At the same time, we also consider robust communication with imperfect channel state information at the transmitter (CSIT), which studies robust optimization by using lower bound expressions on the expected data rates. Extensive numerical results show that the proposed optimization algorithm can significantly improve the performance of fairness. It also provides reliable results for uplink communication with imperfect CSIT.
Creating group choreography from music has gained attention in cultural entertainment and virtual reality, aiming to coordinate visually cohesive and diverse group movements. Despite increasing interest, recent works face challenges in achieving aesthetically appealing choreography, primarily for two key issues: multi-dancer collision and single-dancer foot slide. To address these issues, we propose a Trajectory-Controllable Diffusion (TCDiff), a novel approach that harnesses non-overlapping trajectories to facilitate coherent dance movements. Specifically, to tackle dancer collisions, we introduce a Dance-Beat Navigator capable of generating trajectories for multiple dancers based on the music, complemented by a Distance-Consistency loss to maintain appropriate spacing among trajectories within a reasonable threshold. To mitigate foot sliding, we present a Footwork Adaptor that utilizes trajectory displacement from adjacent frames to enable flexible footwork, coupled with a Relative Forward-Kinematic loss to adjust the positioning of individual dancers' root nodes and joints. Extensive experiments demonstrate that our method achieves state-of-the-art results.
In this paper, we study a defense against poisoned encoders in SSL called distillation, which is a defense used in supervised learning originally. Distillation aims to distill knowledge from a given model (a.k.a the teacher net) and transfer it to another (a.k.a the student net). Now, we use it to distill benign knowledge from poisoned pre-trained encoders and transfer it to a new encoder, resulting in a clean pre-trained encoder. In particular, we conduct an empirical study on the effectiveness and performance of distillation against poisoned encoders. Using two state-of-the-art backdoor attacks against pre-trained image encoders and four commonly used image classification datasets, our experimental results show that distillation can reduce attack success rate from 80.87% to 27.51% while suffering a 6.35% loss in accuracy. Moreover, we investigate the impact of three core components of distillation on performance: teacher net, student net, and distillation loss. By comparing 4 different teacher nets, 3 student nets, and 6 distillation losses, we find that fine-tuned teacher nets, warm-up-training-based student nets, and attention-based distillation loss perform best, respectively.
Autonomous driving systems face the formidable challenge of navigating intricate and dynamic environments with uncertainty. This study presents a unified prediction and planning framework that concurrently models short-term aleatoric uncertainty (SAU), long-term aleatoric uncertainty (LAU), and epistemic uncertainty (EU) to predict and establish a robust foundation for planning in dynamic contexts. The framework uses Gaussian mixture models and deep ensemble methods, to concurrently capture and assess SAU, LAU, and EU, where traditional methods do not integrate these uncertainties simultaneously. Additionally, uncertainty-aware planning is introduced, considering various uncertainties. The study's contributions include comparisons of uncertainty estimations, risk modeling, and planning methods in comparison to existing approaches. The proposed methods were rigorously evaluated using the CommonRoad benchmark and settings with limited perception. These experiments illuminated the advantages and roles of different uncertainty factors in autonomous driving processes. In addition, comparative assessments of various uncertainty modeling strategies underscore the benefits of modeling multiple types of uncertainties, thus enhancing planning accuracy and reliability. The proposed framework facilitates the development of methods for UAP and surpasses existing uncertainty-aware risk models, particularly when considering diverse traffic scenarios. Project page: https://swb19.github.io/UAP/.
In the rapidly evolving field of autonomous driving, accurate trajectory prediction is pivotal for vehicular safety. However, trajectory predictions often deviate from actual paths, particularly in complex and challenging environments, leading to significant errors. To address this issue, our study introduces a novel method for Dynamic Occupancy Set (DOS) prediction, enhancing trajectory prediction capabilities. This method effectively combines advanced trajectory prediction networks with a DOS prediction module, overcoming the shortcomings of existing models. It provides a comprehensive and adaptable framework for predicting the potential occupancy sets of traffic participants. The main contributions of this research include: 1) A novel DOS prediction model tailored for complex scenarios, augmenting traditional trajectory prediction; 2) The development of unique DOS representations and evaluation metrics; 3) Extensive validation through experiments, demonstrating enhanced performance and adaptability. This research contributes to the advancement of safer and more efficient intelligent vehicle and transportation systems.
In automotive radar, time-domain thresholding (TD-TH) and time-frequency domain thresholding (TFD-TH) are crucial techniques underpinning numerous interference mitigation methods. Despite their importance, comprehensive evaluations of these methods in dense traffic scenarios with different types of interference are limited. In this study, we segment automotive radar interference into three distinct categories. Utilizing the in-house traffic scenario and automotive radar simulator, we evaluate interference mitigation methods across multiple metrics: probability of detection, signal-to-interference-plus-noise ratio, and phase error involving hundreds of targets and dozens of interfering radars. The numerical results highlight that TFD-TH is more effective than TD-TH, particularly as the density and signal correlation of interfering radars escalate.
In this paper, a novel transmissive reconfigurable intelligent surface (TRIS) transceiver empowered integrated sensing and communications (ISAC) system is proposed for future multi-demand terminals. To address interference management, we implement rate-splitting multiple access (RSMA), where the common stream is independently designed for the sensing service. We introduce the sensing quality of service (QoS) criteria based on this structure and construct an optimization problem with the sensing QoS criteria as the objective function to optimize the sensing stream precoding matrix and the communication stream precoding matrix. Due to the coupling of optimization variables, the formulated problem is a non-convex optimization problem that cannot be solved directly. To tackle the above-mentioned challenging problem, alternating optimization (AO) is utilized to decouple the optimization variables. Specifically, the problem is decoupled into three subproblems about the sensing stream precoding matrix, the communication stream precoding matrix, and the auxiliary variables, which is solved alternatively through AO until the convergence is reached. For solving the problem, successive convex approximation (SCA) is applied to deal with the sum-rate threshold constraints on communications, and difference-of-convex (DC) programming is utilized to solve rank-one non-convex constraints. Numerical simulation results verify the superiority of the proposed scheme in terms of improving the communication and sensing QoS.
The rise of Artificial Intelligence (AI) has revolutionized numerous industries and transformed the way society operates. Its widespread use has led to the distribution of AI and its underlying data across many intelligent systems. In this light, it is crucial to utilize information in learning processes that are either distributed or owned by different entities. As a result, modern data-driven services have been developed to integrate distributed knowledge entities into their outcomes. In line with this goal, the latest AI models are frequently trained in a decentralized manner. Distributed learning involves multiple entities working together to make collective predictions and decisions. However, this collaboration can also bring about security vulnerabilities and challenges. This paper provides an in-depth survey on private knowledge sharing in distributed learning, examining various knowledge components utilized in leading distributed learning architectures. Our analysis sheds light on the most critical vulnerabilities that may arise when using these components in a distributed setting. We further identify and examine defensive strategies for preserving the privacy of these knowledge components and preventing malicious parties from manipulating or accessing the knowledge information. Finally, we highlight several key limitations of knowledge sharing in distributed learning and explore potential avenues for future research.
Multi-label image recognition is a fundamental task in computer vision. Recently, vision-language models have made notable advancements in this area. However, previous methods often failed to effectively leverage the rich knowledge within language models and instead incorporated label semantics into visual features in a unidirectional manner. In this paper, we propose a Prompt-driven Visual-Linguistic Representation Learning (PVLR) framework to better leverage the capabilities of the linguistic modality. In PVLR, we first introduce a dual-prompting strategy comprising Knowledge-Aware Prompting (KAP) and Context-Aware Prompting (CAP). KAP utilizes fixed prompts to capture the intrinsic semantic knowledge and relationships across all labels, while CAP employs learnable prompts to capture context-aware label semantics and relationships. Later, we propose an Interaction and Fusion Module (IFM) to interact and fuse the representations obtained from KAP and CAP. In contrast to the unidirectional fusion in previous works, we introduce a Dual-Modal Attention (DMA) that enables bidirectional interaction between textual and visual features, yielding context-aware label representations and semantic-related visual representations, which are subsequently used to calculate similarities and generate final predictions for all labels. Extensive experiments on three popular datasets including MS-COCO, Pascal VOC 2007, and NUS-WIDE demonstrate the superiority of PVLR.