Abstract:Due to the emergence of many sign language datasets, isolated sign language recognition (ISLR) has made significant progress in recent years. In addition, the development of various advanced deep neural networks is another reason for this breakthrough. However, challenges remain in applying the technique in the real world. First, existing sign language datasets do not cover the whole sign vocabulary. Second, most of the sign language datasets provide only single view RGB videos, which makes it difficult to handle hand occlusions when performing ISLR. To fill this gap, this paper presents a dual-view sign language dataset for ISLR named NationalCSL-DP, which fully covers the Chinese national sign language vocabulary. The dataset consists of 134140 sign videos recorded by ten signers with respect to two vertical views, namely, the front side and the left side. Furthermore, a CNN transformer network is also proposed as a strong baseline and an extremely simple but effective fusion strategy for prediction. Extensive experiments were conducted to prove the effectiveness of the datasets as well as the baseline. The results show that the proposed fusion strategy can significantly increase the performance of the ISLR, but it is not easy for the sequence-to-sequence model, regardless of whether the early-fusion or late-fusion strategy is applied, to learn the complementary features from the sign videos of two vertical views.
Abstract:Markov chain Monte Carlo (MCMC) methods for sampling from heavy-tailed distributions present unique challenges, particularly in high dimensions. Multi-proposal MCMC algorithms have recently gained attention for their potential to improve performance, especially through parallel implementation on modern hardware. This paper introduces a novel family of gradient-free MCMC algorithms that combine the multi-try Metropolis (MTM) with stereographic MCMC framework, specifically designed for efficient sampling from heavy-tailed targets. The proposed stereographic multi-try Metropolis (SMTM) algorithm not only outperforms traditional Euclidean MTM and existing stereographic random-walk Metropolis methods, but also avoids the pathological convergence behavior often observed in MTM and demonstrates strong robustness to tuning. These properties are supported by scaling analysis and extensive simulation studies.
Abstract:The effectiveness of large language models (LLMs) in conversational AI is hindered by their reliance on single-turn supervised fine-tuning (SFT) data, which limits contextual coherence in multi-turn dialogues. Existing methods for generating multi-turn dialogue data struggle to ensure both diversity and quality in instructions. To address this, we propose Review-Instruct, a novel framework that synthesizes multi-turn conversations through an iterative "Ask-Respond-Review" process involving three agent roles: a Candidate, multiple Reviewers, and a Chairman. The framework iteratively refines instructions by incorporating Reviewer feedback, enhancing dialogue diversity and difficulty. We construct a multi-turn dataset using the Alpaca dataset and fine-tune the LLaMA2-13B model. Evaluations on MT-Bench, MMLU-Pro, and Auto-Arena demonstrate significant improvements, achieving absolute gains of 2.9\% on MMLU-Pro and 2\% on MT-Bench compared to prior state-of-the-art models based on LLaMA2-13B. Ablation studies confirm the critical role of the Review stage and the use of multiple Reviewers in boosting instruction diversity and difficulty. Our work highlights the potential of review-driven, multi-agent frameworks for generating high-quality conversational data at scale.
Abstract:Topology optimization (TO) serves as a widely applied structural design approach to tackle various engineering problems. Nevertheless, sensitivity-based TO methods usually struggle with solving strongly nonlinear optimization problems. By leveraging high capacity of deep generative model, which is an influential machine learning technique, the sensitivity-free data-driven topology design (DDTD) methodology is regarded as an effective means of overcoming these issues. The DDTD methodology depends on initial dataset with a certain regularity, making its results highly sensitive to initial dataset quality. This limits its effectiveness and generalizability, especially for optimization problems without priori information. In this research, we proposed a multi-level mesh DDTD-based method with correlation-based mutation module to escape from the limitation of the quality of the initial dataset on the results and enhance computational efficiency. The core is to employ a correlation-based mutation module to assign new geometric features with physical meaning to the generated data, while utilizing a multi-level mesh strategy to progressively enhance the refinement of the structural representation, thus avoiding the maintenance of a high degree-of-freedom (DOF) representation throughout the iterative process. The proposed multi-level mesh DDTD-based method can be driven by a low quality initial dataset without the need for time-consuming construction of a specific dataset, thus significantly increasing generality and reducing application difficulty, while further lowering computational cost of DDTD methodology. Various comparison experiments with the traditional sensitivity-based TO methods on stress-related strongly nonlinear problems demonstrate the generality and effectiveness of the proposed method.
Abstract:Estimating the 6D pose of textureless objects from RBG images is an important problem in robotics. Due to appearance ambiguities, rotational symmetries, and severe occlusions, single-view based 6D pose estimators are still unable to handle a wide range of objects, motivating research towards multi-view pose estimation and next-best-view prediction that addresses these limitations. In this work, we propose a comprehensive active perception framework for estimating the 6D poses of textureless objects using only RGB images. Our approach is built upon a key idea: decoupling the 6D pose estimation into a sequential two-step process can greatly improve both accuracy and efficiency. First, we estimate the 3D translation of each object, resolving scale and depth ambiguities inherent to RGB images. These estimates are then used to simplify the subsequent task of determining the 3D orientation, which we achieve through canonical scale template matching. Building on this formulation, we then introduce an active perception strategy that predicts the next best camera viewpoint to capture an RGB image, effectively reducing object pose uncertainty and enhancing pose accuracy. We evaluate our method on the public ROBI dataset as well as on a transparent object dataset that we created. When evaluated using the same camera viewpoints, our multi-view pose estimation significantly outperforms state-of-the-art approaches. Furthermore, by leveraging our next-best-view strategy, our method achieves high object pose accuracy with substantially fewer viewpoints than heuristic-based policies.
Abstract:Different types of intelligent reflecting surfaces (IRS) are exploited for assisting wireless communications. The joint use of passive IRS (PIRS) and active IRS (AIRS) emerges as a promising solution owing to their complementary advantages. They can be integrated into a single hybrid active-passive IRS (HIRS) or deployed in a distributed manner, which poses challenges in determining the IRS element allocation and placement for rate maximization. In this paper, we investigate the capacity of an IRS-aided wireless communication system with both active and passive elements. Specifically, we consider three deployment schemes: 1) base station (BS)-HIRS-user (BHU); 2) BS-AIRS-PIRS-user (BAPU); 3) BS-PIRS-AIRS-user (BPAU). Under the line-of-sight channel model, we formulate a rate maximization problem via a joint optimization of the IRS element allocation and placement. We first derive the optimized number of active and passive elements for BHU, BAPU, and BPAU schemes, respectively. Then, low-complexity HIRS/AIRS placement strategies are provided. To obtain more insights, we characterize the system capacity scaling orders for the three schemes with respect to the large total number of IRS elements, amplification power budget, and BS transmit power. Finally, simulation results are presented to validate our theoretical findings and show the performance difference among the BHU, BAPU, and BPAU schemes with the proposed joint design under various system setups.
Abstract:Exploration in sparse reward environments remains a significant challenge in reinforcement learning, particularly in Contextual Markov Decision Processes (CMDPs), where environments differ across episodes. Existing episodic intrinsic motivation methods for CMDPs primarily rely on count-based approaches, which are ineffective in large state spaces, or on similarity-based methods that lack appropriate metrics for state comparison. To address these shortcomings, we propose Episodic Novelty Through Temporal Distance (ETD), a novel approach that introduces temporal distance as a robust metric for state similarity and intrinsic reward computation. By employing contrastive learning, ETD accurately estimates temporal distances and derives intrinsic rewards based on the novelty of states within the current episode. Extensive experiments on various benchmark tasks demonstrate that ETD significantly outperforms state-of-the-art methods, highlighting its effectiveness in enhancing exploration in sparse reward CMDPs.
Abstract:3D Gaussian Splatting has emerged as a promising technique for high-quality 3D rendering, leading to increasing interest in integrating 3DGS into realism SLAM systems. However, existing methods face challenges such as Gaussian primitives redundancy, forgetting problem during continuous optimization, and difficulty in initializing primitives in monocular case due to lack of depth information. In order to achieve efficient and photorealistic mapping, we propose RP-SLAM, a 3D Gaussian splatting-based vision SLAM method for monocular and RGB-D cameras. RP-SLAM decouples camera poses estimation from Gaussian primitives optimization and consists of three key components. Firstly, we propose an efficient incremental mapping approach to achieve a compact and accurate representation of the scene through adaptive sampling and Gaussian primitives filtering. Secondly, a dynamic window optimization method is proposed to mitigate the forgetting problem and improve map consistency. Finally, for the monocular case, a monocular keyframe initialization method based on sparse point cloud is proposed to improve the initialization accuracy of Gaussian primitives, which provides a geometric basis for subsequent optimization. The results of numerous experiments demonstrate that RP-SLAM achieves state-of-the-art map rendering accuracy while ensuring real-time performance and model compactness.
Abstract:In hybrid transactional and analytical processing (HTAP) systems, users often struggle to understand why query plans from one engine (OLAP or OLTP) perform significantly slower than those from another. Although optimizers provide plan details via the EXPLAIN function, these explanations are frequently too technical for non-experts and offer limited insights into performance differences across engines. To address this, we propose a novel framework that leverages large language models (LLMs) to explain query performance in HTAP systems. Built on Retrieval-Augmented Generation (RAG), our framework constructs a knowledge base that stores historical query executions and expert-curated explanations. To enable efficient retrieval of relevant knowledge, query plans are embedded using a lightweight tree-CNN classifier. This augmentation allows the LLM to generate clear, context-aware explanations of performance differences between engines. Our approach demonstrates the potential of LLMs in hybrid engine systems, paving the way for further advancements in database optimization and user support.
Abstract:Sound event localization and detection (SELD) has seen substantial advancements through learning-based methods. These systems, typically trained from scratch on specific datasets, have shown considerable generalization capabilities. Recently, deep neural networks trained on large-scale datasets have achieved remarkable success in the sound event classification (SEC) field, prompting an open question of whether these advancements can be extended to develop general-purpose SELD models. In this paper, leveraging the power of pre-trained SEC models, we propose pre-trained SELD networks (PSELDNets) on large-scale synthetic datasets. These synthetic datasets, generated by convolving sound events with simulated spatial room impulse responses (SRIRs), contain 1,167 hours of audio clips with an ontology of 170 sound classes. These PSELDNets are transferred to downstream SELD tasks. When we adapt PSELDNets to specific scenarios, particularly in low-resource data cases, we introduce a data-efficient fine-tuning method, AdapterBit. PSELDNets are evaluated on a synthetic-test-set using collected SRIRs from TAU Spatial Room Impulse Response Database (TAU-SRIR DB) and achieve satisfactory performance. We also conduct our experiments to validate the transferability of PSELDNets to three publicly available datasets and our own collected audio recordings. Results demonstrate that PSELDNets surpass state-of-the-art systems across all publicly available datasets. Given the need for direction-of-arrival estimation, SELD generally relies on sufficient multi-channel audio clips. However, incorporating the AdapterBit, PSELDNets show more efficient adaptability to various tasks using minimal multi-channel or even just monophonic audio clips, outperforming the traditional fine-tuning approaches.