Navigation of wheeled vehicles on uneven terrain necessitates going beyond the 2D approaches for trajectory planning. Specifically, it is essential to incorporate the full 6dof variation of vehicle pose and its associated stability cost in the planning process. To this end, most recent works aim to learn a neural network model to predict the vehicle evolution. However, such approaches are data-intensive and fraught with generalization issues. In this paper, we present a purely model-based approach that just requires the digital elevation information of the terrain. Specifically, we express the wheel-terrain interaction and 6dof pose prediction as a non-linear least squares (NLS) problem. As a result, trajectory planning can be viewed as a bi-level optimization. The inner optimization layer predicts the pose on the terrain along a given trajectory, while the outer layer deforms the trajectory itself to reduce the stability and kinematic costs of the pose. We improve the state-of-the-art in the following respects. First, we show that our NLS based pose prediction closely matches the output from a high-fidelity physics engine. This result coupled with the fact that we can query gradients of the NLS solver, makes our pose predictor, a differentiable wheel-terrain interaction model. We further leverage this differentiability to efficiently solve the proposed bi-level trajectory optimization problem. Finally, we perform extensive experiments, and comparison with a baseline to showcase the effectiveness of our approach in obtaining smooth, stable trajectories.
Blind-spot networks (BSN) have been prevalent network architectures in self-supervised image denoising (SSID). Existing BSNs are mostly conducted with convolution layers. Although transformers offer potential solutions to the limitations of convolutions and have demonstrated success in various image restoration tasks, their attention mechanisms may violate the blind-spot requirement, thus restricting their applicability in SSID. In this paper, we present a transformer-based blind-spot network (TBSN) by analyzing and redesigning the transformer operators that meet the blind-spot requirement. Specifically, TBSN follows the architectural principles of dilated BSNs, and incorporates spatial as well as channel self-attention layers to enhance the network capability. For spatial self-attention, an elaborate mask is applied to the attention matrix to restrict its receptive field, thus mimicking the dilated convolution. For channel self-attention, we observe that it may leak the blind-spot information when the channel number is greater than spatial size in the deep layers of multi-scale architectures. To eliminate this effect, we divide the channel into several groups and perform channel attention separately. Furthermore, we introduce a knowledge distillation strategy that distills TBSN into smaller denoisers to improve computational efficiency while maintaining performance. Extensive experiments on real-world image denoising datasets show that TBSN largely extends the receptive field and exhibits favorable performance against state-of-the-art SSID methods. The code and pre-trained models will be publicly available at https://github.com/nagejacob/TBSN.
Domain generalization~(DG) aims at solving distribution shift problems in various scenes. Existing approaches are based on Convolution Neural Networks (CNNs) or Vision Transformers (ViTs), which suffer from limited receptive fields or quadratic complexities issues. Mamba, as an emerging state space model (SSM), possesses superior linear complexity and global receptive fields. Despite this, it can hardly be applied to DG to address distribution shifts, due to the hidden state issues and inappropriate scan mechanisms. In this paper, we propose a novel framework for DG, named DGMamba, that excels in strong generalizability toward unseen domains and meanwhile has the advantages of global receptive fields, and efficient linear complexity. Our DGMamba compromises two core components: Hidden State Suppressing~(HSS) and Semantic-aware Patch refining~(SPR). In particular, HSS is introduced to mitigate the influence of hidden states associated with domain-specific features during output prediction. SPR strives to encourage the model to concentrate more on objects rather than context, consisting of two designs: Prior-Free Scanning~(PFS), and Domain Context Interchange~(DCI). Concretely, PFS aims to shuffle the non-semantic patches within images, creating more flexible and effective sequences from images, and DCI is designed to regularize Mamba with the combination of mismatched non-semantic and semantic information by fusing patches among domains. Extensive experiments on four commonly used DG benchmarks demonstrate that the proposed DGMamba achieves remarkably superior results to state-of-the-art models. The code will be made publicly available.
Biologically inspired event-based vision sensors (EVS) are growing in popularity due to performance benefits including ultra-low power consumption, high dynamic range, data sparsity, and fast temporal response. They efficiently encode dynamic information from a visual scene through pixels that respond autonomously and asynchronously when the per-pixel illumination level changes by a user-selectable contrast threshold ratio, $\theta$. Due to their unique sensing paradigm and complex analog pixel circuitry, characterizing Event-based Vision Sensor (EVS) is non-trivial. The step-response probability curve (S-curve) is a key measurement technique that has emerged as the standard for measuring $\theta$. In this work, we detail the method for generating accurate S-curves by applying an appropriate stimulus and sensor configuration to decouple 2nd-order effects from the parameter being studied. We use an EVS pixel simulation to demonstrate how noise and other physical constraints can lead to error in the measurement, and develop two techniques that are robust enough to obtain accurate estimates. We then apply best practices derived from our simulation to generate S-curves for the latest generation Sony IMX636 and interpret the resulting family of curves to correct the apparent anomalous result of previous reports suggesting that $\theta$ changes with illumination. Further, we demonstrate that with correct interpretation, fundamental physical parameters such as dark current and RMS noise can be accurately inferred from a collection of S-curves, leading to more accurate parameterization for high-fidelity EVS simulations.
This paper presents IITP-VDLand, a comprehensive dataset of Decentraland parcels sourced from diverse platforms. Unlike existing datasets which have limited attributes and records, IITP-VDLand offers a rich array of attributes, encompassing parcel characteristics, trading history, past activities, transactions, and social media interactions. Alongside, we introduce a key attribute in the dataset, namely Rarity score, which measures the uniqueness of each parcel within the virtual world. Addressing the significant challenge posed by the dispersed nature of this data across various sources, we employ a systematic approach, utilizing both available APIs and custom scripts, to gather it. Subsequently, we meticulously curate and organize the information into four distinct segments: (1) Characteristics Data-Fragment, (2) OpenSea Trading History Data-Fragment, (3) Ethereum Activity Transactions Data-Fragment, and (4) Social Media Data-Fragment. We envisage that this dataset would serve as a robust resource for training machine- and deep-learning models specifically designed to address real-world challenges within the domain of Decentraland parcels. The performance benchmarking of more than 20 state-of-the-art price prediction models on our dataset yields promising results, achieving a maximum R2 score of 0.8251 and an accuracy of 74.23% in case of Extra Trees Regressor and Classifier. The key findings reveal that the ensemble models performs better than both deep learning and linear models for our dataset. We observe a significant impact of coordinates, geographical proximity, rarity score, and few other economic indicators on the prediction of parcel prices.
Magnetic Resonance Imaging (MRI) is a vital modality for gaining precise anatomical information, and it plays a significant role in medical imaging for diagnosis and therapy planning. Image synthesis problems have seen a revolution in recent years due to the introduction of deep learning techniques, specifically Generative Adversarial Networks (GANs). This work investigates the use of Deep Convolutional Generative Adversarial Networks (DCGAN) for producing high-fidelity and realistic MRI image slices. The suggested approach uses a dataset with a variety of brain MRI scans to train a DCGAN architecture. While the discriminator network discerns between created and real slices, the generator network learns to synthesise realistic MRI image slices. The generator refines its capacity to generate slices that closely mimic real MRI data through an adversarial training approach. The outcomes demonstrate that the DCGAN promise for a range of uses in medical imaging research, since they show that it can effectively produce MRI image slices if we train them for a consequent number of epochs. This work adds to the expanding corpus of research on the application of deep learning techniques for medical image synthesis. The slices that are could be produced possess the capability to enhance datasets, provide data augmentation in the training of deep learning models, as well as a number of functions are made available to make MRI data cleaning easier, and a three ready to use and clean dataset on the major anatomical plans.
End-to-end automatic speech recognition (E2E ASR) systems often suffer from mistranscription of domain-specific phrases, such as named entities, sometimes leading to catastrophic failures in downstream tasks. A family of fast and lightweight named entity correction (NEC) models for ASR have recently been proposed, which normally build on phonetic-level edit distance algorithms and have shown impressive NEC performance. However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic confusion for NEC on ASR transcription. To this end, an efficient entity description augmented masked language model (EDA-MLM) comprised of a dense retrieval model is introduced, enabling MLM to adapt swiftly to domain-specific entities for the NEC task. A series of experiments conducted on the AISHELL-1 and Homophone datasets confirm the effectiveness of our modeling approach. DANCER outperforms a strong baseline, the phonetic edit-distance-based NEC model (PED-NEC), by a character error rate (CER) reduction of about 7% relatively on AISHELL-1 for named entities. More notably, when tested on Homophone that contain named entities of high phonetic confusion, DANCER offers a more pronounced CER reduction of 46% relatively over PED-NEC for named entities.
In response to the rising threat of the face morphing attack, this paper introduces and explores the potential of Video-based Morphing Attack Detection (V-MAD) systems in real-world operational scenarios. While current morphing attack detection methods primarily focus on a single or a pair of images, V-MAD is based on video sequences, exploiting the video streams often acquired by face verification tools available, for instance, at airport gates. Through this study, we show for the first time the advantages that the availability of multiple probe frames can bring to the morphing attack detection task, especially in scenarios where the quality of probe images is varied and might be affected, for instance, by pose or illumination variations. Experimental results on a real operational database demonstrate that video sequences represent valuable information for increasing the robustness and performance of morphing attack detection systems.
In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized effects.
Molecular generation, an essential method for identifying new drug structures, has been supported by advancements in machine learning and computational technology. However, challenges remain in multi-objective generation, model adaptability, and practical application in drug discovery. In this study, we developed a versatile 'plug-in' molecular generation model that incorporates multiple objectives related to target affinity, drug-likeness, and synthesizability, facilitating its application in various drug development contexts. We improved the Particle Swarm Optimization (PSO) in the context of drug discoveries, and identified PSO-ENP as the optimal variant for multi-objective molecular generation and optimization through comparative experiments. The model also incorporates a novel target-ligand affinity predictor, enhancing the model's utility by supporting three-dimensional information and improving synthetic feasibility. Case studies focused on generating and optimizing drug-like big marine natural products were performed, underscoring PSO-ENP's effectiveness and demonstrating its considerable potential for practical drug discovery applications.