Multi-modal recommendation systems, which integrate diverse types of information, have gained widespread attention in recent years. However, compared to traditional collaborative filtering-based multi-modal recommendation systems, research on multi-modal sequential recommendation is still in its nascent stages. Unlike traditional sequential recommendation models that solely rely on item identifier (ID) information and focus on network structure design, multi-modal recommendation models need to emphasize item representation learning and the fusion of heterogeneous data sources. This paper investigates the impact of item representation learning on downstream recommendation tasks and examines the disparities in information fusion at different stages. Empirical experiments are conducted to demonstrate the need to design a framework suitable for collaborative learning and fusion of diverse information. Based on this, we propose a new model-agnostic framework for multi-modal sequential recommendation tasks, called Online Distillation-enhanced Multi-modal Transformer (ODMT), to enhance feature interaction and mutual learning among multi-source input (ID, text, and image), while avoiding conflicts among different features during training, thereby improving recommendation accuracy. To be specific, we first introduce an ID-aware Multi-modal Transformer module in the item representation learning stage to facilitate information interaction among different features. Secondly, we employ an online distillation training strategy in the prediction optimization stage to make multi-source data learn from each other and improve prediction robustness. Experimental results on a stream media recommendation dataset and three e-commerce recommendation datasets demonstrate the effectiveness of the proposed two modules, which is approximately 10% improvement in performance compared to baseline models.
Mixup is a well-established data augmentation technique, which can extend the training distribution and regularize the neural networks by creating ''mixed'' samples based on the label-equivariance assumption, i.e., a proportional mixup of the input data results in the corresponding labels being mixed in the same proportion. However, previous mixup variants may fail to exploit the label-independent information in mixed samples during training, which usually contains richer semantic information. To further release the power of mixup, we first improve the previous label-equivariance assumption by the semantic-equivariance assumption, which states that the proportional mixup of the input data should lead to the corresponding representation being mixed in the same proportion. Then a generic mixup regularization at the representation level is proposed, which can further regularize the model with the semantic information in mixed samples. At a high level, the proposed semantic equivariant mixup (sem) encourages the structure of the input data to be preserved in the representation space, i.e., the change of input will result in the obtained representation information changing in the same way. Different from previous mixup variants, which tend to over-focus on the label-related information, the proposed method aims to preserve richer semantic information in the input with semantic-equivariance assumption, thereby improving the robustness of the model against distribution shifts. We conduct extensive empirical studies and qualitative analyzes to demonstrate the effectiveness of our proposed method. The code of the manuscript is in the supplement.
Vision-based stair perception can help autonomous mobile robots deal with the challenge of climbing stairs, especially in unfamiliar environments. To address the problem that current monocular vision methods are difficult to model stairs accurately without depth information, this paper proposes a depth-aware stair modeling method for monocular vision. Specifically, we take the extraction of stair geometric features and the prediction of depth images as joint tasks in a convolutional neural network (CNN), with the designed information propagation architecture, we can achieve effective supervision for stair geometric feature learning by depth information. In addition, to complete the stair modeling, we take the convex lines, concave lines, tread surfaces and riser surfaces as stair geometric features and apply Gaussian kernels to enable the network to predict contextual information within the stair lines. Combined with the depth information obtained by depth sensors, we propose a stair point cloud reconstruction method that can quickly get point clouds belonging to the stair step surfaces. Experiments on our dataset show that our method has a significant improvement over the previous best monocular vision method, with an intersection over union (IOU) increase of 3.4 %, and the lightweight version has a fast detection speed and can meet the requirements of most real-time applications. Our dataset is available at https://data.mendeley.com/datasets/6kffmjt7g2/1.
Recent studies have utilized deep learning (DL) techniques to automatically extract features from synthetic aperture radar (SAR) images, which shows great promise for enhancing the performance of SAR automatic target recognition (ATR). However, our research reveals a previously overlooked issue: SAR images to be recognized include not only the foreground (i.e., the target), but also a certain size of the background area. When a DL-model is trained exclusively on foreground data, its recognition performance is significantly superior to a model trained on original data that includes both foreground and background. This suggests that the presence of background impedes the ability of the DL-model to learn additional semantic information about the target. To address this issue, we construct a structural causal model (SCM) that incorporates the background as a confounder. Based on the constructed SCM, we propose a causal intervention based regularization method to eliminate the negative impact of background on feature semantic learning and achieve background debiased SAR-ATR. The proposed causal interventional regularizer can be integrated into any existing DL-based SAR-ATR models to mitigate the impact of background interference on the feature extraction and recognition accuracy. Experimental results on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset indicate that the proposed method can enhance the efficiency of existing DL-based methods in a plug-and-play manner.
In recent years, Transformer networks are beginning to replace pure convolutional neural networks (CNNs) in the field of computer vision due to their global receptive field and adaptability to input. However, the quadratic computational complexity of softmax-attention limits the wide application in image dehazing task, especially for high-resolution images. To address this issue, we propose a new Transformer variant, which applies the Taylor expansion to approximate the softmax-attention and achieves linear computational complexity. A multi-scale attention refinement module is proposed as a complement to correct the error of the Taylor expansion. Furthermore, we introduce a multi-branch architecture with multi-scale patch embedding to the proposed Transformer, which embeds features by overlapping deformable convolution of different scales. The design of multi-scale patch embedding is based on three key ideas: 1) various sizes of the receptive field; 2) multi-level semantic information; 3) flexible shapes of the receptive field. Our model, named Multi-branch Transformer expanded by Taylor formula (MB-TaylorFormer), can embed coarse to fine features more flexibly at the patch embedding stage and capture long-distance pixel interactions with limited computational cost. Experimental results on several dehazing benchmarks show that MB-TaylorFormer achieves state-of-the-art (SOTA) performance with a light computational burden. The source code and pre-trained models are available at https://github.com/FVL2020/ICCV-2023-MB-TaylorFormer.
In sequential recommendation, multi-modal information (e.g., text or image) can provide a more comprehensive view of an item's profile. The optimal stage (early or late) to fuse modality features into item representations is still debated. We propose a graph-based approach (named MMSR) to fuse modality features in an adaptive order, enabling each modality to prioritize either its inherent sequential nature or its interplay with other modalities. MMSR represents each user's history as a graph, where the modality features of each item in a user's history sequence are denoted by cross-linked nodes. The edges between homogeneous nodes represent intra-modality sequential relationships, and the ones between heterogeneous nodes represent inter-modality interdependence relationships. During graph propagation, MMSR incorporates dual attention, differentiating homogeneous and heterogeneous neighbors. To adaptively assign nodes with distinct fusion orders, MMSR allows each node's representation to be asynchronously updated through an update gate. In scenarios where modalities exhibit stronger sequential relationships, the update gate prioritizes updates among homogeneous nodes. Conversely, when the interdependent relationships between modalities are more pronounced, the update gate prioritizes updates among heterogeneous nodes. Consequently, MMSR establishes a fusion order that spans a spectrum from early to late modality fusion. In experiments across six datasets, MMSR consistently outperforms state-of-the-art models, and our graph propagation methods surpass other graph neural networks. Additionally, MMSR naturally manages missing modalities.
In a co-design environment changes need to be integrated quickly and in an automated manner. This paper considers the challenge of creating and optimizing a global logistics system for the construction of a passenger aircraft within a co-design approach with respect to key performance indicators (like cost, time or resilience). The product in question is an aircraft, comprised of multiple components, manufactured at multiple sites worldwide. The goal is to find an optimal way to build the aircraft taking into consideration the requirements for its industrial system. The main motivation for approaching this challenge is to develop the industrial system in tandem with the product and making it more resilient against unforeseen events, reducing the risks of bottlenecks in the supply chain. This risk reduction ensures continued efficiency and operational success. To address this challenging and complex task we have chosen Answer Set Programming (ASP) as the modeling language, formalizing the relevant requirements of the investigated industrial system. The approach presented in this paper covers three main aspects: the extraction of the relevant information from a knowledge graph, the translation into logic programs and the computation of existing configurations guided by optimization criteria. Finally we visualize the results for an effortless evaluation of these models. Internal results seem promising and yielded several new research questions for future improvements of the discussed use case.
We introduce and define the novel problem of multi-distribution information retrieval (IR) where given a query, systems need to retrieve passages from within multiple collections, each drawn from a different distribution. Some of these collections and distributions might not be available at training time. To evaluate methods for multi-distribution retrieval, we design three benchmarks for this task from existing single-distribution datasets, namely, a dataset based on question answering and two based on entity matching. We propose simple methods for this task which allocate the fixed retrieval budget (top-k passages) strategically across domains to prevent the known domains from consuming most of the budget. We show that our methods lead to an average of 3.8+ and up to 8.0 points improvements in Recall@100 across the datasets and that improvements are consistent when fine-tuning different base retrieval models. Our benchmarks are made publicly available.
Services of personalized TTS systems for the Mandarin-speaking speech impaired are rarely mentioned. Taiwan started the VoiceBanking project in 2020, aiming to build a complete set of services to deliver personalized Mandarin TTS systems to amyotrophic lateral sclerosis patients. This paper reports the corpus design, corpus recording, data purging and correction for the corpus, and evaluations of the developed personalized TTS systems, for the VoiceBanking project. The developed corpus is named after the VoiceBank-2023 speech corpus because of its release year. The corpus contains 29.78 hours of utterances with prompts of short paragraphs and common phrases spoken by 111 native Mandarin speakers. The corpus is labeled with information about gender, degree of speech impairment, types of users, transcription, SNRs, and speaking rates. The VoiceBank-2023 is available by request for non-commercial use and welcomes all parties to join the VoiceBanking project to improve the services for the speech impaired.
Reconfigurable intelligent surface (RIS) is a promising technology that enables the customization of electromagnetic propagation environments in next-generation wireless networks. In this paper, we investigate the optimal pilot power allocation during the channel estimation stage to improve the ergodic channel gain of RIS-assisted systems under practical imperfect channel state information (CSI). Specifically, we commence by deriving an explicit closed-form expression of the ergodic channel gain of a multi-RIS-aided communication system that takes into account channel estimation errors. Then, we formulate the pilot power allocation problem to maximize the ergodic channel gain under imperfect CSI, subject to the average pilot power constraint. Then, the method of Lagrange multipliers is invoked to obtain the optimal pilot power allocation solution, which indicates that allocating more power to the pilots for estimating the weak reflection channels is capable of effectively improving the ergodic channel gain under imperfect CSI. Finally, extensive simulation results corroborate our theoretical analysis.