Multimodal recommendation aims to model user and item representations comprehensively with the involvement of multimedia content for effective recommendations. Existing research has shown that it is beneficial for recommendation performance to combine (user- and item-) ID embeddings with multimodal salient features, indicating the value of IDs. However, there is a lack of a thorough analysis of the ID embeddings in terms of feature semantics in the literature. In this paper, we revisit the value of ID embeddings for multimodal recommendation and conduct a thorough study regarding its semantics, which we recognize as subtle features of content and structures. Then, we propose a novel recommendation model by incorporating ID embeddings to enhance the semantic features of both content and structures. Specifically, we put forward a hierarchical attention mechanism to incorporate ID embeddings in modality fusing, coupled with contrastive learning, to enhance content representations. Meanwhile, we propose a lightweight graph convolutional network for each modality to amalgamate neighborhood and ID embeddings for improving structural representations. Finally, the content and structure representations are combined to form the ultimate item embedding for recommendation. Extensive experiments on three real-world datasets (Baby, Sports, and Clothing) demonstrate the superiority of our method over state-of-the-art multimodal recommendation methods and the effectiveness of fine-grained ID embeddings.
Cross-modal retrieval has become popular in recent years, particularly with the rise of multimedia. Generally, the information from each modality exhibits distinct representations and semantic information, which makes feature tends to be in separate latent spaces encoded with dual-tower architecture and makes it difficult to establish semantic relationships between modalities, resulting in poor retrieval performance. To address this issue, we propose a novel framework for cross-modal retrieval which consists of a cross-modal mixer, a masked autoencoder for pre-training, and a cross-modal retriever for downstream tasks.In specific, we first adopt cross-modal mixer and mask modeling to fuse the original modality and eliminate redundancy. Then, an encoder-decoder architecture is applied to achieve a fuse-then-separate task in the pre-training phase.We feed masked fused representations into the encoder and reconstruct them with the decoder, ultimately separating the original data of two modalities. In downstream tasks, we use the pre-trained encoder to build the cross-modal retrieval method. Extensive experiments on 2 real-world datasets show that our approach outperforms previous state-of-the-art methods in video-audio matching tasks, improving retrieval accuracy by up to 2 times. Furthermore, we prove our model performance by transferring it to other downstream tasks as a universal model.
Adversarial examples bring a considerable security threat to support vector machines (SVMs), especially those used in safety-critical applications. Thus, robustness verification is an essential issue for SVMs, which can provide provable robustness against various kinds of adversary attacks. The evaluation results obtained through the robustness verification can provide a safe guarantee for the use of SVMs. The existing verification method does not often perform well in verifying SVMs with nonlinear kernels. To this end, we propose a method to improve the verification performance for SVMs with nonlinear kernels. We first formalize the adversarial robustness evaluation of SVMs as an optimization problem. Then a lower bound of the original problem is obtained by solving the Lagrangian dual problem of the original problem. Finally, the adversarial robustness of SVMs is evaluated concerning the lower bound. We evaluate the adversarial robustness of SVMs with linear and nonlinear kernels on the MNIST and Fashion-MNIST datasets. The experimental results show that the percentage of provable robustness obtained by our method on the test set is better than that of the state-of-the-art.
Unmanned aerial vehicle (UAV)-based tracking is attracting increasing attention and developing rapidly in applications such as agriculture, aviation, navigation, transportation and public security. Recently, discriminative correlation filters (DCF)-based trackers have stood out in UAV tracking community for their high efficiency and appealing robustness on a single CPU. However, due to limited onboard computation resources and other challenges the efficiency and accuracy of existing DCF-based approaches is still not satisfying. In this paper, we explore using segmentation by the GrabCut to improve the wildly adopted discriminative scale estimation in DCF-based trackers, which, as a mater of fact, greatly impacts the precision and accuracy of the trackers since accumulated scale error degrades the appearance model as online updating goes on. Meanwhile, inspired by residue representation, we exploit the residue nature inherent to videos and propose residue-aware correlation filters that show better convergence properties in filter learning. Extensive experiments are conducted on four UAV benchmarks, namely, UAV123@10fps, DTB70, UAVDT and Vistrone2018 (VisDrone2018-test-dev). The results show that our method achieves state-of-the-art performance.
Unsupervised crowd counting is a challenging yet not largely explored task. In this paper, we explore it in a transfer learning setting where we learn to detect and count persons in an unlabeled target set by transferring bi-knowledge learnt from regression- and detection-based models in a labeled source set. The dual source knowledge of the two models is heterogeneous and complementary as they capture different modalities of the crowd distribution. We formulate the mutual transformations between the outputs of regression- and detection-based models as two scene-agnostic transformers which enable knowledge distillation between the two models. Given the regression- and detection-based models and their mutual transformers learnt in the source, we introduce an iterative self-supervised learning scheme with regression-detection bi-knowledge transfer in the target. Extensive experiments on standard crowd counting benchmarks, ShanghaiTech, UCF\_CC\_50, and UCF\_QNRF demonstrate a substantial improvement of our method over other state-of-the-arts in the transfer learning setting.
This paper proposes an algorithm named as PrTransH to learn embedding vectors from real world EMR data based medical knowledge. The unique challenge in embedding medical knowledge graph from real world EMR data is that the uncertainty of knowledge triplets blurs the border between "correct triplet" and "wrong triplet", changing the fundamental assumption of many existing algorithms. To address the challenge, some enhancements are made to existing TransH algorithm, including: 1) involve probability of medical knowledge triplet into training objective; 2) replace the margin-based ranking loss with unified loss calculation considering both valid and corrupted triplets; 3) augment training data set with medical background knowledge. Verifications on real world EMR data based medical knowledge graph prove that PrTransH outperforms TransH in link prediction task. To the best of our survey, this paper is the first one to learn and verify knowledge embedding on probabilistic knowledge graphs.
Modern crowd counting methods usually employ deep neural networks (DNN) to estimate crowd counts via density regression. Despite their significant improvements, the regression-based methods are incapable of providing the detection of individuals in crowds. The detection-based methods, on the other hand, have not been largely explored in recent trends of crowd counting due to the needs for expensive bounding box annotations. In this work, we instead propose a new deep detection network with only point supervision required. It can simultaneously detect the size and location of human heads and count them in crowds. We first mine useful person size information from point-level annotations and initialize the pseudo ground truth bounding boxes. An online updating scheme is introduced to refine the pseudo ground truth during training; while a locally-constrained regression loss is designed to provide additional constraints on the size of the predicted boxes in a local neighborhood. In the end, we propose a curriculum learning strategy to train the network from images of relatively accurate and easy pseudo ground truth first. Extensive experiments are conducted in both detection and counting tasks on several standard benchmarks, e.g. ShanghaiTech, UCF_CC_50, WiderFace, and TRANCOS datasets, and the results show the superiority of our method over the state-of-the-art.
In reinforcement learning (RL) , one of the key components is policy evaluation, which aims to estimate the value function (i.e., expected long-term accumulated reward) of a policy. With a good policy evaluation method, the RL algorithms will estimate the value function more accurately and find a better policy. When the state space is large or continuous \emph{Gradient-based Temporal Difference(GTD)} policy evaluation algorithms with linear function approximation are widely used. Considering that the collection of the evaluation data is both time and reward consuming, a clear understanding of the finite sample performance of the policy evaluation algorithms is very important to reinforcement learning. Under the assumption that data are i.i.d. generated, previous work provided the finite sample analysis of the GTD algorithms with constant step size by converting them into convex-concave saddle point problems. However, it is well-known that, the data are generated from Markov processes rather than i.i.d. in RL problems.. In this paper, in the realistic Markov setting, we derive the finite sample bounds for the general convex-concave saddle point problems, and hence for the GTD algorithms. We have the following discussions based on our bounds. (1) With variants of step size, GTD algorithms converge. (2) The convergence rate is determined by the step size, with the mixing time of the Markov process as the coefficient. The faster the Markov processes mix, the faster the convergence. (3) We explain that the experience replay trick is effective by improving the mixing property of the Markov process. To the best of our knowledge, our analysis is the first to provide finite sample bounds for the GTD algorithms in Markov setting.