In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches -- models attaining sub-quadratic attention complexity can utilize a notion of sparsity or a low-rank approximation of inputs to reduce the number of attended keys; other ways to reduce complexity include locality-sensitive hashing, key pooling, additional memory to store information in compacted or hybridization with other architectures, such as CNN. Often based on a strong mathematical basis, kernelized approaches allow for the approximation of attention with linear complexity while retaining high accuracy. Therefore, in the present paper, we aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.
An automatic gun detection system can detect potential gun-related violence at an early stage that is of paramount importance for citizens security. In the whole system, object detection algorithm is the key to perceive the environment so that the system can detect dangerous objects such as pistols and rifles. However, mainstream deep learning-based object detection algorithms depend heavily on large-scale high-quality annotated samples, and the existing gun datasets are characterized by low resolution, little contextual information and little data volume. To promote the development of security, this work presents a new challenging dataset called YouTube Gun Detection Dataset (YouTube-GDD). Our dataset is collected from 343 high-definition YouTube videos and contains 5000 well-chosen images, in which 16064 instances of gun and 9046 instances of person are annotated. Compared to other datasets, YouTube-GDD is "dynamic", containing rich contextual information and recording shape changes of the gun during shooting. To build a baseline for gun detection, we evaluate YOLOv5 on YouTube-GDD and analyze the influence of additional related annotated information on gun detection. YouTube-GDD and subsequent updates will be released at https://github.com/UCAS-GYX/YouTube-GDD.
There is a key problem in the medical visual question answering task that how to effectively realize the feature fusion of language and medical images with limited datasets. In order to better utilize multi-scale information of medical images, previous methods directly embed the multi-stage visual feature maps as tokens of same size respectively and fuse them with text representation. However, this will cause the confusion of visual features at different stages. To this end, we propose a simple but powerful multi-stage feature fusion method, MF2-MVQA, which stage-wise fuses multi-level visual features with textual semantics. MF2-MVQA achieves the State-Of-The-Art performance on VQA-Med 2019 and VQA-RAD dataset. The results of visualization also verify that our model outperforms previous work.
The reconfigurable intelligent surface (RIS) has drawn considerable attention for its ability to enhance the performance of not only the wireless communication but also the indoor localization with low-cost. This paper investigates the performance limits of the RIS-based near-field localization in the asynchronous scenario, and analyzes the impact of each part of the cascaded channel on the localization performance. The Fisher information matrix (FIM) and the position error bound (PEB) are derived. Besides, we also derive the equivalent Fisher information (EFI) for the position-related intermediate parameters. Enabled by the derived EFI, we show that the information for both the distance and the direction of the user can be obtained when the near-field spherical wavefront is considered for the RIS-User equipment (UE) part of the channel, while only the direction of the UE can be inferred in the far-field scenario. For the base station (BS)-RIS part of the channel, we reveal that this part of the channel determines the type of the gain provided by the BS antenna array. We also show that the well-known focusing control scheme for RIS, which maximizes the received SNR, is not always a good choice and may degrade the localization performance in the asynchronous scenario. The simulation results validate the analytic work. The impact of the focusing control scheme on the PEB performances under synchronous and asynchronous conditions is also investigated.
The great progress of 3D object detectors relies on large-scale data and 3D annotations. The annotation cost for 3D bounding boxes is extremely expensive while the 2D ones are easier and cheaper to collect. In this paper, we introduce a hybrid training framework, enabling us to learn a visual 3D object detector with massive 2D (pseudo) labels, even without 3D annotations. To break through the information bottleneck of 2D clues, we explore a new perspective: Temporal 2D Supervision. We propose a temporal 2D transformation to bridge the 3D predictions with temporal 2D labels. Two steps, including homography wraping and 2D box deduction, are taken to transform the 3D predictions into 2D ones for supervision. Experiments conducted on the nuScenes dataset show strong results (nearly 90% of its fully-supervised performance) with only 25% 3D annotations. We hope our findings can provide new insights for using a large number of 2D annotations for 3D perception.
The vulnerability of the high-performance machine learning models implies a security risk in applications with real-world consequences. Research on adversarial attacks is beneficial in guiding the development of machine learning models on the one hand and finding targeted defenses on the other. However, most of the adversarial attacks today leverage the gradient or logit information from the models to generate adversarial perturbation. Works in the more realistic domain: decision-based attacks, which generate adversarial perturbation solely based on observing the output label of the targeted model, are still relatively rare and mostly use gradient-estimation strategies. In this work, we propose a pixel-wise decision-based attack algorithm that finds a distribution of adversarial perturbation through a reinforcement learning algorithm. We call this method Decision-based Black-box Attack with Reinforcement learning (DBAR). Experiments show that the proposed approach outperforms state-of-the-art decision-based attacks with a higher attack success rate and greater transferability.
Recent research has shown that static word embeddings can encode word frequency information. However, little has been studied about this phenomenon and its effects on downstream tasks. In the present work, we systematically study the association between frequency and semantic similarity in several static word embeddings. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled. This proves that the patterns found are not due to real semantic associations present in the texts, but are an artifact produced by the word embeddings. Finally, we provide an example of how word frequency can strongly impact the measurement of gender bias with embedding-based metrics. In particular, we carry out a controlled experiment that shows that biases can even change sign or reverse their order by manipulating word frequencies.
In this thesis, we investigated the relevance, faithfulness, and succinctness aspects of Long Form Question Answering (LFQA). LFQA aims to generate an in-depth, paragraph-length answer for a given question, to help bridge the gap between real scenarios and the existing open-domain QA models which can only extract short-span answers. LFQA is quite challenging and under-explored. Few works have been done to build an effective LFQA system. It is even more challenging to generate a good-quality long-form answer relevant to the query and faithful to facts, since a considerable amount of redundant, complementary, or contradictory information will be contained in the retrieved documents. Moreover, no prior work has been investigated to generate succinct answers. We are among the first to research the LFQA task. We pioneered the research direction to improve the answer quality in terms of 1) query-relevance, 2) answer faithfulness, and 3) answer succinctness.
We propose Unified Model of Saliency and Scanpaths (UMSS) -- a model that learns to predict visual saliency and scanpaths (i.e. sequences of eye fixations) on information visualisations. Although scanpaths provide rich information about the importance of different visualisation elements during the visual exploration process, prior work has been limited to predicting aggregated attention statistics, such as visual saliency. We present in-depth analyses of gaze behaviour for different information visualisation elements (e.g. Title, Label, Data) on the popular MASSVIS dataset. We show that while, overall, gaze patterns are surprisingly consistent across visualisations and viewers, there are also structural differences in gaze dynamics for different elements. Informed by our analyses, UMSS first predicts multi-duration element-level saliency maps, then probabilistically samples scanpaths from them. Extensive experiments on MASSVIS show that our method consistently outperforms state-of-the-art methods with respect to several, widely used scanpath and saliency evaluation metrics. Our method achieves a relative improvement in sequence score of 11.5% for scanpath prediction, and a relative improvement in Pearson correlation coefficient of up to 23.6% for saliency prediction. These results are auspicious and point towards richer user models and simulations of visual attention on visualisations without the need for any eye tracking equipment.
Generalization error bounds are essential to understanding machine learning algorithms. This paper presents novel expected generalization error upper bounds based on the average joint distribution between the output hypothesis and each input training sample. Multiple generalization error upper bounds based on different information measures are provided, including Wasserstein distance, total variation distance, KL divergence, and Jensen-Shannon divergence. Due to the convexity of the information measures, the proposed bounds in terms of Wasserstein distance and total variation distance are shown to be tighter than their counterparts based on individual samples in the literature. An example is provided to demonstrate the tightness of the proposed generalization error bounds.