Most of the existing single object trackers track the target in a unitary local search window, making them particularly vulnerable to challenging factors such as heavy occlusions and out-of-view movements. Despite the attempts to further incorporate global search, prevailing mechanisms that cooperate local and global search are relatively static, thus are still sub-optimal for improving tracking performance. By further studying the local and global search results, we raise a question: can we allow more dynamics for cooperating both results? In this paper, we propose to introduce more dynamics by devising a dynamic attention-guided multi-trajectory tracking strategy. In particular, we construct dynamic appearance model that contains multiple target templates, each of which provides its own attention for locating the target in the new frame. Guided by different attention, we maintain diversified tracking results for the target to build multi-trajectory tracking history, allowing more candidates to represent the true target trajectory. After spanning the whole sequence, we introduce a multi-trajectory selection network to find the best trajectory that delivers improved tracking performance. Extensive experimental results show that our proposed tracking strategy achieves compelling performance on various large-scale tracking benchmarks. The project page of this paper can be found at https://sites.google.com/view/mt-track/.
The studies on black-box adversarial attacks have become increasingly prevalent due to the intractable acquisition of the structural knowledge of deep neural networks (DNNs). However, the performance of emerging attacks is negatively impacted when fooling DNNs tailored for high-resolution images. One of the explanations is that these methods usually focus on attacking the entire image, regardless of its spatial semantic information, and thereby encounter the notorious curse of dimensionality. To this end, we propose a pixel correlation-based attentional black-box adversarial attack, termed as PICA. Firstly, we take only one of every two neighboring pixels in the salient region as the target by leveraging the attentional mechanism and pixel correlation of images, such that the dimension of the black-box attack reduces. After that, a general multiobjective evolutionary algorithm is employed to traverse the reduced pixels and generate perturbations that are imperceptible by the human vision. Extensive experimental results have verified the effectiveness of the proposed PICA on the ImageNet dataset. More importantly, PICA is computationally more efficient to generate high-resolution adversarial examples compared with the existing black-box attacks.
Image enhancement aims at processing an input image so that the visual content of the output image is more pleasing or more useful for certain applications. Although histogram equalization is widely used in image enhancement due to its simplicity and effectiveness, it changes the mean brightness of the enhanced image and introduces a high level of noise and distortion. To address these problems, this paper proposes image enhancement using fuzzy intensity measure and adaptive clipping histogram equalization (FIMHE). FIMHE uses fuzzy intensity measure to first segment the histogram of the original image, and then clip the histogram adaptively in order to prevent excessive image enhancement. Experiments on the Berkeley database and CVF-UGR-Image database show that FIMHE outperforms state-of-the-art histogram equalization based methods.
Low-quality modalities contain not only a lot of noisy information but also some discriminative features in RGBT tracking. However, the potentials of low-quality modalities are not well explored in existing RGBT tracking algorithms. In this work, we propose a novel duality-gated mutual condition network to fully exploit the discriminative information of all modalities while suppressing the effects of data noise. In specific, we design a mutual condition module, which takes the discriminative information of a modality as the condition to guide feature learning of target appearance in another modality. Such module can effectively enhance target representations of all modalities even in the presence of low-quality modalities. To improve the quality of conditions and further reduce data noise, we propose a duality-gated mechanism and integrate it into the mutual condition module. To deal with the tracking failure caused by sudden camera motion, which often occurs in RGBT tracking, we design a resampling strategy based on optical flow algorithms. It does not increase much computational cost since we perform optical flow calculation only when the model prediction is unreliable and then execute resampling when the sudden camera motion is detected. Extensive experiments on four RGBT tracking benchmark datasets show that our method performs favorably against the state-of-the-art tracking algorithms
Vehicle re-identification (Re-ID) is an active task due to its importance in large-scale intelligent monitoring in smart cities. Despite the rapid progress in recent years, most existing methods handle vehicle Re-ID task in a supervised manner, which is both time and labor-consuming and limits their application to real-life scenarios. Recently, unsupervised person Re-ID methods achieve impressive performance by exploring domain adaption or clustering-based techniques. However, one cannot directly generalize these methods to vehicle Re-ID since vehicle images present huge appearance variations in different viewpoints. To handle this problem, we propose a novel viewpoint-aware clustering algorithm for unsupervised vehicle Re-ID. In particular, we first divide the entire feature space into different subspaces according to the predicted viewpoints and then perform a progressive clustering to mine the accurate relationship among samples. Comprehensive experiments against the state-of-the-art methods on two multi-viewpoint benchmark datasets VeRi and VeRi-Wild validate the promising performance of the proposed method in both with and without domain adaption scenarios while handling unsupervised vehicle Re-ID.
RGBT tracking has attracted increasing attention since RGB and thermal infrared data have strong complementary advantages, which could make trackers all-day and all-weather work. However, how to effectively represent RGBT data for visual tracking remains unstudied well. Existing works usually focus on extracting modality-shared or modality-specific information, but the potentials of these two cues are not well explored and exploited in RGBT tracking. In this paper, we propose a novel multi-adapter network to jointly perform modality-shared, modality-specific and instance-aware target representation learning for RGBT tracking. To this end, we design three kinds of adapters within an end-to-end deep learning framework. In specific, we use the modified VGG-M as the generality adapter to extract the modality-shared target representations.To extract the modality-specific features while reducing the computational complexity, we design a modality adapter, which adds a small block to the generality adapter in each layer and each modality in a parallel manner. Such a design could learn multilevel modality-specific representations with a modest number of parameters as the vast majority of parameters are shared with the generality adapter. We also design instance adapter to capture the appearance properties and temporal variations of a certain target. Moreover, to enhance the shared and specific features, we employ the loss of multiple kernel maximum mean discrepancy to measure the distribution divergence of different modal features and integrate it into each layer for more robust representation learning. Extensive experiments on two RGBT tracking benchmark datasets demonstrate the outstanding performance of the proposed tracker against the state-of-the-art methods.
RGB and thermal source data suffer from both shared and specific challenges, and how to explore and exploit them plays a critical role to represent the target appearance in RGBT tracking. In this paper, we propose a novel challenge-aware neural network to handle the modality-shared challenges (e.g., fast motion, scale variation and occlusion) and the modality-specific ones (e.g., illumination variation and thermal crossover) for RGBT tracking. In particular, we design several parameter-shared branches in each layer to model the target appearance under the modality-shared challenges, and several parameterindependent branches under the modality-specific ones. Based on the observation that the modality-specific cues of different modalities usually contains the complementary advantages, we propose a guidance module to transfer discriminative features from one modality to another one, which could enhance the discriminative ability of some weak modality. Moreover, all branches are aggregated together in an adaptive manner and parallel embedded in the backbone network to efficiently form more discriminative target representations. These challenge-aware branches are able to model the target appearance under certain challenges so that the target representations can be learnt by a few parameters even in the situation of insufficient training data. From the experimental results we will show that our method operates at a real-time speed while performing well against the state-of-the-art methods on three benchmark datasets.
Recent advances in high refresh rate displays as well as the increased interest in high rate of slow motion and frame up-conversion fuel the demand for efficient and cost-effective multi-frame video interpolation solutions. To that regard, inserting multiple frames between consecutive video frames are of paramount importance for the consumer electronics industry. State-of-the-art methods are iterative solutions interpolating one frame at the time. They introduce temporal inconsistencies and clearly noticeable visual artifacts. Departing from the state-of-the-art, this work introduces a true multi-frame interpolator. It utilizes a pyramidal style network in the temporal domain to complete the multi-frame interpolation task in one-shot. A novel flow estimation procedure using a relaxed loss function, and an advanced, cubic-based, motion model is also used to further boost interpolation accuracy when complex motion segments are encountered. Results on the Adobe240 dataset show that the proposed method generates visually pleasing, temporally consistent frames, outperforms the current best off-the-shelf method by 1.57db in PSNR with 8 times smaller model and 7.7 times faster. The proposed method can be easily extended to interpolate a large number of new frames while remaining efficient because of the one-shot mechanism.
RGBT salient object detection (SOD) aims to segment the common prominent regions by exploring and exploiting the complementary information of visible and thermal infrared images. However, existing methods simply integrate features of these two modalities, and thus could not explore the potentials of their complementarity. In this paper, we propose a novel multi-interactive encoder-decoder network to achieve an elaborative fusion for RGBT SOD. Our network relies on an encoder-decoder for the feature extraction and fusion, and we design a multi-interaction block (MIB) to model the interactions of different modalities, different layers and local-global information. In particular, we interact and integrate the multi-level features of different modalities in a two-stream decoder, which could fuse modal information sufficiently while maintaining their own specific feature representations for more robust detection performance. Moreover, each MIB block accepts both information from previous MIB and global context to restore more spatial details and object semantics respectively. Extensive experiments on the existing RGBT SOD datasets show that the proposed method achieves outstanding performance against the state-of-the-art algorithms.