Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ling Shao

Terminus Group, Beijing, China

ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos

Jul 24, 2021

Yi Zhang, Fang-Yi Chao, Ge-Peng Ji, Deng-Ping Fan, Lu Zhang, Ling Shao

Figure 1 for ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos

Figure 2 for ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos

Figure 3 for ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos

Figure 4 for ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos

Abstract:Exploring to what humans pay attention in dynamic panoramic scenes is useful for many fundamental applications, including augmented reality (AR) in retail, AR-powered recruitment, and visual language navigation. With this goal in mind, we propose PV-SOD, a new task that aims to segment salient objects from panoramic videos. In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD), which mimics human attention mechanism by segmenting salient objects with the guidance of audio-visual cues. To support this task, we collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy, thus distinguishing itself with richness, diversity and quality. Specifically, each sequence is marked with both its super-/sub-class, with objects of each sub-class being further annotated with human eye fixations, bounding boxes, object-/instance-level masks, and associated attributes (e.g., geometrical distortion). These coarse-to-fine annotations enable detailed analysis for PV-SOD modeling, e.g., determining the major challenges for existing SOD models, and predicting scanpaths to study the long-term eye fixation behaviors of humans. We systematically benchmark 11 representative approaches on ASOD60K and derive several interesting findings. We hope this study could serve as a good starting point for advancing SOD research towards panoramic videos.

* 22 pages, 17 figures, 7 tables (Project Page: https://github.com/PanoAsh/ASOD60K)

Via

Access Paper or Ask Questions

PVTv2: Improved Baselines with Pyramid Vision Transformer

Jul 17, 2021

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao

Figure 1 for PVTv2: Improved Baselines with Pyramid Vision Transformer

Figure 2 for PVTv2: Improved Baselines with Pyramid Vision Transformer

Figure 3 for PVTv2: Improved Baselines with Pyramid Vision Transformer

Figure 4 for PVTv2: Improved Baselines with Pyramid Vision Transformer

Abstract:Transformer recently has shown encouraging progresses in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (abbreviated as PVTv1) by adding three designs, including (1) overlapping patch embedding, (2) convolutional feed-forward networks, and (3) linear complexity attention layers. With these modifications, our PVTv2 significantly improves PVTv1 on three tasks e.g., classification, detection, and segmentation. Moreover, PVTv2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT .

* Technical Report

Via

Access Paper or Ask Questions

Variational Topic Inference for Chest X-Ray Report Generation

Jul 15, 2021

Ivona Najdenkoska, Xiantong Zhen, Marcel Worring, Ling Shao

Figure 1 for Variational Topic Inference for Chest X-Ray Report Generation

Figure 2 for Variational Topic Inference for Chest X-Ray Report Generation

Figure 3 for Variational Topic Inference for Chest X-Ray Report Generation

Figure 4 for Variational Topic Inference for Chest X-Ray Report Generation

Abstract:Automating report generation for medical imaging promises to reduce workload and assist diagnosis in clinical practice. Recent work has shown that deep learning models can successfully caption natural images. However, learning from medical data is challenging due to the diversity and uncertainty inherent in the reports written by different radiologists with discrepant expertise and experience. To tackle these challenges, we propose variational topic inference for automatic report generation. Specifically, we introduce a set of topics as latent variables to guide sentence generation by aligning image and language modalities in a latent space. The topics are inferred in a conditional variational inference framework, with each topic governing the generation of a sentence in the report. Further, we adopt a visual attention module that enables the model to attend to different locations in the image and generate more informative descriptions. We conduct extensive experiments on two benchmarks, namely Indiana U. Chest X-rays and MIMIC-CXR. The results demonstrate that our proposed variational topic inference method can generate novel reports rather than mere copies of reports used in training, while still achieving comparable performance to state-of-the-art methods in terms of standard language generation criteria.

* To be published in the International Conference on Medical Image Computing and Computer Assisted Intervention 2021

Via

Access Paper or Ask Questions

Kernel Continual Learning

Jul 14, 2021

Mohammad Mahdi Derakhshani, Xiantong Zhen, Ling Shao, Cees G. M. Snoek

Abstract:This paper introduces kernel continual learning, a simple but effective variant of continual learning that leverages the non-parametric nature of kernel methods to tackle catastrophic forgetting. We deploy an episodic memory unit that stores a subset of samples for each task to learn task-specific classifiers based on kernel ridge regression. This does not require memory replay and systematically avoids task interference in the classifiers. We further introduce variational random features to learn a data-driven kernel for each task. To do so, we formulate kernel continual learning as a variational inference problem, where a random Fourier basis is incorporated as the latent variable. The variational posterior distribution over the random Fourier basis is inferred from the coreset of each task. In this way, we are able to generate more informative kernels specific to each task, and, more importantly, the coreset size can be reduced to achieve more compact memory, resulting in more efficient continual learning based on episodic memory. Extensive evaluation on four benchmarks demonstrates the effectiveness and promise of kernels for continual learning.

* accepted to ICML 2021

Via

Access Paper or Ask Questions

Structured Latent Embeddings for Recognizing Unseen Classes in Unseen Domains

Jul 12, 2021

Shivam Chandhok, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Vineeth N Balasubramanian, Fahad Shahbaz Khan, Ling Shao

Figure 1 for Structured Latent Embeddings for Recognizing Unseen Classes in Unseen Domains

Figure 2 for Structured Latent Embeddings for Recognizing Unseen Classes in Unseen Domains

Figure 3 for Structured Latent Embeddings for Recognizing Unseen Classes in Unseen Domains

Figure 4 for Structured Latent Embeddings for Recognizing Unseen Classes in Unseen Domains

Abstract:The need to address the scarcity of task-specific annotated data has resulted in concerted efforts in recent years for specific settings such as zero-shot learning (ZSL) and domain generalization (DG), to separately address the issues of semantic shift and domain shift, respectively. However, real-world applications often do not have constrained settings and necessitate handling unseen classes in unseen domains -- a setting called Zero-shot Domain Generalization, which presents the issues of domain and semantic shifts simultaneously. In this work, we propose a novel approach that learns domain-agnostic structured latent embeddings by projecting images from different domains as well as class-specific semantic text-based representations to a common latent space. In particular, our method jointly strives for the following objectives: (i) aligning the multimodal cues from visual and text-based semantic concepts; (ii) partitioning the common latent space according to the domain-agnostic class-level semantic concepts; and (iii) learning a domain invariance w.r.t the visual-semantic joint distribution for generalizing to unseen classes in unseen domains. Our experiments on the challenging DomainNet and DomainNet-LS benchmarks show the superiority of our approach over existing methods, with significant gains on difficult domains like quickdraw and sketch.

Via

Access Paper or Ask Questions

Local-to-Global Self-Attention in Vision Transformers

Jul 10, 2021

Jinpeng Li, Yichao Yan, Shengcai Liao, Xiaokang Yang, Ling Shao

Figure 1 for Local-to-Global Self-Attention in Vision Transformers

Figure 2 for Local-to-Global Self-Attention in Vision Transformers

Figure 3 for Local-to-Global Self-Attention in Vision Transformers

Figure 4 for Local-to-Global Self-Attention in Vision Transformers

Abstract:Transformers have demonstrated great potential in computer vision tasks. To avoid dense computations of self-attentions in high-resolution visual data, some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage. The proposed framework is computationally efficient and highly effective. With a marginal increasement in computational overhead, our model achieves notable improvements in both image classification and semantic segmentation. Code is available at https://github.com/ljpadam/LG-Transformer

Via

Access Paper or Ask Questions

Instance-Level Relative Saliency Ranking with Graph Reasoning

Jul 08, 2021

Nian Liu, Long Li, Wangbo Zhao, Junwei Han, Ling Shao

Figure 1 for Instance-Level Relative Saliency Ranking with Graph Reasoning

Figure 2 for Instance-Level Relative Saliency Ranking with Graph Reasoning

Figure 3 for Instance-Level Relative Saliency Ranking with Graph Reasoning

Figure 4 for Instance-Level Relative Saliency Ranking with Graph Reasoning

Abstract:Conventional salient object detection models cannot differentiate the importance of different salient objects. Recently, two works have been proposed to detect saliency ranking by assigning different degrees of saliency to different objects. However, one of these models cannot differentiate object instances and the other focuses more on sequential attention shift order inference. In this paper, we investigate a practical problem setting that requires simultaneously segment salient instances and infer their relative saliency rank order. We present a novel unified model as the first end-to-end solution, where an improved Mask R-CNN is first used to segment salient instances and a saliency ranking branch is then added to infer the relative saliency. For relative saliency ranking, we build a new graph reasoning module by combining four graphs to incorporate the instance interaction relation, local contrast, global contrast, and a high-level semantic prior, respectively. A novel loss function is also proposed to effectively train the saliency ranking branch. Besides, a new dataset and an evaluation metric are proposed for this task, aiming at pushing forward this field of research. Finally, experimental results demonstrate that our proposed model is more effective than previous methods. We also show an example of its practical usage on adaptive image retargeting.

* TPAMI under review

Via

Access Paper or Ask Questions

Bi-level Feature Alignment for Versatile Image Translation and Manipulation

Jul 07, 2021

Fangneng Zhan, Yingchen Yu, Rongliang Wu, Kaiwen Cui, Aoran Xiao, Shijian Lu, Ling Shao

Figure 1 for Bi-level Feature Alignment for Versatile Image Translation and Manipulation

Figure 2 for Bi-level Feature Alignment for Versatile Image Translation and Manipulation

Figure 3 for Bi-level Feature Alignment for Versatile Image Translation and Manipulation

Figure 4 for Bi-level Feature Alignment for Versatile Image Translation and Manipulation

Abstract:Generative adversarial networks (GANs) have achieved great success in image translation and manipulation. However, high-fidelity image generation with faithful style control remains a grand challenge in computer vision. This paper presents a versatile image translation and manipulation framework that achieves accurate semantic and style guidance in image generation by explicitly building a correspondence. To handle the quadratic complexity incurred by building the dense correspondences, we introduce a bi-level feature alignment strategy that adopts a top-$k$ operation to rank block-wise features followed by dense attention between block features which reduces memory cost substantially. As the top-$k$ operation involves index swapping which precludes the gradient propagation, we propose to approximate the non-differentiable top-$k$ operation with a regularized earth mover's problem so that its gradient can be effectively back-propagated. In addition, we design a novel semantic position encoding mechanism that builds up coordinate for each individual semantic region to preserve texture structures while building correspondences. Further, we design a novel confidence feature injection module which mitigates mismatch problem by fusing features adaptively according to the reliability of built correspondences. Extensive experiments show that our method achieves superior performance qualitatively and quantitatively as compared with the state-of-the-art. The code is available at \href{https://github.com/fnzhan/RABIT}{https://github.com/fnzhan/RABIT}.

* Submitted to TPAMI

Via

Access Paper or Ask Questions

GuidedMix-Net: Learning to Improve Pseudo Masks Using Labeled Images as Reference

Jun 30, 2021

Peng Tu, Yawen Huang, Rongrong Ji, Feng Zheng, Ling Shao

Figure 1 for GuidedMix-Net: Learning to Improve Pseudo Masks Using Labeled Images as Reference

Figure 2 for GuidedMix-Net: Learning to Improve Pseudo Masks Using Labeled Images as Reference

Figure 3 for GuidedMix-Net: Learning to Improve Pseudo Masks Using Labeled Images as Reference

Figure 4 for GuidedMix-Net: Learning to Improve Pseudo Masks Using Labeled Images as Reference

Abstract:Semi-supervised learning is a challenging problem which aims to construct a model by learning from a limited number of labeled examples. Numerous methods have been proposed to tackle this problem, with most focusing on utilizing the predictions of unlabeled instances consistency alone to regularize networks. However, treating labeled and unlabeled data separately often leads to the discarding of mass prior knowledge learned from the labeled examples, and failure to mine the feature interaction between the labeled and unlabeled image pairs. In this paper, we propose a novel method for semi-supervised semantic segmentation named GuidedMix-Net, by leveraging labeled information to guide the learning of unlabeled instances. Specifically, we first introduce a feature alignment objective between labeled and unlabeled data to capture potentially similar image pairs and then generate mixed inputs from them. The proposed mutual information transfer (MITrans), based on the cluster assumption, is shown to be a powerful knowledge module for further progressive refining features of unlabeled data in the mixed data space. To take advantage of the labeled examples and guide unlabeled data learning, we further propose a mask generation module to generate high-quality pseudo masks for the unlabeled data. Along with supervised learning for labeled data, the prediction of unlabeled data is jointly learned with the generated pseudo masks from the mixed data. Extensive experiments on PASCAL VOC 2012, PASCAL-Context and Cityscapes demonstrate the effectiveness of our GuidedMix-Net, which achieves competitive segmentation accuracy and significantly improves the mIoU by +7$\%$ compared to previous state-of-the-art approaches.

* 11 pages

Via

Access Paper or Ask Questions

Accelerated Multi-Modal MR Imaging with Transformers

Jun 29, 2021

Chun-Mei Feng, Yunlu Yan, Geng Chen, Huazhu Fu, Yong Xu, Ling Shao

Figure 1 for Accelerated Multi-Modal MR Imaging with Transformers

Figure 2 for Accelerated Multi-Modal MR Imaging with Transformers

Figure 3 for Accelerated Multi-Modal MR Imaging with Transformers

Figure 4 for Accelerated Multi-Modal MR Imaging with Transformers

Abstract:Accelerating multi-modal magnetic resonance (MR) imaging is a new and effective solution for fast MR imaging, providing superior performance in restoring the target modality from its undersampled counterpart with guidance from an auxiliary modality. However, existing works simply introduce the auxiliary modality as prior information, lacking in-depth investigations on the potential mechanisms for fusing two modalities. Further, they usually rely on the convolutional neural networks (CNNs), which focus on local information and prevent them from fully capturing the long-distance dependencies of global knowledge. To this end, we propose a multi-modal transformer (MTrans), which is capable of transferring multi-scale features from the target modality to the auxiliary modality, for accelerated MR imaging. By restructuring the transformer architecture, our MTrans gains a powerful ability to capture deep multi-modal information. More specifically, the target modality and the auxiliary modality are first split into two branches and then fused using a multi-modal transformer module. This module is based on an improved multi-head attention mechanism, named the cross attention module, which absorbs features from the auxiliary modality that contribute to the target modality. Our framework provides two appealing benefits: (i) MTrans is the first attempt at using improved transformers for multi-modal MR imaging, affording more global information compared with CNN-based methods. (ii) A new cross attention module is proposed to exploit the useful information in each branch at different scales. It affords both distinct structural information and subtle pixel-level information, which supplement the target modality effectively.

Via

Access Paper or Ask Questions