Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"Image": models, code, and papers

TMComposites: Plug-and-Play Collaboration Between Specialized Tsetlin Machines

Sep 12, 2023
Ole-Christoffer Granmo

Figure 1 for TMComposites: Plug-and-Play Collaboration Between Specialized Tsetlin Machines

Figure 2 for TMComposites: Plug-and-Play Collaboration Between Specialized Tsetlin Machines

Figure 3 for TMComposites: Plug-and-Play Collaboration Between Specialized Tsetlin Machines

Figure 4 for TMComposites: Plug-and-Play Collaboration Between Specialized Tsetlin Machines

Tsetlin Machines (TMs) provide a fundamental shift from arithmetic-based to logic-based machine learning. Supporting convolution, they deal successfully with image classification datasets like MNIST, Fashion-MNIST, and CIFAR-2. However, the TM struggles with getting state-of-the-art performance on CIFAR-10 and CIFAR-100, representing more complex tasks. This paper introduces plug-and-play collaboration between specialized TMs, referred to as TM Composites. The collaboration relies on a TM's ability to specialize during learning and to assess its competence during inference. When teaming up, the most confident TMs make the decisions, relieving the uncertain ones. In this manner, a TM Composite becomes more competent than its members, benefiting from their specializations. The collaboration is plug-and-play in that members can be combined in any way, at any time, without fine-tuning. We implement three TM specializations in our empirical evaluation: Histogram of Gradients, Adaptive Gaussian Thresholding, and Color Thermometers. The resulting TM Composite increases accuracy on Fashion-MNIST by two percentage points, CIFAR-10 by twelve points, and CIFAR-100 by nine points, yielding new state-of-the-art results for TMs. Overall, we envision that TM Composites will enable an ultra-low energy and transparent alternative to state-of-the-art deep learning on more tasks and datasets.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions

Jersey Number Recognition using Keyframe Identification from Low-Resolution Broadcast Videos

Sep 12, 2023
Bavesh Balaji, Jerrin Bright, Harish Prakash, Yuhao Chen, David A Clausi, John Zelek

Figure 1 for Jersey Number Recognition using Keyframe Identification from Low-Resolution Broadcast Videos

Figure 2 for Jersey Number Recognition using Keyframe Identification from Low-Resolution Broadcast Videos

Figure 3 for Jersey Number Recognition using Keyframe Identification from Low-Resolution Broadcast Videos

Figure 4 for Jersey Number Recognition using Keyframe Identification from Low-Resolution Broadcast Videos

Player identification is a crucial component in vision-driven soccer analytics, enabling various downstream tasks such as player assessment, in-game analysis, and broadcast production. However, automatically detecting jersey numbers from player tracklets in videos presents challenges due to motion blur, low resolution, distortions, and occlusions. Existing methods, utilizing Spatial Transformer Networks, CNNs, and Vision Transformers, have shown success in image data but struggle with real-world video data, where jersey numbers are not visible in most of the frames. Hence, identifying frames that contain the jersey number is a key sub-problem to tackle. To address these issues, we propose a robust keyframe identification module that extracts frames containing essential high-level information about the jersey number. A spatio-temporal network is then employed to model spatial and temporal context and predict the probabilities of jersey numbers in the video. Additionally, we adopt a multi-task loss function to predict the probability distribution of each digit separately. Extensive evaluations on the SoccerNet dataset demonstrate that incorporating our proposed keyframe identification module results in a significant 37.81% and 37.70% increase in the accuracies of 2 different test sets with domain gaps. These results highlight the effectiveness and importance of our approach in tackling the challenges of automatic jersey number detection in sports videos.

* Accepted in the 6th International Workshop on Multimedia Content Analysis in Sports (MMSports'23) @ ACM Multimedia

Via

Access Paper or Ask Questions

Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning

Sep 12, 2023
Binh M. Le, Simon S. Woo

Figure 1 for Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning

Figure 2 for Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning

Figure 3 for Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning

Figure 4 for Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning

Deepfake has recently raised a plethora of societal concerns over its possible security threats and dissemination of fake information. Much research on deepfake detection has been undertaken. However, detecting low quality as well as simultaneously detecting different qualities of deepfakes still remains a grave challenge. Most SOTA approaches are limited by using a single specific model for detecting certain deepfake video quality type. When constructing multiple models with prior information about video quality, this kind of strategy incurs significant computational cost, as well as model and training data overhead. Further, it cannot be scalable and practical to deploy in real-world settings. In this work, we propose a universal intra-model collaborative learning framework to enable the effective and simultaneous detection of different quality of deepfakes. That is, our approach is the quality-agnostic deepfake detection method, dubbed QAD . In particular, by observing the upper bound of general error expectation, we maximize the dependency between intermediate representations of images from different quality levels via Hilbert-Schmidt Independence Criterion. In addition, an Adversarial Weight Perturbation module is carefully devised to enable the model to be more robust against image corruption while boosting the overall model's performance. Extensive experiments over seven popular deepfake datasets demonstrate the superiority of our QAD model over prior SOTA benchmarks.

* International Conference on Computer Vision 2023

Via

Access Paper or Ask Questions

Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

Sep 04, 2023
Yiwen Cao, Yukun Su, Wenjun Wang, Yanxia Liu, Qingyao Wu

Figure 1 for Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

Figure 2 for Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

Figure 3 for Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

Figure 4 for Semantic-Constraint Matching Transformer for Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) strives to learn to localize objects with only image-level supervision. Due to the local receptive fields generated by convolution operations, previous CNN-based methods suffer from partial activation issues, concentrating on the object's discriminative part instead of the entire entity scope. Benefiting from the capability of the self-attention mechanism to acquire long-range feature dependencies, Vision Transformer has been recently applied to alleviate the local activation drawbacks. However, since the transformer lacks the inductive localization bias that are inherent in CNNs, it may cause a divergent activation problem resulting in an uncertain distinction between foreground and background. In this work, we proposed a novel Semantic-Constraint Matching Network (SCMN) via a transformer to converge on the divergent activation. Specifically, we first propose a local patch shuffle strategy to construct the image pairs, disrupting local patches while guaranteeing global consistency. The paired images that contain the common object in spatial are then fed into the Siamese network encoder. We further design a semantic-constraint matching module, which aims to mine the co-object part by matching the coarse class activation maps (CAMs) extracted from the pair images, thus implicitly guiding and calibrating the transformer network to alleviate the divergent activation. Extensive experimental results conducted on two challenging benchmarks, including CUB-200-2011 and ILSVRC datasets show that our method can achieve the new state-of-the-art performance and outperform the previous method by a large margin.

Via

Access Paper or Ask Questions

Consistency-guided Meta-Learning for Bootstrapping Semi-Supervised Medical Image Segmentation

Jul 21, 2023
Qingyue Wei, Lequan Yu, Xianhang Li, Wei Shao, Cihang Xie, Lei Xing, Yuyin Zhou

Figure 1 for Consistency-guided Meta-Learning for Bootstrapping Semi-Supervised Medical Image Segmentation

Figure 2 for Consistency-guided Meta-Learning for Bootstrapping Semi-Supervised Medical Image Segmentation

Figure 3 for Consistency-guided Meta-Learning for Bootstrapping Semi-Supervised Medical Image Segmentation

Figure 4 for Consistency-guided Meta-Learning for Bootstrapping Semi-Supervised Medical Image Segmentation

Medical imaging has witnessed remarkable progress but usually requires a large amount of high-quality annotated data which is time-consuming and costly to obtain. To alleviate this burden, semi-supervised learning has garnered attention as a potential solution. In this paper, we present Meta-Learning for Bootstrapping Medical Image Segmentation (MLB-Seg), a novel method for tackling the challenge of semi-supervised medical image segmentation. Specifically, our approach first involves training a segmentation model on a small set of clean labeled images to generate initial labels for unlabeled data. To further optimize this bootstrapping process, we introduce a per-pixel weight mapping system that dynamically assigns weights to both the initialized labels and the model's own predictions. These weights are determined using a meta-process that prioritizes pixels with loss gradient directions closer to those of clean data, which is based on a small set of precisely annotated images. To facilitate the meta-learning process, we additionally introduce a consistency-based Pseudo Label Enhancement (PLE) scheme that improves the quality of the model's own predictions by ensembling predictions from various augmented versions of the same input. In order to improve the quality of the weight maps obtained through multiple augmentations of a single input, we introduce a mean teacher into the PLE scheme. This method helps to reduce noise in the weight maps and stabilize its generation process. Our extensive experimental results on public atrial and prostate segmentation datasets demonstrate that our proposed method achieves state-of-the-art results under semi-supervision. Our code is available at https://github.com/aijinrjinr/MLB-Seg.

* Accepted to MICCAI 2023. Code is publicly available at https://github.com/aijinrjinr/MLB-Seg

Via

Access Paper or Ask Questions

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Aug 25, 2023
Chi Chen, Ruoyu Qin, Fuwen Luo, Xiaoyue Mi, Peng Li, Maosong Sun, Yang Liu

Figure 1 for Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Figure 2 for Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Figure 3 for Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Figure 4 for Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

Recently, Multimodal Large Language Models (MLLMs) that enable Large Language Models (LLMs) to interpret images through visual instruction tuning have achieved significant success. However, existing visual instruction tuning methods only utilize image-language instruction data to align the language and image modalities, lacking a more fine-grained cross-modal alignment. In this paper, we propose Position-enhanced Visual Instruction Tuning (PVIT), which extends the functionality of MLLMs by integrating an additional region-level vision encoder. This integration promotes a more detailed comprehension of images for the MLLM. In addition, to efficiently achieve a fine-grained alignment between the vision modules and the LLM, we design multiple data generation strategies to construct an image-region-language instruction dataset. Finally, we present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model. Code and data will be released at https://github.com/THUNLP-MT/PVIT.

Via

Access Paper or Ask Questions

MoMA: Momentum Contrastive Learning with Multi-head Attention-based Knowledge Distillation for Histopathology Image Analysis

Aug 31, 2023
Trinh Thi Le Vuong, Jin Tae Kwak

Figure 1 for MoMA: Momentum Contrastive Learning with Multi-head Attention-based Knowledge Distillation for Histopathology Image Analysis

Figure 2 for MoMA: Momentum Contrastive Learning with Multi-head Attention-based Knowledge Distillation for Histopathology Image Analysis

Figure 3 for MoMA: Momentum Contrastive Learning with Multi-head Attention-based Knowledge Distillation for Histopathology Image Analysis

Figure 4 for MoMA: Momentum Contrastive Learning with Multi-head Attention-based Knowledge Distillation for Histopathology Image Analysis

There is no doubt that advanced artificial intelligence models and high quality data are the keys to success in developing computational pathology tools. Although the overall volume of pathology data keeps increasing, a lack of quality data is a common issue when it comes to a specific task due to several reasons including privacy and ethical issues with patient data. In this work, we propose to exploit knowledge distillation, i.e., utilize the existing model to learn a new, target model, to overcome such issues in computational pathology. Specifically, we employ a student-teacher framework to learn a target model from a pre-trained, teacher model without direct access to source data and distill relevant knowledge via momentum contrastive learning with multi-head attention mechanism, which provides consistent and context-aware feature representations. This enables the target model to assimilate informative representations of the teacher model while seamlessly adapting to the unique nuances of the target data. The proposed method is rigorously evaluated across different scenarios where the teacher model was trained on the same, relevant, and irrelevant classification tasks with the target model. Experimental results demonstrate the accuracy and robustness of our approach in transferring knowledge to different domains and tasks, outperforming other related methods. Moreover, the results provide a guideline on the learning strategy for different types of tasks and scenarios in computational pathology. Code is available at: \url{https://github.com/trinhvg/MoMA}.

* Preprint

Via

Access Paper or Ask Questions

Stream-based Active Learning by Exploiting Temporal Properties in Perception with Temporal Predicted Loss

Sep 11, 2023
Sebastian Schmidt, Stephan Günnemann

Figure 1 for Stream-based Active Learning by Exploiting Temporal Properties in Perception with Temporal Predicted Loss

Figure 2 for Stream-based Active Learning by Exploiting Temporal Properties in Perception with Temporal Predicted Loss

Figure 3 for Stream-based Active Learning by Exploiting Temporal Properties in Perception with Temporal Predicted Loss

Figure 4 for Stream-based Active Learning by Exploiting Temporal Properties in Perception with Temporal Predicted Loss

Active learning (AL) reduces the amount of labeled data needed to train a machine learning model by intelligently choosing which instances to label. Classic pool-based AL requires all data to be present in a datacenter, which can be challenging with the increasing amounts of data needed in deep learning. However, AL on mobile devices and robots, like autonomous cars, can filter the data from perception sensor streams before reaching the datacenter. We exploited the temporal properties for such image streams in our work and proposed the novel temporal predicted loss (TPL) method. To evaluate the stream-based setting properly, we introduced the GTA V streets and the A2D2 streets dataset and made both publicly available. Our experiments showed that our approach significantly improves the diversity of the selection while being an uncertainty-based method. As pool-based approaches are more common in perception applications, we derived a concept for comparing pool-based and stream-based AL, where TPL out-performed state-of-the-art pool- or stream-based approaches for different models. TPL demonstrated a gain of 2.5 precept points (pp) less required data while being significantly faster than pool-based methods.

Via

Access Paper or Ask Questions

Efficient Transfer Learning in Diffusion Models via Adversarial Noise

Aug 23, 2023
Xiyu Wang, Baijiong Lin, Daochang Liu, Chang Xu

Figure 1 for Efficient Transfer Learning in Diffusion Models via Adversarial Noise

Figure 2 for Efficient Transfer Learning in Diffusion Models via Adversarial Noise

Figure 3 for Efficient Transfer Learning in Diffusion Models via Adversarial Noise

Figure 4 for Efficient Transfer Learning in Diffusion Models via Adversarial Noise

Diffusion Probabilistic Models (DPMs) have demonstrated substantial promise in image generation tasks but heavily rely on the availability of large amounts of training data. Previous works, like GANs, have tackled the limited data problem by transferring pre-trained models learned with sufficient data. However, those methods are hard to be utilized in DPMs since the distinct differences between DPM-based and GAN-based methods, showing in the unique iterative denoising process integral and the need for many timesteps with no-targeted noise in DPMs. In this paper, we propose a novel DPMs-based transfer learning method, TAN, to address the limited data problem. It includes two strategies: similarity-guided training, which boosts transfer with a classifier, and adversarial noise selection which adaptive chooses targeted noise based on the input image. Extensive experiments in the context of few-shot image generation tasks demonstrate that our method is not only efficient but also excels in terms of image quality and diversity when compared to existing GAN-based and DDPM-based methods.

Via

Access Paper or Ask Questions

HPFormer: Hyperspectral image prompt object tracking

Aug 14, 2023
Yuedong Tan

Figure 1 for HPFormer: Hyperspectral image prompt object tracking

Hyperspectral imagery contains abundant spectral information beyond the visible RGB bands, providing rich discriminative details about objects in a scene. Leveraging such data has the potential to enhance visual tracking performance. While prior hyperspectral trackers employ CNN or hybrid CNN-Transformer architectures, we propose a novel approach HPFormer on Transformers to capitalize on their powerful representation learning capabilities. The core of HPFormer is a Hyperspectral Hybrid Attention (HHA) module which unifies feature extraction and fusion within one component through token interactions. Additionally, a Transform Band Module (TBM) is introduced to selectively aggregate spatial details and spectral signatures from the full hyperspectral input for injecting informative target representations. Extensive experiments demonstrate state-of-the-art performance of HPFormer on benchmark NIR and VIS tracking datasets. Our work provides new insights into harnessing the strengths of transformers and hyperspectral fusion to advance robust object tracking.

Via

Access Paper or Ask Questions