Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lior Wolf

Converting Transformers to Polynomial Form for Secure Inference Over Homomorphic Encryption

Nov 15, 2023

Itamar Zimerman, Moran Baruch, Nir Drucker, Gilad Ezov, Omri Soceanu, Lior Wolf

Figure 1 for Converting Transformers to Polynomial Form for Secure Inference Over Homomorphic Encryption

Figure 2 for Converting Transformers to Polynomial Form for Secure Inference Over Homomorphic Encryption

Figure 3 for Converting Transformers to Polynomial Form for Secure Inference Over Homomorphic Encryption

Figure 4 for Converting Transformers to Polynomial Form for Secure Inference Over Homomorphic Encryption

Abstract:Designing privacy-preserving deep learning models is a major challenge within the deep learning community. Homomorphic Encryption (HE) has emerged as one of the most promising approaches in this realm, enabling the decoupling of knowledge between the model owner and the data owner. Despite extensive research and application of this technology, primarily in convolutional neural networks, incorporating HE into transformer models has been challenging because of the difficulties in converting these models into a polynomial form. We break new ground by introducing the first polynomial transformer, providing the first demonstration of secure inference over HE with transformers. This includes a transformer architecture tailored for HE, alongside a novel method for converting operators to their polynomial equivalent. This innovation enables us to perform secure inference on LMs with WikiText-103. It also allows us to perform image classification with CIFAR-100 and Tiny-ImageNet. Our models yield results comparable to traditional methods, bridging the performance gap with transformers of similar scale and underscoring the viability of HE for state-of-the-art applications. Finally, we assess the stability of our models and conduct a series of ablations to quantify the contribution of each model component.

* 6 figures

Via

Access Paper or Ask Questions

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Sep 28, 2023

Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi

Figure 1 for Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Figure 2 for Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Figure 3 for Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Figure 4 for Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

Abstract:We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresponding segment of that video. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. The proposed method is based on a lightweight adaptor network, which learns to map the audio-based representation to the input representation expected by the text-to-video generation model. As such, it also enables video generation conditioned on text, audio, and, for the first time as far as we can ascertain, on both text and audio. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples and further propose a novel evaluation metric (AV-Align) to assess the alignment of generated videos with input audio samples. AV-Align is based on the detection and comparison of energy peaks in both modalities. In comparison to recent state-of-the-art approaches, our method generates videos that are better aligned with the input sound, both with respect to content and temporal axis. We also show that videos produced by our method present higher visual quality and are more diverse.

* 9 pages, 6 figures

Via

Access Paper or Ask Questions

Multi-Dimensional Hyena for Spatial Inductive Bias

Sep 24, 2023

Itamar Zimerman, Lior Wolf

Abstract:In recent years, Vision Transformers have attracted increasing interest from computer vision researchers. However, the advantage of these transformers over CNNs is only fully manifested when trained over a large dataset, mainly due to the reduced inductive bias towards spatial locality within the transformer's self-attention mechanism. In this work, we present a data-efficient vision transformer that does not rely on self-attention. Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer. We propose several alternative approaches for obtaining this generalization and delve into their unique distinctions and considerations from both empirical and theoretical perspectives. Our empirical findings indicate that the proposed Hyena N-D layer boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT across multiple datasets. Furthermore, in the small dataset regime, our Hyena-based ViT is favorable to ViT variants from the recent literature that are specifically designed for solving the same challenge, i.e., working with small datasets or incorporating image-specific inductive bias into the self-attention mechanism. Finally, we show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.

* 10 pages, 3 figures

Via

Access Paper or Ask Questions

Zero-Shot Audio Captioning via Audibility Guidance

Sep 07, 2023

Tal Shaharabany, Ariel Shaulov, Lior Wolf

Figure 1 for Zero-Shot Audio Captioning via Audibility Guidance

Figure 2 for Zero-Shot Audio Captioning via Audibility Guidance

Figure 3 for Zero-Shot Audio Captioning via Audibility Guidance

Figure 4 for Zero-Shot Audio Captioning via Audibility Guidance

Abstract:The task of audio captioning is similar in essence to tasks such as image and video captioning. However, it has received much less attention. We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and the somewhat related (iii) audibility, which is the quality of being able to be perceived based only on audio. Our method is a zero-shot method, i.e., we do not learn to perform captioning. Instead, captioning occurs as an inference process that involves three networks that correspond to the three desired qualities: (i) A Large Language Model, in our case, for reasons of convenience, GPT-2, (ii) A model that provides a matching score between an audio file and a text, for which we use a multimodal matching network called ImageBind, and (iii) A text classifier, trained using a dataset we collected automatically by instructing GPT-4 with prompts designed to direct the generation of both audible and inaudible sentences. We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline, which lacks this objective.

Via

Access Paper or Ask Questions

Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

Sep 07, 2023

Eyal Gomel, Tal Shaharabany, Lior Wolf

Figure 1 for Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

Figure 2 for Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

Figure 3 for Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

Figure 4 for Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks

Abstract:It has been established that training a box-based detector network can enhance the localization performance of weakly supervised and unsupervised methods. Moreover, we extend this understanding by demonstrating that these detectors can be utilized to improve the original network, paving the way for further advancements. To accomplish this, we train the detectors on top of the network output instead of the image data and apply suitable loss backpropagation. Our findings reveal a significant improvement in phrase grounding for the ``what is where by looking'' task, as well as various methods of unsupervised object discovery. Our code is available at https://github.com/eyalgomel/box-based-refinement.

Via

Access Paper or Ask Questions

Reconstructing the Hemodynamic Response Function via a Bimodal Transformer

Jun 28, 2023

Yoni Choukroun, Lior Golgher, Pablo Blinder, Lior Wolf

Figure 1 for Reconstructing the Hemodynamic Response Function via a Bimodal Transformer

Figure 2 for Reconstructing the Hemodynamic Response Function via a Bimodal Transformer

Figure 3 for Reconstructing the Hemodynamic Response Function via a Bimodal Transformer

Figure 4 for Reconstructing the Hemodynamic Response Function via a Bimodal Transformer

Abstract:The relationship between blood flow and neuronal activity is widely recognized, with blood flow frequently serving as a surrogate for neuronal activity in fMRI studies. At the microscopic level, neuronal activity has been shown to influence blood flow in nearby blood vessels. This study introduces the first predictive model that addresses this issue directly at the explicit neuronal population level. Using in vivo recordings in awake mice, we employ a novel spatiotemporal bimodal transformer architecture to infer current blood flow based on both historical blood flow and ongoing spontaneous neuronal activity. Our findings indicate that incorporating neuronal activity significantly enhances the model's ability to predict blood flow values. Through analysis of the model's behavior, we propose hypotheses regarding the largely unexplored nature of the hemodynamic response to neuronal activity.

Via

Access Paper or Ask Questions

Annotator Consensus Prediction for Medical Image Segmentation with Diffusion Models

Jun 15, 2023

Tomer Amit, Shmuel Shichrur, Tal Shaharabany, Lior Wolf

Figure 1 for Annotator Consensus Prediction for Medical Image Segmentation with Diffusion Models

Figure 2 for Annotator Consensus Prediction for Medical Image Segmentation with Diffusion Models

Figure 3 for Annotator Consensus Prediction for Medical Image Segmentation with Diffusion Models

Figure 4 for Annotator Consensus Prediction for Medical Image Segmentation with Diffusion Models

Abstract:A major challenge in the segmentation of medical images is the large inter- and intra-observer variability in annotations provided by multiple experts. To address this challenge, we propose a novel method for multi-expert prediction using diffusion models. Our method leverages the diffusion-based approach to incorporate information from multiple annotations and fuse it into a unified segmentation map that reflects the consensus of multiple experts. We evaluate the performance of our method on several datasets of medical segmentation annotated by multiple experts and compare it with state-of-the-art methods. Our results demonstrate the effectiveness and robustness of the proposed method. Our code is publicly available at https://github.com/tomeramit/Annotator-Consensus-Prediction.

* arXiv admin note: text overlap with arXiv:2112.00390

Via

Access Paper or Ask Questions

2-D SSM: A General Spatial Layer for Visual Transformers

Jun 11, 2023

Ethan Baron, Itamar Zimerman, Lior Wolf

Figure 1 for 2-D SSM: A General Spatial Layer for Visual Transformers

Figure 2 for 2-D SSM: A General Spatial Layer for Visual Transformers

Figure 3 for 2-D SSM: A General Spatial Layer for Visual Transformers

Figure 4 for 2-D SSM: A General Spatial Layer for Visual Transformers

Abstract:A central objective in computer vision is to design models with appropriate 2-D inductive bias. Desiderata for 2D inductive bias include two-dimensional position awareness, dynamic spatial locality, and translation and permutation invariance. To address these goals, we leverage an expressive variation of the multidimensional State Space Model (SSM). Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme. Empirically, we observe that incorporating our layer at the beginning of each transformer block of Vision Transformers (ViT) significantly enhances performance for multiple ViT backbones and across datasets. The new layer is effective even with a negligible amount of additional parameters and inference time. Ablation studies and visualizations demonstrate that the layer has a strong 2-D inductive bias. For example, vision transformers equipped with our layer exhibit effective performance even without positional encoding

* 16 pages, 5 figures

Via

Access Paper or Ask Questions

AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder

Jun 10, 2023

Tal Shaharabany, Aviad Dahan, Raja Giryes, Lior Wolf

Figure 1 for AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder

Figure 2 for AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder

Figure 3 for AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder

Figure 4 for AutoSAM: Adapting SAM to Medical Images by Overloading the Prompt Encoder

Abstract:The recently introduced Segment Anything Model (SAM) combines a clever architecture and large quantities of training data to obtain remarkable image segmentation capabilities. However, it fails to reproduce such results for Out-Of-Distribution (OOD) domains such as medical images. Moreover, while SAM is conditioned on either a mask or a set of points, it may be desirable to have a fully automatic solution. In this work, we replace SAM's conditioning with an encoder that operates on the same input image. By adding this encoder and without further fine-tuning SAM, we obtain state-of-the-art results on multiple medical images and video benchmarks. This new encoder is trained via gradients provided by a frozen SAM. For inspecting the knowledge within it, and providing a lightweight segmentation solution, we also learn to decode it into a mask by a shallow deconvolution network.

Via

Access Paper or Ask Questions

Decision S4: Efficient Sequence-Based RL via State Spaces Layers

Jun 08, 2023

Shmuel Bar-David, Itamar Zimerman, Eliya Nachmani, Lior Wolf

Abstract:Recently, sequence learning methods have been applied to the problem of off-policy Reinforcement Learning, including the seminal work on Decision Transformers, which employs transformers for this task. Since transformers are parameter-heavy, cannot benefit from history longer than a fixed window size, and are not computed using recurrence, we set out to investigate the suitability of the S4 family of models, which are based on state-space layers and have been shown to outperform transformers, especially in modeling long-range dependencies. In this work we present two main algorithms: (i) an off-policy training procedure that works with trajectories, while still maintaining the training efficiency of the S4 model. (ii) An on-policy training procedure that is trained in a recurrent manner, benefits from long-range dependencies, and is based on a novel stable actor-critic mechanism. Our results indicate that our method outperforms multiple variants of decision transformers, as well as the other baseline methods on most tasks, while reducing the latency, number of parameters, and training time by several orders of magnitude, making our approach more suitable for real-world RL.

* 21 pages,13 figures

Via

Access Paper or Ask Questions