Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aykut Erdem

Shammie

Sequential Compositional Generalization in Multimodal Models

Apr 18, 2024

Semih Yagcioglu, Osman Batur İnce, Aykut Erdem, Erkut Erdem, Desmond Elliott, Deniz Yuret

Figure 1 for Sequential Compositional Generalization in Multimodal Models

Figure 2 for Sequential Compositional Generalization in Multimodal Models

Figure 3 for Sequential Compositional Generalization in Multimodal Models

Figure 4 for Sequential Compositional Generalization in Multimodal Models

Abstract:The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using \textsc{CompAct} (\underline{Comp}ositional \underline{Act}ivities)\footnote{Project Page: \url{http://cyberiada.github.io/CompAct}}, a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain.

* Accepted to the main conference of NAACL (2024) as a long paper

Via

Access Paper or Ask Questions

ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

Nov 13, 2023

Ilker Kesen, Andrea Pedrotti, Mustafa Dogan, Michele Cafagna, Emre Can Acikgoz, Letitia Parcalabescu, Iacer Calixto, Anette Frank, Albert Gatt, Aykut Erdem(+1 more)

Figure 1 for ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

Figure 2 for ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

Figure 3 for ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

Figure 4 for ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

Abstract:With the ever-increasing popularity of pretrained Video-Language Models (VidLMs), there is a pressing need to develop robust evaluation methodologies that delve deeper into their visio-linguistic capabilities. To address this challenge, we present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of these models on a firm footing. Task-based evaluations, while valuable, fail to capture the complexities and specific temporal aspects of moving images that VidLMs need to process. Through carefully curated counterfactuals, ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. ViLMA also includes proficiency tests, which assess basic capabilities deemed essential to solving the main counterfactual tests. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images. This is especially striking once the performance on proficiency tests is factored in. Our benchmark serves as a catalyst for future research on VidLMs, helping to highlight areas that still need to be explored.

* Preprint. 48 pages, 22 figures, 10 tables

Via

Access Paper or Ask Questions

Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers

Oct 18, 2023

Osman Batur İnce, Tanin Zeraati, Semih Yagcioglu, Yadollah Yaghoobzadeh, Erkut Erdem, Aykut Erdem

Figure 1 for Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers

Figure 2 for Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers

Figure 3 for Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers

Figure 4 for Harnessing Dataset Cartography for Improved Compositional Generalization in Transformers

Abstract:Neural networks have revolutionized language modeling and excelled in various downstream tasks. However, the extent to which these models achieve compositional generalization comparable to human cognitive abilities remains a topic of debate. While existing approaches in the field have mainly focused on novel architectures and alternative learning paradigms, we introduce a pioneering method harnessing the power of dataset cartography (Swayamdipta et al., 2020). By strategically identifying a subset of compositional generalization data using this approach, we achieve a remarkable improvement in model accuracy, yielding enhancements of up to 10% on CFQ and COGS datasets. Notably, our technique incorporates dataset cartography as a curriculum learning criterion, eliminating the need for hyperparameter tuning while consistently achieving superior performance. Our findings highlight the untapped potential of dataset cartography in unleashing the full capabilities of compositional generalization within Transformer models. Our code is available at https://github.com/cyberiada/cartography-for-compositionality.

* Accepted to Findings of EMNLP 2023

Via

Access Paper or Ask Questions

Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks

Sep 15, 2023

Orhan Torun, Seniha Esen Yuksel, Erkut Erdem, Nevrez Imamoglu, Aykut Erdem

Figure 1 for Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks

Figure 2 for Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks

Figure 3 for Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks

Figure 4 for Hyperspectral Image Denoising via Self-Modulating Convolutional Neural Networks

Abstract:Compared to natural images, hyperspectral images (HSIs) consist of a large number of bands, with each band capturing different spectral information from a certain wavelength, even some beyond the visible spectrum. These characteristics of HSIs make them highly effective for remote sensing applications. That said, the existing hyperspectral imaging devices introduce severe degradation in HSIs. Hence, hyperspectral image denoising has attracted lots of attention by the community lately. While recent deep HSI denoising methods have provided effective solutions, their performance under real-life complex noise remains suboptimal, as they lack adaptability to new data. To overcome these limitations, in our work, we introduce a self-modulating convolutional neural network which we refer to as SM-CNN, which utilizes correlated spectral and spatial information. At the core of the model lies a novel block, which we call spectral self-modulating residual block (SSMRB), that allows the network to transform the features in an adaptive manner based on the adjacent spectral data, enhancing the network's ability to handle complex noise. In particular, the introduction of SSMRB transforms our denoising network into a dynamic network that adapts its predicted features while denoising every input HSI with respect to its spatio-spectral characteristics. Experimental analysis on both synthetic and real data shows that the proposed SM-CNN outperforms other state-of-the-art HSI denoising methods both quantitatively and qualitatively on public benchmark datasets.

* Signal Processing, Volume 214, January 2024, 109248

Via

Access Paper or Ask Questions

Spherical Vision Transformer for 360-degree Video Saliency Prediction

Aug 24, 2023

Mert Cokelek, Nevrez Imamoglu, Cagri Ozcinar, Erkut Erdem, Aykut Erdem

Figure 1 for Spherical Vision Transformer for 360-degree Video Saliency Prediction

Figure 2 for Spherical Vision Transformer for 360-degree Video Saliency Prediction

Figure 3 for Spherical Vision Transformer for 360-degree Video Saliency Prediction

Figure 4 for Spherical Vision Transformer for 360-degree Video Saliency Prediction

Abstract:The growing interest in omnidirectional videos (ODVs) that capture the full field-of-view (FOV) has gained 360-degree saliency prediction importance in computer vision. However, predicting where humans look in 360-degree scenes presents unique challenges, including spherical distortion, high resolution, and limited labelled data. We propose a novel vision-transformer-based model for omnidirectional videos named SalViT360 that leverages tangent image representations. We introduce a spherical geometry-aware spatiotemporal self-attention mechanism that is capable of effective omnidirectional video understanding. Furthermore, we present a consistency-based unsupervised regularization term for projection-based 360-degree dense-prediction models to reduce artefacts in the predictions that occur after inverse projection. Our approach is the first to employ tangent images for omnidirectional saliency prediction, and our experimental results on three ODV saliency datasets demonstrate its effectiveness compared to the state-of-the-art.

* 12 pages, 4 figures, accepted to BMVC 2023

Via

Access Paper or Ask Questions

CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing

Jul 18, 2023

Ahmet Canberk Baykal, Abdul Basit Anees, Duygu Ceylan, Erkut Erdem, Aykut Erdem, Deniz Yuret

Figure 1 for CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing

Figure 2 for CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing

Figure 3 for CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing

Figure 4 for CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing

Abstract:Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the CLIP embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.

* Accepted for publication in ACM Transactions on Graphics

Via

Access Paper or Ask Questions

HyperE2VID: Improving Event-Based Video Reconstruction via Hypernetworks

May 10, 2023

Burak Ercan, Onur Eker, Canberk Saglam, Aykut Erdem, Erkut Erdem

Abstract:Event-based cameras are becoming increasingly popular for their ability to capture high-speed motion with low latency and high dynamic range. However, generating videos from events remains challenging due to the highly sparse and varying nature of event data. To address this, in this study, we propose HyperE2VID, a dynamic neural network architecture for event-based video reconstruction. Our approach uses hypernetworks and dynamic convolutions to generate per-pixel adaptive filters guided by a context fusion module that combines information from event voxel grids and previously reconstructed intensity images. We also employ a curriculum learning strategy to train the network more robustly. Experimental results demonstrate that HyperE2VID achieves better reconstruction quality with fewer parameters and faster inference time than the state-of-the-art methods.

* 12 pages, 5 figures. Submitted to IEEE Transactions on Image Processing. The project page can be found at https://ercanburak.github.io/HyperE2VID.html

Via

Access Paper or Ask Questions

EVREAL: Towards a Comprehensive Benchmark and Analysis Suite for Event-based Video Reconstruction

Apr 30, 2023

Burak Ercan, Onur Eker, Aykut Erdem, Erkut Erdem

Abstract:Event cameras are a new type of vision sensor that incorporates asynchronous and independent pixels, offering advantages over traditional frame-based cameras such as high dynamic range and minimal motion blur. However, their output is not easily understandable by humans, making the reconstruction of intensity images from event streams a fundamental task in event-based vision. While recent deep learning-based methods have shown promise in video reconstruction from events, this problem is not completely solved yet. To facilitate comparison between different approaches, standardized evaluation protocols and diverse test datasets are essential. This paper proposes a unified evaluation methodology and introduces an open-source framework called EVREAL to comprehensively benchmark and analyze various event-based video reconstruction methods from the literature. Using EVREAL, we give a detailed analysis of the state-of-the-art methods for event-based video reconstruction, and provide valuable insights into the performance of these methods under varying settings, challenging scenarios, and downstream tasks.

* 19 pages, 9 figures. Has been accepted for publication at the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, 2023. The project page can be found at https://ercanburak.github.io/evreal.html

Via

Access Paper or Ask Questions

VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs

Apr 12, 2023

Moayed Haji Ali, Andrew Bond, Tolga Birdal, Duygu Ceylan, Levent Karacan, Erkut Erdem, Aykut Erdem

Figure 1 for VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs

Figure 2 for VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs

Figure 3 for VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs

Figure 4 for VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs

Abstract:We propose $\textbf{VidStyleODE}$, a spatiotemporally continuous disentangled $\textbf{Vid}$eo representation based upon $\textbf{Style}$GAN and Neural-$\textbf{ODE}$s. Effective traversal of the latent space learned by Generative Adversarial Networks (GANs) has been the basis for recent breakthroughs in image editing. However, the applicability of such advancements to the video domain has been hindered by the difficulty of representing and controlling videos in the latent space of GANs. In particular, videos are composed of content (i.e., appearance) and complex motion components that require a special mechanism to disentangle and control. To achieve this, VidStyleODE encodes the video content in a pre-trained StyleGAN $\mathcal{W}_+$ space and benefits from a latent ODE component to summarize the spatiotemporal dynamics of the input video. Our novel continuous video generation process then combines the two to generate high-quality and temporally consistent videos with varying frame rates. We show that our proposed method enables a variety of applications on real videos: text-guided appearance manipulation, motion manipulation, image animation, and video interpolation and extrapolation. Project website: https://cyberiada.github.io/VidStyleODE

Via

Access Paper or Ask Questions

Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

Apr 06, 2023

Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut Erdem, Aysegul Dundar

Figure 1 for Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

Figure 2 for Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

Figure 3 for Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

Figure 4 for Inst-Inpaint: Instructing to Remove Objects with Diffusion Models

Abstract:Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and also removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task which will be released soon. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.

Via

Access Paper or Ask Questions