Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aykut Erdem

Shammie

Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

Jun 01, 2026

Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Xuelin Chen, Erkut Erdem, Aykut Erdem, Duygu Ceylan

Abstract:Generative video models have achieved remarkable visual fidelity and temporal coherence, yet intentional camera control remains elusive. Existing frameworks treat camera motion as a byproduct of pixel synthesis, producing trajectories that are stochastic, spatially inconsistent, and indifferent to the human subject driving the scene. In this work, we present Auteur, a method for language-driven, human-centric camera framing in generative video. Our core insight is that professional filmmakers conceive shots not as world-space trajectories but as framings defined relative to the actor, encoding shot size, angle, and composition as functions of human pose and motion. We formalize this intuition as a human-centric camera parameterization and introduce a Domain-Specific Language (DSL) that is convertible to standard 6-DoF camera parameters. A fine-tuned multimodal large language model then acts as a virtual director, mapping natural language descriptions and coarse human motion to sparse DSL keyframes that are deterministically interpolated into continuous camera trajectories, which are then provided as input to video generators. We train and evaluate Auteur on a new dataset of 34K aligned text, human motion, and DSL-annotated camera trajectories drawn from procedural synthesis and real-world movie footage from the CondensedMovies dataset. Auteur enables cinematographic framing of human-centered scenes, a capability largely absent in prior generative models. To assess this behavior, we propose new framing-focused metrics, and our experiments show that Auteur consistently outperforms existing methods

Via

Access Paper or Ask Questions

Learning to Think Like a Cartoon Captionist: Incongruity-Resolution Supervision for Multimodal Humor Understanding

Apr 16, 2026

Hatice Merve Vural, Doga Kukul, Ege Erdem Ozlu, Demir Ekin Arikan, Bob Mankoff, Erkut Erdem, Aykut Erdem

Abstract:Humor is one of the few cognitive tasks where getting the reasoning right matters as much as getting the answer right. While recent work evaluates humor understanding on benchmarks such as the New Yorker Cartoon Caption Contest (NYCC), it largely treats it as black-box prediction, overlooking the structured reasoning processes underlying humor comprehension. We introduce IRS (Incongruity-Resolution Supervision), a framework that decomposes humor understanding into three components: incongruity modeling, which identifies mismatches in the visual scene; resolution modeling, which constructs coherent reinterpretations of these mismatches; and preference alignment, which evaluates candidate interpretations under human judgments. Grounded in incongruity-resolution theory and expert captionist practice, IRS supervises intermediate reasoning process through structured traces that make the path from visual perception to humorous interpretation explicit and learnable. Across 7B, 32B, and 72B models on NYCC, IRS outperforms strong open and closed multimodal baselines across caption matching and ranking tasks, with our largest model approaching expert-level performance on ranking. Zero-shot transfer to external benchmarks shows that IRS learns generalizable reasoning patterns. Our results suggest that supervising reasoning structure, rather than scale alone, is key for reasoning-centric tasks.

Via

Access Paper or Ask Questions

FewMMBench: A Benchmark for Multimodal Few-Shot Learning

Feb 25, 2026

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem

Abstract:As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based demonstrations and increased context size also yield limited gains. These results highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs. The data is available at: https://huggingface.co/datasets/mustafaa/FewMMBench

* Preprint. 49 pages, 38 Figures, 5 Tables

Via

Access Paper or Ask Questions

DeVisE: Behavioral Testing of Medical Large Language Models

Jun 18, 2025

Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto

Abstract:Large language models (LLMs) are increasingly used in clinical decision support, yet current evaluation methods often fail to distinguish genuine medical reasoning from superficial patterns. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework for probing fine-grained clinical understanding. We construct a dataset of ICU discharge notes from MIMIC-IV, generating both raw (real-world) and template-based (synthetic) versions with controlled single-variable counterfactuals targeting demographic (age, gender, ethnicity) and vital sign attributes. We evaluate five LLMs spanning general-purpose and medically fine-tuned variants, under both zero-shot and fine-tuned settings. We assess model behavior via (1) input-level sensitivity - how counterfactuals alter the likelihood of a note; and (2) downstream reasoning - how they affect predicted hospital length-of-stay. Our results show that zero-shot models exhibit more coherent counterfactual reasoning patterns, while fine-tuned models tend to be more stable yet less responsive to clinically meaningful changes. Notably, demographic factors subtly but consistently influence outputs, emphasizing the importance of fairness-aware evaluation. This work highlights the utility of behavioral testing in exposing the reasoning strategies of clinical LLMs and informing the design of safer, more transparent medical AI systems.

Via

Access Paper or Ask Questions

A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features

Jan 17, 2025

Enes Karanfil, Nevrez Imamoglu, Erkut Erdem, Aykut Erdem

Abstract:Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene classification and description generation. Our results highlight Spectral LLaVA's ability to produce detailed and accurate descriptions, particularly for scenarios where RGB data alone proves inadequate, while also enhancing classification performance by refining SpectralGPT features into semantically meaningful representations.

Via

Access Paper or Ask Questions

GaussianVideo: Efficient Video Representation via Hierarchical Gaussian Splatting

Jan 08, 2025

Andrew Bond, Jui-Hsien Wang, Long Mai, Erkut Erdem, Aykut Erdem

Abstract:Efficient neural representations for dynamic video scenes are critical for applications ranging from video compression to interactive simulations. Yet, existing methods often face challenges related to high memory usage, lengthy training times, and temporal consistency. To address these issues, we introduce a novel neural video representation that combines 3D Gaussian splatting with continuous camera motion modeling. By leveraging Neural ODEs, our approach learns smooth camera trajectories while maintaining an explicit 3D scene representation through Gaussians. Additionally, we introduce a spatiotemporal hierarchical learning strategy, progressively refining spatial and temporal features to enhance reconstruction quality and accelerate convergence. This memory-efficient approach achieves high-quality rendering at impressive speeds. Experimental results show that our hierarchical learning, combined with robust camera motion modeling, captures complex dynamic scenes with strong temporal consistency, achieving state-of-the-art performance across diverse video datasets in both high- and low-motion scenarios.

* 10 pages, 10 figures

Via

Access Paper or Ask Questions

HyperGAN-CLIP: A Unified Framework for Domain Adaptation, Image Synthesis and Manipulation

Nov 19, 2024

Abdul Basit Anees, Ahmet Canberk Baykal, Muhammed Burak Kizil, Duygu Ceylan, Erkut Erdem, Aykut Erdem

Figure 1 for HyperGAN-CLIP: A Unified Framework for Domain Adaptation, Image Synthesis and Manipulation

Figure 2 for HyperGAN-CLIP: A Unified Framework for Domain Adaptation, Image Synthesis and Manipulation

Figure 3 for HyperGAN-CLIP: A Unified Framework for Domain Adaptation, Image Synthesis and Manipulation

Figure 4 for HyperGAN-CLIP: A Unified Framework for Domain Adaptation, Image Synthesis and Manipulation

Abstract:Generative Adversarial Networks (GANs), particularly StyleGAN and its variants, have demonstrated remarkable capabilities in generating highly realistic images. Despite their success, adapting these models to diverse tasks such as domain adaptation, reference-guided synthesis, and text-guided manipulation with limited training data remains challenging. Towards this end, in this study, we present a novel framework that significantly extends the capabilities of a pre-trained StyleGAN by integrating CLIP space via hypernetworks. This integration allows dynamic adaptation of StyleGAN to new domains defined by reference images or textual descriptions. Additionally, we introduce a CLIP-guided discriminator that enhances the alignment between generated images and target domains, ensuring superior image quality. Our approach demonstrates unprecedented flexibility, enabling text-guided image manipulation without the need for text-specific training data and facilitating seamless style transfer. Comprehensive qualitative and quantitative evaluations confirm the robustness and superior performance of our framework compared to existing methods.

* Accepted for publication in SIGGRAPH Asia 2024. Project Website: https://cyberiada.github.io/HyperGAN-CLIP/

Via

Access Paper or Ask Questions

HUE Dataset: High-Resolution Event and Frame Sequences for Low-Light Vision

Oct 24, 2024

Burak Ercan, Onur Eker, Aykut Erdem, Erkut Erdem

Figure 1 for HUE Dataset: High-Resolution Event and Frame Sequences for Low-Light Vision

Figure 2 for HUE Dataset: High-Resolution Event and Frame Sequences for Low-Light Vision

Figure 3 for HUE Dataset: High-Resolution Event and Frame Sequences for Low-Light Vision

Figure 4 for HUE Dataset: High-Resolution Event and Frame Sequences for Low-Light Vision

Abstract:Low-light environments pose significant challenges for image enhancement methods. To address these challenges, in this work, we introduce the HUE dataset, a comprehensive collection of high-resolution event and frame sequences captured in diverse and challenging low-light conditions. Our dataset includes 106 sequences, encompassing indoor, cityscape, twilight, night, driving, and controlled scenarios, each carefully recorded to address various illumination levels and dynamic ranges. Utilizing a hybrid RGB and event camera setup. we collect a dataset that combines high-resolution event data with complementary frame data. We employ both qualitative and quantitative evaluations using no-reference metrics to assess state-of-the-art low-light enhancement and event-based image reconstruction methods. Additionally, we evaluate these methods on a downstream object detection task. Our findings reveal that while event-based methods perform well in specific metrics, they may produce false positives in practical applications. This dataset and our comprehensive analysis provide valuable insights for future research in low-light vision and hybrid camera systems.

* 18 pages, 4 figures. Has been accepted for publication at the European Conference on Computer Vision Workshops (ECCVW), Milano, 2024. The project page can be found at https://ercanburak.github.io/HUE.html

Via

Access Paper or Ask Questions

Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning

Jul 17, 2024

Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem

Abstract:The linguistic capabilities of Multimodal Large Language Models (MLLMs) are critical for their effective application across diverse tasks. This study aims to evaluate the performance of MLLMs on the VALSE benchmark, focusing on the efficacy of few-shot In-Context Learning (ICL), and Chain-of-Thought (CoT) prompting. We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in model size and pretraining datasets. The experimental results reveal that ICL and CoT prompting significantly boost model performance, particularly in tasks requiring complex reasoning and contextual understanding. Models pretrained on captioning datasets show superior zero-shot performance, while those trained on interleaved image-text data benefit from few-shot learning. Our findings provide valuable insights into optimizing MLLMs for better grounding of language in visual contexts, highlighting the importance of the composition of pretraining data and the potential of few-shot learning strategies to improve the reasoning abilities of MLLMs.

* Preprint. 33 pages, 17 Figures, 3 Tables

Via

Access Paper or Ask Questions

CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

Jun 13, 2024

Yigit Ekin, Ahmet Burak Yildirim, Erdem Eren Caglar, Aykut Erdem, Erkut Erdem, Aysegul Dundar

Figure 1 for CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

Figure 2 for CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

Figure 3 for CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

Figure 4 for CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

Abstract:Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images. Despite their strengths, diffusion models often struggle with object removal tasks without explicit guidance, leading to unintended hallucinations of the removed object. To address this issue, we introduce CLIPAway, a novel approach leveraging CLIP embeddings to focus on background regions while excluding foreground elements. CLIPAway enhances inpainting accuracy and quality by identifying embeddings that prioritize the background, thus achieving seamless object removal. Unlike other methods that rely on specialized training datasets or costly manual annotations, CLIPAway provides a flexible, plug-and-play solution compatible with various diffusion-based inpainting techniques.

* Project page: https://yigitekin.github.io/CLIPAway/

Via

Access Paper or Ask Questions