Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pouyan Navard

Loki: Representation over Architecture for Diffusion-Based Portrait Animation

May 22, 2026

Pouyan Navard, Sernam Lim

Abstract:Portrait animation transfers a driver clip's facial expression and head pose onto a single reference image while preserving the reference's identity. State-of-the-art diffusion systems address this by stacking trained modules for expression, pose, and identity in turn, paying for it in trainable parameters, proprietary corpora, and residual entanglement between the very axes the system is meant to control independently. This complexity compensates for an upstream choice -- learning facial expression and head pose from RGB, a representation in which identity, pose, and expression are inseparable without being learned apart. Loki steps out of RGB on the conditioning path. Driver expression and head pose are encoded by a face model whose parameter axes are identity-orthogonal by construction, then rasterised into a spatial map that the diffusion backbone consumes natively. Identity is routed separately through the diffusion backbone's own pretrained features via lightweight key-value injection. Because the parametric representation factorises identity from expression and pose, cross ID reenactment reduces to a coefficient substitution at inference, requiring no cross ID training data. Loki requires ~43% fewer inference parameters than leading diffusion baselines and trained on 1496x less video samples. We define two metrics that directly measure whether the generated head pose trajectory and facial expression followed the driver's -- the questions portrait animation actually asks; Loki leads or co-leads on both.

Via

Access Paper or Ask Questions

LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration

Mar 25, 2026

Gokce Inal, Pouyan Navard, Alper Yilmaz

Abstract:Recent advances in multimodal vision-language models (VLMs) have enabled joint reasoning over visual and textual information, yet their application to planetary science remains largely unexplored. A key hindrance is the absence of large-scale datasets that pair real planetary imagery with detailed scientific descriptions. In this work, we introduce LLaVA-LE (Large Language-and-Vision Assistant for Lunar Exploration), a vision-language model specialized for lunar surface and subsurface characterization. To enable this capability, we curate a new large-scale multimodal lunar dataset, LUCID (LUnar Caption Image Dataset) consisting of 96k high-resolution panchromatic images paired with detailed captions describing lunar terrain characteristics, and 81k question-answer (QA) pairs derived from approximately 20k images in the LUCID dataset. Leveraging this dataset, we fine-tune LLaVA using a two-stage training curriculum: (1) concept alignment for domain-specific terrain description, and (2) instruction-tuned visual question answering. We further design evaluation benchmarks spanning multiple levels of reasoning complexity relevant to lunar terrain analysis. Evaluated against GPT and Gemini judges, LLaVA-LE achieves a 3.3x overall performance gain over Base LLaVA and 2.1x over our Stage 1 model, with a reasoning score of 1.070, exceeding the judge's own reference score, highlighting the effectiveness of domain-specific multimodal data and instruction tuning to advance VLMs in planetary exploration. Code is available at https://github.com/OSUPCVLab/LLaVA-LE.

* Accepted in AI4Space Workshop CVPR2026. Website: https://osupcvlab.github.io/LLaVA-LE/, Dataset: https://huggingface.co/datasets/pcvlab/lucid

Via

Access Paper or Ask Questions

KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

Oct 02, 2024

Pouyan Navard, Amin Karimi Monsefi, Mengxi Zhou, Wei-Lun Chao, Alper Yilmaz, Rajiv Ramnath

Figure 1 for KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

Figure 2 for KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

Figure 3 for KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

Figure 4 for KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

Abstract:Recent advances in diffusion models have significantly improved text-to-image (T2I) generation, but they often struggle to balance fine-grained precision with high-level control. Methods like ControlNet and T2I-Adapter excel at following sketches by seasoned artists but tend to be overly rigid, replicating unintentional flaws in sketches from novice users. Meanwhile, coarse-grained methods, such as sketch-based abstraction frameworks, offer more accessible input handling but lack the precise control needed for detailed, professional use. To address these limitations, we propose KnobGen, a dual-pathway framework that democratizes sketch-based image generation by seamlessly adapting to varying levels of sketch complexity and user skill. KnobGen uses a Coarse-Grained Controller (CGC) module for high-level semantics and a Fine-Grained Controller (FGC) module for detailed refinement. The relative strength of these two modules can be adjusted through our knob inference mechanism to align with the user's specific needs. These mechanisms ensure that KnobGen can flexibly generate images from both novice sketches and those drawn by seasoned artists. This maintains control over the final output while preserving the natural appearance of the image, as evidenced on the MultiGen-20M dataset and a newly collected sketch dataset.

Via

Access Paper or Ask Questions

A Probabilistic-based Drift Correction Module for Visual Inertial SLAMs

Apr 15, 2024

Pouyan Navard, Alper Yilmaz

Figure 1 for A Probabilistic-based Drift Correction Module for Visual Inertial SLAMs

Figure 2 for A Probabilistic-based Drift Correction Module for Visual Inertial SLAMs

Figure 3 for A Probabilistic-based Drift Correction Module for Visual Inertial SLAMs

Figure 4 for A Probabilistic-based Drift Correction Module for Visual Inertial SLAMs

Abstract:Positioning is a prominent field of study, notably focusing on Visual Inertial Odometry (VIO) and Simultaneous Localization and Mapping (SLAM) methods. Despite their advancements, these methods often encounter dead-reckoning errors that leads to considerable drift in estimated platform motion especially during long traverses. In such cases, the drift error is not negligible and should be rectified. Our proposed approach minimizes the drift error by correcting the estimated motion generated by any SLAM method at each epoch. Our methodology treats positioning measurements rendered by the SLAM solution as random variables formulated jointly in a multivariate distribution. In this setting, The correction of the drift becomes equivalent to finding the mode of this multivariate distribution which jointly maximizes the likelihood of a set of relevant geo-spatial priors about the platform motion and environment. Our method is integrable into any SLAM/VIO method as an correction module. Our experimental results shows the effectiveness of our approach in minimizing the drift error by 10x in long treverses.

Via

Access Paper or Ask Questions

SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation

Apr 15, 2024

Shehan Perera, Pouyan Navard, Alper Yilmaz

Figure 1 for SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation

Figure 2 for SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation

Figure 3 for SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation

Figure 4 for SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation

Abstract:The adoption of Vision Transformers (ViTs) based architectures represents a significant advancement in 3D Medical Image (MI) segmentation, surpassing traditional Convolutional Neural Network (CNN) models by enhancing global contextual understanding. While this paradigm shift has significantly enhanced 3D segmentation performance, state-of-the-art architectures require extremely large and complex architectures with large scale computing resources for training and deployment. Furthermore, in the context of limited datasets, often encountered in medical imaging, larger models can present hurdles in both model generalization and convergence. In response to these challenges and to demonstrate that lightweight models are a valuable area of research in 3D medical imaging, we present SegFormer3D, a hierarchical Transformer that calculates attention across multiscale volumetric features. Additionally, SegFormer3D avoids complex decoders and uses an all-MLP decoder to aggregate local and global attention features to produce highly accurate segmentation masks. The proposed memory efficient Transformer preserves the performance characteristics of a significantly larger model in a compact design. SegFormer3D democratizes deep learning for 3D medical image segmentation by offering a model with 33x less parameters and a 13x reduction in GFLOPS compared to the current state-of-the-art (SOTA). We benchmark SegFormer3D against the current SOTA models on three widely used datasets Synapse, BRaTs, and ACDC, achieving competitive results. Code: https://github.com/OSUPCVLab/SegFormer3D.git

* Accepted at CVPR Workshop 2024

Via

Access Paper or Ask Questions