Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Omar Javed

Adapting Vision-Language Models for E-commerce Understanding at Scale

Feb 12, 2026

Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed(+2 more)

Abstract:E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

Via

Access Paper or Ask Questions

Depth Extraction from Videos Using Geometric Context and Occlusion Boundaries

Oct 25, 2015

S. Hussain Raza, Omar Javed, Aveek Das, Harpreet Sawhney, Hui Cheng, Irfan Essa

Figure 1 for Depth Extraction from Videos Using Geometric Context and Occlusion Boundaries

Figure 2 for Depth Extraction from Videos Using Geometric Context and Occlusion Boundaries

Figure 3 for Depth Extraction from Videos Using Geometric Context and Occlusion Boundaries

Figure 4 for Depth Extraction from Videos Using Geometric Context and Occlusion Boundaries

Abstract:We present an algorithm to estimate depth in dynamic video scenes. We propose to learn and infer depth in videos from appearance, motion, occlusion boundaries, and geometric context of the scene. Using our method, depth can be estimated from unconstrained videos with no requirement of camera pose estimation, and with significant background/foreground motions. We start by decomposing a video into spatio-temporal regions. For each spatio-temporal region, we learn the relationship of depth to visual appearance, motion, and geometric classes. Then we infer the depth information of new scenes using piecewise planar parametrization estimated within a Markov random field (MRF) framework by combining appearance to depth learned mappings and occlusion boundary guided smoothness constraints. Subsequently, we perform temporal smoothing to obtain temporally consistent depth maps. To evaluate our depth estimation algorithm, we provide a novel dataset with ground truth depth for outdoor video scenes. We present a thorough evaluation of our algorithm on our new dataset and the publicly available Make3d static image dataset.

* British Machine Vision Conference (BMVC) 2014

Via

Access Paper or Ask Questions