Alert button
Picture for Fatih Porikli

Fatih Porikli

Alert button

Efficient neural supersampling on a novel gaming dataset

Aug 03, 2023
Antoine Mercier, Ruan Erasmus, Yashesh Savani, Manik Dhingra, Fatih Porikli, Guillaume Berger

Figure 1 for Efficient neural supersampling on a novel gaming dataset
Figure 2 for Efficient neural supersampling on a novel gaming dataset
Figure 3 for Efficient neural supersampling on a novel gaming dataset
Figure 4 for Efficient neural supersampling on a novel gaming dataset

Real-time rendering for video games has become increasingly challenging due to the need for higher resolutions, framerates and photorealism. Supersampling has emerged as an effective solution to address this challenge. Our work introduces a novel neural algorithm for supersampling rendered content that is 4 times more efficient than existing methods while maintaining the same level of accuracy. Additionally, we introduce a new dataset which provides auxiliary modalities such as motion vectors and depth generated using graphics rendering features like viewport jittering and mipmap biasing at different resolutions. We believe that this dataset fills a gap in the current dataset landscape and can serve as a valuable resource to help measure progress in the field and advance the state-of-the-art in super-resolution techniques for gaming content.

* ICCV'23 
Viaarxiv icon

MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation

Jul 26, 2023
Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek Garrepalli, Fatih Porikli

Figure 1 for MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation
Figure 2 for MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation
Figure 3 for MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation
Figure 4 for MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation

We propose MAMo, a novel memory and attention frame-work for monocular video depth estimation. MAMo can augment and improve any single-image depth estimation networks into video depth estimation models, enabling them to take advantage of the temporal information to predict more accurate depth. In MAMo, we augment model with memory which aids the depth prediction as the model streams through the video. Specifically, the memory stores learned visual and displacement tokens of the previous time instances. This allows the depth network to cross-reference relevant features from the past when predicting depth on the current frame. We introduce a novel scheme to continuously update the memory, optimizing it to keep tokens that correspond with both the past and the present visual information. We adopt attention-based approach to process memory features where we first learn the spatio-temporal relation among the resultant visual and displacement memory tokens using self-attention module. Further, the output features of self-attention are aggregated with the current visual features through cross-attention. The cross-attended features are finally given to a decoder to predict depth on the current frame. Through extensive experiments on several benchmarks, including KITTI, NYU-Depth V2, and DDAD, we show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy. Notably, our MAMo video depth estimation provides higher accuracy with lower latency, when omparing to SOTA cost-volume-based video depth models.

* Accepted at ICCV 2023 
Viaarxiv icon

DIFT: Dynamic Iterative Field Transforms for Memory Efficient Optical Flow

Jun 09, 2023
Risheek Garrepalli, Jisoo Jeong, Rajeswaran C Ravindran, Jamie Menjay Lin, Fatih Porikli

Figure 1 for DIFT: Dynamic Iterative Field Transforms for Memory Efficient Optical Flow
Figure 2 for DIFT: Dynamic Iterative Field Transforms for Memory Efficient Optical Flow
Figure 3 for DIFT: Dynamic Iterative Field Transforms for Memory Efficient Optical Flow
Figure 4 for DIFT: Dynamic Iterative Field Transforms for Memory Efficient Optical Flow

Recent advancements in neural network-based optical flow estimation often come with prohibitively high computational and memory requirements, presenting challenges in their model adaptation for mobile and low-power use cases. In this paper, we introduce a lightweight low-latency and memory-efficient model, Dynamic Iterative Field Transforms (DIFT), for optical flow estimation feasible for edge applications such as mobile, XR, micro UAVs, robotics and cameras. DIFT follows an iterative refinement framework leveraging variable resolution of cost volumes for correspondence estimation. We propose a memory efficient solution for cost volume processing to reduce peak memory. Also, we present a novel dynamic coarse-to-fine cost volume processing during various stages of refinement to avoid multiple levels of cost volumes. We demonstrate first real-time cost-volume based optical flow DL architecture on Snapdragon 8 Gen 1 HTP efficient mobile AI accelerator with 32 inf/sec and 5.89 EPE (endpoint error) on KITTI with manageable accuracy-performance tradeoffs.

* CVPR MAI 2023 Accepted Paper 
Viaarxiv icon

X-Align++: cross-modal cross-view alignment for Bird's-eye-view segmentation

Jun 06, 2023
Shubhankar Borse, Senthil Yogamani, Marvin Klingner, Varun Ravi, Hong Cai, Abdulaziz Almuzairee, Fatih Porikli

Bird's-eye-view (BEV) grid is a typical representation of the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. The latest works leverage both camera and LiDAR modalities but suboptimally fuse their features using simple, concatenation-based mechanisms. In this paper, we address these problems by enhancing the alignment of the unimodal features in order to aid feature fusion, as well as enhancing the alignment between the cameras' perspective view (PV) and BEV representations. We propose X-Align, a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation consisting of the following components: (i) a novel Cross-Modal Feature Alignment (X-FA) loss, (ii) an attention-based Cross-Modal Feature Fusion (X-FF) module to align multi-modal BEV features implicitly, and (iii) an auxiliary PV segmentation branch with Cross-View Segmentation Alignment (X-SA) losses to improve the PV-to-BEV transformation. We evaluate our proposed method across two commonly used benchmark datasets, i.e., nuScenes and KITTI-360. Notably, X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes. We also provide extensive ablation studies to demonstrate the effectiveness of the individual components.

* Accepted for publication at Springer Machine Vision and Applications Journal. The Version of Record of this article is published in Machine Vision and Applications Journal, and is available online at https://doi.org/10.1007/s00138-023-01400-7. arXiv admin note: substantial text overlap with arXiv:2210.06778 
Viaarxiv icon

OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

May 18, 2023
Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, Hao Su

Figure 1 for OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding
Figure 2 for OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding
Figure 3 for OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding
Figure 4 for OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds. We adopt the commonly used multi-modal contrastive learning framework for representation alignment, but with a specific focus on scaling up 3D representations to enable open-world 3D shape understanding. To achieve this, we scale up training data by ensembling multiple 3D datasets and propose several strategies to automatically filter and enrich noisy text descriptions. We also explore and compare strategies for scaling 3D backbone networks and introduce a novel hard negative mining module for more efficient training. We evaluate OpenShape on zero-shot 3D classification benchmarks and demonstrate its superior capabilities for open-world recognition. Specifically, OpenShape achieves a zero-shot accuracy of 46.8% on the 1,156-category Objaverse-LVIS benchmark, compared to less than 10% for existing methods. OpenShape also achieves an accuracy of 85.3% on ModelNet40, outperforming previous zero-shot baseline methods by 20% and performing on par with some fully-supervised methods. Furthermore, we show that our learned embeddings encode a wide range of visual and semantic concepts (e.g., subcategories, color, shape, style) and facilitate fine-grained text-3D and image-3D interactions. Due to their alignment with CLIP embeddings, our learned shape representations can also be integrated with off-the-shelf CLIP-based models for various applications, such as point cloud captioning and point cloud-conditioned image generation.

* Project Website: https://colin97.github.io/OpenShape/ 
Viaarxiv icon

A Review of Deep Learning for Video Captioning

Apr 22, 2023
Moloud Abdar, Meenakshi Kollati, Swaraja Kuraparthi, Farhad Pourpanah, Daniel McDuff, Mohammad Ghavamzadeh, Shuicheng Yan, Abduallah Mohamed, Abbas Khosravi, Erik Cambria, Fatih Porikli

Figure 1 for A Review of Deep Learning for Video Captioning
Figure 2 for A Review of Deep Learning for Video Captioning
Figure 3 for A Review of Deep Learning for Video Captioning
Figure 4 for A Review of Deep Learning for Video Captioning

Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work in the fields of computer vision, natural language processing (NLP), linguistics, and human-computer interaction. In essence, VC involves understanding a video and describing it with language. Captioning is used in a host of applications from creating more accessible interfaces (e.g., low-vision navigation) to video question answering (V-QA), video retrieval and content generation. This survey covers deep learning-based VC, including but, not limited to, attention-based architectures, graph networks, reinforcement learning, adversarial networks, dense video captioning (DVC), and more. We discuss the datasets and evaluation metrics used in the field, and limitations, applications, challenges, and future directions for VC.

* 42 pages, 10 figures 
Viaarxiv icon

Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation

Apr 12, 2023
Liwen Wu, Rui Zhu, Mustafa B. Yaldiz, Yinhao Zhu, Hong Cai, Janarbek Matai, Fatih Porikli, Tzu-Mao Li, Manmohan Chandraker, Ravi Ramamoorthi

Figure 1 for Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation
Figure 2 for Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation
Figure 3 for Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation
Figure 4 for Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation

Inverse path tracing has recently been applied to joint material and lighting estimation, given geometry and multi-view HDR observations of an indoor scene. However, it has two major limitations: path tracing is expensive to compute, and ambiguities exist between reflection and emission. We propose a novel Factorized Inverse Path Tracing (FIPT) method which utilizes a factored light transport formulation and finds emitters driven by rendering errors. Our algorithm enables accurate material and lighting optimization faster than previous work, and is more effective at resolving ambiguities. The exhaustive experiments on synthetic scenes show that our method (1) outperforms state-of-the-art indoor inverse rendering and relighting methods particularly in the presence of complex illumination effects; (2) speeds up inverse path tracing optimization to less than an hour. We further demonstrate robustness to noisy inputs through material and lighting estimates that allow plausible relighting in a real scene. The source code is available at: https://github.com/lwwu2/fipt

* Updated results from MILO on Apr 10, 2023 
Viaarxiv icon

EGA-Depth: Efficient Guided Attention for Self-Supervised Multi-Camera Depth Estimation

Apr 06, 2023
Yunxiao Shi, Hong Cai, Amin Ansari, Fatih Porikli

Figure 1 for EGA-Depth: Efficient Guided Attention for Self-Supervised Multi-Camera Depth Estimation
Figure 2 for EGA-Depth: Efficient Guided Attention for Self-Supervised Multi-Camera Depth Estimation
Figure 3 for EGA-Depth: Efficient Guided Attention for Self-Supervised Multi-Camera Depth Estimation
Figure 4 for EGA-Depth: Efficient Guided Attention for Self-Supervised Multi-Camera Depth Estimation

The ubiquitous multi-camera setup on modern autonomous vehicles provides an opportunity to construct surround-view depth. Existing methods, however, either perform independent monocular depth estimations on each camera or rely on computationally heavy self attention mechanisms. In this paper, we propose a novel guided attention architecture, EGA-Depth, which can improve both the efficiency and accuracy of self-supervised multi-camera depth estimation. More specifically, for each camera, we use its perspective view as the query to cross-reference its neighboring views to derive informative features for this camera view. This allows the model to perform attention only across views with considerable overlaps and avoid the costly computations of standard self-attention. Given its efficiency, EGA-Depth enables us to exploit higher-resolution visual features, leading to improved accuracy. Furthermore, EGA-Depth can incorporate more frames from previous time steps as it scales linearly w.r.t. the number of views and frames. Extensive experiments on two challenging autonomous driving benchmarks nuScenes and DDAD demonstrate the efficacy of our proposed EGA-Depth and show that it achieves the new state-of-the-art in self-supervised multi-camera depth estimation.

* CVPR 2023 Workshop on Autonomous Driving 
Viaarxiv icon

4D Panoptic Segmentation as Invariant and Equivariant Field Prediction

Mar 28, 2023
Minghan Zhu, Shizong Han, Hong Cai, Shubhankar Borse, Maani Ghaffari Jadidi, Fatih Porikli

Figure 1 for 4D Panoptic Segmentation as Invariant and Equivariant Field Prediction
Figure 2 for 4D Panoptic Segmentation as Invariant and Equivariant Field Prediction
Figure 3 for 4D Panoptic Segmentation as Invariant and Equivariant Field Prediction
Figure 4 for 4D Panoptic Segmentation as Invariant and Equivariant Field Prediction

In this paper, we develop rotation-equivariant neural networks for 4D panoptic segmentation. 4D panoptic segmentation is a recently established benchmark task for autonomous driving, which requires recognizing semantic classes and object instances on the road based on LiDAR scans, as well as assigning temporally consistent IDs to instances across time. We observe that the driving scenario is symmetric to rotations on the ground plane. Therefore, rotation-equivariance could provide better generalization and more robust feature learning. Specifically, we review the object instance clustering strategies, and restate the centerness-based approach and the offset-based approach as the prediction of invariant scalar fields and equivariant vector fields. Other sub-tasks are also unified from this perspective, and different invariant and equivariant layers are designed to facilitate their predictions. Through evaluation on the standard 4D panoptic segmentation benchmark of SemanticKITTI, we show that our equivariant models achieve higher accuracy with lower computational costs compared to their non-equivariant counterparts. Moreover, our method sets the new state-of-the-art performance and achieves 1st place on the SemanticKITTI 4D Panoptic Segmentation leaderboard.

Viaarxiv icon

DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling

Mar 24, 2023
Jisoo Jeong, Hong Cai, Risheek Garrepalli, Fatih Porikli

Figure 1 for DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling
Figure 2 for DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling
Figure 3 for DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling
Figure 4 for DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling

We propose a novel data augmentation approach, DistractFlow, for training optical flow estimation models by introducing realistic distractions to the input frames. Based on a mixing ratio, we combine one of the frames in the pair with a distractor image depicting a similar domain, which allows for inducing visual perturbations congruent with natural objects and scenes. We refer to such pairs as distracted pairs. Our intuition is that using semantically meaningful distractors enables the model to learn related variations and attain robustness against challenging deviations, compared to conventional augmentation schemes focusing only on low-level aspects and modifications. More specifically, in addition to the supervised loss computed between the estimated flow for the original pair and its ground-truth flow, we include a second supervised loss defined between the distracted pair's flow and the original pair's ground-truth flow, weighted with the same mixing ratio. Furthermore, when unlabeled data is available, we extend our augmentation approach to self-supervised settings through pseudo-labeling and cross-consistency regularization. Given an original pair and its distracted version, we enforce the estimated flow on the distracted pair to agree with the flow of the original pair. Our approach allows increasing the number of available training pairs significantly without requiring additional annotations. It is agnostic to the model architecture and can be applied to training any optical flow estimation models. Our extensive evaluations on multiple benchmarks, including Sintel, KITTI, and SlowFlow, show that DistractFlow improves existing models consistently, outperforming the latest state of the art.

* CVPR 2023 
Viaarxiv icon