Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dong Lao

Triangular Consistency as a Universal Constraint for Learning Optical Flow

Jun 18, 2026

Yi Xiao, Carlos Rodriguez Coronel, Jing Zhan, Haniyeh Ehsani Oskouie, Alex Wong, Dong Lao

Abstract:We propose triangular consistency as a first-principled constraint for optical flow, which is agnostic to network architecture, supervision type, and dataset, and applies to both image-pair and multi-frame settings. This simple but powerful constraint is to compose two flows to induce a third flow and enforce consistency among the three. The composed flows may arise from (i) image pairs, yielding cycle consistency; (ii) multiple video frames, producing longer-range motion through temporal chaining; or (iii) image pairs combined with controlled synthetic transformations, which becomes data augmentation. This triangular consistency introduces negligible computational overhead and requires no additional annotations. Since it is derived directly from the geometry of optical flow, it does not rely on model-specific assumptions and serves as a ``universal'' plug-and-play component for optical flow training. Experiments show consistent improvement across supervised, unsupervised, and transfer learning settings.

* Accepted by ECCV 2026

Via

Access Paper or Ask Questions

Naïve PAINE: Lightweight Text-to-Image Generation Improvement with Prompt Evaluation

Mar 12, 2026

Joong Ho Kim, Nicholas Thai, Souhardya Saha Dip, Dong Lao, Keith G. Mills

Abstract:Text-to-Image (T2I) generation is primarily driven by Diffusion Models (DM) which rely on random Gaussian noise. Thus, like playing the slots at a casino, a DM will produce different results given the same user-defined inputs. This imposes a gambler's burden: To perform multiple generation cycles to obtain a satisfactory result. However, even though DMs use stochastic sampling to seed generation, the distribution of generated content quality highly depends on the prompt and the generative ability of a DM with respect to it. To account for this, we propose Naïve PAINE for improving the generative quality of Diffusion Models by leveraging T2I preference benchmarks. We directly predict the numerical quality of an image from the initial noise and given prompt. Naïve PAINE then selects a handful of quality noises and forwards them to the DM for generation. Further, Naïve PAINE provides feedback on the DM generative quality given the prompt and is lightweight enough to seamlessly fit into existing DM pipelines. Experimental results demonstrate that Naïve PAINE outperforms existing approaches on several prompt corpus benchmarks.

* Code available at https://github.com/LSU-ATHENA/Naive-PAINE

Via

Access Paper or Ask Questions

RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions

Oct 03, 2024

Ziyao Zeng, Yangchao Wu, Hyoungseob Park, Daniel Wang, Fengyu Yang, Stefano Soatto, Dong Lao, Byung-Woo Hong, Alex Wong

Figure 1 for RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions

Figure 2 for RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions

Figure 3 for RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions

Figure 4 for RSA: Resolving Scale Ambiguities in Monocular Depth Estimators through Language Descriptions

Abstract:We propose a method for metric-scale monocular depth estimation. Inferring depth from a single image is an ill-posed problem due to the loss of scale from perspective projection during the image formation process. Any scale chosen is a bias, typically stemming from training on a dataset; hence, existing works have instead opted to use relative (normalized, inverse) depth. Our goal is to recover metric-scaled depth maps through a linear transformation. The crux of our method lies in the observation that certain objects (e.g., cars, trees, street signs) are typically found or associated with certain types of scenes (e.g., outdoor). We explore whether language descriptions can be used to transform relative depth predictions to those in metric scale. Our method, RSA, takes as input a text caption describing objects present in an image and outputs the parameters of a linear transformation which can be applied globally to a relative depth map to yield metric-scaled depth predictions. We demonstrate our method on recent general-purpose monocular depth models on indoors (NYUv2) and outdoors (KITTI). When trained on multiple datasets, RSA can serve as a general alignment module in zero-shot settings. Our method improves over common practices in aligning relative to metric depth and results in predictions that are comparable to an upper bound of fitting relative depth to ground truth via a linear transformation.

Via

Access Paper or Ask Questions

Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation

May 06, 2024

Dong Lao, Congli Wang, Alex Wong, Stefano Soatto

Figure 1 for Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation

Figure 2 for Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation

Figure 3 for Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation

Figure 4 for Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation

Abstract:We describe a method for recovering the irradiance underlying a collection of images corrupted by atmospheric turbulence. Since supervised data is often technically impossible to obtain, assumptions and biases have to be imposed to solve this inverse problem, and we choose to model them explicitly. Rather than initializing a latent irradiance ("template") by heuristics to estimate deformation, we select one of the images as a reference, and model the deformation in this image by the aggregation of the optical flow from it to other images, exploiting a prior imposed by Central Limit Theorem. Then with a novel flow inversion module, the model registers each image TO the template but WITHOUT the template, avoiding artifacts related to poor template initialization. To illustrate the robustness of the method, we simply (i) select the first frame as the reference and (ii) use the simplest optical flow to estimate the warpings, yet the improvement in registration is decisive in the final reconstruction, as we achieve state-of-the-art performance despite its simplicity. The method establishes a strong baseline that can be further improved by integrating it seamlessly into more sophisticated pipelines, or with domain-specific methods if so desired.

Via

Access Paper or Ask Questions

WorDepth: Variational Language Prior for Monocular Depth Estimation

Apr 05, 2024

Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong

Figure 1 for WorDepth: Variational Language Prior for Monocular Depth Estimation

Figure 2 for WorDepth: Variational Language Prior for Monocular Depth Estimation

Figure 3 for WorDepth: Variational Language Prior for Monocular Depth Estimation

Figure 4 for WorDepth: Variational Language Prior for Monocular Depth Estimation

Abstract:Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed, i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this, we focus on monocular depth estimation, the problem of predicting a dense depth map from a single image, but with an additional text caption describing the scene. To this end, we begin by encoding the text caption as a mean and standard deviation; using a variational framework, we learn the distribution of the plausible metric reconstructions of 3D scenes corresponding to the text captions as a prior. To "select" a specific reconstruction or depth map, we encode the given image through a conditional sampler that samples from the latent space of the variational text encoder, which is then decoded to the output depth map. Our approach is trained alternatingly between the text and image branches: in one optimization step, we predict the mean and standard deviation from the text description and sample from a standard Gaussian, and in the other, we sample using a (image) conditional sampler. Once trained, we directly predict depth from the encoded text using the conditional sampler. We demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios, where we show that language can consistently improve performance in both.

Via

Access Paper or Ask Questions

AugUndo: Scaling Up Augmentations for Unsupervised Depth Completion

Oct 15, 2023

Yangchao Wu, Tian Yu Liu, Hyoungseob Park, Stefano Soatto, Dong Lao, Alex Wong

Abstract:Unsupervised depth completion methods are trained by minimizing sparse depth and image reconstruction error. Block artifacts from resampling, intensity saturation, and occlusions are amongst the many undesirable by-products of common data augmentation schemes that affect image reconstruction quality, and thus the training signal. Hence, typical augmentations on images that are viewed as essential to training pipelines in other vision tasks have seen limited use beyond small image intensity changes and flipping. The sparse depth modality have seen even less as intensity transformations alter the scale of the 3D scene, and geometric transformations may decimate the sparse points during resampling. We propose a method that unlocks a wide range of previously-infeasible geometric augmentations for unsupervised depth completion. This is achieved by reversing, or "undo"-ing, geometric transformations to the coordinates of the output depth, warping the depth map back to the original reference frame. This enables computing the reconstruction losses using the original images and sparse depth maps, eliminating the pitfalls of naive loss computation on the augmented inputs. This simple yet effective strategy allows us to scale up augmentations to boost performance. We demonstrate our method on indoor (VOID) and outdoor (KITTI) datasets where we improve upon three existing methods by an average of 10.4\% across both datasets.

Via

Access Paper or Ask Questions

Sub-token ViT Embedding via Stochastic Resonance Transformers

Oct 06, 2023

Dong Lao, Yangchao Wu, Tian Yu Liu, Alex Wong, Stefano Soatto

Figure 1 for Sub-token ViT Embedding via Stochastic Resonance Transformers

Figure 2 for Sub-token ViT Embedding via Stochastic Resonance Transformers

Figure 3 for Sub-token ViT Embedding via Stochastic Resonance Transformers

Figure 4 for Sub-token ViT Embedding via Stochastic Resonance Transformers

Abstract:We discover the presence of quantization artifacts in Vision Transformers (ViTs), which arise due to the image tokenization step inherent in these architectures. These artifacts result in coarsely quantized features, which negatively impact performance, especially on downstream dense prediction tasks. We present a zero-shot method to improve how pre-trained ViTs handle spatial quantization. In particular, we propose to ensemble the features obtained from perturbing input images via sub-token spatial translations, inspired by Stochastic Resonance, a method traditionally applied to climate dynamics and signal processing. We term our method ``Stochastic Resonance Transformer" (SRT), which we show can effectively super-resolve features of pre-trained ViTs, capturing more of the local fine-grained structures that might otherwise be neglected as a result of tokenization. SRT can be applied at any layer, on any task, and does not require any fine-tuning. The advantage of the former is evident when applied to monocular depth prediction, where we show that ensembling model outputs are detrimental while applying SRT on intermediate ViT features outperforms the baseline models by an average of 4.7% and 14.9% on the RMSE and RMSE-log metrics across three different architectures. When applied to semi-supervised video object segmentation, SRT also improves over the baseline models uniformly across all metrics, and by an average of 2.4% in F&J score. We further show that these quantization artifacts can be attenuated to some extent via self-distillation. On the unsupervised salient region segmentation, SRT improves upon the base model by an average of 2.1% on the maxF metric. Finally, despite operating purely on pixel-level features, SRT generalizes to non-dense prediction tasks such as image retrieval and object discovery, yielding consistent improvements of up to 2.6% and 1.0% respectively.

Via

Access Paper or Ask Questions

Divided Attention: Unsupervised Multi-Object Discovery with Contextually Separated Slots

Apr 04, 2023

Dong Lao, Zhengyang Hu, Francesco Locatello, Yanchao Yang, Stefano Soatto

Figure 1 for Divided Attention: Unsupervised Multi-Object Discovery with Contextually Separated Slots

Figure 2 for Divided Attention: Unsupervised Multi-Object Discovery with Contextually Separated Slots

Figure 3 for Divided Attention: Unsupervised Multi-Object Discovery with Contextually Separated Slots

Figure 4 for Divided Attention: Unsupervised Multi-Object Discovery with Contextually Separated Slots

Abstract:We introduce a method to segment the visual field into independently moving regions, trained with no ground truth or supervision. It consists of an adversarial conditional encoder-decoder architecture based on Slot Attention, modified to use the image as context to decode optical flow without attempting to reconstruct the image itself. In the resulting multi-modal representation, one modality (flow) feeds the encoder to produce separate latent codes (slots), whereas the other modality (image) conditions the decoder to generate the first (flow) from the slots. This design frees the representation from having to encode complex nuisance variability in the image due to, for instance, illumination and reflectance properties of the scene. Since customary autoencoding based on minimizing the reconstruction error does not preclude the entire flow from being encoded into a single slot, we modify the loss to an adversarial criterion based on Contextual Information Separation. The resulting min-max optimization fosters the separation of objects and their assignment to different attention slots, leading to Divided Attention, or DivA. DivA outperforms recent unsupervised multi-object motion segmentation methods while tripling run-time speed up to 104FPS and reducing the performance gap from supervised methods to 12% or less. DivA can handle different numbers of objects and different image sizes at training and test time, is invariant to permutation of object labels, and does not require explicit regularization.

Via

Access Paper or Ask Questions

Does Monocular Depth Estimation Provide Better Pre-training than Classification for Semantic Segmentation?

Mar 29, 2022

Dong Lao, Alex Wong, Stefano Soatto

Figure 1 for Does Monocular Depth Estimation Provide Better Pre-training than Classification for Semantic Segmentation?

Figure 2 for Does Monocular Depth Estimation Provide Better Pre-training than Classification for Semantic Segmentation?

Figure 3 for Does Monocular Depth Estimation Provide Better Pre-training than Classification for Semantic Segmentation?

Figure 4 for Does Monocular Depth Estimation Provide Better Pre-training than Classification for Semantic Segmentation?

Abstract:Training a deep neural network for semantic segmentation is labor-intensive, so it is common to pre-train it for a different task, and then fine-tune it with a small annotated dataset. State-of-the-art methods use image classification for pre-training, which introduces uncontrolled biases. We test the hypothesis that depth estimation from unlabeled videos may provide better pre-training. Despite the absence of any semantic information, we argue that estimating scene geometry is closer to the task of semantic segmentation than classifying whole images into semantic classes. Since analytical validation is intractable, we test the hypothesis empirically by introducing a pre-training scheme that yields an improvement of 5.7% mIoU and 4.1% pixel accuracy over classification-based pre-training. While annotation is not needed for pre-training, it is needed for testing the hypothesis. We use the KITTI (outdoor) and NYU-V2 (indoor) benchmarks to that end, and provide an extensive discussion of the benefits and limitations of the proposed scheme in relation to existing unsupervised, self-supervised, and semi-supervised pre-training protocols.

Via

Access Paper or Ask Questions

Flow-Guided Video Inpainting with Scene Templates

Aug 29, 2021

Dong Lao, Peihao Zhu, Peter Wonka, Ganesh Sundaramoorthi

Figure 1 for Flow-Guided Video Inpainting with Scene Templates

Figure 2 for Flow-Guided Video Inpainting with Scene Templates

Figure 3 for Flow-Guided Video Inpainting with Scene Templates

Figure 4 for Flow-Guided Video Inpainting with Scene Templates

Abstract:We consider the problem of filling in missing spatio-temporal regions of a video. We provide a novel flow-based solution by introducing a generative model of images in relation to the scene (without missing regions) and mappings from the scene to images. We use the model to jointly infer the scene template, a 2D representation of the scene, and the mappings. This ensures consistency of the frame-to-frame flows generated to the underlying scene, reducing geometric distortions in flow based inpainting. The template is mapped to the missing regions in the video by a new L2-L1 interpolation scheme, creating crisp inpaintings and reducing common blur and distortion artifacts. We show on two benchmark datasets that our approach out-performs state-of-the-art quantitatively and in user studies.

Via

Access Paper or Ask Questions