Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"Image": models, code, and papers

Data-efficient Event Camera Pre-training via Disentangled Masked Modeling

Mar 01, 2024
Zhenpeng Huang, Chao Li, Hao Chen, Yongjian Deng, Yifeng Geng, Limin Wang

Figure 1 for Data-efficient Event Camera Pre-training via Disentangled Masked Modeling

Figure 2 for Data-efficient Event Camera Pre-training via Disentangled Masked Modeling

Figure 3 for Data-efficient Event Camera Pre-training via Disentangled Masked Modeling

Figure 4 for Data-efficient Event Camera Pre-training via Disentangled Masked Modeling

In this paper, we present a new data-efficient voxel-based self-supervised learning method for event cameras. Our pre-training overcomes the limitations of previous methods, which either sacrifice temporal information by converting event sequences into 2D images for utilizing pre-trained image models or directly employ paired image data for knowledge distillation to enhance the learning of event streams. In order to make our pre-training data-efficient, we first design a semantic-uniform masking method to address the learning imbalance caused by the varying reconstruction difficulties of different regions in non-uniform data when using random masking. In addition, we ease the traditional hybrid masked modeling process by explicitly decomposing it into two branches, namely local spatio-temporal reconstruction and global semantic reconstruction to encourage the encoder to capture local correlations and global semantics, respectively. This decomposition allows our selfsupervised learning method to converge faster with minimal pre-training data. Compared to previous approaches, our self-supervised learning method does not rely on paired RGB images, yet enables simultaneous exploration of spatial and temporal cues in multiple scales. It exhibits excellent generalization performance and demonstrates significant improvements across various tasks with fewer parameters and lower computational costs.

Via

Access Paper or Ask Questions

Optimization of Array Encoding for Ultrasound Imaging

Mar 01, 2024
Jacob Spainhour, Korben Smart, Stephen Becker, Nick Bottenus

Figure 1 for Optimization of Array Encoding for Ultrasound Imaging

Figure 2 for Optimization of Array Encoding for Ultrasound Imaging

Figure 3 for Optimization of Array Encoding for Ultrasound Imaging

Figure 4 for Optimization of Array Encoding for Ultrasound Imaging

Objective: The transmit encoding model for synthetic aperture imaging is a robust and flexible framework for understanding the effect of acoustic transmission on ultrasound image reconstruction. Our objective is to use machine learning (ML) to construct scanning sequences, parameterized by time delays and apodization weights, that produce high quality B-mode images. Approach: We use an ML model in PyTorch and simulated RF data from Field II to probe the space of possible encoding sequences for those that minimize a loss function that describes image quality. This approach is made computationally feasible by a novel formulation of the derivative for delay-and-sum beamforming. We demonstrate these results experimentally on wire targets and a tissue-mimicking phantom. Main Results: When trained according to a given set of imaging parameters (imaging domain, hardware restrictions), our ML imaging model produces optimized encoding sequences that improve a number of standard quality metrics including resolution, field of view, and contrast, over conventional sequences. Significance: This work demonstrates that the set of encoding schemes that are commonly used represent only a narrow subset of those available. Additionally, it demonstrates the value for ML tasks in synthetic transmit aperture imaging to consider the beamformer within the model, instead of as purely post-processing.

Via

Access Paper or Ask Questions

RKHS-BA: A Semantic Correspondence-Free Multi-View Registration Framework with Global Tracking

Mar 02, 2024
Ray Zhang, Jingwei Song, Xiang Gao, Junzhe Wu, Tianyi Liu, Jinyuan Zhang, Ryan Eustice, Maani Ghaffari

Figure 1 for RKHS-BA: A Semantic Correspondence-Free Multi-View Registration Framework with Global Tracking

Figure 2 for RKHS-BA: A Semantic Correspondence-Free Multi-View Registration Framework with Global Tracking

Figure 3 for RKHS-BA: A Semantic Correspondence-Free Multi-View Registration Framework with Global Tracking

Figure 4 for RKHS-BA: A Semantic Correspondence-Free Multi-View Registration Framework with Global Tracking

This work reports a novel Bundle Adjustment (BA) formulation using a Reproducing Kernel Hilbert Space (RKHS) representation called RKHS-BA. The proposed formulation is correspondence-free, enables the BA to use RGB-D/LiDAR and semantic labels in the optimization directly, and provides a generalization for the photometric loss function commonly used in direct methods. RKHS-BA can incorporate appearance and semantic labels within a continuous spatial-semantic functional representation that does not require optimization via image pyramids. We demonstrate its applications in sliding-window odometry and global LiDAR mapping, which show highly robust performance in extremely challenging scenes and the best trade-off of generalization and accuracy.

* 16 pages, 12 figures, technical report under review

Via

Access Paper or Ask Questions

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

Feb 28, 2024
Ethan Smith, Nayan Saxena, Aninda Saha

Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.

Via

Access Paper or Ask Questions

Physics-Informed Learning for Time-Resolved Angiographic Contrast Agent Concentration Reconstruction

Mar 04, 2024
Noah Maul, Annette Birkhold, Fabian Wagner, Mareike Thies, Maximilian Rohleder, Philipp Berg, Markus Kowarschik, Andreas Maier

Figure 1 for Physics-Informed Learning for Time-Resolved Angiographic Contrast Agent Concentration Reconstruction

Figure 2 for Physics-Informed Learning for Time-Resolved Angiographic Contrast Agent Concentration Reconstruction

Figure 3 for Physics-Informed Learning for Time-Resolved Angiographic Contrast Agent Concentration Reconstruction

Figure 4 for Physics-Informed Learning for Time-Resolved Angiographic Contrast Agent Concentration Reconstruction

Three-dimensional Digital Subtraction Angiography (3D-DSA) is a well-established X-ray-based technique for visualizing vascular anatomy. Recently, four-dimensional DSA (4D-DSA) reconstruction algorithms have been developed to enable the visualization of volumetric contrast flow dynamics through time-series of volumes. . This reconstruction problem is ill-posed mainly due to vessel overlap in the projection direction and geometric vessel foreshortening, which leads to information loss in the recorded projection images. However, knowledge about the underlying fluid dynamics can be leveraged to constrain the solution space. In our work, we implicitly include this information in a neural network-based model that is trained on a dataset of image-based blood flow simulations. The model predicts the spatially averaged contrast agent concentration for each centerline point of the vasculature over time, lowering the overall computational demand. The trained network enables the reconstruction of relative contrast agent concentrations with a mean absolute error of 0.02 $\pm$ 0.02 and a mean absolute percentage error of 5.31 % $\pm$ 9.25 %. Moreover, the network is robust to varying degrees of vessel overlap and vessel foreshortening. Our approach demonstrates the potential of the integration of machine learning and blood flow simulations in time-resolved angiographic flow reconstruction.

Via

Access Paper or Ask Questions

SIFT-Aided Rectified 2D-DIC for Displacement and Strain Measurements in Asphalt Concrete Testing

Feb 29, 2024
Zehui Zhu, Imad L. Al-Qadi

Two-dimensional digital image correlation (2D-DIC) is a widely used optical technique to measure displacement and strain during asphalt concrete (AC) testing. An accurate 2-D DIC measurement can only be achieved when the camera's principal axis is perpendicular to the planar specimen surface. However, this requirement may not be met during testing due to device constraints. This paper proposes a simple and reliable method to correct errors induced by non-perpendicularity. The method is based on image feature matching and rectification. No additional equipment is needed. A theoretical error analysis was conducted to quantify the effect of a non-perpendicular camera alignment on measurement accuracy. The proposed method was validated numerically using synthetic images and experimentally in an AC fracture test. It achieved relatively high accuracy, even under considerable camera rotation angle and large deformation. As a pre-processing technique, the proposed method showed promising performance in assisting the recently developed CrackPropNet for automated crack propagation measurement under a non-perpendicular camera alignment.

* Journal of Transportation Engineering, Part B: Pavements

Via

Access Paper or Ask Questions

Radio-astronomical Image Reconstruction with Conditional Denoising Diffusion Model

Feb 15, 2024
Mariia Drozdova, Vitaliy Kinakh, Omkar Bait, Olga Taran, Erica Lastufka, Miroslava Dessauges-Zavadsky, Taras Holotyak, Daniel Schaerer, Slava Voloshynovskiy

Reconstructing sky models from dirty radio images for accurate source localization and flux estimation is crucial for studying galaxy evolution at high redshift, especially in deep fields using instruments like the Atacama Large Millimetre Array (ALMA). With new projects like the Square Kilometre Array (SKA), there's a growing need for better source extraction methods. Current techniques, such as CLEAN and PyBDSF, often fail to detect faint sources, highlighting the need for more accurate methods. This study proposes using stochastic neural networks to rebuild sky models directly from dirty images. This method can pinpoint radio sources and measure their fluxes with related uncertainties, marking a potential improvement in radio source characterization. We tested this approach on 10164 images simulated with the CASA tool simalma, based on ALMA's Cycle 5.3 antenna setup. We applied conditional Denoising Diffusion Probabilistic Models (DDPMs) for sky models reconstruction, then used Photutils to determine source coordinates and fluxes, assessing the model's performance across different water vapor levels. Our method showed excellence in source localization, achieving more than 90% completeness at a signal-to-noise ratio (SNR) as low as 2. It also surpassed PyBDSF in flux estimation, accurately identifying fluxes for 96% of sources in the test set, a significant improvement over CLEAN+ PyBDSF's 57%. Conditional DDPMs is a powerful tool for image-to-image translation, yielding accurate and robust characterisation of radio sources, and outperforming existing methodologies. While this study underscores its significant potential for applications in radio astronomy, we also acknowledge certain limitations that accompany its usage, suggesting directions for further refinement and research.

* In production in Astronomy&Astrophyics

Via

Access Paper or Ask Questions

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Mar 03, 2024
Zhende Song, Chenchen Wang, Jiamu Sheng, Chi Zhang, Gang Yu, Jiayuan Fan, Tao Chen

Figure 1 for MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Figure 2 for MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Figure 3 for MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Figure 4 for MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

The development of multimodal models has marked a significant step forward in how machines understand videos. These models have shown promise in analyzing short video clips. However, when it comes to longer formats like movies, they often fall short. The main hurdles are the lack of high-quality, diverse video data and the intensive work required to collect or annotate such data. In the face of these challenges, we propose MovieLLM, a novel framework designed to create synthetic, high-quality data for long videos. This framework leverages the power of GPT-4 and text-to-image models to generate detailed scripts and corresponding visuals. Our approach stands out for its flexibility and scalability, making it a superior alternative to traditional data collection methods. Our extensive experiments validate that the data produced by MovieLLM significantly improves the performance of multimodal models in understanding complex video narratives, overcoming the limitations of existing datasets regarding scarcity and bias.

Via

Access Paper or Ask Questions

What Is Missing in Multilingual Visual Reasoning and How to Fix It

Mar 03, 2024
Yueqi Song, Simran Khanuja, Graham Neubig

Figure 1 for What Is Missing in Multilingual Visual Reasoning and How to Fix It

Figure 2 for What Is Missing in Multilingual Visual Reasoning and How to Fix It

Figure 3 for What Is Missing in Multilingual Visual Reasoning and How to Fix It

Figure 4 for What Is Missing in Multilingual Visual Reasoning and How to Fix It

NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users. In this paper, we evaluate their multilingual, multimodal capabilities by testing on a visual reasoning task. We observe that proprietary systems like GPT-4V obtain the best performance on this task now, but open models lag in comparison. Surprisingly, GPT-4V exhibits similar performance between English and other languages, indicating the potential for equitable system development across languages. Our analysis on model failures reveals three key aspects that make this task challenging: multilinguality, complex reasoning, and multimodality. To address these challenges, we propose three targeted interventions including a translate-test approach to tackle multilinguality, a visual programming approach to break down complex reasoning, and a novel method that leverages image captioning to address multimodality. Our interventions achieve the best open performance on this task in a zero-shot setting, boosting open model LLaVA by 13.4%, while also minorly improving GPT-4V's performance.

Via

Access Paper or Ask Questions

LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition

Mar 03, 2024
Lingfeng Liu, Dong Ni, Hangjie Yuan

Figure 1 for LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition

Figure 2 for LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition

Figure 3 for LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition

Figure 4 for LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition

Bandwidth constraints during signal acquisition frequently impede real-time detection applications. Hyperspectral data is a notable example, whose vast volume compromises real-time hyperspectral detection. To tackle this hurdle, we introduce a novel approach leveraging pre-acquisition modulation to reduce the acquisition volume. This modulation process is governed by a deep learning model, utilizing prior information. Central to our approach is LUM-ViT, a Vision Transformer variant. Uniquely, LUM-ViT incorporates a learnable under-sampling mask tailored for pre-acquisition modulation. To further optimize for optical calculations, we propose a kernel-level weight binarization technique and a three-stage fine-tuning strategy. Our evaluations reveal that, by sampling a mere 10% of the original image pixels, LUM-ViT maintains the accuracy loss within 1.8% on the ImageNet classification task. The method sustains near-original accuracy when implemented on real-world optical hardware, demonstrating its practicality. Code will be available at https://github.com/MaxLLF/LUM-ViT.

* Accepted to ICLR 2024

Via

Access Paper or Ask Questions