Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:photo

Illegal Waste Detection in Remote Sensing Images: A Case Study

Feb 10, 2025

Federico Gibellini, Piero Fraternali, Giacomo Boracchi, Luca Morandini, Andrea Diecidue, Simona Malegori

Figure 1 for Illegal Waste Detection in Remote Sensing Images: A Case Study

Figure 2 for Illegal Waste Detection in Remote Sensing Images: A Case Study

Figure 3 for Illegal Waste Detection in Remote Sensing Images: A Case Study

Figure 4 for Illegal Waste Detection in Remote Sensing Images: A Case Study

Abstract:Environmental crime currently represents the third largest criminal activity worldwide while threatening ecosystems as well as human health. Among the crimes related to this activity, improper waste management can nowadays be countered more easily thanks to the increasing availability and decreasing cost of Very-High-Resolution Remote Sensing images, which enable semi-automatic territory scanning in search of illegal landfills. This paper proposes a pipeline, developed in collaboration with professionals from a local environmental agency, for detecting candidate illegal dumping sites leveraging a classifier of Remote Sensing images. To identify the best configuration for such classifier, an extensive set of experiments was conducted and the impact of diverse image characteristics and training settings was thoroughly analyzed. The local environmental agency was then involved in an experimental exercise where outputs from the developed classifier were integrated in the experts' everyday work, resulting in time savings with respect to manual photo-interpretation. The classifier was eventually run with valuable results on a location outside of the training area, highlighting potential for cross-border applicability of the proposed pipeline.

Via

Access Paper or Ask Questions

AstroLoc: Robust Space to Ground Image Localizer

Feb 10, 2025

Gabriele Berton, Alex Stoken, Carlo Masone

Abstract:Astronauts take thousands of photos of Earth per day from the International Space Station, which, once localized on Earth's surface, are used for a multitude of tasks, ranging from climate change research to disaster management. The localization process, which has been performed manually for decades, has recently been approached through image retrieval solutions: given an astronaut photo, find its most similar match among a large database of geo-tagged satellite images, in a task called Astronaut Photography Localization (APL). Yet, existing APL approaches are trained only using satellite images, without taking advantage of the millions open-source astronaut photos. In this work we present the first APL pipeline capable of leveraging astronaut photos for training. We first produce full localization information for 300,000 manually weakly labeled astronaut photos through an automated pipeline, and then use these images to train a model, called AstroLoc. AstroLoc learns a robust representation of Earth's surface features through two losses: astronaut photos paired with their matching satellite counterparts in a pairwise loss, and a second loss on clusters of satellite imagery weighted by their relevance to astronaut photography via unsupervised mining. We find that AstroLoc achieves a staggering 35% average improvement in recall@1 over previous SOTA, pushing the limits of existing datasets with a recall@100 consistently over 99%. Finally, we note that AstroLoc, without any fine-tuning, provides excellent results for related tasks like the lost-in-space satellite problem and historical space imagery localization.

Via

Access Paper or Ask Questions

A Comprehensive Survey on Image Signal Processing Approaches for Low-Illumination Image Enhancement

Feb 09, 2025

Muhammad Turab

Abstract:The usage of digital content (photos and videos) in a variety of applications has increased due to the popularity of multimedia devices. These uses include advertising campaigns, educational resources, and social networking platforms. There is an increasing need for high-quality graphic information as people become more visually focused. However, captured images frequently have poor visibility and a high amount of noise due to the limitations of image-capturing devices and lighting conditions. Improving the visual quality of images taken in low illumination is the aim of low-illumination image enhancement. This problem is addressed by traditional image enhancement techniques, which alter noise, brightness, and contrast. Deep learning-based methods, however, have dominated recently made advances in this area. These methods have effectively reduced noise while preserving important information, showing promising results in the improvement of low-illumination images. An extensive summary of image signal processing methods for enhancing low-illumination images is provided in this paper. Three categories are classified in the review for approaches: hybrid techniques, deep learning-based methods, and traditional approaches. Conventional techniques include denoising, automated white balancing, and noise reduction. Convolutional neural networks (CNNs) are used in deep learningbased techniques to recognize and extract characteristics from low-light images. To get better results, hybrid approaches combine deep learning-based methodologies with more conventional methods. The review also discusses the advantages and limitations of each approach and provides insights into future research directions in this field.

Via

Access Paper or Ask Questions

Vision-in-the-loop Simulation for Deep Monocular Pose Estimation of UAV in Ocean Environment

Feb 08, 2025

Maneesha Wickramasuriya, Beomyeol Yu, Taeyoung Lee, Murray Snyder

Abstract:This paper proposes a vision-in-the-loop simulation environment for deep monocular pose estimation of a UAV operating in an ocean environment. Recently, a deep neural network with a transformer architecture has been successfully trained to estimate the pose of a UAV relative to the flight deck of a research vessel, overcoming several limitations of GPS-based approaches. However, validating the deep pose estimation scheme in an actual ocean environment poses significant challenges due to the limited availability of research vessels and the associated operational costs. To address these issues, we present a photo-realistic 3D virtual environment leveraging recent advancements in Gaussian splatting, a novel technique that represents 3D scenes by modeling image pixels as Gaussian distributions in 3D space, creating a lightweight and high-quality visual model from multiple viewpoints. This approach enables the creation of a virtual environment integrating multiple real-world images collected in situ. The resulting simulation enables the indoor testing of flight maneuvers while verifying all aspects of flight software, hardware, and the deep monocular pose estimation scheme. This approach provides a cost-effective solution for testing and validating the autonomous flight of shipboard UAVs, specifically focusing on vision-based control and estimation algorithms.

* 8 pages, 15 figures, conference

Via

Access Paper or Ask Questions

Fillerbuster: Multi-View Scene Completion for Casual Captures

Feb 07, 2025

Ethan Weber, Norman Müller, Yash Kant, Vasu Agrawal, Michael Zollhöfer, Angjoo Kanazawa, Christian Richardt

Abstract:We present Fillerbuster, a method that completes unknown regions of a 3D scene by utilizing a novel large-scale multi-view latent diffusion transformer. Casual captures are often sparse and miss surrounding content behind objects or above the scene. Existing methods are not suitable for handling this challenge as they focus on making the known pixels look good with sparse-view priors, or on creating the missing sides of objects from just one or two photos. In reality, we often have hundreds of input frames and want to complete areas that are missing and unobserved from the input frames. Additionally, the images often do not have known camera parameters. Our solution is to train a generative model that can consume a large context of input frames while generating unknown target views and recovering image poses when desired. We show results where we complete partial captures on two existing datasets. We also present an uncalibrated scene completion task where our unified model predicts both poses and creates new content. Our model is the first to predict many images and poses together for scene completion.

* Project page at https://ethanweber.me/fillerbuster/

Via

Access Paper or Ask Questions

Augmented Conditioning Is Enough For Effective Training Image Generation

Feb 06, 2025

Jiahui Chen, Amy Zhang, Adriana Romero-Soriano

Figure 1 for Augmented Conditioning Is Enough For Effective Training Image Generation

Figure 2 for Augmented Conditioning Is Enough For Effective Training Image Generation

Figure 3 for Augmented Conditioning Is Enough For Effective Training Image Generation

Figure 4 for Augmented Conditioning Is Enough For Effective Training Image Generation

Abstract:Image generation abilities of text-to-image diffusion models have significantly advanced, yielding highly photo-realistic images from descriptive text and increasing the viability of leveraging synthetic images to train computer vision models. To serve as effective training data, generated images must be highly realistic while also sufficiently diverse within the support of the target data distribution. Yet, state-of-the-art conditional image generation models have been primarily optimized for creative applications, prioritizing image realism and prompt adherence over conditional diversity. In this paper, we investigate how to improve the diversity of generated images with the goal of increasing their effectiveness to train downstream image classification models, without fine-tuning the image generation model. We find that conditioning the generation process on an augmented real image and text prompt produces generations that serve as effective synthetic datasets for downstream training. Conditioning on real training images contextualizes the generation process to produce images that are in-domain with the real image distribution, while data augmentations introduce visual diversity that improves the performance of the downstream classifier. We validate augmentation-conditioning on a total of five established long-tail and few-shot image classification benchmarks and show that leveraging augmentations to condition the generation process results in consistent improvements over the state-of-the-art on the long-tailed benchmark and remarkable gains in extreme few-shot regimes of the remaining four benchmarks. These results constitute an important step towards effectively leveraging synthetic data for downstream training.

Via

Access Paper or Ask Questions

Can Text-to-Image Generative Models Accurately Depict Age? A Comparative Study on Synthetic Portrait Generation and Age Estimation

Feb 05, 2025

Alexey A. Novikov, Miroslav Vranka, François David, Artem Voronin

Abstract:Text-to-image generative models have shown remarkable progress in producing diverse and photorealistic outputs. In this paper, we present a comprehensive analysis of their effectiveness in creating synthetic portraits that accurately represent various demographic attributes, with a special focus on age, nationality, and gender. Our evaluation employs prompts specifying detailed profiles (e.g., Photorealistic selfie photo of a 32-year-old Canadian male), covering a broad spectrum of 212 nationalities, 30 distinct ages from 10 to 78, and balanced gender representation. We compare the generated images against ground truth age estimates from two established age estimation models to assess how faithfully age is depicted. Our findings reveal that although text-to-image models can consistently generate faces reflecting different identities, the accuracy with which they capture specific ages and do so across diverse demographic backgrounds remains highly variable. These results suggest that current synthetic data may be insufficiently reliable for high-stakes age-related tasks requiring robust precision, unless practitioners are prepared to invest in significant filtering and curation. Nevertheless, they may still be useful in less sensitive or exploratory applications, where absolute age precision is not critical.

Via

Access Paper or Ask Questions

High-Fidelity Human Avatars from Laptop Webcams using Edge Compute

Feb 04, 2025

Akash Haridas Imran N. Junejo

Abstract:Applications of generating photo-realistic human avatars are many, however, high-fidelity avatar generation traditionally required expensive professional camera rigs and artistic labor, but recent research has enabled constructing them automatically from smartphones with RGB and IR sensors. However, these new methods still rely on the presence of high-resolution cameras on modern smartphones and often require offloading the processing to powerful servers with GPUs. Modern applications such as video conferencing call for the ability to generate these avatars from consumer-grade laptop webcams using limited compute available on-device. In this work, we develop a novel method based on 3D morphable models, landmark detection, photo-realistic texture GANs, and differentiable rendering to tackle the problem of low webcam image quality and edge computation. We build an automatic system to generate high-fidelity animatable avatars under these limitations, leveraging the neural compute capabilities of mobile chips.

* 6 pages, 6 figures, 1 table

Via

Access Paper or Ask Questions

Towards Consistent and Controllable Image Synthesis for Face Editing

Feb 04, 2025

Mengting Wei, Tuomas Varanka, Yante Li, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao

Abstract:Current face editing methods mainly rely on GAN-based techniques, but recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in manipulating fine-grained attributes and preserving consistency of attributes that should remain unchanged. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involve combinations of target background, identity and different face attributes. We aim to sufficiently disentangle the control of these factors to enable high-quality of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Arrtibute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) An Identity Encoder that transfers identity features to the denoising UNet of a pre-trained Stable-Diffusion model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.

Via

Access Paper or Ask Questions

End-to-end Training for Text-to-Image Synthesis using Dual-Text Embeddings

Feb 03, 2025

Yeruru Asrar Ahmed, Anurag Mittal

Abstract:Text-to-Image (T2I) synthesis is a challenging task that requires modeling complex interactions between two modalities ( i.e., text and image). A common framework adopted in recent state-of-the-art approaches to achieving such multimodal interactions is to bootstrap the learning process with pre-trained image-aligned text embeddings trained using contrastive loss. Furthermore, these embeddings are typically trained generically and reused across various synthesis models. In contrast, we explore an approach to learning text embeddings specifically tailored to the T2I synthesis network, trained in an end-to-end fashion. Further, we combine generative and contrastive training and use two embeddings, one optimized to enhance the photo-realism of the generated images, and the other seeking to capture text-to-image alignment. A comprehensive set of experiments on three text-to-image benchmark datasets (Oxford-102, Caltech-UCSD, and MS-COCO) reveal that having two separate embeddings gives better results than using a shared one and that such an approach performs favourably in comparison with methods that use text representations from a pre-trained text encoder trained using a discriminative approach. Finally, we demonstrate that such learned embeddings can be used in other contexts as well, such as text-to-image manipulation.

Via

Access Paper or Ask Questions

Topic:photo

Papers and Code