Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Topic:photo

ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

Nov 06, 2024

Ziji Shi, Jialin Li, Yang You

Figure 1 for ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

Figure 2 for ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

Figure 3 for ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

Figure 4 for ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

Abstract:Recent advances in Generative Artificial Intelligence have fueled numerous applications, particularly those involving Generative Adversarial Networks (GANs), which are essential for synthesizing realistic photos and videos. However, efficiently training GANs remains a critical challenge due to their computationally intensive and numerically unstable nature. Existing methods often require days or even weeks for training, posing significant resource and time constraints. In this work, we introduce ParaGAN, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training. ParaGAN employs a congestion-aware data pipeline and hardware-aware layout transformation to enhance accelerator utilization, resulting in over 30% improvements in throughput. With ParaGAN, we reduce the training time of BigGAN from 15 days to 14 hours while achieving 91% scaling efficiency. Additionally, ParaGAN enables unprecedented high-resolution image generation using BigGAN.

* Accepted at ACM Symposium on Cloud Computing (SoCC) 2024

Via

Access Paper or Ask Questions

LVI-GS: Tightly-coupled LiDAR-Visual-Inertial SLAM using 3D Gaussian Splatting

Nov 05, 2024

Huibin Zhao, Weipeng Guan, Peng Lu

Figure 1 for LVI-GS: Tightly-coupled LiDAR-Visual-Inertial SLAM using 3D Gaussian Splatting

Figure 2 for LVI-GS: Tightly-coupled LiDAR-Visual-Inertial SLAM using 3D Gaussian Splatting

Figure 3 for LVI-GS: Tightly-coupled LiDAR-Visual-Inertial SLAM using 3D Gaussian Splatting

Figure 4 for LVI-GS: Tightly-coupled LiDAR-Visual-Inertial SLAM using 3D Gaussian Splatting

Abstract:3D Gaussian Splatting (3DGS) has shown its ability in rapid rendering and high-fidelity mapping. In this paper, we introduce LVI-GS, a tightly-coupled LiDAR-Visual-Inertial mapping framework with 3DGS, which leverages the complementary characteristics of LiDAR and image sensors to capture both geometric structures and visual details of 3D scenes. To this end, the 3D Gaussians are initialized from colourized LiDAR points and optimized using differentiable rendering. In order to achieve high-fidelity mapping, we introduce a pyramid-based training approach to effectively learn multi-level features and incorporate depth loss derived from LiDAR measurements to improve geometric feature perception. Through well-designed strategies for Gaussian-Map expansion, keyframe selection, thread management, and custom CUDA acceleration, our framework achieves real-time photo-realistic mapping. Numerical experiments are performed to evaluate the superior performance of our method compared to state-of-the-art 3D reconstruction systems.

Via

Access Paper or Ask Questions

Membership Inference Attacks against Large Vision-Language Models

Nov 05, 2024

Zhan Li, Yongtao Wu, Yihang Chen, Francesco Tonin, Elias Abad Rocamora, Volkan Cevher

Figure 1 for Membership Inference Attacks against Large Vision-Language Models

Figure 2 for Membership Inference Attacks against Large Vision-Language Models

Figure 3 for Membership Inference Attacks against Large Vision-Language Models

Figure 4 for Membership Inference Attacks against Large Vision-Language Models

Abstract:Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios. However, their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records, in their training datasets. Detecting inappropriately used data in VLLMs remains a critical and unresolved issue, mainly due to the lack of standardized datasets and suitable methodologies. In this study, we introduce the first membership inference attack (MIA) benchmark tailored for various VLLMs to facilitate training data detection. Then, we propose a novel MIA pipeline specifically designed for token-level image detection. Lastly, we present a new metric called MaxR\'enyi-K%, which is based on the confidence of the model output and applies to both text and image data. We believe that our work can deepen the understanding and methodology of MIAs in the context of VLLMs. Our code and datasets are available at https://github.com/LIONS-EPFL/VL-MIA.

* NeurIPS 2024

Via

Access Paper or Ask Questions

Toward Integrating Semantic-aware Path Planning and Reliable Localization for UAV Operations

Nov 04, 2024

Thanh Nguyen Canh, Huy-Hoang Ngo, Xiem HoangVan, Nak Young Chong

Figure 1 for Toward Integrating Semantic-aware Path Planning and Reliable Localization for UAV Operations

Figure 2 for Toward Integrating Semantic-aware Path Planning and Reliable Localization for UAV Operations

Figure 3 for Toward Integrating Semantic-aware Path Planning and Reliable Localization for UAV Operations

Figure 4 for Toward Integrating Semantic-aware Path Planning and Reliable Localization for UAV Operations

Abstract:Localization is one of the most crucial tasks for Unmanned Aerial Vehicle systems (UAVs) directly impacting overall performance, which can be achieved with various sensors and applied to numerous tasks related to search and rescue operations, object tracking, construction, etc. However, due to the negative effects of challenging environments, UAVs may lose signals for localization. In this paper, we present an effective path-planning system leveraging semantic segmentation information to navigate around texture-less and problematic areas like lakes, oceans, and high-rise buildings using a monocular camera. We introduce a real-time semantic segmentation architecture and a novel keyframe decision pipeline to optimize image inputs based on pixel distribution, reducing processing time. A hierarchical planner based on the Dynamic Window Approach (DWA) algorithm, integrated with a cost map, is designed to facilitate efficient path planning. The system is implemented in a photo-realistic simulation environment using Unity, aligning with segmentation model parameters. Comprehensive qualitative and quantitative evaluations validate the effectiveness of our approach, showing significant improvements in the reliability and efficiency of UAV localization in challenging environments.

* In The 24th International Conference on Control, Automation, and Systems (ICCAS 2024), Jeju, Korea

Via

Access Paper or Ask Questions

Detect an Object At Once without Fine-tuning

Nov 04, 2024

Junyu Hao, Jianheng Liu, Yongjia Zhao, Zuofan Chen, Qi Sun, Jinlong Chen, Jianguo Wei, Minghao Yang

Figure 1 for Detect an Object At Once without Fine-tuning

Figure 2 for Detect an Object At Once without Fine-tuning

Figure 3 for Detect an Object At Once without Fine-tuning

Figure 4 for Detect an Object At Once without Fine-tuning

Abstract:When presented with one or a few photos of a previously unseen object, humans can instantly recognize it in different scenes. Although the human brain mechanism behind this phenomenon is still not fully understood, this work introduces a novel technical realization of this task. It consists of two phases: (1) generating a Similarity Density Map (SDM) by convolving the scene image with the given object image patch(es) so that the highlight areas in the SDM indicate the possible locations; (2) obtaining the object occupied areas in the scene through a Region Alignment Network (RAN). The RAN is constructed on a backbone of Deep Siamese Network (DSN), and different from the traditional DSNs, it aims to obtain the object accurate regions by regressing the location and area differences between the ground truths and the predicted ones indicated by the highlight areas in SDM. By pre-learning from labels annotated in traditional datasets, the SDM-RAN can detect previously unknown objects without fine-tuning. Experiments were conducted on the MS COCO, PASCAL VOC datasets. The results indicate that the proposed method outperforms state-of-the-art methods on the same task.

Via

Access Paper or Ask Questions

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Oct 31, 2024

Xiang Deng, Youxin Pang, Xiaochen Zhao, Chao Xu, Lizhen Wang, Hongjiang Xiao, Shi Yan, Hongwen Zhang, Yebin Liu

Figure 1 for Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Figure 2 for Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Figure 3 for Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Figure 4 for Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Abstract:This paper introduces Stereo-Talker, a novel one-shot audio-driven human video synthesis system that generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control. The process follows a two-stage approach. In the first stage, the system maps audio input to high-fidelity motion sequences, encompassing upper-body gestures and facial expressions. To enrich motion diversity and authenticity, large language model (LLM) priors are integrated with text-aligned semantic audio features, leveraging LLMs' cross-modal generalization power to enhance motion quality. In the second stage, we improve diffusion-based video generation models by incorporating a prior-guided Mixture-of-Experts (MoE) mechanism: a view-guided MoE focuses on view-specific attributes, while a mask-guided MoE enhances region-based rendering stability. Additionally, a mask prediction module is devised to derive human masks from motion data, enhancing the stability and accuracy of masks and enabling mask guiding during inference. We also introduce a comprehensive human video dataset with 2,203 identities, covering diverse body gestures and detailed annotations, facilitating broad generalization. The code, data, and pre-trained models will be released for research purposes.

Via

Access Paper or Ask Questions

Practical and Accurate Reconstruction of an Illuminant's Spectral Power Distribution for Inverse Rendering Pipelines

Oct 30, 2024

Parisha Joshi, Daljit Singh J. Dhillon

Abstract:Inverse rendering pipelines are gaining prominence in realizing photo-realistic reconstruction of real-world objects for emulating them in virtual reality scenes. Apart from material reflectances, spectral rendering and in-scene illuminants' spectral power distributions (SPDs) play important roles in producing photo-realistic images. We present a simple, low-cost technique to capture and reconstruct the SPD of uniform illuminants. Instead of requiring a costly spectrometer for such measurements, our method uses a diffractive compact disk (CD-ROM) and a machine learning approach for accurate estimation. We show our method to work well with spotlights under simulations and few real-world examples. Presented results clearly demonstrate the reliability of our approach through quantitative and qualitative evaluations, especially in spectral rendering of iridescent materials.

* 3 pages, 3 Figures, Submitted as a Tiny Paper at ICVGIP'24, Bangalore, India

Via

Access Paper or Ask Questions

Emotion-Guided Image to Music Generation

Oct 29, 2024

Souraja Kundu, Saket Singh, Yuji Iwahori

Figure 1 for Emotion-Guided Image to Music Generation

Figure 2 for Emotion-Guided Image to Music Generation

Figure 3 for Emotion-Guided Image to Music Generation

Figure 4 for Emotion-Guided Image to Music Generation

Abstract:Generating music from images can enhance various applications, including background music for photo slideshows, social media experiences, and video creation. This paper presents an emotion-guided image-to-music generation framework that leverages the Valence-Arousal (VA) emotional space to produce music that aligns with the emotional tone of a given image. Unlike previous models that rely on contrastive learning for emotional consistency, the proposed approach directly integrates a VA loss function to enable accurate emotional alignment. The model employs a CNN-Transformer architecture, featuring pre-trained CNN image feature extractors and three Transformer encoders to capture complex, high-level emotional features from MIDI music. Three Transformer decoders refine these features to generate musically and emotionally consistent MIDI sequences. Experimental results on a newly curated emotionally paired image-MIDI dataset demonstrate the proposed model's superior performance across metrics such as Polyphony Rate, Pitch Entropy, Groove Consistency, and loss convergence.

* 2024 6th Asian Digital Image Processing Conference

Via

Access Paper or Ask Questions

Murine AI excels at cats and cheese: Structural differences between human and mouse neurons and their implementation in generative AIs

Oct 28, 2024

Rino Saiga, Kaede Shiga, Yo Maruta, Chie Inomoto, Hiroshi Kajiwara, Naoya Nakamura, Yu Kakimoto, Yoshiro Yamamoto, Masahiro Yasutake, Masayuki Uesugi(+12 more)

Figure 1 for Murine AI excels at cats and cheese: Structural differences between human and mouse neurons and their implementation in generative AIs

Figure 2 for Murine AI excels at cats and cheese: Structural differences between human and mouse neurons and their implementation in generative AIs

Figure 3 for Murine AI excels at cats and cheese: Structural differences between human and mouse neurons and their implementation in generative AIs

Figure 4 for Murine AI excels at cats and cheese: Structural differences between human and mouse neurons and their implementation in generative AIs

Abstract:Mouse and human brains have different functions that depend on their neuronal networks. In this study, we analyzed nanometer-scale three-dimensional structures of brain tissues of the mouse medial prefrontal cortex and compared them with structures of the human anterior cingulate cortex. The obtained results indicated that mouse neuronal somata are smaller and neurites are thinner than those of human neurons. These structural features allow mouse neurons to be integrated in the limited space of the brain, though thin neurites should suppress distal connections according to cable theory. We implemented this mouse-mimetic constraint in convolutional layers of a generative adversarial network (GAN) and a denoising diffusion implicit model (DDIM), which were then subjected to image generation tasks using photo datasets of cat faces, cheese, human faces, and birds. The mouse-mimetic GAN outperformed a standard GAN in the image generation task using the cat faces and cheese photo datasets, but underperformed for human faces and birds. The mouse-mimetic DDIM gave similar results, suggesting that the nature of the datasets affected the results. Analyses of the four datasets indicated differences in their image entropy, which should influence the number of parameters required for image generation. The preferences of the mouse-mimetic AIs coincided with the impressions commonly associated with mice. The relationship between the neuronal network and brain function should be investigated by implementing other biological findings in artificial neural networks.

* 41 pages, 4 figures

Via

Access Paper or Ask Questions

Enhancing Community Vision Screening -- AI Driven Retinal Photography for Early Disease Detection and Patient Trust

Oct 27, 2024

Xiaofeng Lei, Yih-Chung Tham, Jocelyn Hui Lin Goh, Yangqin Feng, Yang Bai, Zhi Da Soh, Rick Siow Mong Goh, Xinxing Xu, Yong Liu, Ching-Yu Cheng

Abstract:Community vision screening plays a crucial role in identifying individuals with vision loss and preventing avoidable blindness, particularly in rural communities where access to eye care services is limited. Currently, there is a pressing need for a simple and efficient process to screen and refer individuals with significant eye disease-related vision loss to tertiary eye care centers for further care. An ideal solution should seamlessly and readily integrate with existing workflows, providing comprehensive initial screening results to service providers, thereby enabling precise patient referrals for timely treatment. This paper introduces the Enhancing Community Vision Screening (ECVS) solution, which addresses the aforementioned concerns with a novel and feasible solution based on simple, non-invasive retinal photography for the detection of pathology-based visual impairment. Our study employs four distinct deep learning models: RETinal photo Quality Assessment (RETQA), Pathology Visual Impairment detection (PVI), Eye Disease Diagnosis (EDD) and Visualization of Lesion Regions of the eye (VLR). We conducted experiments on over 10 datasets, totaling more than 80,000 fundus photos collected from various sources. The models integrated into ECVS achieved impressive AUC scores of 0.98 for RETQA, 0.95 for PVI, and 0.90 for EDD, along with a DICE coefficient of 0.48 for VLR. These results underscore the promising capabilities of ECVS as a straightforward and scalable method for community-based vision screening.

* 11 pages, 4 figures, published in MICCAI2024 OMIA XI workshop

Via

Access Paper or Ask Questions

Topic:photo

Papers and Code