Computer Vision & Graphics, Vision & Imaging Technologies, Fraunhofer Heinrich-Hertz-Institute HHI, Berlin, Visual Computing, Humboldt-University, Berlin
Abstract:Recently, 3D GANs based on 3D Gaussian splatting have been proposed for high quality synthesis of human heads. However, existing methods stabilize training and enhance rendering quality from steep viewpoints by conditioning the random latent vector on the current camera position. This compromises 3D consistency, as we observe significant identity changes when re-synthesizing the 3D head with each camera shift. Conversely, fixing the camera to a single viewpoint yields high-quality renderings for that perspective but results in poor performance for novel views. Removing view-conditioning typically destabilizes GAN training, often causing the training to collapse. In response to these challenges, we introduce CGS-GAN, a novel 3D Gaussian Splatting GAN framework that enables stable training and high-quality 3D-consistent synthesis of human heads without relying on view-conditioning. To ensure training stability, we introduce a multi-view regularization technique that enhances generator convergence with minimal computational overhead. Additionally, we adapt the conditional loss used in existing 3D Gaussian splatting GANs and propose a generator architecture designed to not only stabilize training but also facilitate efficient rendering and straightforward scaling, enabling output resolutions up to $2048^2$. To evaluate the capabilities of CGS-GAN, we curate a new dataset derived from FFHQ. This dataset enables very high resolutions, focuses on larger portions of the human head, reduces view-dependent artifacts for improved 3D consistency, and excludes images where subjects are obscured by hands or other objects. As a result, our approach achieves very high rendering quality, supported by competitive FID scores, while ensuring consistent 3D scene generation. Check our our project page here: https://fraunhoferhhi.github.io/cgs-gan/
Abstract:Deep learning has transformed computer vision for precision agriculture, yet apple orchard monitoring remains limited by dataset constraints. The lack of diverse, realistic datasets and the difficulty of annotating dense, heterogeneous scenes. Existing datasets overlook different growth stages and stereo imagery, both essential for realistic 3D modeling of orchards and tasks like fruit localization, yield estimation, and structural analysis. To address these gaps, we present AppleGrowthVision, a large-scale dataset comprising two subsets. The first includes 9,317 high resolution stereo images collected from a farm in Brandenburg (Germany), covering six agriculturally validated growth stages over a full growth cycle. The second subset consists of 1,125 densely annotated images from the same farm in Brandenburg and one in Pillnitz (Germany), containing a total of 31,084 apple labels. AppleGrowthVision provides stereo-image data with agriculturally validated growth stages, enabling precise phenological analysis and 3D reconstructions. Extending MinneApple with our data improves YOLOv8 performance by 7.69 % in terms of F1-score, while adding it to MinneApple and MAD boosts Faster R-CNN F1-score by 31.06 %. Additionally, six BBCH stages were predicted with over 95 % accuracy using VGG16, ResNet152, DenseNet201, and MobileNetv2. AppleGrowthVision bridges the gap between agricultural science and computer vision, by enabling the development of robust models for fruit detection, growth modeling, and 3D analysis in precision agriculture. Future work includes improving annotation, enhancing 3D reconstruction, and extending multimodal analysis across all growth stages.
Abstract:Accurate assessment of tissue perfusion is crucial in visceral surgery, especially during anastomosis. Currently, subjective visual judgment is commonly employed in clinical settings. Hyperspectral imaging (HSI) offers a non-invasive, quantitative alternative. However, HSI imaging lacks continuous integration into the clinical workflow. This study presents a hyperspectral light field system for intraoperative tissue oxygen saturation (SO2) analysis and visualization. We present a correlation method for determining SO2 saturation with low computational demands. We demonstrate clinical application, with our results aligning with the perfusion boundaries determined by the surgeon. We perform and compare continuous perfusion analysis using two hyperspectral cameras (Cubert S5, Cubert X20), achieving processing times of < 170 ms and < 400 ms, respectively. We discuss camera characteristics, system parameters, and the suitability for clinical use and real-time applications.
Abstract:Cross-modal data registration has long been a critical task in computer vision, with extensive applications in autonomous driving and robotics. Accurate and robust registration methods are essential for aligning data from different modalities, forming the foundation for multimodal sensor data fusion and enhancing perception systems' accuracy and reliability. The registration task between 2D images captured by cameras and 3D point clouds captured by Light Detection and Ranging (LiDAR) sensors is usually treated as a visual pose estimation problem. High-dimensional feature similarities from different modalities are leveraged to identify pixel-point correspondences, followed by pose estimation techniques using least squares methods. However, existing approaches often resort to downsampling the original point cloud and image data due to computational constraints, inevitably leading to a loss in precision. Additionally, high-dimensional features extracted using different feature extractors from various modalities require specific techniques to mitigate cross-modal differences for effective matching. To address these challenges, we propose a method that uses edge information from the original point clouds and images for cross-modal registration. We retain crucial information from the original data by extracting edge points and pixels, enhancing registration accuracy while maintaining computational efficiency. The use of edge points and edge pixels allows us to introduce an attention-based feature exchange block to eliminate cross-modal disparities. Furthermore, we incorporate an optimal matching layer to improve correspondence identification. We validate the accuracy of our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.
Abstract:Open-vocabulary segmentation, powered by large visual-language models like CLIP, has expanded 2D segmentation capabilities beyond fixed classes predefined by the dataset, enabling zero-shot understanding across diverse scenes. Extending these capabilities to 3D segmentation introduces challenges, as CLIP's image-based embeddings often lack the geometric detail necessary for 3D scene segmentation. Recent methods tend to address this by introducing additional segmentation models or replacing CLIP with variations trained on segmentation data, which lead to redundancy or loss on CLIP's general language capabilities. To overcome this limitation, we introduce SPNeRF, a NeRF based zero-shot 3D segmentation approach that leverages geometric priors. We integrate geometric primitives derived from the 3D scene into NeRF training to produce primitive-wise CLIP features, avoiding the ambiguity of point-wise features. Additionally, we propose a primitive-based merging mechanism enhanced with affinity scores. Without relying on additional segmentation models, our method further explores CLIP's capability for 3D segmentation and achieves notable improvements over original LERF.
Abstract:3D Gaussian Splatting (3DGS) has become one of the most influential works in the past year. Due to its efficient and high-quality novel view synthesis capabilities, it has been widely adopted in many research fields and applications. Nevertheless, 3DGS still faces challenges to properly manage the number of Gaussian primitives that are used during scene reconstruction. Following the adaptive density control (ADC) mechanism of 3D Gaussian Splatting, new Gaussians in under-reconstructed regions are created, while Gaussians that do not contribute to the rendering quality are pruned. We observe that those criteria for densifying and pruning Gaussians can sometimes lead to worse rendering by introducing artifacts. We especially observe under-reconstructed background or overfitted foreground regions. To encounter both problems, we propose three new improvements to the adaptive density control mechanism. Those include a correction for the scene extent calculation that does not only rely on camera positions, an exponentially ascending gradient threshold to improve training convergence, and significance-aware pruning strategy to avoid background artifacts. With these adaptions, we show that the rendering quality improves while using the same number of Gaussians primitives. Furthermore, with our improvements, the training converges considerably faster, allowing for more than twice as fast training times while yielding better quality than 3DGS. Finally, our contributions are easily compatible with most existing derivative works of 3DGS making them relevant for future works.
Abstract:Sorting and permutation learning are key concepts in optimization and machine learning, especially when organizing high-dimensional data into meaningful spatial layouts. The Gumbel-Sinkhorn method, while effective, requires N*N parameters to determine a full permutation matrix, making it computationally expensive for large datasets. Low-rank matrix factorization approximations reduce memory requirements to 2MN (with M << N), but they still struggle with very large problems. SoftSort, by providing a continuous relaxation of the argsort operator, allows differentiable 1D sorting, but it faces challenges with multidimensional data and complex permutations. In this paper, we present a novel method for learning permutations using only N parameters, which dramatically reduces storage costs. Our approach builds on SoftSort, but extends it by iteratively shuffling the N indices of the elements to be sorted through a separable learning process. This modification significantly improves sorting quality, especially for multidimensional data and complex optimization criteria, and outperforms pure SoftSort. Our method offers improved memory efficiency and scalability compared to existing approaches, while maintaining high-quality permutation learning. Its dramatically reduced memory requirements make it particularly well-suited for large-scale optimization tasks, such as "Self-Organizing Gaussians", where efficient and scalable permutation learning is critical.
Abstract:Photo-realistic rendering and novel view synthesis play a crucial role in human-computer interaction tasks, from gaming to path planning. Neural Radiance Fields (NeRFs) model scenes as continuous volumetric functions and achieve remarkable rendering quality. However, NeRFs often struggle in large, low-textured areas, producing cloudy artifacts known as ''floaters'' that reduce scene realism, especially in indoor environments with featureless architectural surfaces like walls, ceilings, and floors. To overcome this limitation, prior work has integrated geometric constraints into the NeRF pipeline, typically leveraging depth information derived from Structure from Motion or Multi-View Stereo. Yet, conventional RGB-feature correspondence methods face challenges in accurately estimating depth in textureless regions, leading to unreliable constraints. This challenge is further complicated in 360-degree ''inside-out'' views, where sparse visual overlap between adjacent images further hinders depth estimation. In order to address these issues, we propose an efficient and robust method for computing dense depth priors, specifically tailored for large low-textured architectural surfaces in indoor environments. We introduce a novel depth loss function to enhance rendering quality in these challenging, low-feature regions, while complementary depth-patch regularization further refines depth consistency across other areas. Experiments with Instant-NGP on two synthetic 360-degree indoor scenes demonstrate improved visual fidelity with our method compared to standard photometric loss and Mean Squared Error depth supervision.
Abstract:Digitalization in the construction industry has become essential, enabling centralized, easy access to all relevant information of a building. Automated systems can facilitate the timely and resource-efficient documentation of changes, which is crucial for key processes such as progress tracking and quality control. This paper presents a method for image-based automated drywall analysis enabling construction progress and quality assessment through on-site camera systems. Our proposed solution integrates a deep learning-based instance segmentation model to detect and classify various drywall elements with an analysis module to cluster individual wall segments, estimate camera perspective distortions, and apply the corresponding corrections. This system extracts valuable information from images, enabling more accurate progress tracking and quality assessment on construction sites. Our main contributions include a fully automated pipeline for drywall analysis, improving instance segmentation accuracy through architecture modifications and targeted data augmentation, and a novel algorithm to extract important information from the segmentation results. Our modified model, enhanced with data augmentation, achieves significantly higher accuracy compared to other architectures, offering more detailed and precise information than existing approaches. Combined with the proposed drywall analysis steps, it enables the reliable automation of construction progress and quality assessment.
Abstract:Visually sorted grid layouts provide an efficient method for organizing high-dimensional vectors in two-dimensional space by aligning spatial proximity with similarity relationships. This approach facilitates the effective sorting of diverse elements ranging from data points to images, and enables the simultaneous visualization of a significant number of elements. However, sorting data on two-dimensional grids is a challenge due to its high complexity. Even for a small 8-by-8 grid with 64 elements, the number of possible arrangements exceeds $1.3 \cdot 10^{89}$ - more than the number of atoms in the universe - making brute-force solutions impractical. Although various methods have been proposed to address the challenge of determining sorted grid layouts, none have investigated the potential of gradient-based optimization. In this paper, we present a novel method for grid-based sorting that exploits gradient optimization for the first time. We introduce a novel loss function that balances two opposing goals: ensuring the generation of a "valid" permutation matrix, and optimizing the arrangement on the grid to reflect the similarity between vectors, inspired by metrics that assess the quality of sorted grids. While learning-based approaches are inherently computationally complex, our method shows promising results in generating sorted grid layouts with superior sorting quality compared to existing techniques.