Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations by introducing a compositional 3D layout representation into text-to-3D paradigm, serving as an additional prior. It comprises a set of semantic primitives with simple geometric structures and explicit arrangement relationships, complementing textual descriptions and enabling steerable generation. Upon this, we propose two modifications -- (1) We introduce Layout-Guided Variational Score Distillation to address model optimization inadequacies. It conditions the score distillation sampling process with geometric and semantic constraints of 3D layouts. (2) To handle the unbounded nature of urban scenes, we represent 3D scene with a Scalable Hash Grid structure, incrementally adapting to the growing scale of urban scenes. Extensive experiments substantiate the capability of our framework to scale text-to-3D generation to large-scale urban scenes that cover over 1000m driving distance for the first time. We also present various scene editing demonstrations, showing the powers of steerable urban scene generation. Website: https://urbanarchitect.github.io.
Image super-resolution has been an important subject in image processing and recognition. Here, we present an attention-aided convolutional neural network (CNN) for solar image super-resolution. Our method, named SolarCNN, aims to enhance the quality of line-of-sight (LOS) magnetograms of solar active regions (ARs) collected by the Michelson Doppler Imager (MDI) on board the Solar and Heliospheric Observatory (SOHO). The ground-truth labels used for training SolarCNN are the LOS magnetograms collected by the Helioseismic and Magnetic Imager (HMI) on board the Solar Dynamics Observatory (SDO). Solar ARs consist of strong magnetic fields in which magnetic energy can suddenly be released to produce extreme space weather events, such as solar flares, coronal mass ejections, and solar energetic particles. SOHO/MDI covers Solar Cycle 23, which is stronger with more eruptive events than Cycle 24. Enhanced SOHO/MDI magnetograms allow for better understanding and forecasting of violent events of space weather. Experimental results show that SolarCNN improves the quality of SOHO/MDI magnetograms in terms of the structural similarity index measure (SSIM), Pearson's correlation coefficient (PCC), and the peak signal-to-noise ratio (PSNR).
Video creation has become increasingly popular, yet the expertise and effort required for editing often pose barriers to beginners. In this paper, we explore the integration of large language models (LLMs) into the video editing workflow to reduce these barriers. Our design vision is embodied in LAVE, a novel system that provides LLM-powered agent assistance and language-augmented editing features. LAVE automatically generates language descriptions for the user's footage, serving as the foundation for enabling the LLM to process videos and assist in editing tasks. When the user provides editing objectives, the agent plans and executes relevant actions to fulfill them. Moreover, LAVE allows users to edit videos through either the agent or direct UI manipulation, providing flexibility and enabling manual refinement of agent actions. Our user study, which included eight participants ranging from novices to proficient editors, demonstrated LAVE's effectiveness. The results also shed light on user perceptions of the proposed LLM-assisted editing paradigm and its impact on users' creativity and sense of co-creation. Based on these findings, we propose design implications to inform the future development of agent-assisted content editing.
We tackle the task of multi-view, multi-person 3D human pose estimation from a limited number of uncalibrated depth cameras. Recently, many approaches have been proposed for 3D human pose estimation from multi-view RGB cameras. However, these works (1) assume the number of RGB camera views is large enough for 3D reconstruction, (2) the cameras are calibrated, and (3) rely on ground truth 3D poses for training their regression model. In this work, we propose to leverage sparse, uncalibrated depth cameras providing RGBD video streams for 3D human pose estimation. We present a simple pipeline for Multi-View Depth Human Pose Estimation (MVD-HPE) for jointly predicting the camera poses and 3D human poses without training a deep 3D human pose regression model. This framework utilizes 3D Re-ID appearance features from RGBD images to formulate more accurate correspondences (for deriving camera positions) compared to using RGB-only features. We further propose (1) depth-guided camera-pose estimation by leveraging 3D rigid transformations as guidance and (2) depth-constrained 3D human pose estimation by utilizing depth-projected 3D points as an alternative objective for optimization. In order to evaluate our proposed pipeline, we collect three video sets of RGBD videos recorded from multiple sparse-view depth cameras and ground truth 3D poses are manually annotated. Experiments show that our proposed method outperforms the current 3D human pose regression-free pipelines in terms of both camera pose estimation and 3D human pose estimation.
Cross-view person matching and 3D human pose estimation in multi-camera networks are particularly difficult when the cameras are extrinsically uncalibrated. Existing efforts generally require large amounts of 3D data for training neural networks or known camera poses for geometric constraints to solve the problem. However, camera poses and 3D data annotation are usually expensive and not always available. We present a method, PME, that solves the two tasks without requiring either information. Our idea is to address cross-view person matching as a clustering problem using each person as a cluster center, then obtain correspondences from person matches, and estimate 3D human poses through multi-view triangulation and bundle adjustment. We solve the clustering problem by introducing a "size constraint" using the number of cameras and a "source constraint" using the fact that two people from the same camera view should not match, to narrow the solution space to a small feasible region. The 2D human poses used in clustering are obtained through a pre-trained 2D pose detector, so our method does not require expensive 3D training data for each new scene. We extensively evaluate our method on three open datasets and two indoor and outdoor datasets collected using arbitrarily set cameras. Our method outperforms other methods by a large margin on cross-view person matching, reaches SOTA performance on 3D human pose estimation without using either camera poses or 3D training data, and shows good generalization ability across five datasets of various environment settings.
Inference, especially those derived from inductive processes, is a crucial component in our conversation to complement the information implicitly or explicitly conveyed by a speaker. While recent large language models show remarkable advances in inference tasks, their performance in inductive reasoning, where not all information is present in the context, is far behind deductive reasoning. In this paper, we analyze the behavior of the models based on the task difficulty defined by the semantic information gap -- which distinguishes inductive and deductive reasoning (Johnson-Laird, 1988, 1993). Our analysis reveals that the disparity in information between dialogue contexts and desired inferences poses a significant challenge to the inductive inference process. To mitigate this information gap, we investigate a contrastive learning approach by feeding negative samples. Our experiments suggest negative samples help models understand what is wrong and improve their inference generations.
Most existing data-driven power system short-term voltage stability assessment (STVSA) approaches presume class-balanced input data. However, in practical applications, the occurrence of short-term voltage instability following a disturbance is minimal, leading to a significant class imbalance problem and a consequent decline in classifier performance. This work proposes a Transformer-based STVSA method to address this challenge. By utilizing the basic Transformer architecture, a stability assessment Transformer (StaaT) is developed {as a classification model to reflect the correlation between the operational states of the system and the resulting stability outcomes}. To combat the negative impact of imbalanced datasets, this work employs a conditional Wasserstein generative adversarial network with gradient penalty (CWGAN-GP) for synthetic data generation, aiding in the creation of a balanced, representative training set for the classifier. Semi-supervised clustering learning is implemented to enhance clustering quality, addressing the lack of a unified quantitative criterion for short-term voltage stability. {Numerical tests on the IEEE 39-bus test system extensively demonstrate that the proposed method exhibits robust performance under class imbalances up to 100:1 and noisy environments, and maintains consistent effectiveness even with an increased penetration of renewable energy}. Comparative results reveal that the CWGAN-GP generates more balanced datasets than traditional oversampling methods and that the StaaT outperforms other deep learning algorithms. This study presents a compelling solution for real-world STVSA applications that often face class imbalance and data noise challenges.
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks. However, the practical deployment still faces challenges, notably the issue of "hallucination", where models generate plausible-sounding but unfaithful or nonsensical information. This issue becomes particularly critical in the medical domain due to the uncommon professional concepts and potential social risks involved. This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets. Our investigation centers on the identification and comprehension of common problematic answers, with a specific emphasis on hallucination. To tackle this challenge, we present an interactive self-reflection methodology that incorporates knowledge acquisition and answer generation. Through this feedback process, our approach steadily enhances the factuality, consistency, and entailment of the generated answers. Consequently, we harness the interactivity and multitasking ability of LLMs and produce progressively more precise and accurate answers. Experimental results on both automatic and human evaluation demonstrate the superiority of our approach in hallucination reduction compared to baselines.
Grounding dialogue response generation on external knowledge is proposed to produce informative and engaging responses. However, current knowledge-grounded dialogue (KGD) systems often fail to align the generated responses with human-preferred qualities due to several issues like hallucination and the lack of coherence. Upon analyzing multiple language model generations, we observe the presence of alternative generated responses within a single decoding process. These alternative responses are more faithful and exhibit a comparable or higher level of relevance to prior conversational turns compared to the optimal responses prioritized by the decoding processes. To address these challenges and driven by these observations, we propose Polished \& Informed Candidate Scoring (PICK), a generation re-scoring framework that empowers models to generate faithful and relevant responses without requiring additional labeled data or model tuning. Through comprehensive automatic and human evaluations, we demonstrate the effectiveness of PICK in generating responses that are more faithful while keeping them relevant to the dialogue history. Furthermore, PICK consistently improves the system's performance with both oracle and retrieved knowledge in all decoding strategies. We provide the detailed implementation in https://github.com/bryanwilie/pick .
Medical image registration is a critical task that estimates the spatial correspondence between pairs of images. However, current traditional and deep-learning-based methods rely on similarity measures to generate a deforming field, which often results in disproportionate volume changes in dissimilar regions, especially in tumor regions. These changes can significantly alter the tumor size and underlying anatomy, which limits the practical use of image registration in clinical diagnosis. To address this issue, we have formulated image registration with tumors as a constraint problem that preserves tumor volumes while maximizing image similarity in other normal regions. Our proposed strategy involves a two-stage process. In the first stage, we use similarity-based registration to identify potential tumor regions by their volume change, generating a soft tumor mask accordingly. In the second stage, we propose a volume-preserving registration with a novel adaptive volume-preserving loss that penalizes the change in size adaptively based on the masks calculated from the previous stage. Our approach balances image similarity and volume preservation in different regions, i.e., normal and tumor regions, by using soft tumor masks to adjust the imposition of volume-preserving loss on each one. This ensures that the tumor volume is preserved during the registration process. We have evaluated our strategy on various datasets and network architectures, demonstrating that our method successfully preserves the tumor volume while achieving comparable registration results with state-of-the-art methods. Our codes is available at: \url{https://dddraxxx.github.io/Volume-Preserving-Registration/}.