Abstract:Any image is perceived subconsciously as a coherent structure (or whole) with two contrast substructures: figure and ground. The figure consists of numerous auto-generated substructures with an inherent hierarchy of far more smalls than larges. Through these substructures, the structural beauty of an image (L) can be computed by the multiplication of the number of substructures (S) and their inherent hierarchy (H). This definition implies that the more substructures, the more living or more structurally beautiful, and the higher hierarchy of the substructures, the more living or more structurally beautiful. This is the non-recursive approach to the structural beauty of images or the livingness of space. In this paper we develop a recursive approach, which derives all substructures of an image (instead of its figure) and continues the deriving process for those decomposable substructures until none of them are decomposable. All of the substructures derived at different iterations (or recursive levels) together constitute a living structure; hence the notion of living images. We applied the recursive approach to a set of images and found that (1) the number of substructures of an image is far lower (3 percent on average) than the number of pixels and the centroids of the substructures can effectively capture the skeleton or saliency of the image; (2) all the images have the recursive levels more than three, indicating that they are indeed living images; (3) no more than 2 percent of the substructures are decomposable; (4) structural beauty can be measured by the recursively defined substructures, as well as their decomposable subsets. The recursive approach is proved to be more robust than the non-recursive approach. The recursive approach and the non-recursive approach both provide a powerful means to study the livingness or vitality of space in cities and communities.
Abstract:Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. V2C is more challenging than conventional text-to-speech tasks as it additionally requires the generated speech to exactly match the varying emotions and speaking speed presented in the video. Unlike previous works, we propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene. Specifically, we align lip movement to the speech duration, and convey facial expression to speech energy and pitch via attention mechanism based on valence and arousal representations inspired by recent psychology findings. Moreover, we design an emotion booster to capture the atmosphere from global video scenes. All these embeddings together are used to generate mel-spectrogram and then convert to speech waves via existing vocoder. Extensive experimental results on the Chem and V2C benchmark datasets demonstrate the favorable performance of the proposed method. The source code and trained models will be released to the public.




Abstract:Deep learning techniques have shown great potential in medical image processing, particularly through accurate and reliable image segmentation on magnetic resonance imaging (MRI) scans or computed tomography (CT) scans, which allow the localization and diagnosis of lesions. However, training these segmentation models requires a large number of manually annotated pixel-level labels, which are time-consuming and labor-intensive, in contrast to image-level labels that are easier to obtain. It is imperative to resolve this problem through weakly-supervised semantic segmentation models using image-level labels as supervision since it can significantly reduce human annotation efforts. Most of the advanced solutions exploit class activation mapping (CAM). However, the original CAMs rarely capture the precise boundaries of lesions. In this study, we propose the strategy of multi-scale inference to refine CAMs by reducing the detail loss in single-scale reasoning. For segmentation, we develop a novel model named Mixed-UNet, which has two parallel branches in the decoding phase. The results can be obtained after fusing the extracted features from two branches. We evaluate the designed Mixed-UNet against several prevalent deep learning-based segmentation approaches on our dataset collected from the local hospital and public datasets. The validation results demonstrate that our model surpasses available methods under the same supervision level in the segmentation of various lesions from brain imaging.




Abstract:Background and Purpose: Altered brain vasculature is a key phenomenon in several neurologic disorders. This paper presents a quantitative assessment of vascular morphology in healthy and diseased adults including changes during aging and the anatomical variations in the Circle of Willis (CoW). Methods: We used our automatic method to segment and extract novel geometric features of the cerebral vasculature from MRA scans of 175 healthy subjects, 45 AIS, and 50 AD patients after spatial registration. This is followed by quantification and statistical analysis of vascular alterations in acute ischemic stroke (AIS) and Alzheimer's disease (AD), the biggest cerebrovascular and neurodegenerative disorders. Results: We determined that the CoW is fully formed in only 35 percent of healthy adults and found significantly increased tortuosity and fractality, with increasing age and with disease -- both AIS and AD. We also found significantly decreased vessel length, volume and number of branches in AIS patients. Lastly, we found that AD cerebral vessels exhibited significantly smaller diameter and more complex branching patterns, compared to age-matched healthy adults. These changes were significantly heightened with progression of AD from early onset to moderate-severe dementia. Conclusion: Altered vessel geometry in AIS patients shows that there is pathological morphology coupled with stroke. In AD due to pathological alterations in the endothelium or amyloid depositions leading to neuronal damage and hypoperfusion, vessel geometry is significantly altered even in mild or early dementia. The specific geometric features and quantitative comparisons demonstrate potential for using vascular morphology as a non-invasive imaging biomarker for neurologic disorders.




Abstract:To say that beauty is in the eye of the beholder means that beauty is largely subjective so varies from person to person. While the subjectivity view is commonly held, there is also an objectivity view that seeks to measure beauty or aesthetics in some quantitative manners. Christopher Alexander has long discovered that beauty or coherence highly correlates to the number of subsymmetries or substructures and demonstrated that there is a shared notion of beauty - structural beauty - among people and even different peoples, regardless of their faiths, cultures, and ethnicities. This notion of structural beauty arises directly out of living structure or wholeness, a physical and mathematical structure that underlies all space and matter. Based on the concept of living structure, this paper develops an approach for computing the structural beauty or life of an image (L) based on the number of automatically derived substructures (S) and their inherent hierarchy (H). To verify this approach, we conducted a series of case studies applied to eight pairs of images including Leonardo da Vinci's Mona Lisa and Jackson Pollock's Blue Poles. We discovered among others that Blue Poles is more structurally beautiful than the Mona Lisa, and traditional buildings are in general more structurally beautiful than their modernist counterparts. This finding implies that goodness of things or images is largely a matter of fact rather than an opinion or personal preference as conventionally conceived. The research on structural beauty has deep implications on many disciplines, where beauty or aesthetics is a major concern such as image understanding and computer vision, architecture and urban design, humanities and arts, neurophysiology, and psychology. Keywords: Life; wholeness; figural goodness; head/tail breaks; computer vision




Abstract:Visual Question Answering (VQA) is a challenging multimodal task to answer questions about an image. Many works concentrate on how to reduce language bias which makes models answer questions ignoring visual content and language context. However, reducing language bias also weakens the ability of VQA models to learn context prior. To address this issue, we propose a novel learning strategy named CCB, which forces VQA models to answer questions relying on Content and Context with language Bias. Specifically, CCB establishes Content and Context branches on top of a base VQA model and forces them to focus on local key content and global effective context respectively. Moreover, a joint loss function is proposed to reduce the importance of biased samples and retain their beneficial influence on answering questions. Experiments show that CCB outperforms the state-of-the-art methods in terms of accuracy on VQA-CP v2.




Abstract:Reference expression comprehension (REC) aims to find the location that the phrase refer to in a given image. Proposal generation and proposal representation are two effective techniques in many two-stage REC methods. However, most of the existing works only focus on proposal representation and neglect the importance of proposal generation. As a result, the low-quality proposals generated by these methods become the performance bottleneck in REC tasks. In this paper, we reconsider the problem of proposal generation, and propose a novel phrase-guided proposal generation network (PPGN). The main implementation principle of PPGN is refining visual features with text and generate proposals through regression. Experiments show that our method is effective and achieve SOTA performance in benchmark datasets.




Abstract:Deep encoder-decoder based CNNs have advanced image inpainting methods for hole filling. While existing methods recover structures and textures step-by-step in the hole regions, they typically use two encoder-decoders for separate recovery. The CNN features of each encoder are learned to capture either missing structures or textures without considering them as a whole. The insufficient utilization of these encoder features limit the performance of recovering both structures and textures. In this paper, we propose a mutual encoder-decoder CNN for joint recovery of both. We use CNN features from the deep and shallow layers of the encoder to represent structures and textures of an input image, respectively. The deep layer features are sent to a structure branch and the shallow layer features are sent to a texture branch. In each branch, we fill holes in multiple scales of the CNN features. The filled CNN features from both branches are concatenated and then equalized. During feature equalization, we reweigh channel attentions first and propose a bilateral propagation activation function to enable spatial equalization. To this end, the filled CNN features of structure and texture mutually benefit each other to represent image content at all feature levels. We use the equalized feature to supplement decoder features for output image generation through skip connections. Experiments on the benchmark datasets show the proposed method is effective to recover structures and textures and performs favorably against state-of-the-art approaches.



Abstract:We present a general-purpose data compression algorithm, Regularized L21 Semi-NonNegative Matrix Factorization (L21 SNF). L21 SNF provides robust, parts-based compression applicable to mixed-sign data for which high fidelity, individualdata point reconstruction is paramount. We derive a rigorous proof of convergenceof our algorithm. Through experiments, we show the use-case advantages presentedby L21 SNF, including application to the compression of highly overdeterminedsystems encountered broadly across many general machine learning processes.




Abstract:Recent deep learning based image inpainting methods which utilize contextual information and two-stage architecture have exhibited remarkable performance. However, the two-stage architecture is time-consuming, the contextual information lack high-level semantics and ignores both the semantic relevance and distance information of hole's feature patches, these limitations result in blurry textures and distorted structures of final result. Motivated by these observations, we propose a new deep generative model-based approach, which trains a shared network twice with different targets and utilizes a single network during the testing phase, so that we can effectively save inference time. Specifically, the targets of two training steps are structure reconstruction and texture generation respectively. During the second training, we first propose a Pyramid Filling Block (PF-block) to utilize the high-level features that the hole regions has been filled to guide the filling process of low-level features progressively, the missing content can be filled from deep to shallow in a pyramid fashion. Then, inspired by the classical bilateral filter [30], we propose the Bilateral Attention layer (BA-layer) to optimize filled feature map, which synthesizes feature patches at each position by computing weighted sums of the surrounding feature patches, these weights are derived by considering both distance and value relationships between feature patches, thus making the visually plausible inpainting results. Finally, experiments on multiple publicly available datasets show the superior performance of our approach.