Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"photo": models, code, and papers

DCGANs for Realistic Breast Mass Augmentation in X-ray Mammography

Sep 04, 2019
Basel Alyafi, Oliver Diaz, Robert Marti

Early detection of breast cancer has a major contribution to curability, and using mammographic images, this can be achieved non-invasively. Supervised deep learning, the dominant CADe tool currently, has played a great role in object detection in computer vision, but it suffers from a limiting property: the need of a large amount of labelled data. This becomes stricter when it comes to medical datasets which require high-cost and time-consuming annotations. Furthermore, medical datasets are usually imbalanced, a condition that often hinders classifiers performance. The aim of this paper is to learn the distribution of the minority class to synthesise new samples in order to improve lesion detection in mammography. Deep Convolutional Generative Adversarial Networks (DCGANs) can efficiently generate breast masses. They are trained on increasing-size subsets of one mammographic dataset and used to generate diverse and realistic breast masses. The effect of including the generated images and/or applying horizontal and vertical flipping is tested in an environment where a 1:10 imbalanced dataset of masses and normal tissue patches is classified by a fully-convolutional network. A maximum of ~ 0:09 improvement of F1 score is reported by using DCGANs along with flipping augmentation over using the original images. We show that DCGANs can be used for synthesising photo-realistic breast mass patches with considerable diversity. It is demonstrated that appending synthetic images in this environment, along with flipping, outperforms the traditional augmentation method of flipping solely, offering faster improvements as a function of the training set size.

* 4 pages, 4 figures, SPIE Medical Imaging 2020 Conference 
Access Paper or Ask Questions

Deep Aesthetic Assessment and Retrieval of Breast Cancer Treatment Outcomes

May 25, 2022
Wilson Silva, Maria Carvalho, Carlos Mavioso, Maria J. Cardoso, Jaime S. Cardoso

Treatments for breast cancer have continued to evolve and improve in recent years, resulting in a substantial increase in survival rates, with approximately 80\% of patients having a 10-year survival period. Given the serious impact that breast cancer treatments can have on a patient's body image, consequently affecting her self-confidence and sexual and intimate relationships, it is paramount to ensure that women receive the treatment that optimizes both survival and aesthetic outcomes. Currently, there is no gold standard for evaluating the aesthetic outcome of breast cancer treatment. In addition, there is no standard way to show patients the potential outcome of surgery. The presentation of similar cases from the past would be extremely important to manage women's expectations of the possible outcome. In this work, we propose a deep neural network to perform the aesthetic evaluation. As a proof-of-concept, we focus on a binary aesthetic evaluation. Besides its use for classification, this deep neural network can also be used to find the most similar past cases by searching for nearest neighbours in the highly semantic space before classification. We performed the experiments on a dataset consisting of 143 photos of women after conservative treatment for breast cancer. The results for accuracy and balanced accuracy showed the superior performance of our proposed model compared to the state of the art in aesthetic evaluation of breast cancer treatments. In addition, the model showed a good ability to retrieve similar previous cases, with the retrieved cases having the same or adjacent class (in the 4-class setting) and having similar types of asymmetry. Finally, a qualitative interpretability assessment was also performed to analyse the robustness and trustworthiness of the model.

Access Paper or Ask Questions

Embodied vision for learning object representations

May 12, 2022
Arthur Aubret, Céline Teulière, Jochen Triesch

Recent time-contrastive learning approaches manage to learn invariant object representations without supervision. This is achieved by mapping successive views of an object onto close-by internal representations. When considering this learning approach as a model of the development of human object recognition, it is important to consider what visual input a toddler would typically observe while interacting with objects. First, human vision is highly foveated, with high resolution only available in the central region of the field of view. Second, objects may be seen against a blurry background due to infants' limited depth of field. Third, during object manipulation a toddler mostly observes close objects filling a large part of the field of view due to their rather short arms. Here, we study how these effects impact the quality of visual representations learnt through time-contrastive learning. To this end, we let a visually embodied agent "play" with objects in different locations of a near photo-realistic flat. During each play session the agent views an object in multiple orientations before turning its body to view another object. The resulting sequence of views feeds a time-contrastive learning algorithm. Our results show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments. We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions. We conclude that the embodied nature of visual learning may be crucial for understanding the development of human object perception.

* 6 pages 
Access Paper or Ask Questions

Multi-Modal Multi-Instance Learning for Retinal Disease Recognition

Sep 25, 2021
Xirong Li, Yang Zhou, Jie Wang, Hailan Lin, Jianchun Zhao, Dayong Ding, Weihong Yu, Youxin Chen

This paper attacks an emerging challenge of multi-modal retinal disease recognition. Given a multi-modal case consisting of a color fundus photo (CFP) and an array of OCT B-scan images acquired during an eye examination, we aim to build a deep neural network that recognizes multiple vision-threatening diseases for the given case. As the diagnostic efficacy of CFP and OCT is disease-dependent, the network's ability of being both selective and interpretable is important. Moreover, as both data acquisition and manual labeling are extremely expensive in the medical domain, the network has to be relatively lightweight for learning from a limited set of labeled multi-modal samples. Prior art on retinal disease recognition focuses either on a single disease or on a single modality, leaving multi-modal fusion largely underexplored. We propose in this paper Multi-Modal Multi-Instance Learning (MM-MIL) for selectively fusing CFP and OCT modalities. Its lightweight architecture (as compared to current multi-head attention modules) makes it suited for learning from relatively small-sized datasets. For an effective use of MM-MIL, we propose to generate a pseudo sequence of CFPs by over sampling a given CFP. The benefits of this tactic include well balancing instances across modalities, increasing the resolution of the CFP input, and finding out regions of the CFP most relevant with respect to the final diagnosis. Extensive experiments on a real-world dataset consisting of 1,206 multi-modal cases from 1,193 eyes of 836 subjects demonstrate the viability of the proposed model.

* Accepted by ACM Multimedia 2021 (Main Track) 
Access Paper or Ask Questions

AVA: Adversarial Vignetting Attack against Visual Recognition

May 12, 2021
Binyu Tian, Felix Juefei-Xu, Qing Guo, Xiaofei Xie, Xiaohong Li, Yang Liu

Vignetting is an inherited imaging phenomenon within almost all optical systems, showing as a radial intensity darkening toward the corners of an image. Since it is a common effect for photography and usually appears as a slight intensity variation, people usually regard it as a part of a photo and would not even want to post-process it. Due to this natural advantage, in this work, we study vignetting from a new viewpoint, i.e., adversarial vignetting attack (AVA), which aims to embed intentionally misleading information into vignetting and produce a natural adversarial example without noise patterns. This example can fool the state-of-the-art deep convolutional neural networks (CNNs) but is imperceptible to humans. To this end, we first propose the radial-isotropic adversarial vignetting attack (RI-AVA) based on the physical model of vignetting, where the physical parameters (e.g., illumination factor and focal length) are tuned through the guidance of target CNN models. To achieve higher transferability across different CNNs, we further propose radial-anisotropic adversarial vignetting attack (RA-AVA) by allowing the effective regions of vignetting to be radial-anisotropic and shape-free. Moreover, we propose the geometry-aware level-set optimization method to solve the adversarial vignetting regions and physical parameters jointly. We validate the proposed methods on three popular datasets, i.e., DEV, CIFAR10, and Tiny ImageNet, by attacking four CNNs, e.g., ResNet50, EfficientNet-B0, DenseNet121, and MobileNet-V2, demonstrating the advantages of our methods over baseline methods on both transferability and image quality.

* This work has been accepted to IJCAI2021 
Access Paper or Ask Questions

UV Volumes for Real-time Rendering of Editable Free-view Human Performance

Mar 27, 2022
Yue Chen, Xuan Wang, Qi Zhang, Xiaoyu Li, Xingyu Chen, Yu Guo, Jue Wang, Fei Wang

Neural volume rendering has been proven to be a promising method for efficient and photo-realistic rendering of a human performer in free-view, a critical task in many immersive VR/AR applications. However, existing approaches are severely limited by their high computational cost in the rendering process. To solve this problem, we propose the UV Volumes, an approach that can render an editable free-view video of a human performer in real-time. It is achieved by removing the high-frequency (i.e., non-smooth) human textures from the 3D volume and encoding them into a 2D neural texture stack (NTS). The smooth UV volume allows us to employ a much smaller and shallower structure for 3D CNN and MLP, to obtain the density and texture coordinates without losing image details. Meanwhile, the NTS only needs to be queried once for each pixel in the UV image to retrieve its RGB value. For editability, the 3D CNN and MLP decoder can easily fit the function that maps the input structured-and-posed latent codes to the relatively smooth densities and texture coordinates. It gives our model a better generalization ability to handle novel poses and shapes. Furthermore, the use of NST enables new applications, e.g., retexturing. Extensive experiments on CMU Panoptic, ZJU Mocap, and H36M datasets show that our model can render 900 * 500 images in 40 fps on average with comparable photorealism to state-of-the-art methods. The project and supplementary materials are available at

Access Paper or Ask Questions

Robustly Removing Deep Sea Lighting Effects for Visual Mapping of Abyssal Plains

Oct 01, 2021
Kevin Köser, Yifan Song, Lasse Petersen, Emanuel Wenzlaff, Felix Woelk

The majority of Earth's surface lies deep in the oceans, where no surface light reaches. Robots diving down to great depths must bring light sources that create moving illumination patterns in the darkness, such that the same 3D point appears with different color in each image. On top, scattering and attenuation of light in the water makes images appear foggy and typically blueish, the degradation depending on each pixel's distance to its observed seafloor patch, on the local composition of the water and the relative poses and cones of the light sources. Consequently, visual mapping, including image matching and surface albedo estimation, severely suffers from the effects that co-moving light sources produce, and larger mosaic maps from photos are often dominated by lighting effects that obscure the actual seafloor structure. In this contribution a practical approach to estimating and compensating these lighting effects on predominantly homogeneous, flat seafloor regions, as can be found in the Abyssal plains of our oceans, is presented. The method is essentially parameter-free and intended as a preprocessing step to facilitate visual mapping, but already produces convincing lighting artefact compensation up to a global white balance factor. It does not require to be trained beforehand on huge sets of annotated images, which are not available for the deep sea. Rather, we motivate our work by physical models of light propagation, perform robust statistics-based estimates of additive and multiplicative nuisances that avoid explicit parameters for light, camera, water or scene, discuss the breakdown point of the algorithms and show results on imagery captured by robots in several kilometer water depth.

Access Paper or Ask Questions

Roof Damage Assessment from Automated 3D Building Models

Jun 04, 2021
Kenichi Sugihara, Martin Wallace, Kongwen, Zhang, Youry Khmelevsky

The 3D building modelling is important in urban planning and related domains that draw upon the content of 3D models of urban scenes. Such 3D models can be used to visualize city images at multiple scales from individual buildings to entire cities prior to and after a change has occurred. This ability is of great importance in day-to-day work and special projects undertaken by planners, geo-designers, and architects. In this research, we implemented a novel approach to 3D building models for such matter, which included the integration of geographic information systems (GIS) and 3D Computer Graphics (3DCG) components that generate 3D house models from building footprints (polygons), and the automated generation of simple and complex roof geometries for rapid roof area damage reporting. These polygons (footprints) are usually orthogonal. A complicated orthogonal polygon can be partitioned into a set of rectangles. The proposed GIS and 3DCG integrated system partitions orthogonal building polygons into a set of rectangles and places rectangular roofs and box-shaped building bodies on these rectangles. Since technicians are drawing these polygons manually with digitizers, depending on aerial photos, not all building polygons are precisely orthogonal. But, when placing a set of boxes as building bodies for creating the buildings, there may be gaps or overlaps between these boxes if building polygons are not precisely orthogonal. In our proposal, after approximately orthogonal building polygons are partitioned and rectified into a set of mutually orthogonal rectangles, each rectangle knows which rectangle is adjacent to and which edge of the rectangle is adjacent to, which will avoid unwanted intersection of windows and doors when building bodies combined.

Access Paper or Ask Questions

Multimodal Deep Learning Framework for Image Popularity Prediction on Social Media

May 18, 2021
Fatma S. Abousaleh, Wen-Huang Cheng, Neng-Hao Yu, Yu Tsao

Billions of photos are uploaded to the web daily through various types of social networks. Some of these images receive millions of views and become popular, whereas others remain completely unnoticed. This raises the problem of predicting image popularity on social media. The popularity of an image can be affected by several factors, such as visual content, aesthetic quality, user, post metadata, and time. Thus, considering all these factors is essential for accurately predicting image popularity. In addition, the efficiency of the predictive model also plays a crucial role. In this study, motivated by multimodal learning, which uses information from various modalities, and the current success of convolutional neural networks (CNNs) in various fields, we propose a deep learning model, called visual-social convolutional neural network (VSCNN), which predicts the popularity of a posted image by incorporating various types of visual and social features into a unified network model. VSCNN first learns to extract high-level representations from the input visual and social features by utilizing two individual CNNs. The outputs of these two networks are then fused into a joint network to estimate the popularity score in the output layer. We assess the performance of the proposed method by conducting extensive experiments on a dataset of approximately 432K images posted on Flickr. The simulation results demonstrate that the proposed VSCNN model significantly outperforms state-of-the-art models, with a relative improvement of greater than 2.33%, 7.59%, and 14.16% in terms of Spearman's Rho, mean absolute error, and mean squared error, respectively.

* IEEE Transactions on Cognitive and Developmental Systems. 2020 Nov 9 
* 14 pages, 11 figures, 7 tables 
Access Paper or Ask Questions

On the Reliability of the PNU for Source Camera Identification Tasks

Aug 28, 2020
Andrea Bruno, Giuseppe Cattaneo, Paola Capasso

The PNU is an essential and reliable tool to perform SCI and, during the years, became a standard de-facto for this task in the forensic field. In this paper, we show that, although strategies exist that aim to cancel, modify, replace the PNU traces in a digital camera image, it is still possible, through our experimental method, to find residual traces of the noise produced by the sensor used to shoot the photo. Furthermore, we show that is possible to inject the PNU of a different camera in a target image and trace it back to the source camera, but only under the condition that the new camera is of the same model of the original one used to take the target image. Both cameras must fall within our availability. For completeness, we carried out 2 experiments and, rather than using the popular public reference dataset, CASIA TIDE, we preferred to introduce a dataset that does not present any kind of statistical artifacts. A preliminary experiment on a small dataset of smartphones showed that the injection of PNU from a different device makes it impossible to identify the source camera correctly. For a second experiment, we built a large dataset of images taken with the same model DSLR. We extracted a denoised version of each image, injected each one with the RN of all the cameras in the dataset and compared all with a RP from each camera. The results of the experiments, clearly, show that either in the denoised images and the injected ones is possible to find residual traces of the original camera PNU. The combined results of the experiments show that, even in theory is possible to remove or replace the \ac{PNU} from an image, this process can be, easily, detected and is possible, under some hard conditions, confirming the robustness of the \ac{PNU} under this type of attacks.

* 14 pages, 7 figures, to be presented to IWBDAF on 2021, 11 Jan 
Access Paper or Ask Questions