Radiology reports are detailed text descriptions of the content of medical scans. Each report describes the presence/absence and location of relevant clinical findings, commonly including comparison with prior exams of the same patient to describe how they evolved. Radiology reporting is a time-consuming process, and scan results are often subject to delays. One strategy to speed up reporting is to integrate automated reporting systems, however clinical deployment requires high accuracy and interpretability. Previous approaches to automated radiology reporting generally do not provide the prior study as input, precluding comparison which is required for clinical accuracy in some types of scans, and offer only unreliable methods of interpretability. Therefore, leveraging an existing visual input format of anatomical tokens, we introduce two novel aspects: (1) longitudinal representation learning -- we input the prior scan as an additional input, proposing a method to align, concatenate and fuse the current and prior visual information into a joint longitudinal representation which can be provided to the multimodal report generation model; (2) sentence-anatomy dropout -- a training strategy for controllability in which the report generator model is trained to predict only sentences from the original report which correspond to the subset of anatomical regions given as input. We show through in-depth experiments on the MIMIC-CXR dataset how the proposed approach achieves state-of-the-art results while enabling anatomy-wise controllable report generation.
The task of radiology reporting comprises describing and interpreting the medical findings in radiographic images, including description of their location and appearance. Automated approaches to radiology reporting require the image to be encoded into a suitable token representation for input to the language model. Previous methods commonly use convolutional neural networks to encode an image into a series of image-level feature map representations. However, the generated reports often exhibit realistic style but imperfect accuracy. Inspired by recent works for image captioning in the general domain in which each visual token corresponds to an object detected in an image, we investigate whether using local tokens corresponding to anatomical structures can improve the quality of the generated reports. We introduce a novel adaptation of Faster R-CNN in which finding detection is performed for the candidate bounding boxes extracted during anatomical structure localisation. We use the resulting bounding box feature representations as our set of finding-aware anatomical tokens. This encourages the extracted anatomical tokens to be informative about the findings they contain (required for the final task of radiology reporting). Evaluating on the MIMIC-CXR dataset of chest X-Ray images, we show that task-aware anatomical tokens give state-of-the-art performance when integrated into an automated reporting pipeline, yielding generated reports with improved clinical accuracy.
Semi-supervised learning has demonstrated great potential in medical image segmentation by utilizing knowledge from unlabeled data. However, most existing approaches do not explicitly capture high-level semantic relations between distant regions, which limits their performance. In this paper, we focus on representation learning for semi-supervised learning, by developing a novel Multi-Scale Cross Supervised Contrastive Learning (MCSC) framework, to segment structures in medical images. We jointly train CNN and Transformer models, regularising their features to be semantically consistent across different scales. Our approach contrasts multi-scale features based on ground-truth and cross-predicted labels, in order to extract robust feature representations that reflect intra- and inter-slice relationships across the whole dataset. To tackle class imbalance, we take into account the prevalence of each class to guide contrastive learning and ensure that features adequately capture infrequent classes. Extensive experiments on two multi-structure medical segmentation datasets demonstrate the effectiveness of MCSC. It not only outperforms state-of-the-art semi-supervised methods by more than 3.0% in Dice, but also greatly reduces the performance gap with fully supervised methods.
Music therapy has emerged recently as a successful intervention that improves patient's outcome in a large range of neurological and mood disorders without adverse effects. Brain networks are entrained to music in ways that can be explained both via top-down and bottom-up processes. In particular, the direct interaction of auditory with the motor and the reward system via a predictive framework explains the efficacy of music-based interventions in motor rehabilitation. In this manuscript, we provide a brief overview of current theories of music perception and processing. Subsequently, we summarise evidence of music-based interventions primarily in motor, emotional and cardiovascular regulation. We highlight opportunities to improve quality of life and reduce stress beyond the clinic environment and in healthy individuals. This relatively unexplored area requires an understanding of how we can personalise and automate music selection processes to fit individuals needs and tasks via feedback loops mediated by measurements of neuro-physiological responses.
The adaptation of transformers to computer vision is not straightforward because the modelling of image contextual information results in quadratic computational complexity with relation to the input features. Most of existing methods require extensive pre-training on massive datasets such as ImageNet and therefore their application to fields such as healthcare is less effective. CNNs are the dominant architecture in computer vision tasks because convolutional filters can effectively model local dependencies and reduce drastically the parameters required. However, convolutional filters cannot handle more complex interactions, which are beyond a small neighbour of pixels. Furthermore, their weights are fixed after training and thus they do not take into consideration changes in the visual input. Inspired by recent work on hybrid visual transformers with convolutions and hierarchical transformers, we propose Convolutional Swin-Unet (CS-Unet) transformer blocks and optimise their settings with relation to patch embedding, projection, the feed-forward network, up sampling and skip connections. CS-Unet can be trained from scratch and inherits the superiority of convolutions in each feature process phase. It helps to encode precise spatial information and produce hierarchical representations that contribute to object concepts at various scales. Experiments show that CS-Unet without pre-training surpasses other state-of-the-art counterparts by large margins on two medical CT and MRI datasets with fewer parameters. In addition, two domain-adaptation experiments on optic disc and polyp image segmentation further prove that our method is highly generalizable and effectively bridges the domain gap between images from different sources.
Analysing human motions is a core topic of interest for many disciplines, from Human-Computer Interaction, to entertainment, Virtual Reality and healthcare. Deep learning has achieved impressive results in capturing human pose in real-time. On the other hand, due to high inter-subject variability, human motion analysis models often suffer from not being able to generalise to data from unseen subjects due to very limited specialised datasets available in fields such as healthcare. However, acquiring human motion datasets is highly time-consuming, challenging, and expensive. Hence, human motion synthesis is a crucial research problem within deep learning and computer vision. We present a novel method for controllable human motion synthesis by applying attention-based probabilistic deep adversarial models with end-to-end training. We show that we can generate synthetic human motion over both short- and long-time horizons through the use of adversarial attention. Furthermore, we show that we can improve the classification performance of deep learning models in cases where there is inadequate real data, by supplementing existing datasets with synthetic motions.
Human motion is a biomarker used extensively in clinical analysis to monitor the progression of neurological diseases and mood disorders. Since perceptions of emotions are also interleaved with body posture and movements, emotion recognition from human gait can be used to quantitatively monitor mood changes that are often related to neurological disorders. Many existing solutions often use shallow machine learning models with raw positional data or manually extracted features to achieve this. However, gait is composed of many highly expressive characteristics that can be used to identify human subjects, and most solutions fail to address this, disregarding the subject's privacy. This work evaluates the effectiveness of existing methods at recognising emotions using both 3D temporal joint signals and manually extracted features. We also show that this data can easily be exploited to expose a subject's identity. Therefore to this end, we propose a cross-subject transfer learning technique for training a multi-encoder autoencoder deep neural network to learn disentangled latent representations of human motion features. By disentangling subject biometrics from the gait data, we show that the subjects privacy is preserved while the affect recognition performance outperforms traditional methods.
With the increasing awareness of high-quality life, there is a growing need for health monitoring devices running robust algorithms in home environment. Health monitoring technologies enable real-time analysis of users' health status, offering long-term healthcare support and reducing hospitalization time. The purpose of this work is twofold, the software focuses on the analysis of gait, which is widely adopted for joint correction and assessing any lower limb or spinal problem. On the hardware side, we design a novel marker-less gait analysis device using a low-cost RGB camera mounted on a mobile tele-robot. As gait analysis with a single camera is much more challenging compared to previous works utilizing multi-cameras, a RGB-D camera or wearable sensors, we propose using vision-based human pose estimation approaches. More specifically, based on the output of two state-of-the-art human pose estimation models (Openpose and VNect), we devise measurements for four bespoke gait parameters: inversion/eversion, dorsiflexion/plantarflexion, ankle and foot progression angles. We thereby classify walking patterns into normal, supination, pronation and limp. We also illustrate how to run the purposed machine learning models in low-resource environments such as a single entry-level CPU. Experiments show that our single RGB camera method achieves competitive performance compared to state-of-the-art methods based on depth cameras or multi-camera motion capture system, at smaller hardware costs.
The recent successes of AI have captured the wildest imagination of both the scientific communities and the general public. Robotics and AI amplify human potentials, increase productivity and are moving from simple reasoning towards human-like cognitive abilities. Current AI technologies are used in a set area of applications, ranging from healthcare, manufacturing, transport, energy, to financial services, banking, advertising, management consulting and government agencies. The global AI market is around 260 billion USD in 2016 and it is estimated to exceed 3 trillion by 2024. To understand the impact of AI, it is important to draw lessons from it's past successes and failures and this white paper provides a comprehensive explanation of the evolution of AI, its current status and future directions.