The success of deep learning-based generative models in producing realistic images, videos, and audios has led to a crucial consideration: how to effectively assess the quality of synthetic samples. While the Fr\'{e}chet Inception Distance (FID) serves as the standard metric for evaluating generative models in image synthesis, a comparable metric for time series data is notably absent. This gap in assessment capabilities stems from the absence of a widely accepted feature vector extractor pre-trained on benchmark time series datasets. In addressing these challenges related to assessing the quality of time series, particularly in the context of Fr\'echet Distance, this work proposes a novel solution leveraging the Fourier transform and Auto-encoder, termed the Fr\'{e}chet Fourier-transform Auto-encoder Distance (FFAD). Through our experimental results, we showcase the potential of FFAD for effectively distinguishing samples from different classes. This novel metric emerges as a fundamental tool for the evaluation of generative time series data, contributing to the ongoing efforts of enhancing assessment methodologies in the realm of deep learning-based generative models.
This study aims to evaluate the performance of deep learning models in predicting $\geq$M-class solar flares with a prediction window of 24 hours, using hourly sampled full-disk line-of-sight (LoS) magnetogram images, particularly focusing on the often overlooked flare events corresponding to the near-limb regions (beyond $\pm$70$^{\circ}$ of the solar disk). We trained three well-known deep learning architectures--AlexNet, VGG16, and ResNet34 using transfer learning and compared and evaluated the overall performance of our models using true skill statistics (TSS) and Heidke skill score (HSS) and computed recall scores to understand the prediction sensitivity in central and near-limb regions for both X- and M-class flares. The following points summarize the key findings of our study: (1) The highest overall performance was observed with the AlexNet-based model, which achieved an average TSS$\sim$0.53 and HSS$\sim$0.37; (2) Further, a spatial analysis of recall scores disclosed that for the near-limb events, the VGG16- and ResNet34-based models exhibited superior prediction sensitivity. The best results, however, were seen with the ResNet34-based model for the near-limb flares, where the average recall was approximately 0.59 (the recall for X- and M-class was 0.81 and 0.56 respectively) and (3) Our research findings demonstrate that our models are capable of discerning complex spatial patterns from full-disk magnetograms and exhibit skill in predicting solar flares, even in the vicinity of near-limb regions. This ability holds substantial importance for operational flare forecasting systems.
Solar flare prediction is a central problem in space weather forecasting and recent developments in machine learning and deep learning accelerated the adoption of complex models for data-driven solar flare forecasting. In this work, we developed an attention-based deep learning model as an improvement over the standard convolutional neural network (CNN) pipeline to perform full-disk binary flare predictions for the occurrence of $\geq$M1.0-class flares within the next 24 hours. For this task, we collected compressed images created from full-disk line-of-sight (LoS) magnetograms. We used data-augmented oversampling to address the class imbalance issue and used true skill statistic (TSS) and Heidke skill score (HSS) as the evaluation metrics. Furthermore, we interpreted our model by overlaying attention maps on input magnetograms and visualized the important regions focused on by the model that led to the eventual decision. The significant findings of this study are: (i) We successfully implemented an attention-based full-disk flare predictor ready for operational forecasting where the candidate model achieves an average TSS=0.54$\pm$0.03 and HSS=0.37$\pm$0.07. (ii) we demonstrated that our full-disk model can learn conspicuous features corresponding to active regions from full-disk magnetogram images, and (iii) our experimental evaluation suggests that our model can predict near-limb flares with adept skill and the predictions are based on relevant active regions (ARs) or AR characteristics from full-disk magnetograms.
This study progresses solar flare prediction research by presenting a full-disk deep-learning model to forecast $\geq$M-class solar flares and evaluating its efficacy on both central (within $\pm$70$^\circ$) and near-limb (beyond $\pm$70$^\circ$) events, showcasing qualitative assessment of post hoc explanations for the model's predictions, and providing empirical findings from human-centered quantitative assessments of these explanations. Our model is trained using hourly full-disk line-of-sight magnetogram images to predict $\geq$M-class solar flares within the subsequent 24-hour prediction window. Additionally, we apply the Guided Gradient-weighted Class Activation Mapping (Guided Grad-CAM) attribution method to interpret our model's predictions and evaluate the explanations. Our analysis unveils that full-disk solar flare predictions correspond with active region characteristics. The following points represent the most important findings of our study: (1) Our deep learning models achieved an average true skill statistic (TSS) of $\sim$0.51 and a Heidke skill score (HSS) of $\sim$0.38, exhibiting skill to predict solar flares where for central locations the average recall is $\sim$0.75 (recall values for X- and M-class are 0.95 and 0.73 respectively) and for the near-limb flares the average recall is $\sim$0.52 (recall values for X- and M-class are 0.74 and 0.50 respectively); (2) qualitative examination of the model's explanations reveals that it discerns and leverages features linked to active regions in both central and near-limb locations within full-disk magnetograms to produce respective predictions. In essence, our models grasp the shape and texture-based properties of flaring active regions, even in proximity to limb areas -- a novel and essential capability with considerable significance for operational forecasting systems.
This paper presents a post hoc analysis of a deep learning-based full-disk solar flare prediction model. We used hourly full-disk line-of-sight magnetogram images and selected binary prediction mode to predict the occurrence of $\geq$M1.0-class flares within 24 hours. We leveraged custom data augmentation and sample weighting to counter the inherent class-imbalance problem and used true skill statistic and Heidke skill score as evaluation metrics. Recent advancements in gradient-based attention methods allow us to interpret models by sending gradient signals to assign the burden of the decision on the input features. We interpret our model using three post hoc attention methods: (i) Guided Gradient-weighted Class Activation Mapping, (ii) Deep Shapley Additive Explanations, and (iii) Integrated Gradients. Our analysis shows that full-disk predictions of solar flares align with characteristics related to the active regions. The key findings of this study are: (1) We demonstrate that our full disk model can tangibly locate and predict near-limb solar flares, which is a critical feature for operational flare forecasting, (2) Our candidate model achieves an average TSS=0.51$\pm$0.05 and HSS=0.38$\pm$0.08, and (3) Our evaluation suggests that these models can learn conspicuous features corresponding to active regions from full-disk magnetograms.
This paper contributes to the growing body of research on deep learning methods for solar flare prediction, primarily focusing on highly overlooked near-limb flares and utilizing the attribution methods to provide a post hoc qualitative explanation of the model's predictions. We present a solar flare prediction model, which is trained using hourly full-disk line-of-sight magnetogram images and employs a binary prediction mode to forecast $\geq$M-class flares that may occur within the following 24-hour period. To address the class imbalance, we employ a fusion of data augmentation and class weighting techniques; and evaluate the overall performance of our model using the true skill statistic (TSS) and Heidke skill score (HSS). Moreover, we applied three attribution methods, namely Guided Gradient-weighted Class Activation Mapping, Integrated Gradients, and Deep Shapley Additive Explanations, to interpret and cross-validate our model's predictions with the explanations. Our analysis revealed that full-disk prediction of solar flares aligns with characteristics related to active regions (ARs). In particular, the key findings of this study are: (1) our deep learning models achieved an average TSS=0.51 and HSS=0.35, and the results further demonstrate a competent capability to predict near-limb solar flares and (2) the qualitative analysis of the model explanation indicates that our model identifies and uses features associated with ARs in central and near-limb locations from full-disk magnetograms to make corresponding predictions. In other words, our models learn the shape and texture-based characteristics of flaring ARs even at near-limb areas, which is a novel and critical capability with significant implications for operational forecasting.
The class-imbalance issue is intrinsic to many real-world machine learning tasks, particularly to the rare-event classification problems. Although the impact and treatment of imbalanced data is widely known, the magnitude of a metric's sensitivity to class imbalance has attracted little attention. As a result, often the sensitive metrics are dismissed while their sensitivity may only be marginal. In this paper, we introduce an intuitive evaluation framework that quantifies metrics' sensitivity to the class imbalance. Moreover, we reveal an interesting fact that there is a logarithmic behavior in metrics' sensitivity meaning that the higher imbalance ratios are associated with the lower sensitivity of metrics. Our framework builds an intuitive understanding of the class-imbalance impact on metrics. We believe this can help avoid many common mistakes, specially the less-emphasized and incorrect assumption that all metrics' quantities are comparable under different class-imbalance ratios.
Solar flares not only pose risks to outer space technologies and astronauts' well being, but also cause disruptions on earth to our hight-tech, interconnected infrastructure our lives highly depend on. While a number of machine-learning methods have been proposed to improve flare prediction, none of them, to the best of our knowledge, have investigated the impact of outliers on the reliability and those models' performance. In this study, we investigate the impact of outliers in a multivariate time series benchmark dataset, namely SWAN-SF, on flare prediction models, and test our hypothesis. That is, there exist outliers in SWAN-SF, removal of which enhances the performance of the prediction models on unseen datasets. We employ Isolation Forest to detect the outliers among the weaker flare instances. Several experiments are carried out using a large range of contamination rates which determine the percentage of present outliers. We asses the quality of each dataset in terms of its actual contamination using TimeSeriesSVC. In our best finding, we achieve a 279% increase in True Skill Statistic and 68% increase in Heidke Skill Score. The results show that overall a significant improvement can be achieved to flare prediction if outliers are detected and removed properly.
The Space-Weather ANalytics for Solar Flares (SWAN-SF) is a multivariate time series benchmark dataset recently created to serve the heliophysics community as a testbed for solar flare forecasting models. SWAN-SF contains 54 unique features, with 24 quantitative features computed from the photospheric magnetic field maps of active regions, describing their precedent flare activity. In this study, for the first time, we systematically attacked the problem of quantifying the relevance of these features to the ambitious task of flare forecasting. We implemented an end-to-end pipeline for preprocessing, feature selection, and evaluation phases. We incorporated 24 Feature Subset Selection (FSS) algorithms, including multivariate and univariate, supervised and unsupervised, wrappers and filters. We methodologically compared the results of different FSS algorithms, both on the multivariate time series and vectorized formats, and tested their correlation and reliability, to the extent possible, by using the selected features for flare forecasting on unseen data, in univariate and multivariate fashions. We concluded our investigation with a report of the best FSS methods in terms of their top-k features, and the analysis of the findings. We wish the reproducibility of our study and the availability of the data allow the future attempts be comparable with our findings and themselves.
General-purpose object-detection algorithms often dismiss the fine structure of detected objects. This can be traced back to how their proposed regions are evaluated. Our goal is to renegotiate the trade-off between the generality of these algorithms and their coarse detections. In this work, we present a new metric that is a marriage of a popular evaluation metric, namely Intersection over Union (IoU), and a geometrical concept, called fractal dimension. We propose Multiscale IoU (MIoU) which allows comparison between the detected and ground-truth regions at multiple resolution levels. Through several reproducible examples, we show that MIoU is indeed sensitive to the fine boundary structures which are completely overlooked by IoU and f1-score. We further examine the overall reliability of MIoU by comparing its distribution with that of IoU on synthetic and real-world datasets of objects. We intend this work to re-initiate exploration of new evaluation methods for object-detection algorithms.