Alert button
Picture for Taesup Moon

Taesup Moon

Alert button

SwiFT: Swin 4D fMRI Transformer

Jul 12, 2023
Peter Yongho Kim, Junbeom Kwon, Sunghwan Joo, Sangyoon Bae, Donggyu Lee, Yoonho Jung, Shinjae Yoo, Jiook Cha, Taesup Moon

Figure 1 for SwiFT: Swin 4D fMRI Transformer
Figure 2 for SwiFT: Swin 4D fMRI Transformer
Figure 3 for SwiFT: Swin 4D fMRI Transformer
Figure 4 for SwiFT: Swin 4D fMRI Transformer

The modeling of spatiotemporal brain dynamics from high-dimensional data, such as 4D functional MRI, is a formidable task in neuroscience. To address this challenge, we present SwiFT (Swin 4D fMRI Transformer), a Swin Transformer architecture that can learn brain dynamics directly from 4D functional brain MRI data in a memory and computation-efficient manner. SwiFT achieves this by implementing a 4D window multi-head self-attention mechanism and absolute positional embeddings. We evaluate SwiFT using multiple largest-scale human functional brain imaging datasets in tasks such as predicting sex, age, and cognitive intelligence. Our experimental outcomes reveal that SwiFT consistently outperforms recent state-of-the-art models. To the best of our knowledge, SwiFT is the first Swin Transformer architecture that can process dimensional spatiotemporal brain functional data in an end-to-end fashion. Furthermore, due to the end-to-end learning capability, we also show that contrastive loss-based self-supervised pre-training of SwiFT is also feasible for achieving improved performance on a downstream task. We believe that our work holds substantial potential in facilitating scalable learning of functional brain imaging in neuroscience research by reducing the hurdles associated with applying Transformer models to high-dimensional fMRI.

Viaarxiv icon

Sy-CON: Symmetric Contrastive Loss for Continual Self-Supervised Representation Learning

Jun 08, 2023
Sungmin Cha, Taesup Moon

Figure 1 for Sy-CON: Symmetric Contrastive Loss for Continual Self-Supervised Representation Learning
Figure 2 for Sy-CON: Symmetric Contrastive Loss for Continual Self-Supervised Representation Learning
Figure 3 for Sy-CON: Symmetric Contrastive Loss for Continual Self-Supervised Representation Learning
Figure 4 for Sy-CON: Symmetric Contrastive Loss for Continual Self-Supervised Representation Learning

We introduce a novel and general loss function, called Symmetric Contrastive (Sy-CON) loss, for effective continual self-supervised learning (CSSL). We first argue that the conventional loss form of continual learning which consists of single task-specific loss (for plasticity) and a regularizer (for stability) may not be ideal for contrastive loss based CSSL that focus on representation learning. Our reasoning is that, in contrastive learning based methods, the task-specific loss would suffer from decreasing diversity of negative samples and the regularizer may hinder learning new distinctive representations. To that end, we propose Sy-CON that consists of two losses (one for plasticity and the other for stability) with symmetric dependence on current and past models' negative sample embeddings. We argue our model can naturally find good trade-off between the plasticity and stability without any explicit hyperparameter tuning. We validate the effectiveness of our approach through extensive experiments, demonstrating that MoCo-based implementation of Sy-CON loss achieves superior performance compared to other state-of-the-art CSSL methods.

* Preprint 
Viaarxiv icon

Continual Learning in the Presence of Spurious Correlation

Mar 21, 2023
Donggyu Lee, Sangwon Jung, Taesup Moon

Figure 1 for Continual Learning in the Presence of Spurious Correlation
Figure 2 for Continual Learning in the Presence of Spurious Correlation
Figure 3 for Continual Learning in the Presence of Spurious Correlation
Figure 4 for Continual Learning in the Presence of Spurious Correlation

Most continual learning (CL) algorithms have focused on tackling the stability-plasticity dilemma, that is, the challenge of preventing the forgetting of previous tasks while learning new ones. However, they have overlooked the impact of the knowledge transfer when the dataset in a certain task is biased - namely, when some unintended spurious correlations of the tasks are learned from the biased dataset. In that case, how would they affect learning future tasks or the knowledge already learned from the past tasks? In this work, we carefully design systematic experiments using one synthetic and two real-world datasets to answer the question from our empirical findings. Specifically, we first show through two-task CL experiments that standard CL methods, which are unaware of dataset bias, can transfer biases from one task to another, both forward and backward, and this transfer is exacerbated depending on whether the CL methods focus on the stability or the plasticity. We then present that the bias transfer also exists and even accumulate in longer sequences of tasks. Finally, we propose a simple, yet strong plug-in method for debiasing-aware continual learning, dubbed as Group-class Balanced Greedy Sampling (BGS). As a result, we show that our BGS can always reduce the bias of a CL model, with a slight loss of CL performance at most.

Viaarxiv icon

Re-weighting Based Group Fairness Regularization via Classwise Robust Optimization

Mar 01, 2023
Sangwon Jung, Taeeon Park, Sanghyuk Chun, Taesup Moon

Figure 1 for Re-weighting Based Group Fairness Regularization via Classwise Robust Optimization
Figure 2 for Re-weighting Based Group Fairness Regularization via Classwise Robust Optimization
Figure 3 for Re-weighting Based Group Fairness Regularization via Classwise Robust Optimization
Figure 4 for Re-weighting Based Group Fairness Regularization via Classwise Robust Optimization

Many existing group fairness-aware training methods aim to achieve the group fairness by either re-weighting underrepresented groups based on certain rules or using weakly approximated surrogates for the fairness metrics in the objective as regularization terms. Although each of the learning schemes has its own strength in terms of applicability or performance, respectively, it is difficult for any method in the either category to be considered as a gold standard since their successful performances are typically limited to specific cases. To that end, we propose a principled method, dubbed as \ours, which unifies the two learning schemes by incorporating a well-justified group fairness metric into the training objective using a class wise distributionally robust optimization (DRO) framework. We then develop an iterative optimization algorithm that minimizes the resulting objective by automatically producing the correct re-weights for each group. Our experiments show that FairDRO is scalable and easily adaptable to diverse applications, and consistently achieves the state-of-the-art performance on several benchmark datasets in terms of the accuracy-fairness trade-off, compared to recent strong baselines.

Viaarxiv icon

Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers

Jan 27, 2023
Sungmin Cha, Sungjun Cho, Dasol Hwang, Honglak Lee, Taesup Moon, Moontae Lee

Figure 1 for Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers
Figure 2 for Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers
Figure 3 for Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers
Figure 4 for Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers

Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pre-trained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we define instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios.

* Preprint 
Viaarxiv icon

Towards More Robust Interpretation via Local Gradient Alignment

Dec 07, 2022
Sunghwan Joo, Seokhyeon Jeong, Juyeon Heo, Adrian Weller, Taesup Moon

Figure 1 for Towards More Robust Interpretation via Local Gradient Alignment
Figure 2 for Towards More Robust Interpretation via Local Gradient Alignment
Figure 3 for Towards More Robust Interpretation via Local Gradient Alignment
Figure 4 for Towards More Robust Interpretation via Local Gradient Alignment

Neural network interpretation methods, particularly feature attribution methods, are known to be fragile with respect to adversarial input perturbations. To address this, several methods for enhancing the local smoothness of the gradient while training have been proposed for attaining \textit{robust} feature attributions. However, the lack of considering the normalization of the attributions, which is essential in their visualizations, has been an obstacle to understanding and improving the robustness of feature attribution methods. In this paper, we provide new insights by taking such normalization into account. First, we show that for every non-negative homogeneous neural network, a naive $\ell_2$-robust criterion for gradients is \textit{not} normalization invariant, which means that two functions with the same normalized gradient can have different values. Second, we formulate a normalization invariant cosine distance-based criterion and derive its upper bound, which gives insight for why simply minimizing the Hessian norm at the input, as has been done in previous work, is not sufficient for attaining robust feature attribution. Finally, we propose to combine both $\ell_2$ and cosine distance-based criteria as regularization terms to leverage the advantages of both in aligning the local gradient. As a result, we experimentally show that models trained with our method produce much more robust interpretations on CIFAR-10 and ImageNet-100 without significantly hurting the accuracy, compared to the recent baselines. To the best of our knowledge, this is the first work to verify the robustness of interpretation on a larger-scale dataset beyond CIFAR-10, thanks to the computational efficiency of our method.

* 22 pages (9 pages in paper, 13 pages in Appendix), 9 figures, 6 tables Accepted in AAAI 23 (Association for the Advancement of Artificial Intelligence) 
Viaarxiv icon

GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training

Aug 08, 2022
Jaeseok Byun, Taebaek Hwang, Jianlong Fu, Taesup Moon

Figure 1 for GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Figure 2 for GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Figure 3 for GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Figure 4 for GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training

Most of the currently existing vision and language pre-training (VLP) methods have mainly focused on how to extract and align vision and text features. In contrast to the mainstream VLP methods, we highlight that two routinely applied steps during pre-training have crucial impact on the performance of the pre-trained model: in-batch hard negative sampling for image-text matching (ITM) and assigning the large masking probability for the masked language modeling (MLM). After empirically showing the unexpected effectiveness of above two steps, we systematically devise our GRIT-VLP, which adaptively samples mini-batches for more effective mining of hard negative samples for ITM while maintaining the computational cost for pre-training. Our method consists of three components: 1) GRouped mIni-baTch sampling (GRIT) strategy that collects similar examples in a mini-batch, 2) ITC consistency loss for improving the mining ability, and 3) enlarged masking probability for MLM. Consequently, we show our GRIT-VLP achieves a new state-of-the-art performance on various downstream tasks with much less computational cost. Furthermore, we demonstrate that our model is essentially in par with ALBEF, the previous state-of-the-art, only with one-third of training epochs on the same training data. Code is available at https://github.com/jaeseokbyun/GRIT-VLP.

Viaarxiv icon

Descent Steps of a Relation-Aware Energy Produce Heterogeneous Graph Neural Networks

Jun 24, 2022
Hongjoon Ahn, Yongyi Yang, Quan Gan, David Wipf, Taesup Moon

Figure 1 for Descent Steps of a Relation-Aware Energy Produce Heterogeneous Graph Neural Networks
Figure 2 for Descent Steps of a Relation-Aware Energy Produce Heterogeneous Graph Neural Networks
Figure 3 for Descent Steps of a Relation-Aware Energy Produce Heterogeneous Graph Neural Networks
Figure 4 for Descent Steps of a Relation-Aware Energy Produce Heterogeneous Graph Neural Networks

Heterogeneous graph neural networks (GNNs) achieve strong performance on node classification tasks in a semi-supervised learning setting. However, as in the simpler homogeneous GNN case, message-passing-based heterogeneous GNNs may struggle to balance between resisting the oversmoothing occuring in deep models and capturing long-range dependencies graph structured data. Moreover, the complexity of this trade-off is compounded in the heterogeneous graph case due to the disparate heterophily relationships between nodes of different types. To address these issues, we proposed a novel heterogeneous GNN architecture in which layers are derived from optimization steps that descend a novel relation-aware energy function. The corresponding minimizer is fully differentiable with respect to the energy function parameters, such that bilevel optimization can be applied to effectively learn a functional form whose minimum provides optimal node representations for subsequent classification tasks. In particular, this methodology allows us to model diverse heterophily relationships between different node types while avoiding oversmoothing effects. Experimental results on 8 heterogeneous graph benchmarks demonstrates that our proposed method can achieve competitive node classification accuracy.

Viaarxiv icon