Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in image synthesis, which aims to represent an image with a discrete token sequence. Existing studies effectively address this problem by learning a discrete codebook from scratch and in a code-independent manner to quantize continuous representations into discrete tokens. However, learning a codebook from scratch and in a code-independent manner is highly challenging, which may be a key reason causing codebook collapse, i.e., some code vectors can rarely be optimized without regard to the relationship between codes and good codebook priors such that die off finally. In this paper, inspired by pretrained language models, we find that these language models have actually pretrained a superior codebook via a large number of text corpus, but such information is rarely exploited in VQIM. To this end, we propose a novel codebook transfer framework with part-of-speech, called VQCT, which aims to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning. Specifically, we first introduce a pretrained codebook from language models and part-of-speech knowledge as priors. Then, we construct a vision-related codebook with these priors for achieving codebook transfer. Finally, a novel codebook transfer network is designed to exploit abundant semantic relationships between codes contained in pretrained codebooks for robust VQIM codebook learning. Experimental results on four datasets show that our VQCT method achieves superior VQIM performance over previous state-of-the-art methods.
Precipitation nowcasting is an important spatio-temporal prediction task to predict the radar echoes sequences based on current observations, which can serve both meteorological science and smart city applications. Due to the chaotic evolution nature of the precipitation systems, it is a very challenging problem. Previous studies address the problem either from the perspectives of deterministic modeling or probabilistic modeling. However, their predictions suffer from the blurry, high-value echoes fading away and position inaccurate issues. The root reason of these issues is that the chaotic evolutionary precipitation systems are not appropriately modeled. Inspired by the nature of the systems, we propose to decompose and model them from the perspective of global deterministic motion and local stochastic variations with residual mechanism. A unified and flexible framework that can equip any type of spatio-temporal models is proposed based on residual diffusion, which effectively tackles the shortcomings of previous methods. Extensive experimental results on four publicly available radar datasets demonstrate the effectiveness and superiority of the proposed framework, compared to state-of-the-art techniques. Our code will be made publicly available soon.
Online continual learning (OCL) aims to continuously learn new data from a single pass over the online data stream. It generally suffers from the catastrophic forgetting issue. Existing replay-based methods effectively alleviate this issue by replaying part of old data in a proxy-based or contrastive-based replay manner. In this paper, we conduct a comprehensive analysis of these two replay manners and find they can be complementary. Inspired by this finding, we propose a novel replay-based method called proxy-based contrastive replay (PCR), which replaces anchor-to-sample pairs with anchor-to-proxy pairs in the contrastive-based loss to alleviate the phenomenon of forgetting. Based on PCR, we further develop a more advanced method named holistic proxy-based contrastive replay (HPCR), which consists of three components. The contrastive component conditionally incorporates anchor-to-sample pairs to PCR, learning more fine-grained semantic information with a large training batch. The second is a temperature component that decouples the temperature coefficient into two parts based on their impacts on the gradient and sets different values for them to learn more novel knowledge. The third is a distillation component that constrains the learning process to keep more historical knowledge. Experiments on four datasets consistently demonstrate the superiority of HPCR over various state-of-the-art methods.
Online continual learning aims to continuously train neural networks from a continuous data stream with a single pass-through data. As the most effective approach, the rehearsal-based methods replay part of previous data. Commonly used predictors in existing methods tend to generate biased dot-product logits that prefer to the classes of current data, which is known as a bias issue and a phenomenon of forgetting. Many approaches have been proposed to overcome the forgetting problem by correcting the bias; however, they still need to be improved in online fashion. In this paper, we try to address the bias issue by a more straightforward and more efficient method. By decomposing the dot-product logits into an angle factor and a norm factor, we empirically find that the bias problem mainly occurs in the angle factor, which can be used to learn novel knowledge as cosine logits. On the contrary, the norm factor abandoned by existing methods helps remember historical knowledge. Based on this observation, we intuitively propose to leverage the norm factor to balance the new and old knowledge for addressing the bias. To this end, we develop a heuristic approach called unbias experience replay (UER). UER learns current samples only by the angle factor and further replays previous samples by both the norm and angle factors. Extensive experiments on three datasets show that UER achieves superior performance over various state-of-the-art methods. The code is in https://github.com/FelixHuiweiLin/UER.
Online class-incremental continual learning is a specific task of continual learning. It aims to continuously learn new classes from data stream and the samples of data stream are seen only once, which suffers from the catastrophic forgetting issue, i.e., forgetting historical knowledge of old classes. Existing replay-based methods effectively alleviate this issue by saving and replaying part of old data in a proxy-based or contrastive-based replay manner. Although these two replay manners are effective, the former would incline to new classes due to class imbalance issues, and the latter is unstable and hard to converge because of the limited number of samples. In this paper, we conduct a comprehensive analysis of these two replay manners and find that they can be complementary. Inspired by this finding, we propose a novel replay-based method called proxy-based contrastive replay (PCR). The key operation is to replace the contrastive samples of anchors with corresponding proxies in the contrastive-based way. It alleviates the phenomenon of catastrophic forgetting by effectively addressing the imbalance issue, as well as keeps a faster convergence of the model. We conduct extensive experiments on three real-world benchmark datasets, and empirical results consistently demonstrate the superiority of PCR over various state-of-the-art methods.
Few-Shot Learning (FSL) is a challenging task, which aims to recognize novel classes with few examples. Recently, lots of methods have been proposed from the perspective of meta-learning and representation learning for improving FSL performance. However, few works focus on the interpretability of FSL decision process. In this paper, we take a step towards the interpretable FSL by proposing a novel decision tree-based meta-learning framework, namely, MetaDT. Our insight is replacing the last black-box FSL classifier of the existing representation learning methods by an interpretable decision tree with meta-learning. The key challenge is how to effectively learn the decision tree (i.e., the tree structure and the parameters of each node) in the FSL setting. To address the challenge, we introduce a tree-like class hierarchy as our prior: 1) the hierarchy is directly employed as the tree structure; 2) by regarding the class hierarchy as an undirected graph, a graph convolution-based decision tree inference network is designed as our meta-learner to learn to infer the parameters of each node. At last, a two-loop optimization mechanism is incorporated into our framework for a fast adaptation of the decision tree with few examples. Extensive experiments on performance comparison and interpretability analysis show the effectiveness and superiority of our MetaDT. Our code will be publicly available upon acceptance.
Few-Shot Remote Sensing Scene Classification (FSRSSC) is an important task, which aims to recognize novel scene classes with few examples. Recently, several studies attempt to address the FSRSSC problem by following few-shot natural image classification methods. These existing methods have made promising progress and achieved superior performance. However, they all overlook two unique characteristics of remote sensing images: (i) object co-occurrence that multiple objects tend to appear together in a scene image and (ii) object spatial correlation that these co-occurrence objects are distributed in the scene image following some spatial structure patterns. Such unique characteristics are very beneficial for FSRSSC, which can effectively alleviate the scarcity issue of labeled remote sensing images since they can provide more refined descriptions for each scene class. To fully exploit these characteristics, we propose a novel scene graph matching-based meta-learning framework for FSRSSC, called SGMNet. In this framework, a scene graph construction module is carefully designed to represent each test remote sensing image or each scene class as a scene graph, where the nodes reflect these co-occurrence objects meanwhile the edges capture the spatial correlations between these co-occurrence objects. Then, a scene graph matching module is further developed to evaluate the similarity score between each test remote sensing image and each scene class. Finally, based on the similarity scores, we perform the scene class prediction via a nearest neighbor classifier. We conduct extensive experiments on UCMerced LandUse, WHU19, AID, and NWPU-RESISC45 datasets. The experimental results show that our method obtains superior performance over the previous state-of-the-art methods.
Natural disasters caused by heavy rainfall often cost huge loss of life and property. To avoid it, the task of precipitation nowcasting is imminent. To solve the problem, increasingly deep learning methods are proposed to forecast future radar echo images and then the predicted maps have converted the distribution of rainfall. The prevailing spatiotemporal sequence prediction methods apply ConvRNN structure which combines the Convolution and Recurrent neural network. Although improvements based on ConvRNN achieve remarkable success, these methods ignore capturing both local and global spatial features simultaneously, which degrades the nowcasting in the region of heavy rainfall. To address this issue, we proposed the Region Attention Block (RAB) and embed it into ConvRNN to enhance the forecast in the area with strong rainfall. Besides, the ConvRNN models are hard to memory longer history representations with limited parameters. Considering it, we propose Recall Attention Mechanism (RAM) to improve the prediction. By preserving longer temporal information, RAM contributes to the forecasting, especially in the middle rainfall intensity. The experiments show that the proposed model Region Attention Predictive Network (RAP-Net) has outperformed the state-of-art method.
Few-shot learning aims to recognize novel classes with few examples. Pre-training based methods effectively tackle the problem by pre-training a feature extractor and then fine-tuning it through the nearest centroid based meta-learning. However, results show that the fine-tuning step makes marginal improvements. In this paper, 1) we figure out the reason, i.e., in the pre-trained feature space, the base classes already form compact clusters while novel classes spread as groups with large variances, which implies that fine-tuning feature extractor is less meaningful; 2) instead of fine-tuning feature extractor, we focus on estimating more representative prototypes. Consequently, we propose a novel prototype completion based meta-learning framework. This framework first introduces primitive knowledge (i.e., class-level part or attribute annotations) and extracts representative features for seen attributes as priors. Second, a part/attribute transfer network is designed to learn to infer the representative features for unseen attributes as supplementary priors. Finally, a prototype completion network is devised to learn to complete prototypes with these priors. Moreover, to avoid the prototype completion error, we further develop a Gaussian based prototype fusion strategy that fuses the mean-based and completed prototypes by exploiting the unlabeled samples. Extensive experiments show that our method: (i) obtains more accurate prototypes; (ii) achieves superior performance on both inductive and transductive FSL settings.
Few-Shot Learning (FSL) is a challenging task, i.e., how to recognize novel classes with few examples? Pre-training based methods effectively tackle the problem by pre-training a feature extractor and then predict novel classes via a nearest neighbor classifier with mean-based prototypes. Nevertheless, due to the data scarcity, the mean-based prototypes are usually biased. In this paper, we diminish the bias by regarding it as a prototype optimization problem. Although the existing meta-optimizers can also be applied for the optimization, they all overlook a crucial gradient bias issue, i.e., the mean-based gradient estimation is also biased on scarce data. Consequently, we regard the gradient itself as meta-knowledge and then propose a novel prototype optimization-based meta-learning framework, called MetaNODE. Specifically, we first regard the mean-based prototypes as initial prototypes, and then model the process of prototype optimization as continuous-time dynamics specified by a Neural Ordinary Differential Equation (Neural ODE). A gradient flow inference network is carefully designed to learn to estimate the continuous gradients for prototype dynamics. Finally, the optimal prototypes can be obtained by solving the Neural ODE using the Runge-Kutta method. Extensive experiments demonstrate that our proposed method obtains superior performance over the previous state-of-the-art methods. Our code will be publicly available upon acceptance.