Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Erjin Zhou

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Aug 26, 2025

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang

Figure 1 for MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Figure 2 for MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Figure 3 for MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Figure 4 for MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Abstract:Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA

* The project is available at https://shihao1895.github.io/MemoryVLA

Via

Access Paper or Ask Questions

Class Balance Matters to Active Class-Incremental Learning

Dec 09, 2024

Zitong Huang, Ze Chen, Yuanze Li, Bowen Dong, Erjin Zhou, Yong Liu, Rick Siow Mong Goh, Chun-Mei Feng, Wangmeng Zuo

Figure 1 for Class Balance Matters to Active Class-Incremental Learning

Figure 2 for Class Balance Matters to Active Class-Incremental Learning

Figure 3 for Class Balance Matters to Active Class-Incremental Learning

Figure 4 for Class Balance Matters to Active Class-Incremental Learning

Abstract:Few-Shot Class-Incremental Learning has shown remarkable efficacy in efficient learning new concepts with limited annotations. Nevertheless, the heuristic few-shot annotations may not always cover the most informative samples, which largely restricts the capability of incremental learner. We aim to start from a pool of large-scale unlabeled data and then annotate the most informative samples for incremental learning. Based on this premise, this paper introduces the Active Class-Incremental Learning (ACIL). The objective of ACIL is to select the most informative samples from the unlabeled pool to effectively train an incremental learner, aiming to maximize the performance of the resulting model. Note that vanilla active learning algorithms suffer from class-imbalanced distribution among annotated samples, which restricts the ability of incremental learning. To achieve both class balance and informativeness in chosen samples, we propose Class-Balanced Selection (CBS) strategy. Specifically, we first cluster the features of all unlabeled images into multiple groups. Then for each cluster, we employ greedy selection strategy to ensure that the Gaussian distribution of the sampled features closely matches the Gaussian distribution of all unlabeled features within the cluster. Our CBS can be plugged and played into those CIL methods which are based on pretrained models with prompts tunning technique. Extensive experiments under ACIL protocol across five diverse datasets demonstrate that CBS outperforms both random selection and other SOTA active learning approaches. Code is publicly available at https://github.com/1170300714/CBS.

* ACM MM 2024

Via

Access Paper or Ask Questions

IMWA: Iterative Model Weight Averaging Benefits Class-Imbalanced Learning Tasks

Apr 25, 2024

Zitong Huang, Ze Chen, Bowen Dong, Chaoqi Liang, Erjin Zhou, Wangmeng Zuo

Figure 1 for IMWA: Iterative Model Weight Averaging Benefits Class-Imbalanced Learning Tasks

Figure 2 for IMWA: Iterative Model Weight Averaging Benefits Class-Imbalanced Learning Tasks

Figure 3 for IMWA: Iterative Model Weight Averaging Benefits Class-Imbalanced Learning Tasks

Figure 4 for IMWA: Iterative Model Weight Averaging Benefits Class-Imbalanced Learning Tasks

Abstract:Model Weight Averaging (MWA) is a technique that seeks to enhance model's performance by averaging the weights of multiple trained models. This paper first empirically finds that 1) the vanilla MWA can benefit the class-imbalanced learning, and 2) performing model averaging in the early epochs of training yields a greater performance improvement than doing that in later epochs. Inspired by these two observations, in this paper we propose a novel MWA technique for class-imbalanced learning tasks named Iterative Model Weight Averaging (IMWA). Specifically, IMWA divides the entire training stage into multiple episodes. Within each episode, multiple models are concurrently trained from the same initialized model weight, and subsequently averaged into a singular model. Then, the weight of this average model serves as a fresh initialization for the ensuing episode, thus establishing an iterative learning paradigm. Compared to vanilla MWA, IMWA achieves higher performance improvements with the same computational cost. Moreover, IMWA can further enhance the performance of those methods employing EMA strategy, demonstrating that IMWA and EMA can complement each other. Extensive experiments on various class-imbalanced learning tasks, i.e., class-imbalanced image classification, semi-supervised class-imbalanced image classification and semi-supervised object detection tasks showcase the effectiveness of our IMWA.

Via

Access Paper or Ask Questions

Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning

Jan 03, 2024

Zitong Huang, Ze Chen, Zhixing Chen, Erjin Zhou, Xinxing Xu, Rick Siow Mong Goh, Yong Liu, Chunmei Feng, Wangmeng Zuo

Figure 1 for Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning

Figure 2 for Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning

Figure 3 for Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning

Figure 4 for Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning

Abstract:Few-shot Class-Incremental Learning (FSCIL) aims to continuously learn new classes based on very limited training data without forgetting the old ones encountered. Existing studies solely relied on pure visual networks, while in this paper we solved FSCIL by leveraging the Vision-Language model (e.g., CLIP) and propose a simple yet effective framework, named Learning Prompt with Distribution-based Feature Replay (LP-DiF). We observe that simply using CLIP for zero-shot evaluation can substantially outperform the most influential methods. Then, prompt tuning technique is involved to further improve its adaptation ability, allowing the model to continually capture specific knowledge from each session. To prevent the learnable prompt from forgetting old knowledge in the new session, we propose a pseudo-feature replay approach. Specifically, we preserve the old knowledge of each class by maintaining a feature-level Gaussian distribution with a diagonal covariance matrix, which is estimated by the image features of training images and synthesized features generated from a VAE. When progressing to a new session, pseudo-features are sampled from old-class distributions combined with training images of the current session to optimize the prompt, thus enabling the model to learn new knowledge while retaining old knowledge. Experiments on three prevalent benchmarks, i.e., CIFAR100, mini-ImageNet, CUB-200, and two more challenging benchmarks, i.e., SUN-397 and CUB-200$^*$ proposed in this paper showcase the superiority of LP-DiF, achieving new state-of-the-art (SOTA) in FSCIL. Code is publicly available at https://github.com/1170300714/LP-DiF.

Via

Access Paper or Ask Questions

W2N:Switching From Weak Supervision to Noisy Supervision for Object Detection

Jul 25, 2022

Zitong Huang, Yiping Bao, Bowen Dong, Erjin Zhou, Wangmeng Zuo

Figure 1 for W2N:Switching From Weak Supervision to Noisy Supervision for Object Detection

Figure 2 for W2N:Switching From Weak Supervision to Noisy Supervision for Object Detection

Figure 3 for W2N:Switching From Weak Supervision to Noisy Supervision for Object Detection

Figure 4 for W2N:Switching From Weak Supervision to Noisy Supervision for Object Detection

Abstract:Weakly-supervised object detection (WSOD) aims to train an object detector only requiring the image-level annotations. Recently, some works have managed to select the accurate boxes generated from a well-trained WSOD network to supervise a semi-supervised detection framework for better performance. However, these approaches simply divide the training set into labeled and unlabeled sets according to the image-level criteria, such that sufficient mislabeled or wrongly localized box predictions are chosen as pseudo ground-truths, resulting in a sub-optimal solution of detection performance. To overcome this issue, we propose a novel WSOD framework with a new paradigm that switches from weak supervision to noisy supervision (W2N). Generally, with given pseudo ground-truths generated from the well-trained WSOD network, we propose a two-module iterative training algorithm to refine pseudo labels and supervise better object detector progressively. In the localization adaptation module, we propose a regularization loss to reduce the proportion of discriminative parts in original pseudo ground-truths, obtaining better pseudo ground-truths for further training. In the semi-supervised module, we propose a two tasks instance-level split method to select high-quality labels for training a semi-supervised detector. Experimental results on different benchmarks verify the effectiveness of W2N, and our W2N outperforms all existing pure WSOD methods and transfer learning methods. Our code is publicly available at https://github.com/1170300714/w2n_wsod.

* ECCV2022

Via

Access Paper or Ask Questions

Prototypical Contrastive Language Image Pretraining

Jun 22, 2022

Delong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Yixiang Huang, Yiping Bao, Erjin Zhou

Figure 1 for Prototypical Contrastive Language Image Pretraining

Figure 2 for Prototypical Contrastive Language Image Pretraining

Figure 3 for Prototypical Contrastive Language Image Pretraining

Figure 4 for Prototypical Contrastive Language Image Pretraining

Abstract:Contrastive Language Image Pretraining (CLIP) received widespread attention since its learned representations can be transferred well to various downstream tasks. During CLIP training, the InfoNCE objective aims to align positive image-text pairs and separate negative ones. In this paper, we show a representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. We introduce Prototypical Contrastive Language Image Pretraining (ProtoCLIP) to enhance such grouping by boosting its efficiency and increasing its robustness against modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. We further propose Prototypical Back Translation (PBT) to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. PBT also enables us to introduce additional external teachers with richer prior knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. Combining the above novel designs, we train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. Codes are available at https://github.com/megvii-research/protoclip.

* Preprint

Via

Access Paper or Ask Questions

Attend to Who You Are: Supervising Self-Attention for Keypoint Detection and Instance-Aware Association

Nov 25, 2021

Sen Yang, Zhicheng Wang, Ze Chen, Yanjie Li, Shoukui Zhang, Zhibin Quan, Shu-Tao Xia, Yiping Bao, Erjin Zhou, Wankou Yang

Figure 1 for Attend to Who You Are: Supervising Self-Attention for Keypoint Detection and Instance-Aware Association

Figure 2 for Attend to Who You Are: Supervising Self-Attention for Keypoint Detection and Instance-Aware Association

Figure 3 for Attend to Who You Are: Supervising Self-Attention for Keypoint Detection and Instance-Aware Association

Figure 4 for Attend to Who You Are: Supervising Self-Attention for Keypoint Detection and Instance-Aware Association

Abstract:This paper presents a new method to solve keypoint detection and instance association by using Transformer. For bottom-up multi-person pose estimation models, they need to detect keypoints and learn associative information between keypoints. We argue that these problems can be entirely solved by Transformer. Specifically, the self-attention in Transformer measures dependencies between any pair of locations, which can provide association information for keypoints grouping. However, the naive attention patterns are still not subjectively controlled, so there is no guarantee that the keypoints will always attend to the instances to which they belong. To address it we propose a novel approach of supervising self-attention for multi-person keypoint detection and instance association. By using instance masks to supervise self-attention to be instance-aware, we can assign the detected keypoints to their corresponding instances based on the pairwise attention scores, without using pre-defined offset vector fields or embedding like CNN-based bottom-up models. An additional benefit of our method is that the instance segmentation results of any number of people can be directly obtained from the supervised attention matrix, thereby simplifying the pixel assignment pipeline. The experiments on the COCO multi-person keypoint detection challenge and person instance segmentation task demonstrate the effectiveness and simplicity of the proposed method and show a promising way to control self-attention behavior for specific purposes.

* 16 pages, 9 figures, 7 tables

Via

Access Paper or Ask Questions

Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads

Aug 22, 2021

Xiaohu Jiang, Ze Chen, Zhicheng Wang, Erjin Zhou, ChunYuan

Figure 1 for Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads

Figure 2 for Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads

Figure 3 for Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads

Figure 4 for Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads

Abstract:After DETR was proposed, this novel transformer-based detection paradigm which performs several cross-attentions between object queries and feature maps for predictions has subsequently derived a series of transformer-based detection heads. These models iterate object queries after each cross-attention. However, they don't renew the query position which indicates object queries' position information. Thus model needs extra learning to figure out the newest regions that query position should express and need more attention. To fix this issue, we propose the Guided Query Position (GQPos) method to embed the latest location information of object queries to query position iteratively. Another problem of such transformer-based detection heads is the high complexity to perform attention on multi-scale feature maps, which hinders them from improving detection performance at all scales. Therefore we propose a novel fusion scheme named Similar Attention (SiA): besides the feature maps is fused, SiA also fuse the attention weights maps to accelerate the learning of high-resolution attention weight map by well-learned low-resolution attention weight map. Our experiments show that the proposed GQPos improves the performance of a series of models, including DETR, SMCA, YoloS, and HoiTransformer and SiA consistently improve the performance of multi-scale transformer-based detection heads like DETR and HoiTransformer.

Via

Access Paper or Ask Questions

Adaptive Dilated Convolution For Human Pose Estimation

Jul 22, 2021

Zhengxiong Luo, Zhicheng Wang, Yan Huang, Liang Wang, Tieniu Tan, Erjin Zhou

Figure 1 for Adaptive Dilated Convolution For Human Pose Estimation

Figure 2 for Adaptive Dilated Convolution For Human Pose Estimation

Figure 3 for Adaptive Dilated Convolution For Human Pose Estimation

Figure 4 for Adaptive Dilated Convolution For Human Pose Estimation

Abstract:Most existing human pose estimation (HPE) methods exploit multi-scale information by fusing feature maps of four different spatial sizes, \ie $1/4$, $1/8$, $1/16$, and $1/32$ of the input image. There are two drawbacks of this strategy: 1) feature maps of different spatial sizes may be not well aligned spatially, which potentially hurts the accuracy of keypoint location; 2) these scales are fixed and inflexible, which may restrict the generalization ability over various human sizes. Towards these issues, we propose an adaptive dilated convolution (ADC). It can generate and fuse multi-scale features of the same spatial sizes by setting different dilation rates for different channels. More importantly, these dilation rates are generated by a regression module. It enables ADC to adaptively adjust the fused scales and thus ADC may generalize better to various human sizes. ADC can be end-to-end trained and easily plugged into existing methods. Extensive experiments show that ADC can bring consistent improvements to various HPE methods. The source codes will be released for further research.

Via

Access Paper or Ask Questions

Is 2D Heatmap Representation Even Necessary for Human Pose Estimation?

Jul 11, 2021

Yanjie Li, Sen Yang, Shoukui Zhang, Zhicheng Wang, Wankou Yang, Shu-Tao Xia, Erjin Zhou

Figure 1 for Is 2D Heatmap Representation Even Necessary for Human Pose Estimation?

Figure 2 for Is 2D Heatmap Representation Even Necessary for Human Pose Estimation?

Figure 3 for Is 2D Heatmap Representation Even Necessary for Human Pose Estimation?

Figure 4 for Is 2D Heatmap Representation Even Necessary for Human Pose Estimation?

Abstract:The 2D heatmap representation has dominated human pose estimation for years due to its high performance. However, heatmap-based approaches have some drawbacks: 1) The performance drops dramatically in the low-resolution images, which are frequently encountered in real-world scenarios. 2) To improve the localization precision, multiple upsample layers may be needed to recover the feature map resolution from low to high, which are computationally expensive. 3) Extra coordinate refinement is usually necessary to reduce the quantization error of downscaled heatmaps. To address these issues, we propose a \textbf{Sim}ple yet promising \textbf{D}isentangled \textbf{R}epresentation for keypoint coordinate (\emph{SimDR}), reformulating human keypoint localization as a task of classification. In detail, we propose to disentangle the representation of horizontal and vertical coordinates for keypoint location, leading to a more efficient scheme without extra upsampling and refinement. Comprehensive experiments conducted over COCO dataset show that the proposed \emph{heatmap-free} methods outperform \emph{heatmap-based} counterparts in all tested input resolutions, especially in lower resolutions by a large margin. Code will be made publicly available at \url{https://github.com/leeyegy/SimDR}.

* Code will be made publicly available at https://github.com/leeyegy/SimDR

Via

Access Paper or Ask Questions