Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ping Luo

Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models

Dec 14, 2022
Yuying Ge, Annabella Macaluso, Li Erran Li, Ping Luo, Xiaolong Wang

Figure 1 for Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models

Figure 2 for Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models

Figure 3 for Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models

Figure 4 for Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models

Recent progress on vision-language foundation models have brought significant advancement to building general-purpose robots. By using the pre-trained models to encode the scene and instructions as inputs for decision making, the instruction-conditioned policy can generalize across different objects and tasks. While this is encouraging, the policy still fails in most cases given an unseen task or environment. To adapt the policy to unseen tasks and environments, we explore a new paradigm on leveraging the pre-trained foundation models with Self-PLAY and Self-Describe (SPLAYD). When deploying the trained policy to a new task or a new environment, we first let the policy self-play with randomly generated instructions to record the demonstrations. While the execution could be wrong, we can use the pre-trained foundation models to accurately self-describe (i.e., re-label or classify) the demonstrations. This automatically provides new pairs of demonstration-instruction data for policy fine-tuning. We evaluate our method on a broad range of experiments with the focus on generalization on unseen objects, unseen tasks, unseen environments, and sim-to-real transfer. We show SPLAYD improves baselines by a large margin in all cases. Our project page is available at https://geyuying.github.io/SPLAYD/

* Project page: https://geyuying.github.io/SPLAYD/

Via

Access Paper or Ask Questions

Learning Object-Language Alignments for Open-Vocabulary Object Detection

Nov 27, 2022
Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, Jianfei Cai

Figure 1 for Learning Object-Language Alignments for Open-Vocabulary Object Detection

Figure 2 for Learning Object-Language Alignments for Open-Vocabulary Object Detection

Figure 3 for Learning Object-Language Alignments for Open-Vocabulary Object Detection

Figure 4 for Learning Object-Language Alignments for Open-Vocabulary Object Detection

Existing object detection methods are bounded in a fixed-set vocabulary by costly labeled data. When dealing with novel categories, the model has to be retrained with more bounding box annotations. Natural language supervision is an attractive alternative for its annotation-free attributes and broader object concepts. However, learning open-vocabulary object detection from language is challenging since image-text pairs do not contain fine-grained object-language alignments. Previous solutions rely on either expensive grounding annotations or distilling classification-oriented vision models. In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data. We formulate object-language alignment as a set matching problem between a set of image region features and a set of word embeddings. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way. Extensive experiments on two benchmark datasets, COCO and LVIS, demonstrate our superior performance over the competing approaches on novel categories, e.g. achieving 32.0% mAP on COCO and 21.7% mask mAP on LVIS. Code is available at: https://github.com/clin1223/VLDet.

* Technical Report

Via

Access Paper or Ask Questions

MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning

Nov 24, 2022
Yao Lai, Yao Mu, Ping Luo

Figure 1 for MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning

Figure 2 for MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning

Figure 3 for MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning

Figure 4 for MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning

Placement is an essential task in modern chip design, aiming at placing millions of circuit modules on a 2D chip canvas. Unlike the human-centric solution, which requires months of intense effort by hardware engineers to produce a layout to minimize delay and energy consumption, deep reinforcement learning has become an emerging autonomous tool. However, the learning-centric method is still in its early stage, impeded by a massive design space of size ten to the order of a few thousand. This work presents MaskPlace to automatically generate a valid chip layout design within a few hours, whose performance can be superior or comparable to recent advanced approaches. It has several appealing benefits that prior arts do not have. Firstly, MaskPlace recasts placement as a problem of learning pixel-level visual representation to comprehensively describe millions of modules on a chip, enabling placement in a high-resolution canvas and a large action space. It outperforms recent methods that represent a chip as a hypergraph. Secondly, it enables training the policy network by an intuitive reward function with dense reward, rather than a complicated reward function with sparse reward from previous methods. Thirdly, extensive experiments on many public benchmarks show that MaskPlace outperforms existing RL approaches in all key performance metrics, including wirelength, congestion, and density. For example, it achieves 60%-90% wirelength reduction and guarantees zero overlaps. We believe MaskPlace can improve AI-assisted chip layout design. The deliverables are released at https://laiyao1.github.io/maskplace.

Via

Access Paper or Ask Questions

Prototypical context-aware dynamics generalization for high-dimensional model-based reinforcement learning

Nov 23, 2022
Junjie Wang, Yao Mu, Dong Li, Qichao Zhang, Dongbin Zhao, Yuzheng Zhuang, Ping Luo, Bin Wang, Jianye Hao

Figure 1 for Prototypical context-aware dynamics generalization for high-dimensional model-based reinforcement learning

Figure 2 for Prototypical context-aware dynamics generalization for high-dimensional model-based reinforcement learning

Figure 3 for Prototypical context-aware dynamics generalization for high-dimensional model-based reinforcement learning

Figure 4 for Prototypical context-aware dynamics generalization for high-dimensional model-based reinforcement learning

The latent world model provides a promising way to learn policies in a compact latent space for tasks with high-dimensional observations, however, its generalization across diverse environments with unseen dynamics remains challenging. Although the recurrent structure utilized in current advances helps to capture local dynamics, modeling only state transitions without an explicit understanding of environmental context limits the generalization ability of the dynamics model. To address this issue, we propose a Prototypical Context-Aware Dynamics (ProtoCAD) model, which captures the local dynamics by time consistent latent context and enables dynamics generalization in high-dimensional control tasks. ProtoCAD extracts useful contextual information with the help of the prototypes clustered over batch and benefits model-based RL in two folds: 1) It utilizes a temporally consistent prototypical regularizer that encourages the prototype assignments produced for different time parts of the same latent trajectory to be temporally consistent instead of comparing the features; 2) A context representation is designed which combines both the projection embedding of latent states and aggregated prototypes and can significantly improve the dynamics generalization ability. Extensive experiments show that ProtoCAD surpasses existing methods in terms of dynamics generalization. Compared with the recurrent-based model RSSM, ProtoCAD delivers 13.2% and 26.7% better mean and median performance across all dynamics generalization tasks.

Via

Access Paper or Ask Questions

DiffusionDet: Diffusion Model for Object Detection

Nov 17, 2022
Shoufa Chen, Peize Sun, Yibing Song, Ping Luo

Figure 1 for DiffusionDet: Diffusion Model for Object Detection

Figure 2 for DiffusionDet: Diffusion Model for Object Detection

Figure 3 for DiffusionDet: Diffusion Model for Object Detection

Figure 4 for DiffusionDet: Diffusion Model for Object Detection

We propose DiffusionDet, a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes. During training stage, object boxes diffuse from ground-truth boxes to random distribution, and the model learns to reverse this noising process. In inference, the model refines a set of randomly generated boxes to the output results in a progressive way. The extensive evaluations on the standard benchmarks, including MS-COCO and LVIS, show that DiffusionDet achieves favorable performance compared to previous well-established detectors. Our work brings two important findings in object detection. First, random boxes, although drastically different from pre-defined anchors or learned queries, are also effective object candidates. Second, object detection, one of the representative perception tasks, can be solved by a generative way. Our code is available at https://github.com/ShoufaChen/DiffusionDet.

* Tech report. Code is available at https://github.com/ShoufaChen/DiffusionDet

Via

Access Paper or Ask Questions

Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

Oct 09, 2022
Yao Mu, Yuzheng Zhuang, Fei Ni, Bin Wang, Jianyu Chen, Jianye Hao, Ping Luo

Figure 1 for Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

Figure 2 for Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

Figure 3 for Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

Figure 4 for Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

Adapting to the changes in transition dynamics is essential in robotic applications. By learning a conditional policy with a compact context, context-aware meta-reinforcement learning provides a flexible way to adjust behavior according to dynamics changes. However, in real-world applications, the agent may encounter complex dynamics changes. Multiple confounders can influence the transition dynamics, making it challenging to infer accurate context for decision-making. This paper addresses such a challenge by Decomposed Mutual INformation Optimization (DOMINO) for context learning, which explicitly learns a disentangled context to maximize the mutual information between the context and historical trajectories, while minimizing the state transition prediction error. Our theoretical analysis shows that DOMINO can overcome the underestimation of the mutual information caused by multi-confounded challenges via learning disentangled context and reduce the demand for the number of samples collected in various environments. Extensive experiments show that the context learned by DOMINO benefits both model-based and model-free reinforcement learning algorithms for dynamics generalization in terms of sample efficiency and performance in unseen environments.

* NeurIPS 2022

Via

Access Paper or Ask Questions

Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

Oct 08, 2022
Zeyu Gao, Yao Mu, Ruoyan Shen, Chen Chen, Yangang Ren, Jianyu Chen, Shengbo Eben Li, Ping Luo, Yanfeng Lu

Figure 1 for Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

Figure 2 for Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

Figure 3 for Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

Figure 4 for Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

End-to-end autonomous driving provides a feasible way to automatically maximize overall driving system performance by directly mapping the raw pixels from a front-facing camera to control signals. Recent advanced methods construct a latent world model to map the high dimensional observations into compact latent space. However, the latent states embedded by the world model proposed in previous works may contain a large amount of task-irrelevant information, resulting in low sampling efficiency and poor robustness to input perturbations. Meanwhile, the training data distribution is usually unbalanced, and the learned policy is hard to cope with the corner cases during the driving process. To solve the above challenges, we present a semantic masked recurrent world model (SEM2), which introduces a latent filter to extract key task-relevant features and reconstruct a semantic mask via the filtered features, and is trained with a multi-source data sampler, which aggregates common data and multiple corner case data in a single batch, to balance the data distribution. Extensive experiments on CARLA show that our method outperforms the state-of-the-art approaches in terms of sample efficiency and robustness to input permutations.

* 11 pages, 7 figures, 1 table, submitted to Deep RL Workshop 2022

Via

Access Paper or Ask Questions

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Sep 30, 2022
Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge

Figure 1 for Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Figure 2 for Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Figure 3 for Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Figure 4 for Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal commonsense, which is far away from cognition-level video understanding. Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e., leveraging the natural transcribed speech knowledge to provide noisy but useful semantics over time. Furthermore, rather than the simple concept learning in vision-caption contrast, we encourage cognition-level temporal commonsense reasoning via narrative reorganization. The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world. Note that our method differs from ones designed for video-text alignment (e.g., Frozen) and multimodal representation learning (e.g., Merlot). Our method demonstrates strong out-of-the-box spatiotemporal representations on diverse video benchmarks, e.g., +13.6% gains over VideoMAE on SSV2 via linear probing.

Via

Access Paper or Ask Questions

FedVeca: Federated Vectorized Averaging on Non-IID Data with Adaptive Bi-directional Global Objective

Sep 28, 2022
Ping Luo, Jieren Cheng, Zhenhao Liu, N. Xiong, Jie Wu

Figure 1 for FedVeca: Federated Vectorized Averaging on Non-IID Data with Adaptive Bi-directional Global Objective

Figure 2 for FedVeca: Federated Vectorized Averaging on Non-IID Data with Adaptive Bi-directional Global Objective

Figure 3 for FedVeca: Federated Vectorized Averaging on Non-IID Data with Adaptive Bi-directional Global Objective

Figure 4 for FedVeca: Federated Vectorized Averaging on Non-IID Data with Adaptive Bi-directional Global Objective

Federated Learning (FL) is a distributed machine learning framework to alleviate the data silos, where decentralized clients collaboratively learn a global model without sharing their private data. However, the clients' Non-Independent and Identically Distributed (Non-IID) data negatively affect the trained model, and clients with different numbers of local updates may cause significant gaps to the local gradients in each communication round. In this paper, we propose a Federated Vectorized Averaging (FedVeca) method to address the above problem on Non-IID data. Specifically, we set a novel objective for the global model which is related to the local gradients. The local gradient is defined as a bi-directional vector with step size and direction, where the step size is the number of local updates and the direction is divided into positive and negative according to our definition. In FedVeca, the direction is influenced by the step size, thus we average the bi-directional vectors to reduce the effect of different step sizes. Then, we theoretically analyze the relationship between the step sizes and the global objective, and obtain upper bounds on the step sizes per communication round. Based on the upper bounds, we design an algorithm for the server and the client to adaptively adjusts the step sizes that make the objective close to the optimum. Finally, we conduct experiments on different datasets, models and scenarios by building a prototype system, and the experimental results demonstrate the effectiveness and efficiency of the FedVeca method.

Via

Access Paper or Ask Questions

Rethinking Resolution in the Context of Efficient Video Recognition

Sep 26, 2022
Chuofan Ma, Qiushan Guo, Yi Jiang, Zehuan Yuan, Ping Luo, Xiaojuan Qi

Figure 1 for Rethinking Resolution in the Context of Efficient Video Recognition

Figure 2 for Rethinking Resolution in the Context of Efficient Video Recognition

Figure 3 for Rethinking Resolution in the Context of Efficient Video Recognition

Figure 4 for Rethinking Resolution in the Context of Efficient Video Recognition

In this paper, we empirically study how to make the most of low-resolution frames for efficient video recognition. Existing methods mainly focus on developing compact networks or alleviating temporal redundancy of video inputs to increase efficiency, whereas compressing frame resolution has rarely been considered a promising solution. A major concern is the poor recognition accuracy on low-resolution frames. We thus start by analyzing the underlying causes of performance degradation on low-resolution frames. Our key finding is that the major cause of degradation is not information loss in the down-sampling process, but rather the mismatch between network architecture and input scale. Motivated by the success of knowledge distillation (KD), we propose to bridge the gap between network and input size via cross-resolution KD (ResKD). Our work shows that ResKD is a simple but effective method to boost recognition accuracy on low-resolution frames. Without bells and whistles, ResKD considerably surpasses all competitive methods in terms of efficiency and accuracy on four large-scale benchmark datasets, i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V2. In addition, we extensively demonstrate its effectiveness over state-of-the-art architectures, i.e., 3D-CNNs and Video Transformers, and scalability towards super low-resolution frames. The results suggest ResKD can serve as a general inference acceleration method for state-of-the-art video recognition. Our code will be available at https://github.com/CVMI-Lab/ResKD.

* Accepted by NIPS2022

Via

Access Paper or Ask Questions