Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhizheng Zhang

Southeast University, China

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Mar 01, 2024

Jiazhao Zhang, Kunyu Wang, Rongtao Xu, Gengze Zhou, Yicong Hong, Xiaomeng Fang, Qi Wu, Zhizheng Zhang, Wang He

Figure 1 for NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Figure 2 for NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Figure 3 for NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Figure 4 for NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Abstract:Vision-and-Language Navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is a long-standing challenge, either to out-of-distribution scenes or from Sim to Real. In this paper, we propose NaVid, a video-based large vision language model (VLM), to mitigate such a generalization gap. NaVid makes the first endeavour to showcase the capability of VLMs to achieve state-of-the-art level navigation performance without any maps, odometer and depth inputs. Following human instruction, NaVid only requires an on-the-fly video stream from a monocular RGB camera equipped on the robot to output the next-step action. Our formulation mimics how humans navigate and naturally gets rid of the problems introduced by odometer noises, and the Sim2Real gaps from map or depth inputs. Moreover, our video-based approach can effectively encode the historical observations of robots as spatio-temporal contexts for decision-making and instruction following. We train NaVid with 550k navigation samples collected from VLN-CE trajectories, including action-planning and instruction-reasoning samples, along with 665k large-scale web data. Extensive experiments show that NaVid achieves SOTA performance in simulation environments and the real world, demonstrating superior cross-dataset and Sim2Real transfer. We thus believe our proposed VLM approach plans the next step for not only the navigation agents but also this research field.

Via

Access Paper or Ask Questions

SeD: Semantic-Aware Discriminator for Image Super-Resolution

Feb 29, 2024

Bingchen Li, Xin Li, Hanxin Zhu, Yeying Jin, Ruoyu Feng, Zhizheng Zhang, Zhibo Chen

Figure 1 for SeD: Semantic-Aware Discriminator for Image Super-Resolution

Figure 2 for SeD: Semantic-Aware Discriminator for Image Super-Resolution

Figure 3 for SeD: Semantic-Aware Discriminator for Image Super-Resolution

Figure 4 for SeD: Semantic-Aware Discriminator for Image Super-Resolution

Abstract:Generative Adversarial Networks (GANs) have been widely used to recover vivid textures in image super-resolution (SR) tasks. In particular, one discriminator is utilized to enable the SR network to learn the distribution of real-world high-quality images in an adversarial training manner. However, the distribution learning is overly coarse-grained, which is susceptible to virtual textures and causes counter-intuitive generation results. To mitigate this, we propose the simple and effective Semantic-aware Discriminator (denoted as SeD), which encourages the SR network to learn the fine-grained distributions by introducing the semantics of images as a condition. Concretely, we aim to excavate the semantics of images from a well-trained semantic extractor. Under different semantics, the discriminator is able to distinguish the real-fake images individually and adaptively, which guides the SR network to learn the more fine-grained semantic-aware textures. To obtain accurate and abundant semantics, we take full advantage of recently popular pretrained vision models (PVMs) with extensive datasets, and then incorporate its semantic features into the discriminator through a well-designed spatial cross-attention module. In this way, our proposed semantic-aware discriminator empowered the SR network to produce more photo-realistic and pleasing images. Extensive experiments on two typical tasks, i.e., SR and Real SR have demonstrated the effectiveness of our proposed methods.

* CVPR2024

Via

Access Paper or Ask Questions

Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API

Oct 07, 2023

Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu

Figure 1 for Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API

Figure 2 for Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API

Figure 3 for Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API

Figure 4 for Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API

Abstract:Recent popularity of Large Language Models (LLMs) has opened countless possibilities in automating numerous AI tasks by connecting LLMs to various domain-specific models or APIs, where LLMs serve as dispatchers while domain-specific models or APIs are action executors. Despite the vast numbers of domain-specific models/APIs, they still struggle to comprehensively cover super diverse automation demands in the interaction between human and User Interfaces (UIs). In this work, we build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor. This metadata-free grounding model, consisting of a visual encoder and a language decoder, is first pretrained on well studied document understanding tasks and then learns to decode spatial information from UI screenshots in a promptable way. To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm to predict geometric coordinates in a sequence of tokens using a language decoder. We further propose an innovative Reinforcement Learning (RL) based algorithm to supervise the tokens in such sequence jointly with visually semantic metrics, which effectively strengthens the spatial decoding capability of the pixel-to-sequence paradigm. Extensive experiments demonstrate our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin and shows the potential as a generic UI task automation API.

Via

Access Paper or Ask Questions

Adaptive Frequency Filters As Efficient Global Token Mixers

Jul 26, 2023

Zhipeng Huang, Zhizheng Zhang, Cuiling Lan, Zheng-Jun Zha, Yan Lu, Baining Guo

Abstract:Recent vision transformers, large-kernel CNNs and MLPs have attained remarkable successes in broad vision tasks thanks to their effective information fusion in the global scope. However, their efficient deployments, especially on mobile devices, still suffer from noteworthy challenges due to the heavy computational costs of self-attention mechanisms, large kernels, or fully connected layers. In this work, we apply conventional convolution theorem to deep learning for addressing this and reveal that adaptive frequency filters can serve as efficient global token mixers. With this insight, we propose Adaptive Frequency Filtering (AFF) token mixer. This neural operator transfers a latent representation to the frequency domain via a Fourier transform and performs semantic-adaptive frequency filtering via an elementwise multiplication, which mathematically equals to a token mixing operation in the original latent space with a dynamic convolution kernel as large as the spatial resolution of this latent representation. We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet. Extensive experiments demonstrate the effectiveness of our proposed AFF token mixer and show that AFFNet achieve superior accuracy and efficiency trade-offs compared to other lightweight network designs on broad visual tasks, including visual recognition and dense prediction tasks.

* Accepted by ICCV2023

Via

Access Paper or Ask Questions

When and Why Momentum Accelerates SGD:An Empirical Study

Jun 15, 2023

Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, Nanning Zheng

Figure 1 for When and Why Momentum Accelerates SGD:An Empirical Study

Figure 2 for When and Why Momentum Accelerates SGD:An Empirical Study

Figure 3 for When and Why Momentum Accelerates SGD:An Empirical Study

Figure 4 for When and Why Momentum Accelerates SGD:An Empirical Study

Abstract:Momentum has become a crucial component in deep learning optimizers, necessitating a comprehensive understanding of when and why it accelerates stochastic gradient descent (SGD). To address the question of ''when'', we establish a meaningful comparison framework that examines the performance of SGD with Momentum (SGDM) under the \emph{effective learning rates} $\eta_{ef}$, a notion unifying the influence of momentum coefficient $\mu$ and batch size $b$ over learning rate $\eta$. In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when $\eta_{ef}$ is small, SGDM and SGD experience almost the same empirical training losses; when $\eta_{ef}$ surpasses a certain threshold, SGDM begins to perform better than SGD. Furthermore, we observe that the advantage of SGDM over SGD becomes more pronounced with a larger batch size. For the question of ``why'', we find that the momentum acceleration is closely related to \emph{abrupt sharpening} which is to describe a sudden jump of the directional Hessian along the update direction. Specifically, the misalignment between SGD and SGDM happens at the same moment that SGD experiences abrupt sharpening and converges slower. Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening. Together, this study unveils the interplay between momentum, learning rates, and batch sizes, thus improving our understanding of momentum acceleration.

Via

Access Paper or Ask Questions

Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators

Jun 02, 2023

Zhizheng Zhang, Xiaoyi Zhang, Wenxuan Xie, Yan Lu

Figure 1 for Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators

Figure 2 for Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators

Figure 3 for Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators

Figure 4 for Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators

Abstract:The recent success of Large Language Models (LLMs) signifies an impressive stride towards artificial general intelligence. They have shown a promising prospect in automatically completing tasks upon user instructions, functioning as brain-like coordinators. The associated risks will be revealed as we delegate an increasing number of tasks to machines for automated completion. A big question emerges: how can we make machines behave responsibly when helping humans automate tasks as personal copilots? In this paper, we explore this question in depth from the perspectives of feasibility, completeness and security. In specific, we present Responsible Task Automation (ResponsibleTA) as a fundamental framework to facilitate responsible collaboration between LLM-based coordinators and executors for task automation with three empowered capabilities: 1) predicting the feasibility of the commands for executors; 2) verifying the completeness of executors; 3) enhancing the security (e.g., the protection of users' privacy). We further propose and compare two paradigms for implementing the first two capabilities. One is to leverage the generic knowledge of LLMs themselves via prompt engineering while the other is to adopt domain-specific learnable models. Moreover, we introduce a local memory mechanism for achieving the third capability. We evaluate our proposed ResponsibleTA on UI task automation and hope it could bring more attentions to ensuring LLMs more responsible in diverse scenarios. The research project homepage is at https://task-automation-research.github.io/responsible_task_automation.

Via

Access Paper or Ask Questions

Learning Trajectories are Generalization Indicators

May 04, 2023

Jingwen Fu, Zhizheng Zhang, Dacheng Yin, Yan Lu, Nanning Zheng

Figure 1 for Learning Trajectories are Generalization Indicators

Figure 2 for Learning Trajectories are Generalization Indicators

Figure 3 for Learning Trajectories are Generalization Indicators

Figure 4 for Learning Trajectories are Generalization Indicators

Abstract:The aim of this paper is to investigate the connection between learning trajectories of the Deep Neural Networks (DNNs) and their corresponding generalization capabilities when being optimized with broadly used gradient descent and stochastic gradient descent algorithms. In this paper, we construct Linear Approximation Function to model the trajectory information and we propose a new generalization bound with richer trajectory information based on it. Our proposed generalization bound relies on the complexity of learning trajectory and the ratio between the bias and diversity of training set. Experimental results indicate that the proposed method effectively captures the generalization trend across various training steps, learning rates, and label noise levels.

Via

Access Paper or Ask Questions

MRVM-NeRF: Mask-Based Pretraining for Neural Radiance Fields

Apr 11, 2023

Ganlin Yang, Guoqiang Wei, Zhizheng Zhang, Yan Lu, Dong Liu

Figure 1 for MRVM-NeRF: Mask-Based Pretraining for Neural Radiance Fields

Figure 2 for MRVM-NeRF: Mask-Based Pretraining for Neural Radiance Fields

Figure 3 for MRVM-NeRF: Mask-Based Pretraining for Neural Radiance Fields

Figure 4 for MRVM-NeRF: Mask-Based Pretraining for Neural Radiance Fields

Abstract:Most Neural Radiance Fields (NeRFs) have poor generalization ability, limiting their application when representing multiple scenes by a single model. To ameliorate this problem, existing methods simply condition NeRF models on image features, lacking the global understanding and modeling of the entire 3D scene. Inspired by the significant success of mask-based modeling in other research fields, we propose a masked ray and view modeling method for generalizable NeRF (MRVM-NeRF), the first attempt to incorporate mask-based pretraining into 3D implicit representations. Specifically, considering that the core of NeRFs lies in modeling 3D representations along the rays and across the views, we randomly mask a proportion of sampled points along the ray at fine stage by discarding partial information obtained from multi-viewpoints, targeting at predicting the corresponding features produced in the coarse branch. In this way, the learned prior knowledge of 3D scenes during pretraining helps the model generalize better to novel scenarios after finetuning. Extensive experiments demonstrate the superiority of our proposed MRVM-NeRF under various synthetic and real-world settings, both qualitatively and quantitatively. Our empirical studies reveal the effectiveness of our proposed innovative MRVM which is specifically designed for NeRF models.

Via

Access Paper or Ask Questions

Unifying Layout Generation with a Decoupled Diffusion Model

Mar 09, 2023

Mude Hui, Zhizheng Zhang, Xiaoyi Zhang, Wenxuan Xie, Yuwang Wang, Yan Lu

Figure 1 for Unifying Layout Generation with a Decoupled Diffusion Model

Figure 2 for Unifying Layout Generation with a Decoupled Diffusion Model

Figure 3 for Unifying Layout Generation with a Decoupled Diffusion Model

Figure 4 for Unifying Layout Generation with a Decoupled Diffusion Model

Abstract:Layout generation aims to synthesize realistic graphic scenes consisting of elements with different attributes including category, size, position, and between-element relation. It is a crucial task for reducing the burden on heavy-duty graphic design works for formatted scenes, e.g., publications, documents, and user interfaces (UIs). Diverse application scenarios impose a big challenge in unifying various layout generation subtasks, including conditional and unconditional generation. In this paper, we propose a Layout Diffusion Generative Model (LDGM) to achieve such unification with a single decoupled diffusion model. LDGM views a layout of arbitrary missing or coarse element attributes as an intermediate diffusion status from a completed layout. Since different attributes have their individual semantics and characteristics, we propose to decouple the diffusion processes for them to improve the diversity of training samples and learn the reverse process jointly to exploit global-scope contexts for facilitating generation. As a result, our LDGM can generate layouts either from scratch or conditional on arbitrary available attributes. Extensive qualitative and quantitative experiments demonstrate our proposed LDGM outperforms existing layout generation models in both functionality and performance.

* Accepted by CVPR2023

Via

Access Paper or Ask Questions

Versatile Neural Processes for Learning Implicit Neural Representations

Jan 21, 2023

Zongyu Guo, Cuiling Lan, Zhizheng Zhang, Zhibo Chen, Yan Lu

Figure 1 for Versatile Neural Processes for Learning Implicit Neural Representations

Figure 2 for Versatile Neural Processes for Learning Implicit Neural Representations

Figure 3 for Versatile Neural Processes for Learning Implicit Neural Representations

Figure 4 for Versatile Neural Processes for Learning Implicit Neural Representations

Abstract:Representing a signal as a continuous function parameterized by neural network (a.k.a. Implicit Neural Representations, INRs) has attracted increasing attention in recent years. Neural Processes (NPs), which model the distributions over functions conditioned on partial observations (context set), provide a practical solution for fast inference of continuous functions. However, existing NP architectures suffer from inferior modeling capability for complex signals. In this paper, we propose an efficient NP framework dubbed Versatile Neural Processes (VNP), which largely increases the capability of approximating functions. Specifically, we introduce a bottleneck encoder that produces fewer and informative context tokens, relieving the high computational cost while providing high modeling capability. At the decoder side, we hierarchically learn multiple global latent variables that jointly model the global structure and the uncertainty of a function, enabling our model to capture the distribution of complex signals. We demonstrate the effectiveness of the proposed VNP on a variety of tasks involving 1D, 2D and 3D signals. Particularly, our method shows promise in learning accurate INRs w.r.t. a 3D scene without further finetuning.

* Accepted by ICLR2023

Via

Access Paper or Ask Questions