Alert button
Picture for Zhizheng Zhang

Zhizheng Zhang

Alert button

Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API

Oct 07, 2023
Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu

Recent popularity of Large Language Models (LLMs) has opened countless possibilities in automating numerous AI tasks by connecting LLMs to various domain-specific models or APIs, where LLMs serve as dispatchers while domain-specific models or APIs are action executors. Despite the vast numbers of domain-specific models/APIs, they still struggle to comprehensively cover super diverse automation demands in the interaction between human and User Interfaces (UIs). In this work, we build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor. This metadata-free grounding model, consisting of a visual encoder and a language decoder, is first pretrained on well studied document understanding tasks and then learns to decode spatial information from UI screenshots in a promptable way. To facilitate the exploitation of image-to-text pretrained knowledge, we follow the pixel-to-sequence paradigm to predict geometric coordinates in a sequence of tokens using a language decoder. We further propose an innovative Reinforcement Learning (RL) based algorithm to supervise the tokens in such sequence jointly with visually semantic metrics, which effectively strengthens the spatial decoding capability of the pixel-to-sequence paradigm. Extensive experiments demonstrate our proposed reinforced UI instruction grounding model outperforms the state-of-the-art methods by a clear margin and shows the potential as a generic UI task automation API.

Viaarxiv icon

Adaptive Frequency Filters As Efficient Global Token Mixers

Jul 26, 2023
Zhipeng Huang, Zhizheng Zhang, Cuiling Lan, Zheng-Jun Zha, Yan Lu, Baining Guo

Figure 1 for Adaptive Frequency Filters As Efficient Global Token Mixers
Figure 2 for Adaptive Frequency Filters As Efficient Global Token Mixers
Figure 3 for Adaptive Frequency Filters As Efficient Global Token Mixers
Figure 4 for Adaptive Frequency Filters As Efficient Global Token Mixers

Recent vision transformers, large-kernel CNNs and MLPs have attained remarkable successes in broad vision tasks thanks to their effective information fusion in the global scope. However, their efficient deployments, especially on mobile devices, still suffer from noteworthy challenges due to the heavy computational costs of self-attention mechanisms, large kernels, or fully connected layers. In this work, we apply conventional convolution theorem to deep learning for addressing this and reveal that adaptive frequency filters can serve as efficient global token mixers. With this insight, we propose Adaptive Frequency Filtering (AFF) token mixer. This neural operator transfers a latent representation to the frequency domain via a Fourier transform and performs semantic-adaptive frequency filtering via an elementwise multiplication, which mathematically equals to a token mixing operation in the original latent space with a dynamic convolution kernel as large as the spatial resolution of this latent representation. We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet. Extensive experiments demonstrate the effectiveness of our proposed AFF token mixer and show that AFFNet achieve superior accuracy and efficiency trade-offs compared to other lightweight network designs on broad visual tasks, including visual recognition and dense prediction tasks.

* Accepted by ICCV2023 
Viaarxiv icon

When and Why Momentum Accelerates SGD:An Empirical Study

Jun 15, 2023
Jingwen Fu, Bohan Wang, Huishuai Zhang, Zhizheng Zhang, Wei Chen, Nanning Zheng

Figure 1 for When and Why Momentum Accelerates SGD:An Empirical Study
Figure 2 for When and Why Momentum Accelerates SGD:An Empirical Study
Figure 3 for When and Why Momentum Accelerates SGD:An Empirical Study
Figure 4 for When and Why Momentum Accelerates SGD:An Empirical Study

Momentum has become a crucial component in deep learning optimizers, necessitating a comprehensive understanding of when and why it accelerates stochastic gradient descent (SGD). To address the question of ''when'', we establish a meaningful comparison framework that examines the performance of SGD with Momentum (SGDM) under the \emph{effective learning rates} $\eta_{ef}$, a notion unifying the influence of momentum coefficient $\mu$ and batch size $b$ over learning rate $\eta$. In the comparison of SGDM and SGD with the same effective learning rate and the same batch size, we observe a consistent pattern: when $\eta_{ef}$ is small, SGDM and SGD experience almost the same empirical training losses; when $\eta_{ef}$ surpasses a certain threshold, SGDM begins to perform better than SGD. Furthermore, we observe that the advantage of SGDM over SGD becomes more pronounced with a larger batch size. For the question of ``why'', we find that the momentum acceleration is closely related to \emph{abrupt sharpening} which is to describe a sudden jump of the directional Hessian along the update direction. Specifically, the misalignment between SGD and SGDM happens at the same moment that SGD experiences abrupt sharpening and converges slower. Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening. Together, this study unveils the interplay between momentum, learning rates, and batch sizes, thus improving our understanding of momentum acceleration.

Viaarxiv icon

Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators

Jun 02, 2023
Zhizheng Zhang, Xiaoyi Zhang, Wenxuan Xie, Yan Lu

Figure 1 for Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators
Figure 2 for Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators
Figure 3 for Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators
Figure 4 for Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators

The recent success of Large Language Models (LLMs) signifies an impressive stride towards artificial general intelligence. They have shown a promising prospect in automatically completing tasks upon user instructions, functioning as brain-like coordinators. The associated risks will be revealed as we delegate an increasing number of tasks to machines for automated completion. A big question emerges: how can we make machines behave responsibly when helping humans automate tasks as personal copilots? In this paper, we explore this question in depth from the perspectives of feasibility, completeness and security. In specific, we present Responsible Task Automation (ResponsibleTA) as a fundamental framework to facilitate responsible collaboration between LLM-based coordinators and executors for task automation with three empowered capabilities: 1) predicting the feasibility of the commands for executors; 2) verifying the completeness of executors; 3) enhancing the security (e.g., the protection of users' privacy). We further propose and compare two paradigms for implementing the first two capabilities. One is to leverage the generic knowledge of LLMs themselves via prompt engineering while the other is to adopt domain-specific learnable models. Moreover, we introduce a local memory mechanism for achieving the third capability. We evaluate our proposed ResponsibleTA on UI task automation and hope it could bring more attentions to ensuring LLMs more responsible in diverse scenarios. The research project homepage is at https://task-automation-research.github.io/responsible_task_automation.

Viaarxiv icon

Learning Trajectories are Generalization Indicators

May 04, 2023
Jingwen Fu, Zhizheng Zhang, Dacheng Yin, Yan Lu, Nanning Zheng

Figure 1 for Learning Trajectories are Generalization Indicators
Figure 2 for Learning Trajectories are Generalization Indicators
Figure 3 for Learning Trajectories are Generalization Indicators
Figure 4 for Learning Trajectories are Generalization Indicators

The aim of this paper is to investigate the connection between learning trajectories of the Deep Neural Networks (DNNs) and their corresponding generalization capabilities when being optimized with broadly used gradient descent and stochastic gradient descent algorithms. In this paper, we construct Linear Approximation Function to model the trajectory information and we propose a new generalization bound with richer trajectory information based on it. Our proposed generalization bound relies on the complexity of learning trajectory and the ratio between the bias and diversity of training set. Experimental results indicate that the proposed method effectively captures the generalization trend across various training steps, learning rates, and label noise levels.

Viaarxiv icon

MRVM-NeRF: Mask-Based Pretraining for Neural Radiance Fields

Apr 11, 2023
Ganlin Yang, Guoqiang Wei, Zhizheng Zhang, Yan Lu, Dong Liu

Figure 1 for MRVM-NeRF: Mask-Based Pretraining for Neural Radiance Fields
Figure 2 for MRVM-NeRF: Mask-Based Pretraining for Neural Radiance Fields
Figure 3 for MRVM-NeRF: Mask-Based Pretraining for Neural Radiance Fields
Figure 4 for MRVM-NeRF: Mask-Based Pretraining for Neural Radiance Fields

Most Neural Radiance Fields (NeRFs) have poor generalization ability, limiting their application when representing multiple scenes by a single model. To ameliorate this problem, existing methods simply condition NeRF models on image features, lacking the global understanding and modeling of the entire 3D scene. Inspired by the significant success of mask-based modeling in other research fields, we propose a masked ray and view modeling method for generalizable NeRF (MRVM-NeRF), the first attempt to incorporate mask-based pretraining into 3D implicit representations. Specifically, considering that the core of NeRFs lies in modeling 3D representations along the rays and across the views, we randomly mask a proportion of sampled points along the ray at fine stage by discarding partial information obtained from multi-viewpoints, targeting at predicting the corresponding features produced in the coarse branch. In this way, the learned prior knowledge of 3D scenes during pretraining helps the model generalize better to novel scenarios after finetuning. Extensive experiments demonstrate the superiority of our proposed MRVM-NeRF under various synthetic and real-world settings, both qualitatively and quantitatively. Our empirical studies reveal the effectiveness of our proposed innovative MRVM which is specifically designed for NeRF models.

Viaarxiv icon

Unifying Layout Generation with a Decoupled Diffusion Model

Mar 09, 2023
Mude Hui, Zhizheng Zhang, Xiaoyi Zhang, Wenxuan Xie, Yuwang Wang, Yan Lu

Figure 1 for Unifying Layout Generation with a Decoupled Diffusion Model
Figure 2 for Unifying Layout Generation with a Decoupled Diffusion Model
Figure 3 for Unifying Layout Generation with a Decoupled Diffusion Model
Figure 4 for Unifying Layout Generation with a Decoupled Diffusion Model

Layout generation aims to synthesize realistic graphic scenes consisting of elements with different attributes including category, size, position, and between-element relation. It is a crucial task for reducing the burden on heavy-duty graphic design works for formatted scenes, e.g., publications, documents, and user interfaces (UIs). Diverse application scenarios impose a big challenge in unifying various layout generation subtasks, including conditional and unconditional generation. In this paper, we propose a Layout Diffusion Generative Model (LDGM) to achieve such unification with a single decoupled diffusion model. LDGM views a layout of arbitrary missing or coarse element attributes as an intermediate diffusion status from a completed layout. Since different attributes have their individual semantics and characteristics, we propose to decouple the diffusion processes for them to improve the diversity of training samples and learn the reverse process jointly to exploit global-scope contexts for facilitating generation. As a result, our LDGM can generate layouts either from scratch or conditional on arbitrary available attributes. Extensive qualitative and quantitative experiments demonstrate our proposed LDGM outperforms existing layout generation models in both functionality and performance.

* Accepted by CVPR2023 
Viaarxiv icon

Versatile Neural Processes for Learning Implicit Neural Representations

Jan 21, 2023
Zongyu Guo, Cuiling Lan, Zhizheng Zhang, Zhibo Chen, Yan Lu

Figure 1 for Versatile Neural Processes for Learning Implicit Neural Representations
Figure 2 for Versatile Neural Processes for Learning Implicit Neural Representations
Figure 3 for Versatile Neural Processes for Learning Implicit Neural Representations
Figure 4 for Versatile Neural Processes for Learning Implicit Neural Representations

Representing a signal as a continuous function parameterized by neural network (a.k.a. Implicit Neural Representations, INRs) has attracted increasing attention in recent years. Neural Processes (NPs), which model the distributions over functions conditioned on partial observations (context set), provide a practical solution for fast inference of continuous functions. However, existing NP architectures suffer from inferior modeling capability for complex signals. In this paper, we propose an efficient NP framework dubbed Versatile Neural Processes (VNP), which largely increases the capability of approximating functions. Specifically, we introduce a bottleneck encoder that produces fewer and informative context tokens, relieving the high computational cost while providing high modeling capability. At the decoder side, we hierarchically learn multiple global latent variables that jointly model the global structure and the uncertainty of a function, enabling our model to capture the distribution of complex signals. We demonstrate the effectiveness of the proposed VNP on a variety of tasks involving 1D, 2D and 3D signals. Particularly, our method shows promise in learning accurate INRs w.r.t. a 3D scene without further finetuning.

* Accepted by ICLR2023 
Viaarxiv icon

Image Coding for Machines with Omnipotent Feature Learning

Jul 07, 2022
Ruoyu Feng, Xin Jin, Zongyu Guo, Runsen Feng, Yixin Gao, Tianyu He, Zhizheng Zhang, Simeng Sun, Zhibo Chen

Figure 1 for Image Coding for Machines with Omnipotent Feature Learning
Figure 2 for Image Coding for Machines with Omnipotent Feature Learning
Figure 3 for Image Coding for Machines with Omnipotent Feature Learning
Figure 4 for Image Coding for Machines with Omnipotent Feature Learning

Image Coding for Machines (ICM) aims to compress images for AI tasks analysis rather than meeting human perception. Learning a kind of feature that is both general (for AI tasks) and compact (for compression) is pivotal for its success. In this paper, we attempt to develop an ICM framework by learning universal features while also considering compression. We name such features as omnipotent features and the corresponding framework as Omni-ICM. Considering self-supervised learning (SSL) improves feature generalization, we integrate it with the compression task into the Omni-ICM framework to learn omnipotent features. However, it is non-trivial to coordinate semantics modeling in SSL and redundancy removing in compression, so we design a novel information filtering (IF) module between them by co-optimization of instance distinguishment and entropy minimization to adaptively drop information that is weakly related to AI tasks (e.g., some texture redundancy). Different from previous task-specific solutions, Omni-ICM could directly support AI tasks analysis based on the learned omnipotent features without joint training or extra transformation. Albeit simple and intuitive, Omni-ICM significantly outperforms existing traditional and learning-based codecs on multiple fundamental vision tasks.

* Accepted by ECCV2022 
Viaarxiv icon