Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuan Yao

Department of Mathematics, Hong Kong University of Science and Technology

Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Jan 09, 2025

Yingjie Chen, Yifang Men, Yuan Yao, Miaomiao Cui, Liefeng Bo

Figure 1 for Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Figure 2 for Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Figure 3 for Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Figure 4 for Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Abstract:Motion-controllable image animation is a fundamental task with a wide range of potential applications. Recent works have made progress in controlling camera or object motion via various motion representations, while they still struggle to support collaborative camera and object motion control with adaptive control granularity. To this end, we introduce 3D-aware motion representation and propose an image animation framework, called Perception-as-Control, to achieve fine-grained collaborative motion control. Specifically, we construct 3D-aware motion representation from a reference image, manipulate it based on interpreted user intentions, and perceive it from different viewpoints. In this way, camera and object motions are transformed into intuitive, consistent visual changes. Then, the proposed framework leverages the perception results as motion control signals, enabling it to support various motion-related video synthesis tasks in a unified and flexible way. Experiments demonstrate the superiority of the proposed framework. For more details and qualitative results, please refer to our project webpage: https://chen-yingjie.github.io/projects/Perception-as-Control.

Via

Access Paper or Ask Questions

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

Dec 18, 2024

Yipeng Zhang, Yifan Liu, Zonghao Guo, Yidan Zhang, Xuesong Yang, Chi Chen, Jun Song, Bo Zheng, Yuan Yao, Zhiyuan Liu(+2 more)

Abstract:In multimodal large language models (MLLMs), vision transformers (ViTs) are widely employed for visual encoding. However, their performance in solving universal MLLM tasks is not satisfactory. We attribute it to a lack of information from diverse visual levels, impeding alignment with the various semantic granularity required for language generation. To address this issue, we present LLaVA-UHD v2, an advanced MLLM centered around a Hierarchical window transformer that enables capturing diverse visual granularity by constructing and integrating a high-resolution feature pyramid. As a vision-language projector, Hiwin transformer comprises two primary modules: (i) an inverse feature pyramid, constructed by a ViT-derived feature up-sampling process utilizing high-frequency details from an image pyramid, and (ii) hierarchical window attention, focusing on a set of key sampling features within cross-scale windows to condense multi-level feature maps. Extensive experiments demonstrate that LLaVA-UHD v2 achieves superior performance over existing MLLMs on popular benchmarks. Notably, our design brings an average boost of 3.7% across 14 benchmarks compared with the baseline method, 9.3% on DocVQA for instance. We make all the data, model checkpoint, and code publicly available to facilitate future research.

Via

Access Paper or Ask Questions

Neuro-Symbolic Data Generation for Math Reasoning

Dec 06, 2024

Zenan Li, Zhi Zhou, Yuan Yao, Yu-Feng Li, Chun Cao, Fan Yang, Xian Zhang, Xiaoxing Ma

Figure 1 for Neuro-Symbolic Data Generation for Math Reasoning

Figure 2 for Neuro-Symbolic Data Generation for Math Reasoning

Figure 3 for Neuro-Symbolic Data Generation for Math Reasoning

Figure 4 for Neuro-Symbolic Data Generation for Math Reasoning

Abstract:A critical question about Large Language Models (LLMs) is whether their apparent deficiency in mathematical reasoning is inherent, or merely a result of insufficient exposure to high-quality mathematical data. To explore this, we developed an automated method for generating high-quality, supervised mathematical datasets. The method carefully mutates existing math problems, ensuring both diversity and validity of the newly generated problems. This is achieved by a neuro-symbolic data generation framework combining the intuitive informalization strengths of LLMs, and the precise symbolic reasoning of math solvers along with projected Markov chain Monte Carlo sampling in the highly-irregular symbolic space. Empirical experiments demonstrate the high quality of data generated by the proposed method, and that the LLMs, specifically LLaMA-2 and Mistral, when realigned with the generated data, surpass their state-of-the-art counterparts.

* Published as a conference paper at NeurIPS 2024

Via

Access Paper or Ask Questions

Large Language Models show both individual and collective creativity comparable to humans

Dec 04, 2024

Luning Sun, Yuzhuo Yuan, Yuan Yao, Yanyan Li, Hao Zhang, Xing Xie, Xiting Wang, Fang Luo, David Stillwell

Figure 1 for Large Language Models show both individual and collective creativity comparable to humans

Figure 2 for Large Language Models show both individual and collective creativity comparable to humans

Figure 3 for Large Language Models show both individual and collective creativity comparable to humans

Figure 4 for Large Language Models show both individual and collective creativity comparable to humans

Abstract:Artificial intelligence has, so far, largely automated routine tasks, but what does it mean for the future of work if Large Language Models (LLMs) show creativity comparable to humans? To measure the creativity of LLMs holistically, the current study uses 13 creative tasks spanning three domains. We benchmark the LLMs against individual humans, and also take a novel approach by comparing them to the collective creativity of groups of humans. We find that the best LLMs (Claude and GPT-4) rank in the 52nd percentile against humans, and overall LLMs excel in divergent thinking and problem solving but lag in creative writing. When questioned 10 times, an LLM's collective creativity is equivalent to 8-10 humans. When more responses are requested, two additional responses of LLMs equal one extra human. Ultimately, LLMs, when optimally applied, may compete with a small group of humans in the future of work.

Via

Access Paper or Ask Questions

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Nov 11, 2024

Zanlin Ni, Yulin Wang, Renping Zhou, Yizeng Han, Jiayi Guo, Zhiyuan Liu, Yuan Yao, Gao Huang

Figure 1 for ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Figure 2 for ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Figure 3 for ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Figure 4 for ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Abstract:Recently, token-based generation have demonstrated their effectiveness in image synthesis. As a representative example, non-autoregressive Transformers (NATs) can generate decent-quality images in a few steps. NATs perform generation in a progressive manner, where the latent tokens of a resulting image are incrementally revealed. At each step, the unrevealed image regions are padded with mask tokens and inferred by NAT. In this paper, we delve into the mechanisms behind the effectiveness of NATs and uncover two important patterns that naturally emerge from NATs: Spatially (within a step), although mask and visible tokens are processed uniformly by NATs, the interactions between them are highly asymmetric. In specific, mask tokens mainly gather information for decoding, while visible tokens tend to primarily provide information, and their deep representations can be built only upon themselves. Temporally (across steps), the interactions between adjacent generation steps mostly concentrate on updating the representations of a few critical tokens, while the computation for the majority of tokens is generally repetitive. Driven by these findings, we propose EfficientNAT (ENAT), a NAT model that explicitly encourages these critical interactions inherent in NATs. At the spatial level, we disentangle the computations of visible and mask tokens by encoding visible tokens independently, while decoding mask tokens conditioned on the fully encoded visible tokens. At the temporal level, we prioritize the computation of the critical tokens at each step, while maximally reusing previously computed token representations to supplement necessary information. ENAT improves the performance of NATs notably with significantly reduced computational cost. Experiments on ImageNet-256, ImageNet-512 and MS-COCO validate the effectiveness of ENAT. Code is available at https://github.com/LeapLabTHU/ENAT.

* Accepted by NeurIPS2024

Via

Access Paper or Ask Questions

UniGAD: Unifying Multi-level Graph Anomaly Detection

Nov 10, 2024

Yiqing Lin, Jianheng Tang, Chenyi Zi, H. Vicky Zhao, Yuan Yao, Jia Li

Figure 1 for UniGAD: Unifying Multi-level Graph Anomaly Detection

Figure 2 for UniGAD: Unifying Multi-level Graph Anomaly Detection

Figure 3 for UniGAD: Unifying Multi-level Graph Anomaly Detection

Figure 4 for UniGAD: Unifying Multi-level Graph Anomaly Detection

Abstract:Graph Anomaly Detection (GAD) aims to identify uncommon, deviated, or suspicious objects within graph-structured data. Existing methods generally focus on a single graph object type (node, edge, graph, etc.) and often overlook the inherent connections among different object types of graph anomalies. For instance, a money laundering transaction might involve an abnormal account and the broader community it interacts with. To address this, we present UniGAD, the first unified framework for detecting anomalies at node, edge, and graph levels jointly. Specifically, we develop the Maximum Rayleigh Quotient Subgraph Sampler (MRQSampler) that unifies multi-level formats by transferring objects at each level into graph-level tasks on subgraphs. We theoretically prove that MRQSampler maximizes the accumulated spectral energy of subgraphs (i.e., the Rayleigh quotient) to preserve the most significant anomaly information. To further unify multi-level training, we introduce a novel GraphStitch Network to integrate information across different levels, adjust the amount of sharing required at each level, and harmonize conflicting training goals. Comprehensive experiments show that UniGAD outperforms both existing GAD methods specialized for a single task and graph prompt-based approaches for multiple tasks, while also providing robust zero-shot task transferability. All codes can be found at https://github.com/lllyyq1121/UniGAD.

* Accepted by NeurIPS 2024. All codes can be found at https://github.com/lllyyq1121/UniGAD

Via

Access Paper or Ask Questions

Autoregressive Models in Vision: A Survey

Nov 08, 2024

Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang(+10 more)

Figure 1 for Autoregressive Models in Vision: A Survey

Figure 2 for Autoregressive Models in Vision: A Survey

Figure 3 for Autoregressive Models in Vision: A Survey

Figure 4 for Autoregressive Models in Vision: A Survey

Abstract:Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, \textit{i.e.}, pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: \url{https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey}.

Via

Access Paper or Ask Questions

Neuro-symbolic Learning Yielding Logical Constraints

Oct 28, 2024

Zenan Li, Yunpeng Huang, Zhaoyu Li, Yuan Yao, Jingwei Xu, Taolue Chen, Xiaoxing Ma, Jian Lu

Figure 1 for Neuro-symbolic Learning Yielding Logical Constraints

Figure 2 for Neuro-symbolic Learning Yielding Logical Constraints

Figure 3 for Neuro-symbolic Learning Yielding Logical Constraints

Figure 4 for Neuro-symbolic Learning Yielding Logical Constraints

Abstract:Neuro-symbolic systems combine the abilities of neural perception and logical reasoning. However, end-to-end learning of neuro-symbolic systems is still an unsolved challenge. This paper proposes a natural framework that fuses neural network training, symbol grounding, and logical constraint synthesis into a coherent and efficient end-to-end learning process. The capability of this framework comes from the improved interactions between the neural and the symbolic parts of the system in both the training and inference stages. Technically, to bridge the gap between the continuous neural network and the discrete logical constraint, we introduce a difference-of-convex programming technique to relax the logical constraints while maintaining their precision. We also employ cardinality constraints as the language for logical constraint learning and incorporate a trust region method to avoid the degeneracy of logical constraint in learning. Both theoretical analyses and empirical evaluations substantiate the effectiveness of the proposed framework.

* Published as a conference paper at NeurIPS 2023, and code is available at [this url](https://github.com/Lizn-zn/Nesy-Programming)

Via

Access Paper or Ask Questions

Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models

Oct 23, 2024

He Cao, Weidi Luo, Yu Wang, Zijing Liu, Bing Feng, Yuan Yao, Yu Li

Abstract:With the extensive deployment of Large Language Models (LLMs), ensuring their safety has become increasingly critical. However, existing defense methods often struggle with two key issues: (i) inadequate defense capabilities, particularly in domain-specific scenarios like chemistry, where a lack of specialized knowledge can lead to the generation of harmful responses to malicious queries. (ii) over-defensiveness, which compromises the general utility and responsiveness of LLMs. To mitigate these issues, we introduce a multi-agents-based defense framework, Guide for Defense (G4D), which leverages accurate external information to provide an unbiased summary of user intentions and analytically grounded safety response guidance. Extensive experiments on popular jailbreak attacks and benign datasets show that our G4D can enhance LLM's robustness against jailbreak attacks on general and domain-specific scenarios without compromising the model's general functionality.

Via

Access Paper or Ask Questions

Elucidating the design space of language models for image generation

Oct 21, 2024

Xuantong Liu, Shaozhe Hao, Xianbiao Qi, Tianyang Hu, Jun Wang, Rong Xiao, Yuan Yao

Figure 1 for Elucidating the design space of language models for image generation

Figure 2 for Elucidating the design space of language models for image generation

Figure 3 for Elucidating the design space of language models for image generation

Figure 4 for Elucidating the design space of language models for image generation

Abstract:The success of autoregressive (AR) language models in text generation has inspired the computer vision community to adopt Large Language Models (LLMs) for image generation. However, considering the essential differences between text and image modalities, the design space of language models for image generation remains underexplored. We observe that image tokens exhibit greater randomness compared to text tokens, which presents challenges when training with token prediction. Nevertheless, AR models demonstrate their potential by effectively learning patterns even from a seemingly suboptimal optimization problem. Our analysis also reveals that while all models successfully grasp the importance of local information in image generation, smaller models struggle to capture the global context. In contrast, larger models showcase improved capabilities in this area, helping to explain the performance gains achieved when scaling up model size. We further elucidate the design space of language models for vision generation, including tokenizer choice, model choice, model scalability, vocabulary design, and sampling strategy through extensive comparative experiments. Our work is the first to analyze the optimization behavior of language models in vision generation, and we believe it can inspire more effective designs when applying LMs to other domains. Finally, our elucidated language model for image generation, termed as ELM, achieves state-of-the-art performance on the ImageNet 256*256 benchmark. The code is available at https://github.com/Pepperlll/LMforImageGeneration.git.

* Project page: https://pepper-lll.github.io/LMforImageGeneration/

Via

Access Paper or Ask Questions