Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yutong Feng

Zero-shot Image Editing with Reference Imitation

Jun 11, 2024

Xi Chen, Yutong Feng, Mengting Chen, Yiyang Wang, Shilong Zhang, Yu Liu, Yujun Shen, Hengshuang Zhao

Figure 1 for Zero-shot Image Editing with Reference Imitation

Figure 2 for Zero-shot Image Editing with Reference Imitation

Figure 3 for Zero-shot Image Editing with Reference Imitation

Figure 4 for Zero-shot Image Editing with Reference Imitation

Abstract:Image editing serves as a practical yet challenging task considering the diverse demands from users, where one of the hardest parts is to precisely describe how the edited image should look like. In this work, we present a new form of editing, termed imitative editing, to help users exercise their creativity more conveniently. Concretely, to edit an image region of interest, users are free to directly draw inspiration from some in-the-wild references (e.g., some relative pictures come across online), without having to cope with the fit between the reference and the source. Such a design requires the system to automatically figure out what to expect from the reference to perform the editing. For this purpose, we propose a generative training framework, dubbed MimicBrush, which randomly selects two frames from a video clip, masks some regions of one frame, and learns to recover the masked regions using the information from the other frame. That way, our model, developed from a diffusion prior, is able to capture the semantic correspondence between separate images in a self-supervised manner. We experimentally show the effectiveness of our method under various test cases as well as its superiority over existing alternatives. We also construct a benchmark to facilitate further research.

* https://xavierchen34.github.io/MimicBrush-Page

Via

Access Paper or Ask Questions

FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Mar 25, 2024

Shilong Zhang, Lianghua Huang, Xi Chen, Yifei Zhang, Zhi-Fan Wu, Yutong Feng, Wei Wang, Yujun Shen, Yu Liu, Ping Luo

Figure 1 for FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Figure 2 for FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Figure 3 for FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Figure 4 for FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Abstract:This work presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt. Our approach is distinguishable from existing human photo customization methods by higher-fidelity identity preservation and better instruction following, benefiting from two subtle designs. First, we encode the face identity into a series of feature maps instead of one image token as in prior arts, allowing the model to retain more details of the reference faces (e.g., scars, tattoos, and face shape ). Second, we introduce a disentangled integration strategy to balance the text and image guidance during the text-to-image generation process, alleviating the conflict between the reference faces and the text prompts (e.g., personalizing an adult into a "child" or an "elder"). Extensive experimental results demonstrate the effectiveness of our method on various applications, including human image personalization, face swapping under language prompts, making virtual characters into real people, etc. Project Page: https://jshilong.github.io/flashface-page.

* Project Page:https://jshilong.github.io/flashface-page

Via

Access Paper or Ask Questions

Spatio-Temporal Field Neural Networks for Air Quality Inference

Mar 02, 2024

Yutong Feng, Qiongyan Wang, Yutong Xia, Junlin Huang, Siru Zhong, Kun Wang, Shifen Cheng, Yuxuan Liang

Figure 1 for Spatio-Temporal Field Neural Networks for Air Quality Inference

Figure 2 for Spatio-Temporal Field Neural Networks for Air Quality Inference

Figure 3 for Spatio-Temporal Field Neural Networks for Air Quality Inference

Figure 4 for Spatio-Temporal Field Neural Networks for Air Quality Inference

Abstract:The air quality inference problem aims to utilize historical data from a limited number of observation sites to infer the air quality index at an unknown location. Considering the sparsity of data due to the high maintenance cost of the stations, good inference algorithms can effectively save the cost and refine the data granularity. While spatio-temporal graph neural networks have made excellent progress on this problem, their non-Euclidean and discrete data structure modeling of reality limits its potential. In this work, we make the first attempt to combine two different spatio-temporal perspectives, fields and graphs, by proposing a new model, Spatio-Temporal Field Neural Network, and its corresponding new framework, Pyramidal Inference. Extensive experiments validate that our model achieves state-of-the-art performance in nationwide air quality inference in the Chinese Mainland, demonstrating the superiority of our proposed model and framework.

Via

Access Paper or Ask Questions

LivePhoto: Real Image Animation with Text-guided Motion Control

Dec 05, 2023

Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen, Hengshuang Zhao

Figure 1 for LivePhoto: Real Image Animation with Text-guided Motion Control

Figure 2 for LivePhoto: Real Image Animation with Text-guided Motion Control

Figure 3 for LivePhoto: Real Image Animation with Text-guided Motion Control

Figure 4 for LivePhoto: Real Image Animation with Text-guided Motion Control

Abstract:Despite the recent progress in text-to-video generation, existing studies usually overlook the issue that only spatial contents but not temporal motions in synthesized videos are under the control of text. Towards such a challenge, this work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. In particular, considering the facts that (1) text can only describe motions roughly (e.g., regardless of the moving speed) and (2) text may include both content and motion descriptions, we introduce a motion intensity estimation module as well as a text re-weighting module to reduce the ambiguity of text-to-motion mapping. Empirical evidence suggests that our approach is capable of well decoding motion-related textual instructions into videos, such as actions, camera movements, or even conjuring new contents from thin air (e.g., pouring water into an empty glass). Interestingly, thanks to the proposed intensity learning mechanism, our system offers users an additional control signal (i.e., the motion intensity) besides text for video customization.

* Project page: https://xavierchen34.github.io/LivePhoto-Page/

Via

Access Paper or Ask Questions

Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Nov 30, 2023

Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou

Figure 1 for Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Figure 2 for Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Figure 3 for Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Figure 4 for Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following

Abstract:Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at https://ranni-t2i.github.io/Ranni.

Via

Access Paper or Ask Questions

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

Nov 30, 2023

Siteng Huang, Biao Gong, Yutong Feng, Xi Chen, Yuqian Fu, Yu Liu, Donglin Wang

Figure 1 for Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

Figure 2 for Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

Figure 3 for Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

Figure 4 for Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

Abstract:This study focuses on a novel task in text-to-image (T2I) generation, namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features, including appearance. To overcome the preference for low-level features and the entanglement of high-level features, we propose an inversion-based method Action-Disentangled Identifier (ADI) to learn action-specific identifiers from the exemplar images. ADI first expands the semantic conditioning space by introducing layer-wise identifier tokens, thereby increasing the representational richness while distributing the inversion across different features. Then, to block the inversion of action-agnostic features, ADI extracts the gradient invariance from the constructed sample triples and masks the updates of irrelevant channels. To comprehensively evaluate the task, we present an ActionBench that includes a variety of actions, each accompanied by meticulously selected samples. Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation. Our project page is at https://adi-t2i.github.io/ADI.

Via

Access Paper or Ask Questions

Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

Nov 30, 2023

Biao Gong, Siteng Huang, Yutong Feng, Shiwei Zhang, Yuyuan Li, Yu Liu

Figure 1 for Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

Figure 2 for Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

Figure 3 for Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

Figure 4 for Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

Abstract:Diffusion models have recently achieved remarkable progress in generating realistic images. However, challenges remain in accurately understanding and synthesizing the layout requirements in the textual prompts. To align the generated image with layout instructions, we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time. Specifically, following a "check-locate-rectify" pipeline, the system first analyses the prompt to generate the target layout and compares it with the intermediate outputs to automatically detect errors. Then, by moving the located activations and making intra- and inter-map adjustments, the rectification process can be performed with negligible computational overhead. To evaluate SimM over a range of layout requirements, we present a benchmark SimMBench that compensates for the lack of superlative spatial relations in existing datasets. And both quantitative and qualitative results demonstrate the effectiveness of the proposed SimM in calibrating the layout inconsistencies. Our project page is at https://simm-t2i.github.io/SimM.

Via

Access Paper or Ask Questions

Incentive Mechanism Design for Unbiased Federated Learning with Randomized Client Participation

Apr 17, 2023

Bing Luo, Yutong Feng, Shiqiang Wang, Jianwei Huang, Leandros Tassiulas

Figure 1 for Incentive Mechanism Design for Unbiased Federated Learning with Randomized Client Participation

Figure 2 for Incentive Mechanism Design for Unbiased Federated Learning with Randomized Client Participation

Figure 3 for Incentive Mechanism Design for Unbiased Federated Learning with Randomized Client Participation

Figure 4 for Incentive Mechanism Design for Unbiased Federated Learning with Randomized Client Participation

Abstract:Incentive mechanism is crucial for federated learning (FL) when rational clients do not have the same interests in the global model as the server. However, due to system heterogeneity and limited budget, it is generally impractical for the server to incentivize all clients to participate in all training rounds (known as full participation). The existing FL incentive mechanisms are typically designed by stimulating a fixed subset of clients based on their data quantity or system resources. Hence, FL is performed only using this subset of clients throughout the entire training process, leading to a biased model because of data heterogeneity. This paper proposes a game theoretic incentive mechanism for FL with randomized client participation, where the server adopts a customized pricing strategy that motivates different clients to join with different participation levels (probabilities) for obtaining an unbiased and high performance model. Each client responds to the server's monetary incentive by choosing its best participation level, to maximize its profit based on not only the incurred local cost but also its intrinsic value for the global model. To effectively evaluate clients' contribution to the model performance, we derive a new convergence bound which analytically predicts how clients' arbitrary participation levels and their heterogeneous data affect the model performance. By solving a non-convex optimization problem, our analysis reveals that the intrinsic value leads to the interesting possibility of bidirectional payment between the server and clients. Experimental results using real datasets on a hardware prototype demonstrate the superiority of our mechanism in achieving higher model performance for the server as well as higher profits for the clients.

* Accepted in ICDCS 2023

Via

Access Paper or Ask Questions

Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning

Mar 27, 2023

Siteng Huang, Biao Gong, Yutong Feng, Yiliang Lv, Donglin Wang

Abstract:Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. Relying on learning the joint representation of seen compositions, these methods ignore the explicit modeling of the state and object, thus limiting the exploitation of pre-trained knowledge and generalization to unseen compositions. With a particular focus on the universality of the solution, in this work, we propose a novel paradigm for CZSL models that establishes three identification branches (i.e., Multi-Path) to jointly model the state, object, and composition. The presented Troika is our implementation that aligns the branch-specific prompt representations with decomposed visual features. To calibrate the bias between semantically similar multi-modal representations, we further devise a Cross-Modal Traction module into Troika that shifts the prompt representation towards the current visual content. We conduct extensive experiments on three popular benchmarks, where our method significantly outperforms existing methods in both closed-world and open-world settings.

* 14 pages

Via

Access Paper or Ask Questions

ViM: Vision Middleware for Unified Downstream Transferring

Mar 13, 2023

Yutong Feng, Biao Gong, Jianwen Jiang, Yiliang Lv, Yujun Shen, Deli Zhao, Jingren Zhou

Figure 1 for ViM: Vision Middleware for Unified Downstream Transferring

Figure 2 for ViM: Vision Middleware for Unified Downstream Transferring

Figure 3 for ViM: Vision Middleware for Unified Downstream Transferring

Figure 4 for ViM: Vision Middleware for Unified Downstream Transferring

Abstract:Foundation models are pre-trained on massive data and transferred to downstream tasks via fine-tuning. This work presents Vision Middleware (ViM), a new learning paradigm that targets unified transferring from a single foundation model to a variety of downstream tasks. ViM consists of a zoo of lightweight plug-in modules, each of which is independently learned on a midstream dataset with a shared frozen backbone. Downstream tasks can then benefit from an adequate aggregation of the module zoo thanks to the rich knowledge inherited from midstream tasks. There are three major advantages of such a design. From the efficiency aspect, the upstream backbone can be trained only once and reused for all downstream tasks without tuning. From the scalability aspect, we can easily append additional modules to ViM with no influence on existing modules. From the performance aspect, ViM can include as many midstream tasks as possible, narrowing the task gap between upstream and downstream. Considering these benefits, we believe that ViM, which the community could maintain and develop together, would serve as a powerful tool to assist foundation models.

Via

Access Paper or Ask Questions