Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yilin Wang

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

Apr 08, 2024

Jing Gu, Yilin Wang, Nanxuan Zhao, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Xin Eric Wang

Figure 1 for SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

Figure 2 for SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

Figure 3 for SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

Figure 4 for SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

Abstract:Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weaving captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore, in this work, we introduce SwapAnything, a novel framework that can swap any objects in an image with personalized concepts given by the reference, while keeping the context unchanged. Compared with existing methods for personalized subject swapping, SwapAnything has three unique advantages: (1) precise control of arbitrary objects and parts rather than the main subject, (2) more faithful preservation of context pixels, (3) better adaptation of the personalized concept to the image. First, we propose targeted variable swapping to apply region control over latent feature maps and swap masked variables for faithful context preservation and initial semantic concept swapping. Then, we introduce appearance adaptation, to seamlessly adapt the semantic concept into the original image in terms of target location, shape, style, and content during the image generation process. Extensive results on both human and automatic evaluation demonstrate significant improvements of our approach over baseline methods on personalized swapping. Furthermore, SwapAnything shows its precise and faithful swapping abilities across single object, multiple objects, partial object, and cross-domain swapping tasks. SwapAnything also achieves great performance on text-based swapping and tasks beyond swapping such as object insertion.

* 18 pages, 16 figures, 3 tables

Via

Access Paper or Ask Questions

KC-GenRe: A Knowledge-constrained Generative Re-ranking Method Based on Large Language Models for Knowledge Graph Completion

Mar 26, 2024

Yilin Wang, Minghao Hu, Zhen Huang, Dongsheng Li, Dong Yang, Xicheng Lu

Figure 1 for KC-GenRe: A Knowledge-constrained Generative Re-ranking Method Based on Large Language Models for Knowledge Graph Completion

Figure 2 for KC-GenRe: A Knowledge-constrained Generative Re-ranking Method Based on Large Language Models for Knowledge Graph Completion

Figure 3 for KC-GenRe: A Knowledge-constrained Generative Re-ranking Method Based on Large Language Models for Knowledge Graph Completion

Figure 4 for KC-GenRe: A Knowledge-constrained Generative Re-ranking Method Based on Large Language Models for Knowledge Graph Completion

Abstract:The goal of knowledge graph completion (KGC) is to predict missing facts among entities. Previous methods for KGC re-ranking are mostly built on non-generative language models to obtain the probability of each candidate. Recently, generative large language models (LLMs) have shown outstanding performance on several tasks such as information extraction and dialog systems. Leveraging them for KGC re-ranking is beneficial for leveraging the extensive pre-trained knowledge and powerful generative capabilities. However, it may encounter new problems when accomplishing the task, namely mismatch, misordering and omission. To this end, we introduce KC-GenRe, a knowledge-constrained generative re-ranking method based on LLMs for KGC. To overcome the mismatch issue, we formulate the KGC re-ranking task as a candidate identifier sorting generation problem implemented by generative LLMs. To tackle the misordering issue, we develop a knowledge-guided interactive training method that enhances the identification and ranking of candidates. To address the omission issue, we design a knowledge-augmented constrained inference method that enables contextual prompting and controlled generation, so as to obtain valid rankings. Experimental results show that KG-GenRe achieves state-of-the-art performance on four datasets, with gains of up to 6.7% and 7.7% in the MRR and Hits@1 metric compared to previous methods, and 9.0% and 11.1% compared to that without re-ranking. Extensive analysis demonstrates the effectiveness of components in KG-GenRe.

* This paper has been accepted for publication in the proceedings of LREC-COLING 2024

Via

Access Paper or Ask Questions

Bayesian Diffusion Models for 3D Shape Reconstruction

Mar 11, 2024

Haiyang Xu, Yu Lei, Zeyuan Chen, Xiang Zhang, Yue Zhao, Yilin Wang, Zhuowen Tu

Figure 1 for Bayesian Diffusion Models for 3D Shape Reconstruction

Figure 2 for Bayesian Diffusion Models for 3D Shape Reconstruction

Figure 3 for Bayesian Diffusion Models for 3D Shape Reconstruction

Figure 4 for Bayesian Diffusion Models for 3D Shape Reconstruction

Abstract:We present Bayesian Diffusion Models (BDM), a prediction algorithm that performs effective Bayesian inference by tightly coupling the top-down (prior) information with the bottom-up (data-driven) procedure via joint diffusion processes. We show the effectiveness of BDM on the 3D shape reconstruction task. Compared to prototypical deep learning data-driven approaches trained on paired (supervised) data-labels (e.g. image-point clouds) datasets, our BDM brings in rich prior information from standalone labels (e.g. point clouds) to improve the bottom-up 3D reconstruction. As opposed to the standard Bayesian frameworks where explicit prior and likelihood are required for the inference, BDM performs seamless information fusion via coupled diffusion processes with learned gradient computation networks. The specialty of our BDM lies in its capability to engage the active and effective information exchange and fusion of the top-down and bottom-up processes where each itself is a diffusion process. We demonstrate state-of-the-art results on both synthetic and real-world benchmarks for 3D shape reconstruction.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions

UniHuman: A Unified Model for Editing Human Images in the Wild

Dec 22, 2023

Nannan Li, Qing Liu, Krishna Kumar Singh, Yilin Wang, Jianming Zhang, Bryan A. Plummer, Zhe Lin

Figure 1 for UniHuman: A Unified Model for Editing Human Images in the Wild

Figure 2 for UniHuman: A Unified Model for Editing Human Images in the Wild

Figure 3 for UniHuman: A Unified Model for Editing Human Images in the Wild

Figure 4 for UniHuman: A Unified Model for Editing Human Images in the Wild

Abstract:Human image editing includes tasks like changing a person's pose, their clothing, or editing the image according to a text prompt. However, prior work often tackles these tasks separately, overlooking the benefit of mutual reinforcement from learning them jointly. In this paper, we propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings. To enhance the model's generation quality and generalization capacity, we leverage guidance from human visual encoders and introduce a lightweight pose-warping module that can exploit different pose representations, accommodating unseen textures and patterns. Furthermore, to bridge the disparity between existing human editing benchmarks with real-world data, we curated 400K high-quality human image-text pairs for training and collected 2K human images for out-of-domain testing, both encompassing diverse clothing styles, backgrounds, and age groups. Experiments on both in-domain and out-of-domain test sets demonstrate that UniHuman outperforms task-specific models by a significant margin. In user studies, UniHuman is preferred by the users in an average of 77% of cases.

Via

Access Paper or Ask Questions

TokenCompose: Grounding Diffusion with Token-level Supervision

Dec 06, 2023

Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, Zhuowen Tu

Figure 1 for TokenCompose: Grounding Diffusion with Token-level Supervision

Figure 2 for TokenCompose: Grounding Diffusion with Token-level Supervision

Figure 3 for TokenCompose: Grounding Diffusion with Token-level Supervision

Figure 4 for TokenCompose: Grounding Diffusion with Token-level Supervision

Abstract:We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images.

* Preprint. Project link: https://mlpc-ucsd.github.io/TokenCompose

Via

Access Paper or Ask Questions

Learning Mutually Informed Representations for Characters and Subwords

Nov 14, 2023

Yilin Wang, Xinyi Hu, Matthew R. Gormley

Figure 1 for Learning Mutually Informed Representations for Characters and Subwords

Figure 2 for Learning Mutually Informed Representations for Characters and Subwords

Figure 3 for Learning Mutually Informed Representations for Characters and Subwords

Figure 4 for Learning Mutually Informed Representations for Characters and Subwords

Abstract:Most pretrained language models rely on subword tokenization, which processes text as a sequence of subword tokens. However, different granularities of text, such as characters, subwords, and words, can contain different kinds of information. Previous studies have shown that incorporating multiple input granularities improves model generalization, yet very few of them outputs useful representations for each granularity. In this paper, we introduce the entanglement model, aiming to combine character and subword language models. Inspired by vision-language models, our model treats characters and subwords as separate modalities, and it generates mutually informed representations for both granularities as output. We evaluate our model on text classification, named entity recognition, and POS-tagging tasks. Notably, the entanglement model outperforms its backbone language models, particularly in the presence of noisy texts and low-resource languages. Furthermore, the entanglement model even outperforms larger pre-trained models on all English sequence labeling tasks and classification tasks. Our anonymized code is available at https://anonymous.4open.science/r/noisy-IE-A673

Via

Access Paper or Ask Questions

Dolfin: Diffusion Layout Transformers without Autoencoder

Oct 25, 2023

Yilin Wang, Zeyuan Chen, Liangjun Zhong, Zheng Ding, Zhizhou Sha, Zhuowen Tu

Figure 1 for Dolfin: Diffusion Layout Transformers without Autoencoder

Figure 2 for Dolfin: Diffusion Layout Transformers without Autoencoder

Figure 3 for Dolfin: Diffusion Layout Transformers without Autoencoder

Figure 4 for Dolfin: Diffusion Layout Transformers without Autoencoder

Abstract:In this paper, we introduce a novel generative model, Diffusion Layout Transformers without Autoencoder (Dolfin), which significantly improves the modeling capability with reduced complexity compared to existing methods. Dolfin employs a Transformer-based diffusion process to model layout generation. In addition to an efficient bi-directional (non-causal joint) sequence representation, we further propose an autoregressive diffusion model (Dolfin-AR) that is especially adept at capturing rich semantic correlations for the neighboring objects, such as alignment, size, and overlap. When evaluated against standard generative layout benchmarks, Dolfin notably improves performance across various metrics (fid, alignment, overlap, MaxIoU and DocSim scores), enhancing transparency and interoperability in the process. Moreover, Dolfin's applications extend beyond layout generation, making it suitable for modeling geometric structures, such as line segments. Our experiments present both qualitative and quantitative results to demonstrate the advantages of Dolfin.

Via

Access Paper or Ask Questions

Photoswap: Personalized Subject Swapping in Images

May 29, 2023

Jing Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung(+1 more)

Figure 1 for Photoswap: Personalized Subject Swapping in Images

Figure 2 for Photoswap: Personalized Subject Swapping in Images

Figure 3 for Photoswap: Personalized Subject Swapping in Images

Figure 4 for Photoswap: Personalized Subject Swapping in Images

Abstract:In an era where images and visual content dominate our digital landscape, the ability to manipulate and personalize these images has become a necessity. Envision seamlessly substituting a tabby cat lounging on a sunlit window sill in a photograph with your own playful puppy, all while preserving the original charm and composition of the image. We present Photoswap, a novel approach that enables this immersive image editing experience through personalized subject swapping in existing images. Photoswap first learns the visual concept of the subject from reference images and then swaps it into the target image using pre-trained diffusion models in a training-free manner. We establish that a well-conceptualized visual subject can be seamlessly transferred to any image with appropriate self-attention and cross-attention manipulation, maintaining the pose of the swapped subject and the overall coherence of the image. Comprehensive experiments underscore the efficacy and controllability of Photoswap in personalized subject swapping. Furthermore, Photoswap significantly outperforms baseline methods in human ratings across subject swapping, background preservation, and overall quality, revealing its vast application potential, from entertainment to professional editing.

* 14 pages

Via

Access Paper or Ask Questions

Scalable Optimal Margin Distribution Machine

May 08, 2023

Yilin Wang, Nan Cao, Teng Zhang, Xuanhua Shi, Hai Jin

Abstract:Optimal margin Distribution Machine (ODM) is a newly proposed statistical learning framework rooting in the novel margin theory, which demonstrates better generalization performance than the traditional large margin based counterparts. Nonetheless, it suffers from the ubiquitous scalability problem regarding both computation time and memory as other kernel methods. This paper proposes a scalable ODM, which can achieve nearly ten times speedup compared to the original ODM training method. For nonlinear kernels, we propose a novel distribution-aware partition method to make the local ODM trained on each partition be close and converge fast to the global one. When linear kernel is applied, we extend a communication efficient SVRG method to accelerate the training further. Extensive empirical studies validate that our proposed method is highly computational efficient and almost never worsen the generalization.

Via

Access Paper or Ask Questions

MRET: Multi-resolution Transformer for Video Quality Assessment

Mar 29, 2023

Junjie Ke, Tianhao Zhang, Yilin Wang, Peyman Milanfar, Feng Yang

Figure 1 for MRET: Multi-resolution Transformer for Video Quality Assessment

Figure 2 for MRET: Multi-resolution Transformer for Video Quality Assessment

Figure 3 for MRET: Multi-resolution Transformer for Video Quality Assessment

Figure 4 for MRET: Multi-resolution Transformer for Video Quality Assessment

Abstract:No-reference video quality assessment (NR-VQA) for user generated content (UGC) is crucial for understanding and improving visual experience. Unlike video recognition tasks, VQA tasks are sensitive to changes in input resolution. Since large amounts of UGC videos nowadays are 720p or above, the fixed and relatively small input used in conventional NR-VQA methods results in missing high-frequency details for many videos. In this paper, we propose a novel Transformer-based NR-VQA framework that preserves the high-resolution quality information. With the multi-resolution input representation and a novel multi-resolution patch sampling mechanism, our method enables a comprehensive view of both the global video composition and local high-resolution details. The proposed approach can effectively aggregate quality information across different granularities in spatial and temporal dimensions, making the model robust to input resolution variations. Our method achieves state-of-the-art performance on large-scale UGC VQA datasets LSVQ and LSVQ-1080p, and on KoNViD-1k and LIVE-VQC without fine-tuning.

* Frontiers Signal Processing in Computational Video and Video Streaming (https://www.frontiersin.org/articles/10.3389/frsip.2023.1137006/full)

Via

Access Paper or Ask Questions