Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shuai Xiao

Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment

Mar 19, 2024

Mengting Chen, Xi Chen, Zhonghua Zhai, Chen Ju, Xuewen Hong, Jinsong Lan, Shuai Xiao

Figure 1 for Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment

Figure 2 for Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment

Figure 3 for Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment

Figure 4 for Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment

Abstract:This paper introduces a novel framework for virtual try-on, termed Wear-Any-Way. Different from previous methods, Wear-Any-Way is a customizable solution. Besides generating high-fidelity results, our method supports users to precisely manipulate the wearing style. To achieve this goal, we first construct a strong pipeline for standard virtual try-on, supporting single/multiple garment try-on and model-to-model settings in complicated scenarios. To make it manipulable, we propose sparse correspondence alignment which involves point-based control to guide the generation for specific locations. With this design, Wear-Any-Way gets state-of-the-art performance for the standard setting and provides a novel interaction form for customizing the wearing style. For instance, it supports users to drag the sleeve to make it rolled up, drag the coat to make it open, and utilize clicks to control the style of tuck, etc. Wear-Any-Way enables more liberated and flexible expressions of the attires, holding profound implications in the fashion industry.

* Project Page: https://mengtingchen.github.io/wear-any-way-page/

Via

Access Paper or Ask Questions

Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models

Dec 12, 2023

Chen Ju, Haicheng Wang, Zeqian Li, Xu Chen, Zhonghua Zhai, Weilin Huang, Shuai Xiao

Figure 1 for Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models

Figure 2 for Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models

Figure 3 for Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models

Figure 4 for Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models

Abstract:Vision-Language Large Models (VLMs) have become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantification, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two key factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates the data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs' calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without retraining and trivial engineering efforts. On multiple public VLMs benchmarks, we conduct extensive experiments to reveal the gratifying acceleration of Turbo, under negligible performance drop.

Via

Access Paper or Ask Questions

Enhancing Cross-domain Click-Through Rate Prediction via Explicit Feature Augmentation

Nov 30, 2023

Xu Chen, Zida Cheng, Jiangchao Yao, Chen Ju, Weilin Huang, Jinsong Lan, Xiaoyi Zeng, Shuai Xiao

Figure 1 for Enhancing Cross-domain Click-Through Rate Prediction via Explicit Feature Augmentation

Figure 2 for Enhancing Cross-domain Click-Through Rate Prediction via Explicit Feature Augmentation

Figure 3 for Enhancing Cross-domain Click-Through Rate Prediction via Explicit Feature Augmentation

Figure 4 for Enhancing Cross-domain Click-Through Rate Prediction via Explicit Feature Augmentation

Abstract:Cross-domain CTR (CDCTR) prediction is an important research topic that studies how to leverage meaningful data from a related domain to help CTR prediction in target domain. Most existing CDCTR works design implicit ways to transfer knowledge across domains such as parameter-sharing that regularizes the model training in target domain. More effectively, recent researchers propose explicit techniques to extract user interest knowledge and transfer this knowledge to target domain. However, the proposed method mainly faces two issues: 1) it usually requires a super domain, i.e. an extremely large source domain, to cover most users or items of target domain, and 2) the extracted user interest knowledge is static no matter what the context is in target domain. These limitations motivate us to develop a more flexible and efficient technique to explicitly transfer knowledge. In this work, we propose a cross-domain augmentation network (CDAnet) being able to perform explicit knowledge transfer between two domains. Specifically, CDAnet contains a designed translation network and an augmentation network which are trained sequentially. The translation network computes latent features from two domains and learns meaningful cross-domain knowledge of each input in target domain by using a designed cross-supervised feature translator. Later the augmentation network employs the explicit cross-domain knowledge as augmented information to boost the target domain CTR prediction. Through extensive experiments on two public benchmarks and one industrial production dataset, we show CDAnet can learn meaningful translated features and largely improve the performance of CTR prediction. CDAnet has been conducted online A/B test in image2product retrieval at Taobao app, bringing an absolute 0.11 point CTR improvement, a relative 0.64% deal growth and a relative 1.26% GMV increase.

* arXiv admin note: substantial text overlap with arXiv:2305.03953

Via

Access Paper or Ask Questions

Forgedit: Text Guided Image Editing via Learning and Forgetting

Sep 19, 2023

Shiwen Zhang, Shuai Xiao, Weilin Huang

Figure 1 for Forgedit: Text Guided Image Editing via Learning and Forgetting

Figure 2 for Forgedit: Text Guided Image Editing via Learning and Forgetting

Figure 3 for Forgedit: Text Guided Image Editing via Learning and Forgetting

Figure 4 for Forgedit: Text Guided Image Editing via Learning and Forgetting

Abstract:Text guided image editing on real images given only the image and the target text prompt as inputs, is a very general and challenging problem, which requires the editing model to reason by itself which part of the image should be edited, to preserve the characteristics of original image, and also to perform complicated non-rigid editing. Previous fine-tuning based solutions are time-consuming and vulnerable to overfitting, limiting their editing capabilities. To tackle these issues, we design a novel text guided image editing method, Forgedit. First, we propose a novel fine-tuning framework which learns to reconstruct the given image in less than one minute by vision language joint learning. Then we introduce vector subtraction and vector projection to explore the proper text embedding for editing. We also find a general property of UNet structures in Diffusion Models and inspired by such a finding, we design forgetting strategies to diminish the fatal overfitting issues and significantly boost the editing abilities of Diffusion Models. Our method, Forgedit, implemented with Stable Diffusion, achieves new state-of-the-art results on the challenging text guided image editing benchmark TEdBench, surpassing the previous SOTA method Imagic with Imagen, in terms of both CLIP score and LPIPS score. Codes are available at https://github.com/witcherofresearch/Forgedit.

* Codes are available at https://github.com/witcherofresearch/Forgedit

Via

Access Paper or Ask Questions

Automatic Deduction Path Learning via Reinforcement Learning with Environmental Correction

Jun 16, 2023

Shuai Xiao, Chen Pan, Min Wang, Xinxin Zhu, Siqiao Xue, Jing Wang, Yunhua Hu, James Zhang, Jinghua Feng

Figure 1 for Automatic Deduction Path Learning via Reinforcement Learning with Environmental Correction

Figure 2 for Automatic Deduction Path Learning via Reinforcement Learning with Environmental Correction

Figure 3 for Automatic Deduction Path Learning via Reinforcement Learning with Environmental Correction

Figure 4 for Automatic Deduction Path Learning via Reinforcement Learning with Environmental Correction

Abstract:Automatic bill payment is an important part of business operations in fintech companies. The practice of deduction was mainly based on the total amount or heuristic search by dividing the bill into smaller parts to deduct as much as possible. This article proposes an end-to-end approach of automatically learning the optimal deduction paths (deduction amount in order), which reduces the cost of manual path design and maximizes the amount of successful deduction. Specifically, in view of the large search space of the paths and the extreme sparsity of historical successful deduction records, we propose a deep hierarchical reinforcement learning approach which abstracts the action into a two-level hierarchical space: an upper agent that determines the number of steps of deductions each day and a lower agent that decides the amount of deduction at each step. In such a way, the action space is structured via prior knowledge and the exploration space is reduced. Moreover, the inherited information incompleteness of the business makes the environment just partially observable. To be precise, the deducted amounts indicate merely the lower bounds of the available account balance. To this end, we formulate the problem as a partially observable Markov decision problem (POMDP) and employ an environment correction algorithm based on the characteristics of the business. In the world's largest electronic payment business, we have verified the effectiveness of this scheme offline and deployed it online to serve millions of users.

Via

Access Paper or Ask Questions

Cross-domain Augmentation Networks for Click-Through Rate Prediction

May 09, 2023

Xu Chen, Zida Cheng, Shuai Xiao, Xiaoyi Zeng, Weilin Huang

Figure 1 for Cross-domain Augmentation Networks for Click-Through Rate Prediction

Figure 2 for Cross-domain Augmentation Networks for Click-Through Rate Prediction

Figure 3 for Cross-domain Augmentation Networks for Click-Through Rate Prediction

Figure 4 for Cross-domain Augmentation Networks for Click-Through Rate Prediction

Abstract:Data sparsity is an important issue for click-through rate (CTR) prediction, particularly when user-item interactions is too sparse to learn a reliable model. Recently, many works on cross-domain CTR (CDCTR) prediction have been developed in an effort to leverage meaningful data from a related domain. However, most existing CDCTR works have an impractical limitation that requires homogeneous inputs (\textit{i.e.} shared feature fields) across domains, and CDCTR with heterogeneous inputs (\textit{i.e.} varying feature fields) across domains has not been widely explored but is an urgent and important research problem. In this work, we propose a cross-domain augmentation network (CDAnet) being able to perform knowledge transfer between two domains with \textit{heterogeneous inputs}. Specifically, CDAnet contains a designed translation network and an augmentation network which are trained sequentially. The translation network is able to compute features from two domains with heterogeneous inputs separately by designing two independent branches, and then learn meaningful cross-domain knowledge using a designed cross-supervised feature translator. Later the augmentation network encodes the learned cross-domain knowledge via feature translation performed in the latent space and fine-tune the model for final CTR prediction. Through extensive experiments on two public benchmarks and one industrial production dataset, we show CDAnet can learn meaningful translated features and largely improve the performance of CTR prediction. CDAnet has been conducted online A/B test in image2product retrieval at Taobao app over 20days, bringing an absolute \textbf{0.11 point} CTR improvement and a relative \textbf{1.26\%} GMV increase.

Via

Access Paper or Ask Questions

Mixer: Image to Multi-Modal Retrieval Learning for Industrial Application

May 06, 2023

Zida Cheng, Shuai Xiao, Zhonghua Zhai, Xiaoyi Zeng, Weilin Huang

Figure 1 for Mixer: Image to Multi-Modal Retrieval Learning for Industrial Application

Figure 2 for Mixer: Image to Multi-Modal Retrieval Learning for Industrial Application

Figure 3 for Mixer: Image to Multi-Modal Retrieval Learning for Industrial Application

Figure 4 for Mixer: Image to Multi-Modal Retrieval Learning for Industrial Application

Abstract:Cross-modal retrieval, where the query is an image and the doc is an item with both image and text description, is ubiquitous in e-commerce platforms and content-sharing social media. However, little research attention has been paid to this important application. This type of retrieval task is challenging due to the facts: 1)~domain gap exists between query and doc. 2)~multi-modality alignment and fusion. 3)~skewed training data and noisy labels collected from user behaviors. 4)~huge number of queries and timely responses while the large-scale candidate docs exist. To this end, we propose a novel scalable and efficient image query to multi-modal retrieval learning paradigm called Mixer, which adaptively integrates multi-modality data, mines skewed and noisy data more efficiently and scalable to high traffic. The Mixer consists of three key ingredients: First, for query and doc image, a shared encoder network followed by separate transformation networks are utilized to account for their domain gap. Second, in the multi-modal doc, images and text are not equally informative. So we design a concept-aware modality fusion module, which extracts high-level concepts from the text by a text-to-image attention mechanism. Lastly, but most importantly, we turn to a new data organization and training paradigm for single-modal to multi-modal retrieval: large-scale classification learning which treats single-modal query and multi-modal doc as equivalent samples of certain classes. Besides, the data organization follows a weakly-supervised manner, which can deal with skewed data and noisy labels inherited in the industrial systems. Learning such a large number of categories for real-world multi-modality data is non-trivial and we design a specific learning strategy for it. The proposed Mixer achieves SOTA performance on public datasets from industrial retrieval systems.

Via

Access Paper or Ask Questions

Tile Networks: Learning Optimal Geometric Layout for Whole-page Recommendation

Mar 03, 2023

Shuai Xiao, Zaifan Jiang, Shuang Yang

Figure 1 for Tile Networks: Learning Optimal Geometric Layout for Whole-page Recommendation

Figure 2 for Tile Networks: Learning Optimal Geometric Layout for Whole-page Recommendation

Figure 3 for Tile Networks: Learning Optimal Geometric Layout for Whole-page Recommendation

Figure 4 for Tile Networks: Learning Optimal Geometric Layout for Whole-page Recommendation

Abstract:Finding optimal configurations in a geometric space is a key challenge in many technological disciplines. Current approaches either rely heavily on human domain expertise and are difficult to scale. In this paper we show it is possible to solve configuration optimization problems for whole-page recommendation using reinforcement learning. The proposed \textit{Tile Networks} is a neural architecture that optimizes 2D geometric configurations by arranging items on proper positions. Empirical results on real dataset demonstrate its superior performance compared to traditional learning to rank approaches and recent deep models.

* Published at Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022

Via

Access Paper or Ask Questions

Model-based Constrained MDP for Budget Allocation in Sequential Incentive Marketing

Mar 02, 2023

Shuai Xiao, Le Guo, Zaifan Jiang, Lei Lv, Yuanbo Chen, Jun Zhu, Shuang Yang

Figure 1 for Model-based Constrained MDP for Budget Allocation in Sequential Incentive Marketing

Figure 2 for Model-based Constrained MDP for Budget Allocation in Sequential Incentive Marketing

Figure 3 for Model-based Constrained MDP for Budget Allocation in Sequential Incentive Marketing

Figure 4 for Model-based Constrained MDP for Budget Allocation in Sequential Incentive Marketing

Abstract:Sequential incentive marketing is an important approach for online businesses to acquire customers, increase loyalty and boost sales. How to effectively allocate the incentives so as to maximize the return (e.g., business objectives) under the budget constraint, however, is less studied in the literature. This problem is technically challenging due to the facts that 1) the allocation strategy has to be learned using historically logged data, which is counterfactual in nature, and 2) both the optimality and feasibility (i.e., that cost cannot exceed budget) needs to be assessed before being deployed to online systems. In this paper, we formulate the problem as a constrained Markov decision process (CMDP). To solve the CMDP problem with logged counterfactual data, we propose an efficient learning algorithm which combines bisection search and model-based planning. First, the CMDP is converted into its dual using Lagrangian relaxation, which is proved to be monotonic with respect to the dual variable. Furthermore, we show that the dual problem can be solved by policy learning, with the optimal dual variable being found efficiently via bisection search (i.e., by taking advantage of the monotonicity). Lastly, we show that model-based planing can be used to effectively accelerate the joint optimization process without retraining the policy for every dual variable. Empirical results on synthetic and real marketing datasets confirm the effectiveness of our methods.

* Published at CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Via

Access Paper or Ask Questions

SJ-HD^2R: Selective Joint High Dynamic Range and Denoising Imaging for Dynamic Scenes

Jun 20, 2022

Wei Li, Shuai Xiao, Tianhong Dai, Shanxin Yuan, Tao Wang, Cheng Li, Fenglong Song

Figure 1 for SJ-HD^2R: Selective Joint High Dynamic Range and Denoising Imaging for Dynamic Scenes

Figure 2 for SJ-HD^2R: Selective Joint High Dynamic Range and Denoising Imaging for Dynamic Scenes

Figure 3 for SJ-HD^2R: Selective Joint High Dynamic Range and Denoising Imaging for Dynamic Scenes

Figure 4 for SJ-HD^2R: Selective Joint High Dynamic Range and Denoising Imaging for Dynamic Scenes

Abstract:Ghosting artifacts, motion blur, and low fidelity in highlight are the main challenges in High Dynamic Range (HDR) imaging from multiple Low Dynamic Range (LDR) images. These issues come from using the medium-exposed image as the reference frame in previous methods. To deal with them, we propose to use the under-exposed image as the reference to avoid these issues. However, the heavy noise in dark regions of the under-exposed image becomes a new problem. Therefore, we propose a joint HDR and denoising pipeline, containing two sub-networks: (i) a pre-denoising network (PreDNNet) to adaptively denoise input LDRs by exploiting exposure priors; (ii) a pyramid cascading fusion network (PCFNet), introducing an attention mechanism and cascading structure in a multi-scale manner. To further leverage these two paradigms, we propose a selective and joint HDR and denoising (SJ-HD$^2$R) imaging framework, utilizing scenario-specific priors to conduct the path selection with an accuracy of more than 93.3$\%$. We create the first joint HDR and denoising benchmark dataset, which contains a variety of challenging HDR and denoising scenes and supports the switching of the reference image. Extensive experiment results show that our method achieves superior performance to previous methods.

Via

Access Paper or Ask Questions