Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Duen Horng Chau

Mobile Fitting Room: On-device Virtual Try-on via Diffusion Models

Feb 02, 2024

Justin Blalock, David Munechika, Harsha Karanth, Alec Helbling, Pratham Mehta, Seongmin Lee, Duen Horng Chau

Abstract:The growing digital landscape of fashion e-commerce calls for interactive and user-friendly interfaces for virtually trying on clothes. Traditional try-on methods grapple with challenges in adapting to diverse backgrounds, poses, and subjects. While newer methods, utilizing the recent advances of diffusion models, have achieved higher-quality image generation, the human-centered dimensions of mobile interface delivery and privacy concerns remain largely unexplored. We present Mobile Fitting Room, the first on-device diffusion-based virtual try-on system. To address multiple inter-related technical challenges such as high-quality garment placement and model compression for mobile devices, we present a novel technical pipeline and an interface design that enables privacy preservation and user customization. A usage scenario highlights how our tool can provide a seamless, interactive virtual try-on experience for customers and provide a valuable service for fashion e-commerce businesses.

* 7 pages, 3 figures

Via

Access Paper or Ask Questions

Wordflow: Social Prompt Engineering for Large Language Models

Jan 25, 2024

Zijie J. Wang, Aishwarya Chakravarthy, David Munechika, Duen Horng Chau

Abstract:Large language models (LLMs) require well-crafted prompts for effective use. Prompt engineering, the process of designing prompts, is challenging, particularly for non-experts who are less familiar with AI technologies. While researchers have proposed techniques and tools to assist LLM users in prompt design, these works primarily target AI application developers rather than non-experts. To address this research gap, we propose social prompt engineering, a novel paradigm that leverages social computing techniques to facilitate collaborative prompt design. To investigate social prompt engineering, we introduce Wordflow, an open-source and social text editor that enables everyday users to easily create, run, share, and discover LLM prompts. Additionally, by leveraging modern web technologies, Wordflow allows users to run LLMs locally and privately in their browsers. Two usage scenarios highlight how social prompt engineering and our tool can enhance laypeople's interaction with LLMs. Wordflow is publicly accessible at https://poloclub.github.io/wordflow.

* 8 pages, 7 figures. Wordflow is available at: https://poloclub.github.io/wordflow. The code is available at: https://github.com/poloclub/wordflow/. For a demo video, see: https://youtu.be/3dOcVuofGVo

Via

Access Paper or Ask Questions

High-Performance Transformers for Table Structure Recognition Need Early Convolutions

Nov 09, 2023

ShengYun Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari Balasubramaniyan, Duen Horng Chau

Figure 1 for High-Performance Transformers for Table Structure Recognition Need Early Convolutions

Figure 2 for High-Performance Transformers for Table Structure Recognition Need Early Convolutions

Figure 3 for High-Performance Transformers for Table Structure Recognition Need Early Convolutions

Figure 4 for High-Performance Transformers for Table Structure Recognition Need Early Convolutions

Abstract:Table structure recognition (TSR) aims to convert tabular images into a machine-readable format, where a visual encoder extracts image features and a textual decoder generates table-representing tokens. Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. However, this hybrid CNN-Transformer architecture introduces a complex visual encoder that accounts for nearly half of the total model parameters, markedly reduces both training and inference speed, and hinders the potential for self-supervised learning in TSR. In this work, we design a lightweight visual encoder for TSR without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model. The convolutional stem strikes an optimal balance between two crucial factors for high-performance TSR: a higher receptive field (RF) ratio and a longer sequence length. This allows it to "see" an appropriate portion of the table and "store" the complex table structure within sufficient context length for the subsequent transformer. We conducted reproducible ablation studies and open-sourced our code at https://github.com/poloclub/tsr-convstem to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning.

* Table Representation Learning Workshop at NeurIPS 2023 (Oral)

Via

Access Paper or Ask Questions

REVAMP: Automated Simulations of Adversarial Attacks on Arbitrary Objects in Realistic Scenes

Oct 18, 2023

Matthew Hull, Zijie J. Wang, Duen Horng Chau

Figure 1 for REVAMP: Automated Simulations of Adversarial Attacks on Arbitrary Objects in Realistic Scenes

Figure 2 for REVAMP: Automated Simulations of Adversarial Attacks on Arbitrary Objects in Realistic Scenes

Abstract:Deep Learning models, such as those used in an autonomous vehicle are vulnerable to adversarial attacks where an attacker could place an adversarial object in the environment, leading to mis-classification. Generating these adversarial objects in the digital space has been extensively studied, however successfully transferring these attacks from the digital realm to the physical realm has proven challenging when controlling for real-world environmental factors. In response to these limitations, we introduce REVAMP, an easy-to-use Python library that is the first-of-its-kind tool for creating attack scenarios with arbitrary objects and simulating realistic environmental factors, lighting, reflection, and refraction. REVAMP enables researchers and practitioners to swiftly explore various scenarios within the digital realm by offering a wide range of configurable options for designing experiments and using differentiable rendering to reproduce physically plausible adversarial objects. We will demonstrate and invite the audience to try REVAMP to produce an adversarial texture on a chosen object while having control over various scene parameters. The audience will choose a scene, an object to attack, the desired attack class, and the number of camera positions to use. Then, in real time, we show how this altered texture causes the chosen object to be mis-classified, showcasing the potential of REVAMP in real-world scenarios. REVAMP is open-source and available at https://github.com/poloclub/revamp.

Via

Access Paper or Ask Questions

ObjectComposer: Consistent Generation of Multiple Objects Without Fine-tuning

Oct 10, 2023

Alec Helbling, Evan Montoya, Duen Horng Chau

Abstract:Recent text-to-image generative models can generate high-fidelity images from text prompts. However, these models struggle to consistently generate the same objects in different contexts with the same appearance. Consistent object generation is important to many downstream tasks like generating comic book illustrations with consistent characters and setting. Numerous approaches attempt to solve this problem by extending the vocabulary of diffusion models through fine-tuning. However, even lightweight fine-tuning approaches can be prohibitively expensive to run at scale and in real-time. We introduce a method called ObjectComposer for generating compositions of multiple objects that resemble user-specified images. Our approach is training-free, leveraging the abilities of preexisting models. We build upon the recent BLIP-Diffusion model, which can generate images of single objects specified by reference images. ObjectComposer enables the consistent generation of compositions containing multiple specific objects simultaneously, all without modifying the weights of the underlying models.

Via

Access Paper or Ask Questions

Memory in Plain Sight: A Survey of the Uncanny Resemblances between Diffusion Models and Associative Memories

Sep 28, 2023

Benjamin Hoover, Hendrik Strobelt, Dmitry Krotov, Judy Hoffman, Zsolt Kira, Duen Horng Chau

Abstract:Diffusion Models (DMs) have recently set state-of-the-art on many generation benchmarks. However, there are myriad ways to describe them mathematically, which makes it difficult to develop a simple understanding of how they work. In this survey, we provide a concise overview of DMs from the perspective of dynamical systems and Ordinary Differential Equations (ODEs) which exposes a mathematical connection to the highly related yet often overlooked class of energy-based models, called Associative Memories (AMs). Energy-based AMs are a theoretical framework that behave much like denoising DMs, but they enable us to directly compute a Lyapunov energy function on which we can perform gradient descent to denoise data. We then summarize the 40 year history of energy-based AMs, beginning with the original Hopfield Network, and discuss new research directions for AMs and DMs that are revealed by characterizing the extent of their similarities and differences

* 15 pages, 4 figures

Via

Access Paper or Ask Questions

Robust Principles: Architectural Design Principles for Adversarially Robust CNNs

Sep 01, 2023

ShengYun Peng, Weilin Xu, Cory Cornelius, Matthew Hull, Kevin Li, Rahul Duggal, Mansi Phute, Jason Martin, Duen Horng Chau

Figure 1 for Robust Principles: Architectural Design Principles for Adversarially Robust CNNs

Figure 2 for Robust Principles: Architectural Design Principles for Adversarially Robust CNNs

Figure 3 for Robust Principles: Architectural Design Principles for Adversarially Robust CNNs

Figure 4 for Robust Principles: Architectural Design Principles for Adversarially Robust CNNs

Abstract:Our research aims to unify existing works' diverging opinions on how architectural components affect the adversarial robustness of CNNs. To accomplish our goal, we synthesize a suite of three generalizable robust architectural design principles: (a) optimal range for depth and width configurations, (b) preferring convolutional over patchify stem stage, and (c) robust residual block design through adopting squeeze and excitation blocks and non-parametric smooth activation functions. Through extensive experiments across a wide spectrum of dataset scales, adversarial training methods, model parameters, and network design spaces, our principles consistently and markedly improve AutoAttack accuracy: 1-3 percentage points (pp) on CIFAR-10 and CIFAR-100, and 4-9 pp on ImageNet. The code is publicly available at https://github.com/poloclub/robust-principles.

* Published at BMVC'23

Via

Access Paper or Ask Questions

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Aug 15, 2023

Alec Helbling, Mansi Phute, Matthew Hull, Duen Horng Chau

Figure 1 for LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Figure 2 for LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Figure 3 for LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Figure 4 for LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Abstract:Large language models (LLMs) have skyrocketed in popularity in recent years due to their ability to generate high-quality text in response to human prompting. However, these models have been shown to have the potential to generate harmful content in response to user prompting (e.g., giving users instructions on how to commit crimes). There has been a focus in the literature on mitigating these risks, through methods like aligning models with human values through reinforcement learning. However, it has been shown that even aligned language models are susceptible to adversarial attacks that bypass their restrictions on generating harmful text. We propose a simple approach to defending against these attacks by having a large language model filter its own responses. Our current results show that even if a model is not fine-tuned to be aligned with human values, it is possible to stop it from presenting harmful content to users by validating the content using a language model.

Via

Access Paper or Ask Questions

WizMap: Scalable Interactive Visualization for Exploring Large Machine Learning Embeddings

Jun 15, 2023

Zijie J. Wang, Fred Hohman, Duen Horng Chau

Abstract:Machine learning models often learn latent embedding representations that capture the domain semantics of their training data. These embedding representations are valuable for interpreting trained models, building new models, and analyzing new datasets. However, interpreting and using embeddings can be challenging due to their opaqueness, high dimensionality, and the large size of modern datasets. To tackle these challenges, we present WizMap, an interactive visualization tool to help researchers and practitioners easily explore large embeddings. With a novel multi-resolution embedding summarization method and a familiar map-like interaction design, WizMap enables users to navigate and interpret embedding spaces with ease. Leveraging modern web technologies such as WebGL and Web Workers, WizMap scales to millions of embedding points directly in users' web browsers and computational notebooks without the need for dedicated backend servers. WizMap is open-source and available at the following public demo link: https://poloclub.github.io/wizmap.

* 8 pages, 8 figures, Accepted to ACL 2023. For a demo video, see https://youtu.be/8fJG87QVceQ. For a live demo, see https://poloclub.github.io/wizmap. Code is available at https://github.com/poloclub/wizmap

Via

Access Paper or Ask Questions

Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion

May 08, 2023

Seongmin Lee, Benjamin Hoover, Hendrik Strobelt, Zijie J. Wang, ShengYun Peng, Austin Wright, Kevin Li, Haekyu Park, Haoyang Yang, Duen Horng Chau

Abstract:Diffusion-based generative models' impressive ability to create convincing images has captured global attention. However, their complex internal structures and operations often make them difficult for non-experts to understand. We present Diffusion Explainer, the first interactive visualization tool that explains how Stable Diffusion transforms text prompts into images. Diffusion Explainer tightly integrates a visual overview of Stable Diffusion's complex components with detailed explanations of their underlying operations, enabling users to fluidly transition between multiple levels of abstraction through animations and interactive elements. By comparing the evolutions of image representations guided by two related text prompts over refinement timesteps, users can discover the impact of prompts on image generation. Diffusion Explainer runs locally in users' web browsers without the need for installation or specialized hardware, broadening the public's education access to modern AI techniques. Our open-sourced tool is available at: https://poloclub.github.io/diffusion-explainer/. A video demo is available at https://youtu.be/Zg4gxdIWDds.

* 5 pages, 5 figures

Via

Access Paper or Ask Questions