Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sungroh Yoon

Style-Friendly SNR Sampler for Style-Driven Generation

Nov 22, 2024

Jooyoung Choi, Chaehun Shin, Yeongtak Oh, Heeseung Kim, Sungroh Yoon

Figure 1 for Style-Friendly SNR Sampler for Style-Driven Generation

Figure 2 for Style-Friendly SNR Sampler for Style-Driven Generation

Figure 3 for Style-Friendly SNR Sampler for Style-Driven Generation

Figure 4 for Style-Friendly SNR Sampler for Style-Driven Generation

Abstract:Recent large-scale diffusion models generate high-quality images but struggle to learn new, personalized artistic styles, which limits the creation of unique style templates. Fine-tuning with reference images is the most promising approach, but it often blindly utilizes objectives and noise level distributions used for pre-training, leading to suboptimal style alignment. We propose the Style-friendly SNR sampler, which aggressively shifts the signal-to-noise ratio (SNR) distribution toward higher noise levels during fine-tuning to focus on noise levels where stylistic features emerge. This enables models to better capture unique styles and generate images with higher style alignment. Our method allows diffusion models to learn and share new "style templates", enhancing personalized content creation. We demonstrate the ability to generate styles such as personal watercolor paintings, minimal flat cartoons, 3D renderings, multi-panel images, and memes with text, thereby broadening the scope of style-driven generation.

Via

Access Paper or Ask Questions

Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization

Nov 20, 2024

Sanghyeob Song, Jaihyun Lew, Hyemi Jang, Sungroh Yoon

Figure 1 for Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization

Figure 2 for Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization

Figure 3 for Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization

Figure 4 for Unsupervised Homography Estimation on Multimodal Image Pair via Alternating Optimization

Abstract:Estimating the homography between two images is crucial for mid- or high-level vision tasks, such as image stitching and fusion. However, using supervised learning methods is often challenging or costly due to the difficulty of collecting ground-truth data. In response, unsupervised learning approaches have emerged. Most early methods, though, assume that the given image pairs are from the same camera or have minor lighting differences. Consequently, while these methods perform effectively under such conditions, they generally fail when input image pairs come from different domains, referred to as multimodal image pairs. To address these limitations, we propose AltO, an unsupervised learning framework for estimating homography in multimodal image pairs. Our method employs a two-phase alternating optimization framework, similar to Expectation-Maximization (EM), where one phase reduces the geometry gap and the other addresses the modality gap. To handle these gaps, we use Barlow Twins loss for the modality gap and propose an extended version, Geometry Barlow Twins, for the geometry gap. As a result, we demonstrate that our method, AltO, can be trained on multimodal datasets without any ground-truth data. It not only outperforms other unsupervised methods but is also compatible with various architectures of homography estimators. The source code can be found at:~\url{https://github.com/songsang7/AltO}

* This paper is accepted to the Thirty-Eighth Annual Conference on Neural Information Processing Systems (NeurIPS 2024)

Via

Access Paper or Ask Questions

Interpretable Language Modeling via Induction-head Ngram Models

Oct 31, 2024

Eunji Kim, Sriya Mantena, Weiwei Yang, Chandan Singh, Sungroh Yoon, Jianfeng Gao

Figure 1 for Interpretable Language Modeling via Induction-head Ngram Models

Figure 2 for Interpretable Language Modeling via Induction-head Ngram Models

Figure 3 for Interpretable Language Modeling via Induction-head Ngram Models

Figure 4 for Interpretable Language Modeling via Induction-head Ngram Models

Abstract:Recent large language models (LLMs) have excelled across a wide range of tasks, but their use in high-stakes and compute-limited settings has intensified the demand for interpretability and efficiency. We address this need by proposing Induction-head ngram models (Induction-Gram), a method that builds an efficient, interpretable LM by bolstering modern ngram models with a hand-engineered "induction head". This induction head uses a custom neural similarity metric to efficiently search the model's input context for potential next-word completions. This process enables Induction-Gram to provide ngram-level grounding for each generated token. Moreover, experiments show that this simple method significantly improves next-word prediction over baseline interpretable models (up to 26%p) and can be used to speed up LLM inference for large models through speculative decoding. We further study Induction-Gram in a natural-language neuroscience setting, where the goal is to predict the next fMRI response in a sequence. It again provides a significant improvement over interpretable models (20% relative increase in the correlation of predicted fMRI responses), potentially enabling deeper scientific investigation of language selectivity in the brain. The code is available at https://github.com/ejkim47/induction-gram.

Via

Access Paper or Ask Questions

Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP

Oct 11, 2024

Eunji Kim, Kyuhong Shim, Simyung Chang, Sungroh Yoon

Figure 1 for Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP

Figure 2 for Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP

Figure 3 for Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP

Figure 4 for Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP

Abstract:A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of SToRI is demonstrated through comprehensive experiments on few-shot image classification and image retrieval tailored to user preferences.

* Accepted at EMNLP 2024 Findings

Via

Access Paper or Ask Questions

Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context

Oct 09, 2024

Sangwon Yu, Ik-hwan Kim, Jongyoon Song, Saehyung Lee, Junsung Park, Sungroh Yoon

Figure 1 for Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context

Figure 2 for Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context

Figure 3 for Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context

Figure 4 for Unleashing Multi-Hop Reasoning Potential in Large Language Models through Repetition of Misordered Context

Abstract:Multi-hop reasoning, which requires multi-step reasoning based on the supporting documents within a given context, remains challenging for large language models (LLMs). LLMs often struggle to filter out irrelevant documents within the context, and their performance is sensitive to the position of supporting documents within that context. In this paper, we identify an additional challenge: LLMs' performance is also sensitive to the order in which the supporting documents are presented. We refer to this as the misordered context problem. To address this issue, we propose a simple yet effective method called context repetition (CoRe), which involves prompting the model by repeatedly presenting the context to ensure the supporting documents are presented in the optimal order for the model. Using CoRe, we improve the F1 score by up to 30%p on multi-hop QA tasks and increase accuracy by up to 70%p on a synthetic task. Additionally, CoRe helps mitigate the well-known "lost-in-the-middle" problem in LLMs and can be effectively combined with retrieval-based approaches utilizing Chain-of-Thought (CoT) reasoning.

Via

Access Paper or Ask Questions

Textual Training for the Hassle-Free Removal of Unwanted Visual Data

Sep 30, 2024

Saehyung Lee, Jisoo Mok, Sangha Park, Yongho Shin, Dahuin Jung, Sungroh Yoon

Figure 1 for Textual Training for the Hassle-Free Removal of Unwanted Visual Data

Figure 2 for Textual Training for the Hassle-Free Removal of Unwanted Visual Data

Figure 3 for Textual Training for the Hassle-Free Removal of Unwanted Visual Data

Figure 4 for Textual Training for the Hassle-Free Removal of Unwanted Visual Data

Abstract:In our study, we explore methods for detecting unwanted content lurking in visual datasets. We provide a theoretical analysis demonstrating that a model capable of successfully partitioning visual data can be obtained using only textual data. Based on the analysis, we propose Hassle-Free Textual Training (HFTT), a streamlined method capable of acquiring detectors for unwanted visual content, using only synthetic textual data in conjunction with pre-trained vision-language models. HFTT features an innovative objective function that significantly reduces the necessity for human involvement in data annotation. Furthermore, HFTT employs a clever textual data synthesis method, effectively emulating the integration of unknown visual data distribution into the training process at no extra cost. The unique characteristics of HFTT extend its utility beyond traditional out-of-distribution detection, making it applicable to tasks that address more abstract concepts. We complement our analyses with experiments in out-of-distribution detection and hateful image detection. Our codes are available at https://github.com/Saehyung-Lee/HFTT

* NeurIPS 2024

Via

Access Paper or Ask Questions

VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

Sep 24, 2024

Jiheum Yeom, Heeseung Kim, Jooyoung Choi, Che Hyun Lee, Nohil Park, Sungroh Yoon

Figure 1 for VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

Figure 2 for VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

Figure 3 for VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

Figure 4 for VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance

Abstract:When applying parameter-efficient finetuning via LoRA onto speaker adaptive text-to-speech models, adaptation performance may decline compared to full-finetuned counterparts, especially for out-of-domain speakers. Here, we propose VoiceGuider, a parameter-efficient speaker adaptive text-to-speech system reinforced with autoguidance to enhance the speaker adaptation performance, reducing the gap against full-finetuned models. We carefully explore various ways of strengthening autoguidance, ultimately finding the optimal strategy. VoiceGuider as a result shows robust adaptation performance especially on extreme out-of-domain speech data. We provide audible samples in our demo page.

* Submitted to ICASSP 2025, Demo Page: https://voiceguider.github.io/

Via

Access Paper or Ask Questions

NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

Sep 24, 2024

Nohil Park, Heeseung Kim, Che Hyun Lee, Jooyoung Choi, Jiheum Yeom, Sungroh Yoon

Figure 1 for NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

Figure 2 for NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

Figure 3 for NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

Figure 4 for NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers

Abstract:We present NanoVoice, a personalized text-to-speech model that efficiently constructs voice adapters for multiple speakers simultaneously. NanoVoice introduces a batch-wise speaker adaptation technique capable of fine-tuning multiple references in parallel, significantly reducing training time. Beyond building separate adapters for each speaker, we also propose a parameter sharing technique that reduces the number of parameters used for speaker adaptation. By incorporating a novel trainable scale matrix, NanoVoice mitigates potential performance degradation during parameter sharing. NanoVoice achieves performance comparable to the baselines, while training 4 times faster and using 45 percent fewer parameters for speaker adaptation with 40 reference voices. Extensive ablation studies and analysis further validate the efficiency of our model.

* Submitted to ICASSP 2025, Demo Page: https://nanovoice.github.io/

Via

Access Paper or Ask Questions

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

Aug 27, 2024

Heeseung Kim, Sang-gil Lee, Jiheum Yeom, Che Hyun Lee, Sungwon Kim, Sungroh Yoon

Figure 1 for VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

Figure 2 for VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

Figure 3 for VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

Figure 4 for VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

Abstract:We propose VoiceTailor, a parameter-efficient speaker-adaptive text-to-speech (TTS) system, by equipping a pre-trained diffusion-based TTS model with a personalized adapter. VoiceTailor identifies pivotal modules that benefit from the adapter based on a weight change ratio analysis. We utilize Low-Rank Adaptation (LoRA) as a parameter-efficient adaptation method and incorporate the adapter into pivotal modules of the pre-trained diffusion decoder. To achieve powerful adaptation performance with few parameters, we explore various guidance techniques for speaker adaptation and investigate the best strategies to strengthen speaker information. VoiceTailor demonstrates comparable speaker adaptation performance to existing adaptive TTS models by fine-tuning only 0.25\% of the total parameters. VoiceTailor shows strong robustness when adapting to a wide range of real-world speakers, as shown in the demo.

Via

Access Paper or Ask Questions

Unlocking Intrinsic Fairness in Stable Diffusion

Aug 22, 2024

Eunji Kim, Siwon Kim, Rahim Entezari, Sungroh Yoon

Figure 1 for Unlocking Intrinsic Fairness in Stable Diffusion

Figure 2 for Unlocking Intrinsic Fairness in Stable Diffusion

Figure 3 for Unlocking Intrinsic Fairness in Stable Diffusion

Figure 4 for Unlocking Intrinsic Fairness in Stable Diffusion

Abstract:Recent text-to-image models like Stable Diffusion produce photo-realistic images but often show demographic biases. Previous debiasing methods focused on training-based approaches, failing to explore the root causes of bias and overlooking Stable Diffusion's potential for unbiased image generation. In this paper, we demonstrate that Stable Diffusion inherently possesses fairness, which can be unlocked to achieve debiased outputs. Through carefully designed experiments, we identify the excessive bonding between text prompts and the diffusion process as a key source of bias. To address this, we propose a novel approach that perturbs text conditions to unleash Stable Diffusion's intrinsic fairness. Our method effectively mitigates bias without additional tuning, while preserving image-text alignment and image quality.

* 21 pages, 20 figures; First two authors contributed equally

Via

Access Paper or Ask Questions