Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Zeng

GenColor: Generative Color-Concept Association in Visual Design

Mar 05, 2025

Yihan Hou, Xingchen Zeng, Yusong Wang, Manling Yang, Xiaojiao Chen, Wei Zeng

Abstract:Existing approaches for color-concept association typically rely on query-based image referencing, and color extraction from image references. However, these approaches are effective only for common concepts, and are vulnerable to unstable image referencing and varying image conditions. Our formative study with designers underscores the need for primary-accent color compositions and context-dependent colors (e.g., 'clear' vs. 'polluted' sky) in design. In response, we introduce a generative approach for mining semantically resonant colors leveraging images generated by text-to-image models. Our insight is that contemporary text-to-image models can resemble visual patterns from large-scale real-world data. The framework comprises three stages: concept instancing produces generative samples using diffusion models, text-guided image segmentation identifies concept-relevant regions within the image, and color association extracts primarily accompanied by accent colors. Quantitative comparisons with expert designs validate our approach's effectiveness, and we demonstrate the applicability through cases in various design scenarios and a gallery.

* 19 pages, 16 figures. Accepted at CHI Conference on Human Factors in Computing Systems (CHI'25), April 26-May 1, 2025, Yokohama, Japan

Via

Access Paper or Ask Questions

SketchFlex: Facilitating Spatial-Semantic Coherence in Text-to-Image Generation with Region-Based Sketches

Feb 11, 2025

Haichuan Lin, Yilin Ye, Jiazhi Xia, Wei Zeng

Abstract:Text-to-image models can generate visually appealing images from text descriptions. Efforts have been devoted to improving model controls with prompt tuning and spatial conditioning. However, our formative study highlights the challenges for non-expert users in crafting appropriate prompts and specifying fine-grained spatial conditions (e.g., depth or canny references) to generate semantically cohesive images, especially when multiple objects are involved. In response, we introduce SketchFlex, an interactive system designed to improve the flexibility of spatially conditioned image generation using rough region sketches. The system automatically infers user prompts with rational descriptions within a semantic space enriched by crowd-sourced object attributes and relationships. Additionally, SketchFlex refines users' rough sketches into canny-based shape anchors, ensuring the generation quality and alignment of user intentions. Experimental results demonstrate that SketchFlex achieves more cohesive image generations than end-to-end models, meanwhile significantly reducing cognitive load and better matching user intentions compared to region-based generation baseline.

* conference: CHI2025

Via

Access Paper or Ask Questions

MCCoder: Streamlining Motion Control with LLM-Assisted Code Generation and Rigorous Verification

Oct 19, 2024

Yin Li, Liangwei Wang, Shiyuan Piao, Boo-Ho Yang, Ziyue Li, Wei Zeng, Fugee Tsung

Figure 1 for MCCoder: Streamlining Motion Control with LLM-Assisted Code Generation and Rigorous Verification

Figure 2 for MCCoder: Streamlining Motion Control with LLM-Assisted Code Generation and Rigorous Verification

Figure 3 for MCCoder: Streamlining Motion Control with LLM-Assisted Code Generation and Rigorous Verification

Figure 4 for MCCoder: Streamlining Motion Control with LLM-Assisted Code Generation and Rigorous Verification

Abstract:Large Language Models (LLMs) have shown considerable promise in code generation. However, the automation sector, especially in motion control, continues to rely heavily on manual programming due to the complexity of tasks and critical safety considerations. In this domain, incorrect code execution can pose risks to both machinery and personnel, necessitating specialized expertise. To address these challenges, we introduce MCCoder, an LLM-powered system designed to generate code that addresses complex motion control tasks, with integrated soft-motion data verification. MCCoder enhances code generation through multitask decomposition, hybrid retrieval-augmented generation (RAG), and self-correction with a private motion library. Moreover, it supports data verification by logging detailed trajectory data and providing simulations and plots, allowing users to assess the accuracy of the generated code and bolstering confidence in LLM-based programming. To ensure robust validation, we propose MCEVAL, an evaluation dataset with metrics tailored to motion control tasks of varying difficulties. Experiments indicate that MCCoder improves performance by 11.61% overall and by 66.12% on complex tasks in MCEVAL dataset compared with base models with naive RAG. This system and dataset aim to facilitate the application of code generation in automation settings with strict safety requirements. MCCoder is publicly available at https://github.com/MCCodeAI/MCCoder.

Via

Access Paper or Ask Questions

AI-rays: Exploring Bias in the Gaze of AI Through a Multimodal Interactive Installation

Oct 03, 2024

Ziyao Gao, Yiwen Zhang, Ling Li, Theodoros Papatheodorou, Wei Zeng

Abstract:Data surveillance has become more covert and pervasive with AI algorithms, which can result in biased social classifications. Appearance offers intuitive identity signals, but what does it mean to let AI observe and speculate on them? We introduce AI-rays, an interactive installation where AI generates speculative identities from participants' appearance which are expressed through synthesized personal items placed in participants' bags. It uses speculative X-ray visions to contrast reality with AI-generated assumptions, metaphorically highlighting AI's scrutiny and biases. AI-rays promotes discussions on modern surveillance and the future of human-machine reality through a playful, immersive experience exploring AI biases.

* Siggraph Asia 2024 Art Paper

Via

Access Paper or Ask Questions

Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

Jul 29, 2024

Xingchen Zeng, Haichuan Lin, Yilin Ye, Wei Zeng

Figure 1 for Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

Figure 2 for Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

Figure 3 for Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

Figure 4 for Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning

Abstract:Emerging multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA). Recent efforts primarily focus on scaling up training datasets (i.e., charts, data tables, and question-answer (QA) pairs) through data collection and synthesis. However, our empirical study on existing MLLMs and CQA datasets reveals notable gaps. First, current data collection and synthesis focus on data volume and lack consideration of fine-grained visual encodings and QA tasks, resulting in unbalanced data distribution divergent from practical CQA scenarios. Second, existing work follows the training recipe of the base MLLMs initially designed for natural images, under-exploring the adaptation to unique chart characteristics, such as rich text elements. To fill the gap, we propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development. Specifically, we propose a novel data engine to effectively filter diverse and high-quality data from existing datasets and subsequently refine and augment the data using LLM-based generation techniques to better align with practical QA tasks and visual encodings. Then, to facilitate the adaptation to chart characteristics, we utilize the enriched data to train an MLLM by unfreezing the vision encoder and incorporating a mixture-of-resolution adaptation strategy for enhanced fine-grained recognition. Experimental results validate the effectiveness of our approach. Even with fewer training examples, our model consistently outperforms state-of-the-art CQA models on established benchmarks. We also contribute a dataset split as a benchmark for future research. Source codes and datasets of this paper are available at https://github.com/zengxingchen/ChartQA-MLLM.

* 11 pages, 7 figures

Via

Access Paper or Ask Questions

Predicting T-Cell Receptor Specificity

Jul 27, 2024

Tengyao Tu, Wei Zeng, Kun Zhao, Zhenyu Zhang

Figure 1 for Predicting T-Cell Receptor Specificity

Figure 2 for Predicting T-Cell Receptor Specificity

Figure 3 for Predicting T-Cell Receptor Specificity

Figure 4 for Predicting T-Cell Receptor Specificity

Abstract:Researching the specificity of TCR contributes to the development of immunotherapy and provides new opportunities and strategies for personalized cancer immunotherapy. Therefore, we established a TCR generative specificity detection framework consisting of an antigen selector and a TCR classifier based on the Random Forest algorithm, aiming to efficiently screen out TCRs and target antigens and achieve TCR specificity prediction. Furthermore, we used the k-fold validation method to compare the performance of our model with ordinary deep learning methods. The result proves that adding a classifier to the model based on the random forest algorithm is very effective, and our model generally outperforms ordinary deep learning methods. Moreover, we put forward feasible optimization suggestions for the shortcomings and challenges of our model found during model implementation.

Via

Access Paper or Ask Questions

ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map

Jul 17, 2024

Yilin Ye, Shishi Xiao, Xingchen Zeng, Wei Zeng

Abstract:Multi-modal embeddings form the foundation for vision-language models, such as CLIP embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross-modal features, resulting in decreased model performance and diminished generalization. To address this problem, we design ModalChorus, an interactive system for visual probing and alignment of multi-modal embeddings. ModalChorus primarily offers a two-stage process: 1) embedding probing with Modal Fusion Map (MFM), a novel parametric dimensionality reduction method that integrates both metric and nonmetric objectives to enhance modality fusion; and 2) embedding alignment that allows users to interactively articulate intentions for both point-set and set-set alignments. Quantitative and qualitative comparisons for CLIP embeddings with existing dimensionality reduction (e.g., t-SNE and MDS) and data fusion (e.g., data context map) methods demonstrate the advantages of MFM in showcasing cross-modal features over common vision-language datasets. Case studies reveal that ModalChorus can facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.

* Accepted by VIS 2024

Via

Access Paper or Ask Questions

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Jul 09, 2024

Fanyue Wei, Wei Zeng, Zhenyang Li, Dawei Yin, Lixin Duan, Wen Li

Figure 1 for Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Figure 2 for Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Figure 3 for Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Figure 4 for Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Abstract:Personalized text-to-image models allow users to generate varied styles of images (specified with a sentence) for an object (specified with a set of reference images). While remarkable results have been achieved using diffusion-based generation models, the visual structure and details of the object are often unexpectedly changed during the diffusion process. One major reason is that these diffusion-based approaches typically adopt a simple reconstruction objective during training, which can hardly enforce appropriate structural consistency between the generated and the reference images. To this end, in this paper, we design a novel reinforcement learning framework by utilizing the deterministic policy gradient method for personalized text-to-image generation, with which various objectives, differential or even non-differential, can be easily incorporated to supervise the diffusion models to improve the quality of the generated images. Experimental results on personalized text-to-image generation benchmark datasets demonstrate that our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment. Our code is available at: \url{https://github.com/wfanyue/DPG-T2I-Personalization}.

* Accepted by ECCV 2024

Via

Access Paper or Ask Questions

End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding

May 22, 2024

Wei Zeng, Xian He, Ye Wang

Abstract:Piano audio-to-score transcription (A2S) is an important yet underexplored task with extensive applications for music composition, practice, and analysis. However, existing end-to-end piano A2S systems faced difficulties in retrieving bar-level information such as key and time signatures, and have been trained and evaluated with only synthetic data. To address these limitations, we propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores, enabling the transcription of score information at both the bar and note levels by multi-task learning. To bridge the gap between synthetic data and recordings of human performance, we propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering (EPR) system on synthetic audio, followed by fine-tuning the model using recordings of human performance. To preserve the voicing structure for score reconstruction, we propose a pre-processing method for **Kern scores in scenarios with an unconstrained number of voices. Experimental results support the effectiveness of our proposed approaches, in terms of both transcription performance on synthetic audio data in comparison to the current state-of-the-art, and the first experiment on human recordings.

* 8 pages, 5 figures, accepted by IJCAI 2024 - AI, Arts & Creativity Track

Via

Access Paper or Ask Questions

Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality

May 01, 2024

Zidong Cao, Zhan Wang, Yexin Liu, Yan-Pei Cao, Ying Shan, Wei Zeng, Lin Wang

Figure 1 for Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality

Figure 2 for Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality

Figure 3 for Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality

Figure 4 for Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality

Abstract:Viewing omnidirectional images (ODIs) in virtual reality (VR) represents a novel form of media that provides immersive experiences for users to navigate and interact with digital content. Nonetheless, this sense of immersion can be greatly compromised by a blur effect that masks details and hampers the user's ability to engage with objects of interest. In this paper, we present a novel system, called OmniVR, designed to enhance visual clarity during VR navigation. Our system enables users to effortlessly locate and zoom in on the objects of interest in VR. It captures user commands for navigation and zoom, converting these inputs into parameters for the Mobius transformation matrix. Leveraging these parameters, the ODI is refined using a learning-based algorithm. The resultant ODI is presented within the VR media, effectively reducing blur and increasing user engagement. To verify the effectiveness of our system, we first evaluate our algorithm with state-of-the-art methods on public datasets, which achieves the best performance. Furthermore, we undertake a comprehensive user study to evaluate viewer experiences across diverse scenarios and to gather their qualitative feedback from multiple perspectives. The outcomes reveal that our system enhances user engagement by improving the viewers' recognition, reducing discomfort, and improving the overall immersive experience. Our system makes the navigation and zoom more user-friendly.

* 11 pages

Via

Access Paper or Ask Questions