Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yan Zhang

Fellow, IEEE

GDDS: A Single Domain Generalized Defect Detection Frame of Open World Scenario using Gather and Distribute Domain-shift Suppression Network

Jul 18, 2024

Haiyong Chen, Yaxiu Zhang, Yan Zhang, Xin Zhang, Xingwei Yan

Figure 1 for GDDS: A Single Domain Generalized Defect Detection Frame of Open World Scenario using Gather and Distribute Domain-shift Suppression Network

Figure 2 for GDDS: A Single Domain Generalized Defect Detection Frame of Open World Scenario using Gather and Distribute Domain-shift Suppression Network

Figure 3 for GDDS: A Single Domain Generalized Defect Detection Frame of Open World Scenario using Gather and Distribute Domain-shift Suppression Network

Figure 4 for GDDS: A Single Domain Generalized Defect Detection Frame of Open World Scenario using Gather and Distribute Domain-shift Suppression Network

Abstract:Efficient and intelligent surface defect detection of photovoltaic modules is crucial for improving the quality of photovoltaic modules and ensuring the reliable operation of large-scale infrastructure. However, the scenario characteristics of data distribution deviation make the construction of defect detection models for open world scenarios such as photovoltaic manufacturing and power plant inspections a challenge. Therefore, we propose the Gather and Distribute Domain shift Suppression Network (GDDS). It adopts a single domain generalized method that is completely independent of the test samples to address the problem of distribution shift. Using a one-stage network as the baseline network breaks through the limitations of traditional domain generalization methods that typically use two-stage networks. It not only balances detection accuracy and speed but also simplifies the model deployment and application process. The GDDS includes two modules: DeepSpine Module and Gather and Distribute Module. Specifically, the DeepSpine Module applies a wider range of contextual information and suppresses background style shift by acquiring and concatenating multi-scale features. The Gather and Distribute Module collects and distributes global information to achieve cross layer interactive learning of multi-scale channel features and suppress defect instance shift. Furthermore, the GDDS utilizes normalized Wasserstein distance for similarity measurement, reducing measurement errors caused by bounding box position deviations. We conducted a comprehensive evaluation of GDDS on the EL endogenous shift dataset and Photovoltaic inspection infrared image dataset. The experimental results showed that GDDS can adapt to defect detection in open world scenarios faster and better than other state-of-the-art methods.

* 13 images

Via

Access Paper or Ask Questions

Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees

Jul 12, 2024

Alexia Jolicoeur-Martineau, Aristide Baratin, Kisoo Kwon, Boris Knyazev, Yan Zhang

Figure 1 for Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees

Figure 2 for Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees

Figure 3 for Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees

Figure 4 for Any-Property-Conditional Molecule Generation with Self-Criticism using Spanning Trees

Abstract:Generating novel molecules is challenging, with most representations leading to generative models producing many invalid molecules. Spanning Tree-based Graph Generation (STGG) is a promising approach to ensure the generation of valid molecules, outperforming state-of-the-art SMILES and graph diffusion models for unconditional generation. In the real world, we want to be able to generate molecules conditional on one or multiple desired properties rather than unconditionally. Thus, in this work, we extend STGG to multi-property-conditional generation. Our approach, STGG+, incorporates a modern Transformer architecture, random masking of properties during training (enabling conditioning on any subset of properties and classifier-free guidance), an auxiliary property-prediction loss (allowing the model to self-criticize molecules and select the best ones), and other improvements. We show that STGG+ achieves state-of-the-art performance on in-distribution and out-of-distribution conditional generation, and reward maximization.

Via

Access Paper or Ask Questions

PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

Jul 11, 2024

Miao Zheng, Hao Liang, Fan Yang, Haoze Sun, Tianpeng Li, Lingchu Xiong, Yan Zhang, Youzhen Wu, Kun Li, Yanjun Shen(+9 more)

Figure 1 for PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

Figure 2 for PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

Figure 3 for PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

Figure 4 for PAS: Data-Efficient Plug-and-Play Prompt Augmentation System

Abstract:In recent years, the rise of Large Language Models (LLMs) has spurred a growing demand for plug-and-play AI systems. Among the various AI techniques, prompt engineering stands out as particularly significant. However, users often face challenges in writing prompts due to the steep learning curve and significant time investment, and existing automatic prompt engineering (APE) models can be difficult to use. To address this issue, we propose PAS, an LLM-based plug-and-play APE system. PAS utilizes LLMs trained on high-quality, automatically generated prompt complementary datasets, resulting in exceptional performance. In comprehensive benchmarks, PAS achieves state-of-the-art (SoTA) results compared to previous APE models, with an average improvement of 6.09 points. Moreover, PAS is highly efficient, achieving SoTA performance with only 9000 data points. Additionally, PAS can autonomously generate prompt augmentation data without requiring additional human labor. Its flexibility also allows it to be compatible with all existing LLMs and applicable to a wide range of tasks. PAS excels in human evaluations, underscoring its suitability as a plug-in for users. This combination of high performance, efficiency, and flexibility makes PAS a valuable system for enhancing the usability and effectiveness of LLMs through improved prompt engineering.

Via

Access Paper or Ask Questions

Retrieved In-Context Principles from Previous Mistakes

Jul 08, 2024

Hao Sun, Yong Jiang, Bo Wang, Yingyan Hou, Yan Zhang, Pengjun Xie, Fei Huang

Figure 1 for Retrieved In-Context Principles from Previous Mistakes

Figure 2 for Retrieved In-Context Principles from Previous Mistakes

Figure 3 for Retrieved In-Context Principles from Previous Mistakes

Figure 4 for Retrieved In-Context Principles from Previous Mistakes

Abstract:In-context learning (ICL) has been instrumental in adapting Large Language Models (LLMs) to downstream tasks using correct input-output examples. Recent advances have attempted to improve model performance through principles derived from mistakes, yet these approaches suffer from lack of customization and inadequate error coverage. To address these limitations, we propose Retrieved In-Context Principles (RICP), a novel teacher-student framework. In RICP, the teacher model analyzes mistakes from the student model to generate reasons and insights for preventing similar mistakes. These mistakes are clustered based on their underlying reasons for developing task-level principles, enhancing the error coverage of principles. During inference, the most relevant mistakes for each question are retrieved to create question-level principles, improving the customization of the provided guidance. RICP is orthogonal to existing prompting methods and does not require intervention from the teacher model during inference. Experimental results across seven reasoning benchmarks reveal that RICP effectively enhances performance when applied to various prompting strategies.

Via

Access Paper or Ask Questions

Towards Context-aware Support for Color Vision Deficiency: An Approach Integrating LLM and AR

Jul 05, 2024

Shogo Morita, Yan Zhang, Takuto Yamauchi, Sinan Chen, Jialong Li, Kenji Tei

Figure 1 for Towards Context-aware Support for Color Vision Deficiency: An Approach Integrating LLM and AR

Figure 2 for Towards Context-aware Support for Color Vision Deficiency: An Approach Integrating LLM and AR

Abstract:People with color vision deficiency often face challenges in distinguishing colors such as red and green, which can complicate daily tasks and require the use of assistive tools or environmental adjustments. Current support tools mainly focus on presentation-based aids, like the color vision modes found in iPhone accessibility settings. However, offering context-aware support, like indicating the doneness of meat, remains a challenge since task-specific solutions are not cost-effective for all possible scenarios. To address this, our paper proposes an application that provides contextual and autonomous assistance. This application is mainly composed of: (i) an augmented reality interface that efficiently captures context; and (ii) a multi-modal large language model-based reasoner that serves to cognitize the context and then reason about the appropriate support contents. Preliminary user experiments with two color vision deficient users across five different scenarios have demonstrated the effectiveness and universality of our application.

Via

Access Paper or Ask Questions

DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models

Jul 01, 2024

Jiabao Pan, Yan Zhang, Chen Zhang, Zuozhu Liu, Hongwei Wang, Haizhou Li

Figure 1 for DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models

Figure 2 for DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models

Figure 3 for DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models

Figure 4 for DynaThink: Fast or Slow? A Dynamic Decision-Making Framework for Large Language Models

Abstract:Large language models (LLMs) have demonstrated emergent capabilities across diverse reasoning tasks via popular Chains-of-Thought (COT) prompting. However, such a simple and fast COT approach often encounters limitations in dealing with complicated problems, while a thorough method, which considers multiple reasoning pathways and verifies each step carefully, results in slower inference. This paper addresses the challenge of enabling LLMs to autonomously select between fast and slow inference methods, thereby optimizing both efficiency and effectiveness. We introduce a dynamic decision-making framework that categorizes tasks into two distinct pathways: 'Fast', designated for tasks where the LLM quickly identifies a high-confidence solution, and 'Slow', allocated for tasks that the LLM perceives as complex and for which it has low confidence in immediate solutions as well as requiring more reasoning paths to verify. Experiments on five popular reasoning benchmarks demonstrated the superiority of the DynaThink over baselines.

Via

Access Paper or Ask Questions

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

Jun 30, 2024

Yinlin Guo, Yening Lv, Jinqiao Dou, Yan Zhang, Yuehai Wang

Abstract:While recent advances in Text-To-Speech synthesis have yielded remarkable improvements in generating high-quality speech, research on lightweight and fast models is limited. This paper introduces FLY-TTS, a new fast, lightweight and high-quality speech synthesis system based on VITS. Specifically, 1) We replace the decoder with ConvNeXt blocks that generate Fourier spectral coefficients followed by the inverse short-time Fourier transform to synthesize waveforms; 2) To compress the model size, we introduce the grouped parameter-sharing mechanism to the text encoder and flow-based model; 3) We further employ the large pre-trained WavLM model for adversarial training to improve synthesis quality. Experimental results show that our model achieves a real-time factor of 0.0139 on an Intel Core i9 CPU, 8.8x faster than the baseline (0.1221), with a 1.6x parameter compression. Objective and subjective evaluations indicate that FLY-TTS exhibits comparable speech quality to the strong baseline.

* Accepted to Interspeech 2024. 5 pages, 1 figure

Via

Access Paper or Ask Questions

Using Large Language Models to Assist Video Content Analysis: An Exploratory Study of Short Videos on Depression

Jun 27, 2024

Jiaying Liu, Yunlong Wang, Yao Lyu, Yiheng Su, Shuo Niu, Xuhai "Orson" Xu, Yan Zhang

Figure 1 for Using Large Language Models to Assist Video Content Analysis: An Exploratory Study of Short Videos on Depression

Figure 2 for Using Large Language Models to Assist Video Content Analysis: An Exploratory Study of Short Videos on Depression

Figure 3 for Using Large Language Models to Assist Video Content Analysis: An Exploratory Study of Short Videos on Depression

Abstract:Despite the growing interest in leveraging Large Language Models (LLMs) for content analysis, current studies have primarily focused on text-based content. In the present work, we explored the potential of LLMs in assisting video content analysis by conducting a case study that followed a new workflow of LLM-assisted multimodal content analysis. The workflow encompasses codebook design, prompt engineering, LLM processing, and human evaluation. We strategically crafted annotation prompts to get LLM Annotations in structured form and explanation prompts to generate LLM Explanations for a better understanding of LLM reasoning and transparency. To test LLM's video annotation capabilities, we analyzed 203 keyframes extracted from 25 YouTube short videos about depression. We compared the LLM Annotations with those of two human coders and found that LLM has higher accuracy in object and activity Annotations than emotion and genre Annotations. Moreover, we identified the potential and limitations of LLM's capabilities in annotating videos. Based on the findings, we explore opportunities and challenges for future research and improvements to the workflow. We also discuss ethical concerns surrounding future studies based on LLM-assisted video analysis.

* 6 pages, 2 figures, under review in CSCW 24

Via

Access Paper or Ask Questions

Local Manifold Learning for No-Reference Image Quality Assessment

Jun 27, 2024

Timin Gao, Wensheng Pan, Yan Zhang, Sicheng Zhao, Shengchuan Zhang, Xiawu Zheng, Ke Li, Liujuan Cao, Rongrong Ji

Figure 1 for Local Manifold Learning for No-Reference Image Quality Assessment

Figure 2 for Local Manifold Learning for No-Reference Image Quality Assessment

Figure 3 for Local Manifold Learning for No-Reference Image Quality Assessment

Figure 4 for Local Manifold Learning for No-Reference Image Quality Assessment

Abstract:Contrastive learning has considerably advanced the field of Image Quality Assessment (IQA), emerging as a widely adopted technique. The core mechanism of contrastive learning involves minimizing the distance between quality-similar (positive) examples while maximizing the distance between quality-dissimilar (negative) examples. Despite its successes, current contrastive learning methods often neglect the importance of preserving the local manifold structure. This oversight can result in a high degree of similarity among hard examples within the feature space, thereby impeding effective differentiation and assessment. To address this issue, we propose an innovative framework that integrates local manifold learning with contrastive learning for No-Reference Image Quality Assessment (NR-IQA). Our method begins by sampling multiple crops from a given image, identifying the most visually salient crop. This crop is then used to cluster other crops from the same image as the positive class, while crops from different images are treated as negative classes to increase inter-class distance. Uniquely, our approach also considers non-saliency crops from the same image as intra-class negative classes to preserve their distinctiveness. Additionally, we employ a mutual learning framework, which further enhances the model's ability to adaptively learn and identify visual saliency regions. Our approach demonstrates a better performance compared to state-of-the-art methods in 7 standard datasets, achieving PLCC values of 0.942 (compared to 0.908 in TID2013) and 0.914 (compared to 0.894 in LIVEC).

Via

Access Paper or Ask Questions

XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis

Jun 27, 2024

Hao Li, Ming Yuan, Yan Zhang, Chenming Wu, Chen Zhao, Chunyu Song, Haocheng Feng, Errui Ding, Dingwen Zhang, Jingdong Wang

Figure 1 for XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis

Figure 2 for XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis

Figure 3 for XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis

Figure 4 for XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis

Abstract:Thoroughly testing autonomy systems is crucial in the pursuit of safe autonomous driving vehicles. It necessitates creating safety-critical scenarios that go beyond what can be safely collected from real-world data, as many of these scenarios occur infrequently on public roads. However, the evaluation of most existing NVS methods relies on sporadic sampling of image frames from the training data, comparing the rendered images with ground truth images using metrics. Unfortunately, this evaluation protocol falls short of meeting the actual requirements in closed-loop simulations. Specifically, the true application demands the capability to render novel views that extend beyond the original trajectory (such as cross-lane views), which are challenging to capture in the real world. To address this, this paper presents a novel driving view synthesis dataset and benchmark specifically designed for autonomous driving simulations. This dataset is unique as it includes testing images captured by deviating from the training trajectory by 1-4 meters. It comprises six sequences encompassing various time and weather conditions. Each sequence contains 450 training images, 150 testing images, and their corresponding camera poses and intrinsic parameters. Leveraging this novel dataset, we establish the first realistic benchmark for evaluating existing NVS approaches under front-only and multi-camera settings. The experimental findings underscore the significant gap that exists in current approaches, revealing their inadequate ability to fulfill the demanding prerequisites of cross-lane or closed-loop simulation. Our dataset is released publicly at the project page: https://3d-aigc.github.io/XLD/.

* project page: https://3d-aigc.github.io/XLD/

Via

Access Paper or Ask Questions