Abstract:Recent image generation and editing models demonstrate robust adherence to instructions and high visual quality on academic benchmarks. However, their performance on paid, real-world design projects remains uncertain. We introduce \textbf{ServImage}, a benchmark that explicitly correlates model outputs with economic value in commercial design projects. ServImage consists of (i) \textbf{\textit{ServImageBench}}: a dataset of 1.07k paid commercial design tasks and 2.05k designer deliverables totaling over \$295k, covering portrait, product, and digital content, along with 33k candidate images and 33k human annotations. (ii) \textbf{\textit{ServImageScore}}: an integrated scoring system that combines three quality dimensions: baseline requirements fulfilment, visual execution quality, and commercial necessity satisfaction. These three dimensions are designed to characterize the factors that drive human payment decisions and indicate whether an image is commercially acceptable. (iii) \textbf{\textit{ServImageModel}}: under this scoring system, we propose a payment prediction model trained on the human-annotated candidate images, achieving 82.00\% accuracy in predicting human payment decisions and producing calibrated payment probabilities. ServImage provides a comprehensive foundation for assessing the commercial viability of image generation models and offers a scalable resource for future research on economically grounded vision systems \href{https://github.com/FengxianJi/ServImage}{Github.}
Abstract:Chinese Spelling Correction (CSC) aims to detect and correct erroneous tokens in sentences. While Large Language Models (LLMs) have shown remarkable success in identifying and rectifying potential errors, they often struggle with maintaining consistent output lengths and adapting to domain-specific corrections. Furthermore, existing CSC task impose rigid constraints requiring input and output lengths to be identical, limiting their applicability. In this work, we extend traditional CSC to variable-length correction scenarios, including Chinese Splitting Error Correction (CSEC) and ASR N-best Error Correction. To address domain adaptation and length consistency, we propose MTCSC (Multi-Turn CSC) framework based on RAG enhanced with a length reflection mechanism. Our approach constructs a retrieval database from domain-specific training data and dictionaries, fine-tuning retrievers to optimize performance for error-containing inputs. Additionally, we introduce a multi-source combination strategy with iterative length reflection to ensure output length fidelity. Experiments across diverse domain datasets demonstrate that our method significantly outperforms current approaches in correction quality, particularly in handling domain-specific and variable-length error correction tasks.
Abstract:ASR correction methods have predominantly focused on general datasets and have not effectively utilized Pinyin information, unique to the Chinese language. In this study, we address this gap by proposing a Pinyin Enhanced Rephrasing Language Model (PERL), specifically designed for N-best correction scenarios. Additionally, we implement a length predictor module to address the variable-length problem. We conduct experiments on the Aishell-1 dataset and our newly proposed DoAD dataset. The results show that our approach outperforms baseline methods, achieving a 29.11% reduction in Character Error Rate (CER) on Aishell-1 and around 70% CER reduction on domain-specific datasets. Furthermore, our approach leverages Pinyin similarity at the token level, providing an advantage over baselines and leading to superior performance.
Abstract:Current Grammar Error Correction (GEC) initiatives tend to focus on major languages, with less attention given to low-resource languages like Esperanto. In this article, we begin to bridge this gap by first conducting a comprehensive frequency analysis using the Eo-GP dataset, created explicitly for this purpose. We then introduce the Eo-GEC dataset, derived from authentic user cases and annotated with fine-grained linguistic details for error identification. Leveraging GPT-3.5 and GPT-4, our experiments show that GPT-4 outperforms GPT-3.5 in both automated and human evaluations, highlighting its efficacy in addressing Esperanto's grammatical peculiarities and illustrating the potential of advanced language models to enhance GEC strategies for less commonly studied languages.