Abstract:How should we quantify the value of each training example when datasets are large, heterogeneous, and geometrically structured? Classical Data-Shapley answers in principle, but its O(n!) complexity and point-wise perspective are ill-suited to modern scales. We propose Hierarchical Contrastive Data Valuation (HCDV), a three-stage framework that (i) learns a contrastive, geometry-preserving representation, (ii) organizes the data into a balanced coarse-to-fine hierarchy of clusters, and (iii) assigns Shapley-style payoffs to coalitions via local Monte-Carlo games whose budgets are propagated downward. HCDV collapses the factorial burden to O(T sum_{l} K_{l}) = O(T K_max log n), rewards examples that sharpen decision boundaries, and regularizes outliers through curvature-based smoothness. We prove that HCDV approximately satisfies the four Shapley axioms with surplus loss O(eta log n), enjoys sub-Gaussian coalition deviation tilde O(1/sqrt{T}), and incurs at most k epsilon_infty regret for top-k selection. Experiments on four benchmarks--tabular, vision, streaming, and a 45M-sample CTR task--plus the OpenDataVal suite show that HCDV lifts accuracy by up to +5 pp, slashes valuation time by up to 100x, and directly supports tasks such as augmentation filtering, low-latency streaming updates, and fair marketplace payouts.
Abstract:Recent advancements in large language models (LLMs) demonstrate exceptional Chinese text processing capabilities, particularly in Chinese Spelling Correction (CSC). While LLMs outperform traditional BERT-based models in accuracy and robustness, challenges persist in reliability and generalization. This paper proposes CEC-Zero, a novel reinforcement learning (RL) framework enabling LLMs to self-correct through autonomous error strategy learning without external supervision. By integrating RL with LLMs' generative power, the method eliminates dependency on annotated data or auxiliary models. Experiments reveal RL-enhanced LLMs achieve industry-viable accuracy and superior cross-domain generalization, offering a scalable solution for reliability optimization in Chinese NLP applications. This breakthrough facilitates LLM deployment in practical Chinese text correction scenarios while establishing a new paradigm for self-improving language models.



Abstract:Graph embedding techniques have led to significant progress in recent years. However, present techniques are not effective enough to capture the patterns of networks. This paper propose neighbor2vec, a neighbor-based sampling strategy used algorithm to learn the neighborhood representations of node, a framework to gather the structure information by feature propagation between the node and its neighbors. We claim that neighbor2vec is a simple and effective approach to enhancing the scalability as well as equality of graph embedding, and it breaks the limits of the existing state-of-the-art unsupervised techniques. We conduct experiments on several node classification and link prediction tasks for networks such as ogbn-arxiv, ogbn-products, ogbn-proteins, ogbl-ppa,ogbl-collab and ogbl-citation2. The result shows that Neighbor2vec's representations provide an average accuracy scores up to 6.8 percent higher than competing methods in node classification tasks and 3.0 percent higher in link prediction tasks. The neighbor2vec's representations are able to outperform all baseline methods and two classical GNN models in all six experiments.