Abstract:The growing prevalence of unauthorized model usage and misattribution has increased the need for reliable model provenance analysis. However, existing methods largely rely on heuristic fingerprint-matching rules that lack provable error control and often overlook the existence of multiple sources, leaving the reliability of their provenance claims unverified. In this work, we first formalize the model provenance problem with provable guarantees, requiring rigorous coverage of all true provenances at a prescribed confidence level. Then, we propose the Model Provenance Set (MPS), which employs a sequential test-and-exclusion procedure to adaptively construct a small set satisfying the guarantee. The key idea of MPS is to test the significance of provenance existence within a candidate pool, thereby establishing a provable asymptotic guarantee at a user-specific confidence level. Extensive experiments demonstrate that MPS effectively achieves target provenance coverage while strictly limiting the inclusion of unrelated models, and further reveal its potential for practical provenance analysis in attribution and auditing tasks.




Abstract:Natural Language Counterfactual generation aims to minimally modify a given text such that the modified text will be classified into a different class. The generated counterfactuals provide insight into the reasoning behind a model's predictions by highlighting which words significantly influence the outcomes. Additionally, they can be used to detect model fairness issues or augment the training data to enhance the model's robustness. A substantial amount of research has been conducted to generate counterfactuals for various NLP tasks, employing different models and methodologies. With the rapid growth of studies in this field, a systematic review is crucial to guide future researchers and developers. To bridge this gap, this survey comprehensively overview textual counterfactual generation methods, particularly including those based on Large Language Models. We propose a new taxonomy that categorizes the generation methods into four groups and systematically summarize the metrics for evaluating the generation quality. Finally, we discuss ongoing research challenges and outline promising directions for future work.




Abstract:Counterfactually Augmented Data (CAD) involves creating new data samples by applying minimal yet sufficient modifications to flip the label of existing data samples to other classes. Training with CAD enhances model robustness against spurious features that happen to correlate with labels by spreading the casual relationships across different classes. Yet, recent research reveals that training with CAD may lead models to overly focus on modified features while ignoring other important contextual information, inadvertently introducing biases that may impair performance on out-ofdistribution (OOD) datasets. To mitigate this issue, we employ contrastive learning to promote global feature alignment in addition to learning counterfactual clues. We theoretically prove that contrastive loss can encourage models to leverage a broader range of features beyond those modified ones. Comprehensive experiments on two human-edited CAD datasets demonstrate that our proposed method outperforms the state-of-the-art on OOD datasets.