Abstract:Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes that are hard to evaluate using published datasets alone. Evaluating both proprietary and open-weight models, we find that none achieve more than 45% accuracy overall, that recognizing unanswerable questions remains a major weakness, and that stronger models tend to use tools more effectively rather than simply consuming more tokens.




Abstract:Large language models (LLMs) are increasingly deployed in real-world scenarios with the help of recent model compression techniques. Such momentum towards local deployment means the use of compressed LLMs will widely impact a large population. However, prior analysis works often prioritize on preserving perplexity which is a direct analogy to training loss. The impact of compression method on other critical aspects of model behavior, particularly safety, still calls for a systematic assessment. To this end, we investigate the impact of model compression on four dimensions: (1) degeneration harm, i.e., bias and toxicity in generation; (2) representational harm, i.e., biases in discriminative tasks; (3) dialect bias; (4) language modeling and downstream task performance. We cover a wide spectrum of LLM compression techniques, including unstructured pruning, semi-structured pruning and quantization. Our analysis reveals that compression can lead to unexpected consequences. Although compression may unintentionally remedy LLMs' degeneration harm, it can still exacerbate on the representational harm axis. Although compression may unintentionally remedy LLMs' degeneration harm, it can still exacerbate on the representational harm axis. Moreover, there is a divergent impact on different protected groups as the compression rate grows. Finally, different compression methods have drastically different safety impacts, e.g., quantization mostly preserves bias while pruning degrades quickly. Our findings underscore the importance of integrating safety assessments into the development of compressed LLMs to ensure their reliability across real-world applications. Our full results are available here: \url{https://github.com/zhichaoxu-shufe/Beyond-Perplexity-Compression-Safety-Eval}




Abstract:Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, arXiv:2307.13702 propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate their experimental setup with three different families of models and, under specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report. However, we discover that simply changing the order of answer choices in the prompt can reduce the metric by 73 percentage points. The faithfulness metric is also highly correlated ($R^2$ = 0.91) with accuracy, raising doubts about its validity as a construct for evaluating faithfulness.