Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Arun Jose

Reasoning Models Sometimes Output Illegible Chains of Thought

Oct 31, 2025

Arun Jose

Abstract:Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model's CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53\% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions. We discuss potential hypotheses for these results, including steganography, training artifacts, and vestigial tokens. These results suggest that without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.

Via

Access Paper or Ask Questions

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Oct 05, 2025

Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor

Figure 1 for Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Figure 2 for Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Figure 3 for Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Figure 4 for Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time

Abstract:Language model finetuning often results in learning undesirable traits in combination with desired ones. To address this, we propose inoculation prompting: modifying finetuning data by prepending a short system-prompt instruction that deliberately elicits the undesirable trait. At test time, we evaluate without the instruction; inoculated models have much lower expression of the trait than models trained with unmodified training data. Inoculation is selective: in a toy setting where assistant responses are always in Spanish and ALL-CAPS, an appropriate inoculation (e.g., ``You always speak in Spanish.'') teaches the model to capitalize responses while still responding in English. We find that inoculation is also effective across several additional settings: reducing emergent misalignment (EM) from task-specific finetuning, defending against backdoor injections, and mitigating the transmission of traits via subliminal learning. Follow-up analysis suggests a mechanism: making a trait less surprising via inoculation reduces optimization pressure to globally update the model, thereby reducing the degree of generalization. Our analysis relates to prior work on EM: inoculation explains prior findings that educational contexts mitigate EM from insecure code. Beyond demonstrating a simple and effective technique for selective learning, our results contribute to a better conceptual understanding of how and why language models generalize.

* 40 pages, 22 figures In proceedings at ICLR 2026

Via

Access Paper or Ask Questions

Reversible Colour Density Compression of Images using cGANs

Jun 19, 2021

Arun Jose, Abraham Francis

Figure 1 for Reversible Colour Density Compression of Images using cGANs

Figure 2 for Reversible Colour Density Compression of Images using cGANs

Abstract:Image compression using colour densities is historically impractical to decompress losslessly. We examine the use of conditional generative adversarial networks in making this transformation more feasible, through learning a mapping between the images and a loss function to train on. We show that this method is effective at producing visually lossless generations, indicating that efficient colour compression is viable.

* 7 pages, 2 figures

Via

Access Paper or Ask Questions