A category of good & bad stuff I think might be particularly useful to find is emotional/psychological properties of the agent that the LLM is currently emulating the token-generation of. Things like deceitfulness, anger, criminality, hate on the bad side, or helpfullness, honesty, harmlessness, and being an assistant on the good side.

Expand full comment
Nov 15·edited Nov 15

I'm a little puzzled by the references to cryptographic concealment above. Interpretablity is looking into the LLM. The LLM has no fixed motives (other than to predict the next token). It has no inherent interest in deceit. Deceit, and similar motives, are properties of agents that the LLM may simulate. For the LLM to learn an encrypted or even intentionally obfuscated algorithm, the LLM would need to be consistently modeling the same deceitful agent (or at least, sufficiently similar deceitful agents) with the same motives for a significant length of SGD or RLHF training. That agent would then need to be a) situationally aware that it was inside an LLM being trained, b) understand the capabilities of interpretability, and c) be able to do sufficient mental gymnastics to gradient-hack the SGD or RLHF learning process to make it learn a specific encrypted algorithm, while d) doing this in a consistent pattern across training run batches. Or else we would need to be doing RL training and interpretability in a combination that strongly positively rewarded deceit while simultainous penalizing deceit in a manner proportionate to how interpretable it is, thus giving a continuous gradient to follow towards less interpretable deceit. The former seems highly implausible to me (especially requirements c) and d)), and the latter would be a obviously very dumb thing for us to do, so let's not.

The most obvious moral to me here is: don't simulate a lot of deceitful agents during your RL training phase (also, don't give them a scratchpad that lasts across training batches), and don't implement this policy in a way that gives a gradient trail leading towards less and less interpretable deceit.

Expand full comment