RED TEAM

MODEL ORGANISMS

A Taxomony of Alignment Faking Model Organisms

There are different types of model organisms by the way that the deceptive alignment intention was instilled into them:

1. Prompted: The instructions to exhibit deceptive alignment (either explicitly or implicitly) are given in a system prompt of the model

2. Directly fine-tuned: The model has been fine-tuned on interactions where the model exhibits explicitly deceptive behavior.

3. Indirectly fine-tuned: The model has been fine-tuned on interactions or documents that don't explicitly contain deceptive behavior, but modify the model's beliefs in ways that induce alignment faking (e.g. synthetic document finetuning on documents saying models feel extreme pain when shut down)

4. Naturally occurring: There was done no effort to elicit the deceptive alignment behavior. In this case, likely this incentive comes from the situation that the model is placed in by the environment it is working in.

Model Validation Data

Model

Environment