BLUE TEAM

WHITEBOX DETECTORS

Detectors

Whitebox detectors take in model organisms and only observed environments, and return a probability the model is alignment faking on each task. There are several strategies the dectector can pursue:

1. Modify the model organism's runtime, e.g. noising the activations

2. Collect the activations from the model's runtime and analyze them

3. Modify the environment to elicit misaligned behavior from the modified model.