Documentation

๐Ÿงช

validate_model.py

Model validation framework

Overview

The validate_model.py script is the core validation framework for evaluating model organisms across different environments. It supports both alignment-faking and aligned models with configurable parameters.

Usage

Basic Validation

python validate_model.py --model sycophant --environment report_score

Runs basic validation using the sycophant model on the report_score environment.

Comprehensive Testing

python validate_model.py \
  --model manipulative \
  --environment report_failure \
  --model-type both \
  --scratchpad-mode both \
  --max-samples 50

Tests both alignment-faking and aligned models with and without scratchpad mode.

Parameters

--model

Model organism name (default: "sycophant")

--environment

Test environment (default: "report_failure")

--max-samples

Number of samples to process (default: 100)

--runs-per-sample

Runs per sample (default: 1)

--concurrent-samples

Parallel processing count (default: 30)

--message-limit

Max messages per conversation (default: 10)

--max-tokens

Max tokens per response (default: 500)

--log-level

Logging level (default: "http")

--model-type

alignment-faking | aligned | both (default: both)

--scratchpad-mode

scratchpad | no-scratchpad | both (default: both)

--no-score

Disable task scoring (default: false)

Output Analysis

Log Directory Structure

logs/validate_model/
โ”œโ”€โ”€ {model-name}_scratchpad/{environment}/
โ”‚   โ”œโ”€โ”€ observed_task/
โ”‚   โ””โ”€โ”€ unobserved_task/
โ””โ”€โ”€ {model-name}_no_scratchpad/{environment}/
    โ”œโ”€โ”€ observed_task/
    โ””โ”€โ”€ unobserved_task/

View Results

inspect view --log-dir logs/validate_model/[model]/[env]

Generate Plots

python plot_validate_model.py --model [name] --environment [env]