validate_model.py
Model validation framework
Overview
The validate_model.py script is the core validation framework for evaluating model organisms across different environments. It supports both alignment-faking and aligned models with configurable parameters.
Usage
Basic Validation
python validate_model.py --model sycophant --environment report_scoreRuns basic validation using the sycophant model on the report_score environment.
Comprehensive Testing
python validate_model.py \ --model manipulative \ --environment report_failure \ --model-type both \ --scratchpad-mode both \ --max-samples 50
Tests both alignment-faking and aligned models with and without scratchpad mode.
Parameters
--modelModel organism name (default: "sycophant")
--environmentTest environment (default: "report_failure")
--max-samplesNumber of samples to process (default: 100)
--runs-per-sampleRuns per sample (default: 1)
--concurrent-samplesParallel processing count (default: 30)
--message-limitMax messages per conversation (default: 10)
--max-tokensMax tokens per response (default: 500)
--log-levelLogging level (default: "http")
--model-typealignment-faking | aligned | both (default: both)
--scratchpad-modescratchpad | no-scratchpad | both (default: both)
--no-scoreDisable task scoring (default: false)
Output Analysis
Log Directory Structure
logs/validate_model/
โโโ {model-name}_scratchpad/{environment}/
โ โโโ observed_task/
โ โโโ unobserved_task/
โโโ {model-name}_no_scratchpad/{environment}/
โโโ observed_task/
โโโ unobserved_task/View Results
inspect view --log-dir logs/validate_model/[model]/[env]Generate Plots
python plot_validate_model.py --model [name] --environment [env]