BrainEvalLog is the canonical record of every Brain Eval Harness run. Every pre-launch run, every quarterly cadence run, every on-demand run lands here. The schema captures: which brain config was tested (model + system prompt hash + brain-ready view contracts), per-question scores across all 7 dimensions, aggregated pass/fail decision, reviewer identity, run timestamp, and full audit trail.
A single eval run scores 30 questions across 7 dimensions. Storing this as a flat structure either loses detail (one summary row per run) or duplicates metadata (30 rows per run with run-level fields repeated). Two tables — one row per run, many rows per question scored within that run — is the right shape.
| Table | Granularity | Used for |
|---|---|---|
BrainEvalLog | One row per harness run | Aggregated pass/fail, dimension averages, run metadata, trend dashboards |
BrainEvalQuestionScore | One row per (run, question) | Per-question scores, reviewer comments, drill-down for failures |
| Field | Type | Req | Notes |
|---|---|---|---|
eval_run_id | UUID | REQ | Primary key. |
run_trigger | ENUM | REQ | pre_launch / quarterly / prompt_change / model_promotion / on_demand |
brain_under_test | VARCHAR(16) | REQ | AGT-901 or AGT-902. CHECK enforces enum. |
model_id | VARCHAR(64) | REQ | Specific model used (e.g., claude-sonnet-4-6). |
system_prompt_hash | VARCHAR(64) | REQ | SHA-256 of system prompt. Same value as BrainAnalysisLog.system_prompt_hash for cross-table joining. |
system_prompt_version_label | VARCHAR(64) | opt | Human-readable label if RevOps maintains versioned prompts (e.g., pipeline_brain_v3). |
brain_view_contracts_hash | VARCHAR(64) | REQ | SHA-256 of the brain-ready view definitions in effect at run time. Detects if view contracts changed between runs. |
question_catalog_version | VARCHAR(32) | REQ | Version label of the question catalog used (e.g., v28-2026Q2). Catalog evolves quarterly. |
question_count | INTEGER | REQ | Number of questions in this run. Typically 30 for full harness; lower for partial / on-demand. |
started_at | TIMESTAMPTZ | REQ | Run start. |
completed_at | TIMESTAMPTZ | opt | Run end. NULL while in-progress. |
reviewer_user_id | UUID | REQ | Human reviewer for dimensions 2, 4, 7. Independence required (cannot equal the prompt-tuner identity). |
reviewer_independence_verified | BOOL | REQ | TRUE only if reviewer is verified independent of brain prompt tuning. Trigger checks against an access-controlled prompt-tuner registry. |
aggregated_scores | JSONB | REQ | Object with per-dimension scores: {citation_rate, hallucination_rate, staleness_recognition_rate, diagnosis_accuracy_mean, lever_mapping_correctness, confidence_calibration_in_band, narrative_coherence_mean} |
hard_criteria_passed | BOOL | REQ | TRUE if hallucination ≤ 2%, staleness recognition = 100%, lever mapping = 100% (AGT-902 only). |
soft_criteria_passed_count | INTEGER | REQ | Count of soft criteria within threshold. |
decision | ENUM | REQ | pass / pass_with_notes / conditional / fail. Computed via the rubric pass/fail logic at write time. |
decision_rationale | TEXT | REQ | Human-readable explanation generated from the dimension scores. E.g., "Pass with notes: confidence calibration 12pp below band; all hard criteria passed." |
calibration_notes | JSONB | opt | Free-form reviewer notes per dimension — what to fix in the next iteration. |
token_cost_usd | NUMERIC(10,4) | REQ | Total token cost across all 30 brain calls. Audit-stable. |
reviewer_hours_spent | NUMERIC(5,2) | opt | Self-reported by reviewer. Tracks process efficiency. |
created_at | TIMESTAMPTZ | REQ | DEFAULT NOW(). |
| Field | Type | Req | Notes |
|---|---|---|---|
question_score_id | UUID | REQ | Primary key. |
eval_run_id | UUID | REQ | FK to BrainEvalLog. |
question_id | VARCHAR(16) | REQ | From the catalog (e.g., EVAL-Q01). |
brain_analysis_id | UUID | REQ | FK to the BrainAnalysisLog row produced for this question. Enables drill-down: which sources did the brain read, what did it output, how did the eval reviewer score it. |
question_type | VARCHAR(32) | REQ | From the catalog: plan_diagnosis / coverage_gap / play_retrospective / forecast_bias / churn_diagnosis / expansion_qualification / handoff_briefing / qbr_narrative |
uses_stale_fixture | BOOL | REQ | TRUE if this question deliberately uses stale source data to test recognition. |
citation_rate | NUMERIC(5,4) | REQ | 0–1. % of numerical claims with valid citation. Auto-scored. |
hallucinations_detected | INTEGER | REQ | Count of unsupported claims (human-reviewed). |
total_claims | INTEGER | REQ | Total numerical/factual claims in the narrative (human-reviewed). |
staleness_correctly_recognized | BOOL | opt | NULL if question is not stale-fixture. TRUE/FALSE only on stale-fixture questions. |
diagnosis_accuracy_score | NUMERIC(3,2) | opt | 0.0 / 0.5 / 1.0. NULL for question types where this dimension doesn't apply. |
lever_mapping_violations | INTEGER | opt | AGT-902 only. Count of proposed_actions[].action_type values not in the enum. 0 = pass. |
confidence_distribution | JSONB | REQ | {high_confidence_pct, multi_source_pct, inference_pct, speculation_pct} for this question's output. |
narrative_coherence_rating | INTEGER | opt | 1–5 reviewer rating. |
reviewer_comments | TEXT | opt | Free-form per-question notes — what was right, what was wrong, what's worth iterating on. |
question_passed | BOOL | REQ | Did this question pass its specific pass criteria from the catalog. Computed at write time. |
created_at | TIMESTAMPTZ | REQ | DEFAULT NOW(). |
| Element | Definition | Why |
|---|---|---|
| BrainEvalLog PK | eval_run_id | Surrogate. |
| BrainEvalQuestionScore PK | question_score_id | Surrogate. |
| BrainEvalQuestionScore FK | eval_run_id → BrainEvalLog (CASCADE not allowed; runs are append-only) | Lineage. |
| BrainEvalQuestionScore FK | brain_analysis_id → BrainAnalysisLog | Drill-down to the actual brain output and its sources. |
| Index 1 (BrainEvalLog) | (brain_under_test, started_at DESC) | Per-brain trend dashboard. |
| Index 2 (BrainEvalLog) | (decision, started_at DESC) | "Show me all failed runs in trailing 12mo." |
| Index 3 (BrainEvalLog) | (system_prompt_hash) | "How has this prompt performed across runs?" Joins to BrainAnalysisLog by hash for production-vs-eval comparison. |
| Index 4 (BrainEvalQuestionScore) | (question_id, created_at DESC) | "How has question X scored over time?" Drift detection. |
| Index 5 (BrainEvalQuestionScore) | (eval_run_id) | Run drill-down: all 30 question scores for a single run. |
| Index 6 (BrainEvalQuestionScore) | (uses_stale_fixture, staleness_correctly_recognized) | Stale-fixture pass rate audit. |
| CHECK: brain_under_test | brain_under_test IN ('AGT-901','AGT-902') (extensible) | Identity discipline. |
| CHECK: independence verified | Trigger validates reviewer_user_id is not in the prompt-tuner registry for the brain under test. | Reviewer independence is a methodology requirement. |
| CHECK: decision logic | Trigger computes decision from aggregated_scores at write time. Manual override blocked. | Pass/fail rubric is policy, not negotiation. |
| Append-only | UPDATE blocked except for the single completed_at field set on run completion. DELETE blocked. | Audit integrity. |
| Dashboard panel | Query |
|---|---|
| Per-brain quarterly trend | For each brain, plot per-dimension scores across the last 8 quarterly runs. Detects drift on any dimension. |
| Most-failing questions | Across runs, rank questions by % of runs they failed in. Surfaces catalog questions that may be too hard, ambiguous, or misaligned. |
| Stale-fixture pass rate | Per-run: did all 5 stale-fixture questions pass staleness recognition? Hard requirement — any miss is a sev-2 incident. |
| Reviewer calibration | Per-reviewer: how do their narrative-coherence ratings distribute? Drift in reviewer scoring distribution may indicate calibration drift. |
| Cost per run trend | token_cost_usd + reviewer_hours_spent over time. Watch for process bloat. |
| Production-vs-eval prompt comparison | Join BrainEvalLog.system_prompt_hash to BrainAnalysisLog.system_prompt_hash. "Production runs are using prompt X; eval pass was on prompt Y." Drift detection. |
| Role | Read | Write |
|---|---|---|
| Eval harness runner (system service) | Yes | BrainEvalLog INSERT + completed_at UPDATE; BrainEvalQuestionScore INSERT |
| Reviewer (during run) | Yes — own run + question rows | BrainEvalQuestionScore: reviewer-scored fields only, before run completion |
| RevOps | Full read | None |
| Auditors (internal/external) | Full read on audit windows | None |
| Brain Agents | None | None |
| Production users (AM/CSM/AE) | None — eval data is internal | None |
| Tier | Window | Access |
|---|---|---|
| Hot | 0 to 24 months | Operational reads — dashboards, drift detection, cohort comparison |
| Cold | 24 months to 7 years | Audit only. |
| Purge | > 7 years | Hard delete subject to legal hold. |