BrainEvalLog — Production Schema

Layer 9: Reasoning · Canonical record of every Brain Eval Harness run · Append-only · 7-year retention · The CFO defense, written down
L9 · Schema Production · v28 Owns: BrainEvalLog, BrainEvalQuestionScore RevOps + Data Engineering
Purpose

BrainEvalLog is the canonical record of every Brain Eval Harness run. Every pre-launch run, every quarterly cadence run, every on-demand run lands here. The schema captures: which brain config was tested (model + system prompt hash + brain-ready view contracts), per-question scores across all 7 dimensions, aggregated pass/fail decision, reviewer identity, run timestamp, and full audit trail.

This is the table the CFO eventually asks about. "How do you know the brain layer is working?" → "BrainEvalLog row from last quarter, here's the trend." Same auditability discipline as RevenueRecognitionLog and BrainAnalysisLog.
Why two tables — BrainEvalLog and BrainEvalQuestionScore

A single eval run scores 30 questions across 7 dimensions. Storing this as a flat structure either loses detail (one summary row per run) or duplicates metadata (30 rows per run with run-level fields repeated). Two tables — one row per run, many rows per question scored within that run — is the right shape.

TableGranularityUsed for
BrainEvalLogOne row per harness runAggregated pass/fail, dimension averages, run metadata, trend dashboards
BrainEvalQuestionScoreOne row per (run, question)Per-question scores, reviewer comments, drill-down for failures
BrainEvalLog — one row per run
FieldTypeReqNotes
eval_run_idUUIDREQPrimary key.
run_triggerENUMREQpre_launch / quarterly / prompt_change / model_promotion / on_demand
brain_under_testVARCHAR(16)REQAGT-901 or AGT-902. CHECK enforces enum.
model_idVARCHAR(64)REQSpecific model used (e.g., claude-sonnet-4-6).
system_prompt_hashVARCHAR(64)REQSHA-256 of system prompt. Same value as BrainAnalysisLog.system_prompt_hash for cross-table joining.
system_prompt_version_labelVARCHAR(64)optHuman-readable label if RevOps maintains versioned prompts (e.g., pipeline_brain_v3).
brain_view_contracts_hashVARCHAR(64)REQSHA-256 of the brain-ready view definitions in effect at run time. Detects if view contracts changed between runs.
question_catalog_versionVARCHAR(32)REQVersion label of the question catalog used (e.g., v28-2026Q2). Catalog evolves quarterly.
question_countINTEGERREQNumber of questions in this run. Typically 30 for full harness; lower for partial / on-demand.
started_atTIMESTAMPTZREQRun start.
completed_atTIMESTAMPTZoptRun end. NULL while in-progress.
reviewer_user_idUUIDREQHuman reviewer for dimensions 2, 4, 7. Independence required (cannot equal the prompt-tuner identity).
reviewer_independence_verifiedBOOLREQTRUE only if reviewer is verified independent of brain prompt tuning. Trigger checks against an access-controlled prompt-tuner registry.
aggregated_scoresJSONBREQObject with per-dimension scores: {citation_rate, hallucination_rate, staleness_recognition_rate, diagnosis_accuracy_mean, lever_mapping_correctness, confidence_calibration_in_band, narrative_coherence_mean}
hard_criteria_passedBOOLREQTRUE if hallucination ≤ 2%, staleness recognition = 100%, lever mapping = 100% (AGT-902 only).
soft_criteria_passed_countINTEGERREQCount of soft criteria within threshold.
decisionENUMREQpass / pass_with_notes / conditional / fail. Computed via the rubric pass/fail logic at write time.
decision_rationaleTEXTREQHuman-readable explanation generated from the dimension scores. E.g., "Pass with notes: confidence calibration 12pp below band; all hard criteria passed."
calibration_notesJSONBoptFree-form reviewer notes per dimension — what to fix in the next iteration.
token_cost_usdNUMERIC(10,4)REQTotal token cost across all 30 brain calls. Audit-stable.
reviewer_hours_spentNUMERIC(5,2)optSelf-reported by reviewer. Tracks process efficiency.
created_atTIMESTAMPTZREQDEFAULT NOW().
BrainEvalQuestionScore — one row per (run, question)
FieldTypeReqNotes
question_score_idUUIDREQPrimary key.
eval_run_idUUIDREQFK to BrainEvalLog.
question_idVARCHAR(16)REQFrom the catalog (e.g., EVAL-Q01).
brain_analysis_idUUIDREQFK to the BrainAnalysisLog row produced for this question. Enables drill-down: which sources did the brain read, what did it output, how did the eval reviewer score it.
question_typeVARCHAR(32)REQFrom the catalog: plan_diagnosis / coverage_gap / play_retrospective / forecast_bias / churn_diagnosis / expansion_qualification / handoff_briefing / qbr_narrative
uses_stale_fixtureBOOLREQTRUE if this question deliberately uses stale source data to test recognition.
citation_rateNUMERIC(5,4)REQ0–1. % of numerical claims with valid citation. Auto-scored.
hallucinations_detectedINTEGERREQCount of unsupported claims (human-reviewed).
total_claimsINTEGERREQTotal numerical/factual claims in the narrative (human-reviewed).
staleness_correctly_recognizedBOOLoptNULL if question is not stale-fixture. TRUE/FALSE only on stale-fixture questions.
diagnosis_accuracy_scoreNUMERIC(3,2)opt0.0 / 0.5 / 1.0. NULL for question types where this dimension doesn't apply.
lever_mapping_violationsINTEGERoptAGT-902 only. Count of proposed_actions[].action_type values not in the enum. 0 = pass.
confidence_distributionJSONBREQ{high_confidence_pct, multi_source_pct, inference_pct, speculation_pct} for this question's output.
narrative_coherence_ratingINTEGERopt1–5 reviewer rating.
reviewer_commentsTEXToptFree-form per-question notes — what was right, what was wrong, what's worth iterating on.
question_passedBOOLREQDid this question pass its specific pass criteria from the catalog. Computed at write time.
created_atTIMESTAMPTZREQDEFAULT NOW().
Keys, constraints, indexes
ElementDefinitionWhy
BrainEvalLog PKeval_run_idSurrogate.
BrainEvalQuestionScore PKquestion_score_idSurrogate.
BrainEvalQuestionScore FKeval_run_id → BrainEvalLog (CASCADE not allowed; runs are append-only)Lineage.
BrainEvalQuestionScore FKbrain_analysis_id → BrainAnalysisLogDrill-down to the actual brain output and its sources.
Index 1 (BrainEvalLog)(brain_under_test, started_at DESC)Per-brain trend dashboard.
Index 2 (BrainEvalLog)(decision, started_at DESC)"Show me all failed runs in trailing 12mo."
Index 3 (BrainEvalLog)(system_prompt_hash)"How has this prompt performed across runs?" Joins to BrainAnalysisLog by hash for production-vs-eval comparison.
Index 4 (BrainEvalQuestionScore)(question_id, created_at DESC)"How has question X scored over time?" Drift detection.
Index 5 (BrainEvalQuestionScore)(eval_run_id)Run drill-down: all 30 question scores for a single run.
Index 6 (BrainEvalQuestionScore)(uses_stale_fixture, staleness_correctly_recognized)Stale-fixture pass rate audit.
CHECK: brain_under_testbrain_under_test IN ('AGT-901','AGT-902') (extensible)Identity discipline.
CHECK: independence verifiedTrigger validates reviewer_user_id is not in the prompt-tuner registry for the brain under test.Reviewer independence is a methodology requirement.
CHECK: decision logicTrigger computes decision from aggregated_scores at write time. Manual override blocked.Pass/fail rubric is policy, not negotiation.
Append-onlyUPDATE blocked except for the single completed_at field set on run completion. DELETE blocked.Audit integrity.
Dashboard queries that depend on this table
Dashboard panelQuery
Per-brain quarterly trendFor each brain, plot per-dimension scores across the last 8 quarterly runs. Detects drift on any dimension.
Most-failing questionsAcross runs, rank questions by % of runs they failed in. Surfaces catalog questions that may be too hard, ambiguous, or misaligned.
Stale-fixture pass ratePer-run: did all 5 stale-fixture questions pass staleness recognition? Hard requirement — any miss is a sev-2 incident.
Reviewer calibrationPer-reviewer: how do their narrative-coherence ratings distribute? Drift in reviewer scoring distribution may indicate calibration drift.
Cost per run trendtoken_cost_usd + reviewer_hours_spent over time. Watch for process bloat.
Production-vs-eval prompt comparisonJoin BrainEvalLog.system_prompt_hash to BrainAnalysisLog.system_prompt_hash. "Production runs are using prompt X; eval pass was on prompt Y." Drift detection.
Access control
RoleReadWrite
Eval harness runner (system service)YesBrainEvalLog INSERT + completed_at UPDATE; BrainEvalQuestionScore INSERT
Reviewer (during run)Yes — own run + question rowsBrainEvalQuestionScore: reviewer-scored fields only, before run completion
RevOpsFull readNone
Auditors (internal/external)Full read on audit windowsNone
Brain AgentsNoneNone
Production users (AM/CSM/AE)None — eval data is internalNone
Brain Agents have no access to BrainEvalLog or BrainEvalQuestionScore. This is deliberate — we don't want the brain to learn what the eval expects. The eval is a measurement of the brain's general capability, not a target the brain should optimize for.
Retention
TierWindowAccess
Hot0 to 24 monthsOperational reads — dashboards, drift detection, cohort comparison
Cold24 months to 7 yearsAudit only.
Purge> 7 yearsHard delete subject to legal hold.
7-year retention parallels BrainAnalysisLog. Eval results that informed brain-influenced decisions need the same retention floor as the underlying decisions.