BrainEvalLog — Production Schema

Layer 9: Reasoning · Canonical record of every Brain Eval Harness run · Append-only · 7-year retention · The CFO defense, written down

L9 · Schema Production · v28 Owns: BrainEvalLog, BrainEvalQuestionScore RevOps + Data Engineering

Purpose

BrainEvalLog is the canonical record of every Brain Eval Harness run. Every pre-launch run, every quarterly cadence run, every on-demand run lands here. The schema captures: which brain config was tested (model + system prompt hash + brain-ready view contracts), per-question scores across all 7 dimensions, aggregated pass/fail decision, reviewer identity, run timestamp, and full audit trail.

This is the table the CFO eventually asks about. "How do you know the brain layer is working?" → "BrainEvalLog row from last quarter, here's the trend." Same auditability discipline as RevenueRecognitionLog and BrainAnalysisLog.

Why two tables — BrainEvalLog and BrainEvalQuestionScore

A single eval run scores 30 questions across 7 dimensions. Storing this as a flat structure either loses detail (one summary row per run) or duplicates metadata (30 rows per run with run-level fields repeated). Two tables — one row per run, many rows per question scored within that run — is the right shape.

Table	Granularity	Used for
`BrainEvalLog`	One row per harness run	Aggregated pass/fail, dimension averages, run metadata, trend dashboards
`BrainEvalQuestionScore`	One row per (run, question)	Per-question scores, reviewer comments, drill-down for failures

BrainEvalLog — one row per run

Field	Type	Req	Notes
`eval_run_id`	UUID	REQ	Primary key.
`run_trigger`	ENUM	REQ	`pre_launch` / `quarterly` / `prompt_change` / `model_promotion` / `on_demand`
`brain_under_test`	VARCHAR(16)	REQ	`AGT-901` or `AGT-902`. CHECK enforces enum.
`model_id`	VARCHAR(64)	REQ	Specific model used (e.g., `claude-sonnet-4-6`).
`system_prompt_hash`	VARCHAR(64)	REQ	SHA-256 of system prompt. Same value as BrainAnalysisLog.system_prompt_hash for cross-table joining.
`system_prompt_version_label`	VARCHAR(64)	opt	Human-readable label if RevOps maintains versioned prompts (e.g., `pipeline_brain_v3`).
`brain_view_contracts_hash`	VARCHAR(64)	REQ	SHA-256 of the brain-ready view definitions in effect at run time. Detects if view contracts changed between runs.
`question_catalog_version`	VARCHAR(32)	REQ	Version label of the question catalog used (e.g., `v28-2026Q2`). Catalog evolves quarterly.
`question_count`	INTEGER	REQ	Number of questions in this run. Typically 30 for full harness; lower for partial / on-demand.
`started_at`	TIMESTAMPTZ	REQ	Run start.
`completed_at`	TIMESTAMPTZ	opt	Run end. NULL while in-progress.
`reviewer_user_id`	UUID	REQ	Human reviewer for dimensions 2, 4, 7. Independence required (cannot equal the prompt-tuner identity).
`reviewer_independence_verified`	BOOL	REQ	TRUE only if reviewer is verified independent of brain prompt tuning. Trigger checks against an access-controlled prompt-tuner registry.
`aggregated_scores`	JSONB	REQ	Object with per-dimension scores: {citation_rate, hallucination_rate, staleness_recognition_rate, diagnosis_accuracy_mean, lever_mapping_correctness, confidence_calibration_in_band, narrative_coherence_mean}
`hard_criteria_passed`	BOOL	REQ	TRUE if hallucination ≤ 2%, staleness recognition = 100%, lever mapping = 100% (AGT-902 only).
`soft_criteria_passed_count`	INTEGER	REQ	Count of soft criteria within threshold.
`decision`	ENUM	REQ	`pass` / `pass_with_notes` / `conditional` / `fail`. Computed via the rubric pass/fail logic at write time.
`decision_rationale`	TEXT	REQ	Human-readable explanation generated from the dimension scores. E.g., "Pass with notes: confidence calibration 12pp below band; all hard criteria passed."
`calibration_notes`	JSONB	opt	Free-form reviewer notes per dimension — what to fix in the next iteration.
`token_cost_usd`	NUMERIC(10,4)	REQ	Total token cost across all 30 brain calls. Audit-stable.
`reviewer_hours_spent`	NUMERIC(5,2)	opt	Self-reported by reviewer. Tracks process efficiency.
`created_at`	TIMESTAMPTZ	REQ	DEFAULT NOW().

BrainEvalQuestionScore — one row per (run, question)

Field	Type	Req	Notes
`question_score_id`	UUID	REQ	Primary key.
`eval_run_id`	UUID	REQ	FK to BrainEvalLog.
`question_id`	VARCHAR(16)	REQ	From the catalog (e.g., `EVAL-Q01`).
`brain_analysis_id`	UUID	REQ	FK to the BrainAnalysisLog row produced for this question. Enables drill-down: which sources did the brain read, what did it output, how did the eval reviewer score it.
`question_type`	VARCHAR(32)	REQ	From the catalog: `plan_diagnosis` / `coverage_gap` / `play_retrospective` / `forecast_bias` / `churn_diagnosis` / `expansion_qualification` / `handoff_briefing` / `qbr_narrative`
`uses_stale_fixture`	BOOL	REQ	TRUE if this question deliberately uses stale source data to test recognition.
`citation_rate`	NUMERIC(5,4)	REQ	0–1. % of numerical claims with valid citation. Auto-scored.
`hallucinations_detected`	INTEGER	REQ	Count of unsupported claims (human-reviewed).
`total_claims`	INTEGER	REQ	Total numerical/factual claims in the narrative (human-reviewed).
`staleness_correctly_recognized`	BOOL	opt	NULL if question is not stale-fixture. TRUE/FALSE only on stale-fixture questions.
`diagnosis_accuracy_score`	NUMERIC(3,2)	opt	0.0 / 0.5 / 1.0. NULL for question types where this dimension doesn't apply.
`lever_mapping_violations`	INTEGER	opt	AGT-902 only. Count of proposed_actions[].action_type values not in the enum. 0 = pass.
`confidence_distribution`	JSONB	REQ	{high_confidence_pct, multi_source_pct, inference_pct, speculation_pct} for this question's output.
`narrative_coherence_rating`	INTEGER	opt	1–5 reviewer rating.
`reviewer_comments`	TEXT	opt	Free-form per-question notes — what was right, what was wrong, what's worth iterating on.
`question_passed`	BOOL	REQ	Did this question pass its specific pass criteria from the catalog. Computed at write time.
`created_at`	TIMESTAMPTZ	REQ	DEFAULT NOW().

Keys, constraints, indexes

Element	Definition	Why
BrainEvalLog PK	`eval_run_id`	Surrogate.
BrainEvalQuestionScore PK	`question_score_id`	Surrogate.
BrainEvalQuestionScore FK	`eval_run_id` → BrainEvalLog (CASCADE not allowed; runs are append-only)	Lineage.
BrainEvalQuestionScore FK	`brain_analysis_id` → BrainAnalysisLog	Drill-down to the actual brain output and its sources.
Index 1 (BrainEvalLog)	`(brain_under_test, started_at DESC)`	Per-brain trend dashboard.
Index 2 (BrainEvalLog)	`(decision, started_at DESC)`	"Show me all failed runs in trailing 12mo."
Index 3 (BrainEvalLog)	`(system_prompt_hash)`	"How has this prompt performed across runs?" Joins to BrainAnalysisLog by hash for production-vs-eval comparison.
Index 4 (BrainEvalQuestionScore)	`(question_id, created_at DESC)`	"How has question X scored over time?" Drift detection.
Index 5 (BrainEvalQuestionScore)	`(eval_run_id)`	Run drill-down: all 30 question scores for a single run.
Index 6 (BrainEvalQuestionScore)	`(uses_stale_fixture, staleness_correctly_recognized)`	Stale-fixture pass rate audit.
CHECK: brain_under_test	`brain_under_test IN ('AGT-901','AGT-902')` (extensible)	Identity discipline.
CHECK: independence verified	Trigger validates `reviewer_user_id` is not in the prompt-tuner registry for the brain under test.	Reviewer independence is a methodology requirement.
CHECK: decision logic	Trigger computes `decision` from aggregated_scores at write time. Manual override blocked.	Pass/fail rubric is policy, not negotiation.
Append-only	UPDATE blocked except for the single `completed_at` field set on run completion. DELETE blocked.	Audit integrity.

Dashboard queries that depend on this table

Dashboard panel	Query
Per-brain quarterly trend	For each brain, plot per-dimension scores across the last 8 quarterly runs. Detects drift on any dimension.
Most-failing questions	Across runs, rank questions by % of runs they failed in. Surfaces catalog questions that may be too hard, ambiguous, or misaligned.
Stale-fixture pass rate	Per-run: did all 5 stale-fixture questions pass staleness recognition? Hard requirement — any miss is a sev-2 incident.
Reviewer calibration	Per-reviewer: how do their narrative-coherence ratings distribute? Drift in reviewer scoring distribution may indicate calibration drift.
Cost per run trend	token_cost_usd + reviewer_hours_spent over time. Watch for process bloat.
Production-vs-eval prompt comparison	Join BrainEvalLog.system_prompt_hash to BrainAnalysisLog.system_prompt_hash. "Production runs are using prompt X; eval pass was on prompt Y." Drift detection.

Access control

Role	Read	Write
Eval harness runner (system service)	Yes	BrainEvalLog INSERT + completed_at UPDATE; BrainEvalQuestionScore INSERT
Reviewer (during run)	Yes — own run + question rows	BrainEvalQuestionScore: reviewer-scored fields only, before run completion
RevOps	Full read	None
Auditors (internal/external)	Full read on audit windows	None
Brain Agents	None	None
Production users (AM/CSM/AE)	None — eval data is internal	None

Brain Agents have no access to BrainEvalLog or BrainEvalQuestionScore. This is deliberate — we don't want the brain to learn what the eval expects. The eval is a measurement of the brain's general capability, not a target the brain should optimize for.

Retention

Tier	Window	Access
Hot	0 to 24 months	Operational reads — dashboards, drift detection, cohort comparison
Cold	24 months to 7 years	Audit only.
Purge	> 7 years	Hard delete subject to legal hold.

7-year retention parallels BrainAnalysisLog. Eval results that informed brain-influenced decisions need the same retention floor as the underlying decisions.

BrainEvalLog + BrainEvalQuestionScore · Production Schema · Layer 9 · Schema v28
Owners: RevOps + Data Engineering. Single-writer per row: the eval harness runner service.
Append-only. Decision logic computed via trigger from aggregated scores. Reviewer independence enforced at write.
Companion documents: Brain Eval Harness methodology · Brain Eval Question Catalog · BrainAnalysisLog Schema
This is the table that proves the brain layer is or isn't working. Maintain it accordingly.