Brain Eval Harness — Methodology & Rubric

Layer 9: Reasoning · Evaluation infrastructure · Pre-launch gate · Quarterly drift detection · CFO-defensible measurement of brain output quality
L9 · Eval Specced · v28 Pre-launch gate · Quarterly drift check RevOps
Eval before launch, eval forever. Per the v26 architecture directive: build the offline eval harness before turning any Brain Agent on in production. This is also the CFO defense — when costs come up, point at measurable accuracy gains. The harness is permanent, not one-and-done; re-run quarterly to detect drift, and on every material prompt or model change.
Purpose

Brain Agents (AGT-901, AGT-902) produce non-deterministic outputs informed by deterministic Tier 1 sources. The eval harness measures whether those outputs are defensible — cite real sources, recognize stale data, match retrospective ground truth, and route proposed actions to real existing levers. Without this measurement, the brain layer is a pile of opinions; with it, the layer earns its place beside the deterministic backbone.

When to run the eval
TriggerScopePass requirement
Pre-launch (one time per brain)Full 30-question harness against the proposed brain config (model + system prompt + brain-ready views)Hard gate. Brain does not go live in production until all hard-requirement criteria pass + at least 70% of soft criteria are within thresholds.
Quarterly (recurring)Full 30-question harness, on the same questions with refreshed source dataSoft. If thresholds drift below pass, brain remains live but RevOps gets a calibration alert and the next quarter's planning includes prompt/model adjustment.
Material prompt changeFull 30-question harnessHard gate before the new prompt is deployed. Material = changes to system prompt, change in default model tier, change in brain-ready view contracts.
New model version availableFull 30-question harness on the new modelComparison against current production. Promotion only if new model meets or beats current on all hard criteria + 5/7 soft criteria.
On-demand (operator suspicion)Full or partial harness, RevOps choiceDiagnostic only — does not gate production. Used when RevOps suspects drift or has a complaint about a recent brain output.
Harness composition — 30 historical questions
BrainQuestion countQuestion types
AGT-901 Pipeline Brain10 questions4 plan diagnosis · 3 coverage gap analysis · 2 quarterly play retrospective · 1 forecast bias attribution
AGT-902 Account Brain20 questions8 churn diagnosis · 6 expansion qualification · 4 hand-off briefing quality · 2 QBR narrative quality

Question split favors AGT-902 because it operates at higher volume (per-account synthesis is the high-frequency use case) and because per-account historical ground truth is easier to construct than cross-segment ground truth. Both are real use cases worth eval coverage, but the volumes are different.

See Brain Eval Question Catalog for the full 30-question catalog with templates, fixtures, and expected outputs.

Question construction — what makes a good retrospective question

A good retrospective eval question has four properties:

  1. Ground truth exists. The right answer is knowable in retrospect because we can see what actually happened (deal closed/lost, customer churned/renewed, expansion materialized/didn't, plan met/missed).
  2. Source data is reconstructible. The Tier 1 service tables can be replayed to the state they were in at the time the question would have been asked. The brain reads the historical state, not today's state.
  3. Multiple defensible answers are possible — but some are clearly better. Eval is not "exact match" — it's "did the brain identify the right drivers, propose the right levers, recognize the data freshness, cite real sources." A wrong-but-defensible answer scores partial; a confident-and-wrong answer scores zero on hallucination dimension.
  4. Spans the brain's actual use cases. Don't over-fit the eval to easy questions. Coverage across the use case taxonomy in the brain spec is what makes eval predictive of production performance.
Hard rule: if you can't reconstruct what the Tier 1 service tables looked like at the time the question would have been asked, the question is not eligible for the harness. The brain reads the time-of-question state, not the post-hoc state. Otherwise the eval measures hindsight bias, not brain quality.
Scoring rubric — per-question dimensions

Every brain output is scored across 7 dimensions. Some are hard requirements (binary pass/fail); some are soft thresholds (graded). Aggregated across the 30 questions, the brain either passes the harness or it doesn't.

DimensionTypeMeasurementPass threshold
1. Source citation rateSoft% of numerical claims in narrative_output that carry a valid [src:N] citation pointing to a real sources_read entry≥ 95%
2. Hallucination rateHard% of claims containing values not supported by any cited source. Scored by human eval reviewer comparing narrative_output to sources_read.≤ 2% (hard)
3. Staleness recognitionHardFor questions with deliberately stale fixture data: did the brain surface staleness in the narrative AND set data_staleness_acknowledged = TRUE?100% (hard — no exceptions)
4. Diagnosis accuracySoftDid the brain's top-2 identified causes match the retrospective ground truth? Scored 1.0 if both match, 0.5 if one matches, 0.0 if neither.≥ 0.70 mean across questions
5. Lever-mapping correctnessHard (AGT-902 only)Every proposed_actions[].action_type must be in the AGT-902 enum (pull_qbr_forward, open_expansion_play, brief_new_ae_or_csm, customer_communication, escalate_to_slm, recommend_human_query, none). Trigger-enforced at write, validated at eval.100% (hard)
6. Confidence calibrationSoftDistribution of confidence flags across claims. Healthy: ~60% high_confidence, ~25% inference, ≤10% speculation. Unhealthy: 100% high_confidence with errors (dishonest), or 50%+ speculation (over-cautious).Distribution within 15pp of healthy bands
7. Narrative coherenceSoftHuman reviewer rating: does the narrative make sense, follow from the cited sources, lead to the proposed actions? Scored 1–5 per question.≥ 4.0 mean across questions
Any hard-criterion failure (hallucination > 2%, staleness recognition < 100%, lever-mapping < 100%) is an automatic harness fail regardless of soft-criterion performance. The brain does not go to production with hard failures — eval gates that out.
Pass/fail decision logic
ConditionOutcome
All hard criteria pass + all soft criteria passPASS — brain approved for production
All hard criteria pass + 5 of 4 soft criteria pass + no soft criterion is >15pp below thresholdPASS-WITH-NOTES — production approved, calibration items logged for next iteration
All hard criteria pass + soft criteria significantly below thresholdsCONDITIONAL — brain may run in shadow mode (outputs logged, not surfaced) for a calibration period; re-eval before promotion to user-facing
Any hard criterion failsFAIL — brain blocked from production. Diagnose root cause, fix, re-run full harness.
Human review — what's automated vs. what isn't

The harness combines automated checks and human judgment. Both are required for a defensible eval.

DimensionAutomatedHuman required
Source citation rateYes — regex parse [src:N], validate against sources_readNo
Hallucination ratePartial — can flag claims without citations automaticallyYes — reviewer compares cited claims to source values, validates support
Staleness recognitionYes — check data_staleness_acknowledged flag and presence of staleness phrase in narrativeNo
Diagnosis accuracyNoYes — reviewer compares brain's top-2 causes to retrospective ground truth (provided in question template)
Lever-mapping correctnessYes — enum validationNo
Confidence calibrationYes — distribution mathSometimes — flagged outputs reviewed for honest calibration
Narrative coherenceNoYes — reviewer rates 1–5
Reviewer should be a RevOps team member familiar with the underlying GTM context but not the person who tuned the brain prompt. Independence matters — the reviewer is the auditor for this exercise.
Run mechanics
  1. Snapshot Tier 1 source data at the time-of-question state for each of the 30 questions. Snapshots persist in the question catalog as part of the fixture.
  2. Invoke the brain with the question and the snapshotted brain-ready views as context. Capture the full BrainAnalysisLog row.
  3. Auto-score dimensions 1, 3, 5, 6 (and partially 2). Output: per-question auto-score JSON.
  4. Human review for dimensions 2 (full), 4, 7. Reviewer reads the brain's narrative + sources_read + retrospective ground truth, scores per dimension, leaves comments.
  5. Aggregate per-dimension scores across the 30 questions. Apply the pass/fail decision logic.
  6. Write a BrainEvalLog row capturing the run: harness version, brain version, model, prompt hash, per-question scores, aggregated scores, pass/fail decision, reviewer identity, run timestamp.
  7. Surface results in a RevOps dashboard. Track quarterly trend.
Run cadence target: 1 business day end-to-end for a quarterly run with one reviewer. Pre-launch may take 2–3 days because the catalog itself is being constructed alongside.
Drift detection — what to watch for over time
Drift signalLikely causeAction
Source citation rate fallingBrain producing more inference, less direct citation. May correlate with prompt changes that emphasized synthesis over fidelity.Review recent prompt diffs. Re-emphasize citation requirement in system prompt.
Hallucination rate risingModel failure mode, or brain-ready view contract drift (brain receiving incomplete data and filling gaps)Audit recent BrainAnalysisLog. Check brain-ready view definitions vs. current Tier 1 schema. Consider model swap.
Diagnosis accuracy fallingGTM context shift (new product, new ICP) that the brain hasn't been retuned for. Brain is still smart on yesterday's questions but wrong on today's.Refresh question catalog with newer questions. Update system prompt with current ICP / product context.
Confidence calibration becoming more high-confidencePrompt nudged toward "decisive" output style. Often correlates with rising hallucination.Pull back. Honest calibration beats decisive calibration.
Narrative coherence fallingSystem prompt accumulation — too many instructions, brain losing focusPrompt simplification pass. Trim; reorganize.

Quarterly trend on each dimension is plotted in the RevOps eval dashboard. Sustained drift over 2 consecutive quarters in any dimension triggers a brain calibration sprint.

Cost of running the eval

Eval is itself a cost line. Sized to be small enough to absorb easily.

ComponentCost
30 brain calls per run, ~30K input + ~3K output, Sonnet pricing~$3 per full run
Quarterly cadence + 2–3 ad-hoc runs per year~$25/yr in token cost
Human reviewer time~6 hours per quarterly run for 30 questions = ~24 hours/year ≈ 0.012 FTE
Catalog maintenance~4 hours per quarter to add/refresh questions = ~16 hours/year
Total: under $50/yr token + ~40 hours of analyst time. Cheaper than a single bad brain output that loses a deal.
Eval failure modes — what to watch for in the eval itself
SymptomCauseAction
Eval consistently passes; production users complainEval over-fits to easy questions; doesn't span actual use case distributionRefresh catalog with questions sourced from production complaints. Review use case taxonomy coverage.
Eval consistently fails; users report the brain works fineEval too strict, or ground truth in catalog is wrongAudit specific failing questions. Validate ground truth with second reviewer.
Reviewer disagrees with their prior-quarter scores on the same outputsReviewer drift, or the rubric is too vagueTighten rubric definitions. Calibrate reviewers periodically with golden-set questions.
Brain "memorizes" the eval (suspicious if same questions show consistently top scores)System prompt or fine-tune leaked eval questionsAudit prompt for eval contamination. Rotate question wording in catalog. Hold out a small private set for spot checks.
Eval takes too long to run, gets skippedProcess burden too highStreamline auto-scoring. Reduce reviewer cognitive load. Don't compromise on hard criteria; consider thinning soft criteria if real burden.
Pre-launch checklist — before either brain goes live