Brain Eval Harness — Methodology & Rubric

Layer 9: Reasoning · Evaluation infrastructure · Pre-launch gate · Quarterly drift detection · CFO-defensible measurement of brain output quality

L9 · Eval Specced · v28 Pre-launch gate · Quarterly drift check RevOps

Purpose

Brain Agents (AGT-901, AGT-902) produce non-deterministic outputs informed by deterministic Tier 1 sources. The eval harness measures whether those outputs are defensible — cite real sources, recognize stale data, match retrospective ground truth, and route proposed actions to real existing levers. Without this measurement, the brain layer is a pile of opinions; with it, the layer earns its place beside the deterministic backbone.

When to run the eval

Trigger	Scope	Pass requirement
Pre-launch (one time per brain)	Full 30-question harness against the proposed brain config (model + system prompt + brain-ready views)	Hard gate. Brain does not go live in production until all hard-requirement criteria pass + at least 70% of soft criteria are within thresholds.
Quarterly (recurring)	Full 30-question harness, on the same questions with refreshed source data	Soft. If thresholds drift below pass, brain remains live but RevOps gets a calibration alert and the next quarter's planning includes prompt/model adjustment.
Material prompt change	Full 30-question harness	Hard gate before the new prompt is deployed. Material = changes to system prompt, change in default model tier, change in brain-ready view contracts.
New model version available	Full 30-question harness on the new model	Comparison against current production. Promotion only if new model meets or beats current on all hard criteria + 5/7 soft criteria.
On-demand (operator suspicion)	Full or partial harness, RevOps choice	Diagnostic only — does not gate production. Used when RevOps suspects drift or has a complaint about a recent brain output.

Harness composition — 30 historical questions

Brain	Question count	Question types
AGT-901 Pipeline Brain	10 questions	4 plan diagnosis · 3 coverage gap analysis · 2 quarterly play retrospective · 1 forecast bias attribution
AGT-902 Account Brain	20 questions	8 churn diagnosis · 6 expansion qualification · 4 hand-off briefing quality · 2 QBR narrative quality

Question split favors AGT-902 because it operates at higher volume (per-account synthesis is the high-frequency use case) and because per-account historical ground truth is easier to construct than cross-segment ground truth. Both are real use cases worth eval coverage, but the volumes are different.

See Brain Eval Question Catalog for the full 30-question catalog with templates, fixtures, and expected outputs.

Question construction — what makes a good retrospective question

A good retrospective eval question has four properties:

Ground truth exists. The right answer is knowable in retrospect because we can see what actually happened (deal closed/lost, customer churned/renewed, expansion materialized/didn't, plan met/missed).
Source data is reconstructible. The Tier 1 service tables can be replayed to the state they were in at the time the question would have been asked. The brain reads the historical state, not today's state.
Multiple defensible answers are possible — but some are clearly better. Eval is not "exact match" — it's "did the brain identify the right drivers, propose the right levers, recognize the data freshness, cite real sources." A wrong-but-defensible answer scores partial; a confident-and-wrong answer scores zero on hallucination dimension.
Spans the brain's actual use cases. Don't over-fit the eval to easy questions. Coverage across the use case taxonomy in the brain spec is what makes eval predictive of production performance.

Hard rule: if you can't reconstruct what the Tier 1 service tables looked like at the time the question would have been asked, the question is not eligible for the harness. The brain reads the time-of-question state, not the post-hoc state. Otherwise the eval measures hindsight bias, not brain quality.

Scoring rubric — per-question dimensions

Every brain output is scored across 7 dimensions. Some are hard requirements (binary pass/fail); some are soft thresholds (graded). Aggregated across the 30 questions, the brain either passes the harness or it doesn't.

Dimension	Type	Measurement	Pass threshold
1. Source citation rate	Soft	% of numerical claims in narrative_output that carry a valid `[src:N]` citation pointing to a real `sources_read` entry	≥ 95%
2. Hallucination rate	Hard	% of claims containing values not supported by any cited source. Scored by human eval reviewer comparing narrative_output to sources_read.	≤ 2% (hard)
3. Staleness recognition	Hard	For questions with deliberately stale fixture data: did the brain surface staleness in the narrative AND set `data_staleness_acknowledged = TRUE`?	100% (hard — no exceptions)
4. Diagnosis accuracy	Soft	Did the brain's top-2 identified causes match the retrospective ground truth? Scored 1.0 if both match, 0.5 if one matches, 0.0 if neither.	≥ 0.70 mean across questions
5. Lever-mapping correctness	Hard (AGT-902 only)	Every `proposed_actions[].action_type` must be in the AGT-902 enum (`pull_qbr_forward`, `open_expansion_play`, `brief_new_ae_or_csm`, `customer_communication`, `escalate_to_slm`, `recommend_human_query`, `none`). Trigger-enforced at write, validated at eval.	100% (hard)
6. Confidence calibration	Soft	Distribution of confidence flags across claims. Healthy: ~60% high_confidence, ~25% inference, ≤10% speculation. Unhealthy: 100% high_confidence with errors (dishonest), or 50%+ speculation (over-cautious).	Distribution within 15pp of healthy bands
7. Narrative coherence	Soft	Human reviewer rating: does the narrative make sense, follow from the cited sources, lead to the proposed actions? Scored 1–5 per question.	≥ 4.0 mean across questions

Any hard-criterion failure (hallucination > 2%, staleness recognition < 100%, lever-mapping < 100%) is an automatic harness fail regardless of soft-criterion performance. The brain does not go to production with hard failures — eval gates that out.

Pass/fail decision logic

Condition	Outcome
All hard criteria pass + all soft criteria pass	PASS — brain approved for production
All hard criteria pass + 5 of 4 soft criteria pass + no soft criterion is >15pp below threshold	PASS-WITH-NOTES — production approved, calibration items logged for next iteration
All hard criteria pass + soft criteria significantly below thresholds	CONDITIONAL — brain may run in shadow mode (outputs logged, not surfaced) for a calibration period; re-eval before promotion to user-facing
Any hard criterion fails	FAIL — brain blocked from production. Diagnose root cause, fix, re-run full harness.

Human review — what's automated vs. what isn't

The harness combines automated checks and human judgment. Both are required for a defensible eval.

Dimension	Automated	Human required
Source citation rate	Yes — regex parse `[src:N]`, validate against sources_read	No
Hallucination rate	Partial — can flag claims without citations automatically	Yes — reviewer compares cited claims to source values, validates support
Staleness recognition	Yes — check `data_staleness_acknowledged` flag and presence of staleness phrase in narrative	No
Diagnosis accuracy	No	Yes — reviewer compares brain's top-2 causes to retrospective ground truth (provided in question template)
Lever-mapping correctness	Yes — enum validation	No
Confidence calibration	Yes — distribution math	Sometimes — flagged outputs reviewed for honest calibration
Narrative coherence	No	Yes — reviewer rates 1–5

Reviewer should be a RevOps team member familiar with the underlying GTM context but not the person who tuned the brain prompt. Independence matters — the reviewer is the auditor for this exercise.

Run mechanics

Snapshot Tier 1 source data at the time-of-question state for each of the 30 questions. Snapshots persist in the question catalog as part of the fixture.
Invoke the brain with the question and the snapshotted brain-ready views as context. Capture the full BrainAnalysisLog row.
Auto-score dimensions 1, 3, 5, 6 (and partially 2). Output: per-question auto-score JSON.
Human review for dimensions 2 (full), 4, 7. Reviewer reads the brain's narrative + sources_read + retrospective ground truth, scores per dimension, leaves comments.
Aggregate per-dimension scores across the 30 questions. Apply the pass/fail decision logic.
Write a BrainEvalLog row capturing the run: harness version, brain version, model, prompt hash, per-question scores, aggregated scores, pass/fail decision, reviewer identity, run timestamp.
Surface results in a RevOps dashboard. Track quarterly trend.

Run cadence target: 1 business day end-to-end for a quarterly run with one reviewer. Pre-launch may take 2–3 days because the catalog itself is being constructed alongside.

Drift detection — what to watch for over time

Drift signal	Likely cause	Action
Source citation rate falling	Brain producing more inference, less direct citation. May correlate with prompt changes that emphasized synthesis over fidelity.	Review recent prompt diffs. Re-emphasize citation requirement in system prompt.
Hallucination rate rising	Model failure mode, or brain-ready view contract drift (brain receiving incomplete data and filling gaps)	Audit recent BrainAnalysisLog. Check brain-ready view definitions vs. current Tier 1 schema. Consider model swap.
Diagnosis accuracy falling	GTM context shift (new product, new ICP) that the brain hasn't been retuned for. Brain is still smart on yesterday's questions but wrong on today's.	Refresh question catalog with newer questions. Update system prompt with current ICP / product context.
Confidence calibration becoming more high-confidence	Prompt nudged toward "decisive" output style. Often correlates with rising hallucination.	Pull back. Honest calibration beats decisive calibration.
Narrative coherence falling	System prompt accumulation — too many instructions, brain losing focus	Prompt simplification pass. Trim; reorganize.

Quarterly trend on each dimension is plotted in the RevOps eval dashboard. Sustained drift over 2 consecutive quarters in any dimension triggers a brain calibration sprint.

Cost of running the eval

Eval is itself a cost line. Sized to be small enough to absorb easily.

Component	Cost
30 brain calls per run, ~30K input + ~3K output, Sonnet pricing	~$3 per full run
Quarterly cadence + 2–3 ad-hoc runs per year	~$25/yr in token cost
Human reviewer time	~6 hours per quarterly run for 30 questions = ~24 hours/year ≈ 0.012 FTE
Catalog maintenance	~4 hours per quarter to add/refresh questions = ~16 hours/year

Total: under $50/yr token + ~40 hours of analyst time. Cheaper than a single bad brain output that loses a deal.

Eval failure modes — what to watch for in the eval itself

Symptom	Cause	Action
Eval consistently passes; production users complain	Eval over-fits to easy questions; doesn't span actual use case distribution	Refresh catalog with questions sourced from production complaints. Review use case taxonomy coverage.
Eval consistently fails; users report the brain works fine	Eval too strict, or ground truth in catalog is wrong	Audit specific failing questions. Validate ground truth with second reviewer.
Reviewer disagrees with their prior-quarter scores on the same outputs	Reviewer drift, or the rubric is too vague	Tighten rubric definitions. Calibrate reviewers periodically with golden-set questions.
Brain "memorizes" the eval (suspicious if same questions show consistently top scores)	System prompt or fine-tune leaked eval questions	Audit prompt for eval contamination. Rotate question wording in catalog. Hold out a small private set for spot checks.
Eval takes too long to run, gets skipped	Process burden too high	Streamline auto-scoring. Reduce reviewer cognitive load. Don't compromise on hard criteria; consider thinning soft criteria if real burden.

Pre-launch checklist — before either brain goes live

Question catalog populated to 30 questions (10 AGT-901 + 20 AGT-902) with full templates: question, source fixture, retrospective ground truth, expected output framing, pass criteria.
At least 5 of the 30 questions deliberately use stale fixture data to test staleness recognition.
Auto-scoring script for dimensions 1, 3, 5, 6 written and tested against 3 sample brain outputs.
Reviewer rubric document (this doc + the catalog) reviewed by RevOps team for clarity.
BrainEvalLog schema deployed.
RevOps dashboard for eval results scaffolded (per-dimension trends + per-question breakdown).
First full harness run completed against the proposed brain config (model + prompt + brain-ready views).
Pass/fail decision documented; if PASS or PASS-WITH-NOTES, brain approved for production.
If CONDITIONAL or FAIL, root cause analysis + fix + re-run before any launch consideration.
Quarterly cadence scheduled in the RevOps calendar (e.g., end of each fiscal quarter).

Brain Eval Harness · Methodology & Rubric · Layer 9 · Schema v28
Owner: RevOps. Reviewer pool: RevOps team members who did not tune the brain prompts (independence required).
Companion documents: Brain Eval Question Catalog · BrainEvalLog production schema
Pre-launch gate for AGT-901 + AGT-902. Quarterly cadence + on-demand triggers (material prompt change, new model version, operator suspicion).
This harness is the CFO defense. When asked about brain cost, point at measurable accuracy + the absence of recurring incidents from the gated launches.