Brain Agents (AGT-901, AGT-902) produce non-deterministic outputs informed by deterministic Tier 1 sources. The eval harness measures whether those outputs are defensible — cite real sources, recognize stale data, match retrospective ground truth, and route proposed actions to real existing levers. Without this measurement, the brain layer is a pile of opinions; with it, the layer earns its place beside the deterministic backbone.
| Trigger | Scope | Pass requirement |
|---|---|---|
| Pre-launch (one time per brain) | Full 30-question harness against the proposed brain config (model + system prompt + brain-ready views) | Hard gate. Brain does not go live in production until all hard-requirement criteria pass + at least 70% of soft criteria are within thresholds. |
| Quarterly (recurring) | Full 30-question harness, on the same questions with refreshed source data | Soft. If thresholds drift below pass, brain remains live but RevOps gets a calibration alert and the next quarter's planning includes prompt/model adjustment. |
| Material prompt change | Full 30-question harness | Hard gate before the new prompt is deployed. Material = changes to system prompt, change in default model tier, change in brain-ready view contracts. |
| New model version available | Full 30-question harness on the new model | Comparison against current production. Promotion only if new model meets or beats current on all hard criteria + 5/7 soft criteria. |
| On-demand (operator suspicion) | Full or partial harness, RevOps choice | Diagnostic only — does not gate production. Used when RevOps suspects drift or has a complaint about a recent brain output. |
| Brain | Question count | Question types |
|---|---|---|
| AGT-901 Pipeline Brain | 10 questions | 4 plan diagnosis · 3 coverage gap analysis · 2 quarterly play retrospective · 1 forecast bias attribution |
| AGT-902 Account Brain | 20 questions | 8 churn diagnosis · 6 expansion qualification · 4 hand-off briefing quality · 2 QBR narrative quality |
Question split favors AGT-902 because it operates at higher volume (per-account synthesis is the high-frequency use case) and because per-account historical ground truth is easier to construct than cross-segment ground truth. Both are real use cases worth eval coverage, but the volumes are different.
See Brain Eval Question Catalog for the full 30-question catalog with templates, fixtures, and expected outputs.
A good retrospective eval question has four properties:
Every brain output is scored across 7 dimensions. Some are hard requirements (binary pass/fail); some are soft thresholds (graded). Aggregated across the 30 questions, the brain either passes the harness or it doesn't.
| Dimension | Type | Measurement | Pass threshold |
|---|---|---|---|
| 1. Source citation rate | Soft | % of numerical claims in narrative_output that carry a valid [src:N] citation pointing to a real sources_read entry | ≥ 95% |
| 2. Hallucination rate | Hard | % of claims containing values not supported by any cited source. Scored by human eval reviewer comparing narrative_output to sources_read. | ≤ 2% (hard) |
| 3. Staleness recognition | Hard | For questions with deliberately stale fixture data: did the brain surface staleness in the narrative AND set data_staleness_acknowledged = TRUE? | 100% (hard — no exceptions) |
| 4. Diagnosis accuracy | Soft | Did the brain's top-2 identified causes match the retrospective ground truth? Scored 1.0 if both match, 0.5 if one matches, 0.0 if neither. | ≥ 0.70 mean across questions |
| 5. Lever-mapping correctness | Hard (AGT-902 only) | Every proposed_actions[].action_type must be in the AGT-902 enum (pull_qbr_forward, open_expansion_play, brief_new_ae_or_csm, customer_communication, escalate_to_slm, recommend_human_query, none). Trigger-enforced at write, validated at eval. | 100% (hard) |
| 6. Confidence calibration | Soft | Distribution of confidence flags across claims. Healthy: ~60% high_confidence, ~25% inference, ≤10% speculation. Unhealthy: 100% high_confidence with errors (dishonest), or 50%+ speculation (over-cautious). | Distribution within 15pp of healthy bands |
| 7. Narrative coherence | Soft | Human reviewer rating: does the narrative make sense, follow from the cited sources, lead to the proposed actions? Scored 1–5 per question. | ≥ 4.0 mean across questions |
| Condition | Outcome |
|---|---|
| All hard criteria pass + all soft criteria pass | PASS — brain approved for production |
| All hard criteria pass + 5 of 4 soft criteria pass + no soft criterion is >15pp below threshold | PASS-WITH-NOTES — production approved, calibration items logged for next iteration |
| All hard criteria pass + soft criteria significantly below thresholds | CONDITIONAL — brain may run in shadow mode (outputs logged, not surfaced) for a calibration period; re-eval before promotion to user-facing |
| Any hard criterion fails | FAIL — brain blocked from production. Diagnose root cause, fix, re-run full harness. |
The harness combines automated checks and human judgment. Both are required for a defensible eval.
| Dimension | Automated | Human required |
|---|---|---|
| Source citation rate | Yes — regex parse [src:N], validate against sources_read | No |
| Hallucination rate | Partial — can flag claims without citations automatically | Yes — reviewer compares cited claims to source values, validates support |
| Staleness recognition | Yes — check data_staleness_acknowledged flag and presence of staleness phrase in narrative | No |
| Diagnosis accuracy | No | Yes — reviewer compares brain's top-2 causes to retrospective ground truth (provided in question template) |
| Lever-mapping correctness | Yes — enum validation | No |
| Confidence calibration | Yes — distribution math | Sometimes — flagged outputs reviewed for honest calibration |
| Narrative coherence | No | Yes — reviewer rates 1–5 |
| Drift signal | Likely cause | Action |
|---|---|---|
| Source citation rate falling | Brain producing more inference, less direct citation. May correlate with prompt changes that emphasized synthesis over fidelity. | Review recent prompt diffs. Re-emphasize citation requirement in system prompt. |
| Hallucination rate rising | Model failure mode, or brain-ready view contract drift (brain receiving incomplete data and filling gaps) | Audit recent BrainAnalysisLog. Check brain-ready view definitions vs. current Tier 1 schema. Consider model swap. |
| Diagnosis accuracy falling | GTM context shift (new product, new ICP) that the brain hasn't been retuned for. Brain is still smart on yesterday's questions but wrong on today's. | Refresh question catalog with newer questions. Update system prompt with current ICP / product context. |
| Confidence calibration becoming more high-confidence | Prompt nudged toward "decisive" output style. Often correlates with rising hallucination. | Pull back. Honest calibration beats decisive calibration. |
| Narrative coherence falling | System prompt accumulation — too many instructions, brain losing focus | Prompt simplification pass. Trim; reorganize. |
Quarterly trend on each dimension is plotted in the RevOps eval dashboard. Sustained drift over 2 consecutive quarters in any dimension triggers a brain calibration sprint.
Eval is itself a cost line. Sized to be small enough to absorb easily.
| Component | Cost |
|---|---|
| 30 brain calls per run, ~30K input + ~3K output, Sonnet pricing | ~$3 per full run |
| Quarterly cadence + 2–3 ad-hoc runs per year | ~$25/yr in token cost |
| Human reviewer time | ~6 hours per quarterly run for 30 questions = ~24 hours/year ≈ 0.012 FTE |
| Catalog maintenance | ~4 hours per quarter to add/refresh questions = ~16 hours/year |
| Symptom | Cause | Action |
|---|---|---|
| Eval consistently passes; production users complain | Eval over-fits to easy questions; doesn't span actual use case distribution | Refresh catalog with questions sourced from production complaints. Review use case taxonomy coverage. |
| Eval consistently fails; users report the brain works fine | Eval too strict, or ground truth in catalog is wrong | Audit specific failing questions. Validate ground truth with second reviewer. |
| Reviewer disagrees with their prior-quarter scores on the same outputs | Reviewer drift, or the rubric is too vague | Tighten rubric definitions. Calibrate reviewers periodically with golden-set questions. |
| Brain "memorizes" the eval (suspicious if same questions show consistently top scores) | System prompt or fine-tune leaked eval questions | Audit prompt for eval contamination. Rotate question wording in catalog. Hold out a small private set for spot checks. |
| Eval takes too long to run, gets skipped | Process burden too high | Streamline auto-scoring. Reduce reviewer cognitive load. Don't compromise on hard criteria; consider thinning soft criteria if real burden. |