Brain Eval Question Catalog

30 retrospective questions for AGT-901 (10) and AGT-902 (20) · Companion to Brain Eval Harness methodology · Pre-launch gate input · Maintained quarterly

L9 · Eval Specced · v28 30 questions · 5 stale-data RevOps

How to read this catalog

Each question card contains: the verbatim question, the source-data fixture (snapshot of Tier 1 brain-ready views at the time-of-question state), the retrospective ground truth (what we know now in hindsight), and the pass criteria for that specific question. Some cards are detailed templates; others are one-line summaries because the full template lives in the question fixture file. Five questions deliberately use stale fixture data — these test the brain's staleness recognition and are flagged with the stale tag.

This catalog is a starting scaffold, not a finished artifact. The structure and the first few detailed templates are here. The remaining questions are listed by type and one-line description — RevOps owns filling them in with real anonymized fixtures from the company's own historical data before the pre-launch eval run. Do not run the eval against fixtures that are not anonymized — eval runs land in BrainEvalLog and are retained 7 years.

AGT-901 Pipeline Brain — 10 questions

Plan diagnosis (4 questions)

EVAL-Q01 plan diagnosis

"Why was Q3 FY2025 commercial mid-market pipeline soft? Plan was $4.5M, we landed $3.1M."

Source fixture

Snapshot at week 12 of Q3 FY2025: MetricsCalc.brain_view showing Magic Number 0.7 (target 0.8), commercial MM win rate 31% (vs 38% trailing 4Q avg), commercial MM ACV $58K (vs $72K trailing avg). CapacityPlan.brain_view showing 2 open MM territories (3 of 8 reps in <6mo ramp). Opportunities.brain_view showing 4 of 7 top-ACV deals slipping from Q3 to Q4. WinLossLog.brain_view showing emerging competitive displacement pattern.

Ground truth (retrospective)

Three primary drivers, ranked: (1) capacity gap — 2 open MM territories + 3 ramping reps reduced effective MM coverage by ~30%; (2) ACV compression — trend started Q2, accelerated Q3, attributable to a competitor's pricing change; (3) deal slippage on top-ACV opportunities, partly capacity-related (under-resourced support during eval phases). Secondary: pipeline coverage was actually adequate at quarter start — the issue was conversion, not generation.

Expected output

Brain identifies capacity gap and ACV compression as primary drivers (top-2 match scores 1.0). Brain cites CapacityPlan.brain_view for the capacity claim and WinLossLog.brain_view for the competitive displacement claim. Confidence flags: capacity = high_confidence; ACV compression = multi_source; competitive driver = inference. Proposed action: recommend_human_query on whether to accelerate hiring or revise quotas; none on direct lever (this is a strategic question, not an executable one).

Pass criteria

Diagnosis accuracy ≥ 0.5 (at least one of top-2 matches). Source citations valid. No hallucinated metrics. Confidence calibration matches the inference distribution above (capacity = high, competitive = inference, not all high).

EVAL-Q02 plan diagnosis stale

"Why did our forecast accuracy degrade in Q1 FY2026 across enterprise segment?"

Source fixture

Stale: ForecastAccuracyLog.brain_view last refreshed 9 days ago (threshold: 7 days). Other sources fresh.

Ground truth

N/A — this question tests staleness recognition. The brain should refuse to estimate forecast accuracy from a stale source and surface the staleness in the narrative.

Pass criteria

Hard requirement: data_staleness_acknowledged = TRUE. Narrative must contain a staleness disclosure phrase. Brain may answer adjacent questions using fresh sources but must not estimate ForecastAccuracyLog values from stale data.

EVAL-Q03 — plan diagnosis — "Why did SMB ARR plan miss in H2 FY2024 by 18%?" Ground truth: lead supply collapsed mid-Q3 due to a content-driven inbound channel underperforming. Tests whether brain identifies lead supply vs. conversion as the right diagnosis bucket.

EVAL-Q04 — plan diagnosis — "Why did expansion bookings outperform plan by 22% in Q4 FY2025?" Ground truth: usage-based consumption overage spike across mid-tier accounts (post UsageMeteringLog launch). Tests whether brain identifies the consumption signal — not seat expansion — as the driver.

Coverage gap analysis (3 questions)

EVAL-Q05 coverage gap

"Mid-Market healthcare vertical produced only 4 closed-won deals in FY2025 against an addressable count of 220 accounts. What's the coverage story and what's the play?"

Source fixture

Accounts.brain_view filtered to MM healthcare: 220 accounts, 38 ICP T1, 67 T2, 115 T3. AccountPriorityScore.brain_view showing 21 with intent score > 60. CadenceEventLog.brain_view showing only 14 of those 21 received any sequenced outreach in FY2025. WinLossLog.brain_view showing 2 of the 4 wins came from inbound, 2 from one specific event-sourced lead.

Ground truth

Two drivers: (1) outbound coverage gap — 7 of 21 high-intent accounts never received sequenced outreach due to territory misassignment (vertical was split across two territories with neither owning healthcare focus); (2) lack of vertical-specific play — reps were running generic MM cadence on healthcare accounts without sector-specific framing. The play that worked retrospectively was a healthcare-specific co-designed sequence run in FY2026 H1.

Pass criteria

Brain identifies outbound coverage gap (citing CadenceEventLog 14/21 number). Brain proposes open_expansion_play mapped to a vertical-specific draft play in SalesPlayLibrary OR recommend_human_query for territory rebalancing. Brain does NOT propose increasing quota (out of scope, AGT-101 owned).

EVAL-Q06 — coverage gap — "Enterprise West region has had 3 open territories for 6 months. What's the impact and the play?" Tests structural-vs-situational diagnosis.

EVAL-Q07 — coverage gap stale — "Are we under-covered in fintech vertical?" Source fixture: AccountPriorityScore.brain_view stale (10 days, threshold 7). Tests staleness recognition + appropriate refusal to over-claim.

Quarterly play retrospective (2 questions)

EVAL-Q08 — play retro — "We ran the 'consumption-anchored MM expansion' play in Q3 FY2025. Did it work?" Ground truth: yes by ACV, no by win rate; brain should note both. Tests retrospective_outcomes consumption + balanced framing.

EVAL-Q09 — play retro — "The Q1 FY2026 'security-vertical land-and-expand' play missed its meeting-rate target. Why, and is the play worth retiring?" Ground truth: meeting rate missed because of seasonal compliance freeze in target accounts; play should not be retired but timed differently. Tests whether brain identifies seasonality.

Forecast bias attribution (1 question)

EVAL-Q10 — forecast bias — "AGT-402 (bottoms-up) and AGT-404 (top-down) diverged 22% on Q4 FY2025 enterprise forecast. Which was more accurate retrospectively, and why?" Ground truth: AGT-404 more accurate; AGT-402 was reading rep-commit optimism that the top-down cohort model didn't share. Tests whether brain reads ForecastAccuracyLog correctly + makes the right attribution.

AGT-902 Account Brain — 20 questions

Churn diagnosis (8 questions)

EVAL-Q11 churn diagnosis

"Account: 'Northwind Logistics' (anonymized). Churned at FY2025 Q2 renewal at $480K ARR. Why?"

Account-scope source fixture

Composite per-account brain-ready view at T-90 days from renewal. CustomerHealthLog: 52 (Yellow), trending down from 71 over trailing 90 days. ChurnRiskLog: High tier, renewal proximity multiplier active. UsageMeteringLog: seat utilization 38% (8 of 21 licensed seats active), down from 73% 6 months prior. ConvIntelligence: last CSM call 4 months ago, sentiment trajectory negative, 2 unaddressed_showstopper flags. ExpansionLog: no open plays. QBRLog: last QBR 7 months ago. OnboardingLog: completed FY2024.

Ground truth

Three drivers, ranked: (1) onboarding never produced sustained adoption beyond 30% of seats (latent issue from FY2024); (2) CSM disengagement starting 5 months before renewal (rotated CSM, no warm hand-off); (3) two unresolved technical objections from a Q3 FY2024 conversation never picked up. The renewal conversation surfaced these in T-30 days, too late.

Expected output

Brain identifies the seat utilization drop (citing UsageMeteringLog) and the CSM disengagement (citing ConvIntelligence + QBRLog). May or may not surface the unaddressed_showstopper flags — partial credit for that. Proposed actions (had this been a live query at T-90): brief_new_ae_or_csm + pull_qbr_forward + customer_communication at the right level.

Pass criteria

Diagnosis accuracy ≥ 0.5. Brain cites UsageMeteringLog and ConvIntelligence with valid source indices. Lever-mapping: every proposed action in the action enum (hard requirement). Confidence calibration: brain marks the "third driver" claims as inference, not high_confidence.

EVAL-Q12 churn diagnosis stale

"Account: 'Acme HR' (anonymized). Why did they churn at FY2026 Q1?"

Source fixture

Stale: per-account composite brain-ready view stale at T-60 days (last refresh 8 days ago, threshold 7). ConvIntelligence.account_brain_view the most stale source.

Pass criteria

Hard requirement: data_staleness_acknowledged = TRUE. Brain may proceed with caveats but must surface staleness clearly and avoid claims about ConvIntelligence-derived signals.

EVAL-Q13 — churn diagnosis — "Account 'Globex' churned despite Green health 60 days before renewal. Why?" Ground truth: payment health issue masked behind behavioral signals; AGT-803 modifier applied late. Tests whether brain reads PaymentEventLog correctly.

EVAL-Q14 — churn diagnosis — "Account 'Initech' churned with unresolved technical implementation. Why?" Ground truth: AGT-602 milestone 4 never received customer sign-off; account stalled for 3 months. Tests cross-functional reading (TechnicalMilestoneLog + CustomerHealthLog).

EVAL-Q15 — churn diagnosis — "Account 'Nakatomi' downgraded to lowest tier instead of churning — what's the difference vs Globex which fully churned?" Ground truth: contract structure (Nakatomi had milestone-based services attached, Globex didn't); the SOW commitments held part of the relationship. Tests whether brain reads OrderLog/ImplementationSOW context.

EVAL-Q16 — churn diagnosis — "Account 'Tyrell' churned silently — no early warning signals fired. What did we miss?" Ground truth: account had 100% seat util but external M&A made the buying entity disappear; no telemetry could have caught this. Tests honest calibration — brain should mark this as not preventable from internal signals, not over-fit a story.

EVAL-Q17 — churn diagnosis — "Account 'Stark' was Yellow for 9 months and renewed anyway. What was the brain wrong about historically?" Ground truth: yellow signal driven by a seasonal usage pattern (annual budget cycle) that looked like decline but was actually predictable. Tests whether brain recognizes seasonality vs. decline.

EVAL-Q18 — churn diagnosis — "Three accounts churned in Q4 FY2025 with shared characteristic X. What's the pattern?" Ground truth: all three had AE rotations within 60 days of renewal + no documented hand-off. Pattern visible only across-account — tests whether AGT-902 (per-account) appropriately punts this to AGT-901 (cross-account) territory or attempts a cross-account synthesis (which is out of scope).

Expansion qualification (6 questions)

EVAL-Q19 expansion qualification

"Account: 'Wonka Industries'. AGT-503 fired with +40 pts on consumption overage. Is this real expansion or a one-time spike?"

Source fixture

Snapshot at the day AGT-503 fired. UsageMeteringLog.account_brain_view: 90-day trend showing 4 months of steadily-rising consumption with the threshold breach being the natural continuation of trend, not a spike. OnboardingLog: completed 14 months ago. CustomerHealthLog: Green, 81. ConvIntelligence trailing 90 days: 3 CSM calls discussing expanded use cases. QBRLog: positive QBR 60 days prior with explicit "we want to expand to other teams" commitment.

Ground truth

Real expansion. The trend, the QBR commitment, and the CSM call content all corroborated. The expansion play closed at $310K incremental ARR within 90 days of being opened.

Pass criteria

Brain qualifies as real expansion (not spike). Cites the 4-month trend (UsageMeteringLog) AND the QBR commitment (QBRLog). Proposed action: open_expansion_play. Confidence: multi_source. Hallucination: none — all claims supported.

EVAL-Q20 — expansion qualification — "Account 'Pied Piper'. AGT-503 fired but consumption pattern is a 1-month spike post-marketing campaign. Real expansion?" Ground truth: not real, brain should call this a spike. Tests trend-vs-spike differentiation.

EVAL-Q21 — expansion qualification — "Account 'Soylent'. Seat utilization at 82% (above expansion threshold) for 2 months but ConvIntelligence shows resistance to additional spend. Should we still open the play?" Ground truth: hold the play, address objections first; brain should propose recommend_human_query to AM rather than auto-opening expansion play.

EVAL-Q22 — expansion qualification — "Account 'Cyberdyne'. Multi-product. Consumption growing on Product A, declining on Product B. What's the expansion read?" Ground truth: the brain should propose A-specific expansion AND flag B as separate concern, not a single net-of read. Tests product decomposition.

EVAL-Q23 — expansion qualification — "Account 'Massive Dynamic'. Consumption flat. AM thinks expansion possible based on relationship. What does the data say?" Ground truth: data shows no expansion signal; AM may still pursue but brain should be honest that telemetry doesn't support — recommend_human_query, not open_expansion_play.

EVAL-Q24 — expansion qualification stale — "Account 'Aperture Science'. Should we open expansion?" Source fixture: UsageMeteringLog stale (9 days, threshold 7). Tests staleness recognition on a usage-driven question.

Hand-off briefing quality (4 questions)

EVAL-Q25 — handoff briefing — "New AE inheriting 'Hooli' from rotation. What does she need to know?" Reviewer compares brain's hand-off briefing to what the receiving AE actually said she needed (captured via short post-rotation survey). Pass: brain covers ≥ 80% of items the AE flagged as essential.

EVAL-Q26 — handoff briefing — "New CSM inheriting strategic account 'Vandelay'. Brief her." Account has services attached, mid-implementation. Tests whether brain reads TechnicalMilestoneLog + ImplementationSOW into the briefing.

EVAL-Q27 — handoff briefing — "AE departure mid-eval — brief replacement on top 3 open opps for 'Initech'." Tests deal-level briefing per opportunity, not account-level alone.

EVAL-Q28 — handoff briefing stale — "Brief new CSM on 'Bluth Company'." ConvIntelligence stale (10 days). Tests whether brain produces a degraded but useful briefing while flagging which dimensions are not fresh.

QBR narrative quality (2 questions)

EVAL-Q29 — QBR narrative — "Generate the per-account narrative section for 'Strickland Propane' Q3 QBR prep." Reviewer rates narrative coherence 1–5; checks that brain wrote only narrative fields, not metric fields. Pass: ≥ 4.0 coherence + 0 metric-field writes.

EVAL-Q30 — QBR narrative — "Generate the per-account narrative section for 'Pendant Publishing' QBR prep, where the trailing quarter included a critical churn-risk event that resolved positively." Tests whether the brain's narrative captures the resolution as resolved, not still-active risk.

Stale fixture coverage

Five questions deliberately use stale fixture data to test staleness recognition (the hard 100% requirement). They span different staleness cases:

Question	What's stale	What it tests
EVAL-Q02	ForecastAccuracyLog stale by 9d (threshold 7)	AGT-901 staleness recognition on forecast-bias question
EVAL-Q07	AccountPriorityScore stale by 10d	AGT-901 staleness recognition on coverage-gap question
EVAL-Q12	Per-account composite view stale, ConvIntelligence most stale	AGT-902 staleness recognition on churn-diagnosis question
EVAL-Q24	UsageMeteringLog stale by 9d	AGT-902 staleness recognition on usage-driven expansion question
EVAL-Q28	ConvIntelligence stale by 10d	AGT-902 staleness recognition during hand-off briefing — tests degraded-but-useful output

All 5 must score 100% on staleness recognition for the harness to pass. A single miss is an automatic harness fail (hard criterion).

Anonymization discipline

Account names in this catalog are deliberately fictional. Real account fixtures filled in by RevOps must be anonymized:

Replace real account names with consistent fictional placeholders. Map maintained in a separate access-controlled file (not in the catalog).
Round real ACV / consumption numbers to within 10% to break re-identifiability while preserving signal.
Remove or fictionalize identifiable personnel names in ConvIntelligence excerpts.
Verify anonymization with a second reviewer before the catalog is committed.

BrainEvalLog rows are retained 7 years. An eval run on a non-anonymized fixture creates a 7-year retention obligation on identifiable customer data — a privacy and contract risk. Anonymize before the run, not after.

Catalog maintenance cadence

Cadence	What changes
Quarterly	Rotate 3–5 questions: retire questions where ground truth has aged poorly; add questions sourced from the past quarter's actual operational queries (especially production complaints about brain output).
On material GTM context shift	New product launch, new ICP definition, new segment carve — refresh question set to span the new context.
On model promotion	Spot-check 5 questions for whether the new model handles them differently. If divergence material, expand to full harness re-run before promotion.