In the v26 architecture eval the recommendation was three tiers: Tier 1 deterministic services, Tier 2 LLM-native brain agents, Tier 3 specialist tools. The first two get layers (L1–L8 and L9 respectively). Tier 3 doesn't — tools are not part of the layered DAG because:
Every Tier 3 tool spec must define:
| Element | What it captures |
|---|---|
| Purpose | What the tool does in one sentence. If the tool needs more than one sentence to describe, it should probably be split. |
| Input schema | Strict JSON-shaped input. Validated by the calling agent before invocation; tool rejects malformed input. |
| Output schema | Strict JSON-shaped output. Calling agent depends on the contract; schema changes are breaking changes that require coordinated deployment with callers. |
| Model tier | Haiku (narrow scope, fast) / Sonnet (synthesis-heavy) / Opus (rare, only when measurably necessary). Default is Haiku unless the tool's nature demands otherwise. |
| Called by | Explicit list of which Tier 1 services and Tier 2 brains may call the tool. Out-of-list callers should be a code-review concern. |
| Cost ceiling | Per-call token budget + monthly invocation budget. Hard limits. |
| Eval criteria | Tool-specific eval; runs alongside the brain harness or independently. Tools have lower bar than brains because their scope is narrower. |
| Failure mode | What happens when the tool returns a bad result. Calling agent's responsibility to handle, but the tool spec must declare its known failure modes. |
First wave (v29, 4 tools): closed the biggest cognition gaps — Domain 3 (API/dev-led GTM) twice, Domain 1 ("create plays, not just execute") once, UBB consumption forecasting once. Second wave (v31, 4 tools): closed Domain 4 (outbound deliverability), Domain 2 (in-call guidance + competitive narrative), and the post-sales feature-level adoption gap. Third wave (v32, 4 tools): TTV/timing analysis, champion movement detection, pricing sensitivity, onboarding health prediction. Fourth wave (v37, 2 tools): cohort retention forecaster + segment-LTV decomposer, specced alongside AGT-903 Strategy Brain to support multi-quarter portfolio reasoning. Fourteen tools total — further additions remain case-by-case based on observed gaps.
draft; humans co-define from there.draft.load_bearing only when its bootstrap band excludes zero with same-sign p10/p90 — the LLM cannot promote a driver on narrative grounds. Mandatory credible-alternative reading mirrors TOOL-013.With three waves shipped (12 tools), the Tier 3 catalogue is broadly complete relative to the v26 architecture eval gaps. Further additions are case-by-case — the bar is observed operational gap, not theoretical coverage. Tracking ideas as candidates for when signal warrants:
| Candidate | Domain | Trigger to add |
|---|---|---|
| Procurement Negotiation Pattern Recognizer | Domain 1 / Domain 2 | If post-launch eval of TOOL-011 shows procurement-stage analysis is a distinct cognition need. |
| Multi-thread Quality Scorer | Domain 2 | Score deal-level multi-threading quality (stakeholder coverage, persona breadth) beyond what AGT-401 deal health scoring captures. Add if AGT-901 surfaces thin multi-threading as a recurring win-rate driver. |
| QBR Action-Item Outcome Tracker | Post-sales | Track QBR commitments through to outcomes, surfacing accountability patterns. Add after AGT-603/AGT-704 adoption shows signal. |
| Renewal Negotiation Risk Profiler | Post-sales | Renewal-specific equivalent of TOOL-011 Pricing Sensitivity. Sharper renewal-stage focus. Add if renewal motion produces distinct sensitivity patterns from new-logo motion. |
| Industry-Specific Play Refiner | Domain 1 | Vertical-specific narrative refinement for plays exiting SalesPlayLibrary. Add when active plays grow past 30 across verticals and quality lift is observed as a need. |
| Pattern | Caller | Tool | Example |
|---|---|---|---|
| Brain calls tool during query | AGT-901 / AGT-902 | Any | AGT-902 reading per-account view, calls TOOL-004 to forecast overage timing for the account |
| Service calls tool as enrichment | AGT-201, AGT-503, AGT-402 | TOOL-002, TOOL-004 | AGT-201 calls TOOL-002 on account update events to enrich dev-persona signal before ICP rescore |
| Operator calls tool directly | RevOps via workspace UI | TOOL-001, TOOL-003 | RevOps drops a new product API doc into a workspace input; TOOL-001 generates 3 candidate plays for review |
| Tool chains | Brain calling multiple tools in sequence | AGT-901 → TOOL-001 → TOOL-003 | Read API docs (TOOL-001 produces candidate plays) then refine the most promising into a structured play definition (TOOL-003) |
Each tool has its own per-call and monthly budget. Aggregate Tier 3 spend is monitored separately from Tier 2 brain spend.
| Tool | Default model | Per-call budget | Monthly cap (default) |
|---|---|---|---|
| TOOL-001 API-doc translator | Sonnet | 50K input + 5K output | $300/mo (low frequency, high context) |
| TOOL-002 Dev-persona enricher | Haiku | 10K input + 1K output | $200/mo (high frequency, narrow scope) |
| TOOL-003 Sales play composer | Sonnet | 30K input + 4K output | $300/mo (moderate frequency) |
| TOOL-004 Consumption forecasting | Haiku | 15K input + 2K output | $400/mo |
| TOOL-005 Outbound deliverability | Haiku | 8K input + 2K output | $150/mo |
| TOOL-006 Real-time call guidance | Sonnet | 30K input + 1.5K output (per invocation) | $2,000/mo — ~600 customer calls; per-call ~$3 |
| TOOL-007 Competitive narrative writer | Sonnet | 15K input + 2K output | $200/mo |
| TOOL-008 Adoption pattern recognizer | Haiku | 10K input + 1.5K output | $300/mo (high frequency — daily batch from AGT-501) |
| TOOL-009 Activation/TTV analyzer | Haiku | 8K input + 1.5K output | $200/mo |
| TOOL-010 Champion movement detector | Haiku | 10K input + 2K output | $250/mo |
| TOOL-011 Pricing sensitivity analyzer | Sonnet | 20K input + 2.5K output | $300/mo |
| TOOL-012 Onboarding health predictor | Haiku | 10K input + 2K output | $200/mo |
| TOOL-013 Cohort retention forecaster | Sonnet | 40K input + 4K output | $200/mo (low frequency, called by AGT-903 only) |
| TOOL-014 Segment-LTV decomposer | Sonnet | 30K input + 4K output | $150/mo (low frequency, called by AGT-903 only) |
Tools have eval criteria parallel to brains, but lighter:
Tool eval is its own harness, smaller than the brain harness (typically 10–15 questions per tool). Tool eval results land in BrainEvalLog with a flag distinguishing tool runs from brain runs.