TOOL-013 — Cohort Retention Forecaster
Tier 3 Specialist Tool · Stateless · Reads cohort retention curves from AGT-501 cohort_brain_view, projects forward with confidence bands · Cohort-level analogue of TOOL-004 · Closes the multi-quarter retention-projection gap that AGT-903 needs
Tier 3 · Tool
Specced · v37
Strategy · Multi-quarter
Sonnet
Purpose
Reads observed retention curves for one or more signup-quarter cohorts (from AGT-501 cohort_brain_view) and produces forward projections with confidence bands per cohort, plus a cohort-comparison characterization (stable / degrading / improving / mixed). Used by AGT-903 Strategy Brain when answering "are customers we're winning today retaining better or worse than 2 years ago?", "is the retention curve flattening?", and "what's the projected NRR floor on the 2024 cohort?". Cohort-level analogue of TOOL-004 (which forecasts a single account's consumption); TOOL-013 forecasts a population's retention curve. Numerical model in code (cohort survival / decay fits with bootstrapped intervals); LLM portion produces structured characterization + interpretation, never the projection itself.
Closes the cohort-projection gap that AGT-903 strategic reasoning needs. AGT-501's cohort_brain_view publishes observed retention by signup quarter; that's deterministic. Forward projection with calibrated uncertainty is a distinct cognition step, and the spec is explicit that AGT-903 must not estimate from raw rows itself — that's TOOL-013's job. Without it, AGT-903 declines on every "did the bet work / what's next?" retrospective.
Input schema
{
"tool_call_id": "uuid",
"cohorts": [
{
"cohort_label": "string", // e.g., "2023-Q3 SMB", "2024-Q1 MM"
"cohort_dimensions": { // free-form structured tags from cohort_brain_view
"signup_quarter": "YYYY-QN",
"segment": "SMB" | "MM" | "ENT",
"vertical": "string | null",
"icp_tier": "T1" | "T2" | "T3" | null,
"pricing_model": "flat" | "consumption" | "hybrid" | null
},
"cohort_size_at_t0": 0, // accounts in cohort at signup
"retention_curve_observed": [ // per period since signup
{
"period_index": 0, // 0 = signup
"period_label": "M0" | "Q1" | "Y1",
"logos_retained": 0,
"logo_retention_pct": 0.0, // 0.0–1.0
"nrr_pct": 0.0, // optional, null if unavailable
"grr_pct": 0.0 // optional
}
// ... observed periods only; projection horizon picks up where this ends
],
"data_quality_notes": ["string"] // e.g., "small cohort", "missing Q3 nrr"
}
// ... 1 to 16 cohorts
],
"projection_horizon_periods": 8, // default 8 quarters; max 16
"comparison_mode": "single" | "pairwise" | "trend",
// single = forecast each cohort independently
// pairwise = forecast + compare two named cohorts (must pass exactly 2)
// trend = forecast + characterize cross-cohort trajectory (signup-quarter drift)
"context": {
"as_of_date": "ISO 8601",
"calling_brain": "AGT-903", // for audit
"calling_use_case": "icp_retrospective" | "vertical_entry" | "capacity_reallocation" | "strategic_retrospective" | "nrr_durability" | "other"
}
}
The tool refuses if any cohort has < 4 observed periods (insufficient signal for projection) or if cohort_size_at_t0 < 25 (small-cohort variance overwhelms signal). Refusal is structured — the tool returns refusal_reason rather than fabricating wide-band noise. AGT-903 is expected to handle the refusal by surfacing the gap, per its own refusal discipline.
Output schema
{
"tool_call_id": "uuid",
"refusal_reason": "string | null", // populated only on refusal; rest of payload null
"per_cohort_forecast": [
{
"cohort_label": "string",
"fit_method": "kaplan_meier" | "exponential_decay" | "weibull" | "loess",
"projection": [
{
"period_index": 0,
"period_label": "string",
"logo_retention_pct_p50": 0.0, // median projection
"logo_retention_pct_p10": 0.0, // lower band
"logo_retention_pct_p90": 0.0, // upper band
"nrr_pct_p50": 0.0, // null if no nrr observations
"nrr_pct_p10": 0.0,
"nrr_pct_p90": 0.0
}
// ... up to projection_horizon_periods
],
"asymptote_estimate": { // long-run retention floor; null if no asymptote indicated
"logo_retention_pct_floor_p50": 0.0,
"logo_retention_pct_floor_p10": 0.0,
"logo_retention_pct_floor_p90": 0.0
},
"fit_quality": "high" | "medium" | "low",
"fit_quality_reasons": ["string"]
}
],
"cohort_comparison": { // populated when comparison_mode != single
"primary_pattern": "stable" | "degrading" | "improving" | "mixed" | "insufficient_signal",
"drift_direction": "newer_better" | "newer_worse" | "no_drift" | "insufficient_signal",
"drift_magnitude_pct_per_cohort": 0.0, // signed; positive = newer cohorts retain better
"load_bearing_periods": ["string"], // which periods drive the comparison conclusion
"anomaly_cohorts": ["string"] // cohorts flagged as outliers vs. peers
},
"interpretation_for_caller": {
"primary_finding": "string", // 1–2 sentence summary grounded in numbers
"credible_alternative_finding": "string", // mandatory — second reading the data also supports
"what_would_change_the_finding": "string" // falsifiable condition
},
"data_quality": {
"cohorts_evaluated": 0,
"cohorts_refused": 0,
"smallest_cohort_size": 0,
"shortest_observed_window_periods": 0,
"quality_assessment": "high" | "medium" | "low"
},
"ungrounded_assumptions": ["string"],
"tool_metadata": {
"model": "claude-sonnet-4-6",
"input_tokens": 0, "output_tokens": 0,
"cost_usd_estimate": 0.0,
"latency_ms": 0
}
}
Hard rule. Confidence intervals must scale with observed-window length and cohort size. A 4-period observation on a 30-account cohort cannot produce a tight band — the tool widens p10/p90 mechanically based on bootstrap variance, and the LLM characterization cannot override the numerical bounds. If the caller wants a tighter answer, the answer is "wait for more periods or pool cohorts," not "lie about uncertainty."
Mandatory credible alternative. Mirrors AGT-903's options-discipline at the tool layer. Every interpretation includes a second reading the data also supports, plus a falsifiable condition that would distinguish them. This prevents the tool from feeding AGT-903 single-narrative inputs that the brain then dresses up as options.
Cohort-comparison pattern taxonomy
| Pattern | What it looks like | Implication for AGT-903 |
| stable | Cross-cohort retention curves fall within each other's confidence bands across the observed window | No retention-side signal of strategic shift. Other diagnostics (LTV, expansion) drive the strategic question. Brain should not over-claim a "retention is fine" finding — the alternative reading is "we don't have enough signal to detect drift yet." |
| degrading | Newer cohorts consistently retain worse than older cohorts at matched periods, beyond confidence bands | Strategic risk signal. Possible drivers: ICP drift, product fit erosion, segment-mix shift, onboarding regression. Brain must enumerate which driver(s) the cohort dimensions support — not assert one. |
| improving | Newer cohorts retain better than older cohorts at matched periods, beyond confidence bands | Strategic-bet validation signal — the system improvements (better ICP, better onboarding, better product) are showing up in retention. Brain should still flag confounds: cohort-mix changes, survivorship, or a new selection bias in who closes. |
| mixed | Pattern is non-monotonic across cohorts, or different metrics (logo vs. NRR) tell different stories | The honest case. Brain articulates both readings; the strategic call hinges on which story the executive prioritizes. This is where AGT-903's options-discipline is most load-bearing — do not collapse to a single answer. |
| insufficient_signal | Cohorts too small, observation windows too short, or variance too wide to support any cross-cohort claim | Refusal-shaped. Brain surfaces the gap rather than estimating. Retry when more periods land or more cohorts are pooled. |
Pattern selection is rule-based (not LLM judgment): the tool computes signed drift across cohorts at matched periods, compares to bootstrapped null bands, and assigns the pattern. The LLM portion writes the interpretation; it does not re-decide the pattern. This protects against the failure mode where Sonnet pattern-matches "retention" to a confident narrative absent supporting variance bounds.
Called by
| Caller | Invocation context |
| AGT-903 Strategy Brain | Primary caller. Used in icp_retrospective, strategic_retrospective (e.g., "did the consumption-pricing pivot work?"), nrr_durability, vertical_entry (when retention data exists for the vertical's opportunistic cohorts). Brain reads AGT-501 cohort_brain_view, calls TOOL-013 with relevant cohort subset + comparison_mode, integrates output into the StrategyRecommendationLog memo as one of several inputs. |
| AGT-704 Business Review Orchestrator | Secondary caller, only via the AGT-903 narrative job. AGT-704 does not call TOOL-013 directly — it invokes AGT-903 for annual-planning narrative sections, and AGT-903 calls TOOL-013 from there. Keeps the audit trail single-threaded through AGT-903. |
| RevOps direct (workspace UI) | For ad-hoc cohort retention investigations — e.g., "what does the 2024-Q2 MM cohort look like projected forward?" Bypasses AGT-903 only when the question is descriptive (not strategic-recommendation-shaped). |
Out-of-list callers should be a code-review concern. Per Tier 3 contract. AGT-901 (current-period pipeline) and AGT-902 (per-account) should not call TOOL-013 — their horizons don't justify the cost or model tier. If they need cohort context, they should narrow scope or hand off to AGT-903.
Design principles
- Numerical model, not LLM gut feel. Same as TOOL-004. Survival/decay fits + bootstrap intervals run in code. The LLM characterizes the result and writes the interpretation. The LLM does not invent the projection number, does not pick the pattern label, does not narrow the confidence band.
- Confidence intervals scale with observed window and cohort size. A 4-period observation is not the same as a 12-period observation. The tool reports honestly; tight bands on thin data are a hard fail in eval.
- Mandatory credible alternative. Output always includes a second reading the data also supports + a falsifiable condition. Mirrors AGT-903's options-discipline at the tool layer; prevents the brain from collapsing tool output into a single narrative.
- Refusal is a first-class result. When cohorts are too small or windows too short, the tool returns
refusal_reason and stops. Refusal cost ≈ one round-trip's tokens, far cheaper than fabricating noise that AGT-903 then has to discount.
- No external data. Tool reads only cohort retention rows passed by the caller. Does not query AGT-501 directly; does not enrich with industry retention benchmarks. Audit trail traces every projection to specific input rows.
- Pattern label is rule-based, not LLM-chosen. The taxonomy (stable / degrading / improving / mixed / insufficient_signal) is assigned by the numerical core based on signed drift vs. bootstrap null. The LLM cannot override the label — it can only describe what the label means for the caller.
Implementation note for go-live. Principles 1 and 6 mean the tool is partly numerical (cohort-survival fits + bootstrap intervals + rule-based pattern assignment in code) and partly LLM (interpretation + structured output + ungrounded-assumption surfacing). The LLM does not invent numbers and does not pick the pattern. This is enforceable in implementation but not in spec alone — the eval harness must cross-check that primary_pattern matches the rule output.
Cost ceiling
| Constraint | Value |
| Per-call input budget | 40K tokens (up to 16 cohorts × 16 periods × multi-metric rows; bounded at 40K). Larger than TOOL-004 because cohort comparison is multi-population. |
| Per-call output budget | 4K tokens (per-cohort projections + comparison + interpretation + assumptions) |
| Default model | Sonnet — multi-cohort pattern characterization and interpretation justify the step up from Haiku. Numerical forecasting still runs in code. |
| Per-call cost estimate | ~$0.25 per call at Sonnet pricing |
| Monthly cap (default) | $200/mo — bounds usage to ~800 calls/month, well above expected AGT-903 query volume |
| Frequency expectation | Lowest-frequency tool in Tier 3. AGT-903 invokes 10–30 queries/month, of which roughly half call TOOL-013. Annual planning weeks concentrate usage; off-cycle months may see < 5 calls. |
Eval criteria
| Criterion | Measurement | Pass threshold |
| Schema compliance | Output validates against output schema | 100% (hard) |
| Pattern label accuracy | 20 historical cohort sets with known retrospective pattern; % where primary_pattern matches | ≥ 80% — pattern is rule-assigned, so this is mostly a regression test on the rule + corner cases |
| Projection calibration | For 20 cases with retrospectively-known retention outcomes within the projection horizon: % where actual outcome fell within the [p10, p90] band | ≥ 80% |
| Confidence band width discipline | For cases with ≤ 6 observed periods, % where p90 − p10 width is at least 15 percentage points (i.e., honest about uncertainty on thin data) | 100% (hard) |
| Refusal correctness | For deliberately-undersized cohorts (< 25 accounts) or short windows (< 4 periods), % where tool refuses with structured refusal_reason | 100% (hard) |
| Mandatory alternative | % of non-refusal outputs with non-empty credible_alternative_finding and non-empty what_would_change_the_finding | 100% (hard) |
| Hallucinated drift | % of outputs claiming drift_direction in (newer_better, newer_worse) when bootstrap null bands are not crossed | 0% (hard) |
| P95 latency | End-to-end (numerical fit + bootstrap + LLM interpretation) | ≤ 6s |
Failure modes
| Symptom | Cause | Action |
| Tool reports tight confidence bands on 4-period observations | Bootstrap variance not properly scaled to window length | Hard fail in eval. Confidence intervals must mechanically widen with shorter observed windows. |
| Tool calls a cohort comparison "degrading" when the cross-cohort delta lies inside bootstrap null bands | Pattern-assignment rule too aggressive, or LLM overriding the rule | Hard fail. Rule output is the source of truth for the label; LLM cannot override. |
| Tool produces a single-narrative interpretation without an alternative | System prompt not enforcing the mandatory credible-alternative field | Hard fail. Eval enforces non-empty credible_alternative_finding. |
| Tool projects a non-physical retention curve (e.g., logo retention rising over time within a cohort) | Bad fit method choice for the data shape | Numerical guard: projections are clamped monotonic-non-increasing for logo retention; NRR can rise (expansion). Catch in code, not LLM. |
| AGT-903 cost spike traceable to TOOL-013 pooling redundant cohort sets per query | Brain calling TOOL-013 multiple times in one session with overlapping cohort definitions | Caller responsibility. AGT-903 should consolidate cohort sets per session; not TOOL-013's concern. |
| Tool refuses on cohorts AGT-903 needs for the question | Cohort sizes legitimately too small, or AGT-501 cohort_brain_view contract not yet shipping the relevant cohorts | Surface gap to AGT-903; AGT-903 surfaces gap to the operator. Working as designed — refusal is correct when data doesn't support a forecast. |
Interaction with AGT-903
Single-threaded audit trail. AGT-903 calls TOOL-013 via Anthropic tool-use; the call ID lands in tool_calls_made on the BrainAnalysisLog row, and the projection numbers cited in the StrategyRecommendationLog memo trace back via that ID. Per AGT-903's source-citation discipline (≥ 95% of numerical claims must cite a source), TOOL-013 outputs are first-class citation targets.
Pattern label propagation. The primary_pattern from TOOL-013 propagates into AGT-903's options_enumerated as a load-bearing input — e.g., a "degrading" pattern on the post-pivot cohorts in strategic_retrospective shapes whether flag_strategic_risk or propose_pricing_packaging_review is the right action. AGT-903 is required to articulate the alternative reading TOOL-013 surfaced; if the brain ignores the alternative, the eval fixture EVAL-S04 anti-confirmation-bias check fails.
Refusal contagion. When TOOL-013 refuses, AGT-903 is expected to either (a) surface the gap to the operator and decline the strategic question, or (b) proceed without TOOL-013 input and explicitly mark the affected claims as speculation confidence. Silently dropping the refusal and producing a confident narrative is a hard fail for AGT-903 (Sev-2 incident treatment per its spec).