TOOL-013 — Cohort Retention Forecaster

Tier 3 Specialist Tool · Stateless · Reads cohort retention curves from AGT-501 cohort_brain_view, projects forward with confidence bands · Cohort-level analogue of TOOL-004 · Closes the multi-quarter retention-projection gap that AGT-903 needs

Tier 3 · Tool Specced · v37 Strategy · Multi-quarter Sonnet

Purpose

Reads observed retention curves for one or more signup-quarter cohorts (from AGT-501 cohort_brain_view) and produces forward projections with confidence bands per cohort, plus a cohort-comparison characterization (stable / degrading / improving / mixed). Used by AGT-903 Strategy Brain when answering "are customers we're winning today retaining better or worse than 2 years ago?", "is the retention curve flattening?", and "what's the projected NRR floor on the 2024 cohort?". Cohort-level analogue of TOOL-004 (which forecasts a single account's consumption); TOOL-013 forecasts a population's retention curve. Numerical model in code (cohort survival / decay fits with bootstrapped intervals); LLM portion produces structured characterization + interpretation, never the projection itself.

Closes the cohort-projection gap that AGT-903 strategic reasoning needs. AGT-501's cohort_brain_view publishes observed retention by signup quarter; that's deterministic. Forward projection with calibrated uncertainty is a distinct cognition step, and the spec is explicit that AGT-903 must not estimate from raw rows itself — that's TOOL-013's job. Without it, AGT-903 declines on every "did the bet work / what's next?" retrospective.

Input schema

{
  "tool_call_id": "uuid",
  "cohorts": [
    {
      "cohort_label": "string",              // e.g., "2023-Q3 SMB", "2024-Q1 MM"
      "cohort_dimensions": {                  // free-form structured tags from cohort_brain_view
        "signup_quarter": "YYYY-QN",
        "segment": "SMB" | "MM" | "ENT",
        "vertical": "string | null",
        "icp_tier": "T1" | "T2" | "T3" | null,
        "pricing_model": "flat" | "consumption" | "hybrid" | null
      },
      "cohort_size_at_t0": 0,                 // accounts in cohort at signup
      "retention_curve_observed": [           // per period since signup
        {
          "period_index": 0,                   // 0 = signup
          "period_label": "M0" | "Q1" | "Y1",
          "logos_retained": 0,
          "logo_retention_pct": 0.0,           // 0.0–1.0
          "nrr_pct": 0.0,                      // optional, null if unavailable
          "grr_pct": 0.0                       // optional
        }
        // ... observed periods only; projection horizon picks up where this ends
      ],
      "data_quality_notes": ["string"]        // e.g., "small cohort", "missing Q3 nrr"
    }
    // ... 1 to 16 cohorts
  ],
  "projection_horizon_periods": 8,            // default 8 quarters; max 16
  "comparison_mode": "single" | "pairwise" | "trend",
  // single   = forecast each cohort independently
  // pairwise = forecast + compare two named cohorts (must pass exactly 2)
  // trend    = forecast + characterize cross-cohort trajectory (signup-quarter drift)
  "context": {
    "as_of_date": "ISO 8601",
    "calling_brain": "AGT-903",                // for audit
    "calling_use_case": "icp_retrospective" | "vertical_entry" | "capacity_reallocation" | "strategic_retrospective" | "nrr_durability" | "other"
  }
}

The tool refuses if any cohort has < 4 observed periods (insufficient signal for projection) or if cohort_size_at_t0 < 25 (small-cohort variance overwhelms signal). Refusal is structured — the tool returns refusal_reason rather than fabricating wide-band noise. AGT-903 is expected to handle the refusal by surfacing the gap, per its own refusal discipline.

Output schema

{
  "tool_call_id": "uuid",
  "refusal_reason": "string | null",          // populated only on refusal; rest of payload null
  "per_cohort_forecast": [
    {
      "cohort_label": "string",
      "fit_method": "kaplan_meier" | "exponential_decay" | "weibull" | "loess",
      "projection": [
        {
          "period_index": 0,
          "period_label": "string",
          "logo_retention_pct_p50": 0.0,        // median projection
          "logo_retention_pct_p10": 0.0,        // lower band
          "logo_retention_pct_p90": 0.0,        // upper band
          "nrr_pct_p50": 0.0,                   // null if no nrr observations
          "nrr_pct_p10": 0.0,
          "nrr_pct_p90": 0.0
        }
        // ... up to projection_horizon_periods
      ],
      "asymptote_estimate": {                   // long-run retention floor; null if no asymptote indicated
        "logo_retention_pct_floor_p50": 0.0,
        "logo_retention_pct_floor_p10": 0.0,
        "logo_retention_pct_floor_p90": 0.0
      },
      "fit_quality": "high" | "medium" | "low",
      "fit_quality_reasons": ["string"]
    }
  ],
  "cohort_comparison": {                        // populated when comparison_mode != single
    "primary_pattern": "stable" | "degrading" | "improving" | "mixed" | "insufficient_signal",
    "drift_direction": "newer_better" | "newer_worse" | "no_drift" | "insufficient_signal",
    "drift_magnitude_pct_per_cohort": 0.0,      // signed; positive = newer cohorts retain better
    "load_bearing_periods": ["string"],         // which periods drive the comparison conclusion
    "anomaly_cohorts": ["string"]               // cohorts flagged as outliers vs. peers
  },
  "interpretation_for_caller": {
    "primary_finding": "string",                // 1–2 sentence summary grounded in numbers
    "credible_alternative_finding": "string",   // mandatory — second reading the data also supports
    "what_would_change_the_finding": "string"   // falsifiable condition
  },
  "data_quality": {
    "cohorts_evaluated": 0,
    "cohorts_refused": 0,
    "smallest_cohort_size": 0,
    "shortest_observed_window_periods": 0,
    "quality_assessment": "high" | "medium" | "low"
  },
  "ungrounded_assumptions": ["string"],
  "tool_metadata": {
    "model": "claude-sonnet-4-6",
    "input_tokens": 0, "output_tokens": 0,
    "cost_usd_estimate": 0.0,
    "latency_ms": 0
  }
}

Hard rule. Confidence intervals must scale with observed-window length and cohort size. A 4-period observation on a 30-account cohort cannot produce a tight band — the tool widens p10/p90 mechanically based on bootstrap variance, and the LLM characterization cannot override the numerical bounds. If the caller wants a tighter answer, the answer is "wait for more periods or pool cohorts," not "lie about uncertainty."

Mandatory credible alternative. Mirrors AGT-903's options-discipline at the tool layer. Every interpretation includes a second reading the data also supports, plus a falsifiable condition that would distinguish them. This prevents the tool from feeding AGT-903 single-narrative inputs that the brain then dresses up as options.

Cohort-comparison pattern taxonomy

Pattern	What it looks like	Implication for AGT-903
stable	Cross-cohort retention curves fall within each other's confidence bands across the observed window	No retention-side signal of strategic shift. Other diagnostics (LTV, expansion) drive the strategic question. Brain should not over-claim a "retention is fine" finding — the alternative reading is "we don't have enough signal to detect drift yet."
degrading	Newer cohorts consistently retain worse than older cohorts at matched periods, beyond confidence bands	Strategic risk signal. Possible drivers: ICP drift, product fit erosion, segment-mix shift, onboarding regression. Brain must enumerate which driver(s) the cohort dimensions support — not assert one.
improving	Newer cohorts retain better than older cohorts at matched periods, beyond confidence bands	Strategic-bet validation signal — the system improvements (better ICP, better onboarding, better product) are showing up in retention. Brain should still flag confounds: cohort-mix changes, survivorship, or a new selection bias in who closes.
mixed	Pattern is non-monotonic across cohorts, or different metrics (logo vs. NRR) tell different stories	The honest case. Brain articulates both readings; the strategic call hinges on which story the executive prioritizes. This is where AGT-903's options-discipline is most load-bearing — do not collapse to a single answer.
insufficient_signal	Cohorts too small, observation windows too short, or variance too wide to support any cross-cohort claim	Refusal-shaped. Brain surfaces the gap rather than estimating. Retry when more periods land or more cohorts are pooled.

Pattern selection is rule-based (not LLM judgment): the tool computes signed drift across cohorts at matched periods, compares to bootstrapped null bands, and assigns the pattern. The LLM portion writes the interpretation; it does not re-decide the pattern. This protects against the failure mode where Sonnet pattern-matches "retention" to a confident narrative absent supporting variance bounds.

Called by

Caller	Invocation context
AGT-903 Strategy Brain	Primary caller. Used in `icp_retrospective`, `strategic_retrospective` (e.g., "did the consumption-pricing pivot work?"), `nrr_durability`, `vertical_entry` (when retention data exists for the vertical's opportunistic cohorts). Brain reads AGT-501 cohort_brain_view, calls TOOL-013 with relevant cohort subset + comparison_mode, integrates output into the StrategyRecommendationLog memo as one of several inputs.
AGT-704 Business Review Orchestrator	Secondary caller, only via the AGT-903 narrative job. AGT-704 does not call TOOL-013 directly — it invokes AGT-903 for annual-planning narrative sections, and AGT-903 calls TOOL-013 from there. Keeps the audit trail single-threaded through AGT-903.
RevOps direct (workspace UI)	For ad-hoc cohort retention investigations — e.g., "what does the 2024-Q2 MM cohort look like projected forward?" Bypasses AGT-903 only when the question is descriptive (not strategic-recommendation-shaped).

Out-of-list callers should be a code-review concern. Per Tier 3 contract. AGT-901 (current-period pipeline) and AGT-902 (per-account) should not call TOOL-013 — their horizons don't justify the cost or model tier. If they need cohort context, they should narrow scope or hand off to AGT-903.

Design principles

Numerical model, not LLM gut feel. Same as TOOL-004. Survival/decay fits + bootstrap intervals run in code. The LLM characterizes the result and writes the interpretation. The LLM does not invent the projection number, does not pick the pattern label, does not narrow the confidence band.
Confidence intervals scale with observed window and cohort size. A 4-period observation is not the same as a 12-period observation. The tool reports honestly; tight bands on thin data are a hard fail in eval.
Mandatory credible alternative. Output always includes a second reading the data also supports + a falsifiable condition. Mirrors AGT-903's options-discipline at the tool layer; prevents the brain from collapsing tool output into a single narrative.
Refusal is a first-class result. When cohorts are too small or windows too short, the tool returns refusal_reason and stops. Refusal cost ≈ one round-trip's tokens, far cheaper than fabricating noise that AGT-903 then has to discount.
No external data. Tool reads only cohort retention rows passed by the caller. Does not query AGT-501 directly; does not enrich with industry retention benchmarks. Audit trail traces every projection to specific input rows.
Pattern label is rule-based, not LLM-chosen. The taxonomy (stable / degrading / improving / mixed / insufficient_signal) is assigned by the numerical core based on signed drift vs. bootstrap null. The LLM cannot override the label — it can only describe what the label means for the caller.

Implementation note for go-live. Principles 1 and 6 mean the tool is partly numerical (cohort-survival fits + bootstrap intervals + rule-based pattern assignment in code) and partly LLM (interpretation + structured output + ungrounded-assumption surfacing). The LLM does not invent numbers and does not pick the pattern. This is enforceable in implementation but not in spec alone — the eval harness must cross-check that primary_pattern matches the rule output.

Cost ceiling

Constraint	Value
Per-call input budget	40K tokens (up to 16 cohorts × 16 periods × multi-metric rows; bounded at 40K). Larger than TOOL-004 because cohort comparison is multi-population.
Per-call output budget	4K tokens (per-cohort projections + comparison + interpretation + assumptions)
Default model	Sonnet — multi-cohort pattern characterization and interpretation justify the step up from Haiku. Numerical forecasting still runs in code.
Per-call cost estimate	~$0.25 per call at Sonnet pricing
Monthly cap (default)	$200/mo — bounds usage to ~800 calls/month, well above expected AGT-903 query volume
Frequency expectation	Lowest-frequency tool in Tier 3. AGT-903 invokes 10–30 queries/month, of which roughly half call TOOL-013. Annual planning weeks concentrate usage; off-cycle months may see < 5 calls.

Eval criteria

Criterion	Measurement	Pass threshold
Schema compliance	Output validates against output schema	100% (hard)
Pattern label accuracy	20 historical cohort sets with known retrospective pattern; % where `primary_pattern` matches	≥ 80% — pattern is rule-assigned, so this is mostly a regression test on the rule + corner cases
Projection calibration	For 20 cases with retrospectively-known retention outcomes within the projection horizon: % where actual outcome fell within the [p10, p90] band	≥ 80%
Confidence band width discipline	For cases with ≤ 6 observed periods, % where p90 − p10 width is at least 15 percentage points (i.e., honest about uncertainty on thin data)	100% (hard)
Refusal correctness	For deliberately-undersized cohorts (< 25 accounts) or short windows (< 4 periods), % where tool refuses with structured `refusal_reason`	100% (hard)
Mandatory alternative	% of non-refusal outputs with non-empty `credible_alternative_finding` and non-empty `what_would_change_the_finding`	100% (hard)
Hallucinated drift	% of outputs claiming `drift_direction` in (newer_better, newer_worse) when bootstrap null bands are not crossed	0% (hard)
P95 latency	End-to-end (numerical fit + bootstrap + LLM interpretation)	≤ 6s

Failure modes

Symptom	Cause	Action
Tool reports tight confidence bands on 4-period observations	Bootstrap variance not properly scaled to window length	Hard fail in eval. Confidence intervals must mechanically widen with shorter observed windows.
Tool calls a cohort comparison "degrading" when the cross-cohort delta lies inside bootstrap null bands	Pattern-assignment rule too aggressive, or LLM overriding the rule	Hard fail. Rule output is the source of truth for the label; LLM cannot override.
Tool produces a single-narrative interpretation without an alternative	System prompt not enforcing the mandatory credible-alternative field	Hard fail. Eval enforces non-empty `credible_alternative_finding`.
Tool projects a non-physical retention curve (e.g., logo retention rising over time within a cohort)	Bad fit method choice for the data shape	Numerical guard: projections are clamped monotonic-non-increasing for logo retention; NRR can rise (expansion). Catch in code, not LLM.
AGT-903 cost spike traceable to TOOL-013 pooling redundant cohort sets per query	Brain calling TOOL-013 multiple times in one session with overlapping cohort definitions	Caller responsibility. AGT-903 should consolidate cohort sets per session; not TOOL-013's concern.
Tool refuses on cohorts AGT-903 needs for the question	Cohort sizes legitimately too small, or AGT-501 cohort_brain_view contract not yet shipping the relevant cohorts	Surface gap to AGT-903; AGT-903 surfaces gap to the operator. Working as designed — refusal is correct when data doesn't support a forecast.

Interaction with AGT-903

Single-threaded audit trail. AGT-903 calls TOOL-013 via Anthropic tool-use; the call ID lands in tool_calls_made on the BrainAnalysisLog row, and the projection numbers cited in the StrategyRecommendationLog memo trace back via that ID. Per AGT-903's source-citation discipline (≥ 95% of numerical claims must cite a source), TOOL-013 outputs are first-class citation targets.

Pattern label propagation. The primary_pattern from TOOL-013 propagates into AGT-903's options_enumerated as a load-bearing input — e.g., a "degrading" pattern on the post-pivot cohorts in strategic_retrospective shapes whether flag_strategic_risk or propose_pricing_packaging_review is the right action. AGT-903 is required to articulate the alternative reading TOOL-013 surfaced; if the brain ignores the alternative, the eval fixture EVAL-S04 anti-confirmation-bias check fails.

Refusal contagion. When TOOL-013 refuses, AGT-903 is expected to either (a) surface the gap to the operator and decline the strategic question, or (b) proceed without TOOL-013 input and explicitly mark the affected claims as speculation confidence. Silently dropping the refusal and producing a confident narrative is a hard fail for AGT-903 (Sev-2 incident treatment per its spec).

TOOL-013 Cohort Retention Forecaster · Tier 3 Specialist Tool · Schema v37
Stateless. Reads cohort retention input only. Logged via caller's BrainAnalysisLog with tool_calls_made entry.
Default model: Sonnet (LLM for characterization + interpretation; numerical forecasting + pattern assignment in code).
Per-call budget: 40K input + 4K output. Monthly cap: $200/mo (~800 calls).
Closes: cohort-projection gap that AGT-903 strategic reasoning needs — specced alongside TOOL-014 (segment-LTV decomposer) per v37 changelog ripple.
Companion: Tier 3 Tools Index · AGT-903 Strategy Brain · TOOL-004 Consumption Forecasting (per-account analogue) · TOOL-014 Segment-LTV Decomposer