TOOL-014 — Segment-LTV Decomposer

Tier 3 Specialist Tool · Stateless · Decomposes realized LTV by segment / vertical / ICP-tier into driver components (initial ACV, tenure, expansion realization, CAC, segment mix) with confidence-banded contribution per driver · Closes the LTV-attribution gap AGT-903 needs for capacity-reallocation and ICP-revision use cases

Tier 3 · Tool Specced · v37 Strategy · Multi-quarter Sonnet

Purpose

Reads per-account LTV roll-ups bucketed by segment / vertical / ICP-tier (sourced from AGT-201 icp_outcome_brain_view + AGT-501 cohort_brain_view + AGT-503 expansion_strategy_brain_view + AGT-105 capacity_strategy_brain_view) and decomposes observed LTV deltas across buckets into driver components: initial ACV, tenure-driven retention, expansion realization, CAC, and segment-mix shift. Output is per-bucket LTV with confidence-banded driver attribution and a comparative ranking. Used by AGT-903 Strategy Brain in icp_retrospective ("which dimensions of the rubric correlate with realized LTV?"), capacity_reallocation ("if I deploy 10 reps, where's the highest LTV-per-rep payoff?"), and vertical_entry ("what does opportunistic-fintech LTV look like vs. core-vertical LTV?").

Closes the LTV-attribution gap AGT-903 needs. Tier 1 services publish LTV rollups; that's deterministic. Decomposing observed LTV gaps between buckets into driver contributions — with honest uncertainty about which driver is load-bearing — is a distinct cognition step. Without TOOL-014, AGT-903 is forced either to estimate driver attribution itself (violating its no-recompute rule) or to hand the executive raw LTV deltas without explanation.

Input schema

{
  "tool_call_id": "uuid",
  "buckets": [
    {
      "bucket_label": "string",                 // e.g., "MM-fintech-T1", "SMB-horizontal-T2"
      "bucket_dimensions": {
        "segment": "SMB" | "MM" | "ENT",
        "vertical": "string | null",
        "icp_tier": "T1" | "T2" | "T3" | null,
        "signup_quarter_window": ["YYYY-QN", "YYYY-QN"]   // start / end inclusive
      },
      "n_accounts": 0,                            // accounts in bucket; refusal threshold 25
      "ltv_components": {
        "initial_acv_mean": 0.0,
        "initial_acv_p50": 0.0,
        "initial_acv_p10": 0.0,
        "initial_acv_p90": 0.0,
        "observed_tenure_months_mean": 0.0,       // average months retained, observed window
        "observed_tenure_months_p50": 0.0,
        "expansion_realized_pct_mean": 0.0,        // 0.0–N (N>1 if net expansion)
        "expansion_realized_pct_p50": 0.0,
        "cac_per_logo_mean": 0.0,                  // null if CAC data unavailable for bucket
        "ltv_observed_mean": 0.0,                  // realized LTV to date
        "ltv_observed_p50": 0.0,
        "ltv_observed_p10": 0.0,
        "ltv_observed_p90": 0.0
      },
      "data_quality_notes": ["string"]
    }
    // ... 2 to 12 buckets
  ],
  "comparison_mode": "rank" | "pairwise" | "decompose_gap",
  // rank             = rank all buckets by LTV with driver attribution per bucket
  // pairwise         = decompose LTV gap between two named buckets (must pass exactly 2)
  // decompose_gap    = decompose LTV gap between best and worst bucket (auto-selected)
  "drivers_to_include": [                         // subset of supported drivers
    "initial_acv", "tenure", "expansion", "cac", "segment_mix"
  ],
  "context": {
    "as_of_date": "ISO 8601",
    "calling_brain": "AGT-903",
    "calling_use_case": "icp_retrospective" | "capacity_reallocation" | "vertical_entry" | "nrr_durability" | "other"
  }
}

Refusal triggers: any bucket with n_accounts < 25, fewer than 2 buckets passed (decomposition is comparative by definition), or cac_per_logo_mean = null for buckets when "cac" is requested as a driver. Refusal is structured; the tool returns refusal_reason rather than fabricating attribution on thin data.

Output schema

{
  "tool_call_id": "uuid",
  "refusal_reason": "string | null",
  "per_bucket_ltv": [
    {
      "bucket_label": "string",
      "ltv_p50": 0.0,
      "ltv_p10": 0.0,
      "ltv_p90": 0.0,
      "rank": 0,                                   // 1 = highest LTV
      "rank_confidence": "high" | "medium" | "low" // low when bucket bands overlap heavily
    }
  ],
  "driver_attribution": [
    {
      "comparison_label": "string",                // e.g., "MM-fintech-T1 vs MM-horizontal-T2"
      "ltv_gap_observed": 0.0,                     // signed; positive = first bucket higher
      "ltv_gap_p10": 0.0,
      "ltv_gap_p90": 0.0,
      "driver_contributions": [
        {
          "driver": "initial_acv" | "tenure" | "expansion" | "cac" | "segment_mix",
          "contribution_p50": 0.0,                 // signed dollars contributed to gap
          "contribution_p10": 0.0,
          "contribution_p90": 0.0,
          "load_bearing": true,                    // true if |p10|/|p90| both same-sign and band excludes zero
          "rationale": "string"                    // 1–2 sentence grounded explanation
        }
      ],
      "unattributed_residual_p50": 0.0,            // gap not explained by listed drivers
      "decomposition_quality": "high" | "medium" | "low"
    }
  ],
  "interpretation_for_caller": {
    "primary_finding": "string",
    "credible_alternative_finding": "string",      // mandatory — second reading the data also supports
    "what_would_change_the_finding": "string"      // falsifiable condition
  },
  "data_quality": {
    "buckets_evaluated": 0,
    "buckets_refused": 0,
    "smallest_bucket_size": 0,
    "drivers_with_full_data": [],
    "drivers_with_partial_data": [],
    "quality_assessment": "high" | "medium" | "low"
  },
  "ungrounded_assumptions": ["string"],
  "tool_metadata": {
    "model": "claude-sonnet-4-6",
    "input_tokens": 0, "output_tokens": 0,
    "cost_usd_estimate": 0.0,
    "latency_ms": 0
  }
}

Hard rule. The load_bearing flag is mechanically derived: it is true only when the driver's contribution band excludes zero and p10/p90 share sign. The LLM cannot mark a driver load-bearing on narrative grounds — the bands have to support it. Drivers with bands crossing zero are reported with load_bearing = false regardless of point estimate magnitude.

Mandatory credible alternative. Same discipline as TOOL-013. Every interpretation includes a second reading the data also supports + a falsifiable condition. Driver attribution is especially prone to over-confident causal narratives ("MM has higher LTV because of expansion"); the alternative reading guards against the brain inheriting a single-narrative input.

Driver taxonomy

Driver	What it captures	Caveat
initial_acv	Difference in average initial ACV at signup. Mechanical — first-period contract value on the bucket.	Often the largest single contributor. Easy to over-weight in narrative; the executive question is usually "and then what?" — tenure/expansion drivers carry the rest.
tenure	Difference in observed tenure (months retained), holding initial ACV constant via stratification.	Tenure for buckets with short observed windows is right-censored — tool reports observed tenure, not extrapolated lifetime. AGT-903 caller may layer TOOL-013 projections for forward tenure if needed.
expansion	Difference in expansion realization (NRR > 100% multiplier on initial ACV).	Volatile in small buckets — bands typically wide. Tool surfaces band; brain should not narrate as "expansion is the driver" when band crosses zero.
cac	Difference in CAC per logo. Subtractive on LTV (higher CAC reduces LTV).	CAC data is bucket-conditional. If CAC is unavailable for a bucket, driver is excluded with note in `data_quality`. Tool does not estimate missing CAC.
segment_mix	Compositional effect: shift in sub-segment distribution within the bucket explains some/all of the gap.	Most-overlooked driver. When a "high-LTV bucket" is actually a different mix of sub-segments, attribution to ICP fit or sales motion is misleading. Tool computes segment_mix as a residual decomposition step; if it's load-bearing the brain must say so.

Driver attribution uses standard counterfactual decomposition (Oaxaca–Blinder-style on LTV components, with bootstrap intervals on each contribution). The numerical method runs in code; the LLM characterizes which drivers are operationally meaningful for the caller's use case.

Called by

Caller	Invocation context
AGT-903 Strategy Brain	Primary caller. Used in `icp_retrospective` (which ICP dimensions correlate with realized LTV?), `capacity_reallocation` (where's the highest LTV-per-rep payoff?), `vertical_entry` (what's opportunistic-vertical LTV vs core-vertical LTV?), `nrr_durability` (is concentration in a few high-LTV accounts driving the headline number?). Brain assembles bucket roll-ups from multiple Tier 1 strategy_brain_views, calls TOOL-014 with the assembled bucket set, integrates output into the StrategyRecommendationLog memo.
AGT-704 Business Review Orchestrator	Indirect — only via AGT-903 narrative jobs (annual planning, mid-year review). AGT-704 does not call TOOL-014 directly.
RevOps direct (workspace UI)	For ad-hoc LTV-decomposition investigations — e.g., "what's driving the 30% LTV gap between MM-T1 and MM-T2?" Bypasses AGT-903 only when the question is descriptive (not strategic-recommendation-shaped).

Out-of-list callers. AGT-201 ICP Scorer and AGT-105 Sales Capacity Planning do not call TOOL-014. ICP rubric edits go through AGT-201's analyst-input redesign cycle; capacity-plan changes go through AGT-105's deterministic process. TOOL-014 informs strategic recommendations — it does not feed back into Tier 1 service computations.

Design principles

Numerical decomposition, not LLM gut feel. Same as TOOL-004 / TOOL-013. Counterfactual decomposition + bootstrap intervals run in code. The LLM characterizes operational meaning; it does not invent driver contributions.
Load-bearing is mechanical, not narrative. A driver is load-bearing only when its bootstrap band excludes zero with same-sign p10/p90. The LLM cannot promote a driver to load-bearing on storytelling grounds.
Mandatory credible alternative. Same as TOOL-013. Driver narratives are especially prone to spurious causal stories ("we win in fintech because of expansion"); the alternative reading + falsifiable condition is a hard requirement.
Refusal on thin buckets and missing drivers. Buckets < 25 accounts refuse. Drivers without supporting data (e.g., CAC for a bucket without CAC roll-up) are excluded with note, not estimated. Refusal cost ≈ one round-trip's tokens.
Segment-mix is a first-class driver. Segment-mix attribution is the most-overlooked decomposition step and the most common source of false strategic narratives. Tool always evaluates segment_mix when bucket dimensions support it; the LLM must mention it when load-bearing.
Right-censoring is honest. Tenure for buckets with short observed windows is reported as observed tenure, with explicit note that lifetime is right-censored. Tool does not extrapolate to lifetime LTV; that is TOOL-013's job (cohort projection) when the brain layers them.
No external data. Tool reads only the bucket roll-ups passed by the caller. Audit trail traces every contribution back to specific bucket rows.

Cost ceiling

Constraint	Value
Per-call input budget	30K tokens (up to 12 buckets × multi-driver components × bootstrap inputs; bounded at 30K)
Per-call output budget	4K tokens (per-bucket rank + driver attribution per comparison + interpretation)
Default model	Sonnet — multi-bucket attribution and operational interpretation justify the step up from Haiku. Numerical decomposition runs in code.
Per-call cost estimate	~$0.20 per call at Sonnet pricing
Monthly cap (default)	$150/mo — bounds usage to ~750 calls/month, well above expected AGT-903 query volume
Frequency expectation	Similar to TOOL-013: AGT-903 invokes 10–30 queries/month, of which roughly half call TOOL-014 (often paired with TOOL-013 in the same query). Annual planning and capacity-reallocation cycles concentrate usage.

Eval criteria

Criterion	Measurement	Pass threshold
Schema compliance	Output validates against output schema	100% (hard)
Decomposition closure	For each comparison: \|ltv_gap_observed − sum(driver_contributions_p50) − unattributed_residual_p50\| < 1% of \|ltv_gap_observed\|	100% (hard) — decomposition must add up
Load-bearing flag honesty	For 20 cases with deliberately inserted band-crossing-zero drivers, % where `load_bearing = false` on those drivers	100% (hard)
Driver attribution calibration	For 20 retrospective cases with known true drivers (synthetic + simulated), % where the load-bearing driver flagged by the tool matches the synthetic ground truth	≥ 75%
Refusal correctness	For deliberately undersized buckets (< 25) or missing CAC when requested, % where tool refuses with structured `refusal_reason`	100% (hard)
Mandatory alternative	% of non-refusal outputs with non-empty `credible_alternative_finding` and non-empty `what_would_change_the_finding`	100% (hard)
Segment-mix surfacing	For 10 cases where segment_mix is constructed to be load-bearing, % where output identifies segment_mix in `load_bearing = true` drivers and mentions it in `primary_finding`	≥ 90%
P95 latency	End-to-end (decomposition + bootstrap + LLM interpretation)	≤ 6s

Failure modes

Symptom	Cause	Action
Output marks expansion as load-bearing when contribution band crosses zero	LLM overriding the mechanical load_bearing rule	Hard fail. Eval enforces; rule output is source of truth for the flag.
Decomposition residual is > 5% of observed LTV gap	Bug in decomposition math or missing driver	Hard fail. Decomposition must close. If a driver is missing, it should be added or the gap should be classified as `unattributed_residual`.
Tool reports CAC contribution when bucket CAC was null on input	Tool fabricating CAC where data was absent	Hard fail. Drivers without supporting data are excluded with note, never estimated.
Output ignores segment_mix when bucket dimensions clearly span sub-segments	System prompt not enforcing segment_mix evaluation	Eval includes 10 segment-mix-load-bearing fixtures; ≥ 90% surfacing required.
Output produces a single-narrative interpretation without an alternative	System prompt not enforcing the mandatory credible-alternative field	Hard fail. Eval enforces non-empty `credible_alternative_finding`.
AGT-903 cost spike traceable to TOOL-014 being called per-bucket-pair instead of per-comparison	Brain calling tool many times for related comparisons in one session	Caller responsibility. AGT-903 should pass full bucket sets and use `comparison_mode` to do all comparisons in one call.

Interaction with TOOL-013 and AGT-903

Layered with TOOL-013. TOOL-014 reports observed LTV with right-censoring honesty; for forward-projected lifetime LTV, AGT-903 is expected to layer TOOL-013 cohort retention forecasts on top. The two tools are not fungible: TOOL-013 projects retention curves forward, TOOL-014 attributes observed LTV gaps. AGT-903 calls one or both depending on whether the strategic question is retrospective (TOOL-014 alone often sufficient) or forward-looking (both).

Single-threaded audit trail. Same as TOOL-013. AGT-903 calls TOOL-014 via Anthropic tool-use; the call ID lands in tool_calls_made on BrainAnalysisLog; driver-attribution numbers cited in StrategyRecommendationLog memos trace back via that ID. AGT-903's source-citation discipline (≥ 95% of numerical claims must cite) treats TOOL-014 outputs as first-class citation targets.

Refusal contagion. When TOOL-014 refuses (thin buckets, missing CAC), AGT-903 either surfaces the gap and declines or proceeds without TOOL-014 input and marks affected claims as speculation confidence. Silently dropping the refusal and producing a confident driver narrative is a hard fail for AGT-903 (Sev-2 incident treatment per its spec).

Options-discipline propagation. When TOOL-014 surfaces multiple plausible load-bearing drivers (e.g., both expansion and segment_mix have bands excluding zero), AGT-903's options-enumeration in the StrategyRecommendationLog memo should reflect both readings. The mandatory credible_alternative_finding field is the explicit handoff the brain uses to populate the alternative option.

TOOL-014 Segment-LTV Decomposer · Tier 3 Specialist Tool · Schema v37
Stateless. Reads bucket-rolled LTV components only. Logged via caller's BrainAnalysisLog with tool_calls_made entry.
Default model: Sonnet (LLM for characterization + interpretation; numerical decomposition + load-bearing rule in code).
Per-call budget: 30K input + 4K output. Monthly cap: $150/mo (~750 calls).
Closes: LTV-attribution gap that AGT-903 needs for capacity-reallocation, ICP-revision, and vertical-entry use cases — specced alongside TOOL-013 (cohort retention forecaster) per v37 changelog ripple.
Companion: Tier 3 Tools Index · AGT-903 Strategy Brain · TOOL-013 Cohort Retention Forecaster (forward-projection counterpart) · TOOL-004 Consumption Forecasting