TOOL-014 — Segment-LTV Decomposer

Tier 3 Specialist Tool · Stateless · Decomposes realized LTV by segment / vertical / ICP-tier into driver components (initial ACV, tenure, expansion realization, CAC, segment mix) with confidence-banded contribution per driver · Closes the LTV-attribution gap AGT-903 needs for capacity-reallocation and ICP-revision use cases
Tier 3 · Tool Specced · v37 Strategy · Multi-quarter Sonnet
Purpose

Reads per-account LTV roll-ups bucketed by segment / vertical / ICP-tier (sourced from AGT-201 icp_outcome_brain_view + AGT-501 cohort_brain_view + AGT-503 expansion_strategy_brain_view + AGT-105 capacity_strategy_brain_view) and decomposes observed LTV deltas across buckets into driver components: initial ACV, tenure-driven retention, expansion realization, CAC, and segment-mix shift. Output is per-bucket LTV with confidence-banded driver attribution and a comparative ranking. Used by AGT-903 Strategy Brain in icp_retrospective ("which dimensions of the rubric correlate with realized LTV?"), capacity_reallocation ("if I deploy 10 reps, where's the highest LTV-per-rep payoff?"), and vertical_entry ("what does opportunistic-fintech LTV look like vs. core-vertical LTV?").

Closes the LTV-attribution gap AGT-903 needs. Tier 1 services publish LTV rollups; that's deterministic. Decomposing observed LTV gaps between buckets into driver contributions — with honest uncertainty about which driver is load-bearing — is a distinct cognition step. Without TOOL-014, AGT-903 is forced either to estimate driver attribution itself (violating its no-recompute rule) or to hand the executive raw LTV deltas without explanation.
Input schema
{ "tool_call_id": "uuid", "buckets": [ { "bucket_label": "string", // e.g., "MM-fintech-T1", "SMB-horizontal-T2" "bucket_dimensions": { "segment": "SMB" | "MM" | "ENT", "vertical": "string | null", "icp_tier": "T1" | "T2" | "T3" | null, "signup_quarter_window": ["YYYY-QN", "YYYY-QN"] // start / end inclusive }, "n_accounts": 0, // accounts in bucket; refusal threshold 25 "ltv_components": { "initial_acv_mean": 0.0, "initial_acv_p50": 0.0, "initial_acv_p10": 0.0, "initial_acv_p90": 0.0, "observed_tenure_months_mean": 0.0, // average months retained, observed window "observed_tenure_months_p50": 0.0, "expansion_realized_pct_mean": 0.0, // 0.0–N (N>1 if net expansion) "expansion_realized_pct_p50": 0.0, "cac_per_logo_mean": 0.0, // null if CAC data unavailable for bucket "ltv_observed_mean": 0.0, // realized LTV to date "ltv_observed_p50": 0.0, "ltv_observed_p10": 0.0, "ltv_observed_p90": 0.0 }, "data_quality_notes": ["string"] } // ... 2 to 12 buckets ], "comparison_mode": "rank" | "pairwise" | "decompose_gap", // rank = rank all buckets by LTV with driver attribution per bucket // pairwise = decompose LTV gap between two named buckets (must pass exactly 2) // decompose_gap = decompose LTV gap between best and worst bucket (auto-selected) "drivers_to_include": [ // subset of supported drivers "initial_acv", "tenure", "expansion", "cac", "segment_mix" ], "context": { "as_of_date": "ISO 8601", "calling_brain": "AGT-903", "calling_use_case": "icp_retrospective" | "capacity_reallocation" | "vertical_entry" | "nrr_durability" | "other" } }
Refusal triggers: any bucket with n_accounts < 25, fewer than 2 buckets passed (decomposition is comparative by definition), or cac_per_logo_mean = null for buckets when "cac" is requested as a driver. Refusal is structured; the tool returns refusal_reason rather than fabricating attribution on thin data.
Output schema
{ "tool_call_id": "uuid", "refusal_reason": "string | null", "per_bucket_ltv": [ { "bucket_label": "string", "ltv_p50": 0.0, "ltv_p10": 0.0, "ltv_p90": 0.0, "rank": 0, // 1 = highest LTV "rank_confidence": "high" | "medium" | "low" // low when bucket bands overlap heavily } ], "driver_attribution": [ { "comparison_label": "string", // e.g., "MM-fintech-T1 vs MM-horizontal-T2" "ltv_gap_observed": 0.0, // signed; positive = first bucket higher "ltv_gap_p10": 0.0, "ltv_gap_p90": 0.0, "driver_contributions": [ { "driver": "initial_acv" | "tenure" | "expansion" | "cac" | "segment_mix", "contribution_p50": 0.0, // signed dollars contributed to gap "contribution_p10": 0.0, "contribution_p90": 0.0, "load_bearing": true, // true if |p10|/|p90| both same-sign and band excludes zero "rationale": "string" // 1–2 sentence grounded explanation } ], "unattributed_residual_p50": 0.0, // gap not explained by listed drivers "decomposition_quality": "high" | "medium" | "low" } ], "interpretation_for_caller": { "primary_finding": "string", "credible_alternative_finding": "string", // mandatory — second reading the data also supports "what_would_change_the_finding": "string" // falsifiable condition }, "data_quality": { "buckets_evaluated": 0, "buckets_refused": 0, "smallest_bucket_size": 0, "drivers_with_full_data": [], "drivers_with_partial_data": [], "quality_assessment": "high" | "medium" | "low" }, "ungrounded_assumptions": ["string"], "tool_metadata": { "model": "claude-sonnet-4-6", "input_tokens": 0, "output_tokens": 0, "cost_usd_estimate": 0.0, "latency_ms": 0 } }
Hard rule. The load_bearing flag is mechanically derived: it is true only when the driver's contribution band excludes zero and p10/p90 share sign. The LLM cannot mark a driver load-bearing on narrative grounds — the bands have to support it. Drivers with bands crossing zero are reported with load_bearing = false regardless of point estimate magnitude.
Mandatory credible alternative. Same discipline as TOOL-013. Every interpretation includes a second reading the data also supports + a falsifiable condition. Driver attribution is especially prone to over-confident causal narratives ("MM has higher LTV because of expansion"); the alternative reading guards against the brain inheriting a single-narrative input.
Driver taxonomy
DriverWhat it capturesCaveat
initial_acvDifference in average initial ACV at signup. Mechanical — first-period contract value on the bucket.Often the largest single contributor. Easy to over-weight in narrative; the executive question is usually "and then what?" — tenure/expansion drivers carry the rest.
tenureDifference in observed tenure (months retained), holding initial ACV constant via stratification.Tenure for buckets with short observed windows is right-censored — tool reports observed tenure, not extrapolated lifetime. AGT-903 caller may layer TOOL-013 projections for forward tenure if needed.
expansionDifference in expansion realization (NRR > 100% multiplier on initial ACV).Volatile in small buckets — bands typically wide. Tool surfaces band; brain should not narrate as "expansion is the driver" when band crosses zero.
cacDifference in CAC per logo. Subtractive on LTV (higher CAC reduces LTV).CAC data is bucket-conditional. If CAC is unavailable for a bucket, driver is excluded with note in data_quality. Tool does not estimate missing CAC.
segment_mixCompositional effect: shift in sub-segment distribution within the bucket explains some/all of the gap.Most-overlooked driver. When a "high-LTV bucket" is actually a different mix of sub-segments, attribution to ICP fit or sales motion is misleading. Tool computes segment_mix as a residual decomposition step; if it's load-bearing the brain must say so.
Driver attribution uses standard counterfactual decomposition (Oaxaca–Blinder-style on LTV components, with bootstrap intervals on each contribution). The numerical method runs in code; the LLM characterizes which drivers are operationally meaningful for the caller's use case.
Called by
CallerInvocation context
AGT-903 Strategy BrainPrimary caller. Used in icp_retrospective (which ICP dimensions correlate with realized LTV?), capacity_reallocation (where's the highest LTV-per-rep payoff?), vertical_entry (what's opportunistic-vertical LTV vs core-vertical LTV?), nrr_durability (is concentration in a few high-LTV accounts driving the headline number?). Brain assembles bucket roll-ups from multiple Tier 1 strategy_brain_views, calls TOOL-014 with the assembled bucket set, integrates output into the StrategyRecommendationLog memo.
AGT-704 Business Review OrchestratorIndirect — only via AGT-903 narrative jobs (annual planning, mid-year review). AGT-704 does not call TOOL-014 directly.
RevOps direct (workspace UI)For ad-hoc LTV-decomposition investigations — e.g., "what's driving the 30% LTV gap between MM-T1 and MM-T2?" Bypasses AGT-903 only when the question is descriptive (not strategic-recommendation-shaped).
Out-of-list callers. AGT-201 ICP Scorer and AGT-105 Sales Capacity Planning do not call TOOL-014. ICP rubric edits go through AGT-201's analyst-input redesign cycle; capacity-plan changes go through AGT-105's deterministic process. TOOL-014 informs strategic recommendations — it does not feed back into Tier 1 service computations.
Design principles
  1. Numerical decomposition, not LLM gut feel. Same as TOOL-004 / TOOL-013. Counterfactual decomposition + bootstrap intervals run in code. The LLM characterizes operational meaning; it does not invent driver contributions.
  2. Load-bearing is mechanical, not narrative. A driver is load-bearing only when its bootstrap band excludes zero with same-sign p10/p90. The LLM cannot promote a driver to load-bearing on storytelling grounds.
  3. Mandatory credible alternative. Same as TOOL-013. Driver narratives are especially prone to spurious causal stories ("we win in fintech because of expansion"); the alternative reading + falsifiable condition is a hard requirement.
  4. Refusal on thin buckets and missing drivers. Buckets < 25 accounts refuse. Drivers without supporting data (e.g., CAC for a bucket without CAC roll-up) are excluded with note, not estimated. Refusal cost ≈ one round-trip's tokens.
  5. Segment-mix is a first-class driver. Segment-mix attribution is the most-overlooked decomposition step and the most common source of false strategic narratives. Tool always evaluates segment_mix when bucket dimensions support it; the LLM must mention it when load-bearing.
  6. Right-censoring is honest. Tenure for buckets with short observed windows is reported as observed tenure, with explicit note that lifetime is right-censored. Tool does not extrapolate to lifetime LTV; that is TOOL-013's job (cohort projection) when the brain layers them.
  7. No external data. Tool reads only the bucket roll-ups passed by the caller. Audit trail traces every contribution back to specific bucket rows.
Cost ceiling
ConstraintValue
Per-call input budget30K tokens (up to 12 buckets × multi-driver components × bootstrap inputs; bounded at 30K)
Per-call output budget4K tokens (per-bucket rank + driver attribution per comparison + interpretation)
Default modelSonnet — multi-bucket attribution and operational interpretation justify the step up from Haiku. Numerical decomposition runs in code.
Per-call cost estimate~$0.20 per call at Sonnet pricing
Monthly cap (default)$150/mo — bounds usage to ~750 calls/month, well above expected AGT-903 query volume
Frequency expectationSimilar to TOOL-013: AGT-903 invokes 10–30 queries/month, of which roughly half call TOOL-014 (often paired with TOOL-013 in the same query). Annual planning and capacity-reallocation cycles concentrate usage.
Eval criteria
CriterionMeasurementPass threshold
Schema complianceOutput validates against output schema100% (hard)
Decomposition closureFor each comparison: |ltv_gap_observed − sum(driver_contributions_p50) − unattributed_residual_p50| < 1% of |ltv_gap_observed|100% (hard) — decomposition must add up
Load-bearing flag honestyFor 20 cases with deliberately inserted band-crossing-zero drivers, % where load_bearing = false on those drivers100% (hard)
Driver attribution calibrationFor 20 retrospective cases with known true drivers (synthetic + simulated), % where the load-bearing driver flagged by the tool matches the synthetic ground truth≥ 75%
Refusal correctnessFor deliberately undersized buckets (< 25) or missing CAC when requested, % where tool refuses with structured refusal_reason100% (hard)
Mandatory alternative% of non-refusal outputs with non-empty credible_alternative_finding and non-empty what_would_change_the_finding100% (hard)
Segment-mix surfacingFor 10 cases where segment_mix is constructed to be load-bearing, % where output identifies segment_mix in load_bearing = true drivers and mentions it in primary_finding≥ 90%
P95 latencyEnd-to-end (decomposition + bootstrap + LLM interpretation)≤ 6s
Failure modes
SymptomCauseAction
Output marks expansion as load-bearing when contribution band crosses zeroLLM overriding the mechanical load_bearing ruleHard fail. Eval enforces; rule output is source of truth for the flag.
Decomposition residual is > 5% of observed LTV gapBug in decomposition math or missing driverHard fail. Decomposition must close. If a driver is missing, it should be added or the gap should be classified as unattributed_residual.
Tool reports CAC contribution when bucket CAC was null on inputTool fabricating CAC where data was absentHard fail. Drivers without supporting data are excluded with note, never estimated.
Output ignores segment_mix when bucket dimensions clearly span sub-segmentsSystem prompt not enforcing segment_mix evaluationEval includes 10 segment-mix-load-bearing fixtures; ≥ 90% surfacing required.
Output produces a single-narrative interpretation without an alternativeSystem prompt not enforcing the mandatory credible-alternative fieldHard fail. Eval enforces non-empty credible_alternative_finding.
AGT-903 cost spike traceable to TOOL-014 being called per-bucket-pair instead of per-comparisonBrain calling tool many times for related comparisons in one sessionCaller responsibility. AGT-903 should pass full bucket sets and use comparison_mode to do all comparisons in one call.
Interaction with TOOL-013 and AGT-903

Layered with TOOL-013. TOOL-014 reports observed LTV with right-censoring honesty; for forward-projected lifetime LTV, AGT-903 is expected to layer TOOL-013 cohort retention forecasts on top. The two tools are not fungible: TOOL-013 projects retention curves forward, TOOL-014 attributes observed LTV gaps. AGT-903 calls one or both depending on whether the strategic question is retrospective (TOOL-014 alone often sufficient) or forward-looking (both).

Single-threaded audit trail. Same as TOOL-013. AGT-903 calls TOOL-014 via Anthropic tool-use; the call ID lands in tool_calls_made on BrainAnalysisLog; driver-attribution numbers cited in StrategyRecommendationLog memos trace back via that ID. AGT-903's source-citation discipline (≥ 95% of numerical claims must cite) treats TOOL-014 outputs as first-class citation targets.

Refusal contagion. When TOOL-014 refuses (thin buckets, missing CAC), AGT-903 either surfaces the gap and declines or proceeds without TOOL-014 input and marks affected claims as speculation confidence. Silently dropping the refusal and producing a confident driver narrative is a hard fail for AGT-903 (Sev-2 incident treatment per its spec).

Options-discipline propagation. When TOOL-014 surfaces multiple plausible load-bearing drivers (e.g., both expansion and segment_mix have bands excluding zero), AGT-903's options-enumeration in the StrategyRecommendationLog memo should reflect both readings. The mandatory credible_alternative_finding field is the explicit handoff the brain uses to populate the alternative option.