TOOL-004 — Consumption Forecasting / Runway Predictor

Tier 3 Specialist Tool · Stateless · Reads UsageMeteringLog trailing 90–180 days for one account+SKU, predicts overage timing + trend characterization · Closes UBB-specific gap from v26 eval
Tier 3 · Tool Specced · v29 UBB · Post-sales Haiku
Purpose

Reads UsageMeteringLog trailing 90–180 day pattern for one (account_id, sku_id) pair and produces: predicted overage date (or "no overage in next 60 days"), confidence interval, and trend characterization (linear / exponential / seasonal / cliff / flat). Used by the Account Brain when answering "is this real expansion or a one-time spike?", by AGT-503 to augment expansion trigger sensitivity, and by AGT-402 to weight expansion ACV probability in forecasting.

Closes UBB-specific gap from the v26 eval. Today the system can detect overage has happened (AGT-503 fires on threshold breach) but cannot predict when overage will happen or distinguish trend from spike. For a usage-based business, this is the difference between proactive expansion conversation and reactive damage control.
Input schema
{ "account_id": "uuid", "sku_id": "string", "sku_type": "consumption" | "seat" | "hybrid", "metering_history": [ // from UsageMeteringLog brain-ready view { "period_start": "ISO 8601", "period_end": "ISO 8601", "units_consumed": 0.0, "commit_units": 0.0, // null if pure pay-as-you-go "overage_units": 0.0, "active_seats": 0, // for seat/hybrid "licensed_seats": 0, "audit_status": "verified" | "pending_recon" } // ... 90 to 180 days of period rows ], "context": { "contract_start_date": "ISO 8601", "contract_end_date": "ISO 8601", "renewal_date": "ISO 8601", "current_commit_total_period": 0.0, // current period's commit "trailing_overage_count": 0 // how many prior periods had overage }, "forecast_horizon_days": 60 // default 60; max 180 }
All metering_history rows must have audit_status in (verified, pending_recon). Disputed rows are excluded by the calling agent before invocation. The tool refuses if < 30 days of history is provided — insufficient signal for trend characterization.
Output schema
{ "tool_call_id": "uuid", "account_id": "uuid", "sku_id": "string", "forecast_summary": { "predicted_overage": true | false, "predicted_overage_date": "ISO 8601 | null", "predicted_overage_amount_units": 0.0, "confidence_low_units": 0.0, // lower 10th percentile of forecast "confidence_high_units": 0.0, // upper 90th percentile "horizon_evaluated_days": 60 }, "trend_characterization": { "primary_pattern": "linear" | "exponential" | "seasonal" | "cliff" | "flat", "growth_rate_per_period": 0.0, // unit-time normalized "volatility_score": 0.0, // 0-1, period-over-period variance "seasonality_detected": true | false, "seasonality_period_days": 0, // if detected "anomalies_detected": [ // standout periods worth flagging { "period_start": "ISO 8601", "deviation_pct": 0.0, "characterization": "string" } ] }, "interpretation_for_caller": { "is_likely_real_expansion": true | false, "is_likely_one_time_spike": true | false, "is_likely_seasonal_recurrence": true | false, "rationale": "string" // 1-2 sentence interpretation grounded in the trend }, "data_quality": { "history_periods_provided": 0, "missing_periods_detected": 0, "pending_recon_periods": 0, "quality_assessment": "high" | "medium" | "low" }, "ungrounded_assumptions": ["string"], "tool_metadata": { "model": "claude-haiku-4-5", "input_tokens": 0, "output_tokens": 0, "cost_usd_estimate": 0.0, "latency_ms": 0 } }
Hard rule: when data_quality.quality_assessment = 'low', the tool returns the forecast but with confidence intervals so wide that downstream callers know to discount it. The tool never produces a confident forecast on bad data — better to surface low confidence than fabricate certainty. AGT-503 reading a low-quality forecast should not bypass its own scheduled cycle.
Trend pattern taxonomy
PatternWhat it looks likeImplication for caller
linearSteady period-over-period growth at a consistent rateReal expansion. Forecast overage date is reliable. Open expansion play with confidence.
exponentialAccelerating growth — each period grows faster than the lastOften new use case ramping. Strong expansion signal. Forecast may underestimate; confidence intervals widen accordingly.
seasonalRepeating period-over-period pattern aligned with a detected cycle (monthly, quarterly, annual)Don't confuse for expansion. May not require a play; may require renewal-timing adjustment.
cliffSudden discontinuous step (up or down) in a recent period without preceding rampSpike or churn signal, not expansion. AM should investigate cause — M&A, infrastructure migration, contract event.
flatNo directional trend; period-over-period variance within normal volatilityNo expansion signal. No churn signal. Account is in steady state.
When multiple patterns are present (e.g., seasonal pattern with linear underlying growth), primary_pattern identifies the dominant one and anomalies_detected captures the rest. The interpretation field disambiguates for the caller.
Called by
CallerInvocation context
AGT-902 Account BrainMost common caller. "Is this real expansion or a spike?" is a core AGT-902 use case (per spec EVAL-Q19, EVAL-Q20, EVAL-Q22). Brain reads UsageMeteringLog.account_brain_view, calls TOOL-004 for forecast + characterization, integrates into BrainAnalysisLog narrative.
AGT-503 Expansion TriggerAs augment to existing 5-signal scoring. When AGT-503 fires on consumption overage, it may call TOOL-004 to characterize the trend before deciding play eligibility — spike vs. trend matters for whether an expansion play opens.
AGT-402 Forecast AdjusterFor expansion ACV probability weighting in the bottoms-up forecast. AGT-402's third forecast component (expansion, added v23) uses TOOL-004 trend characterization to weight probability by trend strength, not just by ExpansionLog presence.
Design principles
  1. Prediction, not extrapolation. The tool produces forecasts with confidence intervals, not point estimates. A point estimate masquerading as prediction is dishonest given the noisiness of consumption data.
  2. Pattern first, magnitude second. The tool's most-valuable output is the characterization (linear vs. spike vs. seasonal). Calibrating the magnitude is secondary — getting the shape right is what determines play activation. A wrong shape with right magnitude is worse than right shape with conservative magnitude.
  3. Honest about data quality. 30 days of history with 5 missing periods is not the same as 180 days of clean data. The tool reports quality honestly; downstream callers are responsible for discounting low-quality forecasts.
  4. No external data. Tool reads only UsageMeteringLog input passed by the caller. Does not query external systems, does not enrich with industry benchmarks. Keeps the audit trail clean — every forecast traces to specific input rows.
  5. Numerical model, not LLM gut feel. The LLM portion is for pattern characterization and interpretation. The actual numerical forecasting uses standard time-series math (exponential smoothing, linear regression with seasonality decomposition). Output is the LLM's structured summary of the math, not the LLM's vibe-based prediction.
Implementation note for go-live: principle 5 means the tool is partly numerical (pattern-detection algorithms running in code) and partly LLM (interpretation + structured output). The LLM does not invent the forecast number. This is enforceable in the implementation but not in the spec alone.
Cost ceiling
ConstraintValue
Per-call input budget15K tokens (180 days × daily granularity is typically 5–8K tokens; bounded at 15K for hybrid SKUs)
Per-call output budget2K tokens (structured forecast + interpretation)
Default modelHaiku — pattern characterization + interpretation; numerical forecasting runs in code, not in the LLM
Per-call cost estimate~$0.02 per call at Haiku pricing
Monthly cap (default)$400/mo — bounds usage to ~20,000 calls/month
Frequency expectationHighest of the four tools. Called per-account during meeting prep, on AGT-503 fires, in AGT-402 forecast cycles. Most monthly usage will come from AGT-402 batch invocations during weekly forecast snapshots.
Eval criteria
CriterionMeasurementPass threshold
Schema complianceOutput validates against output schema100% (hard)
Pattern characterization accuracy20 historical (account, SKU) cases with known retrospective trend; % where primary_pattern matches≥ 75%
Spike-vs-trend differentiation5 historical cases that were spikes (didn't repeat) and 5 that were trends (continued); % correctly classified≥ 80% — this is the most operationally important dimension
Forecast calibrationFor 20 cases with known overage outcomes within the forecast horizon: % where actual outcome fell within the [confidence_low, confidence_high] interval≥ 80%
Low-quality data handlingFor cases with deliberately gappy or short history, % where data_quality is correctly assessed as low or medium (not high)100% (hard)
Hallucinated patterns% of outputs claiming seasonality_detected = TRUE without sufficient periods to detect seasonality (need ≥ 2 full cycles)0% (hard)
P95 latencyEnd-to-end tool call (numerical forecast + LLM interpretation)≤ 3s
Failure modes
SymptomCauseAction
Tool calls a spike "exponential growth"Pattern detection algorithm tuned too aggressivelyHard fail in eval. Tighten cliff-detection threshold. Spikes often have a single-period cliff signature distinct from genuine ramp.
Tool produces high-confidence forecast on 30 days of dataConfidence intervals too narrow at low data volumesConfidence intervals must scale with history length. With 30 days of data, intervals should be wide enough that any reasonable outcome falls inside.
Seasonality false positiveTool claims seasonal pattern after 60 days of history (insufficient cycles)Hard rule: seasonality_detected requires ≥ 2 full cycles of history. Eval enforces.
Predicts overage that never materializesPattern characterization right, magnitude calibration offTrack via quarterly retrospective. If systematic over-prediction, recalibrate magnitude bias in the numerical forecaster.
AGT-402 forecast destabilizes after TOOL-004 integrationTool's expansion ACV probability weights are too volatile period-over-periodAGT-402 should smooth its TOOL-004 inputs (rolling average), not use raw output directly. Caller responsibility, not TOOL-004's.
Cost spikeAGT-402 calling TOOL-004 per-account-per-week across the full customer baseCost cap blocks. Audit: AGT-402 should batch and cache. Per-account forecast doesn't change much week-to-week; weekly recalc is overkill.
Interaction with consuming services and brains

AGT-503 Expansion Trigger: existing v25 spec fires immediately on consumption overage threshold. Optional augment: AGT-503 calls TOOL-004 to confirm the overage is part of a trend before the +40 pts signal is locked in. If TOOL-004 returns cliff pattern with is_likely_one_time_spike = TRUE, AGT-503 may downgrade the signal weight. This is opt-in — the existing AGT-503 behavior continues to work without TOOL-004.

AGT-402 Forecast Adjuster: existing v23 spec includes expansion as a third forecast component. TOOL-004 augments by providing trend-strength weighting on each open expansion play in ExpansionLog. Plays anchored by linear or exponential trend get higher probability weights; plays anchored by cliff get lower.

AGT-902 Account Brain: most direct caller. Brain integrates TOOL-004 output into BrainAnalysisLog narrative for "real expansion or spike" questions. Quote-able output is the interpretation_for_caller.rationale field; the brain adds source-trace metadata to the BrainAnalysisLog row.