TOOL-004 — Consumption Forecasting / Runway Predictor

Tier 3 Specialist Tool · Stateless · Reads UsageMeteringLog trailing 90–180 days for one account+SKU, predicts overage timing + trend characterization · Closes UBB-specific gap from v26 eval

Tier 3 · Tool Specced · v29 UBB · Post-sales Haiku

Purpose

Reads UsageMeteringLog trailing 90–180 day pattern for one (account_id, sku_id) pair and produces: predicted overage date (or "no overage in next 60 days"), confidence interval, and trend characterization (linear / exponential / seasonal / cliff / flat). Used by the Account Brain when answering "is this real expansion or a one-time spike?", by AGT-503 to augment expansion trigger sensitivity, and by AGT-402 to weight expansion ACV probability in forecasting.

Closes UBB-specific gap from the v26 eval. Today the system can detect overage has happened (AGT-503 fires on threshold breach) but cannot predict when overage will happen or distinguish trend from spike. For a usage-based business, this is the difference between proactive expansion conversation and reactive damage control.

Input schema

{
  "account_id": "uuid",
  "sku_id": "string",
  "sku_type": "consumption" | "seat" | "hybrid",
  "metering_history": [                    // from UsageMeteringLog brain-ready view
    {
      "period_start": "ISO 8601",
      "period_end": "ISO 8601",
      "units_consumed": 0.0,
      "commit_units": 0.0,                 // null if pure pay-as-you-go
      "overage_units": 0.0,
      "active_seats": 0,                   // for seat/hybrid
      "licensed_seats": 0,
      "audit_status": "verified" | "pending_recon"
    }
    // ... 90 to 180 days of period rows
  ],
  "context": {
    "contract_start_date": "ISO 8601",
    "contract_end_date": "ISO 8601",
    "renewal_date": "ISO 8601",
    "current_commit_total_period": 0.0,    // current period's commit
    "trailing_overage_count": 0            // how many prior periods had overage
  },
  "forecast_horizon_days": 60              // default 60; max 180
}

All metering_history rows must have audit_status in (verified, pending_recon). Disputed rows are excluded by the calling agent before invocation. The tool refuses if < 30 days of history is provided — insufficient signal for trend characterization.

Output schema

{
  "tool_call_id": "uuid",
  "account_id": "uuid",
  "sku_id": "string",
  "forecast_summary": {
    "predicted_overage": true | false,
    "predicted_overage_date": "ISO 8601 | null",
    "predicted_overage_amount_units": 0.0,
    "confidence_low_units": 0.0,           // lower 10th percentile of forecast
    "confidence_high_units": 0.0,          // upper 90th percentile
    "horizon_evaluated_days": 60
  },
  "trend_characterization": {
    "primary_pattern": "linear" | "exponential" | "seasonal" | "cliff" | "flat",
    "growth_rate_per_period": 0.0,         // unit-time normalized
    "volatility_score": 0.0,               // 0-1, period-over-period variance
    "seasonality_detected": true | false,
    "seasonality_period_days": 0,          // if detected
    "anomalies_detected": [                // standout periods worth flagging
      { "period_start": "ISO 8601", "deviation_pct": 0.0, "characterization": "string" }
    ]
  },
  "interpretation_for_caller": {
    "is_likely_real_expansion": true | false,
    "is_likely_one_time_spike": true | false,
    "is_likely_seasonal_recurrence": true | false,
    "rationale": "string"                  // 1-2 sentence interpretation grounded in the trend
  },
  "data_quality": {
    "history_periods_provided": 0,
    "missing_periods_detected": 0,
    "pending_recon_periods": 0,
    "quality_assessment": "high" | "medium" | "low"
  },
  "ungrounded_assumptions": ["string"],
  "tool_metadata": {
    "model": "claude-haiku-4-5",
    "input_tokens": 0, "output_tokens": 0,
    "cost_usd_estimate": 0.0,
    "latency_ms": 0
  }
}

Hard rule: when data_quality.quality_assessment = 'low', the tool returns the forecast but with confidence intervals so wide that downstream callers know to discount it. The tool never produces a confident forecast on bad data — better to surface low confidence than fabricate certainty. AGT-503 reading a low-quality forecast should not bypass its own scheduled cycle.

Trend pattern taxonomy

Pattern	What it looks like	Implication for caller
linear	Steady period-over-period growth at a consistent rate	Real expansion. Forecast overage date is reliable. Open expansion play with confidence.
exponential	Accelerating growth — each period grows faster than the last	Often new use case ramping. Strong expansion signal. Forecast may underestimate; confidence intervals widen accordingly.
seasonal	Repeating period-over-period pattern aligned with a detected cycle (monthly, quarterly, annual)	Don't confuse for expansion. May not require a play; may require renewal-timing adjustment.
cliff	Sudden discontinuous step (up or down) in a recent period without preceding ramp	Spike or churn signal, not expansion. AM should investigate cause — M&A, infrastructure migration, contract event.
flat	No directional trend; period-over-period variance within normal volatility	No expansion signal. No churn signal. Account is in steady state.

When multiple patterns are present (e.g., seasonal pattern with linear underlying growth), primary_pattern identifies the dominant one and anomalies_detected captures the rest. The interpretation field disambiguates for the caller.

Called by

Caller	Invocation context
AGT-902 Account Brain	Most common caller. "Is this real expansion or a spike?" is a core AGT-902 use case (per spec EVAL-Q19, EVAL-Q20, EVAL-Q22). Brain reads UsageMeteringLog.account_brain_view, calls TOOL-004 for forecast + characterization, integrates into BrainAnalysisLog narrative.
AGT-503 Expansion Trigger	As augment to existing 5-signal scoring. When AGT-503 fires on consumption overage, it may call TOOL-004 to characterize the trend before deciding play eligibility — spike vs. trend matters for whether an expansion play opens.
AGT-402 Forecast Adjuster	For expansion ACV probability weighting in the bottoms-up forecast. AGT-402's third forecast component (expansion, added v23) uses TOOL-004 trend characterization to weight probability by trend strength, not just by ExpansionLog presence.

Design principles

Prediction, not extrapolation. The tool produces forecasts with confidence intervals, not point estimates. A point estimate masquerading as prediction is dishonest given the noisiness of consumption data.
Pattern first, magnitude second. The tool's most-valuable output is the characterization (linear vs. spike vs. seasonal). Calibrating the magnitude is secondary — getting the shape right is what determines play activation. A wrong shape with right magnitude is worse than right shape with conservative magnitude.
Honest about data quality. 30 days of history with 5 missing periods is not the same as 180 days of clean data. The tool reports quality honestly; downstream callers are responsible for discounting low-quality forecasts.
No external data. Tool reads only UsageMeteringLog input passed by the caller. Does not query external systems, does not enrich with industry benchmarks. Keeps the audit trail clean — every forecast traces to specific input rows.
Numerical model, not LLM gut feel. The LLM portion is for pattern characterization and interpretation. The actual numerical forecasting uses standard time-series math (exponential smoothing, linear regression with seasonality decomposition). Output is the LLM's structured summary of the math, not the LLM's vibe-based prediction.

Implementation note for go-live: principle 5 means the tool is partly numerical (pattern-detection algorithms running in code) and partly LLM (interpretation + structured output). The LLM does not invent the forecast number. This is enforceable in the implementation but not in the spec alone.

Cost ceiling

Constraint	Value
Per-call input budget	15K tokens (180 days × daily granularity is typically 5–8K tokens; bounded at 15K for hybrid SKUs)
Per-call output budget	2K tokens (structured forecast + interpretation)
Default model	Haiku — pattern characterization + interpretation; numerical forecasting runs in code, not in the LLM
Per-call cost estimate	~$0.02 per call at Haiku pricing
Monthly cap (default)	$400/mo — bounds usage to ~20,000 calls/month
Frequency expectation	Highest of the four tools. Called per-account during meeting prep, on AGT-503 fires, in AGT-402 forecast cycles. Most monthly usage will come from AGT-402 batch invocations during weekly forecast snapshots.

Eval criteria

Criterion	Measurement	Pass threshold
Schema compliance	Output validates against output schema	100% (hard)
Pattern characterization accuracy	20 historical (account, SKU) cases with known retrospective trend; % where `primary_pattern` matches	≥ 75%
Spike-vs-trend differentiation	5 historical cases that were spikes (didn't repeat) and 5 that were trends (continued); % correctly classified	≥ 80% — this is the most operationally important dimension
Forecast calibration	For 20 cases with known overage outcomes within the forecast horizon: % where actual outcome fell within the [confidence_low, confidence_high] interval	≥ 80%
Low-quality data handling	For cases with deliberately gappy or short history, % where `data_quality` is correctly assessed as `low` or `medium` (not `high`)	100% (hard)
Hallucinated patterns	% of outputs claiming `seasonality_detected = TRUE` without sufficient periods to detect seasonality (need ≥ 2 full cycles)	0% (hard)
P95 latency	End-to-end tool call (numerical forecast + LLM interpretation)	≤ 3s

Failure modes

Symptom	Cause	Action
Tool calls a spike "exponential growth"	Pattern detection algorithm tuned too aggressively	Hard fail in eval. Tighten cliff-detection threshold. Spikes often have a single-period cliff signature distinct from genuine ramp.
Tool produces high-confidence forecast on 30 days of data	Confidence intervals too narrow at low data volumes	Confidence intervals must scale with history length. With 30 days of data, intervals should be wide enough that any reasonable outcome falls inside.
Seasonality false positive	Tool claims seasonal pattern after 60 days of history (insufficient cycles)	Hard rule: seasonality_detected requires ≥ 2 full cycles of history. Eval enforces.
Predicts overage that never materializes	Pattern characterization right, magnitude calibration off	Track via quarterly retrospective. If systematic over-prediction, recalibrate magnitude bias in the numerical forecaster.
AGT-402 forecast destabilizes after TOOL-004 integration	Tool's expansion ACV probability weights are too volatile period-over-period	AGT-402 should smooth its TOOL-004 inputs (rolling average), not use raw output directly. Caller responsibility, not TOOL-004's.
Cost spike	AGT-402 calling TOOL-004 per-account-per-week across the full customer base	Cost cap blocks. Audit: AGT-402 should batch and cache. Per-account forecast doesn't change much week-to-week; weekly recalc is overkill.

Interaction with consuming services and brains

AGT-503 Expansion Trigger: existing v25 spec fires immediately on consumption overage threshold. Optional augment: AGT-503 calls TOOL-004 to confirm the overage is part of a trend before the +40 pts signal is locked in. If TOOL-004 returns cliff pattern with is_likely_one_time_spike = TRUE, AGT-503 may downgrade the signal weight. This is opt-in — the existing AGT-503 behavior continues to work without TOOL-004.

AGT-402 Forecast Adjuster: existing v23 spec includes expansion as a third forecast component. TOOL-004 augments by providing trend-strength weighting on each open expansion play in ExpansionLog. Plays anchored by linear or exponential trend get higher probability weights; plays anchored by cliff get lower.

AGT-902 Account Brain: most direct caller. Brain integrates TOOL-004 output into BrainAnalysisLog narrative for "real expansion or spike" questions. Quote-able output is the interpretation_for_caller.rationale field; the brain adds source-trace metadata to the BrainAnalysisLog row.

TOOL-004 Consumption Forecasting / Runway Predictor · Tier 3 Specialist Tool · Schema v29
Stateless. Reads UsageMeteringLog input only. Logged via caller's BrainAnalysisLog or service log.
Default model: Haiku (LLM for characterization + interpretation; numerical forecasting in code).
Per-call budget: 15K input + 2K output. Monthly cap: $400/mo (~20,000 calls).
Closes: UBB-specific gap (consumption forecasting / runway prediction) from the v26 architecture evaluation.
Companion: Tier 3 Tools Index · AGT-902 Account Brain · AGT-503 Expansion Trigger · UsageMeteringLog Production Schema