TOOL-004 — Consumption Forecasting / Runway Predictor
Tier 3 Specialist Tool · Stateless · Reads UsageMeteringLog trailing 90–180 days for one account+SKU, predicts overage timing + trend characterization · Closes UBB-specific gap from v26 eval
Tier 3 · Tool
Specced · v29
UBB · Post-sales
Haiku
Purpose
Reads UsageMeteringLog trailing 90–180 day pattern for one (account_id, sku_id) pair and produces: predicted overage date (or "no overage in next 60 days"), confidence interval, and trend characterization (linear / exponential / seasonal / cliff / flat). Used by the Account Brain when answering "is this real expansion or a one-time spike?", by AGT-503 to augment expansion trigger sensitivity, and by AGT-402 to weight expansion ACV probability in forecasting.
Closes UBB-specific gap from the v26 eval. Today the system can detect overage has happened (AGT-503 fires on threshold breach) but cannot predict when overage will happen or distinguish trend from spike. For a usage-based business, this is the difference between proactive expansion conversation and reactive damage control.
Input schema
{
"account_id": "uuid",
"sku_id": "string",
"sku_type": "consumption" | "seat" | "hybrid",
"metering_history": [ // from UsageMeteringLog brain-ready view
{
"period_start": "ISO 8601",
"period_end": "ISO 8601",
"units_consumed": 0.0,
"commit_units": 0.0, // null if pure pay-as-you-go
"overage_units": 0.0,
"active_seats": 0, // for seat/hybrid
"licensed_seats": 0,
"audit_status": "verified" | "pending_recon"
}
// ... 90 to 180 days of period rows
],
"context": {
"contract_start_date": "ISO 8601",
"contract_end_date": "ISO 8601",
"renewal_date": "ISO 8601",
"current_commit_total_period": 0.0, // current period's commit
"trailing_overage_count": 0 // how many prior periods had overage
},
"forecast_horizon_days": 60 // default 60; max 180
}
All metering_history rows must have audit_status in (verified, pending_recon). Disputed rows are excluded by the calling agent before invocation. The tool refuses if < 30 days of history is provided — insufficient signal for trend characterization.
Output schema
{
"tool_call_id": "uuid",
"account_id": "uuid",
"sku_id": "string",
"forecast_summary": {
"predicted_overage": true | false,
"predicted_overage_date": "ISO 8601 | null",
"predicted_overage_amount_units": 0.0,
"confidence_low_units": 0.0, // lower 10th percentile of forecast
"confidence_high_units": 0.0, // upper 90th percentile
"horizon_evaluated_days": 60
},
"trend_characterization": {
"primary_pattern": "linear" | "exponential" | "seasonal" | "cliff" | "flat",
"growth_rate_per_period": 0.0, // unit-time normalized
"volatility_score": 0.0, // 0-1, period-over-period variance
"seasonality_detected": true | false,
"seasonality_period_days": 0, // if detected
"anomalies_detected": [ // standout periods worth flagging
{ "period_start": "ISO 8601", "deviation_pct": 0.0, "characterization": "string" }
]
},
"interpretation_for_caller": {
"is_likely_real_expansion": true | false,
"is_likely_one_time_spike": true | false,
"is_likely_seasonal_recurrence": true | false,
"rationale": "string" // 1-2 sentence interpretation grounded in the trend
},
"data_quality": {
"history_periods_provided": 0,
"missing_periods_detected": 0,
"pending_recon_periods": 0,
"quality_assessment": "high" | "medium" | "low"
},
"ungrounded_assumptions": ["string"],
"tool_metadata": {
"model": "claude-haiku-4-5",
"input_tokens": 0, "output_tokens": 0,
"cost_usd_estimate": 0.0,
"latency_ms": 0
}
}
Hard rule: when data_quality.quality_assessment = 'low', the tool returns the forecast but with confidence intervals so wide that downstream callers know to discount it. The tool never produces a confident forecast on bad data — better to surface low confidence than fabricate certainty. AGT-503 reading a low-quality forecast should not bypass its own scheduled cycle.
Trend pattern taxonomy
| Pattern | What it looks like | Implication for caller |
| linear | Steady period-over-period growth at a consistent rate | Real expansion. Forecast overage date is reliable. Open expansion play with confidence. |
| exponential | Accelerating growth — each period grows faster than the last | Often new use case ramping. Strong expansion signal. Forecast may underestimate; confidence intervals widen accordingly. |
| seasonal | Repeating period-over-period pattern aligned with a detected cycle (monthly, quarterly, annual) | Don't confuse for expansion. May not require a play; may require renewal-timing adjustment. |
| cliff | Sudden discontinuous step (up or down) in a recent period without preceding ramp | Spike or churn signal, not expansion. AM should investigate cause — M&A, infrastructure migration, contract event. |
| flat | No directional trend; period-over-period variance within normal volatility | No expansion signal. No churn signal. Account is in steady state. |
When multiple patterns are present (e.g., seasonal pattern with linear underlying growth), primary_pattern identifies the dominant one and anomalies_detected captures the rest. The interpretation field disambiguates for the caller.
Called by
| Caller | Invocation context |
| AGT-902 Account Brain | Most common caller. "Is this real expansion or a spike?" is a core AGT-902 use case (per spec EVAL-Q19, EVAL-Q20, EVAL-Q22). Brain reads UsageMeteringLog.account_brain_view, calls TOOL-004 for forecast + characterization, integrates into BrainAnalysisLog narrative. |
| AGT-503 Expansion Trigger | As augment to existing 5-signal scoring. When AGT-503 fires on consumption overage, it may call TOOL-004 to characterize the trend before deciding play eligibility — spike vs. trend matters for whether an expansion play opens. |
| AGT-402 Forecast Adjuster | For expansion ACV probability weighting in the bottoms-up forecast. AGT-402's third forecast component (expansion, added v23) uses TOOL-004 trend characterization to weight probability by trend strength, not just by ExpansionLog presence. |
Design principles
- Prediction, not extrapolation. The tool produces forecasts with confidence intervals, not point estimates. A point estimate masquerading as prediction is dishonest given the noisiness of consumption data.
- Pattern first, magnitude second. The tool's most-valuable output is the characterization (linear vs. spike vs. seasonal). Calibrating the magnitude is secondary — getting the shape right is what determines play activation. A wrong shape with right magnitude is worse than right shape with conservative magnitude.
- Honest about data quality. 30 days of history with 5 missing periods is not the same as 180 days of clean data. The tool reports quality honestly; downstream callers are responsible for discounting low-quality forecasts.
- No external data. Tool reads only UsageMeteringLog input passed by the caller. Does not query external systems, does not enrich with industry benchmarks. Keeps the audit trail clean — every forecast traces to specific input rows.
- Numerical model, not LLM gut feel. The LLM portion is for pattern characterization and interpretation. The actual numerical forecasting uses standard time-series math (exponential smoothing, linear regression with seasonality decomposition). Output is the LLM's structured summary of the math, not the LLM's vibe-based prediction.
Implementation note for go-live: principle 5 means the tool is partly numerical (pattern-detection algorithms running in code) and partly LLM (interpretation + structured output). The LLM does not invent the forecast number. This is enforceable in the implementation but not in the spec alone.
Cost ceiling
| Constraint | Value |
| Per-call input budget | 15K tokens (180 days × daily granularity is typically 5–8K tokens; bounded at 15K for hybrid SKUs) |
| Per-call output budget | 2K tokens (structured forecast + interpretation) |
| Default model | Haiku — pattern characterization + interpretation; numerical forecasting runs in code, not in the LLM |
| Per-call cost estimate | ~$0.02 per call at Haiku pricing |
| Monthly cap (default) | $400/mo — bounds usage to ~20,000 calls/month |
| Frequency expectation | Highest of the four tools. Called per-account during meeting prep, on AGT-503 fires, in AGT-402 forecast cycles. Most monthly usage will come from AGT-402 batch invocations during weekly forecast snapshots. |
Eval criteria
| Criterion | Measurement | Pass threshold |
| Schema compliance | Output validates against output schema | 100% (hard) |
| Pattern characterization accuracy | 20 historical (account, SKU) cases with known retrospective trend; % where primary_pattern matches | ≥ 75% |
| Spike-vs-trend differentiation | 5 historical cases that were spikes (didn't repeat) and 5 that were trends (continued); % correctly classified | ≥ 80% — this is the most operationally important dimension |
| Forecast calibration | For 20 cases with known overage outcomes within the forecast horizon: % where actual outcome fell within the [confidence_low, confidence_high] interval | ≥ 80% |
| Low-quality data handling | For cases with deliberately gappy or short history, % where data_quality is correctly assessed as low or medium (not high) | 100% (hard) |
| Hallucinated patterns | % of outputs claiming seasonality_detected = TRUE without sufficient periods to detect seasonality (need ≥ 2 full cycles) | 0% (hard) |
| P95 latency | End-to-end tool call (numerical forecast + LLM interpretation) | ≤ 3s |
Failure modes
| Symptom | Cause | Action |
| Tool calls a spike "exponential growth" | Pattern detection algorithm tuned too aggressively | Hard fail in eval. Tighten cliff-detection threshold. Spikes often have a single-period cliff signature distinct from genuine ramp. |
| Tool produces high-confidence forecast on 30 days of data | Confidence intervals too narrow at low data volumes | Confidence intervals must scale with history length. With 30 days of data, intervals should be wide enough that any reasonable outcome falls inside. |
| Seasonality false positive | Tool claims seasonal pattern after 60 days of history (insufficient cycles) | Hard rule: seasonality_detected requires ≥ 2 full cycles of history. Eval enforces. |
| Predicts overage that never materializes | Pattern characterization right, magnitude calibration off | Track via quarterly retrospective. If systematic over-prediction, recalibrate magnitude bias in the numerical forecaster. |
| AGT-402 forecast destabilizes after TOOL-004 integration | Tool's expansion ACV probability weights are too volatile period-over-period | AGT-402 should smooth its TOOL-004 inputs (rolling average), not use raw output directly. Caller responsibility, not TOOL-004's. |
| Cost spike | AGT-402 calling TOOL-004 per-account-per-week across the full customer base | Cost cap blocks. Audit: AGT-402 should batch and cache. Per-account forecast doesn't change much week-to-week; weekly recalc is overkill. |
Interaction with consuming services and brains
AGT-503 Expansion Trigger: existing v25 spec fires immediately on consumption overage threshold. Optional augment: AGT-503 calls TOOL-004 to confirm the overage is part of a trend before the +40 pts signal is locked in. If TOOL-004 returns cliff pattern with is_likely_one_time_spike = TRUE, AGT-503 may downgrade the signal weight. This is opt-in — the existing AGT-503 behavior continues to work without TOOL-004.
AGT-402 Forecast Adjuster: existing v23 spec includes expansion as a third forecast component. TOOL-004 augments by providing trend-strength weighting on each open expansion play in ExpansionLog. Plays anchored by linear or exponential trend get higher probability weights; plays anchored by cliff get lower.
AGT-902 Account Brain: most direct caller. Brain integrates TOOL-004 output into BrainAnalysisLog narrative for "real expansion or spike" questions. Quote-able output is the interpretation_for_caller.rationale field; the brain adds source-trace metadata to the BrainAnalysisLog row.