TOOL-005 — Outbound Deliverability Monitor

Tier 3 Specialist Tool · Stateless · Reads outbound sequence performance signals, characterizes deliverability health, flags reputation risks · Closes Domain 4 gap from v26 architecture eval
Tier 3 · Tool Specced · v31 Domain 4 · AI-native outbound Haiku
Purpose

Reads outbound sequence performance data (bounce rates, reply quality, suppression patterns, sender reputation signals from email providers) and produces a deliverability health characterization with specific risk flags. Used by AGT-302 Cadence Coordinator to detect quality degradation before it cascades into reputation damage, and by AGT-303 Cadence Intelligence to inform sequence pause/throttle recommendations.

Closes Domain 4 gap from v26 eval. Today AGT-301/302/303 cover sequence generation, execution, and pattern analysis — but no component owns deliverability health specifically. This tool is the missing observability layer for outbound quality.
Input schema
{ "scope": "global" | "sender_domain" | "sequence_template" | "rep", "scope_id": "string", // domain/template_id/rep_id per scope "time_window": { "start": "ISO 8601", "end": "ISO 8601" }, "performance_signals": { "total_emails_sent": 0, "hard_bounce_rate": 0.0, // % of total "soft_bounce_rate": 0.0, "spam_complaint_rate": 0.0, "unsubscribe_rate": 0.0, "reply_rate": 0.0, "open_rate": 0.0, // optional — pixel-tracking signal "auto_reply_rate": 0.0 // out-of-office / not-here patterns }, "reputation_signals": { // from email service provider APIs "google_postmaster_score": 0.0, // 0-100 if available "microsoft_snds_status": "string", // green/yellow/red "blacklist_check_results": [ { "blacklist_name": "string", "listed": true | false } ] }, "suppression_signals": { // from CadenceEventLog suppressed_reason "top_suppression_reasons": [ { "reason": "string", "count": 0, "pct_of_sends": 0.0 } ] }, "trailing_baseline": { // 30-day baseline for comparison "hard_bounce_rate": 0.0, "spam_complaint_rate": 0.0, "reply_rate": 0.0 } }
Output schema
{ "tool_call_id": "uuid", "deliverability_health": { "overall_status": "green" | "yellow" | "red" | "critical", "primary_concern": "string | null", // top concern in one sentence "concern_severity": "low" | "medium" | "high" | "critical" }, "risk_flags": [ // ordered by severity { "flag_type": "high_hard_bounce" | "spam_complaint_spike" | "blacklist_listing" | "reply_rate_collapse" | "auto_reply_spike" | "list_quality_degraded" | "sender_reputation_decline" | "suppression_pattern_anomaly", "severity": "low" | "medium" | "high" | "critical", "evidence": "string", // what specifically triggered the flag "recommended_action": "throttle_sends" | "pause_sequence" | "warm_up_sender" | "investigate_list_source" | "review_template_content" | "manual_review", "supporting_metric": { "name": "string", "value": 0.0, "baseline": 0.0 } } ], "comparison_to_baseline": { "bounce_rate_delta": 0.0, "spam_complaint_delta": 0.0, "reply_rate_delta": 0.0, "is_meaningful_deviation": true | false // statistical not just numerical }, "interpretation_for_caller": "string", // 2-3 sentence narrative summary "ungrounded_assumptions": ["string"], "tool_metadata": { "model": "claude-haiku-4-5", "input_tokens": 0, "output_tokens": 0, "cost_usd_estimate": 0.0, "latency_ms": 0 } }
Hard rule: The tool never recommends an action it doesn't have evidence for. Recommended actions must trace to specific risk_flags + supporting metrics. No "general advice" output.
Called by
CallerInvocation context
AGT-302 Cadence CoordinatorDaily batch on global + per-sender-domain scope. Output drives auto-throttle decisions when severity is high or critical (with RevOps notification, not silent action).
AGT-303 Cadence IntelligenceWeekly batch on per-sequence-template scope. Output feeds AGT-303's risk-classified recommendations to AGT-302 (3-business-day veto window applies).
RevOps direct (workspace UI)Investigative — when reply rates drop or bounce rates spike, RevOps drops in the metrics, gets a structured diagnosis with recommended actions.
AGT-901 Pipeline BrainFor "is outbound quality a bottleneck?" diagnostic queries when investigating soft pipeline.
Design principles
  1. Numerical work in code, characterization in LLM. Same pattern as TOOL-004. Statistical significance tests, baseline comparison, deviation calculations — all in code. The LLM reads the structured numerical input and produces interpretation + recommended action mapping.
  2. Severity drives recommendation, not just metric value. A 1% spam complaint rate is critical (industry threshold). A 4% hard bounce rate is high but not critical if list source explains it. Severity calibration considers context.
  3. Throttling, not pausing, on yellow. Pausing a sequence has real cost (broken pipeline). Reduce send volume first; pause only on critical flags.
  4. Reputation signals weighted heavily. ESP-reported reputation (Google Postmaster, Microsoft SNDS) is a leading indicator; reply-rate decline is a lagging one. Tool weights leading indicators higher.
Cost ceiling
ConstraintValue
Per-call input budget8K tokens
Per-call output budget2K tokens
Default modelHaiku — pattern matching + flag generation; no synthesis-heavy reasoning
Per-call cost estimate~$0.01 per call
Monthly cap$150/mo (~7,500 calls)
Frequency expectationDaily batch from AGT-302 + weekly batch from AGT-303 + ad-hoc. Most monthly usage is the daily batch.
Eval criteria
CriterionPass threshold
Schema compliance100% (hard)
Recommendation grounding100% — every recommended_action traces to a specific risk_flag (hard)
Severity calibration15 retrospective cases (known reputation incidents); ≥ 80% correctly identified as critical/high before reputation damage
False positive rate on green periods≤ 10% — over-flagging green periods creates alert fatigue and undermines the tool
P95 latency≤ 2s
Failure modes
SymptomAction
Tool flags false-positive on small-volume periods (variance noise)Statistical significance check before flag generation; minimum send volume threshold (e.g., 1000 sends in window) before recommending throttle/pause.
Tool misses critical reputation eventEval catches via the 15-case set. Tighten reputation_signals weighting; add the missed pattern as a few-shot example.
Recommendations conflict (throttle vs. pause)Severity hierarchy enforced in output — only one primary recommendation per call. Multiple flags coexist; only one drives the action.
ESP reputation API offlineTool degrades gracefully — returns output flagged with data_quality = 'low' on reputation dimension. Other dimensions unaffected.