TOOL-005 — Outbound Deliverability Monitor

Tier 3 Specialist Tool · Stateless · Reads outbound sequence performance signals, characterizes deliverability health, flags reputation risks · Closes Domain 4 gap from v26 architecture eval

Tier 3 · Tool Specced · v31 Domain 4 · AI-native outbound Haiku

Purpose

Reads outbound sequence performance data (bounce rates, reply quality, suppression patterns, sender reputation signals from email providers) and produces a deliverability health characterization with specific risk flags. Used by AGT-302 Cadence Coordinator to detect quality degradation before it cascades into reputation damage, and by AGT-303 Cadence Intelligence to inform sequence pause/throttle recommendations.

Closes Domain 4 gap from v26 eval. Today AGT-301/302/303 cover sequence generation, execution, and pattern analysis — but no component owns deliverability health specifically. This tool is the missing observability layer for outbound quality.

Input schema

{
  "scope": "global" | "sender_domain" | "sequence_template" | "rep",
  "scope_id": "string",                    // domain/template_id/rep_id per scope
  "time_window": {
    "start": "ISO 8601",
    "end": "ISO 8601"
  },
  "performance_signals": {
    "total_emails_sent": 0,
    "hard_bounce_rate": 0.0,               // % of total
    "soft_bounce_rate": 0.0,
    "spam_complaint_rate": 0.0,
    "unsubscribe_rate": 0.0,
    "reply_rate": 0.0,
    "open_rate": 0.0,                      // optional — pixel-tracking signal
    "auto_reply_rate": 0.0                 // out-of-office / not-here patterns
  },
  "reputation_signals": {                  // from email service provider APIs
    "google_postmaster_score": 0.0,        // 0-100 if available
    "microsoft_snds_status": "string",     // green/yellow/red
    "blacklist_check_results": [
      { "blacklist_name": "string", "listed": true | false }
    ]
  },
  "suppression_signals": {                 // from CadenceEventLog suppressed_reason
    "top_suppression_reasons": [
      { "reason": "string", "count": 0, "pct_of_sends": 0.0 }
    ]
  },
  "trailing_baseline": {                   // 30-day baseline for comparison
    "hard_bounce_rate": 0.0,
    "spam_complaint_rate": 0.0,
    "reply_rate": 0.0
  }
}

Output schema

{
  "tool_call_id": "uuid",
  "deliverability_health": {
    "overall_status": "green" | "yellow" | "red" | "critical",
    "primary_concern": "string | null",    // top concern in one sentence
    "concern_severity": "low" | "medium" | "high" | "critical"
  },
  "risk_flags": [                          // ordered by severity
    {
      "flag_type": "high_hard_bounce" | "spam_complaint_spike" |
                   "blacklist_listing" | "reply_rate_collapse" |
                   "auto_reply_spike" | "list_quality_degraded" |
                   "sender_reputation_decline" | "suppression_pattern_anomaly",
      "severity": "low" | "medium" | "high" | "critical",
      "evidence": "string",                // what specifically triggered the flag
      "recommended_action": "throttle_sends" | "pause_sequence" |
                            "warm_up_sender" | "investigate_list_source" |
                            "review_template_content" | "manual_review",
      "supporting_metric": { "name": "string", "value": 0.0, "baseline": 0.0 }
    }
  ],
  "comparison_to_baseline": {
    "bounce_rate_delta": 0.0,
    "spam_complaint_delta": 0.0,
    "reply_rate_delta": 0.0,
    "is_meaningful_deviation": true | false  // statistical not just numerical
  },
  "interpretation_for_caller": "string",   // 2-3 sentence narrative summary
  "ungrounded_assumptions": ["string"],
  "tool_metadata": {
    "model": "claude-haiku-4-5",
    "input_tokens": 0, "output_tokens": 0,
    "cost_usd_estimate": 0.0,
    "latency_ms": 0
  }
}

Hard rule: The tool never recommends an action it doesn't have evidence for. Recommended actions must trace to specific risk_flags + supporting metrics. No "general advice" output.

Called by

Caller	Invocation context
AGT-302 Cadence Coordinator	Daily batch on global + per-sender-domain scope. Output drives auto-throttle decisions when severity is high or critical (with RevOps notification, not silent action).
AGT-303 Cadence Intelligence	Weekly batch on per-sequence-template scope. Output feeds AGT-303's risk-classified recommendations to AGT-302 (3-business-day veto window applies).
RevOps direct (workspace UI)	Investigative — when reply rates drop or bounce rates spike, RevOps drops in the metrics, gets a structured diagnosis with recommended actions.
AGT-901 Pipeline Brain	For "is outbound quality a bottleneck?" diagnostic queries when investigating soft pipeline.

Design principles

Numerical work in code, characterization in LLM. Same pattern as TOOL-004. Statistical significance tests, baseline comparison, deviation calculations — all in code. The LLM reads the structured numerical input and produces interpretation + recommended action mapping.
Severity drives recommendation, not just metric value. A 1% spam complaint rate is critical (industry threshold). A 4% hard bounce rate is high but not critical if list source explains it. Severity calibration considers context.
Throttling, not pausing, on yellow. Pausing a sequence has real cost (broken pipeline). Reduce send volume first; pause only on critical flags.
Reputation signals weighted heavily. ESP-reported reputation (Google Postmaster, Microsoft SNDS) is a leading indicator; reply-rate decline is a lagging one. Tool weights leading indicators higher.

Cost ceiling

Constraint	Value
Per-call input budget	8K tokens
Per-call output budget	2K tokens
Default model	Haiku — pattern matching + flag generation; no synthesis-heavy reasoning
Per-call cost estimate	~$0.01 per call
Monthly cap	$150/mo (~7,500 calls)
Frequency expectation	Daily batch from AGT-302 + weekly batch from AGT-303 + ad-hoc. Most monthly usage is the daily batch.

Eval criteria

Criterion	Pass threshold
Schema compliance	100% (hard)
Recommendation grounding	100% — every recommended_action traces to a specific risk_flag (hard)
Severity calibration	15 retrospective cases (known reputation incidents); ≥ 80% correctly identified as critical/high before reputation damage
False positive rate on green periods	≤ 10% — over-flagging green periods creates alert fatigue and undermines the tool
P95 latency	≤ 2s

Failure modes

Symptom	Action
Tool flags false-positive on small-volume periods (variance noise)	Statistical significance check before flag generation; minimum send volume threshold (e.g., 1000 sends in window) before recommending throttle/pause.
Tool misses critical reputation event	Eval catches via the 15-case set. Tighten reputation_signals weighting; add the missed pattern as a few-shot example.
Recommendations conflict (throttle vs. pause)	Severity hierarchy enforced in output — only one primary recommendation per call. Multiple flags coexist; only one drives the action.
ESP reputation API offline	Tool degrades gracefully — returns output flagged with `data_quality = 'low'` on reputation dimension. Other dimensions unaffected.