TOOL-005 — Outbound Deliverability Monitor
Tier 3 Specialist Tool · Stateless · Reads outbound sequence performance signals, characterizes deliverability health, flags reputation risks · Closes Domain 4 gap from v26 architecture eval
Tier 3 · Tool
Specced · v31
Domain 4 · AI-native outbound
Haiku
Purpose
Reads outbound sequence performance data (bounce rates, reply quality, suppression patterns, sender reputation signals from email providers) and produces a deliverability health characterization with specific risk flags. Used by AGT-302 Cadence Coordinator to detect quality degradation before it cascades into reputation damage, and by AGT-303 Cadence Intelligence to inform sequence pause/throttle recommendations.
Closes Domain 4 gap from v26 eval. Today AGT-301/302/303 cover sequence generation, execution, and pattern analysis — but no component owns deliverability health specifically. This tool is the missing observability layer for outbound quality.
Input schema
{
"scope": "global" | "sender_domain" | "sequence_template" | "rep",
"scope_id": "string", // domain/template_id/rep_id per scope
"time_window": {
"start": "ISO 8601",
"end": "ISO 8601"
},
"performance_signals": {
"total_emails_sent": 0,
"hard_bounce_rate": 0.0, // % of total
"soft_bounce_rate": 0.0,
"spam_complaint_rate": 0.0,
"unsubscribe_rate": 0.0,
"reply_rate": 0.0,
"open_rate": 0.0, // optional — pixel-tracking signal
"auto_reply_rate": 0.0 // out-of-office / not-here patterns
},
"reputation_signals": { // from email service provider APIs
"google_postmaster_score": 0.0, // 0-100 if available
"microsoft_snds_status": "string", // green/yellow/red
"blacklist_check_results": [
{ "blacklist_name": "string", "listed": true | false }
]
},
"suppression_signals": { // from CadenceEventLog suppressed_reason
"top_suppression_reasons": [
{ "reason": "string", "count": 0, "pct_of_sends": 0.0 }
]
},
"trailing_baseline": { // 30-day baseline for comparison
"hard_bounce_rate": 0.0,
"spam_complaint_rate": 0.0,
"reply_rate": 0.0
}
}
Output schema
{
"tool_call_id": "uuid",
"deliverability_health": {
"overall_status": "green" | "yellow" | "red" | "critical",
"primary_concern": "string | null", // top concern in one sentence
"concern_severity": "low" | "medium" | "high" | "critical"
},
"risk_flags": [ // ordered by severity
{
"flag_type": "high_hard_bounce" | "spam_complaint_spike" |
"blacklist_listing" | "reply_rate_collapse" |
"auto_reply_spike" | "list_quality_degraded" |
"sender_reputation_decline" | "suppression_pattern_anomaly",
"severity": "low" | "medium" | "high" | "critical",
"evidence": "string", // what specifically triggered the flag
"recommended_action": "throttle_sends" | "pause_sequence" |
"warm_up_sender" | "investigate_list_source" |
"review_template_content" | "manual_review",
"supporting_metric": { "name": "string", "value": 0.0, "baseline": 0.0 }
}
],
"comparison_to_baseline": {
"bounce_rate_delta": 0.0,
"spam_complaint_delta": 0.0,
"reply_rate_delta": 0.0,
"is_meaningful_deviation": true | false // statistical not just numerical
},
"interpretation_for_caller": "string", // 2-3 sentence narrative summary
"ungrounded_assumptions": ["string"],
"tool_metadata": {
"model": "claude-haiku-4-5",
"input_tokens": 0, "output_tokens": 0,
"cost_usd_estimate": 0.0,
"latency_ms": 0
}
}
Hard rule: The tool never recommends an action it doesn't have evidence for. Recommended actions must trace to specific risk_flags + supporting metrics. No "general advice" output.
Called by
| Caller | Invocation context |
| AGT-302 Cadence Coordinator | Daily batch on global + per-sender-domain scope. Output drives auto-throttle decisions when severity is high or critical (with RevOps notification, not silent action). |
| AGT-303 Cadence Intelligence | Weekly batch on per-sequence-template scope. Output feeds AGT-303's risk-classified recommendations to AGT-302 (3-business-day veto window applies). |
| RevOps direct (workspace UI) | Investigative — when reply rates drop or bounce rates spike, RevOps drops in the metrics, gets a structured diagnosis with recommended actions. |
| AGT-901 Pipeline Brain | For "is outbound quality a bottleneck?" diagnostic queries when investigating soft pipeline. |
Design principles
- Numerical work in code, characterization in LLM. Same pattern as TOOL-004. Statistical significance tests, baseline comparison, deviation calculations — all in code. The LLM reads the structured numerical input and produces interpretation + recommended action mapping.
- Severity drives recommendation, not just metric value. A 1% spam complaint rate is critical (industry threshold). A 4% hard bounce rate is high but not critical if list source explains it. Severity calibration considers context.
- Throttling, not pausing, on yellow. Pausing a sequence has real cost (broken pipeline). Reduce send volume first; pause only on critical flags.
- Reputation signals weighted heavily. ESP-reported reputation (Google Postmaster, Microsoft SNDS) is a leading indicator; reply-rate decline is a lagging one. Tool weights leading indicators higher.
Cost ceiling
| Constraint | Value |
| Per-call input budget | 8K tokens |
| Per-call output budget | 2K tokens |
| Default model | Haiku — pattern matching + flag generation; no synthesis-heavy reasoning |
| Per-call cost estimate | ~$0.01 per call |
| Monthly cap | $150/mo (~7,500 calls) |
| Frequency expectation | Daily batch from AGT-302 + weekly batch from AGT-303 + ad-hoc. Most monthly usage is the daily batch. |
Eval criteria
| Criterion | Pass threshold |
| Schema compliance | 100% (hard) |
| Recommendation grounding | 100% — every recommended_action traces to a specific risk_flag (hard) |
| Severity calibration | 15 retrospective cases (known reputation incidents); ≥ 80% correctly identified as critical/high before reputation damage |
| False positive rate on green periods | ≤ 10% — over-flagging green periods creates alert fatigue and undermines the tool |
| P95 latency | ≤ 2s |
Failure modes
| Symptom | Action |
| Tool flags false-positive on small-volume periods (variance noise) | Statistical significance check before flag generation; minimum send volume threshold (e.g., 1000 sends in window) before recommending throttle/pause. |
| Tool misses critical reputation event | Eval catches via the 15-case set. Tighten reputation_signals weighting; add the missed pattern as a few-shot example. |
| Recommendations conflict (throttle vs. pause) | Severity hierarchy enforced in output — only one primary recommendation per call. Multiple flags coexist; only one drives the action. |
| ESP reputation API offline | Tool degrades gracefully — returns output flagged with data_quality = 'low' on reputation dimension. Other dimensions unaffected. |