Glossary · Reliability

Retry

Re-attempting a failed delivery on a backoff schedule until it succeeds or hits a retry budget.

A retry is re-attempting a failed delivery. In webhook systems, retries happen on a backoff schedule (typically exponential) until the delivery succeeds or exhausts its budget. After exhaustion, the delivery moves to a DLQ.

Exponential backoff

Naive constant-interval retries (every minute, forever) overload downstream systems and don't recover from longer outages well. Exponential backoff doubles the wait between retries: 30s, 1m, 5m, 15m, 1h, 4h, 12h, 24h. Early retries catch transient blips; later retries catch longer outages without hammering recovering systems.

Most providers retry with backoff:

  • Stripe — retries for up to 3 days
  • GitHub — stores failed deliveries for manual or API-driven redelivery; automatic redelivery requires your own scheduled workflow
  • Slack — retries 3 times: nearly immediately, after 1 minute, and after 5 minutes
  • Shopify — retries failed deliveries up to 8 times in about 4 hours, then the subscription can be removed if failures continue

Why retries need idempotency

Every retry can produce a duplicate from the consumer's perspective. The first attempt might have succeeded but the response was lost; the retry hits the consumer again. Without idempotency, the consumer processes the same event twice.

For agents specifically, retries are expensive (tokens) and have side effects (tool calls). Idempotency at both the relay layer and the agent's tool-call layer is the discipline that keeps cost and side effects bounded.

For the broader retry discussion: Webhook DLQs: design and recovery patterns.

Related terms