Glossary · Reliability

Exponential backoff

A retry schedule where each retry waits exponentially longer than the previous one.

Exponential backoff is a retry schedule where each retry waits exponentially longer than the previous one. A typical schedule: 30s, 1m, 5m, 15m, 1h, 4h, 12h, 24h. Early retries catch transient blips; later retries catch longer outages without hammering the recovering system.

The opposite — constant-interval retries (every minute, forever) — is the most common production mistake. It overloads systems that are trying to recover and never adapts to longer outages.

Why exponential backoff is the right default

Three reasons it beats both constant and linear schedules:

  1. Self-healing. If the destination has a 5-second blip, the first retry catches it. If it's a 5-hour outage, later retries catch it too — the schedule scales naturally.
  2. Less load on recovering systems. A destination coming back from an outage faces a thundering herd of retries. Exponential backoff spreads them out.
  3. Bounded total time. With reasonable parameters (initial 30s, multiplier 2, max ~24h), the total retry budget is finite — you can predict when a delivery will move to the DLQ.

Adding jitter

Pure exponential backoff has a synchronization problem: if many deliveries fail at the same time (an upstream incident), they all retry at the same scheduled times — recreating the thundering herd. The fix is jitter — randomize the wait by ±20% or so, spreading out the retries.

Most production retry libraries default to "exponential backoff with jitter" for this reason. The variance smooths the load curve.

For the broader retry context: Webhook DLQs: design and recovery patterns.

Related terms