Implementing Exponential Backoff for Failed Push Deliveries
Transient HTTP 429 and 503 errors from browser push services will occur — the question is whether your retry logic amplifies or absorbs the load.
Quick Answer
Use full-jitter exponential backoff: start at 2 s, double each attempt, cap at 120 s, apply random(0, cap) jitter, and set max_retries = 5. Route payloads to a dead-letter queue (DLQ) after exhaustion or when the next delay would exceed the remaining TTL. Never retry 400, 401, 404, or 410 — these are permanent failures. Always respect the Retry-After header on 429 responses.
| Attempt | Base delay | With 2× multiplier | After full jitter |
|---|---|---|---|
| 1 | 2 s | 2 s | 0–2 s |
| 2 | 2 s | 4 s | 0–4 s |
| 3 | 2 s | 8 s | 0–8 s |
| 4 | 2 s | 16 s | 0–16 s |
| 5 | 2 s | 32 s | 0–32 s (→ DLQ) |
Core Algorithm & Queue Routing Architecture
Exponential backoff replaces immediate synchronous retries with a mathematically scaled delay progression. When a delivery attempt fails, the payload is serialised into a priority queue rather than re-injected into the active dispatch thread. This decouples the delivery worker from the retry scheduler, preventing thread starvation and enabling horizontal scaling.
When configuring the primary dispatch pipeline, integrate this pattern into your broader Backend Delivery Architecture & Queue Management framework to ensure idempotent message routing, atomic state transitions, and persistent retry tracking across service restarts. For the complementary concern of queue infrastructure capacity under load, see scaling push queues with Redis or RabbitMQ.
Base Delay, Multiplier, and Max Retry Thresholds
Linear retry intervals fail under vendor rate limits because they synchronise worker attempts, compounding downstream load. Exponential scaling aligns with browser vendor recovery windows by progressively widening the gap between attempts.
Production Baseline Configuration:
base_delay: 2000 msmultiplier: 2.0max_retries: 5max_delay: 120000 msretryable_status_codes:[429, 500, 502, 503, 504]permanent_failure_codes:[400, 401, 404, 410]
Under this configuration, retry intervals approximate: 2s → 4s → 8s → 16s → 32s, capped at 120 s. After five attempts, the payload is considered exhausted and routed to a dead-letter queue (DLQ).
Jitter Implementation to Prevent Thundering Herd
Deterministic backoff intervals cause synchronised retry spikes across distributed worker pools. Randomised jitter disperses retry attempts across the time window.
Full Jitter Formula (recommended for distributed systems):
actual_delay = random(0, min(max_delay, base_delay * multiplier^attempt))
Full jitter (random(0, exponential_cap)) provides the best load distribution for high-concurrency systems. “Equal jitter” (half deterministic, half random) is a common alternative that maintains a minimum spacing between retries while still dispersing load.
Diagnostic Steps for Push Delivery Failures
Systematic isolation of push delivery failures requires intercepting HTTP responses, validating payload integrity, and mapping failures to the appropriate retry schedule. For foundational algorithmic context, review Retry Logic & Backoff Strategies before applying vendor-specific push constraints.
Step 1: Intercept & Log HTTP Status Codes. Parse the push service response immediately upon receipt. Map status codes to retry policies:
- Permanent (No Retry):
400 Bad Request,401 Unauthorized,404 Not Found,410 Gone - Rate Limited (Backoff):
429 Too Many Requests— always respect theRetry-Afterheader; see handling 429 responses for header-aware scheduling - Transient/Server Error (Backoff):
500,502,503,504
Enforce structured logging with correlation IDs and endpoint hashes. Never log raw endpoint URLs or VAPID keys:
{
"correlation_id": "req_8f3a9c2d",
"endpoint_hash": "sha256:a1b2c3...",
"status_code": 429,
"retry_after_header": 15,
"timestamp": "2024-01-15T10:23:45Z"
}
Step 2: Classify Failure Types & Isolate Payloads. Before queuing a retry, validate the payload and subscription state:
- Network Timeouts: Verify TCP/TLS handshake success. If the connection drops before headers arrive, treat as transient.
- Malformed Payloads: Check
Content-Encoding: aes128gcmcompliance. Invalid cryptographic payloads trigger400and must be discarded without retry. - Revoked Subscriptions:
410 Goneor404 Not Foundindicates the endpoint is invalid. Immediately purge the subscription record and follow the handling 410 Gone responses pipeline to prevent future delivery attempts.
Step 3: Map Retry Queue to Backoff Schedule. Calculate the exact execution timestamp and attach retry metadata to the message envelope. Use delayed job schedulers native to your stack:
- BullMQ (Node.js):
delayoption onqueue.add()calculated via the jitter formula. - Celery (Python):
countdown=delay_secondsinapply_async()withmax_retries=5. - AWS SQS:
DelaySeconds(max 900 s) for short delays; use Step Functions or a polling worker for longer intervals.
Attach retry_count, original_timestamp, and ttl_remaining to the job payload to enable idempotent processing and TTL enforcement.
Step 4: Validate TTL Remaining Before Enqueueing. Backoff windows must never exceed the message Time-To-Live. Check ttl_remaining > actual_delay; if not, route directly to the DLQ. This is the primary integration point with your TTL configuration strategy.
Implementation Patterns
Node.js (TypeScript) with BullMQ
import { Queue } from 'bullmq';
import { randomInt } from 'crypto';
const PUSH_QUEUE = new Queue('push-delivery', {
connection: { host: 'localhost', port: 6379 }
});
const CONFIG = {
baseDelayMs: 2000,
multiplier: 2.0,
maxDelayMs: 120000,
maxRetries: 5,
jitterRangeMs: 1000,
defaultTTLSeconds: 3600
};
export async function scheduleRetry(
payload: Record<string, unknown>,
attempt: number,
originalTimestamp: number,
ttlSeconds: number = CONFIG.defaultTTLSeconds
): Promise<void> {
const elapsedMs = Date.now() - originalTimestamp;
const ttlRemainingMs = ttlSeconds * 1000 - elapsedMs;
if (ttlRemainingMs <= 0) {
console.warn('TTL expired. Discarding payload.');
return;
}
const exponentialDelay = CONFIG.baseDelayMs * Math.pow(CONFIG.multiplier, attempt);
const jitter = randomInt(0, CONFIG.jitterRangeMs);
const calculatedDelay = Math.min(CONFIG.maxDelayMs, exponentialDelay + jitter);
if (calculatedDelay >= ttlRemainingMs) {
console.warn('Backoff exceeds remaining TTL. Routing to DLQ.');
await PUSH_QUEUE.add('push-dlq', { ...payload, reason: 'ttl_exceeded' });
return;
}
await PUSH_QUEUE.add('push-retry', { ...payload, attempt: attempt + 1 }, {
delay: calculatedDelay
});
}
Python with Celery
import random
import time
from celery import Celery
app = Celery('push_tasks', broker='redis://localhost:6379/0')
CONFIG = {
'base_delay': 2,
'multiplier': 2.0,
'max_delay': 120,
'max_retries': 5,
'jitter_range': 1.0,
'default_ttl': 3600
}
class TransientPushError(Exception):
"""Raised for retryable push delivery failures (429, 5xx)."""
@app.task(bind=True, max_retries=CONFIG['max_retries'])
def deliver_push(self, payload: dict, attempt: int = 0, original_ts: float = 0.0):
if not original_ts:
original_ts = time.time()
elapsed = time.time() - original_ts
ttl_remaining = CONFIG['default_ttl'] - elapsed
if ttl_remaining <= 0:
return # TTL expired; silently drop
exp_delay = CONFIG['base_delay'] * (CONFIG['multiplier'] ** attempt)
jitter = random.uniform(0, CONFIG['jitter_range'])
delay = min(CONFIG['max_delay'], exp_delay + jitter)
if delay >= ttl_remaining:
route_to_dlq(payload, reason='ttl_exceeded')
return
try:
send_to_browser(payload)
except TransientPushError as e:
raise self.retry(exc=e, countdown=int(delay))
Validation, Monitoring & Dead-Letter Routing
Continuous validation ensures the backoff system adapts to vendor degradation without masking systemic failures.
Success Metrics Thresholds:
retry_success_rate> 65%queue_depth< 10,000 pending jobsp95_retry_latency< 500 ms (processing overhead, not delay)permanent_failure_rate< 5% (elevated values indicate stale subscription inventory)
When retry_count >= max_retries or actual_delay >= ttl_remaining, route the payload to a dedicated DLQ. Implement an automated DLQ consumer that:
- Logs the failure reason and endpoint hash.
- Flags the subscription endpoint for health verification.
- Removes the subscription from active routing tables if
410or404is confirmed.
If the HTTP 503 rate exceeds 20% over a 5-minute window, trigger a circuit breaker that pauses new dispatches and escalates to vendor status pages. Configure alerts for sustained retry queue depth spikes exceeding 3 standard deviations from baseline.
Gotchas & Edge Cases
Retry-Afteroverrides your formula. When FCM or Mozilla Autopush returns a429with aRetry-After: 60header, you must use that value directly — do not apply your own exponential cap on top of it; combining both delays can exceed the message TTL.- BullMQ
delayis wall-clock, not queue-depth-adjusted. If your Redis node is under memory pressure, the delayed job may sit in a sorted set longer than expected. MonitorZADDlatency separately from delivery latency. - AWS SQS
DelaySecondscaps at 900 s. For attempts 4–5 where the backoff formula exceeds 900 s, you need a Step Function state machine or a polling worker — notDelaySecondsalone. Design for this before hitting the limit in production. - Celery
max_retriescounts from 0 differently across versions. In Celery ≥ 5.x,self.request.retriesstarts at 0 on the first retry call, somax_retries=5allows 5 retries (6 total attempts). Verify the count against your version’s docs before tuningmax_retriesfor SLA compliance. - Duplicate delivery on worker crash mid-ack. If a worker crashes after dispatching the push but before acknowledging the queue message, the broker re-delivers. Implement idempotency keys (
correlation_id) at the push service layer to detect and discard duplicates without triggering additional retries.
Related
- Handling 410 Gone Responses at Scale — permanent-failure routing pipeline that feeds into the DLQ this backoff system drains.
- Setting Optimal TTL Values for Time-Sensitive Alerts — how to compute the TTL budget your backoff schedule must stay within.
- Scaling Push Queues with Redis or RabbitMQ — broker configuration that underpins the delayed-job infrastructure used here.
Back to Retry Logic & Backoff Strategies
FAQ
Should I use full jitter or equal jitter for web push retries?
Full jitter — random(0, exponential_cap) — is the better choice for web push because push services like FCM enforce per-project rate limits across all your workers simultaneously. Full jitter maximally disperses those concurrent retries. Equal jitter (half deterministic, half random) is useful when you need a guaranteed minimum spacing between attempts, such as when a vendor’s Retry-After specifies a floor value, but it provides weaker load dispersion.
What should I do when the push service returns a `429` with a `Retry-After` header?
Use the Retry-After value directly as the delay for that attempt — do not add it on top of your exponential formula. Record the header value in structured logs alongside your correlation_id and endpoint_hash. If Retry-After would push the retry past the message TTL, route to the DLQ immediately rather than scheduling a retry that will never execute within the relevance window. For a full treatment of 429 edge cases see the handling 429 Too Many Requests guide.
How many retries are appropriate before routing to the DLQ?
Five retries (max_retries = 5) with the 2 s base and 2× multiplier gives a cumulative retry window of roughly 62 s under deterministic delays (longer with jitter headroom). For time-critical alerts with short TTLs — OTPs, flash-sale triggers — reduce to 3 retries and cap delays at TTL × 0.5. For lower-priority engagement campaigns you can extend to 7 retries, but validate that the final backoff cap does not exceed your queue’s x-message-ttl setting.