Implementing Exponential Backoff for Failed Push Deliveries

Q: Should I use full jitter or equal jitter for web push retries?

Full jitter — random(0, exponential_cap) — is the better choice for web push because push services like FCM enforce per-project rate limits across all your workers simultaneously. Full jitter maximally disperses those concurrent retries. Equal jitter is useful when you need a guaranteed minimum spacing between attempts but provides weaker load dispersion.

Q: What should I do when the push service returns a 429 with a Retry-After header?

Use the Retry-After value directly as the delay for that attempt — do not add it on top of your exponential formula. If Retry-After would push the retry past the message TTL, route to the DLQ immediately rather than scheduling a retry that will never execute within the relevance window.

Q: How many retries are appropriate before routing to the DLQ?

Five retries (max_retries = 5) with a 2 s base and 2× multiplier gives a cumulative retry window of roughly 62 s. For time-critical alerts with short TTLs reduce to 3 retries and cap delays at TTL × 0.5. For lower-priority campaigns you can extend to 7 retries, but validate that the final backoff cap does not exceed your queue's message TTL setting.

Transient HTTP 429 and 503 errors from browser push services will occur — the question is whether your retry logic amplifies or absorbs the load.

Quick Answer

Use full-jitter exponential backoff: start at 2 s, double each attempt, cap at 120 s, apply random(0, cap) jitter, and set max_retries = 5. Route payloads to a dead-letter queue (DLQ) after exhaustion or when the next delay would exceed the remaining TTL. Never retry 400, 401, 404, or 410 — these are permanent failures. Always respect the Retry-After header on 429 responses.

Attempt	Base delay	With 2× multiplier	After full jitter
1	2 s	2 s	0–2 s
2	2 s	4 s	0–4 s
3	2 s	8 s	0–8 s
4	2 s	16 s	0–16 s
5	2 s	32 s	0–32 s (→ DLQ)

Each bar shows the deterministic exponential cap; jitter randomises the actual dispatch time within that window, preventing synchronised retry spikes across worker pools.

Core Algorithm & Queue Routing Architecture

Exponential backoff replaces immediate synchronous retries with a mathematically scaled delay progression. When a delivery attempt fails, the payload is serialised into a priority queue rather than re-injected into the active dispatch thread. This decouples the delivery worker from the retry scheduler, preventing thread starvation and enabling horizontal scaling.

When configuring the primary dispatch pipeline, integrate this pattern into your broader Backend Delivery Architecture & Queue Management framework to ensure idempotent message routing, atomic state transitions, and persistent retry tracking across service restarts. For the complementary concern of queue infrastructure capacity under load, see scaling push queues with Redis or RabbitMQ.

Base Delay, Multiplier, and Max Retry Thresholds

Linear retry intervals fail under vendor rate limits because they synchronise worker attempts, compounding downstream load. Exponential scaling aligns with browser vendor recovery windows by progressively widening the gap between attempts.

Production Baseline Configuration:

base_delay: 2000 ms
multiplier: 2.0
max_retries: 5
max_delay: 120000 ms
retryable_status_codes: [429, 500, 502, 503, 504]
permanent_failure_codes: [400, 401, 404, 410]

Under this configuration, retry intervals approximate: 2s → 4s → 8s → 16s → 32s, capped at 120 s. After five attempts, the payload is considered exhausted and routed to a dead-letter queue (DLQ).

Jitter Implementation to Prevent Thundering Herd

Deterministic backoff intervals cause synchronised retry spikes across distributed worker pools. Randomised jitter disperses retry attempts across the time window.

Full Jitter Formula (recommended for distributed systems):

actual_delay = random(0, min(max_delay, base_delay * multiplier^attempt))

Full jitter (random(0, exponential_cap)) provides the best load distribution for high-concurrency systems. “Equal jitter” (half deterministic, half random) is a common alternative that maintains a minimum spacing between retries while still dispersing load.

Diagnostic Steps for Push Delivery Failures

Systematic isolation of push delivery failures requires intercepting HTTP responses, validating payload integrity, and mapping failures to the appropriate retry schedule. For foundational algorithmic context, review Retry Logic & Backoff Strategies before applying vendor-specific push constraints.

Step 1: Intercept & Log HTTP Status Codes. Parse the push service response immediately upon receipt. Map status codes to retry policies:

Permanent (No Retry): 400 Bad Request, 401 Unauthorized, 404 Not Found, 410 Gone
Rate Limited (Backoff): 429 Too Many Requests — always respect the Retry-After header; see handling 429 responses for header-aware scheduling
Transient/Server Error (Backoff): 500, 502, 503, 504

Enforce structured logging with correlation IDs and endpoint hashes. Never log raw endpoint URLs or VAPID keys:

{
  "correlation_id": "req_8f3a9c2d",
  "endpoint_hash": "sha256:a1b2c3...",
  "status_code": 429,
  "retry_after_header": 15,
  "timestamp": "2024-01-15T10:23:45Z"
}

Step 2: Classify Failure Types & Isolate Payloads. Before queuing a retry, validate the payload and subscription state:

Network Timeouts: Verify TCP/TLS handshake success. If the connection drops before headers arrive, treat as transient.
Malformed Payloads: Check Content-Encoding: aes128gcm compliance. Invalid cryptographic payloads trigger 400 and must be discarded without retry.
Revoked Subscriptions: 410 Gone or 404 Not Found indicates the endpoint is invalid. Immediately purge the subscription record and follow the handling 410 Gone responses pipeline to prevent future delivery attempts.

Step 3: Map Retry Queue to Backoff Schedule. Calculate the exact execution timestamp and attach retry metadata to the message envelope. Use delayed job schedulers native to your stack:

BullMQ (Node.js): delay option on queue.add() calculated via the jitter formula.
Celery (Python): countdown=delay_seconds in apply_async() with max_retries=5.
AWS SQS: DelaySeconds (max 900 s) for short delays; use Step Functions or a polling worker for longer intervals.

Attach retry_count, original_timestamp, and ttl_remaining to the job payload to enable idempotent processing and TTL enforcement.

Step 4: Validate TTL Remaining Before Enqueueing. Backoff windows must never exceed the message Time-To-Live. Check ttl_remaining > actual_delay; if not, route directly to the DLQ. This is the primary integration point with your TTL configuration strategy.

Implementation Patterns

Node.js (TypeScript) with BullMQ

import { Queue } from 'bullmq';
import { randomInt } from 'crypto';

const PUSH_QUEUE = new Queue('push-delivery', {
  connection: { host: 'localhost', port: 6379 }
});

const CONFIG = {
  baseDelayMs:       2000,
  multiplier:        2.0,
  maxDelayMs:        120000,
  maxRetries:        5,
  jitterRangeMs:     1000,
  defaultTTLSeconds: 3600
};

export async function scheduleRetry(
  payload: Record<string, unknown>,
  attempt: number,
  originalTimestamp: number,
  ttlSeconds: number = CONFIG.defaultTTLSeconds
): Promise<void> {
  const elapsedMs      = Date.now() - originalTimestamp;
  const ttlRemainingMs = ttlSeconds * 1000 - elapsedMs;

  if (ttlRemainingMs <= 0) {
    console.warn('TTL expired. Discarding payload.');
    return;
  }

  const exponentialDelay = CONFIG.baseDelayMs * Math.pow(CONFIG.multiplier, attempt);
  const jitter           = randomInt(0, CONFIG.jitterRangeMs);
  const calculatedDelay  = Math.min(CONFIG.maxDelayMs, exponentialDelay + jitter);

  if (calculatedDelay >= ttlRemainingMs) {
    console.warn('Backoff exceeds remaining TTL. Routing to DLQ.');
    await PUSH_QUEUE.add('push-dlq', { ...payload, reason: 'ttl_exceeded' });
    return;
  }

  await PUSH_QUEUE.add('push-retry', { ...payload, attempt: attempt + 1 }, {
    delay: calculatedDelay
  });
}

Python with Celery

import random
import time
from celery import Celery

app = Celery('push_tasks', broker='redis://localhost:6379/0')

CONFIG = {
    'base_delay':  2,
    'multiplier':  2.0,
    'max_delay':   120,
    'max_retries': 5,
    'jitter_range': 1.0,
    'default_ttl': 3600
}


class TransientPushError(Exception):
    """Raised for retryable push delivery failures (429, 5xx)."""


@app.task(bind=True, max_retries=CONFIG['max_retries'])
def deliver_push(self, payload: dict, attempt: int = 0, original_ts: float = 0.0):
    if not original_ts:
        original_ts = time.time()

    elapsed       = time.time() - original_ts
    ttl_remaining = CONFIG['default_ttl'] - elapsed

    if ttl_remaining <= 0:
        return  # TTL expired; silently drop

    exp_delay = CONFIG['base_delay'] * (CONFIG['multiplier'] ** attempt)
    jitter    = random.uniform(0, CONFIG['jitter_range'])
    delay     = min(CONFIG['max_delay'], exp_delay + jitter)

    if delay >= ttl_remaining:
        route_to_dlq(payload, reason='ttl_exceeded')
        return

    try:
        send_to_browser(payload)
    except TransientPushError as e:
        raise self.retry(exc=e, countdown=int(delay))

Validation, Monitoring & Dead-Letter Routing

Continuous validation ensures the backoff system adapts to vendor degradation without masking systemic failures.

Success Metrics Thresholds:

retry_success_rate > 65%
queue_depth < 10,000 pending jobs
p95_retry_latency < 500 ms (processing overhead, not delay)
permanent_failure_rate < 5% (elevated values indicate stale subscription inventory)

When retry_count >= max_retries or actual_delay >= ttl_remaining, route the payload to a dedicated DLQ. Implement an automated DLQ consumer that:

Logs the failure reason and endpoint hash.
Flags the subscription endpoint for health verification.
Removes the subscription from active routing tables if 410 or 404 is confirmed.

If the HTTP 503 rate exceeds 20% over a 5-minute window, trigger a circuit breaker that pauses new dispatches and escalates to vendor status pages. Configure alerts for sustained retry queue depth spikes exceeding 3 standard deviations from baseline.

Gotchas & Edge Cases

Retry-After overrides your formula. When FCM or Mozilla Autopush returns a 429 with a Retry-After: 60 header, you must use that value directly — do not apply your own exponential cap on top of it; combining both delays can exceed the message TTL.
BullMQ delay is wall-clock, not queue-depth-adjusted. If your Redis node is under memory pressure, the delayed job may sit in a sorted set longer than expected. Monitor ZADD latency separately from delivery latency.
AWS SQS DelaySeconds caps at 900 s. For attempts 4–5 where the backoff formula exceeds 900 s, you need a Step Function state machine or a polling worker — not DelaySeconds alone. Design for this before hitting the limit in production.
Celery max_retries counts from 0 differently across versions. In Celery ≥ 5.x, self.request.retries starts at 0 on the first retry call, so max_retries=5 allows 5 retries (6 total attempts). Verify the count against your version’s docs before tuning max_retries for SLA compliance.
Duplicate delivery on worker crash mid-ack. If a worker crashes after dispatching the push but before acknowledging the queue message, the broker re-delivers. Implement idempotency keys (correlation_id) at the push service layer to detect and discard duplicates without triggering additional retries.

Handling 410 Gone Responses at Scale — permanent-failure routing pipeline that feeds into the DLQ this backoff system drains.
Setting Optimal TTL Values for Time-Sensitive Alerts — how to compute the TTL budget your backoff schedule must stay within.
Scaling Push Queues with Redis or RabbitMQ — broker configuration that underpins the delayed-job infrastructure used here.

Back to Retry Logic & Backoff Strategies

FAQ

Should I use full jitter or equal jitter for web push retries?

Full jitter — random(0, exponential_cap) — is the better choice for web push because push services like FCM enforce per-project rate limits across all your workers simultaneously. Full jitter maximally disperses those concurrent retries. Equal jitter (half deterministic, half random) is useful when you need a guaranteed minimum spacing between attempts, such as when a vendor’s Retry-After specifies a floor value, but it provides weaker load dispersion.

What should I do when the push service returns a `429` with a `Retry-After` header?

Use the Retry-After value directly as the delay for that attempt — do not add it on top of your exponential formula. Record the header value in structured logs alongside your correlation_id and endpoint_hash. If Retry-After would push the retry past the message TTL, route to the DLQ immediately rather than scheduling a retry that will never execute within the relevance window. For a full treatment of 429 edge cases see the handling 429 Too Many Requests guide.

How many retries are appropriate before routing to the DLQ?

Five retries (max_retries = 5) with the 2 s base and 2× multiplier gives a cumulative retry window of roughly 62 s under deterministic delays (longer with jitter headroom). For time-critical alerts with short TTLs — OTPs, flash-sale triggers — reduce to 3 retries and cap delays at TTL × 0.5. For lower-priority engagement campaigns you can extend to 7 retries, but validate that the final backoff cap does not exceed your queue’s x-message-ttl setting.