Scaling Push Queues with Redis or RabbitMQ: Diagnostic Workflows & Exact Configurations

High-throughput web push delivery fails when the queue layer becomes the bottleneck — consumer lag accumulates, memory watermarks trip, and FCM/APNs start returning 429. Choosing and correctly configuring Redis Streams or RabbitMQ AMQP determines whether your pipeline survives campaign spikes or collapses under them.

Quick-Answer Summary

Question Answer
Which broker for < 50 k msg/s, simple ops? Redis Streams with consumer groups
Which broker for complex routing, HA guarantees? RabbitMQ quorum queues with DLX
Key Redis knob MAXLEN ~ + maxmemory-policy allkeys-lru
Key RabbitMQ knob prefetch_count 5, vm_memory_high_watermark 0.6, quorum queues
How to detect consumer lag? XINFO GROUPS push:stream / rabbitmqctl list_queues messages_unacknowledged

Both brokers require pre-flight endpoint validation before ingestion. Routing stale or 410 Gone endpoints into the delivery stream wastes quota and inflates DLQ depth. Start there before tuning throughput parameters.

Push queue delivery architecture Producer validates endpoints then enqueues to Redis Streams or RabbitMQ. Consumers dispatch to FCM/APNs. Failed messages route to a Dead Letter Exchange or discard stream. App Server (Producer) Validate endpoint + TTL Message Broker Redis Streams RabbitMQ AMQP Consumer Workers (pool) FCM / APNs Push Gateway DLX / DLQ 410 · expired · NACK fail / expire NACK → drop invalid
Push queue flow: the producer validates endpoints before enqueueing; the broker fans out to consumer workers that dispatch to FCM/APNs; failed or expired messages route to the Dead Letter Exchange / discard stream.

Baseline Architecture & Subscription Lifecycle Integration

Push subscription endpoints must be validated before queue ingestion to prevent dead-letter accumulation. Aligning your ingestion pipeline with the Backend Delivery Architecture & Queue Management reference ensures token deduplication, silent drop handling, and lifecycle-aware routing are enforced at the producer level.

Implementation Directives:

  1. Partition Mapping: Hash push tokens to queue partitions using consistent hashing to maintain ordering per tenant.
  2. Pre-Flight Validation: Query active subscription registries before enqueueing. Reject invalid payloads at the API gateway.
  3. Lifecycle Routing: Immediately route unsubscribed, expired, or 410 Gone endpoints to a Dead Letter Exchange (DLX) or discard stream. See Handling 410 Gone Responses at Scale for the exact DLQ consumer pattern.

Redis Streams vs. RabbitMQ: Capability Comparison

Choose your broker based on routing complexity, HA requirements, and your team’s operational familiarity:

Parameter Redis Streams RabbitMQ AMQP (quorum queues)
Throughput ceiling ~500 k msg/s (single shard) ~100–150 k msg/s per node
Message ordering Per-stream, per-consumer-group Per-queue (FIFO)
Dead-letter routing Manual (XACK + separate stream) Native DLX + routing key
TTL control Per-job via SETEX; stream uses MAXLEN x-message-ttl at queue declaration
HA model Redis Cluster / Sentinel Quorum queues (Raft)
Replay / rewind Yes — reprocess from any offset No — once ACK’d, gone
Ops complexity Low (single binary) Medium (plugins, policies, mgmt UI)
Push-specific fit High-volume broadcast campaigns Routing by region/priority/type

For multi-tenant pipelines where payload aggregation matters, coordinate with the Message Batching & Throughput Optimization section to reduce per-message overhead before hitting the broker.

Diagnostic Workflow: Identifying Backpressure & Consumer Lag

When delivery latency exceeds SLA thresholds, isolate the bottleneck using queue telemetry. Monitor Redis Stream consumer group lag or RabbitMQ messages_unacknowledged metrics. Correlate memory spikes with connection pool exhaustion and GC pauses in consumer workers.

# 1. Analyze Redis Stream depth and consumer group lag
redis-cli XINFO GROUPS push:stream

# 2. Audit RabbitMQ queue state and unacked message count
rabbitmqctl list_queues name messages messages_unacknowledged consumers

# 3. Watch live Redis throughput (1-second sample, 10 iterations)
redis-cli --stat -i 1 -n 10

# A sustained producer/consumer delta > 15% indicates consumer saturation,
# network backpressure, or gateway throttling — not a broker misconfiguration.

Exact Configuration: Redis Streams

Enforce MAXLEN ~ trimming and explicit consumer group offsets. Never rely on eviction alone to bound stream depth; approximate trimming keeps memory predictable without the write-amplification of exact trimming.

# Stream creation with approximate trimming (bounds memory footprint)
# Field-value pairs follow the auto-generated entry ID (*)
XADD push:stream MAXLEN "~" 50000 "*" endpoint "https://fcm.example/endpoint" timestamp "1700000000000"

# Consumer group initialization — MKSTREAM creates the stream if absent.
# Start from latest entry ($) so the group only receives new messages.
XGROUP CREATE push:stream push-consumers "$" MKSTREAM

# Read up to 10 messages per worker with a 5-second block timeout
XREADGROUP GROUP push-consumers worker-1 COUNT 10 BLOCK 5000 STREAMS push:stream ">"

Production Parameters:

  • maxmemory-policy allkeys-lru — evict least-recently-used keys when memory is full
  • BLOCK 5000 on XREADGROUP — prevents busy-polling; consumers sleep until messages arrive
  • Deploy Redis Cluster with hash slots mapped to tenant IDs for multi-tenant isolation
  • Use XPENDING to detect and reprocess orphaned messages from crashed consumers

Exact Configuration: RabbitMQ AMQP

Disable auto_ack and tune prefetch_count to prevent head-of-line blocking. Use quorum queues — classic mirrored queues were deprecated in RabbitMQ 3.8 and removed in 4.0.

import pika

connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()

# Declare the main delivery queue with TTL, length cap, and DLX routing
channel.queue_declare(
    queue='push.delivery',
    durable=True,
    arguments={
        'x-queue-type':           'quorum',       # HA via Raft consensus
        'x-message-ttl':          3600000,         # 1-hour TTL in milliseconds
        'x-max-length':           100000,           # reject beyond this depth
        'x-dead-letter-exchange': 'push.dlx'       # route failures here
    }
)

# Per-consumer prefetch — global_=False scopes the limit per consumer, not channel
channel.basic_qos(prefetch_count=5, global_=False)

def on_message(ch, method, properties, body):
    try:
        dispatch_to_gateway(body)
        ch.basic_ack(delivery_tag=method.delivery_tag)
    except PermanentFailure:
        # requeue=False sends to DLX instead of requeueing
        ch.basic_nack(delivery_tag=method.delivery_tag, requeue=False)

Production Parameters:

  • vm_memory_high_watermark 0.6 — pause publishers at 60% node memory usage
  • prefetch_count 5 — start here; tune upward only after profiling consumer processing time
  • Monitor messages_unacknowledged per queue; a growing count means consumers are stalling

Step-by-Step Resolution for Campaign Spike Saturation

Execute this incident response sequence when: consumer lag > 10,000 pending messages, gateway 429/503 rates exceed 5%, or memory watermarks trip.

  1. Isolate the Bottleneck. Run XINFO GROUPS or rabbitmqctl list_queues. Distinguish consumer-side stall (threads blocked on gateway I/O) from broker-side backpressure (memory watermark hit, publisher flow-controlled).
  2. Scale Consumers Horizontally. Deploy additional worker pods. Cap total connection pool size — each RabbitMQ connection is a TCP socket; exceeding broker limits triggers connection refused errors.
  3. Reduce Prefetch. Lower prefetch_count to 1–5 per worker. This redistributes in-flight messages across the expanded pool and breaks head-of-line blocking on slow gateway calls.
  4. Activate Circuit Breakers. When FCM/APNs return 429 or 503, pause that vendor’s consumer lane. Route failed payloads to a retry queue with exponential backoff (base=2s, max=300s). Never re-inject into the primary queue.
  5. Verify Normalization. Monitor acknowledgment rates and consumer lag. Target: restored throughput within 60 seconds, consumer lag < 1,000, zero payload loss during scaling events.

Gotchas & Edge Cases

  • Redis stream rewind during crash recovery. If a consumer crashes before XACK, Redis marks those messages as PEL (pending entry list). Without a XPENDING/XCLAIM reconciliation loop, those messages are never redelivered. Implement a periodic scan for entries older than 2 × consumer_timeout.
  • RabbitMQ quorum queue write amplification. Quorum queues replicate every write to a Raft majority. On write-heavy campaigns, disk I/O becomes the bottleneck before network or CPU. Provision NVMe-backed nodes and monitor io_write_avg_time_ms via the management API.
  • TTL unit mismatch between brokers. Redis SETEX takes seconds; RabbitMQ x-message-ttl takes milliseconds. Passing seconds to RabbitMQ silently sets a ~1 ms TTL — messages expire before consumers read them. Always validate TTL configuration in integration tests with a clock-advancing stub.
  • Silent endpoint drops inflate queue depth. Endpoints returning 404 or 410 Gone may silently NACK without triggering DLX routing if consumer error handling is misconfigured. The DLQ consumer appears healthy while the primary queue depth climbs. Cross-check DLQ ingestion rate against delivery failure rate in your delivery tracking dashboard.
  • Connection pool exhaustion under horizontal scale. Adding consumer pods without capping the pool size causes broker-side TCP port exhaustion. Set max_connections in your connection pool library and test failover behavior before production scaling events.

Validation Checklist & Production Monitoring

Track Prometheus or CloudWatch metrics for consumer lag, memory utilization, and error rates. Set alert thresholds at > 80% memory utilization and > 5% gateway rejection rate.

  • msgs/sec throughput matches or exceeds defined SLOs
  • MAXLEN ~ (Redis) or x-max-length (RabbitMQ) prevents OOM under burst ingestion
  • XPENDING scan (Redis) or messages_unacknowledged alert (RabbitMQ) catches stalled consumers

FAQ

When should I choose Redis Streams over RabbitMQ for push delivery?

Choose Redis Streams when your primary concern is raw throughput on a homogeneous delivery pipeline — all messages go to the same gateway, your team already operates Redis, and you don’t need complex routing topologies. Redis Streams handle hundreds of thousands of messages per second per shard with minimal operational overhead. Choose RabbitMQ when you need native dead-letter exchange routing, per-queue TTL at declaration time, priority queues, or message routing by vendor (FCM vs. APNs vs. Mozilla Autopush) without application-level routing logic.

How do I prevent message loss if a consumer crashes mid-processing?

For Redis Streams, never call XACK until dispatch to the push gateway succeeds. Unacknowledged messages remain in the Pending Entry List (PEL). Run a recovery worker that calls XPENDING and XCLAIM for entries older than your consumer timeout (typically 2× the gateway request timeout). For RabbitMQ, combine basic_qos(prefetch_count=N) with manual basic_ack after confirmed dispatch and basic_nack(requeue=False) on permanent failures. The broker holds unacked messages in memory and redelivers them on channel close — no external reconciliation loop required.

What happens to messages when the RabbitMQ memory watermark trips?

When vm_memory_high_watermark (default 0.4, recommended 0.6 for push workloads) is exceeded, RabbitMQ blocks all publishing connections with a flow-control signal. Consumers continue draining the queue, but producers stall until memory drops below the watermark. This is intentional backpressure, not a crash. Configure your producer with a blocked_connection_timeout so stalled publishers surface an error rather than hanging indefinitely. If the watermark trips frequently, reduce prefetch_count, add consumer nodes, or increase the broker’s memory allocation.

Back to Message Batching & Throughput Optimization