Scaling Push Queues with Redis or RabbitMQ: Diagnostic Workflows & Exact Configurations
High-throughput web push delivery fails when the queue layer becomes the bottleneck — consumer lag accumulates, memory watermarks trip, and FCM/APNs start returning 429. Choosing and correctly configuring Redis Streams or RabbitMQ AMQP determines whether your pipeline survives campaign spikes or collapses under them.
Quick-Answer Summary
| Question | Answer |
|---|---|
| Which broker for < 50 k msg/s, simple ops? | Redis Streams with consumer groups |
| Which broker for complex routing, HA guarantees? | RabbitMQ quorum queues with DLX |
| Key Redis knob | MAXLEN ~ + maxmemory-policy allkeys-lru |
| Key RabbitMQ knob | prefetch_count 5, vm_memory_high_watermark 0.6, quorum queues |
| How to detect consumer lag? | XINFO GROUPS push:stream / rabbitmqctl list_queues messages_unacknowledged |
Both brokers require pre-flight endpoint validation before ingestion. Routing stale or 410 Gone endpoints into the delivery stream wastes quota and inflates DLQ depth. Start there before tuning throughput parameters.
Baseline Architecture & Subscription Lifecycle Integration
Push subscription endpoints must be validated before queue ingestion to prevent dead-letter accumulation. Aligning your ingestion pipeline with the Backend Delivery Architecture & Queue Management reference ensures token deduplication, silent drop handling, and lifecycle-aware routing are enforced at the producer level.
Implementation Directives:
- Partition Mapping: Hash push tokens to queue partitions using consistent hashing to maintain ordering per tenant.
- Pre-Flight Validation: Query active subscription registries before enqueueing. Reject invalid payloads at the API gateway.
- Lifecycle Routing: Immediately route unsubscribed, expired, or
410 Goneendpoints to a Dead Letter Exchange (DLX) or discard stream. See Handling 410 Gone Responses at Scale for the exact DLQ consumer pattern.
Redis Streams vs. RabbitMQ: Capability Comparison
Choose your broker based on routing complexity, HA requirements, and your team’s operational familiarity:
| Parameter | Redis Streams | RabbitMQ AMQP (quorum queues) |
|---|---|---|
| Throughput ceiling | ~500 k msg/s (single shard) | ~100–150 k msg/s per node |
| Message ordering | Per-stream, per-consumer-group | Per-queue (FIFO) |
| Dead-letter routing | Manual (XACK + separate stream) | Native DLX + routing key |
| TTL control | Per-job via SETEX; stream uses MAXLEN |
x-message-ttl at queue declaration |
| HA model | Redis Cluster / Sentinel | Quorum queues (Raft) |
| Replay / rewind | Yes — reprocess from any offset | No — once ACK’d, gone |
| Ops complexity | Low (single binary) | Medium (plugins, policies, mgmt UI) |
| Push-specific fit | High-volume broadcast campaigns | Routing by region/priority/type |
For multi-tenant pipelines where payload aggregation matters, coordinate with the Message Batching & Throughput Optimization section to reduce per-message overhead before hitting the broker.
Diagnostic Workflow: Identifying Backpressure & Consumer Lag
When delivery latency exceeds SLA thresholds, isolate the bottleneck using queue telemetry. Monitor Redis Stream consumer group lag or RabbitMQ messages_unacknowledged metrics. Correlate memory spikes with connection pool exhaustion and GC pauses in consumer workers.
# 1. Analyze Redis Stream depth and consumer group lag
redis-cli XINFO GROUPS push:stream
# 2. Audit RabbitMQ queue state and unacked message count
rabbitmqctl list_queues name messages messages_unacknowledged consumers
# 3. Watch live Redis throughput (1-second sample, 10 iterations)
redis-cli --stat -i 1 -n 10
# A sustained producer/consumer delta > 15% indicates consumer saturation,
# network backpressure, or gateway throttling — not a broker misconfiguration.
Exact Configuration: Redis Streams
Enforce MAXLEN ~ trimming and explicit consumer group offsets. Never rely on eviction alone to bound stream depth; approximate trimming keeps memory predictable without the write-amplification of exact trimming.
# Stream creation with approximate trimming (bounds memory footprint)
# Field-value pairs follow the auto-generated entry ID (*)
XADD push:stream MAXLEN "~" 50000 "*" endpoint "https://fcm.example/endpoint" timestamp "1700000000000"
# Consumer group initialization — MKSTREAM creates the stream if absent.
# Start from latest entry ($) so the group only receives new messages.
XGROUP CREATE push:stream push-consumers "$" MKSTREAM
# Read up to 10 messages per worker with a 5-second block timeout
XREADGROUP GROUP push-consumers worker-1 COUNT 10 BLOCK 5000 STREAMS push:stream ">"
Production Parameters:
maxmemory-policy allkeys-lru— evict least-recently-used keys when memory is fullBLOCK 5000onXREADGROUP— prevents busy-polling; consumers sleep until messages arrive- Deploy Redis Cluster with hash slots mapped to tenant IDs for multi-tenant isolation
- Use
XPENDINGto detect and reprocess orphaned messages from crashed consumers
Exact Configuration: RabbitMQ AMQP
Disable auto_ack and tune prefetch_count to prevent head-of-line blocking. Use quorum queues — classic mirrored queues were deprecated in RabbitMQ 3.8 and removed in 4.0.
import pika
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
channel = connection.channel()
# Declare the main delivery queue with TTL, length cap, and DLX routing
channel.queue_declare(
queue='push.delivery',
durable=True,
arguments={
'x-queue-type': 'quorum', # HA via Raft consensus
'x-message-ttl': 3600000, # 1-hour TTL in milliseconds
'x-max-length': 100000, # reject beyond this depth
'x-dead-letter-exchange': 'push.dlx' # route failures here
}
)
# Per-consumer prefetch — global_=False scopes the limit per consumer, not channel
channel.basic_qos(prefetch_count=5, global_=False)
def on_message(ch, method, properties, body):
try:
dispatch_to_gateway(body)
ch.basic_ack(delivery_tag=method.delivery_tag)
except PermanentFailure:
# requeue=False sends to DLX instead of requeueing
ch.basic_nack(delivery_tag=method.delivery_tag, requeue=False)
Production Parameters:
vm_memory_high_watermark 0.6— pause publishers at 60% node memory usageprefetch_count 5— start here; tune upward only after profiling consumer processing time- Monitor
messages_unacknowledgedper queue; a growing count means consumers are stalling
Step-by-Step Resolution for Campaign Spike Saturation
Execute this incident response sequence when: consumer lag > 10,000 pending messages, gateway 429/503 rates exceed 5%, or memory watermarks trip.
- Isolate the Bottleneck. Run
XINFO GROUPSorrabbitmqctl list_queues. Distinguish consumer-side stall (threads blocked on gateway I/O) from broker-side backpressure (memory watermark hit, publisher flow-controlled). - Scale Consumers Horizontally. Deploy additional worker pods. Cap total connection pool size — each RabbitMQ connection is a TCP socket; exceeding broker limits triggers connection refused errors.
- Reduce Prefetch. Lower
prefetch_countto 1–5 per worker. This redistributes in-flight messages across the expanded pool and breaks head-of-line blocking on slow gateway calls. - Activate Circuit Breakers. When FCM/APNs return
429or503, pause that vendor’s consumer lane. Route failed payloads to a retry queue with exponential backoff (base=2s, max=300s). Never re-inject into the primary queue. - Verify Normalization. Monitor acknowledgment rates and consumer lag. Target: restored throughput within 60 seconds, consumer lag < 1,000, zero payload loss during scaling events.
Gotchas & Edge Cases
- Redis stream rewind during crash recovery. If a consumer crashes before
XACK, Redis marks those messages asPEL(pending entry list). Without aXPENDING/XCLAIMreconciliation loop, those messages are never redelivered. Implement a periodic scan for entries older than2 × consumer_timeout. - RabbitMQ quorum queue write amplification. Quorum queues replicate every write to a Raft majority. On write-heavy campaigns, disk I/O becomes the bottleneck before network or CPU. Provision NVMe-backed nodes and monitor
io_write_avg_time_msvia the management API. - TTL unit mismatch between brokers. Redis
SETEXtakes seconds; RabbitMQx-message-ttltakes milliseconds. Passing seconds to RabbitMQ silently sets a ~1 ms TTL — messages expire before consumers read them. Always validate TTL configuration in integration tests with a clock-advancing stub. - Silent endpoint drops inflate queue depth. Endpoints returning
404or410 Gonemay silently NACK without triggering DLX routing if consumer error handling is misconfigured. The DLQ consumer appears healthy while the primary queue depth climbs. Cross-check DLQ ingestion rate against delivery failure rate in your delivery tracking dashboard. - Connection pool exhaustion under horizontal scale. Adding consumer pods without capping the pool size causes broker-side TCP port exhaustion. Set
max_connectionsin your connection pool library and test failover behavior before production scaling events.
Validation Checklist & Production Monitoring
Track Prometheus or CloudWatch metrics for consumer lag, memory utilization, and error rates. Set alert thresholds at > 80% memory utilization and > 5% gateway rejection rate.
msgs/secthroughput matches or exceeds defined SLOsMAXLEN ~(Redis) orx-max-length(RabbitMQ) prevents OOM under burst ingestionXPENDINGscan (Redis) ormessages_unacknowledgedalert (RabbitMQ) catches stalled consumers
FAQ
When should I choose Redis Streams over RabbitMQ for push delivery?
Choose Redis Streams when your primary concern is raw throughput on a homogeneous delivery pipeline — all messages go to the same gateway, your team already operates Redis, and you don’t need complex routing topologies. Redis Streams handle hundreds of thousands of messages per second per shard with minimal operational overhead. Choose RabbitMQ when you need native dead-letter exchange routing, per-queue TTL at declaration time, priority queues, or message routing by vendor (FCM vs. APNs vs. Mozilla Autopush) without application-level routing logic.
How do I prevent message loss if a consumer crashes mid-processing?
For Redis Streams, never call XACK until dispatch to the push gateway succeeds. Unacknowledged messages remain in the Pending Entry List (PEL). Run a recovery worker that calls XPENDING and XCLAIM for entries older than your consumer timeout (typically 2× the gateway request timeout). For RabbitMQ, combine basic_qos(prefetch_count=N) with manual basic_ack after confirmed dispatch and basic_nack(requeue=False) on permanent failures. The broker holds unacked messages in memory and redelivers them on channel close — no external reconciliation loop required.
What happens to messages when the RabbitMQ memory watermark trips?
When vm_memory_high_watermark (default 0.4, recommended 0.6 for push workloads) is exceeded, RabbitMQ blocks all publishing connections with a flow-control signal. Consumers continue draining the queue, but producers stall until memory drops below the watermark. This is intentional backpressure, not a crash. Configure your producer with a blocked_connection_timeout so stalled publishers surface an error rather than hanging indefinitely. If the watermark trips frequently, reduce prefetch_count, add consumer nodes, or increase the broker’s memory allocation.
Related
- Implementing Exponential Backoff for Failed Push Deliveries — wire backoff schedules into the retry queue your broker feeds after a
429or503. - Handling 410 Gone Responses at Scale — DLQ consumer patterns for permanent endpoint invalidation.
- Setting Optimal TTL Values for Time-Sensitive Alerts — coordinate broker-level TTL with payload-level expiration to prevent stale delivery.