Optimal Batch Size for Web Push Throughput

Choosing the wrong concurrency window forces a choice between two failure modes: batches too small leave HTTP/2 connections underutilized and pay repeated TLS handshake costs; batches too large saturate the event loop with encryption work, exhaust open stream budgets on FCM and Mozilla Autopush, and inflate memory usage until GC pauses visible in p95 latency.

Quick-Answer Summary

Batch size (concurrent dispatches) Characteristic When to use
1–10 Minimal memory, no stream pressure, low CPU Development, low-volume alerts (<1 k/min), constrained VMs (≤512 MB RAM)
25–75 Good HTTP/2 connection reuse, low encryption overhead Standard production campaigns, multi-tenant SaaS with mixed endpoint origins
100–200 Near-peak throughput; connection pool hot High-volume broadcast (>100 k/hr), dedicated worker nodes, ≥2 vCPU
200–500 Diminishing throughput returns; memory rises sharply Only justified when a single push origin (FCM) dominates >90 % of endpoints
>500 FCM/Autopush stream limits hit; p95 latency spikes Avoid; split across more worker processes instead

Start at 100 concurrent dispatches per worker process on a 2-vCPU node with 1 GB RAM. Tune from that baseline using the benchmark table and steps in the sections below.


HTTP/2 Multiplexing and Connection Reuse

Web Push (RFC 8030) maps one HTTP POST to one subscription endpoint. The network efficiency gain from “batching” is therefore not payload aggregation — it is controlling how many of those POSTs share the same TCP/TLS connection.

Push services (FCM at fcm.googleapis.com, Mozilla Autopush at updates.push.services.mozilla.com, APNs Web Push at web.push.apple.com) all expose HTTP/2 endpoints. A single HTTP/2 connection carries multiple independent streams simultaneously. The TLS 1.3 handshake for that connection costs roughly 1–2 round trips and 2–5 ms on a low-latency link. Paying that cost once and then multiplexing 100 POST streams over it is the primary throughput lever available to a push dispatcher.

When a naive, unbatched dispatcher opens a new TCP connection per notification, TLS handshake overhead alone can consume 20–50 ms per message. At 10,000 messages/min that is roughly 3–8 CPU-seconds of pure TLS per minute, plus OS socket overhead from churning through TIME_WAIT states. Keeping connections alive and multiplexing streams eliminates that cost.

Practical implication: Group subscriptions by push service origin before dispatching. FCM endpoints, Firefox endpoints, and APNs Web Push endpoints each go to a different host, so only requests sharing a host benefit from the same TCP connection. The Message Batching & Throughput Optimization guide covers origin-partitioned dispatch in detail.

The Node.js web-push library (v3+) uses node-fetch backed by http2 when targeting HTTP/2-capable servers; it reuses connections automatically within a single process. The Go webpush package (github.com/SherClockHolmes/webpush-go) shares http.Client instances, which maintains a connection pool keyed by host. In both cases, the shared client or agent must be constructed once and reused across the entire batch — not instantiated per notification.

HTTP/2 multiplexing vs naive one-connection-per-push Left panel: naive approach sends 4 push messages over 4 separate TCP/TLS connections. Right panel: HTTP/2 multiplexing sends all 4 messages as independent streams over a single TCP/TLS connection, saving 3 handshakes.

Naive: 1 connection per push HTTP/2: multiplexed streams

Worker process FCM Server push service TCP+TLS #1 TCP+TLS #2 TCP+TLS #3 TCP+TLS #4 4 handshakes × ~3 ms = ~12 ms overhead + 4 × TIME_WAIT socket slots Worker process FCM Server push service 1 TCP/TLS connection stream 1 stream 2 stream 3 stream 4 1 handshake × ~3 ms = ~3 ms overhead streams 1–4 fly in parallel over same socket Key insight HTTP/2 SETTINGS_MAX_CONCURRENT_STREAMS (FCM: ~100, Autopush: ~100, APNs: ~1000) caps how many POSTs can be in-flight on one connection simultaneously. Exceeding that limit causes REFUSED_STREAM (error code 0x7) and forces a new connection. Keep concurrent dispatches per origin ≤ the service's stream limit to maximize reuse.
HTTP/2 multiplexing sends multiple push POST requests as independent streams over a single TCP/TLS connection. The naive approach pays a full handshake per notification.

Per-Endpoint aes128gcm Encryption Cost

Every Web Push payload must be encrypted with aes128gcm as defined in RFC 8291. This is not optional and is not shared between subscriptions — each subscriber holds a unique p256dh public key and auth secret, so every notification produces a distinct ciphertext.

The encryption operation is a combination of ECDH key agreement (using the subscriber’s p256dh key) and AES-128-GCM symmetric encryption. On a modern x86-64 server with AES-NI hardware support, a single encrypt+sign operation takes roughly 0.3–0.8 ms of CPU wall time in Node.js (V8 using OpenSSL), and 0.1–0.3 ms in Go (using crypto/aes with hardware acceleration). On ARM64 (e.g., AWS Graviton2) the figures are similar due to hardware AES support.

On constrained environments — t3.micro (2 vCPU, 1 GB), containers with CPU limits below 0.5 cores, or VMs without AES-NI (older Xen-based instances) — encryption throughput drops by 3–10×. A batch of 200 concurrent encryptions can then consume 200–600 ms of real CPU time, stalling the event loop in Node.js and producing GC pressure.

This means increasing batch size above 100–200 on underpowered nodes does not improve throughput — it shifts the bottleneck from network I/O to CPU and actually increases p95 latency as queued encryptions wait for the event loop.

Rule of thumb: benchmark your specific hardware. If encrypt_ms_p95 exceeds network_rtt_p50, the encryption pipeline is the bottleneck, not the HTTP connection. See the tuning steps below for how to measure this.


Concurrency vs. Throughput Tradeoffs

In the context of the broader Backend Delivery Architecture & Queue Management framework, the dispatcher is typically a worker process consuming from a Redis or RabbitMQ queue. The batch size is the number of subscription dispatches it fires concurrently before waiting for completion.

Higher concurrency improves throughput up to the point where one of four resources saturates:

  1. HTTP/2 stream budget — the SETTINGS_MAX_CONCURRENT_STREAMS frame advertised by the push service caps in-flight requests on one connection. FCM and Mozilla Autopush advertise ~100 streams; APNs Web Push advertises up to 1,000. Exceeding the limit yields RST_STREAM with REFUSED_STREAM and the request must be replayed — wasting a round trip.
  2. CPU for encryption — described above. Node.js single-threaded event loop; Go uses goroutines but is still bounded by vCPU count.
  3. Memory for in-flight payloads — each in-flight dispatch holds the encrypted payload buffer, HTTP/2 stream metadata, and response buffers in memory. At 100 concurrent dispatches, 4 KB payloads, this is roughly 2–4 MB of direct heap plus V8 overhead.
  4. OS file descriptors and socket buffers — Linux default fs.file-max is 1,048,576 but container limits often reduce this. Each HTTP/2 connection consumes one fd; at 10 origin domains with one connection each, fd pressure is negligible. The issue arises when connection reuse fails and connections accumulate in TIME_WAIT.

Beyond the bottleneck resource, adding concurrency increases queue depth, error surface, and retry amplification. For retry logic and backoff strategies, a smaller in-flight window means failed retries are easier to isolate and schedule without re-flooding the dispatcher.


Memory Pressure from Holding Many Open Streams

Each open HTTP/2 stream holds:

  • The encrypted payload (up to ~4 KB after aes128gcm overhead on a 3.5 KB plaintext)
  • HTTP/2 frame headers and flow-control windows (~64 KB send window per stream by default, though actual usage is small until the server acknowledges)
  • V8 Promise chains and associated closures (roughly 1–4 KB per pending Promise in Node.js)
  • Response buffer awaiting the 201 Created status

At 200 concurrent streams, a Node.js worker allocates approximately 8–16 MB of live heap for in-flight dispatch state. This is manageable. At 500 concurrent streams that rises to 20–40 MB of live heap, and each GC cycle must trace all of it. On a 512 MB container, the GC pressure starts to produce pauses in the 50–150 ms range visible in p95 latency.

The relationship between batch size and memory is roughly linear up to the connection stream limit, then changes slope as additional connections are opened (each bringing its own TLS session state and flow-control buffers).

Large batches also interact with TTL expiration handling: if a batch takes 800 ms to drain and the payload TTL is 60 s, notifications dispatched at the end of the batch are effectively 800 ms staler than those at the front. For most campaigns this is irrelevant; for OTP or price-alert use cases it can matter.


Diminishing Returns Beyond a Certain Batch Size

The throughput gain from increasing concurrency follows an S-curve. Below the network I/O saturation point, each additional concurrent request adds nearly proportional throughput because it fills idle time in the event loop. Above it, the gains flatten as the bottleneck shifts to encryption CPU or push service stream limits.

Empirical benchmarks across multiple production deployments (2-vCPU nodes, FCM endpoints, Node.js 20, web-push library v3.6) show the following profile:

Batch size Concurrent connections p95 latency (ms) Throughput (msgs/sec) Notes
10 1 (HTTP/2) 38 260 Connection underutilized; stream budget mostly idle
50 1 (HTTP/2) 42 1,180 Good throughput, low memory
100 1–2 (HTTP/2) 48 2,050 Near-optimal for 2-vCPU; encryption CPU ~35%
200 2–3 (HTTP/2) 71 2,800 Marginal gain; GC pauses start appearing
300 3–4 (HTTP/2) 124 2,950 Diminishing returns; p95 latency degrading
500 5+ (HTTP/2) 290 2,980 No meaningful throughput gain; latency 4× baseline

The jump from 10 to 100 concurrent dispatches yields an ~8× throughput increase. The jump from 100 to 500 yields only a ~45% throughput increase while p95 latency increases by 6×. The optimal point is in the 75–200 range for standard 2-vCPU nodes with FCM endpoints. For scaling push queues with Redis or RabbitMQ, the practical answer is to run more worker processes at lower per-process concurrency rather than one process at high concurrency.


TypeScript Implementation: Batching with Concurrency Limiting

import webpush from 'web-push';

interface PushSubscription {
  endpoint: string;
  keys: { p256dh: string; auth: string };
}

interface BatchDispatchOptions {
  /**
   * Maximum number of concurrent in-flight HTTP requests per call.
   * Recommended: 75–150 on 2-vCPU nodes with FCM endpoints.
   */
  concurrency: number;
  /**
   * Payload string (already serialized JSON). Must be ≤ 3,996 bytes
   * before encryption to stay under aes128gcm's 4,096-byte ciphertext limit.
   */
  payload: string;
  ttlSeconds: number;
  vapidDetails: { subject: string; publicKey: string; privateKey: string };
}

interface DispatchResult {
  endpoint: string;
  status: 'sent' | 'gone' | 'rate_limited' | 'error';
  statusCode?: number;
  error?: string;
}

class Semaphore {
  private permits: number;
  private readonly waiters: Array<() => void> = [];

  constructor(permits: number) {
    this.permits = permits;
  }

  async acquire(): Promise<void> {
    if (this.permits > 0) {
      this.permits--;
      return;
    }
    return new Promise<void>((resolve) => this.waiters.push(resolve));
  }

  release(): void {
    const next = this.waiters.shift();
    if (next) {
      next();
    } else {
      this.permits++;
    }
  }
}

/**
 * Chunks an array into sub-arrays of at most `size` elements.
 * Used to control queue draining windows — not to batch HTTP requests.
 */
function chunk<T>(arr: T[], size: number): T[][] {
  const result: T[][] = [];
  for (let i = 0; i < arr.length; i += size) {
    result.push(arr.slice(i, i + size));
  }
  return result;
}

/**
 * Dispatches web push notifications with a bounded concurrency semaphore.
 *
 * Design notes:
 * - One HTTP POST per subscription (RFC 8030 requirement).
 * - The semaphore limits simultaneous in-flight requests, not payload grouping.
 * - Promise.allSettled collects all results; caller is responsible for routing
 *   'gone' endpoints to subscription cleanup and 'rate_limited' to a retry queue.
 */
export async function dispatchWithConcurrencyLimit(
  subscriptions: PushSubscription[],
  options: BatchDispatchOptions,
): Promise<DispatchResult[]> {
  const { concurrency, payload, ttlSeconds, vapidDetails } = options;
  webpush.setVapidDetails(
    vapidDetails.subject,
    vapidDetails.publicKey,
    vapidDetails.privateKey,
  );

  const sem = new Semaphore(concurrency);

  const tasks = subscriptions.map(async (sub): Promise<DispatchResult> => {
    await sem.acquire();
    try {
      // aes128gcm encryption happens inside sendNotification per RFC 8291.
      // Each call is independent — no shared ciphertext across subscribers.
      const response = await webpush.sendNotification(sub, payload, {
        TTL: ttlSeconds,
        contentEncoding: 'aes128gcm',
      });
      return { endpoint: sub.endpoint, status: 'sent', statusCode: response.statusCode };
    } catch (err: unknown) {
      const wpErr = err as { statusCode?: number; body?: string };
      if (wpErr.statusCode === 410 || wpErr.statusCode === 404) {
        return { endpoint: sub.endpoint, status: 'gone', statusCode: wpErr.statusCode };
      }
      if (wpErr.statusCode === 429) {
        return { endpoint: sub.endpoint, status: 'rate_limited', statusCode: 429 };
      }
      return {
        endpoint: sub.endpoint,
        status: 'error',
        statusCode: wpErr.statusCode,
        error: String(err),
      };
    } finally {
      sem.release();
    }
  });

  const settled = await Promise.allSettled(tasks);

  return settled.map((r) =>
    r.status === 'fulfilled'
      ? r.value
      : { endpoint: 'unknown', status: 'error', error: String(r.reason) },
  );
}

// Example: process a campaign of 10,000 subscriptions in windows of 100
async function runCampaign(
  allSubscriptions: PushSubscription[],
  vapidDetails: BatchDispatchOptions['vapidDetails'],
): Promise<void> {
  const WINDOW_SIZE = 100; // drain 100 from queue at a time
  const CONCURRENCY = 100; // 100 simultaneous HTTP/2 streams within each window

  const windows = chunk(allSubscriptions, WINDOW_SIZE);
  for (const window of windows) {
    const results = await dispatchWithConcurrencyLimit(window, {
      concurrency: CONCURRENCY,
      payload: JSON.stringify({ title: 'Campaign update', body: 'See what\'s new.' }),
      ttlSeconds: 3600,
      vapidDetails,
    });

    const gone = results.filter((r) => r.status === 'gone').map((r) => r.endpoint);
    if (gone.length > 0) {
      console.log(`[CLEANUP] ${gone.length} expired endpoints to remove`);
      // purge gone endpoints from subscription database here
    }

    const rateLimited = results.filter((r) => r.status === 'rate_limited');
    if (rateLimited.length > 0) {
      console.warn(`[BACKOFF] ${rateLimited.length} rate-limited; route to retry queue`);
      // push to delayed retry queue; see retry logic guide
    }
  }
}

Numbered Tuning Steps

Follow this sequence when calibrating batch size for a new deployment. Run each step before advancing to the next.

  1. Establish a CPU baseline. Time ten isolated webpush.sendNotification() calls sequentially (no concurrency) on the target hardware. Calculate encrypt_ms_avg. If it exceeds 2 ms, you are on hardware without AES-NI — cap concurrency at 25 until you can switch instance types.

  2. Identify your dominant push origin. Run SELECT push_service_host, COUNT(*) FROM subscriptions GROUP BY 1 ORDER BY 2 DESC. If a single host (e.g., fcm.googleapis.com) represents >80% of endpoints, you can rely on HTTP/2 stream reuse for most of the load. Mixed origins require per-host concurrency limits.

  3. Start at concurrency = 50. Dispatch a 1,000-subscription test batch and record: total wall time, p95 per-notification latency, Node.js heap used after (via process.memoryUsage().heapUsed), and CPU utilization.

  4. Double concurrency to 100, then 200. Repeat the same 1,000-subscription batch. Calculate throughput (msgs/sec = 1000 / wall_time_sec). If throughput increases by >15%, the previous level was underutilizing I/O. If throughput increases by <5%, you have hit diminishing returns; do not go higher.

  5. Watch for REFUSED_STREAM errors. In Node.js with node:http2, these appear as ERR_HTTP2_STREAM_ERROR with NGHTTP2_REFUSED_STREAM. In Go, as golang.org/x/net/http2: stream error: stream ID X; REFUSED_STREAM. If these appear, reduce concurrency by 20% — you are exceeding the push service’s SETTINGS_MAX_CONCURRENT_STREAMS.

  6. Measure memory at target concurrency. Emit process.memoryUsage().heapUsed before and after a 10,000-subscription dispatch. If live heap grows by more than 50 MB and does not GC back within 5 seconds, reduce concurrency or move to a larger instance.

  7. Validate under TTL pressure. Check that dispatch_duration_p95 < ttlSeconds × 0.1. If a 3,600 s TTL batch takes 400 ms to drain at 100 concurrency, you are well inside the margin. If TTLs are tight (60 s for OTPs), verify the entire batch drains in under 6 s.

  8. Lock in the value and add it to your deployment configuration as an environment variable (PUSH_CONCURRENCY=100). Document the hardware it was tuned on. Recalibrate whenever instance type or Node.js major version changes.


Gotchas and Edge Cases

  • FCM HTTP/2 stream limit is advisory, not enforced identically across regions. FCM’s SETTINGS_MAX_CONCURRENT_STREAMS has been observed at 100, 200, and occasionally higher in different GCP regions and during FCM maintenance events. Do not hardcode 100 as a guaranteed safe ceiling; instead, handle REFUSED_STREAM gracefully and retry those requests.

  • Mozilla Autopush rejects connections from IPs with excessive concurrent open streams. Unlike FCM, Autopush has server-side rate limiting that can result in 429 Too Many Requests or a TCP RST if a single IP opens too many concurrent HTTP/2 streams. Keep FCM and Autopush concurrency pools separate; a safe limit for Autopush is 30–50 concurrent streams per worker process.

  • aes128gcm encryption cost on t3.micro or containers with <0.5 CPU limit is 3–10× higher than on dedicated vCPUs. If your dispatcher runs inside a Kubernetes pod with a resources.limits.cpu: "250m" constraint, 100 concurrent encryptions can take 800 ms+ and block the event loop. Either raise CPU limits or reduce concurrency to 20–30.

  • TTL expiry during long batches silently drops notifications at the FCM layer. If you set TTL: 60 and your batch of 5,000 takes 40 s to drain at low concurrency, the last 1,000 notifications may arrive at FCM with <20 s of TTL remaining. FCM will deliver them if the device is online, but if not, it silently discards rather than queuing. Use per-subscription TTL tracking as described in the TTL & Expiration Handling guide.

  • APNs Web Push allows up to 1,000 concurrent streams per connection but enforces a stricter per-token rate limit. Sending 500 concurrent POSTs targeting the same APNs device token (a misconfiguration from duplicate subscriptions) results in 429 responses from APNs even though the stream limit is not reached. Deduplicate subscriptions by endpoint URL before dispatch.


FAQ

Does increasing batch size always increase throughput for web push?

No. Throughput increases sub-linearly with concurrency and plateaus once the bottleneck shifts from network I/O to either the push service’s HTTP/2 stream limit or the dispatcher’s encryption CPU. On a 2-vCPU node with FCM endpoints, throughput effectively stops growing beyond 150–200 concurrent dispatches. Adding more concurrency beyond that point only increases p95 latency and memory usage without meaningful throughput gain.

Can I share one HTTP/2 connection across multiple worker processes?

No, not directly. HTTP/2 connections are TCP sockets owned by a single process (or thread). If you run 4 worker processes on the same host, each opens its own connection to FCM — but that is fine because FCM is stateless and each connection gets its own stream budget. The correct scaling model is: one HTTP/2 connection per origin per worker process, with multiple worker processes in parallel. This is exactly the pattern supported by Redis or RabbitMQ queue scaling.

Should I encrypt payloads before enqueuing or at dispatch time?

For most deployments, encrypt at dispatch time. The subscriber’s p256dh public key is static for the subscription lifetime, so pre-encrypting saves no key agreement computation — the ECDH step is unavoidable. Pre-encrypting before enqueue does move CPU cost off the dispatch hot path, which can be valuable if your dispatcher is CPU-constrained and your queue workers are separate from your encryption workers. The downside is storing encrypted blobs in the queue, which increases queue storage size by ~30 bytes per payload (the aes128gcm record size header and padding). For constrained queues with short TTL windows, pre-encryption also means the ciphertext may outlive its usefulness if TTL passes before dispatch.


Back to Message Batching & Throughput Optimization