A/B Testing Web Push Notifications

Most teams “test” push notifications by sending two versions and eyeballing which got more clicks. That is not a test — it is a coin flip dressed up as data. A real A/B test assigns subscribers to variants randomly, holds everything else constant, collects enough samples to overcome noise, and only then declares a winner with a stated confidence level. This guide walks through the full procedure with runnable Node.js and service-worker code.

This is a deep-dive within the push engagement and campaign optimization guide. It depends on one measurement primitive — accurately counting clicks against delivered notifications — which is covered in detail in measuring push notification click-through rate.

Prerequisites

A working push pipeline that can send to subscribers (VAPID auth configured via process.env.VAPID_PUBLIC_KEY and process.env.VAPID_PRIVATE_KEY).
An append-only event store recording at least sent, accepted, display, and click per subscriber, campaign, and variant.
A subscriber table with stable subscriber ids and current delivery_status.
A service worker that handles notificationclick and can emit a tracking beacon.
At least a few thousand active, recently-engaged subscribers — small lists cannot detect realistic effect sizes.

How a Push A/B Test Works

An experiment has three moving parts. First, a hypothesis: a falsifiable statement like “an urgency-framed title increases CTR over the neutral control.” Second, assignment: each eligible subscriber is deterministically and randomly placed into exactly one variant. Third, attribution: every click is tied back to the variant the subscriber received, so the comparison is apples-to-apples.

The critical discipline is changing one thing at a time. If variant B has both a new title and a new send time, a win tells you nothing about which change mattered. Keep the payload, icon, timing, and segment identical across variants except for the single element under test. Segment selection itself is upstream of the test and is covered in push personalization and segmentation.

Assignment must be sticky: a subscriber who lands in “control” today should land in “control” on a re-send, otherwise contamination ruins the result. Deterministic hashing gives you stickiness without a database write per subscriber.

Deterministic hashing makes assignment sticky and random; attribution closes the loop back to lift.

Step 1 — Write a Falsifiable Hypothesis

State the change, the metric, the direction, and the minimum effect you care about. “Adding a first-name token to the title raises CTR by at least 1.5 absolute points” is testable; “make the copy better” is not. The minimum detectable effect (MDE) directly drives your sample size, so commit to it before you start.

const experiment = {
  id: 'title-personalization-v1',
  hypothesis: 'First-name title raises CTR by >= 1.5 absolute points',
  control: { title: 'Your weekly summary is ready' },
  variant: { title: '{firstName}, your weekly summary is ready' },
  splitRatio: 0.5,        // 50/50
  mdeAbsolute: 0.015,     // 1.5 percentage points
  baselineCtr: 0.06       // measured, not guessed
};

Step 2 — Compute the Required Sample Size

A two-proportion test needs enough subscribers per arm to detect the MDE at your chosen significance (alpha, typically 0.05) and power (typically 0.80). Undersized tests produce false negatives; you change something, it actually worked, and the noise hides it. Compute the per-arm size before sending.

// Per-arm sample size for a two-proportion z-test
function sampleSizePerArm(p1, mde, alpha = 0.05, power = 0.8) {
  const zA = 1.959963985;  // two-sided z for alpha=0.05
  const zB = 0.841621234;  // z for power=0.80
  const p2 = p1 + mde;
  const pBar = (p1 + p2) / 2;
  const num = Math.pow(
    zA * Math.sqrt(2 * pBar * (1 - pBar)) + zB * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2)),
    2
  );
  return Math.ceil(num / Math.pow(mde, 2));
}

// e.g. baseline 6% CTR, detect a 1.5pt lift:
console.log(sampleSizePerArm(0.06, 0.015)); // ~ 4,300 delivered per arm

If your eligible cohort is smaller than two arms’ worth of delivered (not merely sent) notifications, either widen the segment, raise the MDE, or run the test longer across multiple sends.

Step 3 — Assign Variants Deterministically

Hash the subscriber id together with the experiment id, take a bucket, and map it to a variant by your split ratio. Seeding with the experiment id keeps concurrent experiments independent, and hashing keeps the same subscriber in the same arm on every send.

const crypto = require('crypto');

function assignVariant(subscriberId, experimentId, splitRatio = 0.5) {
  const h = crypto.createHash('sha256')
    .update(`${experimentId}:${subscriberId}`)
    .digest();
  const r = h.readUInt32BE(0) / 0xffffffff; // uniform 0..1
  return r < splitRatio ? 'control' : 'variant';
}

Step 4 — Send and Record the Assignment

When you send, write a sent event with the variant, then record accepted from the push-service response. The variant must travel inside the encrypted payload so the service worker can attribute the eventual click. Keep that payload under the 4 KB ciphertext limit — it is encrypted with the aes128gcm content encoding per RFC 8291 before transmission — so carry only the variant id and a target url, not full copy.

const webpush = require('web-push');
webpush.setVapidDetails(
  'mailto:ops@example.com',
  process.env.VAPID_PUBLIC_KEY,   // never hardcode these
  process.env.VAPID_PRIVATE_KEY
);

async function sendExperiment(db, sub, exp, ctx) {
  const arm = assignVariant(sub.id, exp.id, exp.splitRatio);
  const title = (arm === 'control' ? exp.control.title : exp.variant.title)
    .replace('{firstName}', ctx.firstName ?? 'there');

  await db.query(
    `INSERT INTO push_events (subscriber_id, campaign_id, variant, event_type)
     VALUES ($1, $2, $3, 'sent')`,
    [sub.id, exp.id, arm]
  );

  try {
    const res = await webpush.sendNotification(
      sub,
      JSON.stringify({ title, url: ctx.url, campaignId: exp.id, variant: arm }),
      { TTL: 86400, urgency: 'normal' }
    );
    await db.query(
      `INSERT INTO push_events (subscriber_id, campaign_id, variant, event_type, status_code)
       VALUES ($1, $2, $3, 'accepted', $4)`,
      [sub.id, exp.id, arm, res.statusCode]
    );
  } catch (err) {
    // 410 Gone -> retire endpoint; 429 -> back off and requeue
    await db.query(
      `INSERT INTO push_events (subscriber_id, campaign_id, variant, event_type, status_code)
       VALUES ($1, $2, $3, 'failed', $4)`,
      [sub.id, exp.id, arm, err.statusCode ?? 0]
    );
  }
}

When a send returns 429 Too Many Requests, do not drop the subscriber from the experiment — requeue it using your retry logic and backoff so the arm sizes stay balanced.

Step 5 — Track Clicks in the Service Worker

The browser fires notificationclick when the user taps the notification. Read the variant out of the notification data, fire a tracking beacon, and focus or open the target. This client event is what populates the numerator of CTR.

// sw.js — attribute the click back to its variant
self.addEventListener('notificationclick', (event) => {
  const d = event.notification.data || {};
  event.notification.close();
  event.waitUntil((async () => {
    await fetch('/t/click', {
      method: 'POST',
      keepalive: true,
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ campaignId: d.campaignId, variant: d.variant, ts: Date.now() })
    }).catch(() => {});
    const url = d.url || '/';
    const all = await clients.matchAll({ type: 'window', includeUncontrolled: true });
    const hit = all.find((c) => c.url.includes(url));
    if (hit) return hit.focus();
    return clients.openWindow(url);
  })());
});

Step 6 — Measure Lift and Significance

Per arm, CTR is clicks / accepted. Lift is the difference between variant and control CTR. Then run a two-proportion z-test: if the resulting p-value is below your alpha, the difference is statistically significant. The exact CTR counting logic — and why you must use accepted, not sent, as the denominator — is detailed in measuring push notification click-through rate.

function twoProportionZ(cClicks, cN, vClicks, vN) {
  const p1 = cClicks / cN, p2 = vClicks / vN;
  const pPool = (cClicks + vClicks) / (cN + vN);
  const se = Math.sqrt(pPool * (1 - pPool) * (1 / cN + 1 / vN));
  const z = (p2 - p1) / se;
  // two-sided p-value via normal CDF approximation
  const p = 2 * (1 - normCdf(Math.abs(z)));
  return { liftAbsolute: p2 - p1, z, pValue: p, significant: p < 0.05 };
}

function normCdf(x) {
  return (1 + erf(x / Math.SQRT2)) / 2;
}
function erf(x) {
  const t = 1 / (1 + 0.3275911 * Math.abs(x));
  const y = 1 - (((((1.061405429 * t - 1.453152027) * t) + 1.421413741) * t - 0.284496736) * t + 0.254829592) * t * Math.exp(-x * x);
  return x >= 0 ? y : -y;
}

Configuration Reference

Parameter	Type	Default	Notes
`experimentId`	string	—	Seeds the assignment hash; keep stable for the test’s life
`splitRatio`	number	`0.5`	Fraction routed to control; rest to variant
`mdeAbsolute`	number	—	Minimum detectable effect in absolute CTR points
`alpha`	number	`0.05`	False-positive rate (two-sided)
`power`	number	`0.80`	Probability of detecting a true effect
`baselineCtr`	number	measured	Use your real CTR, never a published benchmark
`TTL`	seconds	`86400`	Push-service retention; affects delivered timing

Verification

Before trusting any result, confirm the plumbing. Open Chrome DevTools, go to the Application panel, and dispatch a test push to your own subscription to confirm both the display and click events reach your tracking endpoint.

# Confirm arm balance and that accepted counts are close to 50/50
curl -s https://example.com/admin/experiment/title-personalization-v1/arms | jq
# Expect: { "control": { "accepted": 4180, "click": 251 },
#           "variant": { "accepted": 4203, "click": 309 } }

If the two arms differ by more than a couple of percent in accepted count, your assignment or delivery is skewed — investigate before reading the lift.

Error & Edge-Case Matrix

Condition	Cause	Fix
Arms badly unbalanced	Non-deterministic assignment or skewed `410` retirement	Use the hash-based assignment; retire dead endpoints evenly across arms
Significant result, no real lift	Peeking / stopping early (false positive)	Fix the sample size up front; do not stop the test when it first crosses significance
No significance after full sample	MDE smaller than reality, or list too small	Raise the MDE, extend the run, or widen the segment
Clicks attributed to wrong variant	Variant missing from payload `data`	Always embed `variant` in the notification `data`
`413 Payload Too Large`	Full copy shipped in payload	Send only variant id + url; keep ciphertext under 4 KB
Variant flips on re-send	Hash seed changed	Keep `experimentId` constant for the experiment’s lifetime

Cross-Browser Notes

Chromium, Firefox, and Safari all fire notificationclick, so attribution works everywhere, but display behaviour differs. Safari on iOS and macOS enforces stricter user-gesture and background-delivery rules, so its display-to-accepted ratio is often lower — segment your analysis by browser to avoid one platform’s throttling masking a real copy win. Firefox uses its own push service with a slightly smaller effective payload ceiling, so a payload that fits Chrome may be rejected there. Confirm payload sizing against the maximum payload size limits for Chrome vs Firefox, and review broader divergence in cross-browser notification quirks.

Back to Push Notification Engagement & Campaign Optimization — the engagement loop this experiment optimizes.
Measuring push notification click-through rate — the CTR counting primitive every A/B test depends on.
Push personalization & segmentation — choose the cohort your experiment runs on.
Retry logic & backoff strategies — keep arm sizes balanced when sends are rate limited.

FAQ

How long should I run a push A/B test?

Run it until each arm has reached the per-arm sample size you computed for your MDE — measured in delivered (accepted) notifications, not subscribers sent to. Calendar time matters too: run across at least a full weekly cycle so a weekday-versus-weekend effect does not bias the result. Never stop early just because the test briefly crosses significance.

Can I test more than two variants at once?

Yes, but each additional arm dilutes your sample and inflates the false-positive rate from multiple comparisons. Map subscribers to N buckets with the same hash, increase the per-arm sample size accordingly, and apply a correction (such as Bonferroni) to your alpha. For most lists, a clean two-arm test converges far faster.

Why use accepted rather than sent as the CTR denominator?

A notification the push service rejected (a 410 Gone endpoint, or a 413 over the 4 KB limit) was never delivered, so it cannot be clicked. Counting it in the denominator artificially depresses CTR and biases the comparison if rejections differ between arms. Use accepted (a 201 from the push service) as the denominator.

Does deterministic hashing introduce bias?

No, provided the hash is well-distributed (sha256 is) and you seed it with both the experiment id and subscriber id. The output is uniform, so the split is random, while remaining stable across re-sends. The only bias risk is uneven endpoint retirement, which you mitigate by retiring 410 Gone endpoints across both arms.