A/B Testing Web Push Notifications

Most teams “test” push notifications by sending two versions and eyeballing which got more clicks. That is not a test — it is a coin flip dressed up as data. A real A/B test assigns subscribers to variants randomly, holds everything else constant, collects enough samples to overcome noise, and only then declares a winner with a stated confidence level. This guide walks through the full procedure with runnable Node.js and service-worker code.

This is a deep-dive within the push engagement and campaign optimization guide. It depends on one measurement primitive — accurately counting clicks against delivered notifications — which is covered in detail in measuring push notification click-through rate.

Prerequisites

  • process.env.VAPID_PUBLIC_KEY and process.env.VAPID_PRIVATE_KEY).
  • sent, accepted, display, and click per subscriber, campaign, and variant.
  • delivery_status.
  • notificationclick and can emit a tracking beacon.

How a Push A/B Test Works

An experiment has three moving parts. First, a hypothesis: a falsifiable statement like “an urgency-framed title increases CTR over the neutral control.” Second, assignment: each eligible subscriber is deterministically and randomly placed into exactly one variant. Third, attribution: every click is tied back to the variant the subscriber received, so the comparison is apples-to-apples.

The critical discipline is changing one thing at a time. If variant B has both a new title and a new send time, a win tells you nothing about which change mattered. Keep the payload, icon, timing, and segment identical across variants except for the single element under test. Segment selection itself is upstream of the test and is covered in push personalization and segmentation.

Assignment must be sticky: a subscriber who lands in “control” today should land in “control” on a re-send, otherwise contamination ruins the result. Deterministic hashing gives you stickiness without a database write per subscriber.

Push A/B test assignment and attribution flow A subscriber cohort is hashed into a control or variant bucket, each receives a different title, and clicks are attributed back to the bucket to compute lift. Eligible cohort segment query sha256 hash % buckets A: "Your order shipped" control B: "On its way! Track it" variant clicks attributed to bucket -> lift
Deterministic hashing makes assignment sticky and random; attribution closes the loop back to lift.

Step 1 — Write a Falsifiable Hypothesis

State the change, the metric, the direction, and the minimum effect you care about. “Adding a first-name token to the title raises CTR by at least 1.5 absolute points” is testable; “make the copy better” is not. The minimum detectable effect (MDE) directly drives your sample size, so commit to it before you start.

const experiment = {
  id: 'title-personalization-v1',
  hypothesis: 'First-name title raises CTR by >= 1.5 absolute points',
  control: { title: 'Your weekly summary is ready' },
  variant: { title: '{firstName}, your weekly summary is ready' },
  splitRatio: 0.5,        // 50/50
  mdeAbsolute: 0.015,     // 1.5 percentage points
  baselineCtr: 0.06       // measured, not guessed
};

Step 2 — Compute the Required Sample Size

A two-proportion test needs enough subscribers per arm to detect the MDE at your chosen significance (alpha, typically 0.05) and power (typically 0.80). Undersized tests produce false negatives; you change something, it actually worked, and the noise hides it. Compute the per-arm size before sending.

// Per-arm sample size for a two-proportion z-test
function sampleSizePerArm(p1, mde, alpha = 0.05, power = 0.8) {
  const zA = 1.959963985;  // two-sided z for alpha=0.05
  const zB = 0.841621234;  // z for power=0.80
  const p2 = p1 + mde;
  const pBar = (p1 + p2) / 2;
  const num = Math.pow(
    zA * Math.sqrt(2 * pBar * (1 - pBar)) + zB * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2)),
    2
  );
  return Math.ceil(num / Math.pow(mde, 2));
}

// e.g. baseline 6% CTR, detect a 1.5pt lift:
console.log(sampleSizePerArm(0.06, 0.015)); // ~ 4,300 delivered per arm

If your eligible cohort is smaller than two arms’ worth of delivered (not merely sent) notifications, either widen the segment, raise the MDE, or run the test longer across multiple sends.

Step 3 — Assign Variants Deterministically

Hash the subscriber id together with the experiment id, take a bucket, and map it to a variant by your split ratio. Seeding with the experiment id keeps concurrent experiments independent, and hashing keeps the same subscriber in the same arm on every send.

const crypto = require('crypto');

function assignVariant(subscriberId, experimentId, splitRatio = 0.5) {
  const h = crypto.createHash('sha256')
    .update(`${experimentId}:${subscriberId}`)
    .digest();
  const r = h.readUInt32BE(0) / 0xffffffff; // uniform 0..1
  return r < splitRatio ? 'control' : 'variant';
}

Step 4 — Send and Record the Assignment

When you send, write a sent event with the variant, then record accepted from the push-service response. The variant must travel inside the encrypted payload so the service worker can attribute the eventual click. Keep that payload under the 4 KB ciphertext limit — it is encrypted with the aes128gcm content encoding per RFC 8291 before transmission — so carry only the variant id and a target url, not full copy.

const webpush = require('web-push');
webpush.setVapidDetails(
  'mailto:ops@example.com',
  process.env.VAPID_PUBLIC_KEY,   // never hardcode these
  process.env.VAPID_PRIVATE_KEY
);

async function sendExperiment(db, sub, exp, ctx) {
  const arm = assignVariant(sub.id, exp.id, exp.splitRatio);
  const title = (arm === 'control' ? exp.control.title : exp.variant.title)
    .replace('{firstName}', ctx.firstName ?? 'there');

  await db.query(
    `INSERT INTO push_events (subscriber_id, campaign_id, variant, event_type)
     VALUES ($1, $2, $3, 'sent')`,
    [sub.id, exp.id, arm]
  );

  try {
    const res = await webpush.sendNotification(
      sub,
      JSON.stringify({ title, url: ctx.url, campaignId: exp.id, variant: arm }),
      { TTL: 86400, urgency: 'normal' }
    );
    await db.query(
      `INSERT INTO push_events (subscriber_id, campaign_id, variant, event_type, status_code)
       VALUES ($1, $2, $3, 'accepted', $4)`,
      [sub.id, exp.id, arm, res.statusCode]
    );
  } catch (err) {
    // 410 Gone -> retire endpoint; 429 -> back off and requeue
    await db.query(
      `INSERT INTO push_events (subscriber_id, campaign_id, variant, event_type, status_code)
       VALUES ($1, $2, $3, 'failed', $4)`,
      [sub.id, exp.id, arm, err.statusCode ?? 0]
    );
  }
}

When a send returns 429 Too Many Requests, do not drop the subscriber from the experiment — requeue it using your retry logic and backoff so the arm sizes stay balanced.

Step 5 — Track Clicks in the Service Worker

The browser fires notificationclick when the user taps the notification. Read the variant out of the notification data, fire a tracking beacon, and focus or open the target. This client event is what populates the numerator of CTR.

// sw.js — attribute the click back to its variant
self.addEventListener('notificationclick', (event) => {
  const d = event.notification.data || {};
  event.notification.close();
  event.waitUntil((async () => {
    await fetch('/t/click', {
      method: 'POST',
      keepalive: true,
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ campaignId: d.campaignId, variant: d.variant, ts: Date.now() })
    }).catch(() => {});
    const url = d.url || '/';
    const all = await clients.matchAll({ type: 'window', includeUncontrolled: true });
    const hit = all.find((c) => c.url.includes(url));
    if (hit) return hit.focus();
    return clients.openWindow(url);
  })());
});

Step 6 — Measure Lift and Significance

Per arm, CTR is clicks / accepted. Lift is the difference between variant and control CTR. Then run a two-proportion z-test: if the resulting p-value is below your alpha, the difference is statistically significant. The exact CTR counting logic — and why you must use accepted, not sent, as the denominator — is detailed in measuring push notification click-through rate.

function twoProportionZ(cClicks, cN, vClicks, vN) {
  const p1 = cClicks / cN, p2 = vClicks / vN;
  const pPool = (cClicks + vClicks) / (cN + vN);
  const se = Math.sqrt(pPool * (1 - pPool) * (1 / cN + 1 / vN));
  const z = (p2 - p1) / se;
  // two-sided p-value via normal CDF approximation
  const p = 2 * (1 - normCdf(Math.abs(z)));
  return { liftAbsolute: p2 - p1, z, pValue: p, significant: p < 0.05 };
}

function normCdf(x) {
  return (1 + erf(x / Math.SQRT2)) / 2;
}
function erf(x) {
  const t = 1 / (1 + 0.3275911 * Math.abs(x));
  const y = 1 - (((((1.061405429 * t - 1.453152027) * t) + 1.421413741) * t - 0.284496736) * t + 0.254829592) * t * Math.exp(-x * x);
  return x >= 0 ? y : -y;
}

Configuration Reference

Parameter Type Default Notes
experimentId string Seeds the assignment hash; keep stable for the test’s life
splitRatio number 0.5 Fraction routed to control; rest to variant
mdeAbsolute number Minimum detectable effect in absolute CTR points
alpha number 0.05 False-positive rate (two-sided)
power number 0.80 Probability of detecting a true effect
baselineCtr number measured Use your real CTR, never a published benchmark
TTL seconds 86400 Push-service retention; affects delivered timing

Verification

Before trusting any result, confirm the plumbing. Open Chrome DevTools, go to the Application panel, and dispatch a test push to your own subscription to confirm both the display and click events reach your tracking endpoint.

# Confirm arm balance and that accepted counts are close to 50/50
curl -s https://example.com/admin/experiment/title-personalization-v1/arms | jq
# Expect: { "control": { "accepted": 4180, "click": 251 },
#           "variant": { "accepted": 4203, "click": 309 } }

If the two arms differ by more than a couple of percent in accepted count, your assignment or delivery is skewed — investigate before reading the lift.

Error & Edge-Case Matrix

Condition Cause Fix
Arms badly unbalanced Non-deterministic assignment or skewed 410 retirement Use the hash-based assignment; retire dead endpoints evenly across arms
Significant result, no real lift Peeking / stopping early (false positive) Fix the sample size up front; do not stop the test when it first crosses significance
No significance after full sample MDE smaller than reality, or list too small Raise the MDE, extend the run, or widen the segment
Clicks attributed to wrong variant Variant missing from payload data Always embed variant in the notification data
413 Payload Too Large Full copy shipped in payload Send only variant id + url; keep ciphertext under 4 KB
Variant flips on re-send Hash seed changed Keep experimentId constant for the experiment’s lifetime

Cross-Browser Notes

Chromium, Firefox, and Safari all fire notificationclick, so attribution works everywhere, but display behaviour differs. Safari on iOS and macOS enforces stricter user-gesture and background-delivery rules, so its display-to-accepted ratio is often lower — segment your analysis by browser to avoid one platform’s throttling masking a real copy win. Firefox uses its own push service with a slightly smaller effective payload ceiling, so a payload that fits Chrome may be rejected there. Confirm payload sizing against the maximum payload size limits for Chrome vs Firefox, and review broader divergence in cross-browser notification quirks.

FAQ

How long should I run a push A/B test?

Run it until each arm has reached the per-arm sample size you computed for your MDE — measured in delivered (accepted) notifications, not subscribers sent to. Calendar time matters too: run across at least a full weekly cycle so a weekday-versus-weekend effect does not bias the result. Never stop early just because the test briefly crosses significance.

Can I test more than two variants at once?

Yes, but each additional arm dilutes your sample and inflates the false-positive rate from multiple comparisons. Map subscribers to N buckets with the same hash, increase the per-arm sample size accordingly, and apply a correction (such as Bonferroni) to your alpha. For most lists, a clean two-arm test converges far faster.

Why use accepted rather than sent as the CTR denominator?

A notification the push service rejected (a 410 Gone endpoint, or a 413 over the 4 KB limit) was never delivered, so it cannot be clicked. Counting it in the denominator artificially depresses CTR and biases the comparison if rejections differ between arms. Use accepted (a 201 from the push service) as the denominator.

Does deterministic hashing introduce bias?

No, provided the hash is well-distributed (sha256 is) and you seed it with both the experiment id and subscriber id. The output is uniform, so the split is random, while remaining stable across re-sends. The only bias risk is uneven endpoint retirement, which you mitigate by retiring 410 Gone endpoints across both arms.