A/B Testing Web Push Notifications
Most teams “test” push notifications by sending two versions and eyeballing which got more clicks. That is not a test — it is a coin flip dressed up as data. A real A/B test assigns subscribers to variants randomly, holds everything else constant, collects enough samples to overcome noise, and only then declares a winner with a stated confidence level. This guide walks through the full procedure with runnable Node.js and service-worker code.
This is a deep-dive within the push engagement and campaign optimization guide. It depends on one measurement primitive — accurately counting clicks against delivered notifications — which is covered in detail in measuring push notification click-through rate.
Prerequisites
process.env.VAPID_PUBLIC_KEYandprocess.env.VAPID_PRIVATE_KEY).sent,accepted,display, andclickper subscriber, campaign, and variant.delivery_status.notificationclickand can emit a tracking beacon.
How a Push A/B Test Works
An experiment has three moving parts. First, a hypothesis: a falsifiable statement like “an urgency-framed title increases CTR over the neutral control.” Second, assignment: each eligible subscriber is deterministically and randomly placed into exactly one variant. Third, attribution: every click is tied back to the variant the subscriber received, so the comparison is apples-to-apples.
The critical discipline is changing one thing at a time. If variant B has both a new title and a new send time, a win tells you nothing about which change mattered. Keep the payload, icon, timing, and segment identical across variants except for the single element under test. Segment selection itself is upstream of the test and is covered in push personalization and segmentation.
Assignment must be sticky: a subscriber who lands in “control” today should land in “control” on a re-send, otherwise contamination ruins the result. Deterministic hashing gives you stickiness without a database write per subscriber.
Step 1 — Write a Falsifiable Hypothesis
State the change, the metric, the direction, and the minimum effect you care about. “Adding a first-name token to the title raises CTR by at least 1.5 absolute points” is testable; “make the copy better” is not. The minimum detectable effect (MDE) directly drives your sample size, so commit to it before you start.
const experiment = {
id: 'title-personalization-v1',
hypothesis: 'First-name title raises CTR by >= 1.5 absolute points',
control: { title: 'Your weekly summary is ready' },
variant: { title: '{firstName}, your weekly summary is ready' },
splitRatio: 0.5, // 50/50
mdeAbsolute: 0.015, // 1.5 percentage points
baselineCtr: 0.06 // measured, not guessed
};
Step 2 — Compute the Required Sample Size
A two-proportion test needs enough subscribers per arm to detect the MDE at your chosen significance (alpha, typically 0.05) and power (typically 0.80). Undersized tests produce false negatives; you change something, it actually worked, and the noise hides it. Compute the per-arm size before sending.
// Per-arm sample size for a two-proportion z-test
function sampleSizePerArm(p1, mde, alpha = 0.05, power = 0.8) {
const zA = 1.959963985; // two-sided z for alpha=0.05
const zB = 0.841621234; // z for power=0.80
const p2 = p1 + mde;
const pBar = (p1 + p2) / 2;
const num = Math.pow(
zA * Math.sqrt(2 * pBar * (1 - pBar)) + zB * Math.sqrt(p1 * (1 - p1) + p2 * (1 - p2)),
2
);
return Math.ceil(num / Math.pow(mde, 2));
}
// e.g. baseline 6% CTR, detect a 1.5pt lift:
console.log(sampleSizePerArm(0.06, 0.015)); // ~ 4,300 delivered per arm
If your eligible cohort is smaller than two arms’ worth of delivered (not merely sent) notifications, either widen the segment, raise the MDE, or run the test longer across multiple sends.
Step 3 — Assign Variants Deterministically
Hash the subscriber id together with the experiment id, take a bucket, and map it to a variant by your split ratio. Seeding with the experiment id keeps concurrent experiments independent, and hashing keeps the same subscriber in the same arm on every send.
const crypto = require('crypto');
function assignVariant(subscriberId, experimentId, splitRatio = 0.5) {
const h = crypto.createHash('sha256')
.update(`${experimentId}:${subscriberId}`)
.digest();
const r = h.readUInt32BE(0) / 0xffffffff; // uniform 0..1
return r < splitRatio ? 'control' : 'variant';
}
Step 4 — Send and Record the Assignment
When you send, write a sent event with the variant, then record accepted from the push-service response. The variant must travel inside the encrypted payload so the service worker can attribute the eventual click. Keep that payload under the 4 KB ciphertext limit — it is encrypted with the aes128gcm content encoding per RFC 8291 before transmission — so carry only the variant id and a target url, not full copy.
const webpush = require('web-push');
webpush.setVapidDetails(
'mailto:ops@example.com',
process.env.VAPID_PUBLIC_KEY, // never hardcode these
process.env.VAPID_PRIVATE_KEY
);
async function sendExperiment(db, sub, exp, ctx) {
const arm = assignVariant(sub.id, exp.id, exp.splitRatio);
const title = (arm === 'control' ? exp.control.title : exp.variant.title)
.replace('{firstName}', ctx.firstName ?? 'there');
await db.query(
`INSERT INTO push_events (subscriber_id, campaign_id, variant, event_type)
VALUES ($1, $2, $3, 'sent')`,
[sub.id, exp.id, arm]
);
try {
const res = await webpush.sendNotification(
sub,
JSON.stringify({ title, url: ctx.url, campaignId: exp.id, variant: arm }),
{ TTL: 86400, urgency: 'normal' }
);
await db.query(
`INSERT INTO push_events (subscriber_id, campaign_id, variant, event_type, status_code)
VALUES ($1, $2, $3, 'accepted', $4)`,
[sub.id, exp.id, arm, res.statusCode]
);
} catch (err) {
// 410 Gone -> retire endpoint; 429 -> back off and requeue
await db.query(
`INSERT INTO push_events (subscriber_id, campaign_id, variant, event_type, status_code)
VALUES ($1, $2, $3, 'failed', $4)`,
[sub.id, exp.id, arm, err.statusCode ?? 0]
);
}
}
When a send returns 429 Too Many Requests, do not drop the subscriber from the experiment — requeue it using your retry logic and backoff so the arm sizes stay balanced.
Step 5 — Track Clicks in the Service Worker
The browser fires notificationclick when the user taps the notification. Read the variant out of the notification data, fire a tracking beacon, and focus or open the target. This client event is what populates the numerator of CTR.
// sw.js — attribute the click back to its variant
self.addEventListener('notificationclick', (event) => {
const d = event.notification.data || {};
event.notification.close();
event.waitUntil((async () => {
await fetch('/t/click', {
method: 'POST',
keepalive: true,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ campaignId: d.campaignId, variant: d.variant, ts: Date.now() })
}).catch(() => {});
const url = d.url || '/';
const all = await clients.matchAll({ type: 'window', includeUncontrolled: true });
const hit = all.find((c) => c.url.includes(url));
if (hit) return hit.focus();
return clients.openWindow(url);
})());
});
Step 6 — Measure Lift and Significance
Per arm, CTR is clicks / accepted. Lift is the difference between variant and control CTR. Then run a two-proportion z-test: if the resulting p-value is below your alpha, the difference is statistically significant. The exact CTR counting logic — and why you must use accepted, not sent, as the denominator — is detailed in measuring push notification click-through rate.
function twoProportionZ(cClicks, cN, vClicks, vN) {
const p1 = cClicks / cN, p2 = vClicks / vN;
const pPool = (cClicks + vClicks) / (cN + vN);
const se = Math.sqrt(pPool * (1 - pPool) * (1 / cN + 1 / vN));
const z = (p2 - p1) / se;
// two-sided p-value via normal CDF approximation
const p = 2 * (1 - normCdf(Math.abs(z)));
return { liftAbsolute: p2 - p1, z, pValue: p, significant: p < 0.05 };
}
function normCdf(x) {
return (1 + erf(x / Math.SQRT2)) / 2;
}
function erf(x) {
const t = 1 / (1 + 0.3275911 * Math.abs(x));
const y = 1 - (((((1.061405429 * t - 1.453152027) * t) + 1.421413741) * t - 0.284496736) * t + 0.254829592) * t * Math.exp(-x * x);
return x >= 0 ? y : -y;
}
Configuration Reference
| Parameter | Type | Default | Notes |
|---|---|---|---|
experimentId |
string | — | Seeds the assignment hash; keep stable for the test’s life |
splitRatio |
number | 0.5 |
Fraction routed to control; rest to variant |
mdeAbsolute |
number | — | Minimum detectable effect in absolute CTR points |
alpha |
number | 0.05 |
False-positive rate (two-sided) |
power |
number | 0.80 |
Probability of detecting a true effect |
baselineCtr |
number | measured | Use your real CTR, never a published benchmark |
TTL |
seconds | 86400 |
Push-service retention; affects delivered timing |
Verification
Before trusting any result, confirm the plumbing. Open Chrome DevTools, go to the Application panel, and dispatch a test push to your own subscription to confirm both the display and click events reach your tracking endpoint.
# Confirm arm balance and that accepted counts are close to 50/50
curl -s https://example.com/admin/experiment/title-personalization-v1/arms | jq
# Expect: { "control": { "accepted": 4180, "click": 251 },
# "variant": { "accepted": 4203, "click": 309 } }
If the two arms differ by more than a couple of percent in accepted count, your assignment or delivery is skewed — investigate before reading the lift.
Error & Edge-Case Matrix
| Condition | Cause | Fix |
|---|---|---|
| Arms badly unbalanced | Non-deterministic assignment or skewed 410 retirement |
Use the hash-based assignment; retire dead endpoints evenly across arms |
| Significant result, no real lift | Peeking / stopping early (false positive) | Fix the sample size up front; do not stop the test when it first crosses significance |
| No significance after full sample | MDE smaller than reality, or list too small | Raise the MDE, extend the run, or widen the segment |
| Clicks attributed to wrong variant | Variant missing from payload data |
Always embed variant in the notification data |
413 Payload Too Large |
Full copy shipped in payload | Send only variant id + url; keep ciphertext under 4 KB |
| Variant flips on re-send | Hash seed changed | Keep experimentId constant for the experiment’s lifetime |
Cross-Browser Notes
Chromium, Firefox, and Safari all fire notificationclick, so attribution works everywhere, but display behaviour differs. Safari on iOS and macOS enforces stricter user-gesture and background-delivery rules, so its display-to-accepted ratio is often lower — segment your analysis by browser to avoid one platform’s throttling masking a real copy win. Firefox uses its own push service with a slightly smaller effective payload ceiling, so a payload that fits Chrome may be rejected there. Confirm payload sizing against the maximum payload size limits for Chrome vs Firefox, and review broader divergence in cross-browser notification quirks.
Related
- Back to Push Notification Engagement & Campaign Optimization — the engagement loop this experiment optimizes.
- Measuring push notification click-through rate — the CTR counting primitive every A/B test depends on.
- Push personalization & segmentation — choose the cohort your experiment runs on.
- Retry logic & backoff strategies — keep arm sizes balanced when sends are rate limited.
FAQ
How long should I run a push A/B test?
Run it until each arm has reached the per-arm sample size you computed for your MDE — measured in delivered (accepted) notifications, not subscribers sent to. Calendar time matters too: run across at least a full weekly cycle so a weekday-versus-weekend effect does not bias the result. Never stop early just because the test briefly crosses significance.
Can I test more than two variants at once?
Yes, but each additional arm dilutes your sample and inflates the false-positive rate from multiple comparisons. Map subscribers to N buckets with the same hash, increase the per-arm sample size accordingly, and apply a correction (such as Bonferroni) to your alpha. For most lists, a clean two-arm test converges far faster.
Why use accepted rather than sent as the CTR denominator?
A notification the push service rejected (a 410 Gone endpoint, or a 413 over the 4 KB limit) was never delivered, so it cannot be clicked. Counting it in the denominator artificially depresses CTR and biases the comparison if rejections differ between arms. Use accepted (a 201 from the push service) as the denominator.
Does deterministic hashing introduce bias?
No, provided the hash is well-distributed (sha256 is) and you seed it with both the experiment id and subscriber id. The output is uniform, so the split is random, while remaining stable across re-sends. The only bias risk is uneven endpoint retirement, which you mitigate by retiring 410 Gone endpoints across both arms.