Peak-season paywall pricing per route.

The t-test reads zero. The CATE on top-quartile-route × peak season reads +9.3%.

Travel-booking apps move their entire annual revenue through ~12 weeks of peak season. A paywall A/B that 'fails' on a global t-test almost always contains a per-route, per-season cohort where the variant earns multi-million-dollar lift. Doubly-robust evaluation surfaces that cohort on the data the team already has, no rerun, no waiting.

Worked audit

Aerial · Series-C travel-booking app · ~$35M ARR · EU + US · 1.18M trialists across 17 markets

AERIAL-2026Q1-AUDIT-001

Projected impact

+$2.1M / yr ARR

1 · What the team reported

Q4 paywall A/B with three price tiers (control $9.99 · Variant B $12.99 · Variant C $14.99). The team ran a t-test on assignment buckets and reported no significant effect (p = 0.34, n = 1.18M trialists).

The variant was shelved. The conclusion in the post-test write-up: elasticity is too uniform across markets to bother personalising.

2 · What our re-analysis found

Doubly-robust re-evaluation with 1,000-bootstrap CIs and per-segment CATE on the existing logged data surfaced a cell the t-test averaged away: top-quartile-popularity routes × peak season.

That cohort showed a +9.3% lift (CI [+4.1, +14.5], ESS 0.41) on Variant B. It carries 23% of trial-start volume, the slice that drove the seasonal revenue plan. The off-peak segments were flat to slightly negative, dragging the global mean to indistinguishable from zero.

3 · Why the t-test missed it

A global t-test estimates one number: the unconditional ATE across the assignment population. When response varies systematically by a covariate the team didn't pre-register as a cut, the bigger sub-populations dominate the mean.

DR + per-segment CATE doesn't replace the t-test; it asks a different question (what's the conditional treatment effect on the cohorts where overlap is defensible?) and answers it on the same logged data, with bootstrap CIs that reflect the actual variance.

4 · What we'd recommend

Ship Variant B only to top-quartile-popularity routes during peak season. Suppress otherwise. Estimated annualised lift: +$2.1M ARR on the same paid-acquisition spend.

The right operational form is a contextual-bandit policy keyed on (route_popularity_decile, week_of_year, market). Cold-start with the CATE estimate as a prior; the bandit refines online.

Doubly-robust readout · Variant B vs Control · bootstrap 1,000 reps

Cohort	DR estimate	95% CI	ESS	Verdict
All trialists	+2.7%	[−0.4, +5.8]	0.61	inconclusive, average hides the segments
Top-quartile route	+1.9%	[−1.2, +5.0]	0.55	inconclusive
Top-quartile × peak season	+9.3%	[+4.1, +14.5]	0.41	significant lift
Bottom-quartile × peak	−4.1%	[−7.8, −0.5]	0.39	over-priced for this cohort

Read the full audit, then audit your own test.

Same shape we'll send back on your last A/B test — free, in three business days.

Read audit PDF Send my test

Related use cases

Abandoned-search reactivation push→Subscription tier upgrade prompt timing→Push timing per habit pattern→