Use cases · Travel & mobility
Peak-season paywall pricing per route.
The t-test reads zero. The CATE on top-quartile-route × peak season reads +9.3%.
Travel-booking apps move their entire annual revenue through ~12 weeks of peak season. A paywall A/B that 'fails' on a global t-test almost always contains a per-route, per-season cohort where the variant earns multi-million-dollar lift. Doubly-robust evaluation surfaces that cohort on the data the team already has, no rerun, no waiting.
Worked audit
Aerial · Series-C travel-booking app · ~$35M ARR · EU + US · 1.18M trialists across 17 markets
AERIAL-2026Q1-AUDIT-001
Projected impact
+$2.1M / yr ARR
1 · What the team reported
Q4 paywall A/B with three price tiers (control $9.99 · Variant B $12.99 · Variant C $14.99). The team ran a t-test on assignment buckets and reported no significant effect (p = 0.34, n = 1.18M trialists).
The variant was shelved. The conclusion in the post-test write-up: elasticity is too uniform across markets to bother personalising.
2 · What our re-analysis found
Doubly-robust re-evaluation with 1,000-bootstrap CIs and per-segment CATE on the existing logged data surfaced a cell the t-test averaged away: top-quartile-popularity routes × peak season.
That cohort showed a +9.3% lift (CI [+4.1, +14.5], ESS 0.41) on Variant B. It carries 23% of trial-start volume, the slice that drove the seasonal revenue plan. The off-peak segments were flat to slightly negative, dragging the global mean to indistinguishable from zero.
3 · Why the t-test missed it
A global t-test estimates one number: the unconditional ATE across the assignment population. When response varies systematically by a covariate the team didn't pre-register as a cut, the bigger sub-populations dominate the mean.
DR + per-segment CATE doesn't replace the t-test; it asks a different question (what's the conditional treatment effect on the cohorts where overlap is defensible?) and answers it on the same logged data, with bootstrap CIs that reflect the actual variance.
4 · What we'd recommend
Ship Variant B only to top-quartile-popularity routes during peak season. Suppress otherwise. Estimated annualised lift: +$2.1M ARR on the same paid-acquisition spend.
The right operational form is a contextual-bandit policy keyed on (route_popularity_decile, week_of_year, market). Cold-start with the CATE estimate as a prior; the bandit refines online.
Doubly-robust readout · Variant B vs Control · bootstrap 1,000 reps
| Cohort | DR estimate | 95% CI | ESS | Verdict |
|---|---|---|---|---|
| All trialists | +2.7% | [−0.4, +5.8] | 0.61 | inconclusive, average hides the segments |
| Top-quartile route | +1.9% | [−1.2, +5.0] | 0.55 | inconclusive |
| Top-quartile × peak season | +9.3% | [+4.1, +14.5] | 0.41 | significant lift |
| Bottom-quartile × peak | −4.1% | [−7.8, −0.5] | 0.39 | over-priced for this cohort |
Read the full audit, then audit your own test.
Same shape we'll send back on your last A/B test — free, in three business days.