Our Approach to Analyzing A/B Test Results

Learn how we at Datacop approach A/B testing and which features we include in our A/B test reporting to ensure we understand the results as thoroughly as possible.

Dec 15, 2023

About the Author:
Co-founder of Datacop, a digital agency that fulfils marketing operation roles in large eCommerce companies such as OluKai, Melin, Roark, Visual Comfort and Company, Dedoles and others.

A/B testing has become a common and necessary part of every e-commerce web manager’s kit. It provides a way to test incremental changes to the website, placement of major components, versions of copy, recommendation models, personalizations, etc. However, while there are ever more a/b tests taking place, not everyone is getting the full value from them they could.

In this article we are going to focus on an example of a successful website A/B test use-case and use it as an example to demonstrate how we go about analyzing A/B tests at Datacop.

A/B Test - Low In Stock Alert

This was a three way A/B test in a Fashion E-Commerce. It was testing whether a responsive “call-out” message that would alert the customer that the product they are interested in is running low on stock. This message would trigger when the customer would select the size variant of the product.

(34%) Control Group: when the size variant’s inventory count is below threshold, it does nothing.
(33%) Variant A: when the size variant’s inventory count is below threshold, it displays “Low Inventory Order soon” above the color variants of the product.
(33%) Variant B: when the size variant’s inventory count is below threshold, it displays “Low Inventory Order soon” below the “add to bag” button.

The variation between the variants is its placement within the product page. To further test copy with different call outs we would recommend to do a second a/b test if the first one proves to be successful to further test the text of the call-out. If we tested both the text variation and placement variation at the same time, we wouldn’t be able to assess whether it is the placement or the text that performed well.

Those could be - “This product is running low on stock. Order Soon.” or even using the inventory quantity itself in the call out, if it is below an even lower secondary threshold → “Order now, there are only X left”

The threshold for what it means to be out of stock can be controlled within the tag in Bloomreach Engagement itself.

How was it implemented?

To be able to implement this, it is required to have access to the inventory information at the variant size level of the products. In this case we used the front-end javascript object which surfaces the inventory count on the Product page.

The tag checks the inventory count for the product every time the user selects a size of the product. Only in the case that the inventory count is below a defined threshold within the tag, the low in inventory alert would trigger. During the A/B test, only half of those customers who clicked on a size out of stock actually had the alert displayed.

A/B Test Results

Often during A/B testing evaluations, only Order count and Revenue $ metrics are considered. We always like to include top of the funnel metrics like view_items and add to carts as well. This is because sometimes an A/B test may not having a direct impact on Orders or Revenue, but is successfully converting more users to adding products to their cart, or browsing more products. That can be a positive outcome for a test in itself. Additionally, depending on the type of use-case we would add specific metrics that are of interest such as AOV, UPT, Refunds or Bounce Rate.

Consider the test results for the Low In Stock Alert. Both Variant A and Variant B have experienced an uplift. The conversion rate to add-to-carts and purchases is looking almost the same for both Variants:

Variant A: 155+ orders and +$27k
Variant B: 164+ orders and +$39k

The uplifts to purchases are also statistically significant at 99% for both variants. This tells us that the intervention has been successful and the low inventory alert should be turned on to 100% on the web. To calculate A/B test significance we use this calculator: https://abtestguide.com/calc/

Impact of Outliers

Outliers can change how you interpret the result of your A/B test. Our approach is to always compare the overall results with and without Outliers. That tends to provide a valuable perspective on the results.

We identify Outliers as transactions whose total_price is 3 standard deviations above or below from the average. Imagine in practice a shop has an average total price transaction of cca. 100$, then the “below” threshold may be at $25 and the “above” threshold may be 500$.

The risk with including orders that are above $500 in the example; is that the Revenue comparison between the two variants may be skewed towards the variant that happened to receive more outliers.

The risk with including orders that are below $25 is that it will skew the CR% to purchase comparison. This is especially important to consider if your data includes Orders that have 0 price value (exchanges, warranties, giveaways etc.)

Compare the results filtered out by the outliers in the table below with the results above.

When these outliers are excluded from the above data results, 70 orders worth $39k were excluded. A lot of that outlier revenue happened to disproportionately fall into Variant B. Once the results are cleaned, we can see that actually the both variants had a very similar uplift. The placement of the Low Inventory Alert did not matter.

Segmenting A/B Test Results

Typically the a/b test results are evaluated at the “aggregated” level, where all the traffic and customer segments for the A/B test results are considered together. Like in the evaluation above, we can conclude that overall for all segments that Variant B is the superior version of the product page.

However, in reality the various segments that are contained within the whole traffic may have reacted differently to the change. Each e-commerce has a few main splits each A/B test results can be split out into like.

New v Returning customers - are existing customers reacting to the change differently?
Device types of the sessions - are all of the devices experiencing an uplift? Especially between desktop and mobile, the browsing experiences are quite different and could result in different results.
Traffic Source origin of the sessions - do customers who arrive via top of the funnel channels (Paid Facebook) react differently

When we split the results by Segments for the A/B above (after the Outliers are filtered out), we find out that 81% of the uplift is concentrated among the Purchasers - customers who have already purchased at least once prior to starting their session where they were assigned to the A/B test. The same is true for the uplift in add-to-carts.

This kind of split changes the perspective on the results. Yes, there has been an uplift, but that is because the intervention had a particularly strong effect on a specific segment within the total population. The soft uplift on visitors is not statistically significant. Consider now the original uplift in conversion across all traffic, is really a significant uplift within the (smaller) Returning customer segment.

To achieve this kind of advanced A/B test reporting we use the Bloomreach Engagement BigQuery Module ; transformed the raw data and our tool of choice for data visualisations is Tableau.

Implementing Results by Segments

Thanks to platforms like Bloomreach Engagement, which are able to “read” all of the users interactions with the e-shop in the past in real-time when the page is loading (website behaviour, transactions, emails, etc.) As a result, it is possible to turn the Low In Stock Alert at 100% only for purchasers. That means we can capitalize on the uplift generated by this segment.

This would leave “space” for a different A/B test idea for the New - Visitor segment - customers who haven’t purchased yet. To them the alert, does not seem to matter as much. Instead they could receive an alert based on different properties of the product variant - like if it has a high sales velocity “Selling Fast” or a low refund rate “Most Kept”. This is at Datacop, we go about identifying follow up opportunities for tests once a successful intervention is “complete”.

If you found this post valuable…

We hope you found value in this article. If you did, we'd appreciate it if you subscribed (at no cost!) to stay updated with our latest publications.

If you’d like to learn what types of A/B tests we would run for your eCommerce company, feel free to schedule a meeting with us below:

Book a Meeting