Do You Still Need a Data Warehouse if You Have a CDP/CDXP in Place?

Discover the Key Differences Between the Roles of a Data Warehouse and a CDP/CDXP.

Oct 22, 2024

About the Author:
Co-founder of Datacop, agency that fulfils marketing operation roles in large eCommerce companies such as OluKai, Melin, Roark, Visual Comfort and Company, Dedoles and others.

Recently, we published an article detailing how we would design a marketing tech stack for eCommerce companies with annualized revenues between $25M and $2B.

This Is How We Would Design a Marketing Techstack for $25m - 2b eCommerce

Lukas Bihary

June 6, 2024

Read full story

In this follow-up article, we will dive deeper into the role of the data warehouse, which was a key component of that tech stack.

At Datacop, we’ve built three data warehouses from the ground up for OluKai, Melin, and Roark. We continue to maintain and enhance these data warehouses to this day. This is one of our core areas of expertise.

Drawing from this experience, we’ll explain why having a data warehouse is crucial, even if you already have a CDP in place. Additionally, we’ll highlight the most common use cases that drive the development of a dedicated data warehouse.

We will also walk you through the architecture of our data warehouse solution, explaining the function of each tool we utilized in its design.

Why We Advice to Have a Data Warehouse Alongside a CDP / CDXP

There are four primary reasons for choosing to build our data warehouse alongside a CDP or CDXP:

1. CDPs or CDXPs Are Optimized for Customer-Centric Data

We believe that CDPs (or CDXPs) and data warehouses serve distinct purposes in an eCommerce ecosystem.

From our perspective, a data warehouse functions as the central repository for all eCommerce data—ideally without a data retention policy. This repository supports core eCommerce reporting, typically visualized through BI tools like Tableau.

On the other hand, CDPs (or CDXPs) are optimized for customer-related data, such as on-site behavior and omnichannel interactions, captured as events. Their primary use is in driving personalization and customer-focused, ad-hoc reporting.

When it comes to analyzing non-customer-related data, such as ad spend across paid channels, inventory, or product margins, this is typically done outside of a CDP. These platforms aren't designed to handle non-customer data as effectively.

2. Price

CDP databases are generally much faster for data retrieval compared to data warehouses, primarily because they need to provide near real-time data calculation, especially for website personalization. For instance, if a website visitor takes a specific action that triggers a personalized element, the system must respond as quickly as possible.

This ability to deliver real-time insights makes CDP (or CDXP) databases more expensive than data warehouses. The speed and responsiveness required for personalization come at a higher cost compared to the data storage used in a data warehouse.

When it comes to reporting, you typically don’t need a database as fast as a CDP. That’s why it makes sense to store data used primarily for analytical purposes in a data warehouse, which is more cost-effective for long-term storage.

Because of this, most CDPs have data retention policies in place, meaning they only keep the customer data needed for personalization within a specific lookback period. For example, view_category events might expire after three months if older data isn't useful for personalization. However, if you want to perform a year-over-year analysis of product categories browsed by customers, you wouldn’t be able to do this directly in a CDP, as the data from a year ago would have already been deleted.

In contrast, this data would still be available in the data warehouse, enabling you to conduct longer-term analyses as needed.

Typically, the data volume in a CDP (or CDXP) is much lower compared to a data warehouse due to the higher cost associated with storing real-time, actionable data.

3. Flexibility

In general, working with data in a CDP or CDXP is far less flexible compared to a data warehouse. To be fair, Bloomreach Engagement, as a CDXP, performs quite well in terms of data flexibility.

Unfortunately, the same cannot be said for many other standalone CDP solutions we’ve tested. Some of these systems don't even allow you to delete data, meaning that any tracking errors or duplicate data will remain in the database once they are recorded. This limitation makes it incredibly difficult to maintain a clean and accurate database.

In contrast, a data warehouse offers complete flexibility—you can make adjustments as needed (including data deletion), structure the data in a way that best suits your use case, and even easily modify historical data. This is a stark difference compared to some CDP solutions we've worked with.

Again, I want to emphasize that Bloomreach Engagement stands out as one of the most flexible tools with a built-in CDP, avoiding many of the limitations we’ve encountered in other CDPs.

4. Unlocking Additional Use Cases

Adopting a data warehouse unlocks the potential to deploy new use cases. For example:

Daily Giveaway Automation:
Previously, one of our clients manually selected a winner for their daily giveaway and sent a congratulatory email. With BigQuery, we fully automated this process. Now, a winner is selected from eligible new email subscribers directly in BigQuery, and a “winner” event is sent back to Bloomreach Engagement for the chosen customer profile, which automatically triggers the congratulatory email.
Main eCommerce Reporting:
We believe that the data warehouse should serve as the primary data source for eCommerce reporting. The reasons why this reporting typically doesn’t reside directly in the CDP include:
1. CDPs are optimized for customer-centric data, not for other data types.
2. CDPs often have data retention policies in place, which limits long-term data storage.
3. Storing data for analytical purposes is more cost-effective in a data warehouse.
4. CDPs are generally less flexible and offer fewer analytical options compared to a data warehouse paired with a BI tool.

Regarding the types of reports we typically build on top of the data warehouse, we recently published an article that covers the key eCommerce metrics we recommend analyzing.

4 Critical Objects Worth Analyzing in eCommerce (Introduction to Data Analytics)

Lukas Bihary

September 16, 2024

Read full story

For a clear summary of the differences between a CDP/CDXP and a data warehouse, feel free to refer to the following visualization:

Designing a (BigQuery) Data Warehouse

After outlining the reasons why we believe it's important to build a data warehouse alongside a CDP or CDXP, we will now explain how we designed the data warehouses we’ve built.

These were developed in conjunction with Bloomreach Engagement CDXP, but a similar design can be applied to other marketing automation tools. In the relevant sections below, we’ll highlight any key design differences.

BigQuery Data Warehouse Structure

Below is a diagram illustrating the structure of the data warehouse we developed. We will walk you through each element in the diagram and explain its role.

What is Engagement BigQuery?

Engagement BigQuery, formerly known as Exponea BigQuery or Long-Term Data Storage, serves as a comprehensive repository for all events and customer profiles that have been, or are currently, stored within the Bloomreach Engagement platform.

This BigQuery project is owned by Bloomreach, and it does not grant editing privileges, meaning data cannot be deleted or structurally modified.

As a result, Engagement BigQuery functions more as a data source than a true data warehouse due to the inability to make structural changes or aggregate data for improved performance.

One notable characteristic of this dataset is the potential for duplicates, which need to be addressed before conducting analyses. For example, if an event is uploaded to Bloomreach Engagement five times and four of those uploads are later deleted, Engagement BigQuery will still retain records of all five uploads.

The core function of Engagement BigQuery project is to act as a robust data source storing website behavior and omni-channel engagement data.

Events are appended to a dedicated table every six hours, while the customer table is rewritten on a daily basis.

For more detailed information, please refer to the Bloomreach Engagement documentation at: https://documentation.bloomreach.com/engagement/docs/ebq

Note: If you're using other marketing automation tools instead of Bloomreach Engagement—such as Klaviyo for email, Segment as a CDP, or Optimizely for website personalization and A/B testing—these tools would function as data sources in a similar way to Engagement BigQuery projects.

Below Bloomreach Engagement icon, you can view other data sources that we typically incorporate, such as ad spend data from various platforms and data from company Google Sheets, which predominantly includes revenue forecasts.

For some use-cases, such as real-time Slack notifications, we also pull real-time data from Bloomreach Engagement. This is necessary because the Engagement BigQuery module is updated only every six hours, which may not be frequent enough for certain applications.

To facilitate the transfer of this data into (and outside) our data warehouse, we utilize Cloud Functions, which are also part of the Google Cloud Platform. Cloud Functions offer tremendous flexibility for our use cases and have proven to be the most cost-effective solution we’ve found.

In the top right corner of the diagram, you'll find various developer tools that we employ to build and maintain our data warehouse setup.

Dataform has been a significant enhancement for us. It has enabled us to implement version control (stored in GitHub) of our data warehouse, automate dependencies, and create a dependency graph (see the screenshot below). This automation eliminates the need to manually specify refresh times for tables through scheduled queries.

Given the extensive benefits we've realized from using Dataform, we plan to dedicate a separate article to this topic in the future.

We also maintain a development version of our data warehouse, where any significant changes or new features are published first for quality assurance before being pushed to production.

On the right side of the diagram, you can see the outputs from our data warehouse. We primarily use Tableau for data visualization, but other BI software options are equally viable depending on your needs.

We sometimes integrate data back into Bloomreach Engagement for specific use cases such as Daily Giveaway Automation, that we already mentioned above.

Additionally, we push some data back into Google Sheets, which are widely used by many users across many organizations.

Slack notifications are also often triggered based on the data stored in our data warehouse.

Future Enhancements

Looking forward, there are several areas we are excited about but haven't yet fully exploited. One significant area is the broader Google Cloud Platform (GCP), which offers a plethora of tools including pre-trained machine learning models that can be applied directly to our data warehouse. We're also exploring the direct integration of AI assistants with the data warehouse, such as ChatGPT and Google's Gemini, to further enhance our capabilities.

As we implement these tools, expect upcoming Substack articles, dedicated to this topic!

If you found this post valuable…

We hope you found this article valuable. If so, please consider subscribing (for free!) to receive updates on our latest publications.

If you'd like to explore any topics mentioned in the article further, or if there's anything else we can assist you with, feel free to schedule a meeting with us by clicking the button below:

Book a meeting