Comprehensive Guide to the Data Warehouse

Data analysis can't start until the data cleaning process is complete. Learn about the role of the data warehouse as a repository of analysis-ready datasets.

Author:  Nicole Janeway  |  Post Date:  Nov 20, 2021  |  Last Update:  Apr 20, 2023  |  Related Posts

safari
Hunting for clean data in the enterprise setting. Photo by Hu Chen on Unsplash.

If you work with data, it's valuable to have some idea of fundamental data warehouse concepts. Most of the work we do involves adding value on top of datasets that need to be clean and readily comprehensible. For a dataset to reach that stage of its lifecycle, it has already passed through many components of data architecture and, hopefully, many data quality filters. This is how we avoid the unfortunate situation wherein a data analyst or data science team ends up spending 80% of their time on data wrangling.

Let's take some time to deepen our appreciation of the data architecture process by learning about various considerations relevant to setting up a data warehouse.

The data warehouse is a specific infrastructure element that provides down-the-line users, including data analysts and data scientists, access to data that has been shaped to conform to business rules and is stored in an easy-to-query format.

The data warehouse typically connects information from multiple “source-of-truth” transactional databases, which may exist within individual business units. In contrast to information stored in a transactional database, the contents of a data warehouse are reformatted for speed and ease of querying.

The data must conform to specific business rules that validate quality. Then it is stored in a denormalized structure — that means storing together pieces of information that will likely be queried together. This serves to increase performance by decreasing the complexity of queries required to get data out of the warehouse (i.e., by reducing the number of data joins).

Contents

Architecting the Data Warehouse

In the process of developing the dimensional model for the data warehouse, the design will typically pass through three stages: (1) business model, which generalizes the data based on business requirements, (2) logical model, which sets the column types, and (3) physical model, which represents the actual design blueprint of the relational data warehouse.

Because the data warehouse will contain information from across all aspects of the business, stakeholders must agree in advance to the grain (i.e., level of granularity) of the data that will be stored.

Reminder to validate the model across various stakeholder groups before implementation.

A sample star schema for a hypothetical safari tours business.

The underlying structure in the data warehouse is commonly referred to as the star schema — it classifies information as either a dimension or fact (i.e., measure). The fact table stores observations or events (e.g., sales, orders, stock balances, etc.) The dimension tables contain descriptive information about those facts (e.g., dates, locations, etc.)

There are three different types of fact tables: (1) transactional for records at the standardized grain, (2) periodic for records that fall within a given time frame, (3) cumulative for records that fall within a given business process.

In addition to the star schema, there's also the option to arrange data into the snowflake schema. The difference here is that each dimension is normalized.

Normalization is a database design technique for creating records that contain an atomic level of information.

However, the snowflake schema adds unnecessary complexity to the dimension model — usually the star schema will suffice.

Enhancing Performance and Adjusting Size

In addition to understanding how to structure the data, the person designing the data warehouse should also be familiar with how to improve performance.

One performance-enhancing technique is to create a clustered index on the data in the order it is typically queried. So for example, we might choose to organize the fact table by TourDate descending, so the tours that are coming up next will be shown first in the table. Setting up a clustered index reorders the way the records are physically stored, promoting speed of retrieval. In addition to an optional, single clustered index, a table can also have multiple non-clustered indices that won't impact how the table is physically stored, but rather create additional copies in memory.

Another performance enhancement involves splitting up very large tables into multiple smaller parts. This is called partitioning. By splitting a large table into smaller, individual tables, queries that need access to only a fraction of the data can run faster. Partitioning can be either vertical (splitting up columns) or horizontal (splitting up rows). Here's a link where you can download a .rtf file containing a partitioning script for SQL along with other database architecture resources like a project launch and management checklist.

giraffe
Yes, I will snag your free resources and helpful tools. Photo by Slawek K on Unsplash

Taking total database size into account is another a crucial component of tuning performance. Estimating the size of the resulting database when designing a data warehouse will help align performance with application requirements according to service level agreement (SLA). Moreover, it will provide insight into the budgeted demand for physical disk space or cost of cloud storage.

To conduct this calculation, simply aggregate the size of each table, which depends largely on the indexes. If database size is significantly larger than expected, you may need to normalize aspects of the database. Conversely, if your database ends up smaller, you can get away with more denormalization, which will increase query performance.

Related Data Storage Options

The data in a data warehouse can be reorganized into smaller databases to suit the needs of the organization. For example, a business unit might create a data mart, with information specific to their department. This read-only info source provides clarity and accessibility for business users who might be a little further from the technical details of data architecture. Here's a planning strategy to deploy when creating a data mart.

Similarly, an operational data store (ODS) can be set up for operational reporting. The Master Data Management (MDM) system stores information about unique business assets (e.g., customers, suppliers, employees, products, etc.)

Read more about the risks of overutilizing data visualization tools for business intelligence.

Extract, Transform, Load (ETL)

Extract, transform,load defines the process of moving the data out of its original location (E), doing some form of transformation (T), then loading it (L) into the data warehouse. Rather than approach the ETL pipeline in an ad hoc, piecemeal fashion, database architect should look to implement a systematic approach that takes into account best practices around design considerations, operational issues, failure points, and recovery methods. See also this helpful resource for setting up an ETL pipeline.

Documentation for ETL includes creating source-to-target mapping: the set of transformation instructions on how to convert the structure and content of data in the source system to the structure and content of the target system.

Your organization might also consider ELT — loading the data without any transformations, then using the power of the destination system (usually a cloud-based tool) to conduct the transform step.

Getting Data out of the Warehouse

Once the data warehouse is set up, users should be able to easily query data out of the system. A little education might be required to optimize queries, focusing on:

Data Archiving

Finally, let's talk about optimizing your organization's data archiving strategy. Archived data remains important to the organization and is of particular interest to data scientists looking to conduct regression using historical trends.

elephant
Photo by Pixabay on Pexels

The data architect should plan for this demand by relocating historical data that is no longer actively used into a separate storage system with higher latency but also robust search capabilities. Moving the data to a less costly storage tier is an obvious benefit of this process. The organization can also gain from removing write access from the archived data, protecting it from modification.

lion
Follow these steps and you'll end up as the king of the data jungle. Photo by Pixabay on Pexels
Nicole Janeway Bills

Nicole Janeway Bills

Data Strategy Professionals Founder & CEO

Nicole offers a proven track record of applying Data Strategy and related disciplines to solve clients' most pressing challenges. She has worked as a Data Scientist and Project Manager for federal and commercial consulting teams. Her business experience includes natural language processing, cloud computing, statistical testing, pricing analysis, ETL processes, and web and application development.