Designing the Data Warehouse on Relational Databases
STANFORD TECHNOLOGY GROUP, INC.
|Data Content||Current values||Archival data, summarized data, calculated data|
|Data Organization||Application by application||Subject areas across enterprise|
|Nature of Data||Dynamic||Static until refreshed|
|Data Structure, Format||Complex; suitable for operational computation||Simple; suitable for business analysis|
|Access Probability||High||Moderate to low|
|Data Update||Updated on a field-by-field basis||Accessed and manipulated; no direct update|
|Usage||Highly structured repetitive processing||Highly unstructured analytical processing|
|Response Time||Sub-second to 2-3 seconds||Seconds to minutes|
Table 1 - Comparison of decision support and operational systems functionality.
Because data processing in OLTP systems is highly structured, complex data models can work well. Transactions generally involve only one or two tables at a time, and often deal with only a single record. This means that complex table relationships do not interfere a great deal with performance. In contrast, DSS processing can involve accessing hundreds of thousands of rows at a time. In such cases, complex joins can seriously compromise performance.
Figure 1 - A sample entity-relationship (ER) diagram. It represents a partial schema for storing transactional sales data. Actual ER models are significantly more complex.
A second reason typical entity-relationship models have failed in the context of decision support is that they tend to be very complex and difficult to navigate. (See Figure 1.) In OLTP systems, this was never an issue. Usage/access paths in an OLTP environment are very well known, and applications can therefore be hardcoded to use particular data structures every time. In DSS, usage is very unstructured; users often decide what data to analyze moments before they request it. If users have to give more thought as to how to get at their data than they give to what data they want to see, then their decision support environment is inadequate.
As a response to the problems with classic ER modeling and DSS, vendors representing proprietary "multidimensional" databases have emerged. "Dimensions" represent ways of looking at information. A multidimensional database is organized according to these dimensions. For example, an analyst might want to view sales data by geography, by time, or by product, three typical dimensions by which to organize data. Multidimensional database vendors argue that their software presents a view of corporate data that closely matches the way users intuitively think of their business. They contrast this with the traditional RDBMS "rows and columns" view, which is decidedly more difficult to navigate. Multidimensional vendors also claim that their midsized proprietary databases achieve better performance than traditional RDBMS software.
However, relational database designs do not need to be complex. It is possible to model data multidimensionally in an RDBMS, thereby providing an intuitive presentation of data for end users while continuing to leverage investments in open technology. Moreover, strategies based on this data model enable relational databases to outperform multidimensional databases, particularly on very large data warehouses. This database design paradigm is known as "dimensional modeling." The technical concepts behind dimensional modeling are discussed in depth in this paper, as are strategies that leverage this paradigm to achieve very high performance in the largest of data warehouses.
Dimensional modeling is a technique developed to structure data around natural business concepts, and to provide a foundation for sophisticated data analysis.
Traditional ER models describe "entities" and "relationships." This strategy focuses on breaking up information into a large number of tables, each of which describes exactly one entity. An entity might be a physical object (e.g., a product or a customer) or it might be a transaction (e.g., a sale or an order's line item). Entities are interrelated through a complex series of joins.
In dimensional modeling, data structures are organized to describe "measures" and "dimensions." Measures are the numerical data being tracked. They are stored in central "fact" tables. Dimensions are the natural business parameters that define each transaction. Dimensions are stored in satellite tables that join to the central fact table. Classic examples of data stored in fact tables include sales, inventory, magazine subscriptions, expenditures, and gross margins data. Classic dimension tables include time, geography, account, and product data. The focus in dimensional modeling is to organize information according to the way analysts intuitively think about their business, and to minimize the number of joins required to retrieve the information into meaningful, integrated reports.
For example, suppose a marketing analyst typically examines sales data as it is requested in the following queries: "Sales by product line manager over the past six months," or, "Sales by account for April, 1994." In this scenario, the data model would consist of a fact table to track sales information. For each sale, there would be a record storing quantity ordered, price and extended price, among other variables. The satellite or dimension tables would include account information, product information, and time information: natural dimensions of the sales information.
In this scenario, every sales record would possess a key joining it to each of the dimension tables. Thus, the fact table would store quantity ordered, sales, account code, time code, and product code. (See Figure 2.) Fact tables are generally fully normalized, meaning that there is no duplicate data storage within the table.
Figure 2 - Data from a sample fact table.
Information in the dimension tables is used to specify subtotal break points in reports as well as standard query constraints. A typical query might ask for: "Sales by brand by month for all retail stores in the western region." In this case, sales are found in the fact table, which is joined to the product, time, and geography dimension tables. Product (brand) and time (month) are used as break points in the report, while geography (region) is used as a query constraint.
The dimension tables store all of the information associated with each particular dimension. This primarily includes:
Dimensions can often be organized into hierarchies. Each level of a hierarchy is said to "roll up" to the next. For example, within the time dimension, days roll up to weeks which roll up to quarters. For the product dimension, products roll up to product lines which roll up to brands.
Hierarchy is very important to dimensional modeling, as it provides the framework for "drill-down" and "drill-up" functionality. Drilling down is the process of requesting a more detailed view of a data set. For example, a query might present sales by region. On seeing a spike in one region, a user might want to drill down to see how sales in that region break down by state. Drilling up is the opposite of drilling down, as it requests a more summarized view of data.
It should be noted that not all hierarchies roll up as neatly as the ones described here. For example, days roll up to weeks and months. Since months do not divide evenly into weeks, weeks cannot roll up to months, but both weeks and months roll up to quarters. (See Figure 3.) In many instances, hierarchies are significantly more complex than this. One of the challenges of hierarchical modeling is accounting for such structural complexity.
Figure 3 - Multiple time hierarchies. Dimension elements roll up in several different combinations, resulting in hierarchies substantially more complex than the pyramid pictured above.
A "dimension element" is a special category of data that represents a particular level in the dimension hierarchy. There is one dimension element for each hierarchy level. Thus, for the product dimension, there might be three dimension elements: product, product line, and brand. In this model, we would say that the dimension element "product" represents the lowest hierarchical level in the product dimension, while "brand" represents the highest level.
Figure 4 - A sample star schema. This schema is simplified in the sense that tables will tend to have many more columns than shown here. The actual number of tables (five) is exactly the same as it would be in any four-dimensional information warehouse.
"Dimension attributes" describe particular dimension elements. For example, "brand manager" might be a dimension attribute that describes the dimension element "brand." "Flavor" might be a dimension attribute that describes the dimension element "product." Dimension attributes allow users to categorize nonhierarchical data.
The physical architecture of the dimensional model is described by the star schema. As Figure 4 illustrates, the star schema can be represented as a fact table at the center of the star, with surrounding dimension tables at its points.
A defining characteristic of a star schema is that the dimension tables are denormalized. Denormalization is a database design approach in which data is repetitively stored in individual tables for the sake of design simplicity and performance. Thus, dimension attributes may be stored multiple times in a dimension table, depending on which level of the dimension hierarchy the attributes describe. (Compare Figures 5 and 6.)
Figure 5 - A normalized representation of sample product data.
Figure 6 - The same product data, denormalized.
For example, say a company has 100 separate products, rolling up to a total of five brands. For every product listed in the dimension table, its corresponding brand is also listed, as are all attributes of the brand. Brand manager information will be stored for virtually every record in the product dimension. The same holds true for brand name and all other brand attributes. In a normalized data model, brand attributes would ordinarily be stored once in a brand table, and only the foreign key to that table would be referenced multiple times.
The simplicity of the star schema in dimensional modeling confers four important advantages:
Aggregation is the process by which low-level data is summarized in advance and placed into intermediate tables that store the summarized or "aggregated" information. These aggregate tables allow applications to anticipate user queries, and eliminate the need to repeat resource-intensive calculations that would otherwise be performed each time the summary information is requested.
A typical data warehouse architecture begins as a massive store of transactions at the lowest, or "atomic," level. Measures are stored in the main fact table in their most detailed form so that later phases of data analysis and reporting can make use of them.
But extracting data from the most atomic level does not yield optimal performance, even with leading-edge software and hardware. Fact tables tend to be very large, resulting in serious performance challenges; summing millions of rows takes a long time no matter what software or hardware is used, and no matter how well the data warehouse has been tuned.
A significant percentage of queries against the data warehouse call for summarization, or aggregation, of data elements. A typical user might ask: "Show me total sales for this month." This would be interpreted by the database as, "Add up all the sales for each of the days that this month contains." If there are an average of 1,000 sales transactions per day in each of 1,000 stores and data is stored at the transactional level, this query would have to process 30,000,000 rows to return the answer. A summary-intensive query like this can take up significant resources.
For commonly accessed data, presummarization is often useful. This enables intermediate results or "aggregates" to be used, significantly reducing resources required to deliver the final query results. To appreciate the value of aggregates, consider a request for August sales: If there is an aggregate table already created to track monthly sales by store, the query has to process only 1,000 rows (the August total for each store). Compared to the 30,000,000 rows the same query would have to process with data stored at the transactional level, the resource savings is several orders of magnitude. In fact, since query response time in a well-tuned warehouse is roughly proportional to the number of rows the query has to process, the improvement in performance with the above summary could be close to a factor of 30,000.
Most existing technologies offer database users a drastic choice: no aggregation at all, or exhaustive aggregation, for every possible combination of queries the user might want.
Performing no aggregation is generally out of the question for substantive data warehouses. Aggregating in every possible combination achieves the best possible query performance but at tremendous storage, maintenance, and loading-time cost. First, storing summary information at every possible level consumes enormous amounts of disk space, increasing the storage requirements by a factor of five or more. Second, typical data warehouses have thousands of combinations of dimension elements, and creating a table or tables to hold the aggregates for all those combinations is an overwhelming maintenance task. Finally, building aggregates every time new information is added to the fact table can become so time consuming and resource intensive that the load window required to make the warehouse operational may become unacceptably long.
Figure 7 - Selective aggregation. By creating just a product-by-month aggregate, an intelligent data warehouse engine can deliver dramatically improved performance for all queries at higher levels of summarization.
What is needed is a query engine that uses aggregates intelligently. For example, consider a query that asks for sales summarized by year. Further consider that sales are stored by transaction in the fact table and by month in an aggregate table. The query should be issued against the monthly summary and should add up the 12 records for the 12 months that are stored in the aggregate instead of adding up the thousands of transaction records from the fact table. (See Figure 7.) If the engine can make such a decision, it is unnecessary to create the "sales-by-year" aggregate, since summarizing 12 rows is trivial. Such intelligent query optimization is performed by MetaCube, the engine that drives Stanford Technology Group, Inc.'s (an Informix Company) family of decision-support software.
With MetaCube's query optimizer, the data warehouse architect has a choice of which aggregates to build, because the optimizer will automatically retrieve information through the best available route. Furthermore, the warehouse architect can change the way the data is aggregated based on usage patterns, addition of new applications, or changing information needs. The architect can then add or delete aggregates without having to change a single line of code in the front ends that use the MetaCube engine.
There are two main considerations when determining what aggregates to create:
If a given dimension element represents a large number of rows as compared to other elements in the hierarchy, aggregating by that dimension element drastically improves performance. Conversely, if a dimension element contains few rows, or if it contains hardly more rows than the superseding dimension element, aggregating by that dimension element is less efficient.
This analysis becomes more meaningful -- and more complicated -- as dimensions combine with one another. Defining a data request by multiple dimensions not only decreases the range of data retrieved, but also the density . It is rare, for example, that every product sells in every store, every day. For many products, there may in fact be few sales records for any given day, and the daily product sales data will thus be sparse . If, however, all or many products sell in every store, every day, the data is classified as relatively dense . Data density complicates our calculation of how many records a query engine will have to process. A sizing simulation based on the facile assumption that every possible record exists, i.e., that the data is perfectly dense, skews the performance analysis of each aggregate.
When determining, for example, whether to compile an aggregate summarizing product line sales by region, the number of different products actually sold in each store within the regions is crucial. Consider a simplified database containing four stores in each of two regions (a total of eight stores), selling four products in each of two product lines (a total of eight products).
If only one of the products in a product line sells in each region on a daily basis (sparse data), the number of products in the product line for that day effectively shrinks to one. For the query, "Daily product line sales by region," one product row would be retrieved for each product line (two) in each region (two), for a total of four rows. Similarly, if products were aggregated into a sales-by-product line aggregate, and the same query was posed, the same number of rows would be processed, four. In this instance, the aggregate offers no performance advantage whatsoever.
If, at the other extreme (dense data), every product in every product line sells in every store, every day, this query would have to process four products for each of two product lines, and for each of the four stores in both regions, for a total of 64 records. But an aggregate summarizing product lines sales by region could process this query using only four records, reducing the number of records processed sixteen-fold. In a typical database representing thousands of stores and products, the performance advantage would be substantial. An analysis identifying optimal aggregates would favor a product line aggregate in this instance.
STG has developed software to perform this analysis. While the analytic process is sophisticated, the methodology STG uses to approach this problem is straightforward. The question that must be answered at each step is, "Which aggregate will reduce by the greatest amount the average number of rows a query in this data warehouse has to process?" The algorithm recursively calculates this answer along with the supporting evidence, and stores it in the database.
The answer to the second will depend on feedback from the user community. STG tools audit user queries, collecting information on which data is being requested most often, and who is requesting it, how long the query took to process, how many rows were retrieved, and other criteria. This information can then be used to further tune the data warehouse.
To determine the optimal number of aggregates to create in any given data warehouse, we can conduct a "sizing simulation." The main factors to consider are the total amount of space occupied by the aggregates (cost of disk) and the total number of aggregate tables (cost of load window and maintenance). This section describes an actual simulation that STG conducted for a sample data warehouse.
The following steps are performed during the sizing simulation:
Figure 8 - Aggregate size vs. query performance
After applying the optimization algorithm to the sample database, the best aggregates can be selected from the overall pool. Figure 8 illustrates the effect of additional aggregates on:
In Figure 8, the horizontal axis represents the number of aggregates created (best aggregates are added first), increasing from 0 (fact table only) to 100. The left vertical axis represents the number of total rows stored in the database. The scale starts at 1.3 million -- the size of the fact table in this sample. The right vertical axis represents the average query cost (in rows processed) computed using the algorithm described in this paper.
The improvement in average query cost for this example is exponential and quite dramatic. After 7-8 aggregates, additional aggregates do little to improve the average query cost while adding significantly to storage, maintenance worries, and load time. On the basis of the information in this chart, STG would recommend that only seven aggregates be created to prepare this sample data warehouse for random, ad hoc access.
Clearly, the information in Figure 8 is incomplete, mainly due to the assumption that usage of the data warehouse is random. The final step of the aggregation analysis would be to incorporate information about which queries are run most often by end users.
As mentioned earlier, one disadvantage associated with the aggregation process is the time window required to build the aggregates. For end users with very large databases and significant aggregation requirements, it can take a tremendous amount of time and computer resources to build all the aggregates needed. Each week, as more information comes into the data warehouse, the aggregates become outdated and thus need to be recalculated. One option is simply to recalculate them from scratch by performing the summary operation on the base fact table. However, in situations where the aggregation time window is a serious problem, this is an impractical solution. To address this issue, STG has developed MetaCube Aggregator, a program that updates existing aggregates based on incoming atomic data, thereby avoiding unnecessary resource utilization. (See Figure 9.)
Figure 9 - Incremental aggregation. The MetaCube Aggregator is able to add new data to existing aggregates. By dealing with much smaller quantities of data, incremental aggregation makes the load window many orders of magnitude smaller than reaggregation.
There are situations in which the simplest implementation of dimensional modeling, the star schema, is not ideal. There are two possible reasons for this:
Denormalization is a very effective method for simplifying data design and improving performance through the reduction of table joins. However, there are instances where the cost in disk storage may be too high.
For example, consider a product dimension in which there are 100,000 products rolling up to 15 product lines and five brands. In a star schema, the corresponding dimension table would have 100,000 rows, and each row would store all of the relevant information for every level of the hierarchy above or equal to its own level (brand manager 100,000 times, product line category 100,000 times, etc.). (Refer to Figure 10.)
Figure 10 - Sample star join for product dimension.
In some cases, the number of attributes stored about each dimension element can be substantial. In the case of the above product dimension, every kilobyte of attribute data elements costs 100 megabytes of disk space.
Normalizing the dimension table avoids this additional disk storage. In a normalized model, the primary dimension table would have 100,000 rows, but might have only three columns: product_id, product_line_id, and brand_id. In this case, the dimension would contain three additional tables, one for brand attributes, one for product line attributes, and one for product attributes. The brand table would store the brand_id, brand manager, and all other brand attributes. The product line table would store product_line_id, product line category, and all other product line attributes. In a dimension table of 500,000 rows, saving just two megabytes per row through normalization of the star saves a full gigabyte of disk. (Refer to Figure 11.)
Figure 11 - Sample normalized product dimension. As the number of products increases to 100,000, the brand and product line tables barely grow in size. Where many attributes are tracked, the disk savings can be substantial.
Normalized dimension tables turn star schemas into "snowflake schemas," named for their added structural complexity.
To understand how snowflake joins can improve performance, consider the following. In the above database, assume that aggregates exist for sales by product line and brand. Assume also that the fact table contains roughly 10 million rows.
Now consider a query looking for sales by product line manager. In the case of either a star or a snowflake, the query would be able to retrieve the sales information from the product line aggregate table. However, in the case of a star, the full 100,000 row product dimension table would need to be joined to the star to retrieve the product line manager information. In the case of a snowflake, the product line attribute table would be separated out from the full 100,000 row product table. The query could thus get the product line manager information it needed from the product line aggregate table and the 15-row product line attribute table -- a substantial performance advantage. (Refer to Figure 12.)
Figure 12 - Star vs. snowflake processing: sales by product line manager. The snowflake join depicted at top would be orders of magnitude faster than the star join depicted at bottom.
To summarize the snowflake architecture, each dimension table stores one key for each level of the dimension's hierarchy (that is, for each dimension element). The lowest level key joins the dimension table both to the central fact table and to the attribute table that contains the descriptive information about the lowest level dimension element. The rest of the keys join the dimension table to the corresponding attribute tables. (Refer to Figure 13.)
Figure 13 - Sample snowflake schema incorporating all dimension elements.
The main disadvantage of the snowflake vs. the star is the relative complexity of the normalized snowflake data structure. When using the MetaCube line of products (or custom applications built using MetaCube's OLE object interface) this is not as critical, as MetaCube can interpret snowflake structures as easily as stars. However, if users are generating queries using typical database ad hoc query tools, it will be more difficult for them to navigate through the snowflake. In addition, load programs and overall maintenance become more difficult to manage as the data model becomes more complex.
The performance gains and disk storage savings provided by snowflake designs are often worth the price of marginally higher complexity. However, as shown earlier, the value of a snowflake join is greatest for dimensions in which:
Thus, the best solution is often to normalize one or two dimensions, and leave the rest of the dimensions in a simple star format. This is referred to as a partial snowflake.
Traditional entity-relationship data models function effectively in the OLTP world, driving most of today's operational, RDBMS-based applications. Because of the success of these data models, the first graphical DSS and EIS systems were implemented using similar designs. As these DSS databases grew larger and more complicated, performance became poorer and systems became difficult to use and maintain.
Dimensional modeling, the database design approach discussed in this paper, improves DSS performance by several orders of magnitude. By presenting information in a format that parallels the multidimensional business environment, dimensional modeling offers an easy-to-navigate data model that makes intuitive sense to users. Moreover, the structural simplicity of dimensional modeling facilitates application maintenance and provides the flexibility to expand the data warehouse. Finally, this approach to data warehousing leverages investments in open, relational technology.