This article contains the following information:
In almost all database systems, database designers create fragmented tables to overcome the space limits of small disks, eliminate access deadlocks, and balance resource use, such as disk I/O.
Certainly one goal of fragmentation is to balance I/O across multiple disks and maximize throughput. All commercial database servers can improve query and transaction performance with simple fragmentation methods, such as round-robin and expression-based methods. In the IDS with AD and XP Options database server, however, you must fragment tables appropriately for your data and queries so that the database server can best use its built-in intelligence to process queries quickly and efficiently.
If you choose an inappropriate fragmentation method, the database server must scan all table fragments for all queries, which is an inefficient use of parallel-processing power. If you use a fragmentation method that allows the database server to identify table fragments that contain specific rows, query processing immediately becomes more efficient for many queries.
An expression-based fragmentation method provides the information that the database server needs to identify the fragments that contain specific column values. In addition, IDS with AD and XP Options database servers provide other fragmentation methods that can dramatically improve performance in two ways:
The ability of the database server to identify the exact table fragment that contains a specific required row is a related performance improvement as well. The IDS with AD and XP Options database server fragmentation methods and their performance implications are described below.
The kinds of queries that run against the table determine whether enhanced fragment elimination or parallel processing is more important. If most queries are highly selective, accessing relatively few rows, then fragment elimination is more important. If most queries have low selectivity, which results in large table scans, then parallel processing is more important.
For this reason, before you can choose a fragmentation method you must know the data and know the queries that are run against it.
For example, suppose you have created the following table that contains date information, such as ship and order dates:
For this table, many queries will probably use dates as filters in the WHERE clause.
This prediction suggests that the
If the table is already populated, run the following SQL statement:
UPDATE STATISTICS MEDIUM for TABLE orders(ship_date, order_date);.
Then use the
If you find uneven distribution that creates data skew, either you cannot use this column for fragmentation or you must use a fragmentation method, such as a hash or hybrid scheme, that minimizes the effect of the uneven distribution. The fragmentation method that you choose depends on whether your primary fragmentation goal is fragment elimination or parallel-processing improvement. The same principle applies when you examine marketing information tables that have regional information, or other kinds of raw data that might contain more rows for one column value or range of values than for others.
You also need to consider the way in which rows are added to the table and removed from it. For an accounting data table, information might be kept by month on a rolling twelve-month basis. Then at the end of each month, the new month's data is added and the oldest month is removed to an archive. For efficiency, your fragmentation scheme should permit this kind of table modification to occur without rebuilding the entire table or running large delete operations.
If the table contains summary data for each category and each date period contains the same list of categories, data skew is not a problem because each table will have the same number of rows. In this case you have more freedom in choosing a fragmentation method.
Know Your Queries
If the filters return relatively few rows, consider fragmenting the table so that the database server needs to scan only the fragments required by the query. Make sure that each fragment is on a separate disk to improve disk I/O for the fragments that are required.
If the filters use aggregate expressions on unindexed columns so that all table fragments must be scanned, then you improve parallel processing by fragmenting the table uniformly across all disks and co-servers. The number of scan threads that the database server starts depends on the number of table fragments in dbspaces on each co-server. For large queries that scan all table fragments, limit the number of fragments to between one and three times the number of CPU VPs.
The number of parallel scans is also determined by the number of CPU VPs in your database server configuration. You might begin by creating fragments as a multiple of one or two times the number of CPU VPs. If you see that the CPU VPs can manage more scan threads because your queries do not require them to manage query processing threads-such as join, group, and sort threads-you can increase the number of table fragments per CPU VP.
If you see that many fragments are eliminated for query processing, you can increase the number of table fragments so that the actual number of scanned fragments is in balance with the number of CPU VPs. For example, a table that has 250 fragments of which only 35 are scanned for most queries would perform very well on a system with 16 CPU VPs.
A dbslice is a named set of dbspaces that you can manage as if it were a single object. Dbslices are a significant server management feature for database servers that contain many co-servers and many more disk partitions. You never have to specify a long list of dbspaces when you can refer to the entire set of spaces with a single dbslice name. In addition, dbslices let you design an optimal dbspace layout for table fragments and use it easily when you fragment the tables.
Table fragmentation is only as effective as the arrangement of dbspaces permits it to be. Properly constructed dbslices can help you achieve the two major goals of dbspace layout:
The standard fragmentation methods work with all Informix database servers. These methods include round-robin fragmentation and expression-based fragmentation. In addition, IDS with AD and XP Options database servers provide two other fragmentation methods: hash fragmentation, which distributes rows into fragments by applying a hash function to the fragmentation columns, and hybrid fragmentation, which combines the hash and expression-based methods.
As a result, round-robin fragmentation is rarely used in IDS with AD and XP Options database servers.
Suppose you create the table as follows:
Although expression-based fragmentation might work well for queries reading only some rows in a table and using either exact match or range filters based on the columns used for the fragmentation expression, parallel processing for complete table scans is not optimal if some fragments contain many more rows than others. In addition, because expression-based fragmentation must be specified by dbspace instead of dbslice, this method might produce some dbspace hot spots-where some dbspaces are used more than others, causing I/O wait problems.
The fragmentation method shown for the sales table can eliminate table fragments if a query uses a range expression that requires rows from specified months, as follows:
Expression-based fragmentation also allows the database server to identify the table fragment that contains information for a single bill date.
Unfortunately, in the illustrated case the sales table data is not evenly distributed for all months. The months of November and December contain ten or twenty times as many rows as the months of July and August. With this skewed distribution, if a query must scan the entire table-or even different fragments-it is inefficient because the database server can process the query only as fast as it can scan the largest table fragment. The fragments containing data for the months of November and December will hold up processing of the entire query for the following query:
The hash fragmentation method improves on the round-robin method because the database server can use the hash fragmentation columns to identify the fragment that contains a specific row. If the WHERE clause of the query uses equality expressions on the fragmentation columns, the database server can skip some fragments when it processes the query. Hash fragmentation, however, does not allow the database server to skip fragments when the WHERE clause uses range expressions.
For example, if you have a table of monthly sales data, you might fragment the table by hash on the customer number, as follows:
Because there is only one row for each customer number in each month's data, hash fragmentation ensures an approximately uniform distribution of rows across dbspaces, thus improving parallel scan performance for queries like the following query, which requires rows from all fragments:
Hash fragmentation also allows the database server to identify individual customer number rows and access only table fragments that contain them, such as the following query:
However, since the customer numbers are distributed across many fragments, and not grouped in specific fragments, queries that request a range of customer numbers do not allow the database server to skip fragments. Consider the following query:
For such a query, hash fragmentation requires the database server to scan all table fragments.
For hybrid fragmentation you create dbslices that contain several dbspaces, evenly distributed across co-servers. In the case of a single co-server database server, you create dbslices that contain several dbspaces evenly distributed across disks.
You need one dbslice for each fragmentation expression. For example, if you use the expression clause of the fragmentation definition to divide accounting data into twelve months, then you need twelve dbslices, one for each month. You can use the same column for the hash distribution into the dbspaces contained in the dbslice, or you can specify a different column to provide fragment elimination on another dimension.
The expression component of the fragmentation scheme allows the database server to eliminate fragments for WHERE clauses that use both range and equality expressions, while the hash component of the fragmentation scheme allows further fragment elimination if the WHERE clause uses equality expressions on the hash column. Even if hash fragments cannot be eliminated, increased parallel processing is possible because table fragments can be scanned by different threads.
The following sections provide hybrid fragmentation examples to illustrate these advantages in more detail.
This basic example shows how hybrid fragmentation can minimize parallel processing bottlenecks that result from data skew in table fragments and also allow fragment elimination.
Suppose that you have a single table that contains sales information for the twelve months of the fiscal year in several categories (quarter, month, and day) and that you want to store the table across the 16 disks of your single co-server IDS with AD and XP Options database server. Suppose also that your database server has four CPU VPs.
You know that most of your queries require information about the data from a specific quarter. In this case it might seem appropriate to fragment the table by expression into four large dbspaces across the 16 disks, using the quarter column as the fragmentation key. If each quarter's table rows are stored in a single table fragment, then fragments that contain the rows that belong to the quarters not queried can be skipped. This seems like a good idea because it achieves your major fragmentation goal: fragment elimination for quarters that are not required by the query. It also provides some parallel processing if more than one quarter is queried.
Unfortunately, your sales data is not evenly distributed. When you fragment the table by quarter, you find that the first quarter fragment contains more than three times as many rows as the third quarter fragment, while the second and fourth quarter fragments contain about three-fourths as many rows as the first quarter fragment (Figure 1). Not only is this inefficient use of dbspace, but it also creates significant data skew when more than one quarter is queried.
Your row distribution will look like this:
You find that even if you fragment the table by month, data is still significantly skewed.
Hybrid fragmentation can improve parallel processing by using dbslices to combine expression-based fragmentation with hash fragmentation.
First you create four horizontal dbslices by striping the dbspaces across the disks so
that each dbslice contains one dbspace on each disk. You then fragment the table by
expression so that the data for each quarter is stored in a separate dbslice. You also
fragment the table by hash so that within each dbslice the data for a single quarter is
spread evenly across the disks into dbspaces determined by a hash function applied to the
With hybrid fragmentation, row distribution looks like this:
This fragmentation method meets both fragmentation goals:
The following real-world example shows how fragment elimination can be effective for specific queries.
In this example, the orders table is fragmented to store order information for all stores in twelve dbslices, one dbslice for each month. Each dbslice contains the same number of dbspaces, evenly distributed across disks. For each month, there is one table fragment for each CPU VP on the co-server node where the fragment is stored. Thus for a single co-server database server with four CPU VPs, each dbslice would contain four dbspaces. Within each dbslice, the month's order information is fragmented by hash on the region number associated with the order. The entire table, then, is stored in 48 fragments on the four-CPU VP single co-server database server. The table definition statement is as follows:
Data rows are distributed as shown in the following figure:
For the following query, the database server scans only four table fragments, or 1/12 of the data:
For the following query, the database server scans only one table fragment, or 1/48 of the data:
Although the following query scans fragments from all twelve months, it scans only twelve fragments, or 1/4 of the data, because it can identify the fragments that contain the required region:
Note: Query performance might also be improved by the use of indexes. Specifically, indexes can improve performance for queries that use non-fragmentation or non-key columns in the WHERE clause. Sometimes an index alone can be scanned to provide the required information to satisfy a query. The subject of indexes, however, deserves an article of its own that discusses when to create an index, what columns to use, whether to fragment the index and how to fragment it, and so on.
Before you put a fragmentation scheme into effect in a production database, test it in an appropriately configured testbed system to make sure that the scheme meets your fragmentation goals. Test your fragmentation methods carefully. After you create a production database system, it is difficult to change the fragmentation method. Although you can easily add fragments, changing the fragmentation method requires you to drop the table, redefine it, and reload the data.
To test the fragmentation scheme, you run a set of typical queries against the table and monitor query performance. To monitor query performance, enter the SET EXPLAIN ON statement before the query runs and examine the output that is produced. You can also enter some of the onstat utility commands to get information while the query is running.
For example, suppose that you run the following query against the orders fact table introduced in the previous section:
fact/dimension database contains several dimension tables, among them
table that is hash-fragmented on the
The sqexplain.out file produced when you turn on SET EXPLAIN contains the following output. This sample output shows only relevant information about fragment elimination when the orders table is scanned to display data for two days, one day in May and the other day in June, and the other threads that must be run by the CPU VPs.
In the XMP Query Plan section, the width column shows the number of threads for each SQL operator, including scan, hash join, group, and sort.
The remainder of the sqexplain.out information, not included here, displays details about how the database server is executing the query plan.
You can also get this information from the output of some
commands. First get the
While the query runs, you can also get information about the scan threads by running
In this article we have explained why appropriate fragmentation methods are vital to efficient performance of IDS with AD and XP Options database servers. We have also provided general guidelines with examples to help you decide on the best fragmentation method for your data and queries, and shown how to monitor queries to find out if your fragmentation method is meeting your goals.
In the next article we will show how appropriate fragmentation methods make it fast and easy to scrub and load data and update tables by adding and removing table fragments.
About the Author
This article has been written by an IDS with AD and XP Options team that consists of Hanna Metzger, a senior technical writer at Informix, and a group of software engineers who work on the database server development. We have benefited greatly from the advice and real-world examples and information provided by Informix field support and data warehouse experts. Hanna may be reached at mailto:email@example.com .