Over the past several months I have found myself repeatedly explaining the concept of tuning a decision support system (DSS) query for an Informix database. I figure that I had better jot this down so that I can just hand it out and avoid spending all the time explaining it over again.
Data warehouses and very large databases tend to be used differently than traditional OLTP database systems. In an OLTP environment, the goal is to execute lots of discrete transactions in a short time frame. OLTP administrators support hundreds or thousands of users, and focus on very specific portions of the data (i.e. what is important is being able to update item 4 of the customer’s order as well as the balance in their order master record). An OLTP transaction is most probably performed with one or more UPDATE statements WHERE cust_id/ord_no = ?.
In a DSS environment, you are more likely to have only a handful of users, each of whom are making queries against a large volume of data. In an OLTP model, users like this would be severely chastised. However, in the world of DSS, this sort of activity is encouraged. Typically, a DSS query will scan every row of one or more tables. Also typically, where the size of an OLTP system might be measured in terms of hundreds of megabytes, a DSS system will be measured in terms of hundreds of gigabytes.
DSS systems tend to be queried by front-end tools such as Business Objects or SAS, where the SQL to perform the query is generated by the tool; you frequently do not have much control over the query itself.
If you try to support a DSS query using the same data model, indexed reads, and engine tuning used in an OLTP system, it is likely that your response time to such a query could be measured in days. In a recent case, the response time for such a query was 40 days. In this particular case, using the tuning methods detailed in this article, I was able to bring the query result time down to about 15 minutes. All without changing the query or changing the schema. This article will examine specific tuning procedures for DSS queries in order to shorten their response times.
An Example Query
Imagine you have a query where you join two tables to return a count of last name by state, where name is in Table 1 and address (state) is in Table 2:
Table 1 has 29 million rows and Table 2 has 27 million rows. Table 1 is 4.5 GB and Table 2 is 6GB. Each table has been placed into a single dbspace. There is no index on either key in either table. The machine in use is an 8 processor Sun 6500 with 4GB of memory. A query like this run in an OLTP environment could take upwards of 70 or 80 hours to execute.
Before examining tuning procedures to optimize this DSS query, you need to understand how to utilize specific monitoring tools available to you in order to learn how the query will behave. There are five monitoring tools that are very helpful in examining a query.
The first monitoring tool is the EXPLAIN plan. This can be obtained by including a SET EXPLAIN ON; command at the start of your query. The optimizer will then create, or append to, a file called sqexplain.out in which it explains how it plans to resolve the query.
Notice that the Estimated Cost is very low; somebody forgot to UPDATE STATISTICS.
The initial section of this EXPLAIN plan is the query itself followed by an estimated cost which is derived from the number of instructions the optimizer thinks will be required to resolve the query. The value given here is way too low for this query (a value in the billions would be more normal).
The estimated number of rows returned is an indication that the optimizer thinks that at least one of the tables is empty. A minimum of one row is always returned from a count(*) operation.
Each of the tables will be read using a sequential scan, probably because it costs nothing to read an empty table. The "Serial" is an indication that each fragment of disk that the table resides on will be read sequentially as opposed to the desired "Parallel". No fragments have been eliminated from the query (hence, the "fragments: ALL").
The Temporary table is, of course, required to perform the actual group by operation. A Hash Join will be used to join the two tables, again because the optimizer thinks one of the tables is empty and because here there are no indexes.
What we would really like to see is a higher estimated cost, a higher estimated number of rows (if UPDATE STATISTICS high had been run for state_cd and last_name, we might even get a correct number here). We like the Sequential scan, although we really want a Parallel, not Serial, scan. If we had been able to eliminate some of the fragments through a WHERE condition, we might also have seen only a list of the fragments to be scanned. Finally, we would also like to see a line like:
We would actually like to see a higher value indicating more parallelism, but for the sake of this example we will say 21.
This would indicate that we are using PDQPRIORITY. The exact number of threads will vary with the number of dbspaces each table is spread across, and the number of threads assigned to each additional task. As can be seen in the onstat -g sql output below, we have 9 scan threads for one table, 11 scan threads for the second table, and a single sqlexec thread to pull it all together. With a higher PDQPRIORITY setting we would see more specialized threads for grouping and sorting.
These key pieces of information can tell you that something is wrong with the way the optimizer has chosen to solve the query.
While the explain plan can show you how the query will be answered, Xtree allows you to monitor the query as it is being executed. You can see how many threads have been initiated for each segment of the process as well as determine a run rate for the query.
Figure 1 below shows an Xtree screen shot for a typical hash join. At the bottom of this figure are two sets of scan threads, the one on the right has 9 threads, the one on the left is running with 11 threads. The table on the right was read into memory in the hash table, the table on the left is being scanned and matched against this hash table while being read.
The numbers to the right of each box indicate how many threads are in use to perform this portion of the query. Within the box, below the meter, is a number indicating how many rows have been processed by this portion of the query in the last second. At the top of the box is the number of rows processed so far. About 114,000 rows were read from the table on the left in the last second, with about 89,000 matches performed in that same second. When a hash table swap occurs, all of the process per second counters will drop to zero; by watching these you can figure out how often and how long the process is spending in a hash table swap.
The numbers next to the scan boxes should match the number of dbspaces the table is fragmented across. On this machine, I would expect to see on the order of 50,000 to 70,000 rows scanned per fragment per second. If this number drops off at the end then one or more of the fragments probably completed before the others; if this occurs in the last 5-10 seconds it’s no big deal, but if this drop off occurs half way through the query you might want to check the fragmentation scheme.
The filter box should be lowering the number of rows scanned if in fact you are using a filter (e.g. where state_cd = ‘MA’). The total number filtered should keep pace easily with the scan rate.
The following Xtree screen shot shows a typical nested loop join (Figure 2). The table on the left is being sequentially scanned; a probe into the second table is then performed for each row from this first table. As you can see, during the second that this snapshot was taken, the query joined all of 1373 rows, which is not very good. Here, the join itself limits the scan rate. Also interesting, you can see that more rows have been joined than read, an indication of duplicate rows. (That’s ok, multiple people live at the same address).
onstat -g ses and onstat -g sql (below) reveal some key numbers—specifically, how many and what type of threads will be used to execute the query, as well as how much memory your query has been allocated.
The output contains a lot of information, probably more than you need, but there are several key things to look for. The Current Statement name provides you details on the SQL of the query any session is actually running. You can see how many threads and what types were kicked off (e.g. scan_2.3 et al, we got 9 scan threads against one table, 11 against the other and a single sqlexec thread which will perform the hash join with a PDQPRIORITY=1. You can also see how much memory has been allocated for the query and is being used under Memory pools and on the top line (Total Memory/Used Memory). This information will confirm what threads have been kicked off for a given query. If there are not multiple scan threads for each table (scan 1.0 through 1.8 is on one table), or if there are NO scan threads for a given table, you know the optimizer may not be doing what you wanted.
onstat -g mgm
For this article I did not provide a snapshot of an onstat -g mg m, but give it a whirl and read the manual on it—it is fairly self evident. Of particular interest are the gating factors and what queries get held up and why (so you can go kick Bob for running everything at PDQPRIORITY=100).
onstat -g lsc
Running the Example
The optimizer should choose to solve the example query with a hash join. If it does not, then there may be another issue, like the optimizer thinks one of the tables is empty (need to run UPDATE STATISTICS) as is the case in our query here. Let’s assume that PDQPRIORITY is NOT set to anything, which is typical of an OLTP environment. This means that it is effectively off. What will happen when the query is executed?
With these settings, the engine first reads the smaller table (Table 2), applies any filters, and builds a hash table entry for each record. It builds this hash table in memory. It then reads the large table and matches it up against the hash table. As it finds matches it pushes the matched rows out the other side of the join.
Note: If the optimizer determines that a filter will make the result set from the larger table smaller, it may choose to read the larger table first. Of course, the optimizer would need some distribution information to make this determination more accurately, and this can only come by running UPDATE STATISTICS high or medium.
After 37 minutes into this query, Xtree shows both tables have been scanned and have moved into the hash join. Because insufficient memory was allocated, the hash join is stalled while it swaps hash table pages from the temp disk. As more and more memory is needed to build the hash table, more and more available memory is used up. Since the query is executing in an OLTP environment, the LRU buffers are filled with all the data from the table along with the associate buffer management overhead, spoiling everybody’s day (or rather week/month). Since the hash table memory require-ments are so large, the query probably overflowed to the temp space(s) and, if these weren’t set up well, took away the temp space for everyone else as well. And, after running for several days, you would probably just kill the query anyway.
Basic Tuning Procedures
There are three basic time-consumptive elements to the example query. A table scan, a second table scan, and the join. When a nested loop is used (where the larger table is read using the index for every record in the smaller table) the join is part of the second read. The join from a hash join occurs while the second table is read. In most cases we want to use a hash join.
When tuning this query, it helps to approach it in two parts: the table scans and the join. To begin, let’s obtain a benchmark of how long it takes just to scan each table. This will not only give us the ability to estimate query time but also highlight when a query slows down, as well as ensure that we can read the tables themselves fairly quickly.
Results: 17 minutes for Table 1, 20 minutes for Table 2. Roughly 30,000 rows per second. These results are interpolated. The actual read time was 37 minutes for both tables.
The first tuning objective is to improve the read performance. With all of the data being in one dbspace, the engine is limited to a single thread to read the entire table. If the data is spread over multiple dbspaces the database can kick off one thread per dbspace, or fragment. Using ipload you can generate and run a job which will unload (and reload) each table. You can then drop each table, recreate them fragmented over several disks, and reload them. In this particular instance, the fragmentation expression is not that important, so you can just use round robin. If it takes 20 minutes to read a table in one dbspace, spreading the table across 4 dbspaces should bring the scan time down to about 5 minutes. Spreading the table across 8 dbspaces should lower the time to about 2.5 minutes.
Now that we have fixed the tables, the most obvious next step is PDQPRIORITY. With PDQPRIORITY off, Informix Dynamic Server ™ is doing nothing in parallel. One thread is doing all of the reading. If you assume that it takes 5 minutes to read the data from one disk, and that there are 16 dbspaces per table, then the query is automatically starting off with 80 minutes of overhead per tabl e. Just by setting PDQPRIORITY to 1 the engine can kick off one read thread per disk (don’t worry, it won’t try to read both tables at once).
Running the benchmark again with PDQPRIORITY set to 1, either from the operating system (EXPORT PDQPRIORITY=1) or from dbaccess (SET PDQPRIORITY 1), the length of time required to scan each table greatly improves.
Results: 3 minutes 21 seconds for Table 1, 3 minutes 52 seconds for Table 2. Roughly 150,000 rows per second.
Table scan times improve five-fold just by fragmenting the data. However, these results can still be improved.
In an OLTP environment, you want to maximize the cache read and write rates. To do this, you want the engine to find the row it needs in the buffer cache and not have to read it from disk. OLTP buffer cache management has some minimal overhead. However, in a DSS environment, you want to scan the entire table; the odds of the data being in the cache are, therefore, dramatically reduced. The overhead associated with managing the DSS buffer cache becomes onerous. Enter the light scan, a critical element in DSS queries.
The light scan is designed for a DSS query. Rather than use the normal resident memory buffer cache, the light scan uses its own light scan buffers, for which there is much less overhead (primarily because you don’t have to worry about LRU queues). Each query gets its own set of buffer pools. This can have a dramatic affect on our read rate.
To force a light scan the trick is to ensure that the table being read is larger than the resident memory buffer size AND to have the ISOLATION mode set to DIRTY READ. You can also set an environment variable (export LIGHT_SCANS=FORCE). The number of light scan buffers assigned is a factor of your RA (read ahead) settings. From the Informix Masters Series training manual:
Light_scan_buffers = roundup (( RA_PAGES + RA_THRESHOLD) / (60/2/))
Bumping the RA_PAGES and RA_THRESHOLD values to their maximum of 128 and 120, respectively, gives you more light scan buffer pools, which is a good thing.
In the previous benchmark runs, the engine did not use light scans (this was intentional to show the performance impact of light scans). By running UPDATE STATISTICS on the engine, the optimizer will recognize that the tables being read in our example query are larger than the resident memory buffer pool and will utilize light scans when the benchmark is run again.
UPDATE STATISTICS is of course key to getting the light scan to work, but to what degree? UPDATE STATISTICS medium or high will tend to run for hours— on a large data warehouse, potentially days. The results of this are very useful if you plan to do indexed joins or filtering, but for this hash join you are looking to scan the entire table anyway, so just run the quick UPDATE STATISTICS low and leave the rest for the hash join.
Results: 43 seconds for Table 1, 50 seconds for Table 2. Roughly 700,000 rows per second.
I have seen scan rates of up to 2 million rows per second on a table with 32 fragments. We could fragment the table over more dbspaces to improve this read rate even more, but is cutting another 20 to 40 seconds off the query time worth the added administrative headache? You decide. The number of fragments you can allocate may be limited by some other factors as well, like the business requirement for a particular fragmentation scheme. For the purposes of this example, these results here work just fine.
The next tuning efforts focus on the join itself. There are several options available. Let’s first try running the query with a nested loop join using the indexes.
By adding an index, you are trying to do one of two things: either setup a nested loop join or limit the size of the data read to just the index pages. In other words, you are trying to put into the index whatever the query needs to be successful.
The addition of an index gives the optimizer a different path it can travel to get at the data. If it feels that one of the tables is small enough, it will choose to read that table and, for every row, probe the large table (using the index) to perform the join. At issue is that the speed with which the engine can perform probes can be measured in thousands per second. 30 million rows at 3000 rows/second, which is optimistic for the machine we are using here, comes out to 10,000 seconds or almost 3 hours. This is way too slow. Even fragmenting the indexes and detaching them from the tables will not change the probe rate dramatically enough to make a nested loop faster.
So much for that idea. Let’s try a dynamic hash join.
To recap, in a hash join the smaller of the two tables is scanned and stored in a hash table in memory. If memory fills up, which will happen unless the two tables in question are much smaller, this hash table is swapped out to the temp disk. Once the first table has been read, the second table is read and matched up against the hash table. Periodically, the query will have to swap in the hash table entries that are out on temp disk. This temp to memory swap is expensive and is generally the slowest portion of a hash join.
First thing to note is that building the hash table actually slows down the scan rate to about 300,000 rows per second (results vary).
At 40 PDQ this query took 22 minutes 12 seconds, and the match rate was about 40,000 rows per second. Not only can more memory lower the amount of swapping done to and from temp disk, but it also increases the size of the hash table in memory and allows more matches per second.
Utilizing a hash join method, it turns out that the example query join runs at about 90,000 rows per second (at 80 PDQ) for a projected total of about 5.5 minutes. This would be great except that the entire hash table does not fit into memory, causing the query to have to swap portions of the table from disk which is why the actual results of this query are much different (see below). Therefore, the more memory you can assign to the hash join, the less swapping will occur.
PDQPRIORITY controls memory allocation. Your ultimate PDQPRIORITY setting (which is your personal PDQ_PRIORITY/100 * MAX_PDQPRIORITY/100) * DS_TOTALMEMORY = how much memory you get for your query. If you have a DS_TOTALMEMORY of 1.5GB and take 1% of that (PDQPRIORITY=1) you get 15MB of memory in which to build the hash table. 15MB/20 bytes (our hash table entry) = 786,000 or so entries. That’s a rough guess, but after that many entries the database will have to push the hash table entries out to temp disk. With 16 scan threads the database will probably be reading around 160,000 rows per second, so in 5 seconds it will fill the hash table and have to push to temp disk with the above memory allocation.
Actual results of this query: 8 minutes 40 seconds (PDQ = 80). During the match phase of the query the match rate was 90,000 rows/second. After 17 million matches the database started swapping. It performed 10 total swaps, each swap about 10 seconds long (100 seconds or 1 minute 40 seconds). Since the entire table did not fit into memory, the query also could not keep up the 90,000 rows/second rate, which is why (sans the swap time) the query runs in 7 minutes not 5.5 minutes (as first projected).
At times, the optimizer can, frustratingly, keep pushing your query down the nested loop path, despite your best intentions and even the use of optimizer hints. In these cases, you can trick the optimizer into a hash join by giving it something it won’t find in the index:
The + 0 is inexpensive and forces the query to perform as a hash join. This is because tab1.key is no longer recognizable as an index to the optimizer.
Not surprising, temp disk is written to and read from in the same manner as normal tables: 1 thread per fragment. It is therefore important to have multiple temp disks set up to maximize the number of read and write threads. It is best to have multiple dbspaces for your temp disks as well. I recommend 2 temp dbspaces per CPU as was used in the query above, although as long as you keep the number at some multiple you can go as far as you like (within reason), 64 temp dbspaces per CPU is way beyond the point of vanishing returns. Even if you only have eight temp disks for eight CPUs, break them into two chunks and make each one it’s own dbspace. Since the temp spaces will be written to in round robin, the smallest temp space will fill up first and cause you to generically "run out of temp space" even though the other spaces still have room. Therefore, use temp spaces of equal size; anything extra on a single temp space is wasted.
It is important to balance all that you do against how many processors you have so that there is minimal task swapping between processors. In the example here, the tables were laid out over 9 and 11 dbspaces, which meant that 8 processors were handling 9 scan threads in one case and 11 in the other. Each processor probably wasted some time swapping from one scan thread to another. Alas, business requirements do not always match what makes sense technically.
There are a few changes you can make to your ONCONFIG file to better support decision support systems. The following configurations are from the Informix Masters Series training manual:
I actually prefer BUFFERS to be 20000—to provide support for OLTP activity as required, the tables are large enough to dwarf the buffers in all cases, and there is still OLTP style work to be done to the data.
Any DSS query similar to the example described in this article will have three principle time-consuming components. A read of Table 1, a read of Table 2, and a join step. If the query limits the selection statement with some sort of filter (a key between 100 and 200) then the read components become easy to handle with an index. For example, read Table 1 for these rows and probe Table 2 for the matching rows, where the actual join component occurs during the probe.
If the query is not limited with a filter, then a full join of both tables is required. This is very inefficient when performed with an indexed read. It is far better to scan both tables and perform the join in memory. A hash join is the best way to perform this. Tuning, therefore, becomes a matter of shortening the three stages of the query (scan1, scan2, and the join itself ). With a light scan the database can read (in this case, 300 thousand rows per second) from the first table while building the hash table. With the second table, the database can read only marginally less quickly including the hash join. And with a high enough PDQPRIORITY the database has enough memory to perform the hash join in memory without expending time to swap the hash table out to temp disk. Even if the hash tables are pushed to temp disk, if the temp disks are laid out well, it’s not that expensive.
If you can spread the read load over multiple disks, reserve enough memory, and set aside enough temp disks, you can bring any query time down to within a 5 minute window to scan each table and, potentially, another minute or two to wrap up the join. If you are getting a 40 hour, or 40 day, query then one or more of these things isn’t happening.
About the Author
Jack Parker is a Database Administrator at Engage Technologies. He has been working with Informix products for 14 years and has specialized in data warehousing for the past 5 years. He can be reached at firstname.lastname@example.org .