There are many books and papers that elaborate on Informix Dynamic Server (IDS), its architecture and performance tuning, but few of them deal with monitoring its performance in detail. Yet, effective monitoring is very critical in IDS administration; it helps you collect valuable statistics and identify database problems early on, providing a proactive means of performance tuning. Once the Informix database has been successfully implemented, monitoring its performance should be the database administrator’s top priority.
This article will discuss in detail how to effectively monitor Informix Dynamic Server performance at all levels and will provide some general tips on troubleshooting and performance tuning your Informix database system. Specifically, this article will examine:
Informix provides two main tools for monitoring system performance: the onstat utility and numerous system monitoring interface (SMI) tables. Both the onstat utility and SMI tables monitor IDS performance by examining its shared memory activity, but there are some key differences between the two. The onstat utility actually reads IDS shared memory and prints out its statistics. It is easy to use, but the statistics it collects are often accumulative from the time IDS was brought up and appear in a fixed format containing more information than you really need. On the other hand, SMI tables are stored in the sysmaster database, which is created automatically at the time of IDS initialization. SMI tables are not real tables; they are dynamic views or, rather, "pseudo-tables" that serve as pointers to IDS shared memory structures. Thus, the data in SMI tables reflects IDS activity more accurately. Moreover, SMI tables can be queried using common SQL and can display data in a more meaningful, more readable format.
IDS activity can be classified into three categories: instance activity, database activity, and session activity. Using the tools provided by Informix, the activity in each category can be effectively monitored.
Monitoring Instance Activity
An IDS instance refers to Informix shared memory, Informix databases, and physical devices allocated to Informix. The following are some of the most important instance activity to monitor:
The first and most important instance activity is, of course, IDS’s operating mode. Is IDS running all right? Is it having any problems? Is it down? The onstat -p command provides the following output:
The first line displays the current IDS operating mode. In this case, the database is "On-Line." There are six operating modes. Of those six, three are particularly important: Off-Line, Quiescent, and On-Line. The Off-Line mode indicates that IDS is not running. The Quiescent mode indicates that IDS is running in a single user mode where only the DBA can do administrative and maintenance work. The On-Line mode indicates that IDS is running normal and all users can connect to the database server and perform any kind of operation. In most circumstances, IDS should always be in the On-Line mode. If for some reason IDS is down or in the Off-Line mode, the command will display the following message:
In this case, you will need to check the online message log to further identify the root cause of the problem (see the "Message Log" section below).
In addition to the current operating mode, the above output also provides some important statistics about your system’s performance. The two %cached fields indicate how effectively IDS is using its cache. The first %cached field shows the percentage of read cache rate, and the second one shows the percentage of write cache rate. Read cache rate and write cache rate can vary dramatically depending upon the applications and the type and size of the data being operated on. But, in general, both read cache rate and write cache rate should be in 80-90 percent range. If these rates are consistently lower than 80%, you may want to consider increasing the value of the BUFFERS parameter in your Informix configuration file to achieve higher read and write cache rates. Low read and write cache rates indicate IDS is doing a lot more disk reads and writes than it should, which will greatly slow down performance.
The seqscans field of the output indicates how many sequential scans have been performed since the database has been up. If this number is large (say over 100,000), and ever increasing, it may indicate some performance problems, especially in an OLTP environment, and a need to do some further investigation. We will discuss this in more detail later in the article.
The ovlock field indicates the number of times IDS attempted to exceed the maximum number of locks. If this number is non-zero, you may need to increase the value of the LOCKS parameter in the ONCONFIG file. The ovbuf field indicates the number of times IDS attempted to exceed the maximum number of buffers. If this number is non-zero, you may need to increase the value of the BUFFERS parameter in the ONCONFIG file. The bufwaits field indicates the number of times a user thread must wait for a buffer. If this number is huge (say over 100,000), then you may need to increase your BUFFERS parameter so that users do not have to wait for buffers when they need to access data from disk. This will reduce response time and, hence, improve performance.
Another group of important fields includes ixda-RA, idx-RA, da-RA, and RA-pgsused. Together, these fields indicate how effectively IDS is using its read-ahead mechanism. Read-ahead is the operation that reads ahead the number of data pages from disks into memory during sequential scans or index reads. The ideal situation is that the number of pages read ahead (the sum of ixda-RA, idx-RA, and da-RA) is equal to the number of pages used during the sequential scans or index reads (RA-pgsused), indicating that the read-ahead pages are used 100% for sequential scans or index reads. If there is a significant difference between these two numbers (>10,000), either positive or negative, IDS is not using read-ahead very efficiently, and you may need to tune your read-ahead parameters (e.g. RA_PAGES and RA_THRESHOLD) for better performance. Refer to the Administrator’s Guide for details on how to tune these parameters.
The message log is also referred to as the online log. It contains a variety of information about critical instance activity, such as time and duration of checkpoints, startup and shutdown, backup and restore status, logical log backup status, and changes to major configuration parameters. The message log also contains critical errors (referred to as assertion failures by Informix) such as disk I/O errors, down chunks, data integrity errors, and so on. When assertion failures occur, the message log will usually point you to the relative assertion failure ("af.xxx") file, which records more detailed information about when the database went down and sometimes gives suggestions on how to fix the problem. The following is an excerpt from the message log:
The above output indicates that there was an internal system error that caused the database to shutdown. This happened when user "informix" logged in, and when he was querying the database. The file /tmp/af.cdf81 recorded the detailed information about the database activity when it shut down. In this particular case, you may need to check the system message log as well to see if there is any defective system hardware and then try to run the oninit command to restart the database.
Chunks are the physical disk storage devices. They should always be on-line. If any IDS chunk is off-line, it indicates some data corruption and requires immediate attention. The onstat -d command provides current chunk status as follows:
Notice the output contains two sections. The first section lists all dbspaces, and the second section lists all chunks. In the Chunks section, pay close attention to the flags field. The first character of this field indicates if the chunk is a primary ("P") chunk or a mirrored ("M") chunk. The second character of the flags field indicates the current status of the chunk, either on-line ("O") or off-line ("D"). Since "O" and "D" can look very much alike, especially when you are in a hurry, you may want to pipe the result to "grep PD" (onstat -d |grep P D) just to make sure that you don’t miss any down chunks. If there are any down primary chunks, you will need to perform either a cold or warm recovery from backup tapes immediately to ensure data integrity.
A checkpoint is the process of synchronizing pages on disk with pages in the shared memory buffer pool. During checkpoints, IDS prevents user threads from entering critical session and blocks all transaction activity. Thus, if checkpoint duration is long, users may experience system hanging. This is especially true in OLTP environments where there are thousands of transactions and the response time is most critical. As noted above, you can monitor checkpoint duration by looking at the message log, but the better and faster way to do this is to use the onstat -m command.
The output displays the duration of the latest checkpoints:
If checkpoint duration is consistently longer than 10 seconds you may need to reduce the values of the LUR_MIN_DIRTY and LUR_MAX_DIRTY configuration parameters to achieve shorter checkpoint duration. Also, if onstat -F shows extremely high chunk writes (>10,000) and this number is ever increasing, it may indicate one of two problems: either checkpoint intervals are too short and cleaners do not have enough time to write modified buffers to disks between checkpoints, or there are too few AIO VPs to share the heavy disk writes during checkpoints. You may need to re-examine the CKPINTVL, LRU, CLEANERS, and NUMAIOVPS configuration parameters and increase their values accordingly. For detailed information on how to tune those parameters, refer to the Informix-OnLine Dynamic Server Administration Guid e.
Knowing if there is enough space in each dbspace is especially important for a production database with constant data loads and growing demands. The script below takes a snapshot of each dbspace, calculating their disk usage in pages.
The output is as follows:
This output helps to identify dbspaces that are running out of space. To be proactive, consider allocating additional disk space when dbspace disk usage approaches 90%.
dbspace I/O is measured by disk reads and writes. If some dbspaces have heavy disk reads and writes while others scarcely have any, then disk I/O bottlenecks in the system may result. A well-balanced dbspace I/O will ease system disk I/O loads and, hence, improve overall system performance. The following script will display I/O statistics for each dbspace:
The output is as follows:
The goal is to achieve balanced disk reads and writes across all dbspaces. This is unrealistic in most cases. The output above gives you an idea of how dbspace I/O is distributed and can help to identify "hot" dbspaces—those with the most disk reads and writes. If disk reads and writes are extremely high for some dbspaces and extremely low for others, you may need to adjust your physical or disk layout.
You can further identify what tables have the most disk reads and writes by querying the sysptprof table in the sysmaster database:
Based upon the output you get from this query, you may need to move some tables among dbspaces to get better I/O balance.
Shared Memory Segments
Too many virtual shared memory segments (usually more than three) indicate that the initial virtual shared memory segment is too small and that the database is constantly having to allocate additional virtual shared memory segments. This adversely effects IDS performance and will eventually bring your system to its knees. The onstat -g seg command displays how many shared memory segments currently exist:
If the output shows more than three virtual shared memory segments, you will need to increase the value of the SHMVERSIZE parameter in your configuration file. The idea is to let IDS allocate enough virtual shared memory at initialization so that it doesn’t need to allocate more when users are logging onto the system and doing database operations. For detailed information on how to calculate IDS virtual shared memory segment size, please refer to the Informix-OnLine Dynamic Server Administration Guid e.
Overall Operating System Performance
To accurately evaluate IDS performance you need to take into account the behavior of the operating system as a whole, especially if the database resides on a non-dedicated database server. If IDS takes too much RAM (for instance, if your system has 512MB RAM and IDS takes 400MB or more for its shared memory) your operating system may experience heavy swapping and hanging when users perform memory-intense operations. When there is not enough memory, the system has to "swap" some data pages in the memory to disk to leave space for new data. And if the system is short of memory, the CPU may also suffer. There are many UNIX utilities to monitor overall operating system CPU and memory utilization. Below is the output from "top":
The output is broken into two sections. The first section gives you a summary of the overall operating system’s CPU and memory usage, and the second section provides detailed information about each processor. Other utilities, such as vmstat, iostat, ps -e f, and sa r, are also useful in collecting current operating system performance statistics. vmstat shows how much operating system is currently swapped; iostat and sar display current I/O distribution among all physical disks; and ps -ef prints out detailed information about each current processor’s login time, CPU, and memory usage. In addition, there are many graphic tools available that allow you to graph the dynamic movements of your operating system’s resource utilization and performance.
Monitoring Database Activity
The purpose of monitoring database activity is to ensure that every database is performing at its peak ability at all times. This implies that you need to watch for potential performance problems, identifying their root cause and eliminating them from the start. Here are a few things to watch for:
An extent is a block of physically contiguous pages. However, if a table has more than one extent, there is no guarantee that those extents are contiguous; extents may be scattered throughout the dbspace where the table resides. Contiguity of physical pages is important to performance. When pages of data are contiguous, the time used to access data on disk is minimized and the database can read rows sequentially. If tables have too many extents, it is very likely that those extents are interleaved. This harms performance a great deal, because when you retrieve data for a certain table, the disk head needs to seek multiple, noncontiguous extents belonging to this table instead of one large extent with contiguous physical pages. This slows down disk-seeking speed rather significantly. The following script detects database tables with multiple extents:
The output is as follows:
If any of the tables has 10 or more extents, with the exception of large fragmented tables, you should consider rebuilding them to consolidate the extents. For details on how to estimate and allocate extent size for tables, please refer to the Informix-OnLine Dynamic Server Performance Guid e.
Index levels may also adversely effect performance. The more index levels, the more reads IDS needs to get to index leaf nodes. Furthermore, if a leaf node gets split or merged, it may take more time for the whole index to adjust to this change. For example, if an index has only two levels, only two levels need to be adjusted, but if it has four levels then four levels need to be adjusted accordingly. The time used for this adjustment is, of course, much longer. This is especially true in an OLTP environment where there are constant inserts, deletes, and updates. The following script identifies how many levels there are for each index:
The output of the script is as follows:
If any index has more than 4 levels, you may need to consider dropping and rebuilding it with less levels for better performance.
A highly duplicate index can severely impact performance for updates and deletes. Suppose you have an index on column customer_type in table customer, and there are only five possible customer_type codes. If the table has one million rows, there could potentially be 200,000 rows with the same type code. The B-tree would store the key value, followed by a list of pointers to each of the physical rows. The problem occurs when you have to delete or update any of the key values. IDS must search through all the duplicates until it finds the correct key to delete or update!
The following is a script to identify highly duplicate indexes:
The output may look like this:
Ideally, all values appearing in the nunique column would equal all values in the nrows column, indicating that every key in the index is unique. Based on the number of rows (nrows column above) and number of unique keys (nunique column above), we can calculate the percentage that each index is duplicated by:
The higher the percentage, the more unique the index. To avoid the performance bottleneck of a highly duplicate index, you can replace the original index with a composite index that combines the highly duplicate column and a more unique column. Using the example above, you can add the primary key column customer_id to the original index and make it a composite index (e.g. "create index index_name on customer (customer_type, customer_id)").
S equential Scans
Sequential access to a table is sometimes harmful to performance because the database reads every row of the table once for every row selected from the preceding tables. If the table is small, it is harmless to read it repeatedly because the table resides completely in memory (actually, a sequential scan of an in-memory table is much faster than searching the same table through an index, especially with Informix’s light scan mechanism). But if the table has 100,000 rows or more, repeated sequential scans are deadly to performance. The following script will identify tables with multiple sequential scans:
The output is as follows:
From the output above, you can see that the customers table has an extremely high number of sequential scans. If this is a big table with thousands or millions of rows, you may need to think of adding some indexes to this table or using program directives to make the internal query optimizer choose indexes for accessing data in this table rather than sequential scans.
Monitoring Session Activity
Statistics about session activity are very useful in identifying potential performance problems and troubleshooting. What session activity statistics can you collect using the monitoring tools discussed earlier in this article?
The syssessions table in the sysmaster database stores some general information about each session, such as login name, login time, host machine from which the session was logged in, operating system’s process ID, current state, and so on. You can query this table to get all this information:
The output looks like this:
s_state is a hex number that indicates current activity of the session (refer to page 38-27 of the Informix-OnLine Dynamic Server Administrator’s Guide for a detailed description of each of these activities). login_time is an integer that indicates the time the session logged in, which can easily be converted to a regular time format using a C program. You can request this program from Informix technical support.
The syssesprof table in the sysmaster database provides more detailed information about each session. With the following query, you can get a better idea how each session is interacting with the database:
The output is as follows:
The access field shows how often the session is hitting the database. locksheld shows how many locks each session is using. seqscans indicates how frequently each session is using sequential scans to access data; if this number is too high (>1000), you may want to question whether the session ever uses indexes to retrieve data and examine its query execution plan more closely to see if it is optimal. total_sorts and dsksorts indicate how efficiently each session uses memory to do sort operations. You can calculate the percentage of each session using memory to sort with the following formula:
((total_sorts – dsksorts) / total_sorts )* 100
The higher the percentage, the more efficient are your sort operations. You can join syssession and syssesprof tables to further identify each session’s username and host machine so that you will know where potential database and system problems are coming from. The following query can retrieve all this information:
You can also join the syssesions table with the syslocks table to get more detailed information about locks, such as which table in which database is locked by which session to help you identify potential lock conflicts among users:
The output looks something like this :
If there are some conflicts with lock usage, for instance if one user needs exclusive access to a table that has been locked by others, you can easily identify the owner of that lock and, based on the priority of users, issue an onmode -z sid command to kill the session and release the lock (where sid is the owner field in the above output). This command can only be executed by user "informix".
Query statistics are critical in troubleshooting and query optimization. The onstat -g sql sid command catches queries and related statistics for current sessions (where sid is the session ID that can be plugged in either manually by hand or dynamically by a UNIX shell program). For example, if you want to know what query a session is executing, you may first use the command onstat -g ses to find out its session ID and then plug this ID into the above command (e.g. onstat -g sql 2895 3). The output will look like this:
The first section of the output gives you some general statistics about the query being executed, such as which database the query is executed against and its isolation level and lock mode. Two fields most interesting are SQL Error and ISAM Error. If these fields are non-zero, it indicates that the query is having some problems and is not executing properly. You can use the Informix finderr utility to find out exactly what the problem is and then, after correcting the problem, kill the session and execute the query again.
The "Current SQL statement" section displays the syntax of the query being executed. This is very helpful in diagnosing a problem query and query optimization. If you find some problems with the query such as its response time is too long, and it consumes too much system CPU and memory, you can make a copy of the query as it is displayed for later investigation and analysis. You can then run the same query against the Informix dbaccess utility and get more detailed statistics on the query’s performance such as its execution plan and join strategy to identify the root cause of the problem. Based on the statistics you collected from your study, you can further optimize your query for better performance. The "Last parsed SQL statement" section displays the query that has been parsed in the memory. Since the query that has been parsed in the memory is the query that is being executed, these two sections are exactly the same.
To take an even more active approach in monitoring, you can modify the alarm program provided by Informix during IDS installation. The alarm program is a UNIX shell script that is automatically called by IDS when certain errors occur. IDS classifies all instance errors into five severity levels, one being the lowest and five the highest. You can set up the alarm program so that it emails the DBA or sends a message to his or her pager when instance assertion failures occur. For detailed information on how to modify the alarm and sample alarm programs, please refer to the Administrator’s Guide and the Informix-OnLine Dynamic Server Performance Guid e.
In addition, you should perform some necessary maintenance routines to ensure that your databases are running healthy. Some basic routines are: verifying and repairing data and index integrity, constantly updating internal statistics for the query optimizer, and constantly recycling unused shared memory segments.
Monitoring Informix Dynamic Server performance is an on-going task. The value of this task is not so much in collecting statistics, but in identifying and troubleshooting potential system and database problems.
Through effective monitoring practices, as described in this article, you will be able to successfully identify system and database problems in their earliest stages and take a proactive approach to performance tuning.
About the Author
Jianing Fan is a software engineer at Motorola specializing in relational database management systems. He is an Informix Certified Professional, Oracle Certified Professional, and has over 10 years of database and system experience as a developer, system administrator, and DBA. Jianing is a pious reader of Tech Note s; this is his first contribution. Jianing can be reached at email@example.com .