Informix Logo



Troubleshooting and Maintaining Your Servers

by Ron M. Flannery rflanner@speedlink.net
Tech Notes 1999, Volume 9, Issue 2


Overview

Introduction

This article describes strategies that help you tune and monitor your Informix database servers. In it, you will learn the day-to-day tasks that can help keep your systems healthy for a long time. In the event that problems do occur, it will help you understand how to solve them. This chapter will help you learn the following:

  • How to perform ongoing system monitoring. There are many basic tasks that can help you understand what is going on with you system, preventing problems later and keeping the system running at its best levels.
  • Methods of tuning and checking your databases. There are different ways to help keep your systems charged up and error-free for the long term. Many of these tasks need not be performed more than once a week-or even once a month.
  • How to troubleshoot problems. Much of troubleshooting is a learning process. However, thinking things through and following basic procedures will help you more easily get to the bottom of problems.

Tracking Disk Usage

Disk configuration can often be a tough balance in Informix Dynamic Server. A prevailing goal is to make the most efficient use of your disks while not filling them up. Disk layout should be planned carefully, especially when you are first creating the Informix instance. Obviously, it isn't possible to completely predict the usage of your system, but it's important to put the most you can into the planning of disk space. If you need to change something, you can add chunks later. Following is the output of onstat -d, which shows the current dbspaces and chunks in your Informix instance:

Dbspaces
address number  flags  fchunk  nchunks flags  owner      name
a28e3e   1      	1      1      1       N     informix   rootdbs
a28e4e   2      	1      2      1       N     informix   dbspace
a28f2a   3      	1      3      1       N T   informix   tempdbs
 2 active, 2047 maximum
Chunks
address chk/dbs offset  size  free bpages  flags pathname
a26f220 1   1   0      25000  10000         PO-  /usr/informix/rootdbs
a26f240 2   2   0      60000  45000         PO-  /usr/informix/chunk1
a26f290 3   3   0      50000  49000         PO-  /usr/informix/tempdbs
 2 active, 2047 maximum

The output is divided into two parts: dbspace information and chunk information. In a nutshell, a dbspace is a collection of chunks. Each chunk represents all or part of a physical disk drive. The dbspaces are displayed in the top part of the output followed by the chunks that map to them. The "number" field in the dbspaces section is the dbspace number, which maps to the "dbs" field in the chunks portion of the screen. For this discussion, we will only look at the chunk usage.

The columns that are of particular interest are "size" and "free." These columns define in pages the size of the chunk and how much of it remains free. The page size can be found by using the command onstat -b. It is very important to monitor the amount of available pages. This is particularly true in the root dbspace: if it fills, your whole instance can be forced off line.

To watch the activity for certain chunks, use the command onstat -g iof. The following is the output of this command:

AIO global files:
gfd pathname totalops dskread dskwrite io/s
3 rootdb1 15562 12746 2816 0.0
4 dbspace1 64 64 0 0.0

If you notice that one of the chunks has a high number of reads or writes when compared to other chunks, it could be causing performance problems. In this case you can consider redistributing the data on that chunk.

Watching VPs

The virtual processors (VPs) help Informix process things efficiently. VPs are "mini CPUs" that are actually operating system processes. These VPs handle user "threads" and provide the multi-threading capabilities of Informix. It's important to have enough VPs and have them properly configured.

The command onstat -g glo provides information about all the current virtual processors as well as statistics for each class of virtual processor (CPU, for example). You can use the results of this command to search for certain classes of VPs that might have an inordinate amount of activity, causing you to add more VPs.

The command onstat -g ioq can help you determine if you need to add more AIO VPs and has output like the following:


AIO I/O queues:
q name/id     len maxlen totalops dskread dskwrite dskcopy
 adt	0	0	0	0	0	0	0 
 opt	0	0	0	0	0	0	0 
 msc	0	0	1	1423	0	0	0 
 aio	0	0	1	2	1	0	0 
 pio	0	0	1	215	0	215	0 
 lio	0	0	1	241	0	241	0 

If this command shows an I/O queue that continues to grow, you might need to add more AIO VPs.

Monitoring Shared Memory

The amount of memory used by Informix can vary, depending on the activity of your system. Memory is initially allocated as the amount of KB given by the onconfig parameter SHMVIRTSIZE and can be dynamically allocated by Informix Dynamic ServerT in KB chunks of SHMADD. Informix Dynamic Server will abort if the amount of memory reaches the amount in SHMTOTAL. To pay attention to how much memory Informix has allocated, simply use the command onstat -, which quickly displays the total amount used by Informix Dynamic Server. For information on specific memory segments that were allocated, use the command onstat -g seg, which displays information as shown in this sample output:

COMMAND: onstat -g seg


Segment Summary:
 (resident segments are not locked)
id    key       addr     size     ovhd  class  blkused  blkfree 
0    1387939841  a000000  2392064  848   R      288     4    
1    1387939842  a248000  8192000  720   V      325     675   

When Informix is initially brought on line, there is one virtual memory segment, which is equal to the size of SHMVIRTSIZE. If more segments are added, you will see them in the output above, with a class of "V." Alternately, you can watch the message log for messages that mention "dynamically allocated new shared memory segment (size xx)." To do this, use onstat -m (to see last 20 entries in message log) or look through the whole message log by using a patter matching command (grep on UNIX).

In a DSS environment, you can also set the variable DS_TOTAL_MEMORY. This value is typically set to a certain percentage of SHMTOTAL. The percentage depends on how many DSS versus OLTP queries are performed. The DS_TOTAL_MEMORY usage is monitored through onstat -g mgm.

Monitoring User Activity

Performance problems can sometimes be totally related to what the users are doing. If random queries are allowed on a system, or there are too many queries, system performance can be greatly reduced. To find out which users are using the Informix instance, use the command onstat -u, which displays output that looks like this:


Userthreads
address flags    sessid  user   tty   wait tout locks nreads  nwrites
a270010  ---P--D 0       root   -     0    0    0    324     166
a270444  ---P--F 0       root   -     0    0    0    0       0
a270878  ---P--B 4       root   -     0    0    0    0       3
a270cac  ---P--D 0       root   -     0    0    0    0       0
a270cbd  ---P--D 49      ron    -     0    0    0    3224    1423
a270cbd  ---P--D 50      bob    -     0    0    0    7201    1244
 6 active, 128 total, 38 maximum concurrent

A quick glance at this output can tell you who has the highest number of reads and writes. On larger print outs, it might be more difficult to monitor, however. The last line of the output can help you decide if the number of users is the problem.

To follow through on a particular query, you can use the session ID ("sessid") that is displayed here. Executing the command onstat -g ses sesid displays information about that particular session, including the query and more detailed information. For example, the following shows part of the output of the command onstat -g ses 49:


session                         #RSAM   total    used 
id    user   tty   pid  hostname threads memory   memory
49    ron    4     9798 sparky   1      65536    55480 
                                  
tid   name     rstcb    flags  curstk  status         
412   sqlexec   a271514  Y--P--- 1872    cond wait(netnorm)   
                                  
Memory pools  count 1                       
name       class addr    totalsize  freesize  #allocfrag  #freefrag 
374        V     a42a010 65536      10056     156        8
...
Sess SQL        Current      Iso  Lock     SQL  ISAM  F.E. 
Id   Stmt type   Database     Lvl Mode      ERR  ERR  Vers 
374  SELECT      stores7      NL  Not Wait  0    0    7.14 
                                   
Current statement name : slctcur                   
                                   
Current SQL statement :                        
 select * from customer                       
                                   
Last parsed SQL statement :                      
 select * from customer

There are lots of useful fields in the output of this command, including the query, memory, user, and other information. You can obtain a smaller piece of the same information by using the command onstat -g sql sesid. To get information for a particular thread within the session, issue the command onstat -g tpf tid where tid is the value shown in the output of the above command.

You can also use onstat -g wai to find threads that are waiting to be processed by Informix. Using onstat -g rea shows threads that are ready to be executed.

Using Performance Monitoring Utilities

Another way to monitor performance and health of your systems is through monitoring tools. These tools are designed to monitor things like CPU usage, memory usage, and possible problems. Some tools provide support for watching performance within the database, including query monitoring, dbspace usage, and more. Choosing an automated tool like this can save a lot of the manual steps that I previously mentioned. These tools are made to automate these tasks and can greatly simplify the life of an administrator.

There are many options in these tools and more becoming available. Some of the tools provided by Informix include onperf, oncockpit, and Informix Enterprise Command Center (IECC). IECC is a GUI-based tool that greatly simplifies administrative and monitoring operations. In addition, there are several third-party tools, like Compuware's ECO Tools and BMC's Patrol.

Using the sysmaster database

The sysmaster database is created in shared memory when you start Informix Dynamic Server. It provides a good deal of the information provided by the "on" commands. Using sysmaster allows you to build your own queries and programs that do monitoring. For example, you can create a program that will monitor much of the information provided by the onstat -p (system profile) command. This will let you create your own alarm conditions and process them accordingly. For a full description of how to use sysmaster, see the article "SMI and the sysmaster Database" by Lester Knutsen that appeared in Tech Notes Volume 8, Issue 4, 1998 (posted on TechInfo Center).

Watching the Operating System and Network

Of course, performance problems can occur on many different levels. If you are experiencing extremely slow keyboard response time, for example, the problem could reside in the network. Likewise, if your Informix instance is starting to have problems accessing some of the disk drives, you might be having hardware errors. Finally, if the system is strapped for memory or CPU, you might look to other things that are running on the operating system. Remember that Informix is sharing disk, memory, and CPU with the operating system, which needs its own resources. Keep in mind, though, that they aren't available on all UNIX systems. If they aren't available, find out which utilities your system offers. You can get information you need by using the UNIX man command.

These commands can give you information that you can use in conjunction with the information you get from Informix. Between the two, you should have a good idea if the problem is an Informix configuration issue or just a lack of resources (like CPU).

Most operating systems have their own log files. These files can show errors that could later affect Informix (I/O errors, for example). They should also be regularly watched.

Don't forget that user applications might be causing the problems. By following some of the listed methods to monitor user sessions, you can trace the original SQL that was being executed by the user. If you find users that are doing certain queries (sequential scan, for example), you might want to suggest indexes or changes to their applications.

Creating Long-Term Stability

Some ongoing maintenance commands need to be performed, but not as often as many of the commands we've already discussed. These commands might best be executed in a batch job that runs on a regular schedule. This next section desribes some of these commands.

Updating Statistics
The statistics for your databases help the Informix query optimizer work most effectively. Statistics tell Informix what kind of data is in the database and what its approximate values are. This information helps the query optimizer find the best way to do queries. Some common update statistics commands include:


	update statistics medium distributions only;
	update statistics low for table customer;
	update statistics high for table customer(cust_nbr);

The strategy for updating your statistics must be chosen carefully. The "Informix Performance Guide" explains the proper strategy.

Checking your Tables and Indexes
Sometimes Informix table and index data can become damaged. If the damaged data is not accessed, no one may become aware of the problem until more damage has occurred. Either case is not good. One way to prevent this is to regularly check your tables and indexes by issuing the following commands:


	oncheck -cD	-- checks data pages and answers no to questions
	oncheck -cI	-- checks index pages and answers no to questions

Note the "n" supplied for every command. This instructs oncheck to ignore any questions about trying to fix the data if an error is found. If you find that you need to correct errors, you can run oncheck again and manually answer or run the command with "y" instead of "n" at the end.

    Tip:
    In some cases, it might be difficult or impossible for oncheck to fix the problems. If the problem is an index, it might be possible to drop and re-create the index. If it is data pages, you can try to unload the data but might not be able to unload all of your data.

Checking System Information
Each Informix instance has critical information that is contained in its "reserved pages." These pages are a road map to the data in the Informix instance, including information about each chunk, checkpoints, archives, general instance configuration (should match onconfig), and other statistics. If the reserved pages are damaged, Informix can have serious problems. To validate and possibly correct problems, use the command oncheck -cr. To just display the reserved pages, use oncheck -pr.

Each database has a set of system catalogs that contains information about its tables, indexes, and other items in the database. These catalogs are tables that begin with "sys". It is crucial that these catalogs have the proper information; if damaged, they can make an entire database inaccessible. To check the catalogs, run the SQL command "update statistics" on each database in the instance and then run the command oncheck -cc.

Reviewing Table Structure: Extents and Otherwise
Over a period of time, the data in tables can be spread over multiple disk drives and/or mix with data from many other tables. The way to avoid this is to create the proper extents-disk space dedicated to certain tables.

Tables with too many extents can have a negative impact on Informix response times since data is scattered in many different parts of the disk. A good ballpark maximum number of extents is 10. You can prevent this problem by creating extents that are large enough to hold your data. The extent can be created with the table or added later with the create table or alter table statements. To monitor the amount of extents, try running the command oncheck -pe, which will display output similar to the following:


... Disk usage for Chunk 1                Start    Length
        ----------------------------      -------  ----------
      stores7:customer                    1000     13
      stores7:item                        1250     12
      stores7:customer                    3000     2500

If you see a table that shows many times in this output, you might consider backing it up, dropping, re-creating, and re-loading into a table with the proper initial and next extent sizes.

The oncheck -pt command will show summarized information of all the tables. This information can tell you how many extents each table has, like in the following example:


TBLspace sysmaster:informix.syscolumns
  Physical Address	100011  
  Creation date	07/01/97 12:15:13
  TBLspace Flags	2	Row Locking
  Maximum row size	48    
  Number of special columns	0     
  Number of keys	1     
  Number of extents	7     
  Current serial value	1     
  First extent size	8     
  Next extent size	8     
  Number of pages allocated	64    
  Number of pages used	61    
  Number of data pages	34    
  Number of rows	1700   
  Partition partnum	1048580  

Note that "number of extents" is seven, still an acceptable number.

The last way to find out about all the extent sizes is to query the sysmaster database.

Correcting and Troubleshooting Problems

You've now learned numerous ways to monitor your servers, watching performance and preventing problems. But sometimes trouble is going to come no matter what you do. It's inevitable: things go wrong. Here we discuss how to address and correct some common problems.

There are many errors that will cause Informix to immediately go off line, possibly causing damage to your data and databases. Possible problems include:

  • Hardware problems like disk and CPU.
  • Internal Informix errors: corrupted pages, etc.
  • Memory errors.
  • Operating system crashes.
  • Errant user programs.

As an administrator, you need to be prepared to handle any of these errors. Rule number one of an administrator is:

DON'T PANIC

A good number of system crashes cause no damage to the data and are recovered through Informix's powerful recovery system.

Checking the Message Log

When your Informix instance goes off line, you should first check the message log for what happened. To display the last 20 lines of the message log, type onstat -m. If you need to see more of the message log, you can use a text editor or other command.

Tip: The messages in the message log are listed in the Informix Dynamic Server Administrator's guide.

Many details about the error are included in the message log. For example, the mention of an "I/O error" is a good indication of a problem with hardware.

Handling Assertion Failures

A common message for Informix crashes indicates an "assertion failure." An assertion failure is simply a message from Informix stating that it could not perform a necessary operation and needed to shut down. These errors often require a simple shutdown and startup of the instance, but must still be investigated.

Starting with version 7.30, Informix attempts to capture assertion failures and kill just that Informix session, automatically saving the necessary debug information. This should prevent Informix from crashing during assertion failures. The failure will of course be documented in the message log for that Informix instance.

An assertion failure will almost always create one or more files that give complete information about the error. The most common file is af.nnn where nnn is a unique hex number. Other possible files are a shared memory file (shm.nnn), a core dump of VP processes (gcore.nnn), and a regular core dump. The data in these files might mean nothing to you, but they are very important to Informix technical support.

    Note:
    You can configure what happens during an assertion failure by setting certain parameters in the onconfig file. These parameters all begin with the word DUMP and include DUMPCORE, DUMPSHMEM, DUMPDIR, and DUMPCNT. Setting these parameters helps trap more information during such a failure.

If you set the environment variable AFDEBUG before you initialize Informix, the engine will suspend processing-rather than crash-during many errors. This is useful because it allows you to run certain diagnostic commands like oncheck and onstat before bringing the engine down. Here is an example of how to do this when initializing Informix:


AFEDEBUG=1; export AFDEBUG
oninit

    Note:
    Because of the improvements for assertion failure handling in 7.3x, setting AFDEBUG is not really necessary; Informix will continue running and capture errors anyway. In the rare cases when an assertion failure crashes Informix, however, AFDEBUG will still work to suspend Informix processing. This allows you to capture error information before bringing down the Informix engine.

Informix gives a lot of information during an assertion failure, often including a possible solution to the problem ("Action: Run 'oncheck -cDI 6449916'"). Be sure to carefully analyze this information to find out what caused the problem, and if appropriate, try the corrective action. When an assertion failure occurs, you should:

  1. Check the log file to find the process that has caused the assertion failure.
  2. If the instance is suspended or still on line, you can run certain commands to gather more information (in 7.3x, the engine should remain on line). The output of these commands should be saved in a directory created for this purpose. You can use these for your own history and for Informix technical support. Some of the recommended commands include:
    
    	onstat -u
    	onstat -g ses session_id_that_caused
    	onstat -g stk thread_id_within_session
    	onstat -g sts
    	onstat -g glo
    	onstat -g seg
    	onstat -g mem
     
  3. Look through the history of your system crashes for a similar error. If found, note what caused it and how it was resolved.

    Tip:
    If a certain assertion failure occurs often, there is a good chance that it is because of a defect in the version of Informix you are using. It means that something in your applications is triggering the error. If this is the case, try calling Informix technical support or obtaining an upgraded version of your Informix software.

  4. Try to determine the cause of the error. If you are unable to determine the cause, it might be best to call Informix technical support at this time. If your engine is suspended (via AFDEBUG or otherwise), Informix can run different onstat and oncheck commands.
  5. Check to see if the Informix instance is still on line by running onstat - from the UNIX command line. If the status indicates "On-Line," your instance is still running; if not, perform the following three steps.
  6. Bring the engine down by using onmode -ky. Sometimes this will not bring Informix off line. If it doesn't, you will have to remove all shared memory segments and semaphores associated with this instance (probably with ipcrm)-just be sure not to kill any from other Informix instances. After that, you can use kill to eliminate all oninit processes associated with this instance.
  7. Attempt to bring Informix back on line with your usual method (including oninit). If Informix goes right back into error mode, it's probably time to call Informix technical support.
  8. If the message log suggested a corrective action (oncheck -cDI, for example), try running that command before allowing the users on line.

In most cases, the above three steps will be sufficient in bringing Informix back on line and you won't have to take it any further. If not, you'll need to apply some real administrator's skills (or call Informix!).

Using What You Know: Common Sense and Working Knowledge

After a while, Informix administrators develop a sixth sense about problems. Through experience, you'll be able to either recognize errors or intuitively know how to fix them. These are the kind of things that are hard to teach in a book; they come naturally. But it is a way of thinking.

Really consider what happened and what could have caused it. Logically think through the messages and conditions of the system and try to eliminate the obvious. Consider what has happened with your system in the past. Run various Informix commands to try to trace problems. Remember that some crashes are caused by problems with the current Informix release and your applications. Again, don't panic and logically walk through the problem.

Following is an interesting anecdote from Clem Akins, Informix Technical Support Engineer. Notice how Clem did follow the logical steps but had to really improvise to find the problem:

    "The error was from an Assert Fail due to a memory Segmentation Violation. The engine was given an address that was out of bounds, and it crashed. After one week of all-day-long, on-site effort we proved that the fault was with a part of the hard drive that was used for temp storage. Informix would write one address there, and retrieve another. The internal bounds checking is good enough to find the fault and bail out before corrupting any data. As the customer was a financial institution, they appreciated that, even after the hard time they gave us at first. We ran oncheck as it says to do, but found no errors. We examined the stack trace, and compared it to the optimizer source code that was running when it crashed. We sent the core dumps and .af files to Informix Advanced Support for further analysis. We reproduced the error using the stores7 database (though we had to make some of the tables several times bigger to see the error). Advanced Support finally built a machine exactly like the customer's, ran the exact query on the same data, and had no errors. Given that, we gave some heavy thought and long, thorough analysis and found the bad disk."

That story demonstrates a time when the obvious didn't solve the problem and it took real improvisation. If something just doesn't make sense, start thinking about other things that could have caused the problem.

Calling Informix Technical Support

Informix provides excellent technical support. In some instances, either you have exhausted all of your ideas or the problem can't be fixed by you. This is when you should call technical support. Be sure to have your information together before you call. Again, take a look at the views of Clem Akins.

    "The most important thing to do is to take the view of the support engineer. This view includes things like:

    'Can I reproduce the problem at will?

    Have I thought about why this problem occurs?

    What are the possible causes?

    Can I isolate them and test them individually?

    What has changed that caused the problem? (System, network, database, application changes?)

    Is the test case the simplest possible that will reproduce the problem? Does it reproduce on the stores7 database?

    Does it reproduce with only simple SQL instead of requiring complex custom code or applications or tables?

    Do I have a dial-up agreement in place with Informix Support? (Gee, I wish I had thought of that during business hours...) What is my support contract number (or product serial number) and what kind of support do I have? Is all of my support information ready to provide to Informix?

    When I talk to the support engineer will I sound like an IT professional who has taken all the reasonable steps to solve the problem? Even better, like a person who also has to support code and who thinks about the poor soul who is trying to help me from the other end of a phone!"

Restoring From a Backup

Certain errors will be just plain unrecoverable. In these times, you might see the dreaded "restore from an archive" error. At this point, the administrator will look like a hero if he had implemented a strong backup and recovery plan (which of course we all do!).

    Warning:
    Sometimes an Informix error message will suggest "restore from a backup." This is not always a true statement! Many times, it is just a matter of the Informix engine being temporarily confused about what is happening. Even when the "restore from a backup message" is displayed, I strongly suggest going through all steps to attempt to get the instance back on line. If that doesn't work, try calling technical support before doing the restore, which could cause a great loss of data.

There are two kinds of restores-cold and warm. The type you choose depends on your error. Refer to the Informix documentation for the backup strategy you are using (onbar, ontape, onarchive, etc.).

The amount of data you restore will depend on your levels of backup. Generally speaking, the process is to:

  1. Restore your last level-0 backup.
  2. Restore all level-1 and level-2 backups since the level-0, if any.
  3. Restore all proper logical logs, if any.

If you are not using logging, you can only go as far as step 2. The point of restoration depends on your backup schedule. If you need to get as close as possible to the point-in-time of your database, you'll need to have implemented logged databases and properly backed up all the logical logs.

 

Summary

This chapter described the ongoing process of maintaining your Informix servers. You learned several commands that can be done on an ongoing basis to watch your servers for possible problems or bottlenecks. And you learned the best method for contacting Informix support during down times.

This article is an excerpt from a chapter in the upcoming Informix Handbook due out later this year. The Informix Handbook promises to be a comprehensive, must-have reference guide to Informix products. The book provides substantial detail and reference information on multiple products and subjects. The following are some related chapters from the book:

  • "Understanding Informix Architecture," Chapter 3, explains many basic concepts of Informix.
  • "Setup of DSA UNIX Databases," Chapter 19, shows how to perform the setup involved with a DSA server.
  • "Working with the Operating System," Chapter 21, describes how to work with the operating system.
  • "Administrative Utilities," Chapter 22, describes the various "on" commands.
  • "SMI and the sysmaster Database," Chapter 27, presents an overview of the tables in sysmaster.
  • "Administrator Tuning," Chapter 28, provides tuning tips for administrators.

For more information, visit the Web site at www.informixhandbook.com.

About the Author

Ron M. Flannery is the founder and president of One Point Solutions, Inc., a Web- and technology-based company in Michigan. He also serves as the president of the Michigan Informix User Group (www.miug.com), is a member of the IIUG board of directors, and is the principal author of the Informix Handbook from Informix Press. Ron may be contacted by email at rflanner@speedlink.net.

 

Украинская баннерная сеть
 

[Home]

Сайт поддерживается группой пользователей Информикс на Украине.

Hosted by NO-more.