Knowledgebase

Timeout Error Diagnosis

The question we are trying to answer is what is causing calls to FairCom DB made by FairCom DB client processes to fail with error 808/809? To answer these questions it is helpful to have answers to the following questions:

Is there a pattern to the times at which the errors occur?
Is there a pattern as to which processes encounter the error?
Are there any factors common to the customer sites that encounter the error (especially factors that are not present on systems that do not encounter the error)?
What FairCom DB function call returned error 808/809? If it's a record read call, that points to record locking as a likely cause.

Recent Observations

Some of the intervals between recent monitoring log entries are larger than expected (say 16 seconds although lock dumps are being taken every 5 seconds). This raises the questions:

What could cause the intervals to be unexpectedly large? All the timestamps tell us is that the time between the taking of the two timestamps is 16 seconds. We don't know where the delay occurs.
How can we understand the cause of the delay? If it is FairCom DB delaying its response to the client, taking a process stack trace of FairCom DB will show us where in the code the threads are blocked. Knowing this, we can come up with ideas as to the cause.

Options to Consider

Understanding the cause of the error using the binaries that are already deployed at customer sites:

Review what we have learned so far.
Above all, the most likely cause of errors 808/809 is application lock behavior. This is at least partly confirmed by specific and documented cases^1,2. We should at a minimum attempt to rule out lock problems first in each case. This can be accomplished with SNAPSHOT.FCS data at the very least and lock dump data if possible.

Understanding the cause of the error that might require new FairCom DB or client application binaries:

Consider using the blocking lock timeout feature. With the blocking lock timeout, a call to FairCom DB that times out on a blocking lock request will fail with error 827, which will make it very easy to distinguish between delays that involve locking and those that do not. A pre-production test system is a good candidate for this type of test.
The client binary can be modified to produce a server stack trace at the appropriate time. Consider placing a pstack() call (as well as a Snapshot() and LockDump() calls) in the FairCom DB client code immediately before the 808/809 error is returned to obtain a stack trace of the server in this instant in time. This would prove a strategic time and location to grab FairCom DB state.
Other ideas?

Proposed actions

Collect a complete repository of data for occurrences of the error. It is important to collect as much information about all occurrences of the error 808/809 as possible, so we can look for patterns. For each occurrence include:

Customer name/site at which the error occurred
Version of FairCom/Customer/system software running on the system
What is the socket timeout value that is in use on the system?
Relevant software configuration details: ctsrvr.cfg, anything else?
Relevant information about hardware in use on the system: # of CPUs, disk type, location of data/index files, transaction logs
Complete I/O subsystem details. SAN drive manufacturer, configurations, partitioning, RAID levels, backup methods
Time and date at which the error occurred
What processes encountered the error
What happened next: were processes restarted, and did the condition happen again or was operation normal after that?
What monitoring data do we have from both normal operation and the activity at the time the error occurred? Examples include SNAPSHOT.FCS, lock dump log, application error logs, CTSTATUS.FCS, FairCom DB process stack traces, any other system logs such as disk data (sar) or CPU data?
What analysis have we done on the available data? Did we at least examine SNAPSHOT.FCS, lock dump log, CTSTATUS.FCS, FairCom DB process stack trace?
Discuss specific cases, which showed up as errors 808/809 and have been resolved.

Heap Debugging on Solaris 9+ Operating Systems

You can enable heap debugging for the FairCom DB process by setting these environment variables before startup.

export LD_PRELOAD=libumem.so.1

export UMEM_DEBUG=default

export UMEM_LOGGING=transaction

This is a low overhead malloc() debugging method, suitable for a production environment experiencing possible heap corruption. With the above options, it will check some guard zones for memory overwrites on alloc/free.

It also has many commands when a core generated with libumem is used with the mdb debugger.

::umem_status and ::umem_verify are useful, and the complete list of dcmds can be found in mdb by executing:

> ::dmods -l libumem.so.1

man umem_debug has more details.

watchmalloc is another malloc debugging library on Solaris that detects memory overwrites more reliably, however, runs very slowly and is not useful in a production environment.