Skip to main content

Build a Highly Available Cluster on Linux for FairCom DB

This section describes how to create a two-node, highly available (HA) solution running FairCom DB on a clustered Red Hat Enterprise Linux (RHEL) environment. The HA solution uses FairCom’s synchronous replication feature (new in 2020) to ensure the data on both FairCom DB servers is always identical.

The HA solution uses Pacemaker and Corosync from RHEL to:
  • Provide a virtual IP address (VIP).

  • Detect failures in the hardware, OS, and database.

  • Failover from the primary node to the secondary node.

  • Fence off the primary server during a failover event to prevent scenarios that can corrupt data such as split-brain

Figure 1. FairCom replication
FairCom replication


This section describes how to implement a high-availability FairCom DB cluster consisting of a primary and secondary server. RHEL calls the primary server the master and the secondary the slave. In this configuration, FairCom DB runs as part of a high-availability Linux cluster provided by Linux RedHat 8 using Pacemaker and Corosync. It implements a virtual IP address to listen for client connections regardless of which server nodes in the cluster are actually running database services. Database clients are connected to the virtual IP.

Data is synchronized between the two database servers using FairCom replication. It replicates all data changes from the primary server to the secondary server. It uses synchronous replication to ensure data is always the same on both servers. You also have the option to use the slightly faster asynchronous replication, but this introduces a chance that data will be lost during a failover. All data changes go to the primary server. Read-only operations can go to either the primary or secondary server.

Failures are detected by Pacemaker. Pacemaker monitors the primary and secondary servers. It also monitors the replication agent on the secondary FairCom DB server. When it detects a failure of the current primary server it fences off the primary database from all communications and promotes the secondary database server to become the new primary (likewise, secondary FairCom DB server failures are also detected further preventing data integrity issues). FairCom DB resource agents define and configure this process.

FairCom DB replication for availability

To ensure all data is the same on both node one and node two, we use FairCom DB’s replication with synchronous commit. This is a new feature for FairCom products released in 2020. It provides high-speed, parallel, synchronous data replication between two or more FairCom DB servers. When a transaction on the primary server is committed, it guarantees that the secondary server has access to the log data from the primary.

FairCom DB asynchronously applies committed data changes to data and index files on the secondary server. Changes are typically applied faster on the secondary server than on the primary because it is read-only. If there is a heavy read load on the secondary server, changes may be applied slightly slower. This does not affect high availability because all transactions are always persisted on disk on both servers before the database returns a commit confirmation to the application.

Cluster

A Cluster is a collection of nodes and resources (usually services) managed as a set for high availability.

Pacemaker

Pacemaker is a high-availability cluster resource manager based on the Open Cluster Framework (OCF). Pacemaker provides detection of and recovery from node-level and service-level failures. Faulty nodes can be fenced ensuring data integrity. Pacemaker configurations define clusters, nodes, resources, alerts, and STONITH fencing agents. 

Pacemaker cluster management is included with many major Linux distributions. It is a cluster resource manager and executes as a daemon called CRMd on each node. All cluster changes are routed through it, such as node instantiations, moves, starts, stops, and information queries. Pacemaker can script and cluster any resource, such as FairCom DB

Pacemaker uses quorum voting (votequorum service) to determine when a cluster node should be fenced or isolated to prevent split-brain phenomena. More than 50% of nodes must agree on cluster operations. If more than 50% of the nodes in a cluster are offline, clustered services are stopped. Pacemaker utilizes Corosync as its high-availability framework and a specific Corosync option provides for two node clusters where only one node is used as a quorum.

Corosync

Corosync is a distributed messaging and quorum platform for cluster-aware components. Cluster communications go through the Corosync daemon utilizing an in-memory database. It manages cluster membership and messaging communications as well as quorum rules, and cluster state transfer between nodes. Corosync uses the kronosnet library.

Node

A Node is an RHEL host participating in the cluster. Each cluster node runs a local resource manager daemon (LRMd) which is the interface between CRMd and local cluster resources on the node. 

One node in the cluster is the Designated Coordinator (DC) which stores and distributes cluster state to all cluster nodes. This cluster state is maintained in the Cluster Information Base (CIB).

A cluster resource is an application or data managed by the cluster — for example, a database instance running in an RHEL cluster can be configured to be a cluster resource. In this definition, FairCom DB is defined as a resource that requires cluster management to run its services in a highly available configuration.

Resource agent

Resource agent is an abstraction that allows Pacemaker to manage services it knows nothing about. This abstraction takes the form of an executable script that contains logic for starting, stopping, and checking the health of a defined resource such as a database, or other service provided by the cluster.

Fencing agents

Fencing agents regularly check the status of a resource and, if Pacemaker determines that a resource has failed, one or more agents ensure the resource stays down. Fencing agents implement Shoot The Other Node In The Head (STONITH) which is usually targeted to the hardware layer ensuring the complete removal of a node from the cluster.

Techniques to fence resources (such as database servers):

You choose the method that best suits your requirements and hardware capabilities.

  • system reset

  • powering off the host

  • deactivating a VM

  • removing network access

Fencing

Fencing is an important component of any successful clustering as it ensures that a failed node is unable to run or provide any services at all. This prevents scenarios such as split brain which, in the case of a database, allows data to be written to both servers in the cluster. This incorrectly puts some data on one server and some data on the other causing the breaking of data integrity, errors, and making it very difficult to recover correct data.

Groups

Groups are created to manage multiple cluster resources as a unit.

Groups simplify configuring the cluster for the options:
  • Location

  • Start order

  • Reverse stop order of multiple resources that need to function as one unit

  1. Install Pacemaker (from the rhel-8-for-x86_64-highavailability-rpms package) using the following command.

    sudo dnf config-manager --enable                 ‘
    rhel-8-for-x86_64-highavailability-rpms
    
  2. If configuring a shared storage cluster, also install the resilientstorage package.

    sudo dnf config-manager --enable--enable            
    rhel-8-for-x86_64-resilientstorage-rpms
  3. Install the Pacemaker fencing agents using the following command.

    sudo dnf install pcs pacemaker fence-agents-all
  4. Verify that Pacemaker was installed using the following command.

    sudo pcs -h

This procedure describes how to initialize two FairCom DB synchronous replication nodes as VMware virtual machines running identical RedHat 8 OS environments configured with Pacemaker clustering software.

Two nodes:
  • Node one

    This is considered the active/hot node. This instance of FairCom DB is a read-write server. The active node is the only FairCom DB server that should process changes to data.

  • Node two

    This is considered the passive/warm node. This instance of FairCom DB is a read-only server. It could be used for query reporting and analytics. This FairCom DB instance should never allow changes to data.

Requirements:
Initialize two FairCom DB synchronous replication nodes as VMware virtual machines running identical RedHat 8 OS environments configured with Pacemaker clustering software:

Note

Unless otherwise mentioned, each of the steps in this procedure must be performed on all nodes in the cluster. Sudo access is required for the majority of the steps.

  1. Configure and load required network ports for high-availability components using the following command.

    sudo firewall-cmd --permanent --add-service=high-availability
    sudo firewall-cmd --reload
    
  2. Assign the hacluster user password using the following command.

    Note

    The high-availability installed packages create a default cluster user (hacluster) used to run and isolate its services. This default user is used to connect across nodes and requires a secure password.

    sudo passwd hacluster
  3. Enable and start the PCSD service on each node using the following command.

    Note

    systemctl enable automatically starts the service on system boot.

    sudo systemctl enable pcsd.service
    sudo systemctl start pcsd.service
    

Identical FairCom DB packages are installed into /opt/faircomdb. This standardized location is appropriate for external software packages. Symlinks are created in /usr/local/bin/ for important utilities used by the pacemaker resource scripts such that they are available in default paths. Firewalls must be configured to allow required database ports on all nodes.

  1. Configure firewalls to allow required database ports on all nodes.

    Note

    Port 5597 is the FairCom DB default ISAM connection port (also known by the server name FAIRCOMS.

    Port 6597 is the FairCom DB SQL connection port.

    sudo firewall-cmd --zone=public --permanent --add-port=5597/tcp
    sudo firewall-cmd --zone=public --permanent --add-port=6597/tcp
    sudo firewall-cmd --reload
    
  2. Set up the monitoring of the availability of a resource using the ctsmon utility.

    Note

    Cluster monitoring requires a script or executable program to monitor the availability of a resource. The FairCom DB monitoring utility ctsmon, provides a heartbeat functionality. It is called directly by Pacemaker as referenced from the FairCom DB resource.

  3. Create symlinks in /usr/local/bin using one of the following commands.

    sudo ln -s /opt/faircomdb/config/failover.linux.pacemaker/getsyncstate /usr/local/bin/getsyncstate
    sudo ln -s /opt/faircomdb/config/failover.linux.pacemaker/setsyncstate /usr/local/bin/setsyncstate
    sudo ln -s /opt/faircomdb/tools/repadm /usr/local/bin/repadm
    sudo ln -s /opt/faircomdb/tools/ctsmon /usr/local/bin/ctsmon
    
  4. Ensure that permissions on the utilities are set correctly by using one of the following commands.

    sudo chmod 755 /opt/faircomdb/config/failover.linux.pacemaker/getsyncstate /usr/local/bin/getsyncstate
    sudo chmod 755 /opt/faircomdb/config/failover.linux.pacemaker/setsyncstate
    sudo chmod 755 /opt/faircomdb/tools/ctsmon 
    sudo chmod 755 /opt/faircomdb/tools/repadm
    
  5. Ensure all files and components are owned and readable by the proper OS user.

    Important

    The user ID under which the server process runs must be added to the haclient group such that it can run the crm_attribute utility to read and write the cluster attribute.

    Note

    You must specify the OS user for the user=owner_name parameter when running the pcs resource create faircomdb command.  You will see references to this username in the FairCom DB resource scripts. These references need to be changed to match the proper account in your environment.

Required FairCom DB files

The files in this section are required to configure the server and its embedded replication agent to run in a pacemaker cluster.

  • FairCom DB files in the server binary directory (server):

    • server executable:

      • faircomdb

      • ctsrvr

      • ctreesql

    • server core library - libctreedbs.so

    • client library - libmtclient.so

    • agent library

      • librcesbasic.so

      • librcesdpctree.so

    • server license file - ctsrvrXxx.lic

  • Replication agent subdirectory (./server/agent)

    replication agent library - libctagent.so

  • Configurations (./config):

    • server configuration file - ctsrvr.cfg

    • agent configuration; specifies name of replication agent configuration file, relative to server working directory - ctagent.json

  • Replication (./config/failover.linux.pacemaker)

    • replication agent configuration file, used when this server is running as a secondary server - ctreplagent.cfg

    • application-specific file filter for replication - replSync.xml

    • settings files used by the replication agent:

      • source_auth.set

      • target_auth.set

Required file configuration

  1. Edit the /opt/faircomdb/config/ctagent.json file.

    {
            "managed": false,
            "configurationFileList": [
                    "../config/ctreplagent.cfg"
            ]
    }
    
  2. Copy the /opt/faircomdb/config/failover.linux.pacemaker/ctreplagent.cfg file to /opt/faircomdb/config.

    Note

    This file will need to be modified for your environment (see ctAgent configuration for synchronous commit replication).

  3. Copy /opt/faircomdb/config/failover.linux.pacemaker/replSync.xml to /opt/faircomdb/config.

ctAgent configuration for synchronous commit replication

It is important to configure replication on both nodes correctly. Once a failover node has recovered, it now becomes the secondary node and the replication direction is now reversed between the nodes. If replication is not correctly configured, the secondary node will not be in sync after recovery.

Example 1. ctAgent configuration

A cluster with two nodes, clusternode1 and clusternode2. Replication will pull from the primary clusternode1 and apply to the secondary clusternode2. In this example, replication agent config file (ctreplagent.cfg) for clusternode2 is used for initial setup.

; c-tree Replication Agent Configuration File

;file filter must specify the names of files to include/exclude for replication and the purposes, including sync_commit
file_filter             <../config/replSync.xml

;replication agent unique id; must be same on both systems
unique_id               agent1

check_update            yes

replicate_data_definitions      yes

parallel_apply          yes
num_analyzer_threads    1
num_apply_threads       4
sync_log_writes         yes

; Source server connection info
source_authfile         ../config/replication/source_auth.set
source_server           FAIRCOMS@clusternode1
source_nodeid           10.0.0.1

; Target server connection info
target_authfile         ../config/replication/target_auth.set
target_server           FAIRCOMS@localhost
target_nodeid           10.0.0.2

socket_timeout          5

lock_retry_count        10
lock_retry_sleep        1000

; Read 8 KB batches from source server
batch_size              8192

; Use a 1-second timeout when reading from source server
read_timeout_ms         1000

exception_mode          transaction

; The master server will remember the replication agent's log position
; even when the replication agent disconnects.
remember_log_pos        yes

; Log file name 
log_file_name  ../data/ctreplagent.log


Example 2. source_server and nodeid

The only difference in the replication agent configuration file between node 1 and node 2 is that the source server name is changed to the other node name and the node IDs are set with the appropriate node ID for the servers.

; Source server connection info
source_server           FAIRCOMS@clusternode2
source_nodeid           10.0.0.2

; Target server connection info
target_nodeid           10.0.0.1


Example 3. file_filter

The file_filter option is important because it determines which files are replicated by the replication agent and it enables synchronous commit mode for the files. In this example, two files, mark.dat and admin_test.dat are replicated in synchronous commit mode.

<?xml version="1.0" encoding="us-ascii"?>
<replfilefilter version="1" persistent="y">
        <file status="include">mark.dat</file>
        <file status="include">admin_test.dat</file>
        <purpose>create_file</purpose>
        <purpose>read_log</purpose>
        <purpose>open_file</purpose>
        <purpose>sync_commit</purpose>
</replfilefilter>


FairCom DB server configuration

The server configuration file (ctsrvr.cfg) contains additional options specific to participating in a Pacemaker cluster.

CHECK_CLUSTER_ROLE 			YES
SECONDARY_STARTUP_TIMEOUT_SEC 	20
SYNC_COMMIT_TIMEOUT 			10

REPL_NODEID 				10.0.0.1
REPLICATE 					*.dat
  • REPL_NODEID

    This must be unique for each server. It does not need to be a valid IP address and, in fact, has no relationship to one. However, it is formatted as such. This must match the source_nodeid and target_nodeid specified in all replication agent configuration files.

  • CHECK_CLUSTER_ROLE YES

    This enables the server’s integration with Pacemaker. When the server starts it checks for the file cluster.ini in the server’s data directory. The start function in the Pacemaker script for faircomdb writes to cluster.ini to signal the server of its role as a primary server. Here is the sequence of events at server startup when this option is used.

  • REPLICATE <filename>

    This indicates to attempt to enable the replication attribute on files matching  <filename> opened by the server. Multiple REPLICATE entries may be used to list files individually. If ctreplagent.cfg specifies a file_filter, REPLICATE may be redundant. Files with the replication attribute must still be included within a replication file filter used by a particular agent for changes to the file to be replicated by that agent when the agent uses a file_filter.

  • SECONDARY_STARTUP_TIMEOUT_SEC <time_in_seconds>

    This configuration option sets the time limit after which the server stops waiting for a notification of its role when starting up. If the server was a secondary server the last time it ran and within this time limit it does not receive a notification of its role from pacemaker and cannot connect to the partner server, then it stops waiting and completes its startup acting as a secondary server in the cluster. This option defaults to 20 seconds.

  • SYNC_COMMIT_TIMEOUT <time_in_seconds>

    This configuration option sets a time limit in seconds for which a transaction commit waits on the transaction log data to be copied to the secondary system. If a transaction commit operation times out waiting for its log data to be copied to the secondary server, the server switches into asynchronous replication mode and the transaction commit proceeds. This option defaults to 60 seconds.

Server role file

When the server sets its role to the primary or secondary server after receiving notification of its role from Pacemaker, it writes its current role to the file serverrole.ini in the server’s data directory (which is set using the LOCAL_DIRECTORY server configuration option). This file contains the value 1 for the primary server and 2 for the secondary server. The server uses this file when it starts up to know its role the previous time it ran.

Server scripts

The server sets a global cluster attribute name promotestate to indicate which is the current primary server node and whether it is running in synchronous or asynchronous replication mode.

In order to read the cluster attribute, the server runs the getsyncstate.sh shell script. To set the cluster attribute value, the server runs the setsyncstate.sh shell script.

Ensure that these scripts have been put into a directory that is included in the server’s PATH environment variable and is readable and executable by the user ID under which the server process is run.

Example 4. Cluster-specific configuration options
CHECK_CLUSTER_ROLE 			YES | NO
SYNC_COMMIT_TIMEOUT		<time_in_seconds>
SECONDARY_STARTUP_TIMEOUT_SEC	<time_in_seconds>


Example 5. Cluster-specific state files
cluster.ini	0 - 4 
serverrole.ini	1 (primary) | 2 (secondary)


Example 6. Cluster-specific shell scripts
getsyncstate.sh	Gets current FDB cluster state
setsyncstate.sh	Sets current FDB cluster state


The procedures in this section describe how to add resources to the Pacemaker clustering monitor.

Resources added to the Pacemaker clustering monitor:
  • Node availability

  • FairCom DB server availability

  • Cluster IP availability

The pcs command controls and configures Corosync and Pacemaker through an interface to the Cluster Information Base, cib file. It provides an extensive command syntax for control over all aspects of your clustered environment. Help is easily obtained directly from this utility.

# pcs --help

This section provides the commands to run in order to define the cluster. In all of these commands, replace node1 and node2 with your specific configured machine URLs.

  1. Authenticate local pcsd (the pcs daemon) to remote host node pcsd services using the following command.

    Note

    This only needs to be executed once on one of the nodes.

    # pcs host auth  node1  node2
  2. When prompted, enter the default hacluster username and password that were previously assigned, to authenticate Pacemaker against other nodes in the cluster.

  3. Assign nodes to cluster and start to secure all pcs changes into the CIB as you make configurations using the following command.

    # pcs cluster setup my_cluster --start node1 node2
  4. Create the CIB for analysis and comparison using the following command.

    # pcs cluster cib mycluster.cib
  5. Explore mycluster.cib, which is an XML file that pcs writes to the local folder.

  6. Enable a cluster to boot when a node is booted using the following command.

    # pcs cluster enable --all

Client applications connect to a cluster floating virtual IP (VIP) (for example, 10.0.1.100) directed to the active node by the cluster management. This VIP is a resource agent, called the Cluster Virtual IP Resource Agent.

A VIP is switched between nodes on a failover event detected by the cluster. It provides a seamless connection experience for client applications.

  1. Define a cluster IP that is monitored for availability to provide a single IP address to the client application.

    Note

    During a failover event, the client’s connection will be broken. After a failover, a client simply reconnects to the same VIP. The client does not need to know the internal IP addresses of each server.

  2. Modify the IP address in the following command below to match your environment.

    # pcs resource create VirtualIP IPaddr2  ip=cluster.ip.address
        cidr_netmask=22
    

For Pacemaker to detect a process failure, it needs to run a heartbeat script periodically. This is a monitoring resource agent. FairCom provides a script that satisfies the basic operational elements required by Pacemaker. These include start, stop, monitor, and validate operations.

The name of the script is faircomdb (also known as the FairCom DB Resource Agent).

  1. Navigate to the default scripts in /usr/lib/ocf/resource.d/heartbeat/.

  2. Copy the faircomdb resource agent script into the /usr/lib/ocf/resource.d/heartbeat/ area.

  3. Verify that this script has executable permissions and correct ownership.

    Note

    Without correct permissions, Pacemaker will report that the resource could not be installed.

  4. Display available options and requirements by running the following command.

    Note

    Multiple configuration options can be provided to this resource agent directly from the command line. Some of these are required.

    # pcs resource describe ocf:heartbeat:faircomdb 
    
  5. Create your faircomdb resource.

  6. Update the local working directory paths, the faircomdb process owner name, and the passwords to match your environment.

    Note

    • These cluster management commands only need to be run on one node of the cluster. As the cluster is now active and enabled, all changes are synchronized across all nodes in the CIB database persisted in /var/lib/pacemaker/cib/cib.xml.

    • Group support allows associating resources together such that they can be started sequentially, or stopped in reverse order. This is optional.

    # sudo pcs resource create faircomdb ocf:heartbeat:faircomdb  
    binary=/opt/faircomdb/server/faircom
    config=/opt/faircomdb/config/ctsrvr.cfg
    ctsmon=/usr/local/bin/ctsmon
    faircomdb_servername=FAIRCOMS
    faircomdb_adminpass=ADMIN 
    -- (Be sure to change your default FairCom DB ADMIN password!)
    faircomdb_working_directory=/opt/faircomdb/server
    faircomdb_data_directory=/opt/faircomdb/data
    user=owner_name
    
    (optional)
    	--group ctree-group
  7. Define the FairCom DB Resource Agent (that runs on both nodes simultaneously) as a cloned primary-secondary (master-clone) resource using the following command.

    Note

    This command also tells the cluster to have only one primary server on the cluster.

    # sudo pcs resource promotable faircomdb 
    master-max=1 
    master-node-max=1 
    clone-max=2 
    clone-node-max=1 
    notify=true
    
    
    # sudo pcs resource op add faircomdb monitor timeout=20 interval=11 role=Master
  1. Add a location constraint to ensure that the VIP starts on the primary node (master) when the cluster is started.

  2. Add an INFINITY score to force the colocation node.

    From Node 1
    # sudo pcs constraint colocation add VirtualIP with master faircomdb-clone score=INFINITY

With constraints in place, you can define the order in which resources are started. The FairCom DB server resource should be started first, followed by the VIP. Start the VIP when the FairCom DB server starts or is promoted on the primary node using the following command.

From Node 1
# sudo pcs constraint order start faircomdb-clone then start VirtualIP
# sudo pcs constraint order promote faircomdb-clone then start VirtualIP

The cluster calls a fencing agent to kill resources and prevent them from running accidentally. In a FairCom DB cluster, the fencing agent ensures that a failed primary node can no longer receive database connections. After a failover, only the secondary server is allowed to modify data. Failure to do this can result in a split-brain scenario which is very difficult to recover from. Fencing usually involves external disruptions to resources, such as using power supplies to shut down a server, using a network switch to disable network access, shutting down a VM, and so forth. Fencing agents can be found in agents/usr/sbin/fence*.

In testing environments, VMWare Virtual Machines (VMs) are used for the hosts. The VMWare Fencing Agent is also used to kill the active node during a failover, also known as Shoot The Other Node In The Head (STONITH). Once the secondary node becomes active after a failover, the cluster performs the STONITH to ensure it is dead. It prevents conflicts with mutually exclusive resources.

Fencing should initially be disabled to speed cluster setup, configuration, and testing. It’s been observed that the cluster might not start resources when enabled as default. In production environments, fencing must be configured and enabled to ensure data integrity.

Example 7. Enable/disable STONITH

This command enables or disables any configured STONITH agent.

# sudo pcs property set stonith-enabled= [ false | true ]


Example 8. Configure STONITH

In this command, you can update the host map, IP address, login, and password to match your environment. The host map names should match what pcs status lists your existing nodes. Your VMWare host IP address must be reachable within your local DNS if using a URL.

Caution

Exercise caution if accessing your VMWare host using the system root account. It may be preferable to configure a specific administrative account for this purpose. Consult with your local system administrator for more information.

# sudo pcs  stonith  create vmfence  fence_vmware_soap 
pcmk_host_map="cluster1.company.com:CLUSTER-NODE-1; cluster2.company.com:CLUSTER-NODE-2" 
   ipaddr=vmware-host.company.com 
   ssl_insecure=1 
   login=root 
   passwd=xxxxxx
   pcmk_monitor_timeout=120s


Fencing components:

You may use any fencing components that best fit your environment.

  • Power supply (UPS specific)

  • Network devices (switches, routers or NICs)

  • Server Remote controlled shutdown (Dell iDRAC8, HP iLO, and so forth)

Configure your cluster to start at boot time such that all nodes are available once the boot process is finished.

Example 9. Enable systemd services for pacemaker and corosync
# sudo systemctl enable corosync.service# sudo systemctl enable pcsd.service


You can shut down an entire cluster. This requires extra attention. Failure to follow the proper sequence will result in the primary node failing over to the secondary node or causing the primary node to restart immediately after you shut it down.

Important

It is the job of the cluster software to keep the nodes in the cluster running at all times, so you must use the Shutdown cluster commands sequence to shut down the cluster itself.

Shutdown cluster commands sequence:
  1. Shut down the cluster on any node to stop Pacemaker and the messaging layer using the following command.

    # pcs cluster stop --all
  2. Force the cluster services to stop in the event that they do not stop running properly on a host using the following command.

    # pcs cluster kill
  3. Completely remove the cluster environment configuration from all nodes using the following command.

    #pcs cluster destroy --all
Example 10. Tune the main Pacemaker configuration timeouts

Tune the main pacemaker configuration timeouts for various resource agent operations. Load testing and consideration of potential delays caused by expensive operations such as database or system backups can help determine a reasonable set of timeout values for your system.

Important

It is generally better to set the timeout to be too large rather than too small because if the timeout value is too small, a delay due to high system load will lead to a Pacemaker timeout of the relevant operations, which Pacemaker treats as a resource failure, leading to unnecessary resource migration (failover), and possible fencing of the failed node. When the timeout value is much too small it can lead to repeating failures as each node successively times out.

A large timeout causes the detection of some types of failures (such as database process hang, network failure, or node failure) to take the total timeout interval to detect, increasing the total time a resource is unavailable during a failover event.

# sudo pcs resource update faircomdb op monitor timeout=120s
# sudo pcs resource update faircomdb op start timeout=120s


Timeout settings:
  • faircomdb op start timeout

    Consider both normal startup times and recovery startup times.  If typical recovery times are very large such a startup may need to be done manually outside the cluster.

  • faircomdb op stop timeout

    Consider the server shutdown time.  If the stop operation times out, Pacemaker will fence the node if configured.  To avoid this, the resource agent will attempt to kill (SIGKILL) the server process shortly before the timeout expires, which will cause transaction recovery operations on restart.

  • faircomdb op promote timeout

    A promote occurs at failover.  This timeout must account for the replication lag at the time of failover. The replication lag is the time to apply replication changes that are queued pending apply on the secondary server being promoted. 

    Example 11. Log read lag

    An estimate of this lag can be monitored with the following command.

    repadm -c getstate -s SECONDARY_NAME
    Output

    time shows the logread lag.

     source server: FAIRCOMS@rhel8cn3
     target server: FAIRCOMS@localhost
    --logship--------------------------------------------
      sid         lognum     logpos   state    seqno      error func
       23              2  108032586  source      675          0 ctReplReadLogData
    --logread--------------------------------------------
      sid   tid   lognum     logpos   state    seqno time error func
       24    41        2  108032586  source   120284    0     0 ctReplGetNextChange
    --analyzer--------------------------------------------
     #                                state    seqno      error func
     1                               source     9339          0 ctThrdQueueRead
    --dependency------------------------------------------
                                      state    seqno      error func
                                     source  1547757          0 ctThrdQueueRead
    --apply-----------------------------------------------
     #      tid   lognum     logpos   state    seqno  files error func
     1       31        2   97494282  source    12015      0     0 ctThrdQueueRead
     2       36        2   97499497  source    10600      0     0 ctThrdQueueRead
     3       37        2   97507840  source    11218      0     0 ctThrdQueueRead
     4       38        2   97477128  source    10507      0     0 ctThrdQueueRead
    --tranStat--  --analysisQ-  ---dependQ--  ---dependG--  ---readyQ---
        0(    0)      0(    0)      0(    0)      0(    0)      0(    0)
    


  • faircomdb op monitor timeout

    The monitor operation occurs frequently and involves a new database connection.  Database operations such as online backups or Quiesce may impose an abnormal delay on new connections.  The monitor timeout should allow for these exceptional connection times, otherwise, a failover event will be triggered.

This procedure describes how to activate clustering.

  1. Start the cluster using the following command.

    Note

    This command starts all cluster-associated resources on all nodes and begins the monitoring processes as defined in your resource agent scripts. As replication was configured directly in FairCom DB, it is immediately active on startup.

    # sudo pcs cluster start --all
    
  2. Check the cluster status using the following command.

    # sudo pcs status
  3. Additional FairCom DB resource logging can be enabled by setting clusterdebug=1 in the faircomdb resource agent script.

    1. Check /opt/faircomdb/server/server.log and /var/log/ messages to further verify all cluster database components have successfully started.

    2. When you have determined the cluster has successfully started, check that replication is active between database nodes.

    3. Ensure that the cluster node hostname references the specific node where replication is active (the slave node) to ensure the correct active replication node is viewed.

      Note

      The Example 12, “Query the replication agent's connection state below assumes FAIRCOMS is your configured FairCom DB server name.

    4. From pcs status, verify which node is the current slave node and substitute that node name.

    Example 12. Query the replication agent's connection state

    This example shows the command to examine the replication agent's connection state.

    repadm -c getstate -u ADMIN -p ADMIN -s FAIRCOMS@clusternode2
    
    Output
    s t   lognum     logpos   state    seqno time func
    n n        0          0  target        2    - INTISAM
    n n        0          0  source        3    - INTISAM
    n n        0          0  source        3    - INTISAM
    


If a database error occurs on the secondary server during replication, one record for each operation in the failed transaction is logged to the agent's exception log. The agent will continue running and applying changes, but the secondary server is now out of sync with the primary and will refuse to be promoted to the role of primary server. These exceptions can be viewed with the getlog command.

Example 13. View exceptions command

These past exceptions, if left unresolved, tend to cause additional cascades of errors as those same records are modified on the primary server, leading to large data differences over time.

# repadm -c getlog -u ADMIN -p ADMIN -s FAIRCOMS@clusternode2


Once exceptions are resolved on the secondary, the exception records should be deleted from the agent’s exception log (see Example 14, “Purge the exception log). Only when the exception log has no entries will the secondary server be allowed to promote to the primary server.

Example 14. Purge the exception log

This might be valid and useful after a full resync of the secondary data, or if the exceptions were related to files being unintentionally replicated.

Caution

Purging exception log entries without resolving all data differences will lead to the loss of data.

# repadm -c purgelog -u ADMIN -p ADMIN -s FAIRCOMS@clusternode2


If a node in a properly operating FairCom DB cluster should fail, the remaining node remains available as the primary server in a degraded cluster. As transactions continue to flow, the failed (now secondary) server falls out of sync. If the secondary node is restored to service, it will resume asynchronous replication where it failed (see MAX_REPL_LOGS for limit) and once it has copied the primary log data will automatically re-enter synchronous commit mode and restore the cluster to a normal state.

During this degraded interval, if the single remaining node also fails, the secondary cannot be promoted without some data loss, as it is effectively a backup from an earlier point in time. Normally, the secondary server is expected to remain partially available (read-only) until the primary is restored. In the event of a catastrophic disk failure on the primary, forcing such a promotion may be the best available course of action (see Example 15, “Force a promotion).

Example 15. Force a promotion
crm_attribute -G --name promotestate --update "{ \"node\":\"clusternode2\", \"sync\":\"y\" }"


Once your cluster is configured, operational, and ready to service FairCom DB client applications, FairCom DB client applications can connect to the single cluster-assigned virtual IP address without regard to which node may be servicing them.

Ultimately, FairCom DB client applications will need to respond to failover events. ISAM-based applications maintain a tight affinity with their connected server. As such, there is a rich array of context data that can be lost on database failover. Unfortunately, with this database model, this is unavoidable in a shared-nothing architecture such as an OCF cluster.

Example 16. Failed connection states

This example shows multiple failed connection states that can be tested when a client connection is lost. All of these states should be monitored and checked for validity before assuming a database node failover occurred. It is assumed an application must reconnect to the database cluster at this point. Applications must re-initialize any current database activities and context they expected at the time of the last operations as no expectation of the last operational state can be made.

ARQS_ERR 	127	"could not send request"
ARSP_ERR 	128	"could not receive response"
ASKY_ERR 	133	"server not available"
SHUT_ERR	150	"server is shutting down"
TRQS_ERR 	808	"request timed out"
TRSP_ERR	809	"response timed out"


The most efficient method to determine if a cluster has failed over may be a hanging socket. That is when the underlying cluster node IP address is switched from the primary node to the secondary node, the existing TCP/IP socket connection becomes invalid and the application must reconnect. The most direct method to detect this situation is to set a socket timeout with a maximum timeout value to wait until forcing a client reconnection.

Example 17. Configure a socket communication option API

FairCom DB allows configuring a socket communication option with this API. The ctCOMMOPT_SOCKET_TIMEOUT sets the global socket timeout value (in seconds).

When ctCOMMOPT_SOCKET_TIMEOUT is set, network socket requests and responses return TRQS_ERR 808 or TRSP_ERR 809 when the specified time has expired while waiting.

ctSetCommProtocolOption(ctCOMMOPT_SOCKET_WAIT_INTERVAL, socketWaitInterval);


When a timeout condition is met, the client simply reconnects to the same virtual IP address initially used, however, the cluster will now redirect that connection to an alternate node. This does mean that your prior client database state is lost — for example, if traversing an index in a search loop, your position context is lost at this point and you will need to restart your search operation.

To confirm that an application has actually connected to a new node in the cluster, it can check the replication node ID assigned to that node. For example, when using fencing agents, it is not expected that the original node remains active. It could be undefined behavior should the application connect back to a failed node.

When configured as recommended in the above sections, each node maintains a unique replication identification which is a reliable indicator. Use GetSymbolicNames() with ctNODEID as the mode parameter. Compare the current ID after connection with a prior ID and you will immediately know if your cluster has indeed failed over and your application successfully reconnected to the alternate node.

Logs that may hold information about the cause of a failure:
  • /var/log/pacemaker/pacemaker.log

    This log holds details about cluster operations and errors encountered by resource agents.

  • /opt/faircomdb/data/CTSTATUS.FCS

    This log holds details about FairCom DB errors.

  • /opt/faircomdb/data/ctreplagent.log

    This log contains Replication Agent-specific errors and is configured in ctreplagent.cfg.

Describe resource

sudo pcs resource describe ocf:heartbeat:faircomdb

Output

ocf:heartbeat:faircomdb - FairCo DB resource agent

Resource agent script for FairComDB Database Server

Resource options:
 binary: Full path to the FairComDB binary
  config: Full path to a FairComDB configuration directory or configuration file
  pid: File to read running process PID
  user: User name or id FairComDB will run under
  group: Group name or id FairComDB will run under
  faircomdb_adminpass (required): FairComDB ADMIN password
  faircomdb_monitor_user: FairComDB user name to connect for monitoring
  faircomdb_monitor_userpass: FairComDB password to connect for monitoring
  faircomdb_servername (required): FairComDB server name to connect for
                                   monitoringi
  faircomdb_host: FairComDB host name to connect for monitoring
  faircomdb_working_directory (required): FairComDB server working directory.
                                          This must be writable for FairComDB to
                                          start and operate
  faircomdb_data_directory (required): FairComDB server data directory. This
                                       must be writable for FairComDB to start
                                       and operate
 faircomdb_heartbeat_tcpip: FairComDB heartbeat option that forces TCPIP 
connections when enabled. Setting false  will use a shared
memory heartbeat that will fallback to TCPIP
  ctsmon: Path to FairComDB ctsmon monitor utility. Assume located in $PATH if
          not provided
  repadm: Path to FairComDB Replication Agent Administrator utility. Assume
          located in $PATH if not provided

Default operations:
  start: interval=0s timeout=30s
  stop: interval=0s timeout=30s
  monitor: interval=10s timeout=20s
  notify: interval=0s timeout=30s
  promote: interval=0s timeout=20s
  demote: interval=0s timeout=20s

This script completely removes ALL existing cluster configurations and component definitions and creates a fresh cluster configuration from scratch.

Delete and recreate new cluster definitions and remove all old definitions

pcs cluster destroy --all

Define machines in the cluster

pcs cluster setup faircom cluster1.company.com cluster2.company.com

Start the cluster

pcs cluster start --all

Create a config file

pcs cluster cib clcfg

Set up a FairCom DB resource

Options should be configured for your specific system.

pcs -f clcfg resource create faircomdb ocf:heartbeat:faircomdb binary=/opt/faircomdb/server/faircom config=/opt/faircomdb/server/ctsrvr.cfg ctsmon=/opt/faircomdb/tools/ctsmon ctstop=/opt/faircomdb/tools/ctstop faircomdb_adminpass=ADMIN faircomdb_servername=FAIRCOMS faircomdb_working_directory=/opt/faircomdb/server faircomdb_data_directory=/opt/faircomdb/data user=fctech repadm=/opt/faircomdb/tools/repadm pid=/opt/faircomdb/server/faircomdb.pid

faircomdb uses primary/secondary

pcs -f clcfg resource promotable faircomdb master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

Add a monitor for the primary server

pcs -f clcfg resource op add faircomdb monitor timeout=20 interval=11 role=Master

Set timeouts

pcs -f clcfg resource update faircomdb op promote timeout=120s
pcs -f clcfg resource update faircomdb op start timeout=120s
pcs -f clcfg resource update faircomdb op stop timeout=120s

Create a Virtual IP (VIP) to connect your application to

pcs -f clcfg resource create VIP ocf:heartbeat:IPaddr2 ip=YOUR_IP cidr_netmask=22 op monitor interval=30

The VIP moves with the master server

pcs -f clcfg constraint colocation add VIP with master faircomdb-clone score=INFINITY

Start the master before the VIP

pcs -f clcfg constraint order faircomdb-clone then start VIP

System specific fencing

This must be customized for each system.

pcs -f clcfg stonith create vmfence fence_vmware_soap pcmk_host_map="cluster1.company.com:RHEL8-ClusterNode3;cluster2.company.com:RHEL8-ClusterNode4" ipaddr=vmware-host.company.com ssl_insecure=1 login=root passwd=mysecretpassword
pcs -f clcfg property set stonith-enabled=true

No quorum for only two nodes

pcs -f clcfg property set no-quorum-policy=ignore

Do not failover during normal operations

pcs -f clcfg resource defaults update resource-stickiness=INFINITY

Failover on first error

pcs -f clcfg resource defaults update migration-threshold=1

Allow failed resource to resume being master after 60 seconds

pcs -f clcfg resource defaults update failure-timeout=60

Apply above changes

pcs cluster cib-push clcfg

Some FairCom DB client applications could take advantage of new functions alerting an application when a failover event occurs. This is important because when a failover occurs, the VIP remains unchanged but the cluster switches the internal IP Address from the primary to the secondary server. Applications connected to the primary server at the time of the failover may hang because their connection with the primary database server is no longer invalid.

FairCom DB provides client applications with an option to enable a background thread that listens for a failover alert. The alert comes in the form of a User Datagram Protocol (UDP) communication, which, like TCP, is supported by all network switches. FairCom DB provides the failover.sh script to send UDP messages to database clients. In the Configuring Alerts section, this script is registered with the cluster so that it runs when the cluster starts a failover event.

The failover alert is protected with Transport Layer Security (TLS) to prevent a malicious user from being able to send false failover alerts, which would result in a denial of service attack.

The background thread that listens to alerts is generic and can be used for more than failover detection — for example, external events can trigger alternate client behaviors.

The UDP listening port is currently hardcoded to port 5595.

Table 1. FairCom DB API functions

Function

Description

ctSetClientLibraryOption()

enables the background alert thread

ctGetFailOverState()

returns true when a failover alert has been received

ctResetFailOverState()

resets the failover state to false

ctSetCommProtocolOption()

can now set the global socket timeout

GetSymbolicNames()

retrieves the hostname and IP address of the database server to verify if it is connected to a primary or secondary server



Enable background failover detection

This function in the FairCom DB API enables background broadcast detection.

Note

ctSetClientLibraryOption() returns SERVER_FAILOVER_ERR (1159) on error.

NINT ctDECL ctSetClientLibraryOption(NINT option, pVOID value);
Table 2. Options

Options

Description

Limits

ctCLIOPT_BROADCAST_READ

enables and disables broadcast read thread

"YES"
"NO"

ctCLIOPT_BROADCAST_DEBUG

enables and disables broadcast read thread debug mode

"YES"
"NO"


Example 18. tCLIOPT_BROADCAST_READ
ctSetClientLibraryOption(ctCLIOPT_BROADCAST_READ, "YES");


Check current failover flag

This API call checks your current failover state.

TEXT ctDECL ctGetFailOverState();
Example 19. ctGetFailOverStateI() returns 1

ctGetFailOverStateI() returns 1 if the global variable failoverstate is set.

if (ctGetFailOverState() == 1)


Reset failover state

This API call can reset failoverstate upon successful failover.

ctResetFailOverState()

To minimize disruption to users, database clients need to be notified quickly when the primary database server fails over to the secondary. This makes it easy for database clients to know the difference between connection failure caused by a network glitch or a database failover. When the database client knows a failover has occurred, it can simply reconnect to the VIP, which points to the secondary node after a failover.

FairCom provides an optional script, failover.sh, that uses a UDP datagram broadcast to alert all database clients of a failover event. This is fired from a predefined script which is configured as an alert resource.

The cluster defines alert resources that are fired during cluster events, such as resource logging, notifications, and maintenance tasks. Alert resources are scripts that are registered with pcs to be run when a cluster event occurs.

Run FairCom's failover.sh script when the failover event occurs using the following command.

# pcs  alert create id=failover
   path=/usr/local/bin/failover.sh
   description="Script to be run at failover"

Example

Example scripts provided by Pacemaker can be found in /usr/share/pacemaker/alerts.

pcs alert create  
   id=failover
   path=/usr/local/bin/failover.shpath=/var/lib/pacemaker/failover.sh
   description=”Script to run at failover"

./failover.sh
#!/bin/bash


##USEFUL DEBUG##
#printf "$CRM_alert_node $CRM_alert_rsc $CRM_alert_task $CRM_alert_kind $CRM_alert_desc $CRM_alert_attribute_name $CRM_alert_attribute_value" | ncat -u 255.255.255.255 5595
#printf "FairCom FailOver Event" | ncat -u 255.255.255.255 5595
##END OF USEFUL DEBUG##


if [[ $CRM_alert_rsc == "VIP" && $CRM_alert_task == "monitor" && $CRM_alert_desc == "Cancelled" ]]; then
  printf "FairCom FailOver Event" | ncat -u 255.255.255.255 5595
fi
if [[ $CRM_alert_task == "node" && $CRM_alert_kind == "lost" ]]; then
  printf "FairCom FailOver Event" | ncat -u 255.255.255.255 5595
fi

In FairCom's testing environment, Corosync is configured for a very basic two node cluster. You do not need to create or otherwise edit corosync.conf. This is only provided for completeness. pcs cluster setup will create this file with all needed attributes and all other edits to this file should be made through the pcs utility.

Locally-defined configurations:
  • cluster_name

    Note

    A cluster name is needed, and each node participating in the cluster will need to be defined.

  • ringX_addr

  • name

    Note

    A fully qualified domain name is preferred for the node, and when specified, will uniquely identify the node and will be used in later configuration commands.

  • two_node

    Note

    A two-node cluster requires a quorum two_node option to be enabled. This treats two-node clusters as if only one node is required for quorum. Consult the corosync.conf man page for details on available configurations.

Example

Pacemaker provides examples in /etc/corosync/corosync.conf.

totem {
    version: 2
    cluster_name: my_cluster 
    transport: knet
    crypto_cipher: aes256
    crypto_hash: sha256
}

nodelist {
    node {
        ring0_addr: node1.company.com 
        name: node1.company.com 
        nodeid: 1
    }

    node {
        ring0_addr: node2.company.com 
        name: node2.company.com 
        nodeid: 2
    }
}

quorum {
    provider: corosync_votequorum
    two_node: 1 
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
}
Resources: