Skip to main content

HA Cluster on Linux for FairCom DB

Build a Highly Available Cluster on Linux for FairCom DB

Abstract

Build a highly available cluster on Linux for FairCom DB

This section describes how to create a two-node, highly available (HA) solution running FairCom DB on a clustered Red Hat Enterprise Linux (RHEL) environment. The two nodes are a primary and a secondary server, referred to in RHEL as a master and slave respectively. The HA solution uses the FairCom synchronous replication feature to ensure that the data on both FairCom DB servers is always identical.

Database clients are connected to a virtual IP address to listen for client connections regardless of which server nodes in the cluster are actually running database services.

The HA solution uses Pacemaker and Corosync from RHEL to:
  • Provide a virtual IP address (VIP).

  • Detect failures in the hardware, OS, and database.

  • Fail over from the primary node to the secondary node.

  • Fence off the primary server during a failover event to prevent scenarios that can corrupt data such as split-brain

Figure 1. FairCom replication
FairCom replication


Data is synchronized between the two database servers using FairCom replication. It replicates all data changes from the primary server to the secondary server. It uses synchronous replication to ensure data is always the same on both servers. You also have the option to use the slightly faster asynchronous replication, but this introduces a chance that data will be lost during a failover. All data changes go to the primary server. Read-only operations can go to either the primary or secondary server.

Failures are detected by Pacemaker. Pacemaker monitors the primary and secondary servers. It also monitors the replication agent on the secondary FairCom DB server. When it detects a failure of the current primary server it fences off the primary database from all communications and promotes the secondary database server to become the new primary (likewise, secondary FairCom DB server failures are also detected further preventing data integrity issues). FairCom DB resource agents define and configure this process.

FairCom DB replication for availability

To ensure all data is the same on both node one and node two, we use FairCom DB replication with synchronous commit. When a transaction on the primary server is committed, it guarantees that the secondary server has access to the log data from the primary server.

When in synchronous mode, changes applied to the secondary server are typically faster because the secondary server is read-only. If there is a heavy read load on the secondary server, changes may be applied slightly slower. This does not affect high availability because all transactions are always persisted on disk on both servers before the database returns a commit confirmation to the application.

Cluster

A Cluster is a collection of nodes and resources (usually services) managed as a set for high availability.

Pacemaker

Pacemaker is a high-availability cluster resource manager based on the Open Cluster Framework (OCF). Pacemaker provides detection of and recovery from node-level and service-level failures. Faulty nodes can be fenced ensuring data integrity. Pacemaker configurations define clusters, nodes, resources, alerts, and STONITH fencing agents. 

Pacemaker cluster management is included with many major Linux distributions. It is a cluster resource manager and executes as a daemon called CRMd on each node. All cluster changes are routed through it, such as node instantiations, moves, starts, stops, and information queries. Pacemaker can script and cluster any resource, such as FairCom DB

Pacemaker uses quorum voting (votequorum service) to determine when a cluster node should be fenced or isolated to prevent split-brain phenomena. More than 50% of nodes must agree on cluster operations. If more than 50% of the nodes in a cluster are offline, clustered services are stopped. Pacemaker utilizes Corosync as its high-availability framework and a specific Corosync option provides for two node clusters where only one node is used as a quorum.

Corosync

Corosync is a distributed messaging and quorum platform for cluster-aware components. Cluster communications go through the Corosync daemon utilizing an in-memory database. It manages cluster membership and messaging communications as well as quorum rules, and cluster state transfer between nodes. Corosync uses the kronosnet library.

Node

A Node is an RHEL host participating in the cluster. Each cluster node runs a local resource manager daemon (LRMd) which is the interface between CRMd and local cluster resources on the node. 

One node in the cluster is the Designated Coordinator (DC) which stores and distributes cluster state to all cluster nodes. This cluster state is maintained in the Cluster Information Base (CIB).

A cluster resource is an application or data managed by the cluster — for example, a database instance running in an RHEL cluster can be configured to be a cluster resource. In this definition, FairCom DB is defined as a resource that requires cluster management to run its services in a highly available configuration.

Resource agent

Resource agent is an abstraction that allows Pacemaker to manage services it knows nothing about. This abstraction takes the form of an executable script that contains logic for starting, stopping, and checking the health of a defined resource such as a database, or other service provided by the cluster.

Fencing agents

Fencing agents regularly check the status of a resource and, if Pacemaker determines that a resource has failed, one or more agents ensure the resource stays down. Fencing agents implement Shoot The Other Node In The Head (STONITH) which is usually targeted to the hardware layer ensuring the complete removal of a node from the cluster.

Techniques to fence resources (such as database servers):

You choose the method that best suits your requirements and hardware capabilities.

  • system reset

  • powering off the host

  • deactivating a VM

  • removing network access

Fencing

Fencing is an important component of any successful clustering as it ensures that a failed node is unable to run or provide any services at all. This prevents scenarios such as split brain which, in the case of a database, allows data to be written to both servers in the cluster. This incorrectly puts some data on one server and some data on the other causing the breaking of data integrity, errors, and making it very difficult to recover correct data.

Groups

Groups are created to manage multiple cluster resources as a unit.

Groups simplify configuring the cluster for the options:
  • Location

  • Start order

  • Reverse stop order of multiple resources that need to function as one unit

  1. Install Pacemaker (from the rhel-8-for-x86_64-highavailability-rpms package) using the following command.

    sudo dnf config-manager --enable rhel-8-for-x86_64-highavailability-rpms
    
  2. If configuring a shared storage cluster, also install the resilientstorage package.

    sudo dnf config-manager --enable rhel-8-for-x86_64-resilientstorage-rpms
  3. Install the Pacemaker fencing agents using the following command.

    sudo dnf install pcs pacemaker fence-agents-all
  4. Verify that Pacemaker was installed using the following command.

    sudo pcs -h

This procedure describes how to initialize two FairCom DB synchronous replication nodes as VMware virtual machines running identical RedHat 8 OS environments configured with Pacemaker clustering software.

Two nodes:
  • Node one

    This is considered the active/hot node. This instance of FairCom DB is a read-write server. The active node is the only FairCom DB server that should process changes to data.

  • Node two

    This is considered the passive/warm node. This instance of FairCom DB is a read-only server. It could be used for query reporting and analytics. This FairCom DB instance should never allow changes to data.

Requirements:
  • Ensure that the Pacemaker RedHad high-availability add-on component is installed.

  • Ensure your user account is a part of the wheel group or consult with your local system administrator for appropriate access.

In this example two FairCom DB synchronous replication nodes are initialized as VMware virtual machines running identical RedHat 8 OS environments configured with Pacemaker clustering software. This is only done here for our example. It is not likely in a production environment as it defeats the purpose of fail-over clustering.

Notice

VMware is used later in this example only to provide a fencing agent. VMware is not required, and other fencing agents may be used.

Note

Unless otherwise mentioned, each of the steps in this procedure must be performed on all nodes in the cluster. Sudo access is required for the majority of the steps.

  1. Configure and load required network ports for high-availability components using the following command.

    sudo firewall-cmd --permanent --add-service=high-availability
    sudo firewall-cmd --reload
    
  2. Assign the hacluster user password using the following command.

    Note

    The high-availability installed packages create a default cluster user (hacluster) used to run and isolate its services. This default user is used to connect across nodes and requires a secure password.

    sudo passwd hacluster
  3. Enable and start the PCSD service on each node using the following command.

    Note

    systemctl enable automatically starts the service on system boot.

    sudo systemctl enable pcsd.service
    sudo systemctl start pcsd.service
    

Identical FairCom DB packages are installed into /opt/faircomdb. This example assumes the system user "faircom" extracts the package and runs the server process, though any system user name could be used.This standardized location is appropriate for external software packages. Symlinks are created in /usr/local/bin/ for important utilities used by the pacemaker resource scripts such that they are available in default paths. Firewalls must be configured to allow required database ports on all nodes.

  1. Configure firewalls to allow required database ports on all nodes.

    Note

    Port 5597 is the FairCom DB default ISAM connection port (also known by the server name FAIRCOMS.

    Port 6597 is the FairCom DB SQL connection port.

    Port 8443 is the default web apps port.

    sudo firewall-cmd --zone=public --permanent --add-port=5597/tcp
    sudo firewall-cmd --zone=public --permanent --add-port=6597/tcp
    sudo firewall-cmd --zone=public --permanent --add-port=8443/tcp
    sudo firewall-cmd --reload
    
  2. Create symlinks in /usr/local/bin using one of the following commands.

    See repadm utility, and ctsmon ctsmon utility

    sudo ln -s /opt/faircomdb/config/failover.linux.pacemaker/getsyncstate /usr/local/bin/getsyncstate
    sudo ln -s /opt/faircomdb/config/failover.linux.pacemaker/setsyncstate /usr/local/bin/setsyncstate
    sudo ln -s /opt/faircomdb/tools/repadm /usr/local/bin/repadm
    sudo ln -s /opt/faircomdb/tools/ctsmon /usr/local/bin/ctsmon
    
  3. Ensure that permissions on the utilities are set correctly by using one of the following commands.

    sudo chmod 755 /opt/faircomdb/config/failover.linux.pacemaker/getsyncstate
    sudo chmod 755 /opt/faircomdb/config/failover.linux.pacemaker/setsyncstate
    sudo chmod 755 /opt/faircomdb/tools/ctsmon 
    sudo chmod 755 /opt/faircomdb/tools/repadm
    
  4. Ensure all files and components are owned and readable by the proper OS user.

    Important

    The user ID (faircom) under which the server process runs must be added to the haclient group such that it can run the crm_attribute utility to read and write the cluster attribute. sudo usermod -a -G haclient faircom

    Note

    If not using the default user ID (faircom), you must specify the OS user for the user=owner_name parameter when running the pcs resource create faircomdb command.  You will see references to this username in the FairCom DB resource scripts.

Required FairCom DB files

The files in this section are required to configure the server and its embedded replication agent to run in a pacemaker cluster.

  • FairCom DB files in the server binary directory /opt/faircomdb/server/:

    • server executable:

      • faircom

      • ctsrvr (alternate legacy server binary file name)

      • ctreesql (alternate legacy server binary file name)

    • server core library - libctreedbs.so

    • client library - libmtclient.so

    • agent library

      • librcesbasic.so

      • librcesdpctree.so

    • server license file - ctsrvrXxx.lic

  • Replication agent subdirectory (/opt/faircomdb/server/agent/)

    replication agent library - libctagent.so

  • Configurations (/opt/faircomdb/config/):

    • server configuration file - ctsrvr.cfg

    • agent configuration; specifies name of replication agent configuration file, relative to server working directory - ctagent.json

  • Replication (/opt/faircomdb/config/failover.linux.pacemaker)

    • replication agent configuration file, used when this server is running as a secondary server - ctreplagent.cfg

    • application-specific file filter for replication - replSync.xml

    • settings files used by the replication agent:

      • source_auth.set

      • target_auth.set

Required file configuration

  1. Copy the /opt/faircomdb/config/failover.linux.pacemaker/ctagent.json file to /opt/faircomdb/config

  2. Edit the /opt/faircomdb/config/ctagent.json file to look like the following:

    {
            "managed": false,
            "configurationFileList": [
                    "../config/ctreplagent.cfg"
            ]
    }
    
  3. Copy the /opt/faircomdb/config/failover.linux.pacemaker/ctreplagent.cfg file to /opt/faircomdb/config.

    Note

    This file will need to be modified for your environment (see ctAgent configuration for synchronous commit replication).

  4. Copy /opt/faircomdb/config/failover.linux.pacemaker/replSync.xml to /opt/faircomdb/config.

ctAgent configuration for synchronous commit replication

It is important to configure replication on both nodes correctly. Once a failover node has recovered, it now becomes the secondary node and the replication direction is now reversed between the nodes. If replication is not correctly configured, the secondary node will not be in sync after recovery.

Example 1. ctAgent configuration

A cluster with two nodes, node1 and node2. Replication will pull from the primary node1 and apply to the secondary node2. In this example, replication agent config file (ctreplagent.cfg) for node2 is used for initial setup.

; c-tree Replication Agent Configuration File

;file filter must specify the names of files to include/exclude for replication and the purposes, including sync_commit
file_filter             <../config/replSync.xml

;replication agent unique id; must be same on both systems
unique_id               agent1

check_update            yes

replicate_data_definitions      yes

;enable synchronous commit processing for replication agent
syncagent               yes

parallel_apply          yes
num_analyzer_threads    1
num_apply_threads       4
sync_log_writes         yes

; Source server connection info
source_authfile         ../config/replication/source_auth.set
source_server           FAIRCOMS@node1
source_nodeid           10.0.0.1

; Target server connection info
target_authfile         ../config/replication/target_auth.set
target_server           FAIRCOMS@localhost
target_nodeid           10.0.0.2

socket_timeout          5

lock_retry_count        10
lock_retry_sleep        1000

; Read 8 KB batches from source server
batch_size              8192

; Use a 1-second timeout when reading from source server
read_timeout_ms         1000

exception_mode          transaction

; The master server will remember the replication agent's log position
; even when the replication agent disconnects.
remember_log_pos        yes

; Log file name 
log_file_name  ../data/ctreplagent.log


Example 2. source_server and nodeid on the second node

The only difference in the replication agent configuration file between node 1 and node 2 is that the source server name is changed to the other node name and the node IDs are set with the appropriate node ID for the servers.

; Source server connection info
source_server           FAIRCOMS@node2
source_nodeid           10.0.0.2

; Target server connection info
target_server           FAIRCOMS@localhost
target_nodeid           10.0.0.1


Example 3. file_filter

The file_filter option is important because it determines which files are replicated by the replication agent and it enables synchronous commit mode for the files. In this example, two files, mark.dat and admin_test.dat are replicated in synchronous commit mode. See replfilefilter.

<?xml version="1.0" encoding="us-ascii"?>
<replfilefilter version="1" persistent="y">
        <file status="include">mark.dat</file>
        <file status="include">admin_test.dat</file>
        <purpose>create_file</purpose>
        <purpose>read_log</purpose>
        <purpose>open_file</purpose>
        <purpose>sync_commit</purpose>
</replfilefilter>

The source_authfile and target_authfile reference .set files contain encoded authentication information for the replication agent.  These have been provided for the default ADMIN user and need to be updated when the password is changed. See ctcmdset - Authentication File Encoding Utility



FairCom DB server configuration

The server configuration file (ctsrvr.cfg) requires additional options specific to participating in a Pacemaker cluster.

;Required keywords for pacemaker-based replication
CHECK_CLUSTER_ROLE              YES
REPL_NODEID                     10.0.0.1

;Optional keywords related to pacemaker replication
SECONDARY_STARTUP_TIMEOUT_SEC 	20
SYNC_COMMIT_TIMEOUT             10
MAX_REPL_LOGS                   100
REPLICATE                       *.dat
  • REPL_NODEID <ID>

    This must be unique for each server. It does not need to be a valid IP address and, in fact, has no relationship to one. However, it is formatted as such. This must match the target_nodeid specified in ctreplagent.cfg on this machine, and the source_nodeid from ctreplagent.cfg on the remote machine.

  • CHECK_CLUSTER_ROLE YES

    This enables the server’s integration with Pacemaker. When the server starts it checks for the file cluster.ini in the server’s data directory. The start function in the Pacemaker script for faircomdb writes to cluster.ini to signal the server of its role as a primary server.

  • DEAD_CLIENT_INTERVAL <seconds>

    The client OS sends a TCP termination signal when a connection is terminated or the client process crashes.  Sometimes there are OS, hardware, or network-level issues that prevent this. The server OS cannot distinguish this condition from an idle connection and will continue to consume server resources for the terminated client.  Beginning with V13.0.1, the DEAD_CLIENT_INTERVAL option configures a TCP probe on idle connections at this interval.  If the client fails to acknowledge the probe, the connection will be dropped after approximately 10 additional seconds.

  • REPLICATE <filename>

    This indicates to attempt to enable the replication attribute on files matching  <filename> opened by the server. Multiple REPLICATE entries may be used to list files individually. If ctreplagent.cfg specifies a file_filter, REPLICATE may be redundant. Files with the replication attribute must still be included within a replication file filter used by a particular agent for changes to the file to be replicated by that agent when the agent uses a file_filter.

  • MAX_REPL_LOGS <max_logs>

    Specifies a limit to the number of transaction logs that will be kept by a primary server for replication agents.  When a secondary database is offline it is not receiving transaction log data, which must be persisted on the primary until the secondary can reconnect and automatically synchronize itself from these logs.  This keyword serves as a safety check to prevent disk space from being exhausted on the primary due to a slow or offline replication agent.  If the limit is hit, all secondary must be manually resynchronized from the primary.  When configuring this, consider the available disk space, the typical size of transaction logs, and how quickly the application generally fills transaction logs.

  • SECONDARY_STARTUP_TIMEOUT_SEC <time_in_seconds>

    This configuration option sets the time limit after which the server stops waiting for a notification of its role when starting up. If the server was a secondary server the last time it ran and within this time limit it does not receive a notification of its role from pacemaker and cannot connect to the partner server, then it stops waiting and completes its startup acting as a secondary server in the cluster. This option defaults to 20 seconds.

  • SYNC_COMMIT_TIMEOUT <time_in_seconds>

    Important

    This configuration option sets a time limit in seconds for which a transaction commit waits on the transaction log data to be copied to the secondary system. If a transaction commit operation times out waiting for its log data to be copied to the secondary server, the server switches into asynchronous replication mode and the transaction commit proceeds. This option defaults to 60 seconds, which is probably much too large for most environments.  The switch to asynchronous mode indicates that the secondary is out of sync with the primary, so this keyword controls a hard tradeoff between redundancy and performance, since the application will encounter this timeout as a delay. Reliable and fast network connections between nodes should be able to set this to smaller values, but the size of transactions must be considered as well.  This value must be smaller than any client socket timeouts set with ctSetCommProtocolOption(ctCOMMOPT_SOCKET_WAIT_INTERVAL) to avoid clients timing out if the secondary server is stopped.  It must also be smaller than the faircomdb resource agent monitor timeout to avoid the primary appearing to hang when the secondary server is stopped causing Pacemaker to then demote and stop the primary!

Server role file

When the server sets its role to the primary or secondary server after receiving notification of its role from Pacemaker, it writes its current role to the file serverrole.ini in the server’s data directory (which is set using the LOCAL_DIRECTORY server configuration option). This file contains the value 1 for the primary server and 2 for the secondary server. The server uses this file when it starts up to know its role the previous time it ran.

Server scripts

The server sets a global cluster attribute name promotestate to indicate which is the current primary server node and whether it is running in synchronous or asynchronous replication mode.

In order to read the cluster attribute, the server runs the getsyncstate shell script. To set the cluster attribute value, the server runs the setsyncstate shell script.

Ensure that these scripts have been put into a directory that is included in the server’s PATH environment variable and is readable and executable by the user ID under which the server process is run.

Example 4. Cluster-specific configuration options
CHECK_CLUSTER_ROLE 			YES | NO
SYNC_COMMIT_TIMEOUT		<time_in_seconds>
SECONDARY_STARTUP_TIMEOUT_SEC	<time_in_seconds>


Example 5. Cluster-specific state files
cluster.ini	0 - 4 
serverrole.ini	1 (primary) | 2 (secondary)


Example 6. Cluster-specific shell scripts
getsyncstate	Gets current FDB cluster state
setsyncstate	Sets current FDB cluster state


The procedures in this section describe how to add resources to the Pacemaker clustering monitor.

Resources added to the Pacemaker clustering monitor:
  • Node availability

  • FairCom DB server availability

  • Cluster IP availability

The pcs command controls and configures Corosync and Pacemaker through an interface to the Cluster Information Base, cib file. It provides an extensive command syntax for control over all aspects of your clustered environment. Help is easily obtained directly from this utility.

sudo pcs --help

This section provides the commands to run in order to define the cluster. In all of these commands, replace node1 and node2 with your specific configured machine URLs.

  1. Authenticate local pcsd (the pcs daemon) to remote host node pcsd services using the following command. When prompted, enter the default hacluster username and password that were previously assigned, to authenticate Pacemaker against other nodes in the cluster.

    Note

    This only needs to be executed once on one of the nodes.

    sudo pcs host auth  node1  node2
  2. Assign nodes to cluster and start to secure all pcs changes into the CIB as you make configurations using the following command.

    sudo pcs cluster setup my_cluster --start node1 node2
  3. Create the CIB for analysis and comparison using the following command.

    sudo pcs cluster cib mycluster.cib
  4. Explore mycluster.cib, which is an XML file that pcs writes to the local folder.

  5. Enable a cluster to boot when a node is booted using the following command.

    sudo pcs cluster enable --all
  6. Allow Pacemaker to run with only 2 nodes.

    sudo pcs property set no-quorum-policy=ignore
  7. Set preference for no migrations during normal operations.

    sudo pcs resource defaults update resource-stickiness=INFINITY
  8. Failover on first error

    sudo pcs resource defaults update migration-threshold=1
  9. Allow a failed resource to resume being the master after 60 seconds

    sudo pcs resource defaults update failure-timeout=60

Client applications connect to a cluster floating virtual IP (VIP) (for example, 10.0.1.100) directed to the active node by the cluster management. This VIP is a resource agent, called the Cluster Virtual IP Resource Agent.

A VIP is switched between nodes on a failover event detected by the cluster. It provides a seamless connection experience for client applications.

  1. Define a cluster IP that is monitored for availability to provide a single IP address to the client application.

    Note

    During a failover event, the client’s connection will be broken. After a failover, a client simply reconnects to the same VIP. The client does not need to know the internal IP addresses of each server.

  2. Modify the IP address in the following command below to match your environment. Consult your network administrator for a suitable static IP address and netmask value.

    sudo pcs resource create VirtualIP IPaddr2  ip=cluster.ip.address
        cidr_netmask=22
    

For Pacemaker to detect a process failure, it needs to run a heartbeat script periodically. This is a monitoring resource agent. FairCom provides a script that satisfies the basic operational elements required by Pacemaker. These include start, stop, monitor, and validate operations.

The name of the script is faircomdb (also known as the FairCom DB Resource Agent).

  1. Navigate to the default scripts in /usr/lib/ocf/resource.d/heartbeat/.

  2. Copy the faircomdb resource agent script into the /usr/lib/ocf/resource.d/heartbeat/ area.

  3. Verify that this script has executable permissions and correct ownership.

    ls -l /usr/lib/ocf/resource.d/heartbeat/faircomdb

    Should have permissions and ownership like: 

    -rwxr-xr-x. 1 root root 20982 May 30  2020 /usr/lib/ocf/resource.d/heartbeat/faircomdb

    Note

    Without correct permissions, Pacemaker will report that the resource could not be installed.

  4. Repeat steps 1 through 3 on the second node.

  5. Display available options and requirements by running the following command.

    Note

    Multiple configuration options can be provided to this resource agent directly from the command line. Some of these are required.

    sudo pcs resource describe ocf:heartbeat:faircomdb 
    
  6. Create your faircomdb resource. Update the local working directory paths, the faircomdb process owner name, and the passwords to match your environment.

    Note

    • These cluster management commands only need to be run on one node of the cluster. As the cluster is now active and enabled, all changes are synchronized across all nodes in the CIB database persisted in /var/lib/pacemaker/cib/cib.xml.

    sudo pcs resource create faircomdb ocf:heartbeat:faircomdb faircomdb_servername=FAIRCOMS faircomdb_adminpass=ADMIN faircomdb_working_directory=/opt/faircomdb/server faircomdb_data_directory=/opt/faircomdb/data  
    -- (Be sure to change your default FairCom DB ADMIN password from the default!)
    user=SOMEUSER group=SOMEGROUP
    
  7. Define the FairCom DB Resource Agent (that runs on both nodes simultaneously) as a cloned primary-secondary (master-clone) resource using the following command.

    Note

    This command also tells the cluster to have only one primary server on the cluster.

    sudo pcs resource promotable faircomdb master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
  8. Enable monitoring for the primary database, checking if it is alive every interval (seconds).

    sudo pcs resource op add faircomdb monitor timeout=20 interval=11 role=Master
  1. Add a location constraint to ensure that the VIP starts on the primary node (master) when the cluster is started.

  2. Add an INFINITY score to force the colocation node.

    sudo pcs constraint colocation add VirtualIP with master faircomdb-clone score=INFINITY

With constraints in place, you can define the order in which resources are started. The FairCom DB server resource should be started first, followed by the VIP. Start the VIP when the FairCom DB server starts or is promoted on the primary node using the following command.

sudo pcs constraint order start faircomdb-clone then start VirtualIP
sudo pcs constraint order promote faircomdb-clone then start VirtualIP

The cluster calls a fencing agent to kill resources and prevent them from running accidentally. In a FairCom DB cluster, the fencing agent ensures that a failed primary node can no longer receive database connections. After a failover, only the secondary server is allowed to modify data. Failure to do this can result in a split-brain scenario which is very difficult to recover from. Fencing usually involves external disruptions to resources, such as using power supplies to shut down a server, using a network switch to disable network access, shutting down a VM, and so forth. Fencing agents can be found in /usr/sbin/fence_*. For details, see Red Hat Training Chapter 10.

In this example environment, VMWare Virtual Machines (VMs) are used for the hosts. The VMWare Fencing Agent is also used to kill an unresponsive node failover, also known as Shoot The Other Node In The Head (STONITH). It prevents conflicts with mutually exclusive resources.

Fencing should initially be disabled to speed cluster setup, configuration, and testing. It’s been observed that the cluster might not start resources when enabled as default. In production environments, fencing must be configured and enabled to ensure data integrity. For failure modes involving non-responsive nodes (such as network failures, hanging processes, and so forth) fencing is the only method to resolve the problem in a well defined way. However, something as simple as stopping Pacemaker on one node (pcs cluster stop) can make this node non-responsive and trigger its fencing.

Example 7. Enable/disable STONITH

This command enables or disables any configured STONITH agent.

sudo pcs property set stonith-enabled= [ false | true ]


Example 8. Configure STONITH

In this command, you can update the host map, IP address, login, and password to match your environment. The host map names should match what pcs status lists your existing nodes. Your VMWare host IP address must be reachable within your local DNS if using a URL.

Caution

Exercise caution if accessing your VMWare host using the system root account. It may be preferable to configure a specific administrative account for this purpose. Consult with your local system administrator for more information.

sudo pcs  stonith  create vmfence  fence_vmware_soap 
pcmk_host_map="cluster1.company.com:CLUSTER-NODE-1; cluster2.company.com:CLUSTER-NODE-2" 
   ipaddr=vmware-host.company.com 
   ssl_insecure=1 
   login=root 
   passwd=xxxxxx
   pcmk_monitor_timeout=120s


Fencing components:

You may use any fencing components that best fit your environment.

  • Power supply (UPS specific)

  • Network devices (switches, routers or NICs)

  • Server Remote controlled shutdown (Dell iDRAC8, HP iLO, and so forth)

Configure your cluster to start at boot time such that all nodes are available once the boot process is finished.

Example 9. Enable systemd services for pacemaker and corosync
sudo systemctl enable corosync.service
sudo systemctl enable pcsd.service


You can shut down an entire cluster, or just individual nodes. This requires extra attention. Failure to follow the proper sequence will result in the primary node failing over to the secondary node or causing the primary node to restart immediately after you shut it down.

Important

It is the job of the cluster software to keep the nodes in the cluster running at all times, so you must use pcs to shut down resources within the cluster, or the cluster itself.

Shutdown command:
  1. Shutdown a single node (node1),  preventing all resources from running on that node.  If node1 is currently the master, failover occurs and resources (such as faircomdb and VirtualIP) will be migrated to node2.

    sudo pcs node standby node1
  2. Shut down the entire cluster on any node to stop Pacemaker and the messaging layer using the following command.

    sudo pcs cluster stop --all
  3. Force the cluster services to stop in the event that they do not stop running properly on a host using the following command.

    sudo pcs cluster kill
  4. Completely remove the cluster environment configuration from all nodes using the following command. The cluster must be created.

    sudo pcs cluster destroy --all
Example 10. Tune the main Pacemaker configuration timeouts

Tune the main pacemaker configuration timeouts for various resource agent operations. Load testing and consideration of potential delays caused by expensive operations such as database or system backups can help determine a reasonable set of timeout values for your system.

Important

It is generally better to set the timeout to be too large rather than too small because if the timeout value is too small, a delay due to high system load will lead to a Pacemaker timeout of the relevant operations, which Pacemaker treats as a resource failure, leading to unnecessary resource migration (failover), and possible fencing of the failed node. When the timeout value is much too small it can lead to repeating failures as each node successively times out.

A large timeout causes the detection of some types of failures (such as database process hang, network failure, or node failure) to take the total timeout interval to detect, increasing the total time a resource is unavailable during a failover event.

sudo pcs resource update faircomdb op monitor timeout=120s
sudo pcs resource update faircomdb op start timeout=120s


Timeout settings:
  • faircomdb op start timeout

    Consider both normal startup times and recovery startup times.  If typical recovery times are very large such a startup may need to be done manually outside the cluster.

  • faircomdb op stop timeout

    Consider the database shutdown time.  If the stop operation times out, Pacemaker will fence the node if configured.  To avoid this, the resource agent will attempt to kill (SIGKILL) the server process shortly before the timeout expires.

  • faircomdb op promote timeout

    A promote occurs at failover.  This timeout must account for the replication lag at the time of failover. The replication lag is the time to apply replication changes that are queued pending apply on the secondary server being promoted. 

    Example 11. Log read lag

    An estimate of this lag can be monitored with the following command.

    repadm -c getstate -s SECONDARY_NAME
    Output

    time shows the logread lag in seconds.

     source server: FAIRCOMS@rhel8cn3
     target server: FAIRCOMS@localhost
    --logship--------------------------------------------
      sid         lognum     logpos   state    seqno      error func
       23              2  108032586  source      675          0 ctReplReadLogData
    --logread--------------------------------------------
      sid   tid   lognum     logpos   state    seqno time error func
       24    41        2  108032586  source   120284    0     0 ctReplGetNextChange
    --analyzer--------------------------------------------
     #                                state    seqno      error func
     1                               source     9339          0 ctThrdQueueRead
    --dependency------------------------------------------
                                      state    seqno      error func
                                     source  1547757          0 ctThrdQueueRead
    --apply-----------------------------------------------
     #      tid   lognum     logpos   state    seqno  files error func
     1       31        2   97494282  source    12015      0     0 ctThrdQueueRead
     2       36        2   97499497  source    10600      0     0 ctThrdQueueRead
     3       37        2   97507840  source    11218      0     0 ctThrdQueueRead
     4       38        2   97477128  source    10507      0     0 ctThrdQueueRead
    --tranStat--  --analysisQ-  ---dependQ--  ---dependG--  ---readyQ---
        0(    0)      0(    0)      0(    0)      0(    0)      0(    0)
    


  • faircomdb op monitor timeout

    The monitor operation occurs frequently and involves a new database connection.  Database operations such as online backups or Quiesce may impose an abnormal delay on new connections.  The monitor timeout should allow for these exceptional connection times, otherwise, a failover event will be triggered.

    Important

    IMPORTANT! This must be larger than the servers SYNC_COMMIT_TIMEOUT <time_in_seconds> configured in ctsrvr.cfg or pacemaker will think the primary has failed any time the secondary server is stopped.

  • faircomdb op monitor interval

    The monitor interval determines how frequently database liveness is tested by making a new database connection.  If the monitor operation fails or times out a failover will be triggered.  A smaller interval will detect failures more quickly but imposes more overhead onto the database.

This procedure describes how to activate clustering.

  1. Start the cluster using the following command.

    Note

    This command starts all cluster-associated resources on all nodes and begins the monitoring processes as defined in your resource agent scripts. As replication was configured directly in FairCom DB, it is immediately active on startup.

    sudo pcs cluster start --all
    
  2. Check the cluster status using the following command.

    # sudo pcs status
  3. Additional FairCom DB resource logging can be enabled by setting clusterdebug=1 in the faircomdb resource agent script.

    1. Check /opt/faircomdb/server/server.log and /var/log/ messages to further verify all cluster database components have successfully started.

    2. When you have determined the cluster has successfully started, check that replication is active between database nodes.

    3. Ensure that the cluster node hostname references the specific node where replication is active (the slave node) to ensure the correct active replication node is viewed.

      Note

      The Example 12, “Query the replication agent's connection state below assumes FAIRCOMS is your configured FairCom DB server name.

    4. From pcs status, verify which node is the current slave node and substitute that node name.

    Example 12. Query the replication agent's connection state

    This example shows the command to examine the replication agent's connection state.

    repadm -c getstate -u ADMIN -p ADMIN -s FAIRCOMS@node2
    
    Output
    s t   lognum     logpos   state    seqno time func
    n n        0          0  target        2    - INTISAM
    n n        0          0  source        3    - INTISAM
    n n        0          0  source        3    - INTISAM
    


It may be necessary to resync the secondary server from the data on the primary. If a dynamic dump is used, the primary may remain operational while the resync occurs.

  1. Putting the secondary node into standby mode will stop the faircom database and allow a manual resync to occur.

    sudo pcs node standby node2
  2. Use dynamic dump to make a point-in-time backup of the primary. This command should be run on the secondary node, and will backup files from the primary node (node1) to the local file /faircomdb/data/resync.bak.  See the FairCom Database Backup Guide for details on constructing a backup script.

    /faircomdb/tools/ctdump -t <backup script name> -c -o /faircomdb/data/resync.bak
  3. Use the dump restore utility to restore the backup to its point-in-time state.  The restore script may be identical to the backup script in many cases.  The !DUMP option must specify the /faircomdb/data/resync.bak file.

    /faircomdb/tools/ctrdmp  <restore script name>
  4. Rename the ctreplagent.ini file. The ctrdmp utility generates a ctreplagent.ini file that identifies the log position on the source (primary) server that corresponds to the point in time of the restored files.  This tells the replication agent where in the log to begin replication.  At startup, the agent looks for a file named /faircomdb/data/ctreplagent_<unique id>.ini where <unique id> matches the unique_id value from ctreplagent.cfg converted to uppercase.

    mv ctreplagent.ini /faircomdb/data/ctreplagent_AGENT1.ini
  5. Restart resources on the secondary node.

    sudo pcs node unstandby node2
  6. Once the database and replication agent restart, if the replication agent locates the ctreplagent_AGENT1.ini file, it should delete this file and log a message to ctreplagent.log similar to the following:

    "AGENT1: Logread: INFO: Initial log read position: overriding saved position on target server: using specified position (set in ini file)"

The secondary should now match the primary server data as of the time of the backup, and the replication agent is applying changes that have occurred since then.

If a database error occurs on the secondary server during replication, one record for each operation in the failed transaction is logged to the agent's exception log. The agent will continue running and applying changes, but the secondary server is now out of sync with the primary and will refuse to be promoted to the role of primary server. These exceptions can be viewed with the getlog command.

Example 13. View exceptions command

These past exceptions, if left unresolved, tend to cause additional cascades of errors as those same records are modified on the primary server, leading to large data differences over time.

repadm -c getlog -u ADMIN -p ADMIN -s FAIRCOMS@node2


Once exceptions are resolved on the secondary, the exception records should be deleted from the agent’s exception log (see Example 14, “Purge the exception log). Only when the exception log has no entries will the secondary server be allowed to promote to the primary server.

Example 14. Purge the exception log

This might be valid and useful after a full resync of the secondary data, or if the exceptions were related to files being unintentionally replicated.

Caution

Purging exception log entries without resolving all data differences will lead to the loss of data.

repadm -c purgelog -u ADMIN -p ADMIN -s FAIRCOMS@node2


If a node in a properly operating FairCom DB cluster fails, the remaining node remains available as the primary server in a degraded cluster. As transactions continue to flow, the failed server falls out of sync. If the node is restored to service (now as secondary), it will resume asynchronous replication where it failed (see MAX_REPL_LOGS for limit) and once it has copied the primary log data will automatically re-enter synchronous commit mode and restore the cluster to a normal state.

During this degraded interval, if the single remaining node also fails, the secondary cannot be promoted without some data loss, as it is effectively a backup from an earlier point in time. Normally, the secondary server is expected to remain partially available (read-only) until the primary is restored. In the event of a catastrophic disk failure on the primary, forcing such a promotion may be the best available course of action (see Example 15, “Force a promotion).

Example 15. Force a promotion

Run the following command on the secondary server you intend to promote:

/usr/local/bin/setsyncstate y


Once your cluster is configured, operational, and ready to service FairCom DB client applications, FairCom DB client applications can connect to the single cluster-assigned virtual IP address without regard to which node may be servicing them.

Ultimately, FairCom DB client applications will need to respond to failover events. ISAM-based applications maintain a tight affinity with their connected server. As such, there is a rich array of context data that can be lost on database failover. Unfortunately, with this database model, this is unavoidable in a shared-nothing architecture such as an OCF cluster.

Example 16. Failed connection states

This example shows multiple failed connection states that can be tested when a client connection is lost. All of these states should be monitored and checked for validity before assuming a database node failover occurred. It is assumed an application must reconnect to the database cluster at this point. Applications must re-initialize any current database activities and context they expected at the time of the last operations as no expectation of the last operational state can be made.

ARQS_ERR 	127	"could not send request"
ARSP_ERR 	128	"could not receive response"
ASKY_ERR 	133	"server not available"
SHUT_ERR	150	"server is shutting down"
TRQS_ERR 	808	"request timed out"
TRSP_ERR	809	"response timed out"


One obvious indicator that a cluster has failed-over is a hanging socket. When the underlying cluster node IP address is switched from the primary node to the secondary node, the existing TCP/IP socket connection becomes invalid and the application receives a network error. When blocking sockets are in use (which is the default behavior), this could lead to long delays in some cases because many systems use OS defaults of 2 hours before detecting a broken socket.

Beginning with V13.0.1, the method to detect this situation is to use TCP-level keepalives to detect non-responsive TCP links. This may be enabled on the client side by calling ctSetCommProtocolOption(ctCOMMOPT_TCP_KEEPALIVE_INTERVAL,""), which causes probes to be sent after the connection is idle for the configured interval (in seconds). If the server fails to respond, the TCP connection will be abandoned after about 10 additional seconds and a faircomDB network error (typically ARQS_ERR/ASRP_ERR) is returned to the application.

If the server is also using TCP keepalives to detect broken links (ctsrvr.cfg option DEAD_CLIENT_INTERVAL), the client interval should be slightly different than the server interval to eliminate duplicate keepalive probes on idle connections. The former method used a socket timeout with a maximum timeout value to wait before returning a timeout error. However, the socket time will also fail a normal API call that takes a long time, such as the rebuild of a very large index, so the timeout may need to be adjusted for some special cases.

Example 17. Configure a socket communication option API

FairCom DB allows configuring a socket communication option with ctCOMMOPT_SOCKET_TIMEOUT which sets the global socket timeout value (in seconds).

When ctCOMMOPT_SOCKET_TIMEOUT is set, network socket requests and responses return TRQS_ERR 808 or TRSP_ERR 809 when the specified time has expired while waiting.

ctSetCommProtocolOption(ctCOMMOPT_SOCKET_WAIT_INTERVAL, socketWaitInterval);


When a timeout condition is met, the client can simply reconnect to the same virtual IP address initially used, however, the cluster will now redirect that connection to an alternate node. This does mean that your prior client database state is lost — for example, if traversing an index in a search loop, your position context is lost at this point and you will need to restart your search operation.

Important

IMPORTANT! Client socket timeouts must be configured to be larger than the servers SYNC_COMMIT_TIMEOUT <time_in_seconds> configured in ctsrvr.cfg or errors occur anytime the secondary server is stopped.

To confirm that an application has actually connected to a new node in the cluster, it can check the replication node ID assigned to that node. For example, when using fencing agents, it is not expected that the original node remains active. It could be undefined behavior should the application connect back to a failed node.

When configured as recommended in the above sections, each node maintains a unique replication identification which is a reliable indicator. Use GetSymbolicNames() with ctNODEID as the mode parameter. Compare the current ID after connection with a prior ID and you will immediately know if your cluster has indeed failed over and your application successfully reconnected to the alternate node.

Logs that may hold information about the cause of a failure:
  • /var/log/messages

    This log holds details about cluster operations and errors encountered by resource agents.

  • /opt/faircomdb/data/CTSTATUS.FCS

    This log holds details about FairCom DB errors.

  • /opt/faircomdb/data/ctreplagent.log

    This log contains Replication Agent-specific errors and is configured in ctreplagent.cfg.

Describe resource

sudo pcs resource describe ocf:heartbeat:faircomdb

Output

ocf:heartbeat:faircomdb - FairCo DB resource agent

Resource agent script for FairComDB Database Server

Resource options:
 binary: Full path to the FairComDB binary
  config: Full path to a FairComDB configuration directory or configuration file
  pid: File to read running process PID
  user: User name or id FairComDB will run under
  group: Group name or id FairComDB will run under
  faircomdb_adminpass (required): FairComDB ADMIN password
  faircomdb_monitor_user: FairComDB user name to connect for monitoring
  faircomdb_monitor_userpass: FairComDB password to connect for monitoring
  faircomdb_servername (required): FairComDB server name to connect for
                                   monitoringi
  faircomdb_host: FairComDB host name to connect for monitoring
  faircomdb_working_directory (required): FairComDB server working directory.
                                          This must be writable for FairComDB to
                                          start and operate
  faircomdb_data_directory (required): FairComDB server data directory. This
                                       must be writable for FairComDB to start
                                       and operate
 faircomdb_heartbeat_tcpip: FairComDB heartbeat option that forces TCPIP 
connections when enabled. Setting false  will use a shared
memory heartbeat that will fallback to TCPIP
  ctsmon: Path to FairComDB ctsmon monitor utility. Assume located in $PATH if
          not provided
  repadm: Path to FairComDB Replication Agent Administrator utility. Assume
          located in $PATH if not provided

Default operations:
  start: interval=0s timeout=30s
  stop: interval=0s timeout=30s
  monitor: interval=10s timeout=20s
  notify: interval=0s timeout=30s
  promote: interval=0s timeout=20s
  demote: interval=0s timeout=20s

This script completely removes ALL existing cluster configurations and component definitions and creates a fresh cluster configuration from scratch.

Delete and recreate new cluster definitions and remove all old definitions

pcs cluster destroy --all

Define machines in the cluster

pcs cluster setup faircom cluster1.company.com cluster2.company.com

Start the cluster

pcs cluster start --all

Create a config file

pcs cluster cib clcfg

Set up a FairCom DB resource

Options should be configured for your specific system.

pcs -f clcfg resource create faircomdb ocf:heartbeat:faircomdb binary=/opt/faircomdb/server/faircom config=/opt/faircomdb/server/ctsrvr.cfg ctsmon=/opt/faircomdb/tools/ctsmon ctstop=/opt/faircomdb/tools/ctstop faircomdb_adminpass=ADMIN faircomdb_servername=FAIRCOMS faircomdb_working_directory=/opt/faircomdb/server faircomdb_data_directory=/opt/faircomdb/data user=fctech repadm=/opt/faircomdb/tools/repadm pid=/opt/faircomdb/server/faircomdb.pid

faircomdb uses primary/secondary

pcs -f clcfg resource promotable faircomdb master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true

Add a monitor for the primary server

pcs -f clcfg resource op add faircomdb monitor timeout=20 interval=11 role=Master

Set timeouts

pcs -f clcfg resource update faircomdb op promote timeout=120s
pcs -f clcfg resource update faircomdb op start timeout=120s
pcs -f clcfg resource update faircomdb op stop timeout=120s

Create a Virtual IP (VIP) to connect your application to

pcs -f clcfg resource create VIP ocf:heartbeat:IPaddr2 ip=YOUR_IP cidr_netmask=22 op monitor interval=30

The VIP moves with the master server

pcs -f clcfg constraint colocation add VIP with master faircomdb-clone score=INFINITY

Start the master before the VIP

pcs -f clcfg constraint order faircomdb-clone then start VIP

System specific fencing

This must be customized for each system.

pcs -f clcfg stonith create vmfence fence_vmware_soap pcmk_host_map="cluster1.company.com:RHEL8-Node3;cluster2.company.com:RHEL8-Node4" ipaddr=vmware-host.company.com ssl_insecure=1 login=root passwd=mysecretpassword
pcs -f clcfg property set stonith-enabled=true

No quorum for only two nodes

pcs -f clcfg property set no-quorum-policy=ignore

Do not failover during normal operations

pcs -f clcfg resource defaults update resource-stickiness=INFINITY

Failover on first error

pcs -f clcfg resource defaults update migration-threshold=1

Allow failed resource to resume being master after 60 seconds

pcs -f clcfg resource defaults update failure-timeout=60

Apply above changes

pcs cluster cib-push clcfg
Resources: