// removed jquery ui css and js

PostgreSQL Troubleshooting

The PostgreSQL Cluster Does not Start after an OS Upgrade and Reboot

If you have patched both the MGMTPOSTGRES_MASTER and MGMTPOSTGRES_SLAVE servers on RHEL 7 and  rebooted both the servers, you may encounter a situation when the cluster does not start even after rebooting both servers. This is because the cluster is not in sync and cannot be automatically started after the reboot.

Verify that the Pacemaker startup is enabled after the system reboot – Pacemaker is the cluster utility used by the CloudCenter platform to ensure high availability. Pacemaker starts with each reboot and you do not have to manually start the HA cluster. Instead, verify that you have configured the following requirements accurately:

  • Ensure that the firewall rules are correctly configured. The firewall rules for this utility are explained in CCM and Database Firewall Rules.

  • Go to each PostgreSQL server and enter the following command to review the status of the database and the HA connectivity:

    Ensure that the PCSD Status for both database servers display Online in response to the following command:

    pcs cluster status
    

  • Verify that your PostgreSQL instance is communicating with the CCM.

Recovering a PostgreSQL Setup

PostgreSQL Requirements!

If your PostgreSQL setup is not exactly as specified in the Prerequisites section below, be aware that you may face unpredictable consequences.

Prerequisites

The CloudCenter platform requires you to use:

  • PostgreSQL 9.6.3 – The CloudCenter platform does not support any other version.

  • CentOS 7 or RHEL 7 OS for any PostgreSQL instance.

  • An active Daemon Status for Corosync, Pacemaker and the pcsd service.

If you use a fully-loaded RPM system or load any other additional package, for example, Fuse, EVM, Raid, or other similar packages, you may face unpredictable consequences.

Common HA Troubleshooting and Recovery Procedures

In some cases, depending on your setup and configuration, it is possible to encounter a scenario in your PostgreSQL HA setup where the master and slave instances are rebooted or reconnected after a network failure, or that a HA failover has occurred due to the current master database instance going offline. In such cases, it is possible that the cluster management software might not start PostgresSQL on a node without user intervention.

  1.  Common administrative tasks to check and rectify the cluster status
    1. Check the status of the cluster to ensure that it running.

      pcs status
    2. You may see a use case (as indicated by the following pcs status output) where the slave was not started due to an error. For example, the slave may not have started because of a pre-existing lock file scenario, as in the following output, where the HA software generates a lock file so admins can review and commit/accept changes, if any.

      Cluster name: cliqrdbcluster
      Stack: corosync
      Current DC: dbslave (version 1.1.15-11.el7-e174ec8) - partition with quorum
      Last updated: Thu Jun  1 10:16:56 2017          
      Last change: Thu Jun  1 09:42:23 2017 by hacluster via crmd on dbslave
      2 nodes and 4 resources configured
      Online: [ dbmaster dbslave ]
      Full list of resources:
      Resource Group: VIPGroup
           EC2SecondaryIP     (ocf::heartbeat:EC2SecIP):      Stopped
           PGMasterVIP        (ocf::heartbeat:IPaddr2):       Stopped
      Master/Slave Set: mspostgresql [pgsql]
           Slaves: [ dbslave ]
           Stopped: [ dbmaster ]
      Failed Actions:
      * pgsql_start_0 on dbmaster 'unknown error' (1): call=15, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
          last-rc-change='Thu Jun  1 09:42:24 2017', queued=0ms, exec=246ms
      Daemon Status:
        corosync: active/disabled
        pacemaker: active/enabled
      pcsd: inactive/disabled
    3. If you see the scenario mentioned in Step b above, delete the lock file (in this example, delete the lock file named /var/lib/pgsql/tmp/PGSQL.lock) on the node which is stopped (in the above example, it is named dbmaster) and run the following command.

      pcs resource cleanup mspostgresql 
    4. Run pcs status again to confirm the errors have cleared up and to ensure that the slave is back online.

      The PostgreSQL services should now run without any issues.

  2.  Rebooting HA Instances

    The administrator may need to reboot the HA instances for certain system administration actions like upgrading security patches or versions and so forth. In these cases, the administrator must follow the sequence provided in this section.

    1. Execute pcs status on each node to determine which node is operating as the current master.

      Cluster name: cliqrdbcluster
      Stack: corosync
      Current DC: dbmaster (version 1.1.15-11.el7-e174ec8) - partition with quorum
      Last updated: Fri Jun  2 03:23:22 2017          
      Last change: Thu Jun  1 10:42:41 2017 by root via crm_attribute on dbslave
      2 nodes and 4 resources configured
        Online: [ dbmaster dbslave ]
      Full list of resources:
         Resource Group: VIPGroup
             EC2SecondaryIP     (ocf::heartbeat:EC2SecIP):      Started dbslave
             PGMasterVIP        (ocf::heartbeat:IPaddr2):       Started dbslave
         Master/Slave Set: mspostgresql [pgsql]
             Masters: [ dbslave ]
             Stopped: [ dbmaster ]
         Failed Actions:
             * pgsql_start_0 on dbmaster 'unknown error' (1): call=15, status=complete,
             exitreason='My data may be inconsistent. You have to remove
             /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
             last-rc-change='Thu Jun  1 10:43:15 2017', queued=0ms,exec=263ms
         Daemon Status:
             corosync: active/disabled
         pacemaker: active/enabled
             pcsd: inactive/disable 
      

      The above output indicates that the node named dbslave is the current master

    2. Reboot your current slave node – in the example above, reboot the node named dbmaster.
    3. After the reboot, follow the steps mentioned in the Common administrative tasks to check and rectify the cluster status section above to check if the node has any errors – if any cleanup is required, then follow the steps provided in this section.
    4. Reboot the node which is the current master, in this example the node named dbslave is the current master.

    5. Once the reboot occurs, run the pcs status command on the node which is the current slave, in this example the node named dbmaster is the current slave. You can see that a HA failover has occurred and that node has become the new master.

    6. After rebooting the old master again, follow the steps mentioned in Common administrative tasks to check and rectify the cluster status section above to verify and cleanup any errors so that the rebooted node can rejoin the cluster as a slave.

Switch the Master Back to Host 1

Assuming you have Host 1 (master) and Host 2 (slave) in a scenario which failed over and caused Host 2 to become the master.

You may have a situation where you want to switch back the master to Host 1. This section provides the process for this type of a switch over.

  1. From either PostgreSQL instance, issue the following command:

    pcs cluster stop host2

    At this point, Host 1 becomes the master again.

  2. Verify the PCS status for any possible errors:

    pcs status
  3. If you see the following Failed Actions error:

    Failed Actions:
    * pgsql_start_0 on dbmaster 'unknown error' (1): call=15, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
    last-rc-change='Tue Jun 13 14:59:43 2017', queued=0ms, exec=224ms

    then execute the following command to bring the HA back to the functional state:

    pcs resource cleanup mspostgresql
  4. Remove the lock file.

    rm -f /var/lib/pgsql/tmp/PGSQL.lock
  5. Recheck the status of the cluster to ensure that you do not see any errors and that all services are up and running.

    pcs status

    At this point, you should not see any errors!

If the PostgreSQL service is shut down, the database goes out of sync, restarting the service results in a throttled error

The sequence in which you stop and restart PostgreSQL and the PCS cluster on the current node is documented in detail in HA Mode Backup and Recovery. Please follow the instructions provided in that page to stop or start the database in HA mode.

If you see a HA service issue, see Graceful PostgreSQL Reboot for additional context.

Verify the Health of the PostgreSQL Cluster

To verify the health of the PostgreSQL cluster, you must verify several aspects of the environment as listed below:

VerifyActionSuccessFailure
The SSH Connectivity between the instances

From each node, execute the following commands:

ssh node1 ls

ssh node2 ls
Both commands should succeed without authentication.

If either command fails, fix the mutual authentication between the systems.

See Database (Required) >  Database HA

The cluster status between the instances

From each node, execute the following command:

pcs cluster status

Ensure that the PCSD Status for both database servers displays Online in response. The output should be similar to the following

# 2 nodes configured
# 4 resources configured

PCSD Status:
 node1: Online
 node2: Online

If nodes for either database server displays Offline, then execute the following command:

pcs cluster status --debug

See the first section in this page for a sample output.

Here are some reasons for the cluster to be offline:

  • If the proxy variable set in the server – Set a no_proxy for the local network if the proxy is set in the server. The no_proxy variable should contain a comma-separated list of domain extensions.
  • If the space on the server is insufficient – Check for available system space based on the CloudCenter platform's Hardware Requirements.
The services are running

From each node, execute the following command:

pcs status
#The last few lines should be similar to 
the Daemon Status:
  
corosync: active/disabled
pacemaker: active/enabled
pcsd: active/enabled

All services should be in active state.


Start the services that are not active. For example:

systemctl start pacemaker

systemctl start pcsd
The System Firewall rules are met

From each node, execute the following command:

iptables -L -n 

The following rules must be present:

ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:5432 ctstate NEW
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:5405 ctstate NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:2224 ctstate NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:3121 ctstate NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:21064 ctstate NEW

Reload the firewall rules, if you have any issues:

systemctl restart firewalld

The VIP is in the same subnet as the Master and Slave IP addresses

From each node, execute the following command:

ifconfig 
The configured VIP is in the same subnet at the master and slave IPs.

If the configured VIP is in a different subnet than the master and slave IPs, then the cluster will not be functional as the pcs process cannot reach the IP across systems.

To reconfigure a HA environment using a different VIP, execute the following command on Node1:

rm /usr/local/osmosix/etc/.HAINSTALLED

Launch the wizard and configure the new VIP so HA is configured with that VIP:

/usr/local/osmosix/bin/db_config_wizard.sh

PCS Status Errors

PCS Status ErrorCauseRemedy
Failed Actions:* pgsql_start_0 on node2 'unknown error' (1): call=43, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',    last-rc-change='Mon Apr 16 09:59:57 2018', queued=0ms, exec=172ms
This error occurs when the master database crashes or reboots and the slave instance fails over to the master instance. After the fail over, you must remove the lock file for the slave instance so it can rejoin the HA cluster.

Remove the lock file for the node specified in the output (in this error example, it is node2).

rm /var/lib/pgsql/tmp/PGSQL.lock



Master/Slave Set: mspostgresql [pgsql]
Masters: [ node2 ]
Stopped: [ node1 ]
Failed Actions:
* pgsql_start_0 on node1 'unknown error' (1): call=15, status=Timed Out, exitreason='none',
    last-rc-change='Mon Apr 16 11:19:00 2018', queued=1ms, exec=60002ms
The database is out of sync with the master One possible reason for this situation could be that the slave database was down for a long time while the master was running.

Try performing a resource cleanup and see if the issue is resolved

pcs resource cleanup mspostgresql
pcs status

If the PCSD status is online (as displayed in the first example in this section), then the issue is resolved.


If the status still displays a failure, verify if SSH connectivity exists between the instances.

 From each of the database instances, execute the following commands

ssh node1 ls
ssh node2 ls

Both commands should work without authentication.

If they fail, fix the mutual authentication between the systems. See Database (Required) > Database HA for more details.

If the status is  still the same, check the logs.

/var/lib/pgsql/9.6/data/pg_log/postgresql-<DayofWeek>.log
< 2018-04-16 11:19:55.544 UTC > FATAL:  the database system is starting up
< 2018-04-16 11:19:56.275 UTC > LOG:  invalid magic number 0000 in log segment 000000010000000000000003, offset 0

This error indicates that the data files in the master instance that are yet to be replicated to the slave instance have already been cleaned up.

  1. On the slave node run the following commands.

    cd /var/lib/pgsql/9.6/
    mv data olddata
    sudo -u postgres 
    /usr/pgsql-9.6/bin/pg_basebackup -h ${master_ip} -D 
    /var/lib/pgsql/9.6/data -U replication -v -P --xlog-method=stream
    pcs resource cleanup mspostgresql
    pcs status
  2. Verify that you have addressed all steps provided in the Verify the Health of the PostgreSQL Cluster section above and that no issues are left pending.
  3. Backup the database (see Backup and Recovery for PostgreSQL Database in the required mode) and then run the following command:

    pg_basebackup

HA-Specific Database Log Files

This section identifies log files that are specific to Database HA installations. For all other file locations, see Locate Log Files.

Log File NameLog File LocationDescription
corosync.log/var/log/cluster/The logs for the Corosync Cluster Engine high availability communication.
pcsd.log/var/log/pcsd/The log messages that relate to the health status of Pacemaker and Corosync.
pacemaker.log/var/log/The logs for the cluster utility used by the CloudCenter platform to ensure high availability.
messages/var/log/The log messages related to the cluster in an HA environment.
postgresql-<weekday>.log/var/lib/pgsql/9.6/data/pg_log/The logs for database high availability communication.
config.log/usr/local/osmosix/logs/The logs for the HA wizard configuration.
  • No labels
© 2017-2019 Cisco Systems, Inc. All rights reserved