PostgreSQL Troubleshooting                                       

The PostgreSQL Cluster Does not Start after an OS Upgrade and Reboot

If you have patched both the MGMTPOSTGRES_MASTER and MGMTPOSTGRES_SLAVE servers on RHEL 7 and  rebooted both the servers, you may encounter a situation when the cluster does not start even after rebooting both servers. This is because the cluster is not in sync and cannot be automatically started after the reboot.

Verify that the Pacemaker startup is enabled after the system reboot – Pacemaker is the cluster utility used by the CloudCenter platform to ensure high availability. From CloudCenter 4.8.0, Pacemaker starts with each reboot and you do not have to manually start the HA cluster. Instead, verify that you have configured the following requirements accurately:

  • Ensure that the firewall rules are correctly configured. The firewall rules for this utility are explained in CCM Firewall Rules.
  • Go to each PostgreSQL server and enter the following command to review the status of the database and the HA connectivity:

    pcs status
    • Ensure that the PCSD Status for both database servers are Online.
    • Ensure that the Daemon Status for Corosync, Pacemaker, and the PCSD service are active/disabled.
  • Verify that your PostgreSQL instance is communicating with the CCM as detailed in the next section (The PostgreSQL Instance has Stopped Communicating with the CCM).

The PostgreSQL Instance has Stopped Communicating with the CCM

If you encounter this scenario when using the HA mode, follow these steps to correct this problem:

  1. Start Postgres service.

    systemctl start postgresql-9.5
  2. Ensure that the .lock file is not present in the required directory.

    cd /var/lib/pgsql/tmp/
    # IF the PGSQL.lock is present, remove it using command 
    rm -rf *.lock
  3. Stop and start the cluster.

    pcs cluster stop
    pcs cluster start
  4. Check the status of the cluster to ensure that it running.

    pcs status
  5. If you are still facing this issue, then restart the dbmaster and PGMasterVIP.

    pcs resource restart dbmaster
    pcs resource restart PGMasterVIP

Recovering a PostgreSQL Setup

PostgreSQL Requirements!

If your PostgreSQL setup is not exactly as specified in the Prerequisites section below, be aware that you may face unpredictable consequences.

Prerequisites

The CloudCenter platform requires you to use:

  • PostgreSQL 9.5.6 – The CloudCenter platform does not support any other version.
  • CentOS 7 or RHEL 7 OS for any PostgreSQL instance.
  • An active Daemon Status for Corosync, Pacemaker and the pcsd service.

If you use a fully-loaded RPM system or load any other additional package, for example, Fuse, EVM, Raid, or other similar packages, you may face unpredictable consequences.

Common HA Troubleshooting and Recovery Procedures

In some cases, depending on your setup and configuration, it is possible to encounter a scenario in your PostgreSQL HA setup where the master and slave instances are rebooted or reconnected after a network failure, or that a HA failover has occurred due to the current master database instance going offline. In such cases, it is possible that the cluster management software might not start PostgresSQL on a node without user intervention.

  1.  Common administrative tasks to check and rectify the cluster status
    1. Check the status of the cluster to ensure that it running.

      pcs status
    2. You may see a use case (as indicated by the following pcs status output) where the slave was not started due to an error. For example, the slave may not have started because of a pre-existing lock file scenario, as in the following output, where the HA software generates a lock file so admins can review and commit/accept changes, if any.

      Cluster name: cliqrdbcluster
      Stack: corosync
      Current DC: dbslave (version 1.1.15-11.el7-e174ec8) - partition with quorum
      Last updated: Thu Jun  1 10:16:56 2017          
      Last change: Thu Jun  1 09:42:23 2017 by hacluster via crmd on dbslave
      2 nodes and 4 resources configured
      Online: [ dbmaster dbslave ]
      Full list of resources:
      Resource Group: VIPGroup
           EC2SecondaryIP     (ocf::heartbeat:EC2SecIP):      Stopped
           PGMasterVIP        (ocf::heartbeat:IPaddr2):       Stopped
      Master/Slave Set: mspostgresql [pgsql]
           Slaves: [ dbslave ]
           Stopped: [ dbmaster ]
      Failed Actions:
      * pgsql_start_0 on dbmaster 'unknown error' (1): call=15, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
          last-rc-change='Thu Jun  1 09:42:24 2017', queued=0ms, exec=246ms
      Daemon Status:
        corosync: active/disabled
        pacemaker: active/enabled
      pcsd: inactive/disabled
    3. If you see the scenario mentioned in Step b above, delete the lock file (in this example, delete the lock file named /var/lib/pgsql/tmp/PGSQL.lock) on the node which is stopped (in the above example, it is named dbmaster) and run the following command.

      pcs resource cleanup mspostgresql 
    4. Run pcs status again to confirm the errors have cleared up and to ensure that the slave is back online.

      The PostgreSQL services should now run without any issues.

  2.  Rebooting HA Instances

    The administrator may need to reboot the HA instances for certain system administration actions like upgrading security patches or versions and so forth. In these cases, the administrator must follow the sequence provided in this section.

    1. Execute pcs status on each node to determine which node is operating as the current master.

      Cluster name: cliqrdbcluster
      Stack: corosync
      Current DC: dbmaster (version 1.1.15-11.el7-e174ec8) - partition with quorum
      Last updated: Fri Jun  2 03:23:22 2017          
      Last change: Thu Jun  1 10:42:41 2017 by root via crm_attribute on dbslave
      2 nodes and 4 resources configured
        Online: [ dbmaster dbslave ]
      Full list of resources:
         Resource Group: VIPGroup
             EC2SecondaryIP     (ocf::heartbeat:EC2SecIP):      Started dbslave
             PGMasterVIP        (ocf::heartbeat:IPaddr2):       Started dbslave
         Master/Slave Set: mspostgresql [pgsql]
             Masters: [ dbslave ]
             Stopped: [ dbmaster ]
         Failed Actions:
             * pgsql_start_0 on dbmaster 'unknown error' (1): call=15, status=complete,
             exitreason='My data may be inconsistent. You have to remove
             /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
             last-rc-change='Thu Jun  1 10:43:15 2017', queued=0ms,exec=263ms
         Daemon Status:
             corosync: active/disabled
         pacemaker: active/enabled
             pcsd: inactive/disable 
      

      The above output indicates that the node named dbslave is the current master

    2. Reboot your current slave node – in the example above, reboot the node named dbmaster.
    3. After the reboot, follow the steps mentioned in the Common administrative tasks to check and rectify the cluster status section above to check if the node has any errors – if any cleanup is required, then follow the steps provided in this section.
    4. Reboot the node which is the current master, in this example the node named dbslave is the current master.

    5. Once the reboot occurs, run the pcs status command on the node which is the current slave, in this example the node named dbmaster is the current slave. You can see that a HA failover has occurred and that node has become the new master.

    6. After rebooting the old master again, follow the steps mentioned in Common administrative tasks to check and rectify the cluster status section above to verify and cleanup any errors so that the rebooted node can rejoin the cluster as a slave.

Switch the Master Back to Host 1

Assuming you have Host 1 (master) and Host 2 (slave) in a scenario which failed over and caused Host 2 to become the master.

You may have a situation where you want to switch back the master to Host 1. This section provides the process for this type of a switch over.

  1. From either PostgreSQL instance, issue the following command:

    pcs cluster stop host2

    At this point, Host 1 becomes the master again.

  2. Verify the PCS status for any possible errors:

    pcs status
  3. If you see the following Failed Actions error:

    Failed Actions:
    * pgsql_start_0 on dbmaster 'unknown error' (1): call=15, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
    last-rc-change='Tue Jun 13 14:59:43 2017', queued=0ms, exec=224ms

    then execute the following command to bring the HA back to the functional state:

    pcs resource cleanup mspostgresql
  4. Recheck the status of the cluster to ensure that you do not see any errors and that all services are up and running.

    pcs status

    At this point, you should not see any errors!

If the PostgreSQL service is shut down, the database goes out of sync, restarting the service results in a throttled error

The sequence in which you stop and restart PostgreSQL and the PCS cluster on the current node is documented in detail in HA Mode Backup and Recovery. Please follow the instructions provided in that page to stop or start the database in HA mode.

If you see a HA service issue, see Graceful PostgreSQL Reboot for additional context.

 

 

  • No labels