The PostgreSQL Cluster Does not Start after an OS Upgrade and Reboot
If you have patched both the MGMTPOSTGRES_MASTER and MGMTPOSTGRES_SLAVE servers on RHEL 7 and rebooted both the servers, you may encounter a situation when the cluster does not start even after rebooting both servers. This is because the cluster is not in sync and cannot be automatically started after the reboot.
Verify that the Pacemaker startup is enabled after the system reboot – Pacemaker is the cluster utility used by the CloudCenter platform to ensure high availability. Pacemaker starts with each reboot and you do not have to manually start the HA cluster. Instead, verify that you have configured the following requirements accurately:
- Ensure that the firewall rules are correctly configured. The firewall rules for this utility are explained in CCM Firewall Rules.
Go to eachserver and enter the following command to review the status of the database and the HA connectivity:
- Ensure that the PCSD Status for both database servers are Online.
- Ensure that the Daemon Status for Corosync, Pacemaker, and the PCSD service are active/disabled.
- Verify that your PostgreSQL instance is communicating with the CCM as detailed in the next section (The PostgreSQL Instance has Stopped Communicating with the CCM).
The PostgreSQL Instance has Stopped Communicating with the CCM
If you encounter this scenario when using the HA mode, follow these steps to correct this problem:
Start Postgres service.
Ensure that the .lock file is not present in the required directory.
Stop and start the cluster.
Check the status of the cluster to ensure that it running.
If you are still facing this issue, then restart the dbmaster and PGMasterVIP.
Recovering a PostgreSQL Setup
If your PostgreSQL setup is not exactly as specified in the Prerequisites section below, be aware that you may face unpredictable consequences.
The CloudCenter platform requires you to use:
- PostgreSQL 9.6.3 – The CloudCenter platform does not support any other version.
- CentOS 7 or RHEL 7 OS for any PostgreSQL instance.
- An active Daemon Status for Corosync, Pacemaker and the pcsd service.
If you use a fully-loaded RPM system or load any other additional package, for example, Fuse, EVM, Raid, or other similar packages, you may face unpredictable consequences.
Common HA Troubleshooting and Recovery Procedures
In some cases, depending on your setup and configuration, it is possible to encounter a scenario in your PostgreSQL HA setup where the master and slave instances are rebooted or reconnected after a network failure, or that a HA failover has occurred due to the current master database instance going offline. In such cases, it is possible that the cluster management software might not start PostgresSQL on a node without user intervention.
- Common administrative tasks to check and rectify the cluster status
Check the status of the cluster to ensure that it running.
You may see a use case (as indicated by the following pcs status output) where the slave was not started due to an error. For example, the slave may not have started because of a pre-existing lock file scenario, as in the following output, where the HA software generates a lock file so admins can review and commit/accept changes, if any.
If you see the scenario mentioned in Step b above, delete the lock file (in this example, delete the lock file named /var/lib/pgsql/tmp/PGSQL.lock) on the node which is stopped (in the above example, it is named dbmaster) and run the following command.
Run pcs status again to confirm the errors have cleared up and to ensure that the slave is back online.
The PostgreSQL services should now run without any issues.
- Rebooting HA Instances
The administrator may need to reboot the HA instances for certain system administration actions like upgrading security patches or versions and so forth. In these cases, the administrator must follow the sequence provided in this section.
Execute pcs status on each node to determine which node is operating as the current master.
The above output indicates that the node named dbslave is the current master
- Reboot your current slave node – in the example above, reboot the node named dbmaster.
- After the reboot, follow the steps mentioned in the Common administrative tasks to check and rectify the cluster status section above to check if the node has any errors – if any cleanup is required, then follow the steps provided in this section.
Reboot the node which is the current master, in this example the node named dbslave is the current master.
Once the reboot occurs, run the pcs status command on the node which is the current slave, in this example the node named dbmaster is the current slave. You can see that a HA failover has occurred and that node has become the new master.
- After rebooting the old master again, follow the steps mentioned in Common administrative tasks to check and rectify the cluster status section above to verify and cleanup any errors so that the rebooted node can rejoin the cluster as a slave.
Switch the Master Back to Host 1
Assuming you have Host 1 (master) and Host 2 (slave) in a scenario which failed over and caused Host 2 to become the master.
You may have a situation where you want to switch back the master to Host 1. This section provides the process for this type of a switch over.
From either PostgreSQL instance, issue the following command:
At this point, Host 1 becomes the master again.
Verify the PCS status for any possible errors:
If you see the following Failed Actions error:
then execute the following command to bring the HA back to the functional state:
Recheck the status of the cluster to ensure that you do not see any errors and that all services are up and running.
At this point, you should not see any errors!
If the PostgreSQL service is shut down, the database goes out of sync, restarting the service results in a throttled error
The sequence in which you stop and restart PostgreSQL and the PCS cluster on the current node is documented in detail in HA Mode Backup and Recovery. Please follow the instructions provided in that page to stop or start the database in HA mode.
If you see a HA service issue, see Graceful PostgreSQL Reboot for additional context.
- No labels