Given that JoinPAY processes tens of thousands of financial transactions every day, and serves socially significant industries (for example, public transport), we have no right to be down. Of course, we have offline modes of terminal equipment operation everywhere with subsequent cumulative upload to the host, but the fewer such switches there are, the better.
Therefore, the task was to get a cluster that survives the loss of any server, or even several servers, and can put servers into operation after accidents automatically. There were different approaches and synthetic tests, but we stopped at the PostgreSQL+Patroni+Haproxy+etcd bundle.
When simulating the master node getting down, switching takes 4-6 seconds in our case and occurs automatically. The node status is checked at an interval of 1 second, 3 unsuccessful responses are required to bring the server down (code 500), 2 successful responses are required to switch the server back up (code 200).