-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failover when the postmaster is hung #1371
Comments
Fairly speaking I have never ever seen a situation when the postmaster hung so that is not able to respond at all, but I have seen a lot of situations when all connection slots are occupied and therefore a connection is rejected.
|
Thanks for quick response and looking into this issue. Unfortunately, we ran into this issue on both databases Oracle & PostgreSQL. Agree that spawning connection frequently is an expensive operation. However, it can be justifiable for highly available services (provided that it is set to small enough to allow timely detection & large enough to limit the associated overhead). Agree that master should be fenced in such scenario and also the setup should ensure that there’s no data loss during failover. Oracle has something similar feature. It introduced ObserverReconnect, a dataguard observer parameter, back in version 11.2.0.4 which allowed to detect such scenario & failover. Part of failover, the Oracle dataguard takes former primary offline to fence it. Synchronous replication is used to prevent data loss. The parameter when set to 0 will make the observer to use persistent connection. Something similar can be done in PostgreSQL using Patroni? In case of PG, the postmaster (or the database) was hung and blocked new incoming connections for extended period of time leading to loss of service.
Please share your thoughts. |
I can't tell anything about Oracle... We are running hundreds of postgres clusters and statistically we would have already hit such problem, but in fact we have never.
If the postmaster is really hanging the clean shutdown (especially in a timely manner) is barely possible, therefore you may loose some data.
As I already explained there are different reasons why the connection could not be opened. The most common one - all connection slots are occupied and we have seen it a lot, but we really never observed the hanging postmaster.
How is it possible? Patroni is regularly running queries via the existing connection, therefore it is in the same position as other connected clients.
Ok, lets assume we do a failover in such a situation, but highly likely that the new primary soon or later will hit the same issue (usually it happens quickly). At the end you will get a chain of failovers, which would be even worse. If you really see the problem of hanging postmaster it should be investigated/debugged and reported as a postgres bug. |
With the setup utilizing synchronous replication would help overcome this issue. (synchronous_standby_names set to “[FIRST|ANY] NUM_SYNC (standby db list)” & synchronous_commit set to remote_apply for e.g.). Transactions on former master basically hung on ack as the synchronous standbys starts following the newly elected leader. Repmgr does this is in a controlled way – it cancels wal-receiver process on all standby nodes during failover so they no longer receive the changes from master. This ensures that there is zero data loss as the master is configured with synchronous replication with above settings. Remaining standby nodes goes through leader election and the leader will be elected and the remaining nodes follow the new leader.
I gave a scenario where only the postmaster was hung. I will be testing various scenarios including database hung. Just tested with patroni db connection hung (to simulate database hung) on master and there was no failover action taken. Looks like it doesn't use any statement_timeout which probably making the patroni process to wait on the query call forever.
Yes of course. That’s our standard process where we go through postmortem to findout the cause and to get it fixed (with vendors if the issue pertains to their code)
The chain of failovers (ping-pong back & forth) probably can be prevented by allowing x number of consecutive failovers in certain time. Underlying issue that caused failover probably limited to only that machine & may not be carried over with failover. The point I am trying to make is the database high availability & service continuity with various failure scenarios. |
Sorry, but no. When you send
You haven't look into the source code, have you? JFYI, statement_timeout is handled by postgres!
This is not a statement_timeout problem. Depending on how Patroni is connected to postgres it could be either fixed or not (in case of unx-domain sockets) by enabling tcp keepalives.
And? What are the results of your investigation? Have you found the reason for hanging postmaster?
The keyword here is "probably". But my experience clearly shows that usually such issues are related to the workload and retain after failover.
Doing unnecessary failovers doesn't improve neither availability nor service continuity. P.S. If you really like to have this feature - go ahead and implement it, we will happily review and merge the pull request. |
## Feature: Postgres stop timeout Switchover/Failover operation hangs on signal_stop (or checkpoint) call when postmaster doesn't respond or hangs for some reason(Issue described in [1371](#1371)). This is leading to service loss for an extended period of time until the hung postmaster starts responding or it is killed by some other actor. ### master_stop_timeout The number of seconds Patroni is allowed to wait when stopping Postgres and effective only when synchronous_mode is enabled. When set to > 0 and the synchronous_mode is enabled, Patroni sends SIGKILL to the postmaster if the stop operation is running for more than the value set by master_stop_timeout. Set the value according to your durability/availability tradeoff. If the parameter is not set or set <= 0, master_stop_timeout does not apply.
We are looking into replacing Repmgr with Patroni and testing various failover scenarios.
It seems that the patroni currently keeping persistent connection to the master database. For any reason if postmaster is hung, and that basically blocks all new connection requests to the master database. It appears that patroni was unable to identify such scenario and not triggering failover.
We have also tested the same scenario with repmgr. repmgr was able to identify such case and failed over the database. Repmgr makes use of parameters reconnect_interval and reconnect_attempts instead of keeping a persistent connection & failover the database if the process unable connect to master in provided threshold.
Test case:
[host1] : ps -eaf |egrep postgres|egrep config |egrep listen_address
postgres 193623 1 0 17:13 ? 00:00:00 /postgres/product/server/10.3.2/bin/postgres -D /postgres/data001/pgrept1/sysdata --config-file=/postgres/data001/pgrept1/sysdata/postgresql.conf --listen_addresses=host1 --port=4345 --cluster_name=pgrept1 --wal_level=logical --hot_standby=on --max_connections=1000 --max_wal_senders=10 --max_prepared_transactions=0 --max_locks_per_transaction=64 --track_commit_timestamp=off --max_replication_slots=10 --max_worker_processes=8 --wal_log_hints=on
[host1] : kill -STOP 193623
[host1] : psql -h host1 -p 4345 -U krishna
HUNG ..
From the patroni.log:
2020-01-22 13:35:35,082 INFO: Lock owner: host1; I am host1
2020-01-22 13:35:35,090 INFO: no action. i am the leader with the lock
...
2020-01-22 13:37:05,082 INFO: Lock owner: host1; I am host1
2020-01-22 13:37:05,091 INFO: no action. i am the leader with the lock
....
The text was updated successfully, but these errors were encountered: