Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failover when the postmaster is hung #1371

Open
kmsarabu opened this issue Jan 22, 2020 · 5 comments
Open

Failover when the postmaster is hung #1371

kmsarabu opened this issue Jan 22, 2020 · 5 comments

Comments

@kmsarabu
Copy link

kmsarabu commented Jan 22, 2020

We are looking into replacing Repmgr with Patroni and testing various failover scenarios.

It seems that the patroni currently keeping persistent connection to the master database. For any reason if postmaster is hung, and that basically blocks all new connection requests to the master database. It appears that patroni was unable to identify such scenario and not triggering failover.

We have also tested the same scenario with repmgr. repmgr was able to identify such case and failed over the database. Repmgr makes use of parameters reconnect_interval and reconnect_attempts instead of keeping a persistent connection & failover the database if the process unable connect to master in provided threshold.

Test case:

[host1] : ps -eaf |egrep postgres|egrep config |egrep listen_address
postgres 193623 1 0 17:13 ? 00:00:00 /postgres/product/server/10.3.2/bin/postgres -D /postgres/data001/pgrept1/sysdata --config-file=/postgres/data001/pgrept1/sysdata/postgresql.conf --listen_addresses=host1 --port=4345 --cluster_name=pgrept1 --wal_level=logical --hot_standby=on --max_connections=1000 --max_wal_senders=10 --max_prepared_transactions=0 --max_locks_per_transaction=64 --track_commit_timestamp=off --max_replication_slots=10 --max_worker_processes=8 --wal_log_hints=on

[host1] : kill -STOP 193623

[host1] : psql -h host1 -p 4345 -U krishna
HUNG ..

From the patroni.log:
2020-01-22 13:35:35,082 INFO: Lock owner: host1; I am host1
2020-01-22 13:35:35,090 INFO: no action. i am the leader with the lock
...
2020-01-22 13:37:05,082 INFO: Lock owner: host1; I am host1
2020-01-22 13:37:05,091 INFO: no action. i am the leader with the lock
....

@CyberDem0n
Copy link
Member

Fairly speaking I have never ever seen a situation when the postmaster hung so that is not able to respond at all, but I have seen a lot of situations when all connection slots are occupied and therefore a connection is rejected.
Opening a new connection is a very expensive operation. If the system is under stress it could easily take seconds. Doing a failover in such case IMO is a bit insane, because:

  1. first of all you'll have to fence the old primary. Depending on how you are doing fencing it might result in some data loss.
  2. Clients which are currently connected would not be happy.
  3. Highly likely that the new primary would become stressed even harder than the old one.

@kmsarabu
Copy link
Author

Thanks for quick response and looking into this issue.

Unfortunately, we ran into this issue on both databases Oracle & PostgreSQL.

Agree that spawning connection frequently is an expensive operation. However, it can be justifiable for highly available services (provided that it is set to small enough to allow timely detection & large enough to limit the associated overhead).

Agree that master should be fenced in such scenario and also the setup should ensure that there’s no data loss during failover.

Oracle has something similar feature. It introduced ObserverReconnect, a dataguard observer parameter, back in version 11.2.0.4 which allowed to detect such scenario & failover. Part of failover, the Oracle dataguard takes former primary offline to fence it. Synchronous replication is used to prevent data loss. The parameter when set to 0 will make the observer to use persistent connection.

Something similar can be done in PostgreSQL using Patroni? In case of PG, the postmaster (or the database) was hung and blocked new incoming connections for extended period of time leading to loss of service.

  1. Former primary should be demoted and the cluster should go through election process to elect new leader. Something similar to what Oracle does.
  2. There is already a loss of service if the database is hung and it will not able to process new connection requests and also requests from existing connections. Clients will be unhappy anyway in this case.

Please share your thoughts.

@CyberDem0n
Copy link
Member

Unfortunately, we ran into this issue on both databases Oracle & PostgreSQL.

I can't tell anything about Oracle... We are running hundreds of postgres clusters and statistically we would have already hit such problem, but in fact we have never.

Agree that master should be fenced in such scenario and also the setup should ensure that there’s no data loss during failover.

If the postmaster is really hanging the clean shutdown (especially in a timely manner) is barely possible, therefore you may loose some data.

In case of PG, the postmaster (or the database) was hung and blocked new incoming connections for extended period of time leading to loss of service.

As I already explained there are different reasons why the connection could not be opened. The most common one - all connection slots are occupied and we have seen it a lot, but we really never observed the hanging postmaster.

There is already a loss of service if the database is hung and it will not able to process new connection requests and also requests from existing connections. Clients will be unhappy anyway in this case.

How is it possible? Patroni is regularly running queries via the existing connection, therefore it is in the same position as other connected clients.

Please share your thoughts.

Ok, lets assume we do a failover in such a situation, but highly likely that the new primary soon or later will hit the same issue (usually it happens quickly). At the end you will get a chain of failovers, which would be even worse.

If you really see the problem of hanging postmaster it should be investigated/debugged and reported as a postgres bug.

@kmsarabu
Copy link
Author

kmsarabu commented Jan 22, 2020

If the postmaster is really hanging the clean shutdown (especially in a timely manner) is barely possible, therefore you may loose some data.

With the setup utilizing synchronous replication would help overcome this issue. (synchronous_standby_names set to “[FIRST|ANY] NUM_SYNC (standby db list)” & synchronous_commit set to remote_apply for e.g.). Transactions on former master basically hung on ack as the synchronous standbys starts following the newly elected leader. Repmgr does this is in a controlled way – it cancels wal-receiver process on all standby nodes during failover so they no longer receive the changes from master. This ensures that there is zero data loss as the master is configured with synchronous replication with above settings. Remaining standby nodes goes through leader election and the leader will be elected and the remaining nodes follow the new leader.

How is it possible? Patroni is regularly running queries via the existing connection, therefore it is in the same position as other connected clients.

I gave a scenario where only the postmaster was hung. I will be testing various scenarios including database hung. Just tested with patroni db connection hung (to simulate database hung) on master and there was no failover action taken. Looks like it doesn't use any statement_timeout which probably making the patroni process to wait on the query call forever.

[host1] : ps -eaf |egrep pgrept1|egrep patroni|egrep "postgres:"
postgres  44282  29922  0 13:59 ?        00:00:00 postgres: pgrept1: patroni postgres xx.xxx.xxx.xx(51856) idle   

[host1] : kill -STOP 44282
[host1] : tail -f patroni.log
...
2020-01-22 16:03:05,362 INFO: Lock owner: host1; I am host1
<No update after this>

If you really see the problem of hanging postmaster it should be investigated/debugged and reported as a postgres bug.

Yes of course. That’s our standard process where we go through postmortem to findout the cause and to get it fixed (with vendors if the issue pertains to their code)

Ok, lets assume we do a failover in such a situation, but highly likely that the new primary soon or later will hit the same issue (usually it happens quickly). At the end you will get a chain of failovers, which would be even worse.

The chain of failovers (ping-pong back & forth) probably can be prevented by allowing x number of consecutive failovers in certain time. Underlying issue that caused failover probably limited to only that machine & may not be carried over with failover.

The point I am trying to make is the database high availability & service continuity with various failure scenarios.

@CyberDem0n
Copy link
Member

I gave a scenario where only the postmaster was hung

Sorry, but no. When you send SIGSTOP you just shoot your own foot, and it hurts badly.

Looks like it doesn't use any statement_timeout

You haven't look into the source code, have you? JFYI, statement_timeout is handled by postgres!

which probably making the patroni process to wait on the query call forever.

This is not a statement_timeout problem. Depending on how Patroni is connected to postgres it could be either fixed or not (in case of unx-domain sockets) by enabling tcp keepalives.

Yes of course. That’s our standard process where we go through postmortem to findout the cause

And? What are the results of your investigation? Have you found the reason for hanging postmaster?

Underlying issue that caused failover probably limited to only that machine & may not be carried over with failover.

The keyword here is "probably". But my experience clearly shows that usually such issues are related to the workload and retain after failover.

The point I am trying to make is the database high availability & service continuity with various failure scenarios.

Doing unnecessary failovers doesn't improve neither availability nor service continuity.

P.S. If you really like to have this feature - go ahead and implement it, we will happily review and merge the pull request.

CyberDem0n pushed a commit that referenced this issue Apr 15, 2020
## Feature: Postgres stop timeout

Switchover/Failover operation hangs on signal_stop (or checkpoint) call when postmaster doesn't respond or  hangs for some reason(Issue described in [1371](#1371)). This is leading to service loss for an extended period of time until the hung postmaster starts responding or it is killed by some other actor.

### master_stop_timeout

The number of seconds Patroni is allowed to wait when stopping Postgres and effective only when synchronous_mode is enabled. When set to > 0 and the synchronous_mode is enabled, Patroni sends SIGKILL to the postmaster if the stop operation is running for more than the value set by master_stop_timeout. Set the value according to your durability/availability tradeoff. If the parameter is not set or set <= 0, master_stop_timeout does not apply.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants