Patroni lost connection and restarted after restarting the etcd-Server #3118

ThuanHua · 2024-08-07T15:24:35Z

What happened?

While restarting the etcd-Server (etcd v3), patroni failed to selected a new etcd Server and shutting down the database until the etcd is running again.

How can we reproduce it (as minimally and precisely as possible)?

I couldn't consistently reproduce it but the best way would be to have the primary and secondary elected the same etcd-Server. The Primary should run multiple transaction. And then restart the etcd-Server (etcd does not need to be the leader).

What did you expect to happen?

If the etcd-Server is not the elected etcd: Nothing
If the etcd-Server is the elected etcd: electing a new etcd-Serer and retrying

Patroni/PostgreSQL/DCS version

Patroni version: 3.2.2 and 3.3.0
PostgreSQL version: 13 and 16
DCS (and its version): etcd 3.3.25 API 3.3

Patroni configuration file

etcd3:
  hosts: IP1:2379,IP2:2379,IP3:2379,IP4:2379,IP5:2379,IP6:2379,IP7:2379,IP8:2379,IP9:2379
  protocol:
  username: username
  password: password

restapi:
  listen: "IP10:8008"
  connect_address: "IP11:8008"
  certfile: PATH-TO-CERT
  keyfile: PATH-TO-KEYFILE
  authentication:
        authentication:
        username: username
        password: password


bootstrap:
  # this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
  # and all other cluster members will use it as a `global configuration`
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    master_start_timeout: 30

    postgresql:
      use_pg_rewind: true
      remove_data_directory_on_rewind_failure: false
      remove_data_directory_on_diverged_timelines: false
      use_slots: false
      # The following parameters are given as command line options
      # overriding the settings in postgresql.conf.
      parameters:
        archive_command: PATH-TO-COMMAND
        max_connections: 1100
        wal_level: logical
        hot_standby: "on"
        max_wal_senders: 50
        max_replication_slots: 50
        max_worker_processes: 100
        wal_log_hints: "on"
        unix_socket_directories: '/var/run/postgresql/'
      recovery_conf:
        restore_command: PATH-TO-COMMAND
  # Some possibly desired options for 'initdb'
  initdb:  # Note: It needs to be a list (some options need values, others are switches)
    - encoding: UTF8
    - data-checksums
#  # Additional users to be created after initializing the cluster
  users:
    username: username
      password: password
      options:
        - replication
        - login

postgresql:
  # Custom clone method
  # The options --scope= and --datadir= are passed to the custom script by
  # patroni and passed on to pg_createcluster by pg_clonecluster_patroni
  create_replica_method:
    - pgbackrest
    - pg_clonecluster
  pgbackrest:
    command: PATH-TO-COMMAND
    keep_data: True
    no_params: True
  pg_clonecluster:
    command: PATH-TO-COMMAND

  listen: "IP10:5433"
  connect_address: "IP11:5433"
  use_unix_socket: true
  data_dir: /var/lib/postgresql/13/main
  bin_dir: /usr/lib/postgresql/13/bin
  config_dir: /etc/postgresql/13/main
  pgpass: /var/lib/postgresql/13-main.pgpass
  use_pg_rewind: true
  remove_data_directory_on_rewind_failure: false
  remove_data_directory_on_diverged_timelines: false
  use_slots: false
  parameters:
    archive_command: PATH-TO-COMMAND
    max_connections: 1100
    wal_level: logical
    hot_standby: "on"
    max_wal_senders: 50
    max_replication_slots: 50
    max_worker_processes: 100
    wal_log_hints: "on"
    unix_socket_directories: '/var/run/postgresql/'
  recovery_conf:
    restore_command: PATH-TO-COMMAND
  authentication:
    replication:
      username: username
      password: password
    rewind:
      username: username
      password: password
# A superuser role is required in order for Patroni to manage the local
# Postgres instance.  If the option `use_unix_socket' is set to `true', then
# specifying an empty password results in no md5 password for the superuser
# being set and sockets being used for authentication. The `password:' line is
# nevertheless required.  Note that pg_rewind will not work if no md5 password
# is set.
    superuser:
      username: username
      password: password
# uncomment the below and reload Patroni to set tags (will be commented again by Puppet!)
#tags:
  #nofailover: false
  #clonefrom: false

patronictl show-config

postgresql:
  parameters:
    archive_command: PATH-TO-COMMAND
    hot_standby: 'on'
    max_connections: 1100
    max_replication_slots: 50
    max_wal_senders: 50
    max_worker_processes: 100
    unix_socket_directories: /var/run/postgresql/
    wal_level: logical
    wal_log_hints: 'on'
  recovery_conf:
    restore_command: PATH-TO-COMMAND
  remove_data_directory_on_diverged_timelines: false
  remove_data_directory_on_rewind_failure: false
  use_pg_rewind: true
  use_slots: false
retry_timeout: 10
ttl: 30

Patroni log files

Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 523, in _run_and_handle_exceptions
      return retry(method, *args, **kwargs) if retry else method(*args, **kwargs)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 904, in _retry
      return retry(*args, **kwargs)
    File "/usr/lib/python3/dist-packages/patroni/utils.py", line 613, in __call__
      return func(*args, **kwargs)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 707, in _do_refresh_lease
      if self._lease and not self._client.lease_keepalive(self._lease, retry=retry):
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 370, in lease_keepalive
      return self.call_rpc('/lease/keepalive', {'ID': ID}, retry).get('result', {}).get('TTL')
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 634, in call_rpc
      ret = super(PatroniEtcd3Client, self).call_rpc(method, fields, retry)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 294, in call_rpc
      return self.api_execute(self.version_prefix + method, self._MPOST, fields)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 283, in api_execute
      return self._handle_server_response(response)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 252, in _handle_server_response
      raise _raise_for_data(ret or data, response.status)
  patroni.dcs.etcd3.Unknown: <Unknown error: 'OK: HTTP status code 200; transport: missing content-type field', code: 2>
  2024-08-05 16:59:55,978 ERROR: failed to update leader lock
  2024-08-05 16:59:55,978 INFO: Demoting self (immediate-nolock)
  2024-08-05 17:00:01,043 ERROR: Invalid auth token: njeBILsrHVvoFTac.15039577
  2024-08-05 17:00:01,043 ERROR:
  Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 511, in handle_etcd_exceptions
      retval = func(self, *args, **kwargs)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 833, in touch_member
      return bool(self._client.put(self.member_path, value, self._lease))
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 208, in wrapper
      return self.handle_auth_errors(func, *args, **kwargs)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 352, in handle_auth_errors
      raise exc
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 334, in handle_auth_errors
      return func(self, *args, retry=retry, **kwargs)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 392, in put
      return self.call_rpc('/kv/put', fields, retry)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 634, in call_rpc
      ret = super(PatroniEtcd3Client, self).call_rpc(method, fields, retry)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 294, in call_rpc
      return self.api_execute(self.version_prefix + method, self._MPOST, fields)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd.py", line 283, in api_execute
      return self._handle_server_response(response)
    File "/usr/lib/python3/dist-packages/patroni/dcs/etcd3.py", line 252, in _handle_server_response
      raise _raise_for_data(ret or data, response.status)
  patroni.dcs.etcd3.InvalidAuthToken: <InvalidAuthToken error: 'etcdserver: invalid auth token', code: 16>
  2024-08-05 17:00:01,044 INFO: demoted self because failed to update leader lock in DCS
  2024-08-05 17:00:01,045 INFO: closed patroni connections to postgres
  2024-08-05 17:00:01,277 INFO: postmaster pid=2652857
  /var/run/postgresql/:5433 - no response
  2024-08-05 17:00:01,593 INFO: establishing a new patroni heartbeat connection to postgres
  2024-08-05 17:00:01,599 INFO: establishing a new patroni heartbeat connection to postgres

PostgreSQL log files

nothing relevant. Only that the database was shutting down and started up.

Have you tried to use GitHub issue search?

Yes

Anything else we need to know?

No response

CyberDem0n · 2024-08-08T09:57:37Z

patroni.dcs.etcd3.Unknown: <Unknown error: 'OK: HTTP status code 200; transport: missing content-type field', code: 2>

Seems to be a culprit, and maybe we can retry in this case, but etcd 3.3.25 is quite old (almost 4 years) and I am pretty sure there have been plenty of bugfixes since then. Therefore I would advice you first to upgrade to v3.5.15 (latest stable version) and check if it helps.

ThuanHua · 2024-08-26T11:58:14Z

Sorry for the delayed response, I was on vacation. I'm using the newest version of etcd which available in ubuntu 22.04. It will take some time until I'm switching to Ubuntu 24.04 which uses 3.4.30. The error "patroni.dcs.etcd3.Unknown: <Unknown error: 'OK: HTTP status code 200; transport: missing content-type field', code: 2>" comes from gRPC. I will try to check if a newer etcd version will fix this problem (though i might not be able to use it in production). But i do think it makes sense to validate if the response is valid before executing the handle_server_response

CyberDem0n · 2024-08-26T12:05:01Z

I'm using the newest version of etcd which available in ubuntu 22.04. It will take some time until I'm switching to Ubuntu 24.04 which uses 3.4.30.

I am sorry, but this is a really bad excuse. You can always download etcd binaries from GitHub and run them as unprivileged user.

But i do think it makes sense to validate if the response is valid before executing the handle_server_response

Well, the HTTP status code is 200, what indicates success. It is not really clear what exactly Etcd sends in the actual response...
Without knowing the exact response it is not even possible to add an exception.

ThuanHua added the bug label Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patroni lost connection and restarted after restarting the etcd-Server #3118

Patroni lost connection and restarted after restarting the etcd-Server #3118

ThuanHua commented Aug 7, 2024

CyberDem0n commented Aug 8, 2024

ThuanHua commented Aug 26, 2024

CyberDem0n commented Aug 26, 2024

Patroni lost connection and restarted after restarting the etcd-Server #3118

Patroni lost connection and restarted after restarting the etcd-Server #3118

Comments

ThuanHua commented Aug 7, 2024

What happened?

How can we reproduce it (as minimally and precisely as possible)?

What did you expect to happen?

Patroni/PostgreSQL/DCS version

Patroni configuration file

patronictl show-config

Patroni log files

PostgreSQL log files

Have you tried to use GitHub issue search?

Anything else we need to know?

CyberDem0n commented Aug 8, 2024

ThuanHua commented Aug 26, 2024

CyberDem0n commented Aug 26, 2024