Description
Important notices
Before you add a new report, we ask you kindly to acknowledge the following:
- I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
- I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue
Describe the bug
When HA is configured and pfsync is active a low precentage (2-4%) of TCP connections are dropped.
Tip: to validate your setup was working with the previous version, use opnsense-revert (https://docs.opnsense.org/manual/opnsense_tools.html#opnsense-revert)
To Reproduce
Steps to reproduce the behavior:
- Install 2 opnsense 24.7.8 firewalls with static WAN and LAN ip.
- Configure WAN and LAN CARP ip.
- Configure NAT to outbound nat the LAN range to CARP ip.
- Turn on pfsync on both on LAN interface (for simplicity the extra interface for pfsync has been omitted)
- Install client in LAN
- Run reproducer on client, adapt -m timeout parameter well above usual download time. local download target is recommended.
fail=0;success=0;while :;do curl -o /dev/null -m 30 https://fsn1-speed.hetzner.com/1GB.bin && echo "success $((++success)) fail $fail" || echo "success $success fail $((++fail))";done
- Observe fail count. in my case it looks like this after some hours:
"success 12539 fail 514"
When the issue occurs, curl's current download rate drops to 0, no more packets are recieved.
Assumption: the TCP state is removed and all packets are dropped.
Expected behavior
"success 12539 fail 0"
No dropped TCP connections. When pfsync is disabled, no connections are dropped.
Describe alternatives you considered
Sync compatibility 24.1 or 24.7 does not change anything, neither does multicast vs. unicast pfsync change anything. Only disabling pfsync or setting the unicast ips to other ips than the master or backup.
I recreated the same setup with voldemort 2.7.2 to rule out general pfsync issues or issues from environment/load. This setup does not show the issue on same machine type and virtualization environment, even on the same hypervisor and bridge over several hours.
Additional context
The issue is probably there for several months if not years. We have had issues especially with docker layer updates stalling. We could never pinpoint this until now. This also goes away when turning off pfsync.
Environment
Software version used and hardware type if relevant, e.g.:
OPNsense 24.7.8
Proxmox VE 8.2.7
bios: ovmf
boot: order=scsi0
cores: 2
cpu: host
efidisk0: nvme:vm-2180-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hotplug: disk,network,usb,cpu
machine: q35
memory: 4096
name: fw01-debug-sense
net0: virtio=BC:24:11:0B:83:7E,bridge=vmbr0,queues=4
net1: virtio=BC:24:11:7E:BE:DA,bridge=abc1001,queues=4,tag=123
numa: 1
ostype: other
scsi0: nvme:vm-2180-disk-1,discard=on,iothread=1,size=18G,ssd=1
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=1fdfcdb9-c5f3-468e-a066-16a705ac3d98
sockets: 1
vga: qxl
vmgenid: b6bee8e4-09d7-4d31-b2d4-96043b2d8069```