Skip to content

[pfsync multiqueue] tcp connections dropped when pfsync is active and multiqueue is enabled #8059

Closed as not planned
@fhloston

Description

@fhloston

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

When HA is configured and pfsync is active a low precentage (2-4%) of TCP connections are dropped.

Tip: to validate your setup was working with the previous version, use opnsense-revert (https://docs.opnsense.org/manual/opnsense_tools.html#opnsense-revert)

To Reproduce

Steps to reproduce the behavior:

  1. Install 2 opnsense 24.7.8 firewalls with static WAN and LAN ip.
  2. Configure WAN and LAN CARP ip.
  3. Configure NAT to outbound nat the LAN range to CARP ip.
  4. Turn on pfsync on both on LAN interface (for simplicity the extra interface for pfsync has been omitted)
  5. Install client in LAN
  6. Run reproducer on client, adapt -m timeout parameter well above usual download time. local download target is recommended.
    fail=0;success=0;while :;do curl -o /dev/null -m 30 https://fsn1-speed.hetzner.com/1GB.bin && echo "success $((++success)) fail $fail" || echo "success $success fail $((++fail))";done
  7. Observe fail count. in my case it looks like this after some hours:
    "success 12539 fail 514"

When the issue occurs, curl's current download rate drops to 0, no more packets are recieved.
Assumption: the TCP state is removed and all packets are dropped.

Expected behavior

"success 12539 fail 0"

No dropped TCP connections. When pfsync is disabled, no connections are dropped.

Describe alternatives you considered

Sync compatibility 24.1 or 24.7 does not change anything, neither does multicast vs. unicast pfsync change anything. Only disabling pfsync or setting the unicast ips to other ips than the master or backup.

I recreated the same setup with voldemort 2.7.2 to rule out general pfsync issues or issues from environment/load. This setup does not show the issue on same machine type and virtualization environment, even on the same hypervisor and bridge over several hours.

Additional context

The issue is probably there for several months if not years. We have had issues especially with docker layer updates stalling. We could never pinpoint this until now. This also goes away when turning off pfsync.

Environment

Software version used and hardware type if relevant, e.g.:

OPNsense 24.7.8
Proxmox VE 8.2.7

bios: ovmf
boot: order=scsi0
cores: 2
cpu: host
efidisk0: nvme:vm-2180-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hotplug: disk,network,usb,cpu
machine: q35
memory: 4096
name: fw01-debug-sense
net0: virtio=BC:24:11:0B:83:7E,bridge=vmbr0,queues=4
net1: virtio=BC:24:11:7E:BE:DA,bridge=abc1001,queues=4,tag=123
numa: 1
ostype: other
scsi0: nvme:vm-2180-disk-1,discard=on,iothread=1,size=18G,ssd=1
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=1fdfcdb9-c5f3-468e-a066-16a705ac3d98
sockets: 1
vga: qxl
vmgenid: b6bee8e4-09d7-4d31-b2d4-96043b2d8069```

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedContributor missing / timeoutsupportCommunity support

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions