-
-
Save spali/2da4f23e488219504b2ada12ac59a7dc to your computer and use it in GitHub Desktop.
#!/usr/local/bin/php | |
<?php | |
require_once("config.inc"); | |
require_once("interfaces.inc"); | |
require_once("util.inc"); | |
$subsystem = !empty($argv[1]) ? $argv[1] : ''; | |
$type = !empty($argv[2]) ? $argv[2] : ''; | |
if ($type != 'MASTER' && $type != 'BACKUP') { | |
log_error("Carp '$type' event unknown from source '{$subsystem}'"); | |
exit(1); | |
} | |
if (!strstr($subsystem, '@')) { | |
log_error("Carp '$type' event triggered from wrong source '{$subsystem}'"); | |
exit(1); | |
} | |
$ifkey = 'wan'; | |
if ($type === "MASTER") { | |
log_error("enable interface '$ifkey' due CARP event '$type'"); | |
$config['interfaces'][$ifkey]['enable'] = '1'; | |
write_config("enable interface '$ifkey' due CARP event '$type'", false); | |
interface_configure(false, $ifkey, false, false); | |
} else { | |
log_error("disable interface '$ifkey' due CARP event '$type'"); | |
unset($config['interfaces'][$ifkey]['enable']); | |
write_config("disable interface '$ifkey' due CARP event '$type'", false); | |
interface_configure(false, $ifkey, false, false); | |
} |
The script should also ignore the third state of INIT as i keep seeing it cause a failover , despite it being harmless.
line 11 can be changed to
if ($type != 'MASTER' && $type != 'BACKUP' && $type != 'INIT') {
and line 28 can be changed to
} else if ($type === "BACKUP") {
or ignored.
Thank you @spali for this script, simple and efficient without frills.
As of totay with OPNsense 24.7.4, the script works perfectly except for the fact that once the backup is promoted master and then demoted to backup again, the default route (System -> Routes -> Status) is not set back to the LAN VIP as it was set initially as stated by @skl283.
That problem limits the backup's ability to have internet access while being backup.
To fix that situation, change/add these:
- Following $ifkey = 'wan' add $lan_vip = 'YOUR_LAN_VIP' and set to your correct LAN_VIP / LAN CARP VIP
- Following interface_configure in the BACKUP section add both...
exec('/sbin/route del default >&1', $ifc, $ret);
exec('/sbin/route add default ' . $lan_vip . ' >&1', $ifc, $ret); - At the end of the script, add the missing "?>"
- Add the suggestions provided by @edward-scroop for the previous post to mine. (!= INIT and else BACKUP)
NOTE : The 4th step in spali's instructions is not optional anymore. A WAN-to-LAN Gateway is required.
This is it :)
I’m throwing this here with little knowledge otherwise with my abandoned script, but a challenge I had to overcome dealt with multiple interfaces being decided as “failed” such that the backup connection would take over. May not be relevant now with the recent updates but throwing it out there - https://gist.github.com/willjasen/6ae0f47bca36ced2bd52b2fefc2bc21e
Hi @raegedoc are you sure that you you use this gist? There ist only an else case line 28 to 33 - which should used, if the system is in the Backup case... or are you using this gist? There is explicit an Backup Section.
Perhaps you could post or do a fork of this Script?
Hi @skl283, I tried them all from 2 weeks ago and none was giving me back internet access on my backup node after being promoted primary and demoted back to backup again. Only these small adds would fix it all while keeping the script very light and clean.
I forgot to mention I incorporated the suggestions @edward-scroop did in the post previous to mine : https://gist.github.com/spali/2da4f23e488219504b2ada12ac59a7dc?permalink_comment_id=5185710#gistcomment-5185710
Here is a link to my gist : https://gist.github.com/raegedoc/093ba815b6b3f2bc2ff327f48c60f3a9
Open to your ideas :)
@raegedoc do you have the gateway monitoring setup for the WAN gateway? Because I have it set up and when it switches back to master, it sets the priority of the backup WAN gateway to defunct which removes it from the route selection.
@edward-scroop, Yes I have gateway monitoring set for my WAN gateway of both primary and backup. The problem is not with my primary node switching back to master but my backup node switching back to being a backup. This way, backup has internet access for receiving it OPNsense updates and news Annoncements
For clarity, here is my primary configuration for the WAN link when primary is primary and backup is backup :
...and for my backup configuration. Blue arrow point to the fields where MY_CARP_LAN_VIP is specifed.
From your screenshots, the monitor ip is empty and the disable gateway monitoring is checked. That would mean gateway monitoring is disabled.
I think what is happening is as your WAN gateway has a higher priority than the LAN gateway and with no gateway monitoring, the backup has no way to tell the WAN gateway is down and it then doesn't have a reason to swap to the LAN gateway.
To fix it either set the LAN gateway to a priority higher than the WAN gateway, or set a monitor ip of 1.1.1.1 and uncheck the disable gateway monitoring box.
Hi, WAN Gateway has priority 254 and WAN-to-LAN has 255 (so WAN > WAN-to-LAN).
Anyway, I tried your trick and worse, my backup has no internet access when backup. Default route has shown still point default gateway to the WAN IP that connects to nothing when backup.
Interfaces: Diagnostics: Ping to 1.1.1.1 has 100% loss :(
Since fixing the default gateway (with route delete followed by add CARP_LAN_IP) while being backup of a functional primary node, it might have been the missing trick with my setup that is pretty standard when theISP provided only a public DHCP WAN IP.
I'll keep the setup I shared earlier. Thank's for sharing edward-scroop.
The LAN gateway needs a priority higher than 254. The smaller the value, the higher the priority.
The LAN gateway needs a priority higher than 254. The smaller the value, the higher the priority.
It's the case, LAN has priority 255
I meant, the LAN needs a priority of 1-253.
I upgraded to 24.7.6 today, and our syshook.d scripts that call interface_configure() appears to now crash when an undefined function eventually is called (see my stack trace below). See my post on opnsense forums: https://forum.opnsense.org/index.php?topic=20972.msg216770#msg216770 for the customizations I run, but I'd imagine Spali's version is equally as affected. I submitted a crash report, but did not create an issue on the opnsense github.
I believe we need to be using a more well supported method to enable/disable interfaces in these syshook scripts. The 'interface' PHP functions seem to be in heavy development in 24.7, and many functions seem to be considered 'legacy' methods or becoming deprecated. Or, perhaps this is just a bug.
As a workaround, if you don't want to roll-back, you can comment the $config line, write_config, and interface_configure calls and instead use shell_exec("/sbin/ifconfig {$interface['if']} up"); and shell_exec("/sbin/ifconfig {$interface['if']} down"); instead, but this is less reliable and has other undesirable effects. For example, when only using interface up/down commands, the backup device needs it's WAN interface left as enabled - under that condition, in the event of a reboot, you'll want to manually trigger a failover cycle to have the backup device's WAN interface in "down" state, else you'll have both interfaces up and enabled. Again, we need to find the most well supported way to enable/disable interfaces, and go from there.
[22-Oct-2024 13:17:14 America/New_York] PHP Fatal error: Uncaught Error: Call to undefined function system_routing_configure() in /usr/local/etc/inc/interfaces.inc:3777
Stack trace:
#0 /usr/local/etc/inc/interfaces.inc(2498): interfaces_restart_by_device(false, Array, false)
#1 /usr/local/etc/rc.syshook.d/carp/10-wancarp(24): interface_configure(false, 'opt3', false, false)
#2 {main}
thrown in /usr/local/etc/inc/interfaces.inc on line 3777
As a side note, others are having trouble with carp maintenance mode not working at all (not triggering a failover, as one would expect): opnsense/core#7877
Anyone find a fix for this issue yet?
i haven't tried it yet, but does this issue also occur at 24.7.8? @bitcoredotorg perhaps you tried the update?
I just upgraded to 24.7.8 (I was actually on 24.7.7 and it was working fine... as was it in 24.7.6). I run both my firewalls in Proxmox, so I took a backup snapshot before each upgrade, just in case. When the primary node came back up, the only thing I noticed was that it was pinned up in persistent carp maintenance mode.. I enabled and disabled and the backup failed right over to the primary. Only issue I still have is with Spectrum. For some reason, when I use a vlan on my managed switch (Juniper EX3400 POE), the Spectrum routinely fails to DHCP a new address (I have dhcp snooping and damn near everything else disabled in that vlan that could be interfering). For a goof, I grabbed an old gig switch from Netgear and plugged in the Spectrum primary/backup and circuit.. been fine for 4 months now. Fails over Spectrum with no issues.
Anyway... not seeing the problem in 24.7.8.
I am running 24.7.9_1, and I see the same error mentioned by bitcoredotorg.
I also tried the recent development branch as of this writing, and it is the same.
Implementing @bitcoredotorg 's fix seemed to work well enough, though I had to edit it slightly. The script with his workaround looks like this for me:
if ($type === "MASTER") {
log_error("enable interface '$ifkey' due CARP event '$type'");
$config['interfaces'][$ifkey]['enable'] = '1';
write_config("enable interface '$ifkey' due CARP event '$type'", false);
#interface_configure(false, $ifkey, false, false);
shell_exec("/sbin/ifconfig {$interface[$ifkey]} up");
} else {
log_error("disable interface '$ifkey' due CARP event '$type'");
unset($config['interfaces'][$ifkey]['enable']);
write_config("disable interface '$ifkey' due CARP event '$type'", false);
#interface_configure(false, $ifkey, false, false);
shell_exec("/sbin/ifconfig {$interface[$ifkey]} down");
}
error stack:
[01-Dec-2024 21:24:07 America/Chicago] PHP Fatal error: Uncaught Error: Call to undefined function system_routing_configure() in /usr/local/etc/inc/interfaces.inc:3777
Stack trace:
#0 /usr/local/etc/inc/interfaces.inc(2498): interfaces_restart_by_device(false, Array, false)
#1 /usr/local/etc/rc.syshook.d/carp/10-wancarp(28): interface_configure(false, 'wan', false, false)
#2 {main}
thrown in /usr/local/etc/inc/interfaces.inc on line 3777
I'm running OPNsense 24.7.10_2-amd64 and incorporated the bits and pieces of code here and there. The solution I found for the undefined function for system_routing_configure() was by including the system.inc to the script and then I can use interface_configure without it crashing. Although, I have CARP event issues unrelated to this.
require_once("config.inc");
require_once("interfaces.inc");
require_once("util.inc");
// Ensure system_routing_configure is included
require_once("system.inc");
.
.
.
So is this script considered stable on OPNsense 24.7.10_2 (with the possible need to require system.inc as mentioned directly above)?
So is this script considered stable on OPNsense 24.7.10_2 (with the possible need to require system.inc as mentioned directly above)?
Not sure. I barely got the whole script installed and troubleshot my installation. I figured I would share what I did to make it work with the crash. I have it running on 1 physical baremetal and 1 proxmox vm with 11 internal VIP VLANs. Stable? Not sure.
I upgraded today to 24.7.11_2. Adding:
require_once("system.inc");
does prevent the crashing issue. Nice find, huetruong.
I'm still having an issue with entering persistent maintenance mode not causing a failover: opnsense/core#7877
I've also not had enough time to find the most optimal way to shut/noshut the WAN interface - to ensure active/passive device reboot behavior produces a consistent and desired state for the interface based on the CARP status. (I don't want my backup/passive device to have it's WAN interface enabled upon boot, and requesting a DHCP lease while the active device is already handling traffic)
Creative suggestions, MEntOMANdo. You could do that and probably achieve a workable situation, but I see potential problems with that approach, and for some users and ISPs.
In your VM example, though the interface will be "down" by default, I believe the interface will still be brought up by configuration during boot - if it's stored in the opnsense configuration for the interface to be up, it will be brought up during boot.
In your CRON example, you may also run into a race condition, and still have your WAN interface come up, and do things like request a DHCP Lease, and possibly also not be shut down by the cron job if the device is 'backup' - depending on when the boot process that cron entry actually executes.
Towards the end of 'boot', the interface configuration is read, and then applied. So, with either approach, you have both the risk of the interface coming up in the first place, or not being shut down after the opnsense scripts read the configuration and bring up the interface.
This is one reason why I mention my workaround of using shell_exec to manually set the interfaces up or down is not very clean, or ideal - both because I'm calling shell_exec in the first place (bad practice, a security no-no!), and because the state of the interface will not persist across reboots).
IMO, it's better for the syshook.d CARP script to set the interface's configuration to be down, and save this in the configuration - so that only when CARP's state changes to "master", will the WAN interface be brought up at all. This way, you don't have to change default interface behavior, the script handles this for you.
Thoughts?
I upgraded today to 24.7.11_2. Adding:
require_once("system.inc");
does prevent the crashing issue. Nice find, huetruong.I'm still having an issue with entering persistent maintenance mode not causing a failover: opnsense/core#7877 I've also not had enough time to find the most optimal way to shut/noshut the WAN interface - to ensure active/passive device reboot behavior produces a consistent and desired state for the interface based on the CARP status. (I don't want my backup/passive device to have it's WAN interface enabled upon boot, and requesting a DHCP lease while the active device is already handling traffic)
I reread your comments. I have to disable the WAN interface of the instance that is in backup state when I update and reboot so it doesn’t switch over.
This script works fine as an automatic failover if something goes wrong with the master.
Long story short, after finding out I couldn't unbridge my ONT -- I went about testing the WAN failover between my opnsense VMs again.
Either I haven't tested it in a long time or I was mistaken the last time I tested it. I had most of the issues that everyone mentioned .. most noticeably, the wan interface not disabling or enabling properly on the master/backup node respectively.
Also, on the backup/master node -- I noticed that it kept repeating master/backup node messages (as per the logging from 10-wancarp).
_2024-12-27T01:38:05-05:00 Error opnsense /usr/local/etc/rc.syshook.d/carp/10-wancarp: enable interface 'wan' due CARP event 'MASTER'
2024-12-27T01:38:05-05:00 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member " (172.30.67.254) (40@vlan009)" has resumed the state "BACKUP" for vhid 40
2024-12-27T01:38:05-05:00 Error opnsense /usr/local/etc/rc.syshook.d/carp/10-wancarp: disable interface 'wan' due CARP event 'BACKUP'
2024-12-27T01:38:04-05:00 Notice opnsense /usr/local/etc/rc.syshook.d/carp/20-openvpn: Carp cluster member " (172.30.67.254) (40@vlan009)" has resumed the state "INIT" for vhid 40
2024-12-27T01:38:04-05:00 Error opnsense /usr/local/etc/rc.syshook.d/carp/10-wancarp: disable interface 'wan' due CARP event 'INIT'
2024-12-27T01:42:05-05:00 Notice configd.py [c8268658-528e-4180-9efb-b4465da3c196] Carp event on subsystem 200@vtnet1 for type MASTER
2024-12-27T01:40:05-05:00 Notice configd.py [75707303-20a1-468e-add3-97c31659f7cf] Carp event on subsystem 215@vlan09 for type MASTER
__
What I believe fixed the inconsistent master/backup status messages in the 10-wancarp -- was seeing that the IF type check in 20-openvpn in /usr/local/etc/rc.syshook.d/carp was different. Thanks for everyone that posted their fixes.
https://gist.github.com/vc1cv1/f59273ce98fda57cf8000cca65193b6b
#last updated for opnsense 24.7.11_2
#!/usr/local/bin/php
<?php
require_once("config.inc");
require_once("interfaces.inc");
require_once("util.inc");
require_once("system.inc");
$subsystem = !empty($argv[1]) ? $argv[1] : '';
$type = !empty($argv[2]) ? $argv[2] : '';
if (!in_array($type, ['MASTER', 'BACKUP', 'INIT'])) {
log_msg("Carp '$type' event unknown from source '{$subsystem}'");
exit(1);
}
if (!strstr($subsystem, '@')) {
log_error("Carp '$type' event triggered from wrong source '{$subsystem}'");
exit(1);
}
$ifkey = 'wan';
$real_if = get_real_interface($ifkey);
# since all my CARP ips fail over together, I just wanted it to only run when it matched the CARP status change for my LAN interface. You can find it in your debug log searching for 'carp' and/or totally comment out the IF statement.
if ($subsystem === "200@vtnet1") {
if ($type === "MASTER") {
log_error("enable interface '$ifkey' due CARP event '$type' on '$subsystem'");
$config['interfaces'][$ifkey]['enable'] = '1';
write_config("enable interface '$ifkey' due CARP event '$type'", false);
interface_configure(false, $ifkey, false, false);
sleep(2);
shell_exec("/sbin/ifconfig {$real_if} up");
log_msg("Issuing dhclient command on '$real_if' to request a DHCP lease");
sleep(1);
shell_exec("dhclient {$real_if}");
} else {
log_error("disable interface '$ifkey' due CARP event '$type' on '$subsystem'");
unset($config['interfaces'][$ifkey]['enable']);
write_config("disable interface '$ifkey' due CARP event '$type'", false);
interface_configure(false, $ifkey, false, false);
shell_exec("/sbin/ifconfig {$real_if} down");
}
} #if subsystem
Creative suggestions, MEntOMANdo. You could do that and probably achieve a workable situation, but I see potential problems with that approach, and for some users and ISPs. In your VM example, though the interface will be "down" by default, I believe the interface will still be brought up by configuration during boot - if it's stored in the opnsense configuration for the interface to be up, it will be brought up during boot. In your CRON example, you may also run into a race condition, and still have your WAN interface come up, and do things like request a DHCP Lease, and possibly also not be shut down by the cron job if the device is 'backup' - depending on when the boot process that cron entry actually executes.
Towards the end of 'boot', the interface configuration is read, and then applied. So, with either approach, you have both the risk of the interface coming up in the first place, or not being shut down after the opnsense scripts read the configuration and bring up the interface.
This is one reason why I mention my workaround of using shell_exec to manually set the interfaces up or down is not very clean, or ideal - both because I'm calling shell_exec in the first place (bad practice, a security no-no!), and because the state of the interface will not persist across reboots).
IMO, it's better for the syshook.d CARP script to set the interface's configuration to be down, and save this in the configuration - so that only when CARP's state changes to "master", will the WAN interface be brought up at all. This way, you don't have to change default interface behavior, the script handles this for you.
Thoughts?
agreed, it's better for the status of the interface to be saved. after testing my failovers, i saw nothing in my backup node on reboot that mentioned the disabled 'wan' interface being tried to be brought online and/or it being disabled by carp status
Thank you for your efforts on this. I've got it set up and working when failing over. However, when the other device comes back online, I'm experiencing an issue. At that point, both firewalls are active and - since I duplicated the MAC address - competing for the IP address from the ISP. Has anyone else experienced this issue? How have you worked around it?
Thank you for your efforts on this. I've got it set up and working when failing over. However, when the other device comes back online, I'm experiencing an issue. At that point, both firewalls are active and - since I duplicated the MAC address - competing for the IP address from the ISP. Has anyone else experienced this issue? How have you worked around it?
which revision of the code are you using? Normally, the backup's interface should remained disabled unless the CARP status changes.
also, under HA -> settings -> "disable preempt" -- do you have that checked or unchecked? Mine is unchecked -- maybe you have this checked.
"When this device is configured as CARP master it will try to switch to master when powering up, this option will keep this one slave if there already is a master on the network. A reboot is required to take effect."
I'm using the one from above, I think you posted it "last week". I did update it to handle my second ISP (I have two ISPs, but neither provide a second IP). Preempt is disabled.
I THINK even though it will come up as a backup, it still tries to grab an IP address at bootup because CARP has not yet been initialized. I see an increase in loss (on the master WAN links) right as the (other, backup) system boots and when it gets to parts (during the boot) where it says something about configuring the WAN interfaces. This makes sense, since the backup does not yet have an awareness of CARP on those interfaces (since they're not configured for CARP) and should logically try to get an IP (with a duplicated MAC) and it is attempting to bring those interfaces up. I may try to spend some time in the other RC directories to see if there is a logical place to down the WAN interfaces until CARP is up and the system's role can be determined. I wasn't sure if others had seen the same issue and - if they had - what may have been done to work around it.
Hi - this is still working for me fyi on version 24.7.2 - if I failover my carp, my wan is disabled on primary
if I fail back wan is disabled on secondary