-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rootfs can fail to fully expand: race condition betw iiab-expand-rootfs & systemd-fsck ? (journalctl confusingly/regularly portrays 2 simultaneous boots due to lack of RTC?) #3325
Comments
So to summarize, because one line of the script logs twice, you leapt to the conclusion that the script ran twice, ignoring the fact that the other three lines, including start and end messages only appear once. As a result you made changes that broke it and now think that because it's broken it should be eliminated altogether. As further evidence that it ran twice you took the fact that for one user it did nothing, not the same thing twice, without any idea as to why that user had a problem. Is that pretty much it? |
Read the log and you will see that it started 2 simultaneous copies of /usr/sbin/iiab-expand-rootfs 2 times out of 10. So it's almost definitely a timing issue during boot. Systemd should be celebrated for helping us to clearly see this in its more modern logging system. If you're accusing others of breaking things without evidence, that's not helpful. |
Clarification for people who don't read the logs: The log shows that double-launching of iiab-expand-rootfs happens repeatedly — both during initial boot (when trying to expand the root filesystem) and also on subsequent boot(s) as well. |
Please cite the actual lines which you take to be evidence that it ran twice. Does 'Starting Root Filesystem Auto-Expander' appear twice? where does the log come from, what OS? |
Please do read this ticket (the top, especially). To restate it: this occurred on 64-bit "RasPiOS with desktop" on RPi 400. |
|
Jul 28 22:04:19 box systemd[1]: Starting Root Filesystem Auto-Expander... |
Please analyze the log more patiently:
As explained at the top of this ticket, the interweaved / interleaved chunks (i.e. sections, each with a |
If I run journalctl -u iiab-expand-rootfs I don't see the -- Boot 84cc35b34186456ba2adf08f7b4d5399 -- line, did you add that, is it a fresh boot? what does journalctl -b -u iiab-expand-rootfs show? |
No, I assume Systemd adds those into journalctl logging.
Indeed the Making these 2 invocations (that have identical timestamps) more readable:
|
so these are two separate boots. Root Filesystem Auto-Expander didn't finish on the first one (? manual reboot) and so ran again on the second one and succeeded. |
OK. Now I see /usr/sbin/iiab-expand-rootfs: line 59: 669 Segmentation fault resize2fs $ROOT_PART |
that could be run manually to test version |
but the segfault coudl also be from corrupt file system from previous reboot |
I don't know enough about Systemd to understand why 1 of the 2 "simultaneous boots" (specifically "84cc35b34186456ba2adf08f7b4d5399") does not appear here:
|
? never completed |
https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/ this is old, but could be relevant need to try to reproduce the resize2fs segmentation fault |
Possible. These 3+3 pastes might shed more light on these "simultaneous boots" hopefully:
|
to 124901371 blocks Didn't finish, rebooted/powered off too soon. |
resizing filesystem from 15728640 to 124901371 blocks |
there's still the segfault on the second boot |
Potentially more revealing details:
|
How about pastebin the entire journal? |
Jun 16 19:37:24 raspberrypi kernel: [ 1.586697] mmcblk0: mmc0:aaaa SC128 119 GiB Looks like 2 different SDcards were involved. |
Think there is a race condition between systemd-fsck and iiab-expand-rootfs
|
Might want to look at using 'After=' in the systemd unit file, for better staging of events. |
what is Jun 16 20:52:19 box kernel: kernel BUG at fs/ext4/resize.c:528! reminds of the segfault |
How about a pop up a warning that the filesystem is being altered and not to reboot until complete? Tie the resize warning into the netwarn routine staged to display before the iiab-network question. Kind of makes the ssh version of netwarn a priority now. Still need to fix the staging to prevent the running of the resize routine before the fsck service is completed. I've given more time to this project than I wished to in the past with the warnings unheeded, now the chickens are hatched and come home to roost. Good luck with whatever solution you choose for the home-brew'd images that are produced with the presently employed build routine. |
|
The opening comment clearly demonstrates that the resize operation can be launched before systemd-fsck-root.service, that in itself was the red flag warning that was glossed over becoming the primary cause of this issue. Bit of fiddling around, discovery pointed to a better approach back then, but in the face of resistance put forth why bother life is too short for the adversarial environment present. I figured it was just a matter of time before somebody ran across this known weakness to support my point of view, kind of sorry others have had their data put at risk using the service. Just remember I can't be the one that the finger points at, no commit access here. |
To prevent the erroneous removal of /.expand-rootfs when there was a problem running the script perhaps bash should be called with -e to exit on first error for an "anti-snowballing" effect. #3339 (comment) |
Used http://download.iiab.io/8.0/Old/iiab-8.0preview2-220616-MEDICAL-raspios32-lite-g2e2824c3.img.zip dated 2022-Jun-16 21:52
The growpart line looks a bit strange, should just be the partition number and was not resized. |
https://bugs.launchpad.net/cloud-utils/+bug/1881014 my guess is the patch mentioned has not made its way upstream to Debian yet. |
That's really great progress in helping everyone understand these frustrating failures! Can a more recent OS (possibly Ubuntu 22.10 daily build) be used to help confirm? Or 22.04.1 on RPi if that's faster to test? |
I don't know if Ubuntu 22.10 daily build kinetic-live-server-arm64.iso is actually meant to run on RPi? (Or is this a more generalized ARM build...that's possibly too painful to install on RPi prior to its 2022-10-20 release date?) |
Ubuntu is distributing the patched version, it's fixed over there. The issue is what |
Indeed 👍 Testing a recent Ubuntu-on-RPi could be extremely useful confirmation that both (1) intermittent failures and (2) deterministic failures are...both...increasingly tamed (with a few prayers included!) |
Known slowness on W. Pic of boot failure due to /boot not being clean, with that partition being a fat filesystem this could be caused by not 'ejecting' the media before removal on any given OS. TODO - check the version of cloud-guest-utils being used on RasPiOS, the patch mentioned above might be already included but is not enough to cover the /boot/kernel.img display of uname but otherwise works with the rest of the RasPiOS's kernels (kernel7.img kernel7l.img kernel8.img) |
Info from booted W
Looks like that patch went upstream as canonical/cloud-utils#2 but doesn't really cover X.X.X+ as the error above notes
Edit: Debian is lagging in updating and is behind upstream's version
|
I'll assume this same lag also applies to the very latest 64-bit Raspberry Pi OS from 2022-09-06. i.e. it too has not yet included the https://bugs.launchpad.net/cloud-utils/+bug/1881014 patch. (But if for some obscure this is wrong, please someone clarify if poss!) |
|
Even the very latest (from 2022-09-06) 64-bit "Raspberry Pi OS with desktop" unfortunately shows:
|
FWIW Debian 11.5 on x86_64 (also unfortunately) has the older version:
More positively Debian 12 (Bookworm) that many Debian community people are already using (prior to its release around mid-2023) is all set — as confirmed by: |
Now that PR #3337 is merged (thanks to testing work by @jvonau and @tim-moody) this ticket should be closed too in coming weeks — once reliable rootfs expansion is validated by others too. (Preferably on both RasPiOS and Ubuntu!) |
Appears solved: Am moving this ticket to the IIAB 8.1 Milestone so these issues continue to be monitored / confirmed. |
This happens to be on [64-bit] "RasPiOS with desktop" [on RPi 400] but it likely occurs on many/most if not all OS's [and hardware].
In the journalctl log below, you can see many examples of
/usr/sbin/iiab-expand-rootfs
being launched twice simultaneously. The output of these simultaneous / duplicate processes is made even more clear by 32-digit hex 'Boot' identifier numbers at the top of each chunk of output, interweaved together, evidenced by each of the 2 running processes having 100% identical timestamps.Unhealthy "conjoined / siamese twins" in a sense.
The 1st example (at the top of the log below) being the most onerous, as contention between those 2 processes caused the crash below (Line 48 of 117, made clear by:
/usr/sbin/iiab-expand-rootfs: line 59: 669 Segmentation fault resize2fs $ROOT_PART
) — that prevented the drive from fully expanding to its actual physical capacity.journalctl -u iiab-expand-rootfs
confirms the problemClick To Expand 'journalctl -u iiab-expand-rootfs' Log Output
Remediation
Reverting to
raspi-config --expand-rootfs
(with manual reboot, already in our code but commented out for now) [ might help (?) as part of a short-term fix on RasPiOS — e.g. if it's smart enough to use some kind of lock file arrangement to avoid running 2 processes simultaneously: ]iiab/roles/1-prep/templates/iiab-expand-rootfs
Line 17 in 2efd7c0
But it's hopefully also possible to fix iiab-expand-rootfs.service across all OS's — to put an end to this erroneous and quite common simultaneous-double-launching of /usr/sbin/iiab-expand-rootfs ?
Related:
The text was updated successfully, but these errors were encountered: