Working towards stabilizing/finishing trunk for another release, and I
just made another functional change in r1257
previously, if your replicate workers did any work, or the old_repl_compat
crap thought it did any work, it wouldn't run drain or rebalance at all.
Now it runs them regardless. So many times the issue is because
old_repl_compat is being set and bugged out on devcount/file_on
mismatches, or because there's just a lot of replication work happening.
So far I've found what look like a few bugs in the way the fids are
checked. rebalance_ignore_missing could/should be a default with a little
more work to guarantee "unreachable" devices are being monitored
correctly. Also a fid can exist but have no file_on rows, which would gum
up the works in a few places.
Otherwise the only things left that should hang drain or rebalance
*should* be the aforementioned fids with no file_on rows, 404's (unless
rebalance_ignore_missing = 1 is set), or the device being legitimately
broken. I still want to port the code to work with file_to_queue instead
of the "shuffle from the top" crap, which'll make it more resiliant
against future bugs.
Anyway, if you presently have trouble getting drain/rebalance to run on
your cluster, please ping me. I'm curious how many folks are left who
still see this issue and what your setup is.
I'm going to push a few more things to trunk.... probably tomorrow at this
point, which should make troubleshooting those issues a lot easier, but if
you're feeling enterprising and presently have stuck drain/rebalance,
please try out trunk on one tracker and see if it helps at all or makes
things worse.
Thanks,
-Dormando
My ideal is to get this all wired into the file_to_queue table, which
supports rescheduling fids.
The existing logic for drain does:
SELECT * FROM file_on WHERE devid = NN LIMIT 5000;
shuffle(@rows);
... then returns 50 to operate on. If they're all stuck it'll just spin.
If you've seen any other weirdness, let the list or me know.
Thanks!
-Dormando