Anyone with broken/weird drain/rebalance?

dormando

unread,

Jan 9, 2009, 6:37:50â€¯AM1/9/09

Yo,

Working towards stabilizing/finishing trunk for another release, and I
just made another functional change in r1257

previously, if your replicate workers did any work, or the old_repl_compat
crap thought it did any work, it wouldn't run drain or rebalance at all.
Now it runs them regardless. So many times the issue is because
old_repl_compat is being set and bugged out on devcount/file_on
mismatches, or because there's just a lot of replication work happening.

So far I've found what look like a few bugs in the way the fids are
checked. rebalance_ignore_missing could/should be a default with a little
more work to guarantee "unreachable" devices are being monitored
correctly. Also a fid can exist but have no file_on rows, which would gum
up the works in a few places.

Otherwise the only things left that should hang drain or rebalance
*should* be the aforementioned fids with no file_on rows, 404's (unless
rebalance_ignore_missing = 1 is set), or the device being legitimately
broken. I still want to port the code to work with file_to_queue instead
of the "shuffle from the top" crap, which'll make it more resiliant
against future bugs.

Anyway, if you presently have trouble getting drain/rebalance to run on
your cluster, please ping me. I'm curious how many folks are left who
still see this issue and what your setup is.

I'm going to push a few more things to trunk.... probably tomorrow at this
point, which should make troubleshooting those issues a lot easier, but if
you're feeling enterprising and presently have stuck drain/rebalance,
please try out trunk on one tracker and see if it helps at all or makes
things worse.

Thanks,
-Dormando

Andy Lo A Foe

unread,

Jan 9, 2009, 6:49:29â€¯AM1/9/09

to [email protected]

Hi,

I had this weird issue 2 weeks ago where all replication workers would loop furiously while trying to drain a device. This was most likely because the class had a replication count of 3, but only 2 storage hosts had nodes with free space left. So apparantly it got stuck draining the last remaining copies from the device.

Moving the device from 'drain' to 'down' stopped the looping. Unfortunately I didn't have time to digg deeper into this issue apart from guessing it was stuck in the call to 'random_fids_on_device' in DrainDevices..

Gr,
Andy

dormando

unread,

Jan 10, 2009, 5:53:03â€¯PM1/10/09

to [email protected]

Yeah, definitely seen that before.

My ideal is to get this all wired into the file_to_queue table, which
supports rescheduling fids.

The existing logic for drain does:

SELECT * FROM file_on WHERE devid = NN LIMIT 5000;
shuffle(@rows);
... then returns 50 to operate on. If they're all stuck it'll just spin.