Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix zero hash items eviction #152

Closed
wants to merge 1 commit into from

Conversation

Etsukata
Copy link
Contributor

If all hash values of five tail items are zero on the specified slab
class, expire check is unintentionally skipped and items stay without
being evicted. Consequently, new item allocation consume memory space
every time an item is set, that leads to slab OOM errors.


Details of the Problem

Version: 1.4.19 to 1.4.25
How reproducible: 100%

Steps to Reproduce:

# set the items whose jenkins hash(ENDIAN_LITTLE) values are zero with exptime 1 sec
printf "set 0wYuLiaUdfgTZCUsz8mRR1WJVk 0 1 4\r\ndata\r\n" | nc localhost 11211
printf "set 9NxjWkBnSfD0LShqUBZAqv3jKI 0 1 4\r\ndata\r\n" | nc localhost 11211
printf "set wRL2lvUMyPnbiImZdi9CTXbsJ6 0 1 4\r\ndata\r\n" | nc localhost 11211
printf "set vOgpzFNo2pGpYLKxUWXfI6PDXq 0 1 4\r\ndata\r\n" | nc localhost 11211
printf "set aTogASrWNRQSA1uh2ySgv6HwaU 0 1 4\r\ndata\r\n" | nc localhost 11211

# wait for exptime
sleep 2

# another innocent key set
# normally, the five items expire here
printf "set innocentkey 0 1 4\r\ndata\r\n" | nc localhost 11211

echo "stats items" | nc localhost 11211

Actual results:

STORED
STORED
STORED
STORED
STORED
STORED
STAT items:1:number 6
STAT items:1:age 2
STAT items:1:evicted 0
STAT items:1:evicted_nonzero 0
STAT items:1:evicted_time 0
STAT items:1:outofmemory 0
STAT items:1:tailrepairs 0
STAT items:1:reclaimed 0
STAT items:1:expired_unfetched 0
STAT items:1:evicted_unfetched 0
STAT items:1:crawler_reclaimed 0
STAT items:1:crawler_items_checked 0
STAT items:1:lrutail_reflocked 0
END

Expected results:

STORED
STORED
STORED
STORED
STORED
STORED
STAT items:1:number 1
STAT items:1:age 0
STAT items:1:evicted 0
STAT items:1:evicted_nonzero 0
STAT items:1:evicted_time 0
STAT items:1:outofmemory 0
STAT items:1:tailrepairs 0
STAT items:1:reclaimed 5
STAT items:1:expired_unfetched 5
STAT items:1:evicted_unfetched 0
STAT items:1:crawler_reclaimed 0
STAT items:1:crawler_items_checked 0
STAT items:1:lrutail_reflocked 0
END

@Etsukata
Copy link
Contributor Author

Updated version info : 1.4.19 to 1.4.25

@Etsukata
Copy link
Contributor Author

@dormando ping?
Could you reproduce the bug?

@dormando
Copy link
Member

Hey,

This is pretty fun. May I ask if you actually ran into this in production, or if it was something you found in an audit?

Not sure if this patch is enough. Since cur_hv can be 0 in the same way as the other items: This could reintroduce a corruption issue if the item pulls itself from the tail (in the case of do_item_replace and friends).

Or am I missing something?

@Etsukata
Copy link
Contributor Author

Thank you for your reply.

May I ask if you actually ran into this in production, or if it was something you found in an audit?

Slab OOM suddenly happend in our production after runnning memcached for 1 month. Then I investigated it with gdb.

This could reintroduce a corruption issue if the item pulls itself from the tail (in the case of do_item_replace and friends).

This patch itself doesn't make a corruption because item_trylock always fails on replace context.
However the current condition to check and skip itself is not clear since cur_hv can be 0 as you mentioned. So how about using whole ITEM_key of the current item instead of hash value? Like,

if ((cur_key && strcmp(ITEM_key(search), cur_key) == 0) || (hold_lock = item_trylock(hv)) == NULL)

Before that, I'd like to confirm the reason of hv == cur_hv condition introduced in f7bf26c . Is that for better performance in eviction?

@dormando
Copy link
Member

That's pretty wild. you should play the lottery.

I feel like that was added to fix a specific bug that came up during stress testing. Struggling to remember and my notes aren't clear. I'll dig around for more or re-test, in case I was wrong and it's useless.

@dormando
Copy link
Member

dormando commented Jun 1, 2016

First, seems like your fix won't... fix it:

The line in question:

        /* Attempt to hash item lock the "search" item. If locked, no
         * other callers can incr the refcount. Also skip ourselves. */
        if (hv == cur_hv || (hold_lock = item_trylock(hv)) == NULL)
            continue;

thread.c:item_trylock():

void *item_trylock(uint32_t hv) {
    pthread_mutex_t *lock = &item_locks[hv & hashmask(item_lock_hashpower)];
    if (pthread_mutex_trylock(lock) == 0) {
        return lock;
    }
    return NULL;
}

The item_trylock() is using the same value compared against cur_hv, except it lops off most of the bits since the trylock table is pretty small. It's much more likely you'd have 5 items in the tail (and then a current item) that evaluate to the same lock entry... which would make it unable to evict. Maybe that doesn't happen as often or something.


Second, the introduction of that line wasn't in f7bf26c, that was a re-introduction after I'd accidentally dropped it. It actually got added in: fd39b19 which was part of a fix started in 59bd02c with a test on cb120c0

The full detail of that bug are here: https://code.google.com/archive/p/memcached/issues/260 and here: #67 (in the comments. ignore the patch).

It's pretty complicated, but refreshing from this: It looks like there are (or were?) some cases where an item's lock is released but references are still held, and a call to item_alloc() happens. Which means the lock test will succeed but the item being currently worked on could get pulled from the tail, "evicted", and handed back. which then causes corruption as the original item gets freed.

Way before that it looks like the cur_hv != hv was added as an optimization... but I feel stupid now, because fd39b19 just looks like it makes the logic more clear to read, but the lines are identical? Or am I unable to parse code today?

@Etsukata
Copy link
Contributor Author

Etsukata commented Jun 1, 2016

Thanks for the information about issue 260. I missed the case item_alloc() called without item's lock.

Let me clarify the cases which leads to memory corruption and eviction stuck.


without this patch:

  • if cur_hv and hv both 0 => eviction stuck

with this patch

  • if cur_hv is 0 and item_alloc() without lock => memory corruption

Right?
Please give me time to fix them and well comprehend the issue 260. It's complicated.

fd39b19

I prefer hv == cur_hv || to hv != cur_hv &&, easier to understand.

@dormando
Copy link
Member

dormando commented Jun 1, 2016

I believe that's correct, but in all of my extensive notes on the bug I never expressly listed the code paths which can do that.

I've piled in enough small fixes into the 'next' tree for a release, so I'm going to try to wrap up the feature I was working on to go along with them. Please feel free to keep inspecting this problem and I'll keep checking/responding, but I won't be directly digging into it again for a little bit.

@Etsukata
Copy link
Contributor Author

Etsukata commented Jun 4, 2016

I'm still on the way to inspecting the problem, found that the commit fd39b19 is the essential to fix the 'issue 260'. These lines are not identical and the key for avoiding add_delta to pull itself from the tail.

'Issue 260' is caused by the race of TAIL_REPAIR and add_delta operation. Without fd39b19, incr/decr commands try to pull itself from the tail without trying to lock because hv != cur_hv is already true, which leads to unlink itself and break the tail. That's well described in your comment #67 (comment) though I also can't figure out the magical shit.

To see the race, just incr after TAIL_REPAIR_TIME like the test cb120c0.

Checkout 1.4.15 and change TAIL_REPAIR_TIME to make it easy to reproduce

diff --git a/memcached.h b/memcached.h
index 8a51352..4401ca6 100644
--- a/memcached.h
+++ b/memcached.h
@@ -76,7 +76,7 @@

 /** How long an object can reasonably be assumed to be locked before
     harvesting it on a low memory condition. */
-#define TAIL_REPAIR_TIME (3 * 3600)
+#define TAIL_REPAIR_TIME 3

 /* warning: don't use these macros with a function, as it evals its arg twice */
 #define ITEM_get_cas(i) (((i)->it_flags & ITEM_CAS) ? \

And just incr after TAIL_REPAIR_TIME

printf "set abc 0 0 1\r\n1\r\n" | nc localhost 11211

# wait TAIL_REPAIR_TIME
sleep 5

printf "incr abc 2\r\n" | nc localhost 11211

echo "stats items" | nc localhost 11211
echo "stats cachedump 1 5" | nc localhost 11211

Result:

STORED
3
STAT items:1:number 1
STAT items:1:age 0
STAT items:1:evicted 0
STAT items:1:evicted_nonzero 0
STAT items:1:evicted_time 0
STAT items:1:outofmemory 0
STAT items:1:tailrepairs 1
STAT items:1:reclaimed 0
STAT items:1:expired_unfetched 0
STAT items:1:evicted_unfetched 0
END
ITEM abc [1 b; 1465040623 s]
ITEM  [-2 b; 1465040623 s]
ITEM  [-2 b; 1465040623 s]
ITEM  [-2 b; 1465040623 s]
ITEM  [-2 b; 1465040623 s]
END

I think my patch will not cause the race because it always try to lock before pulling itself in incr/decr. But still not sure it's reasonable for fixing eviction stuck without races. I'm going to inspect a little more. Thank you for your patience.

@dormando
Copy link
Member

dormando commented Jun 4, 2016

Thanks for looking into it :) I knew I wrote that one-liner fix for a good reason.

@Etsukata
Copy link
Contributor Author

After inspecting the codes and past fixes around item refcount, I've come to think that it's the best fix for the problem to just remove hv == cur_hv condition. So, the if statement should be simply,

if ((hold_lock = item_trylock(hv)) == NULL)

If there were some cases where an item's lock is released but references are still held, and a call to item_alloc() happens, that breaks the tail like issue 260 causing race with tail_repair_time option. On such cases, the item must have its lock because lru_pull_tail() can change its refcount.
I think there is no reason for comparing two items' hash value to prevent an item from accidentally pulling itself from tail now that there is no such case that a call to item_alloc() happens without item's lock. What do you think?

@dormando
Copy link
Member

Sorry for the delay (1.4.26 went out!).

Just re-read things again: The cur_hv is only ever passed in from do_store_item() with an append/prepend command, or in do_add_delta(). In all other cases it's zero, as thread.c's item_alloc() wrapper passes in zero there.

The problem I was probably trying to avoid was deadlocking by avoiding having the thread owning a lock, lock it again... but nothing actually prevents that from happening already. This would be super dangerous if something holding an item lock called another item_lock(), as it'll just deadlock. The codebase isn't very clear about this being a problem.

Thankfully since it's just a trylock() there, it'll return EBUSY.

So, I think that check can just be removed. It could be fixed by changing it to a pointer that's passed in, then: if (cur_hv != NULL && *cur_hv == hv etc.

I still think given the original bug report.. if 5 items with the same item_lock() value (which shaves bytes off of the full hv), you'd occasionally hit this. Holding an item lock while calling item_alloc() only seems to happen in append/prepend and do_add_delta as I noted above though.

So we can start by removing the check... feel free to push a new change into this PR (replace the old one please), or I'll get to it at some point.

You'd still be stuck because only item_alloc() can call lru_pull_tail() with do_evict == true. The background thread can't evict items... but it will expire them, if they're sitting in the bottom and expired.

@Etsukata
Copy link
Contributor Author

Sorry for the delay (1.4.26 went out!).

Please do not care. I like live logging feature, awsome work!

I still think given the original bug report.. if 5 items with the same item_lock() value (which shaves bytes off of the full hv), you'd occasionally hit this.

Yes, the range of masked hv used in item_lock() is much smaller than raw hv range. If there are items whose masked hash values are the same on the tail and one of these items is locked, it can cause occasional OOM.
However, lru_pull_tail called on GET request can evict or expire items whose hv is not 0. In many cases, GET is more likely APPEND/PREPEND. So occasional OOM seldom happens.
It should be noted that if item's hv is 0, lru_pull_tail called on GET does not pull the item without this patch.

So we can start by removing the check... feel free to push a new change into this PR (replace the old one please), or I'll get to it at some point.

OK, I'll soon re-push the commit.

@Etsukata Etsukata force-pushed the fix-zero-hash-eviction branch from 2fd7e2e to 1410be4 Compare June 24, 2016 03:19
@Etsukata
Copy link
Contributor Author

Etsukata commented Jun 24, 2016

rebased on the latest master and re-pushed.
If you need test on this problem, please tell free to tell me.

If all hash values of five tail items are zero on the specified slab
class, expire check is unintentionally skipped and items stay without
being evicted. Consequently, new item allocation consume memory space
every time an item is set, that leads to slab OOM errors.
@Etsukata Etsukata force-pushed the fix-zero-hash-eviction branch from 1410be4 to dd0739c Compare June 24, 2016 04:28
@dormando
Copy link
Member

thanks! this won't quite make it into 1.4.27, but it'll go into the next one. The release after .27 will require burn-in testing, so it's a good fit for that one.

Tests would end up being flaky; anytime the hash value changes/etc. I'll put it under the burn-in test to be absolutely sure there wasn't something I forgot.

@dormando
Copy link
Member

took a lot longer than anticipated to get the testing setup going again and then through the chunked item feature. it seems to be doing fine so far so I've pushed it up to next.

sorry that took so long.

@dormando dormando closed this Jul 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants