-
Notifications
You must be signed in to change notification settings - Fork 3k
Node.js 0.12.0 / io.js 1.x http.Agent keep-alive has problems with CouchDB #7699
Comments
@othiym23 do you still have the wireshark dumps around? |
I do. I'll figure out some way to share them and put a link here. |
@janl https://github.com/othiym23/debugging-npm-couchdb-iojs These captures comprise only the initial population of the VDUs into the new Couch database and the potentially failing auth calls. There are, right now, two traces – one with keep-alive enabled, and one without. The former fails and the latter succeeds. Let me know if you want any other traces. Capturing them is pretty easy for me now. |
It looks like there don't need to be auth failures in order to see the problem. I am running with a modified version of 01-adduser, and it still fails when I do this:
It looks like the first login attempt is handled in about a half a second, and every subsequent one is delayed by 10 seconds. The '10' is suspiciously round, so it may be a timeout or some such. |
@othiym23 can you post your recipe for gathering the packet dumps? I want to see what's going on in my reduced example (above). |
There are some example captures at the repo I mentioned to janl above. |
That timeout is the retry built into |
I think, at least part of this problem, is a couchdb issue. From looking at the dumps, it simply decides to end the connection after sending the response even though it did not specify this intention in the response (by adding a Btw, I found https://issues.apache.org/jira/browse/COUCHDB-1146, which seem related. Maybe @jhs can elaborate? |
I agree, the problem seems to be on the second sending of any given credential. The couchapp returns 409, and then for some reason couchdb throws in the RST and kills the connection. |
@kanongil That agrees with my reading of the CouchDB source, which is that this error at root is something funky in mochiweb's handling of keep-alive (and also with the fact that prior to Node.js 0.12 / io.js, Node didn't actually have a working keep-alive implementation in its |
Can we reproduce the problem with clients other than node > 0.11 ? I am curious about this, which I see just before the RST:
It looks like the client (port 62249) ACK's two packets (787, 788) using the same sequence number (977). Is that a correct reading, and if so, is that legitimate TCP? |
https://blog.udemy.com/tcp-dupack/ As it stands, this is just more evidence of weirdness in the interaction between CouchDB and Node. |
I have a little libcurl C program that sends the same requests as Does this prove anything by itself? Should I try to compare the data sent by curl to the data sent by npm-registry-client, packet-by-packet? |
If you can. If it helps (it probably doesn't), this is a race of some kind, because the |
Ah, OK. So the ideal case is to have two curl processes launched back-to-back:
That will precisely mimic what's happening in 01-adduser, when it launches two |
I'm pretty sure the |
Okay, here's some more data. When I run the
This also fails in the same way, with an extra packet sent from first, in that it writes to the socket after it is FIN/ACK'd As far as which commits introduced this into node, I am currently looking at: libuv/libuv@e19089f and trying a locally built IOJS without the out-of-band support that was added for 1.4.2, 1.4.3 on Darwin. @othiym23 can you reproduce this bug with io.js on other, non-Darwin platforms? |
No, that out of band stuff is fine - or rather, removing it does not remove the problem |
I believe npm/npm-registry-client#107 fixes this. ;-) |
Long writeup: As usual, a complicated bug stems from multiple causes. In this case there are at least two underlying bugs. First, Second, there is at least one bug, possibly more, in the new http-keepalive code added in io.js and node >= 0.12. The first problem is recycling a socket when there are still writes pending against it. Here the problem was that the HTTP agent sent the GET request, and followed by writing the The code that initializes a socket for HTTP use in
The first few lines of the log show the socket being cleaned up and discarded (returned to the pool, for a keepalive socket). Then the socket is queued for configuration in step [1] (onSocket). Next, the previously queued write is processed on the socket [2]. On the next tick the socket is available to actually be configured by the code in tickOnSocket [3]. Meanwhile, the other end of the connection has received garbage data (the body data associated with the previous GET request), which causes it to send an Eventually the The error (or errors) in the
Unfortunately I don't have any ideas for how to fix those easily. If low-level Regarding point 2, the meaning of However the iojs issues are addressed, though, the trigger can be removed from |
sam u r the best 😘 |
@chrisdickinson and @indutny, you might find #7699 (comment) interesting. Also, @janl, I think this lets CouchDB off the hook. I'm going to test and potentially land this patch now. |
npm/npm-registry-client@856eefe contains the fix for this, and 431c3bf includes the fixed version of |
New nodes have a new (which is to say working) HTTP keep-alive agent. So far it seems to cause no problems for npm in practice, but the npm tests (including in continuous integration) consistently hang when running the tests that hit a local CouchDB instance to test the CouchDB application that powers npm's registry.
steps to reproduce:
npm install
npm test
what should happen
The tests should proceed to completion without issue in ~30 seconds. This is what will happen if you swap out io.js or Node.js 0.12 with Node.js 0.10.x.
what actually happens
The tests hang about 2/3 of the way through the first steps, which are for account creation. Eventually it times out. npm is left hanging partway through an HTTP request immediately after getting
ECONNRESET
orEPIPE
from the CouchDB end. A potentially salient detail is that a number of requests are issued in rapid-fire succession, and that some of the responses have 40x codes because they're authentication failures or "document (user) not found" responses.debugging so far
debug
, the CouchDB logs for the tests reveal no interesting or unusual information.NODE_DEBUG=http npm test
basically backs up the information revealed by Wireshark – the connections are getting closed on the CouchDB end, more or less midstream, and the subsequent retries are hanging.Transcript:
I'm stumped! Anyone got any ideas?
The text was updated successfully, but these errors were encountered: