Skip to content
This repository has been archived by the owner on Aug 11, 2022. It is now read-only.

Node.js 0.12.0 / io.js 1.x http.Agent keep-alive has problems with CouchDB #7699

Closed
othiym23 opened this issue Mar 21, 2015 · 24 comments
Closed

Comments

@othiym23
Copy link
Contributor

New nodes have a new (which is to say working) HTTP keep-alive agent. So far it seems to cause no problems for npm in practice, but the npm tests (including in continuous integration) consistently hang when running the tests that hit a local CouchDB instance to test the CouchDB application that powers npm's registry.

steps to reproduce:

  1. ensure that you have a recent version of io.js installed
  2. ensure that you have CouchDB installed
  3. clone npm/npm-registry-couchapp and run npm install
  4. run npm test

what should happen

The tests should proceed to completion without issue in ~30 seconds. This is what will happen if you swap out io.js or Node.js 0.12 with Node.js 0.10.x.

what actually happens

The tests hang about 2/3 of the way through the first steps, which are for account creation. Eventually it times out. npm is left hanging partway through an HTTP request immediately after getting ECONNRESET or EPIPE from the CouchDB end. A potentially salient detail is that a number of requests are issued in rapid-fire succession, and that some of the responses have 40x codes because they're authentication failures or "document (user) not found" responses.

debugging so far

  • Even turned up to debug, the CouchDB logs for the tests reveal no interesting or unusual information.
  • Wireshark captures of the transactions show some strange behavior with RSTs and duplicate ACKs, but not in a way that reveals why those RSTs are happening.
  • running the tests with NODE_DEBUG=http npm test basically backs up the information revealed by Wireshark – the connections are getting closed on the CouchDB end, more or less midstream, and the subsequent retries are hanging.

Transcript:

% node --version
v1.6.1
% couchdb -V
couchdb - Apache CouchDB 1.6.1

Licensed under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of the
License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed
under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
% git clone npm/npm-registry-couchapp
Cloning into 'npm-registry-couchapp'...
done.
% cd npm-registry-couchapp
% npm install
npm WARN package.json Dependency 'json' exists in both dependencies and devDependencies, using 'json@^9.0.2' from dependencies
npm WARN engine [email protected]: wanted: {"node":"0.10.x || 0.8.x"} (current: {"node":"1.6.1","npm":"2.7.4"})
[email protected] node_modules/which

[email protected] node_modules/rimraf

[email protected] node_modules/semver

[email protected] node_modules/parse-json-response
└── [email protected] ([email protected])

[email protected] node_modules/json

[email protected] node_modules/request
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected] ([email protected])
├── [email protected] ([email protected])
├── [email protected] ([email protected])
├── [email protected] ([email protected], [email protected], [email protected])
├── [email protected] ([email protected])
├── [email protected] ([email protected])
└── [email protected] ([email protected], [email protected], [email protected], [email protected])

[email protected] node_modules/tap
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected]
├── [email protected] ([email protected])
├── [email protected] ([email protected])
├── [email protected] ([email protected], [email protected], [email protected])
├── [email protected] ([email protected], [email protected], [email protected])
└── [email protected] ([email protected])

[email protected] node_modules/couchapp
├── [email protected]
├── [email protected] ([email protected], [email protected])
├── [email protected]
├── [email protected] ([email protected], [email protected], [email protected], [email protected])
├── [email protected] ([email protected], [email protected], [email protected], [email protected])
└── [email protected] ([email protected], [email protected], [email protected])
bauchelain% npm test

> [email protected] test /Users/ogd/Documents/projects/npm-registry-couchapp
> tap test/*.js

ok test/00-setup.js ..................................... 8/8
not ok test/01-adduser.js ............................... 5/6
    Command: "/usr/local/bin/iojs 01-adduser.js"
    TAP version 13
    ok 1 (unnamed assert)
    ok 2 (unnamed assert)
    ok 3 (unnamed assert)
    ok 4 (unnamed assert)
    ok 5 (unnamed assert)
    not ok 6 test/01-adduser.js
      ---
        exit:     ~
        timedOut: true
        signal:   SIGTERM
        command:  "/usr/local/bin/iojs 01-adduser.js"
      ...

    1..6
    # tests 6
    # pass  5
    # fail  1

ok test/01b-user-views.js ............................... 5/5
ok test/01c-user-updates.js ............................. 4/4
ok test/01d-whoami.js ................................... 4/4
ok test/02-publish.js ................................. 35/35
ok test/02b-dist-tags.js .............................. 26/26
ok test/03-star.js .................................... 10/10
ok test/04-deprecate.js ................................. 7/7
ok test/04b-registry-views-expect.js .................... 1/1
ok test/04b-registry-views.js ......................... 51/51
ok test/04c-lists.js .................................... 3/3
ok test/05-unpublish.js ............................... 24/24
ok test/06-not-implemented.js ........................... 9/9
ok test/common.js ....................................... 2/2
ok test/ping.js ......................................... 9/9
ok test/pkg-update-copy-fields.js ....................... 3/3
ok test/pkg-update-deprecate-msg.js ..................... 3/3
ok test/pkg-update-hoodie.js ............................ 2/2
ok test/pkg-update-metadata.js .......................... 4/4
ok test/pkg-update-star-nopt.js ......................... 4/4
ok test/pkg-update-yui-error.js ......................... 2/2
ok test/vdu-auth.js ..................................... 7/7
ok test/vdu.js ........................................ 16/16
ok test/zz-teardown.js .................................. 2/2
total ............................................... 246/247

not ok
npm ERR! Test failed.  See above for more details.

I'm stumped! Anyone got any ideas?

@janl
Copy link
Contributor

janl commented Mar 21, 2015

@othiym23 do you still have the wireshark dumps around?

@othiym23
Copy link
Contributor Author

I do. I'll figure out some way to share them and put a link here.

@othiym23
Copy link
Contributor Author

@janl https://github.com/othiym23/debugging-npm-couchdb-iojs

These captures comprise only the initial population of the VDUs into the new Couch database and the potentially failing auth calls. There are, right now, two traces – one with keep-alive enabled, and one without. The former fails and the latter succeeds.

Let me know if you want any other traces. Capturing them is pretty easy for me now.

@smikes
Copy link
Contributor

smikes commented Apr 9, 2015

It looks like there don't need to be auth failures in order to see the problem. I am running with a modified version of 01-adduser, and it still fails when I do this:

test('adduser a', fn.bind(null, conf, u))
test('adduser b', fn.bind(null, conf, u))
test('adduser c', fn.bind(null, conf, u))
test('adduser d', fn.bind(null, conf, u))

It looks like the first login attempt is handled in about a half a second, and every subsequent one is delayed by 10 seconds. The '10' is suspiciously round, so it may be a timeout or some such.

@smikes
Copy link
Contributor

smikes commented Apr 9, 2015

@othiym23 can you post your recipe for gathering the packet dumps? I want to see what's going on in my reduced example (above).

@othiym23
Copy link
Contributor Author

othiym23 commented Apr 9, 2015

  1. Fire up wireshark, and start a capture with a filter to pay attention to only port 15986.
  2. Physically move aside all the tests except setup, auth, and teardown.
  3. Run the tests.
  4. Stop the wireshark capture session.

There are some example captures at the repo I mentioned to janl above.

@othiym23
Copy link
Contributor Author

othiym23 commented Apr 9, 2015

That timeout is the retry built into npm-registry-client, and happens because connections on the keep-alive socket fail. What I haven't been able to figure out is why there are duplicate FINs and / or ACKs from the CouchDB side -- the error appears to be at the TCP, rather than HTTP, layer.

@kanongil
Copy link

kanongil commented Apr 9, 2015

I think, at least part of this problem, is a couchdb issue. From looking at the dumps, it simply decides to end the connection after sending the response even though it did not specify this intention in the response (by adding a connection: close header).

Btw, I found https://issues.apache.org/jira/browse/COUCHDB-1146, which seem related. Maybe @jhs can elaborate?

@smikes
Copy link
Contributor

smikes commented Apr 9, 2015

I agree, the problem seems to be on the second sending of any given credential. The couchapp returns 409, and then for some reason couchdb throws in the RST and kills the connection.

@othiym23
Copy link
Contributor Author

othiym23 commented Apr 9, 2015

@kanongil That agrees with my reading of the CouchDB source, which is that this error at root is something funky in mochiweb's handling of keep-alive (and also with the fact that prior to Node.js 0.12 / io.js, Node didn't actually have a working keep-alive implementation in its http.Agent code). I need a minimal reduction that I can take to @janl et al to get it fixed, or some kind of patch to mochiweb, both of which have proven to be a bit tricky to get.

@smikes
Copy link
Contributor

smikes commented Apr 9, 2015

Can we reproduce the problem with clients other than node > 0.11 ? I am curious about this, which I see just before the RST:

166   2.240033    127.0.0.1 -> 127.0.0.1    TCP 56 62249→15986 [ACK] Seq=977 Ack=787 Win=407488 Len=0 TSval=1247001303 TSecr=1247001303
167   2.240067    127.0.0.1 -> 127.0.0.1    TCP 56 15986→62249 [FIN, ACK] Seq=787 Ack=977 Win=325664 Len=0 TSval=1247001303 TSecr=1247001303
168   2.240082    127.0.0.1 -> 127.0.0.1    TCP 56 62249→15986 [ACK] Seq=977 Ack=788 Win=407488 Len=0 TSval=1247001303 TSecr=1247001303

It looks like the client (port 62249) ACK's two packets (787, 788) using the same sequence number (977). Is that a correct reading, and if so, is that legitimate TCP?

@othiym23
Copy link
Contributor Author

othiym23 commented Apr 9, 2015

https://blog.udemy.com/tcp-dupack/

As it stands, this is just more evidence of weirdness in the interaction between CouchDB and Node.

@smikes
Copy link
Contributor

smikes commented Apr 9, 2015

I have a little libcurl C program that sends the same requests as npm adduser should be sending, except for the very last one (the update). So far I have not been able to trigger the RST. Note that curl uses keepalive by default, and the wireshark dumps of my little script seem to confirm that.

Does this prove anything by itself? Should I try to compare the data sent by curl to the data sent by npm-registry-client, packet-by-packet?

@othiym23
Copy link
Contributor Author

othiym23 commented Apr 9, 2015

Should I try to compare the data sent by curl to the data sent by npm-registry-client, packet-by-packet?

If you can.

If it helps (it probably doesn't), this is a race of some kind, because the npm-registry-couchapp tests do not fail 100% of the time, and in if my machine is chugging because it's been a while since I rebooted or I did something that caused it to have to do a lot of swapping, that slows the tests down enough that they fail much less frequently. So I'd try playing with the timing of events in your test client, and maybe have it do some of its requests on concurrent keep-alive connections. I suspect, but was unable to establish to my satisfaction, that part of the problem may be mochiweb getting confused about which responses go to which connections in the case of concurrent clients on the same IP.

@smikes
Copy link
Contributor

smikes commented Apr 9, 2015

Ah, OK. So the ideal case is to have two curl processes launched back-to-back:

  1. sets user (exercise user-not-present pathway)
  2. attempts to set user, winds up updating (exercise 409 pathway)

That will precisely mimic what's happening in 01-adduser, when it launches two npm adduser commands in a row.

@kanongil
Copy link

kanongil commented Apr 9, 2015

I'm pretty sure the RST is sent because node sends a packet after the response has been transmitted, and a FIN has been sent and merely indicates that couchdb has called close() on the socket. Also, this explains the race condition, where everything will be ok when node transmits the remaining packet data before couchdb responds.

@smikes
Copy link
Contributor

smikes commented Apr 14, 2015

Okay, here's some more data.

When I run the npm adduser commands through the tap test suite, I get the error. When run the test suite with a "wait" command, holding the registry open, I can run npm adduser from the command line, using

$  npm --registry=http://127.0.0.1:15986/ --userconf=bad.conf adduser < npm-adduser.in 

This also fails in the same way, with an extra packet sent from node to couch, follwed by an RST sent from couch. So the failure appears to be in the way node behaves with the keepalive socket:

first, in that it writes to the socket after it is FIN/ACK'd
second, in that it ignores the RST sent by the server

As far as which commits introduced this into node, I am currently looking at: libuv/libuv@e19089f

and trying a locally built IOJS without the out-of-band support that was added for 1.4.2, 1.4.3 on Darwin.

@othiym23 can you reproduce this bug with io.js on other, non-Darwin platforms?

@smikes
Copy link
Contributor

smikes commented Apr 14, 2015

No, that out of band stuff is fine - or rather, removing it does not remove the problem

@smikes
Copy link
Contributor

smikes commented Apr 14, 2015

I believe npm/npm-registry-client#107 fixes this. ;-)

@smikes
Copy link
Contributor

smikes commented Apr 14, 2015

Long writeup:

As usual, a complicated bug stems from multiple causes. In this case there are at least two underlying bugs.

First, npm-registry-client/lib/adduser.js sends { body: userobj } with every request it makes to the registry. This is fine when the request is a PUT (as it is for the first and third requests) but incorrect when the request is a GET. This bug has been present since Oct 31, 2014 but has been harmless since writing to a dead socket is allowed and ignored. (Bug was introduced by this commit: npm/npm-registry-client@29a21d0^1...29a21d0#diff-ca9f3ace6b674b53393b3d9678233b68R64 )

Second, there is at least one bug, possibly more, in the new http-keepalive code added in io.js and node >= 0.12. The first problem is recycling a socket when there are still writes pending against it. Here the problem was that the HTTP agent sent the GET request, and followed by writing the {body} specified by npm-registry-client. However, before the write was executed, the request had been received, processed, and responded to, so the http-keepalive code in iojs/lib/_http_agent.js placed the socket on the freeSocket list. Then, the next request was queued, taking the socket back off of the freeSocket list.

The code that initializes a socket for HTTP use in iojs/lib/_http_client.js uses a one-tick delay. In this case, I was able to see the write coming directly in between the first call (onSocket) and the one-tick-delayed call (tickOnSocket):

HTTP 72584: agent.on(free) 127.0.0.1:15986::
HTTP 72584: removeSocket 127.0.0.1:15986:: destroyed: false
HTTP 72584: have free socket
HTTP 72584: clean up free socket queue

[1] HTTP 72584: onSocket 

[2] HTTP 72584: write wrote = {"_id":"org.couchdb.user:user","name":"user","password":"pass@:\\'","email":"[email protected]","type":"user","roles":[],"date":"2015-04-14T18:49:04.598Z","_rev":"1-0ae8131bcafaf125da4424dfcc3fe7d8","password_scheme":"pbkdf2","iterations":10,"derived_key":"f1733252f704715c9d817859085c206be1c5c77f","salt":"6b76129f88db291761ac2c44f73caf2f"}

HTTP 72584: write ret = false
HTTP 72584: outgoing message end.

[3] HTTP 72584: tickOnSocket  18

[4] NET 72584: onread -54

NET 72584: nread = -54, destroying..
NET 72584: destroy
NET 72584: close
NET 72584: close handle
HTTP 72584: SOCKET ERROR: read ECONNRESET Error: read ECONNRESET
    at exports._errnoException (util.js:749:11)
    at TCP.onread (net.js:528:26)

The first few lines of the log show the socket being cleaned up and discarded (returned to the pool, for a keepalive socket). Then the socket is queued for configuration in step [1] (onSocket). Next, the previously queued write is processed on the socket [2]. On the next tick the socket is available to actually be configured by the code in tickOnSocket [3]. Meanwhile, the other end of the connection has received garbage data (the body data associated with the previous GET request), which causes it to send an RST packet (TCP Reset). The RST is not handled in the iojs networking layer, which causes it to propagate up into the the http and application layer.

Eventually the npm-registry-client code times out, declares an error, and retries the request. Because there is no longer a write pending, the retry succeeds.

The error (or errors) in the iojs networking / keepalive code are:

  1. allowing a write to socket that has been recycled via keepalive
  2. not handling the RST on the http keepalive socket.

Unfortunately I don't have any ideas for how to fix those easily. If low-level uv primitives can tell whether writes are pending on a socket — or, even better, cancel all writes pending on a socket — then point 1 may be fixable.

Regarding point 2, the meaning of tcp reset is well defined on an HTTP keepalive socket -- it means, re-initialize your listener and return this socket to the pool. Unfortunately the error is being detected and handled in net.js and the HTTP keepalive code lives in _http_agent.js, so it's hard to see how to handle this without mixing the concerns of simple TCP sockets and http keepalive sockets.

However the iojs issues are addressed, though, the trigger can be removed from npm-registry-client by patching npm/npm-registry-client#107, which will remove this test failure from npm's integration tests.

@othiym23
Copy link
Contributor Author

sam u r the best 😘

@othiym23
Copy link
Contributor Author

@chrisdickinson and @indutny, you might find #7699 (comment) interesting. Also, @janl, I think this lets CouchDB off the hook. I'm going to test and potentially land this patch now.

@othiym23
Copy link
Contributor Author

npm/npm-registry-client@856eefe contains the fix for this, and 431c3bf includes the fixed version of npm-registry-client. Yay! I am so glad this is fixed! THANK YOU SAM

@janl
Copy link
Contributor

janl commented Apr 17, 2015

Glad this is resolved :)

Impeccable sleuthing @smikes!

@othiym23 let me know if anything else seems fishy on the CouchDB end.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants