Feature request: a means to refuse subsequent TCP connections while allowing current connections enough time to drain

#### update
We've edited this issue to be less prescriptive about solution.  it now presents 3 possible solutions that we can see

### summary

Given I've configured Envoy with LDS serving a TCP proxy listener on some port
and there are connections in flight
I would like a way to refuse subsequent TCP connections to that port while allowing current established connections to drain

We tried the following approaches, but none of them achieve our goals:
1. the LDS is updated to remove the listener
2. Envoy is signalled with a `SIGTERM`
3. Send a `GET` request to `/healthcheck/fail`

### steps to reproduce

write a `bootstrap.yaml` like

```yaml
---
admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901

node:
  id: some-envoy-node
  cluster: some-envoy-cluster

dynamic_resources:
  lds_config:
    path: /cfg/lds-current.yaml

static_resources:
  clusters:
  - name: example_cluster
    connect_timeout: 0.25s
    type: STATIC
    lb_policy: ROUND_ROBIN
    hosts:
    - socket_address:
        address: 93.184.216.3    # IP address of example.com
        port_value: 80
```

write a lds-current.yaml file like

```yaml
version_info: "0"
resources:
- "@type": type.googleapis.com/envoy.api.v2.Listener
  name: listener_0
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 8080
  filter_chains:
    - filters:
        - name: envoy.tcp_proxy
          config:
            stat_prefix: ingress_tcp
            cluster: example_cluster
```

launch envoy (I'm using v1.6.0)

```
envoy -c /cfg/bootstrap.yaml --v2-config-only --drain-time-s 30
```

confirm that the TCP proxy is working

```
curl -v -H 'Host: example.com' 127.0.0.1:8080
```

#### Possible approach 1: remove the listener

update the LDS to return an empty set of listeners. this is a two step process. first, write an empty LDS response file lds-empty.yaml

```yaml
version_info: "1"
resources: []
```

second, move that file on top of the file being watched:

```
mv lds-empty.yaml lds-current.yaml
```

in the Envoy stdout logs you'll see a line

```
source/server/lds_api.cc:68] lds: remove listener 'listener_0'
```

attempt to connect to the port where the listener used to be:

```
curl -v -H 'Host: example.com' 127.0.0.1:8080
```

##### expected behavior

Would like to see all new TCP connections be refused immediately, as if a listener had never been added in the first place. Existing TCP connections should continue to be serviced.

##### actual behavior

the port is still open, even after the LDS update occurs

```
lsof -i
COMMAND PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
envoy     1 root    9u  IPv4  30166      0t0  TCP *:9901 (LISTEN)
envoy     1 root   22u  IPv4  30171      0t0  TCP *:8080 (LISTEN)
```

clients can connect to the port, but the TCP proxying seems to hang (can't tell where)

```
curl -H 'Host: example.com' -v 127.0.0.1:8080
* Rebuilt URL to: 127.0.0.1:8080/
*   Trying 127.0.0.1...
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
> GET / HTTP/1.1
> Host: example.com
> User-Agent: curl/7.47.0
> Accept: */*
>
^C
```

this state remains until --drain-time-s time has elapsed (30 seconds in this example). At that point the port is finally closed, so you see

```
curl 127.0.0.1:8080
curl: (7) Failed to connect to 127.0.0.1 port 8080: Connection refused
```

#### Possible approach 2: pkill -SIGTERM envoy

If instead of removing the listeners we signaled Envoy

```
pkill -SIGTERM envoy
```

Envoy exits immediately without allowing current connections to drain

```
[2018-03-28 17:42:06.995][3563][warning][main] source/server/server.cc:312] caught SIGTERM
[2018-03-28 17:42:06.995][3563][info][main] source/server/server.cc:357] main dispatch loop exited
[2018-03-28 17:42:07.004][3563][info][main] source/server/server.cc:392] exiting
```

**EDITED** to remove incorrect bit about listeners staying open after SIGTERM.

#### Possible approach 3: admin healthcheck fail

We could instead `GET /healthcheck/fail` to trigger this behavior.  As above, we would expect that new TCP connections should be refused while existing TCP connections are serviced.



### background

In Cloud Foundry, we have the following setup currently:

```
       Router         =====>       Envoy    ---->    App
(shared ingress)      (TLS)                 TCP
```

Each application instance has a sidecar Envoy which terminates TLS connections from the shared ingress router. Applications may not speak HTTP, so we use basic TCP connectivity checks from the shared Router to the Envoy in order to infer application health and determine if a client connection should be load-balanced to that Envoy. When the upstream Envoy accepts the TCP connection, the Router considers that upstream healthy. When the upstream refuses the TCP connection, the Router considers that upstream unhealthy.

During a graceful shutdown, the scheduler ought to be able to drain the Envoy before terminating the application. This would mean that the Envoy ought to service any in-flight TCP connections without accepting any new ones.


### acknowledgements
h/t @jvshahid and @emalm for investigation and edits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: a means to refuse subsequent TCP connections while allowing current connections enough time to drain #2920

update

summary

steps to reproduce

Possible approach 1: remove the listener

expected behavior

actual behavior

Possible approach 2: pkill -SIGTERM envoy

Possible approach 3: admin healthcheck fail

background

acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: a means to refuse subsequent TCP connections while allowing current connections enough time to drain #2920

Description

update

summary

steps to reproduce

Possible approach 1: remove the listener

expected behavior

actual behavior

Possible approach 2: pkill -SIGTERM envoy

Possible approach 3: admin healthcheck fail

background

acknowledgements

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions