Skip to content

bpo-25942: Added new parameter to subprocess.run(): cleanup_timeout#4283

Closed
stuarteberg wants to merge 11 commits intopython:masterfrom
stuarteberg:fix-bpo-25942-subprocess-call
Closed

bpo-25942: Added new parameter to subprocess.run(): cleanup_timeout#4283
stuarteberg wants to merge 11 commits intopython:masterfrom
stuarteberg:fix-bpo-25942-subprocess-call

Conversation

@stuarteberg
Copy link
Contributor

@stuarteberg stuarteberg commented Nov 4, 2017

EDIT: This PR was ultimately superseded by #5026


Summary:

subprocess.call() (and now also subprocess.run()) behave badly in the event of SIGINT, i.e. a user hitting Ctrl+C while the subprocess is running. This was first reported by @pilcrow in [bpo-25942]1.

Several possible solutions were discussed (by @vadmium, @pilcrow, and @Haypo), but no action was taken. Among them was this suggestion from @Haypo:

If you really want to give time to the child process, I suggest to add a new optional parameter . For example ctrlc_timeout=5.0

IMHO, that is the best option, and certainly much better than the status quo. I've implemented that solution in this PR. (It is also similar to a proposal by @vadmium in that thread.)

Background: POSIX behavior for Ctrl+C

In general, when the user hits Ctrl+C, the OS sends SIGINT to all processes in the foreground process group, including any child processes launched via subprocess call(). That is, both Python and it's child will automatically see the SIGINT; there is no need to explicitly send a duplicate SIGINT to the child.

What this PR does:

Upon receiving the SIGINT in the form of a KeyboardInterrupt, call() will wait again for the child process, but with cleanup_timeout, giving the child process time to perform cleanup actions. If that timeout expires, then the child process is killed. (Alternatively, during this time, the user can hit Ctrl+C again to immediately stop waiting and kill the child process.)

What this PR does not do:

This PR attempts to make subprocess.call() behave in a less surprising manner in the context of interactive terminal applications, in which the user is likely to expect Ctrl+C to work properly. If the child process does not handle SIGINT properly, we make no attempt to kill it via alternative means, such as SIGTERM. If the developer needs such control over how the child is to be killed in the event of a KeyboardInterrupt, she should use the Popen API directly instead of calling subprocess.call() or run().

Open question 1:

It is not entirely clear what the default value of cleanup_timeout should be. Setting it to 0 by default seems to defeat the purpose and wouldn't help naïve users write code that "just works".

On the other hand, setting it to an arbitrary value like 5.0 seems... arbitrary. If the cleanup takes a long time, then we'd be killing it too soon. But if the child doesn't handle SIGINT anyway, then we'd be wasting 5 seconds for no reason (unless the user hits Ctrl+C a second time).

For now, I've set the default value as follows:

  • If timeout was specified, use that. The child cleanup time must be less than its total expected runtime.
  • If timeout was not specified, then use 1.0. If the child process handles SIGINT, this introduces no extra delay. If it does not handle SIGINT, then 1 second seems like a reasonable delay for what is likely a rare event an interactive console application.

Open question 2:

According to the docs, subprocess.call() is equivalent to subprocess.run().returncode. Yet, these are currently implemented as independent functions (and I left it that way). Is there any objection to replacing subprocess.call() with the following?

def call(*popenargs, timeout=None, cleanup_timeout=None, **kwargs):
    return run(*popenargs, timeout=timeout, cleanup_timeout=cleanup_timeout, **kwargs).returncode

(That would also allow me to remove the mostly duplicated test code in this PR.)

https://bugs.python.org/issue25942

@stuarteberg stuarteberg requested a review from gpshead as a code owner November 4, 2017 23:24
@stuarteberg stuarteberg changed the title bpo-25942: Added new parameter timeout_after_sigint to run() and call() bpo-25942: Added new parameter timeout_after_sigint to subprocess.run() and call() Nov 4, 2017
@gpshead gpshead self-assigned this Nov 5, 2017
@gpshead
Copy link
Member

gpshead commented Nov 5, 2017

for Q1: When timeout_after_sigint is unspecified (None) I think we can do better than your current implementation which just sets it to the full timeout value. Compute the time remaining within timeout at the time of the KeyboardInterrupt and use that as the second wait timer.

For Q2: Refactoring to simplify things is good. run is relatively new, if call can be reworked to be implemented on top of run, great!

I also think we should not mention SIGINT in the documentation and docstrings, this behavior applies equally well to Windows. Reword it to mention KeyboardInterrupt instead as that is accurately how we implement it; the platform may not have signals and the user may cause KeyboardInterrupt to not be raised or to be raised by another means.

@stuarteberg stuarteberg force-pushed the fix-bpo-25942-subprocess-call branch 2 times, most recently from f2b80d2 to 0a1687c Compare November 5, 2017 14:15
@stuarteberg stuarteberg force-pushed the fix-bpo-25942-subprocess-call branch from 0a1687c to 6d3035b Compare November 5, 2017 16:34
@stuarteberg stuarteberg changed the title bpo-25942: Added new parameter timeout_after_sigint to subprocess.run() and call() bpo-25942: Added new parameter to subprocess.run(): cleanup_timeout Nov 5, 2017
@stuarteberg stuarteberg force-pushed the fix-bpo-25942-subprocess-call branch 7 times, most recently from 7cffd2d to 00f2ad2 Compare November 5, 2017 18:22
@stuarteberg
Copy link
Contributor Author

stuarteberg commented Nov 5, 2017

Thanks very much for the feedback, @gpshead. I have updated the PR to address each of your suggestions:

  • The code and docs mention KeyboardInterrupt, not SIGINT
  • Accordingly, the new parameter has been renamed to cleanup_timeout
  • By default, the cleanup timeout period merely consumes any remaining time from the original timeout
  • call(...) is now a simple wrapper for run(...).returncode

BTW, I believe that the new functionality is compatible with Windows, but unfortunately I can't figure out how to test it. The new test case relies on os.killpg() (to simulate Ctrl+C), which is not available on Windows. (It also requires setpgid() to ensure that killpg() doesn't kill the test suite's parent/sibling processes.)

If you have any ideas for achieving windows coverage in the new test, I'm all ears.

@stuarteberg stuarteberg force-pushed the fix-bpo-25942-subprocess-call branch 4 times, most recently from dc8199b to bf3efdd Compare November 5, 2017 19:39
In the event of a KeyboardInterrupt, the child process is given time to clean up before it is killed by the parent.
@stuarteberg stuarteberg force-pushed the fix-bpo-25942-subprocess-call branch from bf3efdd to bebd8d3 Compare November 5, 2017 19:50
Example:
retcode = call(["ls", "-l"])
"""
return run(*popenargs, timeout=timeout, cleanup_timeout=None, **kwargs).returncode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you need to test that cleanup_timeout is respected here :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof. Now I've fixed the copy/paste mistake.

" time.sleep(3.0)\n"
"except KeyboardInterrupt:\n"
" time.sleep(0.1)\n"
f" open('{tf.name}', 'w').write('cleaned up')\n"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close the file to ensure the output is flushed

Copy link
Contributor Author

@stuarteberg stuarteberg Nov 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought, in CPython at least, that the file is guaranteed to be closed during __del__. But you're right: "explicit is better than implicit". :-) Now fixed.

tf.close()

try:
with self.assertRaises(KeyboardInterrupt) as ex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ex apparently not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Removed.


try:
with self.assertRaises(KeyboardInterrupt) as ex:
threading.Thread(target=interrupt_this_process_group).start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread is not joined, so could leak to other tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, now joined.

*cleanup_timeout* is non-zero, the child process is given additional
time to finish before it is killed, to allow potential clean-up operations
in the child to complete. During this time, a second :exc:`KeyboardInterrupt`
will kill the child immediately. If *cleanup_timeout* is not specified,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation looks like it will leave the child running, rather than killing it. Opportunity for another test case :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, that's embarrassing. Thanks for catching it. I've pushed a new commit which fixes that... with a test case. ;-)

process.wait(timeout=cleanup_timeout)
raise
except TimeoutExpired:
raise KeyboardInterrupt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to change the exception here? If so, why make a new KeyboardInterrupt exception rather than reraising the original one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to change the exception here?

Yes, definitely. The user hit Ctrl+C, so the expected behavior is to raise KeyboardInterrupt.

If so, why make a new KeyboardInterrupt exception rather than reraising the original one?

Yeah, I asked myself this question, too. I figured this was the clearest thing to write, and I also like the fact that, thanks to Python 3's automatic exception chaining, power-user could read the traceback carefully and see that the subprocess cleanup code timed out.

Anyway, I don't feel very strongly about it. Open to suggestions.


If the current process recevies a :exc:`KeyboardInterrupt` and
*cleanup_timeout* is non-zero, the child process is given additional
time to finish before it is killed, to allow potential clean-up operations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup_timeout is the additional duration in seconds. What objects are supported: int, float, any real number?

@vadmium
Copy link
Member

vadmium commented Nov 6, 2017

Why limit this to KeyboardInterrupt? A custom signal handler may raise any exception.

process.wait()
raise KeyboardInterrupt
except:
print("Killing!")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the print().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Done.

@stuarteberg
Copy link
Contributor Author

Why limit this to KeyboardInterrupt? A custom signal handler may raise any exception.

IIUC, SIGINT (and CTRL_C_EVENT on Windows) is special in that the terminal automatically sends it to all processes in the foreground group. Therefore, when the parent receives a KeyboardInterrupt, it is reasonable to expect that the child has received it too, and that the child is already in the process of cleaning up.

Other signals, such as SIGTERM are not automatically sent to the children, so the default behavior of run() can't really guess what the right behavior should be. The child probably hasn't seen the signal, so there's no reason to wait for it to quickly clean up.

Under this proposal, developers who stick to the standard convention (i.e. Ctrl+C means KeyboardInterrupt) will end up with code that "just works". If they need to deviate from that convention, then we can't really guess what to do. In that case, developers are free to use the Popen API directly, and write their own exception handling logic.

@stuarteberg
Copy link
Contributor Author

Thanks, @vadmium, for the thorough review! I hope I've addressed all of the issues you've spotted. If so, I'll go ahead and squash my new commits to keep the git history tidy.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the idea of blocking 1 second on CTRL+c by default. I prefer the current behaviour.

operations in the child to complete. During this time, a second KeyboardInterrupt
will kill the child immediately. If cleanup_timeout is not specified,
then any remaining time from the original timeout is used, or 1.0 if no
original timeout was specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike a default timeout of 1 second. We should not hardcode arbitrary delay in the subprocess module, it's too hard to guess the "best" value. I prefer to keep the current behaviour if timeout is not set: kill immediately the process.

stdout, stderr = process.communicate()
raise TimeoutExpired(process.args, timeout, output=stdout,
stderr=stderr)
except KeyboardInterrupt:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the idea of having a special case only for KeyboardInterrupt. I would prefer to put this code in the bare "except:" case.

process.kill()
process.wait()
raise
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to close stdin, stdout and stderr here to notify the child process that we want it to stop.

Otherwise, the program can hang on writing into stdout if the stdout pipe is full.

raise
try:
process.wait(timeout=cleanup_timeout)
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to call "raise" in a new "else:" block.

# Change the current process group to avoid
# interrupting our parent process.
original_pgid = os.getpgrp()
os.setpgid(0, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, I dislike playing with process groups in an unit test. I would prefer to only send SIGINT directly to processes using their pid.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

…t and child,

rather than playing with process groups. (As requested in review.)
@stuarteberg
Copy link
Contributor Author

Hi @Haypo, thanks very much for the detailed review. I have pushed additional commits (I'll squash them later) to address all of your requests, with one exception. (See my plea below.)

  • In the test case, Ctrl+C is simulated without playing with the process group ID.
  • The cleanup code now applies to all exceptions, not just KeyboardInterrupt.
    (@vadmium also suggested this, FWIW)
  • The child's stdin, stdout, stderr are closed during cleanup.
  • If no cleanup_timeout or timeout is provided, the default cleanup_timeout is now 0.0 for all errors -- except for KeyboardInterrupt, which gets a default of 1.0 second.

Here is my plea regarding the last point. Sorry if this is too long; I don't mean to waste your time, but I do think this will affect users.

I can see why a constant like 1.0 rightly triggers your "spider sense". (Mine, too!) But I think in this case it's the lesser of two evils.

Consider the following bash script excerpt:

git clone https://github.com/python/cpython

If the user hits Ctrl+C during the clone operation, git (being user-friendly) will delete the incomplete cpython local directory, and the user is not left with invalid garbage on her hard disk.

What if the script author prefers Python over bash, but omits the timeout?

import subprocess
subprocess.call(['git', 'clone', 'https://github.com/python/cpython'])

Now what does Ctrl+C do? Using Python 2, it will still "just work", like the bash command did. But using Python 3.6, git is killed before it can clean up, and the local cpython repo remains on the hard disk, in an invalid state.

Ideally, authors would be disciplined, and always specify a timeout. But after briefly hunting for examples of call() and check_call() in the wild, it appears that most people do not bother specifying a timeout, at least when using the convenience API. Therefore, it would be good if the convenience API gives at least some chance of correct behavior.

Regarding real-world examples: It's difficult to google efficiently for uses of call, check_call, or run. But the few cases I can find do not bother with timeout parameters. Yet, they would likely benefit from a non-zero cleanup_timeout by default. For example, it looks to me like django's mysql and sqlite3 backend clients both use check_call() without a timeout. I'm not familiar with that code base, but AFAICT it looks broken if the user hits Ctrl+C. (Yet, they do seem to expect their users to use Ctrl+C: See the postgresql client, for example.)

There are more examples than that, but I'm not prepared to perform a quantitative analysis. :-)

Lastly (and sorry if this is too "cute"): The Zen of Python states:

In the face of ambiguity, refuse the temptation to guess.

BUT (!) it also declares:

Although practicality beats purity.

😉

Anyway, that's my opinion. If I'm out-numbered (or simply overruled), I'll remove the 1.0 constant, and apologize for the noise.

PS -- I suppose we can also add a strong warning to the run() docs about the dangers of using Ctrl+C in a program that uses the run()/call()/check_call()/check_output() convenience API. I guess we should also document the difference between the Python 2 and Python 3 behavior.

@stuarteberg
Copy link
Contributor Author

I have made the requested changes; please review again. (See my notes above.)

@bedevere-bot
Copy link

Thanks for making the requested changes!

@Haypo: please review the changes made to this pull request.

- Cleanup child in response to all exceptions, not just KeyboardInterrupt
- Close child stdin, stdout, stderr during cleanup
- Slightly simplify control flow
- Reduce default cleanup timeout to 0.0 for all exceptions except KeyboardInterrupt
  (for review: more discussion needed)
@stuarteberg stuarteberg force-pushed the fix-bpo-25942-subprocess-call branch from f0d8550 to 220b40f Compare November 7, 2017 15:03
# If the child is well-written (exits in response to SIGINT),
# a 1 second timeout is usually enough time for it to clean up.
# Othwerwise, 1 second seems like a tolerable delay in response to Ctrl+C,
# and a resonable price to pay for correct behavior in the well-written case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot assume that a child will get SIGINT, this is true only for a foreground process in the shell. The parent process can be a daemon and nobody will generate SIGINT for its children.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Disclaimer: I cannot claim to be an expert in POSIX conventions, so forgive me if you have expertise in this area and know the following to be incorrect.)

If I understand correctly, the primary distinction betweenSIGTERM and SIGINT is that the latter is specifically intended to be used in interactive terminal programs. For instance, that's why a POSIX-compliant shell automatically forwards SIGINT to all processes in the foreground group. (The same goes for the windows equivalent, CTRL_C_EVENT.) Of course, that's also why Python chose to name it's SIGINT-triggered exception KeyboardInterrupt.

It is reasonable to assume that receiving aSIGINT is a strong clue to the program that it is running in the foreground process group, and therefore its child is, too.

IMHO, programmers who would send a SIGINT programmatically shouldn't be surprised by slightly odd behavior. But the good news is that the "odd behavior" in this case is quite benign: it's merely a 1 second delay.

On the other hand, a default cleanup timeout of 0.0 would mean that the child is killed prematurely, and the user experiences incorrect/incomplete cleanup behavior when hitting Ctrl+C. That's the situation Python 3 users are in today.

But what about a default cleanup delay of 1.0 second? If the child responds to SIGINT then the user experiences no extra delay. But for those (very few) children that don't actually respond to SIGINT, the consequences are relatively minor. The user would merely have to do one of the following:

  1. Hit Ctrl+C again, OR
  2. Do nothing... for 1 second.

The idea with this PR is to help Python programmers write code that "just works" in the default case. After all, the run() function is a convenience API. It's meant to help avoid boilerplate and spare the user from the details of the lower-level Popen API, if they don't want to think about those details. But right now, Python does the wrong thing by default, which is very un-Pythonic, if you ask me. :-P

(BTW, in Python 2, the behavior wasn't perfect, but it happened to be better-behaved in this one instance.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot assume that a child will get SIGINT, this is true only for a foreground process in the shell.

Question: After thinking about your comment some more, I'm now wondering: What do you think about explicitly sending SIGINT to the child, assuming we're handling a KeyboardInterrupt? (For other exceptions, we would send SIGTERM, as you suggested below.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should TERM (or, worse, KILL) it, not INTerrupt it.

Firstly, it is reasonable to assume the child already got the INT. More importantly, a TERM better fits the longstanding semantic conventions of signal use here.

While it's true that a SIGINT could be generated by some circumstance other than a terminal driver catching Ctrl-C and signaling the whole foreground process group (not individual PIDs), in practice that would be an exceedingly rare violation of longstanding convention. POSIX refers to SIGINT as the "terminal interrupt signal", for example. Indeed, the fact that python by default turns a SIGINT into a KeyboardInterrupt tells you something about its expected use.

I don't see benefit in second-guessing system behavior when an INT rolls in. If we're going to terminate the child, TERM the child — and please let devs like me suppress that behavior entirely. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command lines tools are used both interactively and run by other programs. They are expected to handle both SIGINT and SIGTERM, and typically both are handled in the same way.

We can assume that a child process is handling both signals. If you run a program that needs special treatment, use Popen directly instead of the high level easy to use call() or run().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are expected to handle both SIGINT and SIGTERM, and typically both are handled in the same way.

I completely agree that they should. But as Python programmers, it would be particularly odd for us to assume that they do. After all, tools written in Python itself will respond differently for SIGINT and SIGTERM -- unless the author was kind enough to register a handler for SIGTERM. Most authors do not bother.

For example, on a whim, I just checked a few popular utilities that are implemented in Python. Frankly, these results shocked me:

  • httpie handlesSIGINT gracefully, but not SIGTERM. (I expected better from a project project that has 32,518 stars on github!)
  • Another example seems to be the boto suite of AWS utilities. (I didn't verify via experiment, I can find no mention of SIGTERM in it's source code.)
  • Holy crap, you guys aren't gonna believe this: Even pip doesn't handle SIGTERM gracefully. Apparently, if you send it SIGTERM, it does no cleanup at all -- you just end up with a partially installed python package (garbage, in other words). I'm legitimately surprised!

Again, I completely agree regarding what the proper POSIX conventions ought to be. But now we have to decide: we can either be POSIX purists, or be considerate to the typical user. I can see arguments in favor of either, but we should acknowledge the trade-off we're making.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subprocess module is not the place to fix bad signal handling in other programs.

Maybe send a PR to httpie/pip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for making a PR to those projects. But of course there will always be others with the same issue, and we can't fix all of them.

I'm just wondering if we have an opportunity to apply the Zen of Python maxim: "practicality beats purity". I'm honestly not sure. But doing the most "friendly" thing (if we can figure out what that is) seems spiritually compatible with a convenience API like run(), even if it's less pure -- or maybe even a little ugly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot fix the issue in subprocess. For example we can have a correct program that block SIGINT when running without a controlling terminal. The only way to terminate the program is SIGTERM. We must support such program by sending SIGTERM. If another program misbehaves by not handling SIGTERM, the program author should fix the program.

process.kill()
process.wait()
raise ex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need much simple solution.

except:
    if kill_timeout is not None:
        process.terminate()
        try:
            process.wait(timeout=kill_timoeut)
         except:
             pass
    process.kill()
    process.wait()
    raise

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Something like kill_timeout=None is the key here, letting the developer suppress the (in my view misguided) SIGKILL. That lets someone with my crusty POSIX sensibilities use an easy subprocess abstraction, and backport it to earlier python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of the default timeout value, I do like this proposal to explicitly terminate() the child, rather than merely hoping it will just exit gracefully on its own after we close its stdin/stdout.

One open question is whether or not it is bad to send a SIGTERM to a process that has (probably) already received a SIGINT.

For example, if the child is a Python program, then the default behavior of SIGTERM is to end the process without any cleanup. If it has already received a SIGINT, then the process will likely end during cleanup.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the perspective of the child process, a TERM is at worst equivalent to a KILL, and possibly much more polite, since it can be caught. So, if we are committed to killing the child process, I'd encourage a TERM first.

(It's tempting to full-on generalize this and allow something like timeout_handler=some_func_accepting_the_process_as_arg or err_kill=( (0.5, SIGTERM), (0.5, SIGQUIT), (1.0, SIGKILL) ) Those args are just illustrative, but you get the idea.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd encourage a TERM first.

👍

It's tempting to full-on generalize this

Agree with the sentiment. I also completely agree with what you implied by using the word "tempting" (instead of, say, "justified" 😄 ). The run() API is meant for easy, general usage, with reasonable default behavior. If a programmer cares about nuanced error-handling, she should just use the Popen API.

@@ -0,0 +1 @@
Added new parameter, ``cleanup_timeout``, to ``subprocess.run()``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kill_timeout may be more clear, the time we give child process to terminate before we kill it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ambivalent regarding the parameter name.

to allow potential clean-up operations in the child to complete.
During this time, a second exception will kill the child immediately.
If *cleanup_timeout* is not specified, then any remaining time from the original
*timeout* is used, or 1 second if no original *timeout* was specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too complicated.

If cleanup_timeout is not specified, then a default value is chosen:
If timeout was provided, then any remaining time from the original timeout
is used. Otherwise, if the exception is a KeyboardInterrupt, the child is
given 1 second to clean up. Otherwise, the child is killed immediately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cleanup timeout is not specified, there should be no cleanup timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cleanup timeout is not specified, there should be no cleanup timeout.

I assume you chose that phrasing to be pointedly simple, but, unfortunately, I think your phrasing is not as simple as it appears. In the subprocess API and docs, "no timeout" (i.e. timeout=None) actually means "timeout=infinity". That's probably not what you mean.

I think you meant to propose: "If cleanup timeout is not specified, there should be a cleanup timeout of 0.0".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stuarteberg I agree that "no cleanup timeout" is not clear enough, but it is not the same as cleanup timeout of 0 seconds.

How about:

If kill_timeout is not specified, the child process will be killed immediately when the timeout expire. If specified, the child process will be terminated gracefully using SIGTERM. If the process did not terminate within kill_timeout seconds, it will be killed.

This should be modified if graceful termination should be the default behavior. I'm not sure what should be the default, but this can be left for another PR.

@gpshead
Copy link
Member

gpshead commented Jan 30, 2018

Thank you for this PR and discussion, it was informative and got me thinking about the problem. The behavior change in 3.3-3.6 was unintentional.

I'm closing this PR for now as I'd rather not see a cleanup_timeout or similar parameter if at all possible. My PR to attempt to restore sane default behavior of just giving a little time before resorting to SIGKILL by default is in. I'd appreciate it if people would try out the 3.7 betas and report back on how that behaves for them. We can proceed from there if more changes seem warranted.

@gpshead gpshead closed this Jan 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants