bpo-25942: Added new parameter to subprocess.run(): cleanup_timeout#4283
bpo-25942: Added new parameter to subprocess.run(): cleanup_timeout#4283stuarteberg wants to merge 11 commits intopython:masterfrom
Conversation
|
for Q1: When For Q2: Refactoring to simplify things is good. I also think we should not mention SIGINT in the documentation and docstrings, this behavior applies equally well to Windows. Reword it to mention KeyboardInterrupt instead as that is accurately how we implement it; the platform may not have signals and the user may cause KeyboardInterrupt to not be raised or to be raised by another means. |
f2b80d2 to
0a1687c
Compare
0a1687c to
6d3035b
Compare
7cffd2d to
00f2ad2
Compare
|
Thanks very much for the feedback, @gpshead. I have updated the PR to address each of your suggestions:
BTW, I believe that the new functionality is compatible with Windows, but unfortunately I can't figure out how to test it. The new test case relies on If you have any ideas for achieving windows coverage in the new test, I'm all ears. |
dc8199b to
bf3efdd
Compare
In the event of a KeyboardInterrupt, the child process is given time to clean up before it is killed by the parent.
bf3efdd to
bebd8d3
Compare
Lib/subprocess.py
Outdated
| Example: | ||
| retcode = call(["ls", "-l"]) | ||
| """ | ||
| return run(*popenargs, timeout=timeout, cleanup_timeout=None, **kwargs).returncode |
There was a problem hiding this comment.
Perhaps you need to test that cleanup_timeout is respected here :)
There was a problem hiding this comment.
Oof. Now I've fixed the copy/paste mistake.
Lib/test/test_subprocess.py
Outdated
| " time.sleep(3.0)\n" | ||
| "except KeyboardInterrupt:\n" | ||
| " time.sleep(0.1)\n" | ||
| f" open('{tf.name}', 'w').write('cleaned up')\n"], |
There was a problem hiding this comment.
Close the file to ensure the output is flushed
There was a problem hiding this comment.
I thought, in CPython at least, that the file is guaranteed to be closed during __del__. But you're right: "explicit is better than implicit". :-) Now fixed.
Lib/test/test_subprocess.py
Outdated
| tf.close() | ||
|
|
||
| try: | ||
| with self.assertRaises(KeyboardInterrupt) as ex: |
There was a problem hiding this comment.
Nice catch. Removed.
Lib/test/test_subprocess.py
Outdated
|
|
||
| try: | ||
| with self.assertRaises(KeyboardInterrupt) as ex: | ||
| threading.Thread(target=interrupt_this_process_group).start() |
There was a problem hiding this comment.
Thread is not joined, so could leak to other tests
Doc/library/subprocess.rst
Outdated
| *cleanup_timeout* is non-zero, the child process is given additional | ||
| time to finish before it is killed, to allow potential clean-up operations | ||
| in the child to complete. During this time, a second :exc:`KeyboardInterrupt` | ||
| will kill the child immediately. If *cleanup_timeout* is not specified, |
There was a problem hiding this comment.
Current implementation looks like it will leave the child running, rather than killing it. Opportunity for another test case :)
There was a problem hiding this comment.
Ouch, that's embarrassing. Thanks for catching it. I've pushed a new commit which fixes that... with a test case. ;-)
Lib/subprocess.py
Outdated
| process.wait(timeout=cleanup_timeout) | ||
| raise | ||
| except TimeoutExpired: | ||
| raise KeyboardInterrupt |
There was a problem hiding this comment.
Is it important to change the exception here? If so, why make a new KeyboardInterrupt exception rather than reraising the original one?
There was a problem hiding this comment.
Is it important to change the exception here?
Yes, definitely. The user hit Ctrl+C, so the expected behavior is to raise KeyboardInterrupt.
If so, why make a new KeyboardInterrupt exception rather than reraising the original one?
Yeah, I asked myself this question, too. I figured this was the clearest thing to write, and I also like the fact that, thanks to Python 3's automatic exception chaining, power-user could read the traceback carefully and see that the subprocess cleanup code timed out.
Anyway, I don't feel very strongly about it. Open to suggestions.
Doc/library/subprocess.rst
Outdated
|
|
||
| If the current process recevies a :exc:`KeyboardInterrupt` and | ||
| *cleanup_timeout* is non-zero, the child process is given additional | ||
| time to finish before it is killed, to allow potential clean-up operations |
There was a problem hiding this comment.
Cleanup_timeout is the additional duration in seconds. What objects are supported: int, float, any real number?
|
Why limit this to KeyboardInterrupt? A custom signal handler may raise any exception. |
Lib/subprocess.py
Outdated
| process.wait() | ||
| raise KeyboardInterrupt | ||
| except: | ||
| print("Killing!") |
IIUC, Other signals, such as Under this proposal, developers who stick to the standard convention (i.e. |
|
Thanks, @vadmium, for the thorough review! I hope I've addressed all of the issues you've spotted. If so, I'll go ahead and squash my new commits to keep the git history tidy. |
vstinner
left a comment
There was a problem hiding this comment.
I dislike the idea of blocking 1 second on CTRL+c by default. I prefer the current behaviour.
Lib/subprocess.py
Outdated
| operations in the child to complete. During this time, a second KeyboardInterrupt | ||
| will kill the child immediately. If cleanup_timeout is not specified, | ||
| then any remaining time from the original timeout is used, or 1.0 if no | ||
| original timeout was specified. |
There was a problem hiding this comment.
I dislike a default timeout of 1 second. We should not hardcode arbitrary delay in the subprocess module, it's too hard to guess the "best" value. I prefer to keep the current behaviour if timeout is not set: kill immediately the process.
Lib/subprocess.py
Outdated
| stdout, stderr = process.communicate() | ||
| raise TimeoutExpired(process.args, timeout, output=stdout, | ||
| stderr=stderr) | ||
| except KeyboardInterrupt: |
There was a problem hiding this comment.
I dislike the idea of having a special case only for KeyboardInterrupt. I would prefer to put this code in the bare "except:" case.
| process.kill() | ||
| process.wait() | ||
| raise | ||
| try: |
There was a problem hiding this comment.
I propose to close stdin, stdout and stderr here to notify the child process that we want it to stop.
Otherwise, the program can hang on writing into stdout if the stdout pipe is full.
Lib/subprocess.py
Outdated
| raise | ||
| try: | ||
| process.wait(timeout=cleanup_timeout) | ||
| raise |
There was a problem hiding this comment.
I would prefer to call "raise" in a new "else:" block.
Lib/test/test_subprocess.py
Outdated
| # Change the current process group to avoid | ||
| # interrupting our parent process. | ||
| original_pgid = os.getpgrp() | ||
| os.setpgid(0, 0) |
There was a problem hiding this comment.
Hum, I dislike playing with process groups in an unit test. I would prefer to only send SIGINT directly to processes using their pid.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
…t and child, rather than playing with process groups. (As requested in review.)
|
Hi @Haypo, thanks very much for the detailed review. I have pushed additional commits (I'll squash them later) to address all of your requests, with one exception. (See my plea below.)
Here is my plea regarding the last point. Sorry if this is too long; I don't mean to waste your time, but I do think this will affect users. I can see why a constant like Consider the following bash script excerpt: git clone https://github.com/python/cpythonIf the user hits What if the script author prefers Python over bash, but omits the import subprocess
subprocess.call(['git', 'clone', 'https://github.com/python/cpython'])Now what does Ideally, authors would be disciplined, and always specify a Regarding real-world examples: It's difficult to google efficiently for uses of There are more examples than that, but I'm not prepared to perform a quantitative analysis. :-) Lastly (and sorry if this is too "cute"): The Zen of Python states:
BUT (!) it also declares:
😉 Anyway, that's my opinion. If I'm out-numbered (or simply overruled), I'll remove the 1.0 constant, and apologize for the noise. PS -- I suppose we can also add a strong warning to the |
|
I have made the requested changes; please review again. (See my notes above.) |
|
Thanks for making the requested changes! @Haypo: please review the changes made to this pull request. |
- Cleanup child in response to all exceptions, not just KeyboardInterrupt - Close child stdin, stdout, stderr during cleanup - Slightly simplify control flow - Reduce default cleanup timeout to 0.0 for all exceptions except KeyboardInterrupt (for review: more discussion needed)
f0d8550 to
220b40f
Compare
| # If the child is well-written (exits in response to SIGINT), | ||
| # a 1 second timeout is usually enough time for it to clean up. | ||
| # Othwerwise, 1 second seems like a tolerable delay in response to Ctrl+C, | ||
| # and a resonable price to pay for correct behavior in the well-written case. |
There was a problem hiding this comment.
You cannot assume that a child will get SIGINT, this is true only for a foreground process in the shell. The parent process can be a daemon and nobody will generate SIGINT for its children.
There was a problem hiding this comment.
(Disclaimer: I cannot claim to be an expert in POSIX conventions, so forgive me if you have expertise in this area and know the following to be incorrect.)
If I understand correctly, the primary distinction betweenSIGTERM and SIGINT is that the latter is specifically intended to be used in interactive terminal programs. For instance, that's why a POSIX-compliant shell automatically forwards SIGINT to all processes in the foreground group. (The same goes for the windows equivalent, CTRL_C_EVENT.) Of course, that's also why Python chose to name it's SIGINT-triggered exception KeyboardInterrupt.
It is reasonable to assume that receiving aSIGINT is a strong clue to the program that it is running in the foreground process group, and therefore its child is, too.
IMHO, programmers who would send a SIGINT programmatically shouldn't be surprised by slightly odd behavior. But the good news is that the "odd behavior" in this case is quite benign: it's merely a 1 second delay.
On the other hand, a default cleanup timeout of 0.0 would mean that the child is killed prematurely, and the user experiences incorrect/incomplete cleanup behavior when hitting Ctrl+C. That's the situation Python 3 users are in today.
But what about a default cleanup delay of 1.0 second? If the child responds to SIGINT then the user experiences no extra delay. But for those (very few) children that don't actually respond to SIGINT, the consequences are relatively minor. The user would merely have to do one of the following:
- Hit
Ctrl+Cagain, OR - Do nothing... for 1 second.
The idea with this PR is to help Python programmers write code that "just works" in the default case. After all, the run() function is a convenience API. It's meant to help avoid boilerplate and spare the user from the details of the lower-level Popen API, if they don't want to think about those details. But right now, Python does the wrong thing by default, which is very un-Pythonic, if you ask me. :-P
(BTW, in Python 2, the behavior wasn't perfect, but it happened to be better-behaved in this one instance.)
There was a problem hiding this comment.
You cannot assume that a child will get SIGINT, this is true only for a foreground process in the shell.
Question: After thinking about your comment some more, I'm now wondering: What do you think about explicitly sending SIGINT to the child, assuming we're handling a KeyboardInterrupt? (For other exceptions, we would send SIGTERM, as you suggested below.)
There was a problem hiding this comment.
We should TERM (or, worse, KILL) it, not INTerrupt it.
Firstly, it is reasonable to assume the child already got the INT. More importantly, a TERM better fits the longstanding semantic conventions of signal use here.
While it's true that a SIGINT could be generated by some circumstance other than a terminal driver catching Ctrl-C and signaling the whole foreground process group (not individual PIDs), in practice that would be an exceedingly rare violation of longstanding convention. POSIX refers to SIGINT as the "terminal interrupt signal", for example. Indeed, the fact that python by default turns a SIGINT into a KeyboardInterrupt tells you something about its expected use.
I don't see benefit in second-guessing system behavior when an INT rolls in. If we're going to terminate the child, TERM the child — and please let devs like me suppress that behavior entirely. :)
There was a problem hiding this comment.
Command lines tools are used both interactively and run by other programs. They are expected to handle both SIGINT and SIGTERM, and typically both are handled in the same way.
We can assume that a child process is handling both signals. If you run a program that needs special treatment, use Popen directly instead of the high level easy to use call() or run().
There was a problem hiding this comment.
They are expected to handle both SIGINT and SIGTERM, and typically both are handled in the same way.
I completely agree that they should. But as Python programmers, it would be particularly odd for us to assume that they do. After all, tools written in Python itself will respond differently for SIGINT and SIGTERM -- unless the author was kind enough to register a handler for SIGTERM. Most authors do not bother.
For example, on a whim, I just checked a few popular utilities that are implemented in Python. Frankly, these results shocked me:
httpiehandlesSIGINTgracefully, but notSIGTERM. (I expected better from a project project that has 32,518 stars on github!)- Another example seems to be the
botosuite of AWS utilities. (I didn't verify via experiment, I can find no mention ofSIGTERMin it's source code.) - Holy crap, you guys aren't gonna believe this: Even
pipdoesn't handleSIGTERMgracefully. Apparently, if you send itSIGTERM, it does no cleanup at all -- you just end up with a partially installed python package (garbage, in other words). I'm legitimately surprised!
Again, I completely agree regarding what the proper POSIX conventions ought to be. But now we have to decide: we can either be POSIX purists, or be considerate to the typical user. I can see arguments in favor of either, but we should acknowledge the trade-off we're making.
There was a problem hiding this comment.
The subprocess module is not the place to fix bad signal handling in other programs.
Maybe send a PR to httpie/pip?
There was a problem hiding this comment.
+1 for making a PR to those projects. But of course there will always be others with the same issue, and we can't fix all of them.
I'm just wondering if we have an opportunity to apply the Zen of Python maxim: "practicality beats purity". I'm honestly not sure. But doing the most "friendly" thing (if we can figure out what that is) seems spiritually compatible with a convenience API like run(), even if it's less pure -- or maybe even a little ugly.
There was a problem hiding this comment.
We cannot fix the issue in subprocess. For example we can have a correct program that block SIGINT when running without a controlling terminal. The only way to terminate the program is SIGTERM. We must support such program by sending SIGTERM. If another program misbehaves by not handling SIGTERM, the program author should fix the program.
| process.kill() | ||
| process.wait() | ||
| raise ex | ||
|
|
There was a problem hiding this comment.
I think we need much simple solution.
except:
if kill_timeout is not None:
process.terminate()
try:
process.wait(timeout=kill_timoeut)
except:
pass
process.kill()
process.wait()
raise
There was a problem hiding this comment.
Yes. Something like kill_timeout=None is the key here, letting the developer suppress the (in my view misguided) SIGKILL. That lets someone with my crusty POSIX sensibilities use an easy subprocess abstraction, and backport it to earlier python.
There was a problem hiding this comment.
Regardless of the default timeout value, I do like this proposal to explicitly terminate() the child, rather than merely hoping it will just exit gracefully on its own after we close its stdin/stdout.
One open question is whether or not it is bad to send a SIGTERM to a process that has (probably) already received a SIGINT.
For example, if the child is a Python program, then the default behavior of SIGTERM is to end the process without any cleanup. If it has already received a SIGINT, then the process will likely end during cleanup.
There was a problem hiding this comment.
From the perspective of the child process, a TERM is at worst equivalent to a KILL, and possibly much more polite, since it can be caught. So, if we are committed to killing the child process, I'd encourage a TERM first.
(It's tempting to full-on generalize this and allow something like timeout_handler=some_func_accepting_the_process_as_arg or err_kill=( (0.5, SIGTERM), (0.5, SIGQUIT), (1.0, SIGKILL) ) Those args are just illustrative, but you get the idea.)
There was a problem hiding this comment.
I'd encourage a TERM first.
👍
It's tempting to full-on generalize this
Agree with the sentiment. I also completely agree with what you implied by using the word "tempting" (instead of, say, "justified" 😄 ). The run() API is meant for easy, general usage, with reasonable default behavior. If a programmer cares about nuanced error-handling, she should just use the Popen API.
| @@ -0,0 +1 @@ | |||
| Added new parameter, ``cleanup_timeout``, to ``subprocess.run()`` | |||
There was a problem hiding this comment.
kill_timeout may be more clear, the time we give child process to terminate before we kill it.
There was a problem hiding this comment.
I am ambivalent regarding the parameter name.
| to allow potential clean-up operations in the child to complete. | ||
| During this time, a second exception will kill the child immediately. | ||
| If *cleanup_timeout* is not specified, then any remaining time from the original | ||
| *timeout* is used, or 1 second if no original *timeout* was specified. |
| If cleanup_timeout is not specified, then a default value is chosen: | ||
| If timeout was provided, then any remaining time from the original timeout | ||
| is used. Otherwise, if the exception is a KeyboardInterrupt, the child is | ||
| given 1 second to clean up. Otherwise, the child is killed immediately. |
There was a problem hiding this comment.
If cleanup timeout is not specified, there should be no cleanup timeout.
There was a problem hiding this comment.
If cleanup timeout is not specified, there should be no cleanup timeout.
I assume you chose that phrasing to be pointedly simple, but, unfortunately, I think your phrasing is not as simple as it appears. In the subprocess API and docs, "no timeout" (i.e. timeout=None) actually means "timeout=infinity". That's probably not what you mean.
I think you meant to propose: "If cleanup timeout is not specified, there should be a cleanup timeout of 0.0".
There was a problem hiding this comment.
@stuarteberg I agree that "no cleanup timeout" is not clear enough, but it is not the same as cleanup timeout of 0 seconds.
How about:
If kill_timeout is not specified, the child process will be killed immediately when the timeout expire. If specified, the child process will be terminated gracefully using SIGTERM. If the process did not terminate within kill_timeout seconds, it will be killed.
This should be modified if graceful termination should be the default behavior. I'm not sure what should be the default, but this can be left for another PR.
|
Thank you for this PR and discussion, it was informative and got me thinking about the problem. The behavior change in 3.3-3.6 was unintentional. I'm closing this PR for now as I'd rather not see a cleanup_timeout or similar parameter if at all possible. My PR to attempt to restore sane default behavior of just giving a little time before resorting to SIGKILL by default is in. I'd appreciate it if people would try out the 3.7 betas and report back on how that behaves for them. We can proceed from there if more changes seem warranted. |
EDIT: This PR was ultimately superseded by #5026
Summary:
subprocess.call()(and now alsosubprocess.run()) behave badly in the event ofSIGINT, i.e. a user hittingCtrl+Cwhile the subprocess is running. This was first reported by @pilcrow in [bpo-25942]1.Several possible solutions were discussed (by @vadmium, @pilcrow, and @Haypo), but no action was taken. Among them was this suggestion from @Haypo:
IMHO, that is the best option, and certainly much better than the status quo. I've implemented that solution in this PR. (It is also similar to a proposal by @vadmium in that thread.)
Background: POSIX behavior for Ctrl+C
In general, when the user hits
Ctrl+C, the OS sendsSIGINTto all processes in the foreground process group, including any child processes launched viasubprocess call(). That is, both Python and it's child will automatically see theSIGINT; there is no need to explicitly send a duplicateSIGINTto the child.What this PR does:
Upon receiving the
SIGINTin the form of aKeyboardInterrupt,call()will wait again for the child process, but withcleanup_timeout, giving the child process time to perform cleanup actions. If that timeout expires, then the child process is killed. (Alternatively, during this time, the user can hitCtrl+Cagain to immediately stop waiting and kill the child process.)What this PR does not do:
This PR attempts to make
subprocess.call()behave in a less surprising manner in the context of interactive terminal applications, in which the user is likely to expectCtrl+Cto work properly. If the child process does not handleSIGINTproperly, we make no attempt to kill it via alternative means, such asSIGTERM. If the developer needs such control over how the child is to be killed in the event of aKeyboardInterrupt, she should use thePopenAPI directly instead of callingsubprocess.call()orrun().Open question 1:
It is not entirely clear what the default value of
cleanup_timeoutshould be. Setting it to 0 by default seems to defeat the purpose and wouldn't help naïve users write code that "just works".On the other hand, setting it to an arbitrary value like
5.0seems... arbitrary. If the cleanup takes a long time, then we'd be killing it too soon. But if the child doesn't handleSIGINTanyway, then we'd be wasting 5 seconds for no reason (unless the user hitsCtrl+Ca second time).For now, I've set the default value as follows:
timeoutwas specified, use that. The child cleanup time must be less than its total expected runtime.timeoutwas not specified, then use1.0. If the child process handlesSIGINT, this introduces no extra delay. If it does not handleSIGINT, then 1 second seems like a reasonable delay for what is likely a rare event an interactive console application.Open question 2:
According to the docs,
subprocess.call()is equivalent tosubprocess.run().returncode. Yet, these are currently implemented as independent functions (and I left it that way). Is there any objection to replacingsubprocess.call()with the following?(That would also allow me to remove the mostly duplicated test code in this PR.)
https://bugs.python.org/issue25942