Skip to content

Commit 5bb502a

Browse files
authored
Make CLI Commands plugable, fix fd leak when logging exceptions, use black for code formating (someengineering#28)
* Utility functions for getting process information * Add process stats to CLI command * Process info cleanup * Spawn new processes instead of forking them in AWS plugin reducing memory consumption * Add nofile and nproc limits to procinfo output * Increase nproc if applicable * Add command to close open fd from cli * Improve debug_close_fd CLI command * black -l 88 -t py38 * Allow specifying fd that does not exist in debug_close_fd cli cmd * Create deepcopies of everything send to object log fixing an fd leak * Return correct rtdname * Update AWS plugin readme and performance information * Make CLI commands plugable * Update documentation with details about CLI plugins * max-line-length=120 * Ignore flake8 E203 * Make cloudkeeper compatible with Windows and Unixes other than Linux * Move debug_* CLI commands into external plugin * Add linux-headers to Dockerfile for psutil installation * Add cli history function * Add black checks to tox.ini
1 parent f573fb3 commit 5bb502a

File tree

107 files changed

+6423
-3339
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

107 files changed

+6423
-3339
lines changed

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
FROM python:3.8-alpine AS build-env
2-
RUN apk add --no-cache build-base findutils
2+
RUN apk add --no-cache build-base findutils linux-headers
33
RUN pip install --upgrade pip
44
RUN pip install tox flake8
55
COPY ./ /usr/src/cloudkeeper

README.md

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Cloudkeeper is a standalone CLI tool that periodically collects a list of resour
99
Resource collection is performed in intervals (`--interval`) for each activated collector plugin (`--collector`).
1010
When resource collection is finished a resource cleanup can be performed (`--cleanup`). By default nothing will be cleaned!
1111
Cleanup plugins have to be installed and configured, or resources manually flagged for cleanup using the built-in CLI.
12-
Read more about collector and cleanup plugins in the [Plugins](#plugins) section below.
12+
Read more about collector, cli and cleanup plugins in the [Plugins](#plugins) section below.
1313

1414

1515
## Who is it for?
@@ -123,7 +123,7 @@ $ cloudkeeper --collector remote --remote-endpoint file:///tmp/graph --cleanup -
123123

124124

125125
## CLI
126-
Cloudkeeper comes with a simple CLI. Initially only used for debugging the internal data structures it can now also be used to perform simple searches and mark resources for cleanup. Entering `help` will give a list of all commands. `help <command>` will provide additional help for that command.
126+
Cloudkeeper comes with a built-in CLI. Initially only used for debugging the internal data structures it can now also be used to perform simple searches and mark resources for cleanup. Entering `help` will give a list of all commands. `help <command>` will provide additional help for that command.
127127
Commands can be piped into one another using "`|`". Multiple commands can be run after one another using "`;`". If you need to use the pipe or semicolon characters in your commands make sure to escape them using a backslash "`\`" character.
128128
Internally commands take in and output Iterables often consisting of the Cloud resources. Commands can match any attribute of those resources. So for instance if an AWS EBS volume has an attribute `volume_size` then you could query all EBS volumes larger than 100GB using: `match resource_type = aws_ec2_volume | match volume_size > 100`.
129129

@@ -136,16 +136,20 @@ In the next example we will delete all unused EBS volumes larger than 100 GiB th
136136
> cleanup
137137
```
138138

139-
The `dump` command is useful for getting a list of a resource's attributes.
139+
The `dump` command is useful for getting a list of a resource's attributes and event log. To dump all resources to a json file one can use `dump --json | write resources.json`
140+
141+
The CLI has a clipboard which can copy and paste resources. For instance: `match name ~ sre | clipboard copy; match name ~ eng | clipboard append; match name ~ sales | clipboard paste passthrough` would list all the resources with the strings 'sre', 'eng' or 'sales' in the name. The ones with name containing the strings 'sre' and 'eng' are copied to the clipboard while the ones with 'sales' are passed through after the paste.
142+
Side note: the same could have been written as `match name ~ (sre\|eng\|sales)`. This was just to demonstrate the functionality of the clipboard command.
140143

141144
Now the CLI is useful for exploring collected data but if you have a repeating cleanup query it would be tedious to manually run it periodically. To that end Cloudkeeper supports an argument `--register-cli-action` which takes a lowercased event name (see [Events](#events) below) followed by a colon : and the CLI command that should be executed when that event is dispatched.
142145
If we wanted to run our volume cleanup from earlier every time cloudkeeper has finished collecting resources, we could call it like so:
143146
```
144147
$ cloudkeeper --collector aws --cleanup --register-cli-action "cleanup_plan:match resource_type = aws_ec2_volume | match volume_size > 100 | match volume_status = available | match ctime < 2020-01-01 | match last_access > 7d | match last_update > 7d | clean"
145148
```
146-
147149
As a side note, there is a plugin [plugins/cleanup_volumes/](plugins/cleanup_volumes/) that does just that. It was written before cloudkeeper had its own CLI.
148150

151+
Instead of passing CLI actions as commandline arguments they can also be stored in a text file and passed using the `--cli-actions-config`.
152+
149153

150154
## Warning
151155
Cloudkeeper is designed to clean up resources. As such act with caution when selecting and filtering resources for cleanup. **The default input to any CLI command is the list of all cloud resources.** Meaning when you run `match resource_type = aws_ec2_volume` it runs this match against all resources.
@@ -166,8 +170,8 @@ Using the endoints mentioned in [Distributed Instances](#distributed-instances)
166170

167171

168172
## Plugins
169-
Cloudkeeper knows two types of Plugins, COLLECTOR and PERSISTENT. You can find example code for each type in [plugins/example_collector/](plugins/example_collector/) and [plugins/example_persistent/](plugins/example_persistent/).
170-
COLLECTOR Plugins collect cloud resources and are being instanciated on each collect run. PERSISTENT plugins are instanciated once at startup and are mostly used for resource cleanup decissions or for notification (e.g. to send a Slack message to the owner of an instance that has just been deleted).
173+
Cloudkeeper knows three types of Plugins, CLI, COLLECTOR and PERSISTENT. You can find example code for each type in [plugins/example_cli/](plugins/example_cli/), [plugins/example_collector/](plugins/example_collector/) and [plugins/example_persistent/](plugins/example_persistent/).
174+
COLLECTOR Plugins collect cloud resources and are being instanciated on each collect run. PERSISTENT plugins are instanciated once at startup and are mostly used for resource cleanup decissions or for notification (e.g. to send a Slack message to the owner of an instance that has just been deleted). CLI plugins extend the built-in CLI with new commands.
171175

172176
### Collector Plugins
173177
Each collector plugin has a local graph. A collector plugin implements resource collection for a cloud provider (e.g. AWS, GCP, Azure, Alicloud, etc.).
@@ -182,6 +186,9 @@ Persistent plugins run on startup and can register with one or more events. This
182186
As part of the event it would be handed a reference to the current live graph. It could then look at the resources in that graph, search for them, filter them, look at their attributes, etc. and perform actions like protecting a resource from deletion or flagging a resource for deletion.
183187
It could also register with the event that signals the end of a run and look at which resources have been cleaned up to generate a report that could be emailed or notify resource owners on Slack that their resources have been cleaned.
184188

189+
### CLI Plugins
190+
CLI plugins extend the functionality of the built-in CLI with new commands. They can act on and filter resources and have full access to the current graph, the scheduler and the CLI clipboard. CLI commands can also be used in scheduled jobs (`--scheduler-config`) and CLI actions (`--register-cli-action` and `--cli-actions-config`).
191+
185192

186193
## Events
187194
Cloudkeeper implements a simple event system. Plugins can register with and dispatch events.
@@ -237,7 +244,7 @@ Cloudkeeper comes with a built-in development webserver (defaults to Port 8000).
237244
/graph.txt # GET Returns a Text representation of the live Graph
238245
```
239246
The most useful of those will be `/metrics` and `/graph`. In our own setup we have an authentication and TLS proxy in front of our Cloudkeeper instances.
240-
Because a single collect run can take quite a while depending on the number of accounts that need to be scraped (in our case 40+ AWS accounts take about an hour to collect and clean) I have gotten to a development workflow where I download the live graph to my local system and then work on that local copy.
247+
Because a single collect run can take quite a while depending on the number of accounts that need to be scraped I have gotten to a development workflow where I download the live graph to my local system and then work on that local copy.
241248

242249
```
243250
$ cloudkeeper --collector remote --remote-endpoint https://somelogin:[email protected]/graph
@@ -291,17 +298,17 @@ As mentioned Cloudkeeper collects in intervals. As such it will not see resource
291298

292299

293300
## TODO
294-
- Document all plugins in their README.md
301+
- ~~Document all plugins in their README.md~~ ✔️
295302
- Update docstrings for pdoc3 and configure automated generation/export
296303
- Better tests for Cloudkeeper core and plugins
297304
- The basic test infrastructure is there and runs as part of the Docker image build
298-
- flake8 syntax checks run with very lenient settings
299-
- Use more sane defaults than 240 char line length
300-
- Maybe give project formating in the hands of black and be done with it?
305+
- ~~flake8 syntax checks run with very lenient settings~~ ✔️
306+
- ~~Use more sane defaults than 240 char line length~~ ✔️
307+
- ~~Maybe give project formating in the hands of black and be done with it?~~ ✔️
301308
- Cloudkeeper core currently has some testing but not nearly enough
302309
- Plugins have virtually no testing; just a test_args.py stub that tests each plugin's args for correct default values
303310
- Move to Poetry and pyproject.toml
304-
- Implement delete() and update/delete_tag() Methods for all resources, not just the expensive ones
311+
- ~~Implement delete() and update/delete_tag() Methods for all resources, not just the expensive ones~~ ✔️
305312
- Make existing delete() methods smarter - e.g. EKS Nodegroup deletion could block until the Nodegroup is gone so the EKS Cluster does not have to wait until the next collection round for its own deletion - on the other hand this would increase the number of API calls
306313
- Distribute parallel cleanup by cloud, account and region as to optimaly use API request limits
307314
- Implement more Cloud Providers (esp. GCP and Azure)
@@ -312,6 +319,7 @@ As mentioned Cloudkeeper collects in intervals. As such it will not see resource
312319

313320
## Contributing
314321
If you would like to contribute new plugins or other code improvements fork the repo into your own Github account, create a feature branch and submit a PR.
322+
Code formating tests currently use `black --line-length 88 --target-version py38` and flake8 with `max-line-length=120`. Meaning code must wrap after 88 characters but strings are allowed to be up to 120 characters long. This will change once black stable starts to wrap strings.
315323
If you find a bug or have a question about something please create a Github issue.
316324

317325

Lines changed: 47 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,49 @@
1-
import sys
2-
import logging
31
import time
42
import os
5-
import resource
63
import threading
7-
from signal import signal, getsignal, SIGINT, SIGTERM, SIGKILL, SIGUSR1
4+
import cloudkeeper.logging as logging
5+
import cloudkeeper.signal
86
from cloudkeeper.graph import GraphContainer
97
from cloudkeeper.pluginloader import PluginLoader
108
from cloudkeeper.baseplugin import PluginType
119
from cloudkeeper.web import WebServer
1210
from cloudkeeper.scheduler import Scheduler
13-
from cloudkeeper.args import get_arg_parser, ArgumentParser
11+
from cloudkeeper.args import get_arg_parser
1412
from cloudkeeper.processor import Processor
1513
from cloudkeeper.cleaner import Cleaner
1614
from cloudkeeper.metrics import GraphCollector
17-
from cloudkeeper.utils import log_stats, signal_on_parent_exit
15+
from cloudkeeper.utils import log_stats, increase_limits
1816
from cloudkeeper.cli import Cli
19-
from cloudkeeper.event import add_event_listener, dispatch_event, Event, EventType, add_args as event_add_args
17+
from cloudkeeper.event import (
18+
add_event_listener,
19+
dispatch_event,
20+
Event,
21+
EventType,
22+
add_args as event_add_args,
23+
)
2024
from prometheus_client import REGISTRY
2125

2226

23-
# Try to run in a new process group
24-
try:
25-
os.setpgid(0, 0)
26-
except (PermissionError, AttributeError):
27-
pass
28-
29-
log_format = '%(asctime)s - %(levelname)s - %(process)d/%(threadName)s - %(message)s'
30-
logging.basicConfig(level=logging.WARN, format=log_format)
31-
logging.getLogger('cloudkeeper').setLevel(logging.INFO)
3227
log = logging.getLogger(__name__)
3328

34-
# Plugins might produce debug logging during arg parsing so we manually
35-
# look for verbosity and set the log level before using the arg parser.
36-
argv = sys.argv[1:]
37-
if '-v' in argv or '--verbose' in argv:
38-
logging.getLogger('cloudkeeper').setLevel(logging.DEBUG)
39-
40-
# This will be used in main() and signal_handler()
29+
# This will be used in main() and shutdown()
4130
shutdown_event = threading.Event()
42-
parent_pid = os.getpid()
43-
original_sigint_handler = getsignal(SIGINT)
44-
original_sigterm_handler = getsignal(SIGTERM)
4531

4632

4733
def main() -> None:
34+
# Try to run in a new process group and
35+
# ignore if not possible for whatever reason
36+
try:
37+
os.setpgid(0, 0)
38+
except:
39+
pass
40+
41+
cloudkeeper.signal.parent_pid = os.getpid()
42+
4843
# Add cli args
4944
arg_parser = get_arg_parser()
5045

46+
logging.add_args(arg_parser)
5147
Cli.add_args(arg_parser)
5248
WebServer.add_args(arg_parser)
5349
Scheduler.add_args(arg_parser)
@@ -64,28 +60,12 @@ def main() -> None:
6460
# At this point the CLI, all Plugins as well as the WebServer have added their args to the arg parser
6561
arg_parser.parse_args()
6662

67-
# Write log to a file in addition to stdout
68-
if ArgumentParser.args.logfile:
69-
log_formatter = logging.Formatter(log_format)
70-
fh = logging.FileHandler(ArgumentParser.args.logfile)
71-
fh.setFormatter(log_formatter)
72-
logging.getLogger().addHandler(fh)
73-
7463
# Handle Ctrl+c and other means of termination/shutdown
75-
signal_on_parent_exit()
64+
cloudkeeper.signal.initializer()
7665
add_event_listener(EventType.SHUTDOWN, shutdown, blocking=False)
77-
signal(SIGINT, signal_handler)
78-
signal(SIGTERM, signal_handler)
79-
signal(SIGUSR1, signal_handler)
8066

81-
# Try to increase nofile limit
82-
nofile_soft, nofile_hard = resource.getrlimit(resource.RLIMIT_NOFILE)
83-
try:
84-
if nofile_soft < nofile_hard:
85-
log.debug(f'Increasing RLIMIT_NOFILE {nofile_soft} -> {nofile_hard}')
86-
resource.setrlimit(resource.RLIMIT_NOFILE, (nofile_hard, nofile_hard))
87-
except (ValueError):
88-
log.error(f'Failed to increase RLIMIT_NOFILE {nofile_soft} -> {nofile_hard}')
67+
# Try to increase nofile and nproc limits
68+
increase_limits()
8969

9070
# We're using a GraphContainer() to contain the graph which gets replaced at runtime.
9171
# This way we're not losing the context in other places like the webserver when the
@@ -115,16 +95,16 @@ def main() -> None:
11595

11696
for Plugin in plugin_loader.plugins(PluginType.PERSISTENT):
11797
try:
118-
log.debug(f'Starting persistent Plugin {Plugin}')
98+
log.debug(f"Starting persistent Plugin {Plugin}")
11999
plugin = Plugin()
120100
plugin.daemon = True
121101
plugin.start()
122102
except Exception as e:
123-
log.exception(f'Caught unhandled persistent Plugin exception {e}')
103+
log.exception(f"Caught unhandled persistent Plugin exception {e}")
124104

125-
collector = Processor(graph_container, plugin_loader.plugins(PluginType.COLLECTOR))
126-
collector.daemon = True
127-
collector.start()
105+
processor = Processor(graph_container, plugin_loader.plugins(PluginType.COLLECTOR))
106+
processor.daemon = True
107+
processor.start()
128108

129109
# Dispatch the STARTUP event
130110
dispatch_event(Event(EventType.STARTUP))
@@ -135,73 +115,42 @@ def main() -> None:
135115
log_stats()
136116
shutdown_event.wait(900)
137117
time.sleep(5)
138-
log.info('Shutdown complete')
118+
cloudkeeper.signal.kill_children(cloudkeeper.signal.SIGTERM, ensure_death=True)
119+
log.info("Shutdown complete")
139120
quit()
140121

141122

142123
def shutdown(event: Event) -> None:
143-
reason = event.data.get('reason')
144-
emergency = event.data.get('emergency')
124+
reason = event.data.get("reason")
125+
emergency = event.data.get("emergency")
145126

146127
if emergency:
147-
log.fatal(f'EMERGENCY SHUTDOWN: {reason}')
148-
os.killpg(os.getpgid(0), SIGKILL)
128+
cloudkeeper.signal.emergency_shutdown(reason)
149129

150130
current_pid = os.getpid()
151-
if current_pid != parent_pid:
131+
if current_pid != cloudkeeper.signal.parent_pid:
152132
return
153133

154134
if reason is None:
155-
reason = 'unknown reason'
156-
log.info(f'Received shut down event {event.event_type}: {reason} - killing all threads and child processes')
157-
os.killpg(os.getpgid(0), SIGUSR1)
158-
kt = threading.Thread(target=force_shutdown, name='shutdown')
135+
reason = "unknown reason"
136+
log.info(
137+
f"Received shut down event {event.event_type}: {reason} - killing all threads and child processes"
138+
)
139+
# Send 'friendly' signal to children to have them shut down
140+
cloudkeeper.signal.kill_children(cloudkeeper.signal.SIGTERM)
141+
kt = threading.Thread(target=force_shutdown, name="shutdown")
159142
kt.start()
160143
shutdown_event.set() # and then end the program
161144

162145

163146
def force_shutdown(delay: int = 10) -> None:
164147
time.sleep(delay)
165148
log_stats()
166-
log.error('Some child process or thread timed out during shutdown - killing process group')
167-
os.killpg(os.getpgid(0), SIGKILL)
149+
log.error(
150+
"Some child process or thread timed out during shutdown - forcing shutdown completion"
151+
)
168152
os._exit(0)
169153

170154

171-
def delayed_exit(delay: int = 3) -> None:
172-
time.sleep(delay)
173-
os._exit(0)
174-
175-
176-
def signal_handler(sig, frame) -> None:
177-
"""Handles Ctrl+c by letting the Collector() know to shut down"""
178-
signal(SIGINT, original_sigint_handler)
179-
signal(SIGTERM, original_sigterm_handler)
180-
181-
current_pid = os.getpid()
182-
if current_pid == parent_pid:
183-
if sig != SIGUSR1:
184-
reason = f'Received shutdown signal {sig}'
185-
log.debug(f'Parent caught signal {sig} - dispatching shutdown event')
186-
# Dispatch shutdown event in parent process which also causes SIGUSR1 to be sent to
187-
# the process group and in turn causes the shutdown event in all child processes.
188-
dispatch_event(Event(EventType.SHUTDOWN, {'reason': reason, 'emergency': False}))
189-
else:
190-
log.debug('Parent received SIGUSR1 and ignoring it')
191-
else:
192-
if sig != SIGUSR1:
193-
reason = f'Received unexpected shutdown signal {sig} of unknown origin - OOM killer?'
194-
log.error(reason)
195-
else:
196-
reason = f'Received shutdown signal {sig} from parent process'
197-
log.debug(f"Shutting down child process {current_pid} - you might see exceptions from interrupted worker threads")
198-
# Child's threads have 3s to shut down before the following thread will shut them down hard.
199-
kt = threading.Thread(target=delayed_exit, name='shutdown')
200-
kt.start()
201-
# Dispatch shutdown event in child process
202-
dispatch_event(Event(EventType.SHUTDOWN, {'reason': reason, 'emergency': False}), blocking=False)
203-
sys.exit(0)
204-
205-
206-
if __name__ == '__main__':
155+
if __name__ == "__main__":
207156
main()

cloudkeeper/cloudkeeper/args.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,4 @@
11
import argparse
2-
import logging
3-
4-
log = logging.getLogger(__name__)
52

63

74
class Namespace(argparse.Namespace):
@@ -22,7 +19,13 @@ def parse_args(self, *args, **kwargs):
2219

2320

2421
def get_arg_parser() -> ArgumentParser:
25-
arg_parser = ArgumentParser(description='Cloudkeeper - Housekeeping for Clouds')
26-
arg_parser.add_argument('--verbose', '-v', help='Verbose logging', dest='verbose', action='store_true', default=False)
27-
arg_parser.add_argument('--logfile', help='Logfile to log into', dest='logfile')
22+
arg_parser = ArgumentParser(description="Cloudkeeper - Housekeeping for Clouds")
23+
arg_parser.add_argument(
24+
"--verbose",
25+
"-v",
26+
help="Verbose logging",
27+
dest="verbose",
28+
action="store_true",
29+
default=False,
30+
)
2831
return arg_parser

0 commit comments

Comments
 (0)