SlideShare a Scribd company logo
PromCon 2017: Prometheus as an (internal) Service
Prometheus as a
(internal) service
Paul Traylor
LINE Fukuoka
Self-Introduction
• Wanted to make games in high school
• Worked on several mods creating levels
• Decided games were hard, web development looked easier
• North Carolina State University - Computer Science
• Worked in San Francisco for ~7 years
• First job primarily web development
• Second job primarily devops
• LINE Fukuoka ~ 1 year
• Focused primarily on upgrading monitoring tools
LINE Fukuoka – LINE Family Apps
• LINE Creators Studio
• LINE Fortune
• LINE Surveys
• LINE Part Time
• and more!
https://line.me/ja/family-apps
Current Responsibilities
• Continue development on Promgen
• Introduced last year at Promcon 1
• Rewritten in Django
• https://github.com/line/promgen
• Migrate legacy monitoring to Prometheus
• Installing exporters
• Setting Prometheus targets
• Configuring rules
1 https://promcon.io/2016-berlin/talks/hadoop-fluentd-cluster-monitoring-with-prometheus-and-grafana/
Environment
• HA Prometheus Shards
• LB Promgen
• LOTS of scrape
targets
• ~3.5 million
samples
• ~3000 exporters
Environment – Promgen + Prometheus
• Manages targets and rules for Prometheus
• All exporters go into a single json file
• All rules go into a single rules file
• Uses relabeling on the Prometheus side filter out unrelated shards
Environment – Promgen + Prometheus
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster_name: ‘shard-name’
rule_files:
- “/path/to/common.rule"
- “/path/to/promgen.rule"
scrape_configs:
- job_name: 'promgen'
file_sd_configs:
- files:
- “/path/to/promgen.json"
relabel_configs:
- source_labels: [__shard]
regex: {{ shard_keep }}
action: keep
https://github.com/line/promgen/blob/master/docker/prometheus.yml
Environment – Promgen + Alertmanager
• Hard to build a dynamic routing tree in Alertmanager
• Route everything through Promgen
• But Provide a backup in case Promgen goes down
Environment – Promgen + Alertmanager
route:
receiver: default
group_by: ['service','project']
routes:
# Anything that matches the job
# alertmanager, should be routed
# directly since other parts of
# the system may not be working
# correctly
- receiver: backup
match:
job: alertmanager
- name: default
webhook_configs:
- url: http://alertlog.example.com
send_resolved: true
- url: http://promgen.example.com
send_resolved: true
- name: backup
email_configs:
- to: backup@example.com
send_resolved: true
Operation – Where is my shard?
• Use Promgen to query each shard and then combine the result
• Still interested in native support for Prometheus remote_read
Operation – Retention
• Prometheus cares about the present (and recent past)
• Developer’s mostly care about the present
• Sometimes want to compare with historical data
• Wanted to use the same queries for both recent and longterm
queries
Operation – Retention
• Has not scaled well
• Thought we could co-locate both
• Summary servers use a lot of network / cpu to fetch and summarize
• Longterm servers only need to retain the data
• Lots of memory problems
• Summary Prometheus eats resources due to twice the workload of a normal
server
• Sometimes longterm process dies sometimes summary process dies
• Need various extra alert rules to watch both
Operation - Watching upstream projects
• Prometheus is generally fairly stable from a cli / api perspective but
there are still times when things change
• https://github.com/prometheus/prometheus/pulse
• https://github.com/prometheus/alertmanager/pulse
• Federation instance labels #2488
meta:samples:sum{} = sum(scrape_samples_scraped) BY
(service,project,job)
meta:alerts:count{} = count(ALERTS == 1) BY
(service,project,alertname,alertstate)
meta:exporters:count{} = count(up) BY
(service,project,job)
https://github.com/prometheus/prometheus/issues/2488
Operation – Fixed Labels
# Recording Rule
meta:samples:sum{} = sum(scrape_samples_scraped) BY
(service,project,job)
# Prometheus Configuration
relabel_configs:
- source_labels: [__address__]
target_label: cluster_node
action: replace
# Grafana Query
sum(avg(meta:samples:sum) without (cluster_node)) by
(cluster_name)
Education - Writing Queries is hard
• PromQL can be unfamiliar
• https://prometheus.io/docs/querying/functions/
• Hard to remember labels at times
• Lots of service labels
• Lots of project labels
• Lots of mountpoints for disk metrics
• Lots of interfaces for network metrics
• …
Education – Writing Queries is hard
• without|by ?
• ignoring|on?
max by (instance) (node_filesystem_free / node_filesystem_size)
max by (instance, mountpoint)
(node_filesystem_free{fstype!~"tmpfs|rootfs"} /
node_filesystem_size{fstype!~"tmpfs|rootfs"})
Education - Writing rules is harder
• Is loadavg useful ?
• Global rule namespace
• Promgen uses Service and Project labels to route on, but easy to lose
your label when writing queries
# Global Rule excludes children
example_rule{service!~"A|B",}:
# Service A override includes self
- example_rule{service="A",}
# Service B override includes self, but excludes children
- example_rule{service="B", project!~"C"}:
# Project Override
- example_rule{project="C"}
https://promgen.readthedocs.io/en/latest/rules.html
Education - Grafana default dashboards
• Try to provide basic dashboards for the most common metrics
• node_exporter / nginx_exporter / mysql_exporter / etc
• Use templating from Grafana to drill down
• Shard -> Service -> Project -> Instance
• Prototyping simple proxy so that developers can ignore shard
• Currently building most dashboards manually
• Difficult to update when you want to change navigation
• Considering ways to auto generate dashboards
Promgen – One Year Later
• Rewritten in Django to take advantage of ORM and admin page
• Better rule editor
• Sharding
• Prometheus Proxy
Promgen – Want to build a better rule editor
Promgen – Easier notification settings

More Related Content

PromCon 2017: Prometheus as an (internal) Service

  • 2. Prometheus as a (internal) service Paul Traylor LINE Fukuoka
  • 3. Self-Introduction • Wanted to make games in high school • Worked on several mods creating levels • Decided games were hard, web development looked easier • North Carolina State University - Computer Science • Worked in San Francisco for ~7 years • First job primarily web development • Second job primarily devops • LINE Fukuoka ~ 1 year • Focused primarily on upgrading monitoring tools
  • 4. LINE Fukuoka – LINE Family Apps • LINE Creators Studio • LINE Fortune • LINE Surveys • LINE Part Time • and more! https://line.me/ja/family-apps
  • 5. Current Responsibilities • Continue development on Promgen • Introduced last year at Promcon 1 • Rewritten in Django • https://github.com/line/promgen • Migrate legacy monitoring to Prometheus • Installing exporters • Setting Prometheus targets • Configuring rules 1 https://promcon.io/2016-berlin/talks/hadoop-fluentd-cluster-monitoring-with-prometheus-and-grafana/
  • 6. Environment • HA Prometheus Shards • LB Promgen • LOTS of scrape targets • ~3.5 million samples • ~3000 exporters
  • 7. Environment – Promgen + Prometheus • Manages targets and rules for Prometheus • All exporters go into a single json file • All rules go into a single rules file • Uses relabeling on the Prometheus side filter out unrelated shards
  • 8. Environment – Promgen + Prometheus global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster_name: ‘shard-name’ rule_files: - “/path/to/common.rule" - “/path/to/promgen.rule" scrape_configs: - job_name: 'promgen' file_sd_configs: - files: - “/path/to/promgen.json" relabel_configs: - source_labels: [__shard] regex: {{ shard_keep }} action: keep https://github.com/line/promgen/blob/master/docker/prometheus.yml
  • 9. Environment – Promgen + Alertmanager • Hard to build a dynamic routing tree in Alertmanager • Route everything through Promgen • But Provide a backup in case Promgen goes down
  • 10. Environment – Promgen + Alertmanager route: receiver: default group_by: ['service','project'] routes: # Anything that matches the job # alertmanager, should be routed # directly since other parts of # the system may not be working # correctly - receiver: backup match: job: alertmanager - name: default webhook_configs: - url: http://alertlog.example.com send_resolved: true - url: http://promgen.example.com send_resolved: true - name: backup email_configs: - to: [email protected] send_resolved: true
  • 11. Operation – Where is my shard? • Use Promgen to query each shard and then combine the result • Still interested in native support for Prometheus remote_read
  • 12. Operation – Retention • Prometheus cares about the present (and recent past) • Developer’s mostly care about the present • Sometimes want to compare with historical data • Wanted to use the same queries for both recent and longterm queries
  • 13. Operation – Retention • Has not scaled well • Thought we could co-locate both • Summary servers use a lot of network / cpu to fetch and summarize • Longterm servers only need to retain the data • Lots of memory problems • Summary Prometheus eats resources due to twice the workload of a normal server • Sometimes longterm process dies sometimes summary process dies • Need various extra alert rules to watch both
  • 14. Operation - Watching upstream projects • Prometheus is generally fairly stable from a cli / api perspective but there are still times when things change • https://github.com/prometheus/prometheus/pulse • https://github.com/prometheus/alertmanager/pulse • Federation instance labels #2488 meta:samples:sum{} = sum(scrape_samples_scraped) BY (service,project,job) meta:alerts:count{} = count(ALERTS == 1) BY (service,project,alertname,alertstate) meta:exporters:count{} = count(up) BY (service,project,job) https://github.com/prometheus/prometheus/issues/2488
  • 15. Operation – Fixed Labels # Recording Rule meta:samples:sum{} = sum(scrape_samples_scraped) BY (service,project,job) # Prometheus Configuration relabel_configs: - source_labels: [__address__] target_label: cluster_node action: replace # Grafana Query sum(avg(meta:samples:sum) without (cluster_node)) by (cluster_name)
  • 16. Education - Writing Queries is hard • PromQL can be unfamiliar • https://prometheus.io/docs/querying/functions/ • Hard to remember labels at times • Lots of service labels • Lots of project labels • Lots of mountpoints for disk metrics • Lots of interfaces for network metrics • …
  • 17. Education – Writing Queries is hard • without|by ? • ignoring|on? max by (instance) (node_filesystem_free / node_filesystem_size) max by (instance, mountpoint) (node_filesystem_free{fstype!~"tmpfs|rootfs"} / node_filesystem_size{fstype!~"tmpfs|rootfs"})
  • 18. Education - Writing rules is harder • Is loadavg useful ? • Global rule namespace • Promgen uses Service and Project labels to route on, but easy to lose your label when writing queries # Global Rule excludes children example_rule{service!~"A|B",}: # Service A override includes self - example_rule{service="A",} # Service B override includes self, but excludes children - example_rule{service="B", project!~"C"}: # Project Override - example_rule{project="C"} https://promgen.readthedocs.io/en/latest/rules.html
  • 19. Education - Grafana default dashboards • Try to provide basic dashboards for the most common metrics • node_exporter / nginx_exporter / mysql_exporter / etc • Use templating from Grafana to drill down • Shard -> Service -> Project -> Instance • Prototyping simple proxy so that developers can ignore shard • Currently building most dashboards manually • Difficult to update when you want to change navigation • Considering ways to auto generate dashboards
  • 20. Promgen – One Year Later • Rewritten in Django to take advantage of ORM and admin page • Better rule editor • Sharding • Prometheus Proxy
  • 21. Promgen – Want to build a better rule editor
  • 22. Promgen – Easier notification settings

Editor's Notes

  • #4: Like many people, when I was younger I thought it would be fun to make games. After working on games in my freetime, I decided that making games (especially fps games) was hard, and I would try something easier First job was fairly standard web backend development. Second job was mostly operation and deployment
  • #5: LINE is the most popular messenger application in Japan though it is less well known overseas. LINE Fukuoka’s focus is on many of the other applications that make up LINE’s offering
  • #6: Introduced last year at Promcon by my colleague Watarau Yukawa Rewritten in Django assist in ease of development Continue to add new features UI is usually the most difficult. How to show off things in an easy to understand manor Existing services used nagios like checks Working on installing installers to provide similar metrics Working on matching existing rules but matching them to match a more Prometheus way of doing things
  • #7: Since we do not directly use any major cloud provider, we developed Promgen to bridge our on-premise inventory list with Prometheus‘ file_sd_config Developer are able to register their Service with Promgen, and Promgen handles the work of assigning it to a specific shard and reloading the configuraiton. To provide HA, we currently run a pair of Prometheus servers for each shard
  • #8: One half of Promgen’s role is to manage the list of targets and rules for a Prometheus server. The entire list of targets from Promgen are sent to each Prometheus server, and we use relabel configs (similar to any of the Kuberneties or console configs) to filter the server list to the targets that instance is responsible for.
  • #10: Alertmanager is sometimes more difficult than Prometheus to configure, especially when there is a lot of variability to how the alerts should get delivered. Though in extreme cases it does create an additional point of failure, we have found it valuable to route everything through Promgen to allow us more configurability regarding our alert routing. In the case where Promgen is down, we have additional backup notifications that our team watches directly
  • #12: Once you outgrow a single shard, it can be much more difficult to know where to assign targets and what shard the targets were assigned when you want to query the data. Though it’s currently a bit of a naive implementation, we’re currently testing using Promgen as a Prometheus proxy. We have implemented a few endpoints on Promgen, which match the Prometheus API. These accept requests from a source such as Grafana, query each child Prometheus instance, and then combine the results Perhaps in the future this will be supported natively by Prometheus remote_read to other Prometheus instances
  • #13: Retention is another area that is difficult to deal with. Prometheus‘ main focus is on the current status or recent history. While this works fine most of the time, there are times when we want to look at older, historical data. This becomes more complicated because we would prefer to use the same queries without modification when we do these lookups. Our current prototype environment uses additional Prometheus servers, using the same target list as our primary server. We then build list of summary rules to generate a summary of the data and then federate to a longterm storage instance
  • #15: The more we come to rely on Prometheus, the more closely we follow the development and try to subscribe to any thread that may affect our operation. It has been useful to watch Github issues, but there are still issues that we have missed which cause some suprirse and confusion when we roll out updates. One particular surprise was the way Federation instance lables changed which took a few tries to get right including hitting prometheus_local_storage_out_of_order_samples_total until I got it right
  • #16: Here I have my original recording rule to keep track of how many samples are being tracked Since federation drops the instance label, I re-add it as cluster_node to avoid duplicate metrics I then take an average to filter it back out to get my final overview
  • #19: Since rules are evaluated in a basically “global” namespace, its even trickier to make sure we write rules that only capture the metrics our team cares about and doesn’t suddenly send alerts to another team. For Promgen we thought about this a while, and settled on a solution where rules can have a child override them, and we automatically fill in the correct label matches to facilitiate this
  • #20: Once you have everything feeding into Prometheus, you then need a way to visualize it. Also for any error you have, you typically want some dashboard that you can send the developer to, to provide additional context to help troubleshoot