SlideShare a Scribd company logo
Operating Prometheus
モニタリング勉強会
2017/10/27 @kfdm
Self Introduction
• Paul Traylor
• LINE Fukuoka 開発室
• Currently responsible for updating monitoring environment at
LINE Fukuoka
• https://github.com/line/promgen
• https://promcon.io/2017-munich/talks/prometheus-as-a-
internal-service/
Operating Prometheus at LINE Fukuoka
• 4 HA Pairs
• ~2000 targets
per machine
• ~800k samples
per machine
• ~3.5 million samples
• ~7000 exporters
https://github.com/line/promgen
Scaling Prometheus ‒ HA
• Run multiple Prometheus
instance with the same targets
• Alerts are de-duplicated by Alertmanager
Scaling Prometheus ‒ Shard
• Split targets
across multiple
servers
• Alertmanager
de-duplicates
alerts
• Proxy or remote
read
Prometheus 1.8 ‒ Storage Format
https://promcon.io/2016-berlin/talks/the-prometheus-time-series-database/
http://labs.gree.jp/blog/2017/10/16614/
• One series per file
• Rewrites may have
to touch millions
of files
• Queries also may
touch millions of
files
• No easy way to backup
Prometheus 2.0 ‒ New Storage Format
https://promcon.io/2017-munich/slides/storing-16-bytes-at-scale.pdf
https://fabxc.org/blog/2017-04-10-writing-a-tsdb/
• Chunks stored in buckets by time
• Chunks past retention setting are just deleted
• Easier to backup
• Easier to compress
Prometheus 2.0 ‒ Backups
├── 01BX40G8TA6T1MNSS8JJE7ENPY/
│ ├── chunks/
│ ├── index
│ ├── meta.json
│ └── tombstones
├── 01BX5Y9SSE10VBZK4CMZ86WDR6/
│ ├── chunks/
│ ├── index
│ ├── meta.json
│ └── tombstones
├── lock
└── wal/
├── 000760
└── 000761
• https://github.com/Gouthamve/agni
Prometheus 2.0 ‒ Flag Changes
• Most flags move from single dash to double dash
• Many storage settings move to tsdb settings
• -config.file -> --config.file
• -storage.local.path -> --storage.tsdb.path
Prometheus 2.0 ‒ Rule Format Changes
https://www.robustperception.io/converting-rules-to-
the-prometheus-2-0-format/
groups:
- name: alert.rules
rules:
- alert: HighErrorRate
expr: job:request_latency_seconds:mean5m{job="myjob"}
> 0.5
for: 10m
annotations:
summary: High request latency
- alert: DailyTest
expr: vector(1)
for: 1m
annotations:
summary: Daily alert test
• ./promtool update rules /path/to/rules
Prometheus 2.0 ‒ Migration
Prometheus 2.0 ‒ Remote Read
• Prometheus 1.8 (Read)
• InfluxDB (Read and Write)
• Graphite (Write)
• OpenTSDB (Write)
• TimescaledB (Read and Write)
• https://prometheus.io/docs/operating/integrations/
• https://github.com/prometheus/prometheus/tree/master/do
cumentation/examples/remote_storage/remote_storage_ada
pter
Open Metrics
• https://github.com/RichiH/OpenMetrics
• https://github.com/RichiH/OpenMetrics/blob/master/CONT
RIBUTORS.md
Questions?

More Related Content

20171027 モニタリング勉強会

  • 2. Self Introduction • Paul Traylor • LINE Fukuoka 開発室 • Currently responsible for updating monitoring environment at LINE Fukuoka • https://github.com/line/promgen • https://promcon.io/2017-munich/talks/prometheus-as-a- internal-service/
  • 3. Operating Prometheus at LINE Fukuoka • 4 HA Pairs • ~2000 targets per machine • ~800k samples per machine • ~3.5 million samples • ~7000 exporters https://github.com/line/promgen
  • 4. Scaling Prometheus ‒ HA • Run multiple Prometheus instance with the same targets • Alerts are de-duplicated by Alertmanager
  • 5. Scaling Prometheus ‒ Shard • Split targets across multiple servers • Alertmanager de-duplicates alerts • Proxy or remote read
  • 6. Prometheus 1.8 ‒ Storage Format https://promcon.io/2016-berlin/talks/the-prometheus-time-series-database/ http://labs.gree.jp/blog/2017/10/16614/ • One series per file • Rewrites may have to touch millions of files • Queries also may touch millions of files • No easy way to backup
  • 7. Prometheus 2.0 ‒ New Storage Format https://promcon.io/2017-munich/slides/storing-16-bytes-at-scale.pdf https://fabxc.org/blog/2017-04-10-writing-a-tsdb/ • Chunks stored in buckets by time • Chunks past retention setting are just deleted • Easier to backup • Easier to compress
  • 8. Prometheus 2.0 ‒ Backups ├── 01BX40G8TA6T1MNSS8JJE7ENPY/ │ ├── chunks/ │ ├── index │ ├── meta.json │ └── tombstones ├── 01BX5Y9SSE10VBZK4CMZ86WDR6/ │ ├── chunks/ │ ├── index │ ├── meta.json │ └── tombstones ├── lock └── wal/ ├── 000760 └── 000761 • https://github.com/Gouthamve/agni
  • 9. Prometheus 2.0 ‒ Flag Changes • Most flags move from single dash to double dash • Many storage settings move to tsdb settings • -config.file -> --config.file • -storage.local.path -> --storage.tsdb.path
  • 10. Prometheus 2.0 ‒ Rule Format Changes https://www.robustperception.io/converting-rules-to- the-prometheus-2-0-format/ groups: - name: alert.rules rules: - alert: HighErrorRate expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5 for: 10m annotations: summary: High request latency - alert: DailyTest expr: vector(1) for: 1m annotations: summary: Daily alert test • ./promtool update rules /path/to/rules
  • 11. Prometheus 2.0 ‒ Migration
  • 12. Prometheus 2.0 ‒ Remote Read • Prometheus 1.8 (Read) • InfluxDB (Read and Write) • Graphite (Write) • OpenTSDB (Write) • TimescaledB (Read and Write) • https://prometheus.io/docs/operating/integrations/ • https://github.com/prometheus/prometheus/tree/master/do cumentation/examples/remote_storage/remote_storage_ada pter
  • 13. Open Metrics • https://github.com/RichiH/OpenMetrics • https://github.com/RichiH/OpenMetrics/blob/master/CONT RIBUTORS.md