SlideShare a Scribd company logo
Monitoring Kafka
w/ Prometheus
Yuto Kawamura(kawamuray)
About me
● Software Engineer @ LINE corp
○ Develop & operate Apache HBase clusters
○ Design and implement data flow between services with ♥ to Apache Kafka
● Recent works
○ Playing with Apache Kafka and Kafka Streams
■ https://issues.apache.org/jira/browse/KAFKA-3471?jql=project%20%3D%20KAFKA%
20AND%20assignee%20in%20(kawamuray)%20OR%20reporter%20in%20
(kawamuray)
● Past works
○ Docker + Checkpoint Restore @ CoreOS meetup http://www.slideshare.
net/kawamuray/coreos-meetup
○ Norikraでアプリログを集計してリアルタイムエラー通知 @ Norikra Meetup #1 http://www.
slideshare.net/kawamuray/norikra-meetup
○ Student @ Google Summer of Code 2013, 2014
● https://github.com/kawamuray
How are we(our team) using Prometheus?
● To monitor most of our middleware, clients on Java applications
○ Kafka clusters
○ HBase clusters
○ Kafka clients - producer and consumer
○ Stream Processing jobs
Overall Architecture
Grafana
Prometheus
HBase
clusterHBase
cluster
Kafka cluster
Prometheus
Prometheus
Prometheus
(Federation)
Prometheus
Prometheus
Prometheus
YARN Application
Pushgateway
Dashboard
Direct query
Why Prometheus?
● Inhouse monitoring tool wasn’t enough for large-scale + high resolution metrics
collection
● Good data model
○ Genuine metric identifier + attributes as labels
■ http_requests_total{code="200",handler="prometheus",instance="localhost:9090",job="
prometheus",method="get"}
● Scalable by nature
● Simple philosophy
○ Metrics exposure interface: GET /metrics => Text Protocol
○ Monolithic server
● Flexible but easy PromQL
○ Derive aggregated metrics by composing existing metrics
○ E.g, Sum of TX bps / second of entire cluster
■ sum(rate(node_network_receive_bytes{cluster="cluster-A",device="eth0"}[30s]) * 8)
Deployment
● Launch
○ Official Docker image: https://hub.docker.com/r/prom/prometheus/
○ Ansible for dynamic prometheus.yml generation based on inventory and container
management
● Machine spec
○ 2.40GHz * 24 CPUs
○ 192GB RAM
○ 6 SSDs
○ Single SSD / Single Prometheus instance
○ Overkill? => Obviously. Reused existing unused servers. You must don’t need this crazy spec
just to use it.
Kafka monitoring w/ Prometheus overview
Kafka broker
Kafka client in
Java application
YARN
ResourceManager
Stream Processing
jobs on YARN
Prometheus Server
Pushgate
way
Jmx
exporter
Prometh
eus Java
library
+ Servlet
JSON
exporter
Kafka
consumer
group
exporter
Monitoring Kafka brokers - jmx_exporter
● https://github.com/prometheus/jmx_exporter
● Run as standalone process(no -javaagent)
○ Just in order to avoid cumbersome rolling restart
○ Maybe turn into use javaagent on next opportunity of rolling restart :p
● With very complicated config.yml
○ https://gist.github.com/kawamuray/25136a9ab22b1cb992e435e0ea67eb06
● Colocate one instance per broker on the same host
Monitoring Kafka producer on Java application -
prometheus_simpleclient
● https://github.com/prometheus/client_java
● Official Java client library
prometheus_simpleclient - Basic usage
private static final Counter queueOutCounter =
Counter.build()
.namespace("kafka_streams") // Namespace(= Application prefix?)
.name("process_count") // Metric name
.help("Process calls count") // Metric description
.labelNames("processor", "topic") // Declare labels
.register(); // Register to CollectorRegistry.defaultRegistry (default, global registry)
...
queueOutCounter.labels("Processor-A", "topic-T").inc(); // Increment counter with labels
queueOutCounter.labels("Processor-B", "topic-P").inc(2.0);
=> kafka_streams_process_count{processor="Processor-A",topic="topic-T"} 1.0
kafka_streams_process_count{processor="Processor-B",topic="topic-P"} 2.0
Exposing Java application metrics
● Through servlet
○ io.prometheus.client.exporter.MetricsServlet from simpleclient_servlet
● Add an entry to web.xml or embedded jetty ..
Server server = new Server(METRICS_PORT);
ServletContextHandler context = new ServletContextHandler();
context.setContextPath("/");
server.setHandler(context);
context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics");
server.start();
Monitoring Kafka producer on Java application -
prometheus_simpleclient
● Primitive types:
○ Counter, Gauge, Histogram, Summary
● Kafka’s MetricsRerpoter interface gives KafkaMetrics instance
● How to expose the value?
● => Implement proxy metric type which implements
SimpleCollector public class PrometheusMetricsReporter implements MetricsReporter {
...
private void registerMetric(KafkaMetric kafkaMetric) {
...
KafkaMetricProxy.build()
.namespace(“kafka”)
.name(fqn)
.help("Help: " + metricName.description())
.labelNames(labelNames)
.register();
...
}
...
}
public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> {
public static class Builder extends SimpleCollector.Builder<Builder, KafkaMetricProxy> {
@Override
public KafkaMetricProxy create() {
return new KafkaMetricProxy(this);
}
}
KafkaMetricProxy(Builder b) {
super(b);
}
...
@Override
public List<MetricFamilySamples> collect() {
List<MetricFamilySamples.Sample> samples = new ArrayList<>();
for (Map.Entry<List<String>, Child> entry : children.entrySet()) {
List<String> labels = entry.getKey();
Child child = entry.getValue();
samples.add(new Sample(fullname, labelNames, labels, child.getValue()));
}
return Collections.singletonList(new MetricFamilySamples(fullname, Type.GAUGE, help, samples));
}
}
Monitoring YARN jobs - json_exporter
● https://github.com/kawamuray/prometheus-json-exporter
○ Can export value from JSON by specifying the value as JSONPath
● http://<rm http address:port>/ws/v1/cluster/apps
○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-
site/ResourceManagerRest.html#Cluster_Applications_API
○ https://gist.github.com/kawamuray/c07b03de82bf6ddbdae6508e27d3fb4d
json_exporter
- name: yarn_application
type: object
path: $.apps.app[*]?(@.state == "RUNNING")
labels:
application: $.id
phase: beta
values:
alive: 1
elapsed_time: $.elapsedTime
allocated_mb: $.allocatedMB
...
{"apps":{"app":[
{
"id": "application_1234_0001",
"state": "RUNNING",
"elapsedTime": 25196,
"allocatedMB": 1024,
...
},
...
}}
+
yarn_application_alive{application="application_1326815542473_0001",phase="beta"} 1
yarn_application_elapsed_time{application="application_1326815542473_0001",phase="beta"} 25196
yarn_application_allocated_mb{application="application_1326815542473_0001",phase="beta"} 1024
Important configurations
● -storage.local.retention(default: 15 days)
○ TTL for collected values
● -storage.local.memory-chunks(default: 1M)
○ Practically controls memory allocation of Prometheus instance
○ Lower value can cause ingestion throttling(metric loss)
● -storage.local.max-chunks-to-persist(default: 512K)
○ Lower value can cause ingestion throttling likewise
○ https://prometheus.io/docs/operating/storage/#persistence-pressure-and-rushed-mode
○ > Equally important, especially if writing to a spinning disk, is raising the value for the storage.
local.max-chunks-to-persist flag. As a rule of thumb, keep it around 50% of the storage.local.
memory-chunks value.
● -query.staleness-delta(default: 5mins)
○ Resolution to detect lost metrics
○ Could lead weird behavior on Prometheus WebUI
Query tips - label_replace function
● It’s quite common that two metrics has different label sets
○ E.g, server side metric and client side metrics
● Say have metrics like:
○ kafka_log_logendoffset{cluster="cluster-A",instance="HOST:PORT",job="kafka",partition="1234",topic="topic-A"}
● Introduce new label from existing label
○ label_replace(..., "host", "$1", "instance", "^([^:]+):.*")
○ => kafka_log_logendoffset{...,instance=”HOST:PORT”,host=”HOST”}
● Rewrite existing label with new value
○ label_replace(..., "instance", "$1", "instance", "^([^:]+):.*")
○ => kafka_log_logendoffset{...,instance=”HOST”}
● Even possible to rewrite metric name… :D
○ label_replace(kafka_log_logendoffset, "__name__", "foobar", "__name__", ".*")
○ => foobar{...}
Points to improve
● Service discovery
○ It’s too cumbersome to configure server list and exporter list statically
○ Pushgateway?
■ > The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose
their metrics to Prometheus - https://github.
com/prometheus/pushgateway/blob/master/README.md#prometheus-pushgateway-
○ file_sd_config? https://prometheus.io/docs/operating/configuration/#<file_sd_config>
■ > It reads a set of files containing a list of zero or more <target_group>s. Changes to all
defined files are detected via disk watches and applied immediately.
● Local time support :(
○ They don’t like TZ other than UTC; making sense though: https://prometheus.
io/docs/introduction/faq/#can-i-change-the-timezone?-why-is-everything-in-utc?
○ https://github.com/prometheus/prometheus/issues/500#issuecomment-167560093
○ Still might possible to introduce toggle on view
Conclusion
● Data model is very intuitive
● PromQL is very powerful and relatively easy
○ Helps you find out important metrics from hundreds of metrics
● Few pitfalls needs to be avoid w/ tuning configurations
○ memory-chunks, query.staleness-detla…
● Building exporter is reasonably easy
○ Officially supported lot’s of languages…
○ /metrics is the only interface
Questions?
End of Presentation
Metrics naming
● APPLICATIONPREFIX_METRICNAME
○ https://prometheus.io/docs/practices/naming/#metric-names
○ kafka_producer_request_rate
○ http_request_duration
● Fully utilize labels
○ x: kafka_network_request_duration_milliseconds_{max,min,mean}
○ o: kafka_network_request_duration_milliseconds{“aggregation”=”max|min|mean”}
○ Compare all min/max/mean in single graph: kafka_network_request_duration_milliseconds
{instance=”HOSTA”}
○ Much flexible than using static name
Alerting
● Not using Alert Manager
● Inhouse monitoring tool has alerting capability
○ Has user directory of alerting target
○ Has known expression to configure alerting
○ Tool unification is important and should be respected as
possible
● Then?
○ Built a tool to mirror metrics from Prometheus to inhouse
monitoring tool
○ Setup alert on inhouse monitoring tool
/api/v1/query?query=sum(kafka_stream_process_calls_rate{client_id=~"CLIENT_ID.*"}) by
(instance)
{
"status": "success",
"data": {
"resultType": "vector",
"result": [
{
"metric": {
"instance": "HOST_A:PORT"
},
"value": [
1465819064.067,
"82317.10280584119"
]
},
{
"metric": {
"instance": "HOST_B:PORT"
},
"value": [
1465819064.067,
"81379.73499610288"
]
},
]
}
}
public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> {
...
public static class Child {
private KafkaMetric kafkaMetric;
public void setKafkaMetric(KafkaMetric kafkaMetric) {
this.kafkaMetric = kafkaMetric;
}
double getValue() {
return kafkaMetric == null ? 0 : kafkaMetric.value();
}
}
@Override
protected Child newChild() {
return new Child();
}
...
}
Monitoring Kafka consumer offset -
kafka_consumer_group_exporter
● https://github.com/kawamuray/prometheus-kafka-consumer-group-exporter
● Exports some metrics WRT Kafka consumer group by executing kafka-
consumer-groups.sh command(bundled to Kafka)
● Specific exporter for specific use
● Would better being familiar with your favorite exporter framework
○ Raw use of official prometheus package: https://github.
com/prometheus/client_golang/tree/master/prometheus
○ Mine: https://github.com/kawamuray/prometheus-exporter-harness
Query tips - Product set
● Calculated result of more than two metrics results product set
● metric_A{cluster=”A or B”}
● metric_B{cluster=”A or B”,instance=”a or b or c”}
● metric_A / metric_B
● => {}
● metric_A / sum(metric_B) by (cluster)
● => {cluster=”A or B”}
● x: metric_A{cluster=”A”} - sum(metric_B{cluster=”A”}) by (cluster)
● o: metric_A{cluster=”A”} - sum(metric_B) by (cluster) => Same result!
Ad

More Related Content

What's hot (20)

HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
Marco Pas
 
Logs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK StackLogs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK Stack
Josef Karásek
 
Service Discovery In Kubernetes
Service Discovery In KubernetesService Discovery In Kubernetes
Service Discovery In Kubernetes
Knoldus Inc.
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
Brian Brazil
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
Kasper Nissen
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
AIMDek Technologies
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
Lucas Jellema
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
Kevin Brockhoff
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
Knoldus Inc.
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Designing a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsDesigning a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd products
Julian Mazzitelli
 
PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language
Weaveworks
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 
Event-sourced architectures with Akka
Event-sourced architectures with AkkaEvent-sourced architectures with Akka
Event-sourced architectures with Akka
Sander Mak (@Sander_Mak)
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Anant Corporation
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
Marco Pas
 
Logs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK StackLogs/Metrics Gathering With OpenShift EFK Stack
Logs/Metrics Gathering With OpenShift EFK Stack
Josef Karásek
 
Service Discovery In Kubernetes
Service Discovery In KubernetesService Discovery In Kubernetes
Service Discovery In Kubernetes
Knoldus Inc.
 
Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)Prometheus for Monitoring Metrics (Fermilab 2018)
Prometheus for Monitoring Metrics (Fermilab 2018)
Brian Brazil
 
Monitoring with prometheus
Monitoring with prometheusMonitoring with prometheus
Monitoring with prometheus
Kasper Nissen
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil
 
Observability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetryObservability For You and Me with OpenTelemetry
Observability For You and Me with OpenTelemetry
Eric D. Schabell
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)MeetUp Monitoring with Prometheus and Grafana (September 2018)
MeetUp Monitoring with Prometheus and Grafana (September 2018)
Lucas Jellema
 
OpenTelemetry For Architects
OpenTelemetry For ArchitectsOpenTelemetry For Architects
OpenTelemetry For Architects
Kevin Brockhoff
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
Brian Brazil
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
Knoldus Inc.
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Designing a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsDesigning a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd products
Julian Mazzitelli
 
PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language PromQL Deep Dive - The Prometheus Query Language
PromQL Deep Dive - The Prometheus Query Language
Weaveworks
 
Monitoring using Prometheus and Grafana
Monitoring using Prometheus and GrafanaMonitoring using Prometheus and Grafana
Monitoring using Prometheus and Grafana
Arvind Kumar G.S
 

Viewers also liked (13)

Prometheus casual talk1
Prometheus casual talk1Prometheus casual talk1
Prometheus casual talk1
wyukawa
 
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
Tokuhiro Matsuno
 
Prometheus on AWS
Prometheus on AWSPrometheus on AWS
Prometheus on AWS
Mitsuhiro Tanda
 
Monitoring microservices with Prometheus
Monitoring microservices with PrometheusMonitoring microservices with Prometheus
Monitoring microservices with Prometheus
Tobias Schmidt
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
Brian Brazil
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
Application security as crucial to the modern distributed trust model
Application security as crucial to   the modern distributed trust modelApplication security as crucial to   the modern distributed trust model
Application security as crucial to the modern distributed trust model
LINE Corporation
 
FRONTIERS IN CRYPTOGRAPHY
FRONTIERS IN CRYPTOGRAPHYFRONTIERS IN CRYPTOGRAPHY
FRONTIERS IN CRYPTOGRAPHY
LINE Corporation
 
Drawing the Line Correctly: Enough Security, Everywhere
Drawing the Line Correctly:   Enough Security, EverywhereDrawing the Line Correctly:   Enough Security, Everywhere
Drawing the Line Correctly: Enough Security, Everywhere
LINE Corporation
 
ゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティ
LINE Corporation
 
Implementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldImplementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile World
LINE Corporation
 
FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」
LINE Corporation
 
“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication
LINE Corporation
 
Prometheus casual talk1
Prometheus casual talk1Prometheus casual talk1
Prometheus casual talk1
wyukawa
 
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
Tokuhiro Matsuno
 
Monitoring microservices with Prometheus
Monitoring microservices with PrometheusMonitoring microservices with Prometheus
Monitoring microservices with Prometheus
Tobias Schmidt
 
Prometheus Overview
Prometheus OverviewPrometheus Overview
Prometheus Overview
Brian Brazil
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
Application security as crucial to the modern distributed trust model
Application security as crucial to   the modern distributed trust modelApplication security as crucial to   the modern distributed trust model
Application security as crucial to the modern distributed trust model
LINE Corporation
 
Drawing the Line Correctly: Enough Security, Everywhere
Drawing the Line Correctly:   Enough Security, EverywhereDrawing the Line Correctly:   Enough Security, Everywhere
Drawing the Line Correctly: Enough Security, Everywhere
LINE Corporation
 
ゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティゲーム開発を加速させる クライアントセキュリティ
ゲーム開発を加速させる クライアントセキュリティ
LINE Corporation
 
Implementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile WorldImplementing Trusted Endpoints in the Mobile World
Implementing Trusted Endpoints in the Mobile World
LINE Corporation
 
FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」FIDO認証で「あんしんをもっと便利に」
FIDO認証で「あんしんをもっと便利に」
LINE Corporation
 
“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication“Your Security, More Simple.” by utilizing FIDO Authentication
“Your Security, More Simple.” by utilizing FIDO Authentication
LINE Corporation
 
Ad

Similar to Monitoring Kafka w/ Prometheus (20)

Sprint 17
Sprint 17Sprint 17
Sprint 17
ManageIQ
 
PostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacksPostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacks
Showmax Engineering
 
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
KAI CHU CHUNG
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
Tatiana Al-Chueyr
 
Introducing Playwright's New Test Runner
Introducing Playwright's New Test RunnerIntroducing Playwright's New Test Runner
Introducing Playwright's New Test Runner
Applitools
 
Php 5.6 From the Inside Out
Php 5.6 From the Inside OutPhp 5.6 From the Inside Out
Php 5.6 From the Inside Out
Ferenc Kovács
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
Arto Artnik
 
React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発
Yoichi Toyota
 
BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013
Andy Bunce
 
Deploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuDeploying Prometheus stacks with Juju
Deploying Prometheus stacks with Juju
J.J. Ciarlante
 
GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101
yinonavraham
 
Full Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaFull Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and Grafana
Jazz Yao-Tsung Wang
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP
Ronald Hsu
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
Nathan Handler
 
openATTIC using grafana and prometheus
openATTIC using  grafana and prometheusopenATTIC using  grafana and prometheus
openATTIC using grafana and prometheus
Alex Lau
 
Capistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient wayCapistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient way
Sylvain Rayé
 
.NET @ apache.org
 .NET @ apache.org .NET @ apache.org
.NET @ apache.org
Ted Husted
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
Petr Vlček
 
Testing Django APIs
Testing Django APIsTesting Django APIs
Testing Django APIs
tyomo4ka
 
Openstack taskflow 簡介
Openstack taskflow 簡介Openstack taskflow 簡介
Openstack taskflow 簡介
kao kuo-tung
 
PostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacksPostgreSQL Monitoring using modern software stacks
PostgreSQL Monitoring using modern software stacks
Showmax Engineering
 
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰如何透過 Go-kit 快速搭建微服務架構應用程式實戰
如何透過 Go-kit 快速搭建微服務架構應用程式實戰
KAI CHU CHUNG
 
Integrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache AirflowIntegrating ChatGPT with Apache Airflow
Integrating ChatGPT with Apache Airflow
Tatiana Al-Chueyr
 
Introducing Playwright's New Test Runner
Introducing Playwright's New Test RunnerIntroducing Playwright's New Test Runner
Introducing Playwright's New Test Runner
Applitools
 
Php 5.6 From the Inside Out
Php 5.6 From the Inside OutPhp 5.6 From the Inside Out
Php 5.6 From the Inside Out
Ferenc Kovács
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
Arto Artnik
 
React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発React starter-kitでとっとと始めるisomorphic開発
React starter-kitでとっとと始めるisomorphic開発
Yoichi Toyota
 
BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013BaseX user-group-talk XML Prague 2013
BaseX user-group-talk XML Prague 2013
Andy Bunce
 
Deploying Prometheus stacks with Juju
Deploying Prometheus stacks with JujuDeploying Prometheus stacks with Juju
Deploying Prometheus stacks with Juju
J.J. Ciarlante
 
GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101GopherCon IL 2020 - Web Application Profiling 101
GopherCon IL 2020 - Web Application Profiling 101
yinonavraham
 
Full Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and GrafanaFull Stack Monitoring with Prometheus and Grafana
Full Stack Monitoring with Prometheus and Grafana
Jazz Yao-Tsung Wang
 
202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP202107 - Orion introduction - COSCUP
202107 - Orion introduction - COSCUP
Ronald Hsu
 
PaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at YelpPaaSTA: Autoscaling at Yelp
PaaSTA: Autoscaling at Yelp
Nathan Handler
 
openATTIC using grafana and prometheus
openATTIC using  grafana and prometheusopenATTIC using  grafana and prometheus
openATTIC using grafana and prometheus
Alex Lau
 
Capistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient wayCapistrano deploy Magento project in an efficient way
Capistrano deploy Magento project in an efficient way
Sylvain Rayé
 
.NET @ apache.org
 .NET @ apache.org .NET @ apache.org
.NET @ apache.org
Ted Husted
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
Petr Vlček
 
Testing Django APIs
Testing Django APIsTesting Django APIs
Testing Django APIs
tyomo4ka
 
Openstack taskflow 簡介
Openstack taskflow 簡介Openstack taskflow 簡介
Openstack taskflow 簡介
kao kuo-tung
 
Ad

More from kawamuray (7)

Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
kawamuray
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
kawamuray
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...
kawamuray
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
kawamuray
 
Docker + Checkpoint/Restore
Docker + Checkpoint/RestoreDocker + Checkpoint/Restore
Docker + Checkpoint/Restore
kawamuray
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganeti
kawamuray
 
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetupNorikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
kawamuray
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
kawamuray
 
Multitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINEMultitenancy: Kafka clusters for everyone at LINE
Multitenancy: Kafka clusters for everyone at LINE
kawamuray
 
LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...LINE's messaging service architecture underlying more than 200 million monthl...
LINE's messaging service architecture underlying more than 200 million monthl...
kawamuray
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
kawamuray
 
Docker + Checkpoint/Restore
Docker + Checkpoint/RestoreDocker + Checkpoint/Restore
Docker + Checkpoint/Restore
kawamuray
 
LXC on Ganeti
LXC on GanetiLXC on Ganeti
LXC on Ganeti
kawamuray
 
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetupNorikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
Norikraでアプリログを集計してリアルタイムエラー通知 # Norikra meetup
kawamuray
 

Recently uploaded (20)

Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
The Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI IntegrationThe Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI Integration
Re-solution Data Ltd
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Build With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdfBuild With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdf
Google Developer Group - Harare
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Financial Services Technology Summit 2025
Financial Services Technology Summit 2025Financial Services Technology Summit 2025
Financial Services Technology Summit 2025
Ray Bugg
 
GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and MLGyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
Gyrus AI
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
The Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI IntegrationThe Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI Integration
Re-solution Data Ltd
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Financial Services Technology Summit 2025
Financial Services Technology Summit 2025Financial Services Technology Summit 2025
Financial Services Technology Summit 2025
Ray Bugg
 
GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and MLGyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
Gyrus AI
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 

Monitoring Kafka w/ Prometheus

  • 2. About me ● Software Engineer @ LINE corp ○ Develop & operate Apache HBase clusters ○ Design and implement data flow between services with ♥ to Apache Kafka ● Recent works ○ Playing with Apache Kafka and Kafka Streams ■ https://issues.apache.org/jira/browse/KAFKA-3471?jql=project%20%3D%20KAFKA% 20AND%20assignee%20in%20(kawamuray)%20OR%20reporter%20in%20 (kawamuray) ● Past works ○ Docker + Checkpoint Restore @ CoreOS meetup http://www.slideshare. net/kawamuray/coreos-meetup ○ Norikraでアプリログを集計してリアルタイムエラー通知 @ Norikra Meetup #1 http://www. slideshare.net/kawamuray/norikra-meetup ○ Student @ Google Summer of Code 2013, 2014 ● https://github.com/kawamuray
  • 3. How are we(our team) using Prometheus? ● To monitor most of our middleware, clients on Java applications ○ Kafka clusters ○ HBase clusters ○ Kafka clients - producer and consumer ○ Stream Processing jobs
  • 5. Why Prometheus? ● Inhouse monitoring tool wasn’t enough for large-scale + high resolution metrics collection ● Good data model ○ Genuine metric identifier + attributes as labels ■ http_requests_total{code="200",handler="prometheus",instance="localhost:9090",job=" prometheus",method="get"} ● Scalable by nature ● Simple philosophy ○ Metrics exposure interface: GET /metrics => Text Protocol ○ Monolithic server ● Flexible but easy PromQL ○ Derive aggregated metrics by composing existing metrics ○ E.g, Sum of TX bps / second of entire cluster ■ sum(rate(node_network_receive_bytes{cluster="cluster-A",device="eth0"}[30s]) * 8)
  • 6. Deployment ● Launch ○ Official Docker image: https://hub.docker.com/r/prom/prometheus/ ○ Ansible for dynamic prometheus.yml generation based on inventory and container management ● Machine spec ○ 2.40GHz * 24 CPUs ○ 192GB RAM ○ 6 SSDs ○ Single SSD / Single Prometheus instance ○ Overkill? => Obviously. Reused existing unused servers. You must don’t need this crazy spec just to use it.
  • 7. Kafka monitoring w/ Prometheus overview Kafka broker Kafka client in Java application YARN ResourceManager Stream Processing jobs on YARN Prometheus Server Pushgate way Jmx exporter Prometh eus Java library + Servlet JSON exporter Kafka consumer group exporter
  • 8. Monitoring Kafka brokers - jmx_exporter ● https://github.com/prometheus/jmx_exporter ● Run as standalone process(no -javaagent) ○ Just in order to avoid cumbersome rolling restart ○ Maybe turn into use javaagent on next opportunity of rolling restart :p ● With very complicated config.yml ○ https://gist.github.com/kawamuray/25136a9ab22b1cb992e435e0ea67eb06 ● Colocate one instance per broker on the same host
  • 9. Monitoring Kafka producer on Java application - prometheus_simpleclient ● https://github.com/prometheus/client_java ● Official Java client library
  • 10. prometheus_simpleclient - Basic usage private static final Counter queueOutCounter = Counter.build() .namespace("kafka_streams") // Namespace(= Application prefix?) .name("process_count") // Metric name .help("Process calls count") // Metric description .labelNames("processor", "topic") // Declare labels .register(); // Register to CollectorRegistry.defaultRegistry (default, global registry) ... queueOutCounter.labels("Processor-A", "topic-T").inc(); // Increment counter with labels queueOutCounter.labels("Processor-B", "topic-P").inc(2.0); => kafka_streams_process_count{processor="Processor-A",topic="topic-T"} 1.0 kafka_streams_process_count{processor="Processor-B",topic="topic-P"} 2.0
  • 11. Exposing Java application metrics ● Through servlet ○ io.prometheus.client.exporter.MetricsServlet from simpleclient_servlet ● Add an entry to web.xml or embedded jetty .. Server server = new Server(METRICS_PORT); ServletContextHandler context = new ServletContextHandler(); context.setContextPath("/"); server.setHandler(context); context.addServlet(new ServletHolder(new MetricsServlet()), "/metrics"); server.start();
  • 12. Monitoring Kafka producer on Java application - prometheus_simpleclient ● Primitive types: ○ Counter, Gauge, Histogram, Summary ● Kafka’s MetricsRerpoter interface gives KafkaMetrics instance ● How to expose the value? ● => Implement proxy metric type which implements SimpleCollector public class PrometheusMetricsReporter implements MetricsReporter { ... private void registerMetric(KafkaMetric kafkaMetric) { ... KafkaMetricProxy.build() .namespace(“kafka”) .name(fqn) .help("Help: " + metricName.description()) .labelNames(labelNames) .register(); ... } ... }
  • 13. public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> { public static class Builder extends SimpleCollector.Builder<Builder, KafkaMetricProxy> { @Override public KafkaMetricProxy create() { return new KafkaMetricProxy(this); } } KafkaMetricProxy(Builder b) { super(b); } ... @Override public List<MetricFamilySamples> collect() { List<MetricFamilySamples.Sample> samples = new ArrayList<>(); for (Map.Entry<List<String>, Child> entry : children.entrySet()) { List<String> labels = entry.getKey(); Child child = entry.getValue(); samples.add(new Sample(fullname, labelNames, labels, child.getValue())); } return Collections.singletonList(new MetricFamilySamples(fullname, Type.GAUGE, help, samples)); } }
  • 14. Monitoring YARN jobs - json_exporter ● https://github.com/kawamuray/prometheus-json-exporter ○ Can export value from JSON by specifying the value as JSONPath ● http://<rm http address:port>/ws/v1/cluster/apps ○ https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn- site/ResourceManagerRest.html#Cluster_Applications_API ○ https://gist.github.com/kawamuray/c07b03de82bf6ddbdae6508e27d3fb4d
  • 15. json_exporter - name: yarn_application type: object path: $.apps.app[*]?(@.state == "RUNNING") labels: application: $.id phase: beta values: alive: 1 elapsed_time: $.elapsedTime allocated_mb: $.allocatedMB ... {"apps":{"app":[ { "id": "application_1234_0001", "state": "RUNNING", "elapsedTime": 25196, "allocatedMB": 1024, ... }, ... }} + yarn_application_alive{application="application_1326815542473_0001",phase="beta"} 1 yarn_application_elapsed_time{application="application_1326815542473_0001",phase="beta"} 25196 yarn_application_allocated_mb{application="application_1326815542473_0001",phase="beta"} 1024
  • 16. Important configurations ● -storage.local.retention(default: 15 days) ○ TTL for collected values ● -storage.local.memory-chunks(default: 1M) ○ Practically controls memory allocation of Prometheus instance ○ Lower value can cause ingestion throttling(metric loss) ● -storage.local.max-chunks-to-persist(default: 512K) ○ Lower value can cause ingestion throttling likewise ○ https://prometheus.io/docs/operating/storage/#persistence-pressure-and-rushed-mode ○ > Equally important, especially if writing to a spinning disk, is raising the value for the storage. local.max-chunks-to-persist flag. As a rule of thumb, keep it around 50% of the storage.local. memory-chunks value. ● -query.staleness-delta(default: 5mins) ○ Resolution to detect lost metrics ○ Could lead weird behavior on Prometheus WebUI
  • 17. Query tips - label_replace function ● It’s quite common that two metrics has different label sets ○ E.g, server side metric and client side metrics ● Say have metrics like: ○ kafka_log_logendoffset{cluster="cluster-A",instance="HOST:PORT",job="kafka",partition="1234",topic="topic-A"} ● Introduce new label from existing label ○ label_replace(..., "host", "$1", "instance", "^([^:]+):.*") ○ => kafka_log_logendoffset{...,instance=”HOST:PORT”,host=”HOST”} ● Rewrite existing label with new value ○ label_replace(..., "instance", "$1", "instance", "^([^:]+):.*") ○ => kafka_log_logendoffset{...,instance=”HOST”} ● Even possible to rewrite metric name… :D ○ label_replace(kafka_log_logendoffset, "__name__", "foobar", "__name__", ".*") ○ => foobar{...}
  • 18. Points to improve ● Service discovery ○ It’s too cumbersome to configure server list and exporter list statically ○ Pushgateway? ■ > The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus - https://github. com/prometheus/pushgateway/blob/master/README.md#prometheus-pushgateway- ○ file_sd_config? https://prometheus.io/docs/operating/configuration/#<file_sd_config> ■ > It reads a set of files containing a list of zero or more <target_group>s. Changes to all defined files are detected via disk watches and applied immediately. ● Local time support :( ○ They don’t like TZ other than UTC; making sense though: https://prometheus. io/docs/introduction/faq/#can-i-change-the-timezone?-why-is-everything-in-utc? ○ https://github.com/prometheus/prometheus/issues/500#issuecomment-167560093 ○ Still might possible to introduce toggle on view
  • 19. Conclusion ● Data model is very intuitive ● PromQL is very powerful and relatively easy ○ Helps you find out important metrics from hundreds of metrics ● Few pitfalls needs to be avoid w/ tuning configurations ○ memory-chunks, query.staleness-detla… ● Building exporter is reasonably easy ○ Officially supported lot’s of languages… ○ /metrics is the only interface
  • 22. Metrics naming ● APPLICATIONPREFIX_METRICNAME ○ https://prometheus.io/docs/practices/naming/#metric-names ○ kafka_producer_request_rate ○ http_request_duration ● Fully utilize labels ○ x: kafka_network_request_duration_milliseconds_{max,min,mean} ○ o: kafka_network_request_duration_milliseconds{“aggregation”=”max|min|mean”} ○ Compare all min/max/mean in single graph: kafka_network_request_duration_milliseconds {instance=”HOSTA”} ○ Much flexible than using static name
  • 23. Alerting ● Not using Alert Manager ● Inhouse monitoring tool has alerting capability ○ Has user directory of alerting target ○ Has known expression to configure alerting ○ Tool unification is important and should be respected as possible ● Then? ○ Built a tool to mirror metrics from Prometheus to inhouse monitoring tool ○ Setup alert on inhouse monitoring tool /api/v1/query?query=sum(kafka_stream_process_calls_rate{client_id=~"CLIENT_ID.*"}) by (instance) { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "instance": "HOST_A:PORT" }, "value": [ 1465819064.067, "82317.10280584119" ] }, { "metric": { "instance": "HOST_B:PORT" }, "value": [ 1465819064.067, "81379.73499610288" ] }, ] } }
  • 24. public class KafkaMetricProxy extends SimpleCollector<KafkaMetricProxy.Child> { ... public static class Child { private KafkaMetric kafkaMetric; public void setKafkaMetric(KafkaMetric kafkaMetric) { this.kafkaMetric = kafkaMetric; } double getValue() { return kafkaMetric == null ? 0 : kafkaMetric.value(); } } @Override protected Child newChild() { return new Child(); } ... }
  • 25. Monitoring Kafka consumer offset - kafka_consumer_group_exporter ● https://github.com/kawamuray/prometheus-kafka-consumer-group-exporter ● Exports some metrics WRT Kafka consumer group by executing kafka- consumer-groups.sh command(bundled to Kafka) ● Specific exporter for specific use ● Would better being familiar with your favorite exporter framework ○ Raw use of official prometheus package: https://github. com/prometheus/client_golang/tree/master/prometheus ○ Mine: https://github.com/kawamuray/prometheus-exporter-harness
  • 26. Query tips - Product set ● Calculated result of more than two metrics results product set ● metric_A{cluster=”A or B”} ● metric_B{cluster=”A or B”,instance=”a or b or c”} ● metric_A / metric_B ● => {} ● metric_A / sum(metric_B) by (cluster) ● => {cluster=”A or B”} ● x: metric_A{cluster=”A”} - sum(metric_B{cluster=”A”}) by (cluster) ● o: metric_A{cluster=”A”} - sum(metric_B) by (cluster) => Same result!