FluentdとGrowthForecastを使った可視化〜 Haikanko OSS化への道(4)

こんにちは @sonots です。Haikanko OSS化への道と称した Fluentd 連載第四回です。過去の記事はこちら

今回は前回解説したログ監視機能に加え、Haikanko が提供するもう１つのフィーチャー(SIer 的に言うソリューション！)である可視化の機能についてお話したいと思います。

結構よく聞く構成なので、あまりブログに書く必要はないような気もしますが、連載という形でまとめたいので書いておきます。

やりたいこと

アプリケーションログのステータスコード(2xx, 3xx, 4xx, 5xx)の１分あたりの数をグラフ化したい
アプリケーションログのレスポンスタイム(<1sec, <2sec, <3sec, <4sec, >=4sec)の１分あたりの数をグラフ化したい
１分間の平均レスポンスタイム、最悪(最大)レスポンスタイムをグラフ化したい
これらのグラフをホスト毎に出したい

というのを Fluentd + GrowthForecast でできる所までやります。レスポンスタイムをグラフ化したい、というのは5秒ルールがあるソシャゲ業界っぽくてアツいですね。

実現方法

fluent-agent-lite で拾ったログ出力を fluent-plugin-parser で受け取ってパースした後に、fluent-plugin-datacounter および fluent-plugin-numeric-monitor でメトリクスを集計して fluent-plugin-growthforecast に投げています。

大分ぼやっとさせていますが、こんなグラフが出せます。

さらにそのメトリクスを fluent-plugin-notifier に渡して、閾値を超えたら、fluent-plugin-ikachan と fluent-plugin-mail 経由で通知するような設定も書いています。

設定ファイル

というわけで、全部貼ると設定はこんなかんじになります。以下、解説していきます。

type copy

<store>

type parser

remove_prefix raw

key_name message

format ltsv

replace_invalid_sequence true

suppress_parse_error_log true

</store>

</match>

type copy

<store>

type datacounter

count_interval 60

aggregate tag

output_per_tag yes

tag_prefix count

count_key status

pattern1 2xx ^2\d\d$

pattern2 3xx ^3\d\d$

pattern4 4xx ^4\d\d$

pattern5 5xx ^5\d\d$

</store>

<store>

type datacounter

count_interval 60

aggregate tag

output_per_tag yes

tag_prefix count

count_key reqtime

pattern1 <1sec ^0.*$

pattern2 <2sec ^1.*$

pattern3 <3sec ^2.*$

pattern4 <4sec ^3.*$

pattern5 >=4sec ^.*$

</store>

<store>

type numeric_monitor

count_interval 60

aggregate tag

output_per_tag yes

tag_prefix monitor

monitor_key reqtime

output_key_prefix response_time

percentiles 99

</store>

</match>

type copy

<store>

type amplifier_filter

ratio 1000000

add_prefix amplifier

key_pattern _(max|avg|percentile_.*)$

</store>

</match>

type copy

<store>

type growthforecast

remove_prefix amplifier.monitor.accesslog

gfapi_url http://growthforecast:5125/api/

service accesslog

tag_for section

name_keys response_time_max,response_time_percentile_99,response_time_avg

</store>

</match>

type copy

<store>

type growthforecast

remove_prefix count.accesslog

gfapi_url http://growthforecast:5125/api/

service accesslog

tag_for section

name_keys 2xx_count,3xx_count,400_count,4xx_count,5xx_count,<1sec_count,<2sec_count,<3sec_count,<4sec_count,>=4sec_count

</store>

<store>

type notifier

default_tag_warn notifier.count.accesslog

default_tag_crit notifier.count.accesslog

input_tag_remove_prefix count.accesslog

<def>

check numeric_upward

warn_threshold 1

crit_threshold 100

target_keys 5xx_count

pattern accesslog

</def>

</store>

</match>

type copy

<store>

type ikachan

...

</store>

<store>

type mail

...

</store>

</match>

fluent-plugin-parser の設定

fluent-agent-lite で流したログデータを構造化しています。本当は format に正規表現を書いているのですが、全部 ltsv だったらいいのになぁという願望でこの記事では format ltsv と書いていますww

type copy

<store>

type parser

remove_prefix raw

key_name message

format ltsv

replace_invalid_sequence true

suppress_parse_error_log true

</store>

</match>

例えば、fluent-agent-lite から渡ってくるデータが

raw.accesslog.host1: {"message":"time:2013/01/13T07-02-13\treqtime:4.23423\tstatus:500\tmethod:GET\turi:/login"}

だった場合に、

accesslog.host1: {"time":"2013/01/13T07-02-13", "reqtime":"4.23423", "status":"500", "method":"GET", "uri":"/login"}

のような出力になります。

オプションの補足をしておきます。

replace_invalid_sequence というのは、invalid UTF-8 な文字列が流れてくると、正規表現パーサーのところで例外が出るので、そういう文字列を ? に置き換えてしまうためのオプションです。＃そういうログがあったので pull req させて頂きました m(_ _)m

suppress_parse_error_log というのは、正規表現にマッチしない行をパースしたときに表示される「pattern not matched:」のようにでる warn ログの出力を抑えるためのオプションです。本番環境でそのログが出てもしょうがないので true にして押さえています。開発中は false にしていますね。

flunt-plugin-datacounter の設定

最初の datacounter 設定では、入力の status キーを見てステータスコード(2xx, 3xx, 4xx, 5xx)の１分あたりの数を、２つ目の datacounter 設定では入力の reqtime キーを見てレスポンスタイム(<1sec, <2sec, <3sec, <4sec, >=4sec)の１分あたりの数をカウントしています。

type copy

<store>

type datacounter

count_interval 60

aggregate tag

output_per_tag yes

tag_prefix count

count_key status

pattern1 2xx ^2\d\d$

pattern2 3xx ^3\d\d$

pattern3 4xx ^4\d\d$

pattern4 5xx ^5\d\d$

</store>

<store>

type datacounter

count_interval 60

aggregate tag

output_per_tag yes

tag_prefix count

count_key reqtime

pattern1 <1sec ^0.*$

pattern2 <2sec ^1.*$

pattern3 <3sec ^2.*$

pattern4 <4sec ^3.*$
pattern5 >=4sec ^.*$

</store>
</match>

例えば、fluent-plugin-parser から渡ってくる１分間(count_interval)のデータが

accesslog.host1: {"time":"2013/01/13T07-02-13", "reqtime":"5.23423", "status":"500", "method":"GET", "uri":"/login"}

accesslog.host1: {"time":"2013/01/13T07-02-13", "reqtime":"0.23423", "status":"200", "method":"POST", "uri":"/login"}

accesslog.host1: {"time":"2013/01/13T07-02-13", "reqtime":"8.23423", "status":"500", "method":"GET", "uri":"/me"}

だった場合に、

count.accesslog.host1: {"unmatched_count":0, "<1sec_count":1, "<2sec_count":0, "<3sec_count":4, "<4sec_count":0, ">=4sec_count":0}
count.accesslog.host1: {"unmatched_count":0, "2xx_count":1, "3xx_count":0, "4xx_count":0, "5xx_count":2}

のようにデータが出力されます。※ 本当は xxx_rate, xxx_percent という値も出力されますが、今回は使ってないので紙面の都合上省略しています。

aggregate tag かつ output_per_tag yes にしているので、タグ毎に集計して出力されていますね。

fluent-plugin-numeric-monitor の設定

もう１つ、fluent-plugin-numeric-monitor を利用してレスポンスタイムの MAX, 平均, percnetile_99 (99%上位でのMAX) をグラフ化しています。

datacounter と同じように aggregate tag、output_per_tag yes を使えるように pull req しています。また、output_key_prefix で出力JSON のキー名を設定できるようにもしてます。

type copy
<store>

type numeric_monitor

count_interval 60

aggregate tag

output_per_tag yes

tag_prefix monitor

monitor_key reqtime

output_key_prefix response_time

percentiles 99

</store>
</match>

ただ、GrowthForecast がグラフの値として整数値しか受け取れない(小数点切り捨てになる)という制限があるので、numeric-monitor の出力を fluent-pluign-amplifier-filter に渡して大きな値にしています。ratio の値はここでは 1000000 と固定値にしていて、表示上の調整はあとで GrowthForecast 上でアプリ毎にやることにしています。

type copy

<store>

type amplifier_filter

ratio 1000000

add_prefix amplifier

key_pattern _(max|avg|percentile_.*)$

</store>

</match>

この２つの設定の結果、例えば fluent-plugin-parser から渡ってくる１分間(count_interval)のデータが

accesslog.host1: {"time":"2013/01/13T07-02-13", "reqtime":"5.23423", "status":"500", "method":"GET", "uri":"/login"}

accesslog.host1: {"time":"2013/01/13T07-02-13", "reqtime":"0.23423", "status":"200", "method":"POST", "uri":"/login"}

だった場合に、

amplifier.monitor.accesslog.host1: {"response_time_max":"8234230", "response_time_avg":"4567563", "response_time_percentile_99":"4234230"}

のようにデータが出力されます。

fluent-plugin-growthforecast の設定

データのメトリクスはここまでで取れたので、あとは fluent-plugin-grothforecast を使って GrowthForecast にデータを投げるだけですね。

工夫と言えるのは、remove_prefix と tag_for section を使って、GrowthForecast のセクション名(第二階層)はホスト名になるようにしている所ですかね。

type copy

<store>

type growthforecast

remove_prefix count.accesslog

gfapi_url http://growthforecast:5125/api/

service accesslog

tag_for section

name_keys 2xx_count,3xx_count,400_count,4xx_count,5xx_count,<1sec_count,<2sec_count,<3sec_count,<4sec_count,>=4sec_count

</store>

省略

</match>

あとは GrowthForecast 側で複合グラフを作ったり、単位を変えたりグラフの色を設定したりしています。それらの作業は growthforecast-client を利用して一括でできるようにしています。

fluent-plugin-notifier の設定

さらに、グラフが見れるようになったら、それを元に閾値を設定して、アラート飛ばしたいというのは当然の欲求だと思うので、取得したデータのメトリクスに対して閾値を設定してアラートを飛ばせるようにもしています。

fluent-plugin-notifier を利用して、前回も説明した fluent-plugin-ikachan と fluent-plugin-mail を使って IRC とメールで通知できるようにしています。

type copy

<store>

type growthforecast

省略

</store>

<store>

type notifier

default_tag_warn notifier.count.accesslog

default_tag_crit notifier.count.accesslog

input_tag_remove_prefix count.accesslog

<def>

check numeric_upward

warn_threshold 1

crit_threshold 100

target_keys 5xx_count

pattern accesslog

</def>

</store>

</match>

type copy

<store>

type ikachan

...

</store>

<store>

type mail

...

</store>

</match>

まとめ

というかんじでメトリクスのグラフ描画ができるようになりました！さらに閾値を決めてアラートも飛ばせるようになりました！

Haikanko としては GrowthForecast に投げる fluentd の設定をするまでが責務で、そのあとの GrowthForecast のグラフ設定は growthforecast-client ベースの別ツールで行っているかんじですね。

それでは！

sonots:blog