The expressions used in triggers are very flexible. You can use them to create complex logical tests regarding monitored statistics.
A simple expression uses a function that is applied to the item with some parameters. The function returns a result that is compared to the threshold, using an operator and a constant.
The syntax of a simple useful expression is function(/host/key,parameter)<operator><constant>
.
For example:
will trigger if the number of received bytes during the last five minutes was always over 100 kilobytes.
While the syntax is exactly the same, from the functional point of view there are two types of trigger expressions:
When defining a problem expression alone, this expression will be used both as the problem threshold and the problem recovery threshold. As soon as the problem expression evaluates to TRUE, there is a problem. As soon as the problem expression evaluates to FALSE, the problem is resolved.
When defining both problem expression and the supplemental recovery expression, problem resolution becomes more complex: not only the problem expression has to be FALSE, but also the recovery expression has to be TRUE. This is useful to create hysteresis and avoid trigger flapping.
触发器函数允许引用采集的值,当前时间和其他因素。
可以使用的支持函数完整列表。
大多数数字型的函数接受秒数来作为参数。
你可以使用前缀#来指定参数具有不同的含义:
函数调用 含义 | |
---|---|
sum(600) | 600秒内所有值的总和 |
sum(#5) | 最后5个值的总和 |
函数last当以#作为前缀使用时具有不同的含义 - 它可以选择第N次前的值, 返回值 3, 7, 2, 6, 5 (最近五次),last(#2) 将返回值为7 ,last(#5) 将返回值为5。
一些函数支持额外的第二个参数时间偏移量
。这个参数允许从过去一段时间内引用数据。例如,avg(1h,1d)将会返回一天前1小时的平均值。
你可以在触发器表达式中使用支持的单位符号, 例如‘5m’(分钟)代替‘300’秒,‘1d’(天)代替‘86400’秒。‘1k’代表‘1024’bytes。
触发器支持下列运算符(在执行中优先级递减)
优先级 运算 | 定义 | **[未知值] | /manual/config/triggers/expression#expressions_with_unsupported_items_and_unknown_values)**注释 |
---|---|---|---|
1 | - | 负 | *-**Unknown → Unknown |
2 | not | 逻辑非 ** | ot** Unknown → Unknown |
3 | * | 乘 | * Unknown → Unknown (yes, Unknown, not 0 - to not lose Unknown in arithmetic operations) 1.2 * Unknown → Unknown |
/ | 除 | nknown / 0 → error Unknown / 1.2 → Unknown 0.0 / Unknown → Unknown |
|
4 | + | 加 | .2 + Unknown → Unknown |
- | 减 | .2 - Unknown → Unknown | |
5 | < | 小于。该运算符定义: 1.2 **< A<B ⇔ (A<B-0.000001) |
** Unknown → Unknown |
<= | 小于等于。该运算符定义: Unknown **& A<=B ⇔ (A≤B+0.000001) |
t;=** Unknown → Unknown | |
> | 大于. 该运算符定义: A>B ⇔ (A>B+0.000001) |
||
>= | 大于等于。 该运算符定义: A>=B ⇔ (A≥B-0.000001) |
||
6 | = | 相等。 该运算符定义: A=B ⇔ (A≥B-0.000001) and (A≤B+0.000001) |
|
<> | 不等于。该运算符定义: A<>B ⇔ (A<B-0.000001) or (A>B+0.000001) |
||
7 | and | 逻辑与 0 | *and Unknown → 0 1 and Unknown → Unknown Unknown and** Unknown → Unknown |
8 | or | 逻辑或 1 | *or Unknown → 1 0 or Unknown → Unknown Unknown or** Unknown → Unknown |
not, and and or 运算符区分大小写,而且必须为小写。它们也必须被空格或括号包围。
所有运算符中, 除了 - 和 not ,都有左到右的关联性。 - 和 not是非结合的(意味着-(-1)和not (not 1)应该用--1 and not not 1代替).
计算结果:
触发器评估所需的值由Zabbix server缓存。由于此触发器评估在服务器重新启动后一段时间导致较高的数据库负载。当监控项历史数据被移除(手动或housekeeper)时,缓存值不会被清除,因此服务器将使用缓存的值,直到它们比触发器函数中定义的时间段或服务器重启的时间长。
The following operators are supported for triggers (in descending priority of execution):
Priority | Operator | Definition | Notes for unknown values | Force cast operand to float 1 |
---|---|---|---|---|
1 | - | Unary minus | -Unknown → Unknown | Yes |
2 | not | Logical NOT | not Unknown → Unknown | Yes |
3 | * | Multiplication | 0 * Unknown → Unknown (yes, Unknown, not 0 - to not lose Unknown in arithmetic operations) 1.2 * Unknown → Unknown |
Yes |
/ | Division | Unknown / 0 → error Unknown / 1.2 → Unknown 0.0 / Unknown → Unknown |
Yes | |
4 | + | Arithmetical plus | 1.2 + Unknown → Unknown | Yes |
- | Arithmetical minus | 1.2 - Unknown → Unknown | Yes | |
5 | < | Less than. The operator is defined as: A<B ⇔ (A<B-0.000001) |
1.2 < Unknown → Unknown | Yes |
<= | Less than or equal to. The operator is defined as: A<=B ⇔ (A≤B+0.000001) |
Unknown <= Unknown → Unknown | Yes | |
> | More than. The operator is defined as: A>B ⇔ (A>B+0.000001) |
Yes | ||
>= | More than or equal to. The operator is defined as: A>=B ⇔ (A≥B-0.000001) |
Yes | ||
6 | = | Is equal. The operator is defined as: A=B ⇔ (A≥B-0.000001) and (A≤B+0.000001) |
No 1 | |
<> | Not equal. The operator is defined as: A<>B ⇔ (A<B-0.000001) or (A>B+0.000001) |
No 1 | ||
7 | and | Logical AND | 0 and Unknown → 0 1 and Unknown → Unknown Unknown and Unknown → Unknown |
Yes |
8 | or | Logical OR | 1 or Unknown → 1 0 or Unknown → Unknown Unknown or Unknown → Unknown |
Yes |
1 String operand is still cast to numeric if:
(If the cast fails - numeric operand is cast to a string operand and both operands get compared as strings.)
not, and and or operators are case-sensitive and must be in lowercase. They also must be surrounded by spaces or parentheses.
All operators, except unary - and not, have left-to-right associativity. Unary - and not are non-associative (meaning -(-1) and not (not 1) should be used instead of --1 and not not 1).
Evaluation result:
www.zabbix.com is overloaded
{www.zabbix.com:system.cpu.load[all,avg1].last()}>5 or {www.zabbix.com:system.cpu.load[all,avg1].min(10m)}>2
当前处理器负载大于5或者最近10分钟内最小值大于2,表达式为true。
/etc/passwd文件被修改
使用函数diff:
当文件/etc/passwd的checksum值与最近的值不同时,表达式为true。
类似的,表达式可以用于监控重要文件的修改, 如/etc/passwd, /etc/inetd.conf, /kernel等
有人正在从互联网上下载一个大文件
使用min函数:
在过去5分钟内,eth0上接收字节数大于100kb时,表达式为true。
SMTP服务群集的两个节点都停止。 注意在一个表达式中使用两个不同的主机:
{smtp1.zabbix.com:net.tcp.service[smtp].last()}=0 and {smtp2.zabbix.com:net.tcp.service[smtp].last()}=0
当SMTP服务器smtp1.zabbix.com和smtp2.zabbix.com都停止,表达式为true
Zabbix agent需要升级
使用str()函数:
如果Zabbix agent版本是beta8(可能是1.0beta8),则表达式为真。
服务器无法访问
当主机“zabbix.zabbix.com”在30分钟内超过5次不可达,则表达式为真。
3分钟内没有心跳检查
使用nodata()函数:
要使用这个触发器,'tick'必须定义成一个Zabbix[:manual/config/items/itemtypes/trapper|trapper]]监控项。主机应该使用zabbix_sender定期发送这个监控项的数据。
如果在180秒内没有接收到数据,则触发值变为异常状态。
注释‘nodata’可以在任何类型的监控项中使用。
夜间的CPU负载
使用time()函数:
{zabbix:system.cpu.load[all,avg1].min(5m)}>2 and {zabbix:system.cpu.load[all,avg1].time()}>000000 and {zabbix:system.cpu.load[all,avg1].time()}<060000
仅在夜间(00:00-06:00),触发器状态变可以变为真。
检查客户端本地时间是否与Zabbix服务器时间同步
使用fuzzytime()函数:
当MySQL_DB服务器的本地时间与Zabbix server之间的时间相差超过10秒,触发器将变为异常状态。
比较今天的平均负载和昨天同一时间的平均负载(使用第二个“时间偏移”参数)。
如果最近一小时平均负载超过昨天相同小时负载的2倍,触发器将触发。
使用了另一个监控项的值来获得触发器的阈值:
{Template PfSense:hrStorageFree[{#SNMPVALUE}].last()}<{Template PfSense:hrStorageSize[{#SNMPVALUE}].last()}*0.1
如果剩余存储量下降到10%以下,触发器将触发。
使用评估结果获取超过阈值的触发器数量:
({server1:system.cpu.load[all,avg1].last()}>5) + ({server2:system.cpu.load[all,avg1].last()}>5) + ({server3:system.cpu.load[all,avg1].last()}>5)>=2
如果表达式中至少有两个触发器大于5,触发器将触发。
有时我们需要一个OK和问题状态之间的区间,而不是一个简单的阈值。例如,我们希望定义一个触发器,当机房温度超过20C时,触发器会出现异常,我们希望它保持在那种状态,直到温度下降到15C以下。
为了做到这一点,我们首先定义问题事件的触发器表达式。然后在事件成功迭代中选择‘恢复表达式’,并为OK事件输入恢复表达式。
请注意,只有首先解决问题事件才会评估恢复表达式。如果问题条件仍然存在,则不能通过恢复表达式来解决问题。
机房温度过高。
问题表达式:
恢复表达式:
磁盘剩余空间过低。
问题表达式: it is less than 10GB for last 5 minutes
恢复表达式: it is more than 40GB for last 10 minutes
Zabbix3.2之前的版本对触发器表达式中不支持的监控项非常严格。表达式中的任何不支持的监控项都会立即将触发器值呈现为“未知”。
从Zabbix3.2开始通过将未知值引入到表达式评估中,对不受支持的项有更灵活的方法:
or
不支持的监控项函数1 or
不支持的监控项函数2 or
..." 可以被评估为'1' (True),and
不支持的监控项函数1 and
不支持的监控项函数2 and
..." 可以被评估为'0' (False),Unknown
值的逻辑表达式。在上述两种情况下,将产生一个已知值;在其他情况下,触发值将是Unknown
。如上所述,未知值可以在逻辑表达式中“消失”。 在算数表达式中未知值总会导致结果为“Unknown”(除以0除外)。
如果具有多个不支持的监控项的触发器表达式评估为“Unknown”,前端的错误消息是指最后一个不支持的监控项。
Comparing two string values - operands are:
Problem: detect changes in the DNS query
The item key is:
with macros defined as
and normally returns:
So our trigger expression to detect if the DNS query result deviated from the expected result is:
last(/Zabbix server/net.dns.record[8.8.8.8,{$WEBSITE_NAME},{$DNS_RESOURCE_RECORD_TYPE},2,1])<>"{$WEBSITE_NAME} {$DNS_RESOURCE_RECORD_TYPE} 0 mail.{$WEBSITE_NAME}"
Notice the quotes around the second operand.
Comparing two string values - operands are:
Problem: detect if the /tmp/hello
file content is equal to:
Option 1) write the string directly
Notice how \ and " characters are escaped when the string gets compared directly.
Option 2) use a macro
in the expression:
Comparing long-term periods.
Problem: Load of Exchange server increased by more than 10% last month
You may also use the Event name field in trigger configuration to build a meaningful alert message, for example to receive something like
"Load of Exchange server increased by 24% in July (0.69) comparing to June (0.56)"
the event name must be defined as:
Load of {HOST.HOST} server increased by {{?100*trendavg(//system.cpu.load,1M:now/M)/trendavg(//system.cpu.load,1M:now/M-1M)}.fmtnum(0)}% in {{TIME}.fmttime(%B,-1M)} ({{?trendavg(//system.cpu.load,1M:now/M)}.fmtnum(2)}) comparing to {{TIME}.fmttime(%B,-2M)} ({{?trendavg(//system.cpu.load,1M:now/M-1M)}.fmtnum(2)})
It is also useful to allow manual closing in trigger configuration for this kind of problem.
Sometimes an interval is needed between problem and recovery states, rather than a simple threshold. For example, if we want to define a trigger that reports a problem when server room temperature goes above 20°C and we want it to stay in the problem state until the temperature drops below 15°C, a simple trigger threshold at 20°C will not be enough.
Instead, we need to define a trigger expression for the problem event first (temperature above 20°C). Then we need to define an additional recovery condition (temperature below 15°C). This is done by defining an additional Recovery expression parameter when defining a trigger.
In this case, problem recovery will take place in two steps:
The recovery expression will be evaluated only when the problem event is resolved first.
The recovery expression being TRUE alone does not resolve a problem if the problem expression is still TRUE!
Temperature in server room is too high.
Problem expression:
Recovery expression:
Free disk space is too low.
Problem expression: it is less than 10GB for last 5 minutes
Recovery expression: it is more than 40GB for last 10 minutes
Versions before Zabbix 3.2 are very strict about unsupported items in a trigger expression. Any unsupported item in the expression immediately renders trigger value to Unknown
.
Since Zabbix 3.2 there is a more flexible approach to unsupported items by admitting unknown values into expression evaluation:
or
Unsupported_item1.some_function() or
Unsupported_item2.some_function() or
..." can be evaluated to '1' (True),and
Unsupported_item1.some_function() and
Unsupported_item2.some_function() and
..." can be evaluated to '0' (False).Unknown
values. In the two cases mentioned above a known value will be produced; in other cases trigger value will be Unknown
.Unknown
and it takes part in further expression evaluation.Note that unknown values may "disappear" only in logical expressions as described above. In arithmetic expressions unknown values always lead to result Unknown
(except division by 0).
If a trigger expression with several unsupported items evaluates to Unknown
the error message in the frontend refers to the last unsupported item evaluated.