We do not need logs, we need logs (with buzzword name)
What is described there exists, is used, and already have name - structured logs. However if buzz wording it will make that approach more popular, then I am up to it.
What people are IMHO confusing most of the time is that they think that logs == that text that is saved in files/shown on the console. But that is map-territory relation. Textual form of logs is only one of the possible log presentation forms. With structured logging we achieve that “wide events” thingy and a lot of logging libraries supports that. And all “three pillars of observability” are just different aggregations of these structured logs.
Metrics are window aggregations over numeric data with some metadata extracted from logs
Textual logs are human readable representation of given logs
Traces are aggregations over “request ID” or whatever field you have in these logs for logs related to single user action in the system
However there is a thing - there is different volume for each of these data, so for example it is still useful to gather metrics within your system, as it allows you to have more immediate view into the system, and it allows you to have more detailed metrics than logs. Grafana article describes it quite nicely:
[…] the 1KB of logs per request might let you have the equivalent of 100 metrics, at 1000 requests per second the same amount of network bandwidth would allow for 1,000,000 metrics per server every 10 seconds.
So while it us super buzzwordy article, if it will make structured logs more popular, then you have my bow, even when I really dislike creating new names for things just for sake of marketing.
From my understanding, wide events implies structured logs. But the other way isn’t necessarily true.
In the case of a web service for example, if you have a log entry which indicates an HTTP request, and another one which indicates the request becoming authenticated (with a user ID), that’s structured logging, but not wide events.
You’d need to be able to augment the main log entry to add many attributes to it (such as the user ID), and not just emit logs every time you have new data coming in.
That is also my take, “wide events” is a pattern and especially tooling atop structured logging, just having structured logs is no help if the log collector is unable to surface the breadth (length) of a trace.
I’m not necessarily convinced it’s a better paradigm for the reason that I find “span” a much more intuitive qualifier than “wide event” (and you generally know statically whether your event is a point or has a length), but tracing being formatting over just an event primitive makes sense.
Traces (Open Telemetry) and metrics (Scuba) are quite different things and both have their place.
Traces are very useful to follow all the things that happened from start to end of a request / job / etc. Traces pass around a trace id across the events giving a full story on an interaction. It can be very useful for debugging business logic.
Metrics are very useful for querying and aggreggating events at a larger scale, great for observability at a macro scale. Open Telemetry here is probably the wrong tool.
To expand on this - I’d love to be able to derive metrics from traces.
However, most users (myself included) can only really justify processing a sampled subset of traces, which introduces some very wide error bars to your metrics.
Look at these fools buying pitta bread and sourdough. All you need is whole wheat flour.
Yes you can roll a metric and tracing system from structured logs, and if you can do it post hoc that’s great, but it makes sense to use dedicated tooling to track and visualize this information appropriately. Traces in particular are a game changer.
I disagree that you only need wide events. But I agree that the concept of 1 log event per service call that contains all the information about that call is really really useful.
Internally, when I worked there, Amazon had a similar wide logs approach for their service log format (used by ~100k+ services). These drive metrics and reporting and a allow a similar sort of metrics dive to the article. They were great for operational awareness (what problems are we having, and where are they happening), but sequential narrow logs (application logs in Amazon speak) were often still required to find what the heck your application did to cause the problem.
This is mainly due to the fact that to create a wide event, you map a 3+ dimensional set of information representing what happened during any one service call (time, key, value) onto a 2 dimensional structure (key, value). The mapping intentionally aggregates data to reduce cardinality and make it reasonable to drive operations systems.
It’s hard to tell based on the vague descriptions in the article, Scuba doesn’t sound exceptional to me. Visualising event volumes over time with the ability to filter by named attributes is structured logging 101, any passable log store (Elasticsearch/Datadog) will do it. The enforcing of sample rates in the schema sounds… interesting? Maybe a useful practice? But I struggled to follow that paragraph.
Disclaimer: I’ve read a lot about it but never used Scuba myself.
A few main differences between Scuba/Honeycomb and Elasticsearch:
Elasticsearch performs best with a strict schema defined, which is at odds with the free form key=value that Scuba encourages. In Scuba, you basically have 3 mandatory fields: event_id, timestamp, sampling_rate
Cosmetic difference but Kibana’s UI is not designed for this kind of drill-down/discovery exercise. It’s possible to do, but a bit clunky. Scuba is designed for this. I guess one could design a dedicated UI to run on top of ES.
Performance: because ES is document based, it’s difficult to get it to answer large aggregation queries in < 1s over multiple weeks worth of data.
The differences can look small, but they’re real and is what many ex-meta employees lament about when they say there’s no scuba equivalent in OSS.
Wide events are a good way to model telemetry data in the write path. But metrics, traces, and logs are actually patterns of consumption of that data, essentially different transformations of the same information, optimized for specific read usecases.
You can probably build an acceptable logs product directly on top of wide events. You might be able to build a traces product that way, too, with various caveats. But you definitely can’t do metrics from events, with any kind of reasonable performance at any kind of non-trivial scale.
What is described there exists, is used, and already have name - structured logs. However if buzz wording it will make that approach more popular, then I am up to it.
What people are IMHO confusing most of the time is that they think that logs == that text that is saved in files/shown on the console. But that is map-territory relation. Textual form of logs is only one of the possible log presentation forms. With structured logging we achieve that “wide events” thingy and a lot of logging libraries supports that. And all “three pillars of observability” are just different aggregations of these structured logs.
However there is a thing - there is different volume for each of these data, so for example it is still useful to gather metrics within your system, as it allows you to have more immediate view into the system, and it allows you to have more detailed metrics than logs. Grafana article describes it quite nicely:
So while it us super buzzwordy article, if it will make structured logs more popular, then you have my bow, even when I really dislike creating new names for things just for sake of marketing.
From my understanding, wide events implies structured logs. But the other way isn’t necessarily true.
In the case of a web service for example, if you have a log entry which indicates an HTTP request, and another one which indicates the request becoming authenticated (with a user ID), that’s structured logging, but not wide events. You’d need to be able to augment the main log entry to add many attributes to it (such as the user ID), and not just emit logs every time you have new data coming in.
That is also my take, “wide events” is a pattern and especially tooling atop structured logging, just having structured logs is no help if the log collector is unable to surface the breadth (length) of a trace.
I’m not necessarily convinced it’s a better paradigm for the reason that I find “span” a much more intuitive qualifier than “wide event” (and you generally know statically whether your event is a point or has a length), but tracing being formatting over just an event primitive makes sense.
Traces (Open Telemetry) and metrics (Scuba) are quite different things and both have their place.
Traces are very useful to follow all the things that happened from start to end of a request / job / etc. Traces pass around a trace id across the events giving a full story on an interaction. It can be very useful for debugging business logic.
Metrics are very useful for querying and aggreggating events at a larger scale, great for observability at a macro scale. Open Telemetry here is probably the wrong tool.
To expand on this - I’d love to be able to derive metrics from traces.
However, most users (myself included) can only really justify processing a sampled subset of traces, which introduces some very wide error bars to your metrics.
Good article. Reminds me of Brandur’s article recommending “canonical log lines”: https://brandur.org/canonical-log-lines.
Look at these fools buying pitta bread and sourdough. All you need is whole wheat flour.
Yes you can roll a metric and tracing system from structured logs, and if you can do it post hoc that’s great, but it makes sense to use dedicated tooling to track and visualize this information appropriately. Traces in particular are a game changer.
I disagree that you only need wide events. But I agree that the concept of 1 log event per service call that contains all the information about that call is really really useful.
Internally, when I worked there, Amazon had a similar wide logs approach for their service log format (used by ~100k+ services). These drive metrics and reporting and a allow a similar sort of metrics dive to the article. They were great for operational awareness (what problems are we having, and where are they happening), but sequential narrow logs (application logs in Amazon speak) were often still required to find what the heck your application did to cause the problem.
This is mainly due to the fact that to create a wide event, you map a 3+ dimensional set of information representing what happened during any one service call (time, key, value) onto a 2 dimensional structure (key, value). The mapping intentionally aggregates data to reduce cardinality and make it reasonable to drive operations systems.
The whole point of the wide event is that you do not need to reduce cardinality
It’s hard to tell based on the vague descriptions in the article, Scuba doesn’t sound exceptional to me. Visualising event volumes over time with the ability to filter by named attributes is structured logging 101, any passable log store (Elasticsearch/Datadog) will do it. The enforcing of sample rates in the schema sounds… interesting? Maybe a useful practice? But I struggled to follow that paragraph.
Disclaimer: I’ve read a lot about it but never used Scuba myself.
A few main differences between Scuba/Honeycomb and Elasticsearch:
Elasticsearch performs best with a strict schema defined, which is at odds with the free form key=value that Scuba encourages. In Scuba, you basically have 3 mandatory fields: event_id, timestamp, sampling_rate
Cosmetic difference but Kibana’s UI is not designed for this kind of drill-down/discovery exercise. It’s possible to do, but a bit clunky. Scuba is designed for this. I guess one could design a dedicated UI to run on top of ES.
Performance: because ES is document based, it’s difficult to get it to answer large aggregation queries in < 1s over multiple weeks worth of data.
The differences can look small, but they’re real and is what many ex-meta employees lament about when they say there’s no scuba equivalent in OSS.
I’m surprised none of the ex meta employees has build something here.
Honeycomb is by ex-meta employees and is the closest thing to Scuba.
I would even say, based on what glimpses of scuba i have seen here and there, that Honeycomb UX is mile above Scuba for this use case.
Wide events are a good way to model telemetry data in the write path. But metrics, traces, and logs are actually patterns of consumption of that data, essentially different transformations of the same information, optimized for specific read usecases.
You can probably build an acceptable logs product directly on top of wide events. You might be able to build a traces product that way, too, with various caveats. But you definitely can’t do metrics from events, with any kind of reasonable performance at any kind of non-trivial scale.