whiteink

Logs vs. metrics: a false dichotomy

Every once in a while, I end up in a conversation about the relative merits of logs and metrics in the context of software operations. Usually, we treat logs and metrics as logically distinct things, and then have to decide which is more important, because putting all things in the world into rank order is important and valuable work…

…or not. Perhaps this is a silly discussion to be having over and over again?

This post is about how “logs vs. metrics” is a false dichotomy, and how thinking in this binary prevents us from seeing simpler ways to monitor our systems.

Are we talking about the same things?

When I say “logs,” I mean records of discrete events which contain some structured or semi-structured data about the event, along with the timestamp at which the event occurred.

When I say “metrics,” I mean a time-series recording of some quantity of interest. Usually, the only data in the time-series will be the value of the quantity at a series of points in time.

One important difference between logs and metrics is that each log record typically represents a single event – that is, they are disaggregated at point of collection. Metrics are almost always collected and aggregated in the same step (although they may later be re-aggregated).

With those definitions, here are three observations about logs and metrics:

1. You need both logs and metrics for almost any system.

If you need alerts, you need metrics. You may well calculate metrics directly and implicitly from a log stream, but you’re still computing a metric. The evaluation of a metric against some threshold is almost always how you will define an alert.

If you want a hope in hell of diagnosing interesting production issues, you need logs. No matter how many metrics you have collected, you will face difficult problems which cannot be diagnosed by looking only at time-series. Because metrics are typically pre-aggregated by a relatively small number of low-cardinality dimensions, they can make it effectively impossible to see problems affecting only a single customer, or network, or request path (say).

If you don’t collect your logs and make them searchable, then perhaps at the moment you get these logs by SSHing into production servers and tailing logs on the box. Perhaps you get them by generating them on-the-fly using tcpdump or other tracing tools. Regardless, it is my experience that there will always be important issues in production that are impossible to debug solely by staring at dashboards displaying time-series.

2. You can derive arbitrary metrics from log streams

Logs are, by the definition above, disaggregated. This means that you can slice and dice your log stream along whatever dimensions you like, and then compute aggregates over fields in your logs however you wish. Even if your logs are sampled – that is, you don’t record every event that occurs in your system – you can still compute most aggregate statistics over your logs with bounded and known error.

To put it more clearly: you can derive almost all metrics of interest from a suitable log stream.

Metrics, by the definition above, are pre-aggregated and contextless. You cannot recreate a log stream from a metric time-series.

3. Most metric re-aggregation is inaccurate or plain wrong

Hopefully most people know this by now: many interesting kinds of metric are very hard to aggregate and re-aggregate correctly. For example, if you collect 99th percentile latency on each server, there is no way of combining those time-series to give you an accurate 99th percentile latency across all servers.

The best you can hope for is that your metric collection supports the recording of histograms (such as HdrHistogram or Circonus log-linear histograms) which can be combined to give bounded-error quantile estimates.

In practice, teams continue to ignore this problem and instead rely on aggregate time-series that are essentially impossible to interpret correctly, such as “maximum of 99th percentile latency by server.”

A false dichotomy

Taken together, these three observations make it clear that “logs vs. metrics” isn’t a very helpful framing for the discussion. Almost all production systems will require some kind of metric analysis, for alerting purposes if nothing else. Then again, all production systems I’ve ever worked on have at some point required me to dig into event logs to debug difficult problems.

And if arbitrary metrics can be computed from log streams, while the reverse isn’t possible, doesn’t this suggest that sometimes it may be most productive to start by collecting structured logs, and build metric extraction on top of that infrastructure? This is at odds to the way many of us have built these systems in the past, where metric pipelines and log pipelines are considered and built independently.

Lastly, the general problem of displaying correct aggregate metrics will usually require sophisticated metric types such as histograms. If you want to pivot by more than a small handful of dimensions in your data, you require one time-series for each combination of dimension values. This means that pivoting by high-cardinality dimensions (such as client IP or user ID) is usually prohibitively expensive.

Logs and metrics are not two fundamentally different approaches. They are different projections of the same underlying data. Logs are necessarily more complete, because they preserve context. Sometimes this makes them more expensive to handle. Sometimes it makes the difference between understanding an issue and not.

Further reading

This post is not a pitch, but if you want to see what putting these ideas into practice might look like, you should try out Honeycomb. It’s not an exaggeration to say that if you haven’t actually used a product that does this, it’s hard to imagine how powerful a tool it can be.

Other things you might be interested in reading or experimenting with: