Every once in a while, I end up in a conversation about the relative merits
of logs and metrics in the context of software operations. Usually, we treat
logs and metrics as logically distinct things, and then have to decide which
is more important, because putting all things in the world into rank order is
important and valuable work…
…or not. Perhaps this is a silly discussion to be having over and over
again?
This post is about how “logs vs. metrics” is a false dichotomy, and how
thinking in this binary prevents us from seeing simpler ways to monitor our
systems.
Are we talking about the same things?
When I say “logs,” I mean records of discrete events which contain some
structured or semi-structured data about the event, along with the timestamp
at which the event occurred.
When I say “metrics,” I mean a time-series recording of some quantity of
interest. Usually, the only data in the time-series will be the value of the
quantity at a series of points in time.
One important difference between logs and metrics is that each log record
typically represents a single event – that is, they are disaggregated at
point of collection. Metrics are almost always collected and aggregated in
the same step (although they may later be re-aggregated).
With those definitions, here are three observations about logs and metrics:
1. You need both logs and metrics for almost any system.
If you need alerts, you need metrics. You may well calculate metrics directly
and implicitly from a log stream, but you’re still computing a metric. The
evaluation of a metric against some threshold is almost always how you will
define an alert.
If you want a hope in hell of diagnosing interesting production issues, you
need logs. No matter how many metrics you have collected, you will face
difficult problems which cannot be diagnosed by looking only at time-series.
Because metrics are typically pre-aggregated by a relatively small number of
low-cardinality dimensions, they can make it effectively impossible to see
problems affecting only a single customer, or network, or request path (say).
If you don’t collect your logs and make them searchable, then perhaps at the
moment you get these logs by SSHing into production servers and tailing logs
on the box. Perhaps you get them by generating them on-the-fly using
tcpdump
or other tracing tools. Regardless, it is my experience that there
will always be important issues in production that are impossible to debug
solely by staring at dashboards displaying time-series.
2. You can derive arbitrary metrics from log streams
Logs are, by the definition above, disaggregated. This means that you can
slice and dice your log stream along whatever dimensions you like, and then
compute aggregates over fields in your logs however you wish. Even if your
logs are sampled – that is, you don’t record every event that occurs in your
system – you can still compute most aggregate statistics over your logs with
bounded and known error.
To put it more clearly: you can derive almost all metrics of interest from a
suitable log stream.
Metrics, by the definition above, are pre-aggregated and contextless. You
cannot recreate a log stream from a metric time-series.
3. Most metric re-aggregation is inaccurate or plain wrong
Hopefully most people know this by now: many interesting kinds of metric are
very hard to aggregate and re-aggregate correctly. For example, if you
collect 99th percentile latency on each server, there is no way of
combining those time-series to give you an accurate 99th percentile latency
across all servers.
The best you can hope for is that your metric collection supports the
recording of histograms (such as HdrHistogram
or Circonus log-linear
histograms) which can be
combined to give bounded-error quantile estimates.
In practice, teams continue to ignore this problem and instead rely on
aggregate time-series that are essentially impossible to interpret correctly,
such as “maximum of 99th percentile latency by server.”
A false dichotomy
Taken together, these three observations make it clear that “logs vs.
metrics” isn’t a very helpful framing for the discussion. Almost all
production systems will require some kind of metric analysis, for alerting
purposes if nothing else. Then again, all production systems I’ve ever worked
on have at some point required me to dig into event logs to debug difficult
problems.
And if arbitrary metrics can be computed from log streams, while the reverse
isn’t possible, doesn’t this suggest that sometimes it may be most productive
to start by collecting structured logs, and build metric extraction on top of
that infrastructure? This is at odds to the way many of us have built these
systems in the past, where metric pipelines and log pipelines are considered
and built independently.
Lastly, the general problem of displaying correct aggregate metrics will
usually require sophisticated metric types such as histograms. If you want to
pivot by more than a small handful of dimensions in your data, you require
one time-series for each combination of dimension values. This means that
pivoting by high-cardinality dimensions (such as client IP or user ID) is
usually prohibitively expensive.
Logs and metrics are not two fundamentally different approaches. They are
different projections of the same underlying data. Logs are necessarily more
complete, because they preserve context. Sometimes this makes them more
expensive to handle. Sometimes it makes the difference between understanding
an issue and not.
Further reading
This post is not a pitch, but if you want to see what putting these ideas
into practice might look like, you should try out
Honeycomb. It’s not an exaggeration to say that
if you haven’t actually used a product that does this, it’s hard to imagine
how powerful a tool it can be.
Other things you might be interested in reading or experimenting with: