November 17, 2016 1 minutes and 41 seconds read

Identifying Docker container outage using Prometheus

Prometheus is a an open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

Metric data is pulled (on a regular time-interval) from so called exporters which expose the metrics coming from applications/operating systems etc..

+------------------+                               +----------+     Visualize data
|  +------------+  |                               | Grafana  +---> coming from
|  | Dockerized |  |                               +----+-----+     Prometheus
|  | Application|  |                                    |
|  +------------+  |                                    ^
|  +------------+  |  Pull data   +----------+          |
|  |  CAdvisor  +--------->-------+Prometheus+----------+
|  +------------+  |              +---------++
|                  |                        |
| Operating System |                        |
|       with       |                        |
| Docker installed |                        |
|                  |                        v
+------------------+           Prometheus collects data
                               coming from remote systems

In the diagram above cAdvisor is a so called exporter. There are other exporters like e.g. Node Exporter that exposes machine metrics. cAdvisor is used to get Docker container metrics.

cAdvisor

cAdvisor is a project coming from Google and analyzes resource usage and performance characteristics of running Docker containers! When running a Dockerized application and starting a cAdvisor container you will have instant metrics available for all running containers.

A lot of metrics are exposed by cAdvisor of which one is the metric container_last_seen. You can use this metric in Prometheus to identify if a container has left the building :) The challenge with Prometheus is that it keeps the data for a specific amount of time the so called Stale Timeout. This means that Prometheus will keep reporting back that the data has been received until this timeout has occurred (default 5 minutes). This is of course too much if we need to identify if a container has gone.

So if you would normally query like this:

count(container_last_seen{job="<jobname>",name=~".*<containername>.*"})

This would get results until 5 minutes.. way to far…

A simple alternate query to identify if the container has gone is like below:

count(time() - container_last_seen{job="<jobname>",name=~".*<containername>.*"} < 30) OR vector(0)

The ‘30’ is the time in seconds before we want to be notified if a container has gone. This time has to be larger then the pull interval for your job.

When using the mentioned query you can create a nice Singlestat panel in Grafan so you can display an alert when the container is gone.