engineering

Alerting in Prometheus or: How I can sleep well at night

share

Monitoring is a well known problem we all face while doing software. But being completely honest, nowadays we have plenty of tools that help us reduce the pain. As you might guess, for this article we’ve already chosen Prometheus as our monitoring tool, but the same ideas could be applied to alternative tools. In this post we are going to make a small recap on the Prometheus platform. We are going to discuss alerting rules, then we’ll explore the Alertmanager and the notification possibilities. Finally, we’ll show you how to integrate it with slack.

Prometheus

Prometheus

Prometheus expects a simple text response, we could actually use any programming language that lets us talk using TCP. Once the data is retrieved and stored by Prometheus, it is subsequently queried by specialized dashboards like grafana. If you want to spend more time on the basic concepts you can check the official doc, or here, and if you also speak french, here.

Installing and configuring In the

repository

$ docker-compose up -d
  • This command will download some Docker images, configure and run the following containers:
  • One container with a Play application that exposes some metrics, available at localhost:9000. We will present these endpoints later
  • One container for Prometheus to gather metrics, available at localhost:9090
  • One container for Alertmanager to trigger alerts on metrics, available at localhost:9093
  • One container for Grafana, available at localhost:3000
  • One container for CAdvisor, available at localhost:8080 The relationship between these containers is shown in the following diagram

relationship between containers

buildwall and now everything looks nice and clean. It is time to go home, right?... right? Sadly, not quite. I mean this is already very good. We’ve come a long way since SSH connections to inspect logs. However, we are still restricted to visual inspections of the data. Or are we?

Alerting in Prometheus In this section we are going to discuss the application used in the example and the metrics it generates. After that, we are going to explain how to configure

Prometheus and Alertmanager to describe rules from existing metrics. Finally, we will see how to trigger alerts and get notifications when those rules are met.

The monitored application The application we provided for this article exposes the following endpoints to interact with the metrics:

  • /. Increases the Counter metric of visits, play_request_total
  • /login. Increases the Gauge metric of connected users, play_current_users
  • /logout. Decreases the Gauge metric of connected users, play_current_users
  • /metrics. Give the output as expected by Prometheus A typical output of /

metrics containing the metrics and their current values could be as follows

http_request_duration_seconds_bucket{le="+Inf",method="GET",path="/login",status="2xx"} 3
http_request_duration_seconds_count{method="GET",path="/login",status="2xx"} 3
http_request_duration_seconds_sum{method="GET",path="/login",status="2xx"} 0.006110352
http_request_mismatch_total 0.0
play_current_users 3.0
play_requests_total 0.0

Alerting overview In the

Prometheus platform, alerting is handled through an independent component: Alertmanager. Usually, we first tell Prometheus where Alertmanager is located, then we create the alerting rules in Prometheus configuration and finally, we configure Alertmanager to handle and send alerts to a receiver (mail, webhook, slack, etc). These dynamics are shown in the following diagram

Alerting rules Alerting rules is the mechanism proposed by

Prometheus to define alerts on recorded metrics. They are configured in the file prometheus.yml

rule_files:
 - "/etc/prometheus/alert.rules"

and are based on the following template

ALERT
IF
[ FOR ]
  [ LABELS ]
  [ ANNOTATIONS ]

Where:

  • Alert name, is the alert identifier. It does not need to be unique.
  • Expression, is the condition that gets evaluated in order to trigger the alert. It usually uses existing metrics as those returned by the /metrics endpoint.
  • Duration, is the period of time during which the rule must be valid. For example, 5s for 5 seconds.
  • Label set, set of labels that will be used inside your message template. We can define a new rule in our

alert.rules to inform that we have less than two logged users in our application:

ALERT low_connected_users
  IF play_current_users < 2
  FOR 30s
  LABELS {
    severity = "warning"
 }
 ANNOTATIONS {
     summary = "Instance {{ $labels.instance }} under lower load",
     description = "{{ $labels.instance }} of job {{ $labels.job }} is under lower load.",
 }

Alertmanager

Alertmanager is a buffer for alerts (no surprise here) that has the following characteristics: * Is able to receive alerts through a specific endpoint (not specific to Prometheus). * Can redirect alerts to receivers like hipchat, mail or others. * Is intelligent enough to determine that a similar notification was already sent. So you don’t end up being drowned by thousands of emails in case of a problem. A client to

Alertmanager (in this case Prometheus) starts by sending a POST message with all the alerts it wants to be handled to /api/v1/alerts. For example

[
 {
  "labels": {
     "alertname": "low_connected_users",
     "severity": "warning"
   },
   "annotations": {
      "description": "Instance play-app:9000 under lower load",
      "summary": "play-app:9000 of job playframework-app is under lower load"
    }
 }
]

Workflow Once these alerts are stored in

Alertmanager they can be in any of the following states: * ************\*Inactive**. Nothing happens here. * Pending. The client told us that this alert must be triggered. However, alerts could be grouped, suppressed/inhibited (more on inhibition later) or silenced/muted (we will discuss silences later). Once all validation passed, we move to Firing. * Firing. The alert is sent to the *Notification Pipeline which will contact all the receivers of our alert. The client could then tell us the alert is now good, so we make a transition to Inactive.

Prometheus has a dedicated endpoint to allow us to list all alerts and to follow the state transitions. Each state, as indicated by Prometheus, as well as the conditions that led to a transition are shown below * The rule is not met. The alert is not active

receivers:
- name: slack_general
  slack_configs:
  - api_url: https://hooks.slack.com/services/FOO/BAR/FOOBAR
    channel: '#prometheus-article'
    send_resolved: true

In order to get this second notification, we need to make our alert condition

play_current_users < 2 invalid. We can achieve this by increasing the gauge metric used in the rule, or simply put: navigate several times to the /login endpoint of our application. Now we receive a new message with a green bar in our slack channel

More on transitions from Pending to Firing

Inhibition Inhibitions allow us to suppress notifications for some alerts given that any other alert is in state firing. For example, We could configure an inhibition that mutes any warning-level notification if the same alert (based on alertname) is already critical. The relevant section of the alertmanager.yml file could look like this:

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['low_connected_users']

Silences Silences are a quick way to temporarily mute alerts. We configure them directly through a dedicated page in the Alertmanager admin console. It could be useful to avoid getting spammed while trying to resolve a critical production issue

Message templates Message template is a mechanism that allows us to integrate the annotations present in the alert and integrate them in a particular way. They must be specified in the

Alertmanager configuration file. The file alertmessage.tmpl used to produce the slack notifications could be defined as follows

{{ define "__slack_text" }}
{{ range .Alerts }}{{ .Annotations.description}}{{ end }}
{{ end }}

{{ define "__slack_title" }}
{{ range .Alerts }} :scream: {{ .Annotations.summary}} :scream: {{ end }}
{{ end }}

{{ define "slack.default.text" }}{{ template "__slack_text" . }}{{ end }}
{{ define "slack.default.title" }}{{ template "__slack_title" . }}{{ end }}

Final thoughts In this article we have covered briefly the basics of

Prometheus and the type of metrics we can monitor with it. Then we discussed rules, the Alertmanager, and the different receivers with a focus on slack. We hope this article will help you explore new possibilities in your own infrastructure and in general make your monitoring experience less of a pain. However, we still need to explore the monitoring possibilities in a discoverable architecture and the changes introduced in Prometheus 2.0, but that will be the topic of another article. We hope you enjoyed. See you next time.

more...