{"id":119,"date":"2021-12-07T00:26:04","date_gmt":"2021-12-06T22:26:04","guid":{"rendered":"https:\/\/blog.nikster.de\/wordpress\/?p=119"},"modified":"2021-12-07T00:31:24","modified_gmt":"2021-12-06T22:31:24","slug":"monitoring-with-prometheus","status":"publish","type":"post","link":"https:\/\/blog.nikster.de\/wordpress\/index.php\/2021\/12\/07\/monitoring-with-prometheus\/","title":{"rendered":"Monitoring with Prometheus"},"content":{"rendered":"\n<p>Infrastructure needs to be monitored and there exist several tools for this task, not at least because the term &#8220;monitoring&#8221; is rather fuzzy.<br>However, two great tools for this task are graphite and Prometheus.<br>Both have their pros and cons, like with graphite it is much simpler to keep data for long term analysis, while Prometheus shines with it&#8217;s powerful query language, named prom-ql.<br>But there are way more and I won&#8217;t discuss them here, it&#8217;s pretty much dependent on ones needs and general preferences which one to use.<br>So&#8230;<\/p>\n\n\n\n<h5 class=\"wp-block-heading\"> What is Prometheus?<\/h5>\n\n\n\n<p>In short: <br>Prometheus is a set of tools for monitoring, some of which are optional.<br>In it&#8217;s core, the Prometheus Server takes care of collecting and storing metrics in a highly performant time series database and makes them available for further processing (like querying them or sending alerts).<br>The Metrics are mostly scraped from so called exporters. <br>They are one of prometheus strengths, because the ones it brings are already powerful and there already exist <a href=\"https:\/\/prometheus.io\/docs\/instrumenting\/exporters\/\">a ton of exporters<\/a> from the community for all kinds of services, also it&#8217;s kind of simple to implement custom exporters.<br><br>Optional components are the <em>Alertmanager<\/em>, the <em>Pushgateway<\/em> and third party dashboards like <em>Grafana<\/em>. <\/p>\n\n\n\n<h5 class=\"wp-block-heading\">What&#8217;s this article about?<\/h5>\n\n\n\n<p>I&#8217;m going to set up a Prometheus server, an alertmanager, a grafana server with an example dashboard and scrape some metrics to fiddle with promql and display them in a grafana dashboard.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\"> What&#8217;s needed?<\/h5>\n\n\n\n<ul class=\"wp-block-list\"><li>3 VMs with Debian Buster (bullseye should work too)<\/li><\/ul>\n\n\n\n<h5 class=\"wp-block-heading\">Let&#8217;s go&#8230; basic setup<\/h5>\n\n\n\n<p>I&#8217;ll name the three VMs &#8220;prometheus&#8221;, &#8220;grafana&#8221; and &#8220;alertmanager&#8221;, <br>make sure that prometheus is able to reach them all and grafana is able to connect to prometheus.<br>On prometheus install the prometheus package:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>root@prometheus:~# apt-get install prometheus <\/code><\/pre>\n\n\n\n<p> On alertmanager install the prometheus-alertmanager package:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>root@alertmanager:~# apt-get install prometheus-alertmanager<\/code><\/pre>\n\n\n\n<p>On grafana install the grafana package.<br>With grafana it really is best to use the package they provide.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>apt-get install -y apt-transport-https software-properties-common wget gnupg\n\necho \"deb https:\/\/packages.grafana.com\/oss\/deb stable main\" &gt;&gt; \/etc\/apt\/sources.list.d\/grafana.list\n\nwget -q -O - https:\/\/packages.grafana.com\/gpg.key | apt-key add -\n\napt-get update\napt-get install grafana<\/code><\/pre>\n\n\n\n<p>On all hosts install the prometheus-node-exporter:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>apt-get install prometheus-node-exporter<\/code><\/pre>\n\n\n\n<p>If everything worked, then you should have a basic install by now and should be able to see the webinterfaces of the services.<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>prometheus.your.domain:9090<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2001\" height=\"617\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/11\/Screenshot_20211113_132043.png\" alt=\"\" class=\"wp-image-122\"\/><\/figure>\n\n\n\n<ul class=\"wp-block-list\"><li>alertmanager.your.domain:9093<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2001\" height=\"741\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/11\/Screenshot_20211113_132307.png\" alt=\"\" class=\"wp-image-123\"\/><\/figure>\n\n\n\n<ul class=\"wp-block-list\"><li>grafana.nik.local:3000<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2001\" height=\"1028\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/11\/Screenshot_20211113_132619.png\" alt=\"\" class=\"wp-image-124\"\/><\/figure>\n\n\n\n<p>The initial credentials are admin:admin.<br><\/p>\n\n\n\n<p>Check if the exporters are working in general.<br>Therefore on any or all of the hosts do:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>curl localhost:9100\/metrics\ncurl oneofthehosts.your.domain:9100\/metrics<\/code><\/pre>\n\n\n\n<p>You should see lots of metrics like this, more on that later.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># TYPE apt_upgrades_pending gauge\napt_upgrades_pending{arch=\"\",origin=\"\"} 0\n# HELP go_gc_duration_seconds A summary of the GC invocation durations.\n# TYPE go_gc_duration_seconds summary\ngo_gc_duration_seconds{quantile=\"0\"} 9.929e-06\ngo_gc_duration_seconds{quantile=\"0.25\"} 1.63e-05\ngo_gc_duration_seconds{quantile=\"0.5\"} 2.249e-05\ngo_gc_duration_seconds{quantile=\"0.75\"} 4.0521e-05\ngo_gc_duration_seconds{quantile=\"1\"} 0.00410069\n\nroot@prometheus:~# curl -s localhost:9100\/metrics | grep node_disk_written_bytes_total\n# HELP node_disk_written_bytes_total The total number of bytes written successfully.\n# TYPE node_disk_written_bytes_total counter\nnode_disk_written_bytes_total{device=\"sr0\"} 0\nnode_disk_written_bytes_total{device=\"vda\"} 8.2030592e+07<\/code><\/pre>\n\n\n\n<p>Basic installation is finished and working.<br>Let&#8217;s check the&#8230;<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">prometheus configuration<\/h5>\n\n\n\n<p><strong><em>A quick overview:<\/em><\/strong><\/p>\n\n\n\n<p>Prometheus server may be started with lots of <em>arguments<\/em>.<br>There are four categories of them: <\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>config<ul><li><strong>the config file<\/strong>, defaults to: \/etc\/prometheus\/prometheus.yaml.<\/li><\/ul><\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li>storage<ul><li>parameters on where to store tsdb data and how to handle it<\/li><\/ul><\/li><li>web<ul><li>web paramaters, like the url under which prometheus is reachable, api endpoints, etc.<\/li><\/ul><\/li><li>query<ul><li>query parameters, like timeouts but also max-samples, etc.<\/li><\/ul><\/li><\/ul>\n\n\n\n<p>On a debian system, these parameters are defined in \/etc\/default\/prometheus and for this setup they are sufficient.<br><\/p>\n\n\n\n<p>The prometheus.yaml can be split in the following sections:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>global<ul><li><em>global<\/em> <em>parameters<\/em> for all other sections like scrape and evaluation intervalls, etc., may be overwritten in the specific configurations.<\/li><\/ul><\/li><li>scrape_configs<ul><li>the actual job definitions on what to scrape, where and how<\/li><\/ul><\/li><li>alerting<ul><li>parameters for the alert-manager<\/li><\/ul><\/li><li>rule_files<ul><li>files that contain the recording and alerting rules <\/li><\/ul><\/li><li>remote_read + remote_write<ul><li>parameters for working with long term storage like thanos and\/or federation<\/li><\/ul><\/li><\/ul>\n\n\n\n<p>Rule files are periodically evaluated for changes and the prometheus-server itself may be SIGHUPed to gracefully reload it&#8217;s config (\/etc\/prometheus.yaml).<br>This can also be done via an api endpoint (\/reload) if enabled.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<p>Let&#8217;s build a useful config-file to scrape our hosts.<\/p>\n\n\n\n<p>This one, aptly named &#8220;node&#8221;, just scrapes the exporter on localhost (prometheus) and grafana.your.domain.<br>But, for the purpose of demonstration, I&#8217;ll override some parameters in job context, like scrape_interval, etc..<br>sample_limit is important here, because it prevents collecting too may samples, marking the scrape as failed. This may also prevent what is called &#8220;cardinality explosion&#8221; in Prometheus (more on this later). <br>The job config itself is static, which is OK if one has a defined set of metrics to check, which doesn&#8217;t change too often and get&#8217;s distributed and reloaded by some mechanism.<br>We&#8217;ll stick to that for now.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>scrape_configs:\n# The job name is added as a label `job=&lt;job_name&gt;` to any timeseries scraped from this config.\n  - job_name: 'node'\n\n# Override the global default and scrape targets \n    scrape_interval: 15s\n    scrape_timeout: 10s\n    sample_limit: 1000\n\n    # metrics_path defaults to '\/metrics'\n    # scheme defaults to 'http'.\n\n    static_configs:\n      - targets: &#91;'localhost:9090']\n      - targets: &#91;'grafana:9100']\n<\/code><\/pre>\n\n\n\n<p>Now, that there is an initial working config, one may access the metrics via the configured url of the prometheus server:<\/p>\n\n\n\n<p>http:\/\/prometheus.your.domain:9090\/targets<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1489\" height=\"204\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211205_122931.png\" alt=\"\" class=\"wp-image-131\"\/><\/figure>\n\n\n\n<p>One may create a graph from those metrics now, by clicking on &#8220;Graph&#8221; and starting to type. For Example:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1849\" height=\"409\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211205_123351.png\" alt=\"\" class=\"wp-image-132\"\/><\/figure>\n\n\n\n<p>Execute it and you&#8217;ll get a list of Instances and Jobs with this metric.<br>Choose the one you are looking for (or all), copy it to the search bar and click on &#8220;Graph&#8221;.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211205_123644.png\" alt=\"\" class=\"wp-image-133\" width=\"801\" height=\"337\"\/><\/figure>\n\n\n\n<p>Good, now that this is working, a few words on the scrape config and especially labels. <br>Prometheus uses labels for almost everything and has some powerful functions to manipulate and thus work with them.<br><strong>Labels<\/strong> are the key\/value pairs associated with a certain metric.<br>In the example above, it&#8217;s <em>instance=&#8217;grafana&#8217;<\/em> and <em>job=&#8217;node&#8217;<\/em>, placed in {}, they are assigned to a metric.<br>Of course these are basic labels and one may add lots more.<br>That&#8217;s were one needs to be careful, because each unique combination of labels and metrics add a new time series in the database, leading to the aforementioned &#8220;cardinality explosion&#8221; eventually.<br>So, choose labels wisely and never ever use dynamic, unbounded labels, like user ids and such.<br>Read more on this topic: <a href=\"https:\/\/prometheus.io\/docs\/practices\/naming\/\">here<\/a>.<\/p>\n\n\n\n<p>Alright, let&#8217;s check out how relabeling works:<br>Add a new label (this would also update an existing label foo with the value bar):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>- job_name: node\n    static_configs:\n      - targets: &#91;'grafana:9100']\n      - targets: &#91;'localhost:9100']\n    relabel_configs:\n      - target_label: \"foo\"\n        replacement: \"bar\"\n<\/code><\/pre>\n\n\n\n<p>One can also rename and drop Metrics which match regexes and also chain those rules, you may want to check out this <a href=\"https:\/\/valyala.medium.com\/how-to-use-relabeling-in-prometheus-and-victoriametrics-8b90fc22c4b2\">excellent site<\/a> with examples.<br>However, don&#8217;t confuse <strong>relabel_configs<\/strong> with <strong>metric_relabel_configs<\/strong>.<br>metric_relabel_configs is applied after a metric was collected, but before it is written to storage, while relabel_configs is run before a scrape is performed.<br>You may use this to drop specific metrics until problems are fixed on a client, for example.<br>This one drops all the metrics for <em>sidekiq_jobs_completion_seconds_bucket{job=&#8221;gitlab-sidekiq&#8221;}<\/em> from my <a href=\"https:\/\/blog.nikster.de\/wordpress\/index.php\/2019\/05\/06\/how-to-install-gitlab-and-work-with-it\/\">gitlab server<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>- job_name: gitlab-sidekiq\n    metric_relabel_configs:\n      - source_labels: &#91; __name__ ]\n        regex: sidekiq_jobs_completion_seconds_bucket.+\n        action: drop<\/code><\/pre>\n\n\n\n<p>You could find such a problematic metric with this query, which will list the top 20 time series (check out this <a href=\"https:\/\/www.robustperception.io\/relabel_configs-vs-metric_relabel_configs\">blog<\/a> on the topic):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>topk(20, count by (__name__, job)({__name__=~\".+\"}))<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Grafana<\/h2>\n\n\n\n<p>Now that we have a rudimentary understanding of the config and are scraping the exporters on two of our hosts, let&#8217;s connect Grafana, so that we can paint nice Graphs with powerful promql.<\/p>\n\n\n\n<p>After login, find the configuration symbol on the left side and choose Data sources.<br>It is sufficient to enter the URL of your Prometheus server here.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"666\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_002415-1024x666.png\" alt=\"\" class=\"wp-image-136\" srcset=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_002415-1024x666.png 1024w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_002415-300x195.png 300w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_002415-768x499.png 768w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_002415.png 1080w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Of course, there is a ton of options and Grafana itself can be used to as a Frontend for many data sources like databases, Elasticsearch, and so on.<br>If you are configuring Grafana for an Organization, you might also want to check out the User- and Org Settings.<br>For the sake of this Tutorial, we are done with the configuration.<br>Let&#8217;s build some graphs.<\/p>\n\n\n\n<p>Add a new dashboard:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"464\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095047-1024x464.png\" alt=\"\" class=\"wp-image-138\" srcset=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095047-1024x464.png 1024w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095047-300x136.png 300w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095047-768x348.png 768w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095047.png 1160w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>klick on &#8220;add a new Panel&#8221;<\/p>\n\n\n\n<p>Our Data source &#8220;Prometheus&#8221; should be visible.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"81\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095228-1024x81.png\" alt=\"\" class=\"wp-image-139\" srcset=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095228-1024x81.png 1024w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095228-300x24.png 300w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095228-768x61.png 768w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095228-1536x122.png 1536w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095228-1568x125.png 1568w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_095228.png 1687w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>CPU Graphs are always good, let&#8217;s add one for system and for User CPU Time.<br><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>avg(irate(node_cpu_seconds_total{mode=\"system\", job=\"node\",instance=\"grafana:9100\"}&#91;5m]))\n\navg(irate(node_cpu_seconds_total{mode=\"user\", job=\"node\",instance=\"grafana:9100\"}&#91;5m]))<\/code><\/pre>\n\n\n\n<p>As you type, Grafana will provide you with options for auto completion.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"612\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_100006-1024x612.png\" alt=\"\" class=\"wp-image-140\" srcset=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_100006-1024x612.png 1024w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_100006-300x179.png 300w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_100006-768x459.png 768w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_100006-1536x919.png 1536w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_100006-2048x1225.png 2048w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_100006-1568x938.png 1568w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Don&#8217;t forget to save and click Apply.<\/p>\n\n\n\n<p>What has been done here?<br>I started to type <em>node_<\/em> which provides me with a list of all <em>node_*<\/em> metrics and chose <em>node_cpu_seconds_total<\/em>.<br>Then the <em>labels<\/em> are added, within <em>{}<\/em>.<br>If you know that you want <em>mode<\/em>, just start typing it (you may also use the explorer (left side) first or just look through the the exporter output on host:9100 (respective port if you use another one).<br>Also we want it for the resources defined in <em>job=node<\/em> and an specific <em>instance<\/em>.<br>Then, the function <em>irate<\/em> is applied.<br>As stated, prometheus brings some very good functions to deal with timeseries data, but there are so many of them, best read about it <a href=\"https:\/\/prometheus.io\/docs\/prometheus\/latest\/querying\/functions\/\">here<\/a>.<br>Rate and IRate though are used often (at least by me) and easily confused.<br>They are pretty similar, but work differently.<br>As a rule of thumb: irate reacts better to data changes while rate gives more of a trend (see link above on how they really work).<br>Both get us an &#8220;average&#8221; (calculated differently) per second rate of events. <br>For short term CPU Load, I&#8217;ll use irate [5m].<br>(See the picture below for the difference, when using <em>rate<\/em>)<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"612\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_115901-1024x612.png\" alt=\"\" class=\"wp-image-141\" srcset=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_115901-1024x612.png 1024w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_115901-300x179.png 300w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_115901-768x459.png 768w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_115901-1536x919.png 1536w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_115901-2048x1225.png 2048w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_115901-1568x938.png 1568w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>At least (outside) I used the <em>avg<\/em> aggregator function, to calculate an average over the cpu cores (there are two of them.<br>Read about aggregator functions <a href=\"https:\/\/prometheus.io\/docs\/prometheus\/latest\/querying\/operators\/#aggregation-operators\">here<\/a>.<br>Check out how it looks like without <em>avg<\/em>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"612\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_125343-1024x612.png\" alt=\"\" class=\"wp-image-142\" srcset=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_125343-1024x612.png 1024w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_125343-300x179.png 300w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_125343-768x459.png 768w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_125343-1536x919.png 1536w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_125343-2048x1225.png 2048w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/Screenshot_20211206_125343-1568x938.png 1568w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Of course there&#8217;s much more to Grafana.<br>For example: if you click on the &#8220;Preferences&#8221; Icon in the upper right of the new Dashboard, it is possible to define variables, which (e.g.) can replace mode, job and instance values, helping you to create dynamic and easy to navigate dashboards with drop down fields.<br>Also worth noticing is, that you don&#8217;t need to build every single dashboard yourself.<br>There are many well maintained and production ready dashboards on <a href=\"https:\/\/grafana.com\/grafana\/dashboards\/\">grafana.com\/dashboards<\/a>.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Alertmanager<\/h2>\n\n\n\n<p>The Alertmanager is prometheus&#8217; &#8220;Alert Router&#8221;.<br>It takes alerts, generated from Prometheus&#8217; Alerting Rules through it&#8217;s API and converts them into notifications in all kinds of forms: Slack Messages, Emails, and so on. <br>In a production environment one would set up more than the one we are building for the sake of this article.<br>Alertmanager can be clustered and replicates it&#8217;s state throughout it&#8217;s nodes and also de-duplicates alerts.<br><br>How does it work? <br>In short:<br>Whenever an alert rule fires, prometheus sends and event to the alertmanager API (json) and it keeps doing this as long as the rule matches.<br>The alertmanager then dispatches these alerts into <em>alert groups<\/em> (defined by several labels, as everything in prometheus, e.g. alertname), here grouping and de-duplication is done to avoid unnecessary alert spamming, also alerts may be categorized.<br>From here, each group will trigger the notification pipeline.<br>First is the <em>Inhibition<\/em> Phase.<br>This basically allows for dependencies between alerts to mapped (think of: a switch fails and every rule for connected devices would start alerting, this won&#8217;t happen if configured correctly).<br>Mapping is done in the alertmanager.yaml and thus requires reloading.<br>Second is the <em>Silencer<\/em> Phase.  <br>It does what it&#8217;s name says, it silences Alarms, either by directly matching labels or by a regex (be careful with that, though).<br>Silences can be configured through the web interface, by clicking on Silences (http:\/\/alertmanager.your.domain:9093\/#\/silences).<br>If the alert was not handled by one of the previous phases, it get&#8217;s <em>routed<\/em>.<br>Basically the alert is send to an endpoint, which has to be configured in alertmanager.yaml, route section.<br>Several enpoints are pre-configured with example configs.<br><br><em>The configuration:<\/em><br>On prometheus.your.domain, add the Alertmanager section:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Alertmanager configuration\nalerting:\n  alertmanagers:\n  - static_configs:\n      - targets: \n          - 'alertmanager.your.domain:9093'<\/code><\/pre>\n\n\n\n<p>Also, tell prometheus which rule files to evaluate:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Load rules once and periodically evaluate them according to the global 'evaluation_interval'.\nrule_files:   \n  - \"alerting_rules.yml\"<\/code><\/pre>\n\n\n\n<p>Create an alerting rule in alerting_rules.yaml.<br>This one is pretty straight forward.<br>Raise an Alert &#8220;NodeExporterDown&#8221; if &#8220;up&#8221; for our Job=node is not &#8220;1&#8221;, for one minute and label it as critical.<br>Also annotations such as links and other useful stuff may be defined.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>groups:\n- name: alerting_rules\n  rules:\n\n  - alert: NodeExporterDown\n    expr: up{job=\"node\"} != 1\n    for: 1m\n    labels:\n      severity: \"critical\"\n    annotations:\n      description: \"Node exporter {{ .Labels.instance }} is down.\"\n      link: \"https:\/\/example.com\"<\/code><\/pre>\n\n\n\n<p>Now we need to configure the alertmanager itself.<br>On alertmanager.your.domain edit \/etc\/prometheus\/alertmanager.yml.<br>As stated, there are several pre-configured endpoints, let&#8217;s just use email.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>global:\n  smtp_smarthost: 'smtp.whatever.domain:25'\n  smtp_from: 'alertmanager@your.domain'\n  smtp_auth_username: 'user'\n  smtp_auth_password: 'pass'\n  smtp_require_tls: true<\/code><\/pre>\n\n\n\n<p>define a route:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>route:\n  receiver: operations\n  group_by: &#91;'alertname', 'job']\n  group_wait: 10s\n  group_interval: 10s\n  repeat_interval: 3m<\/code><\/pre>\n\n\n\n<p>define the receiver:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>- name: 'operations'\n  email_configs:\n  - to: 'your@ops-email.de'<\/code><\/pre>\n\n\n\n<p>So, now, if we stop the prometheus-node-exporter on grafana.your.domain, the rule get&#8217;s evaluated and prometheus enters a pending state for one minute to see if the service comes up again.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"303\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_pending-1024x303.png\" alt=\"\" class=\"wp-image-144\" srcset=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_pending-1024x303.png 1024w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_pending-300x89.png 300w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_pending-768x227.png 768w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_pending-1536x455.png 1536w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_pending-2048x606.png 2048w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_pending-1568x464.png 1568w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>If it does not, the service is marked as down:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"118\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_status-1024x118.png\" alt=\"\" class=\"wp-image-145\" srcset=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_status-1024x118.png 1024w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_status-300x35.png 300w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_status-768x89.png 768w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_status-1536x177.png 1536w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_status-2048x236.png 2048w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_status-1568x181.png 1568w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>and the prometheus alert starts firing:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"287\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_firing-1024x287.png\" alt=\"\" class=\"wp-image-146\" srcset=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_firing-1024x287.png 1024w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_firing-300x84.png 300w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_firing-768x215.png 768w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_firing-1536x430.png 1536w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_firing-2048x573.png 2048w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_firing-1568x439.png 1568w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We can check for it on the alertmanager.your.domain web interface:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"746\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_alertmanager-1024x746.png\" alt=\"\" class=\"wp-image-147\" srcset=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_alertmanager-1024x746.png 1024w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_alertmanager-300x219.png 300w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_alertmanager-768x560.png 768w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_alertmanager.png 1474w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>and also get an email:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"835\" height=\"436\" src=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_mail.png\" alt=\"\" class=\"wp-image-148\" srcset=\"https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_mail.png 835w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_mail-300x157.png 300w, https:\/\/blog.nikster.de\/wordpress\/wp-content\/uploads\/2021\/12\/node_exporter_down_mail-768x401.png 768w\" sizes=\"auto, (max-width: 835px) 100vw, 835px\" \/><\/figure>\n\n\n\n<p>That&#8217;s it. Prometheus, some basic rules, Grafana and a basic dashboard, and also Alertmanager and a basic alarming is up and running.<br><\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<p>Of course there is much more to know, especially about alerting and recording rules, promql and dynamic configuration, as well as cloud configs and operators, long term storage and so on.<br>I&#8217;d recommend reading the excellent <a href=\"https:\/\/www.packtpub.com\/product\/hands-on-infrastructure-monitoring-with-prometheus\/9781789612349\">Infrastructure Monitoring with Prometheus<\/a> as well as the official documentation on the <a href=\"https:\/\/prometheus.io\/docs\/introduction\/overview\/\">prometheus website<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Infrastructure needs to be monitored and there exist several tools for this task, not at least because the term &#8220;monitoring&#8221; is rather fuzzy.However, two great tools for this task are graphite and Prometheus.Both have their pros and cons, like with graphite it is much simpler to keep data for long term analysis, while Prometheus shines &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/blog.nikster.de\/wordpress\/index.php\/2021\/12\/07\/monitoring-with-prometheus\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Monitoring with Prometheus&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[47,48],"tags":[46,45],"class_list":["post-119","post","type-post","status-publish","format-standard","hentry","category-monitoring","category-prometheus","tag-monitoring","tag-prometheus","entry"],"_links":{"self":[{"href":"https:\/\/blog.nikster.de\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/119","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.nikster.de\/wordpress\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.nikster.de\/wordpress\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.nikster.de\/wordpress\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.nikster.de\/wordpress\/index.php\/wp-json\/wp\/v2\/comments?post=119"}],"version-history":[{"count":11,"href":"https:\/\/blog.nikster.de\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/119\/revisions"}],"predecessor-version":[{"id":149,"href":"https:\/\/blog.nikster.de\/wordpress\/index.php\/wp-json\/wp\/v2\/posts\/119\/revisions\/149"}],"wp:attachment":[{"href":"https:\/\/blog.nikster.de\/wordpress\/index.php\/wp-json\/wp\/v2\/media?parent=119"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.nikster.de\/wordpress\/index.php\/wp-json\/wp\/v2\/categories?post=119"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.nikster.de\/wordpress\/index.php\/wp-json\/wp\/v2\/tags?post=119"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}