Friday, December 30, 2022

Prometheus: Metric Collection and Analysis

Prometheus is an open-source monitoring and alerting tool. It collects data from applications, enable to visualize the data and issue alerts based on the data. Prometheus collects metrics as time series data, process, filter, aggregate, and represent them in a human-readable format. It supports multi-dimensional data model with time series data were data can be sliced and diced along dimensions like instance, service, endpoint, and method. The metrics in the form of key-value pairs.

Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. Grafana or other API consumers can be used to visualize the collected data. Prometheus stores all data as time series i.e. streams of timestamped values belonging to the same metric and the same set of labeled dimensions. Timestamps are stored in milliseconds, while values are always 64-bit floats.

Prometheus provides PromQL, a flexible query language which supports multi-dimensional data model and allows filtering and aggregation based on these dimensions. It supports autonomous single server nodes. It enables discovering targets via service discovery or static configuration. Prometheus supports templating in the annotations and labels of alerts. Prometheus workspace limits to a single region.

Architecture




Prometheus Server collects multi-dimensional data in time series and then analyzes and aggregates the collected data. The process of collecting metrics is called scraping. Time series format is when data is collected after successive or fixed time intervals.

Prometheus server automatically pulls metrics from the targets; hence the user does not need to push metrics for analysis. The client needs to create an HTTP endpoint with /metrics, which returns the complete set of metrics accessible to Prometheus. Prometheus currently supports three metric types, namely counter which is a cumulative metric, gauge which can arbitrarily go up and down, and summary is a client-side calculated histogram of observations. Prometheus Gateway is the intermediary source used for metrics from those jobs which can not be scraped by usual methods, albeit comes with some drawbacks if not used properly.

Prometheus server uses a pull method to collect metrics by reaching out to exporters to pull data. An exporter is any application that exposes data in a format Prometheus can read. The scrape_config in prometheus.yml configures the Prometheus server to regularly connect to each exporter and collect metric data. Exporters do not reach out to Prometheus. Such a pull-based metric system helps in scraping metrics remotely also. However, there are some use cases where a push method is necessary, such as monitoring batch job processes. Prometheus Pushgateway serves as a middle-man for such use cases, were the client application pushes metric data to Pushgateway. The Prometheus server pulls metrics from Pushgateway, just like any other exporter.

Recording rules allow to pre-compute the values of expressions and queries, and save the results as their own separate set of time-series data. Recording rules are evaluated on a schedule, executing an expression and saving the result as a new metric. Recording rules are especially useful when we have complex or expensive queries that are executed frequently. For example, by saving pre-computed results using a recording rule, the expression does not need to be re-evaluated every time someone opens a dashboard. Recording rules are configured using YAML. Create them by placing YAML files in the location specified by rule_files in prometheus.yml. When creating or changing recording rules, reload their configuration the same way as we would when changing prometheus.yml.

Alertmanager is an application that runs in a separate process from the Prometheus server. It is responsible for handling alerts sent to it by clients such as the Prometheus server. Alerts are notifications that are triggered automatically by metric data. Alertmanager does not create alerts or determine when alerts need to be sent based on metric data. Prometheus handles that step and forwards the resulting alerts to Alertmanager. Alertmanager does the following:
  • Deduplicating alerts when multiple clients send the same alert.
  • Suppress or mute notifications for a particular time frame by any label set.
  • Grouping multiple alerts together when they happen around the same time.
  • Routing alerts to the proper destination such as email, or another external alerting service such as PagerDuty or OpsGenie.