Developer Manual

The following sections describe how the software stack is setup, how it can be configured, and how to interact with it at a lower level.

Docker compose

All services run in Docker containers, which spin up their own network to communicate, and expose specific ports on the host for users to access. This infrastructure is managed by a Makefile. The most interesting commands are:

# start/stop/restart all services
make start
make stop
make restart

# check status of running services
make status

# build all images
make build

# you can also just target a single service
make build prometheus-central
make start prometheus-central
make stop prometheus-central
make restart prometheus-central

The individual services are configured at the Docker (Compose) level through their own .yml file.

Prometheus

Prometheus is our time-series database, and fulfills several roles:

Storing the time-series of the metrics,
Scraping (collecting) metrics periodically from across our instrument,
Running queries against the time-series database.

Configuration

The prometheus-central/prometheus.yml configuration file configures our instance to:

Periodically scrape the Prometheus metrics from the Prometheus installations on each station, using a federation,
Periodically scrape the metrics from the services that are part of this software package,
Periodically scrape any other metric source that is offered in the Prometheus format.

The scraped metrics are annotated with a host label denoting where the metric came from. This replaces the host=localhost label coming from station metrics. The full configuration is as follows:

global:
  evaluation_interval: 10s
  scrape_interval: 10s
  scrape_timeout: 10s

scrape_configs:
  # production: all LOFAR stations
  - job_name: stations
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="tango"}'
        - '{job="host"}'
        - '{job="prometheus"}'
        - '{job="grafana"}'
    static_configs:
      - targets:
        - "test-lcu2.astron.nl:9090"
        - "dts-lcu.astron.nl:9090"
        - "dop496.astron.nl:9090"
        - "cs001c.control.lofar:9090"
    relabel_configs:
      - source_labels: [ "__address__" ]
        target_label: "host"
        regex: "(.*)(.astron.nl|.control.lofar):9090"
  # development: any station running in the same Docker network
  - job_name: local-station
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="tango"}'
        - '{job="host"}'
    static_configs:
      - targets: ["prometheus:9090"]
        labels:
          "host": "localhost"
  # scrape local services
  - job_name: host
    scrape_interval: 60s
    static_configs:
      - targets: ["host.docker.internal:9100"]
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: grafana
    static_configs:
      - targets: ["grafana-central:3000"]

Furthermore, prometheus-central/Dockerfile configures:

The retention of the archive, or for how long/how much data will be stored. See also https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects,
Where the data are stored (in conjunction with the paths mounted in prometheus-central.yml).

Monitoring

Prometheus also collects metrics about itself, f.e. the performance of the configured scrape jobs:

# list how many samples could be scraped from each end point
scrape_samples_scraped{exported_job=""}

# list a report of the scraping duration for each configured end point
scrape_duration_seconds{exported_job=""}

NB: The exported_job="" filter is needed to avoid returning the values from Prometheus instances on the stations about their scraping.

Grafana

Grafana is our dashboard and alerting engine. It:

Operates user-managed dashboards covering one or more datasources,
Mainly pulls its metrics from Prometheus,
Periodically evaluates configured alerts.

Grafana is mostly provisioned, that is, its configuration is built on startup:

Dashboards are installed from grafana-central/dashboards, including a copy of the Station dashboards. Note that grafana-central/dashboards/dashboards.yaml configures their use in Grafana,
Datasources are installed from grafana-central/datasources,
System configuration is pulled from grafana-central/grafana.ini.

We use the new Grafana Alerting subsystem to manage alerts:

We define one or more alert conditions to fire when their treshold is exceeded,
An alert can cover one or more elements, f.e. stations or FPGAs within a station,

The alerts and their configuration is not provisioned, lacking Grafana support to do so. The new Alerting engine in Grafana is still under development at the time of writing. This means that alerts will need to be configured and added after installation. The alerts, as well as any dashboards added by the user, are stored in Docker volumes (see grafana-central.yml). Removing those volumes would thus cause these to be lost.

Alerts fire as follows:

Each alert is periodically evaluated (configurable, at most once per 10 seconds),
Grafana waits the Group wait time to collect similar alerts into a single group,
When the alert condition holds, the configured Contact points (http://localhost:3001/alerting/notifications and http://localhost:3001/alerting/routes) are informed,
When the alert condition clears, the Contact points are informed again.

See also the Alerting Notification documentation.

Configuration

The alerts must be configured to be forwarded to Alerta as follows:

Add a Contact point Alerta with the following settings:
- Contact point type is Webhook,
- Url is http://alerta-server:8080/api/webhooks/prometheus?api-key=demo-key. This API key is configured as ADMIN_KEY in alerta.yml.

Hint

Whether Grafana can send alerts to Alerta can be tested by sending a Test alert on the Contact point page.

In Notification policies, modify the Root policy to:
- Default contact point is Alerta,
- Under Timing options, set Group wait = 10s, Group interval = 10s, Repeat interval = 10m.

The faster Group times result in a lower latency of alerts being sent, and the faster Repeat interval means any lost or deleted alarms get resent earlier (than the default 4 hours).

Monitoring

Grafana exposes its metrics to Prometheus, and thus can be used to query itself. The following monitoring points are especially useful:

grafana_alerting_rule_evaluation_duration_seconds returns the time Grafana needs to evaluate its alerts,
grafana_plugin_request_duration_milliseconds returns the time Grafana needs to query its data sources.

Alerta

The Alerta stack manages alerts that come from Grafana, and allows an operator to track them using the ISA 18.2 alarm model, providing the following key states:

NORM: Condition is normal: alarm is not active, all past alarms were acknowledged,
UNACK: Alarm is active, and has not been acknowledged,
RTNUN: Alarm came and went, but has not been acknowledged,
SHLVD: Shelved: condition changes are ignored.

The stack features the following services, as configred in alerta.yml:

alerta-server, which receives and processes alerts from Grafana,
alerta-db, which is the storage backend for both the server and the front end,
alerta-web, which provides a front end for the user to interact with Alerta.

The Alerta server receives the alerts from Grafana, and routes them through several plugins. Most notably:

Our grafana-plugin processes Grafana-specific fields, such as dashboard and panel links,
Our lofar-plugin processes LOFAR-specific properties, such as pulling metadata from the device attributes we use in Prometheus,
The slack plugin posts messages on our Slack instance.

For info on externally developed plugins, see also the alerta-contrib repo.

Slack plugin

Messages on Slack are configured through:

The message layout is defined in alerta-server/alertad.conf,
The access to Slack is provided in alerta-server/alerta-secrets.json, which needs to be updated with the API key as given by Slack, to grant posting rights on the configured channel,
Whether to post a Slack message is defined in alerta-server/lofar-routing-plugin/routing.py. Only new problems are reported (that is, those that appear for the first time, or were ACKed and went away last time), to reduce spam.

Web UI

The Alerta user interface is exposed on http://localhost:8081, and stores its state in the server. Its configuration is thus also through the server, in alerta-server/alertad.conf. See also https://docs.alerta.io/configuration.html.

The credentials are hard-coded into alerta.yml, until a connection with an authentication backend (f.e. LDAP or Keycloak) is made.

API

The API exposed by the server can be queried from Grafana, using any datasource capable of querying HTTP and parsing JSON. Use the http://alerta-server:8080/api/alerts ReST end point. See also https://docs.alerta.io/api/reference.html.