Developer Manual
The following sections describe how the software stack is setup, how it can be configured, and how to interact with it at a lower level.
Docker compose
All services run in Docker containers, which spin up their own network to communicate, and expose specific ports on the host for users to access. This infrastructure is managed by a Makefile. The most interesting commands are:
# start/stop/restart all services
make start
make stop
make restart
# check status of running services
make status
# build all images
make build
# you can also just target a single service
make build prometheus-central
make start prometheus-central
make stop prometheus-central
make restart prometheus-central
The individual services are configured at the Docker (Compose) level through their own .yml file.
Prometheus
Prometheus is our time-series database, and fulfills several roles:
Storing the time-series of the metrics,
Scraping (collecting) metrics periodically from across our instrument,
Running queries against the time-series database.
Configuration
The prometheus-central/prometheus.yml configuration file configures our instance to:
Periodically scrape the Prometheus metrics from the Prometheus installations on each station, using a federation,
Periodically scrape the metrics from the services that are part of this software package,
Periodically scrape any other metric source that is offered in the Prometheus format.
The scraped metrics are annotated with a host label denoting where the metric came from. This replaces the host=localhost label coming from station metrics. The full configuration is as follows:
global:
evaluation_interval: 10s
scrape_interval: 10s
scrape_timeout: 10s
scrape_configs:
# production: all LOFAR stations
- job_name: stations
metrics_path: '/federate'
params:
'match[]':
- '{job="tango"}'
- '{job="host"}'
- '{job="prometheus"}'
- '{job="grafana"}'
static_configs:
- targets:
- "test-lcu2.astron.nl:9090"
- "dts-lcu.astron.nl:9090"
- "dop496.astron.nl:9090"
- "cs001c.control.lofar:9090"
relabel_configs:
- source_labels: [ "__address__" ]
target_label: "host"
regex: "(.*)(.astron.nl|.control.lofar):9090"
# development: any station running in the same Docker network
- job_name: local-station
metrics_path: '/federate'
params:
'match[]':
- '{job="tango"}'
- '{job="host"}'
static_configs:
- targets: ["prometheus:9090"]
labels:
"host": "localhost"
# scrape local services
- job_name: host
scrape_interval: 60s
static_configs:
- targets: ["host.docker.internal:9100"]
- job_name: prometheus
static_configs:
- targets: ["localhost:9090"]
- job_name: grafana
static_configs:
- targets: ["grafana-central:3000"]
Furthermore, prometheus-central/Dockerfile configures:
The retention of the archive, or for how long/how much data will be stored. See also https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects,
Where the data are stored (in conjunction with the paths mounted in
prometheus-central.yml).
Monitoring
Prometheus also collects metrics about itself, f.e. the performance of the configured scrape jobs:
# list how many samples could be scraped from each end point
scrape_samples_scraped{exported_job=""}
# list a report of the scraping duration for each configured end point
scrape_duration_seconds{exported_job=""}
NB: The exported_job="" filter is needed to avoid returning the values from Prometheus instances on the stations about their scraping.
Grafana
Grafana is our dashboard and alerting engine. It:
Operates user-managed dashboards covering one or more datasources,
Mainly pulls its metrics from Prometheus,
Periodically evaluates configured alerts.
Grafana is mostly provisioned, that is, its configuration is built on startup:
Dashboards are installed from
grafana-central/dashboards, including a copy of the Station dashboards. Note thatgrafana-central/dashboards/dashboards.yamlconfigures their use in Grafana,Datasources are installed from
grafana-central/datasources,System configuration is pulled from
grafana-central/grafana.ini.
We use the new Grafana Alerting subsystem to manage alerts:
We define one or more alert conditions to fire when their treshold is exceeded,
An alert can cover one or more elements, f.e. stations or FPGAs within a station,
The alerts and their configuration is not provisioned, lacking Grafana support to do so. The new Alerting engine in Grafana is still under development at the time of writing. This means that alerts will need to be configured and added after installation. The alerts, as well as any dashboards added by the user, are stored in Docker volumes (see grafana-central.yml). Removing those volumes would thus cause these to be lost.
Alerts fire as follows:
Each alert is periodically evaluated (configurable, at most once per 10 seconds),
Grafana waits the Group wait time to collect similar alerts into a single group,
When the alert condition holds, the configured Contact points (http://localhost:3001/alerting/notifications and http://localhost:3001/alerting/routes) are informed,
When the alert condition clears, the Contact points are informed again.
See also the Alerting Notification documentation.
Configuration
The alerts must be configured to be forwarded to Alerta as follows:
Add a Contact point
Alertawith the following settings:Contact point typeisWebhook,Urlishttp://alerta-server:8080/api/webhooks/prometheus?api-key=demo-key. This API key is configured asADMIN_KEYinalerta.yml.
Hint
Whether Grafana can send alerts to Alerta can be tested by sending a Test alert on the Contact point page.
In Notification policies, modify the Root policy to:
Default contact pointisAlerta,Under
Timing options, setGroup wait= 10s,Group interval= 10s,Repeat interval= 10m.
The faster Group times result in a lower latency of alerts being sent, and the faster Repeat interval means any lost or deleted alarms get resent earlier (than the default 4 hours).
Monitoring
Grafana exposes its metrics to Prometheus, and thus can be used to query itself. The following monitoring points are especially useful:
grafana_alerting_rule_evaluation_duration_secondsreturns the time Grafana needs to evaluate its alerts,grafana_plugin_request_duration_millisecondsreturns the time Grafana needs to query its data sources.
Alerta
The Alerta stack manages alerts that come from Grafana, and allows an operator to track them using the ISA 18.2 alarm model, providing the following key states:
NORM: Condition is normal: alarm is not active, all past alarms were acknowledged,UNACK: Alarm is active, and has not been acknowledged,RTNUN: Alarm came and went, but has not been acknowledged,SHLVD: Shelved: condition changes are ignored.
The stack features the following services, as configred in alerta.yml:
alerta-server, which receives and processes alerts from Grafana,alerta-db, which is the storage backend for both the server and the front end,alerta-web, which provides a front end for the user to interact with Alerta.
The Alerta server receives the alerts from Grafana, and routes them through several plugins. Most notably:
Our
grafana-pluginprocesses Grafana-specific fields, such as dashboard and panel links,Our
lofar-pluginprocesses LOFAR-specific properties, such as pulling metadata from the device attributes we use in Prometheus,The
slackplugin posts messages on our Slack instance.
For info on externally developed plugins, see also the alerta-contrib repo.
Slack plugin
Messages on Slack are configured through:
The message layout is defined in
alerta-server/alertad.conf,The access to Slack is provided in
alerta-server/alerta-secrets.json, which needs to be updated with the API key as given by Slack, to grant posting rights on the configured channel,Whether to post a Slack message is defined in
alerta-server/lofar-routing-plugin/routing.py. Only new problems are reported (that is, those that appear for the first time, or were ACKed and went away last time), to reduce spam.
Web UI
The Alerta user interface is exposed on http://localhost:8081, and stores its state in the server. Its configuration is thus also through the server, in alerta-server/alertad.conf. See also https://docs.alerta.io/configuration.html.
The credentials are hard-coded into alerta.yml, until a connection with an authentication backend (f.e. LDAP or Keycloak) is made.
API
The API exposed by the server can be queried from Grafana, using any datasource capable of querying HTTP and parsing JSON. Use the http://alerta-server:8080/api/alerts ReST end point. See also https://docs.alerta.io/api/reference.html.