Developer Manual
===========================================

The following sections describe how the software stack is setup, how it can be configured, and how to interact with it at a lower level.

Docker compose
-------------------------------------------

All services run in *Docker* containers, which spin up their own network to communicate, and expose specific ports on the host for users to access. This infrastructure is managed by a ``Makefile``. The most interesting commands are::

    # start/stop/restart all services
    make start
    make stop
    make restart

    # check status of running services
    make status

    # build all images
    make build

    # you can also just target a single service
    make build prometheus-central
    make start prometheus-central
    make stop prometheus-central
    make restart prometheus-central

The individual services are configured at the Docker (Compose) level through their own ``.yml`` file.

Prometheus
-------------------------------------------

Prometheus is our *time-series database*, and fulfills several roles:

* Storing the time-series of the metrics,
* Scraping (collecting) metrics periodically from across our instrument,
* Running queries against the time-series database.

Configuration
"""""""""""""""""""""""""""""""""""""""""""

The ``prometheus-central/prometheus.yml`` configuration file configures our instance to:

* Periodically scrape the Prometheus metrics from the Prometheus installations on each station, using a `federation <https://prometheus.io/docs/prometheus/latest/federation/#federation>`_,
* Periodically scrape the metrics from the services that are part of this software package,
* Periodically scrape any other metric source that is offered in the Prometheus format.

The scraped metrics are annotated with a ``host`` label denoting where the metric came from. This replaces the ``host=localhost`` label coming from station metrics. The full configuration is as follows:

.. literalinclude:: ../../prometheus-central/prometheus.yml

Furthermore, ``prometheus-central/Dockerfile`` configures:

* The `retention` of the archive, or for how long/how much data will be stored. See also https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects,
* Where the data are stored (in conjunction with the paths mounted in ``prometheus-central.yml``).

Monitoring
"""""""""""""""""""""""""""""""""""""""""""

Prometheus also collects metrics about itself, f.e. the performance of the configured scrape jobs::

  # list how many samples could be scraped from each end point
  scrape_samples_scraped{exported_job=""}

  # list a report of the scraping duration for each configured end point
  scrape_duration_seconds{exported_job=""}

NB: The ``exported_job=""`` filter is needed to avoid returning the values from Prometheus instances on the stations about their scraping.

Grafana
-------------------------------------------

Grafana is our *dashboard* and *alerting* engine. It:

* Operates user-managed dashboards covering one or more datasources,
* Mainly pulls its metrics from Prometheus,
* Periodically evaluates configured *alerts*.

Grafana is mostly *provisioned*, that is, its configuration is built on startup:

* Dashboards are installed from ``grafana-central/dashboards``, including a copy of the `Station dashboards <https://git.astron.nl/lofar2.0/grafana-station-dashboards>`_. Note that ``grafana-central/dashboards/dashboards.yaml`` configures their use in Grafana,
* Datasources are installed from ``grafana-central/datasources``,
* System configuration is pulled from ``grafana-central/grafana.ini``.

Alerts
```````````````````````````````````````````

We use the new Grafana `Alerting <https://grafana.com/docs/grafana/latest/alerting/>`_ subsystem to manage alerts:

* We define one or more alert conditions to fire when their treshold is exceeded,
* An alert can cover one or more elements, f.e. stations or FPGAs within a station,

The alerts and their configuration is not provisioned, lacking Grafana support to do so. The new Alerting engine in Grafana is still `under development <https://github.com/grafana/grafana/blob/main/CHANGELOG.md>`_ at the time of writing. This means that alerts will need to be configured and added after installation. The alerts, as well as any dashboards added by the user, are stored in Docker volumes (see ``grafana-central.yml``). Removing those volumes would thus cause these to be lost.

Alerts fire as follows:

* Each alert is periodically evaluated (configurable, at most once per 10 seconds),
* Grafana waits the *Group wait time* to collect similar alerts into a single group,
* When the alert condition holds, the configured Contact points (http://localhost:3001/alerting/notifications and http://localhost:3001/alerting/routes) are informed,
* When the alert condition clears, the Contact points are informed again.

See also the `Alerting Notification <https://grafana.com/docs/grafana/latest/alerting/notifications/>`_ documentation.

Configuration
"""""""""""""""""""""""""""""""""""""""""""

The alerts must be configured to be forwarded to Alerta as follows:

* Add a Contact point ``Alerta`` with the following settings:

  + ``Contact point type`` is ``Webhook``,
  + ``Url`` is ``http://alerta-server:8080/api/webhooks/prometheus?api-key=demo-key``. This API key is configured as ``ADMIN_KEY`` in ``alerta.yml``.

.. hint:: Whether Grafana can send alerts to Alerta can be tested by sending a Test alert on the Contact point page.

* In Notification policies, modify the *Root policy* to:

  + ``Default contact point`` is ``Alerta``,
  + Under ``Timing options``, set ``Group wait`` = 10s, ``Group interval`` = 10s, ``Repeat interval`` = 10m.

The faster Group times result in a lower latency of alerts being sent, and the faster Repeat interval means any lost or deleted alarms get resent earlier (than the default 4 hours).

Monitoring
"""""""""""""""""""""""""""""""""""""""""""

Grafana exposes its metrics to Prometheus, and thus can be used to query itself. The following monitoring points are especially useful:

* ``grafana_alerting_rule_evaluation_duration_seconds`` returns the time Grafana needs to evaluate its alerts,
* ``grafana_plugin_request_duration_milliseconds`` returns the time Grafana needs to query its data sources.

Alerta
-------------------------------------------

The Alerta stack manages alerts that come from Grafana, and allows an operator to track them using the `ISA 18.2 <http://www.tc.faa.gov/its/worldpac/Standards/isa/ISA_18.2[1].pdf>`_ alarm model, providing the following key `states <https://github.com/alerta/alerta/blob/master/alerta/models/alarms/isa_18_2.py>`_:

* ``NORM``: Condition is normal: alarm is not active, all past alarms were acknowledged,
* ``UNACK``: Alarm is active, and has not been acknowledged,
* ``RTNUN``: Alarm came and went, but has not been acknowledged,
* ``SHLVD``: Shelved: condition changes are ignored.

The stack features the following services, as configred in ``alerta.yml``:

* ``alerta-server``, which receives and processes alerts from Grafana,
* ``alerta-db``, which is the storage backend for both the server and the front end,
* ``alerta-web``, which provides a front end for the user to interact with Alerta.

Server
```````````````````````````````````````````

The Alerta server receives the alerts from Grafana, and routes them through several plugins. Most notably:

* Our ``grafana-plugin`` processes Grafana-specific fields, such as dashboard and panel links,
* Our ``lofar-plugin`` processes LOFAR-specific properties, such as pulling metadata from the device attributes we use in Prometheus,
* The ``slack`` plugin posts messages on our Slack instance.

For info on externally developed plugins, see also the `alerta-contrib <https://github.com/alerta/alerta-contrib>`_ repo.

Slack plugin
"""""""""""""""""""""""""""""""""""""""""""

Messages on Slack are configured through:

* The message layout is defined in ``alerta-server/alertad.conf``,
* The access to Slack is provided in ``alerta-server/alerta-secrets.json``, which needs to be updated with the API key as given by Slack, to grant posting rights on the configured channel,
* Whether to post a Slack message is defined in ``alerta-server/lofar-routing-plugin/routing.py``. Only new problems are reported (that is, those that appear for the first time, or were ACKed and went away last time), to reduce spam.

Web UI
```````````````````````````````````````````

The Alerta user interface is exposed on http://localhost:8081, and stores its state in the server. Its configuration is thus also through the server, in ``alerta-server/alertad.conf``. See also https://docs.alerta.io/configuration.html.

The credentials are hard-coded into ``alerta.yml``, until a connection with an authentication backend (f.e. LDAP or Keycloak) is made.

API
```````````````````````````````````````````

The API exposed by the server can be queried from Grafana, using any datasource capable of querying HTTP and parsing JSON. Use the http://alerta-server:8080/api/alerts ReST end point. See also https://docs.alerta.io/api/reference.html.