# Docker

## General Information

### Exporter Ports

| Exporter   | Port                             |
| ---------- | -------------------------------- |
| API        | nginx\_api\_proxy:8081/metrics   |
| Nginx      | nginx\_container:9113/metrics    |
| TFSS       | bio\_container:8501/metrics/tfss |
| Blackbox   | blackbox\_container:9115         |
| PostgreSQL | postgres\_container:9187/metrics |

You can find the already prepared exporter configuration in [our repository](https://gitlab.com/oz-forensics/public/oz-engineering-on-prem-docker-monitoring). After deployment, replace container names with actual addresses. For example, if your Nginx container is at 192.168.8.8, change `- targets: ["nginx_container:9113"]` to `- targets: ["192.168.8.8:9113"]`.

### Healthcheck Paths

**API**

`GET http://api:8000/api/version`

`GET http://api:8000/api/healthcheck`

**BIO**

`GET http://bio:8501/v1/models/{{model_name}}`

`POST http://bio:8501/v1/models/dummy:predict`

```json
method: POST
body: '{"inputs": {"images_bytes": [{"b64": "aaa"}]}}'
```

***

## API

To enable API metrics, in the [docker/template/nginx.yml](https://gitlab.com/oz-forensics/docker-compose/api6/-/blob/main/docker/templates/nginx.yml) file, uncomment `9081:8081`:

```bash
ports:
  - "9080:8080"
  - "9081:8081"
```

### Prometheus Job

{% code expandable="true" %}

```bash
  - job_name: "oz-api"
    scheme: http
    metrics_path: /metrics
    file_sd_configs:
      - files:
          - sd/oz-api.yml
```

{% endcode %}

The `sd/oz-api.yml` file (instead of placeholders, put your container names and host address):

{% code expandable="true" %}

```yml
- targets:
    - api_container1:8081
  labels:
    host: api1
    role: oz-api
    ozforensics: api
    oz_service: "true"
#- targets:
#    - api_container2:8081
#  labels:
#    host: api1
#    role: oz-api
#    ozforensics: api
#    oz_service: "true"
```

{% endcode %}

### Metrics Overview

| Metric                                                                                                            | Description                                                                                                      | Type      | Labels                                                               |
| ----------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | --------- | -------------------------------------------------------------------- |
| `oz_api_versions_total` (also available as `oz_api_versions_created`, `oz_api_info_created`, `oz_api_info_total`) | Displays API version info when you call `/api/version`                                                           | counter   | `lamb`                                                               |
|                                                                                                                   |                                                                                                                  |           | `core` – core version                                                |
|                                                                                                                   |                                                                                                                  |           | `oz_api` – API version                                               |
| `oz_api_analyses_total`                                                                                           | Number of processed analyses                                                                                     | counter   | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
| `oz_api_analyses_created`                                                                                         | Number of analyses that have been initiated                                                                      | gauge     | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
| `oz_api_analyses_in_progress`                                                                                     | Number of analyses currently processing                                                                          | gauge     | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
| `oz_api_analyse_duration_seconds_bucket`                                                                          | A histogram for analysis duration. Default buckets: `0.1,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,6,7,8,10,12,15,20,30,inf` | histogram | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
|                                                                                                                   |                                                                                                                  |           | `le` – bucket size                                                   |
| `oz_api_analyse_duration_seconds_count`                                                                           | Sum of analyses' durations                                                                                       | counter   | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis status (`SUCCESS`, `DECLINED`)        |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
| `oz_api_analyse_duration_seconds_sum`                                                                             | Number of analyses whose durations were counted                                                                  | counter   | `analysis_result` – analysis resolution (`FINISHED`, `FAILED`)       |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
| `oz_api_analyse_duration_seconds_created`                                                                         | Timestamp of beginning of duration metric counting                                                               |           |                                                                      |
| `django_http_requests_total_by_transport_total`                                                                   | Number of requests by transport protocol                                                                         | counter   | `transport` – protocol (HTTP, HTTPS)                                 |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
| `django_http_responses_before_middlewares_total`                                                                  | Number of Django responses before running middleware                                                             | counter   | `uid` – user identifier                                              |
| `django_http_exceptions_total_by_type_total`                                                                      | Number of Django exceptions by type                                                                              | counter   | `exception_type` – type of exception                                 |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
| `django_http_requests_latency_seconds_by_view_method_bucket`                                                      | Histogram of request processing latency by views                                                                 | histogram | `method` – HTTP method                                               |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
|                                                                                                                   |                                                                                                                  |           | `view_name` – view name                                              |
|                                                                                                                   |                                                                                                                  |           | `le` – bucket size                                                   |
| `django_http_requests_latency_seconds_by_view_method_count`                                                       | Number of request processing durations by views                                                                  | counter   | `method` – HTTP method                                               |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
|                                                                                                                   |                                                                                                                  |           | `view_name` – view name                                              |
| `django_http_requests_latency_seconds_by_view_method_sum`                                                         | Sum of request processing durations by views                                                                     | counter   | `method` – HTTP method                                               |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
|                                                                                                                   |                                                                                                                  |           | `view_name` – view name                                              |

Illustrative screenshots:

{% tabs %}
{% tab title="Analyses rate per type" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FAN6rEj92vAmHfBtBOxuP%2F%5BAPI%5D%20Analyses%20rate%20per%20type.png?alt=media&#x26;token=0632da81-952f-4da0-bd5a-baa6c89be51e" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Analyses rate per resolution" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2Fw1bQXL7ySnVfXEWw7Rxe%2F%5BAPI%5D%20Analyses%20rate%20per%20resolution.png?alt=media&#x26;token=ce2c6b73-4a4f-4e31-871d-ace69fa7198d" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Analyses rate per result" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FBYPgqLhQ9a1GXwgGHcnc%2F%5BAPI%5D%20Analyses%20rate%20per%20result.png?alt=media&#x26;token=0ba15213-6a2b-4b78-9b1d-6f68e9f48332" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Analyses duration" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F7KgkjsbonMykczhCyBJI%2F%5BAPI%5D%20Analyses%20duration.png?alt=media&#x26;token=9d071d88-0b76-471b-94be-3ac11dc4fb78" alt=""><figcaption><p>Analyses duration divided by average value / analysis type and quantile / analysis type. Discontinuity is caused by metric counter which starts for each analysis separately</p></figcaption></figure>
{% endtab %}

{% tab title="Django exceptions rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F4UCDWBRbwkeNzprATx6y%2F%5BAPI%5D%20Django%20exceptions%20rate.png?alt=media&#x26;token=4f78b919-1e71-4e47-81d4-4ba307171db4" alt=""><figcaption></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
groups:
  - name: API alerts
    rules:
      - alert: API is down
        expr: probe_success{job="blackbox-api-healthcheck"} != 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Oz API has been down or showing errors for more than 30 seconds!"
          description: "Critical: Oz API service is down on the host {{ $labels.instance }}!"

      - alert: Absent API metrics!
        expr: absent(oz_api_versions_total)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Absent API metrics!!!"
          description: "Critical: check API containers, API might be not working properly!!!"

      - alert: High total API analysis requests
        expr: sum (rate(oz_api_analyses_total{}[1m])) > 1
        labels:
          severity: warning
        annotations:
          summary: "Too many API analysis requests per second!"
          description: "API analysis request rate is {{ $value }} rps."

      - alert: High failed API analysis rate
        expr: sum by (analysis_result)(rate(oz_api_analyses_total{analysis_result="FAILED"}[1m])) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Failed API analyses detected"
          description: "API analysis failure rate is {{ $value }} rps in the last minute."

      - alert: High API analysis duration
        expr: histogram_quantile(0.95,sum by (le, analysis_type) (rate(oz_api_analyse_duration_seconds_bucket{}[1m]))) > 7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.analysis_type }} analyses are going slow!"
          description: "The duration of 95% {{ $labels.analysis_type }} analyses is longer than {{ $value }} seconds in the last 10 minutes."

      - alert: API exception errors
        expr: sum by (exception_type)(rate(django_http_exceptions_total_by_type_total{exception_type!~"AuthCredentialsExpired|AuthCredentialsInvalid|AuthCredentialsIsNotProvided|AuthForbidden|InvalidBodyStructureError|InvalidParamValueErrorNotExistError|OSError|InvalidParamValueError|NotExistError"}[1m])) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.exception_type }} in API exception errors"
          description: "{{ $value }} rps in API returned {{ $labels.exception_type }} in the last 5 minutes."
```

***

## TFSS

To enable metrics, set `TFSS_PROMETHEUS="true"` in [/configs/env/tfss.env](https://gitlab.com/oz-forensics/docker-compose/api6/-/blob/main/configs/env/tfss.env).

To change the default metrics path, uncomment `TFSS_METRICS_URI="/metrics/tfss"` in [/configs/env/tfss.env](https://gitlab.com/oz-forensics/docker-compose/api6/-/blob/main/configs/env/tfss.env) and define the path.

### Prometheus Job

&#x20;Instead of placeholder, put your container name.

{% code expandable="true" %}

```bash
  - job_name: "tfss"
    metrics_path: /metrics/tfss
    scheme: http
    static_configs:
      - targets: ["bio_container:8501"]
        labels:
          bio: ozforensics
          ozforensics: true
```

{% endcode %}

### Metrics Overview

| Metric                                          | Description                                           |
| ----------------------------------------------- | ----------------------------------------------------- |
| `:tensorflow:serving:request_count`             | Total number of requests to TFSS                      |
| `:tensorflow:serving:request_latency_bucket`    | Histogram of request processing time                  |
| `:tensorflow:serving:request_latency_sum`       | Sum of request processing durations                   |
| `:tensorflow:serving:request_latency_count`     | Total number of requests whose durations were counted |
| `:tensorflow:cc:saved_model:load_attempt_count` | Loaded models                                         |

Illustrative screenshots:

{% tabs %}
{% tab title="Model request rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2Fgn7ySbPxN7NPgvW3h8wx%2F%5BTFSS%5D%20Model%20request%20rate.png?alt=media&#x26;token=0c0a34c6-fb9e-4c5b-817d-68375833c739" alt=""><figcaption><p>Total number of requests to TFSS per second and per minute, divided by models</p></figcaption></figure>
{% endtab %}

{% tab title="Model latency ($quantile-quantile)" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FAPbU9bkDZwyXpbdzGX1v%2F%5BTFSS%5D%20Model%20latency%20(%24quantile-quantile).png?alt=media&#x26;token=d919a8a7-d87f-43c3-ac27-7a0df8656b57" alt=""><figcaption><p>0.95 (or the one you've defined in $quantile) quantile of request processing duration in TFSS</p></figcaption></figure>
{% endtab %}

{% tab title="\[TFSS] Model latency (AVG)" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F5005ceJCuJuoC9UOwELK%2F%5BTFSS%5D%20Model%20latency%20(AVG).png?alt=media&#x26;token=fecbaef4-cf05-44f1-aea4-a53cc89fdcd9" alt=""><figcaption><p>Average analysis processing time</p></figcaption></figure>
{% endtab %}

{% tab title="\[TFSS] HTTP probe status" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FmWwGyBdYJa5Q43irFYUN%2F%5BTFSS%5D%20HTTP%20probe%20status.png?alt=media&#x26;token=97bb1127-b6e4-4079-a083-3fe9c764a579" alt=""><figcaption><p>The built-in blackbox sends requests to check states of models, should be 1; if not, <code>probe_success</code> in TFSS is not working properly</p></figcaption></figure>
{% endtab %}
{% endtabs %}

TFSS alerts are covered in the [Blackbox](#blackbox) section below.

***

## Nginx

### Prometheus Job

&#x20;Instead of placeholder, put your container name.

```bash
  - job_name: "nginx"
    static_configs:
      - targets: ["nginx_container:9113"]
        labels:
          nginx: ozforensics
          ozforensics: true
```

### Metrics Overview

| Metric                       | Description                                                            |
| ---------------------------- | ---------------------------------------------------------------------- |
| `nginx_up`                   | Shows whether Nginx is running: 1 – service is up, 0 – service is down |
| `nginx_connections_accepted` | Number of connections accepted by Nginx                                |
| `nginx_connections_handled`  | Number of connections handled by Nginx                                 |
| `nginx_connections_active`   | Active Nginx connections                                               |

Illustrative screenshots:

{% tabs %}
{% tab title="Request rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FdhMoRHQH5uRFyLU7lVV3%2F%5BNginx%5D%20Request%20rate.png?alt=media&#x26;token=30039c79-8245-4a81-b228-ce407cb5cfba" alt=""><figcaption><p>Total number of requests to Nginx</p></figcaption></figure>
{% endtab %}

{% tab title="Active connections" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FppKgl1Z0DvzGkONAExEs%2F%5BNginx%5D%20Active%20connections.png?alt=media&#x26;token=c0e32f68-0b25-4b33-9ae3-cdbe079890f0" alt=""><figcaption><p>State of Nginx connections. Check that there are no stale connections</p></figcaption></figure>
{% endtab %}

{% tab title="Processed connections rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F286WYPHQikNtBg1aeMu6%2F%5BNginx%5D%20Processed%20connections%20rate.png?alt=media&#x26;token=bd92645f-7e5c-4a68-bfc2-c1e542d965c9" alt=""><figcaption><p>Processing success rate. The numbers of accepted and handled connections should be equal</p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
groups:
  - name: NGINX alerts
    rules:
      - alert: Nginx is down
        expr: nginx_up != 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Nginx has been down for more than 30 seconds!"
          description: "Critical: Nginx service is down on the host {{ $labels.instance }}!"

      - alert: Nginx not all connections are handled
        expr: rate (nginx_connections_handled[5m]) / rate (nginx_connections_accepted[5m]) <1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Nginx has issues with handling connections"
          description: "Critical: not all accepted connections have been handled by Nginx on the host {{ $labels.instance }} for more than 3 minutes!"

      - alert: "High number of active connections in Nginx"
        expr: nginx_connections_active > 3000
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "High number of active connections in Nginx"
          description: "Critical: Nginx has had too many active connections on the host {{ $labels.instance }} for more than 3 minutes!"
```

***

## Blackbox

Blackbox probes verify that API and TFSS are responding correctly. You need to configure the modules, SD target files, and Prometheus jobs as shown below.

### Configuration

{% code expandable="true" %}

```bash
modules:
  tfss_models_probe:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: []  # Defaults to 2xx
      method: GET
      fail_if_ssl: false
      fail_if_not_ssl: false
      fail_if_body_not_matches_regexp:
        - "\"error_code\": \"OK\""
        - "\"state\": \"AVAILABLE\""
      tls_config:
        insecure_skip_verify: false
      preferred_ip_protocol: "ip4"

  tfss_predict_probe:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: []  # Defaults to 2xx
      method: POST
      body: '{"inputs": {"images_bytes": [{"b64": "aaa"}]}}'
      follow_redirects: true
      fail_if_ssl: false
      fail_if_not_ssl: false
      fail_if_body_not_matches_regexp:
        - '\"outputs\": 1\.0'
      tls_config:
        insecure_skip_verify: false
      preferred_ip_protocol: "ip4"

  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: []
      method: GET
      preferred_ip_protocol: "ip4"
      tls_config:
        insecure_skip_verify: false

  api_healthcheck:
    prober: http
    timeout: 5s
    http:
      method: GET
      valid_status_codes: []
      fail_if_body_not_matches_regexp:
        - '{"generally_healthy": true,(| "celery": (\{\}|\{(?:\s*"[^"]+": "ok",)* "[^"]+": "ok"\s*\}),) "db": "ok"(|, "jwt": {"status": "ok"})}'
```

{% endcode %}

<details>

<summary><strong>SD Files Configuration for Blackbox</strong> (instead of placeholders, put your container names)</summary>

`blackbox_tfss_models.yml`:

{% code expandable="true" %}

```yml
[
  {
    "targets": ["http://bio_container:8501/v1/models/inquisitor_c"],
    "labels":
      {
        "host": "bio:8501",
        "role": "oz-bio",
        "ssl_alert": "false",
        "http_code": "200",
        "model": "inquisitor_c",
        "ozforensics": "true",
      },
  },
]
```

{% endcode %}

`blackbox_tfss_probe.yml`:

{% code expandable="true" %}

```yml
[
  {
    "targets": ["http://bio_container:8501/v1/models/dummy:predict"],
    "labels":
      {
        "host": "bio:8501",
        "role": "oz-bio-balancer",
        "ssl_alert": "false",
        "http_code": "200",
        "model": "dummy",
        "ozforensics": "true",
      },
  },
]
```

{% endcode %}

`blackbox-exporter-ozforensics.yml`:

{% code expandable="true" %}

```yml
[
  {
    "targets": ["http://api_container:8000/api/version"],
    "labels":
      {
        "host": "blackbox_container:9115",
        "role": "API",
        "ssl_alert": "false",
        "http_code": "200",
        "ozforensics": "true",
      },
  },
]
```

{% endcode %}

`blackbox-api-healthcheck.yml`:

{% code expandable="true" %}

```yml
[
  {
    "targets": ["http://api_container:8000/api/healthcheck"],
    "labels":
      {
        "job": "blackbox-api-healthcheck",
        "host": "api:8000/api/healthcheck",
        "role": "oz-api",
        "ssl_alert": "false",
        "http_code": "200",
        "ozforensics": "true",
      },
  },
]
```

{% endcode %}

</details>

### Prometheus Job

&#x20;Instead of placeholders, put your container names and host addresses.

{% code expandable="true" %}

```bash
  - job_name: "BlackBox_tfss_models"
    scrape_interval: 30s
    metrics_path: /probe
    scheme: http
    params:
      module: [tfss_models_probe]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox_container:9115
    file_sd_configs:
      - files:
          - sd/BlackBox_tfss_models.yml

  - job_name: "BlackBox_tfss_probe"
    scrape_interval: 30s
    metrics_path: /probe
    scheme: http
    params:
      module: [tfss_predict_probe]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox_container:9115
    file_sd_configs:
      - files:
          - sd/BlackBox_tfss_probe.yml

  - job_name: 'BlackBox_exporter_ozforensics'
    scrape_interval: 60s
    metrics_path: /probe
    params:
      module: [http_2xx]
    relabel_configs:
     - source_labels: [__address__]
       target_label: __param_target
     - source_labels: [__param_target]
       target_label: instance
     - target_label: __address__
       replacement: blackbox_container:9115
    file_sd_configs:
      - files:
        - sd/blackbox-exporter-ozforensics.yml

  - job_name: 'BlackBox_api_healthcheck_ozforensics'
    scrape_interval: 60s
    metrics_path: /probe
    scheme: http
    params:
      module: [api_healthcheck]
    relabel_configs:
     - source_labels: [__address__]
       target_label: __param_target
     - source_labels: [__param_target]
       target_label: instance
     - target_label: __address__
       replacement: blackbox_container:9115
    file_sd_configs:
      - files:
        - sd/blackbox-api-healthcheck.yml
```

{% endcode %}

### Metrics Overview

| Metric          | Description                                                                                                             |
| --------------- | ----------------------------------------------------------------------------------------------------------------------- |
| `probe_success` | Displays whether URL call has been finished successfully (Blackbox received an answer with required regular expression) |

### Grafana Alerts

For API:

```bash
groups:
  - name: API alerts
    rules:
      - alert: API is down
        expr: probe_success{job="blackbox-api-healthcheck"} != 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "OZ API has been down or showing errors for more than 30 seconds!"
          description: "Critical: OZ API service is down on the host {{ $labels.instance }}!"
```

For TFSS:

```bash
groups:
  - name: tfss alerts
    rules:
      - alert: model inquisitor_c error
        expr: probe_success{model="inquisitor_c"} != 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "The inquisitor_c model returns error!"
          description: "Critical: the TFSS service model returns error on the host {{ $labels.instance }}!"
      - alert: model dummy error
        expr: probe_success{model="dummy"} != 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "The dummy model returns error!"
          description: "Critical: the TFSS service model returns error on the host {{ $labels.instance }}!"
```

***

## PostgreSQL

### Prometheus Job

Instead of placeholder, put your container name.

{% code expandable="true" %}

```bash
 - job_name: "postgres"
    static_configs:
      - targets: ["postgres_container:9187"]
        labels:
          postgres: ozforensics
          ozforensics: true
```

{% endcode %}

### Grafana Alerts

```bash
groups:
  - name: POSTGRESQL alerts
    rules:
      - alert: Postgresql is down
        expr: pg_up != 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL has been down for more than 30 seconds!"
          description: "Critical: Postgres service is down on the host {{ $labels.instance }}!"
      - alert: Postgres exporter errors
        expr: pg_exporter_last_scrape_error == 1
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Postgres Exporter is down or showing errors"
          description: "Critical: postgres-exporter is not running or showing errors on the host {{ $labels.instance }}"
```

***

## Other Alerts

`up` displays whether the last metrics collection went successfully: 1 if yes, 0 if no.

```bash
groups:
  - name: exporter alerts
    rules:
      - alert: Scrape error or instance is down
        expr: up != 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.job }} {{ $labels.host }} scrape error or instance is down!"
          description: "Critical: last scrape on {{ $labels.instance }} finished with error!"
```

***

## Optional Prometheus Jobs

### Node-exporter

```bash
  - job_name: "node_exporter"
    metrics_path: /node_exporter/metrics
    scheme: http
    file_sd_configs:
      - files:
        - sd/node-exporter.yml
```

The `node-exporter.yml` file (instead of placeholder, put your host address):

```yml
[
  {
    "targets": ["node:9113"],
    "labels":
      {
        "job": "node_exporter",
        "host": "api",
        "role": "oz-api",
        "ozforensics": "true",
      },
  },
]
```

### cAdvisor

```bash
  - job_name: cadvisor
    metrics_path: /metrics
    file_sd_configs:
      - files:
        - sd/cadvisor.yml
```

The `cadvisor.yml` file  (instead of placeholder, put your host address):

```yml
[
  {
    "targets": ["cadvisor_container:8081"],
    "labels":
      {
        "job": "cadvisor",
        "host": "api",
        "role": "oz-api",
        "ozforensics": "true",
      },
  },
]
```

***

## Telegram Alerting

&#x20;Make sure you replace all placeholders.

{% file src="<https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FIJcc6TJThs7PS8GzGRMN%2Falertmanager.yml?alt=media&token=6d758ae5-c69f-4cba-9c5b-af027c9aa874>" %}

Then, in `./alertmanager`, create `templates/` and add the template file.

{% file src="<https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FzY0Eq1aDS9nf6mP8TOmS%2Foz-templates.tmpl?alt=media&token=aa2b9c33-508e-4834-a2f7-f17104f9e9e6>" %}

***

API 6 no longer requires Celery or Redis, so the corresponding metrics are not covered in this guide.

You can customize the alerts according to your needs. Please proceed to [our repository](https://gitlab.com/oz-forensics/public/oz-engineering-monitoring-configs/-/tree/helm-0.10.20/alerts?ref_type=heads) to find the alert files.
