# API v5 and Below

Below, you'll find the description of component metrics and tables, along with pre-configured alerts.

## Celery

### Metrics Overview

| Metric                                                                                     | Description                                                                    |
| ------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
| `flower_events_total`                                                                      | Total number of tasks                                                          |
| `flower_task_runtime_seconds_sum`                                                          | Sum of task completion durations                                               |
| `histogram_quantile($quantile, sum(rate(flower_task_runtime_seconds_bucket[1m])) by (le))` | Quantile for task execution based on the [quantile variable](#variables) value |
| `flower_events_total{type="task-failed"}`                                                  | With the `type="task-failed"` label, displays the number of failed tasks       |

### Grafana Tables

| Table                          | Description                                        | What to monitor                                                                                                                                          |
| ------------------------------ | -------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Liveness task rate             | Average time of all Celery requests for Liveness   | <ul><li><code>task-received</code> should be  roughly equal to <code>task-succeeded</code>, </li><li>shouldn't be any <code>task-failed</code></li></ul> |
| Liveness task duration         | Quantile for task execution                        | 0.95 quantile should be 8 seconds or less                                                                                                                |
| Succeeded vs failed tasks rate | Total number of Celery requests                    | <ul><li><code>task-received</code> should be  roughly equal to <code>task-succeeded</code>,</li><li>shouldn't be any <code>task-failed</code></li></ul>  |
| All tasks duration (AVG)       | Average time of all Celery requests for all models | The durations should be 6 seconds or less                                                                                                                |
| Queue size                     | Message queue in Redis                             | Queue size and the number of unacked messages                                                                                                            |

Illustrative screenshots:

{% tabs %}
{% tab title="Liveness task rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FOIBuiEj76i603KelWF86%2F3-1%20cel.png?alt=media&#x26;token=b48c5421-b413-460e-a727-12d5ccfcb71f" alt=""><figcaption><p><code>task-received</code> ≈ <code>task-succeeded, task-failed = 0</code></p></figcaption></figure>
{% endtab %}

{% tab title="Liveness task duration" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F3uvnX3eyeSNdScgy85CJ%2F3-2%20cel.png?alt=media&#x26;token=822b134e-7b8c-4d82-9024-dcd65d31789e" alt=""><figcaption><p>0.95 quantile ≤ 8 seconds</p></figcaption></figure>
{% endtab %}

{% tab title="Succeeded vs failed tasks rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FuqWSFKV3okYnnHUqxYEh%2F3-3%20cel.png?alt=media&#x26;token=f694a973-dfdc-4f34-b213-ccaa9c36700b" alt=""><figcaption><p><code>task-received</code> ≈ <code>task-succeeded, task-failed = 0</code></p></figcaption></figure>
{% endtab %}

{% tab title="All task duration (AVG)" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2Ff4BRAnBajcs4ttIQXtwA%2F3-4%20cel.png?alt=media&#x26;token=73052bbf-7cdc-41ee-8ab3-6b500c1aa998" alt=""><figcaption><p>The durations ≤ 6 seconds</p></figcaption></figure>
{% endtab %}

{% tab title="Queue size" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FkKNO5BhrenIh0BqFUeNm%2F3-5%20cel.png?alt=media&#x26;token=735e9c01-1293-4f34-b5ec-fdf8bd607bc5" alt=""><figcaption><p>Monitor the queue size and the number of unacked messages</p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
- name: Celery alerts
  rules:
  
  # An alert for 0.9 quantile of task duration, warns that analyses are running slowly
    - alert: Quality Analysis is slow!
      expr: histogram_quantile(0.9, sum without (handler) (rate(flower_task_runtime_seconds_bucket{task="oz_core.tasks.tfss.process_analyse_quality"}[5m]))) > 5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Too long duration of celery worker for TFSS task"
        description: "The duration of 90% Quality analyses is longer than {{ $value }} seconds for last 10 minutes. NAMESPACE: {{ $labels.namespace }} POD: {{ $labels.pod }} TASK: {{ $labels.task }} WORKER: {{ $labels.worker }}"
  
  # An alert for failed tasks, if the number is growing, something goes wrong
    - alert: Celery tasks failed!
      expr: sum by(type,task) (rate(flower_events_total{type="task-failed", task!=""}[1m])) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Celery tasks {{ $labels.task }} failed"
        description: 'Failed celery tasks rate: {{ printf "%.2f" $value }} rps'
  
  # A critical alert that warns that all the tasks are failed; it means that the system has stopped processing requests
    - alert: Celery zero success tasks!
      expr: sum by(type,task) (rate(flower_events_total{type="task-succeeded", task!=""}[1m])) == 0 and on (task) sum by(type,task) (rate(flower_events_total{type="task-received", task!=""}[1m])) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Celery has zero success tasks: {{ $labels.task }}!"
        description: "Critical! Check if bio is alive!"

```

## Redis

### Metrics Overview

| Metric                                  | Description                                                                                |
| --------------------------------------- | ------------------------------------------------------------------------------------------ |
| `redis_up`                              | <p><code>1</code> means Redis is working, </p><p><code>0</code> – service is down</p>      |
| `redis_commands_total`                  | Total number of commands in Redis                                                          |
| `redis_commands_duration_seconds_total` | Redis process duration                                                                     |
| `redis_key_size`                        | Redis (as a message broker) queue size                                                     |
| `redis_key_size{key="unacked"}`         | With the `"inacked"` label, displays the number of tasks that are being processed by Redis |

### Grafana Tables

| Table             | Description                                    | What to monitor                   |
| ----------------- | ---------------------------------------------- | --------------------------------- |
| Command rate      | Number of requests per second                  | Nothing, just to stay informed    |
| Commands duration | Average and maximum command execution duration | <p>AVG < 15µs</p><p>MAX < 1ms</p> |
| Connected clients | Number of connected clients                    | Shouldn't be 0                    |

Illustrative screenshots:

{% tabs %}
{% tab title="Command rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F0Ym0Do1KehaoSc9Nekwi%2F3-6%20red.png?alt=media&#x26;token=a77d12b4-42ce-44ee-9ca0-994ccabd9ce8" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Commands duration" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F2Z2AcPfs8Btue0SKOxc3%2F3-7%20red.png?alt=media&#x26;token=c22c8462-5910-4136-ac6b-b0a72dc1cc52" alt=""><figcaption><p>AVG &#x3C; 15µs, MAX &#x3C; 1ms</p></figcaption></figure>
{% endtab %}

{% tab title="Connected clients" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FS0MwMK4uxXUaTKzP9TZd%2F3-8%20red.png?alt=media&#x26;token=67f06a71-e626-4b1d-bb2f-7bedb5febe93" alt=""><figcaption><p><code>Shouldn't be 0</code></p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
- name: Redis alerts
  rules:
  # A critical alert that warns about Redis being down
    - alert: Redis is down
      expr: redis_up != 1
      for: 30s
      labels:
        severity: critical
      annotations:
        summary: "Redis is down for more than 30 seconds!"
        description: "Critical: REDIS service is down in namespace: {{ $labels.namespace }}\nPod: {{ $labels.pod }}!"
  
  # Displays if Redis rejects connections
    - alert: Redis rejected connections
      expr: rate(redis_rejected_connections_total[1m]) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis rejects connections for more than 1 minute in namespace: {{ $labels.namespace }}!"
        description: "Some connections to Redis have been rejected!\nPod: {{ $labels.pod }}\nValue = {{ $value }}"

  # Displays that Redis commands are being executed too slow 
    - alert: Redis command duration is slow!
      expr: max by(namespace) (rate(redis_commands_duration_seconds_total[1m])) > 0.0004
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis max command duration is too high for more than 1 minute in namespace: {{ $labels.namespace }}!"
        description: "Maximum command duration is longer than average!\nValue = {{ $value }} seconds"
 
  # Warns that the Redis queue is too long
    - alert: Redis queue length
      expr: sum by (instance)(redis_key_size) > 50
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis queue size is too large!"
        description: "Warning: Redis queue size : {{ $value }} for the last 1 min!"

  # Warns that there are more than 10 processing (unacked) messages in the Redis queue
    - alert: Redis unacked masseges
      expr: sum by (key)(redis_key_size{key="unacked"}) > 10
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis has unacked messages!"
        description: "Warning: Redis has {{ $value }} unacked messages!"
```

## TFSS

### Metrics Overview

| Metric                                          | Description                                |
| ----------------------------------------------- | ------------------------------------------ |
| `:tensorflow:serving:request_count`             | Total number of requests to TFSS           |
| `:tensorflow:serving:request_latency_bucket`    | Histogram of order processing time         |
| `:tensorflow:serving:request_latency_sum`       | Sum of processing durations for each order |
| `:tensorflow:serving:request_latency_count`     | Total number of orders                     |
| `:tensorflow:cc:saved_model:load_attempt_count` | Uploaded models                            |

### Grafana Tables

| Table                                | Description                                                                                                                 | What to monitor                |
| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------ |
| Model request rate                   | Number of requests to TFSS per second and per minute                                                                        | Nothing, just to stay informed |
| Model latency (`$quantile`-quantile) | 0.95 quantile of the TFSS request processing time. You can set the quantile value in [the `$quantile` variable](#variables) | Nothing, just to stay informed |
| Model latency (AVG)                  | Average order processing time                                                                                               | Nothing, just to stay informed |
| HTTP probe success                   | The result of the built-in blackbox check. Blackbox sends requests to verify that TFSS works properly                       | Should be 1                    |

Illustrative screenshots:

{% tabs %}
{% tab title="Model request rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F4cv4zRJGDHbELMaqvAOj%2F3-9%20tfss.png?alt=media&#x26;token=2a6cb042-400c-433a-85ec-731f3887174e" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Model latency (0.95-quantile)" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FmCC2EKpiqcHikQQCarOx%2F3-10%20tfss.png?alt=media&#x26;token=02701561-e704-4876-a35b-d6d92a79bc76" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Model latency (AVG)" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F1xNyyJ5hjMOiN2rS0FpM%2F3-11%20tfss.png?alt=media&#x26;token=acc01f66-f99a-494e-85fd-73f830ad243d" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="HTTP probe success" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FHOO20IdgfPoPziDyXQSt%2F3-12%20tfss.png?alt=media&#x26;token=823e7bb8-1ce3-4549-afa3-74b0b35cdb4f" alt=""><figcaption><p>Should be 1</p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
- name: TFSS alerts
  rules:
  # Critical alert that warns about blackbox detecting incorrect model behavior
    - alert: TFSS models probe service alert!
      expr: probe_success{job="blackbox-tfss-service"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "!!!ALERT!!! TFSS model or server doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS models probe pod alert!
      expr: probe_success{job="blackbox-tfss-models"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS predict probe pod alert!
      expr: probe_success{job="blackbox-tfss-probe"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model predict in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nPredict probe has been returning failed state for 3 min!"

  # Critical alert that warns about TFSS not processing requests 
    - alert: TFSS empty request rate!
      expr: absent(:tensorflow:serving:request_count{namespace="api-prod"}) == 1
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "TFSS request rate is empty!!!"
        description: "Critical! Requests are not processed, check bio!!!"
```

## Nginx

### Metrics Overview

| Metric                       | Description                                                                           |
| ---------------------------- | ------------------------------------------------------------------------------------- |
| `nginx_up`                   | <p><code>1</code> means Nginx is working, </p><p><code>0</code> – service is down</p> |
| `nginx_connections_accepted` | Number of connections accepted by Nginx                                               |
| `nginx_connections_handled`  | Number of connections handled by Nginx                                                |
| `nginx_connections_active`   | Number of active Nginx connections                                                    |

### Grafana Tables

| Table                      | Description                       | What to monitor                                             |
| -------------------------- | --------------------------------- | ----------------------------------------------------------- |
| Request rate               | Total number of requests to Nginx | Nothing, just to stay informed                              |
| Active connections         | Connection states                 | Shouldn't be any pending connections                        |
| Processed connections rate | Processing success rate           | Numbers of accepted and handled connections should be equal |

Illustrative screenshots:

{% tabs %}
{% tab title="Request rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FRgAHfm2GHSy6VbbLMZXm%2F3-13%20nginx.png?alt=media&#x26;token=9289c487-869e-4ad8-b4ac-037baa369941" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Active connections" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FqUqiU5j7WbeezXOy5RR3%2F3-14%20nginx.png?alt=media&#x26;token=e38e27e2-08b2-4805-ba75-432c63f8c69f" alt=""><figcaption><p>Shouldn't be any pending connections</p></figcaption></figure>
{% endtab %}

{% tab title="Processed connections rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FwFzZjQJrCKZUZeLSlqbY%2F3-15%20nginx.png?alt=media&#x26;token=69e7a2a5-2d8a-46d4-aa0a-6c4fddce44ef" alt=""><figcaption><p>Numbers of accepted and handled connections should be equal</p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
groups:
  - name: NGINX alerts
    rules:
      # A critical alert that displays that Nginx hasn't handled some of the accepted connections
      - alert: Nginx not all connections are handled
        expr: rate (nginx_connections_handled[5m]) / rate (nginx_connections_accepted[5m]) <1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Nginx issue with handling connections"
          description: "Critical: Nginx doesn't handle some accepted connections on the host {{ $labels.instance }} for more than 3 minutes!"
```

## API

### Metrics Overview

| Metric                                                                             | Description                                    |
| ---------------------------------------------------------------------------------- | ---------------------------------------------- |
| `absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"}` | Displays that there is no ready API containers |

### Grafana Alerts

```bash
groups:
  - name: API alerts
    rules:
      # A critical alert that warns that there are no ready API containers, so requests are not being processed
      - alert: Absent ready api containers!
        expr: absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"} == 1)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Absent ready api containers!!!"
          description: "Critical! Check api containers!!!"
```

You can customize the alerts according to your needs. Please proceed to [our repository](https://gitlab.com/oz-forensics/public/oz-engineering-monitoring-configs/-/tree/helm-0.10.20/alerts?ref_type=heads) to find the alert files.
