> For the complete documentation index, see [llms.txt](https://doc.ozforensics.com/oz-knowledge/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://doc.ozforensics.com/oz-knowledge/guides/administrator-guide/monitoring/api-5-and-below.md).

# API v5 and Below

Below, you'll find the description of component metrics and tables, along with pre-configured alerts.

## Celery

### Metrics Overview

| Metric                                                                                     | Description                                                                    |
| ------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
| `flower_events_total`                                                                      | Total number of tasks                                                          |
| `flower_task_runtime_seconds_sum`                                                          | Sum of task completion durations                                               |
| `histogram_quantile($quantile, sum(rate(flower_task_runtime_seconds_bucket[1m])) by (le))` | Quantile for task execution based on the [quantile variable](#variables) value |
| `flower_events_total{type="task-failed"}`                                                  | With the `type="task-failed"` label, displays the number of failed tasks       |

### Grafana Tables

| Table                          | Description                                        | What to monitor                                                                                                                                        |
| ------------------------------ | -------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Liveness task rate             | Average time of all Celery requests for Liveness   | <ul><li><code>task-received</code> should be roughly equal to <code>task-succeeded</code>,</li><li>shouldn't be any <code>task-failed</code></li></ul> |
| Liveness task duration         | Quantile for task execution                        | 0.95 quantile should be 8 seconds or less                                                                                                              |
| Succeeded vs failed tasks rate | Total number of Celery requests                    | <ul><li><code>task-received</code> should be roughly equal to <code>task-succeeded</code>,</li><li>shouldn't be any <code>task-failed</code></li></ul> |
| All tasks duration (AVG)       | Average time of all Celery requests for all models | The durations should be 6 seconds or less                                                                                                              |
| Queue size                     | Message queue in Redis                             | Queue size and the number of unacked messages                                                                                                          |

Illustrative screenshots:

{% tabs %}
{% tab title="Liveness task rate" %}

<figure><img src="/files/mNbulX71cldgO6eQ99g5" alt=""><figcaption><p><code>task-received</code> ≈ <code>task-succeeded, task-failed = 0</code></p></figcaption></figure>
{% endtab %}

{% tab title="Liveness task duration" %}

<figure><img src="/files/SJDpzVZMKVBWsmgYN5GV" alt=""><figcaption><p>0.95 quantile ≤ 8 seconds</p></figcaption></figure>
{% endtab %}

{% tab title="Succeeded vs failed tasks rate" %}

<figure><img src="/files/fAD5ymNhoRPhMDZCGvF3" alt=""><figcaption><p><code>task-received</code> ≈ <code>task-succeeded, task-failed = 0</code></p></figcaption></figure>
{% endtab %}

{% tab title="All task duration (AVG)" %}

<figure><img src="/files/36N0YjtvRYWyWDJ48uBN" alt=""><figcaption><p>The durations ≤ 6 seconds</p></figcaption></figure>
{% endtab %}

{% tab title="Queue size" %}

<figure><img src="/files/agnY5SVMLK4Se1YyzjOj" alt=""><figcaption><p>Monitor the queue size and the number of unacked messages</p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
- name: Celery alerts
  rules:
  
  # An alert for 0.9 quantile of task duration, warns that analyses are running slowly
    - alert: Quality Analysis is slow!
      expr: histogram_quantile(0.9, sum without (handler) (rate(flower_task_runtime_seconds_bucket{task="oz_core.tasks.tfss.process_analyse_quality"}[5m]))) > 5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Too long duration of celery worker for TFSS task"
        description: "The duration of 90% Quality analyses is longer than {{ $value }} seconds for last 10 minutes. NAMESPACE: {{ $labels.namespace }} POD: {{ $labels.pod }} TASK: {{ $labels.task }} WORKER: {{ $labels.worker }}"
  
  # An alert for failed tasks, if the number is growing, something goes wrong
    - alert: Celery tasks failed!
      expr: sum by(type,task) (rate(flower_events_total{type="task-failed", task!=""}[1m])) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Celery tasks {{ $labels.task }} failed"
        description: 'Failed celery tasks rate: {{ printf "%.2f" $value }} rps'
  
  # A critical alert that warns that all the tasks are failed; it means that the system has stopped processing requests
    - alert: Celery zero success tasks!
      expr: sum by(type,task) (rate(flower_events_total{type="task-succeeded", task!=""}[1m])) == 0 and on (task) sum by(type,task) (rate(flower_events_total{type="task-received", task!=""}[1m])) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Celery has zero success tasks: {{ $labels.task }}!"
        description: "Critical! Check if bio is alive!"

```

## Redis

### Metrics Overview

| Metric                                  | Description                                                                                |
| --------------------------------------- | ------------------------------------------------------------------------------------------ |
| `redis_up`                              | <p><code>1</code> means Redis is working,</p><p><code>0</code> – service is down</p>       |
| `redis_commands_total`                  | Total number of commands in Redis                                                          |
| `redis_commands_duration_seconds_total` | Redis process duration                                                                     |
| `redis_key_size`                        | Redis (as a message broker) queue size                                                     |
| `redis_key_size{key="unacked"}`         | With the `"inacked"` label, displays the number of tasks that are being processed by Redis |

### Grafana Tables

| Table             | Description                                    | What to monitor                   |
| ----------------- | ---------------------------------------------- | --------------------------------- |
| Command rate      | Number of requests per second                  | Nothing, just to stay informed    |
| Commands duration | Average and maximum command execution duration | <p>AVG < 15µs</p><p>MAX < 1ms</p> |
| Connected clients | Number of connected clients                    | Shouldn't be 0                    |

Illustrative screenshots:

{% tabs %}
{% tab title="Command rate" %}

<figure><img src="/files/zHCivOvtu5OZr6xwE0Py" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Commands duration" %}

<figure><img src="/files/nYzxQ95S1S1tm6VfK6ok" alt=""><figcaption><p>AVG &#x3C; 15µs, MAX &#x3C; 1ms</p></figcaption></figure>
{% endtab %}

{% tab title="Connected clients" %}

<figure><img src="/files/X53YNtRp0D9fHuZSRIu9" alt=""><figcaption><p><code>Shouldn't be 0</code></p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
- name: Redis alerts
  rules:
  # A critical alert that warns about Redis being down
    - alert: Redis is down
      expr: redis_up != 1
      for: 30s
      labels:
        severity: critical
      annotations:
        summary: "Redis is down for more than 30 seconds!"
        description: "Critical: REDIS service is down in namespace: {{ $labels.namespace }}\nPod: {{ $labels.pod }}!"
  
  # Displays if Redis rejects connections
    - alert: Redis rejected connections
      expr: rate(redis_rejected_connections_total[1m]) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis rejects connections for more than 1 minute in namespace: {{ $labels.namespace }}!"
        description: "Some connections to Redis have been rejected!\nPod: {{ $labels.pod }}\nValue = {{ $value }}"

  # Displays that Redis commands are being executed too slow 
    - alert: Redis command duration is slow!
      expr: max by(namespace) (rate(redis_commands_duration_seconds_total[1m])) > 0.0004
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis max command duration is too high for more than 1 minute in namespace: {{ $labels.namespace }}!"
        description: "Maximum command duration is longer than average!\nValue = {{ $value }} seconds"
 
  # Warns that the Redis queue is too long
    - alert: Redis queue length
      expr: sum by (instance)(redis_key_size) > 50
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis queue size is too large!"
        description: "Warning: Redis queue size : {{ $value }} for the last 1 min!"

  # Warns that there are more than 10 processing (unacked) messages in the Redis queue
    - alert: Redis unacked masseges
      expr: sum by (key)(redis_key_size{key="unacked"}) > 10
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis has unacked messages!"
        description: "Warning: Redis has {{ $value }} unacked messages!"
```

## TFSS

### Metrics Overview

| Metric                                          | Description                                |
| ----------------------------------------------- | ------------------------------------------ |
| `:tensorflow:serving:request_count`             | Total number of requests to TFSS           |
| `:tensorflow:serving:request_latency_bucket`    | Histogram of order processing time         |
| `:tensorflow:serving:request_latency_sum`       | Sum of processing durations for each order |
| `:tensorflow:serving:request_latency_count`     | Total number of orders                     |
| `:tensorflow:cc:saved_model:load_attempt_count` | Uploaded models                            |

### Grafana Tables

| Table                                | Description                                                                                                                 | What to monitor                |
| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------ |
| Model request rate                   | Number of requests to TFSS per second and per minute                                                                        | Nothing, just to stay informed |
| Model latency (`$quantile`-quantile) | 0.95 quantile of the TFSS request processing time. You can set the quantile value in [the `$quantile` variable](#variables) | Nothing, just to stay informed |
| Model latency (AVG)                  | Average order processing time                                                                                               | Nothing, just to stay informed |
| HTTP probe success                   | The result of the built-in blackbox check. Blackbox sends requests to verify that TFSS works properly                       | Should be 1                    |

Illustrative screenshots:

{% tabs %}
{% tab title="Model request rate" %}

<figure><img src="/files/V93CwIbVR9r5CvXAM2Cn" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Model latency (0.95-quantile)" %}

<figure><img src="/files/hxrPoxQxjZPVb6vA7d0O" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Model latency (AVG)" %}

<figure><img src="/files/0PCT7GuF93AnDsDlIo7k" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="HTTP probe success" %}

<figure><img src="/files/wo6pZsDIggWB1GsD4STt" alt=""><figcaption><p>Should be 1</p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
- name: TFSS alerts
  rules:
  # Critical alert that warns about blackbox detecting incorrect model behavior
    - alert: TFSS models probe service alert!
      expr: probe_success{job="blackbox-tfss-service"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "!!!ALERT!!! TFSS model or server doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS models probe pod alert!
      expr: probe_success{job="blackbox-tfss-models"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS predict probe pod alert!
      expr: probe_success{job="blackbox-tfss-probe"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model predict in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nPredict probe has been returning failed state for 3 min!"

  # Critical alert that warns about TFSS not processing requests 
    - alert: TFSS empty request rate!
      expr: absent(:tensorflow:serving:request_count{namespace="api-prod"}) == 1
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "TFSS request rate is empty!!!"
        description: "Critical! Requests are not processed, check bio!!!"
```

## nginx

### Metrics Overview

| Metric                       | Description                                                                          |
| ---------------------------- | ------------------------------------------------------------------------------------ |
| `nginx_up`                   | <p><code>1</code> means nginx is working,</p><p><code>0</code> – service is down</p> |
| `nginx_connections_accepted` | Number of connections accepted by nginx                                              |
| `nginx_connections_handled`  | Number of connections handled by nginx                                               |
| `nginx_connections_active`   | Number of active nginx connections                                                   |

### Grafana Tables

| Table                      | Description                       | What to monitor                                             |
| -------------------------- | --------------------------------- | ----------------------------------------------------------- |
| Request rate               | Total number of requests to nginx | Nothing, just to stay informed                              |
| Active connections         | Connection states                 | Shouldn't be any pending connections                        |
| Processed connections rate | Processing success rate           | Numbers of accepted and handled connections should be equal |

Illustrative screenshots:

{% tabs %}
{% tab title="Request rate" %}

<figure><img src="/files/WNZ0hhLLsNUNp1Nd9M2d" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Active connections" %}

<figure><img src="/files/Vo1PInlBV19JWpDQDW7a" alt=""><figcaption><p>Shouldn't be any pending connections</p></figcaption></figure>
{% endtab %}

{% tab title="Processed connections rate" %}

<figure><img src="/files/aB0nASIbiTFsJKD3d1l8" alt=""><figcaption><p>Numbers of accepted and handled connections should be equal</p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
groups:
  - name: NGINX alerts
    rules:
      # A critical alert that displays that nginx hasn't handled some of the accepted connections
      - alert: nginx not all connections are handled
        expr: rate (nginx_connections_handled[5m]) / rate (nginx_connections_accepted[5m]) <1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "nginx issue with handling connections"
          description: "Critical: nginx doesn't handle some accepted connections on the host {{ $labels.instance }} for more than 3 minutes!"
```

## API

### Metrics Overview

| Metric                                                                             | Description                                    |
| ---------------------------------------------------------------------------------- | ---------------------------------------------- |
| `absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"}` | Displays that there is no ready API containers |

### Grafana Alerts

```bash
groups:
  - name: API alerts
    rules:
      # A critical alert that warns that there are no ready API containers, so requests are not being processed
      - alert: Absent ready api containers!
        expr: absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"} == 1)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Absent ready api containers!!!"
          description: "Critical! Check api containers!!!"
```

You can customize the alerts according to your needs. Please proceed to [our repository](https://gitlab.com/oz-forensics/public/oz-engineering-monitoring-configs/-/tree/helm-0.10.20/alerts?ref_type=heads) to find the alert files.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://doc.ozforensics.com/oz-knowledge/guides/administrator-guide/monitoring/api-5-and-below.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
