> For the complete documentation index, see [llms.txt](https://doc.ozforensics.com/oz-knowledge/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://doc.ozforensics.com/oz-knowledge/guides/administrator-guide/monitoring/api-5-and-below.md). # API v5 and Below Below, you'll find the description of component metrics and tables, along with pre-configured alerts. ## Celery ### Metrics Overview | Metric | Description | | ------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ | | `flower_events_total` | Total number of tasks | | `flower_task_runtime_seconds_sum` | Sum of task completion durations | | `histogram_quantile($quantile, sum(rate(flower_task_runtime_seconds_bucket[1m])) by (le))` | Quantile for task execution based on the [quantile variable](#variables) value | | `flower_events_total{type="task-failed"}` | With the `type="task-failed"` label, displays the number of failed tasks | ### Grafana Tables | Table | Description | What to monitor | | ------------------------------ | -------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | | Liveness task rate | Average time of all Celery requests for Liveness |

task-received should be roughly equal to task-succeeded,
shouldn't be any task-failed

task-received should be roughly equal to task-succeeded,
shouldn't be any task-failed

| | All tasks duration (AVG) | Average time of all Celery requests for all models | The durations should be 6 seconds or less | | Queue size | Message queue in Redis | Queue size and the number of unacked messages | Illustrative screenshots: {% tabs %} {% tab title="Liveness task rate" %}

`task-received` ≈ `task-succeeded, task-failed = 0`

{% endtab %} {% tab title="Liveness task duration" %}

{% endtab %} {% tab title="Succeeded vs failed tasks rate" %}

{% endtab %} {% tab title="All task duration (AVG)" %}

{% endtab %} {% tab title="Queue size" %}

Monitor the queue size and the number of unacked messages

{% endtab %} {% endtabs %} ### Grafana Alerts ```bash - name: Celery alerts rules: # An alert for 0.9 quantile of task duration, warns that analyses are running slowly - alert: Quality Analysis is slow! expr: histogram_quantile(0.9, sum without (handler) (rate(flower_task_runtime_seconds_bucket{task="oz_core.tasks.tfss.process_analyse_quality"}[5m]))) > 5 for: 10m labels: severity: warning annotations: summary: "Too long duration of celery worker for TFSS task" description: "The duration of 90% Quality analyses is longer than {{ $value }} seconds for last 10 minutes. NAMESPACE: {{ $labels.namespace }} POD: {{ $labels.pod }} TASK: {{ $labels.task }} WORKER: {{ $labels.worker }}" # An alert for failed tasks, if the number is growing, something goes wrong - alert: Celery tasks failed! expr: sum by(type,task) (rate(flower_events_total{type="task-failed", task!=""}[1m])) > 0 for: 1m labels: severity: warning annotations: summary: "Celery tasks {{ $labels.task }} failed" description: 'Failed celery tasks rate: {{ printf "%.2f" $value }} rps' # A critical alert that warns that all the tasks are failed; it means that the system has stopped processing requests - alert: Celery zero success tasks! expr: sum by(type,task) (rate(flower_events_total{type="task-succeeded", task!=""}[1m])) == 0 and on (task) sum by(type,task) (rate(flower_events_total{type="task-received", task!=""}[1m])) > 0 for: 5m labels: severity: critical annotations: summary: "Celery has zero success tasks: {{ $labels.task }}!" description: "Critical! Check if bio is alive!" ``` ## Redis ### Metrics Overview | Metric | Description | | --------------------------------------- | ------------------------------------------------------------------------------------------ | | `redis_up` |

1 means Redis is working,

0 – service is down

| | `redis_commands_total` | Total number of commands in Redis | | `redis_commands_duration_seconds_total` | Redis process duration | | `redis_key_size` | Redis (as a message broker) queue size | | `redis_key_size{key="unacked"}` | With the `"inacked"` label, displays the number of tasks that are being processed by Redis | ### Grafana Tables | Table | Description | What to monitor | | ----------------- | ---------------------------------------------- | --------------------------------- | | Command rate | Number of requests per second | Nothing, just to stay informed | | Commands duration | Average and maximum command execution duration |

AVG < 15µs

MAX < 1ms

| | Connected clients | Number of connected clients | Shouldn't be 0 | Illustrative screenshots: {% tabs %} {% tab title="Command rate" %}

{% endtab %} {% tab title="Commands duration" %}

{% endtab %} {% tab title="Connected clients" %}

{% endtab %} {% endtabs %} ### Grafana Alerts ```bash - name: Redis alerts rules: # A critical alert that warns about Redis being down - alert: Redis is down expr: redis_up != 1 for: 30s labels: severity: critical annotations: summary: "Redis is down for more than 30 seconds!" description: "Critical: REDIS service is down in namespace: {{ $labels.namespace }}\nPod: {{ $labels.pod }}!" # Displays if Redis rejects connections - alert: Redis rejected connections expr: rate(redis_rejected_connections_total[1m]) > 0 for: 1m labels: severity: warning annotations: summary: "Redis rejects connections for more than 1 minute in namespace: {{ $labels.namespace }}!" description: "Some connections to Redis have been rejected!\nPod: {{ $labels.pod }}\nValue = {{ $value }}" # Displays that Redis commands are being executed too slow - alert: Redis command duration is slow! expr: max by(namespace) (rate(redis_commands_duration_seconds_total[1m])) > 0.0004 for: 1m labels: severity: warning annotations: summary: "Redis max command duration is too high for more than 1 minute in namespace: {{ $labels.namespace }}!" description: "Maximum command duration is longer than average!\nValue = {{ $value }} seconds" # Warns that the Redis queue is too long - alert: Redis queue length expr: sum by (instance)(redis_key_size) > 50 for: 1m labels: severity: warning annotations: summary: "Redis queue size is too large!" description: "Warning: Redis queue size : {{ $value }} for the last 1 min!" # Warns that there are more than 10 processing (unacked) messages in the Redis queue - alert: Redis unacked masseges expr: sum by (key)(redis_key_size{key="unacked"}) > 10 for: 1m labels: severity: warning annotations: summary: "Redis has unacked messages!" description: "Warning: Redis has {{ $value }} unacked messages!" ``` ## TFSS ### Metrics Overview | Metric | Description | | ----------------------------------------------- | ------------------------------------------ | | `:tensorflow:serving:request_count` | Total number of requests to TFSS | | `:tensorflow:serving:request_latency_bucket` | Histogram of order processing time | | `:tensorflow:serving:request_latency_sum` | Sum of processing durations for each order | | `:tensorflow:serving:request_latency_count` | Total number of orders | | `:tensorflow:cc:saved_model:load_attempt_count` | Uploaded models | ### Grafana Tables | Table | Description | What to monitor | | ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | | Model request rate | Number of requests to TFSS per second and per minute | Nothing, just to stay informed | | Model latency (`$quantile`-quantile) | 0.95 quantile of the TFSS request processing time. You can set the quantile value in [the `$quantile` variable](#variables) | Nothing, just to stay informed | | Model latency (AVG) | Average order processing time | Nothing, just to stay informed | | HTTP probe success | The result of the built-in blackbox check. Blackbox sends requests to verify that TFSS works properly | Should be 1 | Illustrative screenshots: {% tabs %} {% tab title="Model request rate" %}

{% endtab %} {% tab title="Model latency (0.95-quantile)" %}

{% endtab %} {% tab title="Model latency (AVG)" %}

{% endtab %} {% tab title="HTTP probe success" %}

{% endtab %} {% endtabs %} ### Grafana Alerts ```bash - name: TFSS alerts rules: # Critical alert that warns about blackbox detecting incorrect model behavior - alert: TFSS models probe service alert! expr: probe_success{job="blackbox-tfss-service"} != 1 for: 3m labels: severity: critical annotations: summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!" description: "!!!ALERT!!! TFSS model or server doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!" - alert: TFSS models probe pod alert! expr: probe_success{job="blackbox-tfss-models"} != 1 for: 3m labels: severity: critical annotations: summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!" description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!" - alert: TFSS predict probe pod alert! expr: probe_success{job="blackbox-tfss-probe"} != 1 for: 3m labels: severity: critical annotations: summary: "TFSS model predict in namespace {{ $labels.namespace }} is unavailable!" description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nPredict probe has been returning failed state for 3 min!" # Critical alert that warns about TFSS not processing requests - alert: TFSS empty request rate! expr: absent(:tensorflow:serving:request_count{namespace="api-prod"}) == 1 for: 1m labels: severity: critical annotations: summary: "TFSS request rate is empty!!!" description: "Critical! Requests are not processed, check bio!!!" ``` ## nginx ### Metrics Overview | Metric | Description | | ---------------------------- | ------------------------------------------------------------------------------------ | | `nginx_up` |

1 means nginx is working,

0 – service is down

| | `nginx_connections_accepted` | Number of connections accepted by nginx | | `nginx_connections_handled` | Number of connections handled by nginx | | `nginx_connections_active` | Number of active nginx connections | ### Grafana Tables | Table | Description | What to monitor | | -------------------------- | --------------------------------- | ----------------------------------------------------------- | | Request rate | Total number of requests to nginx | Nothing, just to stay informed | | Active connections | Connection states | Shouldn't be any pending connections | | Processed connections rate | Processing success rate | Numbers of accepted and handled connections should be equal | Illustrative screenshots: {% tabs %} {% tab title="Request rate" %}

{% endtab %} {% tab title="Active connections" %}

{% endtab %} {% tab title="Processed connections rate" %}

Numbers of accepted and handled connections should be equal

{% endtab %} {% endtabs %} ### Grafana Alerts ```bash groups: - name: NGINX alerts rules: # A critical alert that displays that nginx hasn't handled some of the accepted connections - alert: nginx not all connections are handled expr: rate (nginx_connections_handled[5m]) / rate (nginx_connections_accepted[5m]) <1 for: 2m labels: severity: critical annotations: summary: "nginx issue with handling connections" description: "Critical: nginx doesn't handle some accepted connections on the host {{ $labels.instance }} for more than 3 minutes!" ``` ## API ### Metrics Overview | Metric | Description | | ---------------------------------------------------------------------------------- | ---------------------------------------------- | | `absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"}` | Displays that there is no ready API containers | ### Grafana Alerts ```bash groups: - name: API alerts rules: # A critical alert that warns that there are no ready API containers, so requests are not being processed - alert: Absent ready api containers! expr: absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"} == 1) for: 1m labels: severity: critical annotations: summary: "Absent ready api containers!!!" description: "Critical! Check api containers!!!" ``` You can customize the alerts according to your needs. Please proceed to [our repository](https://gitlab.com/oz-forensics/public/oz-engineering-monitoring-configs/-/tree/helm-0.10.20/alerts?ref_type=heads) to find the alert files. --- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://doc.ozforensics.com/oz-knowledge/guides/administrator-guide/monitoring/api-5-and-below.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.