> For the complete documentation index, see [llms.txt](https://doc.ozforensics.com/oz-knowledge/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://doc.ozforensics.com/oz-knowledge/guides/administrator-guide/monitoring/api-v6-and-above/kubernetes.md).

# Kubernetes

If you use our Helm charts, ServiceMonitor handles Prometheus configuration automatically.

## General Information

### Exporter Ports

| Exporter     | Port                       |
| ------------ | -------------------------- |
| API          | api\_pod:8000/metrics      |
| statsd       | api\_pod:9102/metrics/app  |
| TFSS         | bio\_pod:8501/metrics/tfss |
| Bio blackbox | bio\_pod:9115              |

### Healthcheck Paths

**API**

`GET http://api:8000/api/version`

`GET http://api:8000/api/healthcheck`

**BIO**

`GET http://bio:8501/v1/models/{{model_name}}`

`POST http://bio:8501/v1/models/dummy:predict`

```json
method: POST
body: '{"inputs": {"images_bytes": [{"b64": "aaa"}]}}'
```

***

## API

### Metrics Overview

| Metric                                                                                                            | Description                                                                                                      | Type      | Labels                                                               |
| ----------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | --------- | -------------------------------------------------------------------- |
| `oz_api_versions_total` (also available as `oz_api_versions_created`, `oz_api_info_created`, `oz_api_info_total`) | Displays API version info when you call `/api/version`                                                           | counter   | `lamb`                                                               |
|                                                                                                                   |                                                                                                                  |           | `core` – core version                                                |
|                                                                                                                   |                                                                                                                  |           | `oz_api` – API version                                               |
| `oz_api_analyses_total`                                                                                           | Number of processed analyses                                                                                     | counter   | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
| `oz_api_analyses_created`                                                                                         | Number of analyses that have been initiated                                                                      | gauge     | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
| `oz_api_analyses_in_progress`                                                                                     | Number of analyses currently processing                                                                          | gauge     | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
| `oz_api_analyse_duration_seconds_bucket`                                                                          | A histogram for analysis duration. Default buckets: `0.1,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,6,7,8,10,12,15,20,30,inf` | histogram | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
|                                                                                                                   |                                                                                                                  |           | `le` – bucket size                                                   |
| `oz_api_analyse_duration_seconds_count`                                                                           | Sum of analyses' durations                                                                                       | counter   | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis status (`SUCCESS`, `DECLINED`)        |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
| `oz_api_analyse_duration_seconds_sum`                                                                             | Number of analyses whose durations were counted                                                                  | counter   | `analysis_result` – analysis resolution (`FINISHED`, `FAILED`)       |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
| `oz_api_analyse_duration_seconds_created`                                                                         | Timestamp of beginning of duration metric counting                                                               |           |                                                                      |
| `django_http_requests_total_by_transport_total`                                                                   | Number of requests by transport protocol                                                                         | counter   | `transport` – protocol (HTTP, HTTPS)                                 |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
| `django_http_responses_before_middlewares_total`                                                                  | Number of Django responses before running middleware                                                             | counter   | `uid` – user identifier                                              |
| `django_http_exceptions_total_by_type_total`                                                                      | Number of Django exceptions by type                                                                              | counter   | `exception_type` – type of exception                                 |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
| `django_http_requests_latency_seconds_by_view_method_bucket`                                                      | Histogram of request processing latency by views                                                                 | histogram | `method` – HTTP method                                               |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
|                                                                                                                   |                                                                                                                  |           | `view_name` – view name                                              |
|                                                                                                                   |                                                                                                                  |           | `le` – bucket size                                                   |
| `django_http_requests_latency_seconds_by_view_method_count`                                                       | Number of request processing durations by views                                                                  | counter   | `method` – HTTP method                                               |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
|                                                                                                                   |                                                                                                                  |           | `view_name` – view name                                              |
| `django_http_requests_latency_seconds_by_view_method_sum`                                                         | Sum of request processing durations by views                                                                     | counter   | `method` – HTTP method                                               |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
|                                                                                                                   |                                                                                                                  |           | `view_name` – view name                                              |

Illustrative screenshots:

{% tabs %}
{% tab title="Analyses rate per type" %}

<figure><img src="/files/8sdgBx7T7eBRwlFDaaPv" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Analyses rate per resolution" %}

<figure><img src="/files/4Z4lFINvveSNQjk7mae6" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Analyses rate per result" %}

<figure><img src="/files/vFH880mQ7dT5iBij3Ul0" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Analyses duration" %}

<figure><img src="/files/4iK8VRfEvAfaqEBRAyJH" alt=""><figcaption><p>Analyses duration divided by average value / analysis type and quantile / analysis type. Discontinuity is caused by metric counter which starts for each analysis separately</p></figcaption></figure>
{% endtab %}

{% tab title="Django exceptions rate" %}

<figure><img src="/files/EAz1zAz7CUaQfFHDPzjI" alt=""><figcaption></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
groups:
  - name: API alerts
    rules:
      - alert: Absent ready API containers!
        expr: absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"} == 1)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Absent ready API containers!!!"
          description: "Critical: check API containers!!!"

      - alert: Absent API metrics!
        expr: absent(oz_api_versions_total)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Absent API metrics!!!"
          description: "Critical: check API containers, API might be not working properly!!!"

      - alert: High frequency of API analysis requests
        expr: sum (rate(oz_api_analyses_total{namespace="api-prod"}[1m])) > 1
        labels:
          severity: warning
        annotations:
          summary: "Too many API analysis requests per second"
          description: "API analysis request rate is {{ $value }} rps."

      - alert: High failed API analysis rate
        expr: sum by (analysis_result)(rate(oz_api_analyses_total{namespace="api-prod", analysis_result="FAILED"}[1m])) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Failed API analyses detected"
          description: "API analysis failure rate is {{ $value }} rps in the last minute."

      - alert: High API analysis duration
        expr: histogram_quantile(0.95,sum by (le, analysis_type) (rate(oz_api_analyse_duration_seconds_bucket{namespace="api-prod"}[1m]))) > 7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.analysis_type }} analyses are going slow"
          description: "The duration of 95% {{ $labels.analysis_type }} analyses is longer than {{ $value }} seconds in the last 10 minutes."

      - alert: API exception errors
        expr: sum by (exception_type)(rate(django_http_exceptions_total_by_type_total{namespace="api-prod", exception_type!~"AuthCredentialsExpired|AuthCredentialsInvalid|AuthCredentialsIsNotProvided|AuthForbidden|InvalidBodyStructureError|InvalidParamValueErrorNotExistError|OSError|InvalidParamValueError|NotExistError"}[1m])) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.exception_type }} in API exception errors"
          description: "{{ $value }} rps in API returned {{ $labels.exception_type }} in the last 5 minutes."

      - alert: High API latency
        expr: (sum by(namespace) (rate(gunicorn_request_duration_sum{}[1m])) / on(namespace) sum by(namespace) (rate(gunicorn_request_duration_count{}[1m]))) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High API latency"
          description: "API latency is {{ printf "%.2f" $value }} seconds in the last 5 minutes."

      - alert: High API 5xx error rate!
        expr: sum by(namespace) (max by(status) (rate(gunicorn_response_code{status=~"5.*"}[1m]))) > 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "High 5XX API error percent"
          description: 'High error percent: {{ printf "%.2f" $value }}%'
```

***

## TFSS

### Metrics Overview

| Metric                                          | Description                                           |
| ----------------------------------------------- | ----------------------------------------------------- |
| `:tensorflow:serving:request_count`             | Total number of requests to TFSS                      |
| `:tensorflow:serving:request_latency_bucket`    | Histogram of request processing time                  |
| `:tensorflow:serving:request_latency_sum`       | Sum of request processing durations                   |
| `:tensorflow:serving:request_latency_count`     | Total number of requests whose durations were counted |
| `:tensorflow:cc:saved_model:load_attempt_count` | Loaded models                                         |

Illustrative screenshots:

{% tabs %}
{% tab title="Model request rate" %}

<figure><img src="/files/GOVoM6BH26PqfXub2NFt" alt=""><figcaption><p>Total number of requests to TFSS per second and per minute, divided by models</p></figcaption></figure>
{% endtab %}

{% tab title="Model latency ($quantile-quantile)" %}

<figure><img src="/files/CUwWhVYN5CtEw4tbKagj" alt=""><figcaption><p>0.95 (or the one you've defined in $quantile) quantile of request processing duration in TFSS</p></figcaption></figure>
{% endtab %}

{% tab title="\[TFSS] Model latency (AVG)" %}

<figure><img src="/files/AqssgnCtHZBn6ae6YVhu" alt=""><figcaption><p>Average analysis processing time</p></figcaption></figure>
{% endtab %}

{% tab title="\[TFSS] HTTP probe status" %}

<figure><img src="/files/VKMkNOMEuD8w9ZkStJDr" alt=""><figcaption><p>The built-in blackbox sends requests to check states of models, should be 1; if not, <code>probe_success</code> in TFSS is not working properly</p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
- name: TFSS alerts
  rules:
  # Blackbox alerts: probes check whether TFSS models are working correctly
    - alert: TFSS models probe service alert!
      expr: probe_success{job="blackbox-tfss-service"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "!!!ALERT!!! TFSS model or server doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS models probe pod alert!
      expr: probe_success{job="blackbox-tfss-models"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS predict probe pod alert!
      expr: probe_success{job="blackbox-tfss-probe"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model predict in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nPredict probe has been returning failed state for 3 min!"
  # Critical: indicates the metric is absent, meaning TFSS is not processing requests
    - alert: TFSS empty request rate!
      expr: absent(:tensorflow:serving:request_count{namespace="api-prod"}) == 1
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "TFSS request rate is empty!!!"
        description: "Critical! Requests are not processed, check bio!!!"
```

***

## nginx

### Metrics Overview

| Metric                       | Description                                                            |
| ---------------------------- | ---------------------------------------------------------------------- |
| `nginx_up`                   | Shows whether nginx is running: 1 – service is up, 0 – service is down |
| `nginx_connections_accepted` | Number of connections accepted by nginx                                |
| `nginx_connections_handled`  | Number of connections handled by nginx                                 |
| `nginx_connections_active`   | Active nginx connections                                               |

Illustrative screenshots:

{% tabs %}
{% tab title="Request rate" %}

<figure><img src="/files/cF44OSz61s07VZXrT8b2" alt=""><figcaption><p>Total number of requests to nginx</p></figcaption></figure>
{% endtab %}

{% tab title="Active connections" %}

<figure><img src="/files/trhl6mA6YvMHa753Zyye" alt=""><figcaption><p>State of nginx connections. Check that there are no stale connections</p></figcaption></figure>
{% endtab %}

{% tab title="Processed connections rate" %}

<figure><img src="/files/MLR9PZ2A2fziMCMEXrYf" alt=""><figcaption><p>Processing success rate. The numbers of accepted and handled connections should be equal</p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
groups:
  - name: NGINX alerts
    rules:
      - alert: nginx is down
        expr: nginx_up != 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "nginx has been down for more than 30 seconds!"
          description: "Critical: nginx service is down on the host {{ $labels.instance }}!"

      - alert: nginx not all connections are handled
        expr: rate (nginx_connections_handled[5m]) / rate (nginx_connections_accepted[5m]) <1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "nginx has issues with handling connections"
          description: "Critical: not all accepted connections have been handled by nginx on the host {{ $labels.instance }} for more than 3 minutes!"

      - alert: High number of active connections in nginx
        expr: nginx_connections_active > 3000
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "High number of active connections in nginx"
          description: "Critical: nginx has had too many active connections on the host {{ $labels.instance }} for more than 3 minutes!"
```

***

API 6 no longer requires Celery or Redis, so the corresponding metrics are not covered in this guide.

You can customize the alerts according to your needs. Please proceed to [our repository](https://gitlab.com/oz-forensics/public/oz-engineering-monitoring-configs/-/tree/helm-0.10.20/alerts?ref_type=heads) to find the alert files.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://doc.ozforensics.com/oz-knowledge/guides/administrator-guide/monitoring/api-v6-and-above/kubernetes.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.