# Kubernetes

If you use our Helm charts, ServiceMonitor handles Prometheus configuration automatically.

## General Information

### Exporter Ports

| Exporter     | Port                       |
| ------------ | -------------------------- |
| API          | api\_pod:8000/metrics      |
| statsd       | api\_pod:9102/metrics/app  |
| TFSS         | bio\_pod:8501/metrics/tfss |
| Bio blackbox | bio\_pod:9115              |

### Healthcheck Paths

**API**

`GET http://api:8000/api/version`

`GET http://api:8000/api/healthcheck`

**BIO**

`GET http://bio:8501/v1/models/{{model_name}}`

`POST http://bio:8501/v1/models/dummy:predict`

```json
method: POST
body: '{"inputs": {"images_bytes": [{"b64": "aaa"}]}}'
```

***

## API

### Metrics Overview

| Metric                                                                                                            | Description                                                                                                      | Type      | Labels                                                               |
| ----------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------- | --------- | -------------------------------------------------------------------- |
| `oz_api_versions_total` (also available as `oz_api_versions_created`, `oz_api_info_created`, `oz_api_info_total`) | Displays API version info when you call `/api/version`                                                           | counter   | `lamb`                                                               |
|                                                                                                                   |                                                                                                                  |           | `core` – core version                                                |
|                                                                                                                   |                                                                                                                  |           | `oz_api` – API version                                               |
| `oz_api_analyses_total`                                                                                           | Number of processed analyses                                                                                     | counter   | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
| `oz_api_analyses_created`                                                                                         | Number of analyses that have been initiated                                                                      | gauge     | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
| `oz_api_analyses_in_progress`                                                                                     | Number of analyses currently processing                                                                          | gauge     | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
| `oz_api_analyse_duration_seconds_bucket`                                                                          | A histogram for analysis duration. Default buckets: `0.1,0.5,1,1.5,2,2.5,3,3.5,4,4.5,5,6,7,8,10,12,15,20,30,inf` | histogram | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
|                                                                                                                   |                                                                                                                  |           | `le` – bucket size                                                   |
| `oz_api_analyse_duration_seconds_count`                                                                           | Sum of analyses' durations                                                                                       | counter   | `analysis_result` – analysis status (`FINISHED`, `FAILED`)           |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis status (`SUCCESS`, `DECLINED`)        |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
| `oz_api_analyse_duration_seconds_sum`                                                                             | Number of analyses whose durations were counted                                                                  | counter   | `analysis_result` – analysis resolution (`FINISHED`, `FAILED`)       |
|                                                                                                                   |                                                                                                                  |           | `analysis_type` – analysis type (`QUALITY`, `BIOMETRY`, `DOCUMENTS`) |
|                                                                                                                   |                                                                                                                  |           | `resolution_status` – analysis resolution (`SUCCESS`, `DECLINED`)    |
|                                                                                                                   |                                                                                                                  |           | `company_id` – company identifier                                    |
| `oz_api_analyse_duration_seconds_created`                                                                         | Timestamp of beginning of duration metric counting                                                               |           |                                                                      |
| `django_http_requests_total_by_transport_total`                                                                   | Number of requests by transport protocol                                                                         | counter   | `transport` – protocol (HTTP, HTTPS)                                 |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
| `django_http_responses_before_middlewares_total`                                                                  | Number of Django responses before running middleware                                                             | counter   | `uid` – user identifier                                              |
| `django_http_exceptions_total_by_type_total`                                                                      | Number of Django exceptions by type                                                                              | counter   | `exception_type` – type of exception                                 |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
| `django_http_requests_latency_seconds_by_view_method_bucket`                                                      | Histogram of request processing latency by views                                                                 | histogram | `method` – HTTP method                                               |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
|                                                                                                                   |                                                                                                                  |           | `view_name` – view name                                              |
|                                                                                                                   |                                                                                                                  |           | `le` – bucket size                                                   |
| `django_http_requests_latency_seconds_by_view_method_count`                                                       | Number of request processing durations by views                                                                  | counter   | `method` – HTTP method                                               |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
|                                                                                                                   |                                                                                                                  |           | `view_name` – view name                                              |
| `django_http_requests_latency_seconds_by_view_method_sum`                                                         | Sum of request processing durations by views                                                                     | counter   | `method` – HTTP method                                               |
|                                                                                                                   |                                                                                                                  |           | `uid` – user identifier                                              |
|                                                                                                                   |                                                                                                                  |           | `view_name` – view name                                              |

Illustrative screenshots:

{% tabs %}
{% tab title="Analyses rate per type" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FAN6rEj92vAmHfBtBOxuP%2F%5BAPI%5D%20Analyses%20rate%20per%20type.png?alt=media&#x26;token=0632da81-952f-4da0-bd5a-baa6c89be51e" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Analyses rate per resolution" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2Fw1bQXL7ySnVfXEWw7Rxe%2F%5BAPI%5D%20Analyses%20rate%20per%20resolution.png?alt=media&#x26;token=ce2c6b73-4a4f-4e31-871d-ace69fa7198d" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Analyses rate per result" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FBYPgqLhQ9a1GXwgGHcnc%2F%5BAPI%5D%20Analyses%20rate%20per%20result.png?alt=media&#x26;token=0ba15213-6a2b-4b78-9b1d-6f68e9f48332" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Analyses duration" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F7KgkjsbonMykczhCyBJI%2F%5BAPI%5D%20Analyses%20duration.png?alt=media&#x26;token=9d071d88-0b76-471b-94be-3ac11dc4fb78" alt=""><figcaption><p>Analyses duration divided by average value / analysis type and quantile / analysis type. Discontinuity is caused by metric counter which starts for each analysis separately</p></figcaption></figure>
{% endtab %}

{% tab title="Django exceptions rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F4UCDWBRbwkeNzprATx6y%2F%5BAPI%5D%20Django%20exceptions%20rate.png?alt=media&#x26;token=4f78b919-1e71-4e47-81d4-4ba307171db4" alt=""><figcaption></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
groups:
  - name: API alerts
    rules:
      - alert: Absent ready API containers!
        expr: absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"} == 1)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Absent ready API containers!!!"
          description: "Critical: check API containers!!!"

      - alert: Absent API metrics!
        expr: absent(oz_api_versions_total)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Absent API metrics!!!"
          description: "Critical: check API containers, API might be not working properly!!!"

      - alert: High frequency of API analysis requests
        expr: sum (rate(oz_api_analyses_total{namespace="api-prod"}[1m])) > 1
        labels:
          severity: warning
        annotations:
          summary: "Too many API analysis requests per second"
          description: "API analysis request rate is {{ $value }} rps."

      - alert: High failed API analysis rate
        expr: sum by (analysis_result)(rate(oz_api_analyses_total{namespace="api-prod", analysis_result="FAILED"}[1m])) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Failed API analyses detected"
          description: "API analysis failure rate is {{ $value }} rps in the last minute."

      - alert: High API analysis duration
        expr: histogram_quantile(0.95,sum by (le, analysis_type) (rate(oz_api_analyse_duration_seconds_bucket{namespace="api-prod"}[1m]))) > 7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.analysis_type }} analyses are going slow"
          description: "The duration of 95% {{ $labels.analysis_type }} analyses is longer than {{ $value }} seconds in the last 10 minutes."

      - alert: API exception errors
        expr: sum by (exception_type)(rate(django_http_exceptions_total_by_type_total{namespace="api-prod", exception_type!~"AuthCredentialsExpired|AuthCredentialsInvalid|AuthCredentialsIsNotProvided|AuthForbidden|InvalidBodyStructureError|InvalidParamValueErrorNotExistError|OSError|InvalidParamValueError|NotExistError"}[1m])) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.exception_type }} in API exception errors"
          description: "{{ $value }} rps in API returned {{ $labels.exception_type }} in the last 5 minutes."

      - alert: High API latency
        expr: (sum by(namespace) (rate(gunicorn_request_duration_sum{}[1m])) / on(namespace) sum by(namespace) (rate(gunicorn_request_duration_count{}[1m]))) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High API latency"
          description: "API latency is {{ printf "%.2f" $value }} seconds in the last 5 minutes."

      - alert: High API 5xx error rate!
        expr: sum by(namespace) (max by(status) (rate(gunicorn_response_code{status=~"5.*"}[1m]))) > 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "High 5XX API error percent"
          description: 'High error percent: {{ printf "%.2f" $value }}%'
```

***

## TFSS

### Metrics Overview

| Metric                                          | Description                                           |
| ----------------------------------------------- | ----------------------------------------------------- |
| `:tensorflow:serving:request_count`             | Total number of requests to TFSS                      |
| `:tensorflow:serving:request_latency_bucket`    | Histogram of request processing time                  |
| `:tensorflow:serving:request_latency_sum`       | Sum of request processing durations                   |
| `:tensorflow:serving:request_latency_count`     | Total number of requests whose durations were counted |
| `:tensorflow:cc:saved_model:load_attempt_count` | Loaded models                                         |

Illustrative screenshots:

{% tabs %}
{% tab title="Model request rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2Fgn7ySbPxN7NPgvW3h8wx%2F%5BTFSS%5D%20Model%20request%20rate.png?alt=media&#x26;token=0c0a34c6-fb9e-4c5b-817d-68375833c739" alt=""><figcaption><p>Total number of requests to TFSS per second and per minute, divided by models</p></figcaption></figure>
{% endtab %}

{% tab title="Model latency ($quantile-quantile)" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FAPbU9bkDZwyXpbdzGX1v%2F%5BTFSS%5D%20Model%20latency%20(%24quantile-quantile).png?alt=media&#x26;token=d919a8a7-d87f-43c3-ac27-7a0df8656b57" alt=""><figcaption><p>0.95 (or the one you've defined in $quantile) quantile of request processing duration in TFSS</p></figcaption></figure>
{% endtab %}

{% tab title="\[TFSS] Model latency (AVG)" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F5005ceJCuJuoC9UOwELK%2F%5BTFSS%5D%20Model%20latency%20(AVG).png?alt=media&#x26;token=fecbaef4-cf05-44f1-aea4-a53cc89fdcd9" alt=""><figcaption><p>Average analysis processing time</p></figcaption></figure>
{% endtab %}

{% tab title="\[TFSS] HTTP probe status" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FmWwGyBdYJa5Q43irFYUN%2F%5BTFSS%5D%20HTTP%20probe%20status.png?alt=media&#x26;token=97bb1127-b6e4-4079-a083-3fe9c764a579" alt=""><figcaption><p>The built-in blackbox sends requests to check states of models, should be 1; if not, <code>probe_success</code> in TFSS is not working properly</p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
- name: TFSS alerts
  rules:
  # Blackbox alerts: probes check whether TFSS models are working correctly
    - alert: TFSS models probe service alert!
      expr: probe_success{job="blackbox-tfss-service"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "!!!ALERT!!! TFSS model or server doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS models probe pod alert!
      expr: probe_success{job="blackbox-tfss-models"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS predict probe pod alert!
      expr: probe_success{job="blackbox-tfss-probe"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model predict in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nPredict probe has been returning failed state for 3 min!"
  # Critical: indicates the metric is absent, meaning TFSS is not processing requests
    - alert: TFSS empty request rate!
      expr: absent(:tensorflow:serving:request_count{namespace="api-prod"}) == 1
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "TFSS request rate is empty!!!"
        description: "Critical! Requests are not processed, check bio!!!"
```

***

## Nginx

### Metrics Overview

| Metric                       | Description                                                            |
| ---------------------------- | ---------------------------------------------------------------------- |
| `nginx_up`                   | Shows whether Nginx is running: 1 – service is up, 0 – service is down |
| `nginx_connections_accepted` | Number of connections accepted by Nginx                                |
| `nginx_connections_handled`  | Number of connections handled by Nginx                                 |
| `nginx_connections_active`   | Active Nginx connections                                               |

Illustrative screenshots:

{% tabs %}
{% tab title="Request rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FdhMoRHQH5uRFyLU7lVV3%2F%5BNginx%5D%20Request%20rate.png?alt=media&#x26;token=30039c79-8245-4a81-b228-ce407cb5cfba" alt=""><figcaption><p>Total number of requests to Nginx</p></figcaption></figure>
{% endtab %}

{% tab title="Active connections" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2FppKgl1Z0DvzGkONAExEs%2F%5BNginx%5D%20Active%20connections.png?alt=media&#x26;token=c0e32f68-0b25-4b33-9ae3-cdbe079890f0" alt=""><figcaption><p>State of Nginx connections. Check that there are no stale connections</p></figcaption></figure>
{% endtab %}

{% tab title="Processed connections rate" %}

<figure><img src="https://2532558063-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F5g6dgsxRbyrCvB0uAf8f%2Fuploads%2F286WYPHQikNtBg1aeMu6%2F%5BNginx%5D%20Processed%20connections%20rate.png?alt=media&#x26;token=bd92645f-7e5c-4a68-bfc2-c1e542d965c9" alt=""><figcaption><p>Processing success rate. The numbers of accepted and handled connections should be equal</p></figcaption></figure>
{% endtab %}
{% endtabs %}

### Grafana Alerts

```bash
groups:
  - name: NGINX alerts
    rules:
      - alert: Nginx is down
        expr: nginx_up != 1
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Nginx has been down for more than 30 seconds!"
          description: "Critical: Nginx service is down on the host {{ $labels.instance }}!"

      - alert: Nginx not all connections are handled
        expr: rate (nginx_connections_handled[5m]) / rate (nginx_connections_accepted[5m]) <1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Nginx has issues with handling connections"
          description: "Critical: not all accepted connections have been handled by Nginx on the host {{ $labels.instance }} for more than 3 minutes!"

      - alert: High number of active connections in Nginx
        expr: nginx_connections_active > 3000
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "High number of active connections in Nginx"
          description: "Critical: Nginx has had too many active connections on the host {{ $labels.instance }} for more than 3 minutes!"
```

***

API 6 no longer requires Celery or Redis, so the corresponding metrics are not covered in this guide.

You can customize the alerts according to your needs. Please proceed to [our repository](https://gitlab.com/oz-forensics/public/oz-engineering-monitoring-configs/-/tree/helm-0.10.20/alerts?ref_type=heads) to find the alert files.
