1 of 1

Monitoring

These instructions and dashboard are for Helm chart 0.10.20 and API 5.1.

For monitoring, we use Prometheus and Grafana.

Step 1. Install and Configure Prometheus

Configure Prometheus

If necessary, install Prometheus into your cluster using their GitHub.

Our charts already contain custom resources for Prometheus: serviceMonitor. By default, serviceMonitor is disabled, to enable it, set enable: true in prometheus_exporters → serviceMonitor.

prometheus_exporters:
# Global parameter. If true, add prometheus exporters for apps 
# with the same parameter also set to true in app section
  enable: true 
# Global parameter. if true - expose metrics for apps on ingress 
# Check the same parameters in apps' configs
  addToIngress: false 
# Base path for metrics, e.g., <https://host.local/metricsBasePath/metricsEndPath>
  metricsBasePath: /metrics/ 
  auth:
  # Set HTTP Basic AUTH for metrics in ingress. Usable only if 'addToIngress' is true
    enable: true 
    username: flower_user
    password: flowerpass
# Enable serviceMonitor (Service Discovery for Prometheus)
  serviceMonitor: 
    enable: true
    interval: 15s

To ensure that Prometheus Operator spots the parameters, add the corresponding Namespace or serviceMonitor itself to the spec section as shown below:

spec:
  serviceMonitorNamespaceSelector:
    matchNames:
    - default
    - monitoring

If you don't specify the parameters, all Namespace will be added:

spec:
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}

Verify the Settings

If everything is correct, and serviceMonitor has been added to the cluster, you'll see the corresponding custom resource in Custom Resources → monitoring.coreos.com → Service Monitor.

Make sure that Service Monitor contains the resources listed in the screen below:

You can also check these resources in Prometheus itself. Proceed to Status and select Targets in the drop-down menu.

Here is what should be seen:

For more details on how to work with Service Monitor in Prometheus, please refer to Prometheus Operator Documentation.

Step 2. Configure Dashboard

Proceed to our repository and find the branch that matches your chart version.

Download Oz_dashboard_client.json.
Open Grafana and, in the Home menu, select Dashboards.

Click New and choose Import from the drop-down menu.

Select Upload dashboard JSON file and locate the Oz_dashboard_client.json file you've downloaded. Change filename or directory if neededm but this is optional.
Add the prometheus data source to obtain metrics.
Click Import and save the dashboard.

Variables

namespace is a label of the namespace from :tensorflow:cc:saved_model:load_latency{clustername="$cluster"},
quantile is a quantile value for tables that require it. Possible values: 0.95, 0.5, 0.90, 0.99, 1.

Step 3. Start Monitoring

Below, you'll find the description of component metrics and tables, along with pre-configured alerts.

Celery

Metrics Overview

Metric

Description

flower_events_total

Total number of tasks

flower_task_runtime_seconds_sum

Sum of task completion durations

histogram_quantile($quantile, sum(rate(flower_task_runtime_seconds_bucket[1m])) by (le))

flower_events_total{type="task-failed"}

With the type="task-failed" label, displays the number of failed tasks

Grafana Tables

Table

Description

What to monitor

Liveness task rate

Average time of all Celery requests for Liveness

task-received should be roughly equal to task-succeeded,
shouldn't be any task-failed

Liveness task duration

Quantile for task execution

0.95 quantile should be 8 seconds or less

Succeeded vs failed tasks rate

Total number of Celery requests

task-received should be roughly equal to task-succeeded,
shouldn't be any task-failed

All tasks duration (AVG)

Average time of all Celery requests for all models

The durations should be 6 seconds or less

Queue size

Message queue in Redis

Queue size and the number of unacked messages

Illustrative screenshots:

Grafana Alerts

- name: Celery alerts
  rules:
  
  # An alert for 0.9 quantile of task duration, warns that analyses are running slowly
    - alert: Quality Analysis is slow!
      expr: histogram_quantile(0.9, sum without (handler) (rate(flower_task_runtime_seconds_bucket{task="oz_core.tasks.tfss.process_analyse_quality"}[5m]))) > 5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Too long duration of celery worker for TFSS task"
        description: "The duration of 90% Quality analyses is longer than {{ $value }} seconds for last 10 minutes. NAMESPACE: {{ $labels.namespace }} POD: {{ $labels.pod }} TASK: {{ $labels.task }} WORKER: {{ $labels.worker }}"
  
  # An alert for failed tasks, if the number is growing, something goes wrong
    - alert: Celery tasks failed!
      expr: sum by(type,task) (rate(flower_events_total{type="task-failed", task!=""}[1m])) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Celery tasks {{ $labels.task }} failed"
        description: 'Failed celery tasks rate: {{ printf "%.2f" $value }} rps'
  
  # A critical alert that warns that all the tasks are failed; it means that the system has stopped processing requests
    - alert: Celery zero success tasks!
      expr: sum by(type,task) (rate(flower_events_total{type="task-succeeded", task!=""}[1m])) == 0 and on (task) sum by(type,task) (rate(flower_events_total{type="task-received", task!=""}[1m])) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Celery has zero success tasks: {{ $labels.task }}!"
        description: "Critical! Check if bio is alive!"

Redis

Metrics Overview

Metric

Description

redis_up

1 means Redis is working,

0 – service is down

redis_commands_total

Total number of commands in Redis

redis_commands_duration_seconds_total

Redis process duration

redis_key_size

Redis (as a message broker) queue size

redis_key_size{key="unacked"}

With the "inacked" label, displays the number of tasks that are being processed by Redis

Grafana Tables

Table

Description

What to monitor

Command rate

Number of requests per second

Nothing, just to stay informed

Commands duration

Average and maximum command execution duration

AVG < 15µs

MAX < 1ms

Connected clients

Number of connected clients

Shouldn't be 0

Illustrative screenshots:

Grafana Alerts

- name: Redis alerts
  rules:
  # A critical alert that warns about Redis being down
    - alert: Redis is down
      expr: redis_up != 1
      for: 30s
      labels:
        severity: critical
      annotations:
        summary: "Redis is down for more than 30 seconds!"
        description: "Critical: REDIS service is down in namespace: {{ $labels.namespace }}\nPod: {{ $labels.pod }}!"
  
  # Displays if Redis rejects connections
    - alert: Redis rejected connections
      expr: rate(redis_rejected_connections_total[1m]) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis rejects connections for more than 1 minute in namespace: {{ $labels.namespace }}!"
        description: "Some connections to Redis have been rejected!\nPod: {{ $labels.pod }}\nValue = {{ $value }}"

  # Displays that Redis commands are being executed too slow 
    - alert: Redis command duration is slow!
      expr: max by(namespace) (rate(redis_commands_duration_seconds_total[1m])) > 0.0004
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis max command duration is too high for more than 1 minute in namespace: {{ $labels.namespace }}!"
        description: "Maximum command duration is longer than average!\nValue = {{ $value }} seconds"
 
  # Warns that the Redis queue is too long
    - alert: Redis queue length
      expr: sum by (instance)(redis_key_size) > 50
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis queue size is too large!"
        description: "Warning: Redis queue size : {{ $value }} for the last 1 min!"

  # Warns that there are more than 10 processing (unacked) messages in the Redis queue
    - alert: Redis unacked masseges
      expr: sum by (key)(redis_key_size{key="unacked"}) > 10
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis has unacked messages!"
        description: "Warning: Redis has {{ $value }} unacked messages!"

TFSS

Metrics Overview

Metric

Description

:tensorflow:serving:request_count

Total number of requests to TFSS

:tensorflow:serving:request_latency_bucket

Histogram of order processing time

:tensorflow:serving:request_latency_sum

Sum of processing durations for each order

:tensorflow:serving:request_latency_count

Total number of orders

:tensorflow:cc:saved_model:load_attempt_count

Uploaded models

Grafana Tables

Table

Description

What to monitor

Model request rate

Number of requests to TFSS per second and per minute

Nothing, just to stay informed

Model latency ($quantile-quantile)

Nothing, just to stay informed

Model latency (AVG)

Average order processing time

Nothing, just to stay informed

HTTP probe success

The result of the built-in blackbox check. Blackbox sends requests to verify that TFSS works properly

Should be 1

Illustrative screenshots:

Grafana Alerts

- name: TFSS alerts
  rules:
  # Critical alert that warns about blackbox detecting incorrect model behavior
    - alert: TFSS models probe service alert!
      expr: probe_success{job="blackbox-tfss-service"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "!!!ALERT!!! TFSS model or server doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS models probe pod alert!
      expr: probe_success{job="blackbox-tfss-models"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS predict probe pod alert!
      expr: probe_success{job="blackbox-tfss-probe"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model predict in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nPredict probe has been returning failed state for 3 min!"

  # Critical alert that warns about TFSS not processing requests 
    - alert: TFSS empty request rate!
      expr: absent(:tensorflow:serving:request_count{namespace="api-prod"}) == 1
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "TFSS request rate is empty!!!"
        description: "Critical! Requests are not processed, check bio!!!"

Nginx

Metrics Overview

Metric

Description

nginx_up

1 means Nginx is working,

0 – service is down

nginx_connections_accepted

Number of connections accepted by Nginx

nginx_connections_handled

Number of connections handled by Nginx

nginx_connections_active

Number of active Nginx connections

Grafana Tables

Table

Description

What to monitor

Request rate

Total number of requests to Nginx

Nothing, just to stay informed

Active connections

Connection states

Shouldn't be any pending connections

Processed connections rate

Processing success rate

Numbers of accepted and handled connections should be equal

Illustrative screenshots:

Grafana Alerts

groups:
  - name: NGINX alerts
    rules:
      # A critical alert that displays that Nginx hasn't handled some of the accepted connections
      - alert: Nginx not all connections are handled
        expr: rate (nginx_connections_handled[5m]) / rate (nginx_connections_accepted[5m]) <1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Nginx issue with handling connections"
          description: "Critical: Nginx doesn't handle some accepted connections on the host {{ $labels.instance }} for more than 3 minutes!"

API

Metrics Overview

Metric

Description

absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"}

Displays that there is no ready API containers

Grafana Alerts

groups:
  - name: API alerts
    rules:
      # A critical alert that warns that there are no ready API containers, so requests are not being processed
      - alert: Absent ready api containers!
        expr: absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"} == 1)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Absent ready api containers!!!"
          description: "Critical! Check api containers!!!"

You can customize the alerts according to your needs. Please proceed to our repository to find the alert files.

Monitoring

These instructions and dashboard are for Helm chart 0.10.20 and API 5.1.

For monitoring, we use Prometheus and Grafana.

Step 1. Install and Configure Prometheus

Configure Prometheus

If necessary, install Prometheus into your cluster using their GitHub.

prometheus_exporters:
# Global parameter. If true, add prometheus exporters for apps 
# with the same parameter also set to true in app section
  enable: true 
# Global parameter. if true - expose metrics for apps on ingress 
# Check the same parameters in apps' configs
  addToIngress: false 
# Base path for metrics, e.g., <https://host.local/metricsBasePath/metricsEndPath>
  metricsBasePath: /metrics/ 
  auth:
  # Set HTTP Basic AUTH for metrics in ingress. Usable only if 'addToIngress' is true
    enable: true 
    username: flower_user
    password: flowerpass
# Enable serviceMonitor (Service Discovery for Prometheus)
  serviceMonitor: 
    enable: true
    interval: 15s

To ensure that Prometheus Operator spots the parameters, add the corresponding Namespace or serviceMonitor itself to the spec section as shown below:

spec:
  serviceMonitorNamespaceSelector:
    matchNames:
    - default
    - monitoring

If you don't specify the parameters, all Namespace will be added:

spec:
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}

Verify the Settings

Make sure that Service Monitor contains the resources listed in the screen below:

You can also check these resources in Prometheus itself. Proceed to Status and select Targets in the drop-down menu.

Here is what should be seen:

For more details on how to work with Service Monitor in Prometheus, please refer to Prometheus Operator Documentation.

Step 2. Configure Dashboard

Proceed to our repository and find the branch that matches your chart version.

Download Oz_dashboard_client.json.
Open Grafana and, in the Home menu, select Dashboards.

Click New and choose Import from the drop-down menu.

Select Upload dashboard JSON file and locate the Oz_dashboard_client.json file you've downloaded. Change filename or directory if neededm but this is optional.
Add the prometheus data source to obtain metrics.
Click Import and save the dashboard.

Variables

namespace is a label of the namespace from :tensorflow:cc:saved_model:load_latency{clustername="$cluster"},
quantile is a quantile value for tables that require it. Possible values: 0.95, 0.5, 0.90, 0.99, 1.

Step 3. Start Monitoring

Below, you'll find the description of component metrics and tables, along with pre-configured alerts.

Celery

Metrics Overview

Metric

Description

flower_events_total

Total number of tasks

flower_task_runtime_seconds_sum

Sum of task completion durations

histogram_quantile($quantile, sum(rate(flower_task_runtime_seconds_bucket[1m])) by (le))

Quantile for task execution based on the value

flower_events_total{type="task-failed"}

With the type="task-failed" label, displays the number of failed tasks

Grafana Tables

Table

Description

What to monitor

Liveness task rate

Average time of all Celery requests for Liveness

task-received should be roughly equal to task-succeeded,
shouldn't be any task-failed

Liveness task duration

Quantile for task execution

0.95 quantile should be 8 seconds or less

Succeeded vs failed tasks rate

Total number of Celery requests

task-received should be roughly equal to task-succeeded,
shouldn't be any task-failed

All tasks duration (AVG)

Average time of all Celery requests for all models

The durations should be 6 seconds or less

Queue size

Message queue in Redis

Queue size and the number of unacked messages

Illustrative screenshots:

Grafana Alerts

- name: Celery alerts
  rules:
  
  # An alert for 0.9 quantile of task duration, warns that analyses are running slowly
    - alert: Quality Analysis is slow!
      expr: histogram_quantile(0.9, sum without (handler) (rate(flower_task_runtime_seconds_bucket{task="oz_core.tasks.tfss.process_analyse_quality"}[5m]))) > 5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Too long duration of celery worker for TFSS task"
        description: "The duration of 90% Quality analyses is longer than {{ $value }} seconds for last 10 minutes. NAMESPACE: {{ $labels.namespace }} POD: {{ $labels.pod }} TASK: {{ $labels.task }} WORKER: {{ $labels.worker }}"
  
  # An alert for failed tasks, if the number is growing, something goes wrong
    - alert: Celery tasks failed!
      expr: sum by(type,task) (rate(flower_events_total{type="task-failed", task!=""}[1m])) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Celery tasks {{ $labels.task }} failed"
        description: 'Failed celery tasks rate: {{ printf "%.2f" $value }} rps'
  
  # A critical alert that warns that all the tasks are failed; it means that the system has stopped processing requests
    - alert: Celery zero success tasks!
      expr: sum by(type,task) (rate(flower_events_total{type="task-succeeded", task!=""}[1m])) == 0 and on (task) sum by(type,task) (rate(flower_events_total{type="task-received", task!=""}[1m])) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Celery has zero success tasks: {{ $labels.task }}!"
        description: "Critical! Check if bio is alive!"

Redis

Metrics Overview

Metric

Description

redis_up

1 means Redis is working,

0 – service is down

redis_commands_total

Total number of commands in Redis

redis_commands_duration_seconds_total

Redis process duration

redis_key_size

Redis (as a message broker) queue size

redis_key_size{key="unacked"}

With the "inacked" label, displays the number of tasks that are being processed by Redis

Grafana Tables

Table

Description

What to monitor

Command rate

Number of requests per second

Nothing, just to stay informed

Commands duration

Average and maximum command execution duration

AVG < 15µs

MAX < 1ms

Connected clients

Number of connected clients

Shouldn't be 0

Illustrative screenshots:

Grafana Alerts

- name: Redis alerts
  rules:
  # A critical alert that warns about Redis being down
    - alert: Redis is down
      expr: redis_up != 1
      for: 30s
      labels:
        severity: critical
      annotations:
        summary: "Redis is down for more than 30 seconds!"
        description: "Critical: REDIS service is down in namespace: {{ $labels.namespace }}\nPod: {{ $labels.pod }}!"
  
  # Displays if Redis rejects connections
    - alert: Redis rejected connections
      expr: rate(redis_rejected_connections_total[1m]) > 0
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis rejects connections for more than 1 minute in namespace: {{ $labels.namespace }}!"
        description: "Some connections to Redis have been rejected!\nPod: {{ $labels.pod }}\nValue = {{ $value }}"

  # Displays that Redis commands are being executed too slow 
    - alert: Redis command duration is slow!
      expr: max by(namespace) (rate(redis_commands_duration_seconds_total[1m])) > 0.0004
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis max command duration is too high for more than 1 minute in namespace: {{ $labels.namespace }}!"
        description: "Maximum command duration is longer than average!\nValue = {{ $value }} seconds"
 
  # Warns that the Redis queue is too long
    - alert: Redis queue length
      expr: sum by (instance)(redis_key_size) > 50
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis queue size is too large!"
        description: "Warning: Redis queue size : {{ $value }} for the last 1 min!"

  # Warns that there are more than 10 processing (unacked) messages in the Redis queue
    - alert: Redis unacked masseges
      expr: sum by (key)(redis_key_size{key="unacked"}) > 10
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Redis has unacked messages!"
        description: "Warning: Redis has {{ $value }} unacked messages!"

TFSS

Metrics Overview

Metric

Description

:tensorflow:serving:request_count

Total number of requests to TFSS

:tensorflow:serving:request_latency_bucket

Histogram of order processing time

:tensorflow:serving:request_latency_sum

Sum of processing durations for each order

:tensorflow:serving:request_latency_count

Total number of orders

:tensorflow:cc:saved_model:load_attempt_count

Uploaded models

Grafana Tables

Table

Description

What to monitor

Model request rate

Number of requests to TFSS per second and per minute

Nothing, just to stay informed

Model latency ($quantile-quantile)

0.95 quantile of the TFSS request processing time. You can set the quantile value in

Nothing, just to stay informed

Model latency (AVG)

Average order processing time

Nothing, just to stay informed

HTTP probe success

The result of the built-in blackbox check. Blackbox sends requests to verify that TFSS works properly

Should be 1

Illustrative screenshots:

Grafana Alerts

- name: TFSS alerts
  rules:
  # Critical alert that warns about blackbox detecting incorrect model behavior
    - alert: TFSS models probe service alert!
      expr: probe_success{job="blackbox-tfss-service"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "!!!ALERT!!! TFSS model or server doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS models probe pod alert!
      expr: probe_success{job="blackbox-tfss-models"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"
    - alert: TFSS predict probe pod alert!
      expr: probe_success{job="blackbox-tfss-probe"} != 1
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "TFSS model predict in namespace {{ $labels.namespace }} is unavailable!"
        description: "TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nPredict probe has been returning failed state for 3 min!"

  # Critical alert that warns about TFSS not processing requests 
    - alert: TFSS empty request rate!
      expr: absent(:tensorflow:serving:request_count{namespace="api-prod"}) == 1
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "TFSS request rate is empty!!!"
        description: "Critical! Requests are not processed, check bio!!!"

Nginx

Metrics Overview

Metric

Description

nginx_up

1 means Nginx is working,

0 – service is down

nginx_connections_accepted

Number of connections accepted by Nginx

nginx_connections_handled

Number of connections handled by Nginx

nginx_connections_active

Number of active Nginx connections

Grafana Tables

Table

Description

What to monitor

Request rate

Total number of requests to Nginx

Nothing, just to stay informed

Active connections

Connection states

Shouldn't be any pending connections

Processed connections rate

Processing success rate

Numbers of accepted and handled connections should be equal

Illustrative screenshots:

Grafana Alerts

groups:
  - name: NGINX alerts
    rules:
      # A critical alert that displays that Nginx hasn't handled some of the accepted connections
      - alert: Nginx not all connections are handled
        expr: rate (nginx_connections_handled[5m]) / rate (nginx_connections_accepted[5m]) <1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Nginx issue with handling connections"
          description: "Critical: Nginx doesn't handle some accepted connections on the host {{ $labels.instance }} for more than 3 minutes!"

API

Metrics Overview

Metric

Description

absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"}

Displays that there is no ready API containers

Grafana Alerts

groups:
  - name: API alerts
    rules:
      # A critical alert that warns that there are no ready API containers, so requests are not being processed
      - alert: Absent ready api containers!
        expr: absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"} == 1)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Absent ready api containers!!!"
          description: "Critical! Check api containers!!!"