Monitoring

These instructions and dashboard are for Helm chart 0.10.20 and API 5.1.

For monitoring, we use Prometheus and Grafana.

Step 1. Install and Configure Prometheus

Configure Prometheus

If necessary, install Prometheus into your cluster using their GitHub.

Our charts already contain custom resources for Prometheus: serviceMonitor. By default, serviceMonitor is disabled, to enable it, set enable: true in prometheus_exporters → serviceMonitor.

prometheus_exporters:
# Global parameter. If true, add prometheus exporters for apps 
# with the same parameter also set to true in app section
  enable: true 
# Global parameter. if true - expose metrics for apps on ingress 
# Check the same parameters in apps' configs
  addToIngress: false 
# Base path for metrics, e.g., <https://host.local/metricsBasePath/metricsEndPath>
  metricsBasePath: /metrics/ 
  auth:
  # Set HTTP Basic AUTH for metrics in ingress. Usable only if 'addToIngress' is true
    enable: true 
    username: flower_user
    password: flowerpass
# Enable serviceMonitor (Service Discovery for Prometheus)
  serviceMonitor: 
    enable: true
    interval: 15s

To ensure that Prometheus Operator spots the parameters, add the corresponding Namespace or serviceMonitor itself to the spec section as shown below:

If you don't specify the parameters, all Namespace will be added:

Verify the Settings

If everything is correct, and serviceMonitor has been added to the cluster, you'll see the corresponding custom resource in Custom Resources → monitoring.coreos.com → Service Monitor.

Make sure that Service Monitor contains the resources listed in the screen below:

You can also check these resources in Prometheus itself. Proceed to Status and select Targets in the drop-down menu.

Here is what should be seen:

For more details on how to work with Service Monitor in Prometheus, please refer to Prometheus Operator Documentation.

Step 2. Configure Dashboard

  1. Proceed to our repository and find the branch that matches your chart version.

  1. Download Oz_dashboard_client.json.

  2. Open Grafana and, in the Home menu, select Dashboards.

  1. Click New and choose Import from the drop-down menu.

  1. Select Upload dashboard JSON file and locate the Oz_dashboard_client.json file you've downloaded. Change filename or directory if neededm but this is optional.

  2. Add the prometheus data source to obtain metrics.

  3. Click Import and save the dashboard.

Variables

  • namespace is a label of the namespace from :tensorflow:cc:saved_model:load_latency{clustername="$cluster"},

  • quantile is a quantile value for tables that require it. Possible values: 0.95, 0.5, 0.90, 0.99, 1.

Step 3. Start Monitoring

Below, you'll find the description of component metrics and tables, along with pre-configured alerts.

Celery

Metrics Overview

Metric
Description

flower_events_total

Total number of tasks

flower_task_runtime_seconds_sum

Sum of task completion durations

histogram_quantile($quantile, sum(rate(flower_task_runtime_seconds_bucket[1m])) by (le))

Quantile for task execution based on the quantile variable value

flower_events_total{type="task-failed"}

With the type="task-failed" label, displays the number of failed tasks

Grafana Tables

Table
Description
What to monitor

Liveness task rate

Average time of all Celery requests for Liveness

  • task-received should be roughly equal to task-succeeded,

  • shouldn't be any task-failed

Liveness task duration

Quantile for task execution

0.95 quantile should be 8 seconds or less

Succeeded vs failed tasks rate

Total number of Celery requests

  • task-received should be roughly equal to task-succeeded,

  • shouldn't be any task-failed

All tasks duration (AVG)

Average time of all Celery requests for all models

The durations should be 6 seconds or less

Queue size

Message queue in Redis

Queue size and the number of unacked messages

Illustrative screenshots:

task-received ≈ task-succeeded, task-failed = 0

Grafana Alerts

Redis

Metrics Overview

Metric
Description

redis_up

1 means Redis is working,

0 – service is down

redis_commands_total

Total number of commands in Redis

redis_commands_duration_seconds_total

Redis process duration

redis_key_size

Redis (as a message broker) queue size

redis_key_size{key="unacked"}

With the "inacked" label, displays the number of tasks that are being processed by Redis

Grafana Tables

Table
Description
What to monitor

Command rate

Number of requests per second

Nothing, just to stay informed

Commands duration

Average and maximum command execution duration

AVG < 15µs

MAX < 1ms

Connected clients

Number of connected clients

Shouldn't be 0

Illustrative screenshots:

Grafana Alerts

TFSS

Metrics Overview

Metric
Description

:tensorflow:serving:request_count

Total number of requests to TFSS

:tensorflow:serving:request_latency_bucket

Histogram of order processing time

:tensorflow:serving:request_latency_sum

Sum of processing durations for each order

:tensorflow:serving:request_latency_count

Total number of orders

:tensorflow:cc:saved_model:load_attempt_count

Uploaded models

Grafana Tables

Table
Description
What to monitor

Model request rate

Number of requests to TFSS per second and per minute

Nothing, just to stay informed

Model latency ($quantile-quantile)

0.95 quantile of the TFSS request processing time. You can set the quantile value in the $quantile variable

Nothing, just to stay informed

Model latency (AVG)

Average order processing time

Nothing, just to stay informed

HTTP probe success

The result of the built-in blackbox check. Blackbox sends requests to verify that TFSS works properly

Should be 1

Illustrative screenshots:

Grafana Alerts

Nginx

Metrics Overview

Metric
Description

nginx_up

1 means Nginx is working,

0 – service is down

nginx_connections_accepted

Number of connections accepted by Nginx

nginx_connections_handled

Number of connections handled by Nginx

nginx_connections_active

Number of active Nginx connections

Grafana Tables

Table
Description
What to monitor

Request rate

Total number of requests to Nginx

Nothing, just to stay informed

Active connections

Connection states

Shouldn't be any pending connections

Processed connections rate

Processing success rate

Numbers of accepted and handled connections should be equal

Illustrative screenshots:

Grafana Alerts

API

Metrics Overview

Metric
Description

absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"}

Displays that there is no ready API containers

Grafana Alerts

You can customize the alerts according to your needs. Please proceed to our repository to find the alert files.

Last updated

Was this helpful?