Monitoring
These instructions and dashboard are for Helm chart 0.10.20 and API 5.1.
For monitoring, we use Prometheus and Grafana.
Step 1. Install and Configure Prometheus
Configure Prometheus
If necessary, install Prometheus into your cluster using their GitHub.
Our charts already contain custom resources for Prometheus: serviceMonitor
. By default, serviceMonitor
is disabled, to enable it, set enable: true
in prometheus_exporters
→ serviceMonitor
.
To ensure that Prometheus Operator spots the parameters, add the corresponding Namespace
or serviceMonitor
itself to the spec
section as shown below:
If you don't specify the parameters, all Namespace
will be added:
Verify the Settings
If everything is correct, and serviceMonitor
has been added to the cluster, you'll see the corresponding custom resource in Custom Resources → monitoring.coreos.com → Service Monitor.
Make sure that Service Monitor contains the resources listed in the screen below:
You can also check these resources in Prometheus itself. Proceed to Status and select Targets in the drop-down menu.
Here is what should be seen:
For more details on how to work with Service Monitor in Prometheus, please refer to Prometheus Operator Documentation.
Step 2. Configure Dashboard
Proceed to our repository and find the branch that matches your chart version.
Download
Oz_dashboard_client.json
.Open Grafana and, in the Home menu, select Dashboards.
Click New and choose Import from the drop-down menu.
Select Upload dashboard JSON file and locate the
Oz_dashboard_client.json
file you've downloaded. Change filename or directory if neededm but this is optional.Add the
prometheus
data source to obtain metrics.Click Import and save the dashboard.
Variables
namespace
is a label of the namespace from:tensorflow:cc:saved_model:load_latency{clustername="$cluster"}
,quantile
is a quantile value for tables that require it. Possible values: 0.95, 0.5, 0.90, 0.99, 1.
Step 3. Start Monitoring
Below, you'll find the description of component metrics and tables, along with pre-configured alerts.
Celery
Metrics Overview
flower_events_total
Total number of tasks
flower_task_runtime_seconds_sum
Sum of task completion durations
histogram_quantile($quantile, sum(rate(flower_task_runtime_seconds_bucket[1m])) by (le))
Quantile for task execution based on the quantile variable value
flower_events_total{type="task-failed"}
With the type="task-failed"
label, displays the number of failed tasks
Grafana Tables
Liveness task rate
Average time of all Celery requests for Liveness
task-received
should be roughly equal totask-succeeded
,shouldn't be any
task-failed
Liveness task duration
Quantile for task execution
0.95 quantile should be 8 seconds or less
Succeeded vs failed tasks rate
Total number of Celery requests
task-received
should be roughly equal totask-succeeded
,shouldn't be any
task-failed
All tasks duration (AVG)
Average time of all Celery requests for all models
The durations should be 6 seconds or less
Queue size
Message queue in Redis
Queue size and the number of unacked messages
Illustrative screenshots:
Grafana Alerts
Redis
Metrics Overview
redis_up
1
means Redis is working,
0
– service is down
redis_commands_total
Total number of commands in Redis
redis_commands_duration_seconds_total
Redis process duration
redis_key_size
Redis (as a message broker) queue size
redis_key_size{key="unacked"}
With the "inacked"
label, displays the number of tasks that are being processed by Redis
Grafana Tables
Command rate
Number of requests per second
Nothing, just to stay informed
Commands duration
Average and maximum command execution duration
AVG < 15µs
MAX < 1ms
Connected clients
Number of connected clients
Shouldn't be 0
Illustrative screenshots:
Grafana Alerts
TFSS
Metrics Overview
:tensorflow:serving:request_count
Total number of requests to TFSS
:tensorflow:serving:request_latency_bucket
Histogram of order processing time
:tensorflow:serving:request_latency_sum
Sum of processing durations for each order
:tensorflow:serving:request_latency_count
Total number of orders
:tensorflow:cc:saved_model:load_attempt_count
Uploaded models
Grafana Tables
Model request rate
Number of requests to TFSS per second and per minute
Nothing, just to stay informed
Model latency ($quantile
-quantile)
0.95 quantile of the TFSS request processing time. You can set the quantile value in the $quantile
variable
Nothing, just to stay informed
Model latency (AVG)
Average order processing time
Nothing, just to stay informed
HTTP probe success
The result of the built-in blackbox check. Blackbox sends requests to verify that TFSS works properly
Should be 1
Illustrative screenshots:
Grafana Alerts
Nginx
Metrics Overview
nginx_up
1
means Nginx is working,
0
– service is down
nginx_connections_accepted
Number of connections accepted by Nginx
nginx_connections_handled
Number of connections handled by Nginx
nginx_connections_active
Number of active Nginx connections
Grafana Tables
Request rate
Total number of requests to Nginx
Nothing, just to stay informed
Active connections
Connection states
Shouldn't be any pending connections
Processed connections rate
Processing success rate
Numbers of accepted and handled connections should be equal
Illustrative screenshots:
Grafana Alerts
API
Metrics Overview
absent(kube_pod_container_status_ready{container="oz-api", namespace="api-prod"}
Displays that there is no ready API containers
Grafana Alerts
You can customize the alerts according to your needs. Please proceed to our repository to find the alert files.
Last updated
Was this helpful?