These instructions and dashboard are for Helm chart 0.10.20 and API 5.1.
For monitoring, we use Prometheus and Grafana.
Step 1. Install and Configure Prometheus
Configure Prometheus
If necessary, install Prometheus into your cluster using their GitHub.
Our charts already contain custom resources for Prometheus: serviceMonitor. By default, serviceMonitor is disabled, to enable it, set enable: true in prometheus_exporters → serviceMonitor.
prometheus_exporters:# Global parameter. If true, add prometheus exporters for apps # with the same parameter also set to true in app sectionenable:true# Global parameter. if true - expose metrics for apps on ingress # Check the same parameters in apps' configsaddToIngress:false# Base path for metrics, e.g., <https://host.local/metricsBasePath/metricsEndPath>metricsBasePath:/metrics/auth:# Set HTTP Basic AUTH for metrics in ingress. Usable only if 'addToIngress' is trueenable:trueusername:flower_userpassword:flowerpass# Enable serviceMonitor (Service Discovery for Prometheus)serviceMonitor:enable:trueinterval:15s
To ensure that Prometheus Operator spots the parameters, add the corresponding Namespace or serviceMonitor itself to the spec section as shown below:
If everything is correct, and serviceMonitor has been added to the cluster, you'll see the corresponding custom resource in Custom Resources → monitoring.coreos.com → Service Monitor.
Make sure that Service Monitor contains the resources listed in the screen below:
You can also check these resources in Prometheus itself. Proceed to Status and select Targets in the drop-down menu.
Proceed to our repository and find the branch that matches your chart version.
Download Oz_dashboard_client.json.
Open Grafana and, in the Home menu, select Dashboards.
Click New and choose Import from the drop-down menu.
Select Upload dashboard JSON file and locate the Oz_dashboard_client.json file you've downloaded. Change filename or directory if neededm but this is optional.
Add the prometheus data source to obtain metrics.
Click Import and save the dashboard.
Variables
namespace is a label of the namespace from :tensorflow:cc:saved_model:load_latency{clustername="$cluster"},
quantile is a quantile value for tables that require it. Possible values: 0.95, 0.5, 0.90, 0.99, 1.
Step 3. Start Monitoring
Below, you'll find the description of component metrics and tables, along with pre-configured alerts.
Celery
Metrics Overview
Metric
Description
flower_events_total
Total number of tasks
flower_task_runtime_seconds_sum
Sum of task completion durations
histogram_quantile($quantile, sum(rate(flower_task_runtime_seconds_bucket[1m])) by (le))
flower_events_total{type="task-failed"}
With the type="task-failed" label, displays the number of failed tasks
Grafana Tables
Table
Description
What to monitor
Liveness task rate
Average time of all Celery requests for Liveness
task-received should be roughly equal to task-succeeded,
shouldn't be any task-failed
Liveness task duration
Quantile for task execution
0.95 quantile should be 8 seconds or less
Succeeded vs failed tasks rate
Total number of Celery requests
task-received should be roughly equal to task-succeeded,
shouldn't be any task-failed
All tasks duration (AVG)
Average time of all Celery requests for all models
The durations should be 6 seconds or less
Queue size
Message queue in Redis
Queue size and the number of unacked messages
Illustrative screenshots:
Grafana Alerts
-name:Celeryalertsrules:# An alert for 0.9 quantile of task duration, warns that analyses are running slowly-alert:QualityAnalysisisslow!expr:histogram_quantile(0.9,sumwithout (handler) (rate(flower_task_runtime_seconds_bucket{task="oz_core.tasks.tfss.process_analyse_quality"}[5m]))) > 5for: 10mlabels:severity:warningannotations:summary:"Too long duration of celery worker for TFSS task"description:"The duration of 90% Quality analyses is longer than {{ $value }} seconds for last 10 minutes. NAMESPACE: {{ $labels.namespace }} POD: {{ $labels.pod }} TASK: {{ $labels.task }} WORKER: {{ $labels.worker }}"# An alert for failed tasks, if the number is growing, something goes wrong-alert:Celerytasksfailed!expr:sumby(type,task) (rate(flower_events_total{type="task-failed",task!=""}[1m])) > 0for: 1mlabels:severity:warningannotations:summary:"Celery tasks {{ $labels.task }} failed"description:'Failed celery tasks rate: {{ printf "%.2f" $value }} rps'# A critical alert that warns that all the tasks are failed; it means that the system has stopped processing requests-alert:Celeryzerosuccesstasks!expr:sumby(type,task) (rate(flower_events_total{type="task-succeeded",task!=""}[1m])) == 0 and on (task) sum by(type,task) (rate(flower_events_total{type="task-received",task!=""}[1m])) > 0for: 5mlabels:severity:criticalannotations:summary:"Celery has zero success tasks: {{ $labels.task }}!"description:"Critical! Check if bio is alive!"
Redis
Metrics Overview
Metric
Description
redis_up
1 means Redis is working,
0 – service is down
redis_commands_total
Total number of commands in Redis
redis_commands_duration_seconds_total
Redis process duration
redis_key_size
Redis (as a message broker) queue size
redis_key_size{key="unacked"}
With the "inacked" label, displays the number of tasks that are being processed by Redis
Grafana Tables
Table
Description
What to monitor
Command rate
Number of requests per second
Nothing, just to stay informed
Commands duration
Average and maximum command execution duration
AVG < 15µs
MAX < 1ms
Connected clients
Number of connected clients
Shouldn't be 0
Illustrative screenshots:
Grafana Alerts
-name:Redisalertsrules:# A critical alert that warns about Redis being down-alert:Redisisdownexpr:redis_up!=1for: 30slabels:severity:criticalannotations:summary:"Redis is down for more than 30 seconds!"description:"Critical: REDIS service is down in namespace: {{ $labels.namespace }}\nPod: {{ $labels.pod }}!"# Displays if Redis rejects connections-alert:Redisrejectedconnectionsexpr:rate(redis_rejected_connections_total[1m]) >0for: 1mlabels:severity:warningannotations:summary:"Redis rejects connections for more than 1 minute in namespace: {{ $labels.namespace }}!"description:"Some connections to Redis have been rejected!\nPod: {{ $labels.pod }}\nValue = {{ $value }}"# Displays that Redis commands are being executed too slow -alert:Rediscommanddurationisslow!expr:maxby(namespace) (rate(redis_commands_duration_seconds_total[1m])) > 0.0004for: 1mlabels:severity:warningannotations:summary:"Redis max command duration is too high for more than 1 minute in namespace: {{ $labels.namespace }}!"description:"Maximum command duration is longer than average!\nValue = {{ $value }} seconds"# Warns that the Redis queue is too long-alert:Redisqueuelengthexpr:sumby (instance)(redis_key_size) > 50for:1mlabels:severity:warningannotations:summary:"Redis queue size is too large!"description:"Warning: Redis queue size : {{ $value }} for the last 1 min!"# Warns that there are more than 10 processing (unacked) messages in the Redis queue-alert:Redisunackedmassegesexpr:sumby (key)(redis_key_size{key="unacked"}) > 10for:1mlabels:severity:warningannotations:summary:"Redis has unacked messages!"description:"Warning: Redis has {{ $value }} unacked messages!"
TFSS
Metrics Overview
Metric
Description
:tensorflow:serving:request_count
Total number of requests to TFSS
:tensorflow:serving:request_latency_bucket
Histogram of order processing time
:tensorflow:serving:request_latency_sum
Sum of processing durations for each order
:tensorflow:serving:request_latency_count
Total number of orders
:tensorflow:cc:saved_model:load_attempt_count
Uploaded models
Grafana Tables
Table
Description
What to monitor
Model request rate
Number of requests to TFSS per second and per minute
Nothing, just to stay informed
Model latency ($quantile-quantile)
Nothing, just to stay informed
Model latency (AVG)
Average order processing time
Nothing, just to stay informed
HTTP probe success
The result of the built-in blackbox check. Blackbox sends requests to verify that TFSS works properly
Should be 1
Illustrative screenshots:
Grafana Alerts
-name:TFSSalertsrules:# Critical alert that warns about blackbox detecting incorrect model behavior-alert:TFSSmodelsprobeservicealert!expr:probe_success{job="blackbox-tfss-service"}!=1for: 3mlabels:severity:criticalannotations:summary:"TFSS model in namespace {{ $labels.namespace }} is unavailable!"description:"!!!ALERT!!! TFSS model or server doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"-alert:TFSSmodelsprobepodalert!expr:probe_success{job="blackbox-tfss-models"}!=1for: 3mlabels:severity:criticalannotations:summary:"TFSS model in namespace {{ $labels.namespace }} is unavailable!"description:"TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nModel probe has been returning failed state for 3 min!"-alert:TFSSpredictprobepodalert!expr:probe_success{job="blackbox-tfss-probe"}!=1for: 3mlabels:severity:criticalannotations:summary:"TFSS model predict in namespace {{ $labels.namespace }} is unavailable!"description:"TFSS in pod {{ $labels.pod }} doesn't work:\nMODEL:{{ $labels.model }}\nPredict probe has been returning failed state for 3 min!"# Critical alert that warns about TFSS not processing requests -alert:TFSSemptyrequestrate!expr:absent(:tensorflow:serving:request_count{namespace="api-prod"}) ==1for: 1mlabels:severity:criticalannotations:summary:"TFSS request rate is empty!!!"description:"Critical! Requests are not processed, check bio!!!"
Nginx
Metrics Overview
Metric
Description
nginx_up
1 means Nginx is working,
0 – service is down
nginx_connections_accepted
Number of connections accepted by Nginx
nginx_connections_handled
Number of connections handled by Nginx
nginx_connections_active
Number of active Nginx connections
Grafana Tables
Table
Description
What to monitor
Request rate
Total number of requests to Nginx
Nothing, just to stay informed
Active connections
Connection states
Shouldn't be any pending connections
Processed connections rate
Processing success rate
Numbers of accepted and handled connections should be equal
Illustrative screenshots:
Grafana Alerts
groups:-name:NGINXalertsrules:# A critical alert that displays that Nginx hasn't handled some of the accepted connections-alert:Nginxnotallconnectionsarehandledexpr:rate (nginx_connections_handled[5m]) / rate (nginx_connections_accepted[5m]) <1for: 2mlabels:severity:criticalannotations:summary:"Nginx issue with handling connections"description:"Critical: Nginx doesn't handle some accepted connections on the host {{ $labels.instance }} for more than 3 minutes!"
groups:-name:APIalertsrules:# A critical alert that warns that there are no ready API containers, so requests are not being processed-alert:Absentreadyapicontainers!expr:absent(kube_pod_container_status_ready{container="oz-api",namespace="api-prod"}==1)for: 1mlabels:severity:criticalannotations:summary:"Absent ready api containers!!!"description:"Critical! Check api containers!!!"
You can customize the alerts according to your needs. Please proceed to our repository to find the alert files.
Quantile for task execution based on the value
0.95 quantile of the TFSS request processing time. You can set the quantile value in