Monitoring
For monitoring, we use Prometheus and Grafana.
Installation and Configuration of Prometheus (for Kubernetes)
Omit this part if you have Docker installation and proceed directly to Dashboard Configuration.
Kubernetes-only
Configure Prometheus
If necessary, install Prometheus into your cluster using their GitHub.
Our charts already contain custom resources for Prometheus: serviceMonitor. By default, serviceMonitor is disabled, to enable it, set enable: true in prometheus_exporters → serviceMonitor.
prometheus_exporters:
# Global parameter. If true, add prometheus exporters for apps
# with the same parameter also set to true in app section
enable: true
# Global parameter. if true - expose metrics for apps on ingress
# Check the same parameters in apps' configs
addToIngress: false
# Base path for metrics, e.g., <https://host.local/metricsBasePath/metricsEndPath>
metricsBasePath: /metrics/
auth:
# Set HTTP Basic AUTH for metrics in ingress. Usable only if 'addToIngress' is true
enable: true
username: flower_user
password: flowerpass
# Enable serviceMonitor (Service Discovery for Prometheus)
serviceMonitor:
enable: true
interval: 15sTo ensure that Prometheus Operator spots the parameters, add the corresponding Namespace or serviceMonitor itself to the spec section as shown below:
spec:
serviceMonitorNamespaceSelector:
matchNames:
- default
- monitoringIf you don't specify the parameters, all Namespace will be added:
spec:
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector: {}Verify the Settings
If everything is correct, and serviceMonitor has been added to the cluster, you'll see the corresponding custom resource in Custom Resources → monitoring.coreos.com → Service Monitor.

Make sure that Service Monitor contains the resources listed in the screen below:

You can also check these resources in Prometheus itself. Proceed to Status and select Targets in the drop-down menu.

Here is what should be seen:

For more details on how to work with Service Monitor in Prometheus, please refer to Prometheus Operator Documentation
Dashboard Configuration
Proceed to our repository and find the branch that matches your chart version.

Download the dashboard client:
For API 6 and above:
Oz_dashboard_client_api6.jsonorOz_dashboard_client_api6_with_k8s_metrics.json, depending on your product installation type.For API 5 and below:
Oz_dashboard_client.json.
Open Grafana and, in the Home menu, select Dashboards.

Click New and choose Import from the drop-down menu.

Select Upload dashboard JSON file and locate the
Oz_dashboard_client.jsonfile you've downloaded. Change filename or directory if needed, but this is optional.Add the
prometheusdata source to obtain metrics.Click Import and save the dashboard.
Variables

namespace(Kubernetes-only) is a label of the namespace from:tensorflow:cc:saved_model:load_latency{clustername="$cluster"},quantileis a quantile value for tables that require it. Possible values: 0.95, 0.5, 0.90, 0.99, 1.
Please find alerts for different API versions here:
API v6 and AboveAPI v5 and BelowYou can customize the alerts according to your needs. Please proceed to our repository to find the alert files.
Last updated
Was this helpful?
