For the complete documentation index, see llms.txt. This page is also available as Markdown.

Monitoring

For monitoring, we use Prometheus and Grafana.

Installation and Configuration of Prometheus (for Kubernetes)

Kubernetes-only

Configure Prometheus

If necessary, install Prometheus into your cluster using their GitHub.

Our charts already contain custom resources for Prometheus: serviceMonitor. By default, serviceMonitor is disabled, to enable it, set enable: true in prometheus_exportersserviceMonitor.

prometheus_exporters:
# Global parameter. If true, add prometheus exporters for apps 
# with the same parameter also set to true in app section
  enable: true 
# Global parameter. if true - expose metrics for apps on ingress 
# Check the same parameters in apps' configs
  addToIngress: false 
# Base path for metrics, e.g., <https://host.local/metricsBasePath/metricsEndPath>
  metricsBasePath: /metrics/ 
  auth:
  # Set HTTP Basic AUTH for metrics in ingress. Usable only if 'addToIngress' is true
    enable: true 
    username: flower_user
    password: flowerpass
# Enable serviceMonitor (Service Discovery for Prometheus)
  serviceMonitor: 
    enable: true
    interval: 15s

To ensure that Prometheus Operator spots the parameters, add the corresponding Namespace or serviceMonitor itself to the spec section as shown below:

spec:
  serviceMonitorNamespaceSelector:
    matchNames:
    - default
    - monitoring

If you don't specify the parameters, all Namespace will be added:

spec:
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}

Verify the Settings

If everything is correct, and serviceMonitor has been added to the cluster, you'll see the corresponding custom resource in Custom Resourcesmonitoring.coreos.comService Monitor.

Make sure that Service Monitor contains the resources listed in the screen below:

You can also check these resources in Prometheus itself. Proceed to Status and select Targets in the drop-down menu.

Here is what should be seen:

For more details on how to work with Service Monitor in Prometheus, please refer to Prometheus Operator Documentation

Dashboard Configuration

  1. Proceed to our repository and find the branch that matches your chart version.

  1. Download the dashboard client:

    • For API 6 and above: Oz_dashboard_client_api6.json or Oz_dashboard_client_api6_with_k8s_metrics.json, depending on your product installation type.

    • For API 5 and below: Oz_dashboard_client.json.

  2. Open Grafana and, in the Home menu, select Dashboards.

  1. Click New and choose Import from the drop-down menu.

  1. Select Upload dashboard JSON file and locate the Oz_dashboard_client.json file you've downloaded. Change filename or directory if needed, but this is optional.

  2. Add the prometheus data source to obtain metrics.

  3. Click Import and save the dashboard.

Variables

  • namespace (Kubernetes-only) is a label of the namespace from :tensorflow:cc:saved_model:load_latency{clustername="$cluster"},

  • quantile is a quantile value for tables that require it. Possible values: 0.95, 0.5, 0.90, 0.99, 1.

Please find alerts for different API versions here:

API v6 and AboveAPI v5 and Below

You can customize the alerts according to your needs. Please proceed to our repository to find the alert files.

Last updated

Was this helpful?