> For the complete documentation index, see [llms.txt](https://doc.ozforensics.com/oz-knowledge/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://doc.ozforensics.com/oz-knowledge/guides/administrator-guide/monitoring.md).

# Monitoring

For monitoring, we use Prometheus and Grafana.

## Installation and Configuration of Prometheus (for Kubernetes)

{% hint style="warning" %}
Omit this part if you have Docker installation and proceed directly to [Dashboard Configuration](#dashboard-configuration).
{% endhint %}

<details>

<summary>Kubernetes-only</summary>

### Configure Prometheus

If necessary, install Prometheus into your cluster using [their GitHub](https://github.com/prometheus-operator/prometheus-operator).

Our charts already contain custom resources for Prometheus: `serviceMonitor`. By default, `serviceMonitor` is disabled, to enable it, set `enable: true` in `prometheus_exporters` → `serviceMonitor`.

```bash
prometheus_exporters:
# Global parameter. If true, add prometheus exporters for apps 
# with the same parameter also set to true in app section
  enable: true 
# Global parameter. if true - expose metrics for apps on ingress 
# Check the same parameters in apps' configs
  addToIngress: false 
# Base path for metrics, e.g., <https://host.local/metricsBasePath/metricsEndPath>
  metricsBasePath: /metrics/ 
  auth:
  # Set HTTP Basic AUTH for metrics in ingress. Usable only if 'addToIngress' is true
    enable: true 
    username: flower_user
    password: flowerpass
# Enable serviceMonitor (Service Discovery for Prometheus)
  serviceMonitor: 
    enable: true
    interval: 15s
```

To ensure that Prometheus Operator spots the parameters, add the corresponding `Namespace` or `serviceMonitor` itself to the `spec` section as shown below:

```bash
spec:
  serviceMonitorNamespaceSelector:
    matchNames:
    - default
    - monitoring
```

If you don't specify the parameters, all `Namespace` will be added:

```bash
spec:
  serviceMonitorNamespaceSelector: {}
  serviceMonitorSelector: {}
```

### Verify the Settings

If everything is correct, and `serviceMonitor` has been added to the cluster, you'll see the corresponding custom resource in **Custom Resources** → **monitoring.coreos.com** → **Service Monitor**.

<figure><img src="/files/0uG4L1ofUhi81uXhXXrd" alt=""><figcaption></figcaption></figure>

Make sure that Service Monitor contains the resources listed in the screen below:

<figure><img src="/files/uzH6CuqvwbtBeLIBlbWI" alt=""><figcaption></figcaption></figure>

You can also check these resources in Prometheus itself. Proceed to **Status** and select **Targets** in the drop-down menu.

<figure><img src="/files/rpcY5fSr6ps3hPXW7wCf" alt=""><figcaption></figcaption></figure>

Here is what should be seen:

<figure><img src="/files/0DmpIyo69y1Nqg8Dq3iX" alt=""><figcaption></figcaption></figure>

For more details on how to work with Service Monitor in Prometheus, please refer to [Prometheus Operator Documentation](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/getting-started.md)

</details>

## Dashboard Configuration

1. Proceed to [our repository](https://gitlab.com/oz-forensics/public/oz-engineering-monitoring-configs) and find the branch that matches your chart version.

<figure><img src="/files/buHHj2TYB6D9HnIAy0Vn" alt=""><figcaption></figcaption></figure>

2. Download the dashboard client:
   * For API 6 and above: `Oz_dashboard_client_api6.json` or `Oz_dashboard_client_api6_with_k8s_metrics.json`, depending on your product installation type.
   * For API 5 and below: `Oz_dashboard_client.json`.
3. Open Grafana and, in the **Home** menu, select **Dashboards**.

<figure><img src="/files/1XzVP2PwZGxo8YOmI24n" alt=""><figcaption></figcaption></figure>

4. Click **New** and choose **Import** from the drop-down menu.

<figure><img src="/files/xstm3s0A0NkA2CIEOHUD" alt=""><figcaption></figcaption></figure>

5. Select **Upload dashboard JSON file** and locate the `Oz_dashboard_client.json` file you've downloaded. Change filename or directory if needed, but this is optional.
6. Add the `prometheus` data source to obtain metrics.
7. Click **Import** and save the dashboard.

#### Variables

<figure><img src="/files/28BHjZI5irpLQ3LzijIE" alt=""><figcaption></figcaption></figure>

* `namespace` (Kubernetes-only) is a label of the namespace from `:tensorflow:cc:saved_model:load_latency{clustername="$cluster"}`,
* `quantile` is a quantile value for tables that require it. Possible values: 0.95, 0.5, 0.90, 0.99, 1.

Please find alerts for different API versions here:

{% content-ref url="/pages/09Sqow6oHzJQY1MRIVrw" %}
[API v6 and Above](/oz-knowledge/guides/administrator-guide/monitoring/api-v6-and-above.md)
{% endcontent-ref %}

{% content-ref url="/pages/5cIdDR6dAhp1v2dM4RXA" %}
[API v5 and Below](/oz-knowledge/guides/administrator-guide/monitoring/api-5-and-below.md)
{% endcontent-ref %}

You can customize the alerts according to your needs. Please proceed to [our repository](https://gitlab.com/oz-forensics/public/oz-engineering-monitoring-configs/-/tree/helm-0.10.20/alerts?ref_type=heads) to find the alert files.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://doc.ozforensics.com/oz-knowledge/guides/administrator-guide/monitoring.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
