Prometheus Metrics¶

Exposing a Prometheus metrics port¶

All supported serving runtimes support exporting prometheus metrics on a specified port in the inference service's pod. The appropriate port for the model server is defined in the kserve/config/runtimes YAML files. For example, torchserve defines its prometheus port as 8082 in kserve-torchserve.yaml.

metadata:
  name: kserve-torchserve
spec:
  annotations:
    prometheus.kserve.io/port: '8082'
    prometheus.kserve.io/path: "/metrics"

If needed, this value can be overridden in the InferenceService YAML.

To enable prometheus metrics, add the annotation serving.kserve.io/enable-prometheus-scraping to the InferenceService YAML.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-irisv2"
  annotations:
    serving.kserve.io/enable-prometheus-scraping: "true"
spec:
  predictor:
    sklearn:
      protocolVersion: v2
      storageUri: "gs://seldon-models/sklearn/iris"

The default values for serving.kserve.io/enable-prometheus-scraping can be set in the inferenceservice-config configmap. See the docs for more info.

There is not currently a unified set of metrics exported by the model servers. Each model server may implement its own set of metrics to export.

Note

This annotation defines the prometheus port and path, but it does not trigger the prometheus to scrape. Users must configure prometheus to scrape data from inference service's pod according to the prometheus settings.

Metrics for lgbserver, paddleserver, pmmlserver, sklearnserver, xgbserver, custom transformer/predictor¶

Prometheus latency histograms are emitted for each of the steps (pre/postprocessing, explain, predict). Additionally, the latencies of each step are logged per request. See also modelserver prometheus label definitions and metric implementation.

Metric Name	Description	Type
request_preprocess_seconds	pre-processing request latency	Histogram
request_explain_seconds	explain request latency	Histogram
request_predict_seconds	prediction request latency	Histogram
request_postprocess_seconds	pre-processing request latency	Histogram

Other serving runtime metrics¶

Some model servers define their own metrics.

mlserver
torchserve
triton
tensorflow (Please see Github Issue #2462)

Exporting metrics¶

Exporting metrics in serverless mode requires that the queue-proxy extension image is used.

For more information on how to export metrics, see Queue Proxy Extension documentation.

Knative/Queue-Proxy metrics¶

Queue proxy emits metrics be default on port 9091. If aggregation metrics are set up with the queue proxy extension, the default port for the aggregated metrics will be 9088. See the Knative documentation (and additional metrics defined in the code) for more information about the metrics queue-proxy exposes.