Prometheus Metrics¶
Exposing a Prometheus metrics port¶
All supported serving runtimes support exporting prometheus metrics on a specified port in the inference service's pod. The appropriate port for the model server is defined in the kserve/config/runtimes YAML files. For example, torchserve defines its prometheus port as 8082
in kserve-torchserve.yaml
.
metadata:
name: kserve-torchserve
spec:
annotations:
prometheus.kserve.io/port: '8082'
prometheus.kserve.io/path: "/metrics"
If needed, this value can be overridden in the InferenceService YAML.
To enable prometheus metrics, add the annotation serving.kserve.io/enable-prometheus-scraping
to the InferenceService YAML.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-irisv2"
annotations:
serving.kserve.io/enable-prometheus-scraping: "true"
spec:
predictor:
sklearn:
protocolVersion: v2
storageUri: "gs://seldon-models/sklearn/iris"
The default values for serving.kserve.io/enable-prometheus-scraping
can be set in the inferenceservice-config
configmap. See the docs for more info.
There is not currently a unified set of metrics exported by the model servers. Each model server may implement its own set of metrics to export.
Note
This annotation defines the prometheus port and path, but it does not trigger the prometheus to scrape. Users must configure prometheus to scrape data from inference service's pod according to the prometheus settings.
Metrics for lgbserver, paddleserver, pmmlserver, sklearnserver, xgbserver, custom transformer/predictor¶
Prometheus latency histograms are emitted for each of the steps (pre/postprocessing, explain, predict). Additionally, the latencies of each step are logged per request. See also modelserver prometheus label definitions and metric implementation.
Metric Name | Description | Type |
---|---|---|
request_preprocess_seconds | pre-processing request latency | Histogram |
request_explain_seconds | explain request latency | Histogram |
request_predict_seconds | prediction request latency | Histogram |
request_postprocess_seconds | pre-processing request latency | Histogram |
Other serving runtime metrics¶
Some model servers define their own metrics.
- mlserver
- torchserve
- triton
- tensorflow (Please see Github Issue #2462)
Exporting metrics¶
Exporting metrics in serverless mode requires that the queue-proxy extension image is used.
For more information on how to export metrics, see Queue Proxy Extension documentation.
Knative/Queue-Proxy metrics¶
Queue proxy emits metrics be default on port 9091. If aggregation metrics are set up with the queue proxy extension, the default port for the aggregated metrics will be 9088. See the Knative documentation (and additional metrics defined in the code) for more information about the metrics queue-proxy exposes.