Deploying Models

Deploy trained models for real-time predictions

Deploying Models

This guide covers the different ways to deploy trained models in KLearn for serving predictions.

Deployment Options

KLearn supports two deployment methods:

MethodUse CaseFeatures
KLearn ServingDevelopment, testingSimple, lightweight, fast iteration
KServeProductionAutoscaling, canary, GPU support

KLearn Serving is a lightweight FastAPI-based serving solution perfect for:

  • Local development and testing
  • Simple use cases
  • Quick deployment iterations

Deploy via Dashboard

  1. Navigate to Models
  2. Find your trained model
  3. Click Deploy
  4. Select KLearn as deployment type
  5. Set replicas (1-3 for development)
  6. Click Deploy

Deploy via API

curl -X POST http://localhost:8000/api/v1/models/{model_name}/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "deployment_type": "klearn",
    "replicas": 2
  }'

Architecture

Client → Gateway API → HTTPRoute → KLearn Serving Pod → MinIO (model)

The serving pod:

  1. Loads model from MinIO on startup
  2. Exposes /predict endpoint
  3. Handles JSON input/output
  4. Returns predictions and probabilities

KServe is a Kubernetes-native model serving platform that provides:

  • Autoscaling: Scale to zero, scale based on load
  • Canary deployments: Gradual rollout of new versions
  • Multi-model serving: Serve multiple models efficiently
  • GPU support: For deep learning models
  • Request batching: Improved throughput

Prerequisites

KServe must be installed on your cluster. KLearn's Helm chart can install it:

# values.yaml
kserve:
  enabled: true

Deploy via Dashboard

  1. Navigate to Models
  2. Click Deploy on your model
  3. Select KServe as deployment type
  4. Configure options:
    • Replicas: min/max for autoscaling
    • Runtime: flaml (for FLAML models)
  5. Click Deploy

Deploy via API

curl -X POST http://localhost:8000/api/v1/models/{model_name}/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "deployment_type": "kserve",
    "replicas": 2,
    "runtime": "flaml",
    "autoscaling": {
      "min_replicas": 1,
      "max_replicas": 10,
      "target_utilization": 70
    }
  }'

Custom Runtime

KLearn provides a custom FLAML runtime for KServe:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: flaml-runtime
spec:
  containers:
  - name: kserve-container
    image: localhost:5001/klearn/flaml-runtime:latest
    env:
    - name: PROTOCOL
      value: v2

Making Predictions

Request Format

curl -X POST http://{endpoint}/predict \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [
      {"feature1": 1.0, "feature2": "value"},
      {"feature1": 2.0, "feature2": "other"}
    ]
  }'

Response Format

{
  "predictions": [0, 1],
  "probabilities": [
    [0.85, 0.15],
    [0.20, 0.80]
  ],
  "model_name": "churn-model",
  "model_version": "v1"
}

Batch Predictions

For large batches, use the batch endpoint:

curl -X POST http://{endpoint}/predict/batch \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [
      // Up to 1000 instances
    ]
  }'

Monitoring Deployments

Check Deployment Status

# Via API
curl http://localhost:8000/api/v1/deployments

# Via kubectl
kubectl get deployment -n klearn -l klearn.dev/model-name={model}
kubectl get inferenceservice -n klearn

View Logs

# KLearn serving logs
kubectl logs -n klearn -l app={model-name}-serving

# KServe logs
kubectl logs -n klearn -l serving.kserve.io/inferenceservice={model-name}

Metrics

KLearn exposes Prometheus metrics:

  • klearn_predictions_total: Total predictions made
  • klearn_prediction_latency_seconds: Prediction latency histogram
  • klearn_model_load_time_seconds: Model loading time

Scaling

Manual Scaling

# Via API
curl -X PATCH http://localhost:8000/api/v1/deployments/{name}/scale \
  -H "Content-Type: application/json" \
  -d '{"replicas": 5}'

# Via kubectl
kubectl scale deployment {name}-serving -n klearn --replicas=5

Autoscaling (KServe only)

Configure in the deployment:

{
  "autoscaling": {
    "min_replicas": 1,
    "max_replicas": 10,
    "target_utilization": 70,
    "scale_down_delay": "5m"
  }
}

Updating Deployments

Rolling Update

When you deploy a new model version:

  1. New pods are created with the new model
  2. Traffic gradually shifts to new pods
  3. Old pods are terminated
# Redeploy with new model
curl -X POST http://localhost:8000/api/v1/models/{new-model}/deploy \
  -H "Content-Type: application/json" \
  -d '{"deployment_type": "klearn", "replicas": 2}'

Canary Deployment (KServe)

Gradually shift traffic:

{
  "canary": {
    "traffic_percent": 10,
    "model_name": "new-model-version"
  }
}

Undeploying

Via Dashboard

  1. Navigate to Deployments
  2. Find your deployment
  3. Click Delete (with confirmation dialog)

Via API

curl -X DELETE http://localhost:8000/api/v1/deployments/{name}

Troubleshooting

Pod not starting

kubectl describe pod -n klearn -l app={model}-serving

Common issues:

  • ImagePullBackOff: Image not in registry
  • CrashLoopBackOff: Check logs for errors
  • Pending: Not enough resources

Model not loading

Check serving logs:

kubectl logs -n klearn -l app={model}-serving

Common issues:

  • MinIO connection: Check credentials
  • Model file missing: Verify path in MinIO
  • Incompatible model: Check model format

High latency

  • Increase replicas
  • Check resource limits
  • Enable request batching
  • Consider GPU for large models