Deploying Models
Deploy trained models for real-time predictions
Deploying Models
This guide covers the different ways to deploy trained models in KLearn for serving predictions.
Deployment Options
KLearn supports two deployment methods:
| Method | Use Case | Features |
|---|---|---|
| KLearn Serving | Development, testing | Simple, lightweight, fast iteration |
| KServe | Production | Autoscaling, canary, GPU support |
KLearn Serving (Recommended for Development)
KLearn Serving is a lightweight FastAPI-based serving solution perfect for:
- Local development and testing
- Simple use cases
- Quick deployment iterations
Deploy via Dashboard
- Navigate to Models
- Find your trained model
- Click Deploy
- Select KLearn as deployment type
- Set replicas (1-3 for development)
- Click Deploy
Deploy via API
curl -X POST http://localhost:8000/api/v1/models/{model_name}/deploy \
-H "Content-Type: application/json" \
-d '{
"deployment_type": "klearn",
"replicas": 2
}'
Architecture
Client → Gateway API → HTTPRoute → KLearn Serving Pod → MinIO (model)
The serving pod:
- Loads model from MinIO on startup
- Exposes
/predictendpoint - Handles JSON input/output
- Returns predictions and probabilities
KServe (Recommended for Production)
KServe is a Kubernetes-native model serving platform that provides:
- Autoscaling: Scale to zero, scale based on load
- Canary deployments: Gradual rollout of new versions
- Multi-model serving: Serve multiple models efficiently
- GPU support: For deep learning models
- Request batching: Improved throughput
Prerequisites
KServe must be installed on your cluster. KLearn's Helm chart can install it:
# values.yaml
kserve:
enabled: true
Deploy via Dashboard
- Navigate to Models
- Click Deploy on your model
- Select KServe as deployment type
- Configure options:
- Replicas: min/max for autoscaling
- Runtime:
flaml(for FLAML models)
- Click Deploy
Deploy via API
curl -X POST http://localhost:8000/api/v1/models/{model_name}/deploy \
-H "Content-Type: application/json" \
-d '{
"deployment_type": "kserve",
"replicas": 2,
"runtime": "flaml",
"autoscaling": {
"min_replicas": 1,
"max_replicas": 10,
"target_utilization": 70
}
}'
Custom Runtime
KLearn provides a custom FLAML runtime for KServe:
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: flaml-runtime
spec:
containers:
- name: kserve-container
image: localhost:5001/klearn/flaml-runtime:latest
env:
- name: PROTOCOL
value: v2
Making Predictions
Request Format
curl -X POST http://{endpoint}/predict \
-H "Content-Type: application/json" \
-d '{
"instances": [
{"feature1": 1.0, "feature2": "value"},
{"feature1": 2.0, "feature2": "other"}
]
}'
Response Format
{
"predictions": [0, 1],
"probabilities": [
[0.85, 0.15],
[0.20, 0.80]
],
"model_name": "churn-model",
"model_version": "v1"
}
Batch Predictions
For large batches, use the batch endpoint:
curl -X POST http://{endpoint}/predict/batch \
-H "Content-Type: application/json" \
-d '{
"instances": [
// Up to 1000 instances
]
}'
Monitoring Deployments
Check Deployment Status
# Via API
curl http://localhost:8000/api/v1/deployments
# Via kubectl
kubectl get deployment -n klearn -l klearn.dev/model-name={model}
kubectl get inferenceservice -n klearn
View Logs
# KLearn serving logs
kubectl logs -n klearn -l app={model-name}-serving
# KServe logs
kubectl logs -n klearn -l serving.kserve.io/inferenceservice={model-name}
Metrics
KLearn exposes Prometheus metrics:
klearn_predictions_total: Total predictions madeklearn_prediction_latency_seconds: Prediction latency histogramklearn_model_load_time_seconds: Model loading time
Scaling
Manual Scaling
# Via API
curl -X PATCH http://localhost:8000/api/v1/deployments/{name}/scale \
-H "Content-Type: application/json" \
-d '{"replicas": 5}'
# Via kubectl
kubectl scale deployment {name}-serving -n klearn --replicas=5
Autoscaling (KServe only)
Configure in the deployment:
{
"autoscaling": {
"min_replicas": 1,
"max_replicas": 10,
"target_utilization": 70,
"scale_down_delay": "5m"
}
}
Updating Deployments
Rolling Update
When you deploy a new model version:
- New pods are created with the new model
- Traffic gradually shifts to new pods
- Old pods are terminated
# Redeploy with new model
curl -X POST http://localhost:8000/api/v1/models/{new-model}/deploy \
-H "Content-Type: application/json" \
-d '{"deployment_type": "klearn", "replicas": 2}'
Canary Deployment (KServe)
Gradually shift traffic:
{
"canary": {
"traffic_percent": 10,
"model_name": "new-model-version"
}
}
Undeploying
Via Dashboard
- Navigate to Deployments
- Find your deployment
- Click Delete (with confirmation dialog)
Via API
curl -X DELETE http://localhost:8000/api/v1/deployments/{name}
Troubleshooting
Pod not starting
kubectl describe pod -n klearn -l app={model}-serving
Common issues:
- ImagePullBackOff: Image not in registry
- CrashLoopBackOff: Check logs for errors
- Pending: Not enough resources
Model not loading
Check serving logs:
kubectl logs -n klearn -l app={model}-serving
Common issues:
- MinIO connection: Check credentials
- Model file missing: Verify path in MinIO
- Incompatible model: Check model format
High latency
- Increase replicas
- Check resource limits
- Enable request batching
- Consider GPU for large models