How It Works
Understanding the KLearn workflow
How KLearn Works
This page explains the end-to-end workflow of building, training, and deploying machine learning models with KLearn.
The ML Lifecycle
KLearn simplifies the machine learning lifecycle into four main stages:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Data │ -> │ Training │ -> │ Model │ -> │ Serving │
│ Upload │ │ (FLAML) │ │ Registry │ │ (Deployment) │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
Stage 1: Data Upload
What happens when you upload a dataset:
- File validation: KLearn validates the CSV/Parquet file
- Storage: File is uploaded to MinIO (S3-compatible storage)
- Metadata extraction: Row count, column types, statistics
- Database record: Dataset metadata stored in PostgreSQL
# API call when you upload
POST /api/v1/datasets
Content-Type: multipart/form-data
file: your_data.csv
name: customer_churn
Supported formats:
- CSV (recommended)
- Parquet
- Excel (.xlsx)
Stage 2: Training with FLAML
What is FLAML?
FLAML (Fast and Lightweight AutoML) is Microsoft's open-source AutoML library that:
- Automatically selects models: Tests XGBoost, LightGBM, RandomForest, etc.
- Optimizes hyperparameters: Finds the best configuration
- Respects time budgets: Stops when time runs out
- Handles preprocessing: Deals with missing values and encoding
Training workflow:
- User creates experiment: Specifies dataset, target, task type
- Backend creates KLearnJob: Custom resource in Kubernetes
- Operator creates training pod: With FLAML container
- FLAML trains models: Tries multiple algorithms
- Best model saved: To MinIO as pickle file
- Metrics logged: To MLflow for tracking
# KLearnJob example
apiVersion: klearn.klearn.dev/v1alpha1
kind: KLearnJob
metadata:
name: churn-prediction
spec:
dataSource:
uri: s3://klearn/datasets/churn.csv
taskType: classification
targetColumn: churned
flamlConfig:
timeBudget: 3600 # 1 hour
metric: accuracy
Supported task types:
| Task Type | Description | Metrics |
|---|---|---|
classification | Binary or multi-class | accuracy, f1, roc_auc |
regression | Continuous target | mse, mae, r2 |
ts_forecast | Time series | mape, smape, rmse |
Stage 3: Model Registry
When training completes successfully:
- KLearnModel created: Operator creates model resource
- Model artifacts stored: In MinIO under
models/prefix - Metrics recorded: From FLAML's best model
- MLflow run linked: For experiment tracking
# KLearnModel created automatically
apiVersion: klearn.klearn.dev/v1alpha1
kind: KLearnModel
metadata:
name: churn-prediction-model
spec:
sourceJob: churn-prediction
modelUri: s3://klearn/models/churn-prediction/model.pkl
status:
phase: Registered
metrics:
accuracy: 0.92
f1_score: 0.89
Model versioning:
- Each training job creates a new model version
- Models can be promoted through stages:
development→staging→production
Stage 4: Deployment & Serving
KLearn supports two deployment options:
Option A: KServe InferenceService
For production workloads with:
- Autoscaling based on request volume
- Canary deployments for gradual rollout
- Request batching for efficiency
- GPU support
# Deploy via API
curl -X POST http://localhost:8000/api/v1/models/churn-model/deploy \
-H "Content-Type: application/json" \
-d '{
"deployment_type": "kserve",
"replicas": 2,
"runtime": "flaml"
}'
Option B: KLearn Serving
Lightweight deployment for:
- Development and testing
- Simple REST API
- Quick iteration
# Deploy via API
curl -X POST http://localhost:8000/api/v1/models/churn-model/deploy \
-H "Content-Type: application/json" \
-d '{
"deployment_type": "klearn",
"replicas": 1
}'
Making predictions:
# Send prediction request
curl -X POST http://localhost:8080/predict \
-H "Content-Type: application/json" \
-d '{
"instances": [
{"age": 45, "tenure": 24, "monthly_charges": 79.5}
]
}'
# Response
{
"predictions": [1],
"probabilities": [[0.15, 0.85]]
}
Real-time Progress Tracking
During training, KLearn provides real-time updates:
- Pod status: Pending → Running → Succeeded
- Training progress: Based on time budget
- Current trial: Which model is being tested
- Best score so far: Current best performance
- Logs: Real-time trainer logs
The frontend polls the API every few seconds to show:
- Progress bar with time remaining
- Current best model and score
- Live log streaming
Error Handling
KLearn handles failures gracefully:
Training failures:
- Pod crash: Operator detects and updates status
- OOM errors: Logged and reported
- Data issues: Validation errors shown to user
Deployment failures:
- Image pull errors: Automatically retried
- Resource constraints: Clear error messages
- Health check failures: Automatic rollback
What makes KLearn different?
| Feature | Traditional MLOps | KLearn |
|---|---|---|
| Setup | Days/weeks | Minutes |
| ML expertise needed | High | Low |
| Infrastructure | Complex | Kubernetes-native |
| Vendor lock-in | Often | Never (100% OSS) |
| Cost | High | Self-hosted |