How KLearn Works

This page explains the end-to-end workflow of building, training, and deploying machine learning models with KLearn.

The ML Lifecycle

KLearn simplifies the machine learning lifecycle into four main stages:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│    Data      │ -> │   Training   │ -> │    Model     │ -> │   Serving    │
│   Upload     │    │   (FLAML)    │    │   Registry   │    │ (Deployment) │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘

Stage 1: Data Upload

What happens when you upload a dataset:

File validation: KLearn validates the CSV/Parquet file
Storage: File is uploaded to MinIO (S3-compatible storage)
Metadata extraction: Row count, column types, statistics
Database record: Dataset metadata stored in PostgreSQL

# API call when you upload
POST /api/v1/datasets
Content-Type: multipart/form-data

file: your_data.csv
name: customer_churn

Supported formats:

CSV (recommended)
Parquet
Excel (.xlsx)

Stage 2: Training with FLAML

What is FLAML?

FLAML (Fast and Lightweight AutoML) is Microsoft's open-source AutoML library that:

Automatically selects models: Tests XGBoost, LightGBM, RandomForest, etc.
Optimizes hyperparameters: Finds the best configuration
Respects time budgets: Stops when time runs out
Handles preprocessing: Deals with missing values and encoding

Training workflow:

User creates experiment: Specifies dataset, target, task type
Backend creates KLearnJob: Custom resource in Kubernetes
Operator creates training pod: With FLAML container
FLAML trains models: Tries multiple algorithms
Best model saved: To MinIO as pickle file
Metrics logged: To MLflow for tracking

# KLearnJob example
apiVersion: klearn.klearn.dev/v1alpha1
kind: KLearnJob
metadata:
  name: churn-prediction
spec:
  dataSource:
    uri: s3://klearn/datasets/churn.csv
  taskType: classification
  targetColumn: churned
  flamlConfig:
    timeBudget: 3600  # 1 hour
    metric: accuracy

Supported task types:

Task Type	Description	Metrics
`classification`	Binary or multi-class	accuracy, f1, roc_auc
`regression`	Continuous target	mse, mae, r2
`ts_forecast`	Time series	mape, smape, rmse

Stage 3: Model Registry

When training completes successfully:

KLearnModel created: Operator creates model resource
Model artifacts stored: In MinIO under models/ prefix
Metrics recorded: From FLAML's best model
MLflow run linked: For experiment tracking

# KLearnModel created automatically
apiVersion: klearn.klearn.dev/v1alpha1
kind: KLearnModel
metadata:
  name: churn-prediction-model
spec:
  sourceJob: churn-prediction
  modelUri: s3://klearn/models/churn-prediction/model.pkl
status:
  phase: Registered
  metrics:
    accuracy: 0.92
    f1_score: 0.89

Model versioning:

Each training job creates a new model version
Models can be promoted through stages: development → staging → production

Stage 4: Deployment & Serving

KLearn supports two deployment options:

Option A: KServe InferenceService

For production workloads with:

Autoscaling based on request volume
Canary deployments for gradual rollout
Request batching for efficiency
GPU support

# Deploy via API
curl -X POST http://localhost:8000/api/v1/models/churn-model/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "deployment_type": "kserve",
    "replicas": 2,
    "runtime": "flaml"
  }'

Option B: KLearn Serving

Lightweight deployment for:

Development and testing
Simple REST API
Quick iteration

# Deploy via API
curl -X POST http://localhost:8000/api/v1/models/churn-model/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "deployment_type": "klearn",
    "replicas": 1
  }'

Making predictions:

# Send prediction request
curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [
      {"age": 45, "tenure": 24, "monthly_charges": 79.5}
    ]
  }'

# Response
{
  "predictions": [1],
  "probabilities": [[0.15, 0.85]]
}

Real-time Progress Tracking

During training, KLearn provides real-time updates:

Pod status: Pending → Running → Succeeded
Training progress: Based on time budget
Current trial: Which model is being tested
Best score so far: Current best performance
Logs: Real-time trainer logs

The frontend polls the API every few seconds to show:

Progress bar with time remaining
Current best model and score
Live log streaming

Error Handling

KLearn handles failures gracefully:

Training failures:

Pod crash: Operator detects and updates status
OOM errors: Logged and reported
Data issues: Validation errors shown to user

Deployment failures:

Image pull errors: Automatically retried
Resource constraints: Clear error messages
Health check failures: Automatic rollback

What makes KLearn different?

Feature	Traditional MLOps	KLearn
Setup	Days/weeks	Minutes
ML expertise needed	High	Low
Infrastructure	Complex	Kubernetes-native
Vendor lock-in	Often	Never (100% OSS)
Cost	High	Self-hosted

How It Works