How It Works

Understanding the KLearn workflow

How KLearn Works

This page explains the end-to-end workflow of building, training, and deploying machine learning models with KLearn.

The ML Lifecycle

KLearn simplifies the machine learning lifecycle into four main stages:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│    Data      │ -> │   Training   │ -> │    Model     │ -> │   Serving    │
│   Upload     │    │   (FLAML)    │    │   Registry   │    │ (Deployment) │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘

Stage 1: Data Upload

What happens when you upload a dataset:

  1. File validation: KLearn validates the CSV/Parquet file
  2. Storage: File is uploaded to MinIO (S3-compatible storage)
  3. Metadata extraction: Row count, column types, statistics
  4. Database record: Dataset metadata stored in PostgreSQL
# API call when you upload
POST /api/v1/datasets
Content-Type: multipart/form-data

file: your_data.csv
name: customer_churn

Supported formats:

  • CSV (recommended)
  • Parquet
  • Excel (.xlsx)

Stage 2: Training with FLAML

What is FLAML?

FLAML (Fast and Lightweight AutoML) is Microsoft's open-source AutoML library that:

  • Automatically selects models: Tests XGBoost, LightGBM, RandomForest, etc.
  • Optimizes hyperparameters: Finds the best configuration
  • Respects time budgets: Stops when time runs out
  • Handles preprocessing: Deals with missing values and encoding

Training workflow:

  1. User creates experiment: Specifies dataset, target, task type
  2. Backend creates KLearnJob: Custom resource in Kubernetes
  3. Operator creates training pod: With FLAML container
  4. FLAML trains models: Tries multiple algorithms
  5. Best model saved: To MinIO as pickle file
  6. Metrics logged: To MLflow for tracking
# KLearnJob example
apiVersion: klearn.klearn.dev/v1alpha1
kind: KLearnJob
metadata:
  name: churn-prediction
spec:
  dataSource:
    uri: s3://klearn/datasets/churn.csv
  taskType: classification
  targetColumn: churned
  flamlConfig:
    timeBudget: 3600  # 1 hour
    metric: accuracy

Supported task types:

Task TypeDescriptionMetrics
classificationBinary or multi-classaccuracy, f1, roc_auc
regressionContinuous targetmse, mae, r2
ts_forecastTime seriesmape, smape, rmse

Stage 3: Model Registry

When training completes successfully:

  1. KLearnModel created: Operator creates model resource
  2. Model artifacts stored: In MinIO under models/ prefix
  3. Metrics recorded: From FLAML's best model
  4. MLflow run linked: For experiment tracking
# KLearnModel created automatically
apiVersion: klearn.klearn.dev/v1alpha1
kind: KLearnModel
metadata:
  name: churn-prediction-model
spec:
  sourceJob: churn-prediction
  modelUri: s3://klearn/models/churn-prediction/model.pkl
status:
  phase: Registered
  metrics:
    accuracy: 0.92
    f1_score: 0.89

Model versioning:

  • Each training job creates a new model version
  • Models can be promoted through stages: developmentstagingproduction

Stage 4: Deployment & Serving

KLearn supports two deployment options:

Option A: KServe InferenceService

For production workloads with:

  • Autoscaling based on request volume
  • Canary deployments for gradual rollout
  • Request batching for efficiency
  • GPU support
# Deploy via API
curl -X POST http://localhost:8000/api/v1/models/churn-model/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "deployment_type": "kserve",
    "replicas": 2,
    "runtime": "flaml"
  }'

Option B: KLearn Serving

Lightweight deployment for:

  • Development and testing
  • Simple REST API
  • Quick iteration
# Deploy via API
curl -X POST http://localhost:8000/api/v1/models/churn-model/deploy \
  -H "Content-Type: application/json" \
  -d '{
    "deployment_type": "klearn",
    "replicas": 1
  }'

Making predictions:

# Send prediction request
curl -X POST http://localhost:8080/predict \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [
      {"age": 45, "tenure": 24, "monthly_charges": 79.5}
    ]
  }'

# Response
{
  "predictions": [1],
  "probabilities": [[0.15, 0.85]]
}

Real-time Progress Tracking

During training, KLearn provides real-time updates:

  1. Pod status: Pending → Running → Succeeded
  2. Training progress: Based on time budget
  3. Current trial: Which model is being tested
  4. Best score so far: Current best performance
  5. Logs: Real-time trainer logs

The frontend polls the API every few seconds to show:

  • Progress bar with time remaining
  • Current best model and score
  • Live log streaming

Error Handling

KLearn handles failures gracefully:

Training failures:

  • Pod crash: Operator detects and updates status
  • OOM errors: Logged and reported
  • Data issues: Validation errors shown to user

Deployment failures:

  • Image pull errors: Automatically retried
  • Resource constraints: Clear error messages
  • Health check failures: Automatic rollback

What makes KLearn different?

FeatureTraditional MLOpsKLearn
SetupDays/weeksMinutes
ML expertise neededHighLow
InfrastructureComplexKubernetes-native
Vendor lock-inOftenNever (100% OSS)
CostHighSelf-hosted