Model CRD

Every field on the Model spec.

apiVersion: krypton.ai/v1alpha1, kind: Model, namespaced (short name mdl).

The Model CRD declares “host this Hugging Face GGUF as an OpenAI-compatible model named X”. The controller turns it into a Deployment running llama.cpp’s HTTP server, plus a Service. The gateway aggregates every Model in the cluster at /v1/models and routes incoming /v1/chat/completions (etc.) requests by the model field in the body.

Minimal example

apiVersion: krypton.ai/v1alpha1
kind: Model
metadata:
  name: qwen2-0-5b
  namespace: models
spec:
  source:
    huggingface: Qwen/Qwen2.5-0.5B-Instruct-GGUF
    file: qwen2.5-0.5b-instruct-q4_k_m.gguf

That’s the smallest valid Model — every other field has a default.

Full example

apiVersion: krypton.ai/v1alpha1
kind: Model
metadata:
  name: qwen2-0-5b
  namespace: models
spec:
  # Weights
  source:
    huggingface: Qwen/Qwen2.5-0.5B-Instruct-GGUF
    file:        qwen2.5-0.5b-instruct-q4_k_m.gguf

  # Runtime
  runtime: llama.cpp           # only value supported today
  image: ""                    # blank = built-in llama.cpp:server image
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []

  # Networking
  port: 8080

  # Extra args appended to llama-server
  args:
    - "--ctx-size"
    - "4096"

  # Scaling (always-on; scale-to-zero on the roadmap)
  minReplicas: 1
  maxReplicas: 1

  # Pod
  resources:
    requests: { cpu: 500m, memory: 1Gi }
    limits:   { cpu: "2",   memory: 2Gi }
  env: []
  envFrom:
    - secretRef: { name: hf-token }   # for gated repos: HUGGINGFACE_HUB_TOKEN
  serviceAccountName: ""              # blank = auto-create

  # Lifecycle
  startupTimeout: 600s

Spec reference

Source

FieldTypeDefaultNotes
source.huggingfacestring (required)Hugging Face repo id, e.g. Qwen/Qwen2.5-0.5B-Instruct-GGUF
source.filestring (required)GGUF file within the repo to load

Gated repos require a token. Mount it as HUGGINGFACE_HUB_TOKEN via envFrom.

Runtime

FieldTypeDefaultNotes
runtimeenumllama.cppOnly llama.cpp is supported today
imagestringghcr.io/ggml-org/llama.cpp:serverOverride the runtime container image
imagePullPolicystringIfNotPresentStandard K8s pull policy
imagePullSecrets[]LocalObjectReferenceSame shape as a pod’s imagePullSecrets

Networking

FieldTypeDefaultNotes
portint32 (1–65535)8080Port llama-server listens on

Args

args is appended to the runtime entrypoint after the controller-owned flags (--host, --port, --hf-repo, --hf-file, --alias). Use it for tuning knobs like --ctx-size, --n-gpu-layers, --parallel. Don’t redefine the controller-owned flags from here.

Scaling

FieldTypeDefaultNotes
minReplicasint32 (≥ 1)1Models are always-on; zero floor is not supported
maxReplicasint32 (≥ 1)1Caps replicas; future autoscaler will honor this

Pod

FieldTypeDefaultNotes
resourcescorev1.ResourceRequirementsStandard pod resource block
env[]corev1.EnvVarPassed to the runtime container
envFrom[]corev1.EnvFromSourceUse for HUGGINGFACE_HUB_TOKEN on gated repos
serviceAccountNamestring""Empty = auto-created SA with minimal permissions

Lifecycle

FieldTypeDefaultNotes
startupTimeoutduration600sCold-pull grace. First start downloads weights from HF — keep generous

Status (read-only)

FieldTypeWritten by
phaseenumManager
replicasint32Manager
readyReplicasint32Manager
urlstringManager
observedGenerationint64Manager
conditions[]Cond.Manager

Phases

PhaseMeaning
PendingPod exists but not yet ready (likely pulling weights)
ReadyAt least one replica reports ready
FailedPersistent reconcile errors (e.g. crashloop)

status.url points at the in-cluster OpenAI base URL for the pod, e.g. http://qwen2-0-5b.models.svc:8080/v1. Most callers go through the gateway instead.

OpenAI compatibility

Once Ready, the gateway exposes the model under these standard paths:

MethodPathNotes
GET/v1/modelsLists every Model in the cluster
GET/v1/models/{id}OpenAI model card for one Model by name
POST/v1/chat/completionsRoutes by body model
POST/v1/completionsRoutes by body model
POST/v1/embeddingsRoutes by body model

The OpenAI-facing id is metadata.name. If two Model CRs share a name across namespaces, the gateway picks one and logs the collision — keep model names unique.