Model CRD

Every field on the Model spec.

apiVersion: krypton.ai/v1alpha1, kind: Model, namespaced (short name mdl).

The Model CRD declares “host this Hugging Face GGUF as an OpenAI-compatible model named X”. The controller turns it into a Deployment running llama.cpp’s HTTP server, plus a Service. The gateway aggregates every Model in the cluster at /v1/models and routes incoming /v1/chat/completions (etc.) requests by the model field in the body.

Minimal example

apiVersion: krypton.ai/v1alpha1
kind: Model
metadata:
  name: qwen2-0-5b
  namespace: models
spec:
  source:
    huggingface: Qwen/Qwen2.5-0.5B-Instruct-GGUF
    file: qwen2.5-0.5b-instruct-q4_k_m.gguf

That’s the smallest valid Model — every other field has a default.

Full example

apiVersion: krypton.ai/v1alpha1
kind: Model
metadata:
  name: qwen2-0-5b
  namespace: models
spec:
  # Weights
  source:
    huggingface: Qwen/Qwen2.5-0.5B-Instruct-GGUF
    file:        qwen2.5-0.5b-instruct-q4_k_m.gguf

  # Runtime
  runtime: llama.cpp           # only value supported today
  image: ""                    # blank = built-in llama.cpp:server image
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []

  # Networking
  port: 8080

  # Extra args appended to llama-server
  args:
    - "--ctx-size"
    - "4096"

  # Scaling (always-on; scale-to-zero on the roadmap)
  minReplicas: 1
  maxReplicas: 1

  # Pod
  resources:
    requests: { cpu: 500m, memory: 1Gi }
    limits:   { cpu: "2",   memory: 2Gi }
  env: []
  envFrom:
    - secretRef: { name: hf-token }   # for gated repos: HUGGINGFACE_HUB_TOKEN
  serviceAccountName: ""              # blank = auto-create

  # Lifecycle
  startupTimeout: 600s

Spec reference

Source

Field	Type	Default	Notes
`source.huggingface`	string (required)	—	Hugging Face repo id, e.g. `Qwen/Qwen2.5-0.5B-Instruct-GGUF`
`source.file`	string (required)	—	GGUF file within the repo to load

Gated repos require a token. Mount it as HUGGINGFACE_HUB_TOKEN via envFrom.

Runtime

Field	Type	Default	Notes
`runtime`	enum	`llama.cpp`	Only `llama.cpp` is supported today
`image`	string	`ghcr.io/ggml-org/llama.cpp:server`	Override the runtime container image
`imagePullPolicy`	string	`IfNotPresent`	Standard K8s pull policy
`imagePullSecrets`	`[]LocalObjectReference`	—	Same shape as a pod’s `imagePullSecrets`

Networking

Field	Type	Default	Notes
`port`	int32 (1–65535)	`8080`	Port `llama-server` listens on

Args

args is appended to the runtime entrypoint after the controller-owned flags (--host, --port, --hf-repo, --hf-file, --alias). Use it for tuning knobs like --ctx-size, --n-gpu-layers, --parallel. Don’t redefine the controller-owned flags from here.

Scaling

Field	Type	Default	Notes
`minReplicas`	int32 (≥ 1)	`1`	Models are always-on; zero floor is not supported
`maxReplicas`	int32 (≥ 1)	`1`	Caps replicas; future autoscaler will honor this

Pod

Field	Type	Default	Notes
`resources`	`corev1.ResourceRequirements`	—	Standard pod resource block
`env`	`[]corev1.EnvVar`	—	Passed to the runtime container
`envFrom`	`[]corev1.EnvFromSource`	—	Use for `HUGGINGFACE_HUB_TOKEN` on gated repos
`serviceAccountName`	string	`""`	Empty = auto-created SA with minimal permissions

Lifecycle

Field	Type	Default	Notes
`startupTimeout`	duration	`600s`	Cold-pull grace. First start downloads weights from HF — keep generous

Status (read-only)

Field	Type	Written by
`phase`	enum	Manager
`replicas`	int32	Manager
`readyReplicas`	int32	Manager
`url`	string	Manager
`observedGeneration`	int64	Manager
`conditions`	`[]Cond.`	Manager

Phases

Phase	Meaning
`Pending`	Pod exists but not yet ready (likely pulling weights)
`Ready`	At least one replica reports ready
`Failed`	Persistent reconcile errors (e.g. crashloop)

status.url points at the in-cluster OpenAI base URL for the pod, e.g. http://qwen2-0-5b.models.svc:8080/v1. Most callers go through the gateway instead.

OpenAI compatibility

Once Ready, the gateway exposes the model under these standard paths:

Method	Path	Notes
`GET`	`/v1/models`	Lists every `Model` in the cluster
`GET`	`/v1/models/{id}`	OpenAI model card for one `Model` by name
`POST`	`/v1/chat/completions`	Routes by body `model`
`POST`	`/v1/completions`	Routes by body `model`
`POST`	`/v1/embeddings`	Routes by body `model`

The OpenAI-facing id is metadata.name. If two Model CRs share a name across namespaces, the gateway picks one and logs the collision — keep model names unique.

Last modified May 27, 2026: feat: Model CRD for self-hosted LLMs (d864dac)