openova/platform/vllm
e3mrah f6757c7c93
feat(docs): lean documentation strategy — consolidate 16 docs into 7 canonical + 3 subdirs (#2094)
* docs(arch): consolidate ARCHITECTURE + PLATFORM-TECH-STACK + NAMING + EPICS-1-6 + BOOTSTRAP-KIT-EXPANSION → docs/ARCHITECTURE.md (lean doc strategy)

Single canonical "how OpenOva works" doc per founder's lean-doc strategy.
2926 source lines → 1110 consolidated lines, no semantic loss.

Sections:
 §1  High-level model (Catalyst/Sovereign/Org/Env/Application/Blueprint)
 §2  Repo layout
 §3  Tech stack by layer (CNI/GitOps/IaC/event-spine/data/secrets/identity/...)
 §4  Naming conventions (dimensions, patterns, labels, DOMAINS-CANON)
 §5  Catalyst control plane (rules, CRDs, controllers, cutover, identity, surfaces)
 §6  Per-host-cluster infrastructure
 §7  Application Blueprints
 §8  Multi-region topology (1 cpx52/region, WireGuard-over-public-IPs, ClusterMesh)
 §9  Bootstrap-kit slot ordering (full 48-slot canonical list)
 §10 EPIC-level design overview (EPIC-0 through EPIC-6)
 §11 Per-chart DESIGN.md inventory
 §12 OAM influence
 §13 Read further

Stale literal fixes:
 - omantel.openova.io → omantel.biz / <sovereign>.<tld> / t38.omani.works (7 instances)
 - SPIRE marked DEFERRED / opt-in only (PR #665, TBD-V29 #2055)
 - failover-controller marked REPLACED by bp-continuum

New PR refs wired into §3:
 - PR #665   SPIRE deferral
 - PR #2071  bp-cnpg-pair synchronous remote_apply (zero-tx-loss multi-region)
 - PR #2087  bp-cnpg-pair pre-merge guard
 - PR #2093  bp-cnpg-pair pre-merge guard

New stack components added to §3:
 - bp-cnpg-pair  (synchronous remote_apply ReplicaCluster across ClusterMesh)
 - bp-continuum  (lease-based failover orchestrator)
 - bp-self-sovereign-cutover (8-tether pivot, ADR-0002, Principle #11)

Source docs (to be deleted by orchestrator in final PR):
 - docs/PLATFORM-TECH-STACK.md
 - docs/NAMING-CONVENTION.md
 - docs/EPICS-1-6-unified-design.md
 - docs/BOOTSTRAP-KIT-EXPANSION-PLAN.md

* docs(principles): consolidate INVIOLABLE-PRINCIPLES + ANTI-PATTERN-CATALOG → docs/PRINCIPLES.md (lean doc strategy)

* docs(dod): consolidate 5-PILLAR-DOD + DOMAINS-CANON + SOVEREIGN-MULTI-REGION-DOD + PERSONAS-AND-JOURNEYS → docs/DOD.md (lean doc strategy)

* docs(runbooks+status+glossary): consolidate 5 runbooks → RUNBOOKS.md + refresh STATUS.md + fold banned-terms into GLOSSARY.md (lean doc strategy)

Part 1 — Runbook consolidation:
- NEW docs/RUNBOOKS.md with 7 numbered sections (provisioning, day-2 ops,
  Blueprint authoring, chart conventions, demo walk, failover, troubleshooting)
- Folds BLUEPRINT-AUTHORING / CHART-AUTHORING / DEMO-RUNBOOK /
  RUNBOOK-OPERATIONS / RUNBOOK-PROVISIONING into one canonical surface
- Documents dual-annotation requirement for charts with enabled.default: false
  (GUARD 1 #2087 no-upstream + GUARD 2 #2093 smoke-render) with bp-network-policies:1.0.1
  dead-reserve incident as the live evidence
- All admin.<fqdn> legacy URL refs → console.<fqdn>/bss (BSS lives in operator console)
- All openova.io / omantel.omani.works test commands → canonical t<NN>.omani.works
- Cites PRs #2076 (docs migration), #2082 (no-auto-close-keyword), #2087, #2093

Part 2 — STATUS.md refresh (renamed from IMPLEMENTATION-STATUS.md):
- Header dated 2026-05-20 (was 2026-04-29; 22 days stale per audit)
- Adds 🟦 CODE-COMPLETE state for "controllers + CRDs + tests landed,
  awaiting fresh-prov walk" (per 5-pillar DoD)
- Pillar 3 marked CODE-COMPLETE (PRs #2071/#2072/#2073/#2074/#2075/#2053)
- Adds 3 new CRDs verified in products/catalyst/chart/crds/:
  CNPGPair, PDM, Sandbox
- Sandbox controller chain CODE-COMPLETE
  (PRs #1615/#1618/#1621/#1622/#1626/#1631/#1632)
- SPIRE marked DEFERRED — opt-in only (PRs #665, #2056, #2061)
- New §6 CI / supply-chain guards table: hollow-chart (#2087),
  smoke-render (#2093), no-auto-close-keyword (#2082), observability-toggle,
  subchart 4-step, Flux version-pin replay
- New §9 Pillar-status table — Pillars 1/2/3/4 CODE-COMPLETE, Pillar 5 🚧
- Pillar 1 (PRs #2038 V18, #2043 V18-D), Pillar 2 (PR #2029 V20),
  Pillar 3 (per above), Pillar 4 (Sandbox chain)

Part 3 — GLOSSARY.md folded as single source of truth for banned terms:
- Header dated 2026-05-20, notes "single source of truth for banned terms"
  and "no separate BANNED-TERMS.md"
- Existing 11 banned-terms rows rewritten with italicized qualifiers
- NEW Forbidden test domains subsection:
  openova.io (mothership-only), omantel.openova.io (hallucinated),
  Nova Cloud (predecessor brand), eventforge.io (hallucinated),
  admin.<fqdn> (dead BSS URL)
- SPIFFE/SPIRE identity row + acronym row marked deferred per PR #665
  with TBD-V29 (#2055) re-introduction roadmap
- Cross-links updated: IMPLEMENTATION-STATUS → STATUS,
  SOVEREIGN-PROVISIONING + BLUEPRINT-AUTHORING → RUNBOOKS.md

CLAUDE.md NOT touched. Source files NOT deleted (orchestrator owns deletion).
No push, no PR. Manifest at /tmp/merge-D-runbooks-status-glossary-manifest.txt.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: assemble lean doc strategy — delete legacy sources, move ledger/sessions/archive, ADR-0004, rewrite cross-refs

Per founder direction 2026-05-20 + user-global ~/.claude/CLAUDE.md §11.

This is the orchestrator commit on top of the four cherry-picked consolidation
commits (ARCHITECTURE, PRINCIPLES, DOD, RUNBOOKS+STATUS+GLOSSARY). It:

1. Deletes 15 legacy source docs (now folded into the 7 canonical):
   PLATFORM-TECH-STACK, NAMING-CONVENTION, EPICS-1-6-unified-design,
   BOOTSTRAP-KIT-EXPANSION-PLAN, INVIOLABLE-PRINCIPLES, ANTI-PATTERN-CATALOG,
   5-PILLAR-DOD, DOMAINS-CANON, SOVEREIGN-MULTI-REGION-DOD,
   PERSONAS-AND-JOURNEYS, BLUEPRINT-AUTHORING, CHART-AUTHORING,
   DEMO-RUNBOOK, RUNBOOK-OPERATIONS, RUNBOOK-PROVISIONING.

2. Moves transient + historical docs into proper subdirs:
   - docs/ledger/{TRUST,TRACKER}.md (cron-refreshed live state)
   - docs/sessions/{2026-05-17-convergence,2026-05-19-20-trust-recovery,
     2026-05-20-trust-audit,2026-05-20-walk-runbook}.md
   - docs/archive/{validation-log,orchestrator-state,omantel-handover-wbs}.md

3. Adds docs/adr/0004-cnpg-sync-replication.md (Pillar 3 zero-tx-loss decision)
   + docs/adr/README.md index.

4. Updates CLAUDE.md reading-order + repo-structure block to match the
   lean strategy and current core/ tree (controllers/, marketplace/, etc.).

5. Sweeps all .md files + .github/workflows + scripts to repoint old doc
   paths to the new canonical homes. ADR cross-references kept intact
   (ADRs are immutable historical artifacts).

Operator-side cron scripts that still write to the old paths
(/home/openova/bin/refresh-dod-dashboard.sh, refresh-wbs.sh and
openova-private/bin/trust-audit.sh) need a one-line path update —
flagged in the PR body.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(bootstrap-kit): update repo-root sentinel to docs/PRINCIPLES.md

The bootstrap-kit Go test used `docs/INVIOLABLE-PRINCIPLES.md` as its
repo-root sentinel; the file no longer exists after the lean-doc
consolidation (it's now `docs/PRINCIPLES.md`). Update the walker to
match the new canonical filename.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: hatiyildiz <269457768+hatiyildiz@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 14:40:01 +04:00
..
chart feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283) 2026-04-30 18:37:07 +04:00
blueprint.yaml feat(charts): bp-vllm + bp-bge + bp-nemo-guardrails wrapper charts (#283) 2026-04-30 18:37:07 +04:00
README.md feat(docs): lean documentation strategy — consolidate 16 docs into 7 canonical + 3 subdirs (#2094) 2026-05-20 14:40:01 +04:00

vLLM

High-performance LLM inference engine with PagedAttention. Application Blueprint (see docs/ARCHITECTURE.md §4.6). Default LLM serving runtime in bp-cortex (the composite AI Hub Blueprint).

Status: Accepted | Updated: 2026-04-27


Overview

vLLM provides high-throughput LLM serving with efficient memory management via PagedAttention. Recommended runtime for LLM inference in OpenOva.

flowchart LR
    subgraph vLLM["vLLM Engine"]
        PagedAttn[PagedAttention]
        Scheduler[Continuous Batching]
        KVCache[KV Cache Management]
    end

    subgraph API["OpenAI-Compatible API"]
        Chat[/v1/chat/completions]
        Completions[/v1/completions]
        Models[/v1/models]
    end

    Request[Request] --> API
    API --> vLLM
    vLLM --> GPU[GPU]

Why vLLM?

Feature Benefit
PagedAttention 24x higher throughput than HuggingFace
Continuous batching Efficient request handling
OpenAI-compatible API Drop-in replacement
Tensor parallelism Multi-GPU support
Quantization AWQ, GPTQ, INT8 support

Supported Models

Model Family Examples
Qwen Qwen2.5, Qwen3 (recommended)
Llama Llama 3.1, Llama 3.2
Mistral Mistral, Mixtral
DeepSeek DeepSeek-R1, DeepSeek-V3
Others Phi, Gemma, Yi, etc.

Configuration

Deployment via KServe

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen-32b
  namespace: ai-hub
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      runtime: vllm-runtime
      storageUri: pvc://model-cache/models/qwen3-32b-awq
    resources:
      requests:
        nvidia.com/gpu: "2"
      limits:
        nvidia.com/gpu: "2"

Standalone Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  namespace: ai-hub
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - --model=/models/qwen3-32b-awq
            - --tensor-parallel-size=2
            - --max-model-len=32768
            - --gpu-memory-utilization=0.9
            - --enable-prefix-caching
          ports:
            - containerPort: 8000
          resources:
            requests:
              nvidia.com/gpu: "2"
            limits:
              nvidia.com/gpu: "2"
          volumeMounts:
            - name: model-cache
              mountPath: /models
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache

Key Parameters

Parameter Purpose Example
--model Model path or HuggingFace ID /models/qwen3-32b
--tensor-parallel-size Number of GPUs 2
--max-model-len Maximum context length 32768
--gpu-memory-utilization GPU memory fraction 0.9
--quantization Quantization method awq, gptq
--enable-prefix-caching Cache common prefixes -

API Usage

Chat Completions

curl http://vllm.ai-hub.svc:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ],
    "stream": true
  }'

With Thinking Mode (Qwen3)

curl http://vllm.ai-hub.svc:8000/v1/chat/completions \
  -d '{
    "model": "qwen3-32b",
    "messages": [
      {"role": "user", "content": "Solve this step by step: ..."}
    ],
    "extra_body": {
      "chat_template_kwargs": {"enable_thinking": true}
    }
  }'

Multi-GPU Configuration

Tensor Parallelism (Single Node)

args:
  - --tensor-parallel-size=4  # Split model across 4 GPUs

Pipeline Parallelism (Multi-Node)

args:
  - --pipeline-parallel-size=2  # Split across 2 nodes
  - --tensor-parallel-size=4    # 4 GPUs per node

Quantization

Method Memory Reduction Quality
AWQ ~4x Excellent
GPTQ ~4x Good
INT8 ~2x Very Good
FP8 ~2x Excellent
args:
  - --quantization=awq
  - --dtype=half

Monitoring

Metric Query
Request latency vllm:request_latency_seconds
Tokens/second vllm:generation_tokens_total
GPU memory vllm:gpu_cache_usage_perc
Queue length vllm:num_requests_waiting

Consequences

Positive:

  • Industry-leading performance
  • OpenAI-compatible API
  • Excellent quantization support
  • Multi-GPU scaling
  • Active development

Negative:

  • GPU required
  • Memory-intensive for large models
  • Some models not yet supported

Part of OpenOva