In early development · building in the open

The open source control plane for AI inference

Modelplane is the control plane above your inference clusters across cloud, neocloud, and on-premise. Platform teams set policy and capacity; developers declare a model and get a serving endpoint. Modelplane continuously reconciles the whole fleet: provisioning, scheduling, autoscaling, routing, and caching. All of it runs entirely under your control.

ModelDeployment
POSTSGLang
deepseek-r1
prefill / decode8× B200
ModelDeployment
POSTvLLM
llama-4-70b
tensor parallel4× H100
ModelDeployment
POSTTRT-LLM
qwen3-235b
data / expert8× H200
reconciling
provisioningschedulingautoscalingroutingcaching
policygovernancecompliance
InferenceCluster
GCP
Cloud · us-central1
256× TPU v6e8× H100
InferenceCluster
CoreWeave
Neocloud · gpu-east
72× GB2008× H200
InferenceCluster
DGX
On-prem · dc-1
32× H1008× A100

Created by

Built on


The inference ecosystem. Under one control plane.

Any model, any engine, any infrastructure. Modelplane doesn’t replace the inference ecosystem; it sits above the pieces your teams already choose and composes them into a running, self-reconciling fleet.

composesprovisionsschedulesautoscalesroutescaches

orchestrates

Models

open weights & custom

LlamaQwenDeepSeekMistralgpt-ossGemma+ any open-weight model

Serving

inference engines

vLLMSSGLangTensorRT-LLMTGIlllama.cppLLMDeploy+ any engine

Infrastructure

accelerators & providers

Accelerators

NVIDIAAMDGoogle TPUAWS TrainiumIntel Gaudi+ any accelerator

Providers

AWSGCPAzureCoreWeaveLambdaoon-prem+ any Kubernetes

Advanced serving. From one GPU to frontier.

Modelplane matches each model’s requirements and serving topology to the hardware available, using expressive CEL selectors and composable API shapes. Topology is declared as shape, so it places anything from a single GPU to multi-node, disaggregated frontier serving, and new parallelism strategies work as they emerge.

tensor parallel

Split each layer across GPUs in a node for low-latency single-model serving.

pipeline parallel

Stage a model across nodes so very large models fit beyond a single box.

data / expert

Replicate workers, or shard experts across them for MoE throughput.

prefill / decode

Disaggregate prefill and decode onto separate pools for frontier serving.

+ emerging topology

Described as shape, so future parallelism strategies just work.


A resource model for inference. Serving two roles.

At its core, Modelplane is a flexible resource model for inference. Each role owns its own resources: developers declare model deployments and expose one service across regions, clouds, and managed vendors, while platform teams declare the fleet of clusters, accelerators, and gateways underneath.

Development & ML teams

Define model deployments: the model, the engine and its configuration, serving topology, hardware request, region, and environment. Then expose them as one service, weighted across regions, clouds, and managed vendors.

kind: ModelService

name: prod-llama

routing: weighted, openai

60

kind: ModelDeployment

model: llama-4-70b

cluster: aws-us-east

30

kind: ModelDeployment

model: llama-4-70b

cluster: gcp-eu-west

10

kind: ModelEndpoint

target: vendor-api

type: managed

Platform teams

Declare the fleet: a gateway over clusters across clouds and regions, each with its own hardware classes and node pools. Set the capacity, accelerators, policy, and cost controls the whole fleet runs within.

kind: InferenceGateway

name: prod-gateway

routes: all endpoints

kind: InferenceCluster

name: aws-us-east

pools: h200, h100

kind: InferenceCluster

name: gcp-eu-west

pools: tpu-v6e, a100

kind: InferenceCluster

name: onprem-dc1

pools: h100, l40s


Capabilities built for the fleet. Not just the cluster.

01 / Provisioning

Provision the fleet, or bring your own

Provision inference clusters on AWS, GCP, and Azure, or bring your own on any Kubernetes. Each gets hardware classes, node pools, an inference gateway, and the full serving stack, installed and continuously reconciled.

Provisioning

Provision · GKE / EKSBring your own · any K8s
Modelplane installs & reconciles
InferenceCluster● reconciled

classes: h200-8x, h100-8x · node pools · gateway

GPU operator & drivers

Serving engines

Inference gateway

02 / Scheduling

One global pool of capacity

Modelplane treats every cluster, cloud, and region as one global pool. A fleet scheduler places each model's replicas where its requirements match a cluster's capabilities, then hands off to the cluster's own scheduler and DRA, with support for advanced schedulers like KAI, Kueue, and Volcano.

Two-level scheduling

fleet scheduler

one global pool

tracks requirements

↔ capabilities

places replicas

aws-us-east
gcp-eu-west
azure-us2

cluster scheduler

DRA · KAI / Kueue / Volcano

bound

03 / Autoscaling

Scale replicas across clouds and regions

Every model exposes the standard Kubernetes scale subresource, so its replicas scale out across clusters, clouds, and regions, driven by hand or by HPA and KEDA.

roadmapScale-to-zero is on the roadmap.

Autoscaling

load
spec.replicas 6
GCP · eu-west
AWS · us-east
Azure · us-east2
min 1 · max 8 replicasscale subresource · HPA / KEDA

04 / Routing

One service, many replicas and endpoints

A model service is one stable, OpenAI-compatible endpoint over many replicas and model endpoints. Weighted routing spreads traffic across replicas for canary and A/B rollouts, and a managed endpoint can take a weighted share too.

roadmapAutomatic cross-cloud failover is on the roadmap.

One service, many endpoints

ModelService · prod-llama

● one OpenAI-compatible endpoint

60

ModelEndpoint

replica · aws-us-east

30

ModelEndpoint

replica · gcp-eu-west

10

ModelEndpoint

managed · vendor


Genuinely open. Yours to run and operate.

Modelplane is Apache 2 and open source end to end. The control plane lives entirely in your infrastructure and depends on nothing outside it, so no vendor can restrict, throttle, or revoke access. Donation to a neutral open source foundation is planned.

Built by the team behind Crossplane, the proven open source foundation for infrastructure control planes, trusted at Apple, JPMC, Nike, Elastic, Grafana, and MongoDB.