Skip to main content
This endpoint is not yet available. It is planned for a future release.
Dedicated GPU deployments with autoscaling, scale-to-zero, and per-deployment configuration. For enterprises that need guaranteed throughput, custom models, or data residency guarantees.

Planned endpoints

OperationMethodPath
Create deploymentPOST/v1/accounts/{id}/deployments
Get deploymentGET/v1/accounts/{id}/deployments/{id}
List deploymentsGET/v1/accounts/{id}/deployments
Update deploymentPATCH/v1/accounts/{id}/deployments/{id}
Delete deploymentDELETE/v1/accounts/{id}/deployments/{id}
Scale deploymentPOST/v1/accounts/{id}/deployments/{id}:scale

Planned autoscaling options

FlagDefaultDescription
min_replica_count0Scale to zero when idle
max_replica_count1Maximum replicas
scale_up_window30sWait before scaling up
scale_down_window10mWait before scaling down
Serverless inference is the right choice for most developers. Deployments are for enterprises with guaranteed throughput requirements.
  • Models — Available models for deployment
  • Roadmap — feature timeline