This endpoint is not yet available. It is planned for a future release.
Dedicated GPU deployments with autoscaling, scale-to-zero, and per-deployment configuration. For enterprises that need guaranteed throughput, custom models, or data residency guarantees.
Planned endpoints
| Operation | Method | Path |
|---|
| Create deployment | POST | /v1/accounts/{id}/deployments |
| Get deployment | GET | /v1/accounts/{id}/deployments/{id} |
| List deployments | GET | /v1/accounts/{id}/deployments |
| Update deployment | PATCH | /v1/accounts/{id}/deployments/{id} |
| Delete deployment | DELETE | /v1/accounts/{id}/deployments/{id} |
| Scale deployment | POST | /v1/accounts/{id}/deployments/{id}:scale |
Planned autoscaling options
| Flag | Default | Description |
|---|
min_replica_count | 0 | Scale to zero when idle |
max_replica_count | 1 | Maximum replicas |
scale_up_window | 30s | Wait before scaling up |
scale_down_window | 10m | Wait before scaling down |
Serverless inference is the right choice for most developers. Deployments are for enterprises with guaranteed throughput requirements.
- Models — Available models for deployment
- Roadmap — feature timeline