Quickstart¶
Deploy your first inference service in minutes.
Prerequisites¶
- AIM Engine installed on your cluster
- AMD GPUs available in the cluster
kubectlconfigured to access your cluster
Step 1: Check Available Models¶
If you enabled model discovery during installation, models are already available:
If no models are listed, create one manually:
apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMClusterModel
metadata:
name: qwen3-32b
spec:
image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
Step 2: Deploy an Inference Service¶
Create an AIMService to deploy the model:
apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
name: qwen-chat
namespace: default
spec:
model:
image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
AIM Engine automatically:
- Resolves or creates a matching model
- Selects the best runtime template for your GPU hardware
- Downloads the model weights (this can take several minutes for large models)
- Creates a KServe InferenceService once the download completes
- Starts serving the model
Caching¶
Model weights are always downloaded to a persistent volume before the InferenceService starts. The caching mode controls whether that PVC is shared or isolated:
Shared(default) — The PVC is shared across all services using the same template. Once one service downloads the model, others reuse it immediately.Dedicated— Each service gets its own PVC, isolated from other services.
apiVersion: aim.eai.amd.com/v1alpha1
kind: AIMService
metadata:
name: qwen-chat
namespace: default
spec:
model:
image: amdenterpriseai/aim-qwen-qwen3-32b:0.8.5
caching:
mode: Dedicated
See Model Caching for more on caching modes and configuration.
Step 3: Monitor Progress¶
Watch the service status:
The status progresses through: Pending → Starting → Running. The service pauses in Starting while model weights are downloaded.
For more detail, check the conditions:
Step 4: Send a Request¶
Once the service is Running, find the inference endpoint:
InferenceService names are derived, so use the name returned by the command above and port-forward its predictor service:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen-chat",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Next Steps¶
- Deploying Services — Scaling, caching, routing, and more configuration options
- Model Catalog — Browse and manage available models
- Architecture — Understand how AIM Engine components work together