Skip to content

Troubleshooting

Common issues and diagnostic steps for AIM Engine.

Service Status

Check the overall status:

kubectl get aimservice <name> -n <namespace>

For detailed diagnostics, inspect conditions and component health:

kubectl get aimservice <name> -n <namespace> -o jsonpath='{.status.conditions}' | jq

Common Issues

Service Stuck in "Pending"

The service is waiting for upstream dependencies.

Check which conditions are blocking readiness:

kubectl get aimservice <name> -n <namespace> -o jsonpath='{.status.conditions}' | jq
Blocked Component Likely Cause
Model Model not found — check model.name spelling or model.image accessibility
Template No matching template — verify templates exist and are Ready
RuntimeConfig Runtime config not found or invalid

Service Stuck in "Starting"

Downstream resources are being created but haven't become ready.

Check the InferenceService:

kubectl get inferenceservice -n <namespace> -l aim.eai.amd.com/service.name=<name>
kubectl describe inferenceservice <isvc-name> -n <namespace>

Check pods:

kubectl get pods -l serving.kserve.io/inferenceservice=<isvc-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

Use the InferenceService name returned by the first command as <isvc-name>.

Common causes:

  • Image pull errors — Wrong image URL or missing imagePullSecrets
  • Insufficient resources — Not enough GPU, memory, or CPU available
  • PVC not binding — Storage class doesn't support RWX, or insufficient capacity

Template Selection Fails

"No templates found":

# List templates for the model
kubectl get aimservicetemplates -n <namespace>
kubectl get aimclusterservicetemplates

# Check template status
kubectl get aimservicetemplates -o custom-columns=NAME:.metadata.name,STATUS:.status.status

Templates may be excluded because:

  • Status is not Ready (still discovering or failed)
  • Status is NotAvailable (required GPU not in cluster)
  • Profile is unoptimized and allowUnoptimized is not set

"Ambiguous selection":

Multiple templates scored equally. Resolve by specifying template.name explicitly.

Cache or Artifact Failures

# Check template cache
kubectl get aimtemplatecache -n <namespace>

# Check artifacts
kubectl get aimartifact -n <namespace>

# Check download job
kubectl get jobs -l aim.eai.amd.com/artifact=<artifact-name> -n <namespace>
kubectl logs job/<job-name> -n <namespace>

Common causes:

  • StorageSizeError — Model size not yet discovered; typically resolves automatically
  • Download failure — Network issues, authentication errors, or protocol incompatibility
  • PVC binding failure — Storage class doesn't support ReadWriteMany

Routing Not Working

# Check HTTPRoute
kubectl get httproute -n <namespace>
kubectl describe httproute <name> -n <namespace>

# Check the gateway
kubectl get gateway -n <gateway-namespace>

Common causes:

  • Gateway doesn't exist or isn't ready
  • routing.enabled is not set (check runtime config)
  • Gateway namespace mismatch in gatewayRef

Operator Logs

View operator logs for detailed error information:

kubectl logs -n aim-system deployment/aim-engine-controller-manager -f

Filter for errors related to a specific resource:

kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
  jq 'select(.name == "<resource-name>")'

Status Values

Status Meaning
Pending Waiting for upstream dependencies
Starting Creating downstream resources
Progressing Resources created, waiting for readiness
Running Fully operational
Ready Resource is ready (for non-service CRDs)
Degraded Partially functional
NotAvailable Required infrastructure not present
Failed Critical failure

Next Steps