Skip to content

Monitoring and Observability

AIM Engine exposes metrics and structured logs for monitoring operator health and inference workloads.

Metrics

Endpoint

The controller exposes metrics on port 8443 (HTTPS by default). Configure via Helm:

Value Default Description
metrics.enable true Enable metrics endpoint
metrics.port 8443 Metrics port

Prometheus ServiceMonitor

Enable automatic scraping with Prometheus:

helm install aim-engine oci://docker.io/amdenterpriseai/charts/aim-engine \
  --version <version> \
  --namespace aim-system \
  --set prometheus.enable=true

This creates a ServiceMonitor resource that Prometheus Operator picks up automatically.

Controller Runtime Metrics

AIM Engine exposes standard controller-runtime metrics:

  • controller_runtime_reconcile_total — Total reconciliations by controller and result
  • controller_runtime_reconcile_errors_total — Total reconciliation errors
  • controller_runtime_reconcile_time_seconds — Reconciliation duration
  • workqueue_depth — Current work queue depth per controller

Logs

Format

Operator logs are JSON-formatted with these key fields:

Field Description Example
level Log level info, error, debug
controller Controller name artifact, service, model
namespace Resource namespace ml-team
name Resource name qwen-chat
condition Condition being updated Ready
status Condition status True, False
reason Condition reason RuntimeReady

Log Levels

Configure via operator flags:

Flag Values Default
--zap-log-level debug, info, error, or integer info
--zap-encoder json, console json
--zap-devel false (production mode)

Enable debug logging in Helm:

helm install aim-engine oci://docker.io/amdenterpriseai/charts/aim-engine \
  --version <version> \
  --namespace aim-system \
  --set 'manager.args={--leader-elect,--zap-log-level=debug}'

Useful Log Queries

# View operator logs
kubectl logs -n aim-system deployment/aim-engine-controller-manager -f

# Filter for errors
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
  jq 'select(.level == "error")'

# Filter by controller
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
  jq 'select(.controller == "aimservice")'

# Filter by namespace
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
  jq 'select(.namespace == "ml-team")'

Kubernetes Events

The operator emits Kubernetes Events on AIM resources when conditions change. Events provide a timeline of state transitions visible via kubectl describe.

Event Types

Type When Emitted
Normal Condition transitions to a healthy state
Warning Condition transitions to an unhealthy state, or persists unhealthy on every reconcile

Event Reasons

Events use the condition's reason field as the event reason. Common event reasons:

AIMService:

Reason Type Description
ModelResolved Normal Model found and ready
ModelNotFound Warning Referenced model does not exist
Resolved Normal Template resolved successfully
TemplateSelectionAmbiguous Warning Multiple templates scored equally
CacheReady Normal Model cache is populated
CacheFailed Warning Cache download failed
RuntimeReady Normal InferenceService is serving
InvalidImageReference Warning Model image URI is invalid
PathTemplateInvalid Warning Routing path template failed to resolve

AIMModel:

Reason Type Description
AllTemplatesReady Normal All discovered templates are ready
AllTemplatesFailed Warning All discovered templates failed
MetadataExtractionFailed Warning Failed to extract model metadata

AIMArtifact:

Reason Type Description
Verified Normal Download complete and verified
Downloading Normal Download in progress

Viewing Events

# Events for a specific resource
kubectl describe aimservice qwen-chat -n <namespace>

# All AIM-related events in a namespace
kubectl get events -n <namespace> --field-selector involvedObject.apiVersion=aim.eai.amd.com/v1alpha1

Recurring Events

Some warning events are emitted on every reconcile (not just on transitions) for critical conditions that remain unhealthy. These are useful for alerting — a persistent stream of warnings indicates a stuck or failing resource.

See Conditions Reference for the full catalog of conditions and reasons.

Health Probes

The operator exposes health and readiness probes:

Probe Path Port
Liveness /healthz 8081
Readiness /readyz 8081

These are configured automatically in the Helm chart deployment.

Next Steps