Monitoring and Observability¶
AIM Engine exposes metrics and structured logs for monitoring operator health and inference workloads.
Metrics¶
Endpoint¶
The controller exposes metrics on port 8443 (HTTPS by default). Configure via Helm:
| Value | Default | Description |
|---|---|---|
metrics.enable |
true |
Enable metrics endpoint |
metrics.port |
8443 |
Metrics port |
Prometheus ServiceMonitor¶
Enable automatic scraping with Prometheus:
helm install aim-engine oci://docker.io/amdenterpriseai/charts/aim-engine \
--version <version> \
--namespace aim-system \
--set prometheus.enable=true
This creates a ServiceMonitor resource that Prometheus Operator picks up automatically.
Controller Runtime Metrics¶
AIM Engine exposes standard controller-runtime metrics:
controller_runtime_reconcile_total— Total reconciliations by controller and resultcontroller_runtime_reconcile_errors_total— Total reconciliation errorscontroller_runtime_reconcile_time_seconds— Reconciliation durationworkqueue_depth— Current work queue depth per controller
Logs¶
Format¶
Operator logs are JSON-formatted with these key fields:
| Field | Description | Example |
|---|---|---|
level |
Log level | info, error, debug |
controller |
Controller name | artifact, service, model |
namespace |
Resource namespace | ml-team |
name |
Resource name | qwen-chat |
condition |
Condition being updated | Ready |
status |
Condition status | True, False |
reason |
Condition reason | RuntimeReady |
Log Levels¶
Configure via operator flags:
| Flag | Values | Default |
|---|---|---|
--zap-log-level |
debug, info, error, or integer |
info |
--zap-encoder |
json, console |
json |
--zap-devel |
— | false (production mode) |
Enable debug logging in Helm:
helm install aim-engine oci://docker.io/amdenterpriseai/charts/aim-engine \
--version <version> \
--namespace aim-system \
--set 'manager.args={--leader-elect,--zap-log-level=debug}'
Useful Log Queries¶
# View operator logs
kubectl logs -n aim-system deployment/aim-engine-controller-manager -f
# Filter for errors
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
jq 'select(.level == "error")'
# Filter by controller
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
jq 'select(.controller == "aimservice")'
# Filter by namespace
kubectl logs -n aim-system deployment/aim-engine-controller-manager | \
jq 'select(.namespace == "ml-team")'
Kubernetes Events¶
The operator emits Kubernetes Events on AIM resources when conditions change. Events provide a timeline of state transitions visible via kubectl describe.
Event Types¶
| Type | When Emitted |
|---|---|
Normal |
Condition transitions to a healthy state |
Warning |
Condition transitions to an unhealthy state, or persists unhealthy on every reconcile |
Event Reasons¶
Events use the condition's reason field as the event reason. Common event reasons:
AIMService:
| Reason | Type | Description |
|---|---|---|
ModelResolved |
Normal | Model found and ready |
ModelNotFound |
Warning | Referenced model does not exist |
Resolved |
Normal | Template resolved successfully |
TemplateSelectionAmbiguous |
Warning | Multiple templates scored equally |
CacheReady |
Normal | Model cache is populated |
CacheFailed |
Warning | Cache download failed |
RuntimeReady |
Normal | InferenceService is serving |
InvalidImageReference |
Warning | Model image URI is invalid |
PathTemplateInvalid |
Warning | Routing path template failed to resolve |
AIMModel:
| Reason | Type | Description |
|---|---|---|
AllTemplatesReady |
Normal | All discovered templates are ready |
AllTemplatesFailed |
Warning | All discovered templates failed |
MetadataExtractionFailed |
Warning | Failed to extract model metadata |
AIMArtifact:
| Reason | Type | Description |
|---|---|---|
Verified |
Normal | Download complete and verified |
Downloading |
Normal | Download in progress |
Viewing Events¶
# Events for a specific resource
kubectl describe aimservice qwen-chat -n <namespace>
# All AIM-related events in a namespace
kubectl get events -n <namespace> --field-selector involvedObject.apiVersion=aim.eai.amd.com/v1alpha1
Recurring Events¶
Some warning events are emitted on every reconcile (not just on transitions) for critical conditions that remain unhealthy. These are useful for alerting — a persistent stream of warnings indicates a stuck or failing resource.
See Conditions Reference for the full catalog of conditions and reasons.
Health Probes¶
The operator exposes health and readiness probes:
| Probe | Path | Port |
|---|---|---|
| Liveness | /healthz |
8081 |
| Readiness | /readyz |
8081 |
These are configured automatically in the Helm chart deployment.
Next Steps¶
- Troubleshooting — Diagnosing common issues
- CLI and Operator Flags — Full operator flag reference