Monitoring and Orchestration
Production ML systems require continuous monitoring, automated responses to issues, and orchestrated workflows for retraining. AWS provides CloudWatch for metrics and alarms, EventBridge for event-driven automation, and SNS for notifications — forming the operational backbone of MLOps.
Overview
| Service | What It Does | When to Use |
|---|---|---|
| Amazon CloudWatch | Metrics, logs, and alarms for AWS resources | Monitor SageMaker endpoints, training jobs, and infrastructure health |
| Amazon EventBridge | Serverless event bus — react to events from AWS services | Trigger automated workflows on SageMaker events, schedules, or Model Monitor alerts |
| Amazon SNS | Pub/sub messaging and notifications | Send alerts from CloudWatch alarms — email, SMS, or Lambda triggers |
CloudWatch for SageMaker
SageMaker automatically publishes key metrics to CloudWatch:
| Metric | What It Tracks |
|---|---|
| CPUUtilization | CPU usage on endpoint instances |
| MemoryUtilization | Memory usage on endpoint instances |
| ModelLatency | Time the model takes to respond |
| Invocations | Number of inference requests |
| 4XXError / 5XXError | Client and server error rates |
Auto-Scaling with CloudWatch
CloudWatch alarms can trigger SageMaker endpoint auto-scaling. A typical pattern:
- Create a CloudWatch alarm on
InvocationsPerInstanceorCPUUtilization - Define a scaling policy that adds instances when the threshold is breached
- Define a scale-in policy for when traffic drops
The MLOps Automation Pattern
The most operationally efficient pattern for automated retraining combines three services:
Model Monitor (detects drift) → EventBridge (receives alert, triggers action) → SageMaker Pipelines (retrains and redeploys)
This pattern requires no manual intervention — when Model Monitor detects data drift or model quality degradation, EventBridge automatically triggers a retraining pipeline.
How It Works
- SageMaker Model Monitor runs on a schedule, comparing live inference data to the baseline
- When drift exceeds the threshold, Model Monitor emits a CloudWatch metric and/or an EventBridge event
- EventBridge rule matches the event and triggers a SageMaker Pipeline
- The pipeline retrains the model, evaluates it, and (with approval) deploys it
- The Model Registry tracks the new version with approval status
When to Use
Every production ML endpoint should have CloudWatch monitoring configured at minimum. For production systems that need automated retraining, implement the full MLOps pattern with EventBridge and Pipelines. Use SNS for human notification when automated responses are not appropriate.
Flashcards
What metrics does SageMaker automatically publish to CloudWatch?
Click to revealCPUUtilization, MemoryUtilization, ModelLatency, Invocations, and 4XX/5XX error rates for endpoints.
For the most operationally efficient automated retraining, use the EventBridge + Pipelines + Model Monitor pattern. Manual monitoring and scheduled retraining are simpler but less responsive — drift could go undetected until the next scheduled check.