Monitoring and Orchestration

Production ML systems require continuous monitoring, automated responses to issues, and orchestrated workflows for retraining. AWS provides CloudWatch for metrics and alarms, EventBridge for event-driven automation, and SNS for notifications — forming the operational backbone of MLOps.

Overview

Service	What It Does	When to Use
Amazon CloudWatch	Metrics, logs, and alarms for AWS resources	Monitor SageMaker endpoints, training jobs, and infrastructure health
Amazon EventBridge	Serverless event bus — react to events from AWS services	Trigger automated workflows on SageMaker events, schedules, or Model Monitor alerts
Amazon SNS	Pub/sub messaging and notifications	Send alerts from CloudWatch alarms — email, SMS, or Lambda triggers

CloudWatch for SageMaker

SageMaker automatically publishes key metrics to CloudWatch:

Metric	What It Tracks
CPUUtilization	CPU usage on endpoint instances
MemoryUtilization	Memory usage on endpoint instances
ModelLatency	Time the model takes to respond
Invocations	Number of inference requests
4XXError / 5XXError	Client and server error rates

Auto-Scaling with CloudWatch

CloudWatch alarms can trigger SageMaker endpoint auto-scaling. A typical pattern:

Create a CloudWatch alarm on InvocationsPerInstance or CPUUtilization
Define a scaling policy that adds instances when the threshold is breached
Define a scale-in policy for when traffic drops

The MLOps Automation Pattern

The most operationally efficient pattern for automated retraining combines three services:

Model Monitor (detects drift) → EventBridge (receives alert, triggers action) → SageMaker Pipelines (retrains and redeploys)

This pattern requires no manual intervention — when Model Monitor detects data drift or model quality degradation, EventBridge automatically triggers a retraining pipeline.

How It Works

SageMaker Model Monitor runs on a schedule, comparing live inference data to the baseline
When drift exceeds the threshold, Model Monitor emits a CloudWatch metric and/or an EventBridge event
EventBridge rule matches the event and triggers a SageMaker Pipeline
The pipeline retrains the model, evaluates it, and (with approval) deploys it
The Model Registry tracks the new version with approval status

When to Use

Every production ML endpoint should have CloudWatch monitoring configured at minimum. For production systems that need automated retraining, implement the full MLOps pattern with EventBridge and Pipelines. Use SNS for human notification when automated responses are not appropriate.

Flashcards

1 / 6

Question

What metrics does SageMaker automatically publish to CloudWatch?

Click to reveal

Answer

CPUUtilization, MemoryUtilization, ModelLatency, Invocations, and 4XX/5XX error rates for endpoints.

Key Insight

For the most operationally efficient automated retraining, use the EventBridge + Pipelines + Model Monitor pattern. Manual monitoring and scheduled retraining are simpler but less responsive — drift could go undetected until the next scheduled check.

Overview​

CloudWatch for SageMaker​

Auto-Scaling with CloudWatch​

The MLOps Automation Pattern​

How It Works​

When to Use​

Flashcards​