Skip to main content

Monitoring and Orchestration

Production ML systems require continuous monitoring, automated responses to issues, and orchestrated workflows for retraining. AWS provides CloudWatch for metrics and alarms, EventBridge for event-driven automation, and SNS for notifications — forming the operational backbone of MLOps.

Overview

ServiceWhat It DoesWhen to Use
Amazon CloudWatchMetrics, logs, and alarms for AWS resourcesMonitor SageMaker endpoints, training jobs, and infrastructure health
Amazon EventBridgeServerless event bus — react to events from AWS servicesTrigger automated workflows on SageMaker events, schedules, or Model Monitor alerts
Amazon SNSPub/sub messaging and notificationsSend alerts from CloudWatch alarms — email, SMS, or Lambda triggers

CloudWatch for SageMaker

SageMaker automatically publishes key metrics to CloudWatch:

MetricWhat It Tracks
CPUUtilizationCPU usage on endpoint instances
MemoryUtilizationMemory usage on endpoint instances
ModelLatencyTime the model takes to respond
InvocationsNumber of inference requests
4XXError / 5XXErrorClient and server error rates

Auto-Scaling with CloudWatch

CloudWatch alarms can trigger SageMaker endpoint auto-scaling. A typical pattern:

  1. Create a CloudWatch alarm on InvocationsPerInstance or CPUUtilization
  2. Define a scaling policy that adds instances when the threshold is breached
  3. Define a scale-in policy for when traffic drops

The MLOps Automation Pattern

The most operationally efficient pattern for automated retraining combines three services:

Model Monitor (detects drift) → EventBridge (receives alert, triggers action) → SageMaker Pipelines (retrains and redeploys)

This pattern requires no manual intervention — when Model Monitor detects data drift or model quality degradation, EventBridge automatically triggers a retraining pipeline.

How It Works

  1. SageMaker Model Monitor runs on a schedule, comparing live inference data to the baseline
  2. When drift exceeds the threshold, Model Monitor emits a CloudWatch metric and/or an EventBridge event
  3. EventBridge rule matches the event and triggers a SageMaker Pipeline
  4. The pipeline retrains the model, evaluates it, and (with approval) deploys it
  5. The Model Registry tracks the new version with approval status

When to Use

Every production ML endpoint should have CloudWatch monitoring configured at minimum. For production systems that need automated retraining, implement the full MLOps pattern with EventBridge and Pipelines. Use SNS for human notification when automated responses are not appropriate.

Flashcards

1 / 6
Question

What metrics does SageMaker automatically publish to CloudWatch?

Click to reveal
Answer

CPUUtilization, MemoryUtilization, ModelLatency, Invocations, and 4XX/5XX error rates for endpoints.

Key Insight

For the most operationally efficient automated retraining, use the EventBridge + Pipelines + Model Monitor pattern. Manual monitoring and scheduled retraining are simpler but less responsive — drift could go undetected until the next scheduled check.