AWS News Blog

Amazon SageMaker Model Monitor – Fully Managed Automatic Monitoring For Your Machine Learning Models

Voiced by Polly

Today, we’re extremely happy to announce Amazon SageMaker Model Monitor, a new capability of Amazon SageMaker that automatically monitors machine learning (ML) models in production, and alerts you when data quality issues appear.

The first thing I learned when I started working with data is that there is no such thing as paying too much attention to data quality. Raise your hand if you’ve spent hours hunting down problems caused by unexpected NULL values or by exotic character encodings that somehow ended up in one of your databases.

As models are literally built from large amounts of data, it’s easy to see why ML practitioners spend so much time caring for their data sets. In particular, they make sure that data samples in the training set (used to train the model) and in the validation set (used to measure its accuracy) have the same statistical properties.

There be monsters! Although you have full control over your experimental data sets, the same can’t be said for real-life data that your models will receive. Of course, that data will be unclean, but a more worrisome problem is “data drift”, i.e. a gradual shift in the very statistical nature of the data you receive. Minimum and maximum values, mean, average, variance, and more: all these are key attributes that shape assumptions and decisions made during the training of a model. Intuitively, you can surely feel that any significant change in these values would impact the accuracy of predictions: imagine a loan application predicting higher amounts because input features are drifting or even missing!

Detecting these conditions is pretty difficult: you would need to capture data received by your models, run all kinds of statistical analysis to compare that data to the training set, define rules to detect drift, send alerts if it happens… and do it all over again each time you update your models. Expert ML practitioners certainly know how to build these complex tools, but at the great expense of time and resources. Undifferentiated heavy lifting strikes again…

To help all customers focus on creating value instead, we built Amazon SageMaker Model Monitor. Let me tell you more.

Introducing Amazon SageMaker Model Monitor
A typical monitoring session goes like this. You first start from a SageMaker endpoint to monitor, either an existing one, or a new one created specifically for monitoring purposes. You can use SageMaker Model Monitor on any endpoint, whether the model was trained with a built-in algorithm, a built-in framework, or your own container.

Using the SageMaker SDK, you can capture a configurable fraction of the data sent to the endpoint (you can also capture predictions if you’d like), and store it in one of your Amazon Simple Storage Service (Amazon S3) buckets. Captured data is enriched with metadata (content type, timestamp, etc.), and you can secure and access it just like any S3 object.

Then, you create a baseline from the data set that was used to train the model deployed on the endpoint (of course, you can reuse an existing baseline, too). This will fire up a Amazon SageMaker Processing job where SageMaker Model Monitor will:

  • Infer a schema for the input data, i.e. type and completeness information for each feature. You should review it, and update it if needed.
  • For pre-built containers only, compute feature statistics using Deequ, an open source tool based on Apache Spark that is developed and used at Amazon (blog post and research paper). These statistics include KLL sketches, an advanced technique to compute accurate quantiles on streams of data, that we recently contributed to Deequ.

Using these artifacts, the next step is to launch a monitoring schedule, to let SageMaker Model Monitor inspect collected data and prediction quality. Whether you’re using a built-in or custom container, a number of built-in rules are applied, and reports are periodically pushed to S3. The reports contain statistics and schema information on the data received during the latest time frame, as well as any violation that was detected.

Last but not least, SageMaker Model Monitor emits per-feature metrics to Amazon CloudWatch, which you can use to set up dashboards and alerts. The summary metrics from CloudWatch are also visible in Amazon SageMaker Studio, and of course all statistics, monitoring results and data collected can be viewed and further analyzed in a notebook.

For more information and an example on how to use SageMaker Model Monitor using AWS CloudFormation, refer to the developer guide.

Now, let’s do a demo, using a churn prediction model trained with the built-in XGBoost algorithm.

Enabling Data Capture
The first step is to create an endpoint configuration to enable data capture. Here, I decide to capture 100% of incoming data, as well as model output (i.e. predictions). I’m also passing the content types for CSV and JSON data.

data_capture_configuration = {
    "EnableCapture": True,
    "InitialSamplingPercentage": 100,
    "DestinationS3Uri": s3_capture_upload_path,
    "CaptureOptions": [
        { "CaptureMode": "Output" },
        { "CaptureMode": "Input" }
    ],
    "CaptureContentTypeHeader": {
       "CsvContentTypes": ["text/csv"],
       "JsonContentTypes": ["application/json"]
}

Next, I create the endpoint using the usual CreateEndpoint API.

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m5.xlarge',
        'InitialInstanceCount':1,
        'InitialVariantWeight':1,
        'ModelName':model_name,
        'VariantName':'AllTrafficVariant'
    }],
    DataCaptureConfig = data_capture_configuration)

On an existing endpoint, I would have used the UpdateEndpoint API to seamlessly update the endpoint configuration.

After invoking the endpoint repeatedly, I can see some captured data in S3 (output was edited for clarity).

$ aws s3 ls --recursive s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/datacapture/DEMO-xgb-churn-pred-model-monitor-2019-11-22-07-59-33/
AllTrafficVariant/2019/11/22/08/24-40-519-9a9273ca-09c2-45d3-96ab-fc7be2402d43.jsonl
AllTrafficVariant/2019/11/22/08/25-42-243-3e1c653b-8809-4a6b-9d51-69ada40bc809.jsonl

Here’s a line from one of these files.

    "endpointInput":{
        "observedContentType":"text/csv",
        "mode":"INPUT",
        "data":"132,25,113.2,96,269.9,107,229.1,87,7.1,7,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1",
        "encoding":"CSV"
     },
     "endpointOutput":{
        "observedContentType":"text/csv; charset=utf-8",
        "mode":"OUTPUT",
        "data":"0.01076381653547287",
        "encoding":"CSV"}
     },
    "eventMetadata":{
        "eventId":"6ece5c74-7497-43f1-a263-4833557ffd63",
        "inferenceTime":"2019-11-22T08:24:40Z"},
        "eventVersion":"0"}

Pretty much what I expected. Now, let’s create a baseline for this model.

Creating A Monitoring Baseline
This is a very simple step: pass the location of the baseline data set, and the location where results should be stored.

from processingjob_wrapper import ProcessingJob

processing_job = ProcessingJob(sm_client, role).
   create(job_name, baseline_data_uri, baseline_results_uri)

Once that job is complete, I can see two new objects in S3: one for statistics, and one for constraints.

aws s3 ls s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/baselining/results/
constraints.json
statistics.json

The constraints.json file tells me about the inferred schema for the training data set (don’t forget to check it’s accurate). Each feature is typed, and I also get information on whether a feature is always present or not (1.0 means 100% here). Here are the first few lines.

{
  "version" : 0.0,
  "features" : [ {
    "name" : "Churn",
    "inferred_type" : "Integral",
    "completeness" : 1.0
  }, {
    "name" : "Account Length",
    "inferred_type" : "Integral",
    "completeness" : 1.0
  }, {
    "name" : "VMail Message",
    "inferred_type" : "Integral",
    "completeness" : 1.0
  }, {
    "name" : "Day Mins",
    "inferred_type" : "Fractional",
    "completeness" : 1.0
  }, {
    "name" : "Day Calls",
    "inferred_type" : "Integral",
    "completeness" : 1.0

At the end of that file, I can see configuration information for CloudWatch monitoring: turn it on or off, set the drift threshold, etc.

"monitoring_config" : {
    "evaluate_constraints" : "Enabled",
    "emit_metrics" : "Enabled",
    "distribution_constraints" : {
      "enable_comparisons" : true,
      "min_domain_mass" : 1.0,
      "comparison_threshold" : 1.0
    }
  }

The statistics.json file shows different statistics for each feature (mean, average, quantiles, etc.), as well as unique values received by the endpoint. Here’s an example.

"name" : "Day Mins",
    "inferred_type" : "Fractional",
    "numerical_statistics" : {
      "common" : {
        "num_present" : 2333,
        "num_missing" : 0
      },
      "mean" : 180.22648949849963,
      "sum" : 420468.3999999996,
      "std_dev" : 53.987178959901556,
      "min" : 0.0,
      "max" : 350.8,
      "distribution" : {
        "kll" : {
          "buckets" : [ {
            "lower_bound" : 0.0,
            "upper_bound" : 35.08,
            "count" : 14.0
          }, {
            "lower_bound" : 35.08,
            "upper_bound" : 70.16,
            "count" : 48.0
          }, {
            "lower_bound" : 70.16,
            "upper_bound" : 105.24000000000001,
            "count" : 130.0
          }, {
            "lower_bound" : 105.24000000000001,
            "upper_bound" : 140.32,
            "count" : 318.0
          }, {
            "lower_bound" : 140.32,
            "upper_bound" : 175.4,
            "count" : 565.0
          }, {
            "lower_bound" : 175.4,
            "upper_bound" : 210.48000000000002,
            "count" : 587.0
          }, {
            "lower_bound" : 210.48000000000002,
            "upper_bound" : 245.56,
            "count" : 423.0
          }, {
            "lower_bound" : 245.56,
            "upper_bound" : 280.64,
            "count" : 180.0
          }, {
            "lower_bound" : 280.64,
            "upper_bound" : 315.72,
            "count" : 58.0
          }, {
            "lower_bound" : 315.72,
            "upper_bound" : 350.8,
            "count" : 10.0
          } ],
          "sketch" : {
            "parameters" : {
              "c" : 0.64,
              "k" : 2048.0
            },
            "data" : [ [ 178.1, 160.3, 197.1, 105.2, 283.1, 113.6, 232.1, 212.7, 73.3, 176.9, 161.9, 128.6, 190.5, 223.2, 157.9, 173.1, 273.5, 275.8, 119.2, 174.6, 133.3, 145.0, 150.6, 220.2, 109.7, 155.4, 172.0, 235.6, 218.5, 92.7, 90.7, 162.3, 146.5, 210.1, 214.4, 194.4, 237.3, 255.9, 197.9, 200.2, 120, ...

Now, let’s start monitoring our endpoint.

Monitoring An Endpoint
Again, one API call is all that it takes: I simply create a monitoring schedule for my endpoint, passing the constraints and statistics file for the baseline data set. Optionally, I could also pass preprocessing and postprocessing functions, should I want to tweak data and predictions.

ms = MonitoringSchedule(sm_client, role)
schedule = ms.create(
   mon_schedule_name, 
   endpoint_name, 
   s3_report_path, 
   # record_preprocessor_source_uri=s3_code_preprocessor_uri, 
   # post_analytics_source_uri=s3_code_postprocessor_uri,
   baseline_statistics_uri=baseline_results_uri + '/statistics.json',
   baseline_constraints_uri=baseline_results_uri+ '/constraints.json'
)

Then, I start sending bogus data to the endpoint, i.e. samples constructed from random values, and I wait for SageMaker Model Monitor to start generating reports. The suspense is killing me!

Inspecting Reports
Quickly, I see that reports are available in S3.

mon_executions = sm_client.list_monitoring_executions(MonitoringScheduleName=mon_schedule_name, MaxResults=3)
for execution_summary in mon_executions['MonitoringExecutionSummaries']:
    print("ProcessingJob: {}".format(execution_summary['ProcessingJobArn'].split('/')[1]))
    print('MonitoringExecutionStatus: {} \n'.format(execution_summary['MonitoringExecutionStatus']))

ProcessingJob: model-monitoring-201911221050-df2c7fc4
MonitoringExecutionStatus: Completed 

ProcessingJob: model-monitoring-201911221040-3a738dd7
MonitoringExecutionStatus: Completed 

ProcessingJob: model-monitoring-201911221030-83f15fb9
MonitoringExecutionStatus: Completed 

Let’s find the reports for one of these monitoring jobs.

desc_analytics_job_result=sm_client.describe_processing_job(ProcessingJobName=job_name)
report_uri=desc_analytics_job_result['ProcessingOutputConfig']['Outputs'][0]['S3Output']['S3Uri']
print('Report Uri: {}'.format(report_uri))

Report Uri: s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/reports/2019112208-2019112209

Ok, so what do we have here?

aws s3 ls s3://sagemaker-us-west-2-123456789012/sagemaker/DEMO-ModelMonitor/reports/2019112208-2019112209/

constraint_violations.json
constraints.json
statistics.json

As you would expect, the constraints.json and statistics.json contain schema and statistics information on the data samples processed by the monitoring job. Let’s open directly the third one, constraints_violations.json!

violations" : [ {
    "feature_name" : "State_AL",
    "constraint_check_type" : "data_type_check",
    "description" : "Value: 0.8 does not meet the constraint requirement! "
  }, {
    "feature_name" : "Eve Mins",
    "constraint_check_type" : "baseline_drift_check",
    "description" : "Numerical distance: 0.2711598746081505 exceeds numerical threshold: 0"
  }, {
    "feature_name" : "CustServ Calls",
    "constraint_check_type" : "baseline_drift_check",
    "description" : "Numerical distance: 0.6470588235294117 exceeds numerical threshold: 0"
  }

Oops! It looks like I’ve been assigning floating point values to integer features: surely that’s not going to work too well!

Some features are also exhibiting drift, that’s not good either. Maybe something is wrong with my data ingestion process, or maybe the distribution of data has actually changed, and I need to retrain the model. As all this information is available as CloudWatch metrics, I could define thresholds, set alarms and even trigger new training jobs automatically.

Now Available!
As you can see, Amazon SageMaker Model Monitor is easy to set up, and helps you quickly know about quality issues in your ML models.

Now it’s your turn: you can start using Amazon SageMaker Model Monitor today in all commercial regions where Amazon SageMaker is available. This capability is also integrated in Amazon SageMaker Studio, our workbench for ML projects.

Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

- Julien
Julien Simon

Julien Simon

As an Artificial Intelligence & Machine Learning Evangelist for EMEA, Julien focuses on helping developers and enterprises bring their ideas to life.