Automatically Create Machine Learning Models

TUTORIAL

Overview

In this tutorial, you will learn how to use Amazon SageMaker Autopilot to automatically build, train, and tune a machine learning (ML) model, and deploy the model to make predictions.
 
Amazon SageMaker Autopilot eliminates the heavy lifting of building ML models by helping you automatically build, train, and tune the best ML model based on your data. With SageMaker Autopilot, you simply provide a tabular dataset and select the target column to predict. SageMaker Autopilot explores your data, selects the algorithms relevant to your problem type, prepares the data for model training, tests a variety of models, and selects the best performing one. You can then deploy one of the candidate models or iterate on them further to improve prediction quality.
 

What you will accomplish

In this guide, you will:

  • Create a training experiment using SageMaker Autopilot
  • Explore the different stages of the training experiment
  • Identify and deploy the best performing model from the training experiment
  • Predict with your deployed model

Prerequisites

Before starting this guide, you will need:

 AWS experience

Beginner

 Time to complete

45 minutes

 Cost to complete

See SageMaker pricing to estimate cost for this tutorial.

 Requires

You must be logged into an AWS account.

 Services used

Amazon SageMaker Autopilot

 Last updated

April 25, 2023

Implementation

For this workflow, you will use a synthetically generated auto insurance claims dataset. The raw inputs are two tables of insurance data: a claims table and a customers table. The claims table has a fraud column indicating whether a claim was fraudulent or otherwise. For the purposes of this tutorial, we have selected a small portion of the dataset. However, you can follow the same steps in this tutorial to process large datasets.

 

Step 1: Set up Amazon SageMaker Studio domain

An AWS account can have only one SageMaker Studio domain per AWS Region. If you already have a SageMaker Studio domain in the US East (N. Virginia) Region, follow the SageMaker Studio setup guide to attach the required AWS IAM policies to your SageMaker Studio account, then skip Step 1, and proceed directly to Step 2. 

If you don't have an existing SageMaker Studio domain, continue with Step 1 to run an AWS CloudFormation template that creates a SageMaker Studio domain and adds the permissions required for the rest of this tutorial.

This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC.

Choose the AWS CloudFormation stack link. This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-catalog, and should not be changed. This stack takes about 10 minutes to create all the resources.

Select I acknowledge that AWS CloudFormation might create IAM resources, and then choose Create stack.

On the CloudFormation pane, choose Stacks. When the stack is created, the status of the stack should change from CREATE_IN_PROGRESS to CREATE_COMPLETE.

Enter SageMaker Studio into the CloudFormation console search bar, and then choose SageMaker Studio.

Choose US East (N. Virginia) from the Region dropdown list on the upper right corner of the SageMaker console. For Launch app, select Studio to open SageMaker Studio using the studio-user profile.
 

Step 2: Start a new SageMaker Autopilot experiment

Developing and testing a large number of candidate models is crucial for machine learning (ML) projects. Amazon SageMaker Autopilot helps by providing different model candidates and automatically chooses the best model based on your data. In this step, you will configure a SageMaker Autopilot experiment to predict success from a financial services marketing campaign. This dataset represents a marketing campaign that was run by a major financial services institution to promote certificate of deposit enrollment.

To start a new SageMaker Autopilot experiment, in the Launcher window, scroll down to ML tasks and components, click the + icon in the tab for New Autopilot experiment.
Alternatively, you can also click on the File from the top menu and select New and then New experiment.
Next, you’ll name your experiment. Click in the Experiment name box and type autopilot-experiment as the name. Next, you’ll connect the experiment to data that is staged in S3. Click the box Enter S3 bucket location. In the S3 bucket address box, paste the following S3 path: s3://sagemaker-sample-files/datasets/tabular/uci_bank_marketing/bank-additional-full.csv
 
Leave the manifest file option set to Off. In the Target dropdown, select y as the target feature which our model will attempt to predict.

For the Output data, you can optionally choose your own S3 bucket by toggling the flag for Auto create output data location and specifying the S3 bucket in the Output data (S3 bucket) section. We are going to leave it as default and let Sagemaker Studio create an output bucket. Then select Next: Training method
For the training method, you can keep default option selected Auto (or optionally select Ensembling or Hyperparameter optimization(HPO). Then select Next: Deployment and advanced settings

For Deployment and advanced settings, you can keep everything as default. (Optionally, you specify the Auto deploy endpoint name and Select the machine learning problem from the drop down.) Then select Next: Review and create
You can review the summary about the experiment. If you wish to modify any option you can click on the previous button and modify it. Click the Create Experiment button to start the first stage of the SageMaker Autopilot experiment. SageMaker Autopilot will begin to run through the phases of an experiment. In the experiment window, you can track progress through the phases of preprocessing, candidate definitions, feature engineering, model tuning, explainability, and insights.
Once the SageMaker Autopilot job is complete, you can access a report that shows the candidate models, candidate model status, objective value, F1 score, and accuracy. SageMaker Autopilot will automatically deploy the endpoint.

Step 3: Interpret model performance

Now that the experiment is complete and you have a model, the next step is to interpret its performance. You will now learn how to use SageMaker Autopilot to analyze the model's performance.

Now that the SageMaker Autopilot experiment is complete, you can open up the top ranking model to obtain more details on the model’s performance and metadata. From the list of models, highlight the first one and right click to bring up model options. Click on Open in model details to review the model’s performance statistics.

In the new window, click on Explainability. The first view you see is called Feature Importance and represents the aggregated SHAP value for each feature across each instance in the dataset. The feature importance score is an important part of model explainability because it shows what features tend to influence the predictions the most in the dataset. In this use case, the customer duration or tenure and employment variation rate are the top two fields for driving the model's outcome.

Now, click on the tab Performance. You will find detailed information on the model’s performance, including recall, precision, and accuracy. You can also interpret model performance and decide if additional model tuning is needed.

Next, visualizations are provided to further illustrate model performance. First, look at the confusion matrix. The confusion matrix is commonly used to understand how the model labels are divided among the predicted and true classes. In this case, the diagonal elements show the number of correctly predicted labels and the off-diagonal elements show the misclassified records. A confusion matix is useful for analyzing misclassifications due to false positives and false negatives.

Next, look at the precision versus recall curve. This curve interprets the label as a probability threshold and shows the trade-off that occurs at various probability thresholds for model precision and recall. SageMaker Autopilot automatically optimizes these two parameters to provide the best model.

Next, look at the curve labelled Receiver Operating Characteristic (ROC). This curve shows the relationship between the true positive rate and the false positive rate over a variety of potential probability thresholds. A diagonal line represents a hypothetical model based on random guessing. The more this curve pulls to the upper left of the chart, the better the model will perform.

The dashed line represents a model with 0 predictive value, which is often called the null model. The null model would randomly assign a 0/1 label, and its area under the ROC curve would be 0.5, representing that it would be accurate 50% of the time.

Next, click on the tab Artifacts. You can find the SageMaker Autopilot experiment’s supporting assets, including feature engineering code, input data locations, and explainability artifacts.

Finally, click the Network tab. You will find information about network isolation and container traffic encryption.

Step 4: Test the SageMaker model endpoint

Now that you have reviewed the model’s details, test the endpoint.

Click the + icon to bring up a new Python notebook. Select Python3 as the kernel.

To know where to send the request, look up the model endpoint’s name. On the left pane, click the SageMaker Resources icon. In the SageMaker resources pane, select Endpoints. Click on the endpoint associated with the experiment name you created at the start of this tutorial. This will bring up the Endpoint Details window. Record the endpoint name and navigate back to the Python 3 notebook.

Copy and paste the following code snippet into a cell in the notebook, and press Shift+Enter to run the current cell. This code sets the environment variable ENDPOINT_NAME and runs inference. As the code completes, you will see a result consisting of the model label and associated probability score.
import os
import io
import boto3
import json
import csv

#: Define the endpoint's name.
ENDPOINT_NAME = 'autopilot-experiment-6d00f17b55464fc49c45d74362f284ce'
runtime = boto3.client('runtime.sagemaker')

#: Define a test payload to send to your endpoint.
payload = {
    "data":{
        "features": {
            "values": [45,"blue-collar","married","basic.9y",'unknown',"yes","no","telephone","may","mon",461,1,999,0,"nonexistent",1.1,93.994,-36.4,4.857,5191.0]
        }
    }
}

#: Submit an API request and capture the response object.
response = runtime.invoke_endpoint( 
EndpointName=ENDPOINT_NAME, 
ContentType='application/json', 
Body=json.dumps(payload)
)

#: Print the model endpoint's output.
print(response['Body'].read().decode())

Congratulations! You have learned how to use SageMaker Autopilot to automatically train and deploy a machine learning model.

Step 5: Clean up your AWS resources

It is a best practice to delete resources that you are no longer using so that you don't incur unintended charges.

If you ran the CloudFormation template in Step 1 to create a new SageMaker Studio domain, continue with the following steps to delete the domain, user, and the resources created by the CloudFormation template.

To open the CloudFormation console, enter CloudFormation into the AWS console search bar, and choose CloudFormation from the search results.

In the CloudFormation pane, choose Stacks. From the status drop down list, select Active. Under Stack name, choose CFN-SM-IM-Lambda-catalog to open the stack details page.

On the CFN-SM-IM-Lambda-catalog stack details page, choose Delete to delete the stack along with the resources it created in Step 1.

Conclusion

Congratulations! You have now completed the Automatically Create Machine Learning Models tutorial.

You have successfully used SageMaker Autopilot to automatically build, train, and tune models, and then deploy the best candidate model to make predictions.

Was this page helpful?

Next steps

Learn more about Amazon SageMaker Autopilot

Visit the webpage
Learn more »

Explore SageMaker Autopilot documentation

Learn how to get started with Amazon SageMaker Autopilot
Read more »

Find more hands-on tutorials

Find more hands-on tutorials to learn how to leverage ML
Get started »