Generate Machine Learning Predictions Without Writing Code
TUTORIAL
Overview
What you will accomplish
In this tutorial, you will:
- Import datasets
- Select the target variable for classification
- Inspect datasets visually
- Build an ML model with the SageMaker Canvas Quick Build feature
- Understand model features and metrics
- Generate and understand bulk and single predictions
Prerequisites
Before starting this tutorial, you will need:
- An AWS account: If you don't already have an account, follow the Setting Up Your AWS Environment getting started guide for a quick overview.
AWS experience
Beginner
Time to complete
20 minutes
Cost to complete
Requires
You must be logged into an AWS account.
Services used
Amazon SageMaker Canvas
Last updated
April 25, 2023
Implementation
In this tutorial, you will build an ML model that can predict the estimated time of arrival (ETA) of shipments (measured in days). You will use a dataset that contains complete shipping data for delivered products, including estimated time, shipping, priority, carrier, and origin.
Step 1: Set up Amazon SageMaker Studio domain
This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC.
Step 2: Log into SageMaker Canvas and upload dataset to Amazon S3 bucket
2.2 Choose Canvas under Getting started from the left pane and then choose US East (N. Virginia) from the Region dropdown list on the top right.
- On the SageMaker Canvas page, choose Open Canvas.
- The SageMaker Canvas Creating application screen will be displayed. The application will take a few minutes to load.
2.4 Download the shipping log and products datasets from the following links
Step 3: Set up SageMaker Canvas for automatic model building
Import data into SageMaker Canvas for visual inspection and model building.
3.1 Import the dataset into SageMaker Canvas
- On the SageMaker Canvas interface, choose Datasets on the left pane and then choose + Import.
- Select the Amazon S3 bucket named sagemaker-<your-Region>-<your-account-id> where you uploaded the datasets in the previous step. Select the shipping_logs.csv and product_descriptions.csv datasets by selecting the checkboxes to their left. Two new buttons appear at the bottom of your page: Preview all and Import data. Choose Preview all. This allows you to see a 100-row preview of the datasets. Select the arrows to view the preview.
- After you check the datasets, choose Import data to import them into SageMaker Canvas.
3.3 On the Join Datasets page, drag the two datasets from the left panel onto the right pane. Select the join icon between the two datasets. A pop-up showing details about the join will appear. Make sure that the join type is Inner and the joining column is ProductId. Choose Save & close and then choose Import data.
On the Import data dialog box, enter the name ConsolidatedShippingData in the Import dataset name field and choose Import data.
Step 4: Build, train, and analyze an ML model
4.2 The model view page consists of four tabs which represent the steps involved in building a model and getting predictions. The tabs are:
- Select – Set up the input data.
- Build – Build the ML model.
- Analyze – Analyze the model output and features.
- Predict – Run predictions in bulk or on a single sample.
On the Select tab, choose the radio button for the ConsolidatedShippingData dataset that you created in the previous step. This dataset contains 16 columns and 10,000 rows. It also contains a high-level description of dataset shape and size. Choose Select dataset.
4.5 You'll notice SageMaker Canvas provides dataset statistics, including missing and mismatched values, unique values, and mean and median values for each of the columns in the dataset. You can use these statistics to drop some of the columns. If you do not want to use a particular column for prediction, you can clear (deselect) it from the left checkbox.
- Notice that the correlation of XShippingDistance and YShippingDistance columns with the target is negligible.
- Because features with negligible correlation with the target are not informative enough for the prediction task at hand, you can drop XShippingDistance, YShippingDistance, ProductID, and OrderID columns because they are primary keys and not expected to contain any valuable information. You can deselect the checkboxes.
- You can select the vertical bars icon to inspect the distributions of the columns. This is useful in highlighting imbalances and potential bias in the data.
4.6 After you complete data exploration, you can train a model. In SageMaker Canvas, there are two methods for training: Quick build and Standard build. The Quick build usually takes 2-15 minutes to build the model, whereas the standard build usually takes 2-4 hours and generally has a higher accuracy. Quick build trains fewer combinations of models and hyperparameters to prioritize speed. This is especially applicable in cases like this tutorial, where the goal is to quickly develop a prototype model for the chosen use case.
For purposes of this tutorial, choose Quick build to begin model building. A popup window will appear about validating your data. Choose Start quick build to begin model training. This process takes less than 5 minutes to complete.
4.9 On the Scoring tab, you can see a plot representing best fit regression line for ActualshippingDays. On average, the model prediction has a difference of +/- 1.148 from the actual value of ActualShippingDays. The Scoring section for numeric prediction shows a line to indicate the model's predicted value in relation to the data used to make predictions. The values of the numeric prediction are often +/- the RMSE (root mean square error) value. The value that the model predicts is often within the range of the RMSE. The width of the purple band around the line indicates the RMSE range. The predicted values often fall within the range. To have a deeper understanding of the model performance, choose the Advanced metrics link on the right to display the Advanced metrics page.
- The various metrics shown on the Advanced metrics page are R2, mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE). The Advanced metrics page also shows plots for visual inspection of the model performance. One image shows a graph of the residuals or errors. The horizontal line indicates an error of 0 or a perfect prediction. The blue dots are the errors. Their distance from the horizontal line indicates the magnitude of the errors.
- Scrolling down on the Advanced metrics page, you can see an error density plot which shows the distribution of the errors and their spread with respect to MAE and RMSE of the model. An error density with a shape similar to a normal distribution is indicative of good model performance.
Step 5: Generate model predictions
Now that you have a regression model, you can either use the model to run predictions, or you can create a new version of this model to train with the Standard build process. In this step, you use SageMaker Canvas to generate predictions, both single and in bulk, from a dataset.
5.1 To start generating predictions, choose the Predict button at the bottom of the Analyze page, or choose the Predict tab.
- On the Predict page, Batch prediction is already selected. Choose Select dataset and then select the ConsolidatedShippingData dataset. In actual ML workflows, this dataset should be separate from the training dataset. However, for simplicity, you use the same dataset to demonstrate how SageMaker Canvas generates predictions. Choose Generate predictions.
- After a few seconds, the prediction is done. Choose the options icon and select Preview to see a preview of the predictions by hovering over the predictions dataset name or status. You can also choose Download to download a CSV file containing the full output. SageMaker Canvas returns a prediction for each row of data. In this tutorial, the feature with the highest importance is the ExpectedShippingDays feature. It is also presented beside the predictions for a visual comparison.
- On the Predict page, you can generate predictions for a single sample by selecting Single prediction. SageMaker Canvas presents an interface in which you can manually enter values for each of the input variables used in the model. This type of analysis is ideal for what-if scenarios where you want to know how the prediction changes when one or more variables increase or decrease in value. With the prediction of the single set of column values, SageMaker Canvas provides individual feature importance. This indicates the columns with the highest influence towards the current sample prediction.
You can start by giving a name to the model such as ShippingForecastStandardModel. In addition, on the Build tab, you can choose Standard build instead of Quick build. From there, proceed through the remaining steps. The Standard build mode is beneficial in providing the additional functionality of being able to share the trained model with data scientists through SageMaker Studio. This allows collaboration, quick model refinement, and iterations. Choose the Share in the top right, and then in the popup select the user you want to share the model with, then choose Share.
Step 6: Clean up your AWS resources
It is a best practice to delete resources that you are no longer using so that you don't incur unintended charges.
6.1 Navigate to the S3 console and choose Buckets. Navigate to your bucket named sagemaker-<your-Region>-<your-account-id> and select the check box to the left of all of the files and folders. Next, choose Delete.
- On the Delete objects page, verify that you have selected the proper objects to delete. In the Permanently delete objects section, confirm by entering permanently delete in the text field and choose Delete objects. After completion and the bucket is empty, you can delete the S3 bucket by following the same process. A success banner appears after deletion is complete.
Conclusion
You have successfully used Amazon SageMaker Canvas to import and prepare a dataset for ML from Amazon S3, select the target variable, build an ML model using the quick build mode, and use the visual interface.
Next steps
Train a machine learning model
Label training data for machine learning
Find more hands-on tutorials