Amazon SageMaker Data Wrangler

The fastest and easiest way to prepare tabular and image data for machine learning

Why SageMaker Data Wrangler?

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface. You can use SQL to select the data that you want from various data sources and import it quickly. Next, you can use the data quality and insights report to automatically verify data quality and detect anomalies, such as duplicate rows and target leakage. SageMaker Data Wrangler contains over 300 built-in data transformations, so you can quickly transform data without writing any code.

Amazon SageMaker Data Wrangler Overview

Benefits of SageMaker Data Wrangler

Select data, understand data insights, and transform data to prepare it for machine learning (ML) in minutes.
Quickly estimate ML model accuracy and diagnose issues before models are deployed into production.
Take data preparation to production faster without the need to author PySpark code, install Apache Spark, or spin up clusters.

How it works

How Amazon SageMaker Data Wrangler works

Access, select, and query data faster

With the SageMaker Data Wrangler data selection tool, you can quickly access and select your tabular and image data from various popular sources (such as Amazon Simple Storage Service [Amazon S3], Amazon Athena, Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks) and over 50 other third-party sources (such as Salesforce, SAP, Facebook Ads, and Google Analytics). You can also write queries for data sources using SQL and import data directly into SageMaker from various file formats, such as CSV, Parquet, and JSON, and database tables.

Generate data insights and understand data quality

SageMaker Data Wrangler provides a data quality and insights report that automatically verifies data quality (such as missing values, duplicate rows, and data types) and helps detect anomalies (such as outliers, class imbalance, and data leakage) in your data. Once you can effectively verify data quality, you can quickly apply domain knowledge to process datasets for ML model training.

Understand your data with visualizations

SageMaker Data Wrangler helps you understand your data and identify potential errors and extreme values with a set of robust preconfigured visualization templates. Histograms, scatter plots, box and whisker plots, line plots, and bar charts are all built in for applying to your data. More advanced ML-specific visualizations (such as bias report, feature correlation, multicollinearity, target leakage, and time series) are also available that show feature importance and feature correlations. Those tools can be accessed from the Analysis tab.

Transform data more efficiently

SageMaker Data Wrangler offers a selection of over 300 prebuilt, PySpark-based data transformations, so you can transform your data and scale your data preparation workflow without writing a single line of code. Preconfigured transformations cover common use cases such as flattening JSON files, deleting duplicate rows, imputing missing data with mean or median, one hot encoding, and time-series–specific transformers to accelerate the preparation of time-series data for ML. For your image data, SageMaker Data Wrangler offers common image augmentations (such as Blur, Enhance, and Resize) and cleaning operations (like dropping corrupted images and duplicates). You can also author custom transformations in PySpark, SQL, and Pandas. SageMaker Data Wrangler offers image (imgaug, OpenCV) libraries for creating custom transforms for CV use cases and a rich library of code snippets to streamline custom transformation authoring.

Understand the predictive power of your data

The SageMaker Data Wrangler Quick Model feature provides an estimate of the expected predictive power of your data. Quick Model automatically splits your data into training and testing datasets and trains the data on an XGBoost model with default hyperparameters. Based on the task that you are solving (for example, classification or regression), SageMaker Data Wrangler provides a model summary, feature summary, and confusion matrix, which help you quickly iterate on your data preparation flows.

Automate and deploy ML data preparation workflows

With the SageMaker Data Wrangler UI, you can launch scale to large datasets without the need to author PySpark code, install Apache Spark, or spin up clusters. You can launch or schedule a job to quickly process your data or export it to a SageMaker Studio notebook. SageMaker Data Wrangler offers several export options, including SageMaker Data Wrangler jobs, SageMaker Feature Store, and SageMaker Pipelines, so you can integrate your data preparation flow into your ML workflow. Alternatively, you can deploy your data preparation workflow to a SageMaker hosted endpoint. Finally, you can export data directly to train ML model using a visual interface with SageMaker Canvas

Customers

Invista
"At INVISTA, we are driven by transformation and look to develop products and technologies that benefit customers around the globe. We see ML as a way to improve the customer experience. But, with datasets that span hundreds of millions of rows, we needed a solution to help us prepare data, and develop, deploy, and manage ML models at scale. With Amazon SageMaker Data Wrangler, we can now interactively select, clean, explore, and understand our data effectively, empowering our data science team to create feature engineering pipelines that can scale effortlessly to datasets that span hundreds of millions of rows. With Amazon SageMaker Data Wrangler, we can operationalize our ML workflows faster."

Caleb Wilkinson, Former Lead Data Scientist, INVISTA

3M
"Using ML, 3M is improving tried-and-tested products, like sandpaper, and driving innovation in several other spaces, including healthcare. As we plan to scale ML to more areas of 3M, we see the amount of data and models growing rapidly—doubling every year. We are enthusiastic about the new SageMaker features because they will help us scale. Amazon SageMaker Data Wrangler makes it much easier to prepare data for model training, and Amazon SageMaker Feature Store will eliminate the need to create the same model features over and over. Finally, Amazon SageMaker Pipelines will help us automate data prep, model building, and model deployment into an end-to-end workflow so we can speed time to market for our models. Our researchers are looking forward to taking advantage of the new speed of science at 3M."

David Frazee, Former Technical Director, 3M Corporate Systems Research Lab

Deloitte
"Amazon SageMaker Data Wrangler enables us to hit the ground running to address our data preparation needs with a rich collection of transformation tools that accelerate the process of ML data preparation needed to take new products to market. In turn, our clients benefit from the rate at which we scale deployed models, enabling us to deliver measurable, sustainable results that meet the needs of our clients in a matter of days rather than months."

Frank Farrall, Principal, AI Ecosystems and Platforms Leader, Deloitte

NRI
"As an AWS Premier Consulting Partner, our engineering teams are working very closely with AWS to build innovative solutions to help our customers continuously improve the efficiency of their operations. ML is the core of our innovative solutions, but our data preparation workflow involves sophisticated data preparation techniques which, as a result, take a significant amount of time to become operationalized in a production environment. With Amazon SageMaker Data Wrangler, our data scientists can complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, which helps us accelerate the data preparation process and easily prepare our data for ML. With Amazon SageMaker Data Wrangler, we can prepare data for ML faster."

Shigekazu Ohmoto, Senior Corporate Managing Director, NRI Japan

equilibrium
"As our footprint in the population health management market continues to expand into more health payors, providers, pharmacy benefit managers, and other healthcare organizations, we needed a solution to automate end-to-end processes for data sources that feed our ML models, including claims data, enrollment data, and pharmacy data. With Amazon SageMaker Data Wrangler, we can now accelerate the time it takes to aggregate and prepare data for ML using a set of workflows that are easier to validate and reuse. This has dramatically improved the delivery time and quality of our models, increased the effectiveness of our data scientists, and reduced data preparation time by nearly 50%. In addition, SageMaker Data Wrangler has helped us save multiple ML iterations and significant GPU time, speeding the entire end-to-end process for our clients as we can now build data marts with thousands of features including pharmacy, diagnosis codes, ER visits, inpatient stays, as well as demographic and other social determinants. With SageMaker Data Wrangler, we can transform our data with superior efficiency for building training datasets, generate data insights on datasets prior to running ML models, and prepare real-world data for inference/predictions at scale.”

Lucas Merrow, CEO, Equilibrium Point IoT

Get started with SageMaker Data Wrangler

Blogs

BLOG

Accelerate data preparation with data quality and insights in Amazon SageMaker Data Wrangler

BLOG

Amazon SageMaker Data Wrangler Supports SaaS Applications as Data Sources

Blog

Prepare data from Databricks for machine learning using Amazon SageMaker Data Wrangler

BLOG

Prepare data with PySpark and Altair code snippets in Amazon SageMaker Data Wrangler

BLOG

Import data from cross-account Amazon Redshift to Amazon SageMaker Data Wrangler

BLOG

Use Amazon SageMaker Data Wrangler in Amazon SageMaker Studio with a default lifecycle configuration

Hands-on exercises

Tutorial

Step-by-step tutorial to get started with SageMaker Data Wrangler

WORKSHOPS

Explore how to use SageMaker Data Wrangler for use cases

Demo videos

Video

re:Invent 2022: Accelerate data preparation with SageMaker Data Wrangler

re:Invent 2022: Accelerate data preparation (56:45)
VIDEO

Quickly prepare data for ML using SageMaker Data Wrangler Virtual Workshop

Quickly prepare data for ML Virtual Workshop (1:18:08)
VIDEO

AWS On Air 2020: AWS What’s Next ft. SageMaker Data Wrangler

AWS on Air 2020: AWS What’s Next ft. SageMaker Data Wrangler (27:51)
VIDEO

SageMaker Data Wrangler Deep Dive Demo

SageMaker Data Wrangler Deep Dive Demo (28:13)

What's new

  • Date (Newest to Oldest)
No results found
1