How can I run an AWS Glue job on a specific partition in Amazon S3?

2 minute read
0

I want to run an AWS Glue job on a specific partition in an Amazon Simple Storage Service (Amazon S3) location.

Short description

To filter on partitions in the AWS Glue Data Catalog, use a pushdown predicate. Unlike Filter transforms, pushdown predicates let you filter on partitions without having to list and read all the files in your dataset.

Resolution

Create an AWS Glue job, and then specify the pushdown predicate in the DynamicFrame. In the following example, the job processes data in only the s3://awsexamplebucket/product_category=Video partition:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "testdata", table_name = "sampletable", transformation_ctx = "datasource0",push_down_predicate = "(product_category == 'Video')")

In the following example, the pushdown predicate filters by date. The job processes data in only the s3://awsexamplebucket/year=2019/month=08/day=02 partition:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "testdata", table_name = "sampletable", transformation_ctx = "datasource0",push_down_predicate = "(year == '2019' and month == '08' and day == '02')")

In the following example, the pushdown predicate filters by date for non-Hive style partitions. The job processes data in only the s3://awsexamplebucket/2019/07/03 partition:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "testdata", table_name = "sampletable", transformation_ctx = "datasource0",push_down_predicate ="(partition_0 == '2019' and partition_1 == '07' and partition_2 == '03')" )

AWS OFFICIAL
AWS OFFICIALUpdated a year ago