Why is my AWS Glue crawler not adding new partitions to the table?

3 minute read
1

My AWS Glue crawler doesn't add new partitions to the table.

Short description

When the crawler scans the source data files under a new partition, the crawler compares the following attributes of the source files with those of the existing table:

  • File format
  • Compression type
  • Schema
  • Structure of Amazon Simple Storage Service (Amazon S3) partitions

If any of these attributes of the partition differ from attributes of the table, then the partition is skipped and not added to the metadata. A difference in the name, sequence, or number of partitions in the Amazon S3 path is considered as a change in the partition schema or structure.

Resolution

Troubleshoot the issue

Check the crawler logs to identify the issue:

  1. Open the AWS Glue console.
  2. In the navigation pane, choose Crawlers.
  3. Select the crawler, and then choose the Logs link to view the logs on the CloudWatch console.
  4. Review the logs to check if the crawler skipped the new partition.

For example, suppose that the log includes entries look similar to the following:

Folder partition keys do not match table partition keys, skipped folder: doc-example-bucket/doc-example-path/doc-example-table/year=2021/month=01/sday=05/

This entry suggests that the partition structure for the Amazon S3 location doesn't match the partition keys defined for the table. This might happen when the partition structure isn't consistent across the table source location.

If the AWS Glue crawler creates multiple tables, then the log entries look similar to the following:

INFO : Created table doc-example-table in database doxtest_db

If you see similar logs, then compare the schema and partition structure of the location of these tables with those of the original table.

Resolve the issue

Based on the information from the CloudWatch logs, consider one or more of the following solution options:

  • If the issue is caused by inconsistent partition structure, then make the structure consistent by renaming the S3 path manually or programmatically.
  • If the partition is skipped due to mismatch in file format, compression format, or schema, and the data isn't required to be included in the intended table, then consider the following:
  • Use an exclude pattern to skip any unwanted files.
  • Move the unwanted file to a different location.
  • If your data has different schemas in some input files and similar schemas in other input files, then combine compatible schemas when you create the crawler. On the Configure the crawler's output page, under Grouping behavior for S3 data (optional), select Create a single schema for each S3 path. When this setting is turned on and the data is compatible, then the crawler ignores the similarity of specific schemas when evaluating S3 objects in the specified include path. For more information, see How to create a single schema for each Amazon S3 include path.
  • If the crawler is creating multiple tables, then see How can I prevent the AWS Glue crawler from creating multiple tables?

AWS OFFICIAL
AWS OFFICIALUpdated 3 years ago