Why does my AWS Glue job fail with the error "Temporary directory not specified" when I insert or extract data from Amazon Redshift?

5 minute read
0

My AWS Glue job fails with the error "Temporary directory not specified" when I insert or extract data from Amazon Redshift.

Short description

Here are a few things to remember when your AWS Glue job writes or reads data from Amazon Redshift:

  • Your AWS Glue job writes data into an Amazon Redshift cluster: The job initially writes the data into an Amazon Simple Storage Service (Amazon S3) bucket in CSV format. Then, the job issues a COPY command to Amazon Redshift.
  • Your AWS Glue job reads data from an Amazon Redshift cluster: The job first unloads the data into an Amazon S3 bucket in CSV format using the UNLOAD command. Then, the job loads the data into the DynamicFrame from these temporary bucket files.

You might get this error either when either of the following conditions are true:

  • You are unloading the data from Amazon Redshift into the temporary S3 bucket.
  • You are loading the data from the S3 bucket to Amazon Redshift using the COPY or UNLOAD command.

Resolution

The following are some of the common causes and solution options for this error.

Define a temporary directory

The most common reason for this error is the missing temporary S3 bucket that's used by the AWS Glue job as a staging directory. Therefore, be sure to define an S3 bucket as the temporary directory for your job. For more information on how to define a temporary bucket, see Special parameters used by AWS Glue.

Verify the IAM role permissions

Verify the IAM role permissions to be sure that you have the right permissions to access the temporary S3 bucket. Also, be sure that you didn't block required permissions for the bucket in the following policies for the AWS Glue IAM role:

  • Bucket policy
  • S3 VPC endpoint policy
  • AWS Organizations policy
  • Service control policy

Some examples of required permissions are ListObjects, GetObject, and PutObject.

Verify the name of the temporary directory

Be sure that the name of the S3 bucket that's used as the temporary directory doesn't have a period in it to avoid getting the following exception:

Caused by: java.sql.SQLException: [Amazon](500310) Invalid operation: UNLOAD destination is not supported.

Verify AWS Key Management Service (AWS KMS) permissions

If you use customer managed keys from AWS Key Management Service (AWS KMS) to encrypt your data, then be sure to do the include extraunloadoptions in additional_options for your ETL statement in the AWS Glue script. For example:

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "database-name", 
    table_name = "table-name", 
    redshift_tmp_dir = args["TempDir"],
    additional_options = {"extraunloadoptions":"ENCRYPTED KMS_KEY_ID 'CMK key ID'"}, 
    transformation_ctx = "datasource0"
  )

If you are using AWS KMS to encrypt S3 data and facing permission issues related to AWS KMS, then be sure of the following:

  • You have the AWS KMS action permissions similar to the following in the AWS Glue IAM role.
  • You added the AWS Glue IAM role into the AWS KMS key.
"Action": [
    "kms:Encrypt",
    "kms:Decrypt",
    "kms:ReEncrypt*",
    "kms:GenerateDataKey*",
    "kms:DescribeKey"
    ]

Verify the IAM role permissions for AWS Glue Python Shell job

If you are trying to run the COPY or UNLOAD command from an AWS Glue Python Shell job and can't load the credentials, then be sure of the following:

  • You added the AWS Glue IAM role in Amazon Redshift.
  • The AWS Glue job role includes Amazon Redshift and AWS Glue in its trust relationship policy. For example:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "glue.amazonaws.com",
          "redshift.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Check for packet drop

The failure of your queries to reach the Amazon Redshift cluster even after the connection is successful might be due to the maximum transmission unit (MTU) size mismatch between Amazon Redshift and AWS Glue network path. Try configuring Amazon Redshift security groups to allow ICMP "destination unreachable". For more information, see Queries appear to hang and sometimes fail to reach the cluster.

Define the temporary directory in the AWS CloudFormation template

If you created your AWS Glue job using CloudFormation, then be sure that you provided the temporary directory location in the DefaultArguments parameter in your CloudFormation template. For example:

"DefaultArguments": { "--TempDir": "s3://doc-example-bucket/doc-example-folder"}

Define temporary directory in the DynamicFrame

If you get the error even after defining the temporary directory in your AWS Glue job, then check if you are reading or writing into Amazon Redshift. You can do so using the AWS Glue's DynamicFrame method. Then, confirm the following:

You passed your temporary directory into the redshift_tmp_dir property of your DynamicFrame:

You have TempDir specified in the getResolvedOptions function for your ETL job:

Use the following commands to retrieve the job name and TempDir parameters:

import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv,['JOB_NAME','TempDir'])
TempDir = args['TempDir']

For more information, see Accessing parameters using getResolvedOptions.


Related information

Moving data to and from Amazon Redshift

AWS OFFICIAL
AWS OFFICIALUpdated 3 years ago