How do I troubleshoot cluster or query performance issues in Amazon Redshift?

9 minute read
1

I want to troubleshoot or improve the query performance in my Amazon Redshift cluster.

Short description

If you're experiencing performance issues in your Amazon Redshift cluster, then complete the following tasks:

  • Monitor your cluster performance metrics.
  • Check the Amazon Redshift Advisor recommendations.
  • Review the query execution alerts and excessive disk usage.
  • Check for locking issues and long-running sessions or transactions.
  • Check your workload management (WLM) configuration.
  • Check your cluster node hardware maintenance and performance.

Resolution

Monitor your cluster performance metrics

Review your cluster performance metrics and graphs to help you find the root cause of your performance degradation. View performance data in the Amazon Redshift console to compare cluster performance over time.

An increase in these metrics can indicate a higher workload and resource contention on your Amazon Redshift cluster. For more information, see Monitoring Amazon Redshift using Amazon CloudWatch metrics.

Check the workload execution breakdown in the Amazon Redshift console to review specific queries and runtimes. For example, if you see an increase in query planning time, then a query might be waiting for a lock.

Check the Amazon Redshift Advisor recommendations

Use Amazon Redshift Advisor recommendations to learn about potential improvements for your cluster. The recommendations are based on common usage patterns and Amazon Redshift best practices.

Review the query execution alerts and excessive disk usage

When a query runs, Amazon Redshift notes the query performance and indicates whether the query is running efficiently. If the query is identified as inefficient, then Amazon Redshift notes the query ID and provides recommendations for query performance improvement. These recommendations are logged in the STL_ALERT_EVENT_LOG internal system table.

If you experience a slow or inefficient query, then check the STL_ALERT_EVENT_LOG entries. To retrieve information from the STL_ALERT_EVENT_LOG table, run the following query:

SELECT TRIM(s.perm_table_name) AS TABLE    , (SUM(ABS(DATEDIFF(SECONDS, Coalesce(b.starttime, d.starttime, s.starttime), CASE
            WHEN COALESCE(b.endtime, d.endtime, s.endtime) > COALESCE(b.starttime, d.starttime, s.starttime)
            THEN COALESCE(b.endtime, d.endtime, s.endtime)
        ELSE COALESCE(b.starttime, d.starttime, s.starttime)
    END))) / 60)::NUMERIC(24, 0) AS minutes
    , SUM(COALESCE(b.ROWS, d.ROWS, s.ROWS)) AS ROWS
    , TRIM(SPLIT_PART(l.event, ':', 1)) AS event
    , SUBSTRING(TRIM(l.solution), 1, 60) AS solution
    , MAX(l.QUERY) AS sample_query
    , COUNT(DISTINCT l.QUERY)
FROM STL_ALERT_EVENT_LOG AS l
LEFT JOIN stl_scan AS s
    ON s.QUERY = l.QUERY AND s.slice = l.slice AND s.segment = l.segment
LEFT JOIN stl_dist AS d
    ON d.QUERY = l.QUERY AND d.slice = l.slice AND d.segment = l.segment 
LEFT JOIN stl_bcast AS b
    ON b.QUERY = l.QUERY AND b.slice = l.slice AND b.segment = l.segment 
WHERE l.userid > 1 AND l.event_time >= DATEADD(DAY, -7, CURRENT_DATE)
GROUP BY 1, 4, 5
ORDER BY 2 DESC, 6 DESC;

The query lists query IDs and the most common issues and issue occurrences for the query that's running on the cluster.

The following is an example output of the query and information that describes why the alert was triggered:

table | minutes | rows |               event                |                        solution                        | sample_query | count-------+---------+------+------------------------------------+--------------------------------------------------------+--------------+-------
NULL  |    NULL | NULL | Nested Loop Join in the query plan | Review the join predicates to avoid Cartesian products |      1080906 |     2

To review query performance, check the diagnostic queries for query tuning. Make sure that your query operations are designed to run efficiently. For example, not all join operations are effective. A nested loop join is the least effective join type. Because nested loop joins significantly increase query runtime, try to avoid nested loops.

To help you diagnose the issue, identify the queries that are performing the nested loops. For more information, see How can I troubleshoot high or full disk usage with Amazon Redshift?

Check for locking issues and long-running sessions or transactions

Before you run a query on the cluster, Amazon Redshift might acquire table-level locks on the tables that are involved in query runs. Sometimes, queries appear that they're aren't responding, or there's a spike in query runtime. If you experience a spike in your query runtime, then a locking issue might be causing the spike. For more information, see Why is my query planning time so high in Amazon Redshift?

If your table is currently locked by another process or query, then your query can't proceed. As a result, you don't see your query appear in the STV_INFLIGHT table. Instead, your running query appears in the STV_RECENTS table.

Sometimes, a long-running transaction can cause a query to stop responding. Take the following actions so that long-running sessions or transactions don't affect your query performance:

  • Use the STL_SESSIONS and SVV_TRANSACTIONS tables to check for long-running sessions and transactions, and then terminate the them.
  • Design your queries so that Amazon Redshift can quickly and efficiently process them.

Note: Long-running sessions or transactions also affect the VACUUM operation to reclaim disk space and leads to a higher number of ghost rows or uncommitted rows. Ghost rows that queries scan can affect query performance.

For more information, see How do I detect and release locks in Amazon Redshift?

Check your WLM configuration

Depending on your WLM configuration, a query might immediately begin to run or query for some time. Minimize the amount of time that a query is queued to run. To define your queues, then check your WLM memory allocation.

To check a cluster's WLM queues over a few days, run the following query:

SELECT *, pct_compile_time + pct_wlm_queue_time + pct_exec_only_time + pct_commit_queue_time + pct_commit_time AS total_pcntFROM 
(SELECT IQ.*,
   ((IQ.total_compile_time*1.0 / IQ.wlm_start_commit_time)*100)::DECIMAL(5,2) AS pct_compile_time,
   ((IQ.wlm_queue_time*1.0 / IQ.wlm_start_commit_time)*100)::DECIMAL(5,2) AS pct_wlm_queue_time,
   ((IQ.exec_only_time*1.0 / IQ.wlm_start_commit_time)*100)::DECIMAL(5,2) AS pct_exec_only_time,
   ((IQ.commit_queue_time*1.0 / IQ.wlm_start_commit_time)*100)::DECIMAL(5,2) pct_commit_queue_time,
   ((IQ.commit_time*1.0 / IQ.wlm_start_commit_time)*100)::DECIMAL(5,2) pct_commit_time  
FROM
     (SELECT trunc(d.service_class_start_time) AS DAY,
       d.service_class,
       d.node,
       COUNT(DISTINCT d.xid) AS count_all_xid,
       COUNT(DISTINCT d.xid) -COUNT(DISTINCT c.xid) AS count_readonly_xid,
       COUNT(DISTINCT c.xid) AS count_commit_xid,
       SUM(compile_us) AS total_compile_time,
       SUM(datediff (us,CASE WHEN d.service_class_start_time > compile_start THEN compile_start ELSE d.service_class_start_time END,d.queue_end_time)) AS wlm_queue_time,
       SUM(datediff (us,d.queue_end_time,d.service_class_end_time) - compile_us) AS exec_only_time,
       nvl(SUM(datediff (us,CASE WHEN node > -1 THEN c.startwork ELSE c.startqueue END,c.startwork)),0) commit_queue_time,
       nvl(SUM(datediff (us,c.startwork,c.endtime)),0) commit_time,
       SUM(datediff (us,CASE WHEN d.service_class_start_time > compile_start THEN compile_start ELSE d.service_class_start_time END,d.service_class_end_time) + CASE WHEN c.endtime IS NULL THEN 0 ELSE (datediff (us,CASE WHEN node > -1 THEN c.startwork ELSE c.startqueue END,c.endtime)) END) AS wlm_start_commit_time     
FROM
        (SELECT node, b.*
          FROM (SELECT -1 AS node UNION SELECT node FROM stv_slices) a,
               stl_wlm_query b
         WHERE queue_end_time > '2005-01-01'
           AND exec_start_time > '2005-01-01') d
     LEFT JOIN stl_commit_stats c USING (xid,node)
     JOIN (SELECT query, MIN(starttime) AS compile_start, SUM(datediff (us,starttime,endtime)) AS compile_us
           FROM svl_compile
           GROUP BY 1) e USING (query)
    WHERE d.xid > 0
      AND d.service_class > 4
      AND d.final_state <> 'Evicted'
 GROUP BY trunc(d.service_class_start_time),
          d.service_class,
          d.node
 ORDER BY trunc(d.service_class_start_time),
          d.service_class,
          d.node) IQ)
WHERE node < 0 ORDER BY 1,2,3;

This query provides the total number of transactions, or xid, runtime, queued time, and commit queue details. Check the commit queue details to see if frequent commits are affecting workload performance.

To check the details of queries that are running at a particular point in time, run the following query:

select b.userid,b.query,b.service_class,b.slot_count,b.xid,d.pid,d.aborted,a.compile_start,b.service_class_start_time,b.queue_end_time,b.service_class_end_time,c.startqueue as commit_startqueue,c.startwork as commit_startwork,c.endtime as commit_endtime,a.total_compile_time_s,datediff(s,b.service_class_start_time,b.queue_end_time)    as wlm_queue_time_s,datediff(s,b.queue_end_time,b.service_class_end_time) as wlm_exec_time_s,datediff(s, c.startqueue, c.startwork) commit_queue_s,datediff(s, c.startwork, c.endtime) commit_time_s,undo_time_s,numtables_undone,datediff(s,a.compile_start,nvl(c.endtime,b.service_class_end_time))
    total_query_s ,substring(d.querytxt,1,50) as querytext from (select query,min(starttime) as compile_start,max(endtime) as compile_end,sum(datediff(s,starttime,endtime)) as total_compile_time_s from svl_compile group by query) a left join stl_wlm_query
    b using (query) left join (select * from stl_commit_stats where node=-1) c using (xid) left join stl_query d using(query) left join (select xact_id_undone as xid,datediff(s,min(undo_start_ts),max(undo_end_ts)) as undo_time_s,count(distinct table_id)
    numtables_undone from stl_undone group by 1) e on b.xid=e.xid
WHERE '2011-12-20 13:45:00' between compile_start and service_class_end_time;

Note: Replace 2011-12-20 13:45:00 with the specific time and date that you want to check for queued and completed queries.

Review your cluster node hardware performance

If a node was replaced during a cluster maintenance window, then the cluster might be available soon. However, it might take some time for the data to be restored on the replaced node. During this process, your cluster performance might decline.

To identify the events that affected your cluster performance, check your Amazon Redshift cluster events.

To monitor the data restoration process, use the STV_UNDERREPPED_BLOCKS table. Run the following query to retrieve the blocks that require data restoration:

SELECT COUNT(1) FROM STV_UNDERREPPED_BLOCKS;

Note: The duration of the data restoration process depends on the cluster workload. To measure the progress of your cluster's data restoration process, check the blocks at intervals.

To check the health of a specific node, run the following query to compare its performance to other nodes:

SELECT day  , node
  , elapsed_time_s
  , sum_rows
  , kb
  , kb_s
  , rank() over (partition by day order by kb_s) AS rank
FROM (
  SELECT DATE_TRUNC('day',start_time) AS day
    , node
    , sum(elapsed_time)/1000000 AS elapsed_time_s
    , sum(rows) AS sum_rows
    , sum(bytes)/1024 AS kb
    , (sum(bytes)/1024)/(sum(elapsed_time)/1000000) AS "kb_s"
  FROM svl_query_report r
    , stv_slices AS s
   WHERE r.slice = s.slice
     AND elapsed_time > 1000000
   GROUP BY day
    , node
   ORDER BY day
    , node
);

Example query output:

day    node    elapsed_time_s    sum_rows         kb         kb_s  rank...
4/8/20     0      3390446     686216350489    21570133592    6362    4
4/8/20     2      3842928     729467918084    23701127411    6167    3
4/8/20     3      3178239     706508591176    22022404234    6929    7
4/8/20     5      3805884     834457007483    27278553088    7167    9
4/8/20     7      6242661     433353786914    19429840046    3112    1
4/8/20     8      3376325     761021567190    23802582380    7049    8

Note: The preceding output shows that node 7 processed 19429840046 KB of data for 6242661 seconds. This is much slower than the other nodes.

The ratio between the number of rows in the sum_rows column and number of processed bytes in the kb column is about the same. Depending on your hardware performance, the number of rows in the kb_s column is also about the same as the number of rows in the sum_rows column. If a node is processing less data over a period of time, then there might be an underlying hardware issue. To check if there's an underlying hardware issue, review the node's performance graph.

AWS OFFICIAL
AWS OFFICIALUpdated 4 months ago