How can I troubleshoot and resolve high CPU utilization on my Amazon DocumentDB instances?

7 minute read
0

I observe high CPU utilization on my Amazon DocumentDB (with MongoDB compatibility) DB instances.

Short description

The CPU utilization of your Amazon DocumentDB instances help you understand how your currently allocated resources perform for the ongoing workload.

You might observe an increase in CPU utilization for these reasons:

  • User-initiated heavy workloads
  • Non-efficient queries
  • The writer in the cluster is overburdened because the read load isn't balanced in the cluster
  • The reader is of lower hardware configuration than the writer and can't sync up with the high write workload
  • Internal tasks such as garbage collection in the Amazon DocumentDB cluster
  • Too many database connections (idle)
  • Short bursts of connections

To identify the main sources of CPU usage in your Amazon DocumentDB instance, review these aspects:

  • Amazon CloudWatch metrics
  • Performance Insights
  • Native database queries
  • Check the efficiency of queries
  • Aggressive logging settings

After you identify the cause, analyze and optimize your workload to reduce CPU usage. If the issue persists, then increase your instance size based on your workload.

Resolution

Use CloudWatch metrics

Amazon DocumentDB integrates with CloudWatch and allows you to gather and analyze operational metrics for your clusters.

CloudWatch metrics allow you to identify CPU and its proportional metric patterns over extended periods. Review these metrics and then monitor them in the CloudWatch console:

  • Use DatabaseConnections and DatabaseConnectionsMax to identify the number of connections open at a relevant timeline.
  • Use WriteIOPs, ReadIOPs, ReadThroughput, and WriteThroughput to understand the overall workload on your Amazon DocumentDB instance.
  • Use DocumentsDeleted, DocumentsInserted, DocumentsReturned and DocumentsUpdated. The metrics can help you understand the user workload on your Amazon DocumentDB instance.
  • If you use the T3 or T4 instance classes, then review CPUCreditBalance and CPUSurplusCreditBalance to check for compute throttling.

Use Performance Insights metrics

Use Amazon DocumentDB Performance Insights to identify queries that contribute to database load and wait state. Under the Manage Metrics option, use the average active sessions to review the load and CPU distribution (system%, user% or total%).

A load average that's greater than the number of vCPUs indicates that the instance is under a heavy load. For example, the load average might be less than the vCPUs for the DB instance class. This indicates that CPU throttling might not be the cause of application latency. Check the load average and analyze the relevant wait states to understand the source of added CPU usage such as I/O, locks, and latch.

Use native database queries

Native queries can help you analyze the workload and check the CPU usage. Use the MongoDB shell to run this query. This lists all operations that currently run on an Amazon DocumentDB instance:

db.adminCommand({currentOp: 1, $all: });

This query uses the currentOp command to list all queries that are either blocked or run for more than 10 seconds:

db.adminCommand({aggregate: 1,
                 pipeline: [{$currentOp: {}},
                            {$match: {$or: [{secs_running: {$gt: 10}},
                                            {WaitState: {$exists: true}}]}},
                            {$project: {_id:0,
                                        opid: 1,
                                        secs_running: 1,
                                        WaitState: 1,
                                        blockedOn: 1,
                                        command: 1}}],
                 cursor: {}
                });

To analyze the system usage results, run this query on the instance that you observe high CPU usage on. This query returns an aggregate of all queries that run in each namespace. It also lists all internal system tasks and the unique number of wait states per namespace.

db.adminCommand({aggregate: 1,
                 pipeline: [{$currentOp: {allUsers: true, idleConnections: true}},
                            {$group: {_id: {desc: "$desc", ns: "$ns", WaitState: "$WaitState"}, count: {$sum: 1}}}],
                 cursor: {}
                });

Note: The GARBAGE_COLLECTION metric under the internal tasks is the MVCC implementation in the Amazon DocumentDB cluster. This is a background sweeper that removes dead document versions and relates to the number of updates or deletes on your database. The sweeping process is triggered based on internal thresholds at a collection level, and results in read/write IOPs and CPU usage.

Check the efficiency of queries

Check index overhead for write queries

Too many indexes or a lot of unused indexes associated with your database can contribute to the added overhead for your writes. Review the index statistics to analyze the index usage and identify them.

Check explain plan of the query

Queries can run slowly because they require a full scan of the collection to choose the relevant documents. Create appropriate indexes to improve the speed of the query.

Use the explain command to identify the fields that you want to create indexes on. You can also use profiler logs to capture long running queries and the details of their operations.

Check statistics of collections

Check these statistics for the collections you use:

  • Review the Top Queries section in Performance Insights to identify the collections that are contributing most to the load.
  • Review the collection's statistics to understand the amount of insert, update, and delete operations performed on it. You can also review the amount of index scans and full collection scans performed.
  • Split your collections to reduce the document size to be processed, particularly if you have a large number of update operations.

Check the Aggressive Logging settings

Auditing of events is prioritized over database traffic. If auditing isn't needed, then you can turn it off. If you do require auditing, then set the audit_logs parameter to log only the events that are necessary. Plan for increased load, and then switch to a bigger instance class if needed.

For profiler logs, make sure the correct value is set for the profiler_threshold_ms parameter to avoid aggressive logging. Review your application workload to identify the correct threshold you require to categorize a query as long running.

Confirm that you activated the log exports option for the logs that you want to export to CloudWatch.

Use best practices

Offload the read workload to reader

If you have multiple DB instances in your Amazon DocumentDB cluster, offload the read workload to your reader instance. When you connect as a replica set, specify the readPreference for the connection. If you specify a read preference of secondaryPreferred, then the client tries to route the read queries to your replicas. The client tries to route write queries to your primary DB instance.

Note that readers have eventual consistency. If a workload requires stronger read-after-write consistency, then use dynamic read preference and overriding it on query level. For example, you might default to secondaryPreferred at connection level so queries go to secondary. If you have queries that require stronger read-after-write consistency, you can override the default. See this example:

db.collection.find().readPref("primary")

Add one or more reader instances to the cluster

If you have an Amazon DocumentDB cluster with a single DB instance (writer only), then add one or multiple reader DB instances to the cluster. Then use readPreference=secondaryPreferred to handle the load efficiently.

Use Amazon DocumentDB Profiler to identify slow queries

Use the Amazon DocumentDB Profiler to log slow queries. If a query appears repeatedly in the slow query logs, then you might need an additional index to improve performance.

Look for long running queries that have one or more stages that perform at least one COLLSCAN stage. This indicates that the query stage has to read every document in the collection to provide a response to the query.

For more information, see Profile slow-running queries in Amazon DocumentDB (with MongoDB compatibility).

Create an alarm notification with CloudWatch

Create a CloudWatch alarm that notifies you when the CPU Utilization metric exceeds a specific threshold.

Scale up the instance class of your DB instances

If there's no further scope of query tuning, then scale up the instance class of instances in the cluster to handle the workload.

Note: If you scale up an instance class, then this increases the cost. For more information, see Amazon DocumentDB pricing.

Related information

Scale Amazon DocumentDB Clusters

Performance and resource utilization

How to index on Amazon DocumentDB (with MongoDB compatibility)

AWS OFFICIAL
AWS OFFICIALUpdated 7 months ago