How do I resolve search or write rejections in OpenSearch Service?

10 minute read
2

When I submit a search or write request to my Amazon OpenSearch Service cluster, the requests are rejected.

Short description

When you write or search for data in your OpenSearch Service cluster, you might receive the following HTTP 429 error or es_rejected_execution_exception:

error":"elastic: Error 429 (Too Many Requests): rejected execution of org.elasticsearch.transport.TransportService$7@b25fff4 on 
EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@768d4a66[Running, 
pool size = 2, active threads = 2, queued tasks = 200, completed tasks = 820898]] [type=es_rejected_execution_exception]"

Reason={"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TcpTransport$RequestHandler@3ad6b683 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@bef81a5[Running, pool size = 25, active threads = 23, queued tasks = 1000, completed tasks = 440066695]]"

The following variables can contribute to an HTTP 429 error or es_rejected_execution_exception:

  • Data node instance types and search or write limits
  • High values for instance metrics
  • Active and Queue threads
  • High CPU utilization and JVM memory pressure

HTTP 429 errors can occur because of search and write requests to your cluster. Rejections can also come from a single node or multiple nodes of your cluster.

Note: Different versions of Elasticsearch use different thread pools to process calls to the _index API. Elasticsearch versions 1.5 and 2.3 use the index thread pool. Elasticsearch versions 5.x, 6.0, and 6.2 use the bulk thread pool. Elasticsearch versions 6.3 and later use the write thread pool. For more information, see Thread pool on the Elasticsearch website.

Resolution

Data node instance types and search or write limits

A data node instance type has fixed virtual CPUs (vCPUs). Plug the vCPU count into your formula to retrieve the concurrent search or write operations that your node can perform before the node enters a queue. If an active thread becomes full, then the thread spills over to a queue and is eventually rejected. For more information about the relationship between vCPUs and node types, see OpenSearch Service pricing.

Additionally, there is a limit to how many searches per node or writes per node that you can perform. This limit is based on the thread pool definition and Elasticsearch version number. For more information, see Thread pool on the Elasticsearch website.

For example, if you choose the R5.2xlarge node type for five nodes in your Elasticsearch cluster (version 7.4), then the node will have 8 vCPUs.

Use the following formula to calculate maximum active threads for search requests:

int ((# of available_processors * 3) / 2) + 1

Use the following formula to calculate maximum active threads for write requests:

int (# of available_processors)

For an R5.2xlarge node, you can perform a maximum of 13 search operations:

(8 VCPUs * 3) / 2 + 1 = 13 operations

For an R5.2xlarge node, you can perform a maximum of 8 write operations:

8 VCPUs = 8 operations

For an OpenSearch Service cluster with five nodes, you can perform a maximum of 65 search operations:

5 nodes * 13 = 65 operations

For an OpenSearch Service cluster with five nodes, you can perform a maximum of 40 write operations:

5 nodes * 8 = 40 operations

High values for instance metrics

To troubleshoot your 429 exception, check the following Amazon CloudWatch metrics for your cluster:

  • IndexingRate: The number of indexing operations per minute. A single call to the _bulk API that adds two documents and updates two counts as four operations that might spread across nodes. If that index has one or more replicas, other nodes in the cluster also record a total of four indexing operations. Document deletions don't count towards the IndexingRate metric.
  • SearchRate: The total number of search requests per minute for all shards on a data node. A single call to the _search API might return results from many different shards. If five different shards are on one node, then the node reports "5" for this metric, even if the client only made one request.
  • CoordinatingWriteRejected: The total number of rejections that occurred on the coordinating node. These rejections are caused by the indexing pressure that accumulated since the OpenSearch Service startup.
  • PrimaryWriteRejected: The total number of rejections that occurred on the primary shards. These rejections are caused by indexing pressure that accumulated since the last OpenSearch Service startup.
  • ReplicaWriteRejected: The total number of rejections that occurred on the replica shards because of indexing pressure. These rejections are caused by indexing pressure that accumulated since the last OpenSearch Service startup.
  • ThreadpoolWriteQueue: The number of queued tasks in the write thread pool. This metric tells you whether a request is being rejected because of high CPU usage or high indexing concurrency.
  • ThreadpoolWriteRejected: The number of rejected tasks in the write thread pool.
    Note: The default write queue size was increased from 200 to 10,000 in OpenSearch Service version 7.9. As a result, this metric is no longer the only indicator of rejections from OpenSearch Service. Use the CoordinatingWriteRejected, PrimaryWriteRejected, and ReplicaWriteRejected metrics to monitor rejections in versions 7.9 and later.
  • ThreadpoolSearchQueue: The number of queued tasks in the search thread pool. If the queue size is consistently high, then consider scaling your cluster. The maximum search queue size is 1,000.
  • ThreadpoolSearchRejected: The number of rejected tasks in the search thread pool. If this number continually grows, then consider scaling your cluster.
  • JVMMemoryPressure: The JVM memory pressure specifies the percentage of the Java heap in a cluster node. If JVM memory pressure reaches 75%, OpenSearch Service initiates the Concurrent Mark Sweep (CMS) garbage collector. The garbage collection is a CPU-intensive process. If JVM memory pressure stays at this percentage for a few minutes, then you might encounter cluster performance issues. For more information, see How do I troubleshoot high JVM memory pressure on my Amazon OpenSearch Service cluster?

Note: The thread pool metrics that are listed help inform you about the IndexingRate and SearchRate.

For more information about monitoring your OpenSearch Service cluster with CloudWatch, see Instance metrics.

Active and Queue threads

If there is a lack of CPUs or high request concurrency, then a queue can fill up quickly, resulting in an HTTP 429 error. To monitor your queue threads, check the ThreadpoolSearchQueue and ThreadpoolWriteQueue metrics in CloudWatch.

To check your Active and Queue threads for any search rejections, use the following command:

GET /_cat/thread_pool/search?v&h=id,name,active,queue,rejected,completed

To check Active and Queue threads for write rejections, replace "search" with "write". The rejected and completed values in the output are cumulative node counters, which are reset when new nodes are launched. For more information, see the Example with explicit columns section of cat thread pool API on the Elasticsearch website.

Note: The bulk queue on each node can hold between 50 and 200 requests, depending on which Elasticsearch version you are using. When the queue is full, new requests are rejected.

Errors for search and write rejections

Search rejections

A search rejection error indicates that active threads are busy and that queues are filled up to the maximum number of tasks. As a result, your search request can be rejected. You can configure OpenSearch Service logs so that these error messages appear in your search slow logs.

Note: To avoid extra overhead, set your slow log threshold to a generous amount. For example, if most of your queries take 11 seconds and your threshold is "10", then OpenSearch Service takes more time to write a log. You can avoid this overhead by setting your slow log threshold to 20 seconds. Then, only a small percentage of the slower queries (that take longer than 11 seconds) is logged.

After your cluster is configured to push search slow logs to CloudWatch, set a specific threshold for slow log generation. You can set a specific threshold for slow log generation with the following HTTP POST call:

curl -XPUT http://<your domain’s endpoint>/index/_settings -d '{"index.search.slowlog.threshold.query.<level>":"10s"}'

Write rejections

A 429 error message as a write rejection indicates a bulk queue error. The es_rejected_execution_exception[bulk] indicates that your queue is full and that any new requests are rejected. This bulk queue error occurs when the number of requests to the cluster exceeds the bulk queue size (threadpool.bulk.queue_size). A bulk queue on each node can hold between 50 and 200 requests, depending on which Elasticsearch version you are using.

You can configure OpenSearch Service logs so that these error messages appear in your index slow logs.

Note: To avoid extra overhead, set your slow log threshold to a generous amount. For example, if most of your queries take 11 seconds and your threshold is "10", then OpenSearch Service will take additional time to write a log. You can avoid this overhead by setting your slow log threshold to 20 seconds. Then, only a small percentage of the slower queries (that take longer than 11 seconds) is logged.

After your cluster is configured to push search slow logs to CloudWatch, set a specific threshold for slow log generation. To set a specific threshold for slow log generation, use the following HTTP POST call:

curl -XPUT http://<your domain’s endpoint>/index/_settings -d '{"index.indexing.slowlog.threshold.query.<level>":"10s"}'

Write rejection best practices

Here are some best practices that mitigate write rejections:

  • When documents are indexed faster, the write queue is less likely to reach capacity.
  • Tune bulk size according to your workload and desired performance. For more information, see Tune for indexing speed on the Elasticsearch website.
  • Add exponential retry logic in your application logic. The exponential retry logic makes sure that failed requests are automatically retried.
    Note: If your cluster continuously experiences high concurrent requests, then the exponential retry logic won't help resolve the 429 error. Use this best practice when there is a sudden or occasional spike of traffic.
  • If you are ingesting data from Logstash, then tune the worker count and bulk size. It's a best practice to set your bulk size between 3-5 MB.

For more information about indexing performance tuning, see How can I improve the indexing performance on my OpenSearch Service cluster?

Search rejection best practices

Here are some best practices that mitigate search rejections:

  • Switch to a larger instance type. OpenSearch Service relies heavily on the filesystem cache for faster search results. The number of threads in the thread pool on each node for search requests is equal to the following: int((# of available_processors * 3) / 2) + 1. Switch to an instance with more vCPUs to get more threads to process search requests.
  • Turn on search slow logs for a given index or for all indices with a reasonable threshold value. Check to see which queries are taking longer to run and implement search performance strategies for your queries. For more information, see Troubleshooting Elasticsearch searches, for beginners or Advanced tuning: Finding and fixing slow Elasticsearch queries on the Elasticsearch website.
AWS OFFICIAL
AWS OFFICIALUpdated a year ago