Worker performance
Note: All metrics in this article are prepended with the “temporal_” prefix. The prefix is omitted in this article to make the names more descriptive.
Metrics
Performance tuning involves three important SDK metric groups:
worker_task_slots_available
gauges taggedworker_type=WorkflowWorker
andworker_type=ActivityWorker
for Workflow Task and Activity Workers correspondingly. These gauges report how many executor “slots” are currently available (unoccupied) for each Worker type.workflow_task_schedule_to_start_latency
andactivity_schedule_to_start_latency
timers for Workflow Tasks and Activities correspondingly. For more information aboutschedule_to_start
timeout and latency, see Schedule-To-Start Timeout.sticky_cache_size
andworkflow_active_thread_count
report the size of the Workflow cache and the number of cached Workflow threads.
Note: To have access to all the metrics mentioned above in the JavaSDK, version ≥ 1.8.0 is required.
Configuration
The following options are defined on WorkerOptions
and are applicable for each Worker separately:
maxConcurrentWorkflowTaskExecutionSize
andmaxConcurrentActivityExecutionSize
define the number of total available slots for that Worker.maxConcurrentWorkflowTaskPollers
(JavaSDK:workflowPollThreadCount
) andmaxConcurrentActivityTaskPollers
(JavaSDK:activityPollThreadCount
) define the number of pollers performing poll requests waiting on Workflow / Activity task queue and delivering the tasks to the executors.
The Workflow Cache is created and shared between all the workers. It's designed to limit the amount of resources used by the cache for the whole host/process. So the options are defined on WorkerFactoryOptions
in JavaSDK and in worker
package in GoSDK:
WorkerFactoryOptions#workflowCacheSize
(GoSDK:worker.setStickyWorkflowCacheSize
) defines the maximum number of cached Workflows Executions. Each cached Workflow contains at least one Workflow thread and its resources (memory, etc).maxWorkflowThreadCount
defines the maximum number of Workflow threads.
These options limit the resource consumption of the in-memory Workflow cache. Workflow cache options are shared between all Workers, because the Workflow cache is something that has to do with the resource consumption of the whole host, like memory and the total amount of threads, and should be limited per JVM.
Monitor Task Queue backlog metrics
A Task Queue is a lightweight, dynamically allocated queue. Worker Entities poll the queue for Tasks. The Temporal Service dynamically creates different Task Queue types including Activity Task Queues, Workflow Task Queues, and Nexus Task Queues. These Task Queue types route their Tasks to Workers for Task completion.
With an accurate estimate of backlog Tasks, you can determine the optimal number of Workers to deploy. Balance your Worker count with the number of Tasks to achieve the best performance. This approach minimizes Task backlog saturation and reduces idle Workers.
Task Queue metrics provide numerical insights into your Task Queue activity and backlog characteristics. Use these metrics to tune your production deployments. Evaluate your Worker loads and assess whether you need to scale up or reduce your Worker deployment.
Task Queue types
For each Task Queue name, Temporal creates separate queues for each Task Queue type, namely:
- Activity Task Queue: A queue that holds Activity Tasks. Activity Tasks represent units of work within a larger business process. Each Activity Task contains the context required by an Activity Definition. Workers poll Activity Tasks from the Activity Task Queue and use them to initiate Activity Executions.
- Workflow Task Queue: A queue that holds Workflow Tasks. Workflow Tasks contain the context needed by a Workflow Definition. Workers poll Workflow Tasks from the Workflow Task Queue and use them to initiate Workflow Executions.
- Nexus Task Queue (Public Preview): A queue that holds Nexus Tasks. Nexus Tasks are used for units of work passed between Namespaces. Workers configured with the Nexus Service poll the Nexus Task Queue for available tasks and handle them by initiating Workflow Executions in the target Namespace. Each Nexus Task contains the context required by the target Workflow Definition.
Each Task Queue type provides unaggregated metrics.
Task Queue metrics
The Temporal Service reports information separately for each Task Queue type (not aggregated). Use the following metrics to retrieve detailed information about Task Queue health and performance. Available metrics include:
ApproximateBacklogCount
ApproximateBacklogAge
TasksAddRate
andTasksDispatchRate
BacklogIncreaseRate
(derived fromTasksAddRate
andTasksDispatchRate
)
ApproximateBacklogCount
Represents the approximate count of Tasks currently backlogged in this Task Queue. The number may include expired Tasks as well as active Tasks, but it will eventually converge to the correct count over time.
You can rely on this count when making scaling decisions.
Workflow Task Queue types provide partial information due to performance optimizations. Tasks sent to Sticky queues are not included in the returned values for this metric. Since Tasks remain valid for only a few seconds in Sticky Queues, this inaccuracy diminishes over time, especially as the backlog grows.
ApproximateBacklogAge
Returns the approximate age of the oldest Task in the backlog. The age is based on the creation time of the Task at the head of the queue.
You can rely on this count when making scaling decisions.
Workflow Task Queue types provide partial information due to performance optimizations. Tasks sent to Sticky queues are not included in the returned values. Since Tasks remain valid for only a few seconds in Sticky Queues, this inaccuracy diminishes over time, especially when the backlog is older than a few seconds.
TasksAddRate
and TasksDispatchRate
Reports the approximate Tasks-per-second added to or dispatched from a Task Queue. This rate is averaged over the most recent 30-second time interval. The calculations include Tasks that were added to or dispatched from the backlog as well as Tasks that were immediately dispatched and bypassed the backlog (sync-matched).
The actual Task delivery count may be significantly higher than the number reported by these metrics:
- Eager dispatch refers to a Temporal feature where Activities can be requested by an SDK using one Workflow Task completion response. Tasks using Eager dispatch do not pass through Task Queues.
- A Sticky Task Queue is associated with a dedicated Worker instance. Tasks passed to Sticky Task Queues are not accounted for by these metrics. Normally, only the first Workflow Task of each Workflow is placed on a Workflow Task Queue. Subsequent Tasks are passed to the Sticky Task Queue for performance improvement.
BacklogIncreaseRate
Approximates the net Tasks per second added to the backlog, averaged over the most recent 30 seconds. This is calculated as:
TasksAddRate - TasksDispatchRate
- Positive values of
X
indicate the backlog is growing by aboutX
Tasks per second. - Negative values of
X
indicate the backlog is shrinking by aboutX
Tasks per second.
While individual add
and dispatch
rates may be inaccurate due to Eager and Sticky Task Queues, the BacklogIncreaseRate
reliably reflects the rate at which the backlog is shrinking or growing for backlogs older than a few seconds.
Fetch Worker metrics
The Temporal CLI helps you monitor and evaluate Worker performance. Issue the following command to display a list of active Workers that have recently polled a Task Queue:
temporal task-queue describe \
--task-queue YourTaskQueueName \
[additional options]
This command retrieves poller information, backlog statistics, and task reachability for Task types (available in Temporal Server v1.25.0, Temporal CLI 1.1 and later).
Task reachability status is experimental. Determining Task reachability incurs a non-trivial computing cost. This feature may significantly change or be removed in a future release.
Evaluate Worker availability and capacity issues
Each Temporal Server records the last time of each poll request.
This time is displayed in the temporal task-queue describe
output.
-
A
LastAccessTime
value exceeding one minute may indicate that the Worker fleet is at capacity or that Workers have shut down or been removed. -
Values under 5 minutes typically suggest the Worker fleet is at capacity. "At capacity" means that all Workflow and Activity slots are full.
-
Values over 5 minutes since the last poll request usually suggest that Workers have shut down or been removed. Workers are removed if 5 minutes have passed since the last poll request.
Manage your Worker fleet
You can adjust the number of Workers to enhance Workflow Execution performance and manage your fleet size. For instance, a large backlog of Tasks with too few Workers will slow down Workflow Execution completions and decrease processing efficiency. Adding more Workers boosts speeds up completion rates and improves throughput. An empty backlog indicates low Worker utilization, allowing you to reduce your fleet and associated costs.
The metric values provided by temporal task-queue describe
can help you manage your Worker fleet deployment:
-
ApproximateBacklogAge
shows how long Tasks have been waiting to be dispatched. If this time grows too long, more Workers can boost Workflow efficiency. -
Calculate the demand per Worker by dividing the number of backlogged Tasks (
ApproximateBacklogCount
) by the number of Workers. Determine if your task processing rate is within an acceptable range for you needs using the per-Worker demand (how many Tasks each Worker has yet to process), the backlog consumption rate (TasksDispatchRate
, the rate at which Workers are processing Tasks), and the dispatch latency (ApproximateBacklogAge
, the time the oldest Task has been waiting to be assigned to a Worker). -
The backlog increase rate (
BacklogIncreaseRate
) shows the changing demand on your Workers over time. As this rate increases, you may need to add more Workers until demand and capacity are balanced. As it decreases, you may be able to reduce your Worker fleet.
Task Queue processing tuning
The following steps limit delays in Task Queue processing due to insufficient or unbalanced Workers.
Review these steps if you notice high schedule_to_start
metrics.
The steps are arranged in the recommended order of execution.
Hosts and Resources provisioning
If currently provisioned Worker hosts are fully utilized (near full CPU usage, high load average, etc), additional Workers hosts have to be provisioned to increase the capacity of the Workers pool.
It's possible to have too many Workers
Monitor the poll success (poll_success
/poll_success_sync
) and poll timeout poll_timeouts
Server metric counters.
Poll Success Rate = (poll_success
+ poll_success_sync
) / (poll_success
+ poll_success_sync
+ poll_timeouts
)
Poll Success Rate should be >90% in most cases of systems with a steady load. For high volume and low latency, try to target >95%.
If you see
- low Poll Success Rate, and
- low
schedule_to_start_latency
, and - low Worker hosts resource utilization at the same time,
then you might have too many workers, consider sizing down.
Worker Executor Slots sizing
The main area to focus on when tuning is the number of Worker Executor Slots.
Increase the maximum number of working slots by adjusting maxConcurrentWorkflowTaskExecutionSize
or maxConcurrentActivityExecutionSize
if both of the following conditions are met:
- The Worker hosts are underutilized (no bottlenecks on CPU, load average, etc.).
- The
worker_task_slots_available
metric from the corresponding Worker type frequently shows a depleted number of available Worker slots.
Poller count
Adjustments to pollers are rarely needed and rarely make a difference. Please consider this step only after adjusting Worker slots in the previous step. The only scenario in which the pollers' adjustment makes sense is when there is a significant network latency between the Workers and Temporal Server.
If:
- the
schedule_to_start
metric is abnormally long, and - the Worker hosts are underutilized (there are no bottlenecks on CPU, load average, etc), and
worker_task_slots_available
metric from the corresponding Worker type shows that a significant percentage of Worker slots are available on a regular basis,
then consider increasing the number of pollers by adjusting maxConcurrentWorkflowTaskPollers
or maxConcurrentActivityTaskPollers
, depending on which type of schedule_to_start
metric is elevated.
Rate Limiting
If, after adjusting the poller and executors count as specified earlier, you still observe an elevated schedule_to_start
, underutilized Worker hosts, or high worker_task_slots_available
, you might want to check the following:
- If server-side rate limiting per Task Queue is set by
WorkerOptions#maxTaskQueueActivitiesPerSecond
, remove the limit or adjust the value up. (See Go and Java.) - If Worker-side rate limiting per Worker is set by
WorkerOptions#maxWorkerActivitiesPerSecond
, remove the limit. (See Go, TypeScript, and Java.)
Workflow Cache Tuning
When the number of cached Workflow Executions reported by sticky_cache_size
hits workflowCacheSize
or the number of their threads reported by workflow_active_thread_count
metrics gauge hits maxWorkflowThreadCount
, Workflow Executions start to get evicted from the cache.
An evicted Workflow Execution will need to be replayed when it gets any action that may advance it.
If
- The Workflow Cache limits described above are hit, and
- Worker hosts have enough free RAM and are not close to reasonable thread limits,
workflowCacheSize
and maxWorkflowThreadCount
limits may be increased to decrease the overall latency and cost of the replays in the system. If the opposite occurs, consider decreasing the limits.
In CoreSDK based SDKs, like TypeScript, this metric works differently and should be monitored and adjusted on a per Worker and Task Queue basis.
Invariants
These properties should always be true for a Worker's configuration.
These are applicable to JavaSDK only.
Perform this sanity check after the adjustments to Worker settings.
workflowCacheSize
should be ≤maxWorkflowThreadCount
. Each Workflow has at least one Workflow thread.maxConcurrentWorkflowTaskExecutionSize
should be ≤maxWorkflowThreadCount
. Having more Worker slots than the Workflow cache size will lead to resource contention/stealing between executors and unpredictable delays. It's recommended thatmaxWorkflowThreadCount
be at least 2x ofmaxConcurrentWorkflowTaskExecutionSize
.maxConcurrentWorkflowTaskPollers
should be significantly ≤maxConcurrentWorkflowTaskExecutionSize
. AndmaxConcurrentActivityTaskPollers
should be significantly ≤maxConcurrentActivityExecutionSize
. The number of pollers should always be lower than the number of executors.
Drawbacks of putting just "large values everywhere"
As with any multithreading system, specifying too large values without monitoring with the SDK and system metrics will lead to constant resource contention/stealing, which decreases the total throughput and increases latency jitter of the system.
Related