What is a Temporal Retry Policy?
A Retry Policy works in cooperation with the timeouts to provide fine controls to optimize the execution experience.
A Retry Policy is a collection of attributes that instructs the Temporal Server how to retry a failure of a Workflow Execution or an Activity Task Execution. (Retry Policies do not apply to Workflow Task Executions, which always retry indefinitely.)
Try out the Activity retry simulator to visiualize how a Retry Policy works.
Default behavior​
-
Workflow Execution: When a Workflow Execution is spawned, it is not associated with a default Retry Policy and thus does not retry by default. The intention is that a Workflow Definition should be written to never fail due to intermittent issues; an Activity is designed to handle such issues.
-
Activity Execution: When an Activity Execution is spawned, it is associated with a default Retry Policy, and thus Activity Task Executions are retried by default. When an Activity Task Execution is retried, the Temporal Service places a new Activity Task into its respective Activity Task Queue, which results in a new Activity Task Execution.
Custom Retry Policy​
To use a custom Retry Policy, provide it as an options parameter when starting a Workflow Execution or Activity Execution. Only certain scenarios merit starting a Workflow Execution with a custom Retry Policy, such as the following:
- A Temporal Cron Job or some other stateless, always-running Workflow Execution that can benefit from retries.
- A file-processing or media-encoding Workflow Execution that downloads files to a host.
Properties​
Default values for Retry Policy​
Initial Interval = 1 second
Backoff Coefficient = 2.0
Maximum Interval = 100 × Initial Interval
Maximum Attempts = ∞
Non-Retryable Errors = []
Initial Interval​
- Description: Amount of time that must elapse before the first retry occurs.
- The default value is 1 second.
- Use case: This is used as the base interval time for the Backoff Coefficient to multiply against.
Backoff Coefficient​
- Description: The value dictates how much the retry interval increases.
- The default value is 2.0.
- A backoff coefficient of 1.0 means that the retry interval always equals the Initial Interval.
- Use case: Use this attribute to increase the interval between retries. By having a backoff coefficient greater than 1.0, the first few retries happen relatively quickly to overcome intermittent failures, but subsequent retries happen farther and farther apart to account for longer outages. Use the Maximum Interval attribute to prevent the coefficient from increasing the retry interval too much.
Maximum Interval​
- Description: Specifies the maximum interval between retries.
- The default value is 100 times the Initial Interval.
- Use case: This attribute is useful for Backoff Coefficients that are greater than 1.0 because it prevents the retry interval from growing infinitely.
Maximum Attempts​
- Description: Specifies the maximum number of execution attempts that can be made in the presence of failures.
- The default is unlimited.
- If this limit is exceeded, the execution fails without retrying again. When this happens an error is returned.
- Setting the value to 0 also means unlimited.
- Setting the value to 1 means a single execution attempt and no retries.
- Setting the value to a negative integer results in an error when the execution is invoked.
- Use case: Use this attribute to ensure that retries do not continue indefinitely. In most cases, we recommend using the Workflow Execution Timeout for Workflows or the Schedule-To-Close Timeout for Activities to limit the total duration of retries, rather than using this attribute.
Non-Retryable Errors​
- Description: Specifies errors that shouldn't be retried.
- Default is none.
- Errors are matched against the
type
field of the Application Failure. - If one of those errors occurs, a retry does not occur.
- Use case: If you know of errors that should not trigger a retry, you can specify that, if they occur, the execution is not retried.
Retry interval​
The wait time before a retry is the retry interval. A retry interval is the smaller of two values:
- The Initial Interval multiplied by the Backoff Coefficient raised to the power of the number of retries.
- The Maximum Interval.
Diagram that shows the retry interval and its formula
Per-error Next Retry Delay​
Sometimes, your Activity or Workflow raises a special exception that needs a different retry interval from the Retry Policy. To accomplish this, you may throw an Application Failure with the Next Retry Delay field set. This value will replace and override whatever the retry interval would be on the Retry Policy. Note that your retries will still cap out under the Retry Policy's Maximum Attempts, as well as overall timeouts. For an Activity, its Schedule-to-Close Timeout applies. For a Workflow, the Execution Timeout applies.
Event History​
There are some subtle nuances to how Events are recorded to an Event History when a Retry Policy comes into play.
-
For an Activity Execution, the ActivityTaskStarted Event will not show up in the Workflow Execution Event History until the Activity Execution has completed or failed (having exhausted all retries). This is to avoid filling the Event History with noise. Use the Describe API to get a pending Activity Execution's attempt count.
-
For a Workflow Execution with a Retry Policy, if the Workflow Execution fails, the Workflow Execution will Continue-As-New and the associated Event is written to the Event History. The WorkflowExecutionContinuedAsNew Event will have an "initiator" field that will specify the Retry Policy as the value and the new Run Id for the next retry attempt. The new Workflow Execution is created immediately. But the first Workflow Task won't be scheduled until the backoff duration is exhausted. That duration is recorded as the
firstWorkflowTaskBackoff
field of the new run'sWorkflowExecutionStartedEventAttributes
event.