Resilience4j is a fault tolerance library designed for Java8 and functional programming.
Involves retrying an operation or request that has failed due to transient faults or network issues. It's designed to enhance the robustness and reliability of the system by automatically attempting the operation again after a certain period and/or under certain conditions.
Config property | Default value | Description |
---|---|---|
maxAttempts | 3 |
The maximum number of attempts (including the initial call as the first attempt). |
waitDuration | 500 [ms] |
A fixed wait duration between retry attempts. |
intervalFunction | numOfAttempts -> waitDuration |
A function to modify the waiting interval after a failure. By default the wait duration remains constant. |
intervalBiFunction | (numOfAttempts, Either<throwable, result>) -> waitDuration |
A function to modify the waiting interval after a failure based on attempt number and a result or exception. |
retryOnResultPredicate | result -> false |
Configures a Predicate which evaluates if a result should be retried. The Predicate must return true , if the result should be retried, otherwise it must return false . |
retryExceptions | empty |
Configures a list of Throwable classes that are recorded as a failure and thus are retried. This parameter supports subtyping. |
ignoreExceptions | empty |
Configures a list of Throwable classes that are ignored and thus are not retried. This parameter supports subtyping. |
failAfterMaxAttempts | false |
A boolean to enable or disable throwing of MaxRetriesExceededException when the Retry has reached the configured maxAttempts , and the result is still not passing the retryOnResultPredicate |
From: Resilience4j Retry Docs
Important
In the Retry
configuration the intervalFunction
and intervalBiFunction
are
mutually exclusive.
If both are set it will throw a IllegalStateException
.
A configuration represents the retry mechanism's behavior - policies - and is used to create a Retry
instance, which is a
thread-safe instance that holds the decoration logic and defines the async and
sync contexts (i.e., state-holders),
among other funcionalities
(e.g., metrics, event-publishers).
A Retry
instance cannot be created without a custom or default RetryConfig.
The configuration is done using the RetryConfig
class. It uses the builder pattern to create a configuration object:
RetryConfig config = RetryConfig.custom()
.maxAttempts(2)
.waitDuration(Duration.ofMillis(1000))
.retryOnResult(response -> response.getStatus() == 500)
.retryOnException(e -> e instanceof WebServiceException)
.retryExceptions(IOException.class, TimeoutException.class)
.ignoreExceptions(BusinessException.class, OtherBusinessException.class)
.failAfterMaxAttempts(true)
.build();
Or using default configuration:
RetryConfig config = RetryConfig.ofDefaults();
Or by using a base configuration:
RetryConfig baseConfig = RetryConfig.custom()
.maxAttempts(2)
.waitDuration(Duration.ofMillis(1000))
.build();
RetryConfig config = RetryConfig.from(baseConfig);
Note
When using a base configuration, the new configuration will inherit the properties of the base configuration. Although, if the new configuration has the same property set, it will override the base configuration's property.
The registry is essentially a getOrCreate
factory method for Retry
instances.
Its function is to manage (i.e., perform CRUD operations) and to store:
- a collection of
Retry
instances using their designated names as unique identifiers; - a collection of
RetryConfig
instances using their designated names as unique identifiers.
To register a Retry
instance in the RetryRegistry
with a configuration use:
RetryRegistry registry = RetryRegistry.of(config);
Retry retry = registry.retry("name");
Or without a registry with:
Retry retry = Retry.of("name", config);
Important
A single Retry
instance can be used to decorate multiple decorators
because internally it creates a
new Retry
context
per subscription.
A decorated method with a Retry
instance can be in one of the following states:
Retry Execution Flow |
RETRY
: A retry attempt was triggered;SUCCESS
: The retry attempt was successful, and thus the next retry attempt is not triggered.IGNORED_ERROR
: The method execution failed, but the retry mechanism was not triggered, since the exception is not considered a failure (e.g., the exception is in theignoreExceptions
list);ERROR
: The method execution failed, and the retry mechanism was not triggered because configuration conditions were not met (e.g., the maximum number of attempts was reached).
Note
State IGNORED_ERROR
will propagate the exception to the caller,
while state ERROR
will also propagate the exception if the intervalBiFunction
returns a negative value.
A decorator is a high-order function that wraps a function and returns a new function with the same signature.
Available decorators:
- Supplier;
- Supplier<CompletionStage>;
- Runnable;
- Callable;
- Function
And checked variants provided by the library, which wrap unchecked exceptions that might be thrown:
CheckedSupplier
;CheckedRunnable
;CheckedFunction
Associated with a given high-order function, there is also the capability to:
-
recover
: Provides a function to handle exceptions or errors that might occur during the execution of the high-order function. This recovery mechanism allows the program to gracefully handle errors and continue execution;Retry retry = Retry.of("name", config); CheckedSupplier<String> retryableSupplier = Retry .decorateCheckedSupplier(retry, remoteService::message); Try<String> result = Try.of(retryableSupplier) .recover(throwable -> "Hello from recovery");
-
andThen
: This function enables chaining operations after the execution of the high-order function. It acts similar to a flatmap operation in functional programming, where the result of the first operation is passed as input to the next operation, allowing for sequential composition of functions and without multiple wrapping of the result.
An IntervalFunction
is used to calculate the wait duration between retry attempts,
and is called for every retry attempt.
A few examples:
-
Fixed wait interval
// using defaults IntervalFunction defaultWaitInterval = IntervalFunction.ofDefaults(); // or explicitly IntervalFunction fixedWaitInterval = IntervalFunction.of(Duration.ofSeconds(5));
-
Exponential backoff
// using defaults IntervalFunction intervalWithExponentialBackoff = IntervalFunction.ofExponentialBackoff(); // or explicitly (initial interval [ms], multiplier) IntervalFunction intervalWithExponentialBackoff = IntervalFunction.ofExponentialBackoff(100, 2);
-
Randomized
IntervalFunction randomWaitInterval = IntervalFunction.ofRandomized();
-
Custom
IntervalFunction customIntervalFunction = IntervalFunction.of(1000, nrOfAttempts -> nrOfAttempts + 1000);
An EventPublisher
is used
to register event listeners in both the underlying retry mechanism and the registry where the retry and retry config
instances are stored.
See states for the possible RetryEvent
types.
public void configureRetryEventListeners() {
Retry retry = Retry.of("name", config);
retry.getEventPublisher()
.onRetry(event -> logger.info("Event: " + event.getEventType()))
.onError(event -> logger.info("Error: " + event.getEventType()))
.onIgnoredError(event -> logger.info("Ignored error: " + event.getEventType()))
.onSuccess(event -> logger.info("Success: " + event.getEventType()));
}
void configureRegistryEventListeners() {
RetryRegistry registry = RetryRegistry.ofDefaults();
registry.getEventPublisher()
.onEntryAdded(entryAddedEvent -> {
Retry addedRetry = entryAddedEvent.getAddedEntry();
logger.info("Retry {} added", addedRetry.getName());
})
.onEntryRemoved(entryRemovedEvent -> {
Retry removedRetry = entryRemovedEvent.getRemovedEntry();
logger.info("Retry {} removed", removedRetry.getName());
});
}
The context is a state-holder that is used to manage the retry mechanism and advance the underlying state machine in both synchronous and asynchronous scenarios.
public interface Context<T> {
void onComplete();
boolean onResult(T result);
void onError(Exception exception) throws Exception;
void onRuntimeError(RuntimeException runtimeException);
}
onComplete
: This method is called when the operation under retry is completed. Depending on the outcome (success, failure, or reaching the maximum number of attempts), it updates counters and publishes retry events;onResult
: This method is called when a result is obtained from the operation. It evaluates the result against a predicate and decides whether to continue retrying or not based on the result and the number of attempts;onError
: This method is called when a checked exception occurs during the operation. It evaluates the exception against a predicate and decides whether to retry or propagate the exception further;onRuntimeError
: This method is called when a runtime exception occurs during the operation. It is similar toonError
, but specifically for runtime exceptions.
public interface AsyncContext<T> {
void onComplete();
long onError(Throwable throwable);
long onResult(T result);
}
Behaves similarly to the synchronous context in terms of logic,
but the methods return a long
value
that represents the number of milliseconds to wait before the next retry attempt or -1
if the retry should not be
attempted.
Important
In asynchronous contexts,
the responsibility for handling exceptions
thrown by the called method is typically delegated to the asynchronous framework or to the completion handlers
associated with asynchronous operations.
Unlike synchronous contexts,
where the calling thread is often responsible for handling checked and runtime exceptions (i.e., unchecked
exceptions),
in asynchronous programming models, the calling thread is generally not directly involved in exception handling.
Which is why the AsyncContext
does not differentiate between checked and runtime exceptions.
Since Kotlin is interoperable with Java, the configuration can be done in a similar way but with a more concise syntax which takes advantage of trailing lambdas.
val configName = "config"
val retryRegistry = RetryRegistry {
addRetryConfig(configName) {
maxAttempts(4)
waitDuration(Duration.ofMillis(10))
}
}
val retry = retryRegistry.retry("retry", configName)
As mentioned in the context section, this module provides decorators for both synchronous and asynchronous contexts.
runBlocking {
retry.executeSuspendFunction {
remoteService.suspendCall()
// ... suspend functions
}
// or retry.decorateSuspendFunction { ... } and call it later
}
retry.executeFunction {
remoteService.blockingCall()
}
// or retry.decorateFunction { ... } and call it later
In electronics, traditional circuit breaker is an automatically operated electrical switch designed to protect an electrical circuit from damage caused by excess current from an overload or short circuit. Its basic function is to interrupt current flow after a fault is detected. Similary, in resilience engineering, a circuit breaker is a design pattern that prevents an application from repeatedly trying to execute an operation that's likely to fail. Allowing it to continue (fail-fast) without waiting for the fault to be fixed or wasting CPU cycles while it determines that the fault is long-lasting. But unlike the electrical circuit breaker, which needs to be manually reset after a fault is fixed, the resilience circuit breaker can also detect whether the fault has been resolved. If the problem appears to have been fixed, the application is allowed to try to invoke the operation.
The circuit breaker, which acts like a proxy for the underlying operation, can be implemented as a state machine with the following states:
-
Closed
: The request from the application is routed to the operation. The proxy maintains a count of the number of recent failures, and if the call to the operation is unsuccessful, the proxy increments this count. If the number of recent failures exceeds a specified threshold within a given time period (assuming a time-based sliding window), the proxy is placed into theOpen
state. At this point the proxy starts a timeout timer, and when this timer expires the proxy is placed into theHalf-Open
state. The purpose of the timeout timer is to give the system time to fix the problem that caused the failure before allowing the application to try to perform the operation again. -
Open
: The request from the application fails immediately, and an exception is returned to the application. -
Half-Open
: A limited number of requests from the application are allowed to pass through and invoke the operation. If these requests are successful, it's assumed that the fault that was previously causing the failure has been fixed and the circuit breaker switches to theClosed
state (the failure counter is resetted). If any request fails, the circuit breaker assumes that the fault is still present so it reverts to theOpen
state and restarts the timeout timer to give the system a further period of time to recover from the failure.
Important
The Half-Open
state is useful to prevent a recovering service from suddenly being flooded with requests. As a service recovers, it might be able to support a limited volume of requests until the recovery is complete, but while recovery is in progress, a flood of work can cause the service to time out or fail again.
Circuit Breaker State Machine |
From: Microsoft Azure Docs
Config property | Default Value | Description |
---|---|---|
failureRateThreshold | 50 | Configures the failure rate threshold in percentage. When the failure rate is equal or greater than the threshold, the CircuitBreaker transitions to open and starts short-circuiting calls. |
slowCallRateThreshold | 100 | Configures a threshold in percentage. The CircuitBreaker considers a call as slow when the call duration is greater than slowCallDurationThreshold. When the percentage of slow calls is equal or greater than the threshold, the CircuitBreaker transitions to open and starts short-circuiting calls. |
slowCallDurationThreshold | 60000 [ms] | Configures the duration threshold above which calls are considered as slow and increase the rate of slow calls. |
permittedNumberOfCallsInHalfOpenState | 10 | Configures the number of permitted calls when the CircuitBreaker is half open. |
maxWaitDurationInHalfOpenState | 0 [ms] | Configures a maximum wait duration which controls the longest amount of time a CircuitBreaker could stay in Half Open state, before it switches to open. A value of 0 means Circuit Breaker would wait infinitely in HalfOpen State until all permitted calls have been completed. |
slidingWindowType | COUNT_BASED | Configures the type of the sliding window which is used to record the outcome of calls when the CircuitBreaker is closed. The sliding window can either be count-based or time-based. |
slidingWindowSize | 100 | Configures the size of the sliding window which is used to record the outcome of calls when the CircuitBreaker is closed. |
minimumNumberOfCalls | 100 | Configures the minimum number of calls which are required (per sliding window period) before the CircuitBreaker can calculate the error rate or slow call rate. |
waitDurationInOpenState | 60000 [ms] | The time that the CircuitBreaker should wait before transitioning from open to half-open. |
automaticTransitionFromOpenToHalfOpenEnabled | false | If set to true, it means that the CircuitBreaker will automatically transition from open to half-open state and no call is needed to trigger the transition. If set to false, the transition to half-open only happens if a call is made, even after waitDurationInOpenState is passed. |
recordExceptions | empty | A list of exceptions that are recorded as a failure and thus increase the failure rate. Any exception matching or inheriting from one of the list counts as a failure, unless explicitly ignored via ignoreExceptions. |
ignoreExceptions | empty | A list of exceptions that are ignored and neither count as a failure nor success. Any exception matching or inheriting from one of the list will not count as a failure nor success, even if the exception is part of recordExceptions. |
recordFailurePredicate | throwable -> true | A custom Predicate which evaluates if an exception should be recorded as a failure. The Predicate must return true if the exception should count as a failure. |
ignoreExceptionPredicate | throwable -> false | A custom Predicate which evaluates if an exception should be ignored and neither count as a failure nor success. The Predicate must return true if the exception should be ignored. |
From: Resilience4j Circuit Breaker Docs
Note
Resilience4j
also provides two more states: DISABLED
(stopping automatic state transition, metrics and event publishing)
and FORCED_OPEN
(same behavior as disabled state, but always returning an exception); as well as manual control
over the possible state transitions.
Note
Worth mentioning that Polly's circuit breaker
also allows for manual control over the circuit breaker's state.
They present an additional state, Isolated
, which can be used to prevent the circuit breaker from automatically transitioning, as it is manually held open (i.e., actions are blocked).
It is an implementation detail to consider.
The CircuitBreaker uses a sliding window to store and aggregate the outcome of calls.
There are two types of sliding windows: count-based
(aggregrates the outcome of the last N
calls) and time-based
(aggregrates the outcome of the calls of the last N
seconds).
In more detail:
-
Count Based
: The sliding window is implemented with a circular array ofN
measurements. If the count window size is10
, the circular array has always10
measurements. The sliding window incrementally updates a total aggregation. The total aggregation is updated when a new call outcome is recorded. When the oldest measurement is evicted, the measurement is subtracted from the total aggregation and the bucket is reset. (Subtract-on-Evict) -
Time Based
: The sliding window is implemented with a circular array ofN
partial aggregations (buckets). If the time window size is10
seconds, the circular array has always10
partial aggregations (buckets). Every bucket aggregates the outcome of all calls which happen in a certain epoch second. (Partial aggregation). The head bucket of the circular array stores the call outcomes of the current epoch second. The other partial aggregations store the call outcomes of the previous seconds. The sliding window does not store call outcomes (tuples) individually, but incrementally updates partial aggregations (bucket) and a total aggregation. The total aggregation is updated incrementally when a new call outcome is recorded. When the oldest bucket is evicted, the partial total aggregation of that bucket is subtracted from the total aggregation and the bucket is reset. (Subtract-on-Evict)
From Resilience4j Circuit Breaker Docs
Important
For each sliding window type future implementation, the time and space complexity of the sliding window should be documented.
Just like the Retry mechanism, the Circuit Breaker mechanism also provides:
- Registry for managing Circuit Breaker instances and configurations;
- Decorators for wrapping functions with the Circuit Breaker logic;
- Events for monitoring the Circuit Breaker's state transitions and outcomes;
- Kotlin Interop for accessing the Circuit Breaker mechanism in Kotlin that compiles to JVM bytecode.
Rate limiting restricts the number of requests a client can make to a service within a specified time frame. It aims to prevent abuse, ensure fair usage, protect the service from being overwhelmed and ensure that it remains responsive to legitimate users. Contrary to throttling, rate limiting is applied to the number of requests per time frame, while throttling is applied to the rate of requests.
Config property | Default value | Description |
---|---|---|
timeoutDuration | 5 [s] | The default wait time a thread waits for a permission. |
limitRefreshPeriod | 500 [ns] | The period of a limit refresh. After each period the rate limiter sets its permissions count back to the limitForPeriod value. |
limitForPeriod | 50 | The number of permissions available during one limit refresh period. |
drainPermissionsOnResult | Either -> false | Configures a Predicate which evaluates if a result of the underlying service should be used to drain permissions. |
Important
Both limitForPeriod
and limitRefreshPeriod
can be adjusted at runtime
using the changeLimitForPeriod
and changeLimitRefreshPeriod
methods respectively of the RateLimiter
instance.
From: Resilience4j Rate Limiter Docs
There are two implementations of the Rate Limiter mechanism:
AtomicRateLimiter
: A rate limiter that manages its state viaAtomicReference
. Represents the default implementation.SemaphoreBasedRateLimiter
: A rate limiter that usesSemaphores
and aScheduler
that will refresh permissions after eachlimitRefreshPeriod
.
When the rate is exceeded, the Rate Limiter can either:
Block
: The rate limiter blocks the request.Queue
: The rate limiter queues the request to be processed later.Combined
: A way to combine both blocking and queuing based on some custom policy.
Just like the Retry mechanism, the Rate Limiter mechanism also provides:
- Registry for managing Rate Limiter instances and configurations;
- Decorators for wrapping functions with the Rate Limiter logic;
- Events for monitoring the Rate Limiter's state transitions and outcomes;
- Kotlin Interop for accessing the Rate Limiter mechanism in Kotlin that compiles to JVM bytecode.
Resilience4j is compatible with Kotlin but only for the JVM environment. Some considerations for multiplatform design found are:
Concurrency
- Problem: The library uses
Java's concurrent package
for concurrency (e.g.,
AtomicInteger
,AtomicReference
,LongAdder
) - Potential solution: Use kotlinx-atomicfu for multiplatform compatibility.
- Problem: The library uses
Java's concurrent package
for concurrency (e.g.,
Duration
- Problem: The library uses Java's Duration to represent time intervals.
- Potential solution: Use
kotlinx-datetime
for multiplatform compatibility.
Delay
- Problem: The library uses Java's Thread.sleep for delay as the default delay provider.
- Potential solution: Use kotlinx-coroutines for delay and other asynchronous operations in a multiplatform environment.
Important
If Javascript target is required, a Kotlin Multiplatform implementation of the Retry mechanism cannot use synchronous context because of the single-threaded nature of JavaScript. Implementation should be done using asynchronou context only.
The library also provides several extensions for the asynchronous primitive Flow to work with all provided mechanisms. Such extensions are not terminal operators and can be chained with others.
val retry = Retry.ofDefaults()
val rateLimiter = RateLimiter.ofDefaults()
flowOf(1, 2, 3)
.rateLimiter(rateLimiter)
.map { it * 2 }
.retry(retry)
.collect { println(it) } // terminal operator