Skip to content

Latest commit

 

History

History
651 lines (532 loc) · 30.1 KB

README.md

File metadata and controls

651 lines (532 loc) · 30.1 KB

Resilience4j

Resilience4j is a fault tolerance library designed for Java8 and functional programming.

Table of Contents

  1. Retry
  2. Circuit Breaker
  3. Rate Limiter
  4. Kotlin Multiplatform Design
  5. Flow

Retry

Involves retrying an operation or request that has failed due to transient faults or network issues. It's designed to enhance the robustness and reliability of the system by automatically attempting the operation again after a certain period and/or under certain conditions.

RetryImpl

Configuration

Config property Default value Description
maxAttempts 3 The maximum number of attempts (including the initial call as the first attempt).
waitDuration 500 [ms] A fixed wait duration between retry attempts.
intervalFunction numOfAttempts -> waitDuration A function to modify the waiting interval after a failure. By default the wait duration remains constant.
intervalBiFunction (numOfAttempts, Either<throwable, result>) -> waitDuration A function to modify the waiting interval after a failure based on attempt number and a result or exception.
retryOnResultPredicate result -> false Configures a Predicate which evaluates if a result should be retried. The Predicate must return true, if the result should be retried, otherwise it must return false.
retryExceptions empty Configures a list of Throwable classes that are recorded as a failure and thus are retried. This parameter supports subtyping.
ignoreExceptions empty Configures a list of Throwable classes that are ignored and thus are not retried. This parameter supports subtyping.
failAfterMaxAttempts false A boolean to enable or disable throwing of MaxRetriesExceededException when the Retry has reached the configured maxAttempts, and the result is still not passing the retryOnResultPredicate

From: Resilience4j Retry Docs

Important

In the Retry configuration the intervalFunction and intervalBiFunction are mutually exclusive. If both are set it will throw a IllegalStateException.

A configuration represents the retry mechanism's behavior - policies - and is used to create a Retry instance, which is a thread-safe instance that holds the decoration logic and defines the async and sync contexts (i.e., state-holders), among other funcionalities (e.g., metrics, event-publishers). A Retry instance cannot be created without a custom or default RetryConfig.

The configuration is done using the RetryConfig class. It uses the builder pattern to create a configuration object:

RetryConfig config = RetryConfig.custom()
        .maxAttempts(2)
        .waitDuration(Duration.ofMillis(1000))
        .retryOnResult(response -> response.getStatus() == 500)
        .retryOnException(e -> e instanceof WebServiceException)
        .retryExceptions(IOException.class, TimeoutException.class)
        .ignoreExceptions(BusinessException.class, OtherBusinessException.class)
        .failAfterMaxAttempts(true)
        .build();

Or using default configuration:

RetryConfig config = RetryConfig.ofDefaults();

Or by using a base configuration:

RetryConfig baseConfig = RetryConfig.custom()
        .maxAttempts(2)
        .waitDuration(Duration.ofMillis(1000))
        .build();

RetryConfig config = RetryConfig.from(baseConfig);

Note

When using a base configuration, the new configuration will inherit the properties of the base configuration. Although, if the new configuration has the same property set, it will override the base configuration's property.

Registry

The registry is essentially a getOrCreate factory method for Retry instances. Its function is to manage (i.e., perform CRUD operations) and to store:

  • a collection of Retry instances using their designated names as unique identifiers;
  • a collection of RetryConfig instances using their designated names as unique identifiers.

To register a Retry instance in the RetryRegistry with a configuration use:

RetryRegistry registry = RetryRegistry.of(config);
Retry retry = registry.retry("name");

Or without a registry with:

Retry retry = Retry.of("name", config);

Important

A single Retry instance can be used to decorate multiple decorators because internally it creates a

new Retry context

per subscription.

States

A decorated method with a Retry instance can be in one of the following states:

Retry Execution Flow
Retry Execution Flow
  • RETRY: A retry attempt was triggered;
  • SUCCESS: The retry attempt was successful, and thus the next retry attempt is not triggered.
  • IGNORED_ERROR: The method execution failed, but the retry mechanism was not triggered, since the exception is not considered a failure (e.g., the exception is in the ignoreExceptions list);
  • ERROR: The method execution failed, and the retry mechanism was not triggered because configuration conditions were not met (e.g., the maximum number of attempts was reached).

Note

State IGNORED_ERROR will propagate the exception to the caller, while state ERROR will also propagate the exception if the intervalBiFunction returns a negative value.

Decorators

A decorator is a high-order function that wraps a function and returns a new function with the same signature.

Available decorators:

And checked variants provided by the library, which wrap unchecked exceptions that might be thrown:

  • CheckedSupplier;
  • CheckedRunnable;
  • CheckedFunction

Associated with a given high-order function, there is also the capability to:

  • recover: Provides a function to handle exceptions or errors that might occur during the execution of the high-order function. This recovery mechanism allows the program to gracefully handle errors and continue execution;

    Retry retry = Retry.of("name", config);
    CheckedSupplier<String> retryableSupplier = Retry
        .decorateCheckedSupplier(retry, remoteService::message);
    Try<String> result = Try.of(retryableSupplier)
        .recover(throwable -> "Hello from recovery");
  • andThen: This function enables chaining operations after the execution of the high-order function. It acts similar to a flatmap operation in functional programming, where the result of the first operation is passed as input to the next operation, allowing for sequential composition of functions and without multiple wrapping of the result.

Interval Functions

An IntervalFunction is used to calculate the wait duration between retry attempts, and is called for every retry attempt.

A few examples:

  1. Fixed wait interval

    // using defaults
    IntervalFunction defaultWaitInterval = IntervalFunction.ofDefaults();
    // or explicitly
    IntervalFunction fixedWaitInterval = IntervalFunction.of(Duration.ofSeconds(5));
  2. Exponential backoff

    // using defaults
    IntervalFunction intervalWithExponentialBackoff = IntervalFunction.ofExponentialBackoff();
    // or explicitly (initial interval [ms], multiplier)
    IntervalFunction intervalWithExponentialBackoff = IntervalFunction.ofExponentialBackoff(100, 2);
  3. Randomized

    IntervalFunction randomWaitInterval = IntervalFunction.ofRandomized();
  4. Custom

    IntervalFunction customIntervalFunction =
                IntervalFunction.of(1000, nrOfAttempts -> nrOfAttempts + 1000);

Events

An EventPublisher is used to register event listeners in both the underlying retry mechanism and the registry where the retry and retry config instances are stored.

Mechanism

See states for the possible RetryEvent types.

public void configureRetryEventListeners() {
    Retry retry = Retry.of("name", config);
    retry.getEventPublisher()
            .onRetry(event -> logger.info("Event: " + event.getEventType()))
            .onError(event -> logger.info("Error: " + event.getEventType()))
            .onIgnoredError(event -> logger.info("Ignored error: " + event.getEventType()))
            .onSuccess(event -> logger.info("Success: " + event.getEventType()));
}

Registry

void configureRegistryEventListeners() {
    RetryRegistry registry = RetryRegistry.ofDefaults();
    registry.getEventPublisher()
            .onEntryAdded(entryAddedEvent -> {
                Retry addedRetry = entryAddedEvent.getAddedEntry();
                logger.info("Retry {} added", addedRetry.getName());
            })
            .onEntryRemoved(entryRemovedEvent -> {
                Retry removedRetry = entryRemovedEvent.getRemovedEntry();
                logger.info("Retry {} removed", removedRetry.getName());
            });
}

Context

The context is a state-holder that is used to manage the retry mechanism and advance the underlying state machine in both synchronous and asynchronous scenarios.

Synchronous

public interface Context<T> {
    void onComplete();

    boolean onResult(T result);

    void onError(Exception exception) throws Exception;

    void onRuntimeError(RuntimeException runtimeException);
}
  • onComplete: This method is called when the operation under retry is completed. Depending on the outcome (success, failure, or reaching the maximum number of attempts), it updates counters and publishes retry events;
  • onResult: This method is called when a result is obtained from the operation. It evaluates the result against a predicate and decides whether to continue retrying or not based on the result and the number of attempts;
  • onError: This method is called when a checked exception occurs during the operation. It evaluates the exception against a predicate and decides whether to retry or propagate the exception further;
  • onRuntimeError: This method is called when a runtime exception occurs during the operation. It is similar to onError, but specifically for runtime exceptions.

Asynchronous

public interface AsyncContext<T> {
    void onComplete();

    long onError(Throwable throwable);

    long onResult(T result);
}

Behaves similarly to the synchronous context in terms of logic, but the methods return a long value that represents the number of milliseconds to wait before the next retry attempt or -1 if the retry should not be attempted.

Important

In asynchronous contexts, the responsibility for handling exceptions thrown by the called method is typically delegated to the asynchronous framework or to the completion handlers associated with asynchronous operations. Unlike synchronous contexts, where the calling thread is often responsible for handling checked and runtime exceptions (i.e., unchecked exceptions), in asynchronous programming models, the calling thread is generally not directly involved in exception handling. Which is why the AsyncContext does not differentiate between checked and runtime exceptions.

Kotlin Interop

Configuration

Since Kotlin is interoperable with Java, the configuration can be done in a similar way but with a more concise syntax which takes advantage of trailing lambdas.

val configName = "config"
val retryRegistry = RetryRegistry {
    addRetryConfig(configName) {
        maxAttempts(4)
        waitDuration(Duration.ofMillis(10))
    }
}
val retry = retryRegistry.retry("retry", configName)

Decorators

As mentioned in the context section, this module provides decorators for both synchronous and asynchronous contexts.

runBlocking {
    retry.executeSuspendFunction {
        remoteService.suspendCall()
        // ... suspend functions
    }
    // or retry.decorateSuspendFunction { ... } and call it later
}
retry.executeFunction {
    remoteService.blockingCall()
}
// or retry.decorateFunction { ... } and call it later

Circuit Breaker

In electronics, traditional circuit breaker is an automatically operated electrical switch designed to protect an electrical circuit from damage caused by excess current from an overload or short circuit. Its basic function is to interrupt current flow after a fault is detected. Similary, in resilience engineering, a circuit breaker is a design pattern that prevents an application from repeatedly trying to execute an operation that's likely to fail. Allowing it to continue (fail-fast) without waiting for the fault to be fixed or wasting CPU cycles while it determines that the fault is long-lasting. But unlike the electrical circuit breaker, which needs to be manually reset after a fault is fixed, the resilience circuit breaker can also detect whether the fault has been resolved. If the problem appears to have been fixed, the application is allowed to try to invoke the operation.

State Machine

The circuit breaker, which acts like a proxy for the underlying operation, can be implemented as a state machine with the following states:

  • Closed: The request from the application is routed to the operation. The proxy maintains a count of the number of recent failures, and if the call to the operation is unsuccessful, the proxy increments this count. If the number of recent failures exceeds a specified threshold within a given time period (assuming a time-based sliding window), the proxy is placed into the Open state. At this point the proxy starts a timeout timer, and when this timer expires the proxy is placed into the Half-Open state. The purpose of the timeout timer is to give the system time to fix the problem that caused the failure before allowing the application to try to perform the operation again.

  • Open: The request from the application fails immediately, and an exception is returned to the application.

  • Half-Open: A limited number of requests from the application are allowed to pass through and invoke the operation. If these requests are successful, it's assumed that the fault that was previously causing the failure has been fixed and the circuit breaker switches to the Closed state (the failure counter is resetted). If any request fails, the circuit breaker assumes that the fault is still present so it reverts to the Open state and restarts the timeout timer to give the system a further period of time to recover from the failure.

Important

The Half-Open state is useful to prevent a recovering service from suddenly being flooded with requests. As a service recovers, it might be able to support a limited volume of requests until the recovery is complete, but while recovery is in progress, a flood of work can cause the service to time out or fail again.

Circuit Breaker State Machine
Circuit Breaker State Machine

From: Microsoft Azure Docs

Configuration

Config property Default Value Description
failureRateThreshold 50 Configures the failure rate threshold in percentage. When the failure rate is equal or greater than the threshold, the CircuitBreaker transitions to open and starts short-circuiting calls.
slowCallRateThreshold 100 Configures a threshold in percentage. The CircuitBreaker considers a call as slow when the call duration is greater than slowCallDurationThreshold. When the percentage of slow calls is equal or greater than the threshold, the CircuitBreaker transitions to open and starts short-circuiting calls.
slowCallDurationThreshold 60000 [ms] Configures the duration threshold above which calls are considered as slow and increase the rate of slow calls.
permittedNumberOfCallsInHalfOpenState 10 Configures the number of permitted calls when the CircuitBreaker is half open.
maxWaitDurationInHalfOpenState 0 [ms] Configures a maximum wait duration which controls the longest amount of time a CircuitBreaker could stay in Half Open state, before it switches to open. A value of 0 means Circuit Breaker would wait infinitely in HalfOpen State until all permitted calls have been completed.
slidingWindowType COUNT_BASED Configures the type of the sliding window which is used to record the outcome of calls when the CircuitBreaker is closed. The sliding window can either be count-based or time-based.
slidingWindowSize 100 Configures the size of the sliding window which is used to record the outcome of calls when the CircuitBreaker is closed.
minimumNumberOfCalls 100 Configures the minimum number of calls which are required (per sliding window period) before the CircuitBreaker can calculate the error rate or slow call rate.
waitDurationInOpenState 60000 [ms] The time that the CircuitBreaker should wait before transitioning from open to half-open.
automaticTransitionFromOpenToHalfOpenEnabled false If set to true, it means that the CircuitBreaker will automatically transition from open to half-open state and no call is needed to trigger the transition. If set to false, the transition to half-open only happens if a call is made, even after waitDurationInOpenState is passed.
recordExceptions empty A list of exceptions that are recorded as a failure and thus increase the failure rate. Any exception matching or inheriting from one of the list counts as a failure, unless explicitly ignored via ignoreExceptions.
ignoreExceptions empty A list of exceptions that are ignored and neither count as a failure nor success. Any exception matching or inheriting from one of the list will not count as a failure nor success, even if the exception is part of recordExceptions.
recordFailurePredicate throwable -> true A custom Predicate which evaluates if an exception should be recorded as a failure. The Predicate must return true if the exception should count as a failure.
ignoreExceptionPredicate throwable -> false A custom Predicate which evaluates if an exception should be ignored and neither count as a failure nor success. The Predicate must return true if the exception should be ignored.

From: Resilience4j Circuit Breaker Docs

Note

Resilience4j also provides two more states: DISABLED (stopping automatic state transition, metrics and event publishing) and FORCED_OPEN (same behavior as disabled state, but always returning an exception); as well as manual control over the possible state transitions.

Note

Worth mentioning that Polly's circuit breaker also allows for manual control over the circuit breaker's state. They present an additional state, Isolated, which can be used to prevent the circuit breaker from automatically transitioning, as it is manually held open (i.e., actions are blocked). It is an implementation detail to consider.

Sliding Window

The CircuitBreaker uses a sliding window to store and aggregate the outcome of calls. There are two types of sliding windows: count-based (aggregrates the outcome of the last N calls) and time-based (aggregrates the outcome of the calls of the last N seconds).

In more detail:

  • Count Based: The sliding window is implemented with a circular array of N measurements. If the count window size is 10, the circular array has always 10 measurements. The sliding window incrementally updates a total aggregation. The total aggregation is updated when a new call outcome is recorded. When the oldest measurement is evicted, the measurement is subtracted from the total aggregation and the bucket is reset. (Subtract-on-Evict)

  • Time Based: The sliding window is implemented with a circular array of N partial aggregations (buckets). If the time window size is 10 seconds, the circular array has always 10 partial aggregations (buckets). Every bucket aggregates the outcome of all calls which happen in a certain epoch second. (Partial aggregation). The head bucket of the circular array stores the call outcomes of the current epoch second. The other partial aggregations store the call outcomes of the previous seconds. The sliding window does not store call outcomes (tuples) individually, but incrementally updates partial aggregations (bucket) and a total aggregation. The total aggregation is updated incrementally when a new call outcome is recorded. When the oldest bucket is evicted, the partial total aggregation of that bucket is subtracted from the total aggregation and the bucket is reset. (Subtract-on-Evict)

From Resilience4j Circuit Breaker Docs

Important

For each sliding window type future implementation, the time and space complexity of the sliding window should be documented.

Additional Details

Just like the Retry mechanism, the Circuit Breaker mechanism also provides:

  • Registry for managing Circuit Breaker instances and configurations;
  • Decorators for wrapping functions with the Circuit Breaker logic;
  • Events for monitoring the Circuit Breaker's state transitions and outcomes;
  • Kotlin Interop for accessing the Circuit Breaker mechanism in Kotlin that compiles to JVM bytecode.

Rate Limiter

Rate limiting restricts the number of requests a client can make to a service within a specified time frame. It aims to prevent abuse, ensure fair usage, protect the service from being overwhelmed and ensure that it remains responsive to legitimate users. Contrary to throttling, rate limiting is applied to the number of requests per time frame, while throttling is applied to the rate of requests.

Configuration

Config property Default value Description
timeoutDuration 5 [s] The default wait time a thread waits for a permission.
limitRefreshPeriod 500 [ns] The period of a limit refresh. After each period the rate limiter sets its permissions count back to the limitForPeriod value.
limitForPeriod 50 The number of permissions available during one limit refresh period.
drainPermissionsOnResult Either -> false Configures a Predicate which evaluates if a result of the underlying service should be used to drain permissions.

Important

Both limitForPeriod and limitRefreshPeriod can be adjusted at runtime using the changeLimitForPeriod and changeLimitRefreshPeriod methods respectively of the RateLimiter instance.

From: Resilience4j Rate Limiter Docs

Implementations

There are two implementations of the Rate Limiter mechanism:

  • AtomicRateLimiter: A rate limiter that manages its state via AtomicReference. Represents the default implementation.
  • SemaphoreBasedRateLimiter: A rate limiter that uses Semaphores and a Scheduler that will refresh permissions after each limitRefreshPeriod.

Approaches After Rate Exceeded

When the rate is exceeded, the Rate Limiter can either:

  • Block: The rate limiter blocks the request.
  • Queue: The rate limiter queues the request to be processed later.
  • Combined: A way to combine both blocking and queuing based on some custom policy.

Additional Details

Just like the Retry mechanism, the Rate Limiter mechanism also provides:

  • Registry for managing Rate Limiter instances and configurations;
  • Decorators for wrapping functions with the Rate Limiter logic;
  • Events for monitoring the Rate Limiter's state transitions and outcomes;
  • Kotlin Interop for accessing the Rate Limiter mechanism in Kotlin that compiles to JVM bytecode.

Kotlin Multiplatform Design

Resilience4j is compatible with Kotlin but only for the JVM environment. Some considerations for multiplatform design found are:

  1. Concurrency
    • Problem: The library uses Java's concurrent package for concurrency (e.g., AtomicInteger, AtomicReference, LongAdder)
    • Potential solution: Use kotlinx-atomicfu for multiplatform compatibility.
  2. Duration
    • Problem: The library uses Java's Duration to represent time intervals.
    • Potential solution: Use kotlinx-datetime for multiplatform compatibility.
  3. Delay
    • Problem: The library uses Java's Thread.sleep for delay as the default delay provider.
    • Potential solution: Use kotlinx-coroutines for delay and other asynchronous operations in a multiplatform environment.

Important

If Javascript target is required, a Kotlin Multiplatform implementation of the Retry mechanism cannot use synchronous context because of the single-threaded nature of JavaScript. Implementation should be done using asynchronou context only.

Flow

The library also provides several extensions for the asynchronous primitive Flow to work with all provided mechanisms. Such extensions are not terminal operators and can be chained with others.

val retry = Retry.ofDefaults()
val rateLimiter = RateLimiter.ofDefaults()

flowOf(1, 2, 3)
    .rateLimiter(rateLimiter)
    .map { it * 2 }
    .retry(retry)
    .collect { println(it) } // terminal operator