xBlog

BLOG

A Guide to Software Resilience Patterns

A Guide to Software Resilience Patterns
Achraf Bakkache

Achraf Bakkache

25 November, 2021 Ā· 8min šŸ“–

Despite the enormous amount of time and effort we invest to manage risk when building our services, failures can and will occur.

Finding out these points of failure isnā€™t always easy. Luckily, applying Resiliency Patterns in our system can help us identify and correct these failures.

In todayā€™s post, weā€™ll cover the most known resiliency patterns and how they can minimize disruption of our services.


First, what is Resilience?

Resilience is a noun that refers to the ability to recover quickly from difficulties.

Are Resiliency and Resilience the same?

Resiliency is a variant of the Resilience noun!

As you can notice from the below chart of the usage of the two nouns between 1800 and 2000 years, Resilience is the oldest and still the most used variant.

So to sum up, Resilience and Resiliency are the same!

What is Resilience in IT Context?

System Resilience is the ability of a system to react to failures and still be operational. It's not about avoiding failure, but accepting failure and returning to a completely functioning state ASAP.

Why bother me with Resilience?

Why does nobody want his system to fail for a long time?

Well, because of Ā System failure => Loss of client => Loss of money.

Itā€™s trivial! isn't it?

System Failures :

Failures vary from a partial loss of connectivity to the complete failure of a service, here are the common failures our systems face :

  • Network latency (the time it takes for a request to go from the client to the server).
  • Transient faults (short-lived errors).
  • Blocking due to a long-running operation.
  • A system that is being moved or restarted.
  • An overloaded system that can't respond for the moment.
  • Hardware failures.
  • Timeouts.

Resiliency and short-lived failures :

The majority of the failures mentioned above are permanent, which means that applying resilience in these cases is very hard or nearly impossible.

Note that some failures are comprehensive so itā€™s useless to apply Resiliency for it at all (for example 400, 401, and 403 errors).

However, because short-lived failures happen temporarily, they are easy to recuperate.

So in the majority of cases, we apply resilience for these types of failures.

Here is a list of some short-lived failures:

  • 404 Not Found
  • 408 Request timeout (when the server didn't receive a complete request from the client within the serverā€™s timeout)
  • 429 Too many requests (when you've most likely been throttled)
  • 502 Bad gateway (when the gateway/proxy receives a bad response from another server)
  • 503 Service unavailable (when the server is temporarily under maintenance or overloaded)
  • 504 Gateway timeout (when the gateway/proxy didn't receive a timely response from another server)

Resilience Patterns

Now that we discovered which failures to solve, itā€™s time to know how to apply resilience in practice. In fact, there is a lot of techniques that your service can leverage to solve failures discussed earlier, these techniques are called Resilience Patterns.

We will now explore the following patterns :

  • Retry pattern
  • Circuit breaker pattern
  • Fallback pattern
  • Timeout pattern
  • Bulkhead pattern
  • Rollback pattern

Retry Pattern :

The Retry Pattern enables a service to retry a failed request multiple (configurable) times with an exponentially increasing waiting time (AKA backoff).

The first call fails and returns an HTTP status code that requires a retry (error 500 in this example). The application waits for 2 seconds and retries the invocation.

The second call also fails and returns an HTTP error status code. The application now doubles the waiting time interval to 4 seconds and retries the invocation.

Finally, the third call succeeds.


Circuit breaker Pattern :

In many cases, performing retries on a failed service can move you into a self-imposed denial of service scenario (DDOS) where you flood your service with continual calls exhausting resources such as memory, threads and database connections, causing failure in unrelated parts of the system that use the same resources.

Well, how to fix this limitation of the Retry Pattern?

With the help of the Circuit Breaker Pattern, you prevent an application from repeatedly trying to execute an operation that's likely to fail.

The idea behind it is that after a pre-defined number of failed calls, it blocks all traffic to the service. Periodically, it will send a trial call to determine whether the failure has been resolved or not.


Fallback Pattern :

Known also as ā€˜Graceful degradationā€™, the Fallback Pattern enables your service to continue the execution in case of a failed request to another service. Instead of aborting the computation because of a missing response, we fill in a fallback value (please note that fallback values are not always possible).

Example (1) : Fraud check system :

Letā€™s say service (A) of money transfer calls service (B) of fraud check before committing the transfer transaction, what if service (B) got exhausted or is under maintenance, why not tolerate to commit transfers below a certain small amount (like 50$) without waiting for service (B) to recover? Ā 

Example (2) : Payment system :

Letā€™s say an e-commerce website that contains 2 payment methods : Paypal & Stripe, what if the connection to Stripe service is down ! Why not suggest the Paypal method to the users who have chosen Stripe in the first place during this temporary failure period?


Timeout Pattern :

In a thread pool model, every request is handled by one thread and the total number of threads is limited.

So when calling a service that has a long response time, threads are blocked, which impacts the incoming requests.

By using the Timeout Pattern you can specify a maximum waiting time, which allows you to avoid unbounded waiting times for responses and so treating every request as failed where no response was received.

The Timeout Pattern is pretty straightforward and the majority of HTTP clients have a default timeout configured.


Bulkhead Pattern :

As we saw earlier, we escaped the problem of thread saturation by specifying a value of timeout.

But what if the frequency of entering requests is higher than the timeout value?

For example : Every 1 second a request is coming, while timeout is set to 3 seconds.

Well, clearly the problem is not fixed yet, we still have thread saturation!

The solution here is to use the Bulkhead Pattern, the idea behind it is to divide the elements of the application into pools, each pool is isolated and has a limited number of threads.

This way, if one pool fails, the others will continue to function, so we passed from the saturation of all application threads to the saturation of only a small number of threads.


Rollback Pattern :

Not all failures require a retry, circuit breaker or any of the resiliency patterns we discussed earlier, the reason is that not always the external world that makes our software fail.

Sometimes the failure is caused by an internal bug, so it needs to be fixed ASAP & jump the system back to a stable state.


Is Resiliency easy to set up?

Dealing with Resiliency is not as easy as it seems, in fact there is two main challenges :

Timeout difficulty :

Choosing a timeout value can be a hard task, timeout has to be high enough to cover slower responses BUT low enough to stop waiting for a response that is never going to arrive.

Retry difficulty :

While the retry of an idempotent system is safe (because it always gives the same response for the same request), retries of a non-idempotent system could cause the return of a different response than the one that was expected the first time, or worse : the operation will be executed more than once.

For example, service (A) sends a request to service (B), service (B) handles the request with success, but fails to send the response. service (A) thinks that the first request wasn't received and re-send the request which causes the transaction to be executed twice.


Is Resilience always good?

By applying resiliency, we hope to succeed on a failed request even by sacrificing response time.
Well, this is not always good, for some use-cases, it's better to fail fast instead of retry many times and impact the whole response time of the application.

Scenario (1) : when a service (A) that tries to access a remote service B to satisfy a minor business functionality, it's better to fail immediately or at least after a smaller amount of retries with of course a shorter wait-time.

Scenario (2) : the second use-case is when a service (A) that contains a retry policy calls another service (B) that also has a retry policy, this redundancy of retries can add long delays to the whole operation. In this case itā€™s better to configure the lower-level service to fail fast and report the reason for the failure back to the service that invoked it. This higher-level service can then handle the failure based on its own policy.


Resiliency Testing :

Testing for resilience isnā€™t achieved with the casual tests (unit tests, integration tests, etc.).

Instead, we test how the system performs under failure conditions that occur temporarily.

We introduce chaos into the system to find weaknesses, like for example making dependent services unavailable, crashing processes, expiring certificates, etc.

This science of failure injection is called Chaos Engineering.

In 2011, Netflix released the first Chaos tool : Chaos Monkey, which randomly terminates instances.

Later, greater apes were released, Chaos Kong switched off whole regions, while Chaos Gorilla contented itself by knocking over availability zones.

Before starting a chaos testing you should consider some guidelines :

  • Know the KPI metrics of your systemā€™s normal case.
  • Brainstorm a list of ā€˜realisticā€™ chaos.
  • Run the tests in a production environment (the goal of chaos engineering is discovering how the REAL system responds to the unexpected).
  • Minimize the risk (you should be able to shut the experiment down and limit its effects to a small area).
  • Automation (automate your chaos experiments to run continuously).


Note that :
Chaos testing is different from Load/Stress testing, load tests are designed to check the maximum number of concurrent users and what load will break the system.


Resilience implementations :

In the previous parts, we took a look at different Resilience Patterns. Now letā€™s see how you can implement them.

Frameworks like Hystrix, Resilience4j, Failsafe & Sentinel support the majority of the above Resilience Design Patterns out of the box.

Hystrix (offered by Netflix) has been used in many applications but is no longer under active development.

Here is a features comparison table of these frameworks :

All of the 4 above frameworks are directly called from within the application source code. You can integrate it either by using annotations or implementing interfaces.


Is Resilience enough?

While building software, we aim for 3 main goals: security, performance and reliability.

We all are familiar with Security and Performance, but what is Reliability?

Reliability ensures that your system satisfies the commitments you make to your clients.

Reliable systems have 2 features :

They're resilient, recover from failures, and continue to work.

They're highly available (HA) and perform in a healthy state with no downtime.

So to make it simple, we can say that : Reliability Ā = High Availability + Resiliency

This means that resiliency alone is not enough if we want to have a reliable system.

High availability can be achieved with the help of some patterns, here is a shortlist of the common HA patterns :

- Redundancy & Autoscaling.

- Throttling pattern (rate-limiting)

- Async pattern (Queue-Based Load Leveling pattern & Priority Queue pattern).

- Cache-Aside pattern.

Wrap up :

Failures will inevitably occur.

Resilience is a good recovery process.

Resilience isnā€™t always the best choice (fallback, fast failure...)

Resilience isnā€™t easy (retry and timeout difficulties)

Resilience isnā€™t enough (need HA or Fault Tolerance)

Resilience needs testing & monitoring.

https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency/

https://github.com/Netflix/Hystrix

https://resilience4j.readme.io/docs

Achraf Bakkache

Achraf Bakkache

I'm Achraf Bakkache, a Software Engineer who is passionate about Backend development and mainly JAVA ecosystem.

Tags:

signature

Achraf Bakkache has no other posts

Aloha from xHub team šŸ¤™

We are glad you are here. Sharing is at the heart of our core beliefs as we like to spread the coding culture into the developer community, our blog will be your next IT info best friend, you will be finding logs about the latest IT trends tips and tricks, and more

Never miss a thing Subscribe for more content!

šŸ’¼ Offices

Weā€™re remote friendly, with office locations around the world:
šŸŒ Casablanca, Agadir, Valencia, Quebec

šŸ“ž Contact Us:

šŸ¤³šŸ» Follow us:

Ā© XHUB. All rights reserved.

Made with šŸ’œ by xHub

Terms of Service