Why You Should Care About Resilience


Photo by Anatoli Styf

Microservices are a reality. Big companies do it all the time, to improve scalability and separate responsibilities. Most of the time, the maintainers of each service are not even the same. But the resilience of these services, or the ability to absorb the impacts of abnormal problems, is an issue if they are not prepared, sometimes causing big problems when one of the services is down or passing for internal problems.

Imagine your team maintains an API that has an extremely important flow that needs resources from other services, so it does a request to fetch it. But what happens if the service returns a 500 status with an error?

In this article, I’ll bring some strategies to solve this kind of scenario and other resilience problems.

Retry

Connectivity problems with other services are quite common. For example, a service can refuse connectivity when a number overpasses a certain amount, so it returns a temporary error while it handles all processing and connections. The retry pattern is something like: “Oh, it’s not working now, but I can try later, maybe it works”.

Some ways to work with retry pattern:
Cancel: If the error indicates it’s not temporary, just raise an exception. Try again probably won’t work.
Try again: If the error is rare, try again, probably it’s just a temporary error.
Try again with timeout: If you already tried again once and the error still seems to be temporary, add a timeout and try again. These timeouts can be incremental or even exponential. It can have a max number of tries, and when it reaches, raise an exception.

Circuit Breaker

The service is returning an error because of a serious problem that will take a long time to be fixed. On some occasions, if you are still retrying to call an endpoint that does a lot of processing, access critical spots of the system (like database, threads, etc..), and it’s failing, it can cause bigger problems.

The circuit breaker pattern can be used when you know the error is not temporary, so it stops all the requests for the problematic endpoint, only returning when the problem is appearing to be gone. This can prevent bigger problems with the service.

It’s like a proxy, that monitors all the failing requests, and decides if the request can be done or not.

The implementation of the circuit breaker is like a state machine where the states are:
Closed: The request is made, if it returns an error, the circuit breaker increments a counter. After a certain amount of errors, the state is changed to open and starts a timer. When the timer expires, the state is changed for half-opened. This timer exists to give time for the API team to fix the problem.
Open: The request is not made and raises an exception.
Half-opened: Some of the requests are made and some not. If the made requests have been successful, the circuit breaker considers the problem is solved, and returns the state to closed, resetting the counter.

Rate Limiter

Imagine if your API is receiving too many requests, that your infrastructure cannot handle. This can result in a complete break of your API.

The rate limiter pattern solves this by having a limit of acceptable requests. The requests above the limit are rejected, and those who are calling your API have to try again later.

The difference between the rate limiter and circuit breaker pattern is: the circuit breaker prevents the called service to receive too many requests when it has a problem, and the rate limit prevents its service to receive too many requests when there are errors or not.

Bulkhead

Your API has multiple external data resources from different services to handle, but your infrastructure naturally is shared by all your connections.

The bulkhead pattern avoids the problem when one of the services is not working and consumes all the resources of your connection poll, by isolating and dividing the resources for each one of them. So when one of the services is down, it doesn’t affect the other ones.

Cache

Your API needs external data that is almost immutable, in other words, if you access the service twice, the data will be the same. Do you need to call it twice?

The cache pattern can save the request data in a temporary data store, and now you can retrieve the data from the cache. This will save your infrastructure resources.

But what if the API changes the data? You can set a timeout to expire the cache, and when it does, you can call the service, receiving fresh new data.

Is it enough?

Of course not, software resilience is a concept way bigger than what was shown in this article. But writing all of them would take some of my years.

Thank you for reading, and don’t stop learning!

We are hiring new talents. Do you want to work with us? become@codeminer42.com