MCPA MuleSoft Certified Platform Architect Level 1 – Implementing Effective APIs Part 5

  1. Circuit Breakers

Hi. In this lecture, let us understand about the circuit breaker design pattern. Like I have mentioned before, this circuit breaker design pattern is very popular pattern in the integration world. This is first brought into light by Michael T. Nygard, especially because of his articulation in his book called Release is It. It was a very popular book with respect to the software release management.

How it should be designed, developed and rolled out across the software development lifecycle. This is just one of the patterns he has provided stability pattern which I am going to explain in this particular lecture, because this comes very handy in the APA world as well. So far we have learned about timeouts, where to apply, how to apply and how they come helpful to us.

Then we saw how we can improve the stability of our API even more by adding retries on top of time modes, so that wherever possible are applicable. Like add important Http methods, we can apply retries so that the API consumers or clients will have successful response in a nicer way.

Now we can add the circuit breaker on top of the timeouts and retries to make it even more robust and very, very friendly responsive nature to the APA clients. So, let us understand this first. So what you are seeing on the left is the typical APA integration. Say there is a service CA or APA calling service B, and then that in turn calls another service, which is C.

And finally it calls a service D. If you want to compare it with our three layer architecture. Think of Service C a as an experience layer API service b as a process layer API, c as a system layer API, and D as the end system like an ERP. Okay? Now with the current patterns we discussed so far, say we apply timeouts as well as the retries.

What would happen in some scenarios is say if the back end system which is service D. In this example, I’m going to use the terminology service ABCD to not confuse between the terms. So let’s say service D is down OK, or slowly performing what’s happening maximum with so far a process we discussed, you have time modes where you can see them as a representation of time icon or a clock icon on the arrows.

So each of that service will be having a timeout and they’ll be waiting for the response. So each one will keep timing out. If they have retrace on top, then it has to wait for a little bit more time. Like we discussed in the previous lecture, this is a little bit costly retrace, but with the minimal intervals and all may come handy.

So with the resumption, let’s say it keep on retraining two, three times. Okay, so same thing happened all the way down till the service B and service A. Correct. So that means that error is still getting cascaded.

Although there are timeouts and rate rise when the orchestration is so big and there are a number of hops say all the wait till service D the four level of hops then service C will have to wait with timeout retries and then give the response back finally saying an error to service B which in turn gives an error back after timeouts and reduce to service here and then the service A knows that okay, it is going to fail anywhere. Correct?

This doesn’t look that much clean, right? And second thing, this is not a drawback only for the services A, B and C. This is not just a disadvantage or dissatisfaction for ABC. This is also a dissatisfaction for service D as well. Why? Because maybe the nature of the failure which service D is experiencing is something that it needs some time to heal. Maybe it needs some space or free time to recover from that failure that is happening. But what’s happening in the current scenario because of our retries, service C, although the initial request timed out is still retrying more than one time to see if it is lucky enough to get the response upon that failure, service B is again retrying and C A also is again retrying. So this is just we’re talking about one transaction or invocation right instance of invocation. Let’s say there are many such services who are using this service D.

All of them will keep doing this retry or bombard in the request. Okay, although they are getting their system safer by timing out, still the timeout is on the client side so they are safe. They are getting fast failure yes, from the satisfaction of the client side they are coming to know okay, the service is still not up, the service D is down so I’m timing out. So they are keeping their systems healthy or failing fast and knowing what’s wrong. But timeout is on the client side, not at the server side meaning not at the service D side. Whatever request that was pawned and triggered is still trying to serve the request in the service D. It might have been queued or in a thread pool waiting for a new thread or any reason okay but it still reached service D just what time loaded the client side.

This keeps the load running on service D and won’t give the time to breathe and recover from the error. So for these reasons, secure breaker comes into play so that it helps both the client and the provider or the server side, not just the client the timeout and retrace so far we discuss it only helps the client side or caller side. It is favorable to caller but not to the server. Okay? Server still getting the requests although they are time to client side and not getting time to heal. So circuit breaker the principle says that okay, I convert this whole call chain is circuit wherein when the circuit is closed okay, like it’s a typical electrical terminology. The circuit is closed. That means all connections are fully connected from A-B-C all the way to D.

The all the connections are fully connected when we say circuit is closed. So the calls will go as usual. As if how the normal calls would go in the left hand side. It will go now. Let’s say the service D slowly starts failing. Okay. Then based on some criteria which we are going to discuss soon, circuit breaker will smartly understand that oh okay, I see that service D is going down. So let me open the circuit. Meaning it will like remove it’s like a bridge between the service BC and service D, right? So it opens the circuit meaning the switch comes up so the connectivity is lost between the services ABC and D and it won’t just leave like that.

Okay, don’t just open the connection, open the circuit and leave like that. It will also give a proper response back saying okay, the service D is still down, so sorry, we can’t proceed with the request. Then what the Service C will do once the Service C understands that circuit is open, instead of cascading the failure all the way back to a proper response like we discussed before is constructed and says that okay, D is down temporarily not available. Some kind of error to B because it is not a failure, but just a proper error response back. Service B will not think that C is down. Okay? So C B will cascade the response while it responds back to A. That’s why they’re representing green ticks. If you see in the left hand side there are red arrow, red cross marks because they are failures. Like I explained in the previous lecture, failure is different from the error response.

Error response going back is smoother, okay? There are no system failures. It’s a proper response instead of successful response is just error response like status 500, internals are error temporary, not available, something like that. Whereas failure is like it causes this error inside the system. So left hand side it is cascading because with the perspective of CD is dimmed out or giving error. From B perspective, C is standard or giving error same way all the way. But in this case, C only knows that OK D is done immediately because the circuit is open. It constructs a response message with error details and cascades that proper response message back. It’s smoother so that you will finally come to know that okay, I didn’t get the response back. It won’t fail, but okay.

Now, how will the circuit breaker know when the D is up? How will when will it close the circuit again? So let’s see how the circuit breaker design pattern exactly works. So how it works is whenever the requests come in, by default, the circuit is always closed in the starting once we deploy, say we implemented this in starting the circuit is fully closed, okay? Think of this state as a small flag or something in the cache or DB, okay? Think of how it maintains it is a state full design, okay? Circuit breaker needs state it’s a state full design component.

So it maintains a state that the cash circuit is currently closed kind of is closed through for example now the request is coming in as long as the target system is fine, circuit is closed, they keep going forward happily. Okay? Say for example the error starts coming so what happens? That circuit breaker check, it failed but is it under the threshold or not? What do we mean by threshold? Just because there is one error, a circuit breaker cannot open the circuit thinking it is completely down so it has to observe the pattern or trend. So for that reason we have to configure a reasonable threshold.

Say for example some ten okay? So once it fails the circuit breaker checks, okay? Are there ten failures within say for example 30 seconds or ten failures within 15 seconds? OK, that is a threshold. If it is not, then okay, it is fine. So if the threshold is still within the limit then the circuit will keep closed and the request will go to the target system. Say if it failed again and again and again slowly at one point the threshold will reach for example, it will fail ten times in 15 seconds at some point, right? If it is recurring then if failed and crossed the threshold that is when the circuit breaker will say I’ll open the circuit meaning that is circuit closed flag will change to false.

Okay? So it will be open. Now how the implementation should be done is the code written in A, B or C before calling their respective target systems? Is that before blindly making the call to the target system the code should be something has to first check their flag, okay? Whether circuit is open or closed only when the circuit is closed they have to proceed with the logic to make the call to the back end system, okay? If the circuit is open they have to fall back into that temporary response response back saying system is down to something and respond give that response back to the caller. So think of it like efforts condition.

If a circuit closed is true then they have to go and make the call to the target system every time. But if the is circuit closed is false then they have to fall back and give a static response error response system is temporarily not available, something like that. Okay? So this way the call won’t reach that target systems and it gives time for the target system to heal itself or recover itself, not bombard the request and make it more busy and crash it down.

Okay, so this is a good pattern but the next question is when will it turn back to circuit close to when will it close the circuit again? Because it’s not open, no requests are going to the target system. How will it know that it has to bring it back up? So for that reason like however, we set a threshold. Something like okay, ten requests for 15 minutes if there are failures, same way there will be a timer running as well. Okay.

There should be a timeout timer running for this circuit. So the timeout timer is something every 1 minute just for example, circuit would be 1 minute, ten minute 15 minutes after your configuration. But whatever is the timer timeout, a timer will be running the moment the circuit is open. Okay? From the time the circuit keeps open meaning is circuit closed is false. Okay, timer starts and once the timer reaches zero for example, if it’s 1 minute timer, once the 1 minute timer is finished, the circuit is limitedly closed, partially closed.

Meaning if it’s completely open, then there is a chance that suddenly the huge waiting requester load may go and bombard the target system again. So before it heals, we keep on bombarding. Earlier we are continuously bombarding but now after a minute again we are bombarding and it will keep the target system again failed. For that reason, it will be limitedly open. Meaning it will be throttled such a way that say only ten requests at a time are allowed and then immediately closes the circuit and open the circuit back.

Okay? So it makes sure that okay, limited closed means when the flag is like partial or limited instead of true or false, then there will be again a counter which will make sure that okay, the request released to the target system are maximum ten and then it will immediately open the circuit. It won’t allow more than that. If these ten requests are successful or some of them are successful, the connection is fine, then it is an indicator to circuit breaker that the target system is doing well. So it will again fully close the circuit. The e circuit closed will be true then all requests can go forward. But if those ten requests again fail, then it is indication that okay, it is still down. So it will keep the circuit open. Only a circuit closure will be false. Okay, this way we are managing or controlling when to send the requested target system when not by partially testing properly.

Okay, this way we are unnecessarily not cascading the errors all the way back to all systems instead of stopping it at the starting place only this is a circuit breaker pattern and how it helps with combination of retrace and timeout. This is a perfect way. Now let us see how this can be implemented in the mule soft. Okay. In the mule applications. So because the circuit breaker is a state full component, we have to see what are the different ways we can implement the state bringing the state into the mule applications and how it works.

There are basically three ways. There is an easy way to implement this which works perfectly. There is a moderate way which is an enhancement on top of the easy way to make it bit more mature and work a little bit more effectively. And there is an extreme way where it is a perfect or ultimate implementation of the circuit breaker on the entire cloud hub level and common for all the APIs. Let us see each one of them one by one.

The first one is the easy way. So, here what you are seeing is like two new applications app One, app Two running on cloud hub. And each of this application is highly available by running on two workers. For example Sake, I have given two workers each in this app but they can be more than two multi worker, let’s call it multi worker apps, okay? And all these apps or APIs are interacting with an ERP in common. Maybe the functionality may be different. For example, app one might be calling create Sales Order API or functionality in the ERP.

App two might be calling update sales order API or functionality in the RP. Okay? But end of the day they talk to same ERP. Now, to implement the circuit breaker which we discussed in the previous slide, we need a state, correct state in the sense it needs to maintain that flag whether it is open or closed. It needs to maintain the threshold to keep checking, okay, if the threshold is breached or less than the threshold to open or close the circuit.

Also the timer or the counter to make sure that it has to eventually slowly check the partially closed circuit whether to open fully or not. So these states should be maintained somewhere. So the easiest way not to make it complicated, the best way is each worker maintains its own state, could be in a cache scope, could be in an object store but not a cluster aware one or any other way could be a Sharon variable.

Whatever it can be maintained at the each worker level. The advantage of this is easiest implementation, fastest implementation, less chance of testing is required because it is straightforward way. But the concept all workers will not come to know same time whether the end system or ERP is up or down the circuit may get closed open accordingly to the invocation pattern in this each of the worker okay? So for example, if worker one is getting a lot of requests and hits worker one would know fast that okay, it is open or closed and it will open the circuit first.

But the worker two in the App One and the other workers in App 2 may still be closed because there are no requests that have come to them. So whenever the requests come and hit the worker two in the App One or other workers in the app two, then they’ll come to know oh, okay, it is down. So they will now open the circuits their side. So because they’re maintaining shit individually, they are not aware of each other. Everyone has to start fresh, everyone should have a timer, everyone should eventually open and all. So this slowly leaves some of the request to go and hit ERP, okay, but not bombarded. But still because they’re maintaining separate separately, they commonly don’t know when it is up or down, but it is easiest to implement.

So if your organization is a smaller one or not organization, if your scenario is a simple one where you don’t expect much of a traffic and you don’t want to invest much time, then you can go with the easy approach. The next one is the moderate way. In the moderate way, without not much complication, you can achieve a bit stateful and Worker aware approach where at least in each individual Cloud Hub app, the workers will be cluster aware. Okay? They will share the same state of the circuit breaker, may not be at the application level awareness meaning Multi Application or Intern mule Soft applications will not be sharing the state of the circuit breaker, but each application, no matter how many workers they have, all multiple workers will be sharing the same state.

That can be achieved by object store. If you go with the Object Store and use the default Object User Store, then MuleSoft offers a cluster of your feature in the Object Store. So using that you can implement this so that all the workers in a particular application are cluster aware. So that at least for a given app, they know at same time whether it is up or down. So this way the number of open or closed circuits will be little less compared to the previous approach because it is at an app level instead of worker level. Okay, this is the moderate approach you can go you can achieve it using Object Store. One of the ways, there are many ways you can come up with but easiest way out of the box.

Where is the object store? The last one is the extreme way where it works across the Cloud Hub. Complete cloud. Hub. Okay, all the applications running on Cloud Hub, all the workers among those applications, irrespective of how many they have, all of them will at a time share the same state. Okay, how this is possible, multiple ways are there. Instead of going with Object Store or Custom Cache, maybe you can go with third party caching tool like Redis Cache or Elastic Cache or you can go with even database like no SQL one which is faster.

So if you go with any of these, then you can move your state into that third party tool so that all these apps will communicate or get the sheet from there and update the state there so that they are all aware at same time or if you cannot invest money in a third party caching tool or you don’t want to go to database because of IO or performance reasons, then still there is another way mule where you can achieve this. You can have one more app written dedicatedly for circuit breaker. Maybe you can call that application as the Circuit Breaker app or Circuit Breaker utility, okay, which is an API itself and offers some resources like get Circuit open or closed e Circuit Open API. Similarly set threshold API. Get threshold API. Set timer API. Get timer API.

So like that, there can be multiple resources inside this particular circuit breaker API. It’s an API itself written. It can be also scaled across multiple workers. And this one shares the state using Object Store with net application. Because Object Store can have this feature, right? Like I told before, in the moderate approach in an application, object Store is cluster or worker error. So with this implementation, having secured Breaker as API itself, all those other apps, instead of talking to a DB or third party cache and system in Caching Tool, they talk to this API to get the state, okay?

And accordingly, they will decide whether to call the RP or not. See, here one thing I want to hide from this depiction of the picture in the extreme here. When we have an arrow between app X, which is secretbreaker, and ERP, does not mean that the call will be routed through the application of Secret Breaker, okay? It doesn’t mean that. It’s just that the state will be coming from the Secret Breaker app, which is appx, and then app one or two will decide whether to call the ERP or not.

So this way, there are multiple ways you can come up with. There can be worker level or application level or the cloud hub level. All right? So this is how you can enhance your fault tolerance system by having timeouts and then retrace on top of them and attach the circuit breaker to it to make it perfect. Let us move on to the next lecture in the course. Happy learning.

  1. Fallback APIs

In this lecture, let us see the next approach for the fault tolerant API implementation, which is invoking a fallback API. When an API invocation fails, even after all the things we have done so far, like implementing the timeouts to make it safer and then retry stand top of it, and with the circuit breaker, even after the reasonable number of retries and all, it may be possible that it still fails, right? In that case, if say the criticality or the importance of that particular API is compulsory to result a response, and it is important for the clients to get some kind of response, then we can have a way to invoke a different API as a fallback API.

Okay? So after it performs all the stuff like retries and all, if still it doesn’t work and we have to compass return an API response for that particular API because of its nature, then there might be a chance that there are some fallback APIs which we can use to call it upon retraining all this time. For example, let’s take the Create Sales Order API that we have discussed so far. In the Create Sales Order API, say we have one particular API in the process layer orchestration, which is valid external ID. Correct. What does this validate external ID do? It validates the customer ID, shipment location ID, and item IDs.

So this is doing this by interacting with the XRF system that the organization has. What if say that XRF system has gone slow or slowly performing and it is compulsory that the API should still be able to validate those in order to accept the order because the order is the revenue for that company and losing or failing any creation of orders would impact their business revenue. Correct. It looks like they’re losing some business. So how the implementation can be enhanced is there can be a fallback way which will call another API that can still validate these three elements in the request and give a response back whether they are valid or not, so that the control can go and create the sales order successfully or accept the order at least. Correct. So how can one know which particular APIs can be used as fallback point number one, it is up to your organization.

You know best about your company, right? If you know that, okay. During your design phase or your analysis phase. Okay. There are two potential places where this can be validated. For example, currently they have excellent system which is the best place to validate these things. Say they also have a warehouse management system or their ERP itself maybe also has the information related to these things inventory customers and the shipment locations. Maybe they are stored in the ERP or WMS as well. They also may have some APIs, but the company has not opted those APIs currently because they are slow performing or costly.

Okay. Whereas the access system is the faster one because it is say something on a Cosmos DB or a DynamoDB which is very fast with millisecond response times, right? So this is when you can think of the benefits pro and cons. Okay? Now, any way the extra system is failing for XYZ reason, even though it is a little bit costly to interact with the ERP or WMS, you can have that as a fallback so that you can still get these things validated. The external IDs can be validated. Okay, so this is one way. So that could be one possible reason that could be one fallback API.

Or the fallback API can also be something like it is still the extra system APA only. But maybe it is something that is deployed in your Dr site, disaster recovery site or on a different availability zone which you are thinking to access only during a Dr time. But now, even if it is not fully disaster recovery situation, you may want to still leverage that API for the order acceptance. Or it can be some other API which is not at all to do with the extra system or the ERP or the disaster recovery side. It is for a different purpose altogether, but maybe as part of that APA internally it validates and do something else extra as well. Or it does less things than this API supposed to do. Maybe it only validates items, but not the shipment and customer IDs.

So there might be separate apay for the customer ID to be validated. So you can do some two three fallbacks too even though it’s costly. But if it is important for your API, you can go and still call different individual ones like valid customer validate item and all and you can still accept the order. But only thing is your response times will be bit slower so that they may impact your SLA accordingly. Okay? So you have to just weigh between things while designing the API and choose what to do. Okay, so that is what is about the Fallback APIs.

And if you are implementing your APIs in the Mule Soft applications in the Nipon studio, then once again Mule has this feature to support implicitly by using the until successful scope along with the combination of the exception strategy. So we already discussed that until successful scope helps to implement the rate rise, correct? Now, if you can add exception strategy to that until successful scope, what happens is let’s see if all the retries are exhausted, then you can have the option to put that upon exhausting all the retries what you want the code to do so in the place. If you have this logic to implement your fallback API, then it would do the job it’s she’s supposed to be done by the primary APA and return the response back. All right, let us move on to the next lecture in the course. Happy learning.