MCPA MuleSoft Certified Platform Architect Level 1 – Implementing Effective APIs Part 4

  1. Using Timeouts

Let us now discuss about timeouts. Timeouts should be the very first strategy or approach we should be implementing in all our APIs because this prevents most of the failures. Generally, timeouts come into play for slow performing APIs or the downstream systems. Say if your API which you are implementing is calling another API API which is slowing performing, then this timeout will help you from preventing any errors that are about to happen from the tapi. Or if your API is calling a back end system directly which is performing slow, it will still help you to prevent the actual error before happening by timing out. Okay, reason why?

Because your failures will be very less costly if you time out beforehand compared to the failures that will happen after waiting for like two minutes, three minutes, or ten minutes depending on whenever the slow performing system is going to respond. Okay, your API clients or consumers find that your API is very mature or will be very happy to know an error or a failure very quickly or fast instead of waiting for a lot of time and eventually seeing a failure.

Maybe sometimes your downstream system or the downstream API which is slowly performing may return the response or might be successful behind the scene. OK, but still, it won’t give the satisfaction to your API consumers compared to the failure. Instead, the clients prefer to see an error quickly to fail fast compared to a slow performing successful responses. Because the slow performance degrades the overall performance of the systems of clients as well as your hosting systems. In this case the runtime manager workers.

So implementing timeouts in proper places will actually help the overall performance both from client side and the implementation side. Though they may give some errors, but it is smooth and happily running system compared to successful performance transactions which are doing slowly. So for these reasons, you have to always enforce time modes and make sure as part of your code review or API standards or any automated tool you do for code quality check, the timeouts are implemented.

But things you remember when implementing this timeouts is don’t just blindly apply some random timeouts because this timeout should be in accordance with your SLS for your particular API. Remember, we discussed this about timeouts or response times in your NFRS. Let’s take our Create Sales order as an example.

During our discussions in NFR section, we told that okay, let’s come up with some numbers for our Create Sales order where the average response time should be always less than or equal to 200 millisecond and the maximum that is allowed is 500 millisecond, not more than that, correct? So if that is the SLA for our API, it does not mean that only when it is successfully working those SLA times applicable. Okay? No, those SLS are applicable even for errors or success both. End of the day, whether your API is giving an error response or a success response.

SLA is an SLA. It has to respond within that time. The SLA is for the response coming from the API to the clients, not whether only it is during successful or failure. So if you apply some random time off, let’s say if you put 30 seconds, okay? Just like that. Because I see in many of my previous projects, and when I did review for some external parties, I see that a general practice people follow. I’m not saying wrong because a decade back or five years back, the trend used to be the same. Even I did in many of my previous projects.

A 32nd timeout is the default timeout. We put 30,000 milliseconds, 30,000 milliseconds, okay? But that should change now, slowly. We have to change our mindset slowly now, okay? Into this new world of APIs and how it works. Because now there are like we discussed in the previous lecture, there are so many APIs now. Small, small APIs. Instead of one chunk of big call to one back end system with 32nd time mode, there will be some ten small APA calls to orchestrate. So we cannot go with the same mindset now. Okay? So if you said 30 seconds how we were doing before, then what happens? Your overall APA SLA is 200 millisecond minimum and 500 millisecond maximum. But our timeout itself will be 30 seconds. So it will be keep on waiting, correct for timing out for one API call.

And it already preached your overall API SLA, right? So you have to keep your time out in alignment with your SLS. So you have to adjust. For example, if you take our credit sales order, you have to make sure that you set as minimum timeout as possible, because 500 millisecond is the maximum allowed time out or SLA maximum allowed SLA. For our credit sales order API, you have to set something like, okay, in system layer for all system layer invocations, it should be maximum 100 millisecond time out.

Okay, I’m just giving an example. It depends on your scenario. What is the SLA and all? If the 500 millisecond is the overall response time, then 100 for each system layer API and like 300 for your process layer API, 400 for your process layer API and 100 for your experience API. So that overall it will tie up to the 500 millisecond. So that your responses will come within that time.

All right? So they have to time mode within the time frame, because if your overall SLA is 500 millisecond, they have to respond within that time. So you have to make sure you calculate properly, apply whatever is the right timeout for each API call in your process layer or any layer of the APA. If you are implementing this in the mule if your API implementation is a mule soft implementation, then almost all kind of integrations connectors processors support this Http. If it is Http processor, it supports the timeout for response and connections. Okay, connection timeout, response time mode inside it if you are using say for example scatter gather and all even they support the timeouts to say okay, within what much of time the scatter gather should get the collect collect the responses back.

Similarly for splitter aggregators or even for the VM, if you’re using a VM to publish and then subscribe and it’s a request response model, then you can set time modes. So everywhere MuleSoft has provisioned this. So you have to remember and when creating the global connections and all make sure that you have set with some proper relevant timeout. Mostly preferably they are configured in the properties file and set there. Okay? This is a good practice. That is the important one as well. It should be the first approach you have to make sure you have to do we discussed many timeout retry file back, et cetera. Timeout should be the very first one. You have to remember and always try to make sure and implement in your APS. All right, let us move on to the next slide lecture. Happy learning.

  1. Retrying Failed API

In this lecture, let us look into retrying failed APA invocations. Like we saw in the previous lecture. The first one is the timeout and then the second one is retrying the failed APA invocations. How can we apply these retries? What is the right way? Let’s see that generally after we make an APA call to the back end system or another API, it may fail due to some XYZ reason, right? There is enough chance now that failure can be either transient meaning temporary, or it could be a permanent error, could be a permanent failure or a temporary also called as transient meaning intermittent failure. Okay? So if we retry some calls accordingly, then they may get succeeded. If they are temporary, intermittent or whatever transient error or failure, then those kind of failures may get succeeded. So we should not miss that small chance, okay? Instead of forwarding the error back all the way to the client and them to retry or decide on what to do for some calls or some kinds of APIs, we can take the chance of retrying second time or third time for the temporary errors.

Okay? So we have to make sure one thing that only the transient errors can be retried. If we somehow know that the failure is a permanent one, then retrying the permanent errors is waste of time, okay? It doesn’t help much because they are going to fail anyway the second or third attempt. So you have to make sure that which ones are the correct one to retry. But sometimes in particular or specific implementations, it is difficult for the APA implementer to know upfront, okay, what could be transient or what could be permitted, okay? So in those cases it is okay that you can just retry any failure, but you have to make sure that again, it is not a wrong thing to do. So the default approach you can set is like to retry all failed AP invocations with proper rules in your place.

But if you have chance or if you have opportunity to segregate between the transient errors or failures and the permanent failures, then the best thing is to always retry only the transient ones and not the permanent ones. For example, see most of the times for some common things you will know whatever is the technology. So if there are for example network related errors, then you know that you can network the temporary. For example, if it is Java based on Java based product or system, the rest will come like Java net based error. Or if it is IO based, you get IO exceptions like Java net connect time out or Java IO exception, you know, right? So if those kind of errors come, you can treat them as is confidently transient failures and you can retry.

But let’s say if you get a permanent error which are not falling in your list of transient exception list, then they can be permanent and not retried. Otherwise the default behavior is you can implement retries if they’re safe to do with proper rules for all the errors if you don’t have that level of knowledge to segregate between the technical and the permit failures. But in the rest world you can get this segregation if your project is implementing the rest APIs with proper standards like we discussed across this course in various sections, okay? Meaning say if in your organization you are respecting all the rules of Http and API’s and implementing OK, all validation errors are 400 category errors, OK? Or authorization errors or any kind of clients errors or 400 series errors, OK? Similarly, if there is internal server error or back end system error, you send the 500 series errors. So if you are following this religiously and implementing the APIs, then it will become very easy for you to segregate between them.

How? For example, if you are getting a 400 series errors in the implementation call, that means 400 series are nothing but the client said errors. For example, 400 is a bad request, it is going to be bad request no matter how many times you retry four, not four resource not found, it is not going to find resources no matter how many times you retry four, not one unauthorized. Okay? So all these things are client said errors, so they are permanent errors and you need not retry. But if it is a 500 series error, then there is a chance that these are temporary or transient errors because it could be internal error in the system because of some issue temporary issue like network glitch or some back end system connectivity connection issue, something like that. So you can take a risk or a chance to retry the APA again, calling IPA again.

Okay, now in the 400 also if you want to even want to make this retry method while matured and work it out, then you can further down drill that. Okay? In 400 series also there are some kinds of failures which can be rated not blindly. All 400 are client side. Okay? Yes, we said a moment back, 400 client side can be not retrained because they are permanent. But we are discussing at the level of the opportunity you have to how mature you can build your retire. Okay, first is like you can retrain for all permanent or failure if you don’t have that idea of segregation. Otherwise, the second chance is that to make it a bit more mature, to segregate between them by using the waxception code.

A third chance is by segregating between the Http error codes. Now, the fourth chance I’m telling is even if you want to still make more mature then in the 400 series, for example, some errors are like four, not eight, which means request timeout. Okay? So request timeout is a temporary error because there is a timeout set up on the provider side, it timed out, you can retry again to see if it works this time, correct? So that is something you can retry. It is transient. It’s nearly four to nine. Generally, this is four to nine stands for too many requests. Some AAA providers apply rate limiting, right? Like say only 100 requests per second. So maybe your request went as one or one request, so it failed at them. But if you try again, maybe it will fall under new rate limiting in the next second, correct?

So it will work. So these kind of things you can add to exceptional list in the 400 series of your transient errors, okay? So that these are the way you can segregate and properly implement retrain. Now, the next thing you have to remember is that not only the segregation of the failures, but also where you are applying this retry on what? APA method, okay? You have to try and make sure that you are applying on the item potent Http methods only for keeping your API safe. Remember, we discussed even this in one of the previous lectures in the previous section, okay? Item Potency very important if you can’t, you can’t retry on any method API API method because that may cause issues in the backend systems. Maybe creating something two times or three times.

Okay? So you have to implement on the methods which are highly important in nature. So what are those methods which we discussed even before? Get head options, put and delete. These are the methods which are safe if implemented accordingly by respecting the or honoring the Http standards, right? So you have to try and make sure you implement on these kind of APIs because Get and all these are kind of read only, right? So you can try that. Similarly, Put also is like a one time thing. So you can try on this because you can implement Item Potency in these kind of Http methods. Okay? If you are retrying an API invocation, then definitely it is apparent that there will be some extra increase in the overall processing time. Correct? There is a difference in executing one time compared to trying some to three times. It will definitely increase the processing ten. So again, to mitigate that factor of SLA et cetera, just like how we discussed before, this retry should be configured properly so that they won’t breach that SLA and other other first we have on the API.

Okay? So how we can do that? For example, you have to restrict the retries to very few number of retrace only, okay? Not like 1020 times. Just make sure you try two times or three times max. Okay? That too. The retrace should be with a short wait time between each retrace. The interval should be very short, okay? Not like give a five second gap between each retry. Keep something very less in either milliseconds or a second or milliseconds. Millisecond is suggested interval between the retrace. Okay? Also combine this with your timeouts. So it doesn’t mean that you can skip the timeout. Better to have time out like we discussed in the previous lecture, timeout is compulsory or mandatory to implement, to have it more safer. Then on top of it you can have the retrace, okay?

So with the right combination, if within that timeout period you’re getting the errors and responses, then retraining it is good behavior, okay? If you are again implementing any of your replacement the mule project or mule implementations for mule applications, then you still have even for the retries the out of the box way provided by the mule soft in the Nepal studio to achieve this, for example, with the combination of like http request character and there is a processor called until successful scope okay. Like we have many scopes at cache scope and all we have until successful scope.

So using it until successful scope, whatever code you write inside you can configure on the scope the number of retries interval et cetera as per your read and then you can globalize and have them in the property files so that you can achieve this retry behavior, okay? Just make sure that you do the right way by that few number of retries, short way times, et cetera.

Similarly for the segregation of the Http errors which are error failure which are marked does not failure. Even the Http request counter also has the configurable support. If you see in the properties in the advanced section in the connector to clearly set what Http response coder can be marked as failures and whatnot. Okay, they are called like Http success validators if you see if you notice in the processor Http request character all right, let us move on to the next lecture in the section. Happy learning.

img