DP-300 Microsoft Azure Database – Recommend an HADR strategy for a data platform solution
Hello. Now, we’ve had a look in the previous section about restoring and that’s part of disaster recovery. What happens if a major problem happens? Well, in this section we’re going to continue looking at high availability and disaster recovery. And in this video we’re going to have a look at how we can recommend a high availability disaster recovery. So ha d strategy. So let’s have a look at the SLA. So these are service level agreements. So this is what Microsoft says we will provide. So this is the minimum we provide, otherwise we’ll have to give you money back. So if you’re using the hyperscale tier, then you will have a minimum uptime availability guarantee of 99. 9% if you’ve got zero Replicas. So additional databases which look the same.
So that means that it can be down for 8 hours and 45 minutes over a year, or 44 minutes just under a month, and 99. 95% if you’ve got at least one Replica. So that halves the amount of time that it could be down. If you’re wondering where these figures are coming from, we’ll subtract 100% from this and you’ll see that we’ve got 0. 1% and 0. 5% and then multiply that by a year and a month.
But you don’t need to worry too much about what hours and minutes there are. I’m just giving you these so you can place these into context. So the hyper scale is at least 99. 9%, so that’s 8 hours potential downtime over a year over tiers have 99. 99%. So that’s ten times better than 99. 9%. So instead of 8 hours potentially being down, it’s for just 52 minutes over a year, or four minutes and 23 seconds over a month.
So this is the same for all other Azure SQL database tiers and Azure Escrow managed instance. Apart from if you have got it configured with zone redundant deployments and you are the business critical or premium tiers, then we can cut that in half again, 99. 99 5%. So that’s 26 minutes over a year or two minutes 11 seconds downtime over a month. Now, saying all of these nines gets a bit confusing. So this 99. 9% is called three nines, 99. 99% is called four nines.
If there’s a five at the end, that’s halfway between. So we call this three and a half nines and this four and a half nines. The more nines you’ve got, the better. And if you don’t remember what zone redundant deployments are, that’s a database that has multiple synchronized Replicas provisioned in different availability zones, so in the region but in different buildings. So this is the same as in SQL managed instance. For virtual machines, then you could go all the way down to 95%, which isn’t great because it could be down 5% of the time, but goes all the way up to 99. 99% four nines. However, that’s not the end of the story for virtual machines. A virtual machine is just the Windows or Linux bots you will have on top of it, the SQL Server.
So the SQL Server may fail even though the virtual machine is healthy. And so the composite SLA is actually lower for virtual machines. Now, one of the requirements is to recommend a HADR strategy based on RPO and RTO requirements. So what is that? Well, RPO is the recovery point objective. So that is how much data you can lose if something goes wrong. And then recovery time objective is how long until you can use it again. In other words, the maximum fail over time. So if you’ve got Geo replication, then you’ve got a guaranteed RPO recovery . 5 seconds.
That’s how much data you can lose and a recovery time RTO of 30 seconds. So how long until you can use it again? And if exceeded, you can get a credit of up to 100% of the total monthly cost of the secondary. So that’s the Georeficated database. However, it’s not necessarily going to be this depending on what particular thing you’re doing. This is for manual database failover to a Georplicated secondary. So in other words, I say failover now, then you’ll lose a maximum of 5 seconds worth of data but it will back up in 30 seconds. If you’ve got an auto failover group then you’ll have again 5 seconds worth of data that you could lose. But the RT or the recovery time objective is an hour minimum. So in other words, it will not fail over until an hour is gone.
Because it might be there’s just a temporary problem that the computer is going well, we’ll just give it a little bit longer. And then finally, if you are doing a GeoRestore from georeplicated backups, in other words, you’re creating the database from backups, then the RPO can be an hour and the RTO 12 hours. So if you have a strategy based on RPO and RTO requirements then if you’re using manual database failover or auto failover groups and we’ll discuss what those are later, then you will be guaranteed to lose no more than 5 seconds worth of data. However, auto failover groups, when you’re getting the computer to do the failover, the recovery time objective will be an hour.
If you’re doing it manually then when you press the button to a Geo replicated secondary, then it will be back up in 30 seconds. If you’re doing a restore from a Georeplicated backup then you could lose up to 1 hour worth of data. But you’ll also have an RTO of 12 hours. It might not be back up for another 12 hours. So hopefully you can see from this that RPR RTO can be critical if you’ve got a critical need to not lose any data and to be back up quickly.
But as you might imagine, the more RPO and RTO tightness that you need, the smaller the period, the more expensive it could well become. You do have to say, okay, do I really require such a small RPO and RTO? Or would I be okay in this particular requirement to have a bit longer? So, these are considerations for your RPO rio RTO HADR high availability, disaster recovery strategy. And if you’re wondering what the difference between high availability and disaster recovery, high availability means, it’s up. Most of the time, disaster recovery is what do we do when something goes wrong?
In this video we’re going to have a look at some of the solutions for High Availability disaster recovery for Azure SQL databases. Now the first one I want to look at is Georefication, which you may also see called as Replicas. So you’ve got your data, if something happens to your data oops, that’s it. Okay, well that’s not ten kiss you with basic standard and general purpose service tiers, then you do have some high availability and reliability of the remote storage tier. And with the Premium availability model then there is a quorum of available database engine nodes. So hopefully at least one of them will stay out. But this is just a simplification, this is the problem.
So how can we guard against that problem? Well, we can have additional databases, maybe in the same region, maybe in different regions. And so what happens is that we have one database, our original being called the primary and then these others being called secondaries or replicas. And you can have up to four Replicas per primary database. So here we can see the connections between them. Now, is four not enough? Well, we could take one of these secondaries or replicas and bowed additional secondaries or replicas off that secondary.
Now these databases obviously need data and so the primary database sends out the data in what’s called asynchronous replication. So initially the secondary database is populated from the primary database in the process called seeding. Then updates are replicated automatically Asynchronously. So what does asynchronously mean? It means that they are committed to the primary before they are committed to the secondary. So in other words, the primary is as quick as possible and then we have the secondaries which lag behind and needless to say, secondaries of secondaries like behind even more, but it’s still not that bad. Now the service tier that the secondaries are going to have needs to be at least the same service tier as the primary.
So if you are going to upgrade the primary, you should upgrade this secondary first and then you can upgrade the primary. As I say, the secondary needs to be at least the same service tier and if you need it to downgrade while you downgrade, the primary first and then the secondaries. But the good news with all of this upgrading and downgrading is you don’t need to disconnect the secondaries unless you’re changing between general purpose and business critical. Now obviously more secondaries great if you’ve got more than one secondary means that even if one secondary filled, there’ll still be at least one available until they are replicated. Now you can read from these secondaries and that’s a really good thing because this primary will probably have a lot of writing done to it and if they had to be reading done to it as well, well that could just increase the workload on the primary.
Whereas if you’ve got these readable replicas, then you are able to say, okay, I want to read from this one, I want to read from this one. And then if something happens to the primary, what you can do is then say, okay, I’m going to go to this particular database and I’m going to say you’re going to be the new primary. And then I can create my secondaries from this new primary. Now, there is one problem with this. I use connection strings to go to a particular database. So when I log into SSMS here, this gets translated into a connection string. So this is going to this particular database, my primary. However, my secondary database would have a different connection string and therefore when I go to this new primary, I have to use a different connection string. It’s a bit of a nuisance, it doesn’t happen automatically. Now, what can happen automatically is something a bit simpler. This is called a failover group. And a failover group uses one primary and one secondary.
And if something was to happen to the primary, then the computer can detect this, this is Azure and say, okay, what we’re going to do is swap these around. So the old secondary becomes the primary, the old primary becomes a secondary and everything is as normal. And with a failover group, what we have is a connection string that goes to the group. So even if the actual server, the actual database changes over, fails over, then the connection will still go to whatever one is the primary. So these are the two big HADR solutions for the Azure SQL database and let’s just compare them.
So first of all, failover groups can have automatic failover. Both of them can be have manual failover. You can fail over multiple databases simultaneously. In a fail of group and failover groups, you can use Azure SQL managed instance. You don’t need to update the connection string after a failover, whereas you do for Geoffication, geo replication can be in the same region as the primary and you can have multiple Replicas. And both of these support what’s called Read Scale. The Read Scale out feature allows you to offload readonly workloads using the compute power of one of the read only Replicas. Instead of running them on the read write Replica. So both of these support this. So in the next few videos we’re going to have a look at Georeplication, also known as Replicas and Failover groups, so we can see how they are implemented. And you’ve can more visually see some of the pluses and minuses for both of these solutions.