Amazon AWS Certified Database Specialty – Amazon RDS and Aurora Part 3

  1. Cloning databases in Aurora

Now let’s look at Cloning databases in Aurora. So Cloning is different from creating read replicas clone support reads as well as rights and Cloning is also different from replicating a cluster because clones use the same storage layer as the source cluster and you’ll understand what I mean in just a bit. So the clone database is run using the same storage layer as the original database, so it only requires a minimal additional storage. So this is very cost effective option of creating read write replica of your database and clones can be created from existing clones as well. So all in all, this is a quick cost effective method and it requires minimal tunnel administrative effort.

You can create clones only within a region you can place them in different VPCs, but the AZ must be same and you can also create cross account clones so you can create a clone in a different account than the original source account. And the way Cloning works is by using the copy on write protocol and the copy on right approach is something that helps us save on the storage costs. So let’s find out how it works. You have a source database and it’s data divided into different pages. So initially when you create a clone, both the source database as well as the clone share the same data but after that, as the data changes, the data is stored separately.

So for example, if you make any writes on the source or any changes on the source, then that data will be stored separately and in a similar fashion, if you make any changes on the clone, then those changes will also be stored separately. And this delta of changes after Cloning is never shared between the source and the clone. In effect, we only require minimal additional storage so that saves a lot of costs of your data storage in Aura. So what are the use cases for Cloning? Firstly, you can create a copy of your production database cluster for development testing or for QA purposes you can use Cloning for impact assessment of your changes before you apply them to the main database.

For example, if you intend to make any schema changes or parameter group changes on your database, you can test them on a clone first. You can also use Cloning to perform workload intensive operations like running analytical queries or for exporting data for non routine works or the activities that you require once in a while. And something that’s super important to remember here is you cannot backtrack a clone to a time before that clone was created you can only backtrack to a time starting from when the clone was created. And cloning feature you Must Never is only available in Aurora. It’s not supported in other RDS engines. All right, let’s continue.

  1. Aurora failovers

Let’s look at the aura. Failovers. Now, failovers within region occur automatically. A replica is promoted to be the new primary, and which replica to promote is decided by something called as the Replicas Failover Priority. So you can call it as failure priority or failure tiers. It’s all the same. So each replica has a specific failure priority from Tier Zero through Tier 15. So Tier Zero is the highest priority and Tier 15 is the lowest priority. And we have seen this in the demo as well. You have a master instance and you have your replicas. And each replica will have certain priority from Tier Zero to Tier 15. And the replica with the highest priority will always get promoted. A Tier Zero replica will get promoted first. And if two replicas have the same priority, then the replica that is largest in size gets promoted. And if two replicas have the same priority as well as the same size, then one of them will get promoted.

Arbitrarily all right, so let’s say you have your application and it’s currently pointing to the current Master, and this master goes down. Then what’s going to happen is Aura is going to switch the DNS name to the Tier Zero replica, and then it’s going to promote it. So the failure will take about 30 seconds. So you have your replica promoted to the Master within about 30 seconds. So Aura flips the CNAME of the database instance to point to the new replica and then promotes it. And RTO in this case will be just about 30 seconds, or maximum about 60 seconds. All right? So typical failure times with Aura is about 30 seconds. And if there are no replicas, if you just have a single instance, then this 30 seconds RTO doesn’t apply. What Aura does is it creates a new instance in the same AZ, okay? It will spin up a new instance in the same AZ. And this definitely results in a downtime as compared to failure to a replica in a single instance set up. The failure happens on a best effort basis.

It may not succeed if there is an AZ wide outage. And remember that copying of data is not required for failovers because of the shared storage architecture. So you always have a shared storage. So all replicas, all the instances share the same storage, so you don’t have to copy any data. In case of Aura serverless, what Aura is going to do is if there is a failure, then Aura is going to create a new Aura Serverless instance in a new AZ, okay? It will spin up a new instance in a different AZ. So this is true in case of an AZ outage and the Dr or disaster recovery across region. Remember, is a manual process. If your Aura cluster has cross region replicas and your Master goes down, you can always promote the replica in another region to be the new master. Okay? So once it becomes the master. It can take on reads as well as writes. All right, so that’s about aura failovers. Let’s continue to the next lecture.

  1. Cluster Cache Management (CCM) in Aurora PostgreSQL

Now let’s talk about something called as CCM or cluster cache management. So the CCM feature is available in Aura PostgreSQL flavor. So let’s find out where CCM is useful. So let’s say you have your PostgreSQL Aura cluster. You had an outage, and your cluster has failed over to a new replica. Then what’s going to typically happen is your new replica is going to experience certain performance lag immediately after it gets promoted to be the new primary, and CCM helps in reducing this performance lag. So let’s find out how it works. So we know that the database caches have a buffer cache to reduce the disk I O in the relational databases, and the buffer cache content in your primary instance and your replica is most often going to be different.

So when the failure happens from the primary to a replica, the promoted replica will take some time to warm up its cache to match the cached content in primary, right? So this is going to result in slower response times, or in other words, this is going to result in a performance lag for a few minutes at least. When the new replica just takes over to be the new primary and CCM tries to improve the performance of the promoted instance post a failover. And the way it works is the replica preemptively reads frequently accessed buffered cached content from the primary.

Let’s say you have your orap PostgreSQL cluster here and you have your replica. The replica sends a Bloom filter with currently cached buffers to the master. What exactly is this Bloom filter? Bloom filter is a data structure which is space efficient. Okay, so it’s kind of a data structure. And what the master is going to do is master is going to send back the frequently used buffers to the replica. So replica is going to have a copy of the cached content in the primary instance. So whenever there is a failover and the replica becomes the new primary, it’s always going to have the cached content. And hence this replica, when it gets promoted, will not face the performance lag issues. All right? So to enable CCM, what you do is you simply set a DB cluster parameter. The parameter name is APG CCM enabled, so you set it to one to enable CCM. And remember that this is only supported for Aura postgres SQL. All right, let’s continue to the next lecture.

  1. Simulating fault tolerance or resiliency in Aurora

Now simulating fault tolerance in Aurora. So there are two ways you can test or simulate fault tolerance in Aurora. First is manual failure, and you can also run fault injection queries. And fault tolerance is also called as resiliency. So you call it as simulating the fault tolerance, or you can also call it as simulating resiliency in Aura. And you can also simulate AZ failures using these options. In addition, you can use this simulation feature to upgrade your primary instance. So you call this as force failover? All right, so let’s look at the first option, the manual failover. And this is very straightforward. Simply go to the AWS console, select an instance, and from the Actions menu choose to fail over. Or you can also use the failure DB cluster CLI command. And what this is going to do is it’s going to fail over to the replica with the highest priority.

The read replica with the highest priority is going to be the new master. And this will take about 30 seconds, right? And the master instance that failed over will become a replica when it comes back online. Remember, as each instance has its own endpoint address, you should always clean up and reestablish any existing connections that use the old endpoint post a failover. And ideally, you should not use the instance endpoints. And we have discussed this already, you should always use the writer, reader, or the custom endpoints, so you don’t have to clean up and reestablish any existing connections. Right, so that’s about it.

Let’s look at the fault injection queries. Now, simulating fault tolerance with fault injection queries. Fault injection queries are simply SQL commands that you use to simulate resiliency in your aura cluster. So you can schedule a simulated occurrence of different failure events, like for example, writer or reader crash, replica failure, disk failure, or even disk congestion. So let’s look at each of these quickly. So first, how do you simulate writer or reader crash? So to do that, you use the SQL query alter system crash. So you can crash an instance dispatcher or a node instance refers to the DB instance, and that’s the default.

Dispatcher is what writes your updates to the cluster volume. And when you choose Node, it will crash instance as well as the dispatcher. So this is a simple SQL command alter system crash, and you can use it to simulate writer or reader crash. Then let’s look at the replica failure. To simulate replica failure, you use Alter system simulate command and you can specify the percentage of failure. So this refers to the number of requests that Aura should keep blocking to this particular replica. So if you choose to all, it means the failure event will be simulated across all the replicas. Or you can also specify a particular replica to simulate your failure event on. And quantity refers to the duration of your failure event. Okay, so you can define the quantity in seconds or even in years.

Okay, then let’s look at the disk failure simulation. The command is similar you use alter system simulate you specify the percentage of failure of your disks percentage of failure this time refers to the percentage of disk to mark as faulting and you can specify a disk index. Disk index refers to a specific logical block of data within the disk or you can also specify a node index so it will refer to the failure of a specific storage node and quantity again refers to the duration of your failure event and finally we have disk congestion failure. The command again is similar alter system simulate, you specify the percentage of failure.

The percentage failure this time refers to the percentage of disk to be marked as congested and you specify the disk index or the node index just like you did previously say you simulate failure of a particular logical block of data or a particular storage node and you also specify minimum and maximum milliseconds. This is going to be the minimum and maximum amount of congestion delay in milliseconds. And when you run this command, aura is going to pick a random number between these two minimum and maximum values that you specify. And quantity, as always, specifies the duration of your failure event. So in this case, it is the duration of this congestion. So this is how you would simulate resiliency in aura or fault injection in aurora. All right, that’s about it. Let’s continue to the next lecture.

  1. Simulating failovers in Aurora – Demo

All right, in this demo, we’re going to see how to manually trigger a failover in an aura cluster. All right, so here we have aura MySQL database cluster, and we have one writer and one reader. How do you test a failover? We simply select a writer node and go to actions and click on failure. And what it’s going to do is it’s going to failure to the Replica, so it’s going to confirm. Do you really want to failure? Yes. And now the failure will happen in about 30 seconds. So let’s wait for about 30 seconds.

I’m going to pause the video here and come back in about 30 seconds. So it’s failing over and now you can see that it has failed over to the reader instance. The original writer is now a reader and we have a new writer here. So the Replica has been promoted to be a new writer. So you can see that the failure to Replica is very fast. It takes just about 30 seconds or so. And this is how you simulate a failure in your aura database.

img