CompTIA Cloud+ CV0-003 – Domain 5.0 Troubleshooting

  1. Resource Contention

When it comes to the cloud, we can certainly expect to have some level of resource contention at one time or another. When it comes to managing a resource conflict, we need to look at it from at least several different angles. And so, what does that really mean? Let’s proceed and talk about the different ways contention can happen. The reality is that there is generally contention when demand exceeds supply. For example, if you have a retail application and this retail application is being scaled out during the holidays, which may be your busiest time of the year, and you scale those cloud resources like the CPU, memory, and network and storage for average usage instead of peak, then you may experience some kind of resource contention.

Again, you need to realise that contention can happen in many different forms. It could be a factor of just the CPU, or it could be a factor of both the CPU memory and the network, whatever it is. So you need to investigate appropriately. On this exam, the Cloud Plus exam, you will be expected to look at specific use cases, or what I like to call a paragraph question of a scenario, more of a scenario-based approach, and try to determine whether or not there is a contention. Or perhaps what would be the source of the contention? For example, they may have CPU memory or the network listed as sources of contention. If for some reason they’re asking you the question about why the user is not able to refresh his screen in a certain amount of time, then perhaps it could be because of a network or application issue.

Again, it’s hard to know. Until you have the right details on the exam, you could certainly expect a question on resource contention. Resource contention can be hard to identify in the cloud. That is a given. I won’t even pretend I’m an expert at it. I will say that generally, the sources of contention are going to be one of three typical concerns. The first is the application itself. So sometimes what happens is that the user expects that application to perform at a certain level, and that’s perfectly reasonable. But sometimes what happens is that the application itself may not have been scaled for that number of users, or the network bandwidth might not be there, whatever it is. The second reason that contention typically comes up is because of latency. Latency is again more I like to think of latency as a kind of ghost in the closet. A lot of organisations have plenty of latency issues. They don’t like to admit it.

A lot of that is just based on the fact that you’re using a network and going through service providers, and you may be routed over several hops. Or I’ve seen customers go from Chicago to Atlanta and have 30 hops. Sometimes it’s sort of crazy. And each of the hops can certainly affect the performance of that application. When it affects the application, it generally affects the user experience. That’s where we all heard the term “blame it on the network.” Now again, it may not be the network’s fault. It could be the service provider. I’ve had instances where the service provider has decided, “Hey, we’re going to go update our routing, and maybe we’ll see who it affects.” I don’t know what the thinking is at the service provider, but sometimes they update their routes, and it can definitely affect your cloud application, especially if your user base may be segmented in certain areas. It can be very noticeable. IO issues I.O.P. front-end and back-end issues can occur. You may have data services that are in the cloud that you’re accessing from within your house. Perhaps many different things can cause a bottleneck. When it comes to virtual machines, here are a few things you may want to look at:

The emulation could definitely be an issue. It could also be determined by the clustering configurations you employ. could be shared resources as well. You may not have scaled appropriately as well. These are just some of the issues that could affect virtual machines’ costs. One of the challenges is, again, trying to get the right configuration at the right cost. It’s very easy to add more resources than you need. The hard part is trying to add the right resources you need. Allocate storage judiciously; snaps and copies must be managed. A great way to run up your cloud bill, especially with storage, is to just have those snapshots keep on going, not archived or deleted after a certain amount of time. Snapshots are typically point-in-time copies, and most people don’t need more than a day or two of snaps. Again, a lot of this could be based on your DrmbC strategies. You need to understand how they correlate with bandwidth charges and energy costs, especially if you have a private cloud as well. Some other areas of cost consideration could be the TCO and the return on investment as well. This is the apples-to-apples approach. Again, look at Google Cloud Storage. Look at Amazon’s cloud storage. Compare apples to apples. Again, each of the providers has different configuration variables and may or may not be the right fit for your application. In terms of migration costs, you must exercise caution. Other than dropping something in the cloud, everything has a cost in the cloud, usually to get it out or access it, so there will be some costs there that you need to be aware of, as well as what they are monitoring.

That’s another cost you need to be aware of. Resource contention can be hard to identify in the cloud. Compare your baselines and then work from there. One of the things I like to point out is that if you don’t have proper monitoring and you don’t have baselines or you haven’t done a proof of concept with a provider or whatever, it’s really hard to know how it should be performing. So with that said, make sure you have good practises in place before you go to the cloud because it’ll be a lot easier to troubleshoot and identify issues. Here’s an exam tip: resource contention is what happens when demand exceeds supply for a shared resource. Understand the resource types and possible contention issues. For this exam, for example, you will most likely need to identify the specific contention issues with Typically, again, what I like to call “paragraph questions”—where they give you a scenario— And I wouldn’t call it a case study because it’s going to be much more lengthy, but it’s more like customers having issues with How do you solve the problem, or what is causing the problem? That’s what they like to ask on this exam.

  1. Troubleshooting Cloud Services

When it comes to the cloud, you’re going to run into specific issues that may or may not be obvious to solve. Let’s talk about some of the issues that can come up and some approaches to starting to resolve those specific issues. When it comes to cloud security, you want to troubleshoot it but also understand that it’s going to take some time. You’ll have to look at it holistically as well. It could take many different approaches. For example, if it’s a VPN issue or a VPC issue, there could be some variables involved that are affecting your performance, security, et cetera. It could be a permissions change or something changed by a developer. Whatever it is, you need to sort of figure out a way to start finding that needle in the haystack, as they used to say.

When it comes to cloud security, it can be very complex. Recognize that you are not only responsible for security, but that the provider is also providing you with some kind of security service. You’ll need to look at the provider’s best practices and documentation; you may need to contact support; whatever it is, you’ll need to have an idea of how to initiate some kind of help. Hopefully, when it comes to security, realize that these issues come up very suddenly, in most cases out of the blue. It could be one week and everything was fine, and then you come in on Monday morning and nothing works. And it turns out that some APIs were updated without going through the proper change process. control, ownership, and control. Recognize that unless you’re using a private cloud that you’ve set up and fully paid for, the resources you’re using are likely not under your control. Visibility into some of the issues can be challenging as well.

One of the areas that commonly comes up in classes when I teach or if I’m consulting is: “How do we know our resources are where they’re supposed to be?” And that’s a really good question. I’ll let you know when I have an answer. Now, the reality is that the provider is going to guarantee that your resources are in the zone that you place them in. This does not imply that you have control over which racks are in which part of the data center. That is not something that you’re going to typically even be aware of. Now, you could certainly use solutions like science and logic that may provide some kind of additional insight, especially with AWS as an example. But once again, visibility is just not something you can control very easily when it comes to using tool sets; realise that the provider may have some. Recognize that you use third-party services. Remember that in some cases, you may be able to use standard commands to assist with troubleshooting. Use solutions against your cloud services to determine what ports are open as you go out there. You may proceed to run net stat. You can go ahead and use other solutions like Nmap; whatever problem you’re trying to resolve, there are plenty of solutions out there. Now, of course, for some of the solutions, when you’re running them, you need to get permission. Anything that’s going to be scanned or is going to generally try to run a pen test or threat assessment, you need to validate that you are doing it according to how the provider wants you to. Because your actions may cause a cloud service to be shut down. That wouldn’t be the first time that the providers, automated systems, threat assessments, and IDs IPS solutions might shut down a cloud service because it looks like it’s being breached or has been breached. Now, a couple of other things are on the slide here. Look at log files; hopefully you set up log files and you’re able to keep track of who’s doing what and when.

Examine specific websites or resources, such as US. Cert has a wealth of information on some of the vulnerabilities that are out there, threats, et cetera. NIST has specific guidelines you can look at as well. And CSA does publish the Treacherous Twelve every year. They shouldn’t say every year; they only started doing it a few years ago, and it’s only been a few years, but they’ve changed the names of the top threats in recent years, including, I believe, the Notorious Nine. Now it’s called the treacherous twelve. Once again, just look at what is going on in the industry. There are plenty of great resources to find out. Other areas that come up as a concern could be trust; legal liability is a big deal now as well. Authenticity, key management, data, and locality are all areas you may need to look into as well. For example, if you have certificates that are not valid, your user base could certainly get some kind of error or warnings, I should say. Do you want to proceed? The certificates are expired, et cetera. You need to look at those issues individually. Make sure you determine resources to assist with troubleshooting like Cert, CSA, et cetera. Take a look at what’s going on out in the industry and try to correlate that to your environment. The CSA sends out the perilous twelve. One of the things you won’t see on the test, but I put it there because I think it’s something you should look at from a knowledge standpoint because it’s a very good list.

  1. Troubleshooting Automation

Troubleshooting automation When it comes to providing cloud services, connecting your hybrid cloud provisioning, storage, or whatever you’re doing, automation and orchestration are typically required. One of the challenges with automation and orchestration is that there will almost certainly be some significant workflow processes that you will need to understand in order to determine what configuration may be the issue. Why is this blueprint not performing the way it should? Why does this workflow not perform the way you expect it to?

A lot of burials could go into troubleshooting. One of the challenges is also understanding the vendors’ capabilities. For example, whether you use an AWS, Google, or Oracle Reserve, they have deployment scripts to use, SOPs you could follow, and tonnes of reference material to look at to determine how, when, and where those workflows should occur. So with that said, just do look at the vendor’s capabilities as well. I know we discussed this in an earlier module, but I bring it up again because this is a common source of confusion. I think for some folks, automation is when you automate a single task without human intervention. Orchestration is the arranging of those automated tasks. So with that said, let’s proceed to a workflow. Once again, here’s another term you want to know. Workflow is used to create that automated and orchestrated sequence of services and processes. An example is AWS, a simple workflow, as we talked about, and then you realise that a workflow relies on each step. So what could happen is that during that workflow, you get to create the VM and instal software, but for some reason that VM stops before it goes to create that VM with software.

So you need to look at why that is. And this is a very simplistic workflow. Usually the workflows are much longer, but as an example, this does the job. When it comes to troubleshooting those deployment issues with automation, there could be a number of different issues that could come up. When it comes to accounts and permissions, this is a very common issue. Especially if you have a service account, for example. A service account is typically used for server-to-server communication, such as when a source drops files on a target. Or the source will request a batch job be done on the target. So those permissions could be incorrect, they could be expired accounts; whatever the concern, you can change the names, you can change IP addresses, whatever. All of those will definitely affect the deployment templates. They’ll affect the workflows as well because, again, if there’s anything wrong with any of the templates, then again, you can expect your workflows to not complete properly. Or if they do complete, maybe it’s not configured the way you expect it.tool sets as well as APIs you need to look at, as well as location changes too. This could certainly occur as well. Especially if you move around VMs, this is pretty common and will typically happen if you move from one zone to another. when it comes to troubleshooting deployment issues.

Workflows are another thing you should consider. Review scripts, review log reports, and logs can be really good indicators of what’s really going on. For example, if you get that HTTP error or that 404 error, then you know that perhaps that web page just isn’t being reached. Well, why is that? Is there some kind of routing issue? Is there some kind of configuration issue? networking issue, whatever? Is it a problem with the application? These are things you want to start looking into. Look at support and documentation; contact the vendor if you need to. Automation is the completion of a single task without human intervention. Here’s an exam tip: make sure you know how to troubleshoot high-level automation issues like workflows and orchestration. Once again, the main thing to understand in this module is to make sure you know the difference between orchestration and automation, but also understand that a workflow depends on orchestration and automation. Now on this exam, I saw one exam question out of 100 that basically asks you to look at a workflow and try to determine what the failure was. That’s pretty much it. Again, I just want to make sure you know where to start with this and that you know the difference between the terms because they’ll ask you a question, and perhaps they’re asking about orchestration and not automation. And this is where you’d definitely go the wrong way. And the responses to that.

  1. Network Connectivity Tools

Troubleshooting networks. One of the challenges as a cloud architect or cloud administrator, whatever your role is going to be, is to identify network issues. Now this module here is just one of several where we touch on various network issues, like troubleshooting commands to use, talking about latency, et cetera. Let’s go ahead and discuss this objective here about troubleshooting networks. When it comes to troubleshooting network configurations, one of the challenges is to understand that your network needs resources as well. Essentially, you have to plan those resources. What I mean is that perhaps over time the bandwidth is no longer acceptable, or perhaps what is fairly common is that let’s say you’re using, say, AT&T, or myBell, Verizon, or Unisys, whoever your provider is.

What can happen is that the provider may be doing maintenance, and they may change a protocol or update routes. And I’ve seen situations where the provider updated routes, and this blew away the latency that the customer was getting. This can happen pretty routinely. So your job as a cloud guru is to understand different facets and areas. You’ll have to troubleshoot to identify the issues. And when we talk about hybrid clouds, this adds more complexity. For example, with hybrid clouds, we’re going to have several services that are going to typically be orchestrated and integrated as well. And that could be everything from Active Directory to LDAP to DNS to NTP—you name it—logging services, and so on. Now this definitely adds to the complexity, and this is an area that I think probably could use more attention in the cloud world, but it doesn’t get it. But with that said, I just want to clarify the objectives for the exam and make sure you know what to look for on the exam when you’re studying for it as well.

For example, you could use a resource like CloudHermany.com to help you identify issues around latency, for example, with a certain provider or see if there is an outage with the provider as well. Again, that’s one of the toolkits out there you could use. Bandwidth can be a problem as well. I’ve seen organisations that don’t really plan accordingly for bandwidth. For example, if everything’s in house one day and then the next day they put services in the cloud, they may have had enough bandwidth with that ten gig linkor whatever, but now maybe they don’t.

So you need to identify: is bandwidth an issue? Is it quality service? Is latency an issue? What makes jitter an issue? These are things you want to look at. Take some time on the cloud plus exam, and here’s what I saw. Pretty much, you’ll likely see two questions on troubleshooting deployment issues, and they’ll focus mainly on the network issues that we’re talking about. Now you’ll have to identify how to resolve those deployment issues. For example, misconfigured templates and images It’s more than possible that when you deploy an image, you don’t realise that perhaps you need to update the networking or that you deployed the wrong image to the wrong subnet or zone. You could also exceed cloud provider limits.

All the cloud providers have default limits. Amazon, for example, has very strict limits on their VPCs. Google Cloud has pretty strict limits on the number of APIs. There are many limitations to look for in a cloud provider, for example. So take a look at the Google Cloud quotas. You go straight over there; it’ll tell you the number of resources you’re using and what the quota is. When it comes to toolsets and commands, for example, you should first validate your network configuration. Make sure you look at OStool sets and cloud vendor tools. Look for routing and IP configurations. Some of the networking tests you could do—and this is not an exclusive list—are covered here, but we’re just covering what you can expect on the exam. You do want to know what IP configuration is; make sure you know what it’s used for. You want to use it to validate your router. Another is the IP configuration, netstat. Netstat is a really good tool to validate statistics. The two syntaxes I’d like to make sure that you go look at and make sure you know why you want to use them are netstata and stata. In Netstatn, I’m going to leave it up to you to really dig into what those switches are for. Again, you’ll likely see one of those on ta and statThis is ICMP. This is probably the easiest networking tool to use. The goal of ping is to determine round-trip time and latency.

So for example, if you had a question on the exam asking you what tool you’re going to use to determine latency, this could be the right choice, depending on the course, the question, and the answers. ICMP, of course, determines ping as well as round-trip time and latency. Again, know this for the exam. Trace route. Another ICMP tool This is to map the packet route to a destination. So Traceroad is also a good tool to determine the number of hops. Now in the networking world, the hop is essentially another router switch that it’s traversing to get to the destination. And once you get over ten or twelve hops, the latency jumps exponentially. In general, the fewer the hops, the better, and the less latency you’ll typically experience. Once again, these are tools that can help you identify some of the results of this DNS NS lookup. Make sure you know what an NS lookup is. This is going to help you correlate a domain name to an IP address. Dig. This is called the domain information group.

This will help you identify information about some of the tool sets. ARP.ARP. This is used to correlate a Mac address based on an IP address. Now a lot of people will confuse ARP with Netstat. So I would like you to go back and make sure you know what Netstat is used for, right? So let’s just go back and make sure you get this. This is for a reason, because this is where the confusing part can come in. If you come back here, I believe you’ll find it right here. Netstat. This is for network statistics, which is the main goal. There are syntaxes, for example, that you could use with Netstat to get some of this information, such as the IP address or, to a lesser extent, your tables. Ports are open as well. But again, ARP is really mainly used for correlating a Mac address based on an IP address. So ARPA is going to help you view your local table. Once again, ARPA, view your local table.  Ports areSome vendors support having a virtual private cloud.

Amazon is a good example of that. These virtual private clouds are going to have limits. They’re going to have thresholds. For example, in AWS, you have specific limits on NAT, for example, and you have specific limits on the routing as well as the distribution of your cedar pool, for example. So just make sure you’re aware of what those limits are. Also, again, just like I said, make sure you know the limitations of features and scalability. This is an area where you’re not going to get tested. This is within the limits of VPC, so don’t worry about that. They may ask you what a VPC is, but they will not associate it with Amazon. So once again, there’s no direct relationship to AWS or anything else on the exam, but they can’t ask you what a VPC is, so just be aware of that. t worry abThis is used for testing connectivity. Once again, if you wanted to determine the round trip time from source to destination, this is an ICMP protocol test, and this will help you determine the round trip time and the latency. Here is an exam tip. You want to use ARP to correlate a Mac address-based star with an IP address.

  1. Cloud Attacks

Cloud attacks on the exam is going to be a questionon types of cloud attacks and the question will ask youto determine what type of an attack but also how to possibly resolve this type of attack as well. So let’s go ahead and talk about some common cloud attacks that you may run into. A cloud attack can occur due to vulnerabilities. Threats can occur inside and outside of your cloud. The following are some examples of common cloud attacks and vulnerabilities: DDoS is a distributed denial of service. VM hopping and data bleeding compromise credentials. Now, there’s certainly a lot more than just this, but for this exam, the objectives really are focused on these types of vulnerabilities, essentially in attacks. Let’s go ahead and talk about what a DDoS is.

A DDoS is a type of attack where multiple compromised systems which are often infected with the Trojan are used to target a single system causing a denial of service attack.In general, you want to mitigate these types of DDoS attacks by using load balancing, firewalls, and traffic filtering. Those are generally the best ways to mitigate a DDoS attack. It won’t stop it, but remember that mitigation is really the key in this case. VM hopping, when it comes to virtual machines generally, is really not a concern as much as it used to be, but it still could be a concern. So if you’re running an infrastructure as a service or a private cloud, this may be something you need to be a little concerned about and validate. You have all the vendor patches, for example. So VM hopping is an attack method that exploits the hypervisor’s weaknesses and allows a virtual machine to be accessed from another virtual machine. This is also known as “hyperjumping” or “VM guest hopping.” You want to mitigate this with vendor-patched uplinks and VLANs as well. Generally, like if it comes to VMware, most of the patches for VSFare, for example, address issues like this already, so this should not be so much of an issue. But again, stranger things have happened.

Data Bleeding When it comes to a vulnerability that I have found a lot of people aren’t even aware of, it’s sort of interesting because data bleeds generally occur when you have code that essentially hasn’t been written totally securely. This is where the data could be retrieved from cloud resources such as storage. So for example, if you’re accessing a website like S3, for example, Amazon, it’s possible that the code that was developed for your own private application could have some HTML calls that are incorrect or something of that nature. But with that said, it’s pretty rare to run into issues like these.

There was a vulnerability that was fairly well known that I believe occurred with Box a couple of years ago, where some folks using the Box application didn’t realise that their information was literally being bled away, as it stated. So this can happen. Just be aware that there are vulnerabilities out there. The way to mitigate this issue is to scan for the vulnerability and also ensure your developers are writing stronger codes. Compromise credentials. This can generally occur as a result of poor identity and access management. In general, this occurs in areas where password rule sets are not enforced, such as lockouts after a certain number of failed attempts. Generally, this is something that can be mitigated pretty easily, especially if you’re going to use better techniques like multi-factor authentication encryption and certainly have strong user credential policies. Again, if you follow best practises and have excellent security personnel, this should not be an issue. DDoS. Remember what DDoS is.

DDoS is an attack where you have compromised systems that can be located anywhere in the world. This is again typically how a botnet is set up, and then these infected machines will have what’s called a Trojan. This is basically a rogue programme that is going to be essentially taking over resources on that bot or that node, and we’ll be essentially sending pings of death or other service calls to basically try to interrupt the service of a website, for example. very common type of attack. Pretty much anyone who has been on the web for any length of time has probably experienced one of these. In some way or another. At least if you’re a large corporation, that would definitely be the case. Test tip: Make sure you know the common attack types. Keep in mind that DDoS compromises credentials and causes data leakage. You’ll likely see one of those on the test, so be prepared.

img