Chapter 10 Ensuring High Availability for vSphere Clusters and the VCSA

2V0-21.19 EXAM OBJECTIVES COVERED IN THIS CHAPTER:

  • Section 1 - VMware vSphere Architectures and Technologies

    • Objective 1.2 - Identify vCenter high availability (HA) requirements
    • Objective 1.6 - Describe and differentiate among vSphere, HA, DRS, and SDRS functionality
    • Objective 1.9 - Describe the purpose of cluster and the features it provides
  • Section 2 - VMware Products and Solutions

    • Objective 2.2 - Describe HA solutions for vSphere
  • Section 4 - Installing, Configuring, and Setting Up a VMware vSphere Solution

    • Objective 4.2 - Create and configure vSphere objects
  • Section 7 - Administrative and Operational Tasks in a VMware vSphere Solution

    • Objective 7.13 - Identify and interpret affinity/anti-affinity rules

Ensuring that your resources are available when needed is a key requirement in a datacenter. In any environment, it takes a comprehensive plan to ensure availability, with participation from each of the teams with responsibilities in the datacenter.

The infrastructure team needs to make sure there are no single points of failure; the development team needs to build resilient and self-healing applications; and the operations team needs to track, trend, and anticipate potential issues before they can cause outages.

The vSphere team members are responsible for ensuring that the system is designed for resiliency, updates are applied regularly, and the system is configured to take advantage of the availability features built into vSphere. This chapter will address two of those availability features, the High Availability (HA) option for vSphere clusters and the vCenter High Availability solution for the VCSA.

Configuring vSphere Cluster High Availability

When you create clusters in vSphere, there are two primary options to enable: Distributed Resource Scheduler (DRS) and High Availability (HA). The primary purpose of DRS is to balance workloads among hosts in a cluster, which is discussed in Chapter 6, “Allocating Resources in a vSphere Data Center.” High Availability allows virtual machines to be restarted in the event of problems with the host or in the virtual machine.

All vSphere licenses Essentials Plus and above include High Availability. The primary benefits of High Availability are recovery from the following scenarios:

  • A host failure-by restarting virtual machines on other hosts
  • A virtual machine failure-by restarting the VM
  • A storage loss on a host-by restarting virtual machines on other hosts
  • A network loss on a host-by restarting virtual machines on other hosts

You can also configure Proactive HA, which gives you both manual and automatic options for evacuating virtual machines from hosts whose health has degenerated.

The only requirements to enable HA are licensing (Essentials Plus or higher) and a cluster object. No hosts or virtual machines are required to configure HA, although you will not be able to test your settings without them!

High Availability for a cluster can be enabled using the Edit Cluster Settings wizard (Figure 10.1), which is accessed from the vSphere DRS or vSphere Availability options under the Configure tab of the cluster.

FIGURE 10.1 Enable High Availability on a cluster.
Screenshot_596

HA Failures and Responses

Figure 10.1 shows the availability failures and responses available from High Availability, and the single green check shows that by default it will only protect against a host failure. To recover from other scenarios, additional configuration is required.

Proactive HA allows vCenter to move virtual machines off a host that has reported health degradation via a third-party provider. Many major server vendors, including Dell, Lenovo, and Cisco, have providers available for their hardware and might have requirements such as licensing or homogeneous hosts in a cluster.

As shown in Figure 10.2, there are a few options available for Proactive HA.

When Proactive HA is enabled, it will default to the Manual automation level with Quarantine mode. Manual mode will only provide suggestions; you need to manually move virtual machines off flagged hosts. Automated mode will leverage vMotion to automatically move virtual machines off troubled hosts.

FIGURE 10.2 Configure Proactive HA.
Screenshot_597

The remediation options allow you to adjust when virtual machines are migrated:

  • Quarantine Mode for All Failures allows virtual machines to migrate as long as performance is unaffected.
  • Maintenance Mode for All Failures moves all virtual machines off any host with a failure.
  • Mixed (the middle option) sets quarantine mode for hosts with moderate issues and maintenance mode for hosts with severe failures. You can adjust what types of failures (such as redundant PSU or fan) are treated as moderate or severe.

The types of failures detected are dependent on the Proactive HA provider. Contact your hardware vendor for availability, licensing, and installation.

Host Isolation

Host isolation is when a host stops receiving HA heartbeats and cannot access the isolation address. Hosts in an HA cluster communicate continuously over any network with management traffic enabled. If a host does not receive any communications in 12 seconds, it attempts to ping any configured isolation address (by default, the gateway IP address of the management network). If no response is received from the isolation address, the host will check the HA folder in the heartbeat datastore. If the host determines that it is isolated, the host will initiate the host isolation response (Figure 10.3).

FIGURE 10.3 Configure host isolation response.
Screenshot_598

The default response is Disabled, which means a host will not react to being isolated. The other options can either power off the VMs or initiate a guest shutdown. If the uptime of the guests is a priority and they can handle being powered off, the Power Off and Restart VMs option will get the virtual machines back up faster as there is no wait time for the guest to gracefully power down.

The isolation response is governed by the Fault Domain Manager (FDM) agent, which runs on each host in the cluster. The FDM agents elect a master host to act as a primary point of contact with vCenter and the subordinate hosts, which are all of the other hosts in the cluster.

The master host listens for a heartbeat message from each subordinate host, which are sent every second. If a subordinate stops sending heartbeat messages, the master host will check the heartbeat datastore(s) for entries from the failed host. If the master determines that the subordinate is isolated from the management network and is not updating the heartbeat datastore, the master will initiate restarting the subordinate's virtual machines on other hosts in the cluster. If the master determines that the subordinate is isolated from the management network but is still updating the heartbeat datastore, the master host will watch for the virtual machines on that host to be powered off before attempting to restart them on other hosts in the cluster.

Heartbeat Datastores

The heartbeat datastores are selected automatically by default, but as seen in Figure 10.4, you have the option of specifying datastores to use. You can also specify datastores to use while allowing HA to choose additional datastores if needed. While it is best to use datastores available to all of the hosts, you at least need to ensure that all datastores are accessible by at least one host-the master designation will go to the host with the most datastores connected by default. High Availability will display a warning message if there are fewer than two datastores available or manually selected.

FIGURE 10.4 Configure heartbeat datastores.
Screenshot_599

If you are on a converged network where your management, virtual machine, and storage traffic use the same NICs and switches, you might assume that any management traffic loss will result in VM network or storage loss and shut down the running virtual machines. On the other hand, if your management network is completely separate, then you may want to leave the virtual machines running by disabling host isolation response.

Advanced Options

There are a few common advanced settings for host isolation that are accessible under the Advanced Options section of the Edit Cluster Settings wizard (Figure 10.5):

  • das.ignoreRedundantNetWarning: This setting prevents a warning from being displayed if there is not a second HA network or NIC. This can be set to true for test/dev environments with limited networks.
  • das.usedefaultisolationaddress: This setting prevents the FDM from using the default gateway of the management network to test for host isolation.
  • das.isolationaddress[0-9]: These settings (up to 10) set specific addresses to use for isolation response in addition to the default gateway if it is not disabled.
  • das.isolationshutdowntimeout: This setting adjusts the amount of time the host waits for a virtual machine to shut down before it issues a power off command. This setting is only used if a response of “Shut down and restart” is used.
FIGURE 10.5 Configure advanced settings for host isolation.
Screenshot_600

Up and Down: Know Your Network Before Setting Isolation

I was called out to help a company that was experiencing virtual machines randomly powering off overnight. While putting this story right in the middle of the network isolation section gives a big clue as to why, it took several hours of troubleshooting to pin it down at the time.

The client had a three-node cluster and one of the hosts (not always the same one) would have a few (but not all) virtual machines powered off at night, right around the time of the systemwide backup. Some nights there would be no virtual machines powered off, but more often than not, a few would be powered off.

One way to verify that the FDM agent is shutting down your VMs is to check the FDM log file /var/log/fdm.log on the host the virtual machines were running on for lines such as these:


  [LocalIsolationPolicy::TerminateVms] Terminating 1 vms
  [LocalIsolationPolicy::DoVmTerminate] Terminating
/Tiny01.vmx
  [InventoryManagerImpl::MarkVmPowerOff] Adding
/Tiny01.vmx to powering off set
  [LocalIsolationPolicy::HandleNetworkIsolation] Done with
isolation handling

It turned out the network team had initiated reboots of some of the network equipment, including the switches the hosts plugged into (which prevented the subordinates from talking to the master) and the firewall that acted as the default gateway. The devices didn't always take the same amount of time to reboot; they usually took just enough to time to start the isolation response, but the network would be available before the isolation response was completed.

To compound matters, they had disabled the datastore heartbeat during recent storage maintenance.

In the end, we reconfigured the datastore heartbeat and worked with the networking team to stagger the network switch reboots and identify IPs of devices that would respond to pings and always be available.

Configuring VMCP

High Availability can also protect against a host losing access to datastores in use by a virtual machine. Virtual Machine Component Protection (VMCP) refers to the capability for HA to respond to a Permanent Device Loss (PDL) or All Paths Down (APD) event. Figure 10.6 shows the response options for both scenarios.

FIGURE 10.6 Configure VMCP options.
Screenshot_601

Whether a storage loss is considered PDL or APD depends on the response from the storage device.

  • A PDL event occurs when the storage device responds to an I/O request with a SCSI sense code indicating the storage device is no longer available. This indicates to the host that the storage will not be coming back and requires virtual machines to be powered off before they can be recovered.
  • An APD event occurs when there is either no response or any response other than “that device is no longer available.”

NOTE

VMware's Knowledge Base article 2004684 lists the SCSI sense codes that will trigger a PDL state.

Permanent Device Loss Event

As PDL is the storage provider stating that the requested device is gone, the only options are to either issue an event message or power off affected virtual machines so they can be restarted on a host that still can access the storage device. If you have virtual machines with storage on multiple sources (OS on one datastore, data on another, logs on a third), you might not want to power off the virtual machines if one of the devices becomes unavailable. As shown in Figure 10.7, you can set per-virtual machine overrides for any of the monitored conditions including PDL.

All Paths Down Event

The possible response to an All Paths Down event are more varied because it is not clear why the storage is unavailable and when it might return. As with PDL, possible choices for APD events are to issue events and power off virtual machines. However, you have two options for powering off virtual machines, either conservative or aggressive. Conservative will power off VMs only if hosts are available with connections to the datastore. Choosing aggressive will start powering off the affected virtual machines without first seeing if other hosts can access the datastore.

While conservative should be the choice for most environments, there could be scenarios such as a stretched cluster where aggressive could be a valid choice. An environment where the storage could be isolated with some hosts in the cluster could be a candidate for aggressive as a remote host would not be able to check those hosts to know they have storage connectivity.

Monitoring Virtual Machines

The final failure vSphere High Availability addresses is virtual machine failures, monitoring both operating system and running applications (Figure 10.8). Virtual machine monitoring requires VMware Tools to be installed and running on the virtual machine. The VMware Tools application will send a “heartbeat” signal to the host periodically. If the host does not receive a heartbeat in a certain amount of time and no I/O activity is observed within 120 seconds, the guest OS will be restarted. You can change the amount of time the host will wait for I/O activity or disable the I/O check using the das.iostatsinterval advanced option.

FIGURE 10.7 Override VMCP settings for a virtual machine
Screenshot_602

Applications that support vSphere application monitoring can also send heartbeats to the host. If the application heartbeat is not received within the time period, the guest will be restarted.

Heartbeat monitoring defaults to High sensitivity. If the guest has been sending heartbeats for 120 seconds and there are no heartbeats received for 30 seconds, the recovery will proceed. A virtual machine will be reset no more than three times and not more than once in a 1-hour period. You can use the slider shown in Figure 10.8 to select Low, Medium, or High sensitivity (see the grayed-out Custom section to see what settings correspond to which preset) or choose Custom and set specific time limits.

As with VMCP settings, you can override the cluster settings on a perVM basis (Figure 10.9).

FIGURE 10.8 Configure virtual machine monitoring.
Screenshot_603
FIGURE 10.9 Configure monitoring for a specific virtual machine.
Screenshot_604

Admission Control

One key concept with High Availability is admission control: admitting virtual machines into the cluster in such a way that capacity is guaranteed in the event of a failure. As shown in Figure 10.10, the default configuration is to reserve enough capacity to tolerate one host failure and to calculate the capacity needed by using a percentage of the resources in the cluster.

FIGURE 10.10 Configuring admission control
Screenshot_605

With admission control enabled, you need to have at least two hosts available in the cluster (powered on and not in maintenance mode) or no virtual machines will be allowed to power on because in the event that one host is unavailable, there will be no resources left.

Admission control works by providing vCenter with a capacity check before virtual machines power on. If the admissions control calculations specify that there will be sufficient resources available in the event of a failure, the virtual machine is allowed to power up. The options available for defining host failover capacity are as follows:

  • Slot Policy: Uses a virtual machine “slot size” to determine how many slots are available on the hosts and ensure that number of slots will be available in the event of a failure.
  • Cluster Resource Percentage: Uses a percentage of resources to hold in reserve.
  • Dedicated Failover Hosts: Specifies hosts to hold in reserve. No VM will run on these hosts unless another host fails.

Slot Policy

Slot Policy (listed as Host Failures to Tolerate in vSphere prior to 6.5) uses the concept of a slot, which is a measurement of a virtual machine using memory and CPU reservation. The basic idea is to determine the “size” of the largest running virtual machine (by memory and CPU), determine how many virtual machines of this size can run on your hosts, and determine how much space needs to be kept free to account for the number of failed hosts you want the cluster to tolerate. See Figure 10.11.

FIGURE 10.11 Configuring Slot Policy
Screenshot_606

The default slot size is calculated by looking at all of the running virtual machines in the cluster for the largest memory reservation (plus overhead) and the largest CPU reservation. The largest values do not need to come from the same virtual machine. If VM A has a 1 GB memory reservation and a 500 MHz CPU reservation while VM B has CPU reservations of 768 MB and 1 GHz, the default slot size will be 1 GB memory and 1 GHz CPU.

The virtual machines currently running in the cluster are counted for the number of slots currently in use. The slot size is then compared to each host to determine how many slots each host can hold. If the hosts have different memory/CPU configurations, the largest hosts are used for the slots to reserve.

To view the slot size for your cluster along with the total slots, available slots, and failover slots, look at the Advanced Runtime Info in the vSphere HA summary under the Monitor tab of the cluster (Figure 10.12).

FIGURE 10.12 Viewing the slot information
Screenshot_607

You can also manually set the slot size if you feel the default size is not right for your environment, if you have virtual machines of varying reservations, or if you have virtual machines with large reservations that are not always running. If you are using slot size but are not using a fixed slot size, the slot calculations will be performed at any virtual machine power state change. This can result in dramatic changes to the slot availability if you power on and off a virtual machine with a large reservation. Each virtual machine will consume one slot at least. Virtual machines with reservations greater than the slot size will consume enough slots to cover their reservation. See Figure 10.13, Figure 10.14, and Figure 10.15 for examples.

FIGURE 10.13 A small environment with three small VMs running with no reservations and automatic slot sizes. Note 664 total slots.
Screenshot_608
FIGURE 10.14 A virtual machine with a 3000 MHz CPU reservation is powered on. Total slots in cluster changes to 6.
Screenshot_609
FIGURE 10.15 Slot size is manually set to 1000 MHz. Total slots in cluster are now 20. There are 6 used slots as the large VM is consuming 3 slots.
Screenshot_610

EXERCISE 10.1 Configure a cluster for Slot Policy.

Requires a vCenter server with a datacenter created.

  • Create a new cluster in the datacenter named SlotPolicy and enable HA.

    Screenshot_611
  • Open the HA settings of the new cluster.

    Screenshot_612
  • Set Admission Control to Slot Policy.

    Screenshot_613
  • Configure a fixed slot size of 250 MHz and 256 MB. Set a warning if performance is expected to drop more than 20% during a failure.

    Screenshot_614
  • Open the Advanced Options and set the das.isolation to default and create two new isolation addresses:

    Screenshot_615
  • Click OK to save the settings.

Cluster Resource Percentage

The next choice for admission control is Cluster Resource Percentage, where a specific percentage of cluster resources (CPU and memory) are held in reserve. By default, the percentage will be the total resources available divided by the resources of the n largest hosts, where n is the number of host failures to tolerate. You can also manually set the percentage of CPU and memory reserved ( Figure 10.16 ).

FIGURE 10.16 Manually setting Cluster Resource Percentage
Screenshot_616

When you view the Advanced Runtime Info in the vSphere HA summary under the Monitor tab of the cluster (Figure 10.17), you will not see any information about running virtual machines as you would with the Slot Policy.

FIGURE 10.17 Viewing Cluster Resource Percentage
Screenshot_617

However, you can use the Resource Reservation tab to view Cluster Total Capacity for CPU and Memory as well as Total Reservation Capacity for each resource (Figure 10.18).

FIGURE 10.18 Viewing total cluster CPU capacity and VM CPU reservations
Screenshot_618

Dedicated Failover Hosts

Using a dedicated failover host is the simplest configuration as it sets up a “hot standby” list. This ensures that a comparable amount of resources is available in the event of a failure, assuming your failover hosts have the same capacity as the largest hosts in the cluster. Virtual machines cannot power up on the specified hosts, and attempting to vMotion a virtual machine to a listed host will result in the message shown in Figure 10.19.

FIGURE 10.19 Message received when migrating a virtual machine to a failover host
Screenshot_619

The downside to dedicated hosts is that you cannot use those hosts for any workloads and you have less flexibility in workload placement. Using dedicated hosts also does not adjust to changing workloads or demands over time. You also need to ensure that the selected failover host has sufficient resources if not all of the hosts have the same specifications. In a cluster where not all hosts have the same CPU type or count or amount of RAM, choosing a smaller host will result in a warning (Figure 10.20).

FIGURE 10.20 Choosing a smaller host can result in a warning.
Screenshot_620

EXERCISE 10.2 Configure a cluster for dedicated hosts.

Requires a vCenter server and two hosts, each with VMs. Assumes hosts are in an existing cluster.

  1. Open the vSphere Availability settings of the cluster.

    Screenshot_621
  2. Enable HA.

    Screenshot_622
  3. Under Admission Control, set the host failover capacity to Dedicated Failover Hosts, select a host for the list, and click OK.

    Screenshot_623
  4. Monitor the host's HA configuration process.

    Screenshot_624
  5. When the host has been configured, attempt to power on a VM on the host.

    Screenshot_625
  6. Verify that the virtual machine cannot power on.

    Screenshot_626

Resource Fragmentation

Resource fragmentation refers to a problem of virtual machines not being able to boot because while the capacity to hold the virtual machine exists, it may not exist on one host. Resource fragmentation cannot occur with dedicated failover hosts or Slot Policy if a fixed slot size is not used. However, with Cluster resource percentage, you can run into resource fragmentation.

In Figure 10.21 we see three hosts with 5000 MHz each in a cluster. If Cluster Resource percentage is enabled it will reserve 33% CPU and 33% of memory. In this example, in a failover scenario, while 33% of the total cluster resources are still available, VMK cannot be powered on to a surviving host as neither host has sufficient resources.

If we used a custom slot size of 500 MHz with our example in Figure 10.21, we could have the same problem. A slot size of 500 MHz allows each host to have 10 slots, with 10 slots total being held in reserve. In the event of a failure, those slots would not necessarily be on the same host-and in this case VMK, which requires 6 slots, would not be able to power on.

Performance

For Cluster resource percentage and Slot Policy, calculations are made to determine the current load and possible load and the amount of resources needed to reserve in case of a failure. Both of these policies use virtual machine reservations plus VM overhead memory to determine capacity. If you do not have any CPU or memory reservations set for your virtual machines, only the memory overhead of the running machines is used. This will result in a very low figure for running capacity and a very high figure for total capacity, which will result in oversubscribing the environment in the event of a failure.

To help with this, vSphere 6.5 introduced a new setting that is only available when DRS is also enabled: Performance Degradation VMs Tolerate is specified as a percentage (Figure 10.22).

This setting will raise a warning if the virtual machines' performance is expected to degrade more than a set percentage during a failure. If the current CPU or memory utilization is greater than the percentage of reserved capacity specified, a warning will be set.

FIGURE 10.21 Resource fragmentation example showing that virtual machine “VMK” cannot be powered on due to insufficient resources
Screenshot_627
FIGURE 10.22 The Performance Degradation VMs Tolerate setting
Screenshot_628

The VMware documentation gives the formula as follows:

performance reduction = current utilization × percentage set If current utilization - performance reduction > available capacity, a warning will be issued (Figure 10.23).

FIGURE 10.23 Performance degradation warning
Screenshot_629

From the white paper “VMware vCloud Architecture Toolkit for Service Providers” (download3.vmware.com/vcat/vmw-vcloudarchitecture-toolkit-spv1-webworks/index.html):

  • Host Failures Cluster Tolerates admission control policy: When virtual machines have similar CPU/memory reservations and similar memory overheads.
  • Percentage of Cluster Resources Reserved admission control policy: When virtual machines have highly variable CPU and memory reservations.
  • Specify a Failover Host admission control policy: To accommodate organizational policies that dictate the use of a passive failover host, most typically seen with the use of virtualized business critical applications.

For best results, your hosts should all be about the same size with regard to CPU and RAM capacity. You should also either disable Distributed Power Management (DPM) or configure it to have enough hosts running to allow the reserve capacity to survive a host failure. If DPM has consolidated VMs and powered down all other hosts, HA will not have the capacity to recover a failed host.

You also need to keep an eye on any Distributed Resource Scheduler (DRS) rules created. DRS rules that are VM-to-Host “must” rules or rules to separate virtual machines will be enforced by HA during failures unless HA Advanced Options are set.

  • das.respectvmvmantiaffinityrules: Can be set to false to ignore VM anti-affinity rules during a failure
  • das.respectvmhostsoftaffinityrules: Can be set to false to restart a VM on any available host regardless of VM-to-host rules

Please refer to Figure 10.5 for configuring Advanced Options such as advanced isolation or DRS rule options.

NOTE

VMware recommends disabling HA before enabling or upgrading vSAN. Once the vSAN operation is complete, re-enable HA.

vCenter Server Appliance High Availability

With vSphere, the vCenter server provides centralized management, monitoring, and security. Keeping your vCenter server available will ensure that DRS is always working, new VMs can be deployed, and you know where all of your virtual machines are in your environment. With vSphere 6.5, there is a new feature available for the vCenter Server Appliance called vCenter High Availability, which provides a managed active/passive cluster of VCSAs to ensure that VCSA is not a single point of failure.

VCSA HA can be enabled using the web client in the vCenter HA section of the Configure tab of the vCenter server (Figure 10.24).

FIGURE 10.24 vCenter HA in the web client GUI
Screenshot_630

As shown in Figure 10.24, a vCenter HA configuration consists of active and passive vCenter hosts plus a witness virtual machine. The passive and witness appliances are clones of the original VCSA. Once vCenter HA is configured, two of the nodes must be available at all times. If the active node becomes unavailable, the passive node will take over. If the passive or witness node goes down, the active node will continue to run. However, if any two nodes go down, the remaining node will stop responding to requests.

There are two methods of setting up vCenter HA: Basic and Advanced (Figure 10.25). You should create a new port group to use for the vCenter HA network.

If you use the Basic option, the VCSA must reside in the environment it manages or reside on a cluster managed by a 6.5 vCenter server in the same Single Sign-on domain. The cluster the VCSA is running on should have DRS enabled and at least three hosts for best practices. The Basic option will add a NIC to the VCSA before cloning the VCSA twice.

During the Basic wizard, you will be able to set the IP addresses for the vCenter HA network and add IP addresses for the management NIC of the passive appliance. You can also approve where the two new appliances will be created. The wizard will attempt to place them, but if the default options have problems, you will be prompted ( Figure 10.26 ).

FIGURE 10.25 vCenter HA configuration options
Screenshot_631
FIGURE 10.26 Basic option compatibility errors (left) and the issues expanded (right)
Screenshot_632

Correct the issues and set the configuration as needed before continuing (Figure 10.27).

If you use the Advanced option to configure vCenter HA, you must add a second NIC to the vCenter server appliance before starting the Advanced Option wizard. During the wizard, you will be prompted to enter the IP addresses for the passive and witness virtual machines. The last step of the Advanced option is Clone VMs. While on this window, use the Clone to Virtual Machine wizard (Figure 10.28) to make two copies of the VCSA. During that wizard, use the Customize the Operating System option and create a Guest Customization Spec to change the hostname and IP address of NIC0 to the one specified in the Advanced Option wizard. For a detailed walkthrough, visit the following URL:

featurewalkthrough.vmware.com/t/vsphere-6-5/enabling-vcenter-haadvanced/26

FIGURE 10.27 Completing the Basic option process
Screenshot_633
FIGURE 10.28 Use the Clone to Virtual Machine wizard to copy the VCSA.
Screenshot_634

The Basic option will default to using -peer for the passive appliance and -witness for the witness appliance.

With either configuration, the witness appliance will only be connected to the HA network. You also need to ensure that SSH is enabled on the VCSA before cloning is performed or you will receive the errors shown in Figure 10.29 and/or Figure 10.30.

FIGURE 10.29 Tasks message regarding enabling SSH before the VCSA is cloned
Screenshot_635
FIGURE 10.30 Error message in the Configure vCenter HA window if SSH is not enabled
Screenshot_636

After vCenter HA is configured, you can monitor its status using the Monitor tab (Figure 10.31) or monitor the status and review the configuration using the Configure tab (Figure 10.32).

If a problem is detected, the Monitor tab will suggest some remedies. In this case (see Figure 10.33), the network connections for the passive node have been disconnected.

If the active node fails, the failover process (Figure 10.34) will start and the passive node will assume the hostname and management IP address of the active node (Figure 10.35).

When the failed node returns, it will become the passive node. There is no automatic failback. However, there is an Initiate Failover button on the Configure tab to trigger a failback. Also, on the Configure tab after vCenter HA is enabled is an Edit button that allows you to enter maintenance mode in case of infrastructure changes that could affect connectivity, since if the active node loses connectivity to the passive and witness nodes, it will stop responding to requests. You can also disable or remove vCenter HA using the Edit button (Figure 10.36).

FIGURE 10.31 Monitor tab for vCenter HA
Screenshot_637
FIGURE 10.32 The Configure tab after vCenter HA is enabled
Screenshot_638
FIGURE 10.33 The Monitor tab after the passive node has been disconnected
Screenshot_639
FIGURE 10.34 Failover notification from the web client
Screenshot_640
FIGURE 10.35 The passive node has claimed the .201 IP address of the active node.
Screenshot_641
FIGURE 10.36 Four options are provided to edit the configurations after vCenter HA has been enabled.
Screenshot_642

EXERCISE 10.3 Enable vCenter HA and test failover.

Requires a VCSA residing in the cluster it manages.

  1. Ensure that the VCSA has SSH enabled.

    Screenshot_643
  2. Create a new port group named HA Network either on a vDS or on a vSS on each of your hosts and allocate an IP domain for it. This will be a closed network and will not need routing.

  3. Open the Config tab for your VCSA and click the Configure button on the Settings → vCenter HA page.

    Screenshot_644
  4. Leave Basic selected and click Next.

    Screenshot_645
  5. Enter the vCenter HA network IP address for the active node and select the port group (network) to use.

    Screenshot_646
  6. Enter the vCenter HA network IP addresses for the passive and witness nodes.

    Screenshot_647
  7. Verify that the selected configuration is appropriate for your environment and check any compatibility errors or warnings.

    Screenshot_648
  8. Double-check your settings and then click Finish to start the clone process.

    Screenshot_649
  9. Monitor the deployment via vCenter and watch the Monitor tab for vCenter HA to complete the process and start replicating.

    Screenshot_650
  10. When vCenter HA shows that its health is Good, suspend the active VCSA.

    Screenshot_651
  11. You can monitor the failover by pinging the public IP of the vCenter server or waiting for the web client to resume.

    Screenshot_652
  12. Verify that the passive appliance has taken over and the vCenter HA is in a degraded state.

    Screenshot_653
  13. Unsuspend the original active node and verify that vCenter HA has a Good status.

    Screenshot_654

Summary

In a production vSphere environment, ensuring that the virtual machines are available is a priority. Part of keeping virtual machines available is recovering them from a failed host and reviving virtual machines that experienced a failure. Using vSphere High Availability, you can configure automated recovery from those scenarios.

The primary purpose of vSphere HA is to restart VMs from a failed host onto a running host, and this feature is turned on if all you do is enable High Availability. However, HA has the ability to also monitor virtual machines (using VMware Tools) and restart the VM if the OS or an application being monitored stops. HA can also recover from storage issues on a host, where virtual machines can be restarted on other hosts not experiencing an issue.

New to vSphere 6.5, High Availability can also be proactive, leveraging vendor monitoring tools to evacuate virtual machines from hosts that experience problems such as failed power supply or overheating.

Also new to vSphere 6.5 is the ability to create an active/passive cluster from a VCSA. With vCenter HA, you can increase the availability of your vCenter server to improve management uptime.

Exam Essentials

Understand vSphere HA and how it is implemented. One of VMware's key features for many years, HA's primary goal is to restart VMs from failed hosts. Know that this is a per-cluster setting and the VMs will experience downtime during an HA recovery.

Know VM Component Protection (VMCP). Know the difference between Permanent Device Loss (PDL) and All Paths Down (APD) and the HA options for each. PDL means the array has reports (via SCSI code) that the storage device is no longer available. APD means the storage can't be reached by your host. The key difference is that with PDL, the array is reporting that the storage is gone while APD means the host has no idea why it can't reach the storage. PDL assumes your VMs need to be powered off and restated on a host that can reach them. With APD, you have a conservative option where HA won't stop VMs before it determines that other hosts can restart them.

Describe Proactive HA and know its requirements. Proactive HA is a new feature in vSphere and can improve uptime in the environment by preventing VMs from running on suspect hosts. However, it requires your server vendor to provide a monitoring solution compatible with Proactive HA. While an obvious requirement is that all hosts must be from the same vendor, your vendor may have other requirements.

Understand HA admission control. Admission control is there to prevent you from starting more VMs in your environment than can be restarted in the event of host failure(s). Admission control has three methods for calculating how much capacity to reserve for the event of a failure: Slot Policy, Cluster Resource Percentage, and Dedicated failover hosts.

Know Slot Policy vs. Cluster Resource Percentage. Two of the admission control policies use calculations and VM reservations to determine how much CPU and memory capacity to hold in reserve. Slot Policy uses the maximum RAM reservation and the maximum CPU reservation to set a “slot” size. Each running VM takes up one slot, and enough extra slots are reserved to account for a host to fail. Cluster Resource Percentage keeps free the amount of resources that equal the CPU and memory capacity of one host.

Understand vCenter HA architecture. With vCenter HA, there are three virtual machines set up in an active-passive-witness trio. The passive and witness appliances are clones of the active appliance and can be created by the Enable wizard using Basic mode or by the administrator manually if the Advanced option is used.

Be able to describe the requirements for vCenter HA. You can only enable vCenter HA on a vSphere 6.5 VCSA with SSH enabled. If the appliance is not managing itself or is not managed by a vCenter server in the same SSO domain, you must use the Advanced option. You need a separate network for the HA traffic. The Advanced option requires the admin to add a second NIC on the HA network prior to cloning and use the Guest Customization option to change the host name and IPs before the Advanced option wizard is completed.

Review Questions

  1. What is the minimum licensing level required for vSphere High Availability?

    1. Essentials
    2. Essentials Plus
    3. Standard
    4. Platinum
  2. What recovery is provided if no High Availability configuration is performed beyond enabling HA on a cluster?

    1. Host failure
    2. Storage failure
    3. vCenter failure
    4. Virtual machine failure
  3. Which options should be used if the default gateway of the management network does not respond to ICMP? (Choose two.)

    1. das.failuredetectiontime
    2. das.isolationaddress0
    3. das.usedefaultisolationaddress
    4. das.isolationshutdowntimeout
  4. Which admission control policy should be used when regulations require passive failover capacity?

    1. Slot Policy
    2. Cluster resource percentage
    3. Host Failures Cluster Tolerates
    4. Dedicated failover hosts
  5. Which admission control policy should be used when virtual machines have very different reservation settings for CPU and memory?

    1. Slot Policy
    2. Cluster resource percentage
    3. Host Failures Cluster Tolerates
    4. Dedicated failover hosts
  6. Which admission control policy should be used when virtual machines have very similar resource reservations?

    1. Slot Policy
    2. Cluster resource percentage
    3. Host Failures Cluster Tolerates
    4. Dedicated failover hosts
  7. Which admission control policy, when configured by vSphere, could result in resource fragmentation?

    1. Slot Policy (fixed slot size)
    2. Cluster resource percentage
    3. Host Failures Cluster Tolerates
    4. Dedicated failover hosts
  8. What could prevent virtual machines from restarting during a failure in an environment using a Slot size policy that covers all powered-on virtual machines?

    1. A new virtual machine with a very large reservation
    2. Fragmented resources
    3. DRS affinity rules
    4. Performance degradation set to 0%
  9. Which option will set a warning if the environment is anticipated to have insufficient performance during a failure?

    1. Cluster resource percentage set to 0%
    2. Cluster resource percentage set to 100%
    3. Performance degradation set to 0%
    4. Performance degradation set to 100%
  10. An environment has the following virtual machines in a cluster configured with the default Slot Policy admission control.

    • Twenty virtual machines with a 500 MHz CPU reservation
    • Twenty virtual machines with a 750 MHz CPU reservation
    • Five virtual machines with a 3000 MHz CPU reservation

    An administrator cannot power on a new virtual machine. What are two options that could allow the administrator to power on the VM? (Choose two.)

    1. Remove the CPU reservation on the new VM.
    2. Reduce the 500 MHz reservations to 250 MHz.
    3. Reduce the 750 MHz reservations to 500 MHz.
    4. Reduce the 3000 MHz reservations to 750 MHz.
  11. An environment has the following virtual machines in a cluster configured with the default Slot Policy admission control.

    • Twenty virtual machines with a 500 MHz CPU reservation and 1024 MB memory reservation
    • Twenty virtual machines with a 750 MHz CPU reservation and no memory reservation, 250 MB overhead
    • Five virtual machines with a 3000 MHz CPU reservation and no memory reservation, 512 MB overhead

    What is the slot size currently in use?

    1. 500 MHz and 1 GB
    2. 3000 MHz and 1 GB
    3. 3000 MHz and 512 MB
    4. 3000 MHz and 1786 MB
  12. What options are available if a monitored VM appliance stops sending heartbeats?

    1. Restart the VM on a new host.
    2. Power off the VM after confirming that another host has connectivity.
    3. Restart the guest OS.
    4. Restart the application.
  13. What could account for a virtual machine configured for VM monitoring not being restarted after a failure? (Choose two.)

    1. No VMware Tools.
    2. VM failed too quickly.
    3. HA cannot find a host that can access the datastore.
    4. Admission control is disabled.
  14. Which HA technologies require vendor support for implementation? (Choose two.)

    1. Proactive HA
    2. VMCP
    3. Heartbeat datastores
    4. Application monitoring
  15. What components are required for vCenter High Availability? (Choose two.)

    1. VCSA
    2. Load balancer
    3. Windows server
    4. Dedicated network
  16. Which HA failure scenarios will not allow usage of vCenter?

    1. Witness appliance failure
    2. Active appliance failure
    3. Passive or witness appliance failure
    4. Passive and witness appliance failure
  17. What steps are required to be taken manually for the vCenter HA Basic option? (Choose two.)

    1. Add a second NIC.
    2. Enable SSH.
    3. Clone the VCSA.
    4. Create a new network.
  18. What manual steps are required for the Advanced option for vCenter HA? (Choose two.)

    1. Clone the VCSA.
    2. Customize the guest.
    3. Configure PostgreSQL replication.
    4. Enable VMCP.
  19. Which step should be taken prior to initiating infrastructure changes in an environment configured for vCenter HA?

    1. Set the host to maintenance mode.
    2. Suspend the witness and passive nodes.
    3. Initiate a failover before working on the active host.
    4. Set vCenter HA to maintenance mode.
  20. What is the default number of heartbeat datastores per host?

    1. One
    2. Two
    3. Three
    4. Same as the number of hosts in the cluster
  21. How many more virtual machines without a reservation can be started in this environment? (See exhibit.)

    Screenshot_655
    1. 20
    2. 2
    3. 8
    4. No limit
  22. How many slots will a virtual machine with a 200 MB reservation take in this environment? (See exhibit.)

    Screenshot_656
    1. One
    2. Two
    3. Three
    4. Four
UP

LIMITED OFFER: GET 30% Discount

This is ONE TIME OFFER

ExamSnap Discount Offer
Enter Your Email Address to Receive Your 30% Discount Code

A confirmation link will be sent to this email address to verify your login. *We value your privacy. We will not rent or sell your email address.

Download Free Demo of VCE Exam Simulator

Experience Avanset VCE Exam Simulator for yourself.

Simply submit your e-mail address below to get started with our interactive software demo of your free trial.

Free Demo Limits: In the demo version you will be able to access only first 5 questions from exam.