Google Professional Data Engineer – Dataproc ~ Managed Hadoop Part 2

  1. Lab: Running The PySpark REPL Shell And Pig Scripts On Dataproc

Dataproc clusters are basically managed Hadoop on the Cloud. If you were running Hadoop on a local data center, you should be able to access the cluster manager console on port 80 80 of your master node. What do you need to do to make this possible in Dataproc? In this lecture, we’ll talk about how we can set up firewall rules to restrict access to our Dataproc clusters. We have a Hadoop cluster set up on Google Cloud. In order to administer it from our local machine, we need to set up a firewall rule to allow our local machine to access our Dataproc cluster.

Choose the networking option on our side navigation bar and it will take you to a page where you can configure your firewall rules. The firewall rule that we’ll set up is essentially whitelisting our local machine to be able to access the Dataproc cluster. Once you go to the main page for firewall rules, you’ll find that a whole bunch of firewall rules already exist.

When you set up VM instances and when you’re performing all the actions that you did in the labs previously, some of the side effects would be to create some firewall rules. Each firewall rule has a name which you can use to reference it, and a target. That target specifies to whom that rule applies. It can be applied to all, as in the case of the second entry in that list, or you can specify a tag there. Remember, you can tag resources on Google Cloud. You can use those tags to formulate your firewall rules as well. You can essentially say you can only access those resources which have these tags set.

The source filters are the IP addresses that are whitelisted for each of these rules. The protocol and ports column specify the software clients that can access these resources, and the port number that these software clients are allowed to use. What we essentially want to do is to whitelist the IP address of our local machine, the laptop or desktop that you’re working on to be able to access our cluster on Dataproc.

To set this up, you need the IP address of your local machine. First, visit this website on your local machine IP four me. This will give you the IP address, the IPV four address of your machine. Copy this IP address, switch back to your console dashboard, and let’s set up a new firewall rule to whitelist your IP address. Specify a name and a description for your firewall rule. This is useful in case you forget why you set this rule up in the first place.

We don’t have any specific network set up, so we choose the default and we can set the priority of our firewall rule as well. The default value is at 1000, and I’m just going to leave it at that. The first of these radio buttons ask you to specify traffic direction to which this firewall rule applies. We are focused on ingress traffic that goes to our Dataproc cluster.

We want our local machine to be able to connect to our Dataproc cluster. The second set of radio buttons ask you to indicate whether this rule is a whitelist or a blacklist. The action on match of our rule should allow traffic or should it deny traffic. We want to whitelist our IP address, so we choose Allow.

The target field here is where you can indicate whether this firewall rule applies only to those resources which have these tags or which have certain labels. If you choose specified target tags, you actually have to specify target tags that this firewall rule is for. In the text box for source IP address, specify the IP address of your local machine. The forward slash 32 is for the range of IP addresses that you want to whitelist. It’s always good practice in your firewall rules to specify certain software clients and ports on which those software clients can connect. We should always be more restrictive as far as access policies are concerned, and Allow All is less restrictive than specifying the protocols and ports. Here we allow the TCP protocol from our web browser to connect just to ports 80 88 50,070 and 80 80.

If you remember your Hadoop administration, 80 88 is the port on which you connect to your cluster manager. 50,070 is the port with which you connect to your HTFs name node. We’ve seen this before. If you click on command line at the bottom, you’ll see the Gcloud command that you can run on the command line to set up this very same firewall rule with all the custom specifications that we’ve configured. I go ahead and try to create this firewall rule and there is an error. This is because I haven’t set all instances in the network. My option was specified tags and I hadn’t explicitly specified any tags. Switch over to all instances in this network and you’ll be good to go.

Once the firewall rule has been created, it’s time for us to test it. We are going to check whether our local machine can connect to our Dataproc cluster. Use the navigation bar to go to your Dataproc dashboard and within Dataproc. Click on your cluster instance and navigate to VM instances on your cluster. So click on VM instances here. We want to find the IP address of our master node so we can connect to it from our local machine using the browser.

Click on the master node VM instance and you’ll see the details of the master node. Scroll down till you find the network interfaces. This is where you’ll get the IP address of your master node. What you’re looking for is the external IP address. Copy this external IP address and open up a new browser window which you’ll use to connect to your Dataproc. You’ll connect to the master node using its IP address here, paste in the master node IP address followed by the port. 80, 88 this should lead us to the cluster manager for Yarn, and it does. This should be a very familiar sight to all of you who worked on Hadoop the first time I used Dataproc. This is when I was convinced that it’s indeed running Hadoop.

While we are at it, let’s also access the HDFS name node, the IP address of the master node, followed by colon and 50,070. And there you see it the UI for the HDFS name node. If you scroll down, you can see the details of our worker nodes. There are two live nodes and node dead nodes. Now that you’re done exploring the cluster, you can go ahead and delete it so that you free up resources. Don’t keep resources hanging around that you don’t use. It will unnecessarily take away from your free credit. Hit OK on the confirmation dialogue and you’re done with your test cluster. For the sake of completeness, let’s now use the Cloud Shell command line.

To set up a Dataproc cluster, we use the Gcloud command line tool. The command is Gcloud Dataproc clusters create and then the name of the cluster, which is second test cluster. In our case, specify the zone you want your cluster to be created in. I always choose the US. By default, it’s the easiest. And then a whole bunch of configuration parameters for your master node as well as your worker nodes.

As you submit this create command, you’ll find that the UI updates to show that this cluster is in the process of being created. Deletion of a cluster is equally easy. Using the command line, simply say Gcloud Dataproc clusters. Delete and specify the name of the cluster. And in this lecture you should have found the answer to this question. In order to allow your local machine to access your Dataproc master node and connect to the cluster manager or to the HDFS name node consoles, you need to set up an explicit firewall rule whitelisting your local IP address and allowing it to access specific ports with specific clients. In this case, it is the PCP protocol.

  1. Lab: Submitting A Spark Jar To Dataproc

When you submit a Scala spark job to Google Cloud, what kind of job type would you specify? This is a question you should be able to answer at the end of this lecture. In this lecture, we’ll focus on how we can execute a Spark jar on the Dataproc cluster. Open up the page for Dataproc on your web console, and within that, click on Jobs. Here is the last job that you ran on the Spark cluster.

Let’s now submit a new job by clicking on Submit Job. The cluster that we’ve set up is my cluster. Change the job type to Spark because we’re going to run a Spark job implemented in Scala. We have a Jar file available and that’s what we want to submit to Google Cloud.

The job type should be Spark and not Pi. Spark. In the text box for the Jar file, specify the complete path to the jar that you want to execute on the Cloud. This particular jar file is present on the file system of the master node in the cluster itself. So we are going to directly point to it. This is one of the example Spark jars which come along with the Spark installation. That is why it’s present on the file system of the cluster.

The main class that we want to execute within the jar is Spark pi. This calculates the value of pi to a very high degree of precision. By parallelizing the computation tasks, you specify the number of partitions that you want to divide this computation into.

The command line argument that you specify to this program is the number of partitions into which you want to divide this computation of pi. Let’s specify 1000 partitions for our argument. Click on Submit Job and then view its logs. To see the result, click on the Job itself. Here are the log files, and then you can click on Line Wrapping or scroll to the very right and you see the value of pi. If you want to run a Scala Spark program, you’ll package it to a jar and submit it the cloud and specify your job type as Spark.

  1. Lab: Working With Dataproc Using The GCloud CLI

At the end of this lecture, you should know how you can set up your cluster. Your Dataproc cluster, to be able to run Gcloud commands. In this lecture, we’ll focus on using the Gcloud Command Line interface to perform common tasks with Dataproc clusters. We’ll first confirm that all the APIs that need to be enabled to perform these command line operations on Dataproc are enabled.

Go to the API manager and check that the Google Compute engine API is enabled. It should be. Next, double check that the Dataproc API is enabled. It should be. If you’ve been doing the demos so far, use the side navigation bar to go to the Dataproc web console. I’ve deleted all the clusters I’ve set up so far, so I see an empty screen open up. Your Cloud Shell will now set up a cluster using the command line. Set the cluster name variable to test cluster.

That’s the name of the cluster that we are going to set up and then use the Gcloud Dataproc clusters. Create command to set up a test cluster. Examine what these other parameters mean. When you say Scopes is equal to Cloud Platform, that generally states that you can run G Cloud commands on your cluster. Once the cluster has been set up, it means that you can SSH into the master node, assume that the Gcloud SDK has been installed, and all the permissions have been set up for you to run Gcloud commands. The Tags parameter allows you to tag this entire cluster with the code Lab tag. Remember that Tags are logical groupings of your resources, and you can use it to view billing information. Resource use information together for all resources with a particular tag. These tags can also be used to set up firewall rules which target those resources with the Tags enabled. And the last flag is pretty self explanatory. The zone where the cluster is to be set up.

Once the cluster has been set up, it should appear on your web console as well. You can now go ahead and submit a Spark job to this cluster. This is the same scala Spark program that we used in the earlier demo. We use the Gcloud command Dataproc Jobs Submit and then specify the type of job. This is a spark job, remember, and not a pi spark job. All the other parameters are pretty obvious and need no explanation. Once you submit the job from the command line, the job has been submitted to the Cloud, and even if you use Control C and kill the execution within Cloud Shell, the Cloud job will continue to run. In the meanwhile, let’s take a look at what other commands we can run on the command line. Gcloud Dataproc Jobs list, followed by the name of the cluster will give you all jobs that are currently running on that cluster.

If you want to reconnect to this job and view logs on the Cloud Shell, you can do it using the Dcloud Dataproc Jobs weight command and specify the Job ID. After that you’ve reconnected to the Job and you should see the output on your cloud shell window. Notice at the end you can see the various states in which the job existed pending setup done and then running.

There is the name of the job. There Spark Spy, and the current state is finished. The tracking URL shows you where you can use the web console to view the progress of this job. You can scroll up and see that a whole host of other information is available on the Job. You can see the cluster that it was submitted to, the Job ID project ID, the arguments that we passed in on the command line, the Jar file that we actually executed for this park job, and so on.

And somewhere in these log messages is the value of pi that it calculated. Run this command gcloud Dataproc clusters described, followed by the cluster name to get details on the cluster, the settings that we set up, the configuration of the cluster, and so on. Here you can see that there are two worker nodes s cluster w zero and W one. You can scroll and explore to see what other details are available. Let’s say I want to expand the capacity of this cluster, but I want to do so cheaply. An easy way to do this is by adding preemptible workers. I can do this using the command line by saying gcloud Dataproc clusters. Update specify the name of the cluster and specify the argument for preemptible workers.

NUM preemptible workers equal to I’m going to add two preemptible workers to my cluster. Once this has been done successfully, you can run the describe command once again and see that these preemptible workers have been added. This command spits out a lot of detail, but you can scroll and see a pattern to it. Here is the master configuration.

The configuration for our master node is exactly one of this. The secondary worker configuration now shows us two preemptible nodes. They are labeled somewhat differently if preemptible is true, and you can see that their machine type is the same as the worker nodes that we set up. When you set up preemptible workers, they use the same image as the worker nodes on your cluster. Let’s confirm this resize of our cluster using the web console. Go to Dataproc and click on the cluster instance.

Click on VM instances and it will show you all the machines that make up this cluster. There is the master node and the two dedicated worker machines that we had set up in our standard configuration, and there are the preemptible workers that we added using the command line. Let’s SSH into the master node of this cluster. We can do so using Gcloud compute SSH, specify the name of the cluster and M. This is the format in which the master node is named. You should be able to SSH in just fine. Check that you are indeed on the master node. Yes, we are. If you remember, one of the flags that we had specified while setting up this cluster was that scopes equal to Cloud Platform. This means you can run Gcloud commands on this cluster master node.

Gcloud Dataproc Clusters list shows you what clusters are running at this point in time. You can run the describe command on a single node instance of a cluster. Here we want to describe the master node as specified by the M at the suffix of the cluster name at the very end of this describe command output. Notice that the tags that are associated with this cluster show up.

The only tag that we’ve set up is the Code Lab tag. If you remember earlier, we’d set up a firewall rule that allowed access from our local machine to connect to any cluster that we set up on Google Cloud Platform. Let’s check out this firewall rule using the command line and then we are going to go ahead and modify this rule. We’ll do the modification using the web console so it’s clear what modification we are applying. Go to the networking page on your Google Cloud Platform console and then click on Firewall rules.

If you remember one of our labs earlier, we can specify tags within firewall rules to indicate which resources the firewall applies to. We’ll edit our firewall rule to allow access only to the cluster that we just set up. This updated firewall rule will allow our local machine to connect to only those clusters which have the Code Lab tag.

Update the settings so that targets are specified target tags and the resources that we want to be able to access are those that are tags tagged Code Lab, and that’s it hit done. And you have a new firewall rule setup. You can test it out and see how it works. And at this point, you should know that when you specify the scope of your cluster during creation as Cloud Platform, you can run Gcloud commands on your cluster’s master node.

img