Google Professional Data Engineer – BigQuery ~ Hive ~ OLAP Part 3

  1. Lab: Subqueries And Joins

At the end of this lecture, you’ll be familiar with concatenating strings together in a query in order to format your data the way you want to. This lecture is a continuation of the lab that we worked on earlier. We’ll continue to work on aggregate and boolean functions in BigQuery. In the last lecture, we left off where we’d run a query to calculate the total number of flights and the number of flights that were delayed per airline from La Guardia for a particular day. Now we query a different data set to find out whether individual days were rainy days or regular days. Because we are interested in the fraction of flights delayed when the days are rainy, BigQuery has a public data set which collects weather data.

This data is the global surface summary of the day weather data. This weather data is available from over 9000 stations and from 1929 up to 2016. Let’s run a query on this table to get a feel for what the weather data looks like. We want to select the year, month, and date. These are separate columns in this data. We want this information for New York City. The station Number the weather station number for New York City is as shown on screen. If you want to perform this analysis for regions other than New York City, you’ll have to look up the corresponding station number in the NOAA, the National Oceanic and Atmospheric Administration website. We’ve set up the Where clause of this query to only find those days where total precipitation was greater than zero, which means it was a rainy day. This is the information we need from this weather data.

Information on what days were rainy days in New York. If you look at the results and also if you examine the schema of this table, you’ll see that year, month and day are all integers and in separate columns in this data. We need to join this table with our Flights table in order to find out on rainy days how many flights were delayed for every airline. In order to do this, we need to get the dates in the same format. The two tables will be joined on date. We’ll convert the date format of this table to match the one in the Flights table. Take a look at the select clause of this query and notice the string functions that BigQuery allows us to use in the innermost function.

We cast the year, month and day as strings and then perform string operations on them. Both the month and the day are left padded by zeros. We concatenate all of these together to get the final format as Yymmd. This is the format in which dates are present in our Slides table. Run this query to confirm that the dates are indeed in the format that you want them to be. We are now ready to perform our join operation.

Let’s first put down a variation of the query that we ran last on our flights table. This gives the delayed and the total number of flights from our earlier query, but this time it’s flights arriving at LaGuardia. We want to know what flights arrived late because of the rain. This query will form the first part of our joint. Notice the join keyword and the opening bracket. After the join within this opening bracket will specify our weather subquery to find the rainy days. Find all the rainy days in New York City where La Guardia is located and make sure the dates are in the format which match the dates in the flight table. Our first query is alias by F, and our second query is alias by W.

The join is on W. Rainy day is equal to f date. Rainy day is the alias for the date that we formatted in the way that we want it to be. Let’s put in the remaining clauses. We want only those arrivals where the airport is LaGuardia and we want to group by F. Airline. This is a pretty neat joint query along with subquaries, and you’ll see the results in the table below. There are ways to make these results more intuitive. What if we want the fraction of arrival delays rather than just numbers? Because a certain airline might have a huge number of flights arriving into LaGuardia. Absolute numbers for total flights and the number of delays will not tell us how well that airline is performing.

This involves another level of sub. Queries the entire query that you’ve written so far, make that a giant sub query and then query that result at the bottom. We know the results that we have so far airline numb delayed and total flights. From this result. Simply query the airline numb delayed total flights and find an additional fraction as well, that is, numb delayed by total flights as fraction delayed.

Make sure your brackets line up. The entire query that you wrote earlier is now a sub query and has brackets around it. And also add some additional order by let’s order by fraction delayed. So we see the airlines that perform the best in terms of delays on rainy days. Run this query and examine the results it should make for some interesting data. Explorer. US. Airlines seems to do pretty well here with the lowest fraction of delays. In this lecture, you would have seen how you can cast an integer as a string, use the LPAD functionality and concat string functionality. Format data within your query.

  1. Lab: Regular Expressions In Legacy SQL

In this particular lecture you will see how we can use regular expressions with your BigQuery queries to match arbitrary text. This demo will focus on advanced queries using regular expressions. We’ll use another public data set for this query. The BigQuery Samples Wikipedia page views set, the free data set that we added earlier two lectures ago. This should be available as a part of the same project. We’re going to use a table which has information about page views for all wikimedia projects. We’re going to use the information from December 2011. You can access this URL directly. To get to the details of this particular table, let’s run a simple query. First, let’s find the sum of all page views across wikimedia’s projects from December 2011, which means we just do a sum operation across the entire table.

We are using the legacy SQL dialect so our table name can be enclosed in square brackets if you click on Show Options. Let’s examine some of the other options there. There are some options that you can use to speed up this query. Use cached results is one of them. If you’ve run the query once before, those results are cached and if you rerun the same query, those cached results will be used. This will obviously give you a tremendous speed up. The results will be cached only so long as the table remains unmodified. It is more inefficient to use BigQuery if you have this option unchecked, but just in case, you might want to do it at some point in time when you want to ensure that the results are always fresh, let’s go ahead and uncheck it just now.

And you’ll notice a tag at the bottom of your query which says no cached results. It’s kind of reminding you that this is a setting that you have turned on. Let’s just run this query and see what it tells us. And there you can see it a huge number of page views for all wikimedia’s projects, including Wikipedia. That’s the bulk of the page views here. Of all of wikimedia’s projects, let’s say the one that is of most interest to us is Wikipedia. You can filter this data set by Wikipedia’s page views by adding a Where wikimedia project is equal to WP clause. All the other wikimedia projects such as dictionary, wiki books, wiki quotes, et cetera will be excluded.

They form a smaller fraction anyway, run this query and notice that the amount of data processed has immediately shot up to 18GB. It was around 12GB before. Take a look at the query and see if you can tell why we’ve asked for an additional column in our result, the wikimedia underscore project. Let’s make the data that we extract from this table a little more interesting. We want to count the total number of page views for all Wikipedia pages where the title contains a word that starts with red. The information we want is the title and the total sum of page views for that particular Wikipedia title that’s in the select statement here. Our Where clause will now contain a condition based on a regular expression match. We use the reg expunction for this. We want to match the title against words which start with Red. Notice the dot and the asterisk for the wild card. This is more meaningful if we group by the title so that we get the total number of page views and order by views to get them in descending order. Wikipedia results are always interesting.

Red Hot Chili Peppers are clearly the most popular and most viewed page, starting with Red Shawshank Redemption is in their Red Dead Redemption and so on. You can run the same query against multiple tables in one go by specifying more than one table in square brackets in the from clause. The data that will be queried will be the union of the data from all these tables that you’ve specified. You see that the results haven’t changed very much. Even though our query is now across two months, November and December 2011, Red Hot Chili Peppers are still on top. Views for Reedit are no longer in our top five. We’ve increased the number of tables, which means we increase the amount of data that we process. It’s now up to 113 gig.

It was 62 gig with just one month of data. We’d mentioned earlier that we use standard SQL so that we can use Wildcard functions while querying tables, and for that we need back ticks. We can use Wildcards in legacy SQL as well, except that we have to use the table query in order to specify a regular expression. The table query takes in the project ID and the data set ID and a regular expression to match all tables within that data set. There is also a table date range function which lets you do something similar with date ranges in legacy SQL. This query runs against tables for the entire year of 2011. You can see that in the amount of data that we processed 672gb. The Red Hot chili peppers are still the most popular. Though this lecture, we saw how we could use regular expressions for text mapping in our query. We also saw how we can query multiple tables in one go in legacy SQL by using the table query with regular expressions.

  1. Lab: Using The With Statement For SubQueries

In this lecture we’ll work with BigQuery column types which have repeated values. We’ll see how we can access the individual values in such BigQuery columns. This lab is primarily about the width statement in BigQuery. This lab we are going to use GitHub data to answer the question what is the most popular language for programmers to code in on weekends? In the BigQuery Public Data project, you’ll find a data set named GitHub underscore repos, and within that you’ll find a table named Commits. This contains all the metadata information for all the commits that have been made to various GitHub repos. You can check out the details of this table to see that it’s pretty massive. It has 714gb of data. This is a table that’s also updated pretty frequently.

Let’s examine the schema in some detail. As you probably already know, GitHub is organized in a hierarchical structure where there is a path where our code lives. This is in the three column in this table. Here are the details of the author that made this commit. It’s a nested field with many subfields within it. An interesting column that we’ll work with is the difference column. Difference is a record which is repeated, that is an array of structs. Each struct summarizes the difference or the delta between this new commit and what existed previously for this file. For this repo, we are going to run a query in the standard SQL dialect. So make sure that use legacy SQL. That option is unchecked. Let’s now examine this query and see what’s going on. What we are trying to extract from this query is the email of the author of the file, the paths which were affected because of the commit and the date on which this commit was made. If you worked on GitHub or any code repository before, you know that one single commit can affect a huge number of files.

At the same time. You check in multiple files for a one commit, which means that this path will have many values for a single commit. The information that the table provides us is the difference field that we saw earlier. Difference is an array of structs. It is a repeated field. The text that shows on screen is slightly buggy. Difference is not an array of strings, it’s an array of structs where each struct has multiple fields. What we want to do here is to flatten this difference array so that we can access the path from individual struct elements within this array. This is possible in BigQuery using the command called unnest. Unnest is a flattening command and you apply it to a column which has repeated records. As an example, let’s say your columns in your table were set up such that you have the email column and a column for paths, which is a type array.

Applying the unnest command on the array of paths will flatten it so that you get the email associated with each path. As a result, email comma path one, email comma path two, and so on. The same email is associated with each individual path in the array. The array here means the column with repeated records. The unnest function here will give us the structs which hold the delta information for every file that we committed. As a part of this overall commit, we access only the path in the select statement above. The where clause in this particular query limits the data that we extract to the year 2016. We are not interested in older data.

We’ll sample just ten records and see what the result looks like. Once you run this query, the effect of the unnest function will become much more clear to you. Notice that the same email ID that is the same author has checked in multiple files on multiple paths on the same commit. The date is the same, the email is the same, only the paths are different. The paths have been unnested. Explore the results and you’ll see that this pattern holds. Here is a commit with just two files. Let’s understand this Difference column which has repeated records a little better by modifying the queries in a bunch of different ways and see whether the resultant queries are valid.

Let’s simply access Difference new path directly. Instead of unnesting that column, you’ll see that the validator very helpfully tells you that this is wrong. That’s because Difference is a repeated field. It’s a type array of struct and it doesn’t have a new path field within it. You can only perform array actions on Difference. Let’s change this query a little bit. Here we’ll perform an array action on Difference. We want to find the path, the new underscore path at the 0th index in this array, offset of zero gives us the 0th index. The result for this should look like what you expected to. Each email ID is present with just one path in each commit.

It’ll just choose the first path it encounters in that commit from the Difference array and display that. At this point we have the full path for every commit, including the file name. Let’s use this file name to extract the programming language for that particular file. For this query, we are going to do this in a very simplistic way. We’ll extract the extension of the file name and use that to determine the programming language.

A C file is the C programming language, java is in Java, RB is in Ruby and so on. We use the reg EXP function to extract the extension, the file name extension and convert it to lowercase for our programming language. Run this query and you’ll find the programming language along with the path for every commit in GitHub for the year 2016. We now have a query which gives us the email address of the author, the programming language in which his files were coded, the path of those files, and the date of the check in. We still have some work to do. In order to find the most popular programming languages that programmers code in on weekends, we’ll use this as a sub query and run more queries. On top of this, we want to give the subquery a name. We are going to call it commits and we set it up using a width statement that you can see on screen.

The with statement assigns a name to this sub query and then you can use commits just like you would a table name. In queries that follow, remove the limit on this sub query and enclose the entire thing in brackets. We now have a subquery named commits and we are going to use the sub query in other queries that follow. And below the closing brackets of your with statement you can write the query which uses this commits sub Query we are going to use a very simple, maybe even simplistic metric to determine the popularity of a particular programming language. We simply count the number of commits on files that were made in that particular language. So we want to select the language and we’ll count the number of file paths as Numits.

We have some conditions on what we consider a valid language though the length of the language should be less than eight. We can’t have very arbitrarily long file extensions. Language should not be null and the language should only contain alphabets, no numbers or other special characters. The count aggregation should be accompanied by a group by. So we group by language. Because we want to count the total number of commits for a particular language, we want to make sure that Numits is greater than 100. So if you’re developing your own specialized language and there are just three or four commits in there, it won’t count here. And the order by Numits descending the order by the most popular language, the most number of commits will appear on the top. The most popular language is JavaScript and after that is PNG.

Well, those are images. PHP is up there, JSON is up there, HTML is up there. They’re not really languages, but you can see what kind of files people check in. All right, at this point we know the most popular language across the board based on number of commits, but we haven’t differentiated between weekdays and weekends. We are not done yet. There’s a little bit of additional work required for this, but BigQuery makes this very easy. In your sub query which makes up commits, you have the author date information. We can use this author date to determine whether it’s a weekday or a weekend. Use the extract function in BigQuery extract day of week from author date. The result of this extract function will be an integer representing the day of the week. It starts with Sunday, so one is Sunday, two is Monday, three is Tuesday and so on.

Which means if the integer is one and seven that represents a weekend. All integers from two through six, both inclusive, are weekdays. So if day of week is between two and six, Is Weekday will be true. This weekday is a true false value, which will be true if it’s a weekday, false if it’s a weekend. This is the update to our commits sub query. Once this is done, we can go to our outer query and add this information in our outer query as well. In the select statement here, simply include the Is Weekday field as well, which we have now available when we select fields along with an aggregation.

Those fields that are not part of the aggregate need to be part of the group by. Go ahead and add Is Weekday to our group by fields as well. Run this query and view the results. It should be immediately obvious to you that JavaScript is not just popular on weekdays but also on weekends. It’s the top programming language based on number of commits. When is weekdays true as well as false? The answer to this question is an unnest function that you’ll apply on an array of records, and this unnest will allow you to access individual records and fields within that record.

 

img