Google Professional Data Engineer – BigQuery ~ Hive ~ OLAP

  1. BigQuery Intro

As we begin our discussion of BigQuery, which is a Google Cloud Platform’s workhorse data warehousing and business intelligence and analytics processing service, here is a question that I’d like you to keep in mind. BigQuery takes Hive’s schema on Read to the next level. How does it do that? What exactly does BigQuery do as far as schema enforcement goes? Recall here for context that relational database management systems have schema on right? They enforce schema constraints each time you attempt to write to the database, and they reject write requests which violate the schema or any constraints. Hive does not have that luxury because it does not always own the data. The data resides on HDFS. Other applications like Pig or Spark or anything else might go in and change it behind them. So Hive has to resolve to schema on Read. The question is, how does BigQuery take schema on Read even one level further? Let’s pick up from where we left off.

While discussing use cases, we had a bunch of use cases which ranged from simple block storage requirements for compute down to analytics and OLAP processing requirements. We had discussed the technologies that we would use for these in the absence of the Google Cloud Platform Assistant Disks or SSDs for block storage, HDFS or some distributed file system for file storage. Hive as a sequel like interface that ran on top of Map Reduce on HDFS, a document database like Couch, DB or MongoDB for key value indexed applications, HBase or Cassandra for fast scanning in NoSQL MySQL or PostgreSQL for transaction processing OLTP applications. And once again something like Hive for analytics or advanced querying on very large distributed data sets.

This is a fairly important table. If any of these technologies seem unfamiliar, or if you’re not sure why they match this use case, I suggest that you go back and rematch the storage and compute sections. Once we have this decent first principles understanding of the use cases and technologies, we can then apply GCP’s products for instance for storage and block storage for VM instances, persistent disks, hard disks and local SSDs for storing large media or Blobs, which are immutable. Cloud storage was the best option for SQL interface sitting on top of file system storage, we had mentioned BigQuery as the semantic equivalent of Hive datastore is GCP’s document database. This is a NoSQL technology based on very fast indexing, where the size of the result set, not the size of the data set, determines the query execution time. For fast sequential scanning, we had mentioned BigTable, which is a columnar data store. It has similar applications and use cases as Cassandra or HBase. Then, for transaction processing OLTP, we had discussed a couple of options cloud SQL, which is open source, and Cloud Spanner, which offers horizontal scaling.

Cloud Spanner in some sense blurs the boundaries between a traditional RDBMS and a Big data technology. And lastly, we had also spoken about analytics and data warehousing and OLAP applications. And there again, it was BigQuery that we had noted against the Use case. The only item on this list which we have yet to discuss in detail is BigQuery. We spoke about data store and Big table and cloud storage. BigQuery is the last one standing, and that’s because it is really more of a big data technology than a storage technology. That said, BigQuery differs from Hive in its underlying representation. It uses something known as Capacitor, which is a proprietary columnar data format. This means that BigQuery is closer to BigTable than Hive is. I keep going on and on about these underlying implementation details because really understanding the how of these technologies helps us to understand some of the whys. In any case, let’s now move on and cut to the chase.

BigQuery is an enterprise data warehouse with all of its implications of complex querying facilities on large distributed data sets. While talking about Hive, we had mentioned that operations like partitioning, bucketing and windowing are nice to have in a traditional RTB, but must have in OLAP. And that is exactly where BigQuery delivers. It offers SQL queries or SQL like queries complex ones, but with Google storage lying underneath.

One of the advantages of BigQuery is that it is fully managed. There is no server. You don’t need to explicitly deploy any resources, though of course you will need to pay for them. You can access BigQuery in a variety of ways using a web user interface using Rest APIs, or programmatically using client SDKs. You should know that there are also various third party tools available for accessing BigQuery, but those really aren’t all that important if you plan to use BigQuery from the Google cloud itself. The BigQuery data model has data sets. A data set in BigQuery roughly approximates a database. In a relational database world, the data set includes a set of tables and views.

Any table in BigQuery must belong to a data set. And of course, any data set must belong to a project, because creating data sets is chargeable and billable, and therefore must be associated with a project which is the billing entity. Tables are approximately relational in their format because they contain records, and these records have rows and columns. The columns are called fields. All of this is very similar to the relational world. Unlike the strict traditional relational model, the data model in BigQuery is quite relaxed about deformalized forms, so it’s okay to have nested and repeated fields as well. Schemas are associated and can be specified when you create a table, but you can also go ahead and specify a schema during the initial load.

So you don’t need to specify the schema until you actually plan to load data into the table. As with relational databases, the schema can be updated after the table has been created. If you worked a lot with Hive, you’d be familiar with the idea of manage and external Hive tables managed tables have their data managed by Hive and stored in the warehouse directory. External Hive tables are not fully managed by Hive, and their data resides outside the warehouse directory. There is a similar concept in BigQuery. Tables could be either native, i.e. All of the data is in BigQuery storage, or they could be external. In this case, the data resides in some underlying data formats, such as Big Table or cloud storage, or even Google Drive. Just like Hive, BigQuery also supports views.

We’ll have more to say about those in a moment. BigQuery also has a really cool facility called Schema Auto Detection. As its name would suggest, this involves BigQuery inferring the structure of data. This is available either while loading data into a table or while querying external data. Given a set of data to query or load, BigQuery will pick a random subset, a random file in the data source, and scan up to 100 rows of data and assume that that is a representative sample. And on the basis of that, the Schema Auto Detection algorithm will infer the different fields and assign types to them. We’ve discussed the difference between batch processing and stream processing, and corresponding to these, there are two ways of loading data into BigQuery batch loads and you guessed it, streaming loads. These are the two important ways of ingesting data or getting data into BigQuery.

Batch loads can be carried out by using technologies such as loading of CSV files or JSON files which need to be new line delimited Avro, which is open source storage format or GCP data store backups. Streaming loads correspond to use cases like event tracking, especially from logs or from real time dash boarding. You could have data which fields into BigQuery and then on into a visualization tool such as Data Studio. Let’s very quickly talk about Avro and some of the other data formats that are acceptable to BigQuery. CSV and JSON are both pretty obvious ones. Just keep in mind that the JSON needs to be new line delimited Avro, which we just mentioned is an open source data format. The idea here is that Data is serialized along with its schema included in the same file. And do also keep in mind that BigQuery has special support for cloud data store backups, particularly logs. BigQuery knows how to convert data from cloud datastore backup files to its own data types.

This is an easy way to interoperate with other GCP data. It is also possible to work with data without loading it into BigQuery, and you’ll see some examples of this in the demos. One way is to make use of public data sets. Anyone can access public data sets or through shared data sets which have been shared with you, or lastly using data from Stackdriver. Remember that Stackdriver is an internal suite of tools which has to do with monitoring, logging, and so on.

You can use BigQuery to perform analytics on your logs with a minimal amount of effort. Coming now to the question that we had posed at the beginning of this video. Big Query’s take on schema, on Read is even smarter than Hives, and that’s because BigQuery has schema auto detection. Recall that BigQuery can read the first hundred rows of data, either from an external data source which is being queried, or from data files which need to be loaded into a table.

And by using those 100 rows of data, it can infer the types and the numbers of the columns I e the schema. This is important because in BigQuery, just like in Hive, there are tables which do not have their data managed by BigQuery, just like Hive has managed and external tables. BigQuery also has native and external tables. So it’s important for BigQuery or Hive or any OLAP processing facility to have the ability to deal with data which is not directly owned by it. This is very different from an RDBMS which has complete control, which has complete ownership of the data on the server.

  1. BigQuery Advanced

We are now going to discuss a bunch of additional BigQuery features, including partitioning, views, querying, and so on. While you are going through this material, here is a question that I’d like you to ponder on. How does BigQuery take on partitioning vary from Hives? Do recall that Hive offers both dynamic and static partitioning. Static partitioning is turned off by default and is rarely used. So static partitioning is the common use case for partitioning in Hive, where data gets stored in different directories. Static partitioning is not transparent to the user. So when you create a table in Hive, if you’d like to partition it by, say, a state column, you explicitly have to say so.

That will then cause all data where values of the state correspond to, say, California, New York, excess, and so on to be stored in different data directories. One per partition. Dynamic partitioning in Hive does not require the programmer to specify a partition column. This is inferred, and the partitions are created by Hive, but this tends to be turned off by default. It’s an advanced feature and it’s relatively rarely used. So again, the question is how does Google BigQuery’s take on partitioning differ from that of Hive? We’ve now spoken a bit about the data model in BigQuery. Let’s move on to querying and viewing data, in other words, to doing useful things with the data once it is in BigQuery in the first place.

There are four ways of interacting with data. Interactive and batch queries are the first two. These are exactly as you would expect. Views and partition tables represent the other two, and we’ll discuss each of these in a little bit more detail. So let’s start with the simplest and the most common way of interacting with Hive, which is using interactive queries. Here you type out queries which are in a language very much like SQL, and those queries will be executed as soon as possible. I e. Interactively interactive queries will count towards your usage limits both your daily usage limit as well as your concurrent usage limit.

So Interactive queries do not come all that cheap. It is possible to explicitly run queries in batch mode where you are telling BigQuery that they do not require to be attended to immediately. This gives BigQuery the flexibility to schedule these whenever idle resources are available. As a result, batch queries will not count towards your concurrent usage limit, although of course they will count towards your daily limits. If a batch query does not start within 24 hours of its being submitted for execution, BigQuery will change its mode to Interactive and hopefully will run it right away. If there are queries which you run again and again, it probably makes sense for you to create views around them. Views can be thought of as queries which are stored in the metastore, so they have a status which is higher than that of just any other query, but lower than that of a table. BigQuery’s views are logical, which means that they are not materialized.

The underlying query will be executed each time you access a view. And by the way, billing will also happen accordingly. All of the traditional advantages of Views apply to views in BigQuery as well. They are a way to reduce query complexity as well as to restrict access to data and to construct different logical tables from the same physical table. Let’s discuss one of these in a little more detail. Let’s understand how data access can be controlled using Views. So here’s what you cannot do. You cannot grant access control to a view that is purely based on the user who happens to be running the view as if it were a query.

However, you can create Authorized views. That is, you can share query results with specific groups without giving read access to the underlying data to those groups. There are some limitations on the number of Authorized Views that can be created more in a moment. You should also know that there are row level permissions which can be specified within Views. So you can specify that different users can only access different subsets of the results of a view. Views do come with a fairly large set of restrictions. So for one, you cannot export data from a view.

In this way, a view behaves like a query rather than a table. You can’t use the JSON API to retrieve data from a view. Once again, the same fact applies. You can’t mix standard and legacy SQL while writing views. For instance, if you have a standard SQL query that cannot be accessed by a nonstandard or a legacy SQL view as an aside, remember that the underlying query behind the view tends to be frozen. It is not automatically updated when there are changes in the table structure, so it’s possible for view queries to get out of Whack. In any case, it’s also not possible to use user defined functions in Views. You can’t use wildcard table references. This has to do with partitioning, which we’ll get to in a moment.

And lastly, there is a limit of 1000 Authorized Views per data set. Recall that an Authorized View is one where you have shared the results of the view with some group and you have not shared the underlying data with that group. Let’s move on now and talk about partition tables. You should know that partition tables were not present in the very initial versions of BigQuery. And it might surprise you to know that there were instances where users would end up manually partitioning their tables, creating individual tables for specific chunks of time or based on some other column. Clearly, this would be extremely onerous, and it had a whole bunch of downsides. So when partitioning was added to BigQuery, this came as a boon. BigQuery’s partitioning support is not quite as rich or robust as Hives, but the idea is the same partition tables are able. Whether data is partitioned for you, you as the programmer or the user do not need to create those partitions manually or programmatically or manage them.

In BigQuery, it used to be the case that partition tables were not available, so manual partitioning was an unfortunate reality. But this used to lead to performance degradation and it also used to run into limits. For instance, queries could only use a thousand tables. Partitioning of tables manually used to run up against that limit. If you’ve used partitions in hive, you’d be aware that you specify the column on which you’d like to partition the data. For instance, you might want to create one partition for each state in the US. If you’re storing customer data in BigQuery, partitioning works differently. The partitions are automatically created by hive on the basis of a pseudo column. That pseudo column is called underscore partition time.

This contains a date based time stamp for when the data was loaded into the table. So this has two very important implications. The partitions are automatically created for you. You do not pick the partition column. This makes it somewhat similar to dynamic partitioning in Hive. That’s the first implication. The second is that that partitioning happens on the basis of the data insertion data, the data loading data. You need to explicitly declare this table as a partition one while you are creating it. But you do not need to explicitly specify the schema up front or because the partitioning really only has to do with that pseudo column which is calculated by BigQuery. It has nothing to do with loading data or the schema. BigQuery will automatically create the date partitions, and in fact, it will also automatically go ahead and discard some of them once some retention limits have passed.

Obviously, this feature of partition expiration is something that you need to turn on. You can specify what the partition expiration period is. You can also make use of something known as a partition decorator to delete specific portions of a partition table and also query only specific partitions within a table. All of these are the advantages of partitioning. Let’s wrap up with a quick look at a couple of other features of BigQuery. Query plan explanations are available. You can access these, for instance, in the Web user interface by clicking on the explanation link.

Understanding query plans is a good way of debugging very complex queries which aren’t running as well as you’d like to. Another is by examining the slot usage. A slot is a unit of computational capacity that is assigned for running queries. By default, BigQuery will assign a certain number of slots. This will be calculated on the basis of the query’s size and complexity. This default assignment is usually enough, but if for any reason you find your BigQuery performance really poor, you might need to expand the number of slots that you’re using.

And of course, you should be aware that doing so is likely to make your bill go up as well. It’s possible to track your slot usage by making use of Stackdriver monitoring. Let’s now turn our attention back to the question we had posed at the start of this video. How does BigQuery’s take on partitioning differ from that of hive? And we have the answer now. BigQuery’s partitioning is dynamic, it is automatic, and it’s based on a pseudo column which is calculated by BigQuery on the basis of when data is loaded into a table. In hive, on the other hand, there is much richer support for partitioning. There are both static and dynamic partitions. Static partitions are explicitly specified by the user. Dynamic partitions are not. They are auto calculated by hive, but turned off by default.

  1. Lab: Loading CSV Data Into Big Query

At the end of this lecture, you should be very familiar with BigQuery’s Web Console and be able to tell me where you would go to quickly look up metadata information for a table in BigQuery. In this lecture we’ll see how to create a data set and a table within it in BigQuery and bulk load data using CSV files. In order to navigate to BigQuery queries web Console, you need to click on the navigation sidebar and choose the BigQuery option there. BigQuery is Google’s fully managed petabyte scale analytics data warehouse. BigQuery is no ops. There is no infrastructure that you need to set up and manage. Before you start using BigQuery. It completely eliminates the need for a database administrator. All administration is managed by Google cloud platform. First thing to notice here is that BigQuery is server less. You don’t need to instantiate a specific instance like you did in the case of Cloud, SQL, Cloud Spanner or any of the storage systems here in BigQuery. You simply create a data set and start analyzing the tables of data that you have within it.

This is why BigQuery comes under Google’s Big data suite of offerings and not as a storage system. How the data stored is unimportant as compared with how you can query it and use it for analytics. Notice that you can always directly go to BigQuery’s Web Console by typing in BigQuery Cloud. Google. com query history on the left has a list of all the queries that you run recently on BigQuery. If this is the first time that you are starting up, there will be no recent queries to display. The job history link will contain all the jobs that you’ve run on BigQuery data such as creating a table, loading data into it and so on. All data sets that you create within BigQuery have to live within a project. Notice the project here is my test project which I set up earlier. I can use the drop down next to my test project to do a bunch of actions. Firstly, I can create a new data set from here. I can also switch projects if I don’t want to work in my test project. If you click on this, you’ll see the list of projects that you have available on Google Cloud platform and you can choose the project that you want to work in. I’m going to create a new data set here. I’m going to call it baby names. The data set.

ID is baby names. I can specify where I want this data to be located. This is important if you are working for the government, see? And your data can only be located in certain zones. You can also configure an expiration date for your data. You can say that you want your data set to only live for so many days. I’m just going to set it to never click on OK? And we’ve created our very first Baby Names data set. All the data sets that you have access to within your project, you can navigate through using this left navigation pane. The US Social Security website has a bunch of baby name data for every year going back over 100 years. Go to SSA gov and download this names zip file. Store this file somewhere on your local machine and then unzip it. Once you unzip it, you’ll find that it has a bunch of files within it. These files contain the baby names that have been registered for every year. This is in a CSV format. These are text files, but they are comma separated values within it.

You can always open up one of these files to examine it. Notice that they go all the way back to 1896, or even further back. Switch over to your BigQuery console and let’s create a table to store some of this data in. You can see the plus and the drop down signs that you get right next to your Baby Names data set. If you click on the drop down icon, you’ll get an option to create a new table. And just like everything else in Google Cloud platform, this will take you to a UI that will handhold you through the process of creating a table within your Baby names data set. Here in this UI, I’m going to choose Create from Source, which means a table will be created when you specify the source data for that table. I’m going to use the CSV files that we downloaded from the Social Security website in order to create this table. Click on the Choose File button to bulk upload the CSV data. To create this table within our Baby Names data set, I’m going to choose one of these files. Just choose one from any year that takes your fancy. Once you’ve chosen your file, take a quick glance over the other settings to make sure they’re all okay. You want the location to be file upload. That’s where you specify your bulk data and the file format is CSC. All of this looks good. We can move on. The next bit of information that you need to specify is the destination table, the table where you want this data bulk loaded into, and after that, the schema for this table that you want to create.

Because you’ve indicated the source file for Bulk upload as a CSV file, there are a number of options that you can specify. Down there at the very bottom, you can specify what field delimiter your file has. By default, it’s a comma. It can also be a tab pipe or any other delimiter. You can also indicate whether there is a header rule that you need to skip in a file. There are some other options there as well. You can explore them when you’re dealing with real data of your own. In the Schema specification for this table within our data set, you can either use the fields to add what fields are present in your data. Or you can use the edit as text to simply paste in a text field which has the schema for your data.

This might be much faster if you already have your schema available in the format that this text field expects. It is name of the Column colon Data Type when I went ahead and tried to create the table here, I realized that there is an error. BigQuery very kindly warned me that no destination table has been specified yet. The destination data set has been set to baby names. I need to specify an actual table. I’m just going to go ahead and call the destination table baby names as well. Just for simplicity’s sake. When you want to reference a table in BigQuery, you specify the dataset name, the table name. So if you say baby names baby names that will reference the baby names table within the baby names data set. I think we are all ready now.

Hit the Create button and you’ll find that the table gets created. It might take a few seconds for your table to be created and then bulk loaded with data. If you go to the job history on the left navigation pane, you’ll find that the loading job is currently in progress. When it turns green, you know the job has completed and you can now query your table within your baby names data set. If you click on the baby names table, you’ll find a whole bunch of metadata information available to you on this web console. You can see the schema, details and even a preview of the table. If you click on the Details tab for this table, you’ll get a whole bunch of metadata information. The fully qualified reference to this table should contain the name of the project, the name of the data set, and finally the name of the table as specified here. You can see that the size of the table is 615.

There are about 32,000 rows of baby names within this table. And just like you can label other cloud platform resources, you can attach a label with this BigQuery table as well. This will allow you to logically group data to see its billing information, resource usage and so on. You can click on the Preview tab here to see a brief preview of what the data looks like. Each row has a name agenda for that baby and a count of how many babies had that name.

From the first row you can see that 19,414 female babies were named Emma in 2016. I know it’s for the year 2016 because that’s the only data that I’ve uploaded into this table. If you click on the JSON tab at the very bottom, you can preview this same data in JSON format. Now that you’re familiar with BigQuery’s Web console, you know that you can simply click on the table name and then view the Details tab in order to see metadata information about a table.

  1. Lab: Running Queries On Big Query

After you watch this lecture, you should be able to answer the question what are the two kinds of SQL that BigQuery supports and how do you switch between them when you run your queries? So far, we’ve created a BigQuery table and uploaded data into it. Let’s run some queries on our PB Names table. Let’s write our very first BigQuery query against the Baby Names table that we just set up. Click on Compose Query and you’ll get a window where you can type in your queries. Here you see a simple select statement which chooses thousand rows from our Baby Names table. Notice that we’ve specified the name of our table within square brackets. This is required when you use a legacy SQL. Your table names have to be specified in square brackets.

When you specify the table in your query, specify the complete form, the Project ID, the Data Set ID and the name of the table. The Project ID is followed by a colon and then the Dataset ID dot table name in BigQuery you are build based on how much data your query process. BigQuery is smart enough to give you an idea of how much data it’s going to process before you actually run your query. This information is available for every query specified in that query box. If you click on the icon on the bottom right, this icon can either be a check mark that is, when the query is valid, or it can be an exclamation mark when the query is invalid.

The message at the bottom will tell you the query is valid and this is how much data it’s going to process, or it will tell you why the query is invalid. Fix this query up so that it’s a valid SQL query once again and you’ll see that the message at the bottom says this query will process 650 KB when run. That’s the entire size of our table. This message will tell you how much data is going to be processed by your query, not how much data is returned from your query.

We are returning only 1000 rows, but 615 KB. The size of our entire data set, which has 32,000 rows, will be processed. More details about this query is available if you click on Show Options at the bottom there. This will open up a window which has a whole bunch of options that you can tweak. You can ask for your query to use cached results from previous runs, you can specify a priority for your query, and so on. One of the options, if you notice, is the SQL dialect that your current query uses. You can choose either Legacy SQL or, if you uncheck it, BigQuery will use Standard SQL. Let’s hide these additional options and go ahead and click Run Query.

You’ll get a warning dialogue if this is the first time that you’re running a BigQuery query. With this query you’ll be built for all the data in the table. Even if your query contains a limit clause, that’s because you used a select star with no where clauses. The select star has to process all the records. Even if you limit what data is returned by this query, this can be potentially very expensive if your table is huge. And this is the warning which tells you that the web console helpfully suggests that you can use the Preview column if you just want to preview what the data looks like. Once you get an idea of how BigQuery data processing works, you can simply check mark, don’t show me this dialogue again and you won’t be bothered with it popping up as you run your query.

Click on Run Query and depending on how large your table is, it might take a little bit of time to process and you’ll get the results in table form at the bottom of the screen. If you click on the explanation tab of the results, you’ll get an idea of how long each step in the query took. You can see that stage one we waited for a little bit. Read took a really long time, compute didn’t take very long, and Write was very quick as well. The output was 1000 rows because we set our limit clause to 1000 in our query. Let’s run another query that’s a little more interesting. Let’s say we wanted to find the most popular male baby names in 2016. We want the top ten male baby names. You simply add a where and an order by clause to your query. Notice that this query also processes 615 data. This might seem confusing to you at first, but it makes sense.

All records in the table have to be checked to see if they match the condition. If they match the condition, then they will be retrieved in the query the query runs through, and the result will be once again available at the very bottom of the screen. Noah was clearly the most popular male baby name in 2016. BigQuery displays just five rows in one screen. You can navigate to the remaining screens by clicking on Next at the bottom. It should be very easy for you to answer this question now. BigQuery supports both standard SQL as well as a Legacy SQL, and you can switch between them for each query by simply clicking on Show Options and unchecking the Legacy SQL checkbox box.