Google Professional Data Engineer – BigTable ~ HBase = Columnar Store

Practice Exams:

View All

Uncategorized

Google Professional Data Engineer – BigTable ~ HBase = Columnar Store

Column Families

Here’s a question, a rather open ended question, which I’d like you to keep in mind as you watch this video. How does the choice of a row key affect physical storage in Big table or in HBase? The manner in which data is laid out in a traditional database is very different from the layout in a columnar store. And that layout has implications, important implications, for the performance of HBase and Bitable. So let’s understand these differences in a little more detail. A traditional database has a two dimensional data model.

We’ve discussed this to death. It has rows and columns. So to uniquely identify any data item, we need to know its row and its column I e. It’s a row ID as well as the column that it belongs to. But wait, there’s also one additional piece of information, which is the table which this particular data item belongs to, because in a relational database, of course, we are going to have multiple tables. And so really it is the combination of unique row ID, column name and table name which will collectively identify any one data item. So that increases the dimensionality of our data from two to three. Now, if we also take into account that most modern database systems use some kind of versioning based on timestamps, that is, they maintain multiple copies of the same data item, then really we get to a four dimensional data model. And indeed, this four dimensional data model is exactly what is used in HBase as well. Any data item can be uniquely identified using its row key. This is a fairly familiar concept. The column family basically corresponds to the table name in an RDBMS. This really is how columnar data stores are able to create multiple tables corresponding to a single database within one columnar store.

That’s by encoding the table name in the column family field. The third dimension corresponds to the column name. This is an individual column identifier from within a relational database schema. And the last dimension corresponds to the timestamp. Recall how cloud spanner allowed us to specify bounds on the timestamp? These were staleness bounds. For instance, we could say that we want the latest copy, or a copy no later than or no older than and so on. That was because cloud spanner and most modern databases support different versions based on different timestamps of the same data item, and HBase does the same. Combining all of these, we can see that HBase has a four dimensional data model. To uniquely identify any item, any data item, we have to specify the values of the row key, the column family, and the column name. Notice that we don’t explicitly have to specify the timestamp, because if we omit the timestamp, we will always get the latest or the most up to date version of that data item. We can always choose to explicitly specify a timestamp and get a slightly older version. So in this way, four dimensions are required to access any data value that can be reduced to three dimensions.

If you omit the timestamp, this will have the effect of giving you the latest version of that particular value. Let’s look at how this would play out in the case of a table for employee data. Here we have a Row ID, which is the employee ID. We have a column family corresponding to Work information and another column family corresponding to Personal Information. In other words, if this were a relational database system, we would have had two tables called Work and Personal. Each of those two tables would have a primary key on employee ID. Then there are the actual column names in the column family Work. Those column names are Department grade and Title. In the personal column. Family. Those are name and SSN. Once again, map this to a relational database set up. We would have one database containing two tables called Work and Personal. In the table called Work, the columns of the fields would be Department Grade and Title.

In the table, Name Personal. Those columns would be called Name and SSN. Both of these tables would be primary keyed off of the employee ID column. Now, in order to access one single record for a single employee, we need to specify the row key.

We need to specify the column family and also the column name. Let’s now go ahead and nail down exactly what each of these dimensions in the four dimensional data model are all about. Let’s start with the row key, which is by far the most important. As is obviously the case, a row key uniquely identifies a row and this can contain values which can be primitives structures or arrays. This is a very important point. The row key can contain compound that is deformalized data types as well. Internally, the row key is represented as a byte array. So it doesn’t really matter what you pass in as the Roki. Phase will just interpret it as a sequence of bytes. And here is maybe the most important point about how row keys are stored. They are stored sorted in ascending order. We are going to talk about the performance of big table in some detail in just a little bit, and this point will assume a lot of importance there. Again, the row key of all rows in the column or data store will be sorted lexicographically and then all row data will be stored in order of this lexicographical sorting. Next up, let’s talk about column families. Column families roughly correspond to tables in a relational database.

They are a set of logically related columns. This means that all rows need to have the same set of column families and each column family is going to be stored in a separate data file. This is again kind of similar to the idea of a table in a relational database being a logical unit. Column families need to be set up up front, ie. At the time of Schema definition. Note that this is different from columns which can be added dynamically on the fly. So column families need to be specified upfront at Schema definition time. But columns can be specified dynamically on the fly. And also it’s okay to have different columns for each row. This is an important point, kind of like in a document database, it’s okay for different rows to have different columns. And this of course is in keeping with our conversation about sparse data. You’ll only include rows for which there actually is data in existence. Row keys and column families are rather complicated to understand. Columns are a lot simpler. These are very similar to fields in a relational database. These are just units within a column family.

These correspond to columns in a relational database. A big difference between column families and columns is that new columns can be added on the fly. Remember that to uniquely identify or qualify a column, you also need to specify which column family it belongs to. The column family, in a sense, acts like a namespace. This would be the same role that’s played by a table name in a relational database. And the last dimension in that four dimensional database is the timestamp. This has to do with Base’s ability to store different versions for values in a given column.

You could specify an explicit timestamp corresponding to a specific version that you wish to access, or you could omit the timestamp and simply retrieve the most up to date version of data in that column. Let’s come back to the question which we posed at the start of the video. The choice of a rookie is really important. It completely determines physical storage in both pick table and Edge base because all of the different values of the Rookies are taken and sorted in lexicographical order, and data is is then stored in that order. So all values with similar row key values when sorted are going to reside close to each other. This clearly has a bunch of implications as far as hot spotting and the distribution of the reads and writes go.

BigTable Performance

Here is a question, a relatively simple one at what level are HBase operations? Atomic HBase, which is the open source first cousin of a big table, is a column in a data store. Atomic City is supported at what level? Is it not supported at all? Or is it supported at the level of a column family a column or a row? Big Table and its doppelganger HBase have fairly complex performance considerations. So let’s spend a bunch of time discussing these. Avoid BigTable under the following sets of circumstances do not use BigTable if you require transaction support. Because as we’ve already discussed, BigTable will only offer row level asset guarantees, and that’s just not enough. So use cloud SQL or cloud spanner if you need to carry out OLTP. This one was fairly obvious. The next one is more confusing. Do not use BigTable if your data size is going to be less than one TB.

That’s because BigTable needs to do a bunch of smart optimizations related to Shading and distributed storage, and it just won’t be able to do that if your data set is too small. Also, do not use BigTable if you plan to use analytics or business intelligence or data warehousing. Use cases. BigQuery is a lot better there for three specific reasons. Reason number one is that BigQuery supports a SQL like interface, which many data analysts are familiar with. Reason number two is that BigQuery supports really complex types of queries partitioning, windowing operators. All of these are really important in overlap and business intelligent operations. Both of these first two reasons would be applicable even if we were comparing Hive and HBase. The third reason has to do specifically with BigQuery.

BigQuery is a lot more performant than Hive. You can contemplate using it even for real time applications. So for analytics or data warehousing or OLAP, definitely use BigQuery rather than Big table. The next point is again a fairly simple one. Do not use Big table for very highly structured or hierarchical data. That is more in the realm of document oriented databases such as Data Store if you are on GCP or MongoDB or Couch DB if you are not on GCP. BigTable requires a key value relationship at least around the row ID, so it doesn’t make a lot of sense to use it for immutable data like Blobs or Media files. Just use cloud storage there instead. These are all situations to avoid BigTable. Let’s not talk about the cases where BigTable excels. The first and obvious one has to do with very fast scanning with low latency high throughput applications where you are going to be scanning on sequential row ideas. It’s pretty tough to beat Big table for this particular use case.

Also, think of Big table anytime you have nanostructured but key value data because it’s non structured. Relational databases won’t work because it is key valued with a single key. Think of Big table if there are multiple keys. Think of a document oriented database like Data Store. Also, keep in mind some guidelines on the types of data size. Use BigTable when each data item is less than ten megabytes and the total data set size is greater than 1 TB. We’ve already discussed this. A very small data set is not suitable for BigTable because it’s not able to carry out the smart distributed processing that it relies on. If you have right operations which are very infrequent or not important, and you don’t care about asset support, but you care about fast cans, or if you’re using time series data, these are all use cases for big. Table the time series one is a little surprising, so let’s understand why this is the case. This really has to do with the fact that different timestamps can be used as a part of the row key. More on that in a minute when we talk about ideal row key choices. Remember, while talking about the row key in the four dimensional data model we had mentioned that data is stored in sorted lexicographical order of that row key. This is similar to cloud spanner. And then the next step is also similar to cloud spanner. Data is distributed shared in effect based on those key values, so that data which has the same key value will be grouped together. If you think about it, this implies that performance will be really poor if all of the reads or writes end up being concentrated in some particular shards or in some ranges of the key values. A classic example is if sequential key values are used here, there will be a shifting hot spot as all of the sequential operations happen.

This is a classic problem. This is a classic anti pattern with the design of an HBase data Store. As with the case of Cloud Spanner, we always have the option of hashing those key values or using non sequential keys wherever possible. Let’s spend a little more time talking about how hotspots can be avoided. There are some fairly typical techniques, one of which is field promotion. So here the idea is that you use a structured key and that structured key is arranged in reverse URL order. Something like a Java package name, for instance. Why is this a good idea, you ask? Well, because in this way, keys will have similar prefixes, but they will have differing endings.

So that if the sequential scan is based on some subset of the key prefix, well, all of the related values will just be picked off at one go. Reverse URL order is a pretty standard way of arranging keys in HBase, and you should use it wherever possible. The other common way of avoiding hotspots is salting. That is the descriptive term for the practice of hashing the key value. This is something that we refer to even while discussing cloud spanner. Next, we’ll talk about a somewhat surprising feature of BigTable. This is referred to colloquially as warming the cache.

And it refers to the fact that Big Table or HBase will tend to improve in performance over time. The reason for this is that Big Table is sapient. It observes the read and write patterns in your data and it then goes ahead and redistributes the data in smart ways so that those reads and writes are evenly distributed over all of the shards of the distributed partitions. I should add that this feature is more prominent in Big Table than in HBase. BigTable is more proactive about moving data around in order to eliminate hot spots. BigTable will try and do this in ways that store roughly equal amounts of data in different shards. An important implication of this is that if you are testing the performance of your HBase or Big Table system, you need those performance tests to last for several hours in order to get a true sense of the performance.

If you run an inordinately short test, maybe half an hour or less, that’s not going to give HBase or BigTable enough time to carry out all of these smart data movements to eliminate hotspots, and you will get a misleadingly poor indication of performance. Another decision that you’ve got to make while designing your Big Table implementation is whether you want to use SSD or HTD disks. The simple rule of thumb is use SSDs unless you are rarely operating on a shoestring budget, and even then, HDD probably makes more sense. SSDs can be up to 20 times faster than ordinary hard disks on individual raw reads, although that advantage is a lot less when you are considering batch reads or sequential scans. Another advantage of SSDs is that they are more predictable in terms of their throughput, and this gives BigTable room to learn and predict how they are going to operate. If the performance is very variable, that could throw big tables calculations off for a spin. So really only think about using ordinary persistent disks. If your data size exceeds about ten terabytes, and if your common usage pattern is only for batch queries. The greater the proportion of a random access that you perform in your Big Table, the stronger the case for SST. We can add here as an asterisk that if all of your data usage takes the form of random access, then maybe BigTable isn’t even the right tool for you. Maybe you should be looking at a document oriented database like Data Store instead. Because BigTable is a rather complicated beast, reasons for its poor performance are often hard to find. Here are some pointers indicators that might help you if you are suffering from bad sleep performing Big Table the first place to look would be at the Schema design check.

If you have sequential keys or some such other anti pattern which is causing hot spotting or causing the reads and writes to be concentrated in some specific shards. The next set of possible causes have to do with inappropriate workloads. Maybe your data set is too small. Less than 300gb? That’s not enough for Big Table to really show its talents. BigTable comes into its own at more than 1 can be used up to petabytes in size. Another possible problem has to do with the usage pattern. Maybe your queries run in short bursts when really Big Table performs best, when it has hours of observation to tune performance internally. There are also some of the usual suspects.

For instance, maybe your cluster is just too small. Another possibility is that your cluster has just been fired up or just been scaled up. In either one of these cases, it’s going to take some time for Big Table to understand the patterns and allocate the newly added resources optimally. As we just discussed, it might also be a case of you using HDDs instead of SSDs. Don’t be pennywise pound foolish, as the old saying goes. And lastly, it might be that you are pessimistic about your performance in a development environment, but don’t give up until you’ve tried it in production, because the differences between the levels of optimization are particularly stark for Big Table.

Big Table does a lot more in production than it does in development. Schema design is very important with Big Table, so let’s spend a minute talking about some of this stuff, even at the cost of repetition. Remember that each table has just one index, and that’s the row key. So choose that index well. Unlike in data store or cloud spanner, you don’t have the luxury of picking multiple indices per table. Next, remember that those ROKIS are going to be sorted lexicographically and rows will be arranged in that ordering. So be smart about your choice of Roki. Do not use an anti pattern like a sequentially increasing integer count. And also keep in mind the row centric worldview of Big Table. All operations are atomic only at the row level. Asset properties are supported only at the row level. Multi row operations are not asset guaranteed.

The beauty of that four dimensional data model which has column, family and Roki, means that related entities will be stored in adjacent rows. And this can give rise to the really fast sequential scanning performance that we hope for from HBase or Big Table. Remember the kinds of row keys that you should be looking to use. Reverse domain names are the first choice they should jump to mind string identifiers are fine as well because they will typically hash evenly. And lastly, timestamps, but only as key suffixes. This is important. Do not include timestamps as the first or the prefixed portion of your key. This is likely to be a sequentially increasing field in order of insertion, and that will cause hot spotting. That brings us to row keys to avoid. And the first one that comes to mind there is a regular domain name rather than a reverse domain name, because here the common portion is going to be at the end of the row key, and that will cause adjacent values to not be logically related.

A similar problem or a similar anti pattern is sequential numeric values. These cause hot spotting. As we’ve discussed on a bunch of occasions, it’s usually a pretty bad idea to use timestamps alone as the row IDs. And it’s also a bad idea to use row keys, which are prefixed by timestamp that ends up being quite similar to a sequential numerical value. And finally, because data storage is so tied to row key, values do not use as row keys fields, which are likely to be changed repeatedly. Ideally, your Roki should be in Mutable, so be careful if you are going to use Mutable or frequently updated values as your row key. Big Table also has some recommendations for different size limits your row key should not exceed 4 key value.

You should not have more than 100 column families. At that point, it starts to get complicated. For Big Table, individual column values should not exceed about ten megabytes in size, and the total row size should not exceed about 100 megabytes. All in all, Big Table has a complicated set of performance considerations. And these are complicated for a good reason they’re tied to the equally complicated underlying physical representation of the columnar data store that BigTable and HBase both use. Let’s return and answer the question we posed at the start. Operations in HBase and BigTable are atomic, but only at the level of a rule. HBase and BigTable do not offer any stronger guarantees on atomicity. They certainly do not offer asset support.

Lab: BigTable demo

At the end of this demo, you should be able to answer this question what are the two operations that are very low latency in Cloud BigTable? In this demo, we’ll see a quick way to get up and running with Cloud BigTable. Using the Hvase Shell in order to get connected to BigTable and start using it, we need to create a Big Table instance. Move to BigTable in your side navigation menu and click on Create a new instance web Console will, as usual, make things very easy for you and walk you through creating a new Big Table instance. Give the instance a name which is once again just for display purposes. The instance ID is permanent and will be used to refer to the instance. At this point, you have two choices. You can choose a production instance, which is what is recommended if you’re setting up a real web app which is going to serve real traffic. This needs a minimum of three nodes and is highly available. Once you set up this instance, though, you cannot downgrade this instance later. You need to keep it or delete it if you no longer need it, or if you’re just playing around with Big Table in order to understand it as we are, you can choose the development instance. It’s lower cost, it’s meant for development. It’s not highly available, but you can upgrade it to a production instance later. In addition to an instance ID, you also need to specify a cluster ID for Big Table.

Once again, this is permanent and there are some constraints on what characters a cluster ID can accept, and also specify the zone where you want your instance to be located at. I’m going to just choose US central one C. That’s the first option in the list. Once again, you have a choice here as to what kind of storage your Big Table instance should use. You can choose the high performance low latency SST that’s what is recommended. Or if you have huge data sets and you want to lower your storage cost and you don’t care about latency, you’ll choose an HDD to store your data. Click on Create and it will go ahead and create a Big Table instance for you. As you already know, BigTable is Google’s columnar data store. It’s the exact analogous of Hpase. In fact, Hpase was built based on a paper that Google engineers released back in 2006 about BigTable. Now, Edgebase is really common in the open source world, and many people who are moving to Google Cloud platform are familiar with HBase, which is why Google Cloud platform provides you an HBase shell where you can use HBase commands to connect and work with BigTable. In order to use the HBase shell, you need to download the Google Cloud BigTable Quickstart from this particular URL storage, Google Apis. com, Cloud BigTable Quickstart, and a zip file there. This zip file has a script that can quickly set you up with the HP shell and allow you to connect to your BigTable instance.

This is provided that you are authenticated and logged in using Cloud shell or a terminal on your local machine. Unzip the file and it will set up the Quick Start in your current working directory. Notice that there is a Quick Start folder that has been created in your current working directory. CD into that folder and will run the script to connect to BigTable using the HBase shell. Now this script will work only if you have a Big Table instance set up. You can use Gcloud Beta BigTable Instances List and see the list of BigTable instances that we have.

We have just one called test BT. It is in the ready state at this point in time. This script works under three conditions you have to be authenticated and logged in using Google Auth login. You have to have a default project set up. This is my test project in our case and you have to have a Big table instance set up. Simply run Quick Start and it will take you to the HBase shell. If you’re familiar with the HBase shell, what you’re going to see is going to be very straightforward for you. You can run list to see what tables you’ve set up within Big Table. We have no table so far. You can use the Create command to create the students table and within the students table we want the Personal column family. As you know, BigTable is a column now store and all columns are logically grouped into column families. A table should have at least one column family. Running the list command now should confirm that exactly one table, the students table has been set up.

Let’s insert our first row into this big table. We want to put into the students table a row with row key 12345. Important that you specify a row key for every row that you insert into Big table. The row key uniquely identifies a row and is what is used to index all the columns and column values that are present in one row. It is this indexed row key that allows very fast lookup operations in HBase and Big table and also very fast scan operations where you scan a range of rows together. When we insert a value in a particular row, we need to specify the column family and the name of the column where this insert should occur. Personal is the column family that we set up and Name is the name of the column and John is the value that we are adding to the Personal colon name column. Here we added just one value in a row for a particular row key.

We can add the other values for the same row as well using put statements. So within students with the same row key 12345. Here is the state information. John lives in the state of California. Here we have another student with a different row key named Emily. We’ve only added the name information for Emily and here is the state information. She lives in Washington. The row key for each row in big table has to be unique. Running a scan command in this HBase shell will list all the information in the student’s table. This table has just two rows of information which we’ve added in this session. Here are the row keys. The row keys are repeated twice because there are two columns worth of data on the right hand side. For every row key we can see the name of the column family and the name of the column. We have the name and state columns for each of these row keys. Each of these values is associated with a timestamp.

This forms the versioning information for this big table value. And finally we have the value for each of these columns. John lives in California and Emily lives in Washington. The list command will show you the students table. We haven’t created any other table here and if you wanted to delete this table, you’ll simply run the drop students command just like Hpase, big tables optimized for very fast lookup operations using the row key. The row key is what is used to index all the rows in a table and also very fast scan operations. Contiguous ROKIS can be looked up in a range very quickly.

Related posts: