Google Professional Data Engineer – Datastore ~ Document Database

  1. Datastore

Here is a question that I’d like you to try and answer before we begin our conversation about Data Store. Keep that answer at the back of your mind during the upcoming video. We’ll revisit the answer at the end. The question is a fundamental computer science one what data structure has approximately constant lookup time? We spent a lot of time talking about BigTable and HBase, and that made sense because those were both complex beasts. Let’s now turn our attention to a rather simpler but also very cool data storage product, and that is Data Store. Going back to our master list of use cases, datastore is something that we turn to when we are looking for document oriented storage in a NoSQL database. This is something which Datastore offers in competition with other products like MongoDB, couch, DB and so on. Datastore is really easy to understand if you imagine that what’s going to be stored in Datastore is going to be document data.

Something like XML or HTML, with all of the characteristic patterns of root elements, children attributes and so on. If you think about it, XML or HTML data has characteristic patterns of a key value structure. It is highly structured, in fact. But unlike in the kind of columnar data we would put in HBase, there are a large number of differing keys and those keys are hierarchically related to each other. Data Store performs a whole bunch of indexing around these different key values. And because it is so document oriented, it is not used typically either for OLTP or for OLAP. Rather. The most common use case for Datastore is when you want crazy fast lookup. And you would like that crazy fast lookup to scale almost infinitely with the size of your data set. That kind of crazy fast lookup, independent of the size of your data set, can only be achieved using indexing, specifically hash based indices. That also has implications for write updates and transaction support.

Typically, transaction support in Data Store is better than in the other NoSQL products. Of course, it isn’t quite as good as cloud SQL or cloud Spanner. Let’s return to our conversation about fast scaling of reads. The speciality of Data Store is that query execution time really depends on the size of the returned result and not on the size of the data set as a whole. The implication of this is that if your query is going to return ten documents, it doesn’t matter whether those ten documents are being drawn from a document store with just ten documents or from a much larger data store with 10 billion documents. The query running time is basically going to scale proportionate to the size of your result set and not the size of your total data set. This, of course, makes Data Store perfect for searching for needles in haystacks. Here where you’re looking for a random key and you want to find all of the corresponding values. This is a different use case than HBase or BigTable, where you are performing a sequential scan for an entire range of related keys.

Data Store is somewhat closer to relational databases than other NoSQL products. So let’s compare these two. Let’s juxtapose them and see how they line up. In a traditional RDBMS, you have atomic transactions. That is also true in data store. Data Store does support atomic transactions and the asset properties. To a large extent, this has to do with the need to keep all of those internal indices consistent with each other. Both traditional RDBMS and Data Store make heavy use of indices for fast lookup. And here Datastore takes the use of indices to a whole other level. Every query will make use of indices. This is far beyond what traditional RDBMS do, and for that reason the query time property which we just discussed comes into play. The query execution time and Data Store is going to be basically independent of the size of the underlying data set. And as we probably know from long experience, that is certainly not the case with traditional RDBMS.

These in a sense, were areas of similarity. Let’s also understand some of the differences. Traditional RDBMS use relational data that’s rows and columns without a whole bunch of hierarchical relationships within those entity relations. Data Store, on the other hand, is document oriented and that means that it’s optimized for hierarchically structured data like XML or HTML. And this has the form of a tree in its internal representation. Think of the Dom or the Document Object model in an HTML doc. There is also a slight change in terminology in terms of what rows, columns and attributes are called. In a relational database, rows are stored in tables. In a document database, entities are of different kinds. Focus on the word entity, which corresponds to a row.

Focus on the word kind, which corresponds to a table. Focus on the absence of the word stored. Because entities are of different kinds, entities are not really stored in different kinds. This is a slightly different take on the world, which has to do with the document oriented nature of Data Store. What would an example of an entity be? Well, think HTML tags, a head tag or a body tag would be an entity in a document. Data store. Rows consist of fields in a traditional RDBMS, while entities consist of properties in a data store. Again, think of HTML as an example. If you have a head tag, that’s going to have a bunch of nested tags and those are the properties. Traditional databases have primary keys as a unique ID. In Data Store, the word primary is not used that’s kind of relational, you just refer to them as keys. So really, just remember, as far as the data model is concerned, data Store has entities which are of different kinds and those entities consist of properties.

If you think about it, HTML documents are quite loose in terms of how strictly they enforce rules about nested tags and so on. And this carries over to schema checking in data store as well. In a traditional RDBMS, all rows of the same table need to have the same schema or the same properties. In other words, they’ll have the same number of columns, and those columns will be all of the same type. Schemas are strongly enforced in relational databases. In contrast, data stores are very lenient here. It’s perfectly okay for different entities of the same kind to have different properties.

So for instance, maybe you have two HTML documents. Each of these has a head tag, but inside one of the head tags is a body, whereas the other does not have a body tag at all. That’s perfectly okay in a document oriented store. This is also true of types. In a relational database, all of the values in a particular column must have the same type, and this is strictly enforced. But in data store, types of different properties with the same name can be different. Imagine this let’s say that you have two XML documents. Each of them has a body tag, but inside one of them, the body tag has another property called ID, which is an integer.

The other body tag also has a property called ID, but that one happens to be a Gould or a string of some sort. As we shall see, datastore also has some quirks in which operations it will and will not support. For instance, unlike traditional relational databases, datastore does not support joins, it does not support filtering on sub queries, and it also does not allow more than one inequality filter. We’ll have more to say about this in just a little bit. Let’s understand when it does not make sense to use Datastore, because we’ve now run through a gamut of storage technologies. Hopefully this won’t hold a lot of surprises for us. Don’t use Datastore if you need very strong transaction support. If you’re doing hardcore OLTP, you should use something like cloud spanner if you want basic asset support. However, Datastore is probably enough for you. Datastore comes into its own when Data is hierarchical and highly structured. If you have data which is non hierarchical or unstructured, big Table is probably a better NoSQL technology. Do not use Datastore for analytics or OLAP or business intelligence type of applications. BigQuery is a lot better because it has complex queries which are optimized for numerical calculations rather than documents. Clearly, datastore requires key values and a whole bunch of indices. These don’t make sense if you’re storing Immutable Blobs like movies. If each is greater than ten MB in size, just go with cloud storage instead.

And finally, keep in mind the extremely heavily indexed nature of data store. Do not use Datastore if your application is going to carry out a lot of writes and updates on your key columns. Okay, let’s now turn to those situations where data store shines. The basic use case which we’ve discussed is of crazy scaling. We would like the read performance to scale to virtually any size of underlying data store. And of course this really makes sense for hierarchical documents because after all, datastore is document oriented, as we already saw in its data model, enforcement of schema and so on. Let’s take a second to really understand full indexing and its implications for how datastore works. Remember that there are built in indices in data store on every property of every entity. That is saying a lot. That’s basically comparable to an RDBMS where every column of every table has an index constructed on it by default. Now, this only applies to individual properties, but there are also composite indices. These allow the indexing of multiple property values all at once.

Now, if you are absolutely certain that a property will never be queried, you can explicitly exclude it from this full indexing. That might give you some performance benefits, particularly in write operations where you do want to be updating a whole bunch of unnecessary indices. The way data store works, every query will be evaluated using something known as its perfect index. The perfect index is an interesting concept. Given a query, the perfect index is that index which will most optimally return the query’s results. The perfect index is evaluated in order of following conditions if there is an equality filtering condition, that will be treated as the perfect index. If there are inequality filters on columns, and by the way, only one such inequality filter is allowed per query, then that will be used, provided that there is no equality filter.

And if there is neither an equality filter nor an inequality filter, but there is a sort condition, the index on whatever property it is that is being sorted will be used for the perfect index. So we can see from this that the perfect index will be the equality filter. I e. The needle in the haystack type use case is optimized. If there is not an equality filter, then there can be at most one inequality filter. This makes us a range query. And if neither inequality nor equality filters apply, then something like the sort order will be considered. Full indexing is a wonderful feature of data store, but it also has some important implications which we need to grasp. The first of these, which is quite obvious, is that updates are really slow because after all, the whole point of indexing is that updates become slow, but lookups become blazingly fast.

Another implication of this full indexing is that joins are not supported, so there are no joins in data store. This is another similarity with Big Table, by the way, and another difference from relational databases. In addition, it’s not possible to filter results based on sub query results. And it’s not okay to have more than one inequality filter. One inequality filter is okay, more than one is not. Let’s now change tracks a little bit to a completely different aspect of Data Store and this is multitenancy. In XML and HTML it’s possible to specify namespaces in your documents and effectively each namespace can be used to refer to documents from different clients. This same idea is basically used appropriated by Data Store. In this way it’s possible to have separate data partitions corresponding to separate client organizations.

This is easily achieved using namespaces and this has the advantage that we can then use the same schema type for all clients but vary the values in those schemas. Like many other features of Data Store, this is easy to understand if one thinks about how namespaces are specified in XML documents, different kinds and entities can coexist within namespaces. Let’s move on to transaction support. As previously mentioned, transaction support is slightly better in Data Store than in Big Table. But it’s important to remember that this transaction support in Datastore is optional.

You can opt to use Data Store without transaction support on and if you really want strong transaction support or consistency guarantees then of course you ought to use an RDBMS, something like Cloud Spanner or Cloud SQL. Let’s also quickly look at the levels of consistency supported by Datastore. There are two choices here, you could go with either one of them depending on your requirements. You could either require strong consistency here, Datastore will always return the up to date result however long it takes, or you could go with an eventual consistency model.

And of course this is a lot faster. But it also carries the risk that your query results might be stale. Let’s revisit the question that we posed at the start of this video. The answer to this question is a hash table. Now there are some wrinkles around lookup time and a hash table. It is order of K where K is the number of buckets. But for all intents and purposes we can assume that lookup in a hash table is independent of the size of the data set. It depends really on the number of collisions. So it is basically constant. A hash table is the answer to this question.

  1. Lab: Datastore demo

At the end of this lecture, you should be able to answer this question confidently. Do entities of a particular kind in Cloud Data Store have all the same properties and its associated data types? Is this true or false? This lecture will work with Cloud Datastore, which is Google’s NoSQL Document database built for automatic scaling, high performance, and ease of application development. On your Google Cloud Platform dashboard. Go to the side navigation bar, click on Data Store and choose entities. This is where we can create the entities that will live in our Cloud Data Store. You’ll find a big blue box there asking you to create an entity and try out Cloud Data Store. Go ahead, click on it, and let’s create our very first entity.

The first thing you need to specify is a namespace where your entities will live. If you plan for your Data Store to be multitenant, that is, entities from different clients will live within Cloud Data Store. You should use different namespaces for your entities. This is how you separate entities from multiple clients. Notice here that we are directly jumping into the creation of entities. This implies that Cloud data store is serverless. We don’t create an instance of Cloud Data Store before we populate it within the default namespace, I’m going to create entities of the Kind Products. Remember, that Kind corresponds to table in a relational database. Within this entity, I want a key Identifier I can choose for it to be numeric, in which case it will be autogenerated or a custom name where I have to specify a unique key identifier for every entity. Entities stored in Cloud Data Store have a hierarchical relationship, which means you can specify a parent entity for the entity that you are creating.

Since this is our very first entity, there is no parent here, so that field is completely empty. Let’s go ahead and add properties to our entity. Click on Add properties. Specify a name for the property, a type which in this case is String. And since it’s a product entity, we’ll simply say it’s iPhone Sixs. Properties in an entity are essentially key value pairs. Just specify a type for the value and you’re good to go. Go ahead and add a bunch of other properties and specify different data types.

For each of these properties, the UI will update to accommodate these different data types. The memory property is where I’m going to specify the storage capacity of this particular phone in Gigabytes. My phone entity also has a color property. It’s a string value. White colored iPhone Success Products entities is something that an Ecommerce site would support. So let’s go ahead and add an availability of this particular phone. It’s a Boolean property, and the only values that you can set there are false or true. When you’re done, just hit the Create button at the very bottom and this entity will now be created. Notice that all the properties of an entity are indexed by default in Cloud Data Store.

If you hit Create Entity once again, you will find that the web console will helpfully prepopulate the Kind and the properties that we specified. For our earlier entity. We’ll create a new entity of the Kind products. It’s important to remember in Cloud Data Store that all entities in a Kind need not have the same properties. And even the data types for those properties need not be the same across entities. As an example, in this new entity that I’m going to create, I’m going to go ahead and delete the color property. The next product will not have a color property associated with the entity. Go ahead and add in a bunch of values for the other existing properties.

You can also choose to change the data type of any property within this entity. It’s a Samsung Galaxy phone. The name I specify is Samsung Galaxy. For this particular entity in the Kind products, I’ve added a new property called Screen Size. This was not present in the previous entity. Hit Create. So now you have two entities under the Kind Products. I’m just going to go ahead and create a third entity. You can create as many entities as you wish, so you have a few to play around with. At the end of this, I’ll have three entities of the Kind products. I have only one Kind, though. If you look at the drop down at the very top of this list, you’ll see I have just the Kind products. Creating a new Kind is simply a matter of creating an entity within that Kind.

Click on Create Entity and specify the Kind as orders or any new Kind that you want to create. Go ahead, create a bunch of entities under this Kind. So you’ll have both products and orders set up in your Cloud Data Store. In this view, using the filter at the very top, I can choose to view either the products or the orders Kind and see all entities that I have within it. I can also filter entities further by using their properties. If you were to use this Cloud Data Store as the back end for an ecommerce site, you want the ability to be able to filter your products by their availability. You want all products which are available to show up in your search results. Since all properties of an entity are indexed by default, you can apply these filters to any of these properties.

You can choose Availability and choose those entities with availability. True or false, all of these with helpful drop downs. In this UI, you can specify multiple filters as well by using the plus icon on the right here. You want to filter by availability and let’s say, the color of your phone or product. That’s very straightforward as well. All of this uses the Web UI, which makes things very easy for you. You can also query by using GQL, which is the Google query language. You can simply specify a SQL like query in order to query your cloud data store. In this lecture, as we walk through the demo, it must have been pretty clear to you that entities of the same kind in cloud data store can have different properties, and each of those properties can have different data types across entry entities as well.