Productionizing Backend Development - Data storage glossary

Introduction

This page lists the various terms that relate to data storage. Although data storage is part of supporting infrastructure needed to host the web application, it is a big topic in itself and so, the various data storage related terms are discussed here, rather than in the infrastructure glossary.

An application capable of supporting a business must be able to interact with individual customer and provide services needed by them individually. Serving same set of data to everyone and everytime (..like this ebook) can't grow as a business because it ignores individual needs of someone who'd be willing to pay for the customized service. Since HTTP requests are in themselves stateless, so, it is necessary to have a "data-repository" where user specific data can be saved, and served back to the same user at a later time.

Memory, or primary storage

(Reference: Wikipedia) Memory refers to a device that is used to store information for immediate use in a computer. The term "memory" is often synonymous with the term "primary storage". It operates at a high speed and is directly available to the CPU. An example of computer memory is a random-access memory that allows data items to be read or written in almost the same amount of time irrespective of the physical location of data inside the memory. The flip side of this high speed is that memory storage is very costly. As soon as the business gains some traction and the user data starts to increase, then it becomes exteremely costly to store all the data in memory. Worse, the data in memory is volatile and is lost if the application stops, crashes or if the power goes out.

Secondary storage

Secondary storage differs from primary storage in that it is not directly accessible by the CPU. The computer usually uses an input or output channels to access secondary storage and transfer the desired data to primary storage. Compared to the primary storage, the data access rate for secondary storage is much slower. However, secondary storage has a huge advantage of being non-volatile, i.e. it retains data when its power is shut off. Also, for storage of similar size data, secondary storages are much cheaper compared to a primary storage. For these reasons, secondary storage is primarily used to store data for a web application. In modern computers, hard disk drives (HDDs) or solid-state drives (SSDs) are usually used as secondary storage. The access time per byte for HDDs or SSDs is typically measured in milliseconds (one thousandth seconds), while the access time per byte for primary storage is measured in nanoseconds (one billionth seconds). Between SSDs and HDDs, SSDs have smaller access time compared to HDDs, but are more costly, but, SSDs are still less costly compared to memory. For this reason, a web application may be designed to store frequently accessed or modified data in SSD storage, and move archived or unchanging data to HDDs. Other common options for secondary storage devices include USB flash drives, rotating optical storage devices (like, CDs and DVDs).

File system storage

(Reference: Filesystem) A file system is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one large body of data with no way to tell where one piece of data stops and the next begins. By separating the data into pieces and giving each piece a name, the data is easily isolated and identified. The file system is responsible for organizing files and directories, and keeping track of which areas of the media belong to which file and which are not being used. Conceptually, the design and terms used in computer file system closely follows the method of storing and retrieving paper documents before the advent of computers. For example, each group of data is called a "file", and collection of related data forms a "directory". However, from an implementation perspective, a working computer file system resembles more as a notebook, where data chunks corresponding to a file are written on one or more pages that may be consecutive or non-consecutive, and there is an index section containing a mapping of file name and the corresponding sequence of page numbers. It should also be notes that although file system is most commonly implemented on secondary storage, it is also possible to define one on primary storage, as done by Linux's tmpfs.

To understand how file systems work with an example, consider yourself as a note keeper. Let's say the first note you make is from your friends who asks you to save "This is the friend's note" with file name "friend.txt". You do this by writing "This is the friend's note|" text on page-1 of your notebook, and adding an entry on the notebook's index page that "friend.txt" entries are on page-1. The index page in this example maps to an inode in a Unix-like file system, and the page maps to disk sector, and the file system in this example is you! Note that the "|" sign at end is not a typo, an its use is discussed later. In this example, even though there many be more space left on page-1, it is not used to write any new note because according to the index entry, any and all entry on page-1 related to "friend.txt". This wasted space is called slack space. At least for windows OS, I know that you can right click on a file, go down to "Properties" and then look at the number under "Size" and "Size on disk"; former will always be less than or equal to the latter because the latter also includes the slack space. Let's say that sometime later, the same friend asks you to get the note for "friend.txt". You first look at the index and see that the entry is on page-1. You go to page-1, read up to the first "|" sign, which is "This is friend's note|", and then return everything before the "|" sign. So you return "This is friend's note". That's the role of "|" sign, in that it marks the end of your note. In computer, the corollary is a null termination character. So you only read up to it. Alternately, in your index, you can store additional information about the total length of the file and then having "|" will not be needed. In computer file system, doing so maps to storing metadata in file system. There are some more uses of "|" sign which is discussed later. Consider another scenario where your friend asks to get the note for "new-note.txt". In this case, then you look at the index and don't find any entries. So, you just say that you don't have that value. This is how file system is able to identify if a file is missing. From a security perspective, only access by file name is allowed and not by disk sector storing the data.

Let's consider another example: The same friend comes back and asks you to delete "friend.txt". Since you always check the entries in the index page to identify if a file is existing, so you take a short cut where you only delete the index entry for "friend.txt", and do not clear the actual "This is friend's note|" data on page-1. Now, if you're asked for "friend.txt" note, you won't find the entry in index and will say that the note is not found. Let's say your friend now now asks you to write "new-note" and save it under file name "new-note.txt". From the index page, you notice that page-1 is not linked to any file name. So, you write "new-note|" onto it . But page-1 already had some old data. With new note written on same page, page-1 now reads "new-note|riend's note|"; The new data simply overwrites the existing old data only as much needed. You also make an entry in the index that "new-note.txt" is on page-1. When your friend asks for "new-note.txt" data, you check the index to see that it is on page-1. On page-1, you now only read up to the first "|" sign, i.e. "new-note|", and return everything before "|", i.e. you return "new-note". Here, everything after the first "|", i.e. "riend's note|" is garbage data. Similar handling of data also happens in a computer file system where storage data is not proactively deleted when a file is deleted (reference).

Now, let's say that your sister asks you to store a poem under "sister.txt". It take complete page-2, page-3 and small part of page-4. In the index, you mark "sister.txt" with value 2, 3, 4. Just a small referesh, here, page-4 has a slack space, but pages 2 and 3 do not have any slack space. Let's say your friend asks you to update "new-note.txt" and gives a long todo list to store. You write it on page-1. Then you notice that pages 2-4 are already filled and so you write the remaining story on page 5. You update the index with "new-note.txt" having data on page-1 and page- 5. Note that even though you can retrieve the "new-note.txt" and return it back to your friend as a single text, it is not necessary that it be stored in consecutive pages. You simply compensate for it by sequentially reading the data chunks regardless of whether they are consecutive or not. For whatever reason, you could have stored the todo list starting with page-5 and then write the remaining on page-1, and that's still ok because then, you'll read page-5 before page-1, and the index entry will also be updated correspondingly. For a computer file system to read the complete data stored in non consecutive disk sectors requires "hopping" from one to another sector and this is something that SSDs do faster than HDDs, and that's why former is faster than latter. Do note that reading data in consecutiv disk sectors does not cause hopping. Hence, for reading data in "sister.txt", a HDD will have same performance as a SDD, even though it is distributed across multiple pages, i.e. disk sectors. Let's say that your sister now asks you to delete "sister.txt". If you now look at your friend's note, i.e. "new-note.txt", on how it's stored, you'll notice that even though it is a single file, it is starts on page-1 and then continues on page-5. After deleting "sister.txt", here is no longer any non-garbage data on pages 2-4. Had your friend asked you to make the todo list after your sister deleted her poem, it would have been on pages 1-2, rather than "fragmented" across non continuous pages. This is corollary to disk fragmentation. Fragmentation naturally occurs when a file system has been in use for some time, and it hurts average speed of data retrieval from disk because retrieving data for a single file requires hopping from one disk sector to another. At least for HDDs, having data on consecutive disck sectors reads much faster than if they are on different sectors.

Database

See article about databases on Wikipedia and on Oracle. A database enables storage of "records", i.e., either structured or semi-structured data, so it can be read at a later time. Most databases store data on a secondary storage, but it is also possible to have in-memory database that store data on primary storage. For example, let's say you are watching a Batman movie, and the corresponding data to store is (you, Batman-movie). At a later time, maybe you'd be watching a Superman movie and your friend is the one who's now watching the Batman movie that you previously watched. The corresponding data are: (you, Superman-movie) and (your-friend, Batman-movie). In this example, the database would store three entries about the person who saw the movie and the movie that was seen.

Type of database

From the articles here, here, here and here, it can be observed that there are various types of databases available. Depending on the business use case, one may be more suited than others. The most broad division is that of relational vs non-relational databases. and they are discussed below. Other classification categories also exist, like, if the database is on-premise vs on-cloud. Or, if the database stores data in row format, or in column format. For data stored in row-format, it is easy to get all fields corresponding to a particular record. This is the traditional maner in which databases store data. If a table column does not have many unique values, i.e., it has low cardinality (reference: Wikipedia, StackOverflow), then column storage allows for a more compact storage.

Relational database, or RDBMS

See article about relational database on Wikipedia. The "relational" adjective is used because the data is stored within the database in a manner to bring out the relations between different data pieces. Many relational database systems use Structured Query Language, or SQL, for adding and querying relational data in the database. Using the example of users and movies in the previous section, note that the data contains "Batman-movie" twice because there is just 1 such movie which is watched twice. One way to make this relation more explicit is by not storing individual data as-is, but instead making 3 "database-tables" of viewer, movie, movie-view, with "viewer" table containing 2 entries for you, your-friend, "movie" table containing 2 entries for Batman-movie, Superman-movie, and "movie-view" table containing 3 entries for (viewer-1, movie-1), (viewer-1, movie-2), (viewer-2, movie-1). This design is better able to bring out the relationship between various data elements. For example, the design is able to convey the idea that the Batman-movie which was watched both by you and your-friend is the same movie.

In addition to capturing and making visible the relationship between different data pieces, a relational structure also reduces data duplication. For example, if it is realized later that the "Batman-movie" is better identified instead as a "Batgirl-movie", then updating the data requires touching only 1 row in one table. Had the data been stored as-is and not in relational form, making this change would have required going through the entire database to make this update. Capturing data-relations also allows extending the data-model to easily include new relations. For example, let's say we want to capture the release date and the lead actors in the movie. Now, the "movie" table can be extended to include a year. For adding actor data, a different technique could be to make a new "actor" table can be made containing the actor name, and link them to the movie they are in; And this can be done without touching the "movie", "viewer" or "movie-view" table. A later discussion on database normalization discusses concepts to help identify when a new column can be made in a table vs when a new table should be made.

With its awesome advantages, storing data relationship also comes with a disadvantage that it can hurt performance. With above example, let's say that our business model is that user provides us with some "movie-view-id", and we return details about the movie and viewer. Now, everytime this call is made, the database with first look in "movie-view" table to get an entry for the provided "movie-view-id", which will be like (viewer-1, movie-1). However, the user doesn't understand what "viewer-1" or "movie-1" means. So, the database will first go to "viewer" table to get the information for "viewer-1", which is you, and then to "movie" table to get the information for "movie-1", which is Batman-movie. Combining these data pieces, the user can now be told that the entry they are looking for is (you, Batman-movie). Contrast it with the case if the data wasn't normalized and there was just one table with duplicated (viewer, movie) entry. It would have required just 1 database call to get the required information compared to making 3 calls and then "joining" the data-pieces together. That being said, note that relation database makes extensive use of indexes to speed up joins (see section on database performance).

Non relational database, or NoSQL database

See article about non relational database on Wikipedia, MongoDB and AWS. In the previous discussion on relational database, it is identified that storing data in a relational manner promote data integrity and reduces data duplication (see here), but it comes at cost of performance (see here). Non-relational database takes the opposite route, prioritizing performance over relations. Although most articles on the web further decompose non relational database further in four subtypes, i.e. as a key–value pair store, wide column or a column-family store, document-store or graph store, I believe that the non-relation database is best seen only as a key-value store. If the value becomes a xml or json document containing multiple fields then the non-relational database becomes a document-oriented database. If instead of one "document", the database allows for multiple documents with similar structure and that can be identified based on a certain value, then it becomes a column-family store. Take a key-value pair non-relational database and also store the relations between any 2 values, then it becomes a graph database (Reference: Neo4j docs). At the end of day, each non-relational database requires you to give it a key, for which it will give a "value" back.

On a theoretical level, different non-relational database can be seen as a key-value store. However, in practice, each of them are tuned to solving a specific business use case. Thus, using a Graph database where a document database is needed, or vice versa, is not recommended. Each database also come with special utilities that make it more preferable than others when targeting a particular business need. Since the primary use case for using a non-relational databases is to achieve a very fast store and query of key-value like data, that is added to database in high volume and at a very high rate, so, the non-relational databases are designed to also easily and quickly scale out horizontally. Similar to the relational databases, the non-relational database can also be "partitioned", so that the key-value pairs in a particular range are stored and queried in a particular shard; Doing so improves the database performance.

Cache

See article about cahces on Wikipedia, AWS. A cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. An important precaution when using cache is to ensure that stale data is not being sent to user in a manner unexpected by user. Hardware components can include having L1 cache, L2 cache, etc. (reference: here and here). Software components include adding code which checks for data in memory before checking on the hard disk drive (HDD), or storing highly reused data on a faster solid state drive (SDD) and keeping archived and infrequently accessed data on HDD.

Caches can be set up anywhere there is a temporal correlation in responses. It enables speeding up the overall processing by returning a previous copy of the response rather than the alternate process of re-computing the response result from zero. A "cache hit" occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. For the cache to improve the overall processing response, its size should be of the order of request rate multiplied by the temporal correlation period of the response. Hence, for a highly repeating request, having even a small size cache would be sufficent to notice improvements, but for high request rate or not repeating request, even a large size cache wouldn't be helpful. As the cache gets full, it starts to remove old (i.e., least recently used) or unused (i.e., least frequently used) data on a "cache miss", so that new data can be stored, with the expectation that it will get requested for again. The process of removing data from cache is called "evication". Reuse of cache data defines its "hit rate". Optimizing a cache involves analyzing the way cache is used to ensure a high hit rate and with a low eviction rate (to get the benefit of cache use rather than terms getting constantly evicted and then re-loaded). A cache can also be configured as a distributed store wherein one cache node returns data corresponding to a particular "shard" or "partition" of entire dataset.

Various types of caches can be added at different stages in a request processing, like, to store data pulled from database, or to store user profile for corresponding cookie, or even at front end to reuse previous responses. As an in-memory key-value store, a cache can be used as a NoSQL database (reference). One such example is Redis. On the other hand, H2, HSQL, Apache Derby and SQLite are some SQL database examples that can run in-memory.

Relational vs Non relational database

With the database category named as "relational" or "non-relational", it seems very easy to make a reductive argument that if relations are needed, then relational database should be used, and use non-relational ones for semi-structured or unstructured data because they don't enforce relations! But what if someone wants to use Oracle database, which is a relational database, and make a table containing 2 columns, one with a numeric "id" and other with a xml or json text? With this design, the Oracle databse, which is a relational database, how now been designed to store semi-structured information! On the other hand, note that graph database also stores relations just like relational databases, but the former is a non-relational database! Another example if using a non-relation database, like Cassandra, which is a column family store, and programmatically store data within it in a normalized form! On the other hand, an Oracle database can be used to store denormalized data, and it won't be that the database will somehow stop working! So, what's the difference between the two? To me, the difference between the two is not about "what is one that is not in another", but instead, "what one can do better, out of the box, without less work from developers, than the other", which further relates to "which database suits the use case better". If your business requires good data integrity and an ability to expand database tables as the business grows and the data model gets richer and complex, go with relational database. If your business is getting hosed with data, and the user just expects you to save it, then using a key-value store is a good way to proceed. Another important question to ask is any business processes will need a multi-table transaction allowing ACID guarantees? This capability is provided by relational databases. If using distributed databases, which is generally the case happens when using non-relational databases, then ensure that the business requirements do not break because of CAP restrictions. One more note: this Medium post describes how replication works for PostgreSQL. I would highly suggested reading it to gain an understanding of related terms discussed above.

Database vs File system

Consider the following question: In the above example of database, rather than using a database, why don't I simply make a file system directory called "database", and within it I store 3 files, "watch-1" file containing data (you, Batman-movie), "watch-2" file containing data (you, Superman-movie), and "watch-3" file containing data (your-friend, Batman-movie). With this alternate design, all user data is persisted without needing to pay for a database software! So, is there any advantage of having a database over a file system? To me, the best answers for database vs file system are mentioned in these two StackOverflow posts: here and here. A huge benefit provided by database software is that rather than making huge amount of small files which will eventually bring down the OS and filesystem performance, a database enables saving millions of record and without hurting filesystem performance. It also provides additional services that can be used out of the box, like having a transaction with ACID guarantees, creation and automatic update of indexes to improve search performance, performance optimization by using system memory to store frequently used data and relatively easier management of database schema. For these additional and important services, it may be preferable to use a database software rather than only relying a filesystem.

Primary storage database vs secondary storage database

As discussed in earlier sections on primary storage and secondary storage, the decision to use one versus the other is an exercise in cost-benefit analysis. Similar logic also applies when deciding between using a primary or secondary storage for database. In-memory databases and caches are very fast, but for the same price, they can hold much less data compared to databases using the secondary storage. Hence, if the business revenue gain from serving data fast surpasses the expense associated with using an in-memory database, then it becomes worthwhile to invest in them. One use case for having only in-memory database is when performing a functional test, where setting a databse on secondary storage would simply hurt the testing speed in setting up the data for each test, retrieving data during the test, and clearing the data after each test.

Database design concepts

Database schema

The first step towards database design is to identify the structure of the information to be held in the database, also called the conceptual data model. Using the business requirements as inputs, this step identifies the real-world entities with which the business will interact, and the relationships and constraints among/within entities. Domain driven design is applied in this stage. In the next step, the conceptual data model entities are further developed and the model is either normalized, or denormalized depending on the business needs. Individual properties associated with each entity is identified, and the relations and constrainst among/within entities is defined in terms of these columns. This is the logical data model. Finally, database specific commands are used to create tables, thereby, realizing the logical data model. The database schema is this final structure that mirrors the logical data model and is described in a formal language supported by the database. At this point, the database is ready to receive user data.

Database normalization

See article about database normalization on Wikipedia. Database normalization is the process of structuring the data in a manner that brings out various relations amongst the data pieces, eliminates data redundancy and improves data integrity. Bringing out the relations helps the team (including, developers, testers, designers and managers) realize that the database design conceptual schema. Reducing data redundancy is important so that if the data is changed, then their will be just one unique place where the change needs to be made. If there is data redundancy, then all other places where the data is stored would need to be synced as soon as a change is made at any one place, and this is cumbersome and error prone. While there are use cases for using a denormalized design, it is generally the case that a normalized design should be used. Also, it is a good practice to start with a normalized design, confirm that it matches the expected conceptual model, and then denormalize it to meet the business requirements.

The Wikipedia article is a great reference to understand about normalization, and particularly helpful is the example section showing how to achieve different normal forms. Rather than repeating the same information and not doing a good job at it in a misguided attempt to shorten this important and length discussion, I strongly encourage the readers to refer to the article. In here, I would like to provide an alternate understanding of the normal forms. I did not see the "normal forms" being presented in this manner anywhere else, so there's a slight risk that I may be incorrect, or maybe I don't know of a proper reference. It does sound correct to me, but I ask readers to exercise caution. So, here it is: Using the example in the Wikipedia article, application of 1NF can be seen as defining a "data" row in a database as something that should map to a "single" real world object and not have multiplicty within it. If some kind of data-multiplicity needs to be modeled, then that should instead be done by using a foreign key relation. With every real world object, some features can be defined that make it unique. For example, the name of this book "Productionizing backend development" is unique. But what if I write a new edition of same book? Then the combination of book name and its edition is unique. What defines a real world object as unique depends on the object and how it is used in context of business application, but there will always be some unique aspect to data. 2NF form can be seen as implying that other columns for a give database table row must relate to the "whole" unique combination. So, if (name, edition) defines the uniqueness of a book, then other columns in book table except the name and edition, must relate to the (name, edition) value, and not just to name or to edition alone. 3NF can be seen as making 2NF stricter and implying that all other columns for a database table row that don't form part of the unique combination, they must only relate to the unique combination and not have any relation to any other columns. Thus, except the columns used in the unique combination, other columns for a row can take any possible value. This helps in better identifying the features of real world object that makes it unique and different from others. 4NF onwards the focus not goes to the unique combination. 4NF can be seen as requiring that the unique combination cannot be decomposed further into independent tables. 5NF makes this stricter and requires that the columns in unique combination should not have any relations; Each column of unique combination can take any value possible. As mentioned in the Wikipedia article, a database table in 5NF is one that is truly "normalized". Going to 6NF is based on realization that every column can itself be changed to a 1-column table (along with an extra id rown in each table), and the data-record table simple contains foreign key to these extremely slim tables. For a record-based design, as is always the case with the OLTP requests, this creates unnecessary performance overheads. I agree with the article's suggestion of going up to 5NF, i.e. fifth normal form, when designing database schema.

To highlight an important point that was quickly mentioned above and wasn't emphasized more: When designing your tables, always seek to identify the combination of fields in a record that make it unique. It may not be always possible, but that's for very very few cases, but much more often than not, it should be possible to do so. Since real world data almost always have a unique reason on why it is made, having a unique combination in a data record is a way of acknowledging the cause for creating the data. This also aligns with one of the main goals of database normalization which is to remove data redundancy. An additional benefit of doing so is that it provides stability under concurrent request processing, i.e., if multiple requests comes to server at same time to add same data, then only the first of those requests will get processed. Any subsequent ones will fal because the database will report failure of uniqueness constraint since a record with intended data already exists in the database.

Database de-normalization

A similar reason to that presented earlier for choosing between a relational vs a non-relational database also applies when considering a normalized vs a denormalized database design. My suggestion is to always start with a normalized design, to confirm that it matches the expected conceptual model. A normalized design will have various join operations which hurt performance; But also note that trying to depart from the normalized design will require extra effort on the software to ensure that the data integraity is maintained, and that also hurts performance. Denormalizing is the processing of weighing these two competing choices and concluding that for the business cases of interest, it is a better compromise to break off from normal structure and fine-tune the database table design to those certain business requirements only. This said, note that more often than not, businesses should use a normalized design and not use a denormalized design.

Transaction

A database transaction refers to a logical unit of work, made up of one or multiple operations, performed on one or multiple tables in a database. This operations in a transaction can include either only reading data from database, or making changes to database entries, or do both read and write operations. A database transaction, by definition, must be atomic (it must either complete in its entirety or have no effect whatsoever), consistent (it must conform to existing constraints in the database), isolated (it must not affect other transactions) and durable (it must get written to persistent storage, and if there is any outage after the trnsaction, then the modifications applied during the transaction must not get undone). These four properties are commonly referred using the acronym of ACID properties. Atomicity and durability properties of a transaction enables data reliability in recovering from failures, i.e. if the database stops working and is then restarted, then the data in the databse would represent a transaction as either been applied or not applied in its entirity. Atomicity and durability are primarily enabled via use of a write ahead log. The isolation property enables having a reliable outcome as a result of concurrent read and/or write operations done "concurrently" on the database; That being said, a fully concurrent safe processing is also slow on performance and so, SQL standard itself allows for different isolation levels having corresponding performance behavior, with uncommitted reads being very quick but allowing for unsafe data read, and serializable processing being slowest and strongest among the isolation levels. Note that the default behavior of almost all database is to NOT perform the transaction with the strongest isolation guarentee. Within a transaction, isolation in achieved by using read/write locks at table or page or row level; It can also be simulated on the database record by either using optimistic locking or pessimistic locking depending on the use case (locking reference: a StackOverflow post, a blog post).

A really good reference to understand about how ACID guarantees are preserved for a transaction is here, and the link was itself obtained from this StackOverflow post. A "transaction manager" is the central component that enables transaction processing. A database may itself be using a transaction manager to enable changes being done on it to be executed as a transaction. Some web applications may also use a different type of transaction manager to coordinate a distributed transaction, which can include one or more databases, or message queues, or any other components enabling transaction. These distributed transaction managers leverage a different mechanisms like 2 Phase commit or a more robust 3 phase commit to ensure that the corresponding changes are applied in all components involved in the distributed transaction.

Heads up: The comments in this paragraph is something that I stand by; However, it is not something that most of the articles on internet will agree with. So, please take it with extra skepticism. Note that above paragraphs mention ACID as a property of a transaction and not of database. This viewpoint means that both a relational and non-relational database can have a transaction and satisfy ACID constraints. To me, it is not that a non-relational database cannot have ACID transaction; But, it is the scope of ACID behavior in a transaction performed by non-relational database that is very different and reduced compared to the transactions done in relational database. A relational database can provide atomicity spanning over multiple rows in multiple tables. However, since non-relational databases can effectively be viewed as a key-value store (discussed above), so, the best atomicity that a transaction on a non-relational database can guarantee is only over a single key! Since every entry and every table in a non-relational database is designed to be independent, so the concept of "consistency" does not exist; There are no relations between any two tables that need to be checked. Also, for each key, a single value is stored (which can be a complex structured object, but is still one unit from the perspective of the non-relational database), and so, there is no concept of consistency check for that value. Isolation is a tricky topic in transaction and even relational databases provide varying degree of isolation. It comes down to how the database, both relation or non-relation, handles concurrent read and write operations; But there's nothing preventing a non-relational database to provide strong isolation guarantees. Non-relational databases can guarantee durability by persisting data, at least in the write-ahead log. Hence, my point, that even non-relational dtaabases can have transactions with ACID properties, because, as I said earlier, a transaction must, by definition, have ACID property. It is just the scope of ACID behavior that is different between the two. This idea is also discussed in this StackOverflow post. All above being said, I do notice 2 biggest source of confusion in most of the online discussions on non-relational databases and transactions. First, the discussion mix the concept of ACID properties of transaction with discussion of CAP theorem (discussed next), which, to me, is not a correct path to follow. Due to the two concepts getting repeatedly mixed when discussing non-relational database,this question is revisited under the discussion on CAP theorem. Second, a transaction must not be confused with a batch operation. A batch operation is where a sequence of commands are issued to the database that can be executed in multiple transactions. Generally, I've noticed non-relational database providers provide the option for batch operations (For example: Cassanda and MongoDB). The batch operations are designed to behavior similar-looking to a multi-change transaction of relational database; And they are also aggresively marketed as behaving like multi-change transaction of relational database. However, this similarity is not exact and is dependent on how the non-relational database is designed. Hence, the interchangable use of batch commands as equivalent of relational database transaction ambiguate the concept of a transaction as a logical unit of work having ACID behavior.

CAP theorem

The CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees: Consistency (i.e., every read receives the most recent write or an error; Also note that this is NOT the "C" in ACID property of a transaction, as mentioned in this reference), Availability (i.e., every request receives a non-error response, but the response received may not correspond to the most recent write), and Partition tolerance (i.e., the system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes). Since in the real world cases, systems partition failures are rare, so the design generally comes down to identifying whether sacrificing consistency or availability would be in best interest of the business use case.

As discussed previously, the primary use case for using a non-relational databases is to achieve a quick store or retrieval of high volume data coming at very high rate. For this reason, the data is stored as a key-value pair, and so, a non-relational databases can effectively be viewed as a key-value store (discussed above). To accommodate this business use case, non-relational databases are designed so that it can be scaled out horizontally in a very quick and easy manner; And, it is because of the "distributed" nature of non-relational databases, it gets subjected to the CAP theorem. What is unfortunate though is that because non-relational databases are generally used in settings that make them subjected to CAP constraints, it has becomes a norm to say that non-relation databases, by design, are constrained by CAP theorem, which is wrong! If a non-relational database is operated on just one machine, then it is not a distributed database and so, it will not get subjected to CAP theorem. What is even more unfortunate is that form some reason, it is justified that since non-relational database are subject to CAP theorem, so they don't provide ACID guarantees - which is a totally different thing, at least to me (see above). On the other hand, even relational databases can get subjected to CAP constraints if multiple of them are connected together as distributed data store. While distributed transaction manager can do their best to ensure that a distributed transaction spanning multiple relational databases is executed using 2-phase commit protocol, these distributed transactions have no partition tolerance and it is hard to recover from failures if one or more database(s) are not available. Distributed transactions are also slow to execute and don't match with the business use case for using non-relational databases, i.e., be able to handle a very high velocity data and be robust enough to respond even if some nodes fail. Hence, non-relational databases do not use 2-phase commit or 3-phase commit, and instead rely on alternate protocols like Paxos and RAFT for any conflict resolution and to form consensus on state of the data. This article on Medium covers the difference between 2-phase commit vs Paxos, and how they are applicable to different scenarios. Finally, I want to mention about this StackOverflow answer as one of the few posts I've seen that clarifies the confusion between ACID vs CAP!

TODO -- define index in database design section.. it's useful for constraints, search. Differentiate between SQL vs NoSQL indexes -- latter ones are more like buckets. Update reference for index in page. Update table of contents

Database performance

As records are added to a database, both the read and write performance can reduce over time. The most common way to improve read performance is by adding an index over a column of a database table. To enable storing large amounts of data while keeping the read and write operations efficient, a database table can be "sharded" or "partitioned" into smaller units that deal with the table data in a smaller range. Like bucket sort, sharding improves read performance because only a smaller range of data needs to be searched to identify the record of interest. For example, if the movie name starts with A-N, then search in database table partition#1, else search in partition#2. So, just by the movie name, we'll know that the entry for "Batman-movie" (watched by you and later, by your friend) is in partition#1, and the entry for Superman-movie watched by you is in partition#2. It is also possible to add a "replica" of entire database, so that read queries can be handled by different replicas, reducing the overall query load seen on one server. Replicas also provide fail-safe against data loss if for any reason the data in the database is corrupted.

[[[[TODO: Add StackOverflow]]]] sql vs no-sql indexes (distributed, more like a "grouping" than a search - column family; (cassandra, MongoDB) In addition to tables, columns, constraints and relations, the database schema defines many other terms. One important term used extensively in backend development is an "index". A database index is a data structure, separate from tables, that is constructed using values of one or more columns for all rows. It improves the speed of data retrieval operations on a database table by identifying the row(s) containing a particular value for the column(s). Since it is a separate structure, it costs additonal space and it also needs to be updated everytime a new row is added/deleted. They are particularly useful if the business application has more reads than writes. In relational databases, an index is automatically made for the primary key because it helps with speeding up joining different relations. Note that while the use of indexes is standard in relational database to achieve performance boost, its use among the non relational databases is not consistent (Reference: Stackoverflow). As discussed previously, all non relational databases can be thought of as a key-value store. Their primary use case is that they'll be provided with a particular key, for which they need to return the corresponding value. This allows them to store a large amount of data and at a fast throughput rate. That being said, specific non relational database implementation may choose to provide custom utilities, one of them being the ability to create index over content of the values. Most likely, it is because the value (for a given key) is in a semi-structured form and allows creating indexes; Or, because the database is designed for use cases where the total count of individual keys won't be so large to preclude having some indexing data structure; Or because the column being indexed on will not have lot of unique values. Even when a database implementation allows creating index, it is strongly suggested to rethink if an index should be made and used for the volume and throughput of data that the application will see because doing so for non relational databases will immediately start showing up in degraded performance.