NoSQL is a term that has become ubiquitous in recent years. But what does "NoSQL" actually mean? How and why is it useful? In this article, we'll answer these questions by creating a toy NoSQL database in pure Python (or, as I like to call it, "slightly structured pseudo-code").
OldSQL
To most, SQL is synonymous with "database". SQL, an acronym for Structured Query Language, is not a database technology itself, however. Rather, it describes the language by which one retrieves data from a RDBMS, or Relational Database Management System. MySQL, PostgreSQL, MS SQL Server, and Oracle are all examples of RDBMSs.
The word "Relational" in the acronym RDBMS is the most informative. Data
is organized into tables, each with a set series of columns with an
associated type. The description of all tables, their columns, and the
columns' types are referred to as the database's schema. The schema completely
describes the structure of the database, with a description of each table.
For example, a Car
table may have the following columns:
- Make: a string
- Model: a string
- Year: a four-digit number; alternatively, a date
- Color: a string
- VIN (Vehicle Identification Number): a string
A single entry in a table is called a row, or record. To distinguish one
record from another, a primary key is usually defined. The primary key for a
table is one of its columns (or a combination thereof) that uniquely identifies
each row. In the Car
table, VIN is a natural choice to be the table's primary key as it
is guaranteed to be unique between cars. Two rows may share the exact same
values for Make, Model, Year, and Color but refer to different cars, meaning
they would have different VINs. If two rows have the same VIN, we don't even
have to check the other columns, they must refer to the same car.
Querying
SQL lets us query this database to gain useful information. To query simply means to ask questions of the RDBMS in a structured language and interpret the rows it returns as the answer. Imagine the database represents all vehicles registered in the US. To get all records, we could write the following SQL query against the database:
1 |
|
A translation of the SQL into plain English might be:
- "SELECT": "Show me"
- "Make, Model": "the value of Make and Model"
- "FROM Car": "for each row in the Car table"
Or, "Show me the value of Make and Model for each row in the Car table". We'd get back a list of results, each with Make and Model. If we cared only about the color of cars from 1994, we could say:
1 |
|
In this case, we'd get back a list like
Black
Red
Red
White
Blue
Black
White
Yellow
Lastly, using the table's primary key, we could look up a specific car by looking up a VIN:
1 |
|
That would give us the specific properties of that car.
Primary keys are defined to be unique. That is, a specific car with a specific VIN must only appear in the table at most once. Why is that important? Let's look at an example:
Relations
Imagine we are running an auto repair business. Among other things, we need to keep track
of a vehicle's service history: the record of all repairs and tune ups we've
performed on that car. We might create a ServiceHistory
table with the
following columns:
- VIN
- Make
- Model
- Year
- Color
- Service Performed
- Mechanic
- Price
- Date
Thus, each time a car comes in to get serviced, we add a new row to the table with all of the car's information along with what we did to it, who the mechanic was, how much it cost, and when the service was performed.
But wait. All of the columns related to the car itself are always the same for the same car. That is, if I bring in my Black 2014 Lexus RX 350 10 times for service, I'll need to record the Make, Model, Year, and Color each time, even though they won't change. Rather than repeat all of that information, it makes more sense to store it once and look it up when necessary.
How would we do this? We'd create a second table: Vehicle
, with the following
columns:
- VIN
- Make
- Model
- Year
- Color
For the ServiceHistory
table, we now want to trim down to the following
columns:
- VIN
- Service Performed
- Mechanic
- Price
- Date
Why does VIN appear in both tables? Because we need a way to specify that this
vehicle in the ServiceHistory
table refers to that vehicle in the Vehicle
table. This way, we only have to store information about a particular car once.
Each time it comes in for repair, we create a new row in the ServiceHistory
table but not the Vehicle
table; it's the same car, after all.
We can also issue queries that span the implicit relationship between Vehicle
and ServiceHistory
:
1 |
|
This query seeks to determine the Model and Year for all cars where the repair costs were greater than $75.00. Notice that we specify that the way to match rows from the Vehicle table to rows in the ServiceHistory table is to match up the VIN values. What it gives us back is a set of rows with the columns of both tables. We refine this by saying we only want the "Model" and "Year" columns of the "Vehicle" table.
If our database has no indexes (or, more correctly, indices), the query above would need to perform a table scan to locate rows that match our query. Table scans are an inspection of each row in the table in sequence and are notoriously slow. Indeed, they represent the slowest possible method of query execution.
Table scans can be avoided through the use of an index on a column or set of columns.
Think of indices as data structures that allow us to find a particular value
(or range of values) in the indexed column very quickly by pre-sorting the
values. That is, if we had an index on the Price column, instead of looking
through all rows one-at-a-time to determine if the price was greater
than 75.00
, we could simply use the information contained in the index
to "jump" to the first row with a price greater than 75.00
and return every
subsequent row (which would have a price at least as high as 75.00
, since the
index is ordered).
When dealing with non-trivial amounts of data, indices become an indispensable tool for improving query speed. Like all things, however, they come at a cost: the index's data structure consumes memory that could otherwise be used to store more data in the database. It's a trade off that one must examine in each individual case, but it's very common to index frequently queried columns.
The Clear Box
Advanced features like indices are possible due to the database's ability to inspect a table's schema (the description of what type of data each column holds) and make rational decisions based on the data. That is, to a database, a table is the opposite of a "black box" (a clear box?).
Keep this fact in mind when we talk about NoSQL databases. It becomes an important part of the discussion regarding the ability to query different types of database engines.
Schemas
A table's schema, we've learned, is a description of the names of the columns and the type of data they contain. It also contains information like which columns can be blank, which must be unique, and all other constraints on column values. A table may only have one schema at any given time and all rows in the table must conform to the schema.
This is an important restriction. Imagine you have a database table with millions of rows of customer information. Your sales team would like to begin capturing an additional piece of data (say, the user's age) to increase the precision of their email marketing algorithm. This requires you to alter the table by adding a column. You'll also need to decide if each row in the table needs a value for this column. Often times, it makes sense to make a column required, but doing so would require information we simply don't have access to (like the age of every user already in the database). Therefore, trade offs are often made in this regard.
In addition, making schema changes to very large database tables is rarely a simple matter. Having a rollback plan in case something goes wrong is important, but it's not always possible to undo a schema change once it's been made. Schema maintenance is probably one of the most difficult parts of the job of a DBA.
Key/Value Stores
Far before the term "NoSQL" existed, Key/Value Data Stores like memcached
provided storage for data without the overhead of a table schema. Indeed, in K/V
stores, there are no "tables" at all. There are simply keys and values.
If a key/value store sounds familiar, that's because it's built upon the same
principles as Python's dict
and set
classes: using hash tables to provide
quick key-based access to data. The most primitive Python-based NoSQL database
would simply be a big dictionary.
To understand how they work, let's write one ourselves! We'll start with a very simple design:
- A Python
dict
as the primary data store - Only support strings as keys
- Support for storing integers, strings, and lists
- A simple TCP/IP server that uses ASCII strings for messaging
- Slightly advanced commands like
INCREMENT
,DELETE
,APPEND
, andSTATS
The benefit of building the data store with an ASCII-based TCP/IP interface is
that we can use the simple telnet
program to interact with our server; no
special client is needed (though writing one would be a good exercise and can be
done in about 15 lines).
We need a "wire format" for the messages we send to the server and for the responses it sends back. Here's a loose specification:
Commands Supported
- PUT
- Arguments: Key, Value
- Purpose: Insert a new entry into the data store
- GET
- Arguments: Key
- Purpose: Retrieve a stored value from the data store
- PUTLIST
- Arguments: Key, Value
- Purpose: Insert a new list entry into the data store
- GETLIST
- Arguments: Key
- Purpose: Retrieve a stored list from the data store
- APPEND
- Arguments: Key, Value
- Purpose: Add an element to an existing list in the data store
- INCREMENT
- Arguments: Key
- Purpose: Increment the value of an integer value in the data store
- DELETE
- Arguments: Key
- Purpose: Delete an entry from the data store
- STATS
- Arguments: N/A
- Purpose: Request statistics on how many successful/unsuccessful executions of each command were executed
Now let's define the message structure itself.
Message Structure
Request Messages
A Request Message consists of a command, a key, a value, and a value type. The
last three are optional depending on the message. A ;
is used as a delimiter.
There must always be three ;
characters in the message, even if some optional
fields are not included.
COMMAND;[KEY];[VALUE];[VALUE TYPE]
- COMMAND is a command from the list above
- KEY is a string to be used as a data store key (optional)
- VALUE is a integer, list, or string to be stored in the data store (optional)
- Lists are represented as a comma-separated series of strings, e.g. "red,green,blue"
- VALUE TYPE describes what type VALUE should be interpreted as
- Possible values: INT, STRING, LIST
Examples
-
"PUT;foo;1;INT"
-
"GET;foo;;"
-
"PUTLIST;bar;a,b,c;LIST"
-
"APPEND;bar;d;STRING
-
"GETLIST;bar;;"
-
"STATS;;;"
-
"INCREMENT;foo;;"
-
"DELETE;foo;;"
Response Messages
A Response Message consists of two parts, separated by a ;
. The first part is always True|False
based on whether the command was successful. The second part is the command message.
On errors, this will describe the error. On successful commands that don't
expect a value to be returned (like PUT
), this will be a success message. For
commands that expect a value to be returned (like GET
), this will be the value
itself.
Examples
-
"True;Key [foo] set to [1]"
-
"True;1"
-
"True;Key [bar] set to [['a', 'b', 'c']]"
-
"True;Key [bar] had value [d] appended"
-
"True;['a', 'b', 'c', 'd']
-
"True;{'PUTLIST': {'success': 1, 'error': 0}, 'STATS': {'success': 0, 'error': 0}, 'INCREMENT': {'success': 0, 'error': 0}, 'GET': {'success': 0, 'error': 0}, 'PUT': {'success': 0, 'error': 0}, 'GETLIST': {'success': 1, 'error': 0}, 'APPEND': {'success': 1, 'error': 0}, 'DELETE': {'success': 0, 'error': 0}}"
Show Me The Code!
I'll present the code in digestible chunks. The entire server clocks in at just under 180 lines of code, so it's a quick read.
Set Up
Below is a lot of the boilerplate setup code required for our server:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
Not much to see here, just an import and some initialization of data.
Set Up (Cont'd)
I'll now skip a bit of code so that I can show the rest of the setup code. Note that it refers to functions that don't exist yet. That's fine, since I'm jumping around. In the full version (presented at the end), everything is in the proper order. Here's the rest of the setup code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
|
We've created what's commonly referred to as a look-up table called COMMAND_HANDLERS
. It works by
associating the name of the command with the function used to handle commands of
that type. So, for example, if we get a GET
command, saying
COMMAND_HANDLERS[command](key)
is the same as saying handle_get(key)
.
Remember, functions can be treated as values and can be stored in a dict
like
any other value.
In the code above, I decided to handle each group of commands requiring the same
number of arguments separately. I could have simply forced all handle_
functions to accept a key
and value
, I just decided this was made the
handler functions more clear, easier to test, and was less error prone.
Note that the socket code is minimal. Though our entire server is based on TCP/IP communication, there's really not much interaction with low-level networking code.
The last thing to note is so innocuous you might have missed it: the DATA
dictionary. This is where we'll actually store the key-value pairs that make up
our database.
Command Parser
Let's take a look at the command parser, responsible for making sense of incoming messages:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Here we can see type conversion occurring. If the value is meant to be a list,
we know we can create the proper value by calling str.split(',')
on the
string. For an int
, we simply make use of the fact that int()
can take
strings. Ditto for strings and str()
.
Command Handlers
Below is the code for the command handlers. They are all quite straight-forward and (hopefully) look as you would expect. Notice there's a good deal of error checking, but it's certainly not exhaustive. As you're reading, try to find error cases the code misses and post them in the discussion.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
|
Two things to take note of: the use of multiple assignment and code re-use. A
number of functions are simply wrappers around existing functions with a bit
more logic, like handle_get
and handle_getlist
for example. Since we are
occasionally just sending back the results of an existing function and other
times inspecting what that function returned, multiple assignment is used.
Look at handle_append
. If we try to call handle_get
and the key doesn't
exist, we can simply return exactly what handle_get
returned. Thus, we'd like
to be able to refer to the tuple returned by handle_get
as a single return
value. That lets us simply say return return_value
if the key does not exist.
If it does exist, we need to inspect the value that was returned. Thus,
we'd also like to refer to the return value of handle_get
as separate
variables. To handle both the case above and the case where we need to handle
the results separately, we use multiple assignment. This gives us the best of
both worlds without requiring multiple lines where our purpose is unclear.
return_value = exists, list_value = handle_get(key)
makes it explicit that we're
going to be referring to the value returned by handle_get
in at least two
different ways.
How Is This a Database?
The program above is certainly not an RDBMS, but it definitely qualifies as a NoSQL database. The reason it was so easy to create is because we don't have any real interaction with the data. We do minimal type checking, but otherwise just store whatever the user sends. If we needed to store more structured data, we'd likely need to create a schema for the database and refer to it while storing and retrieving data.
So if NoSQL databases are easier to write, easier to maintain, and easier to reason about, why don't we all just run mongoDB instances and be done with it? There is, of course, a trade off for all this data flexibility that NoSQL databases afford us: searchability.
Querying Data
Imagine we used our NoSQL database above to store the car data from earlier. We
might store them using the VIN as the key and a list of values as each column
value, i.e. 2134AFGER245267 = ['Lexus', 'RX350', 2013, Black]
. Of course,
we've lost the meaning of each index in the list. We just have to remember
somewhere that index one stores the Make of the car and index two stores the
Year.
Worse, what happens when we want to run some of the queries from earlier? To
find the colors of all cars from Year 1994 becomes a nightmare. We have to go
through every value in DATA
, somehow determine if the value is storing car
data or something else entirely, look at index two, then take the value of index
three if index two is equal to 1994. That's much worse than a table scan, since
it not only scans every row in the data store but needs to apply a somewhat
sophisticated set of rules to answer the query.
The authors of NoSQL databases are aware of these issues, of course, and (since querying is generally a useful feature) have come up with a number of ways to make queries possible. One way is to structure the data using, say, JSON and allow for references to other rows to represent relationships. Also, most NoSQL databases have some concept of namespaces, where data of a single type can be stored in it's own "section" of the database, allowing a query engine to make use of the fact that it knows the "shape" of the data being queried.
Of course, far more sophisticated approaches exist (and are implemented) to increase queryability, but there will always exist a trade off between storing schema-less data and queryability. Our database, for example, only supports querying by key. Things would get a lot more complex if we needed to support a richer set of queries.
Summary
Hopefully, by now it's clear what "NoSQL" means. We learned a bit of SQL and how RDBMSs work. We saw how data is retrieved from an RDBMS (using SQL queries). We built a toy NoSQL database to examine the trade offs between queryability and simplicity. We also discussed a few ways database authors deal with this.
The topic of databases, even of simple key-value stores, is incredibly deep. We've merely scratched the surface here. Hopefully however, you learned a bit about what NoSQL means, how it works, and when it's a good idea to use. You can continue the conversation at Chat For Smart People, the discussion board for this site.
Posted on by Jeff Knupp