To write this chapter, were used
This part is partially inspired from (Sadalage and Fowler 2012, Chap. 1), but it has been further updated.
When you write a database application, you have two options:
The first option can cause severe impacts on the efficiency of your database: since maintening the integrity of the database is a requirement, a lot of synchronization is needed, and your database becomes a bottleneck. With the second option, you develop an “application database” (i.e., a database dedicated to a particular application), and you have more freedom in the design, schema, and even DBMS (you can use one particular software solution for one particular database application, and a different one for a different database application).
The increase in everything (traffic, size of data, number of clients, etc.) meant “up or out”, and raised numerous challenges for the “one database for multiple application” option. There was two ways to increase the resources and to scale up:
The second option was generally less expensive (compare buying 1,000 raspberry pi VS buying 1 supercomputer that is not a cluster of more modest computers), but came with two drawbacks w.r.t. databases:
Developping DBMS more suited for distributed architectures became growingly important, and some comanies took at stab at it. The more important attemts were
It was solutions suited to the needs of those big companies, that were very specific. But it was interresting to see SQL’s supremacy being questionned.
One of the goal was to get rid of “impedance mismatch”: mapping classes or objects to database tables defined by a relational schema is complex and cumbersome. However, if you want your database application to go naturally from their data representation to the representations in the DBMS, solving this issue becomes critical. Among the issues,
The term “impedance mismatch” describes that annoying need for a translation, and one of the goal of this first shift was to get rid of it.
Also, the data is now moving, growing fast, extremely diverse, and traditional relational DBMS seemed not necessarily wel-suited to hande those changes.
To renew the world of DBMS, there were multiple attempts, going in multiple directions. A meetup to discuss them coined the term “NoSQL” in an attempt to have a “twittable” hashtag, and it stayed (even it is as specific as describing a dog as “not being a cat”). The original meet-up asked for “open-source, distributed, nonrelational database”. Today, there is no “official” definition of NoSQL, but NoSQL often implies the following:
Another important notion that emerged was the notion of “polyglot persistence”, which is the idea of “using different data storage technologies to handle varying data storage needs.” In other terms, if you adopt the “application database” approach (i.e., one database dedicated to one particular application), the you can use the DBMS A for your application 1, and the DBMS B for your application 2, or even use A and B for the same application!
There was a lot of enthusiasm, also because this approach “frees the data” (and, actually, the metadata, cf. application/ld+json, JavaScript Object Notation for Linked Data, schema.org, etc.): sharing e.g. a json file is much easier that sharing a SQL view along with its schema (the example in the Document-Oriented Database will make it clearer).
Some of it will last for sure: polyglot persistency, the possibility of being schema-less, being “distributed first”, the possibility of sacrificing consistency for greater good, etc. This does not mean that SQL (“OldSQL”) and relational database are over: there are still useful in many scenario, and the powerfull query language is great (writing your own every time is a nightmare…).
Starting ~ 2010, one reaction was to develop “NewSQL”, which would combine aspects of both approaches. For instance, having to drop the ACID requirements was often seen as a major drawback, but, for instance, MongoDB announced that it would have more and more of the ACID properties!
Also, a really great use of NoSQL is to adopt it at an early stage of the development, when it is not clear what the schemas should be. When the schemas are final, then you can shift to relational DBMS!
The retro-acronym “Not Only SQL” emphasizes that SQL will still be one of the principal actor, but that developer should be aware of other solutions for other needs.
SQL and the NoSQL approach can be compared in many different ways. Note that there is no “best tool”: it would be like trying to decide if a hammer is better than a saw, the answer is “it depends of what you want to do with it!”. But you can use one relational or non-relational DBMS for different purposes, sometimes, again, within the same application (“polyglot persistency”).
« Comparison n’est pas raison »
A French proverb, meaning that “things should be judged on the individual qualities they posses, rather than by comparing one with another.” (Manser 2007)
ACID and BASE are three acronyms capturing desirable features of DBMS, while CAP is a theorem stating the impossibility to have some desirable properties at the same time in distributed systems.
ACID is the guarantee of validity even in the event of errors, power failures, etc.
CAP (a.k.a. Brewer’s theorem): Roughly, “In a distributed system, one has to choose between consistency (every read receives the most recent write or an error) and availability (every request receives a (non-error) response, without guarantee that it contains the most recent write)” (the P. standing for “Partition tolerance”, a guarantee of availability).
BASE (also formulated by Brewer) corresponds to Basic Availability, Soft state, Eventual consistency. It is a series of properties that can be reached by distributed systems, including NoSQL systems, and is often seen as the “NoSQL’s version of ACID”. This answer for answer, gives some insight on its meaning.
There are multiple ways to be “non-relational”. A rough hierarchy of the different approaches can be sketched as follows.
Model | Description | Examples |
---|---|---|
Document-based | Data is stored as “documents” (JSON, for instance), accessible via their ID (other indexes). | Apache CouchDB (simble for web applications, and reliable), MongoDB (easy to operate), Couchbase (high concurrency, and high availability). |
Key-value stores | Fast access by the key to the value. Value can be a record, an object, a document, or be more complex. | Redis (in-memory but persistent on disk database, stores everything in the RAM!) |
Column-based (a.k.a. wide column) | Partition a table by colmuns into column families, where each column family is stored in its own files. | Cassandra, HBase (both for huge amount of data) |
Graph-based | Data is represented as graphs, and related nodes can be found by traversing the edges using path expressions. | Neo4J (excellent for pattern recognition, and data mining) |
Multi-model | Support multiple data models | Apache Ignite, ArangoDB, etc. |
MongoDB is
MongoDB is endowed with
Note that while the design of your database becomes a “second class citizen”, as you can start manipulating data before a schema has been defined, this does not mean that design became irrelevant. General design principles still needs to be adopted, and everything that was said about design remains true. The key difficulty is that there is no foreign key, in MongoDB, or at least no contraints attached to the relationships two documents can have, except for the one you implement. This is generally considered to be a downside in terms of consistency, and an advantage in terms of flexibility and scalability.
Mongodb is vulnerable to SQL injection (cf. Zanon) and should respect the same general guidelines as discussed in A Bit About Security (cf. MongoDB).
And additional challenge is that e.g. since JOIN
operations need to be performed “by hand”, in the application program (cf. w3schools), your attack surface grows.
Let us start by detailling what a “document” is. There are multiple different implementations and definition of what a document is, but at the core of all of them are the followings:
Among the formats of documents, there is XML
, YAML
, JSON
(JavaScript Object Notation), PDF
, etc. You can generally convert from one format to the others, which is an important feature.
An example of XML (Extensible Markup Languag) document, storing information about what Martin and Pradmod like, which cities they visited, etc.:
<?xml version="1.0" encoding="UTF-8"?> <!-- code/xml/person.xml --> <root> <element> <firstname>Martin</firstname> <lastVisited>Paris</lastVisited> <lastcity>Boston</lastcity> <likes> <element>Biking</element> <element>Photography</element> </likes> </element> <element> <firstname>Pramod</firstname> <lastcity>Chicago</lastcity> <addresses> <element> <city>DILLINGHAM</city> <state>AK</state> </element> <element> <city>PUNE</city> <state>MH</state> </element> </addresses> <citiesvisited> <element>Chicago</element> <element>London</element> <element>Pune</element> <element>Bangalore</element> </citiesvisited> </element> </root>
As you can see, from this document:
element
(person) contains different information: we know the first name of both, but not the address of Martin, nor the lastVisited
of Pradmod.addresses
), but their order does not matter.shiporder
or item
tags are made-up!A more detailed example, including the design of a schema, can be found at w3schools.
The kind of document MongoDB uses is called BSON
(portmanteau of the words “binary” and “JSON
”), and it actually extends JSON
. Think of BSON
as a binary representation of JSON
documents.
Mongodb is a document-oriented database (document store), which means that the databases contain semi-structured data. It is a subclass of the key-value store:
RDB is excellent for optimization, but sometimes waste space (placeholders for optional values) and is sometimes too rigid. KV does not allow any optimization, but provides flexibility and follows more closely modern programming concepts. DO has the flexibility of KV, and allows for some optimization.
One important difference: in RDB, data is stored in separate tables, and a single object (entity) may be spread across several tables. In DO, one object = one instance, and every stored object can be different from every other. There are pros to this approach:
Let us start by mapping the common notions of RDBMS to the mongoDB ecosystem:
RDBMS | MongoDB |
---|---|
database instance | MongoDB instance |
schema | database |
table | collection |
row | document |
Each MongoDB instance has multiple databases, each database can have multiple collections.
Our previous XML
“person” example can be converted into two documents delimited by […], used to delimit an array of document.
We actually had to have three documents: asJSON
does not really have comments (cf. StackOverflow), we added a document containing only the attribute"_comment"
to specify the path where that file is located.
[ { "_comment": "code/json/person.json" }, { "firstname": "Martin", "likes": [ "Biking", "Photography" ], "lastcity": "Boston", "lastVisited": "Paris" }, { "firstname": "Pramod", "citiesvisited": [ "Chicago", "London", "Pune", "Bangalore" ], "addresses": [ { "state": "AK", "city": "DILLINGHAM" }, { "state": "MH", "city": "PUNE" } ], "lastcity": "Chicago" } ]
Note that
addresses
is a d document!A collection should be on “related” entities (do not store server logs, store customers and list of employee in the same collection!), and not too abstract ones (no “Server stuff”). Also, if you store document that are too different, your performances will take a big hit. Bottom line: think about your usage, and the kind of queries you will perform.
So, in summary, “Schema-less” does not mean “organization-less”!
The instructions are only for Linux, but should be easy to adapt.
Download and install mongodb
from MongoDB, select the “server” and “shell” packages.
As root, type
mkdir /tmp/mongotest
mongod --dbpath /tmp/mongotest
to start the server and create a “dummy” database in the folder /tmp/mongotest
.
Then, open another terminal, and type in, as a normal user mongo
.
The documentation is nicely written and well-organized: we’ll follow parts of it, please refer to it if needed. You can start by opening the “Getting started” tutorial and running its examples on your own installation.
The syntax for the command-line interface can be found at MongoDB. In a first approximation, the syntax is of the form:
db.<name of the collection>.<command>(<arguments>)
Where db
is not the name of the database, it is just the prefix.
To get information about your installation, use
show dbs
to see the databases,
use mydb
to use the mydb
database,
show collections
to see the collections in a particular database,
To insert, use:
db.books.insert({"title": "Mother Night", "author": "Kurt Blabal"})
MongoDB will add a unique identifier (_id
) if you do not provide one. You can think of that as a primary key.
db.books.remove({"title":"Mother Night"})
db.books.update({"title":"Mother Night"}, {$set {"quantity" : 10}})
Other function, such as $inc
, to increment, can be used.
is like 'SELECT * FROM Books;db.books.find()
both search for the book with title “Mother Night”, and the second query displays only the author and quantity attributes.db.books.find({"title":"Mother Night"}) db.books.find({"title":"Mother Night"}, {"author":1, "quantity":1})
display all the attributes, except the author and the quantity.db.books.find({"title":"Mother Night"}, {"author":0, "quantity":0})
displays the entries were the quantity is greater than equal to 10, and less than 50.db.books.find({"quantity":{"$gte": 10, "$lt": 50}})
It is possibility to mimic some features of SQL
(like the unique attributes), but there are no referential key integrity, for instance.
Most insert / update / delete will return success as soon as one node received your command, but you may tweak them so that success is returned only once the operation has been performed on the majority of the nodes.
Mongodb does not offer as many features as e.g. MySQL, and there is the need to write a lot on the program side. However, you can find a lot of API (i.e., it is taking the “package manager” approach to offer a modular software), cf. for instance an API over mongo-java-driver: Jongo (support some form of prepared statement).
This section will follow Mongodb’s “quick tour” of the Java api, as discussed at GitHub.
You will need to :
Place those two files, mongo-java-driver-3.9.1.jar
and QuickTour.java
in the same folder, and run
java -cp .:mongo-java-driver-3.9.1.jar QuickTour.java
You should see a large number of lines displayed at the screen, and around the top, the message INFO: Opened connection [connectionId{localValue:2, serverValue:12}] to localhost:27017
.
After various import statement, the program create a MongoClient
object called mongoClient
, and connects it to the local database server:
MongoClient mongoClient; if (args.length == 0) { // connect to the local database server mongoClient = MongoClients.create(); } else { mongoClient = MongoClients.create(args[0]); }
To get a database and a collection, the program uses:
// get handle to "mydb" database MongoDatabase database = mongoClient.getDatabase("mydb"); // get a handle to the "test" collection MongoCollection<Document> collection = database.getCollection("test");
Note that a collection is simply an ArrayList of documents.
Assume we want to create the following document:
{ "name": "MongoDB", "type": "database", "count": 1, "info": { "x": 203, "y": 102 } }
(Remember: order does not matter!)
Then we can use the Document
class to create it, and then insert the document created:
// make a document and insert it Document doc = new Document("name", "MongoDB") .append("type", "database") .append("count", 1) .append("info", new Document("x", 203).append("y", 102)); collection.insertOne(doc);
Note that we can “chain” the append
, using doc.append("type", "database").append("count", 1);
etc.
Only at this point would the database and collection being created.
To “witness” what the program is doing from the command line, you can, for instance,
Edit the Java program, by commenting the statement database.drop();
.
Execute the modified version,
Open the command-line-interface (simply type mongo
), and run:
use mydb show collections db.test.find()
This last command should return something like
{ "_id" : ObjectId("5ea72152d8b5777d53c1a148"), "name" : "MongoDB", "type" : "database", "count" : 1, "info" : { "x" : 203, "y" : 102 } }
The program goes on and is discussed in details at GitHub. You can see for instance that to construct lists of documents and insert them, one can use:
// now, lets add lots of little documents to the collection so we can explore queries and // cursors List<Document> documents = new ArrayList<Document>(); for (int i = 0; i < 100; i++) { documents.add(new Document("i", i)); } collection.insertMany(documents);
A discipline similar to what we saw on Java applications interacting with MySQL should apply:
We can summarize some of the principles we have learned, and introduce some new, as follows:
What is polyglot persistence? Is it useful?
What does it mean to be “schemaless”? What does it imply?
What is denormalization? When could it be useful?
What is the object-relational impedance mismatch? Is it an issue that cannot be overcome?
For each of the following notions, indicate if they are usually an attribute of NoSQL or of “traditional” SQL:
Schema First | Distributed | Relational | Scalable | |
NoSQL | ||||
SQL |
Schema First | Distributed | Relational | Scalable | |
NoSQL | ✓ | ✓ | ||
SQL | ✓ | ✓ |
“NoSQL” used to mean “Non SQL”, but was retro-actively given the meaning “Not Only SQL.” Below, write a short essay that explains:
Consider the following xml file:
<?xml version="1.0" encoding="UTF-8"?> <!-- code/xml/sustomers.xml --> <Customers> <Customers> <Customer Name="Pamela Zave" ID="C001"> <Orders> <Order Date="2012-07-04T00:00:00" ID="10248"> <Product Quantity="5" ID="10"> <Description>A Box of Cereal</Description> <Brand>Cereal Company</Brand> <Price>$3</Price> </Product> <Product Quantity="10" ID="43"> <Description>A Box of Matches</Description> <Brand>Match Ltd</Brand> <Price>$1.20</Price> <Caution>Not suitable for children</Caution> </Product> </Order> </Orders> <Address>123 Main St., Augusta, GA, 30904</Address> </Customer> <Customer Name="Nancy Lynch" ID="C002"> <Orders> <Order Date="2011-07-04T00:00:00" ID="10245"> <Product Quantity="3" ID="10"> <Description>A Box of Cereal</Description> <Brand>Cereal Company</Brand> <Price>$3</Price> </Product> <Product Quantity="1" ID="5"> <Description>A Cup</Description> <Brand>Cup Company</Brand> <Price>$2</Price> <Material>Stoneware</Material> </Product> </Order> </Orders> <Address> Address line 5, 6, 7</Address> </Customer> <Customer Name="Shafi Goldwasser" ID="C003"> <Address>345 Second St., Augusta, GA, 30904</Address> </Customer> </Customers> </Customers>
Try to draw the ER model that would correspond to the relational implementation of this database. Justify your choices.
Find below a mashup of actual data from the National Science Foundation (courtesy of NSF):
<?xml version="1.0" encoding="UTF-8"?> <!-- code/xml/NSFAward.xml --> <rootTag> <Award> <AwardTitle>CAREER: Advances in Graph Learning and Inference</AwardTitle> <AwardEffectiveDate>11/01/2019</AwardEffectiveDate> <AwardExpirationDate>01/31/2023</AwardExpirationDate> <AwardAmount>105091</AwardAmount> <Organization> <Code>05010000</Code> <Directorate> <Abbreviation>CSE</Abbreviation> <LongName>Direct For Computer & Info Scie & Enginr</LongName> </Directorate> <Division> <Abbreviation>CCF</Abbreviation> <LongName>Division of Computing and Communication Foundations</LongName> </Division> </Organization> <ProgramOfficer> <SignBlockName>Phillip Regalia</SignBlockName> </ProgramOfficer> <AwardID>2005804</AwardID> <Investigator> <FirstName>Patrick</FirstName> <LastName>Hopkins</LastName> <EmailAddress>phopkins@virginia.edu</EmailAddress> <StartDate>11/22/2019</StartDate> <EndDate /> <RoleCode>Co-Principal Investigator</RoleCode> </Investigator> <Investigator> <FirstName>Jon</FirstName> <LastName>Ihlefeld</LastName> <EmailAddress>jfi4n@virginia.edu</EmailAddress> <StartDate>11/22/2019</StartDate> <EndDate /> <RoleCode>Principal Investigator</RoleCode> </Investigator> <Institution> <Name>University of Virginia Main Campus</Name> <CityName>CHARLOTTESVILLE</CityName> <ZipCode>229044195</ZipCode> <PhoneNumber>4349244270</PhoneNumber> <StreetAddress>P.O. BOX 400195</StreetAddress> <CountryName>United States</CountryName> <StateName>Virginia</StateName> <StateCode>VA</StateCode> </Institution> </Award> </rootTag>
It contains information about one particular award that was awarded to an institution on behalf of two researchers. Quoting the National Science Foundation (NSF):
NSF is divided into the following seven directorates that support science and engineering research and education:…. Each is headed by an assistant director and each is further subdivided into divisions like …
From this xml file and the information given above, draw an ER diagram for NSF’s awards. Do not hesitate to comment on the choices you are making and on what justifies them.
It should be clear that three entities are present in this file: Customer, Order, and Product. An order can contain a certain quantity of a product, and a customer can pass 0 or more orders. Some attributes are natural primary keys (they are named “ID” in the diagram below), and some attributes seems to be optional (“Caution”, or “Material”), but should still be made an attribute.
Put together, this gives the following diagram:
We made further assumptions: an order cannot be empty (transcribed by the total constraint on CONTAINS), and an order does not exist if it was not passed by a customer (transcribed by the fact that ORDER is a weak entity), which also implies that an order cannot be passed by more than one customer. Note that the same product cannot be present “twice” (with the equal or different quantities) in an order: an order can contain a particular product only once in any quantity, implying that if an order had two of the product A, and three of the same product A, then those two quantities of A should be merged so that an order contains five of this product A. This is enforced by the cardinality ratio of 1
in the CONTAINS relationship.
Of course, other choices are possible.
Two entities are easy to distinguish: RESEARCHER (for “Investigator”) and INSTITUTION. The status of the the content between the <Organization>
tags is less clear; apparently, an organization has a code, and is made of two parts: a Directorate and a Division. Using the quote, we know that a Division should be a part of exactly one Directorate, and that a Directorate has an assistant director. But what is the status of that “Organization”? Is it subsumed by the Directorate or is it orthogonal? We decide to create an entity for it, but its precise role should be clarified. The relationship between Division and Directorate is clear, but, once again, the relationship between Division and Organization could have any constraint, we can not really infer that information from the document.
The next difficulty is the status of the award itself: should it be a relationship with many attributes, between the RESEARCHER and INSTITUTION entities? The issue with this approach is that an award can have multiple investigators, as shown in the example, and that this number can vary. Hence, fixing the arity and constraints on this relationship will be difficult. We could have a relation of arity 2
, and “duplicate it” if multiple researchers are involved in the same grant, but that seems like a poor choice (since all the information about the grant will need to be duplicated). Therefore, it seems more reasonable to make the award an entity.
How should we connect the AWARD entity with the RESEARCHER and INSTITUTION entities? A ternary relation has some drawbacks, since it would require some duplication when multiple investigators are working on the same award. Instead, having one binary relationship between the award and the institution, and one binary relationship between the award and the researcher (that specifies further the role of the researcher for that particular award), seems like a safer choice. An award must be awarded to at least one researcher and one institution, but we do not know if there is a maximum number of institutions that can obtain the same award, so it is better not to restrict this arity. Whether there should be a relationship between the researcher and the institution is up in the air; we do not know if a researcher has to work for an institution to get a grant, nor if getting a grant for an institution means that you work for it, so it is probably better to refrain from adding such a relationship.
Most of the attributes are straightforward once we see that “Role” is an attribute of a relationship, not of an entity.
All together, this gives the following diagram: