Presentation of NoSQL
Resources
To write this chapter, were used
- (Sadalage and Fowler 2012),
- Wikipedia NoSQL,
- (Sullivan 2015),
- (Elmasri and Navathe 2015, Chapter 24),
- (Pavlo and Aslett 2016)
- (Lakshman and Malik 2010) and
- (Ellis 2013)
A Bit of History
This part is partially inspired from (Sadalage and Fowler 2012, Chap. 1), but it has been further updated.
Database Applications and Application Databases
When you write a database application, you have two options:
- One database for multiple applications,
- One database for each application.
The first option can cause severe impacts on the efficiency of your database: since maintening the integrity of the database is a requirement, a lot of synchronization is needed, and your database becomes a bottleneck. With the second option, you develop an “application database” (i.e., a database dedicated to a particular application), and you have more freedom in the design, schema, and even DBMS (you can use one particular software solution for one particular database application, and a different one for a different database application).
Clusters, Clusters…
The increase in everything (traffic, size of data, number of clients, etc.) meant “up or out”, and raised numerous challenges for the “one database for multiple application” option. There was two ways to increase the resources and to scale up:
- Bigger machines,
- More machines.
The second option was generally less expensive (compare buying 1,000 raspberry pi VS buying 1 supercomputer that is not a cluster of more modest computers), but came with two drawbacks w.r.t. databases:
- The cost of licences was excessive (indeed, you had to buy one licence per computer),
- and it forced to perform “unnatural acts”: relational model are really not made to be distributed.
A First Shift
Developping DBMS more suited for distributed architectures became growingly important, and some comanies took at stab at it. The more important attemts were
- Google Big Table, 2004 (made public in … 2015!) (Chang et al. 2006)
- Amazon DynamoDB, 2004 (used in Simple Storage Service (S3) in 2007)
- Facebook’s Cassandra is sometimes mentioned, but it came later on, around 2009 (Lakshman and Malik 2009).
It was solutions suited to the needs of those big companies, that were very specific. But it was interresting to see SQL’s supremacy being questionned.
One of the goal was to get rid of “impedance mismatch”: mapping classes or objects to database tables defined by a relational schema is complex and cumbersome. However, if you want your database application to go naturally from their data representation to the representations in the DBMS, solving this issue becomes critical. Among the issues,
- There is no absolute notion of “private” and “public” in RDBMS (relative to needs),
- There are many differences in the data-type (no pointer, weird way of defining string, etc.),
- The values in a relational structure have to be simple (no complex datatype, no structure).
The term “impedance mismatch” describes that annoying need for a translation, and one of the goal of this first shift was to get rid of it.
Also, the data is now moving, growing fast, extremely diverse, and traditional relational DBMS seemed not necessarily wel-suited to hande those changes.
Gathering Forces
To renew the world of DBMS, there were multiple attempts, going in multiple directions. A meetup to discuss them coined the term “NoSQL” in an attempt to have a “twittable” hashtag, and it stayed (even it is as specific as describing a dog as “not being a cat”). The original meet-up asked for “open-source, distributed, nonrelational database”. Today, there is no “official” definition of NoSQL, but NoSQL often implies the following:
- No relational model,
- Not using SQL. Some still have a query language, and it ressembles SQL (to minimize learning cost), for instance Cassandra’s CQL.,
- Run well on clusters,
- Schemaless: you can add records without having to define a change in the structure first,
- Open source.
Another important notion that emerged was the notion of “polyglot persistence”, which is the idea of “using different data storage technologies to handle varying data storage needs.” In other terms, if you adopt the “application database” approach (i.e., one database dedicated to one particular application), the you can use the DBMS A for your application 1, and the DBMS B for your application 2, or even use A and B for the same application!
The Future or the Past?
There was a lot of enthusiasm, also because this approach “frees the data” (and, actually, the metadata, cf. application/ld+json, JavaScript Object Notation for Linked Data, schema.org, etc.): sharing e.g. a json file is much easier that sharing a SQL view along with its schema (the example in the Document-Oriented Database will make it clearer).
Some of it will last for sure: polyglot persistency, the possibility of being schema-less, being “distributed first”, the possibility of sacrificing consistency for greater good, etc. This does not mean that SQL (“OldSQL”) and relational database are over: there are still useful in many scenario, and the powerfull query language is great (writing your own every time is a nightmare…).
Starting ~ 2010, one reaction was to develop “NewSQL”, which would combine aspects of both approaches. For instance, having to drop the ACID requirements was often seen as a major drawback, but, for instance, MongoDB announced that it would have more and more of the ACID properties!
Also, a really great use of NoSQL is to adopt it at an early stage of the development, when it is not clear what the schemas should be. When the schemas are final, then you can shift to relational DBMS!
The retro-acronym “Not Only SQL” emphasizes that SQL will still be one of the principal actor, but that developer should be aware of other solutions for other needs.
Comparison
SQL and the NoSQL approach can be compared in many different ways. Note that there is no “best tool”: it would be like trying to decide if a hammer is better than a saw, the answer is “it depends of what you want to do with it!”. But you can use one relational or non-relational DBMS for different purposes, sometimes, again, within the same application (“polyglot persistency”).
Overview
« Comparison n’est pas raison »
A French proverb, meaning that “things should be judged on the individual qualities they posses, rather than by comparing one with another.” (Manser 2007)
- NoSQL
- Semi-structured data (no schema)
- High performance
- Availability
- Data Replication (improves availability and performance)
- Scalability (horizontal scalabality (add nodes) instead of vertical (add memory))
- Eventual Consistency
- Natively versionning
- SQL
- Immediate data consistency
- Powerfull query language (for instance, join is often missing in NoSQL, has to be implemented on the application-side)
- Structured data storage (can be too restrictive)
ACID vs CAP vs BASE
ACID and BASE are three acronyms capturing desirable features of DBMS, while CAP is a theorem stating the impossibility to have some desirable properties at the same time in distributed systems.
ACID is the guarantee of validity even in the event of errors, power failures, etc.
- Atomicity → Transactions are all or nothing
- Consistency → Transactions maintains validity
- Isolation → Executing two transactions in parallel or one after the other would have the same result
- Durability → Once a transaction has been commited, it is stored in non-volatile memory.
CAP (a.k.a. Brewer’s theorem): Roughly, “In a distributed system, one has to choose between consistency (every read receives the most recent write or an error) and availability (every request receives a (non-error) response, without guarantee that it contains the most recent write)” (the P. standing for “Partition tolerance”, a guarantee of availability).
BASE (also formulated by Brewer) corresponds to Basic Availability, Soft state, Eventual consistency. It is a series of properties that can be reached by distributed systems, including NoSQL systems, and is often seen as the “NoSQL’s version of ACID”. This answer for answer, gives some insight on its meaning.
Categories of NoSQL Systems
There are multiple ways to be “non-relational”. A rough hierarchy of the different approaches can be sketched as follows.
Model | Description | Examples |
---|---|---|
Document-based | Data is stored as “documents” (JSON, for instance), accessible via their ID (other indexes). | Apache CouchDB (simble for web applications, and reliable), MongoDB (easy to operate), Couchbase (high concurrency, and high availability). |
Key-value stores | Fast access by the key to the value. Value can be a record, an object, a document, or be more complex. | Redis (in-memory but persistent on disk database, stores everything in the RAM!) |
Column-based (a.k.a. wide column) | Partition a table by colmuns into column families, where each column family is stored in its own files. | Cassandra, HBase (both for huge amount of data) |
Graph-based | Data is represented as graphs, and related nodes can be found by traversing the edges using path expressions. | Neo4J (excellent for pattern recognition, and data mining) |
Multi-model | Support multiple data models | Apache Ignite, ArangoDB, etc. |
MongoDB
Resources
- MongoDB
- MongoDB University
- Wikipedia
- (Sadalage and Fowler 2012, ch. 9)
- (Sullivan 2015, ch. 6)
Introduction
MongoDB is
- Free (i.e., provided at no cost). Their business model leverages training, support, and DB as service. They actually developped MongoDB because they wanted a good solution for a cloud solution!
- Open-source, even if recents changes makes their license not really open source, they share most of their code.
- Cross-platform: their community server, for instance, runs on all major operating systems.
- Document-oriented: it uses a JSON-like documents with schemas, and it is the most popular DBMS using documents (the next ones are Amazon DynamoDB, Couchbase, CouchDB).
Technologies
MongoDB is endowed with
- API and drivers for C, C++, C#, Hadoop Connector, Haskell, Java, node.js, PHP, Perl, Python, Ruby, Scala (Casbah),
- a “mongo shell” (a command-line interface), which is an interactive JavaScript interface to MongoDB. You can try it on-line.`
Design
Note that while the design of your database becomes a “second class citizen”, as you can start manipulating data before a schema has been defined, this does not mean that design became irrelevant. General design principles still needs to be adopted, and everything that was said about design remains true. The key difficulty is that there is no foreign key, in MongoDB, or at least no contraints attached to the relationships two documents can have, except for the one you implement. This is generally considered to be a downside in terms of consistency, and an advantage in terms of flexibility and scalability.
Security
Mongodb is vulnerable to SQL injection (cf. Zanon) and should respect the same general guidelines as discussed in A Bit About Security (cf. MongoDB).
And additional challenge is that e.g. since JOIN
operations need to be performed “by hand”, in the application program (cf. w3schools), your attack surface grows.
Document
Let us start by detailling what a “document” is. There are multiple different implementations and definition of what a document is, but at the core of all of them are the followings:
- Documents encapsulate and encode data (Self-Describing Data),
- Documents do not need to adhere a standard schema (but they can, if you want),
- One program can have many different types of objects, and those objects often have many optional fields.
Among the formats of documents, there is XML
, YAML
, JSON
(JavaScript Object Notation), PDF
, etc. You can generally convert from one format to the others, which is an important feature.
An example of XML (Extensible Markup Languag) document, storing information about what Martin and Pradmod like, which cities they visited, etc.:
<?xml version="1.0" encoding="UTF-8"?> <!-- code/xml/person.xml --> <root> <element> <firstname>Martin</firstname> <lastVisited>Paris</lastVisited> <lastcity>Boston</lastcity> <likes> <element>Biking</element> <element>Photography</element> </likes> </element> <element> <firstname>Pramod</firstname> <lastcity>Chicago</lastcity> <addresses> <element> <city>DILLINGHAM</city> <state>AK</state> </element> <element> <city>PUNE</city> <state>MH</state> </element> </addresses> <citiesvisited> <element>Chicago</element> <element>London</element> <element>Pune</element> <element>Bangalore</element> </citiesvisited> </element> </root>
As you can see, from this document:
- The two
element
(person) contains different information: we know the first name of both, but not the address of Martin, nor thelastVisited
of Pradmod. - Tags can have an internal structure (like
addresses
), but their order does not matter. - Invalid document exists! Imagine if one tag is not properly closed, then the parsing would fail.
- Documents are somehow human and computer-readable.
- There are no or little predefined tags: the
shiporder
oritem
tags are made-up! - Documents are extensible, as one can invent new tags, re-fine the organization inside an item, etc.
A more detailed example, including the design of a schema, can be found at w3schools.
The kind of document MongoDB uses is called BSON
(portmanteau of the words “binary” and “JSON
”), and it actually extends JSON
. Think of BSON
as a binary representation of JSON
documents.
Document-Oriented Database
Mongodb is a document-oriented database (document store), which means that the databases contain semi-structured data. It is a subclass of the key-value store:
- Relational databases (RDB) pre-define the data structure (i.e., the schema) in the database (fields + data type).
- Key-value (KV) treats the data as a single opaque collection, which may have any number (including 0) fields for every record.
- Document-oriented (DO) system relies on internal structure in the data to extract metadata.
RDB is excellent for optimization, but sometimes waste space (placeholders for optional values) and is sometimes too rigid. KV does not allow any optimization, but provides flexibility and follows more closely modern programming concepts. DO has the flexibility of KV, and allows for some optimization.
One important difference: in RDB, data is stored in separate tables, and a single object (entity) may be spread across several tables. In DO, one object = one instance, and every stored object can be different from every other. There are pros to this approach:
- Mapping objects to a DB simpler,
- Change “in place”,
- Increase speed of deployment.
General Organization of MongoDB Databases
Let us start by mapping the common notions of RDBMS to the mongoDB ecosystem:
RDBMS | MongoDB |
---|---|
database instance | MongoDB instance |
schema | database |
table | collection |
row | document |
Each MongoDB instance has multiple databases, each database can have multiple collections.
Our previous XML
“person” example can be converted into two documents delimited by […], used to delimit an array of document.
We actually had to have three documents: asJSON
does not really have comments (cf. StackOverflow), we added a document containing only the attribute"_comment"
to specify the path where that file is located.
[ { "_comment": "code/json/person.json" }, { "firstname": "Martin", "likes": [ "Biking", "Photography" ], "lastcity": "Boston", "lastVisited": "Paris" }, { "firstname": "Pramod", "citiesvisited": [ "Chicago", "London", "Pune", "Bangalore" ], "addresses": [ { "state": "AK", "city": "DILLINGHAM" }, { "state": "MH", "city": "PUNE" } ], "lastcity": "Chicago" } ]
Note that
addresses
is a d document!- Some attributes are common, some are not: that’s fine, every document can have its own schema.
A collection should be on “related” entities (do not store server logs, store customers and list of employee in the same collection!), and not too abstract ones (no “Server stuff”). Also, if you store document that are too different, your performances will take a big hit. Bottom line: think about your usage, and the kind of queries you will perform.
So, in summary, “Schema-less” does not mean “organization-less”!
Set Up
The instructions are only for Linux, but should be easy to adapt.
Download and install
mongodb
from MongoDB, select the “server” and “shell” packages.As root, type
mkdir /tmp/mongotest mongod --dbpath /tmp/mongotest
to start the server and create a “dummy” database in the folder
/tmp/mongotest
.Then, open another terminal, and type in, as a normal user
mongo
.
The documentation is nicely written and well-organized: we’ll follow parts of it, please refer to it if needed. You can start by opening the “Getting started” tutorial and running its examples on your own installation.
First Elements of Syntax
The syntax for the command-line interface can be found at MongoDB. In a first approximation, the syntax is of the form:
db.<name of the collection>.<command>(<arguments>)
Where db
is not the name of the database, it is just the prefix.
To get information about your installation, use
show dbs
to see the databases,use mydb
to use themydb
database,show collections
to see the collections in a particular database,
To insert, use:
db.books.insert({"title": "Mother Night", "author": "Kurt Blabal"})
MongoDB will add a unique identifier (
_id
) if you do not provide one. You can think of that as a primary key.- To remove an entry, use:
db.books.remove({"title":"Mother Night"})
- To update an entry, use:
db.books.update({"title":"Mother Night"}, {$set {"quantity" : 10}})
Other function, such as $inc
, to increment, can be used.
- To select, use:
is like 'SELECT * FROM Books;db.books.find()
both search for the book with title “Mother Night”, and the second query displays only the author and quantity attributes.db.books.find({"title":"Mother Night"}) db.books.find({"title":"Mother Night"}, {"author":1, "quantity":1})
display all the attributes, except the author and the quantity.db.books.find({"title":"Mother Night"}, {"author":0, "quantity":0})
displays the entries were the quantity is greater than equal to 10, and less than 50.db.books.find({"quantity":{"$gte": 10, "$lt": 50}})
It is possibility to mimic some features of SQL
(like the unique attributes), but there are no referential key integrity, for instance.
Most insert / update / delete will return success as soon as one node received your command, but you may tweak them so that success is returned only once the operation has been performed on the majority of the nodes.
Mongodb does not offer as many features as e.g. MySQL, and there is the need to write a lot on the program side. However, you can find a lot of API (i.e., it is taking the “package manager” approach to offer a modular software), cf. for instance an API over mongo-java-driver: Jongo (support some form of prepared statement).
MongoDB Database Program
This section will follow Mongodb’s “quick tour” of the Java api, as discussed at GitHub.
You will need to :
- Make sure mongodb server is up and running as noted in the “Set-up” section,
- Download mongo-java-driver-3.9.1.jar from Sonatype or from Github.
- Download QuickTour.java.
Place those two files, mongo-java-driver-3.9.1.jar
and QuickTour.java
in the same folder, and run
java -cp .:mongo-java-driver-3.9.1.jar QuickTour.java
You should see a large number of lines displayed at the screen, and around the top, the message INFO: Opened connection [connectionId{localValue:2, serverValue:12}] to localhost:27017
.
After various import statement, the program create a MongoClient
object called mongoClient
, and connects it to the local database server:
MongoClient mongoClient; if (args.length == 0) { // connect to the local database server mongoClient = MongoClients.create(); } else { mongoClient = MongoClients.create(args[0]); }
To get a database and a collection, the program uses:
// get handle to "mydb" database MongoDatabase database = mongoClient.getDatabase("mydb"); // get a handle to the "test" collection MongoCollection<Document> collection = database.getCollection("test");
Note that a collection is simply an ArrayList of documents.
Assume we want to create the following document:
{ "name": "MongoDB", "type": "database", "count": 1, "info": { "x": 203, "y": 102 } }
(Remember: order does not matter!)
Then we can use the Document
class to create it, and then insert the document created:
// make a document and insert it Document doc = new Document("name", "MongoDB") .append("type", "database") .append("count", 1) .append("info", new Document("x", 203).append("y", 102)); collection.insertOne(doc);
Note that we can “chain” the append
, using doc.append("type", "database").append("count", 1);
etc.
Only at this point would the database and collection being created.
To “witness” what the program is doing from the command line, you can, for instance,
Edit the Java program, by commenting the statement
database.drop();
.Execute the modified version,
Open the command-line-interface (simply type
mongo
), and run:use mydb show collections db.test.find()
This last command should return something like
{ "_id" : ObjectId("5ea72152d8b5777d53c1a148"), "name" : "MongoDB", "type" : "database", "count" : 1, "info" : { "x" : 203, "y" : 102 } }
The program goes on and is discussed in details at GitHub. You can see for instance that to construct lists of documents and insert them, one can use:
// now, lets add lots of little documents to the collection so we can explore queries and // cursors List<Document> documents = new ArrayList<Document>(); for (int i = 0; i < 100; i++) { documents.add(new Document("i", i)); } collection.insertMany(documents);
A discipline similar to what we saw on Java applications interacting with MySQL should apply:
- read the documentation,
- think about what should be the role of the application, and what should be left to the DBMS (knowing that mongoDB can not do as much as MySQL),
- secure your application,
- think if you need one database per application, or if you want to share a single database accross multiple users and applications.
Principles
We can summarize some of the principles we have learned, and introduce some new, as follows:
- “Schemaless means more responsibility”
- In NoSQL, it is sometimes ok to have a bit of denormalization: duplicating the information, to have everything in the same document, can save time at the price of memory. For instance, you could imagine having a table for phone number, for employee, for emergency contact. You can duplicate that information, it is not going to consume a lot of memory, but will allow you to perform fewer join operations (which are resources expensive), at the price of course of storage.
- As we said, “NoSQL injection” exists, and the same remedies apply: your application should accept only strings from your users (never allow objects by design) and sanitize the inputs before using them (mongo-sanitize or content-filter are good modules for this).
Exercises
- Exercise 7.1
What is polyglot persistence? Is it useful?
- Exercise 7.2
What does it mean to be “schemaless”? What does it imply?
- Exercise 7.3
What is denormalization? When could it be useful?
- Exercise 7.4
What is the object-relational impedance mismatch? Is it an issue that cannot be overcome?
- Exercise 7.5
For each of the following notions, indicate if they are usually an attribute of NoSQL or of “traditional” SQL:
Schema First Distributed Relational Scalable NoSQL SQL
Solution to Exercises
- Solution 7.1
- It is the act of picking the right DBMS for the task and involving multiple DBMS’s in a single application. Yes, it is useful. Per wikipedia, “Polyglot persistence is the concept of using different data storage technologies to handle different data storage needs within a given software application.”
- Solution 7.2
- “Schemaless” means hat a table can contain documents, or tuples, with different attributes. It implies more responsibilities.
- Solution 7.3
- Denormalization is to duplicate data about other entities in some entities. It is useful when joining is expensive.
- Solution 7.4
- Database and object-oriented principles are different and it requires work to make them work together. This correspondance, or matching, can be implemented in the application, or lead to the design of a new DBMS.
- Solution 7.5
Schema First Distributed Relational Scalable NoSQL ✓ ✓ SQL ✓ ✓
Problems
- Problem 7.1 (Explaining NoSQL)
“NoSQL” used to mean “Non SQL”, but was retro-actively given the meaning “Not Only SQL.” Below, write a short essay that explains:
- What motivated the “Non SQL” approach.
- What is the meaning of “Not Only SQL.”
- The benefits and drawbacks of the relational approach.
- Problem 7.2 (ER Diagram from XML File – Customer)
Consider the following xml file:
<?xml version="1.0" encoding="UTF-8"?> <!-- code/xml/sustomers.xml --> <Customers> <Customers> <Customer Name="Pamela Zave" ID="C001"> <Orders> <Order Date="2012-07-04T00:00:00" ID="10248"> <Product Quantity="5" ID="10"> <Description>A Box of Cereal</Description> <Brand>Cereal Company</Brand> <Price>$3</Price> </Product> <Product Quantity="10" ID="43"> <Description>A Box of Matches</Description> <Brand>Match Ltd</Brand> <Price>$1.20</Price> <Caution>Not suitable for children</Caution> </Product> </Order> </Orders> <Address>123 Main St., Augusta, GA, 30904</Address> </Customer> <Customer Name="Nancy Lynch" ID="C002"> <Orders> <Order Date="2011-07-04T00:00:00" ID="10245"> <Product Quantity="3" ID="10"> <Description>A Box of Cereal</Description> <Brand>Cereal Company</Brand> <Price>$3</Price> </Product> <Product Quantity="1" ID="5"> <Description>A Cup</Description> <Brand>Cup Company</Brand> <Price>$2</Price> <Material>Stoneware</Material> </Product> </Order> </Orders> <Address> Address line 5, 6, 7</Address> </Customer> <Customer Name="Shafi Goldwasser" ID="C003"> <Address>345 Second St., Augusta, GA, 30904</Address> </Customer> </Customers> </Customers>
Try to draw the ER model that would correspond to the relational implementation of this database. Justify your choices.
- Problem 7.3 (ER Diagram from XML File – Award)
Find below a mashup of actual data from the National Science Foundation (courtesy of NSF):
<?xml version="1.0" encoding="UTF-8"?> <!-- code/xml/NSFAward.xml --> <rootTag> <Award> <AwardTitle>CAREER: Advances in Graph Learning and Inference</AwardTitle> <AwardEffectiveDate>11/01/2019</AwardEffectiveDate> <AwardExpirationDate>01/31/2023</AwardExpirationDate> <AwardAmount>105091</AwardAmount> <Organization> <Code>05010000</Code> <Directorate> <Abbreviation>CSE</Abbreviation> <LongName>Direct For Computer & Info Scie & Enginr</LongName> </Directorate> <Division> <Abbreviation>CCF</Abbreviation> <LongName>Division of Computing and Communication Foundations</LongName> </Division> </Organization> <ProgramOfficer> <SignBlockName>Phillip Regalia</SignBlockName> </ProgramOfficer> <AwardID>2005804</AwardID> <Investigator> <FirstName>Patrick</FirstName> <LastName>Hopkins</LastName> <EmailAddress>phopkins@virginia.edu</EmailAddress> <StartDate>11/22/2019</StartDate> <EndDate /> <RoleCode>Co-Principal Investigator</RoleCode> </Investigator> <Investigator> <FirstName>Jon</FirstName> <LastName>Ihlefeld</LastName> <EmailAddress>jfi4n@virginia.edu</EmailAddress> <StartDate>11/22/2019</StartDate> <EndDate /> <RoleCode>Principal Investigator</RoleCode> </Investigator> <Institution> <Name>University of Virginia Main Campus</Name> <CityName>CHARLOTTESVILLE</CityName> <ZipCode>229044195</ZipCode> <PhoneNumber>4349244270</PhoneNumber> <StreetAddress>P.O. BOX 400195</StreetAddress> <CountryName>United States</CountryName> <StateName>Virginia</StateName> <StateCode>VA</StateCode> </Institution> </Award> </rootTag>
It contains information about one particular award that was awarded to an institution on behalf of two researchers. Quoting the National Science Foundation (NSF):
NSF is divided into the following seven directorates that support science and engineering research and education:…. Each is headed by an assistant director and each is further subdivided into divisions like …
From this xml file and the information given above, draw an ER diagram for NSF’s awards. Do not hesitate to comment on the choices you are making and on what justifies them.
Solutions to Selected Problems
- Solution to Problem 7.2 (ER Diagram from XML File – Customer)
It should be clear that three entities are present in this file: Customer, Order, and Product. An order can contain a certain quantity of a product, and a customer can pass 0 or more orders. Some attributes are natural primary keys (they are named “ID” in the diagram below), and some attributes seems to be optional (“Caution”, or “Material”), but should still be made an attribute.
Put together, this gives the following diagram:
We made further assumptions: an order cannot be empty (transcribed by the total constraint on CONTAINS), and an order does not exist if it was not passed by a customer (transcribed by the fact that ORDER is a weak entity), which also implies that an order cannot be passed by more than one customer. Note that the same product cannot be present “twice” (with the equal or different quantities) in an order: an order can contain a particular product only once in any quantity, implying that if an order had two of the product A, and three of the same product A, then those two quantities of A should be merged so that an order contains five of this product A. This is enforced by the cardinality ratio of
1
in the CONTAINS relationship.Of course, other choices are possible.
- Solution to Problem 7.3 (ER Diagram from XML File – Award)
Two entities are easy to distinguish: RESEARCHER (for “Investigator”) and INSTITUTION. The status of the the content between the
<Organization>
tags is less clear; apparently, an organization has a code, and is made of two parts: a Directorate and a Division. Using the quote, we know that a Division should be a part of exactly one Directorate, and that a Directorate has an assistant director. But what is the status of that “Organization”? Is it subsumed by the Directorate or is it orthogonal? We decide to create an entity for it, but its precise role should be clarified. The relationship between Division and Directorate is clear, but, once again, the relationship between Division and Organization could have any constraint, we can not really infer that information from the document.The next difficulty is the status of the award itself: should it be a relationship with many attributes, between the RESEARCHER and INSTITUTION entities? The issue with this approach is that an award can have multiple investigators, as shown in the example, and that this number can vary. Hence, fixing the arity and constraints on this relationship will be difficult. We could have a relation of arity
2
, and “duplicate it” if multiple researchers are involved in the same grant, but that seems like a poor choice (since all the information about the grant will need to be duplicated). Therefore, it seems more reasonable to make the award an entity.How should we connect the AWARD entity with the RESEARCHER and INSTITUTION entities? A ternary relation has some drawbacks, since it would require some duplication when multiple investigators are working on the same award. Instead, having one binary relationship between the award and the institution, and one binary relationship between the award and the researcher (that specifies further the role of the researcher for that particular award), seems like a safer choice. An award must be awarded to at least one researcher and one institution, but we do not know if there is a maximum number of institutions that can obtain the same award, so it is better not to restrict this arity. Whether there should be a relationship between the researcher and the institution is up in the air; we do not know if a researcher has to work for an institution to get a grant, nor if getting a grant for an institution means that you work for it, so it is probably better to refrain from adding such a relationship.
Most of the attributes are straightforward once we see that “Role” is an attribute of a relationship, not of an entity.
All together, this gives the following diagram: