Introduction

Resources

(Elmasri and Navathe 2010, ch. 1.1-1.6)
(Elmasri and Navathe 2015, ch. 1.1–1.6)
(Watt and Eng 2014, ch. 2–3)

The Need for a Specialized Tool

There is a good chance that any programming language you can think of is Turing complete. Actually, even some of the extremely basic tools you may be using may be Turing complete. However, being complete does not mean being good at any task: it just means that any computable problem can be solved, but does not imply anything in terms of efficiency, comfort, or usability.

In theory, pretty much any programming language can be used to

Store, retrieve and update data,
Have accessible catalog describing the metadata,
Support transactions and concurrency,
Support authorization of access and update of data,
Enforce constraints.

But to obtain a system that is fast in reading and writing on the disk, convenient to search in the data, and that provides as many “built-in” tools as possible, one should use a specialized tool.

In those lecture notes, we will introduce one of this tool–the SQL programming language– and the theory underneath it–the relational model–. We will also observe that a careful design is a mandatory step before implementing a catalog, and that how good a catalog is can be assessed, and introduce the tools to do so. Finally, we will discuss how an application interacting with a database can be implemented and secured, and the alternatives to SQL offered by the NoSQL approach, as well as the limitations and highlights of both models.

Database

A database (DB) is a collection of related data. Data (= information, can be anything, really) + management (= logical organization of the data), through Database Management System.

Represent a mini-world, a “Universe of Disclosure” (UoD).
Logically coherent, with a meaning.
Populated for a purpose.

A DBMS has multiple components, as follows:

Note that

The program can be written in any language, be a web interface, etc. It is sometimes part of the software shipped with the DBMS, but not necessarily (you can, and we will, develop your own program to interact with the DBMS).
Most DBMS software includes a Command-Line Interface (CLI).
The catalog (or schema, meta-data) contains the description of how the data is stored, i.e., the datatypes, nature of the attributes, etc.
The term “meta-data” has numerous definition (“data about the data”): we use it here to refer to the description of the organization of the data, and not e.g. to statistical data about the data.
Sometimes, catalog and data are closer than pictured (you can have “self-describing meta-data”, that is, they cannot be distinguished).

Database Management System (DBMS)

A DBMS contains a general purpose software that is used to

Define (= datatype, constraints, structures, etc.)
Construct / Create the data (= store the data)
Manipulate / Maintain (= change the structure, query the data, update it, etc.)
Share / Control access (= among users, applications)

You can think of a tool to

Specify a storage unit,
Fill it,
Allow to change its content, as well as its organization,
Allow multiple persons to access all or parts of it at the same time.

Subtasks

Exactly like a program can have

clients, that specify the requirements,
designers, that define the overall architecture of a program,
programmers, that implement the details of the program,
testers, that make sure the program is free of bugs, and
users, that actually use the program,

a DBMS offers multiple (sub)tasks and can be interacted with different persons with different roles.

Role	Task
Client	Specify the business statement, the specifications
DB Administrator	Install, configure, secure and maintain up-to-date the DBMS
Designer	Lay out the global organization of the data
Programmer	Implement the database, work on the programs that will interface with it
User	Provide, search, and edit the data (usually)

In those lecture notes, the main focus will be on design and implementation, but we will have to do a little bit of everything, without forgetting which role we are currently playing.

Life of a Project

From the business statement to the usage, a project generally follows one of this path:

Note that reverse-engineering can sometimes happen, i.e., if you are given a poor implementation and want to extract a relational model from it, to normalize it.

An Example

Let us consider the following:

STUDENT

Name	Student_number	Class	Major
Morgan	18	2	IT
Bob	17	1	CS

COURSE

Course_name	Course_number	Credit_hours	Department
Intro. to CS	1301	4	CS
DB Systems	3401	3	CS
Principles of Scripting and Automation	2120	3	AIST

SECTION

Section_identifier	Course_num	Semster	Year	Instructor
2910	1301	Fall	2019	Kate
9230	2103	Spring	2020	Todd

GRADE_REPORT

Student_number	Section_identifier	Grade
17	2910	A
18	2910	B

PREREQUISITE

Course_number	Prerequisite_number
2120	1301
1302	1301

You can describe the structure as a collection of relations, and a collection of columns:

RELATIONS

Relation Name	Number of Columns
STUDENT	4
COURSE	4
SECTION	5
GRADE_REPORT	3
PREREQUISITE	2

COLUMNS

Column Name	Datatype	Belongs to relation
Name	String	STUDENT
Student_number	Integer	STUDENT
Class	String	STUDENT
Major	String	STUDENT
Course_name	String	STUDENT
Course_number	Integer	COURSE
Credit_hours	Integer	COURSE
Department	String	COURSE
…	…	…
Prerequisite_number	Integer	PREREQUISITE

Structure

Database structure and records, 5 files (=collection of records), each containing data records of the same type, stored in a persistent way.
Each record has a structure, different data elements, each has a data type.
Records have relationships between them (for instance, you expect the Course_number of PREREQUISITE to occur as a Course_number in COURSE).

Interactions

This organization will allow some interactions. For instance, we can obtain the answer to questions like

“What is the name of the course whose number is 1301?”,
“What courses is Kate teaching this semester?”,
“Does Bob meets the pre-requisite for 2910?”
Note that this last query is a bit different, as it forces us to look up information in multiple relations.
We should also be able to perform updates, removal, addition of records in an efficient way (using auxiliary files (indexes), optimization).
Finally, selection (for any operation) requires care: do we want all the records, some of them, exactly one?

Organization

Why are the files separated like that? Why do not we store the section with the course with the students? For multiple reasons:

To avoid redundancy (“data normalization”), or having it controlled,
To controle multiple levels of access (multiple user interface),
Without sacrificing the usability!

In separating the datae, we also need to remember to be careful about consistency and referential integrity, which is a topic we will discuss in detail.

How Is a Database Conceived?

Specification and analysis. “Each student number will be unique, but they can have the same name. We want to access the letter grade, but not the numerical grade”, etc. This gives the businnes statement.
Conceptual design
Logical design
Physical design

There is a gradation, from really abstract specification that is easy to modify, to more solidified description of what needs to be coded. When we will be discussing high-level models, we will come back to those notions. The global idea is that it is easier to move things around early in the conception, and harder once everything is implemented.

Characteristics of the Database Approach

A database is more than just data: it also contains a complete description of the structure and constraints. We generally have a catalog (a.k.a. the meta-data, the schema) and the data (we can also have self-describing data, where meta-data and data are interleaved, but note that both are still present).
Data-abstraction: A DBMS provides a conceptual representation, and hides implementation details. This implies that changing the internals of the database should not require to change the application (the DBMS) or the way any of the client (program, or CLI) was interacting with the data.
Support of multiple views of the data: view is a subset of the database, or virtual data.
Sharing and multiuser transaction processing: concurrency control using transactions (= series of instructions that is supposed to execute a logically correct database access if executed in its entirety). Isolation, atomicity (all or nothing): cf. the ACID principles.

Exercises

Exercise 1.1

What is the difference between a database and the meta-data of the database?

Exercise 1.2

Is a pile of trash a database? Why, or why not?

Exercise 1.3

Define the word “miniworld”.

Exercise 1.4

Expand the acronym “DBMS”.

Exercise 1.5

Name two DBMS.

Exercise 1.6

Name the four different kinds of action that can be performed on data.

Exercise 1.7

Assign each of the following task to one of the “character” (administrator, client, etc.) we introduced:

Task	Assigned to
Install a DBMS on a server.
Sketch the schema so that the data will not be redundant.
Write client-side application that uses the DBMS API.
Establish the purpose of the database.

Exercise 1.8

List some of the tasks assigned to the Database Administrator.

Exercise 1.9

Why do DBMS include concurrency control?

Exercise 1.10

Do I have to change my DBMS if I want to change the structure of my data?

Exercise 1.11

Exercise 1.12

Assume that I have a file where one record corresponds to one student. Should the information about the classes a student is taking (e.g. room, instructor, code, etc.) being stored in the same file? Why, or why not?

Exercise 1.13

Which one comes first, the physical design, the conceptual design, or the logical design?

Exercise 1.14

What is a virtual data? How can I access it?

Solution to Exercises

Solution 1.1

The data is the information we want to store, the meta-data is its organization, how we are going to store it. Meta-data is information about the data, but of no use on its own.

Solution 1.2

No, because it lacks a logical structure.

Solution 1.3

The mini-world is the part of the universe we want to represent in the database. It is supposed to be meaningful and will serve a purpose.

Solution 1.4

Database Management System

Solution 1.5

Oracle RDBMS, IBM DB2, Microsoft SQL Server, MySQL, PostgreSQL, Microsoft Access, etc., are valid answers. Are not valid “SQL”, “NoSQL”, “Relational Model”, or such: we are asking for the names of actual softwares!

Solution 1.6

The four actions are:

Add / Insert
Update / Modify
Search / Query
Delete / Remove

Solution 1.7

We can have something like:

Task	Assigned to
Install a DBMS on a server.	Administrator, IT service
Sketch the schema so that the data will not be redundant.	Designer
Write client-side application that uses the DBMS API.	Programmer, Developer
Establish the purpose of the database.	Client, business owner

Solution 1.8

The database administrator is in charge of installing, configuring, securing and keeping up-to-date the database management system. They also control the accesses and the performance of the system, troubleshoot it, and create backup of the data.

Solution 1.9

DBMS have concurrency control to ensure that several users trying to update the same data will do so in a controlled manner. It is to avoid inconsistency to appear in the data.

Solution 1.10

Normally no, data and programs are independent. But actually, this is true only if the model does not change: shifting to a “less structured model”, e.g., one of the NoSQL models, can require to change the DBMS.

Solution 1.11

The application should not be sensible to the “internals” of the definition and organization of the data. It matters because having this independence means that changing the data will not require to change the programs.

Solution 1.12

If we were to store all the information about the classes in the student records, then we would have to store it as many time as its number of students! It is better to store it in a different file, and then to “link” the two files, to avoid redundancy.

Solution 1.13

The conceptual design.

Solution 1.14

It is a set of information that is derived from the database but not directly stored in it. It is accessed through queries. For instance, we can infer the age of a person if their date of birth is in the database, but strictly speaking the age is not an information stored in the database.

What is independence between program and data? Why does it matter?

Problems

Problem 1.1 (Define a database for CAMPUS)

Define a CAMPUS database organized into three files as follows:

A BUILDING file storing the name and GPS coordinates of each building.
A ROOM file storing the building, number and floor of each room.
A PROF file storing the name, phone number, email and room number where the office is located for each professor.

Pb 1.1 – Question 1

A database catalog is made of two part: a table containing the relations’ name and their number of columns, and a table containing the columns’ name, their data type, and the relation to which they belong. Refer to the example we made previously or consult, e.g., (Elmasri and Navathe 2010, Figure 1.3) or (Elmasri and Navathe 2015, Figure 1.3). Write the database catalog corresponding to the CAMPUS database.

Pb 1.1 – Question 2

Invent data for such a database, with two buildings, three rooms and two professors.

Pb 1.1 – Question 3

Answer the following, assuming all the knowledge you have of the situation comes from the CAMPUS database, which is an up-to-date and accurate representation of its miniworld:

Is it possible to list all the professors?
Is it possible to tell in which department is a professor?
Is it possible to get the office hours of a professor?
Is it possible to list all the professors whose offices are in the same building?
Is it possible to list all the rooms?
If a new professor arrives, and has to share his office with another professor, do you have to revise your database catalog?
Can you list which professors are at the same floor?
Can you tell which professor has the highest evaluations?

Solutions to Selected Problems

Solution to Problem 1.1 (Define a database for CAMPUS)

Pb 1.1 – Solution to Q. 1

The database catalog should be similar to the following:

RELATIONS

Relation name	Number of columns
`BUILDING`	3
`ROOM`	3
`PROF`	4

COLUMNS

Column name	Datatype	Belongs to relation
Building_Name	Character(30)	`Building`
GPSLat	Decimal(9,6)	`Building`
GPSLon	Decimal(9,6)	`Building`
Building_Name	Character(30)	`ROOM`
Room_Number	Integer(1)	`ROOM`
Floor	Integer (1)	`ROOM`
Prof_Name	Character (30)	`PROF`
Phone	Integer (10)	`PROF`
Email	Character (30)	`PROF`
Room_Number	Integer (1)	`PROF`

Pb 1.1 – Solution to Q. 2

For the data, you could have:

For the BUILDING file, we could have:

(Allgood Hall, 33.47520, -82.02503)
(Institut Galilé,   48.959001, 2.339999)

For the ROOM file, we could have:

(Allgood Hall, 128, 1)
(Institut Galilé, 205, 3)
(Allgood Hall, 228, 2)

For the PROF file, we could have:

(Aubert, 839401, dae@ipn.net, 128)
(Mazza, 938130, Dm@fai.net, 205)

Pb 1.1 – Solution to Q. 3

If everything we knew about the campus came from that database, then

Yes, we could list all the professors.
No, we could not tell in which department is a professor.
No, we could not get the office hours of a professor.
Yes, we could list all the professors whose offices are in the same building.
Yes, we could list all the rooms.
If a new professor arrives, and has to share his office with another professor, we would not have to revise our database catalog (it is fine for two professor to have the same room number, in our model).
Yes, we could list which professors are at the same floor.
No, we could not tell which professor has the highest evaluations.

The Relational Model

Show the following:

Adjust appearance:

Notes