Introduction
Resources
- (Elmasri and Navathe 2010, ch. 1.1-1.6)
- (Elmasri and Navathe 2015, ch. 1.1–1.6)
- (Watt and Eng 2014, ch. 2–3)
The Need for a Specialized Tool
There is a good chance that any programming language you can think of is Turing complete. Actually, even some of the extremely basic tools you may be using may be Turing complete. However, being complete does not mean being good at any task: it just means that any computable problem can be solved, but does not imply anything in terms of efficiency, comfort, or usability.
In theory, pretty much any programming language can be used to
- Store, retrieve and update data,
- Have accessible catalog describing the metadata,
- Support transactions and concurrency,
- Support authorization of access and update of data,
- Enforce constraints.
But to obtain a system that is fast in reading and writing on the disk, convenient to search in the data, and that provides as many “built-in” tools as possible, one should use a specialized tool.
In those lecture notes, we will introduce one of this tool–the SQL programming language– and the theory underneath it–the relational model–. We will also observe that a careful design is a mandatory step before implementing a catalog, and that how good a catalog is can be assessed, and introduce the tools to do so. Finally, we will discuss how an application interacting with a database can be implemented and secured, and the alternatives to SQL offered by the NoSQL approach, as well as the limitations and highlights of both models.
Database
A database (DB) is a collection of related data. Data (= information, can be anything, really) + management (= logical organization of the data), through Database Management System.
- Represent a mini-world, a “Universe of Disclosure” (UoD).
- Logically coherent, with a meaning.
- Populated for a purpose.
A DBMS has multiple components, as follows:
Note that
- The program can be written in any language, be a web interface, etc. It is sometimes part of the software shipped with the DBMS, but not necessarily (you can, and we will, develop your own program to interact with the DBMS).
- Most DBMS software includes a Command-Line Interface (CLI).
- The catalog (or schema, meta-data) contains the description of how the data is stored, i.e., the datatypes, nature of the attributes, etc.
The term “meta-data” has numerous definition (“data about the data”): we use it here to refer to the description of the organization of the data, and not e.g. to statistical data about the data.
- Sometimes, catalog and data are closer than pictured (you can have “self-describing meta-data”, that is, they cannot be distinguished).
Database Management System (DBMS)
A DBMS contains a general purpose software that is used to
- Define (= datatype, constraints, structures, etc.)
- Construct / Create the data (= store the data)
- Manipulate / Maintain (= change the structure, query the data, update it, etc.)
- Share / Control access (= among users, applications)
You can think of a tool to
- Specify a storage unit,
- Fill it,
- Allow to change its content, as well as its organization,
- Allow multiple persons to access all or parts of it at the same time.
Subtasks
Exactly like a program can have
- clients, that specify the requirements,
- designers, that define the overall architecture of a program,
- programmers, that implement the details of the program,
- testers, that make sure the program is free of bugs, and
- users, that actually use the program,
a DBMS offers multiple (sub)tasks and can be interacted with different persons with different roles.
Role | Task |
---|---|
Client | Specify the business statement, the specifications |
DB Administrator | Install, configure, secure and maintain up-to-date the DBMS |
Designer | Lay out the global organization of the data |
Programmer | Implement the database, work on the programs that will interface with it |
User | Provide, search, and edit the data (usually) |
In those lecture notes, the main focus will be on design and implementation, but we will have to do a little bit of everything, without forgetting which role we are currently playing.
Life of a Project
From the business statement to the usage, a project generally follows one of this path:
Note that reverse-engineering can sometimes happen, i.e., if you are given a poor implementation and want to extract a relational model from it, to normalize it.
An Example
Let us consider the following:
STUDENT
Name | Student_number | Class | Major |
---|---|---|---|
Morgan | 18 | 2 | IT |
Bob | 17 | 1 | CS |
COURSE
Course_name | Course_number | Credit_hours | Department |
---|---|---|---|
Intro. to CS | 1301 | 4 | CS |
DB Systems | 3401 | 3 | CS |
Principles of Scripting and Automation | 2120 | 3 | AIST |
SECTION
Section_identifier | Course_num | Semster | Year | Instructor |
---|---|---|---|---|
2910 | 1301 | Fall | 2019 | Kate |
9230 | 2103 | Spring | 2020 | Todd |
GRADE_REPORT
Student_number | Section_identifier | Grade |
---|---|---|
17 | 2910 | A |
18 | 2910 | B |
PREREQUISITE
Course_number | Prerequisite_number |
---|---|
2120 | 1301 |
1302 | 1301 |
You can describe the structure as a collection of relations, and a collection of columns:
RELATIONS
Relation Name | Number of Columns |
---|---|
STUDENT | 4 |
COURSE | 4 |
SECTION | 5 |
GRADE_REPORT | 3 |
PREREQUISITE | 2 |
COLUMNS
Column Name | Datatype | Belongs to relation |
---|---|---|
Name | String | STUDENT |
Student_number | Integer | STUDENT |
Class | String | STUDENT |
Major | String | STUDENT |
Course_name | String | STUDENT |
Course_number | Integer | COURSE |
Credit_hours | Integer | COURSE |
Department | String | COURSE |
… | … | … |
Prerequisite_number | Integer | PREREQUISITE |
Structure
- Database structure and records, 5 files (=collection of records), each containing data records of the same type, stored in a persistent way.
- Each record has a structure, different data elements, each has a data type.
- Records have relationships between them (for instance, you expect the Course_number of PREREQUISITE to occur as a Course_number in COURSE).
Interactions
- This organization will allow some interactions. For instance, we can obtain the answer to questions like
“What is the name of the course whose number is 1301?”,
Note that this last query is a bit different, as it forces us to look up information in multiple relations.
“What courses is Kate teaching this semester?”,
“Does Bob meets the pre-requisite for 2910?” - We should also be able to perform updates, removal, addition of records in an efficient way (using auxiliary files (indexes), optimization).
- Finally, selection (for any operation) requires care: do we want all the records, some of them, exactly one?
Organization
Why are the files separated like that? Why do not we store the section with the course with the students? For multiple reasons:
- To avoid redundancy (“data normalization”), or having it controlled,
- To controle multiple levels of access (multiple user interface),
- Without sacrificing the usability!
In separating the datae, we also need to remember to be careful about consistency and referential integrity, which is a topic we will discuss in detail.
How Is a Database Conceived?
- Specification and analysis. “Each student number will be unique, but they can have the same name. We want to access the letter grade, but not the numerical grade”, etc. This gives the businnes statement.
- Conceptual design
- Logical design
- Physical design
There is a gradation, from really abstract specification that is easy to modify, to more solidified description of what needs to be coded. When we will be discussing high-level models, we will come back to those notions. The global idea is that it is easier to move things around early in the conception, and harder once everything is implemented.
Characteristics of the Database Approach
- A database is more than just data: it also contains a complete description of the structure and constraints. We generally have a catalog (a.k.a. the meta-data, the schema) and the data (we can also have self-describing data, where meta-data and data are interleaved, but note that both are still present).
- Data-abstraction: A DBMS provides a conceptual representation, and hides implementation details. This implies that changing the internals of the database should not require to change the application (the DBMS) or the way any of the client (program, or CLI) was interacting with the data.
- Support of multiple views of the data: view is a subset of the database, or virtual data.
- Sharing and multiuser transaction processing: concurrency control using transactions (= series of instructions that is supposed to execute a logically correct database access if executed in its entirety). Isolation, atomicity (all or nothing): cf. the ACID principles.
Exercises
Exercise 1.1
What is the difference between a database and the meta-data of the database?
Exercise 1.2
Is a pile of trash a database? Why, or why not?
Exercise 1.3
Define the word “miniworld”.
Exercise 1.4
Expand the acronym “DBMS”.
Exercise 1.5
Name two DBMS.
Exercise 1.6
Name the four different kinds of action that can be performed on data.
Exercise 1.7
Assign each of the following task to one of the “character” (administrator, client, etc.) we introduced:
Task | Assigned to |
---|---|
Install a DBMS on a server. | |
Sketch the schema so that the data will not be redundant. | |
Write client-side application that uses the DBMS API. | |
Establish the purpose of the database. |
Exercise 1.8
List some of the tasks assigned to the Database Administrator.
Exercise 1.9
Why do DBMS include concurrency control?
Exercise 1.10
Do I have to change my DBMS if I want to change the structure of my data?
Exercise 1.11
Exercise 1.12
Assume that I have a file where one record corresponds to one student. Should the information about the classes a student is taking (e.g. room, instructor, code, etc.) being stored in the same file? Why, or why not?
Exercise 1.13
Which one comes first, the physical design, the conceptual design, or the logical design?
Exercise 1.14
What is a virtual data? How can I access it?
Solution to Exercises
Solution 1.1
The data is the information we want to store, the meta-data is its organization, how we are going to store it. Meta-data is information about the data, but of no use on its own.
Solution 1.2
No, because it lacks a logical structure.
Solution 1.3
The mini-world is the part of the universe we want to represent in the database. It is supposed to be meaningful and will serve a purpose.
Solution 1.4
Database Management System
Solution 1.5
Oracle RDBMS, IBM DB2, Microsoft SQL Server, MySQL, PostgreSQL, Microsoft Access, etc., are valid answers. Are not valid “SQL”, “NoSQL”, “Relational Model”, or such: we are asking for the names of actual softwares!
Solution 1.6
The four actions are:
- Add / Insert
- Update / Modify
- Search / Query
- Delete / Remove
Solution 1.7
We can have something like:
Task | Assigned to |
---|---|
Install a DBMS on a server. | Administrator, IT service |
Sketch the schema so that the data will not be redundant. | Designer |
Write client-side application that uses the DBMS API. | Programmer, Developer |
Establish the purpose of the database. | Client, business owner |
Solution 1.8
The database administrator is in charge of installing, configuring, securing and keeping up-to-date the database management system. They also control the accesses and the performance of the system, troubleshoot it, and create backup of the data.
Solution 1.9
DBMS have concurrency control to ensure that several users trying to update the same data will do so in a controlled manner. It is to avoid inconsistency to appear in the data.
Solution 1.10
Normally no, data and programs are independent. But actually, this is true only if the model does not change: shifting to a “less structured model”, e.g., one of the NoSQL models, can require to change the DBMS.
Solution 1.11
The application should not be sensible to the “internals” of the definition and organization of the data. It matters because having this independence means that changing the data will not require to change the programs.
Solution 1.12
If we were to store all the information about the classes in the student records, then we would have to store it as many time as its number of students! It is better to store it in a different file, and then to “link” the two files, to avoid redundancy.
Solution 1.13
The conceptual design.
Solution 1.14
It is a set of information that is derived from the database but not directly stored in it. It is accessed through queries. For instance, we can infer the age of a person if their date of birth is in the database, but strictly speaking the age is not an information stored in the database.
What is independence between program and data? Why does it matter?
Problems
Problem 1.1 (Define a database for CAMPUS
)
Define a CAMPUS
database organized into three files as follows:
- A
BUILDING
file storing the name and GPS coordinates of each building. - A
ROOM
file storing the building, number and floor of each room. - A
PROF
file storing the name, phone number, email and room number where the office is located for each professor.
Pb 1.1 – Question 1
A database catalog is made of two part: a table containing the relations’ name and their number of columns, and a table containing the columns’ name, their data type, and the relation to which they belong. Refer to the example we made previously or consult, e.g., (Elmasri and Navathe 2010, Figure 1.3) or (Elmasri and Navathe 2015, Figure 1.3). Write the database catalog corresponding to the CAMPUS
database.
Pb 1.1 – Question 2
Invent data for such a database, with two buildings, three rooms and two professors.
Pb 1.1 – Question 3
Answer the following, assuming all the knowledge you have of the situation comes from the CAMPUS
database, which is an up-to-date and accurate representation of its miniworld:
- Is it possible to list all the professors?
- Is it possible to tell in which department is a professor?
- Is it possible to get the office hours of a professor?
- Is it possible to list all the professors whose offices are in the same building?
- Is it possible to list all the rooms?
- If a new professor arrives, and has to share his office with another professor, do you have to revise your database catalog?
- Can you list which professors are at the same floor?
- Can you tell which professor has the highest evaluations?
Solutions to Selected Problems
Solution to Problem 1.1 (Define a database for CAMPUS
)
Pb 1.1 – Solution to Q. 1
The database catalog should be similar to the following:
RELATIONS
Relation name | Number of columns |
---|---|
BUILDING | 3 |
ROOM | 3 |
PROF | 4 |
COLUMNS
Column name | Datatype | Belongs to relation |
---|---|---|
Building_Name | Character(30) | Building |
GPSLat | Decimal(9,6) | Building |
GPSLon | Decimal(9,6) | Building |
Building_Name | Character(30) | ROOM |
Room_Number | Integer(1) | ROOM |
Floor | Integer (1) | ROOM |
Prof_Name | Character (30) | PROF |
Phone | Integer (10) | PROF |
Character (30) | PROF | |
Room_Number | Integer (1) | PROF |
Pb 1.1 – Solution to Q. 2
For the data, you could have:
- For the
BUILDING
file, we could have:
(Allgood Hall, 33.47520, -82.02503) (Institut Galilé, 48.959001, 2.339999)
- For the
ROOM
file, we could have:
(Allgood Hall, 128, 1) (Institut Galilé, 205, 3) (Allgood Hall, 228, 2)
- For the
PROF
file, we could have:
(Aubert, 839401, dae@ipn.net, 128) (Mazza, 938130, Dm@fai.net, 205)
Pb 1.1 – Solution to Q. 3
If everything we knew about the campus came from that database, then
- Yes, we could list all the professors.
- No, we could not tell in which department is a professor.
- No, we could not get the office hours of a professor.
- Yes, we could list all the professors whose offices are in the same building.
- Yes, we could list all the rooms.
- If a new professor arrives, and has to share his office with another professor, we would not have to revise our database catalog (it is fine for two professor to have the same room number, in our model).
- Yes, we could list which professors are at the same floor.
- No, we could not tell which professor has the highest evaluations.