Data Base Primer
We have a company which manages multiple software projects. We hire people
and assign them to projects. Consequently, each project will have several
people working on it. From this statement, it is easy to observe that
project-to-workers constitute a one-to-many
relationship,
i.e. multiple people work on one project.
Note that in this relationship, we are assuming that one person may not work on more than one project.
If we had to represent a one-to-many
relationship
in the relational database, it would do it like this:
Notes:
1. We only included a minimum set of fields to demonstrate the point of the current topic.
Let us analyze this design.
Each table must have a unique primary key. In both EMPLOYEE table and PROJECT tables, this field is named "ID". There are times when the primary key is made up of multiple keys, referred to as "composite key", however, we will not consider this case here.
For the EMPLOYEE table, we could have designated the SSN as the primary key since we know that the Social Security number would be unique. This would not be a great idea for at least two reasons:
1. We may hire some foreign workers who do not have a Social Security number and whatever unique ID they may have, will likely not conform to the SSN format - "xxx-xx-xxxx".
2. If we chose SSN to be the primary key, its datatype would be STRING. Primary keys are used for joining data from multiple tables. Joining on a field whose type is INTEGER is a lot faster than if its type were STRING.
All databases have a SEQUENCE GENERATOR. We can designate the primary key as a SEQUENCE GENERATED key and the database will created it for us. Thus, when we try to insert our first employee into the PEOPLE table, it's ID will be 1.
From EMPLOYEE and PROJECT tables above we can see that both have a primary key named "ID".
To express the one-to-many
relationship,
the EMPLOYEE table also has a key named "PROJECT_ID". The PROJECT_ID
is a foreign
key.
A foreign
key is
the primary key of another table, in this case table PROJECT.
From the demonstration above, we can see that people whose IDs are 1 & 2 work for project ID=1, people with IDs 3, 4, and 5 work for project ID=2, etc.
The one-to-many
(or many-to-one
) relationships are
incredibly common.
Consider another situation where we want to store information about each
employee. In addition to the obvious fields such as "First Name",
"Last Name", "SSN", each employee also has a phone number,
email address, and residence address. However, many employees will have more
than one phone number, email address, and may have multiple residences. The
only logical way to represent this would be with a one-to-many
relationship:
One-to-on
e relationships
are less common than many-to-one
but
they are still used very extensively. Lets go back to the task of storing data
for each employee. In addition to storing employee's name, phone-numbers, etc.,
we also want to store employee's salary, start and end date (end data would be
set to null for current employees), and other sensitive information. SSN field
may also be considered sensitive data. While many applications will need access
to employee's general information, such as name, department number, title,
etc., only HR applications that are entrusted to few employees should have
access to sensitive data, such as employee's salary.
Each database table can have associated privileges indicating who can see and modify data. Thus, we can create other tables for storing employee's sensitive data and assign privileges to this table to select few.
The one-to-one
relationships
are represented the same as the many-to-many
relationships,
that is with a foreign
key.
Note:
1. EMPLOYEE_ID in EMPLOYEE_HR table is the foreign key. It contains values from EMPLOYEE.ID field.
2. Unlike the primary key, a foreign need need not (and usually will not) be unique and it may be null.
3.
In the case of a one-to-one
relationship,
we could have placed the "foreign" key in either table.
For example, our design could look like this, where we place the foreign key - EMPLOYEE_HR_ID in the EMPLOYEE table.
However, since most people in the organization will be privy to data in the EMPLOYEE table, and only people who work in HR will be able to see the EMPLOYEE_HR table, it is better to place the foreign key into the EMPLOYEE_HR table so that those who examine contents of the EMPLOYEE table would not even know that EMPLOYEE_HR table exists.
There are other reasons to create one-to-one
relationships.
Let's say we also want to store a single hobby for our employees. The
description of a hobby may contain many fields. Most employees are very boring
and do not have any hobbies. If we added Hobby fields to the EMPLOYEE table,
most records would contain "holes", i.e. empty values. Thus, we can
make a decision to define a HOBBY table and link it to EMPLOYEE with a one-to-one
relationship.
Next, we expand our original definition by allowing one person to work on
many projects. For example, some person may dedicate 60% of his time working on
project #1 and 40% working on project #2. Any one person may work on more
than two projects. Following this rule, we progress from a one-to-many
relationship to a many-to-many
relationship. In
this many-to-many
relationship,
we have many people working on one project and each person may work on many
projects. To support the many-to-many relationship, we need to introduce a new
"MAPPING" table.
Note that table PROJECT_WORKER_MAP, maps Projects to Employees (Workers). For example,
From the PROJECT_EMPLOYEE_MAP table we can see that employees with IDs 1 and 2 work on project 1, employee with IDs 2, 4, and 5 work on project 2, employees with IDs 1, 7, and 8 work on project 3, etc. We can also observe that employee with ID 2 works on projects 1 and 2, employee with ID 1 works on projects 1 and 3.
With many-to-many
relationships,
it is quite possible to have some employees who do not work on any projects,
let's say that an active employee goes on a sabbatical and will not be involved
on any project. It is also possible to have a project that does not have any
employees, for example, an upstart project that is still in the research phase.
In cases where employees are not assigned to any projects or project do not
have any employees, the PROJECT_EMPLOYEE_MAP table will not have a row for
associating employee to a project.
Note that in the example above, no employee is assigned to project ID=5 and Employee ID=9 does not work on any projects.
All databases support two types of commands: DDL that stands for Data Definition Language and DML that stands for Data Manipulation Language.
The list of DDL commands is exhaustive. You use DDL commands to created and alter tables and indexes, drop (i.e. delete) tables, manage privileges, and much-much more.
Before we can execute any database commands we need to create required tables and populate them with data. Let us examine the CREATE TABLE command:
create table worker ( id integer not null, created_by varchar(255), in_utc timestamp not null, last_modified_by varchar(255), out_utc timestamp, audit_id varchar(255), compensation double not null, worker_code varchar(255), worker_name varchar(255), primary key (id) ); |
The WORKER table in our real project has following fields:
Each field is listed along with its datatype. For STRING fields (i.e. varchar) we also need to specify maximum length.
Some fields may require a value, i.e. may not contain nulls. For
example, "compensation double
not null"
One or more fields must be designated as primary. ( primary key (id)
). A Primary key in
unique and non-nullable by definition.
Databases support many types of indices and we can't cover this topic exhaustively. Let's first discuss an index in a general sense.
Suppose we have an ORDERS table with 1,000,000 rows. We may want to perform certain types of operations on the ORDERS table very frequently. For example, find order where SELLER_NAME is "One First". With 1M rows, we would have to check 1M records in the worst case (order O(n)).
SELECT * FROM ORDERS WHERE SELLER_NAME = 'One First' |
However, if we create an index for field SELLER_NAME, we would be able to locate this record in almost constant time (O(1)).
The DDL command to create an index may look like this:
CREATE INDEX INDEX_SELLER_NAME ON ORDERS(SELLER_NAME) |
One of the common interview questions goes something like this:
"You work on your application and test it continuously in your dev (development) environment. Everything works great so you decide to deploy it to the production environment. However, in production, your application's performance becomes very sluggish. Why is that?"
Your answer may be as follows:
"It is quite possible that the dev database has been property tuned with required indices and it is therefore performant. However, production environments uses a different database and it is possible that it does not have all the required indices".
In addition to a simple index, databases support a notion of a unique index
.
Let's say that we want our database to enforce business rules and one of our
business rules is that the combination of values in audit_id
and out_utc
fields must be unique.
Thus, it would be an error if two records had same values in both audit_id
and out_utc
fields. The DDL command for
a unique index would look like this:
create unique index UKbe6uin1lttt8dh3oddmrb6g6d on worker (audit_id, out_utc); |
The name of the index (" UKbe6uin1lttt8dh3oddmrb6g6d
"
in the example above) was randomly generated and the name itself is not
important. The only thing that is important is to identify a set of
fields that must be unique.
Let's go back to our many-to-many
illustration
and consider what would happen if someone decided to delete an employee with
ID=1
delete from employee where id = 1 |
We would be left with records in table PROJECT_EMPLOYEE_MAP that would not make sense. (See records with ID 1 and 6 where the EMPLOYEE_ID is 1 which was deleted)
We can still delete employee with ID=1, however, we first need to delete records from the PROJECT_EMPLOYEE_MAP table that associate employee ID=1 with projects. This constraint can be enforced with command:
alter table project_worker_map add constraint FK6baomleyjvq2jkfy8n5tscn0l foreign key (worker_id) references worker; |
There are too many DDL commands and it is not important to know them all, only to know that they exist and can be easily discovered. Most of the common DDL commands affect tables and indices, such as:
create table ... alter table ... drop table ... create index ... create unique index ... drop index ... |
We had mentioned earlier the need for a one-to-one relationship. In that example, we decided to store sensitive employee information inside EMPLOYEE_HR table, accessible to HR staff only. For people working in HR who do need this access, we can assign them to a user (or group) with a special name, i.e. USER_HR. Then, we can execute the following command:
GRANT ALL ON TABLE EMPLOYEE_HR TO USER USER_HR |
One may think that knowing the syntax of this command might allow you to peek at other employee's salary. However, only database admins, or other users with special privileges, would be able to execute the GRANT command.
The good news about Data manipulation commands (DML) is that they comprise only four commands: INSERT, UPDATE, DELETE, and SELECT. In the industry, these are often referred to as CRUD commands for Create, Read, Update, Delete.
The bad news is that some of these commands, particularly SELECT could get very complex. In this document we will only cover the basics.
Most of the INSERT commands are very simple.
For example:
INSERT INTO WORKER(ID, AUDIT_ID, IN_UTC, OUT_UTC, WORKER_CODE, WORKER_NAME, COMPENSATION, CREATED_BY, LAST_MODIFIED_BY) VALUES(1, '00000000-0000-0000-0000-000000000001', CURRENT_TIMESTAMP - CURRENT_TIMEZONE, '9999-12-31', 'CODE-1', 'One First', 120000.00, 'unknown', 'unknown'); |
This example could also be written like this:
INSERT INTO WORKER VALUES(1, '00000000-0000-0000-0000-000000000001', CURRENT_TIMESTAMP - CURRENT_TIMEZONE, '9999-12-31', 'CODE-1', 'One First', 120000.00, 'unknown', 'unknown'); |
The reason why the above syntax would work is because we are populating every single field in the WORKER table. If we only wanted to populate some of the fields, we would need to list them, as in:
INSERT INTO WORKER(ID, AUDIT_ID, WORKER_CODE, WORKER_NAME, COMPENSATION) VALUES(1, '00000000-0000-0000-0000-000000000001', 'CODE-1', 'One First', 120000.00); |
Note that if you don't specify a field inside the INSERT statement, that field must be nullable.
To change values in an existing record we can do this:
update worker set compensation = 200000, worker_code = 'CODE-X' where id = 3 |
Naturally, this syntax could get a lot more complex and we will return to this command later.
The simplest delete command looks like this:
delete from worker |
This command will delete all records from table WORKER.
One of the common interview questions it to ask about the difference of
this delete
command
and the drop table
command:
DELETE |
DROP |
delete from worker |
drop table worker |
When you use the delete
command, the table still
exists but contains no records. When you drop
the table, the table is destroyed and you would
need to create it again if you decide to add records to it.
More often than not, we will add a condition to the delete command, for example:
Command |
Description |
delete from worker where compensation < 100000 |
Delete workers who make less than 100,000 |
delete from worker where worker_code in ('CODE-1', 'CODE-2') |
Delete workers with WORKER_CODE CODE-1 and CODE-2 |
This is where all the fun is. Many books and PhD decertations are dedicated to this topic. Let us start simple:
Note:
1.
To select all fields use " select
* ...
"
2.
We use " where
"
to filter records to be displayed. Multiple expressions may be strung together
with and, or, in,
and
other operators.
3.
Note the usage of like
operator
4. Aliases are often used with SELECT commands.
Another powerful feature of SELECT is to aggregate records via GROUP BY. Let's start with example. We would like to find out how many people work on each project. We use the PROJECT_WORKER_MAP table to map workers to projects:
Query |
Response |
select project_id, count(worker_id) as WorkerCount from project_worker_map group by project_id |
PROJECT_ID WORKERCOUNT -------------------------- 1 2 2 2 3 4 |
Suppose we are interested in seeing only those projects where the worker count is greater than 2
Query |
Response |
select project_id, count(worker_id) as Worker_Count from project_worker_map group by project_id having count(worker_id) > 2 |
PROJECT_ID WORKER_COUNT --------------------------- 3 4 |
Note:
having
to group by
is what where
is to select
Above command is incredibly useful and you should store it somewhere in your memory bank. Often you run a process which will populate rows in some table. It may be an error to have records with a duplicate value in some field. Thus, you would execute the following query to check:
Query |
select some_field from some_table group by some_field having count(some_field) > 1 |
Let us take a leap into something more interesting. We would like to determine the total compensations for each project. The compensation is specified in the WORKER table, so we need to join WORKER and PROJECT_WORKER_MAP tables.
Query |
Result |
select p.project_id, sum(w.compensation) as TotalComp from worker w, project_worker_map p where = p.worker_id group by project_id |
PROJECT_ID TOTALCOMP ---------------------- 1 310000.0 2 425000.0 3 735000.0 |
Let us observe several things from above command:
In group by selects, fields that are listed on the select clause must match fields that appear in the group by clause. However, the select clause may also specify aggregate operations such as SUM, AVG, and others.
As a slight variation to above, let's say we only want to see projects whose total compensation is greater than 500,000.
Query |
Result |
select p.project_id, sum(w.compensation) as TotalComp from worker w, project_worker_map p where = p.worker_id group by project_id having sum(w.compensation) > 500000 |
PROJECT_ID TOTALCOMP ---------------------- 3 735000.0 |
Notes:
In the previous query, I used alias w for worker and p for project_worker_map. If I had not used aliases, I would need to type the full name of the table, as in:
select project_worker_map.project_id, sum(worker.compensation) from worker, project_worker_map where worker = project_worker_map.worker_id group by project_id having sum(worker.compensation) > 500000 |
These examples will be provided without explanations:
Query |
Response |
select max(compensation) as Max from worker |
MAX -------- 240000.0 |
select min(compensation) as Min from worker |
MIN ------- 40000.0 |
Suppose we only have information on WORKER_CODE, i.e. 'CODE-1' and 'CODE-2' and we would like to find out what projects they are working on. To join table WORKER to table WORKER_PROJECT_MAP we need to have WORKER.ID and WORKER_PROJECT_MAP.WORKER_ID. However, we do not know values for WORKER.ID, we only know WORKER_CODE.
Before we accomplish our final task, let's perform some intermediate steps, like joining three tables: PROJECT, WORKER, and PROJECT_WORKER_MAP.
Note:
Second command uses order by
clause
to sort results by worker_name
field.
However, for our current task, we don't have IDs of workers, we have worker_code
. Thus, we need to use the
worker_code to get to the WORKER.ID.
Query |
Result |
select p.project_name, w.worker_name from project p, worker w, project_worker_map m where m.project_id = p.id and m.worker_id = w.id and m.worker_id in ( select id from worker where worker_code in('CODE-1', 'CODE-2') ) |
PROJECT_NAME WORKER_NAME ------------------------- P-1 One First P-2 Two Second P-3 One First P-3 Two Second |
Note:
We need to join three tables and define the filtering criteria
We can combine this result with GROUP BY clause to determine the total compensation for these projects:
Query |
Result |
select p.project_name, sum(w.compensation) as TotalComp from project p, worker w, project_worker_map m where m.project_id = p.id and m.worker_id = w.id and m.worker_id in ( select id from worker where worker_code in('CODE-1', 'CODE-2') ) group by p.project_name |
PROJECT_NAME TOTALCOMP ----------------------- P-1 120000.0 P-2 240000.0 P-3 360000.0 |
The information presented so far does not even scratch the surface. There are things that you need to be aware of even though they are not demonstrated in this document.
Like everything else, SQL went through a lot of iterations over many years. The SQL presented here follows the original SQL-86 format. I believe the latest is called SQL:2003. The syntax for SELECT commands has changed, however, the original syntax is still supported. Some operations differ by database providers. For example, to perform a left outer join (discussed shortly), you will need to adhere to the format of the database provider such as Oracle, Sybase, Postgres, DB2, etc.
The main point is to understand what capabilities are possible. It is easy to find the syntax for a specific command on the internet.
So far, the examples provided here used a standard, or "INNER" join. For example, we can join multiple tables using INNER join to see all projects and people who are assigned to them. Some projects may not have any people assigned. Suppose we want to see all projects. If people are assigned to them, we will see person name in the WORKER column. If a project has no workers assigned, that project will not appear in the response. Often, we want to see projects and workers assigned to them and also projects that have no workers. This is the job for a LEFT OUTER join. If no one is assigned to some project, the project will still appear but the value in the WORKER column will be null. RIGHT OUTER join is similar to LEFT OUTER join but works in the right-to-left order.
Here is one of the most common problems for a SELF JOIN. Let's say we have a table of employees, along with their salaries and ID of the manager. Let's say our fictitious table looks like this:
We need to answer following question: "Provide a list on employees who make more money than their manager." From table's data we can observe that Employee IDs 1, 2 3, 4 all report to employee whose ID is 5. Only employee ID=4 makes more money than his manager.
This question comes up so often at interviews that I owe you a demonstration.
Let's create a new table with data as shown above:
Create Table |
create table workerX (id integer not null, name varchar(255), compensation double not null, manager_id integer not null) |
Populate |
insert into workerX (id, name, compensation, manager_id) values (1, 'One First', 50000, 5) insert into workerX (id, name, compensation, manager_id) values (2, 'Two Second', 70000, 5) insert into workerX (id, name, compensation, manager_id) values (3, 'Three Third', 90000, 5) insert into workerX (id, name, compensation, manager_id) values (4, 'Four Fourth', 180000, 5) insert into workerX (id, name, compensation, manager_id) values (5, 'Five Fifth', 160000, 9) |
Now we can execute a query to find all employees who make more money than their managers:
Query |
Result |
select employee.* from workerX employee, workerX manager where employee.manager_id = and employee.compensation > manager.compensation |
ID NAME COMPENSATION MANAGER_ID ---------------------------------------- 4 Four Fourth 180000.0 5 |
Note how use of aliases, employee and manager, were absolutely essential for a SELF JOIN query.
There may be times when we would like to combine results from multiple tables. Our SELECT clause identifies fields and they datatype. As long as the fields in the SELECT clause match, data from multiple tables may be combined with a UNION command.
SELECT command is extremely powerful and provides many clauses to wield
incredible power. In addition to standard features of the SQL, each database
provide furnishes a library of functions to works with strings, dates, and just
about anything else. However, sometimes all this power is still not enough.
There may be times when it would be useful to create intermediate results and
stored them in the temp
table.
Next, subsequent commands can me used in combination with data in the temp
table to produce more elaborate answers.
For anything that can't be done with SQL, there are stored procedures. Each database provider has its own programming language to interact with databases. A program written in this language is called the stored procedure. Some companies build their entire applications with stored procedures. In my view, this is a terrible idea but it may be justified for certain types of projects. One of the problems if stored procedures is that they can't be debugged. You may need tp place results of some intermediate computation into a table for debugging purposes.
It is also important to realize that a stored procedure is often composed to several SQL statements without resorting to any other programming aspects. To support transactions, you will need to place your SQL commands inside the stored procedure.
A Trigger is a stored procedure. It is triggered when a certain action occurs, such as inserting a new row or updating it. Some time ago, triggers where relied on very heavily, however, over the years they usefulness has diminished.
Let's say we have a site for taking customer orders. When the order is placed we need to update ORDERS table, INVENTORY table, and possible CUSTOMERS table to keep track of all the money they had spent on our crap.
Say we update the ORDERS table first and then try to update the INVERTORY table, but that update fails. Now we have a system in an inconsistent state. To prevent this inconsistent state we need to ROLL-BACK our transaction. Thus, either all commands complete successfully or none at all.
Failure to update the INVERNTORY table may happen for many diffident reasons, for example, the table may run out of storage space that was allocated for it.