Introduction to Relational Databases and SQL

An introduction to relational databases - Where do they come from? Where do they live? What do they eat?

Hi! I hope you’re enjoying The Miners’ Guide to Code Crafting series. Today, we’re diving into databases — one of the most essential tools in web development. Keep in mind that this subject is a universe of its own, so my goal here is to give you an introduction to this subject. As we all know, the best way to learn about databases is through consistent practice. So, with that in mind, if you want to follow this blog post by executing the queries, here’s a gist containing a composer file with a Postgres instance and a SQL script to create the database setup. If you’re ready, let’s jump into it by first answering a fundamental question:

What is a Relational Database?

A relational database organizes data in a structured manner, allowing relationships to be established through a common and unique attributes. This structure makes data easier to interpret and helps reveal connections between different entities, being that the relational model represents entities—whether concrete or abstract, such as a User, Order, or Vehicle—as tables stored within a database. As systems develop, the relationships between data can become more intricate. To effectively manage this complexity and improve understanding, various levels of data abstraction are employed:

Physical Level

The Physical level is the lower level of abstraction, representing and describing how the data are actually stored within the database, defining which data structures are going to be used, how these data are going to be accessed, what kind of security strategies are going to be implemented, etc.

Logical or conceptual Level

This level of abstraction describes what data is stored in the database, its relationships, constraints, nature, and organization. Most developers use this level of abstraction when modeling the database to solve their needs through the Entity Relational Diagram – a diagram that is used to conceptualize how data is going to be represented and organized.

View level

This is the highest level of abstraction and is often presented to the end-user, making it as user-friendly as possible. It is typically represented through the software’s GUI and usually displays only a portion of the data from the database. It is also common to mask certain data, such as dates, monetary values, measurements (height, width, etc.), and other relevant information.

Relational Database Management Systems – RDBMS

RDBMSs are software implementations of a Relational Database that provide us a plethora of tools to help organize data, maintain its integrity, and manage databases efficiently. Some of the most well-known RDBMSs include PostgreSQL, MySQL, MariaDB (a Linux-based fork of MySQL), OracleDB, and Microsoft SQL Server. In this blog post, we’ll focus on PostgreSQL.

RDBMSs uses SQL – Structured Query Language to manipulate the stored data.

Clients

Some applications use RDBMS APIs to provide a simple visual abstraction for various features (e.g., inserting data through an interface, managing multiple database connections, creating database backups, visualizing table relationships through a generated ER diagram, etc.), making it easier to manage and explore data. These applications are called clients, and some of the most popular PostgreSQL clients include DBeaver, PgAdmin, and TablePlus.

Database instances and Schemas – How databases are defined

A database instance is a collection of databases and their associated information and also refers to the physical installation of the database engine software – a set of processes, services, and tasks that runs on the Operational System, providing all of the necessary components to manage and operate a database such as indexes, constraints, etc. (We’ll see more about these components further on).

A database schema is the overall design of the database and it can be separated into two different types: The logical schema, and the physical schema. Both follow the same rationale as the levels of data abstraction.

The physical schema defines the physical details of the database, such as the syntax used to create the data structures, what type of storage is going to be used, details of the objects such as tables and columns, etc.

The logical schema defines the details of the database at a logical level, such as how tables are going to be organized, their relationships, constraints, columns data types, etc.

The physical schema is hidden beneath the logical schema and can usually be changed easily without affecting applications.

Bear in mind that database instances and schemas are the foundation that supports all database components.

Database Languages

Database systems provide languages for developers to manipulate the data. These languages form parts of a single database language, such as the Structured Query Language – SQL, which we’ll focus on.

Relational Model

As mentioned before, the relational model structures data as tables. Every table has a set of columns and every column has a data type. Except for specific cases (which we’ll discuss later), most of the tables have a unique identifier – a special column called Primary Key. The Primary Key is used to differentiate the rows/registers from one another, and it can range from a simple integer ID that is incremented as registers are inserted in the database to a string code that is generated through specific calculations defined by the developers.

Tables can have relationships between themselves, and that is done by adding a column that’s going to reference another table register through its Primary Keys. These columns are called Foreign Keys – an attribute that serves as a reference to a register from another table. In case that reference is invalid, for example when the reference points to a nonexistent register on another table, that reference is considered invalid and isn’t going to be saved – Postgres is going to throw a Referential Integrity Error. This happens when someone tries to:

  • Update or add an invalid reference to a row. For optional foreign keys (foreign keys that can be null), this means adding a reference to a table row that doesn’t exist. For required foreign keys, this means either not referencing a table row (when creating a new row or updating an existing reference) or referencing a table row that doesn’t exist.
  • Delete a register that’s already being referenced by another one
  • Create a register with an invalid reference.

To better illustrate these concepts, let’s see the following data model:

Here, the customers, orders, and products tables represent entities, and order_products is an intermediate table that represents the relationship between order and products.

Every entity on that model has an integer id as their Primary Keys, and attributes that represents the needed data for the business rules. Here we have some of the most used data types in PosgreSQL:

  • Boolean
  • Integer
  • Bigint
  • Character varying – varchar (a string, a sequence of characters that can vary its size from 1 to 255 bytes)
  • Text
  • Double precision (float8)
  • Date
  • Enum – Enumerated (a type of data that holds a static, ordered set of values, such as open, closed and on hold)

For more details over every data type that Postgres supports, see: https://www.postgresql.org/docs/17/datatype.html

Data-definition Language – DDL

Data Definition Language are SQL commands used to specify a database’s logical and physical schema, defining details such as the storage engine used by tables, how many bytes an integer field can use in a table’s row, etc. As for the logical part, DDL is responsible for describing the data, its constraints, limits and relationships. This is done through a set of predicates/rules that can be easily tested, called constraints, more specifically, integrity constraints:

Domain constraints

This type of constraint defines a domain of possible values that must be associated with every column across tables, ensuring that the values inserted into the database conform to the expected data type. It guarantees, for example, that a column designed to store only integers cannot receive a VARCHAR or a DECIMAL, for instance.

Referential integrity

This kind of constraint is used when there are cases in which a value that appears in one relation for a given set of attributes, also appears in a certain set of attributes on another relation. Imagine that we have a table called customers, and its unique identifier is a column named id, which is being used as a reference in a table called orders. In this case, for every order made by a customer, this kind of constraint is responsible for ensuring that the customer id reference is pointing to one that actually exists, and when that constraint is violated, the database should reject that action.

Authorization

This kind of constraint is used to restrict and allow users to perform certain actions, based on the level of permission that the constraint gives. The most common use case is to allow some users to read data but not to modify or write to any register in the database or in a set of tables. Or to allow users to read and update data, but not delete it. The SQL commands used to create authorization constraints are defined through the Data Control Language – DCL, which we’ll discuss later.

In SQL the most used DDL commands are:

Create

The CREATE command is used to define a component schema, such as Databases, Tables, Indexes, Views, etc:

CREATE DATABASE development;

CREATE TABLE customers(
    id SERIAL PRIMARY KEY,
    first_name VARCHAR,
    last_name VARCHAR,
    active BOOLEAN
);

CREATE TABLE orders(
    id SERIAL PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(id),
    ordered_at TIMESTAMP
);

CREATE TABLE products(
    id SERIAL PRIMARY KEY,
    name VARCHAR,
    description TEXT
);

CREATE TABLE order_products(
    order_id INTEGER REFERENCES orders(id),
    product_id INTEGER REFERENCES products(id),
    quantity INTEGER
);

CREATE INDEX products_list ON products (id, name);
CREATE INDEX order_products_index ON order_products (order_id, product_id, quantity);

_Quick disclaimer: The SERIAL keyword indicates that the column will be auto-incremented. This means that Postgres will create a sequence for this column, where each new value is incremented from the previous one. For example, when we insert our first record, its id will be 1 (since the initial sequence value is 0), the next one will be 2, and so on. We won’t need to explicitly define an id when inserting a new row, instead, we rely on the database to do this.

Alter

The ALTER command is used to modify or update a component’s definition, such as its name or internal structure (e.g., a table’s column). This clause can be combined with other commands, such as ADD COLUMN (which adds a new column to a given table) or OWNER TO (which changes the ownership of a component, such as a table or a database):

ALTER TABLE orders ADD COLUMN total_amount DECIMAL;

ALTER TABLE orders ALTER COLUMN total_amount TYPE INTEGER;

ALTER INDEX order_products_index RENAME TO orders_and_products_listing_index;

ALTER TABLE order DROP COLUMN total_amount;

ALTER DATABASE development OWNER TO codeminer42;
Drop

The DROP command is used to delete an existing definition/component.

CREATE TABLE drop_test(
    id SERIAL PRIMARY KEY,
    description TEXT
);

CREATE INDEX drop_test_index ON drop_test(id, description);

CREATE DATABASE test;

DROP TABLE drop_test;

DROP INDEX drop_test_index;

DROP DATABASE test;

Data manipulation language – DML

Data Manipulation Language are SQL commands that allow users to access or manipulate data as organized by the appropriate data model.

Data Manipulation Language offers a uniform language to access or manipulate data, being the types of access:

  • Retrieval of information stored in the database.
  • Insertion of new information into the database.
  • Deletion of information from the database.
  • Modification of information stored in the database.

There are two types of DMLs:

  • Procedural DMLs, which require a user to specify what data are needed and how to get those data
  • Declarative DMLs, which require a user to specify what data are needed without specifying how to get those data

These accesses are specified through queries, which are statements that requests the retrieval of information.

Insert

The INSERT command is used to save rows in a given table:

INSERT INTO products(name, description) VALUES ('Gibson Les Paul Guitar', 'Definitely one of the ever made guitars');

INSERT INTO customers(first_name, last_name, active) VALUES ('Edward', 'Something', TRUE);

INSERT INTO orders(customer_id, ordered_at) VALUES (1, 'NOW()');

INSERT INTO order_products(order_id, product_id) VALUES (1,1)
Update

The UPDATE command is used to update a set of attributes from a row. Note that is VERY IMPORTANT to use the WHERE on an update query, it is that clause that is going to indicate what rows should be updated with the desired values. If an update command is used without using a WHERE condition, all rows of the selected table are going to be affected and updated to have the desired values.

-- Only updates the product with ID 1 to have the new name and description
UPDATE products SET (name, descrption) = ('Gourmet Deluxe Premium Water Bottle', 'Totally not a scam') WHERE id = 1;

-- Updates all products to have the new name and description
UPDATE products SET (name, descrption) = ('Gourmet Deluxe Premium Water Bottle', 'Totally not a scam');
Delete

The DELETE command is used to delete a row from a table. The same logic from the Update command applies here – always use that command with a WHERE condition, otherwise all rows of the selected table are going to be deleted.

-- Deletes the product with id 1
DELETE FROM products WHERE id = 1;
-- Deletes all products
DELETE FROM products;

Data Query Language – DQL

Data Query Language are SQL commands used to retrieve data from our database, combining them to filter, aggregate, join and organize the data according to our needs:

Select

The SELECT command is used to read data from a table:

SELECT * FROM orders;
SELECT id, customer_id FROM orders;
Clauses

Along with the select command, we can use functions called CLAUSES, which gives us the ability to better organize the data that we want to see:

Where

The WHERE command applies a set of conditions to the query, retrieving only the rows that met them:

SELECT name, description FROM products WHERE name = 'Snickers';
-- Returns name and description from products with name 'Snickers' and id 5'
SELECT name, description FROM products WHERE name = 'Snickers' AND id = 5;
-- Returns name and description from products with name 'Snickers' or name 'Twix'
SELECT name, description FROM products WHERE name = 'Snickers' OR name = 'Twix';
-- Returns name and description from products that has a description.
SELECT name, description FROM products WHERE description IS NOT NULL;
Limit

The LIMIT command limits the amount of rows returned by the query:

SELECT * FROM products LIMIT 20;
-- OFFSET skips an amount of rows limited by the query. So supposing that we have 40 registers, OFFSET 20 will skip 20 rows, showing from row 20 to row 40.
SELECT * FROM products LIMIT 20 OFFSET 20;
Order by

The ORDER BY command orders rows by a specified attribute in the ascendant (ASC) or descendant (DESC) order.

SELECT * FROM products ORDER BY name ASC;
SELECT * FROM products ORDER BY name DESC;
Group by

The GROUP BY command groups rows based on an attribute. GROUP BY can only be used along with an aggregate function:

-- Returns the total amount of reserved products grouped by their ids.
SELECT SUM(quantity) FROM order_products GROUP BY product_id;
Distinct

The DISTINCT command selects unique rows based on a query:

SELECT DISTINCT name FROM products;
Aggregates

Aggregate functions are used to calculate a result from a set of input values.

Count

The COUNT command calculates how many rows a query result has.

SELECT COUNT(id) FROM products;
Sum

The SUM command calculates the sum of a given attribute (or subquery result) from a query result:

SELECT SUM(quantity) FROM order_products WHERE order_id = 5;
Min

The MIN command calculates the lowest value of a given attribute from a query result.

SELECT MIN(quantity) FROM order_products;
Max

The MAX command calculates the highest value of a given attribute from a query result.

SELECT MAX(quantity) FROM order_products WHERE product_id = 2;
Avg

The AVG command calculates the general average of a given attribute from a query result.

SELECT AVG(quantity) FROM order_products WHERE product_id = 11;
Array_agg

The ARRAY_AGG command receives values and aggregate them into an array.

SELECT active, ARRAY_AGG(first_name || ' ' || last_name) FROM customers GROUP BY active;
Joins

JOIN operations are used to join two or more tables and retrieve desired data involving them based on a common value. These operations are separated into different types:

Inner join:

INNER JOIN performs an intersection between two tables, based on a common value between them. For instance, imagine that we want to see the first and last names of the customers that has ordered a specific product, from what order that product is, and how much of it was ordered:

SELECT customers.id, first_name, last_name, products.name AS "product_name", order_products.quantity AS "order_products_quantity", order_products.order_id FROM customers
INNER JOIN orders ON orders.customer_id = customers.id
INNER JOIN order_products ON order_products.order_id = orders.id
INNER JOIN products ON products.id = order_products.product_id;
Left join / Left outer join

LEFT JOINS includes the query results from the left table and the matching rows from the right table. If there is no match, the result will include the left table row with null values for columns from the right table. For instance, imagine we want to see all customers and their orders if any exist.

SELECT customers.id, customers.first_name, customers.last_name, orders.id
FROM customers
LEFT JOIN orders ON orders.customer_id = customers.id;
Right join / Right outer join

RIGHT JOINS is the Left joins opposite – it includes the query results from the right table and the matching rows from the left table. If there is no match, the result will include the right table row with null values for columns from the left table. For instance, imagine that we want to see all orders and their customers if any exist:

SELECT orders.id, orders.ordered_at, customers.first_name, customers.last_name
FROM orders
RIGHT JOIN customers ON orders.customer_id = customers.id;
Full join / Full outer join

FULL JOIN or Full outer join includes all the data from both tables when there are data that correspond to at least one of them. For instance, imagine that we want to list all the customers and orders, showing their association when there is one:

SELECT customers.id, customers.first_name, customers.last_name, orders.id AS order_id, orders.ordered_at FROM customers FULL JOIN orders ON customers.id = orders.customer_id;
Cross join

CROSS JOIN includes all the combinations or possibilities of associations between two tables. For instance, imagine that we want to see all the possibilities between customers and products:

SELECT customers.id, customers.first_name, customers.last_name, products.id AS product_id
FROM customers CROSS JOIN products;
Union and Union All

UNION Combines the result from two or more queries into a unique set of results, while UNION ALL combines the result from two or more queries into a set of results that may have duplicate rows. These commands are useful when we want to retrieve data from columns in common from two or more tables, with or without duplicate rows. For instance, imagine that we had an additional table called ‘vip_customers’ that has a similar schema from the customers table, and we wanted to get the first and last names from both regular and VIP customers from the database:

-- Result doesn't have duplicate rows
SELECT first_name, last_name FROM vip_customers;
UNION 
SELECT first_name, last_name FROM customers;
------------------
-- Result have duplicate rows
SELECT first_name, last_name FROM vip_customers;
UNION ALL
SELECT first_name, last_name FROM customers;

To summarize and show a visual understanding of the JOIN commands, this image shows what every join does:

Transactions

Transactions are a mechanism that ensures a sequence of queries is executed as a single block of instructions. Every Transaction is defined by 4 properties which are the ACID properties:

Atomicity

Every operation made within the transaction must either be completed successfully, committing these changes to the database, or not completed successfully, failing to make these changes and not making any changes to the database.

Consistency

Every change committed to the database must be valid, conforming to the database’s constraints.

Isolation

Concurrent transactions must not affect each other.

Durability

Every data that is written by a successful transaction must be recorded in persistent storage.

Transactions have specific commands to help dealing with these properties:

BEGIN

Starts the transaction, grouping the SQL commands sequence.

COMMIT

Applies (permanently) the changes made by the SQL commands.

ROLLBACK

Undo the changes made during the transaction execution.

SAVEPOINT

Create checkpoints inside a transaction, allowing a partial rollback when needed. This is often used before executing a critical SQL command on the database.

ROLLBACK TO SAVEPOINT

Execute a partial rollback, reverting to a specific savepoint.

Ok so, a lot of keywords and different commands right? To make it more clear: transactions, at the engine level, behave as a state machine – it transitions from an initial to an end state according to the results generated by the instructions within. These commands build a workflow, and it can be illustrated as follows:

If it fails, no change is done to the database, and if it commits, the change is permanently done to the database.

Examples

Here, we have an example that creates a new product, a new order, and an association between them. If any of these commands fail, the transaction will throw an error and will not commit the changes to the database. This ensures that the product, order, and their association are properly created in the specified order, avoiding the risk of creating an order_product without an existing product or order, for example.

BEGIN;
    -- The WITH statement is used to create temporary result sets (or auxiliar queries) also known as Common Table Expressions.
    -- These result sets can be queried and reused through the main query. For more details, see
    -- https://www.postgresql.org/docs/17/queries-with.html
  WITH inserted_product_id AS (
    INSERT INTO products (name, description)
    VALUES ('white monster', 'good') 
    RETURNING id
  ),
  inserted_order_id AS (
    INSERT INTO orders(customer_id, ordered_at) 
    VALUES (5, NOW()) 
    RETURNING id
  )
  INSERT INTO order_products(order_id, product_id, quantity) 
    VALUES ((SELECT id FROM inserted_order_id), (SELECT id FROM inserted_product_id), 3);
COMMIT;

This transaction creates a new product and updates its name, but since we called ROLLBACK, neither of these changes is committed to the database, so no changes are persisted in the products table.

BEGIN;
  WITH inserted_product_id AS (
    INSERT INTO products (name, description) VALUES ('ruffles', 'good') RETURNING id
  )
  UPDATE products SET name = 'dr. pepper' WHERE id = (SELECT id FROM inserted_product_id);
ROLLBACK;

And finally, this transaction creates a new product and a SAVEPOINT. It then updates the newly created product with a new description, but we roll back to the SAVEPOINT created before that update operation. As a result, only the new product insertion is committed, while the update command is ignored since we rolled back to a point within the transaction where the update had not yet been executed.

BEGIN;
  -- For simplicity sake the update query is not updating a record with a specific id,
    -- due to the fact that is not possible to access a CTE after a SAVEPOINT within
    -- a transaction. Nonetheless, the execution of that query will be rolled back so it's not 
    -- a problem.
  INSERT INTO products (name, description) VALUES ('watermellon juice', 'good') RETURNING id;
  SAVEPOINT insert_product;
  UPDATE products SET description = 'bad' WHERE id IN(SELECT id FROM products WHERE name = 'watermellon juice');
  ROLLBACK TO SAVEPOINT insert_product;
COMMIT;

Indexes

Indexes are data structures that allow the database system to find rows that have a specified value for an attribute more easily without scanning all of the rows within a table. To simplify, it is possible to visualize this index as a book index, where we search for a subject or a chapter and find the page that we have to search for, instead of going through every page of the book. The indexes are not stored inside the tables, they are a separate structure that makes the read performance faster.

When an index is added, every time a new register with the indexed attributes is added to the indexed table, that new register will be inserted in the index. That’s why spamming an index on a table that has a lot of write operations might not be a good idea because that operation will become slower at some point.

The EXPLAIN command can show if an index it’s being used on a query:

  • EXPLAIN will show the execution plan.
  • EXPLAIN_ANALYZE will show the execution plan, execute the query and show how much time it took to execute. (Beware on using with update and delete queries)

Data Control Language – DCL

Data Control Language are SQL commands responsible for managing permissions within the database. These commands create Authorization constraints.

GRANT

The Grant command is used to grant permissions to a user or a role:

GRANT SELECT ON TABLE products TO john_doe;
GRANT SELECT ON TABLE products TO admin_users;
REVOKE

The REVOKE command is used to revoke permissions from a user or a role.

REVOKE SELECT ON TABLE products TO john_doe;
REVOKE SELECT ON TABLE products TO admin_users;

Views

Views are saved/stored queries that can be used as virtual tables. They don’t store data physically but represent a query result. Often used to simplify access to complex queries and to make access to specific columns or lines more restricted.

Materialized Views

A materialized view it’s a physical table that stores a query’s result, usually complex and heavy to execute. These views don’t update the result automatically after the data has changed. To do so, the REFRESH command is used.

The difference between a traditional view and a materialized view is that a materialized view stores the data physically, while a traditional view offers a virtual view of the data.

Functions

Functions are reusable pieces of code that perform a specific logic or task. They allow logic to be stored in the database, and executed from SQL statements. Often used to modularize and reuse business logic within the database.

Triggers

A trigger is a statement that the system executes automatically as a side effect (or event) of a modification to the database, such as views, tables, etc. Often used to audit data access and modification.

References

https://www.amazon.com.br/Database-Concepts-Sudarshan-Abraham-Silberschatz/dp/9390727502/
https://www.postgresql.org/docs/17/
https://www.solarwinds.com/resources/it-glossary/sql-database
https://use-the-index-luke.com/


Our previous post: Setting the backstage: Rack and Sinatra

This post is part of our ‘The Miners’ Guide to Code Crafting’ series, designed to help aspiring developers learn and grow. Stay tuned for more and continue your coding journey with us!! Check out the full summary here!

We want to work with you. Check out our "What We Do" section!