Databse 101 - MySQL Replication, Database Operation & Concept
this course is taken from “School of SRE” repository
Relational Databases
Prerequisites
- Complete DB 101
MySQL Replication
Replication enables data from one MySQL host (termed as Primary) to be copied to another MySQL host (termed as Replica). MySQL Replication is asynchronous in nature by default, but it can be changed to semi-synchronous with some configurations.
Some common applications of MySQL replication are:-
- Read-scaling - as multiple hosts can replicate the data from a single primary host, we can set up as many replicas as we need and scale reads through them, i.e. application writes will go to a single primary host and the reads can balance between all the replicas that are there. Such a setup can improve the write performance as well, as the primary is dedicated to only updates and not reads.
- Backups using replicas - the backup process can sometimes be a little heavy. But if we have replicas configured, then we can use one of them to get the backup without affecting the primary data at all.
- Disaster Recovery - a replica in some other geographical region paves a proper path to configure disaster recovery.
MySQL supports different types of synchronizations as well:-
- Asynchronous - this is the default synchronization method. It is one-way, i.e. one host serves as primary and one or more hosts as replica. We will discuss this method throughout the replication topic.
- Semi-Synchronous - in this type of synchronization, a commit performed on the primary host is blocked until at least one replica acknowledges it. Post the acknowledgement from any one replica, the control is returned to the session that performed the transaction. This ensures strong consistency but the replication is slower than asynchronous.
- Delayed - we can deliberately lag the replica in a typical MySQL replication by the number of seconds desired by the use case. This type of replication safeguards from severe human errors of dropping or corrupting the data on the primary, for example, in the above diagram for Delayed Replication, if a DROP DATABASE is executed by mistake on the primary, we still have 30 minutes to recover the data from R2 as that command has not been replicated on R2 yet.
Pre-Requisites
Before we dive into setting up replication, we should know about the binary logs. Binary logs play a very important role in MySQL replication. Binary logs, or commonly known as binlogs contain events about the changes done to the database, like table structure changes, data changes via DML operations, etc. They are not used to log SELECT statements. For replication, the primary sends the information to the replicas using its binlogs about the changes done to the database, and the replicas make the same data changes.
With respect to MySQL replication, the binary log format can be of two types that decides the main type of replication:-
- Statement-Based Replication or SBR
- Row-Based Replication or RBR
Statement Based Binlog Format
Originally, the replication in MySQL was based on SQL statements getting replicated and executed on the replica from the primary. This is called statement based logging. The binlog contains the exact SQL statement run by the session.
So If we run the above statements to insert 3 records and the update 3 in a single update statement, they will be logged exactly the same as when we executed them.
Row Based Binlog Format
The Row based is the default one in the latest MySQL releases. This is a lot different from the Statement format as here, row events are logged instead of statements. By that we mean, in the above example one update statement affected 3 records, but binlog had only one update statement; if it is a row based format, binlog will have an event for each record updated.
Statement Based v/s Row Based binlogs
Let’s have a look at the operational differences between statement-based and row-based binlogs.
Statement Based | Row Based |
---|---|
Logs SQL statements as executed | Logs row events based on SQL statements executed |
Takes lesser disk space | Takes more disk space |
Restoring using binlogs is faster | Restoring using binlogs is slower |
When used for replication, if any statement has a predefined function that has its own value, like sysdate(), uuid() etc, the output could be different on the replica which makes it inconsistent. | Whatever is executed becomes a row event with values, so there will be no problem if such functions are used in SQL statements. |
Only statements are logged so no other row events are generated. | A lot of events are generated when a table is copied into another using INSERT INTO SELECT. |
Note - There is another type of binlog format called Mixed. With mixed logging, statement based is used by default but it switches to row based in certain cases. If MySQL cannot guarantee that statement based logging is safe for the statements executed, it issues a warning and switches to row based for those statements.
We will be using binary log format as Row for the entire replication topic.
Replication in Motion
The above figure indicates how a typical MySQL replication works.
- Replica_IO_Thread is responsible to fetch the binlog events from the primary binary logs to the replica
- On the Replica host, relay logs are created which are exact copies of the binary logs. If the binary logs on primary are in row format, the relay logs will be the same.
- Replica_SQL_Thread applies the relay logs on the replica MySQL server.
- If log-bin is enabled on the replica, then the replica will have its own binary logs as well. If log-slave-updates is enabled, then it will have the updates from the primary logged in the binlogs as well.
Setting up Replication
In this section, we will set up a simple asynchronous replication. The binlogs will be in row based format. The replication will be set up on two fresh hosts with no prior data present. There are two different ways in which we can set up replication.
- Binlog based - Each replica keeps a record of the binlog coordinates on the primary - current binlog and position in the binlog till where it has read and processed. So, at a time different replicas might be reading different parts of the same binlog.
- GTID based - Every transaction gets an identifier called global transaction identifier or GTID. There is no need to keep the record of binlog coordinates, as long as the replica has all the GTIDs executed on the primary, it is consistent with the primary. A typical GTID is the server_uuid:# positive integer.
We will set up a GTID based replication in the following section but will also discuss binlog based replication setup as well.
Primary Host Configurations
The following config parameters should be present in the primary my.cnf file for setting up GTID based replication.
server-id - a unique ID for the mysql server
log-bin - the binlog location
binlog-format - ROW | STATEMENT (we will use ROW)
gtid-mode - ON
enforce-gtid-consistency - ON (allows execution of only those statements which can be logged using GTIDs)
Replica Host Configurations
The following config parameters should be present in the replica my.cnf file for setting up replication.
server-id - different than the primary host
log-bin - (optional, if you want replica to log its own changes as well)
binlog-format - depends on the above
gtid-mode - ON
enforce-gtid-consistency - ON
log-slave-updates - ON (if binlog is enabled, then we can enable this. This enables the replica to log the changes coming from the primary along with its own changes. Helps in setting up chain replication)
Replication User
Every replica connects to the primary using a mysql user for replicating. So there must be a mysql user account for the same on the primary host. Any user can be used for this purpose provided it has REPLICATION SLAVE privilege. If the sole purpose is replication then we can have a user with only the required privilege.
On the primary host
mysql> create user repl_user@<replica_IP> identified by 'xxxxx';
mysql> grant replication slave on *.* to repl_user@'<replica_IP>';
Obtaining Starting position from Primary
Run the following command on the primary host
mysql> show master status\G
*************************** 1. row ***************************
File: mysql-bin.000001
Position: 73
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-3
1 row in set (0.00 sec)
If we are working with binary log based replication, the top two output lines are the most important ones. That tells the current binlog on the primary host and till what position it has written. For fresh hosts we know that no data is written so we can directly set up replication using the very first binlog file and position 4. If we are setting up a replication from a backup, then that changes the way we obtain the starting position. For GTIDs, the executed_gtid_set is the value where primary is right now. Again, for a fresh setup, we don’t have to specify anything about the starting point and it will start from the transaction id 1, but when we set up from a backup, the backup will contain the GTID positions till where backup has been taken.
Setting up Replica
The replication setup must know about the primary host, the user and password to connect, the binlog coordinates (for binlog based replication) or the GTID auto-position parameter. The following command is used for setting up
change master to
master_host = '<primary host IP>',
master_port = <primary host port - default=3306>,
master_user = 'repl_user',
master_password = 'xxxxx',
master_auto_position = 1;
Note - the Change Master To command has been replaced with Change Replication Source To from Mysql 8.0.23 onwards, also all the master and slave keywords are replaced with source and replica.
If it is binlog based replication, then instead of master_auto_position, we need to specify the binlog coordinates.
master_log_file = 'mysql-bin.000001',
master_log_pos = 4
Starting Replication and Check Status
Now that everything is configured, we just need to start the replication on the replica via the following command
start slave;
OR from MySQL 8.0.23 onwards,
start replica;
Whether or not the replication is running successfully, we can determine by running the following command
show slave status\G
OR from MySQL 8.0.23 onwards,
show replica status\G
mysql> show replica status\G
*************************** 1. row ***************************
Replica_IO_State: Waiting for master to send event
Source_Host: <primary IP>
Source_User: repl_user
Source_Port: <primary port>
Connect_Retry: 60
Source_Log_File: mysql-bin.000001
Read_Source_Log_Pos: 852
Relay_Log_File: mysql-relay-bin.000002
Relay_Log_Pos: 1067
Relay_Source_Log_File: mysql-bin.000001
Replica_IO_Running: Yes
Replica_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Source_Log_Pos: 852
Relay_Log_Space: 1283
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Source_SSL_Allowed: No
Source_SSL_CA_File:
Source_SSL_CA_Path:
Source_SSL_Cert:
Source_SSL_Cipher:
Source_SSL_Key:
Seconds_Behind_Source: 0
Source_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Source_Server_Id: 1
Source_UUID: e17d0920-d00e-11eb-a3e6-000d3aa00f87
Source_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates
Source_Retry_Count: 86400
Source_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Source_SSL_Crl:
Source_SSL_Crlpath:
Retrieved_Gtid_Set: e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-3
Executed_Gtid_Set: e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-3
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Source_TLS_Version:
Source_public_key_path:
Get_Source_public_key: 0
Network_Namespace:
1 row in set (0.00 sec)
Some of the parameters are explained below:-
- Relay_Source_Log_File - the primary’s file where replica is currently reading from
- Execute_Source_Log_Pos - for the above file on which position is the replica reading currently from. These two parameters are of utmost importance when binlog based replication is used.
- Replica_IO_Running - IO thread of replica is running or not
- Replica_SQL_Running - SQL thread of replica is running or not
- Seconds_Behind_Source - the difference of seconds when a statement was executed on Primary and then on Replica. This indicates how much replication lag is there.
- Source_UUID - the uuid of the primary host
- Retrieved_Gtid_Set - the GTIDs fetched from the primary host by the replica to be executed.
- Executed_Gtid_Set - the GTIDs executed on the replica. This set remains the same for the entire cluster if the replicas are in sync.
- Auto_Position - it directs the replica to fetch the next GTID automatically
Create a Replica for the already setup cluster
The steps discussed in the previous section talks about the setting up replication on two fresh hosts. When we have to set up a replica for a host which is already serving applications, then the backup of the primary is used, either fresh backup taken for the replica (should only be done if the traffic it is serving is less) or use a recently taken backup.
If the size of the databases on the MySQL primary server is small, less than 100G recommended, then mysqldump can be used to take backup along with the following options.
mysqldump -uroot -p -hhost_ip -P3306 --all-databases --single-transaction --master-data=1 > primary_host.bkp
--single-transaction
- this option starts a transaction before taking the backup which ensures it is consistent. As transactions are isolated from each other, so no other writes affect the backup.--master-data
- this option is required if binlog based replication is desired to be set up. It includes the binary log file and log file position in the backup file.
When GTID mode is enabled and mysqldump is executed, it includes the GTID executed to be used to start the replica after the backup position. The contents of the mysqldump output file will have the following
It is recommended to comment these before restoring otherwise they could throw errors. Also, using master-data=2 will automatically comment the master_log_file line.
Similarly, when taking backup of the host using xtrabackup, the file xtrabckup_info file contains the information about binlog file and file position, as well as the GTID executed set.
server_version = 8.0.25
start_time = 2021-06-22 03:45:17
end_time = 2021-06-22 03:45:20
lock_time = 0
binlog_pos = filename 'mysql-bin.000007', position '196', GTID of the last change 'e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-5'
innodb_from_lsn = 0
innodb_to_lsn = 18153149
partial = N
incremental = N
format = file
compressed = N
encrypted = N
Now, after setting MySQL server on the desired host, restore the backup taken from any one of the above methods. If the intended way is binlog based replication, then use the binlog file and position info in the following command
change Replication Source to
source_host = ‘primary_ip’,
source_port = 3306,
source_user = ‘repl_user’,
source_password = ‘xxxxx’,
source_log_file = ‘mysql-bin.000007’,
source_log_pos = ‘196’;
If the replication needs to be set via GITDs, then run the below command to tell the replica about the GTIDs already executed. On the Replica host, run th following commands
reset master;
set global gtid_purged = ‘e17d0920-d00e-11eb-a3e6-000d3aa00f87:1-5’
change replication source to
source_host = ‘primary_ip’,
source_port = 3306,
source_user = ‘repl_user’,
source_password = ‘xxxxx’,
source_auto_position = 1
The reset master command resets the position of the binary log to initial. It can be skipped if the host is a freshly installed MySQL, but we restored a backup so it is necessary. The gtid_purged global variable lets the replica know the GTIDs that have already been executed, so that the replication can start after that. Then in the change source command, we set the auto-position to 1 which automatically gets the next GTID to proceed.
Further Reading
Database Operations & Concepts
LAB
Prepare the LAB database and populate to the mysql database
First create the environment
PREPARE To create a Mysql database please run the fallowing commands (you can just click below section)
docker run --rm --name mysql-db -p 3306:3306 -e MYSQL_ROOT_PASSWORD=mypassword -d mysql
Populate sample data for the LAB
following will download some sample data and create the database
git clone https://github.com/datacharmer/test_db.git
docker cp test_db mysql-db:/home
docker exec -it mysql-db bash
cd /home/test_db/
mysql -uroot -pmypassword < /home/test_db/employees.sql
mysql -uroot -pmypassword
now you are inside the MYSQL database inside the container. You also have a database prepared for you “EMPLOYEES”
SELECT Query
The most commonly used command while working with MySQL is SELECT. It is used to fetch the result set from one or more tables. The general form of a typical select query looks like:-
SELECT expr
FROM table1
[WHERE condition]
[GROUP BY column_list HAVING condition]
[ORDER BY column_list ASC|DESC]
[LIMIT #]
The above general form contains some commonly used clauses of a SELECT query:-
- expr - comma-separated column list or * (for all columns)
- WHERE - a condition is provided, if true, directs the query to select only those records.
- GROUP BY - groups the entire result set based on the column list provided. An aggregate function is recommended to be present in the select expression of the query. HAVING supports grouping by putting a condition on the selected or any other aggregate function.
- ORDER BY - sorts the result set based on the column list in ascending or descending order.
- LIMIT - commonly used to limit the number of records.
Let’s have a look at some examples for a better understanding of the above. The dataset used for the examples below is available here and is free to use.
Select all records
select * from employees limit 5;
mysql> select * from employees limit 5;
+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date |
+--------+------------+------------+-----------+--------+------------+
| 10001 | 1953-09-02 | Georgi | Facello | M | 1986-06-26 |
| 10002 | 1964-06-02 | Bezalel | Simmel | F | 1985-11-21 |
| 10003 | 1959-12-03 | Parto | Bamford | M | 1986-08-28 |
| 10004 | 1954-05-01 | Chirstian | Koblick | M | 1986-12-01 |
| 10005 | 1955-01-21 | Kyoichi | Maliniak | M | 1989-09-12 |
+--------+------------+------------+-----------+--------+------------+
5 rows in set (0.00 sec)
Select specific fields for all records
select first_name, last_name, gender from employees limit 5;
mysql> select first_name, last_name, gender from employees limit 5;
+------------+-----------+--------+
| first_name | last_name | gender |
+------------+-----------+--------+
| Georgi | Facello | M |
| Bezalel | Simmel | F |
| Parto | Bamford | M |
| Chirstian | Koblick | M |
| Kyoichi | Maliniak | M |
+------------+-----------+--------+
5 rows in set (0.00 sec)
Select all records Where hire_date >= January 1, 1990
select * from employees where hire_date >= '1990-01-01' limit 5;
mysql> select * from employees where hire_date >= '1990-01-01' limit 5;
+--------+------------+------------+-------------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date |
+--------+------------+------------+-------------+--------+------------+
| 10008 | 1958-02-19 | Saniya | Kalloufi | M | 1994-09-15 |
| 10011 | 1953-11-07 | Mary | Sluis | F | 1990-01-22 |
| 10012 | 1960-10-04 | Patricio | Bridgland | M | 1992-12-18 |
| 10016 | 1961-05-02 | Kazuhito | Cappelletti | M | 1995-01-27 |
| 10017 | 1958-07-06 | Cristinel | Bouloucos | F | 1993-08-03 |
+--------+------------+------------+-------------+--------+------------+
5 rows in set (0.01 sec)
Select first_name and last_name from all records Where birth_date >= 1960 AND gender = ‘F’
select first_name, last_name from employees where year(birth_date) >= 1960 and gender='F' limit 5;
mysql> select first_name, last_name from employees where year(birth_date) >= 1960 and gender='F' limit 5;
+------------+-----------+
| first_name | last_name |
+------------+-----------+
| Bezalel | Simmel |
| Duangkaew | Piveteau |
| Divier | Reistad |
| Jeong | Reistad |
| Mingsen | Casley |
+------------+-----------+
5 rows in set (0.00 sec)
Display the total number of records
select count(*) from employees;
mysql> select count(*) from employees;
+----------+
| count(*) |
+----------+
| 300024 |
+----------+
1 row in set (0.05 sec)
Display gender-wise count of all records
select gender, count(*) from employees group by gender;
mysql> select gender, count(*) from employees group by gender;
+--------+----------+
| gender | count(*) |
+--------+----------+
| M | 179973 |
| F | 120051 |
+--------+----------+
2 rows in set (0.14 sec)
Display the year of hire_date and number of employees hired that year, also only those years where more than 20k employees were hired
select year(hire_date), count(*) from employees group by year(hire_date) having count(*) > 20000;
mysql> select year(hire_date), count(*) from employees group by year(hire_date) having count(*) > 20000;
+-----------------+----------+
| year(hire_date) | count(*) |
+-----------------+----------+
| 1985 | 35316 |
| 1986 | 36150 |
| 1987 | 33501 |
| 1988 | 31436 |
| 1989 | 28394 |
| 1990 | 25610 |
| 1991 | 22568 |
| 1992 | 20402 |
+-----------------+----------+
8 rows in set (0.14 sec)
Display all records ordered by their hire_date in descending order. If hire_date is the same then in order of their birth_date ascending order
select * from employees order by hire_date desc, birth_date asc limit 5;
mysql> select * from employees order by hire_date desc, birth_date asc limit 5;
+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date |
+--------+------------+------------+-----------+--------+------------+
| 463807 | 1964-06-12 | Bikash | Covnot | M | 2000-01-28 |
| 428377 | 1957-05-09 | Yucai | Gerlach | M | 2000-01-23 |
| 499553 | 1954-05-06 | Hideyuki | Delgrande | F | 2000-01-22 |
| 222965 | 1959-08-07 | Volkmar | Perko | F | 2000-01-13 |
| 47291 | 1960-09-09 | Ulf | Flexer | M | 2000-01-12 |
+--------+------------+------------+-----------+--------+------------+
5 rows in set (0.12 sec)
SELECT - JOINS
JOIN statement is used to produce a combined result set from two or more tables based on certain conditions. It can be also used with Update and Delete statements but we will be focussing on the select query. Following is a basic general form for joins
SELECT table1.col1, table2.col1, ... (any combination)
FROM
table1 <join_type> table2
ON (or USING depends on join_type) table1.column_for_joining = table2.column_for_joining
WHERE …
Any number of columns can be selected, but it is recommended to select only those which are relevant to increase the readability of the resultset. All other clauses like where, group by are not mandatory. Let’s discuss the types of JOINs supported by MySQL Syntax.
Inner Join
This joins table A with table B on a condition. Only the records where the condition is True are selected in the resultset.
Display some details of employees along with their salary
select e.emp_no,e.first_name,e.last_name,s.salary from employees e join salaries s on e.emp_no=s.emp_no limit 5;
mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e join salaries s on e.emp_no=s.emp_no limit 5;
+--------+------------+-----------+--------+
| emp_no | first_name | last_name | salary |
+--------+------------+-----------+--------+
| 10001 | Georgi | Facello | 60117 |
| 10001 | Georgi | Facello | 62102 |
| 10001 | Georgi | Facello | 66074 |
| 10001 | Georgi | Facello | 66596 |
| 10001 | Georgi | Facello | 66961 |
+--------+------------+-----------+--------+
5 rows in set (0.00 sec)
Similar result can be achieved by
select e.emp_no,e.first_name,e.last_name,s.salary from employees e join salaries s using (emp_no) limit 5;
mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e join salaries s using (emp_no) limit 5;
+--------+------------+-----------+--------+
| emp_no | first_name | last_name | salary |
+--------+------------+-----------+--------+
| 10001 | Georgi | Facello | 60117 |
| 10001 | Georgi | Facello | 62102 |
| 10001 | Georgi | Facello | 66074 |
| 10001 | Georgi | Facello | 66596 |
| 10001 | Georgi | Facello | 66961 |
+--------+------------+-----------+--------+
5 rows in set (0.00 sec)
And also by
select e.emp_no,e.first_name,e.last_name,s.salary from employees e natural join salaries s limit 5;
mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e natural join salaries s limit 5;
+--------+------------+-----------+--------+
| emp_no | first_name | last_name | salary |
+--------+------------+-----------+--------+
| 10001 | Georgi | Facello | 60117 |
| 10001 | Georgi | Facello | 62102 |
| 10001 | Georgi | Facello | 66074 |
| 10001 | Georgi | Facello | 66596 |
| 10001 | Georgi | Facello | 66961 |
+--------+------------+-----------+--------+
5 rows in set (0.00 sec)
Outer Join
Majorly of two types:-
- LEFT - joining complete table A with table B on a condition. All the records from table A are selected, but from table B, only those records are selected where the condition is True.
- RIGHT - Exact opposite of the left join.
Let us assume the below tables for understanding left join better.
select * from dummy1;
mysql> select * from dummy1;
+----------+------------+
| same_col | diff_col_1 |
+----------+------------+
| 1 | A |
| 2 | B |
| 3 | C |
+----------+------------+
```.term1
select * from dummy2;
mysql> select * from dummy2;
+----------+------------+
| same_col | diff_col_2 |
+----------+------------+
| 1 | X |
| 3 | Y |
+----------+------------+
A simple select join will look like the one below.
select * from dummy1 d1 left join dummy2 d2 on d1.same_col=d2.same_col;
mysql> select * from dummy1 d1 left join dummy2 d2 on d1.same_col=d2.same_col;
+----------+------------+----------+------------+
| same_col | diff_col_1 | same_col | diff_col_2 |
+----------+------------+----------+------------+
| 1 | A | 1 | X |
| 3 | C | 3 | Y |
| 2 | B | NULL | NULL |
+----------+------------+----------+------------+
3 rows in set (0.00 sec)
Which can also be written as
select * from dummy1 d1 left join dummy2 d2 using(same_col);
mysql> select * from dummy1 d1 left join dummy2 d2 using(same_col);
+----------+------------+------------+
| same_col | diff_col_1 | diff_col_2 |
+----------+------------+------------+
| 1 | A | X |
| 3 | C | Y |
| 2 | B | NULL |
+----------+------------+------------+
3 rows in set (0.00 sec)
And also as
select * from dummy1 d1 natural left join dummy2 d2;
mysql> select * from dummy1 d1 natural left join dummy2 d2;
+----------+------------+------------+
| same_col | diff_col_1 | diff_col_2 |
+----------+------------+------------+
| 1 | A | X |
| 3 | C | Y |
| 2 | B | NULL |
+----------+------------+------------+
3 rows in set (0.00 sec)
Cross Join
This does a cross product of table A and table B without any condition. It doesn’t have a lot of applications in the real world.
A Simple Cross Join looks like this
select * from dummy1 cross join dummy2;
mysql> select * from dummy1 cross join dummy2;
+----------+------------+----------+------------+
| same_col | diff_col_1 | same_col | diff_col_2 |
+----------+------------+----------+------------+
| 1 | A | 3 | Y |
| 1 | A | 1 | X |
| 2 | B | 3 | Y |
| 2 | B | 1 | X |
| 3 | C | 3 | Y |
| 3 | C | 1 | X |
+----------+------------+----------+------------+
6 rows in set (0.01 sec)
One use case that can come in handy is when you have to fill in some missing entries. For example, all the entries from dummy1 must be inserted into a similar table dummy3, with each record must have 3 entries with statuses 1, 5 and 7.
desc dummy3;
mysql> desc dummy3;
+----------+----------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+----------+------+-----+---------+-------+
| same_col | int | YES | | NULL | |
| value | char(15) | YES | | NULL | |
| status | smallint | YES | | NULL | |
+----------+----------+------+-----+---------+-------+
3 rows in set (0.02 sec)
Either you create an insert query script with as many entries as in dummy1 or use cross join to produce the required resultset.
select * from dummy1 cross join (select 1 union select 5 union select 7) T2 order by same_col;
mysql> select * from dummy1
cross join
(select 1 union select 5 union select 7) T2
order by same_col;
+----------+------------+---+
| same_col | diff_col_1 | 1 |
+----------+------------+---+
| 1 | A | 1 |
| 1 | A | 5 |
| 1 | A | 7 |
| 2 | B | 1 |
| 2 | B | 5 |
| 2 | B | 7 |
| 3 | C | 1 |
| 3 | C | 5 |
| 3 | C | 7 |
+----------+------------+---+
9 rows in set (0.00 sec)
The T2 section in the above query is called a sub-query. We will discuss the same in the next section.
Natural Join
This implicitly selects the common column from table A and table B and performs an inner join.
select e.emp_no,e.first_name,e.last_name,s.salary from employees e natural join salaries s limit 5;
mysql> select e.emp_no,e.first_name,e.last_name,s.salary from employees e natural join salaries s limit 5;
+--------+------------+-----------+--------+
| emp_no | first_name | last_name | salary |
+--------+------------+-----------+--------+
| 10001 | Georgi | Facello | 60117 |
| 10001 | Georgi | Facello | 62102 |
| 10001 | Georgi | Facello | 66074 |
| 10001 | Georgi | Facello | 66596 |
| 10001 | Georgi | Facello | 66961 |
+--------+------------+-----------+--------+
5 rows in set (0.00 sec)
Notice how natural join and using takes care that the common column is displayed only once if you are not explicitly selecting columns for the query.
Some More Examples
Display emp_no, salary, title and dept of the employees where salary > 80000
select e.emp_no, s.salary, t.title, d.dept_no
from
employees e
join salaries s using (emp_no)
join titles t using (emp_no)
join dept_emp d using (emp_no)
where s.salary > 80000
limit 5;
mysql> select e.emp_no, s.salary, t.title, d.dept_no
from
employees e
join salaries s using (emp_no)
join titles t using (emp_no)
join dept_emp d using (emp_no)
where s.salary > 80000
limit 5;
+--------+--------+--------------+---------+
| emp_no | salary | title | dept_no |
+--------+--------+--------------+---------+
| 10017 | 82163 | Senior Staff | d001 |
| 10017 | 86157 | Senior Staff | d001 |
| 10017 | 89619 | Senior Staff | d001 |
| 10017 | 91985 | Senior Staff | d001 |
| 10017 | 96122 | Senior Staff | d001 |
+--------+--------+--------------+---------+
5 rows in set (0.00 sec)
Display title-wise count of employees in each department order by dept_no
select d.dept_no, t.title, count(*)
from titles t
left join dept_emp d using (emp_no)
group by d.dept_no, t.title
order by d.dept_no
limit 10;
mysql> select d.dept_no, t.title, count(*)
from titles t
left join dept_emp d using (emp_no)
group by d.dept_no, t.title
order by d.dept_no
limit 10;
+---------+--------------------+----------+
| dept_no | title | count(*) |
+---------+--------------------+----------+
| d001 | Manager | 2 |
| d001 | Senior Staff | 13940 |
| d001 | Staff | 16196 |
| d002 | Manager | 2 |
| d002 | Senior Staff | 12139 |
| d002 | Staff | 13929 |
| d003 | Manager | 2 |
| d003 | Senior Staff | 12274 |
| d003 | Staff | 14342 |
| d004 | Assistant Engineer | 6445 |
+---------+--------------------+----------+
10 rows in set (1.32 sec)
SELECT - Subquery
A subquery is generally a smaller resultset that can be used to power a select query in many ways. It can be used in a ‘where’ condition, can be used in place of join mostly where a join could be an overkill. These subqueries are also termed as derived tables. They must have a table alias in the select query.
Let’s look at some examples of subqueries.
Here we got the department name from the departments table by a subquery which used dept_no from dept_emp table.
select e.emp_no,
(select dept_name from departments where dept_no=d.dept_no) dept_name from employees e
join dept_emp d using (emp_no)
limit 5;
mysql> select e.emp_no,
(select dept_name from departments where dept_no=d.dept_no) dept_name from employees e
join dept_emp d using (emp_no)
limit 5;
+--------+-----------------+
| emp_no | dept_name |
+--------+-----------------+
| 10001 | Development |
| 10002 | Sales |
| 10003 | Production |
| 10004 | Production |
| 10005 | Human Resources |
+--------+-----------------+
5 rows in set (0.01 sec)
Here, we used the ‘avg’ query above (which got the avg salary) as a subquery to list the employees whose latest salary is more than the average.
select avg(salary) from salaries;
mysql> select avg(salary) from salaries;
+-------------+
| avg(salary) |
+-------------+
| 63810.7448 |
+-------------+
1 row in set (0.80 sec)
select e.emp_no, max(s.salary)
from employees e
natural join salaries s
group by e.emp_no
having max(s.salary) > (select avg(salary) from salaries)
limit 10;
mysql> select e.emp_no, max(s.salary)
from employees e
natural join salaries s
group by e.emp_no
having max(s.salary) > (select avg(salary) from salaries)
limit 10;
+--------+---------------+
| emp_no | max(s.salary) |
+--------+---------------+
| 10001 | 88958 |
| 10002 | 72527 |
| 10004 | 74057 |
| 10005 | 94692 |
| 10007 | 88070 |
| 10009 | 94443 |
| 10010 | 80324 |
| 10013 | 68901 |
| 10016 | 77935 |
| 10017 | 99651 |
+--------+---------------+
10 rows in set (0.56 sec)
Query Performance Improvement
Query Performance is a very crucial aspect of relational databases. If not tuned correctly, the select queries can become slow and painful for the application, and for the MySQL server as well. The important task is to identify the slow queries and try to improve their performance by either rewriting them or creating proper indexes on the tables involved in it.
The Slow Query Log
The slow query log contains SQL statements that take a longer time to execute then set in the config parameter long_query_time. These queries are the candidates for optimization. There are some good utilities to summarize the slow query logs like, mysqldumpslow (provided by MySQL itself), pt-query-digest (provided by Percona), etc. Following are the config parameters that are used to enable and effectively catch slow queries
Variable | Explanation | Example value |
---|---|---|
slow_query_log | Enables or disables slow query logs | ON |
slow_query_log_file | The location of the slow query log | /var/lib/mysql/mysql-slow.log |
long_query_time | Threshold time. The query that takes longer than this time is logged in slow query log | 5 |
log_queries_not_using_indexes | When enabled with the slow query log, the queries which do not make use of any index are also logged in the slow query log even though they take less time than long_query_time. | ON |
So, for this section, we will be enabling slow_query_log, long_query_time will be kept to 0.3 (300 ms), and log_queries_not_using index will be enabled as well.
Below are the queries that we will execute on the employees database.
- select * from employees where last_name = ‘Koblick’;
- select * from salaries where salary >= 100000;
- select * from titles where title = ‘Manager’;
- select * from employees where year(hire_date) = 1995;
- select year(e.hire_date), max(s.salary) from employees e join salaries s on e.emp_no=s.emp_no group by year(e.hire_date);
Now, queries 1, 3 and 4 executed under 300 ms but if we check the slow query logs, we will find these queries logged as they are not using any of the index. Queries 2 and 5 are taking longer than 300ms and also not using any index.
Use the following command to get the summary of the slow query log
mysqldumpslow /var/lib/mysql/mysql-slow.log
There are some more queries in the snapshot that were along with the queries mentioned. Mysqldumpslow replaces actual values that were used by N (in case of numbers) and S (in case of strings). That can be overridden by -a
option, however that will increase the output lines if different values are used in similar queries.
The EXPLAIN Plan
The EXPLAIN command is used with any query that we want to analyze. It describes the query execution plan, how MySQL sees and executes the query. EXPLAIN works with Select, Insert, Update and Delete statements. It tells about different aspects of the query like, how tables are joined, indexes used or not, etc. The important thing here is to understand the basic Explain plan output of a query to determine its performance.
Let’s take the following query as an example,
explain select * from salaries where salary = 100000;
mysql> explain select * from salaries where salary = 100000;
+----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| 1 | SIMPLE | salaries | NULL | ALL | NULL | NULL | NULL | NULL | 2838426 | 10.00 | Using where |
+----+-------------+----------+------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
The key aspects to understand in the above output are:-
- Partitions - the number of partitions considered while executing the query. It is only valid if the table is partitioned.
- Possible_keys - the list of indexes that were considered during creation of the execution plan.
- Key - the index that will be used while executing the query.
- Rows - the number of rows examined during the execution.
- Filtered - the percentage of rows that were filtered out of the rows examined. The maximum and most optimized result will have 100 in this field.
- Extra - this tells some extra information on how MySQL evaluates, whether the query is using only where clause to match target rows, any index or temporary table, etc.
So, for the above query, we can determine that there are no partitions, there are no candidate indexes to be used and so no index is used at all, over 2M rows are examined and only 10% of them are included in the result, and lastly, only a where clause is used to match the target rows.
Creating an Index
Indexes are used to speed up selecting relevant rows for a given column value. Without an index, MySQL starts with the first row and goes through the entire table to find matching rows. If the table has too many rows, the operation becomes costly. With indexes, MySQL determines the position to start looking for the data without reading the full table.
A primary key is also an index which is also the fastest and is stored along with the table data. Secondary indexes are stored outside of the table data and are used to further enhance the performance of SQL statements. Indexes are mostly stored as B-Trees, with some exceptions like spatial indexes use R-Trees and memory tables use hash indexes.
There are 2 ways to create indexes:-
- While creating a table - if we know beforehand the columns that will drive the most number of where clauses in select queries, then we can put an index over them while creating a table.
- Altering a Table - To improve the performance of a troubling query, we create an index on a table which already has data in it using ALTER or CREATE INDEX command. This operation does not block the table but might take some time to complete depending on the size of the table.
Let’s look at the query that we discussed in the previous section. It’s clear that scanning over 2M records is not a good idea when only 10% of those records are actually in the resultset.
Hence, we create an index on the salary column of the salaries table.
create index idx_salary on salaries(salary)
OR
alter table salaries add index idx_salary(salary)
And the same explain plan now looks like this
explain select * from salaries where salary = 100000;
mysql> explain select * from salaries where salary = 100000;
+----+-------------+----------+------------+------+---------------+------------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------------+---------+-------+------+----------+-------+
| 1 | SIMPLE | salaries | NULL | ref | idx_salary | idx_salary | 4 | const | 13 | 100.00 | NULL |
+----+-------------+----------+------------+------+---------------+------------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)
Now the index used is idx_salary, the one we recently created. The index actually helped examine only 13 records and all of them are in the resultset. Also, the query execution time is also reduced from over 700ms to almost negligible.
Let’s look at another example. Here we are searching for a specific combination of first_name and last_name. But, we might also search based on last_name only.
explain select * from employees where last_name = 'Dredge' and first_name = 'Yinghua';
mysql> explain select * from employees where last_name = 'Dredge' and first_name = 'Yinghua';
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| 1 | SIMPLE | employees | NULL | ALL | NULL | NULL | NULL | NULL | 299468 | 1.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
Now only 1% record out of almost 300K is the resultset. Although the query time is particularly quick as we have only 300K records, this will be a pain if the number of records are over millions. In this case, we create an index on last_name and first_name, not separately, but a composite index including both the columns.
create index idx_last_first on employees(last_name, first_name)
and now see the difference
explain select * from employees where last_name = 'Dredge' and first_name = 'Yinghua';
mysql> explain select * from employees where last_name = 'Dredge' and first_name = 'Yinghua';
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------------+------+----------+-------+
| 1 | SIMPLE | employees | NULL | ref | idx_last_first | idx_last_first | 124 | const,const | 1 | 100.00 | NULL |
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)
We chose to put last_name before first_name while creating the index as the optimizer starts from the leftmost prefix of the index while evaluating the query. For example, if we have a 3-column index like idx(c1, c2, c3), then the search capability of the index follows - (c1), (c1, c2) or (c1, c2, c3) i.e. if your where clause has only first_name this index won’t work.
explain select * from employees where first_name = 'Yinghua';
mysql> explain select * from employees where first_name = 'Yinghua';
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| 1 | SIMPLE | employees | NULL | ALL | NULL | NULL | NULL | NULL | 299468 | 10.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
But, if you have only the last_name in the where clause, it will work as expected.
explain select * from employees where last_name = 'Dredge';
mysql> explain select * from employees where last_name = 'Dredge';
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------+------+----------+-------+
| 1 | SIMPLE | employees | NULL | ref | idx_last_first | idx_last_first | 66 | const | 200 | 100.00 | NULL |
+----+-------------+-----------+------------+------+----------------+----------------+---------+-------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)
For another example, use the following queries:-
create table employees_2 like employees;
create table salaries_2 like salaries;
alter table salaries_2 drop primary key;
We made copies of employees and salaries tables without the Primary Key of salaries table to understand an example of Select with Join.
When you have queries like the below, it becomes tricky to identify the pain point of the query.
mysql> select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge';
1860 rows in set (4.44 sec)
This query is taking about 4.5 seconds to complete with 1860 rows in the resultset. Let’s look at the Explain plan. There will be 2 records in the Explain plan as 2 tables are used in the query.
mysql> explain select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge';
+----+-------------+-------+------------+--------+------------------------+---------+---------+--------------------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+------------------------+---------+---------+--------------------+---------+----------+-------------+
| 1 | SIMPLE | s | NULL | ALL | NULL | NULL | NULL | NULL | 2837194 | 100.00 | NULL |
| 1 | SIMPLE | e | NULL | eq_ref | PRIMARY,idx_last_first | PRIMARY | 4 | employees.s.emp_no | 1 | 5.00 | Using where |
+----+-------------+-------+------------+--------+------------------------+---------+---------+--------------------+---------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
These are in order of evaluation i.e. salaries_2 will be evaluated first and then employees_2 will be joined to it. As it looks like, it scans almost all the rows of salaries_2 table and tries to match the employees_2 rows as per the join condition. Though where clause is used in fetching the final resultset, but the index corresponding to the where clause is not used for the employees_2 table.
If the join is done on two indexes which have the same data-types, it will always be faster. So, let’s create an index on the emp_no column of salaries_2 table and analyze the query again.
create index idx_empno on salaries_2(emp_no);
mysql> explain select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge';
+----+-------------+-------+------------+------+------------------------+----------------+---------+--------------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+------------------------+----------------+---------+--------------------+------+----------+-------+
| 1 | SIMPLE | e | NULL | ref | PRIMARY,idx_last_first | idx_last_first | 66 | const | 200 | 100.00 | NULL |
| 1 | SIMPLE | s | NULL | ref | idx_empno | idx_empno | 4 | employees.e.emp_no | 9 | 100.00 | NULL |
+----+-------------+-------+------------+------+------------------------+----------------+---------+--------------------+------+----------+-------+
2 rows in set, 1 warning (0.00 sec)
Now, not only did the index help the optimizer to examine only a few rows in both tables, it reversed the order of the tables in evaluation. The employees_2 table is evaluated first and rows are selected as per the index respective to the where clause. Then the records are joined to salaries_2 table as per the index used due to the join condition. The execution time of the query came down from 4.5s to 0.02s.
mysql> select e.first_name, e.last_name, s.salary, e.hire_date from employees_2 e join salaries_2 s on e.emp_no=s.emp_no where e.last_name='Dredge'\G
1860 rows in set (0.02 sec)
Further reading
- More practice with online resources like this one
- Normalization
- Routines, triggers
- Views
- Transaction isolation levels
- Sharding
- Setting up HA, monitoring, backups
Short quiz to repeat the topic
After selecting the correct options, press the Gonder button. You can see the results true and in false colours.
Which one is not a MySQL replication?
- ( ) Read-scaling
- ( ) Asynchronous
- (x) Stand By replication
- ( ) Semi-Synchronous
Which one can not be used for mysql backup?
- ( ) mysqldump
- (x) tar
- ( ) xtrabackup
- ( ) mysql