Tuesday, April 10, 2012

Global Transaction Identifiers are in MySQL 5.6.5 DMR

Global Transaction Identifiers are in!


I am very happy and especially proud to announce that the replication team has delivered global transaction identifiers to MySQL 5.6.5 Development Milestone Release (DMR). It is a very useful, big, impressive and game-changing feature that will make life easier for many of our users. With this feature in place, it is much simpler to track replication progress through the replication topology thus it removes part of the burden of deploying and administering complex multi-tier replication topologies. Actually, as stated before, this feature is an enabler, as it gives the user so much more flexibility when it comes to deploying replication and tracking the data that is replicated. In particular, it is a foundation for reliable and automated fail/switch over with slave promotion, reducing the need for 3rd party HA infrastructure (which adds cost and complexity). In fact, my good colleague Chuck has already written a couple of utilities that build on global transaction identifiers and implement automated fail/switch over. You should have a look at that as well!

Having said that, lets have a look at what is under the hood.

Structure, Life Cycle and Tracking of GTIDs

It has been discussed before what a global transaction identifier(GTID) is. Just to recap, a global transaction identifier is a tuple (SID, GNO). SID is normally the SERVER_UUID and GNO is a sequence number (1 for the first transaction committed on SID, 2 for the second, and so on). Basically, a GTID is a logical identifier that maps into physical coordinates (log file name, file offset). Physical coordinates are likely to be different at each different server. In contrast, global transaction identifiers are not.

Now, about the identifier lifecycle… When a transaction is executed for the first time on the master, the master assigns it a GTID. Since SID is the server's UUID, it remains constant for transactions executed on that server. GNO on the other hand, is generated automatically when the transaction commits (it is the smallest number not yet used as GNO for any other transaction with the same SID). GTIDs are persisted in the binary log as a new log event type that holds the actual identifier. The event is called Gtid_log_event. As such, when the group of events for a given transaction is to be written to the binary log, a new Gtid_log_event is also written, preceding the group.

Once the GTID is in the binary log, the identifier flows seamlessly through the replication stream. The Gtid_log_event is read and sent by the dump thread, received and stored in the slave's relay log by the IO thread and read from the relay log by the SQL thread. Nothing new here. However, when the SQL thread executes the transaction, it does not generate a new identifier. Instead, it preserves the same identifier, relaying it to its own binary log. Thus, the server ensures that a replayed transaction gets the same GTID it was assigned on the master. In fact, two properties hold:

  1. No transaction is re-executed more than once. (Therefore, a before applying a transaction a server checks that it has not applied it before. If it has, it skips it). 
  2. Two different transactions cannot have the same GTID.  
All in all, the server must not and will not execute a transaction if that transaction GTID already exists in the binary log, or if some other, concurrent client is executing a transaction with the same identifier. Ultimately, this means that the server keeps a record of which set of transactions it has seen/executed. This is mostly useful in fail-over scenarios in which the DBA does not need to calculate himself from which point in the replication stream the slave should pick up (when redirected to a new master). Since slaves know what they have processed, they can auto-position themselves in the replication stream.

To sum up, the current GTIDs implementation is comprised of two major blocks:
  1. The transaction identifier. It serves a simple and yet very powerful purpose. It uniquely identifies a set of events. 
  2. The state machine that keeps track of which transactions a server has seen. It is a key part of the procedure of switching slaves to a new master as well as keeping the data of a slave consistent by preventing undesirable re-execution of transactions. 

CHANGE MASTER Made Easy (Fail-over facilitator)

In a GTIDs enabled topology the master-slave handshake is slightly different from what it has been until now. When the slave connects to a new master, the slave will tell the master which GTIDs it has in its relay or binary log. Therefore, the master can pick which transactions the slave is missing and send only those to the slave. This is all automatic and means that the slave does not need to know any non-local data. Before GTIDs, the slave had to know positions in the master's binary log. That is not needed now. More importantly, the user does not have to know anything about the replication positions, thus no position (re)calculation has to be done when redirecting slaves to a new master. The user can just issue:  

CHANGE MASTER TO MASTER_HOST='...', MASTER_PORT=SOME_PORT, MASTER_USER='...', MASTER_AUTO_POSITION=1;
 
The new parameter, MASTER_AUTO_POSITION=1, should be used instead of the position parameters MASTER_LOG_FILE and MASTER_LOG_POS and tells the server to use GTIDs (i.e., the GTIDs protocol handshake between master and slave). Then, when the slave is started, both master and slave will automatically agree on which transactions the slave is missing and replication will resume from the correct point. Again, the DBA has only one thing to do while switching the slave to a new master: use MASTER_AUTO_POSITION=1 . Nothing else.

Restrictions and Design Changes

There are a few restrictions if the user wants to use this new feature, since some MySQL functionality is not compliant with GTIDs. I do not want to go into all the gory details, but here is a list of constructs that are automatically blocked by the server to ensure robust operation under GTIDs mode, and a brief explanation why:
  • Non-transactional updates, such as MyISAM. There are a few cases when using non-transactional tables might result in duplicate GTIDs throughout a replication chain. Either by having transactions including changes to non-transactional tables or by having master and slave with different engine types. 
  • CREATE TABLE … SELECT . In RBR, CREATE TABLE … SELECT is split into two transactional group of events: one for the CREATE TABLE and one for the row events. Thus, in a replication chain, both group of events could end up getting the same identifier. Since a server skips transactions that it has seen before, the row events would not be applied by a server further down the chain. 
  • [CREATE|DROP] TEMPORARY TABLE executed inside a transaction. CREATE TEMPORARY TABLE and DROP TEMPORARY table are special statements: they can be executed inside a transaction - there is no implicit commit - but they cannot be rolled back. This is similar to updates to non-transactional tables. There are scenarios in which replication could break and/or parts of transactions might be replayed twice while failing-over to a new master. I will skip the details here, as this is a somewhat convoluted example. 
Given the three cases above, to enable global transaction identifiers, one has to disable such offending statements, by starting the server with the switch:

  –disable-gtid-unsafe-statements

On the design changes, a note worth mentioning is the fact that the approach changed a bit since the last time I blogged. The current implementation leaves out the planned indexes that would map identifiers to physical coordinates. We are still looking into how to solve all issues around this.

Hands-on


To start making use of GTIDs, the entire replication infrastructure needs to be configured to use GTIDs. In addition, all servers should "speak" GTIDs, meaning that there should not be any transaction, without an identifier, still pending execution. Starting the server with GTIDs ON, requires four switches (two of which you already know from long before):
  • –log-bin
  • –log-slave-updates
  • –gtid-mode=ON
  • –disable-gtid-unsafe-statements
Obviously, the server needs the binary log turned ON (–log-bin). It also requires the –log-slave-updates ON, since an SQL thread must persist GTIDs data while relaying those events. Furthermore, it may be the case that a master will be demoted to a slave's role at some point in time. Turning this switch ON prevents the user from having to restart the server when that moment arrives.

There are also a couple of other two new options. One (–disable-gtid-unsafe-statements) was already explained on the "Restrictions" section. The last one, –gtid-mode, it is a simple switch that turns on the GTID feature.

That's it. Starting mysqld with these options gets us a server configured to use GTIDs. That can be checked that by issuing:


mysql> SHOW VARIABLES LIKE '%gtid_mode%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| gtid_mode     | ON    |
+---------------+-------+
1 row in set (0,01 sec)
 
OK, great. Now, lets play a bit with it. Lets create a table:


mysql> use test;
Database changed
mysql> CREATE TABLE t1 (a INT);
Query OK, 0 rows affected (0,02 sec)

mysql> SHOW BINLOG EVENTS;
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| Log_name          | Pos | Event_type     | Server_id | End_log_pos | Info                                                              |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| master-bin.000001 |   4 | Format_desc    |         1 |         117 | Server ver: 5.6.6-m8-debug-log, Binlog ver: 4                     |
| master-bin.000001 | 117 | Previous_gtids |         1 |         144 |                                                                   |
| master-bin.000001 | 144 | Gtid           |         1 |         188 | SET @@SESSION.GTID_NEXT= '4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:1' |
| master-bin.000001 | 188 | Query          |         1 |         281 | use `test`; CREATE TABLE t1 (a INT)                               |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+

 We can notice a few new things in the binary log already. There is a Previous_gtids event, which I have not mentioned before to not clutter this blog entry. Lets just assume that it is an event that keeps track of transaction identifiers that existed in a set of binary log files previously purged. Anyway, the most interesting part for now is the Gtid event. It precedes the CREATE TABLE and assigns it the unique identifier:

  '4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:1'

Very good! Lets insert a value into the table:

mysql> INSERT INTO t1 VALUES (1);
Query OK, 1 row affected (0,01 sec)

This results in the following set of events to be logged into the binary log:

mysql> SHOW BINLOG EVENTS;
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| Log_name          | Pos | Event_type     | Server_id | End_log_pos | Info                                                              |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| master-bin.000001 |   4 | Format_desc    |         1 |         117 | Server ver: 5.6.6-m8-debug-log, Binlog ver: 4                     |
| master-bin.000001 | 117 | Previous_gtids |         1 |         144 |                                                                   |
| master-bin.000001 | 144 | Gtid           |         1 |         188 | SET @@SESSION.GTID_NEXT= '4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:1' |
| master-bin.000001 | 188 | Query          |         1 |         281 | use `test`; CREATE TABLE t1 (a INT)                               |
| master-bin.000001 | 281 | Gtid           |         1 |         325 | SET @@SESSION.GTID_NEXT= '4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:2' |
| master-bin.000001 | 325 | Query          |         1 |         400 | BEGIN                                                             |
| master-bin.000001 | 400 | Query          |         1 |         495 | use `test`; INSERT INTO t1 VALUES (1)                             |
| master-bin.000001 | 495 | Xid            |         1 |         522 | COMMIT /* xid=11 */                                               |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
8 rows in set (0,00 sec)

We can see right away that another identifier was assigned to the new transaction. This time:'4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:2' .

Same SID, 4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7, different GNO, 2.

Now, lets go over to another server and set it as slave of this one. One can do that by simply issuing (note: I am using root user just for demonstration purposes):

mysql> CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=13000, MASTER_USER='root', MASTER_AUTO_POSITION=1;
Query OK, 0 rows affected (0,00 sec)

Inspecting SHOW SLAVE STATUS, one can find that everything is setup:

mysql> SHOW SLAVE STATUS\G
(...)
                  Master_Host: 127.0.0.1
                  Master_User: root 
                  Master_Port: 13000
(...)
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
1 row in set (0,00 sec)

You can see here two new fields, Retrieved_Gtid_Set and Executed_Gtid_Set. The first one is the set of GTIDs pulled from the master. The second one is the set of GTIDs actually executed.

Continuing… Lets start the slave:

mysql> START SLAVE;
Query OK, 0 rows affected (0,00 sec)
And lets inspect the slave status again:
mysql> SHOW SLAVE STATUS\G
(...)
                  Master_Host: 127.0.0.1
                  Master_User: root
                  Master_Port: 13000
(...)
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
(...)
           Retrieved_Gtid_Set: 4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:1-2
            Executed_Gtid_Set: 4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:1-2
1 row in set (0,00 sec)

 Inspecting the RELAY log on the slave, one can find:

mysql> show relaylog events in 'slave-relay-bin.000002';
(...)
| slave-relay-bin.000002 |  345 | Gtid           |         1 |         188 | SET @@SESSION.GTID_NEXT= '4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:1' |
| slave-relay-bin.000002 |  389 | Query          |         1 |         281 | use `test`; CREATE TABLE t1 (a INT)                               |
| slave-relay-bin.000002 |  482 | Gtid           |         1 |         325 | SET @@SESSION.GTID_NEXT= '4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:2' |
| slave-relay-bin.000002 |  526 | Query          |         1 |         400 | BEGIN                                                             |
| slave-relay-bin.000002 |  601 | Query          |         1 |         495 | use `test`; INSERT INTO t1 VALUES (1)                             |
| slave-relay-bin.000002 |  696 | Xid            |         1 |         522 | COMMIT /* xid=11 */                                               |
+------------------------+------+----------------+-----------+-------------+-------------------------------------------------------------------+
13 rows in set (0,00 sec)

Looking into the slave's BINARY log, once can find:

mysql> SHOW BINLOG EVENTS;
+------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| Log_name         | Pos | Event_type     | Server_id | End_log_pos | Info                                                              |
+------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| slave-bin.000001 |   4 | Format_desc    |         2 |         117 | Server ver: 5.6.6-m8-debug-log, Binlog ver: 4                     |
| slave-bin.000001 | 117 | Previous_gtids |         2 |         144 |                                                                   |
| slave-bin.000001 | 144 | Gtid           |         1 |         188 | SET @@SESSION.GTID_NEXT= '4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:1' |
| slave-bin.000001 | 188 | Query          |         1 |         281 | use `test`; CREATE TABLE t1 (a INT)                               |
| slave-bin.000001 | 281 | Gtid           |         1 |         325 | SET @@SESSION.GTID_NEXT= '4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:2' |
| slave-bin.000001 | 325 | Query          |         1 |         400 | BEGIN                                                             |
| slave-bin.000001 | 400 | Query          |         1 |         495 | use `test`; INSERT INTO t1 VALUES (1)                             |
| slave-bin.000001 | 495 | Xid            |         1 |         522 | COMMIT /* xid=15 */                                               |
+------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
8 rows in set (0,00 sec)

Notice how the original identifiers were indeed preserved.

Getting back to the master, one interesting new variable to observe is GTID_DONE. This variable contains the set of logged transactions. One can query it and know which transactions were actually seen/executed by the server:

mysql> SELECT @@GLOBAL.GTID_DONE;
+------------------------------------------+
| @@GLOBAL.GTID_DONE                       |
+------------------------------------------+
| 4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:1-2 |
+------------------------------------------+
1 row in set (0,00 sec)

This is very cool. The same thing can be done on the slave side:

mysql> SELECT @@GLOBAL.GTID_DONE;
+------------------------------------------+
| @@GLOBAL.GTID_DONE                       |
+------------------------------------------+
| 4B2CBA63-8082-11E1-BE2D-F0DEF11A08B7:1-2 |
+------------------------------------------+
1 row in set (0,00 sec)

OK, this was just an appetizer! There are a few more variables and additional extensions either to the replication layer as well as to mysqlbinlog that are worth checking out! I will let you uncover those yourself with the help of the great online manual which already documents much of this great feature.

Summary

This post provides a very brief insight on Global Transaction Identifiers. It is a major replication feature that made it into MySQL 5.6.5 and one that is extremely important for people doing Highly Available systems based on MySQL replication.

Since it is a big feature and a big change in the replication behavior, I have only covered a small part of it in this post. It is just like a very brief introduction. Anyway, this post touches the very basic parts to get one going as it explains what a GTID and its life cycle is, how to activate it and how to connect servers using GTID protocol (and thus how easy and simple it has become to do a simple slave switchover to new master). In addition it also also mentions some of the restrictions one has to deal with when using this feature.

It concludes with a very short overview of how a server behaves when GTID is turned ON.

I cannot just end this post without referring the readers, again, to the very nice and extremely interesting blog post from Chuck. It presents two new MySQL utilities that automate a couple of the most complex replication administration tasks: switchover and fail over. Such utilities already build on global transaction identifiers and take much of the pain away while doing slave promotion or taking a master down for maintenance, and so forth… And, by the way, MySQL 5.6 is full of new, interesting and very useful MySQL replication features. You can find more details by going through the developer's zone replication article. Go and have a look.

I hope you have a great time trying this new feature out, but more importantly, that it fits nice and perfectly in your own use cases going forward.

Have fun!

Wednesday, October 5, 2011

Global Transaction Identifiers Feature Preview

The Case for Global Transaction Identifiers

"Global Transaction Identifiers" is a feature that has been requested every now and then. And it is not so much about what it actually is, but rather about what it enables MySQL users to do. Having a logical identifier associated with each transaction instead of a physical one (filename + offset), provides more flexibility and removes the burden of complex math from userland scripts. We have put out there an early access release (based on 5.6 codebase) of our ongoing effort to implement the global transaction identifiers and we would like some early feedback. Keep in mind that this is NOT something to use in production as it is in very early development stages. That said...

What exactly is a global transaction identifier?

A global identifier is a tag that pin-points a set of changes resulting from the execution of a transaction.

Why do we need global transaction identifiers?

If every transaction has its own universally unique identifier, it becomes a lot easier to follow changes through a complex replication stream. It is easier for us, humans, to visualize and understand what is going on, consequently, the algorithms we write for dealing with binary logs and replication tend to be far less convoluted.

In practice, what are the benefits of this all?
  • Fail-over: automation of fail-over suddenly becomes a lot easier. Instead of working through the physical coordinates to decide which slave is most up to date, with respect to the master it is going to replace, one can just compare global transaction identifiers of the last applied transactions. Slave promotion gets easier. However, the major benefit comes when switching over the slaves to the new master. They can reference a global transaction identifier and not have to convert binary log filenames and offsets between different servers.
  • Session consistency and Hierarchical replication: Offloading the master, through some hierarchical replication scheme, often works very well, especially if tied together with an intelligent load-balancer. Sometimes the load balancer sends read queries to a slave down in the hierarchy chain and at the same time it has to be sure that session consistency is guaranteed (updates on the master must have already been applied on the slave being queried). Making sure that the update has flowed all the way down to the desired slave without global transaction identifiers is laborious (one needs to climb down the hierarchy and follow the changes on every level in the hierarchy). On the other hand, with global transaction identifiers, one can just wait for the transaction with the desired identifier to be applied on a given slave and then run the query.
  • Enabler for multi-master update everywhere replication: This is a complex problem to solve, but the fact that transactions can be uniquely identified and distinguished from one another lays the ground for establishing (at least partial) order between them. This property is often important in such replication setups (even for dealing with conflict detection).
I am sure that there are more benefits, and that there is a whole bunch of interesting things one can do on top of global transaction identifiers... But I'll leave it up to you to think and decide how that would be useful in your own setup.

Designing a Global Transaction Identifier

The replication team has been working on several nice features that you must have noticed before (multi-threaded slave, row-based replication enhancements, ...). But now, we are adding global transaction identifiers to the list.


Global Transaction Identifier

The goal of this task is to augment MySQL binary log with Global Transaction Identifiers. Thus each transaction has an associated global identifier (GTID), which is essentially a pair:

GTID = <SID, GNO>

In practice a transaction is logged as a group of events in the binary log. Thus sometimes we refer to groups instead of transactions - I ended up use them intermixed in the text below. Actually, there are a few details about this, but to avoid risking excessive and unnecessary complexity in what I will describe below, I will just omit those.

The following describes more clearly each part of an GTID.

  • SID => currently it is a 128-bit number that identifies the server where the transaction/group of events was first committed. SID is normally the server UUID, but may be something different if a transaction is generated by something else other than than a regular MySQL Server. For example, for NDB, it identifies the Cluster.
  • GNO => is a 64-bit sequence number: 1 for the first changes logged on SID, 2 for the second changes, and so on. No change can have GNO 0.

Indexes and Relaying Transactions in the Replication Stream

In a typical MySQL replication setup there is one master server and a set of slave servers retrieving the changes from that one master and replaying those changes locally, against their own databases. Thus, GTIDs are added to transaction on the originating server - the master. As such, the following major changes have to be done on the originating side:

  1. Annotate existing binary log events with the GTID that they belong to... or create a new type of event that stores a GTID associated with a set of subsequent events in the replication stream.
  2. Create an index to quickly find out which are the physical coordinates that map into the logical identifiers, ie, the GTID. This makes looking up for which binary log file and at which offset a transaction is in, given a certain GTID, very easy and quick. It is especially useful for a dump thread, when it starts, to quickly find the physical position from which it should start reading and sending events to the slave.
In practice, this index, that maps GTIDs to binary log positions, has the form of a set of files, each file containing a sequence of transaction specifications. Each transaction specification has, among other fields, the following ones:
  • SID: The unique source identifier for this group of events/transaction.
  • GNO: the sequence number of the group of events/transaction.
  • LGID: This a local identifier like an auto-increment primary.
  • binlog file: name of binary log where this group is stored.
  • binlog pos: offset in binary log where this group starts.
  • binlog length: length of this group in binary log.
  • group end: true if this is the last set of events with GTID.
When a transaction commits, the master generates a GTID and atomically writes it to the binary log along with the events of that transaction. After that the in-memory group index data structures are updated. However, this data is asynchronously flushed to the index file. This requires that on server restart the recovery routine is extended so that it also runs a procedure to make sure that the index is properly setup and consistent.

Slaves relay changes from the master. This means that transactions that are replayed by a slave thread will keep the original GTID. Furthermore, slaves also maintain indexes to keep track of their relay logs, and its content is also flushed asynchronously. As in the master, on slave restart, a recovery routine is run to make sure that the indexes are consistent.

The current snapshot does not yet relay GTIDs through the replication protocol, so we do not get to see the identifiers flowing all the way to the slave. But we can inspect the master binary log and have a look at the identifiers...

Early Access: Exercising the Labs Snapshot.

We have uploaded a snapshot of our current work, to labs, which you can try out. It's buggy and it's incomplete, but it lays the ground to what we will be delivering in the future. So... how can we show off a bit of what we have done? Currently, we can issue a set of commands on the master and look into the resulting binary log to search for information regarding the new transaction identifiers. For instance, issuing the following commands on a server with binary log enabled:

shell> SET AUTOCOMMIT=0;
shell> CREATE TABLE t1 (a INT) Engine=InnoDB;
shell> INSERT INTO t1 VALUES (1);
shell> INSERT INTO t1 VALUES (2);
shell> INSERT INTO t1 VALUES (3);
shell> COMMIT;

Will get you an output very similar to the following one, when inspecting the binary log with the mysqlbinlog tool:
$ mysqlbinlog -v var/mysqld.1/data/master-bin.000001

(...)

# at 114
# Subgroup(#1, D5375118-EF7C-11E0-8C85-F0DEF11A08B7:1, END, COMMIT, binlog(no=0, pos=114, len=107, oals=0))
SET UGID_NEXT='D5375118-EF7C-11E0-8C85-F0DEF11A08B7:1', UGID_END=1, UGID_COMMIT=1/*!*/;
#111005 11:08:04 server id 1 end_log_pos 221 Query thread_id=1 exec_time=0 error_code=0
use test/*!*/;
SET TIMESTAMP=1317838084/*!*/;
SET @@session.pseudo_thread_id=1/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=0/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
CREATE TABLE t1 (a int) Engine=InnoDB
/*!*/;
# at 221
# Subgroup(#2, D5375118-EF7C-11E0-8C85-F0DEF11A08B7:2, END, COMMIT, binlog(no=0, pos=221, len=387, oals=27))
SET UGID_NEXT='D5375118-EF7C-11E0-8C85-F0DEF11A08B7:2', UGID_END=0, UGID_COMMIT=0/*!*/;
#111005 11:08:10 server id 1 end_log_pos 296 Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1317838090/*!*/;
BEGIN
/*!*/;
# at 296
#111005 11:08:10 server id 1 end_log_pos 391 Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1317838090/*!*/;
INSERT INTO t1 VALUES (1)
/*!*/;
# at 391
#111005 11:08:13 server id 1 end_log_pos 486 Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1317838093/*!*/;
INSERT INTO t1 VALUES (2)
/*!*/;
# at 486
# Subgroup(#2, D5375118-EF7C-11E0-8C85-F0DEF11A08B7:2, END, COMMIT, binlog(no=0, pos=221, len=387, oals=27))
SET UGID_END=1, UGID_COMMIT=1/*!*/;
#111005 11:08:16 server id 1 end_log_pos 581 Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1317838096/*!*/;
INSERT INTO t1 VALUES (3)
/*!*/;
# at 581
#111005 11:08:18 server id 1 end_log_pos 608 Xid = 11
COMMIT/*!*/;
SET UGID_NEXT='AUTOMATIC'/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

In the output above one can find an additional line of metadata related to global transaction identifiers and a few new variables related to GTIDs. For now, lets just concentrate in finding the global transaction identifier. Looking at the line starting with "#Subgroup", one can find and '<UUID>:1' and '<UUID>:2'. These relate to the two transactional groups, the one that consists only of the DDL 'CREATE TABLE...' and the second group is the one that consists of the set of 'INSERT INTO...' statements that comprise the explicit transaction issued.

Now... we can filter out one of the transactions just by issuing (lets skip the create table):
$ mysqlbinlog -v --exclude-ugids=D5375118-EF7C-11E0-8C85-F0DEF11A08B7:1 var/mysqld.1/data/master-bin.000001
(...)
Or we could even not print identifiers at all:
$ mysqlbinlog -v --skip-ugids var/mysqld.1/data/master-bin.000001
(...)
There are a couple of more switches implemented in the mysqlbinlog tool that are useful to handle contents on the binary log, based on the global transaction identifier. But I'll leave it up to you to check that out.

Summary

This post provides some insights on the work the replication team is doing on designing and implementing global transaction ids. It gives a general overview of what the problem is and roughly what the solution is and how it is trying to solve a long standing requirement such as easier fail-over.

The good news are that we are not just designing anymore, we are already implementing it and you can even get a recent snapshot of this feature branch. Go... download it, look at the code, build the branch (or download a binary one), play a little bit with it. In the current implementation, global transaction ids are not yet part of the replication protocol, but you can see them by inspecting the master binary log, using mysqlbinlog tool.

Enjoy!

Tuesday, October 4, 2011

Multi-threaded slave: it's in!

I am very happy to share with you all that the much awaited feature request - multi-threaded slave (MTS) - has made it to the development tree targeting 5.6 release and is part of this latest 5.6 DMR release (5.6 DMR2).

I would like to take a moment to thank those of you that took the time to evaluate and provide invaluable feedback. Some started as early as the very first snapshot that was put out in the open. Part of this feedback is directly responsible for many changes, as outlined in my previous posts about MTS. Since the last snapshot, we kept on taking your suggestions seriously and improved the implementation by deploying a few of them in time for this release. Again, thanks!

What comes next for MTS? Well, it will continue to be debugged and get its overall quality improved.

What comes next for MySQL replication? You already know that we are working on the binlog group commit (by the way, you can get a hold of the second labs snapshot for that one). But, wait, there's more! We are also spending time and effort to implement another much awaited feature: Global Transaction Ids. I think you'll enjoy that one very much (there is also a very early access labs snapshot that you can check out - I'll blog about the details later on...).

Lots of interesting stuff going on. Stay tuned!

In the meantime, you can download 5.6 DMR2 and check out our labs releases.

Thursday, July 28, 2011

Update on the Multi-threaded Slave

A few months have passed since we announced our MySQL 5.6.2 DM based feature preview - Multi-threaded Slaves (MTS) - back in April. Since then, we have had some very good feedback from all around. You deserve a big thank you.

We have now a new snapshot available on labs.mysql.com. This snapshot includes:
  • Bug fixes. A lot of them, both in the server core and in the replication layer, especially at the MTS layer.
  • Sound recovery procedure. The recovery procedure, which runs on slave threads (re)start, has been improved significantly.
  • STOP SLAVE consistently. We have made STOP SLAVE to stop at a consistent point with respect to the master execution history. There were quite a few requests to implement this behavior. That said, one can still stop the SQL thread immediately if one KILLs the SQL thread.
  • Enhanced SBR replication. In multi-threaded mode, in particular, support for temporary tables replication was added.
  • Usability improvements. There is now a simplified set of configuration options which will make DBAs life easier.
  • Improved codebase. Following an effort on codebase clean up and related refactorings, the multi-threaded slave code is now easier to maintain.
Please, give the latest snapshot a spin and let us know how it goes!

These are exciting times... Not only MTS is coming, some very cool replication features are also on their way, such as binary log group commit and an API for inspecting binary logs:

http://www.oracle.com/us/corporate/press/439460

You can find these feature previews next to the MTS snapshot on labs.mysql.com as well.

Enjoy!

Wednesday, April 13, 2011

MySQL 5.6.2 DM: Binlog Informational Events

In some cases it would be really useful to have additional information in the binary log, mostly for debugging purposes. Starting with MySQL 5.6.2 DM release, the mysql replication event set was extended to include a new type of event that is basically used for transferring data around that does not actually changes state at the slaves. It can be safely ignored for the actual replication procedures, but it can be extremely relevant for conducting debugging…

Row-based Replication



Quite frequently, one finds someone out there requesting that in row based replication, the original statement is written into the binary log, alongside with the row event(s) it generated.

Starting with MySQL 5.6.2 DM, and by making use of the Informational Events facility, users can now do exactly that. While turning on the switch --binlog-rows-query-log-events (or activating it through the correspondent session variable) the mysql replication layer is instructed to write the statement to the binary log as a special event. These events do not require any extra processing from the slave(s), well… maybe just to relay them, and only if the user wants them to be relayed...

Lets have a look how it works! In the following example, the option to log statements is activated while logging in ROW, and then some rows are inserted into a table. This generates some row events entries in the binary log! As expected they show up when one issues the command 'SHOW BINLOG EVENTS'. In addition, the new event type Rows_query comes up in the output, showing the original query!

mysql> SET binlog_format=ROW;
Query OK, 0 rows affected (0.00 sec)

mysql> SET SESSION binlog_rows_query_log_events=ON;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE t1 (a INT);
Query OK, 0 rows affected (0.07 sec)

mysql> INSERT INTO t1 VALUES (1), (2), (3);
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 Warnings: 0

mysql> SHOW BINLOG EVENTS;
+-------------------+-----+-------------+-----------+-------------+-----------------------------------------------+
| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |
+-------------------+-----+-------------+-----------+-------------+-----------------------------------------------+
| master-bin.000001 | 4 | Format_desc | 1 | 114 | Server ver: 5.6.3-m5-debug-log, Binlog ver: 4 |
| master-bin.000001 | 114 | Query | 1 | 200 | use `test`; CREATE TABLE t1 (a INT) |
| master-bin.000001 | 200 | Query | 1 | 268 | BEGIN |
| master-bin.000001 | 268 | Rows_query | 1 | 323 | # INSERT INTO t1 VALUES (1), (2), (3) |
| master-bin.000001 | 323 | Table_map | 1 | 364 | table_id: 54 (test.t1) |
| master-bin.000001 | 364 | Write_rows | 1 | 408 | table_id: 54 flags: STMT_END_F |
| master-bin.000001 | 408 | Query | 1 | 477 | COMMIT |
+-------------------+-----+-------------+-----------+-------------+-----------------------------------------------+
7 rows in set (0.00 sec)


Actually, 'SHOW BINLOG EVENTS' is not the only way to display these events, one can also check them through the mysqlbinlog tool. There is a catch however, mysqlbinlog will only output the special event if one sets the verbosity level to 2 (-vv) or more.

shell> mysqlbinlog -vv data/master-bin.000001
(...)
# at 268
#110401 14:24:29 server id 1 end_log_pos 323 Rows_query
# INSERT INTO t1 VALUES (1), (2), (3)
# at 323
#110401 14:24:29 server id 1 end_log_pos 364 Table_map: `test`.`t1` mapped to number 54
# at 364
#110401 14:24:29 server id 1 end_log_pos 408 Write_rows: table id 54 flags: STMT_END_F
(...)
shell>


Summary



This blog post introduces the user to the new type of (informational) events that MySQL 5.6.2 DM provides. Furthermore, by making use of such type of events mysql replication can now log the original statements along with the rows it generated to the binary log. This is something that is asked every now and then, for a long time… So there you have it, you can find it in the MySQL 5.6.2 DM release.

Enjoy!

References

You can find detailed information about this feature in:

  • http://forge.mysql.com/worklog/task.php?id=4033

MySQL 5.6.2 DM: Optimized Row-based Replication Logging

Row based replication (RBR) is an amazing technology in MySQL replication. It has several advantages over statement based replication (SBR), where the lack of non-deterministic operations jumps to mind rather quickly. However, there is one drawback that drives some users away from it. RBR ships the changes over to the slave, instead of the operations as it happens in SBR. This means that RBR may exhibit a large binary log footprint for operations that make a rather big number of changes or operate over sizable rows. The problem boils down to the fact that for each row changed, one (sometimes even two) row image is inserted into the binary log, as part of a row event, and it contains all fields values (even those that are not part of the actual change). MySQL 5.6.2 DM release ships with a new feature that allows the user to tune logging in order to avoid this problem. Lets have a look at it!

Row events, Before and After images



In RBR, one row event contains changes to one or more rows in a given table. A row change, in its turn, may consist of one or two full row copies of the row that is being changed. These are generally known as row images. The first one, known as before image (BI), contains data as it was before the row was modified. The second one, known as after image (AI), is what the row looks like after the changes were done. Each image has different purpose: the BI is used for locating, in the storage engine, the row to be updated/deleted, while the AI, is used for replaying the actual changes. Needless to say, not both images are needed for every operation. Deletes only need the BI, inserts, on the other hand, just need the AI, while updates need both.

This is pretty neat stuff already, however, there is room for improvements, in particular to address storage overuse. Logging can be optimized in such a way that writing full images into the binary log can be avoided, and instead, log only those parts that are meaningful for replication. Not only does this help with reducing storage space, but also enables less network bandwidth consumption - smaller events are transmitted to the slaves - as well as smaller mysql memory footprint - less buffer usage for handling row events. So… How can one tackle this ?

Optimizing Before Image logging


The one and only usage for such image is to help finding the correct row to be updated or deleted at the slave's storage engine. Locating a row is done by considering one of the following search methods: search by primary key, by unique key, by an index scan or by doing a table scan - depending on what indexes are available on the slave's table. As such, assuming that the index structures are the same on master and slaves' tables, BIs can be logged based on the uniqueness of the index structures available. If there is a primary or unique key (with non-nullable columns), the fields that compose the key are enough to locate the row at the slave, thus BI can be composed of just those fields. On the other hand, if there is no unique key, all fields in the row need to be written into the binary log. Here is an example:

MASTER> CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 char(1), c3 TEXT);
MASTER> INSERT INTO t1 VALUES (1, 'a', '1MB_text...');
MASTER> DELETE FROM t1 WHERE c1=1;

Logging the full BI in this case, means that each row deleted has a correspondent before image stored in the binary log sizing up to (4 + 1 + 1024*1024) bytes ~ 1MB. Deleting 10 rows in the table, easily grows the binary log to 10 MB (assuming that c3 is always 1MB large). Now, if only the primary key is logged, then for the same 10 rows, only 40 bytes are needed for storing the BI. Lets see how that plays out for the scenario above by trying out this feature on a real mysql session:

mysql> SET binlog_format=ROW;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 char(1), c3 LONGTEXT);
Query OK, 0 rows affected (0.10 sec)

mysql> SHOW MASTER STATUS;
+-------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+------------------+
| master-bin.000001 | 238 | | |
+-------------------+----------+--------------+------------------+
1 row in set (0.00 sec)

mysql> SET @text= repeat('a', 1048576);
Query OK, 0 rows affected (0.02 sec)

mysql> INSERT INTO t1 VALUES (1, 'a', @text);
Query OK, 1 row affected (0.04 sec)

mysql> SHOW MASTER STATUS;
+-------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+------------------+
| master-bin.000001 | 1049037 | | |
+-------------------+----------+--------------+------------------+
1 row in set (0.00 sec)

mysql> RESET MASTER;
Query OK, 0 rows affected (0.30 sec)

mysql> DELETE FROM t1 WHERE c1=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> SHOW MASTER STATUS;
+-------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+------------------+
| master-bin.000001 | 1048913 | | |
+-------------------+----------+--------------+------------------+
1 row in set (0.00 sec)

As one can see here, the DELETE operation resulted in a binary log entry that takes up to 1048913 bytes, although only the primary key would be needed to locate and delete the row at the slave. Now, if the server is set to log only minimal images, then it will instead log just a few bytes:
mysql> INSERT INTO t1 VALUES (1, 'a', @text);
Query OK, 1 row affected (0.04 sec)

mysql> RESET MASTER;
Query OK, 0 rows affected (0.37 sec)

mysql> SET SESSION binlog_row_image='MINIMAL';
Query OK, 0 rows affected (0.00 sec)

mysql> SHOW MASTER STATUS;
+-------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+------------------+
| master-bin.000001 | 114 | | |
+-------------------+----------+--------------+------------------+
1 row in set (0.00 sec)

mysql> DELETE FROM t1 WHERE c1=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> SHOW MASTER STATUS;
+-------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+------------------+
| master-bin.000001 | 331 | | |
+-------------------+----------+--------------+------------------+
1 row in set (0.00 sec)
Looking at the output from the last 'SHOW MASTER STATUS', one finds that indeed the binlog size for the delete operation decreased from 1048913 to 331, that is 1MB less than when using a full row.


Optimizing the After Image Logging


Assuming that the table definitions are pretty much the same on both master and slave(s), the server can avoid logging the entire AI and log only the columns that were actually changed/inserted, much like in SBR. Lets look at an example:

MASTER> CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 char(1), c3 CHAR(255) DEFAULT 'aaaaaaaaaaaaaaaaaaaa');
MASTER> INSERT INTO t1(c1,c2) VALUES (1, 'a');

When the server is set to log full rows, the insert operation will generate an AI containing c1, c2 and c3 values. This means that an additional 20 bytes ('aaaaaaaaaaaaaaaaaaaa') will be logged pointlessly. On the other hand, if the server is set to log only minimal rows, the extra 20 bytes are avoided. Lets check what happens with a real sample:

mysql> SET binlog_format=ROW;
Query OK, 0 rows affected (0.00 sec)

mysql> SET SESSION binlog_row_image='FULL';
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 char(1), c3 VARCHAR(1024) DEFAULT 'aaaaaaaaaaaaaaaaaaaa');
Query OK, 0 rows affected (0.10 sec)

mysql> RESET MASTER;
Query OK, 0 rows affected (0.37 sec)

mysql> SHOW MASTER STATUS;
+-------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+------------------+
| master-bin.000001 | 114 | | |
+-------------------+----------+--------------+------------------+
1 row in set (0.00 sec)

mysql> INSERT INTO t1(c1,c2) VALUES (1, 'a');
Query OK, 1 row affected (0.00 sec)

mysql> SHOW MASTER STATUS;
+-------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+------------------+
| master-bin.000001 | 356 | | |
+-------------------+----------+--------------+------------------+
1 row in set (0.00 sec)

We can find that the event logged, corresponding to the INSERT operation, grows the binary log up to 356 bytes. Setting the server to log only minimal images we get the following:

mysql> SET SESSION binlog_row_image='MINIMAL';
Query OK, 0 rows affected (0.00 sec)

mysql> DELETE FROM t1;
Query OK, 1 row affected (0.00 sec)

mysql> RESET MASTER;
Query OK, 0 rows affected (0.34 sec)

mysql> SHOW MASTER STATUS;
+-------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+------------------+
| master-bin.000001 | 114 | | |
+-------------------+----------+--------------+------------------+
1 row in set (0.00 sec)

mysql> INSERT INTO t1(c1,c2) VALUES (1, 'a');
Query OK, 1 row affected (0.00 sec)

mysql> SHOW MASTER STATUS;
+-------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-------------------+----------+--------------+------------------+
| master-bin.000001 | 334 | | |
+-------------------+----------+--------------+------------------+
1 row in set (0.00 sec)


Again, looking at the last SHOW MASTER STATUS command, one finds that the event entry, related to the INSERT operation, grows the binary log to 334 instead of 356 bytes, i.e., 22 bytes less (2 bytes for c3 metadata + 20 bytes for c3 content).

The case for UPDATE operations



Updates generate pairs of images for each row changed. One needs the BI to locate the row and the AI to actually update it. Thus, this operation benefits from both optimizations that were described previously for the INSERT and DELETE operations. The following picture depicts what columns each image contains in previous MySQL versions and compares against 5.6.2 with partial rows logging enabled (assuming same queries and table definition as in the example shown before). One can easily see that in the green shaded box (5.6.2 DM binary log with partial logging turned on) BI and AI contain far less columns (the dark green boxes with "Cn" on them):




BLOBs special case



Apart from full and minimal rows, there is also a special option to optimize just the logging of blob fields, if setting binlog_row_image to 'NOBLOB';

In this case, replication will avoid logging blob fields whenever possible (BI: when they are not needed to locate the row, AI: when they are not modified). The other fields are still logged as if it the option binlog_row_image had been set to 'FULL'.

Summary


This post presents a new feature, available on MySQL 5.6.2 DM release, which enables more flexible logging of ROW events. It gives the user the freedom to choose whether the server should log full or partial rows, which is a major asset when either changing a large number of rows and/or changing sizable rows.

There are some assumptions when using this feature, for instance:
  • The tables on both master and slave should have the same index structures, or at least the index set on the master should be a subset of slave's index set. If not, unexpected behavior may happen;
  • The table definition should be the same with respect to default values, otherwise on INSERTs there may be different content inserted on the master and the slave (just like in SBR);
  • When setting binlog_row_image to 'MINIMAL' or 'NOBLOB', the logged rows are incomplete, therefore, if the user requires that the binary log contains full rows, then he should not use these options.
In general, if taking some care with indexes and default values and by knowing the workload the user can save a lot of storage space and network bandwidth by using these options to tweak RBR logging.

Go and try it out. Have fun!

References

Monday, April 11, 2011

Feature Preview: The Multi-Threaded Slave

The Need for Multi-Threaded Slaves!

The MySQL replication team has come up with a feature that looks very promising for improving slave scalability. Although there is some work that one can do on the master side, it is the work already done at the slave side that this blog post highlights...

At its core, MySQL replication is single-threaded! In detail, the tiniest unit of work for replication is an event, and a group of events forms a transaction. Events are pushed by the master to the slaves by a thread, known as "dump thread". At the slave, a reader ("IO thread") reads event-by-event and writes them to a local persistent queue, the "relay log". Then a single threaded applier, the "SQL thread", reads and applies events sequentially.

Contrary to the master, which executes transactions concurrently, the slave serializes execution of each and every transaction, even if they were executed concurrently on the master and are guaranteed to be conflict free. In fact, should that be the case, then such transactions could be applied in parallel thus taking advantage of multi-core/cpu hardware. Furthermore, recent optimizations on the master side enabled much better scale-up, thence enforcing the need for more scalable slave execution as well.

Today, Andrei Elkin has actually delivered a talk on this subject at Collaborate 11: Boosting MySQL Replication Performance Through the Multi-Threaded Slave. Roughly two hours earlier, a prototype - feature preview based on 5.6 codebase - was published in MySQL Labs for you to go and test it out. NOTE: the package is not intended for using in production. It IS a feature preview.

What is a Multi-Threaded Slave?

While multi-threaded is a rather broad term, it is loosely used in the current context, targeting a specific technique of parallelization of transaction execution. In other words, multi-threaded slave refers to inter-transactional parallelization while applying transactions to the slave's databases. This makes replication more scalable, especially on multi-core architectures.

The fact is that even the most common hardware today is multi-core and given that replication strives to exhibit good scalability always, the benefits from having a multi-core machine at the slave side must be exploited. Apparently, that is not completely the case yet, because every now and then we hear someone saying that the slave is not able to keep up with its master or that it is "lagging behind"! On such settings inter-transactional parallelization comes to the rescue, as it fits a big number of cases, is relatively easy to implement and is engine agnostic.

Earlier today, the MySQL replication team has put up for download, on MySQL Labs, a prototype that fits one particular scenario: apply, in parallel, transactions that operate on different database sets. In fact, it turns out to be quite a commonplace. In detail, it is often the case that the application partitions its data logically per database, thus the workload pattern is such, that a set of unrelated and non-conflicting transactions are executed concurrently, and on different databases, at the master. The replication slave can take advantage of this fact and execute such transactions, concurrently as well, i.e., by applying them in parallel. In practice, and on such scenarios, a multi-threaded slave would read transactions from the relay log and assign them to different worker threads, depending on the database the transaction is working on. Transactions operating on the same database would then be guaranteed to be serialized. On the other hand, transactions executing on different databases would be concurrently executed and committed independently. This is depicted in the following diagram:



In this figure we can find the usual stages that transactions go through during the apply procedure at the slave. It is not fundamentally different from what we have today on latest MySQL GA releases, except that now there is an additional "workers" stage. In a nutshell, the execution flow goes like this: IO thread receives events/transactions and queues them in the relay log; SQL coordinator thread reads them from the log file and queues the event on a specific worker queue. Selecting which queue to use is done by checking which database the event should be executed against (the best case scenario is the one that all events in a transaction change the same database, otherwise the scheduling may have some negative impact); Workers dequeue the events from their queues and execute them. If the event is a COMMIT operation, the worker then commits the transaction independently from other workers.

Some Special Cases

There are some special cases that one needs to think of. What about cross-database transactions? What about slave positions ? What about stopping and restarting the slave ? What about crashing and recovering?

I don't want to risk getting tangled into details, so lets not go into much detail and discuss them briefly:
  • For cross-database transactions, the slave waits until all preceding transactions that are working on the same database set are over. However, multiple databases in the same transaction can result in a partition itself that the slave can parallelize, e.g., if T1 changes "db1" and "db2" and T2 changes "db3" and "db4", then there is room for the slave to execute both transactions in parallel, despite the fact that both change more than one database;
  • On a multi-threaded slave, there are several workers, each committing independently, therefore usual concept of "last executed position" is not very meaningful. Some concurrent transactions may commit in a different order than the one that is established in the binary log, thus execution gaps may exist throughout execution time. The slave needs to track such gaps, and for that purpose it keeps track of each worker position and stores them persistently on system tables ("mysql.slave_worker_info");
  • Stopping and restarting the server is somewhat analogous to crashing and recovering it. On restart the slave checks each worker position, thus finding out which transactions need to be replayed since the last checkpoint (position in the log that has no execution gaps before it). This takes care of existing gaps,and the new workers can start executing from a base position.

State of the Art

For the following examples, a multi-threaded enabled slave server is required, replication is assumed to be setup, but the slave threads should not have been started. In addition, the slave should be configured to use the new system tables to store positions (not really necessary, but it turns the whole experience much more appealing). As such, one needs to start the slave MySQL server with the (additional) following options:

--relay-log-info-repository="TABLE" --master-info-repository="TABLE" --worker-info-repository="TABLE"

Now that the servers are started, lets begin by changing the number of workers that will be available for executing transactions and start the slave:
mysql> STOP SLAVE;
Query OK, 0 rows affected (0.00 sec)

mysql> SET GLOBAL mts_slave_parallel_workers=2;
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT @@mts_slave_parallel_workers;
+------------------------------+
| @@mts_slave_parallel_workers |
+------------------------------+
| 2 |
+------------------------------+
1 row in set (0.00 sec)

mysql> START SLAVE;
Query OK, 0 rows affected (0.09 sec)

mysql> SELECT USER,STATE FROM INFORMATION_SCHEMA.PROCESSLIST WHERE USER='system user';
+-------------+------------------------------------------------------------------+
| USER | STATE |
+-------------+------------------------------------------------------------------+
| system user | Slave has read all relay log; waiting for the slave I/O thread t |
| system user | Waiting for an event from sql thread |
| system user | Waiting for an event from sql thread |
| system user | Waiting for master to send event |
+-------------+------------------------------------------------------------------+
4 rows in set (0.00 sec)

mysql> SELECT * FROM mysql.slave_worker_info\G
*************************** 1. row ***************************
Master_id: 2
Worker_id: 0
Relay_log_name:
Relay_log_pos: 0
Master_log_name:
Master_log_pos: 0
Checkpoint_relay_log_name:
Checkpoint_relay_log_pos: 0
Checkpoint_master_log_name:
Checkpoint_master_log_pos: 0
Checkpoint_seqno: 0
Checkpoint_group_size: 64
Checkpoint_group_bitmap:
*************************** 2. row ***************************
Master_id: 2
Worker_id: 1
Relay_log_name:
Relay_log_pos: 0
Master_log_name:
Master_log_pos: 0
Checkpoint_relay_log_name:
Checkpoint_relay_log_pos: 0
Checkpoint_master_log_name:
Checkpoint_master_log_pos: 0
Checkpoint_seqno: 0
Checkpoint_group_size: 64
Checkpoint_group_bitmap:
2 rows in set (0.00 sec)

In the instructions above, we have changed the number of workers and then we started the slave and showed the status of all "system user" processes entries. There is one SQL thread that acts as coordinator, and is waiting for the IO thread to queue more events into the relay log. Additionally, we can find two worker threads and an IO thread that are waiting for events. Finally, we also printed out the position entries for each worker by inspecting the "mysql.slave_worker_info" table.

Next, lets issue a couple of statements that fit the multi-threaded use cases described previously. Thence, we need to create a couple of databases, some tables and issue some DML statements over those tables. On the master, we issue the following:

mysql> CREATE DATABASE db1;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE DATABASE db2;
Query OK, 1 row affected (0.00 sec)

mysql> CREATE TABLE db2.t1 (a INT);
Query OK, 0 rows affected (0.03 sec)

mysql> CREATE TABLE db1.t1 (a INT);
Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO db1.t1 VALUES (1), (2); INSERT INTO db2.t1 VALUES (3),(4);
Query OK, 2 rows affected (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 0

Query OK, 2 rows affected (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 0

At this point we have inserted into two tables that are on different databases. Since they are non-conflicting, the slave server will execute these two statements concurrently, one on each slave worker. Lets have a look at each worker positions:

mysql> SELECT * FROM mysql.slave_worker_info\G
*************************** 1. row ***************************
Master_id: 2
Worker_id: 0
Relay_log_name: ./slave-relay-bin.000003
Relay_log_pos: 1081
Master_log_name: master-bin.000001
Master_log_pos: 927
Checkpoint_relay_log_name: ./slave-relay-bin.000003
Checkpoint_relay_log_pos: 626
Checkpoint_master_log_name: master-bin.000001
Checkpoint_master_log_pos: 472
Checkpoint_seqno: 1
Checkpoint_group_size: 64
Checkpoint_group_bitmap:
*************************** 2. row ***************************
Master_id: 2
Worker_id: 1
Relay_log_name: ./slave-relay-bin.000003
Relay_log_pos: 854
Master_log_name: master-bin.000001
Master_log_pos: 700
Checkpoint_relay_log_name: ./slave-relay-bin.000003
Checkpoint_relay_log_pos: 626
Checkpoint_master_log_name: master-bin.000001
Checkpoint_master_log_pos: 472
Checkpoint_seqno: 0
Checkpoint_group_size: 64
Checkpoint_group_bitmap:
2 rows in set (0.00 sec)


The table shows us that both workers were engaged during the apply activities. Worker #1 handled insert for one database, while worker #2 handled the other insert (on the other database).

This is the very basics of multi-threaded slave, thence you should go and try/play with it yourself!

Tuning

The multi-threaded slave can be tuned by changing some of its parameters. The following are just a few (those that are somewhat more relevant). Since this is still work in progress, it can be that they get renamed or even removed before the feature is actually released in a GA version. Anyway, if you're playing with the feature preview, the following might be helpful:

  --mts-slave-parallel-workers=#
Number of worker threads for executing events in parallel.

--mts-pending-jobs-size-max=#
Max size of Slave Worker queues holding yet not
applied events.

A quick and dirty list of parameters that the server can actually accept related to multi-threaded slave, can be obtained by running mysqld --help as in the following command:

shell> ./bin/mysqld --verbose --help | grep mts

[... some error messages might show up here ...]

--mts-checkpoint-group=#
--mts-checkpoint-period=#
--mts-coordinator-basic-nap=#
--mts-exp-slave-local-timestamp
--mts-partition-hash-soft-max=#
Number of records in the mts partition hash below which
--mts-pending-jobs-size-max=#
--mts-slave-parallel-workers=#
--mts-slave-worker-queue-len-max=#
all queues are governed by mts_pending_jobs_size_max.
--mts-worker-underrun-level=#
mts-checkpoint-group 512
mts-checkpoint-period 300
mts-coordinator-basic-nap 5
mts-exp-slave-local-timestamp FALSE
mts-partition-hash-soft-max 16
mts-pending-jobs-size-max 16777216
mts-slave-parallel-workers 0
mts-slave-worker-queue-len-max 40000
mts-worker-underrun-level 0

Summary

This blog post presents the need for a Multi-threaded slave and stresses the fact that Oracle's MySQL replication team delivered a feature preview based on the MySQL 5.6 codebase. The user can download it from MySQL Labs.

Apart from explaining the needs, it also details what its use cases are and explains briefly how it actually works internally.

It finishes by presenting a few MySQL commands that one can use in the feature preview server to actually make use of several threads at the slave.