Wednesday, October 5, 2011

Global Transaction Identifiers Feature Preview

The Case for Global Transaction Identifiers

"Global Transaction Identifiers" is a feature that has been requested every now and then. And it is not so much about what it actually is, but rather about what it enables MySQL users to do. Having a logical identifier associated with each transaction instead of a physical one (filename + offset), provides more flexibility and removes the burden of complex math from userland scripts. We have put out there an early access release (based on 5.6 codebase) of our ongoing effort to implement the global transaction identifiers and we would like some early feedback. Keep in mind that this is NOT something to use in production as it is in very early development stages. That said...

What exactly is a global transaction identifier?

A global identifier is a tag that pin-points a set of changes resulting from the execution of a transaction.

Why do we need global transaction identifiers?

If every transaction has its own universally unique identifier, it becomes a lot easier to follow changes through a complex replication stream. It is easier for us, humans, to visualize and understand what is going on, consequently, the algorithms we write for dealing with binary logs and replication tend to be far less convoluted.

In practice, what are the benefits of this all?
  • Fail-over: automation of fail-over suddenly becomes a lot easier. Instead of working through the physical coordinates to decide which slave is most up to date, with respect to the master it is going to replace, one can just compare global transaction identifiers of the last applied transactions. Slave promotion gets easier. However, the major benefit comes when switching over the slaves to the new master. They can reference a global transaction identifier and not have to convert binary log filenames and offsets between different servers.
  • Session consistency and Hierarchical replication: Offloading the master, through some hierarchical replication scheme, often works very well, especially if tied together with an intelligent load-balancer. Sometimes the load balancer sends read queries to a slave down in the hierarchy chain and at the same time it has to be sure that session consistency is guaranteed (updates on the master must have already been applied on the slave being queried). Making sure that the update has flowed all the way down to the desired slave without global transaction identifiers is laborious (one needs to climb down the hierarchy and follow the changes on every level in the hierarchy). On the other hand, with global transaction identifiers, one can just wait for the transaction with the desired identifier to be applied on a given slave and then run the query.
  • Enabler for multi-master update everywhere replication: This is a complex problem to solve, but the fact that transactions can be uniquely identified and distinguished from one another lays the ground for establishing (at least partial) order between them. This property is often important in such replication setups (even for dealing with conflict detection).
I am sure that there are more benefits, and that there is a whole bunch of interesting things one can do on top of global transaction identifiers... But I'll leave it up to you to think and decide how that would be useful in your own setup.

Designing a Global Transaction Identifier

The replication team has been working on several nice features that you must have noticed before (multi-threaded slave, row-based replication enhancements, ...). But now, we are adding global transaction identifiers to the list.


Global Transaction Identifier

The goal of this task is to augment MySQL binary log with Global Transaction Identifiers. Thus each transaction has an associated global identifier (GTID), which is essentially a pair:

GTID = <SID, GNO>

In practice a transaction is logged as a group of events in the binary log. Thus sometimes we refer to groups instead of transactions - I ended up use them intermixed in the text below. Actually, there are a few details about this, but to avoid risking excessive and unnecessary complexity in what I will describe below, I will just omit those.

The following describes more clearly each part of an GTID.

  • SID => currently it is a 128-bit number that identifies the server where the transaction/group of events was first committed. SID is normally the server UUID, but may be something different if a transaction is generated by something else other than than a regular MySQL Server. For example, for NDB, it identifies the Cluster.
  • GNO => is a 64-bit sequence number: 1 for the first changes logged on SID, 2 for the second changes, and so on. No change can have GNO 0.

Indexes and Relaying Transactions in the Replication Stream

In a typical MySQL replication setup there is one master server and a set of slave servers retrieving the changes from that one master and replaying those changes locally, against their own databases. Thus, GTIDs are added to transaction on the originating server - the master. As such, the following major changes have to be done on the originating side:

  1. Annotate existing binary log events with the GTID that they belong to... or create a new type of event that stores a GTID associated with a set of subsequent events in the replication stream.
  2. Create an index to quickly find out which are the physical coordinates that map into the logical identifiers, ie, the GTID. This makes looking up for which binary log file and at which offset a transaction is in, given a certain GTID, very easy and quick. It is especially useful for a dump thread, when it starts, to quickly find the physical position from which it should start reading and sending events to the slave.
In practice, this index, that maps GTIDs to binary log positions, has the form of a set of files, each file containing a sequence of transaction specifications. Each transaction specification has, among other fields, the following ones:
  • SID: The unique source identifier for this group of events/transaction.
  • GNO: the sequence number of the group of events/transaction.
  • LGID: This a local identifier like an auto-increment primary.
  • binlog file: name of binary log where this group is stored.
  • binlog pos: offset in binary log where this group starts.
  • binlog length: length of this group in binary log.
  • group end: true if this is the last set of events with GTID.
When a transaction commits, the master generates a GTID and atomically writes it to the binary log along with the events of that transaction. After that the in-memory group index data structures are updated. However, this data is asynchronously flushed to the index file. This requires that on server restart the recovery routine is extended so that it also runs a procedure to make sure that the index is properly setup and consistent.

Slaves relay changes from the master. This means that transactions that are replayed by a slave thread will keep the original GTID. Furthermore, slaves also maintain indexes to keep track of their relay logs, and its content is also flushed asynchronously. As in the master, on slave restart, a recovery routine is run to make sure that the indexes are consistent.

The current snapshot does not yet relay GTIDs through the replication protocol, so we do not get to see the identifiers flowing all the way to the slave. But we can inspect the master binary log and have a look at the identifiers...

Early Access: Exercising the Labs Snapshot.

We have uploaded a snapshot of our current work, to labs, which you can try out. It's buggy and it's incomplete, but it lays the ground to what we will be delivering in the future. So... how can we show off a bit of what we have done? Currently, we can issue a set of commands on the master and look into the resulting binary log to search for information regarding the new transaction identifiers. For instance, issuing the following commands on a server with binary log enabled:

shell> SET AUTOCOMMIT=0;
shell> CREATE TABLE t1 (a INT) Engine=InnoDB;
shell> INSERT INTO t1 VALUES (1);
shell> INSERT INTO t1 VALUES (2);
shell> INSERT INTO t1 VALUES (3);
shell> COMMIT;

Will get you an output very similar to the following one, when inspecting the binary log with the mysqlbinlog tool:
$ mysqlbinlog -v var/mysqld.1/data/master-bin.000001

(...)

# at 114
# Subgroup(#1, D5375118-EF7C-11E0-8C85-F0DEF11A08B7:1, END, COMMIT, binlog(no=0, pos=114, len=107, oals=0))
SET UGID_NEXT='D5375118-EF7C-11E0-8C85-F0DEF11A08B7:1', UGID_END=1, UGID_COMMIT=1/*!*/;
#111005 11:08:04 server id 1 end_log_pos 221 Query thread_id=1 exec_time=0 error_code=0
use test/*!*/;
SET TIMESTAMP=1317838084/*!*/;
SET @@session.pseudo_thread_id=1/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=0/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!\C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
CREATE TABLE t1 (a int) Engine=InnoDB
/*!*/;
# at 221
# Subgroup(#2, D5375118-EF7C-11E0-8C85-F0DEF11A08B7:2, END, COMMIT, binlog(no=0, pos=221, len=387, oals=27))
SET UGID_NEXT='D5375118-EF7C-11E0-8C85-F0DEF11A08B7:2', UGID_END=0, UGID_COMMIT=0/*!*/;
#111005 11:08:10 server id 1 end_log_pos 296 Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1317838090/*!*/;
BEGIN
/*!*/;
# at 296
#111005 11:08:10 server id 1 end_log_pos 391 Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1317838090/*!*/;
INSERT INTO t1 VALUES (1)
/*!*/;
# at 391
#111005 11:08:13 server id 1 end_log_pos 486 Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1317838093/*!*/;
INSERT INTO t1 VALUES (2)
/*!*/;
# at 486
# Subgroup(#2, D5375118-EF7C-11E0-8C85-F0DEF11A08B7:2, END, COMMIT, binlog(no=0, pos=221, len=387, oals=27))
SET UGID_END=1, UGID_COMMIT=1/*!*/;
#111005 11:08:16 server id 1 end_log_pos 581 Query thread_id=1 exec_time=0 error_code=0
SET TIMESTAMP=1317838096/*!*/;
INSERT INTO t1 VALUES (3)
/*!*/;
# at 581
#111005 11:08:18 server id 1 end_log_pos 608 Xid = 11
COMMIT/*!*/;
SET UGID_NEXT='AUTOMATIC'/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

In the output above one can find an additional line of metadata related to global transaction identifiers and a few new variables related to GTIDs. For now, lets just concentrate in finding the global transaction identifier. Looking at the line starting with "#Subgroup", one can find and '<UUID>:1' and '<UUID>:2'. These relate to the two transactional groups, the one that consists only of the DDL 'CREATE TABLE...' and the second group is the one that consists of the set of 'INSERT INTO...' statements that comprise the explicit transaction issued.

Now... we can filter out one of the transactions just by issuing (lets skip the create table):
$ mysqlbinlog -v --exclude-ugids=D5375118-EF7C-11E0-8C85-F0DEF11A08B7:1 var/mysqld.1/data/master-bin.000001
(...)
Or we could even not print identifiers at all:
$ mysqlbinlog -v --skip-ugids var/mysqld.1/data/master-bin.000001
(...)
There are a couple of more switches implemented in the mysqlbinlog tool that are useful to handle contents on the binary log, based on the global transaction identifier. But I'll leave it up to you to check that out.

Summary

This post provides some insights on the work the replication team is doing on designing and implementing global transaction ids. It gives a general overview of what the problem is and roughly what the solution is and how it is trying to solve a long standing requirement such as easier fail-over.

The good news are that we are not just designing anymore, we are already implementing it and you can even get a recent snapshot of this feature branch. Go... download it, look at the code, build the branch (or download a binary one), play a little bit with it. In the current implementation, global transaction ids are not yet part of the replication protocol, but you can see them by inspecting the master binary log, using mysqlbinlog tool.

Enjoy!

4 comments:

  1. Can you elaborate on the index to be maintained for (GTID -> file name/pos)? Is this fully in memory for the current binlog? What about for older binlogs? My concern is that this might use a lot of RAM. My memory of the Google patch is vague but the index was never persisted -- it was recreated on demand. The index does require an entry for every transaction/group. Storing entries for every Nth group would still make it faster to find the filename/offset of the binlog. Also, this blog mentions maintaining an index of the relay log on the slave. Why is that done?

    ReplyDelete
  2. Hi Mark,
    Some quick answers:
    - The index is persisted on disk. The entire index is not kept in memory.
    - There is one index that spans all binary logs, respectively all relay logs. The beginning of the index is purged when the corresponding binary logs are purged.
    - The index has one entry per transaction/group.
    - The reason for the index on the relay log is that we need to know the set of IDs that exist in the relay log. (Among other things, the slave sends this set to the master in the initial handshake, so that the master does not need to send transactions/groups that the slave already has.) If we did not have an index, we would lose this information on server restart and have to scan all relay logs. (Maybe there are other ways to do this, but since we already have the index on the binary log it seemed easier to use the same mechanism for the relay log than to invent something different.)

    ReplyDelete
  3. How much memory does this use assuming 1M groups per binlog file?

    By send IDs to the master does this mean send the last ID. Why would it need to send more than the last ID?

    I am curious about the use of an index because it will consume memory and is another thing to make crash-safe.

    ReplyDelete
  4. Hi Mark,

    > By send IDs to the master does this mean send the last ID. Why would it need to send more than the last ID?

    In an ideal world, transactions are re-executed consecutively: after trx(N) comes trx(N+1). If this was true, the set of all executed transactions could be represented as a single integer, N. However, there are several ways that there can be gaps in this sequence, or transactions can be out of order. For instance, multi-threaded slave executes transactions in parallel and can commit them out of order. That will cause transactions to be out of order in the slave's binary log, and temporarily there can be gaps: trx(M) may not exist in the slave's binary log even when trx(N) does, where M How much memory does this use assuming 1M groups per binlog file?

    - The size of the index is O(number_of_transactions). However, this is only on disk, not in memory.

    - The size of the set of groups is O(number_of_intervals). This is in memory, not on disk. When gaps are only generated by multi-threaded slave, number_of_intervals is bounded by the number of worker threads.

    > I am curious about the use of an index because it will consume memory and is another thing to make crash-safe.

    The index does not use memory, only disk space.

    The binary log is the source of truth. The index is never sync'ed. When the server starts, after it has executed any recovery procedure on the binary log, it performs the following recovery on the index: seek to the end of the index. If the index is shorter than the binary log, scan the tail of the binary log and copies the group information to the index.

    /Sven

    ReplyDelete