Planet MySQL

Getting Started with MySQL Replication for High-Availability

Many organizations have MySQL or MariaDB databases at the core of their business—for processing and storing product sales, collecting information related to services offered, or just providing essential information to customers. As a result, keeping these databases running continuously can be critical for the success of an organization.

There are many components of a database system that a database administrator will need to consider for maintaining high availability. We considered server equipment (e.g., memory) in a previous introductory article. Now let’s look at using multiple servers for your MySQL and MariaDB databases—let’s look at replication.

Replication Overview

One common and effective way to structure a highly available database system is through some form of database replication. There are a few reasons for using replication. One reason is for load balancing: you can split user traffic between servers, sending write traffic (e.g., UPDATE statements) to the master and read traffic (e.g., SELECT statements) to a slave—or distributed to multiple slaves. If this is new to you, it may seem complicated. So let’s keep it simple to start. We’ll look at how to set up a second server to replicate an existing database server: we’ll set up a slave. We’ll cover more complex replication systems in another article.

Basically, you designate one server as the master and another server as a slave. All changes that are made to the databases on the master are also made automatically on the slave. As for load balancing, we’ll cover it in another article. To start, you can set up replication to have an extra machine that you can use as a spare in case your main server fails, and for making back-ups.

Preparations on the Master

There are only a few steps to configuring replication. First, there’s no software to purchase or install—replication is built into MySQL and MariaDB. On the master, you will need to enable binary logging. To do this, add log-bin on a separate line to the database configuration file (e.g., my.cnf). Also, add a line with server-id to give the server a unique identifier—the number 1 is fine. Below is an excerpt from the configuration file showing these two variables. Be sure to restart MySQL when finished adding them.

[mysqld] log-bin server-id=1

Next, you will need to create a user account for the slave to use when communicating with the master. The slave doesn’t query the databases on the master. Instead, it requests new entries to the master’s binary log. This log records all changes to the server (e.g., SET statements), database structure (e.g., ALTER TABLE), and data (e.g., INSERT). The changes to the server, schema and data are all that’s needed for replication. The binary log doesn’t include SELECT statements. The slave doesn’t need them.

So, the replication user needs only the REPLICATION SLAVE privilege, which allows it to get updates to the master’s binary log. You can execute this SQL statement on the master to create such a user:

GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'12.345.0.2' IDENTIFIED BY 'its_pwd';

You would replace the name replicator with a username you prefer. Use the IP address of the slave server—not 12.345.0.2, as shown here. And provide a better password.

Now you need to make a complete back-up of the master’s databases to transfer to the slave. You can use the mysqldump utility to do this:

mysqldump -p --user backup_user \ --master-data --flush-logs \ --all-databases > full-backup.sql

You will have to change the username from backup_user to whatever user has the privileges needed to make a full back-up of all databases. The --master-data option tells mysqldump to include information on the master in the dump file. The --flush-log option tells mysqldump to flush the binary logs so you have a fresh start.

Configuring the Slave

On the server which is to be the slave, install MySQL or MariaDB. You should use the same software and version and release as you’re using on the master. In its configuration file, set the server-id equal to 2 or some other unique number. Also add the option, read-only so that no one will change the data directly on the slave. You’ll need to restart MySQL on the slave for these options to take effect.

[mysqld] server-id=2 read-only

If you haven’t already, copy the back-up file from the master to the slave. You could use FTP or a similar method. Here’s how you might do this with scp from your home directory:

scp -i ./.ssh/my_key.pem \ ec2-user@12.345.0.1:/home/ec2-user/full-backup.sql .

This line is set for copying between two AWS instances, which is fast. For other systems, you may have to authenticate differently, without a key. Once you have the database copied, you can use the mysql client to load the databases contained in the back-up file:

mysql -p -u root < full-backup.sql

When that’s done, you need to provide the slave with the information it needs to authenticate with the master. To do this, log into MySQL on the slave as root and execute the following SQL statement:


The host here should be set to the master’s IP address, not the slave’s IP address. When you’ve done this, you’re ready to start the slave. Just execute the following on the slave, from within MySQL:


At this point, the slave should be replicating the master. Try changing or adding some data on the master to see if it’s immediately replicated on the slave. You can create a table in the test database or insert a row into a table on the master. It should immediately reproduce whatever you do on the slave—but not SELECT statements. If it didn’t work, you’ll have to check the server’s status to troubleshoot the problem.

Initial Troubleshooting

MySQL and MariaDB provide a SHOW statement for monitoring and troubleshooting replication. Execute the following statement on the slave to check its status:

SHOW SLAVE STATUS \G Slave_IO_State: Waiting for master to send event ... Slave_IO_State: Waiting for master to send event Slave_IO_Running: Yes Slave_SQL_Running: Yes ... Last_Errno: 0 Last_Error: ...

The results will tell you plenty, but we’ve listed just a few variables here for you to consider initially. In particular, look to see if the fields Slave_IO_Running and Slave_SQL_Running each say, Yes. If they do, that means it’s replicating. If the IO and SQL threads on the slave aren’t running, check for error messages (i.e., Last_Errno and Last_Error). This may give you a clue as to what is the problem.

If you’re still having problems, execute SHOW MASTER STATUS on the master. It will show you the name of the current binary log file in use by the master, and the position number of the last entry in that binary log. Compare this to the results from SHOW SLAVE STATUS on the slave. The values for the master should be the same as on the slave. Also, check that the user name for the slave, and the host address for the master are correct, as well as the server identification number.

Setting up replication the first time should go smoothly, if you did all of the steps given above, set the user names and hosts correctly, and started with a fresh install on the slave. If troubleshooting becomes too difficult, though, you can start over: you can uninstall MySQL and delete the data directory for MySQL on the slave and then try again. Be care you delete the data directory on the slave, and not on the master.

Monitoring Replication

Once you have replication installed and it’s been running for a while, you’ll discover two things: it works well and it will stop unexpectedly—without notifying you. This means you will have to execute regularly the SHOW SLAVE STATUS statement on the slave to see if replication is running. Besides checking the IO and SQL threads mentioned above, you should also check the Seconds_Behind_Master field. If it’s too far behind the master, there may be a problem with your network.

Besides what’s provided with SHOW SLAVE STATUS, you might want to check SHOW STATUS for slave related information:

SHOW STATUS LIKE 'slave_%'; +----------------------------+----------+ | Variable_name | Value | +----------------------------+----------+ | Slave_connections | 0 | | Slave_heartbeat_period | 1800.000 | | Slave_open_temp_tables | 0 | | Slave_received_heartbeats | 0 | | Slave_retried_transactions | 0 | | Slave_running | ON | | Slave_skipped_errors | 0 | | Slaves_connected | 0 | | Slaves_running | 1 | +----------------------------+----------+

If everything is going well, this won’t show much. But you may want to check this information, regularly. For instance, if Slave_retried_transactions shows a high number, that might indicate a problem. In which case, check the error logs for more information.

Monyog – Add a Slave Server

If you have Monyog installed and monitoring your server, you can add a second server, the slave easily”.

Monitoring a slave and replication is much easier if you have Monyog installed already and are monitoring the master—if not, download it and try it. You can use Monyog to monitor both the master and the slave. It will provide you with information on all of the variables from the SHOW statements mentioned. You can also create alerts to notify you when the slave stops replicating or has other problems. Let’s go through how to add a slave to Monyog.

First, there a few things to do to prepare the slave. Create a user for Monyog on the slave. This user will need only REPLICATION CLIENT and SUPER privileges to monitor replication:

GRANT REPLICATION CLIENT, SUPER ON *.* TO 'monyog'@'12.345.0.1' IDENTIFIED BY 'its_pwd';

Now open Monyog in your web browser and click on Servers in the left margin. You’ll see the one server you already added—assuming you’re already using Monyog. You probably called it Localhost. For clarity, you might want to rename it. Just click on the ellipses for the box for the server and you’ll see a list of choices. Choose Edit Server. The first box is labeled, Name. Change the name from Localhost to Master, then Save.

Back at the Servers page, click on ADD NEW SERVER at the top. A panel will open on the right (see screenshot). Name the new server Slave or whatever seems appropriate for you. Enter the IP address of the slave in the MySQL Host box. Enter the Username and Password for Monyog on the slave. Everything else is fine for now with the default settings. On Amazon’s AWS, you may have to edit the Inbound Rules for the to allow MySQL traffic from the master. Once you’re done, click the blue button, Test MySQL Connection.

When the new server tests without a problem, click on the Advanced tab. Look for Replication in the list of monitor groups. Click its plus-sign to add it. You’ll see a line that reads, Is this replication slave? This is necessary to gather information about replication on the slave. Enable it. By the way, on that same screen you may notice that there is an option to Auto-register All Slaves. If you enable this on the master, you won’t need to add the slave‐it will add automatically a new slave when it detects it interacting with the master.

Monyog – Monitoring Replication

Monyog is now monitoring the slave. Let’s see how that looks. Click on Monitors in the left margin, the select Replication from the list of monitor groups (see screenshot). Here you’ll see the results of SHOW SLAVE STATUS and the replication related fields from SHOW STATUS. Besides being a nicer display, if you click on the flag icon for a monitor, you can have Monyog send you an email or an SNMP trap to alert you when it exceeds parameters you set. At a minimum, Monyog will be monitoring Slave_IO_Running and Slave_SQL_Running and will alert you if those values switch from Yes to No, indicating that replication has stopped. You may just need to edit the server again and make sure you have the email address you want in the Notification section.


Replication can be useful for many things related to high-availability. At a minimum, as we said, you can use it as a hot spare or for making back-ups of the databases. We’ll look at back-ups related to high-availability, as well as other methods related to replication to improve high availability of databases, in upcoming articles in this series on high-availability.

You can download a 14-day free trial of Monyog MySQL monitor here.

The post Getting Started with MySQL Replication for High-Availability appeared first on Webyog Blog.

New Continuent Webinar Wednesdays and Training Tuesdays

We are just starting to get into the swing of setting up our new training and webinar schedule.  Initially, there will be one Webinar session (typically on a Wednesday) and one training session (on a Tuesday) every week from now. We'll be covering a variety of different topics at each.  Typically our webinars will be about products and features, comparisons to other products, mixed in with

The Insert Benchmark

The insert benchmark was first published by Tokutek in C++. Since then I converted it to Python, they made my Python version faster, I added support for MongoDB and I have been using it to evaluate storage engine efficiency. I think it is about 10 years old and been useful at finding things we can make better in storage engines. I hope to convert it to a faster language but in Python the code is still faster than storage engines for IO-bound workloads.

I use a helper script to run a sequence of tests. For all tests there is either one table shared by all clients or a separate table per client. Some storage engines suffer more from concurrency within a table. For MySQL each table has a primary key index up to 3 secondary indexes. I configure the tests to use 3 secondary indexes. The inserts during the load are in PK order but random for the secondary indexes. InnoDB benefits from the change buffer. MyRocks and MongoRocks benefit from read-free secondary index maintenance. MongoRocks and WiredTiger suffer from the hidden index used by MongoDB for engines that use clustered indexes, but the overhead is less with the insert benchmark than for Linkbench. The tests are:
  1. load - this is the first test and uses N clients to load into the test table(s).
  2. q1000 - this is the second test and uses N query clients and N writer clients. Each writer client is rate limited to 1000 inserts/second. With some storage engines the writer clients are unable to sustain that rate. With N writer clients the global write rate should be N*1000 inserts/second. The query clients perform range scans as fast as possible. The usage of tables is the same as 
  3. q100 - this is the third test and uses N query clients and N writer clients. Each writer client is rate limited to 100 inserts/second. With N writer clients the global write rate should be N*100 inserts/second. The query clients perform range scans as fast as possible.
The benchmark client and my helper scripts are on github. The client is and the top-level helper scripts is One day I will add a few comments to to explain the command line. An example command line to run it for 500M rows, 16 clients and each client using a separate table is:
    bash innodb "" ~/b/orig801/bin/mysql /data/m/my/data nvme0n1 1 1 no no no 0 no 500000000

When it finishes there will be three directories named lq1000 and q100. In each directory there is a file named o.res.$something that has performance and efficiency metrics. The l directory has results from the load step. The q1000 and q100 directories have results from the write+query step where each writer is rate limited to 1000/second and 100/second. An example o.res.$something file for a test with 500M rows and 1 client is here for the load test and here for the q1000 test. Each file has a section with efficiency metrics normalized by the insert rate and then efficiency metrics normalized by the query rate. For the load test only the first section is interesting. For the q1000 test both sections are interesting. For the q100 test only the second section is interesting because the insert rates are too low.

The first section starts with the text iostat, vmstat normalized by insert rate and the second section starts with the text iostat, vmstat normalized by query rate. They both have similar data -- rates from iostat and vmstat normalized by the insert and query rate. An example is:
iostat, vmstat normalized by insert rate
samp    r/s     rkb/s   wkb/s   r/q     rkb/q   wkb/q   ips             spi
501     3406.8  54508   91964   3.410   54.563  92.057  999.0           0.100100

samp    cs/s    cpu/c   cs/q    cpu/q
525     16228   17.3    16.244  0.017276

iostat, vmstat normalized by query rate
samp    r/s     rkb/s   wkb/s   r/q     rkb/q   wkb/q   qps             spq
501     3406.8  54508   91964   4.023   64.370  108.602 846.8           0.118092

samp    cs/s    cpu/c   cs/q    cpu/q
525     16228   17.3    19.164  0.02038

The metric names are:
  • samp - number of iostat or vmstat samples collected during the test
  • r/s, rkb/s, wkb/s - average values for iostat r/s, rKB/s and wKB/s during the test
  • r/q - iostat r/s divided by the insert or query rate
  • rKB/q, wKB/q - iostat rKB/s and wKB/s divided by the insert or query rate
  • ips, qps - average insert and query rate
  • spi, spq - seconds per insert, seconds per query - average response time for inserts and queries
  • cs/s - average vmstat cs/s rate (cs is context switch)
  • cpu/c - average CPU utilization from vmstat us & sy (user & system)
  • cs/q - context switches per insert or per query
  • cpu/q - CPU utilization divided by insert or query rate then multiplied by a constant. This is only comparable between servers with the same CPU count.

Hybrid Data Encryption by Example using MySQL Enterprise Edition

Sharing keys, passphrases with applications is problematic, especially with regard to encrypting data. Too often applications are developed where “the keys are left in the door” or at best “under the mat” – hard coded, in a clear text property file… exposed and vulnerable. …

Announcing Scotch Box 3.0 and Scotch Box Pro

It's been a while (maybe too long) since our last Scotch Box update. If you're not familiar with Scotch Box, it's a dead-simple local LAMP stack for Vagrant. It was built about a year and half or so ago back when Vagrant was the hot new thing coming out. Vagrant had a tendency to break a lot with various updates and was built with a million platforms connecting to it in-mind (props to HashiCorp for everything they achieved).

Vagrant hooks into multiple virtualization platforms, provisioning tools, works across multiple operating systems, and sometimes things just didn't mesh though. Debugging Vagrant was kind of a lift for people who aren't well equipped with Dev Ops, but when you did get it working, development was amazing!

That's why we made Scotch Box. A tool to allow you to use Vagrant - in all its glory - in the most stupid-easy way possible.

Today, it has almost 2 million downloads! That's a crazy astronomical number that ended up being way bigger than ever predicted. In total, that's about 1250 terabytes of data downloaded from our servers.

If you're new to Scotch Box, I'll give you a complete introduction tutorial on it here, including Vagrant, and then cover some of the newer features and differences between the two boxes.

How to Use Vagrant

If you're not new to Vagrant, skip this section!

The simplest way to think about Vagrant is as a command line tool to control virtual box (or some other virtualization software). To get started:

Boom that's it! After you have that installed, open up your terminal and run:

vagrant -v

If you get something like this you, know you're in the clear:

Vagrant Boxes and the "Vagrantfile"

Vagrant doesn't really work on it's own though. It simply is just a tool that connects all the systems you need together (like Virtual Box software we just installed).

Think of Vagrant Boxes as pre-papackaged operating systems compressed into a .box file. It's kind of like a .iso file but just for Vagrant instead.

The Vagrantfile is literally a file named Vagrantfile (no extension) that is pretty much your "config" on how you want vagrant to act. The Vagrantfile is:

  • Written in Ruby
  • Kind of crazy looking / complex if you don't know servers, networking, or Vagrant

Here's what a simple Vagrantfile might look like for example:

And, here's a crazy complex Vagrantfile for perspective on how far you can go with it:

Note: Complex doesn't mean awful though. That Vaprobash file is "badass" to it's core. It's just not for everyone.

Making them talk to each other and be friends

After you get to the Scotch Box tutorial below, this whole section might make more sense, but I want to cover it now. Vagrant has some commands you run from your terminal.

These commands need to be run from the folder location of the Vagrantfile. When you run these commands it executes on the Vagrantfile.

The Vagrantfile then has a line in it saying to load whatever Vagrant box you want - Ubuntu, CentOS, Scotch Box, etc.

From the command line, you can do things like "use this vagrant box", "turn on vagrant box", "turn off vagrant box", etc.

Here they are:

# Use the box and/or turn the box back on vagrant up # Turn the box off (similar to power off switch) vagrant halt # Pause the box (similar to freezing it in time) vagrant suspend # Turn the box off and stop using it vagrant destroy # SSH into the box vagrant ssh How to Use Scotch Box

If you're not new to Scotch Box, skip this section!

Scotch Box is super easy to get up and running. Once you have installed Vagrant and VirtualBox, it's super quick!

Just clone and run vagrant up:

git clone my-project cd my-project vagrant up

Then visit:

That's it! Now if you ever need to run commands from within this new virtual server, just run:

vagrant ssh

Easy. As. Cake. From here, you can run all your other vagrant commands.

What's New in Scotch Box 3.0? PHP 7

A while ago, the PPA's for the PHP version that Scotch Box became obsolete / broken:


This made it a pain to run updates or work with it. A lot of people in the Github issues found provisioning work-around for this, but we finally added the fix directly to the box by just upgrading to PHP v7.0.

That also included fixing a ton of other things that come with it like updating the PHP modules.

This makes Scotch Box Pro an absolute beast of a perfect and free development environment.

Clearer Docs

We updated the website to be super clear on versions, features, and common GitHub help wanted questions. Hopefully it is now a way better resource.

New Landing Page

We also updated that page you land on with you first visit This reflects updates to the Box and adds some additional example code / tests.

So, What About Scotch Box Pro

We are still free / MIT License for the 3.0 version. The free version is still an awesome machine and will allow people to keep going strong without ever needing to upgrade.

The "Pro" version will help support the project, pay for expensive hosting bill, let you customize some things a bit more, and will help this project survive.

If you're super upset about this, in school, working for good causes, hurting for cash, email me at so we can work something out.

I also kind of dropped off the map for 6 months busy with work/life. When Scotch Box desperately needed an update, I couldn't seem to get around to it. With Scotch Box Pro, it encourages people more to create their own boxes based off the build scripts for the box I provide in case that happens again. This way if I'm ever unavailable, people can fix, upgrade and do whatever the want!

Here's huge improvements:

  • Ubuntu 16.04!
  • PHP 7!
  • MySQL 5.7!
  • NGNIX Option!
  • Go lang in the box
  • PHPUnit in the box
  • Yarn
  • Improved email "catching" with MailHog
  • Email "catching" is always on by default
  • Vagrant Share working nicely finally
  • The box's build scripts
  • Customize your own boxes in minutes
  • Generally higher versions of things (Node.js, Ruby, etc.)!

So let's talk about this!

Ubuntu 16.04, PHP 7.0, and MySQL 5.7

You get higher versions generally by default. It's nice to know you're working on a stable system.

For example, the free version uses a custom PPA to get PHP 7.0 working on the box, this one doesn't have to. Pretty nice and stable and just worked with sudo apt-get install php.

This version also gets nicer / higher versions of things by with the higher operating system, such as MySQL 5.7!

Build Scripts

We've finally released the bash script written to build "Scotch Box". The Pro version is provided with tutorials on how to use these and create your own custom box in minutes.

You could do neat things like:

  • Customize the box's welcome message
  • Change passwords / login
  • Add, fix, adjust, change things to your liking
  • Build off this for future projects (such as a production server)
  • Help maintain the project / suggest improvements
NGINX Version

NGINX is really awesome and I know a lot of people who prefer it. I still find Apache a little bit easier to work with, but there's a ton of people who feel differently.

I decided to take extra time and start providing an "NGINX flavor" that comes with the Pro version if you'd rather have that as your web server. It also depends on the project that you're working on.


Thank you for supporting our project. It has grown way larger than I could ever imagine and am glad it has helped so many people. Please check it out, support the project, and let us know if you have any questions!

MySQL binlog administration

If you look into the key elements of replication, then the very basic element is Binary log or binlog. Over the period of time we have made efforts to improve the management of this quintessential element of replication. To keep up with our raising standards and requirements coming from the global MySQL community we have introduced two new features to help you manage your binary logs more efficiently.…

MySQL Shell - Easy scripting

With the introduction of MySQL InnoDB Cluster we also got the MySQL Shell (mysqlsh) interface. The shell offers scripting in Javascript (default), SQL or Python. This offers a lot more options for writing scripts on MySQL, for example it is much easier now to use multiple server connections in a single script.
A customer recently asked for a way to compare the transaction sets between servers. That is useful when setting up replication or identifying the server that has most transactions applied already. So I wrote this little script which can be executed from the OS shell:

#!/usr/bin/mysqlsh -f
// it is important to connect to the X protocol port,
// usually it is the traditional port + "0"
var serverA="root:root@localhost:40010"
var serverB="root:root@localhost:50010"
var gtidA=session.sql("SELECT @@global.gtid_executed").execute().fetchOne()[0]
var gtidB=session.sql("SELECT @@global.gtid_executed").execute().fetchOne()[0]
// If you want to use pure XdevAPI the former statements should be
// gtid = session.getSchema("performance_schema")"VARIABLE_VALUE").where("VARIABLE_NAME='gtid_executed'").execute().fetchOne()[0]
println(" ")
println ("Transactions that exist only on "+serverA)
println (session.sql("SELECT gtid_subtract('"+gtidA+"','"+gtidB+"')").execute().fetchOne()[0])
println(" ")
println ("Transactions that exist only on "+serverB)
println (session.sql("SELECT gtid_subtract('"+gtidB+"','"+gtidA+"')").execute().fetchOne()[0])

MariaDB 10.1.24 and Connector/C 2.3.3 now available

The MariaDB project is pleased to announce the immediate availability of MariaDB 10.1.24, and MariaDB Connector/C 2.3.3. See the release notes and changelogs for details. Download MariaDB 10.1.24 Release Notes Changelog What is MariaDB 10.1? MariaDB APT and YUM Repository Configuration Generator Download MariaDB Connector/C 2.3.3 Release Notes Changelog About MariaDB Connector/C Thanks, and enjoy […]

The post MariaDB 10.1.24 and Connector/C 2.3.3 now available appeared first on

ProxySQL-Assisted Percona XtraDB Cluster Maintenance Mode

In this blog post, we’ll look at how Percona XtraDB Cluster maintenance mode uses ProxySQL to take cluster nodes offline without impacting workloads.

Percona XtraDB Cluster Maintenance Mode

Since Percona XtraDB Cluster offers a high availability solution, it must consider a data flow where a cluster node gets taken down for maintenance (through isolation from a cluster or complete shutdown).

Percona XtraDB Cluster facilitated this by introducing a maintenance mode. Percona XtraDB Cluster maintenance mode reduces the number of abrupt workload failures if a node is taken down using ProxySQL (as a load balancer).

The central idea is delaying the core node action and allowing ProxySQL to divert the workload.

How ProxySQL Manages Percona XtraDB Cluster Maintenance Mode

With Percona XtraDB Cluster maintenance mode, ProxySQL marks the node as OFFLINE when a user triggers a shutdown signal (or wants to put a specific node into maintenance mode):

  • When a user triggers a shutdown, Percona XtraDB Cluster node sets pxc_maint_mode to SHUTDOWN (from the DISABLED default) and sleep for x seconds (dictated by pxc_maint_transition_period  — 10 secs by default). ProxySQLauto detects this change and marks the node as OFFLINE. With this change, ProxySQL avoids opening new connections for any DML transactions, but continues to service existing queries until pxc_maint_transition_period. Once the sleep period is complete, Percona XtraDB Cluster delivers a real shutdown signal — thereby giving ProxySQL enough time to transition the workload.
  • If the user needs to take a node into maintenance mode, the user can simply set pxc_maint_mode to MAINTENANCE. With that, pxc_maint_mode is updated and the client connection updating it goes into sleep for x seconds (as dictated by pxc_maint_transition_period) before giving back control to the user. ProxySQL auto-detects this change and marks the node as OFFLINE. With this change ProxySQL avoids opening new connections for any DML transactions but continues to service existing queries.
  • ProxySQL auto-detects this change in maintenance state and then automatically re-routes traffic, thereby reducing abrupt workload failures.

Technical Details:

  • The ProxySQL Galera checker script continuously monitors the state of individual nodes by checking the pxc_maint_mode variable status (in addition to the existing wsrep_local_state) using the ProxySQL scheduler feature
  • Scheduler is a Cron-like implementation integrated inside ProxySQL, with millisecond granularity.
  • If proxysql_galera_checker detects pxc_maint_mode = SHUTDOWN | MAINTENANCE, then it marks the node as OFFLINE_SOFT.  This avoids the creation of new connections (or workloads) on the node.

Sample proxysql_galera_checker log:

Thu Dec  8 11:21:11 GMT 2016 Enabling config Thu Dec  8 11:21:17 GMT 2016 Check server 10: , status ONLINE , wsrep_local_state 4 Thu Dec  8 11:21:17 GMT 2016 Check server 10: , status ONLINE , wsrep_local_state 4 Thu Dec  8 11:21:17 GMT 2016 Check server 10: , status ONLINE , wsrep_local_state 4 Thu Dec  8 11:21:17 GMT 2016 Changing server 10: to status OFFLINE_SOFT due to SHUTDOWN Thu Dec  8 11:21:17 GMT 2016 Number of writers online: 2 : hostgroup: 10 Thu Dec  8 11:21:17 GMT 2016 Enabling config Thu Dec  8 11:21:22 GMT 2016 Check server 10: , status ONLINE , wsrep_local_state 4 Thu Dec  8 11:21:22 GMT 2016 Check server 10: , status ONLINE , wsrep_local_state 4 Thu Dec  8 11:21:22 GMT 2016 Check server 10: , status OFFLINE_SOFT , wsrep_local_state 4

Ping us below with any questions or comments.

Sysbench, IO-bound & Intel NUC for MySQL 5.6, 5.7 & 8

In this post I share results from sysbench with an IO-bound workload on Intel NUC servers. My previous post shared results from an in-memory workload. This is part of my work from bug 86215 as I identify CPU performance regressions from MySQL 5.6 to 5.7 and 8.


  • The results here are similar to, but not as bad as, the results from the in-memory workload. This is reasonable because CPU regressions can be hidden by IO-bound tests. But if you scroll to the bottom of this page and look at the graph for the point-query test you will see there are problems even for IO-bound tests.
  • For the i5 NUC most of the regression is from 5.6 to 5.7
  • For the i3 NUC, MySQL 5.7 did better especially on range scans but there is still a regression from 5.6 to 8. From many tests I have run it looks like someone did great work in MySQL 5.7 to make range scans more efficient in InnoDB.
  • For long scans the overhead from the default charset/collation in MySQL 8 is significant compared to latin1/latin1_swedish_ci.

I tested MySQL using upstream 5.6.35, 5.7.17 and 8.0.1. For 8.0.1 I repeated the test: first with the default charset/collation and then with latin1/latin1_swedish_ci.

My servers are described here. The charts below refer to the servers as i3 NUC and i5 NUC. Compared to the i3 NUC, the i5 NUC is newer, has a CPU that is 2X faster, 2X more RAM and storage that is more than 2X faster.

My usage of sysbench, including the tests run, and the my.cnf used for each database engine is described here. The my.cnf files I share were for the i3 NUC. For the i5 NUC the InnoDB buffer pool and IO capacity options were increased using these values. I use the same server for mysqld and the sysbench clients. The binlog is enabled but sync-on-commit is disabled. Sysbench is run with 4 tables and 40M rows per table. The database is larger than RAM. Although in future tests I might need to use larger tables for the i5 NUC which has twice the memory of the i3 NUC.


I first ran these tests on the i3 NUC and then was curious about performance on a more modern CPU so I setup a cluster of new i5 NUC servers. I am a big fan of Intel NUC and have 6 at home and have yet to trip a circuit breaker, overheat my office or suffer from too much fan noise. The results on the i5 NUC are different than on the i3 NUC. On the i3 NUC MySQL 5.7 has much less of a regression from 5.6. While on the i5 NUC most of the regression is from MySQL 5.6 to 5.7. Perhaps one day I will have time to explain that, but I am too busy right now.

The results are in the order in which tests are run.

There is a small regression from 5.6 to 5.7/8 for update-index on the i5 NUC. This test requires secondary index maintenance as part of the update (read-modify-write of secondary index leaf pages).  The regression for i5 NUC for update-nonindex is larger than for update-index. This test is less IO-bound than update-index because secondary index maintenance is not needed.
The regression for delete is worse on the i5 NUC. If fact, there isn't a regression on the i3 NUC.

The regression for write-only is worse on the i5 NUC and there isn't a regression on the i3 NUC.
For read-write the regression is worse on the i5 NUC. MySQL 5.7 does better than 5.6 on the i3 NUC and that difference is larger for the larger range scan (10,000 row) than the smaller (100 row).
For read-only using latin1/latin1_swedish_ci improves QPS at 1000 and 10,000 row range scans with MySQL 8 compared to the default charset/collation. Again, MySQL 5.7 does great on the i3 NUC and does better as the range scan increases. MySQL 8 has regressions on both the i3 and i5 NUC and that isn't explained by charset.

For point-query the regression on the i3 NUC starts with MySQL 8 and on the i5 NUC starts with MySQL 5.7.
For insert the regression is larger on the i5 NUC.

Kafka Replication from MySQL and Oracle

Hello again everybody.

Well, I promised it a couple of weeks ago, and I’m sorry it has been so long (I’ve been working on other fun stuff in addition to this). But I’m pleased to say that we now have a fully working applier that takes data from an incoming THL stream, whether that is Oracle or MySQL, and converts that into a JSON document and message for distribution over a Kafka topic.

Currently, the configuration is organised with the following parameters:

  • The topic name is set according to the incoming schema and table. You can optionally add a prefix. So, for example, if you have a table ‘invoices’ in the schema ‘sales’, your Kafka topic will be sales_invoices, or if you’ve added a prefix, ‘myprefix_schema_table’.
  • Data is marshalled into a JSON document as part of the message, and the structure is to have a bunch of metadata and then an embedded record. You’ll see an example of this below. You can choose what metadata is included here. You can also choose to send everything on a single topic. I’m open to suggestions on whether it would be useful for this to be configured on a more granular level.
  • The msgkey is composed of the primary key information (if we can determine it), or the sequence number otherwise.
  • Messages are generated one row of source data at a time. There were lots of ways we could have done this, especially with larger single dumps/imports/multi-million-row transactions. There is no more sensible way. It may mean we get duplicate messages into Kafka, but these are potentially easier to handle than trying to send a massive 10GB Kafka message.
  • Since Zookeeper is a requirement for Kafka, we use Zookeeper to record the replicator status information.

Side note: One way I might consider mitigating that last item (and which may also apply to some of our other upcoming appliers, such as the ElasticSearch applier) is to actually change the incoming THL stream so that it is split into individual rows. This sounds entirely crazy, since it would separate the incoming THL sequence number from the source (MySQL binlog, Oracle, er, other upcoming extractors), but it would mean that we have THL on the applier side which is a single row of data. That means we would have a THL seqno per row of data, but would also mean that in the event of a problem, the replicator could restart from that one row of data, rather than restarting from the beginning of a multi-million-row transaction.

Anyway, what does it all look like in practice?

Well, here’s a simple MySQL instance and I’m going to insert a row into this table:

mysql> insert into sbtest.sbtest values (0,100,"Base Msg","Some other submsg");

OK, this looks like this:

mysql> select * from sbtest.sbtest where k = 100; +--------+-----+----------+-------------------+ | id     | k   | c        | pad               | +--------+-----+----------+-------------------+ | 255759 | 100 | Base Msg | Some other submsg | +--------+-----+----------+-------------------+

Over in Kafka, let’s have a look what the message looks like. I’m just using the console consumer here:

{"_meta_optype":"INSERT","_meta_committime":"2017-05-27 14:27:18.0","record":{"pad":"Some other submsg","c":"Base Msg","id":"255759","k":"100"},"_meta_source_table":"sbtest","_meta_source_schema":"sbtest","_meta_seqno":"10130"}

And let’s reformat that into something more useful:

{    "_meta_committime" : "2017-05-27 14:27:18.0",    "_meta_source_schema" : "sbtest",    "_meta_seqno" : "10130",    "_meta_source_table" : "sbtest",    "_meta_optype" : "INSERT",    "record" : {       "c" : "Base Msg",       "k" : "100",       "id" : "255759",       "pad" : "Some other submsg"    } }


Woohoo! Kafka JSON message. We’ve got the metadata (and those field names/prefixes are likely to change), but we’ve also got the full record details. I’m still testing other data types and ensuring we get the data through correctly, but I don’t foresee any problems.

There are a couple of niggles still to be resolved:

  • The Zookeeper interface which is used to store state data needs addressing; although it works fine there are some occasional issues with key/path collisions.
  • Zookeeper and Kafka connection are not fully checked, so it’s possible to appear to be up and running when no connection is available.
  • Some further tweaking of the configuration would be helpful – for example, setting or implying specific formats for msg key and the embedded data.

I may add further configurations for other items, especially since longer term we might have a Kafka extractor and maybe we want to use that to distribute records, in which case we might want to track other information like the additional metadata and configuration (SQL mode etc) currently held within the THL. I’ll keep thinking about that though.

Anything else people would like to see here? Please email me at and we’ll sort something out.


DNS Infrastructure at GitHub

At GitHub we recently revamped how we do DNS from the ground up. This included both how we interact with external DNS providers and how we serve records internally to our hosts. To do this, we had to design and build a new DNS infrastructure that could scale with GitHub’s growth and across many data centers.

Previously GitHub’s DNS infrastructure was fairly simple and straightforward. It included a local, forwarding only DNS cache on every server and a pair of hosts that acted as both caches and authorities used by all these hosts. These hosts were available both on the internal network as well as public internet. We configured zone stubs in the caching daemon to direct queries locally rather than recurse on the internet. We also had NS records set up at our DNS providers that pointed specific internal zones to the public IPs of this pair of hosts for queries external to our network.

This configuration worked for many years but was not without its downsides. Many applications are highly sensitive to resolving DNS queries and any performance or availability issues we ran into would cause queuing and degraded performance at best and customer impacting outages at worst. Configuration and code changes can cause large unexpected changes in query rates. As such scaling beyond these two hosts became an issue. Due to the network configuration of these hosts we would just need to keep adding IPs and hosts which has its own problems. While attempting to fire fight and remediate these issues, the old system made it difficult to identify causes due to a lack of metrics and visibility. In many cases we resorted to tcpdump to identify traffic and queries in question. Another issue was running on public DNS servers we run the risk of leaking internal network information. As a result we decided to build something better and began to identify our requirements for the new system.

We set out to design a new DNS infrastructure that would improve the aforementioned operational issues including scaling and visibility, as well as introducing some additional requirements. We wanted to continue to run our public DNS zones via external DNS providers so whatever system we build needed to be vendor agnostic. Additionally, we wanted this system to be capable of serving both our internal and external zones, meaning internal zones were only available on our internal network unless specifically configured otherwise and external zones are resolvable without leaving our internal network. We wanted the new DNS architecture to allow both a deploy-based workflow for making changes as well as API access to our records for automated changes via our inventory and provisioning systems. The new system could not have any external dependencies, too much relies on DNS functioning for it to get caught in a cascading failure. This includes connectivity to other data centers and DNS services that may reside there. Our old system mixed the use of caches and authorities on the same host, we wanted to move to a tiered design with isolated roles. Lastly, we wanted a system that could support many data center environments whether it be EC2 or bare metal.


To build this system we identified three classes of hosts: caches, edges, and authorities. Caches serve as recursive resolvers and DNS “routers” caching responses from the edge tier. The edge tier, running a DNS authority daemon, responds to queries from the caching tier for zones it is configured to zone transfer from the authority tier. The authority tier serve as hidden DNS masters as our canonical source for DNS data, servicing zone transfers from the edge hosts as well as providing an HTTP API for creating, modifying or deleting records.

In our new configuration, caches live in each data center meaning application hosts don’t need to traverse a data center boundary to retrieve a record. The caches are configured to map zones to the edge hosts within their region in order to route our internal zones to our own hosts. Any zone that is not explicitly configured will recurse on the internet to resolve an answer.

The edge hosts are regional hosts, living in our network edge PoPs (Point of Presence). Our PoPs have one or more data centers that rely on them for external connectivity, without the PoP the data center can’t get to the internet and the internet can’t get to them. The edges perform zone transfers with all authorities regardless of what region or location they exist in and store those zones locally on their disk.

Our authorities are also regional hosts, only containing zones applicable to the region it is contained in. Our inventory and provisioning systems determine which regional authority a zone lives in and will create and delete records via an HTTP API as servers come and go. OctoDNS maps zones to regional authorities and uses the same API to create static records and to ensure dynamic sources are in sync. We have an additional separate authority for external domains, such as, to allow us to query our external domains during a disruption to connectivity. All records are stored in MySQL.


One huge benefit of moving to a more modern DNS infrastructure is observability. Our old DNS system had little to no metrics and limited logging. A large factor in deciding which DNS servers to use was the breadth and depth of metrics they produce. We finalized on Unbound for the caches, NSD for the edge hosts and PowerDNS for the authorities, all of which have been proven in DNS infrastructures much larger than at GitHub.

When running in our bare metal data centers, caches are accessed via a private anycast IP resulting in it reaching the nearest available cache host. The caches have been deployed in a rack aware manner that provides some level of balanced load between them and isolation against some power and network failure modes. When a cache host fails, servers that would normally use it for lookups will now automatically be routed to the next closest cache, keeping latency low as well as providing tolerance to some failure modes. Anycast allows us to scale the number of caches behind a single IP address unlike our previous configuration, giving us the ability to run as many caching hosts as DNS demand requires.

Edge hosts perform zone transfers with the authority tier, regardless of region or location. Our zones are not large enough that keeping a copy of all of them in every region is a problem. This means for every zone, all caches will have access to a local edge server with a local copy of all zones even when a region is offline or upstream providers are having connectivity issues. This change alone has proven to be quite resilient in the face of connectivity issues and has helped keep GitHub available during failures that not long ago would have caused customer facing outages.

These zone transfers include both our internal and external zones from their respective authorities. As you might guess zones like are external and zones like are generally internal. The difference between them is only the types of use and data stored in them. Knowing which zones are internal and external gives us some flexibility in our configuration.

$ dig +short

Public zones are sync’d to external DNS providers and are records GitHub users use everyday. Addtionally, public zones are completely resolvable within our network without needing to communicate with our external providers. This means any service that needs to look up can do so without needing to rely on external network connectivity. We also use the stub-first configuration option of Unbound which gives a lookup a second chance if our internal DNS service is down for some reason by looking it up externally when it fails.

$ dig +short

Most of the zone is completely private, inaccessible from the internet and only contains RFC 1918 IP addresses. Private zones are split up per region and site. Each region and/or site has a set of sub-zones applicable to that location, sub-zones for management network, service discovery, specific service records and yet to be provisioned hosts that are in our inventory. Private zones also include reverse lookup zones for PTRs.


Replacing an old system with a new one that is ready to serve millions of customers is never easy. Using a pragmatic, requirements based approach to designing and implementing our new DNS system resulted in a DNS infrastructure that was able to hit the ground running and will hopefully grow with GitHub into the future.

Want to help the GitHub SRE team solve interesting problems like this? We’d love for you to join us. Apply Here

Sysbench, in-memory & Intel NUC

This continues my series on low-concurrency performance regressions, Here I share results for in-memory sysbench using my home Intel NUC servers. The results were first shared in bug 86215 and the numbers for the charts in this post are here.

  • For the i5 NUC most of the regression is from MySQL 5.6 to 5.7
  • For the i3 NUC MySQL 5.7.17 matches 5.6.35 in many tests and does much better on range scans. Alas I have yet to debug this. But MySQL 8.0.1 is much worse than 5.6.35. At first I only had results for the i3 NUC and assumed the regression was from 5.7 to 8 but now I think that most of the regression comes from MySQL 5.6 to 5.7. Soon I will repeat these tests on servers I use at work that have a better CPU than what I get in the i5 NUC.
  • For long range scans there is more overhead with the default charset/collation in MySQL 8.0.1 and switching back to latin1/latin1_swedish_ci improves QPS.


I tested MySQL using upstream 5.6.35, 5.7.17 and 8.0.1. For 8.0.1 I repeated the test: first with the default charset/collation and then with latin1/latin1_swedish_ci.

My servers are described here. The charts below refer to the servers as i3 NUC and i5 NUC. Compared to the i3 NUC, the i5 NUC is newer, has a CPU that is 2X faster, 2X more RAM and storage that is more than 2X faster.

My usage of sysbench, including the tests run, and the my.cnf used for each database engine is described here. The my.cnf files I share were for the i3 NUC. For the i5 NUC the InnoDB buffer pool and IO capacity options were increased using these values. I use the same server for mysqld and the sysbench clients. The binlog is enabled but sync-on-commit is disabled. Sysbench is run with 4 tables and 1M rows per table. The database fits in the InnoDB buffer pool.


I first ran these tests on the i3 NUC and assumed that the regression was mostly from MySQL 5.6 to MySQL 8. Then I acquired the i5 NUC servers and repeated tests and it looks like more of the regression is from MySQL 5.6 to 5.7. Regardless, I hope we can make this better.

The results are in the order in which the tests are run. For several tests MySQL 5.7.17 does better on the i3 NUC than the i5 NUC, relative to MySQL 8. That is most apparent on the tests that do longer range scans. I thought that was a mistake but the result is repeatable. From past tests there appears to be improvements to range scan performance for InnoDB in MySQL 5.7. Perhaps this is an artifact of that improvement.

The results for update-index are typical. The regression is worse on the i5 NUC.
The regression for update-nonindex is worse than for update-index and worse on the i5 NUC.
The regression for delete is worse on the i5 NUC.
The regression for write-only is worse on the i5 NUC.
For read-write with a 100 row range scan MySQL 5.7.17 does better on the i3 NUC. I didn't debug that. Otherwise the regressions are similar between the i3 NUC and i5 NUC. Switching the charset for MySQL 8.0.1 from utf to latin1 has a small impact.
For read-write with a 10,000 row range scan MySQL 5.7.17 does better on the i3 NUC. I didn't debug that. Otherwise the regressions are similar between the i3 NUC and i5 NUC. Switching the charset for MySQL 8.0.1 from utf to latin1 has a big impact.
For the read-only tests using latin1/latin1_swedish_ci improves QPS at 1000 and 10,000 row range scans. MySQL 5.7.17 does great on the i3 NUC but has regressions on the i5 NUC. MySQL 8.0.1 has regressions on the i3 and i5 NUC.
For point-query MySQL 5.7.17 does better on the i3 NUC. Otherwise the regressions from MySQL 5.6 to newer releases are similar.
For insert MySQL 5.7.17 does better on the i3 NUC. Otherwise the regressions from MySQL 5.6 to newer releases are similar.

Percona XtraBackup 2.4.7-2 is Now Available

Percona announces the GA release of Percona XtraBackup 2.4.7-2 on May 29, 2017. You can download it from our download site and apt and yum repositories.

Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, Percona XtraBackup drives down backup costs while providing unique features for MySQL backups.

Bug Fixed:
  • Fixed build failure on Debian 9.0 (Stretch). Bug fixed #1678947.

Release notes with all the bugfixes for Percona XtraBackup 2.4.7-2 are available in our online documentation. Please report any bugs to the launchpad bug tracker.

Webinar May 31, 2017: Online MySQL Backups with Percona XtraBackup

Please join Percona’s solution engineer, Dimitri Vanoverbeke as he presents Online MySQL Backups with Percona XtraBackup on Wednesday, May 31, 2017 at 7:00 am PDT / 10:00 am EDT (UTC-7).

Register Now

Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server, MySQL® and MariaDB®. Percona XtraBackup provides:

  • Fast and reliable backups
  • Uninterrupted transaction processing during backups
  • Savings on disk space and network bandwidth with better compression
  • Automatic backup verification
  • Higher uptime due to faster restore time

This webinar will discuss the different features of Percona XtraBackup, including:

  • Full and incremental backups
  • Compression, streaming, and encryption of backups
  • Backing up to the cloud (swift)
  • Percona XtraDB Cluster / Galera Cluster
  • Percona Server specific features
  • MySQL 5.7 support
  • Tools that use Percona XtraBackup
  • Limitations

Register for the webinar here.

Dimitri Vanoverbeke, Solution Engineer

At the age of seven, Dimitri received his first computer. Since then he has felt addicted to anything with a digital pulse. Dimitri has been active in IT professionally since 2003, when he took various roles from internal system engineering to consulting. Prior to joining Percona, Dimitri worked as a consultant for a leading open source software consulting firm in Belgium. During his career, Dimitri has become familiar with a broad range of open source solutions and with the devops philosophy. Whenever he’s not glued to his computer screen, he enjoys traveling, cultural activities, basketball and the great outdoors.


Short guide on using performance_schema for user & table stats

It took me too long to figure this out while reading the chapter on PS in the MySQL manual. Hopefully this saves me time the next time I need to figure it out. I think it gives me the equivalent of the data I get from IS.user_statistics and IS.table_statistics when using FB MySQL. From a few tests I ran the overhead from the PS was small, maybe less than 5%, while collecting this data.
  1. Add performance_schema=1 to my.cnf
  2. For table stats: select * from table_io_waits_summary_by_table
  3. For user stats: select * from events_statements_summary_by_account_by_event_name
Update - great advice from Mark Leith for MySQL 5.7 and newer
For table stats see the docs and run: SELECT * FROM sys.schema_table_statisticsFor user stats see the docs and run: SELECT * FROM sys.user_summary

MySQL on Docker: Swarm Mode Limitations for Galera Cluster in Production Setups

In the last couple of blog posts on Docker, we have looked into understanding and running Galera Cluster on Docker Swarm. It scales and fails over pretty well, but there are still some limitations that prevent it from running smoothly in a production environment. We will be discussing about these limitations, and see how we can overcome them. Hopefully, this will clear some of the questions that might be circling around in your head.

Docker Swarm Mode Limitations

Docker Swarm Mode is tremendous at orchestrating and handling stateless applications. However, since our focus is on trying to make Galera Cluster (a stateful service) to run smoothly on Docker Swarm, we have to make some adaptations to bring the two together. Running Galera Cluster in containers in production requires at least:

  • Health check - Each of the stateful containers must pass the Docker health checks, to ensure it achieves the correct state before being included into the active load balancing set.
  • Data persistency - Whenever a container is replaced, it has to be started from the last known good configuration. Else you might lose data.
  • Load balancing algorithm - Since Galera Cluster can handle read/write simultaneously, each node can be treated equally. A recommended balancing algorithm for Galera Cluster is least connection. This algorithm takes into consideration the number of current connections each server has. When a client attempts to connect, the load balancer will try to determine which server has the least number of connections and then assign the new connection to that server.

We are going to discuss all the points mentioned above with a great detail, plus possible workarounds on how to tackle those problems.

Health Check

HEALTHCHECK is a command to tell Docker how to test a container, to check that it is still working. In Galera, the fact that mysqld is running does not mean it is healthy and ready to serve. Without a proper health check, Galera could be wrongly diagnosed when something goes wrong, and by default, Docker Swarm’s ingress network will include the “STARTED” container into the load balancing set regardless of the Galera state. On the other hand, you have to manually attach to a MySQL container to check for various MySQL statuses to determine if the container is healthy.

With HEALTHCHECK configured, container healthiness can be retrieved directly from the standard “docker ps” command:

$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS 42f98c8e0934 severalnines/mariadb:10.1 "/ " 13 minutes ago Up 13 minutes (healthy) 3306/tcp, 4567-4568/tcp

Plus, Docker Swarm’s ingress network will include only the healthy container right after the health check output starts to return 0 after startup. The following table shows the comparison of these two behaviours:

Options Sample output Description Without HEALTHCHECK Hostname: db_mariadb_galera.2
Hostname: db_mariadb_galera.3
Hostname: ERROR 2003 (HY000): Can't connect to MySQL server on '' (111)
Hostname: db_mariadb_galera.1
Hostname: db_mariadb_galera.2
Hostname: db_mariadb_galera.3
Hostname: ERROR 2003 (HY000): Can't connect to MySQL server on '' (111)
Hostname: db_mariadb_galera.1
Hostname: db_mariadb_galera.2
Hostname: db_mariadb_galera.3
Hostname: db_mariadb_galera.4
Hostname: db_mariadb_galera.1
Applications will see an error because container db_mariadb_galera.4 is introduced incorrectly into the load balancing set. Without HEALTHCHECK, the STARTED container will be part of the “active” tasks in the service. With HEALTHCHECK Hostname: db_mariadb_galera.1
Hostname: db_mariadb_galera.2
Hostname: db_mariadb_galera.3
Hostname: db_mariadb_galera.1
Hostname: db_mariadb_galera.2
Hostname: db_mariadb_galera.3
Hostname: db_mariadb_galera.4
Hostname: db_mariadb_galera.1
Hostname: db_mariadb_galera.2
Container db_mariadb_galera.4 is introduced correctly into the load balancing set. With proper HEALTHCHECK, the new container will be part of the “active” tasks in the service if it’s marked as healthy.

The only problem with Docker health check is it only supports two exit codes - either 1 (unhealthy) or 0 (healthy). This is enough for a stateless application, where containers can come and go without caring much about the state itself and other containers. With a stateful service like Galera Cluster or MySQL Replication, another exit code is required to represent a staging phase. For example, when a joiner node comes into the picture, syncing is required from a donor node (by SST or IST). This process is automatically started by Galera and probably requires minutes or hours to complete, and the current workaround for this is to configure [--update-delay] and [--health-interval * --health-retires] to higher than the SST/IST time.

For a clearer perspective, consider the following “service create” command example:

$ docker service create \ --replicas=3 \ --health-interval=30s \ --health-retries=20 \ --update-delay=600s \ --name=galera \ --network=galera_net \ severalnines/mariadb:10.1

The container will be destroyed if the SST process has taken more than 600 seconds. While in this state, the health check script will return “exit 1 (unhealthy)” in both joiner and donor containers because both are not supposed to be included by Docker Swarm’s load balancer since they are in syncing stage. After failures for 20 consecutive times at every 30 seconds (equal to 600 seconds), the joiner and donor containers will be removed by Docker Swarm and will be replaced by new containers.

It would be perfect if Docker’s HEALTHCHECK could accept more than exit "0" or "1" to signal Swarm’s load balancer. For example:

  • exit 0 => healthy => load balanced and running
  • exit 1 => unhealthy => no balancing and failed
  • exit 2 => unhealthy but ignore => no balancing but running

Thus, we don’t have to determine SST time for containers to survive the Galera Cluster startup operation, because:

  • Joiner/Joined/Donor/Desynced == exit 2
  • Synced == exit 0
  • Others == exit 1

Another workaround apart setting up [--update-delay] and [--health-interval * --health-retires] to be higher than SST time is you could use HAProxy as the load balancer endpoints, instead of relying on Docker Swarm’s load balancer. More discussion further on.

Data Persistency

Stateless doesn’t really care about persistency. It shows up, serves and get destroyed if the job is done or it is unhealthy. The problem with this behaviour is there is a chance of a total data loss in Galera Cluster, which is something that cannot be afforded by a database service. Take a look at the following example:

$ docker service create \ --replicas=3 \ --health-interval=30s \ --health-retries=20 \ --update-delay=600s \ --name=galera \ --network=galera_net \ severalnines/mariadb:10.1

So, what happens if the switch connecting the three Docker Swarm nodes goes down? A network partition, which will split a three-node Galera Cluster into 'single-node' components. The cluster state will get demoted into Non-Primary and the Galera node state will turn to Initialized. This situation turns the containers into an unhealthy state according to the health check. After a period 600 seconds if the network is still down, those database containers will be destroyed and replaced with new containers by Docker Swarm according to the "docker service create" command. You will end up having a new cluster starting from scratch, and the existing data is removed.

There is a workaround to protect from this, by using mode global with placement constraints. This is the preferred way when you are running your database containers on Docker Swarm with persistent storage in mind. Consider the following example:

$ docker service create \ --mode=global \ --constraints='node.labels.type == galera' \ --health-interval=30s \ --health-retries=20 \ --update-delay=600s \ --name=galera \ --network=galera_net \ severalnines/mariadb:10.1

The cluster size is limited to the number of available Docker Swarm node labelled with "type=galera". Dynamic scaling is not an option here. Scaling up or down is only possible if you introduce or remove a Swarm node with the correct label. The following diagram shows a 3-node Galera Cluster container with persistent volumes, constrained by custom node label "type=galera":

It would also be great if Docker Swarm supported more options to handle container failures:

  • Don’t delete the last X failed containers, for troubleshooting purposes.
  • Don’t delete the last X volumes, for recovery purposes.
  • Notify users if a container is recreated, deleted, rescheduled.
Load Balancing Algorithm Related resources  MySQL on Docker: Composing the Stack  MySQL on Docker: Deploy a Homogeneous Galera Cluster with etcd  MySQL on Docker: ClusterControl and Galera Cluster on Docker Swarm

Docker Swarm comes with a load balancer, based on IPVS module in Linux kernel, to distribute traffic to all containers in round-robin fashion. It lacks several useful configurable options to handle routing of stateful applications, for example persistent connection (so source will always reach the same destination) and support for other balancing algorithm, like least connection, weighted round-robin or random. Despite IPVS being capable of handling persistent connections via option "-p", it doesn’t seem to be configurable in Docker Swarm.

In MySQL, some connections might take a bit longer time to process before it returns the output back to the clients. Thus, Galera Cluster load distribution should use "least connection" algorithm, so the load is equally distributed to all database containers. The load balancer would ideally monitor the number of open connections for each server, and sends to the least busy server. Kubernetes defaults to least connection when distributing traffic to the backend containers.

As a workaround, relying on other load balancers in front of the service is still the recommended way. HAProxy, ProxySQL and MaxScale excel in this area. However, you have to make sure these load balancers are aware of the dynamic changes of the backend database containers especially during scaling and failover.


Galera Cluster on Docker Swarm fits well in development, test and staging environments, but it needs some more work when running in production. The technology still needs some time to mature, but as we saw in this blog, there are ways to work around the current limitations.

Tags:  MySQL docker galera swarm cluster limitation

InnoDB Locks Analysis: Why is Blocking Query NULL and How To Find More Information About the Blocking Transaction?

This post was originally published on the MySQL Support Team Blog at on 14 April 2017.

Consider the scenario that you execute a query. You expect it to be fast – typically subsecond – but now it take not return until after 50 seconds (innodb_lock_wait_timeout seconds) and then it returns with an error:

mysql> UPDATE world.City SET Population = Population + 999 WHERE ID = 130; ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

You continue to investigate the issue using the sys.innodb_lock_waits view or the underlying Information Schema tables (INNODB_TRX, INNODB_LOCKS and INNODB_LOCK_WAITS).

Note: The above Information Schema tables with lock and lock waits information have been moved to the Performance Schema in 8.0 as the data_locks and data_lock_waits tables. The sys schema view however works the same.

However, when you query the locks information, the blocking query is returned as NULL. What does that mean and how to proceed from that to get information about the blocking transaction?

Setting Up an Example

Before proceeding, lets set up an example which will be investigated later in the blog. The example can be set up as (do not disconnect Connection 1 when the queries have been executed):

  1. Connection 1:
    Connection 1> START TRANSACTION; Query OK, 0 rows affected (0.00 sec) Connection 1> UPDATE world.City SET Population = Population + 1 WHERE ID = 130; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 Connection 1> UPDATE world.City SET Population = Population + 1 WHERE ID = 131; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 Connection 1> UPDATE world.City SET Population = Population + 1 WHERE ID = 132; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 Connection 1> UPDATE world.City SET Population = Population + 1 WHERE ID = 133; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0
  2. Connection 2 (blocks for innodb_lock_wait_timeout seconds):
    Connection 2> UPDATE world.City SET Population = Population + 999 WHERE ID = 130;
  3. The following output while Connection 2 is still blocking from sys.innodb_lock_waits shows that the blocking query is NULL (slightly reformatted):
    Connection 3> SELECT * FROM sys.innodb_lock_waits\G *************************** 1. row *************************** wait_started: 2017-04-15 09:54:53 wait_age: 00:00:03 wait_age_secs: 3 locked_table: `world`.`City` locked_index: PRIMARY locked_type: RECORD waiting_trx_id: 2827 waiting_trx_started: 2017-04-15 09:54:53 waiting_trx_age: 00:00:03 waiting_trx_rows_locked: 1 waiting_trx_rows_modified: 0 waiting_pid: 5 waiting_query: UPDATE world.City SET Populati ... opulation + 999 WHERE ID = 130 waiting_lock_id: 2827:24:6:41 waiting_lock_mode: X blocking_trx_id: 2826 blocking_pid: 3 blocking_query: NULL blocking_lock_id: 2826:24:6:41 blocking_lock_mode: X blocking_trx_started: 2017-04-15 09:54:46 blocking_trx_age: 00:00:10 blocking_trx_rows_locked: 4 blocking_trx_rows_modified: 4 sql_kill_blocking_query: KILL QUERY 3 sql_kill_blocking_connection: KILL 3 1 row in set, 3 warnings (0.00 sec) Connection 3> SHOW WARNINGS\G *************************** 1. row *************************** Level: Warning Code: 1681 Message: 'INFORMATION_SCHEMA.INNODB_LOCK_WAITS' is deprecated and will be removed in a future release. *************************** 2. row *************************** Level: Warning Code: 1681 Message: 'INFORMATION_SCHEMA.INNODB_LOCKS' is deprecated and will be removed in a future release. *************************** 3. row *************************** Level: Warning Code: 1681 Message: 'INFORMATION_SCHEMA.INNODB_LOCKS' is deprecated and will be removed in a future release. 3 rows in set (0.00 sec)
    The warnings will only occur in the 5.7.14 and later as the InnoDB lock tables being moved to the Performance Schema in MySQL 8.0. It is recommended to use the sys.innodb_lock_waits view as that is updated accordingly in MySQL 8.0.
Investigating Idle Transactions

To investigate idle transactions, you need to use the Performance Schema to get this information. First determine the Performance Schema thread id for the blocking transaction. For this you need the blocking_pid, in the above example:

                blocking_pid: 3

and use this with the The threads Table table like:

Connection 3> SELECT THREAD_ID FROM performance_schema.threads WHERE PROCESSLIST_ID = 3; +-----------+ | THREAD_ID | +-----------+ | 28 | +-----------+ 1 row in set (0.00 sec)

For the following queries insert the thread id found above for the THREAD_ID = … where clauses.

To get the latest query executed, use the The events_statements_current Table table or the The session and x$session Views view:

Connection 3> SELECT THREAD_ID, SQL_TEXT FROM performance_schema.events_statements_current WHERE THREAD_ID = 28; +-----------+------------------------------------------------------------------+ | THREAD_ID | SQL_TEXT | +-----------+------------------------------------------------------------------+ | 28 | UPDATE world.City SET Population = Population + 1 WHERE ID = 133 | +-----------+------------------------------------------------------------------+ 1 row in set (0.00 sec)


Connection 3> SELECT * FROM sys.session WHERE thd_id = 28\G *************************** 1. row *************************** thd_id: 28 conn_id: 3 user: root@localhost db: NULL command: Sleep state: NULL time: 447 current_statement: NULL statement_latency: NULL progress: NULL lock_latency: 117.00 us rows_examined: 1 rows_sent: 0 rows_affected: 1 tmp_tables: 0 tmp_disk_tables: 0 full_scan: NO last_statement: UPDATE world.City SET Population = Population + 1 WHERE ID = 133 last_statement_latency: 321.06 us current_memory: 0 bytes last_wait: NULL last_wait_latency: NULL source: NULL trx_latency: NULL trx_state: ACTIVE trx_autocommit: NO pid: 7717 program_name: mysql 1 row in set (0.08 sec)

In this case this does not explain why the lock is held as the last query update a different row then where the lock issue occurs. However if the events_statements_history consumer is enabled (it is by default in MySQL 5.7 and later), the The events_statements_history Table table will include the last 10 statements (by default) executed for the connection:

Connection 3> SELECT THREAD_ID, SQL_TEXT FROM performance_schema.events_statements_history WHERE THREAD_ID = 28 ORDER BY EVENT_ID; +-----------+------------------------------------------------------------------+ | THREAD_ID | SQL_TEXT | +-----------+------------------------------------------------------------------+ | 28 | SELECT DATABASE() | | 28 | NULL | | 28 | show databases | | 28 | show tables | | 28 | START TRANSACTION | | 28 | UPDATE world.City SET Population = Population + 1 WHERE ID = 130 | | 28 | UPDATE world.City SET Population = Population + 1 WHERE ID = 131 | | 28 | UPDATE world.City SET Population = Population + 1 WHERE ID = 132 | | 28 | UPDATE world.City SET Population = Population + 1 WHERE ID = 133 | +-----------+------------------------------------------------------------------+ 9 rows in set (0.00 sec)

So now the history of the blocking transaction can be seen and it is possible to determine why the locking issue occur.

Note: The history also includes some queries executed before the transaction started. These are not related to the locking issue.

If transaction monitoring is also enabled (only available in MySQL 5.7 and later), it is possible to get more information about the transaction and automatically limit the query of the history to the current transaction. Transaction monitoring is not enabled by default. To enable it, use:

mysql> UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME = 'events_transactions_current'; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 mysql> UPDATE performance_schema.setup_instruments SET ENABLED = 'YES' WHERE NAME = 'transaction'; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0

Note: This must be done before either of the transactions is started. Only transaction started after the transaction monitoring is enabled will be instrumented.

If the above was enabled before the blocking transaction started, you can get more details about the blocking transaction as:


And to get the statement history of the transaction:

Connection 3> SELECT t.THREAD_ID, s.SQL_TEXT FROM performance_schema.events_transactions_current t INNER JOIN performance_schema.events_statements_history s ON s.THREAD_ID = t.THREAD_ID AND s.NESTING_EVENT_ID = t.EVENT_ID WHERE t.THREAD_ID = 28 ORDER BY s.EVENT_ID; +-----------+------------------------------------------------------------------+ | THREAD_ID | SQL_TEXT | +-----------+------------------------------------------------------------------+ | 28 | UPDATE world.City SET Population = Population + 1 WHERE ID = 130 | | 28 | UPDATE world.City SET Population = Population + 1 WHERE ID = 131 | | 28 | UPDATE world.City SET Population = Population + 1 WHERE ID = 132 | | 28 | UPDATE world.City SET Population = Population + 1 WHERE ID = 133 | +-----------+------------------------------------------------------------------+ 4 rows in set (0.00 sec)