Planet MySQL

What Should I Monitor, and How Should I Do It?

Monitoring tools offer two core types of functionality: alerts based on aliveness checks and comparing metrics to thresholds, and displaying time-series charts of status counters. Nagios + Graphite are the prototypical time-series tools that do these things.

But these tools don’t answer the crucial questions about what we should monitor. What kinds of aliveness/health checks should we build into Nagios? Which metrics should we monitor with thresholds to raise alarms, and what should the thresholds be? What graphs should we build of status counters, which graphs should we examine and what do they mean?

We need guiding principles to help answer these questions. This webinar briefly introduces the principles that motivate and inform what we do at VividCortex, then explains which types of health checks and charts are valuable and what conclusions should be drawn from them. The webinar is focused mostly on MySQL database monitoring, but will be relevant beyond MySQL as well. Some of the questions we answer are:

  • What status counters from MySQL are central and core, and which are peripheral?
  • What is the meaning of MySQL status metrics?
  • Which subsystems inside MySQL are the most common causes of problems in production?
  • What is the unit of work-getting-done in MySQL, and how can you measure it?
  • Which open-source tools do a good job at monitoring in the way we recommend at VividCortex?
  • Which new and/or popular open-source tools should you evaluate when choosing a solution?

You will leave this webinar with a solid understanding of the types of monitoring you should be doing, the low-hanging fruit, and tools for doing it. This is not just a sales pitch for VividCortex. Register below, and we will send you a link to the recording and a copy of the slide deck.

Pic Cred

PlanetMySQL Voting: Vote UP / Vote DOWN

Second day with InnoDB transparent page compression

My first day with InnoDB transparent page compression didn't turn out OK, but I am an optimist so onto day 2. I gave up trying to use it on a host with CentOS 6.6, XFS and a 3.10.53 kernel. I had a bit more luck with a Fedora 19 host using XFS and a 3.14.27 kernel and I ran linkbench.

You can follow along here or at bug 78277.

I configured linkbench with maxid1=100M and the database was about 100GB after the load. I used MySQL 5.7.8 with transparent page compression. I won't share the performance results just yet, but at the and of 7 days of the query test I shutdown mysqld and my storage device (400G Intel s3700) had 180G free space. I confirmed that I was able to use at least 176 of the 180 GB, which was a good sign. However, the device was only able to write at ~60 MB/second while making copies of a 2G test file. File copies were writing at ~200 MB/second on a second host that has the same hardware, the same amount of free space, but didn't use hole-punch for compression. So this is one more cost of using hole-punch, it makes the filesystem much slower. But I don't blame XFS. I don't think that doing a hole-punch on every small write was the intended use for that feature.

Then I ran DROP DATABASE for the database used by linkbench. It contained 3 unpartitioned tables. Two were ~60G and one was ~10G. The DROP TABLE took 8 minutes. While that ran there were many warnings like this:
[Warning] InnoDB: Trying to close/delete/truncate tablespace 'linkdb/linktable' but there are 1 pending operations on it.
[Warning] InnoDB: Trying to close/delete/truncate tablespace 'linkdb/linktable' but there are 1 pending operations on it.

Then it got more exciting -- a long semaphore wait courtesy of a slow unlink. Alas the DROP TABLE finished before the background thread killed mysqld.
InnoDB: A long semaphore wait:--Thread 139616032126720 has waited at line 830 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x327b74a8 created in file line 1153
a writer (thread id 139616238388992) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file line 830
Last time write locked in file /home/mdcallag/b/mysql-5.7.8-rc/storage/innobase/row/ line 4202
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:

PlanetMySQL Voting: Vote UP / Vote DOWN

COLLABORATE16 IOUG – Call For Papers

There’s so many ways to proceed
To get the knowledge you need
One of the best
Stands out from the rest
COLLABORATE16 – indeed!

Why not be part of the show
By sharing the stuff that you know
Got something to say
For your colleagues each day
Call for papers –> let’s go

I believe many of you would agree that regardless of how insignificant you believe your corner of the Oracle technology may be, everyone has something to say. I attended my first show in Anaheim CA USA in 1990 and started presenting at shows the year after in Washington DC USA. It’s not hard to get over the hump, moving from I would love to present a paper at a show but I just don’t have the koyich to wow that was fun. The only way you will ever get the strength is to do it (and do it and do it …).

Some suggestions for getting started …

  1. Co-present with a colleague
  2. Collaborate through paper and slides development WITH your colleague rather than parcel off portions to one another then merge at the end.
  3. Be cautions of trying to cover too much in too little time (I once attended a session at IOUW [a pre-cursor to COLLABORATE] where the presenter had over 400 slides to cover in 45 minutes].
  4. Ask for assistance from seasoned presenters (mentor/protégé type relationship).
  5. Go slowly at first and set yourself some realistic but aggressive goals.

The experience of presenting at shows is rewarding and I for one do it as much as I can … Ensuring Your Physical Standby is Usable and Time to Upgrade to 12c (posting of 2015 presentation details pending).

The confidence gain, personal koyich, and rewards of presenting at events are life long and can help propel your career into the ionosphere. Speaking of confidence, 20 months ago I started playing bridge. Now look where my experience presenting at shows and writing for Oracle Press got me … check this out :).

Surprises surprises abound
With the new confidence found
Presenting is great
Get now on your plate
In no time you’ll be so renowned


Discover more about our expertise in the world of Oracle.

PlanetMySQL Voting: Vote UP / Vote DOWN

Percona Toolkit and systemd

After some recent work with systemd I’ve realized it’s power and I can come clean that I am a fan. I realize that there are multitudes of posts out there with arguments both for and against systemd but let’s look at some nice ways to have systemd provide us with (but not limited to) pt-kill-as-a-service.

This brief post introduces you to a systemd unit file and how we can leverage them to enable pt-kill at startup, have it start after mysqld and ensure that MySQL is running by using the mysql service as a dependency of pt-kill. By using systemd to handle this we don’t have to complicate matters by ‘monitoring the monitor’ using hacky shell scripts, cron or utilities like monit.

So then, a quick primer on systemd, because lets face it, we’ve all been avoiding it. Systemd is not new but it made recent headlines in the Linux world due to some of the major distros announcing their intentions to migrate upcoming releases to systemd.

What is it? Well due to it’s depth it is best described as a suite of management daemons, libraries and tools that will replace the traditional init scripts. So essentially remember how you start a service, mount a volume or read the system logs…well start forgetting all of that because systemd is disrupting this space. With systemd comes some really neat tricks for administering your machines and I’m really only beginning to see the tip of this iceberg. There is admittedly a lot to learn with systemd but this should serve as pragmatic entrée.

Systemd what? When did this happen?Linux distributionDate released as defaultArch Linux000000002012-10-01-0000October 2012CoreOS000000002013-10-01-0000October 2013 (v94.0.0)Debian000000002015-04-01-0000April 2015 (v8 aka jessie)Fedora000000002011-05-01-0000May 2011 (v15)Gentoo LinuxN/AMageia000000002012-05-01-0000May 2012 (v2.0)openSUSE000000002012-09-01-0000September 2012 (v12.2)Red Hat Enterprise Linux000000002014-06-01-0000June 2014 (v7.0)SlackwareN/ASUSE Linux Enterprise Server000000002014-10-01-0000October 2014 (v12)Ubuntu000000002015-04-01-0000April 2015 (v15.04)

Lennart Poettering, the name frequently attached with systemd is seeking to modernize the most fundamental process(es) of the Linux startup system, bringing the paradigms of modern computing; concurrency, parallelism and efficiency. The dependency tree of processes and services is more intuitive and the structure of the underlying startup scripts are unified. I feel that the direction proposed by systemd is an evolutionary one which promotes consistency within the startup scripts enabling conventions that can be easier understood by a broader audience.

Systemd and Percona Toolkit

This post aims to show that we can rely on systemd to handle processes such as pt-kill, pt-stalk, and other daemonized scripts that we like to have running perpetually, are fired at startup and can be reinstated after failure.

The scenario is this; I want pt-kill to drop all sleeping connections from a certain application user, lets call them, ‘caffeinebob’, because they never close connections. Due to various reasons we can’t make changes in the application so we’re employing Percona Toolkit favourite, pt-kill, to do this for us. For convenience we want this result to persist across server restarts. In the olden days we might have some cron job that fires a shell script in combination with a sentinal file to ensure it’s running. I’m pretty sure that this kitty could be skinned many ways.

The systemd Unit File

After some research and testing, the below unit file will play nicely on a Centos 7 node with systemd at it’s core. In this example I am running Percona Server 5.6 installed using Percona’s yum repo with the mysql.service unit file generated at installation. I suspect that there could be some systemd deviation with other MySQL variants however, this configuration is working for me.

[Unit] Description = pt-kill caffeinebob mysql.service Requires=mysql.service [Service] Type = simple PIDFile = /var/run/ ExecStart = /usr/bin/pt-kill --daemonize --pid=/var/run/ --interval=5 --defaults-file=/root/.my.cnf --log=/var/log/ptkill.log --match-user caffeinebob --busy-time 10 --kill --print Restart=on-abort [Install]

Let’s examine the above and see what we’re working with. Systemd unit files have various biologies. The example above is a simple Service unit file. This means we are enacting a process controlled and supervised by systemd. The significance of the After directive is that this service will not attempt startup until after and mysql.service have been called. The Required directive is makes ptkill.service dependant on the mysql.service startup being successful.

The next part, the [Service] grouping, details the actions to be taken by the service. The Type can be one of many but as it’s a simple call to a script I’ve used the simple type. We are describing the command and the handling of it. The ExecStart is evidently the pt-kill command that we would usually run from the shell prompt or from within a shell script. This is a very corse example because we can opt to parameterize the command with the assistance of an Environment file. Note the use of the Restart directive, used so that systemd can handle a reaction should a failure occur that interrupts the process.

Finally under the [Install] grouping we’re telling systemd that this service should startup on a multi user system, and could be thought of as runlevel 2 or 3 (Multiuser mode).

So providing that we’ve got all the relevant paths, users and dependencies in place, once you reboot your host, mysql.service should in order, initiate mysqld and when that dependency is met, systemd will initiate pt-kill with our desired parameters to cull connections that meet the criteria stipulated in our configuration. This means you rely on systemd to manage pt-kill for you and you don’t necessarily need to remember to start this or similar processes when you restart you node.

Start up & enable

Now to envoke our service manually and add enable it to work on start up we should run the following systemctl commands;

[moore@localhost ~]$ sudo systemctl start ptkill.service [moore@localhost ~]$ sudo systemctl enable ptkill.service

No feedback but no errors so we can check the status of the service

[moore@localhost ~]$ sudo systemctl status ptkill -l ptkill.service - keep pt-kill persistent across restarts Loaded: loaded (/etc/systemd/system/ptkill.service; enabled) Active: active (running) since Wed 2015-08-12 02:39:13 BST; 1h 19min ago Main PID: 2628 (perl) CGroup: /system.slice/ptkill.service └─2628 perl /usr/bin/pt-kill --daemonize --pid=/var/run/ --interval=5 --defaults-file=/root/.my.cnf --log=/var/log/ptkill.log --match-user caffeinebob --busy-time 10 --kill --print

Perfect we can also instruct systemd to disable this and|or stop our service when the application is changed and caffeinebob close() all those open connections.

[moore@localhost ~]$ sudo systemctl stop ptkill.service [moore@localhost ~]$ sudo systemctl disable ptkill.service

Now after successful implementation we see that our process is running delightfully;

[moore@localhost ~]$ ps -ef | grep pt-kill root 2547 1 0 02:37 ? 00:00:00 perl /usr/bin/pt-kill --daemonize --pid=/var/run/ --interval=5 --defaults-file=/root/.my.cnf --log=/var/log/ptkill.log --match-user caffeinebob --busy-time 10 --kill --print

Catch me if I fall

Lets issue a kill signal to the process and observe it’s behaviour using journalctl

[moore@localhost ~]$ sudo kill -SEGV 2547

This will write similar entries into the system log;

[moore@localhost ~]$ sudo journalctl -xn -f Aug 12 02:39:13 localhost.localdomain sudo[2624]: moore : TTY=pts/1 ; PWD=/home/moore ; USER=root ; COMMAND=/bin/kill -SEGV 2547 Aug 12 02:39:13 localhost.localdomain systemd[1]: ptkill.service: main process exited, code=killed, status=11/SEGV Aug 12 02:39:13 localhost.localdomain systemd[1]: Unit ptkill.service entered failed state. Aug 12 02:39:13 localhost.localdomain systemd[1]: ptkill.service holdoff time over, scheduling restart. Aug 12 02:39:13 localhost.localdomain systemd[1]: Stopping keep pt-kill persistent across restarts... -- Subject: Unit ptkill.service has begun shutting down -- Defined-By: systemd -- Support: -- -- Unit ptkill.service has begun shutting down. Aug 12 02:39:13 localhost.localdomain systemd[1]: Starting keep pt-kill persistent across restarts... -- Subject: Unit ptkill.service has begun with start-up -- Defined-By: systemd -- Support: -- -- Unit ptkill.service has begun starting up. Aug 12 02:39:13 localhost.localdomain systemd[1]: Started keep pt-kill persistent across restarts. -- Subject: Unit ptkill.service has finished start-up -- Defined-By: systemd -- Support: -- -- Unit ptkill.service has finished starting up. -- -- The start-up result is done.

Pt-kill flaps after the kill signal but systemd has been instructed to restart on failure so we don’t see caffeinebob saturate our processlist with sleeping connections.

Another bonus with this workflow is use within orchestration. Any standardized unit files can be propagated to your fleet of hosts with tools such as Ansible, Chef, Puppet or Saltstack.

Closing note

I’d love to hear from the pragmatists from the systemd world to understand if this approach can be improved or whether there are any flaws in this example unit file that would require addressing. This is very much a new-school of thought for me and feedback is both welcome and encouraged.

Thank you for your time, happy systemd-ing.

The post Percona Toolkit and systemd appeared first on MySQL Performance Blog.

PlanetMySQL Voting: Vote UP / Vote DOWN

Amazon RDS Migration Tool

Amazon has just released their RDS Migration Tool, and Pythian has recently undertaken training to use for our clients. I wanted to share my initial thoughts on the tool, give some background on its internals, and provide a walk-through on the functionality it will be most commonly used for.

There are many factors to consider when evaluating cloud service providers, including cost, performance, and high availability and disaster recovery options. One of the most critical and overlooked elements of any cloud offering though, is the ease of migration. Often, weeks are spent evaluating all of the options only to discover after the choice is made that it will take hours of expensive downtime to complete the migration, and that there is no good rollback option in the case of failure.

In order to reduce the friction inherent in the move to a DBaaS offering, Amazon has developed an RDS Migration tool. This is an in-depth look at this new tool, which will be available after September 1, 2015. Contact Pythian to start a database migration.

With the introduction of the RDS Migration tool, Amazon has provided a powerful engine capable of handling much more than basic migration tasks. It works natively with Oracle, SQL Server, Sybase, MySQL, PostgreSQL, Redshift (target only), Aurora (target only), and provides an ODBC connector for all other source systems. The engine is powerful enough to handle fairly complex transformations and replication topologies; however, it is a migration tool and isn’t intended for long-term use.


Amazon’s RDS Migration Tool architecture is very simple. It consists of your source system, an AWS VM with the Migration Tool installed on it, and the target RDS instance.

Each migration is broken up into Tasks. Within a Task, a source and target database are defined, along with the ability to transform the data, filter the tables or data being moved, and perform complex transformations.

Tasks can be scheduled to run at particular times, can be paused and resumed, and can alert on success or failure. It’s important to note that if a task is paused while a table is loading, that table will be reloaded completely from the beginning when the task resumes.

Within a running task, the following high-level steps are performed:
• Data is pulled from the source using a single thread per table
• Data is converted into a generic data type
• All transformations are applied
• Data is re-converted into the target system’s datatype and inserted
• After the initial load, if specified, the tool monitors for updates to data and applies them in near real-time

While processing the data, each table has a single thread reading from it, and any updates are captured using the source system’s native change data capture utility. Changes are not applied until after the initial load is completed. This is done to avoid overloading the source system, where it’s assumed client applications will still be running.

Performance Considerations

There are several factors which might limit the performance seen when migrating a database.

Network Bandwidth
Probably the biggest contributor to performance issues across data centers, there is no magic button when moving to RDS. If the database is simply too big or too busy for the network to handle the data being sent across, then other options may need to be explored or used in conjunction with this tool.

Some workarounds to consider when network performance is slow include:
• Setup AWS Direct Connect
• Use a bulk-load utility, and then use the tool to catch up on transactions
• Only migrate data from a particular point in time

RDS Migration Tool Server CPU
The migration tool converts all data into a common data type before performing any transformations, then converts them into the target database’s data type. This is obviously very heavy on the server’s CPU, and this is where the main performance bottlenecks on the server are seen.

Capacity of Source database
This tool uses a single SELECT statement to migrate the data, and then returns for any changed data after the initial bulk load is completed. On a busy system, this can be a lot of undo and redo data to migrate, and the source system needs to be watched closely to ensure the log files don’t grow out of control.

Capacity of Target database
In the best case scenario, this will be the limiter as it means all other systems are moving very fast. Amazon does recommend disabling backups for the RDS system while the migration is running to minimize logging.


The following walkthrough looks at the below capabilities of this tool in version 1.2:

• Bulk Data Migration to and from the client’s environment and Amazon RDS
• Near Real-Time Updates to data after the initial load is completed
• The ability to transform data or add auditing information on the fly
• Filtering capabilities at the table or schema level

You will need to have setup network access to your databases for the RDS Migration Tool.

1. After confirming access with your account manager, access the tool by opening the AWS console, selecting EC2, and choosing AMIs.

2. Select the correct AMI and build your new VM. Amazon recommends an M4.large or M4.xlarge.

3. After building the new VM, you will need to install the connectors for your database engine. In this example, we’ll be using Oracle Instant Client and MySQL ODBC Connector 5.2.7.

  • For the SQL Server client tools, you will need to stop the Migration services before installing.

4. Access the Migration Tool

  • Within VM: http://localhost/AmazonRDSMigrationConsole/
  • Public URL: https:[VM-DNS]/AmazonRDSMigrationConsole/
    • Username/Password is the Administrator login to the VM

5. The first screen after logging in displays all of your current tasks and their statuses.

6. Clicking on the Tasks menu in the upper-left corner will bring up a drop-down menu to access Global Settings. From here, you can set Notifications, Error Handling, Logging, etc…

7. Back on the Tasks menu, click the Manage Databases button to add the source and target databases. As mentioned earlier, this walkthrough will be an Oracle to Aurora migration. Aurora targets are a MySQL database for the purposes of this tool.

8. After defining your connections, close the Manage Databases pop-up and select New Task. Here, you can define if the task will perform a bulk-load of your data and/or if it will attempt to apply changes made.

9. After closing the New Task window, simply drag & drop the source and target connectors into the task.

10. By selecting Task Settings, you can now define task level settings such as number of threads, truncate or append data, and define how a restart is handled when the task is paused. You can also override the global error handling and logging settings here.

  • The best practice recommendation is to find the largest LOB value in your source database and set that as the max LOB size in the task. Setting this value allows the task to optimize LOB handling, and will give the best performance.

11. Select the Table Selection button to choose which tables will be migrated. The tool uses wildcard searches to allow any combination of tables to exclude or include. For example, you can:

  • Include all tables in the database
  • Include all tables in a schema or set of schemas
  • Exclude individual tables and bring over all remaining tables
  • Include individual tables and exclude all remaining tables

The tool has an Expand List button which will display all tables that will be migrated.

In this screenshot, all tables in the MUSER08 schema that start with T1 will be migrated, while all tables that start with T2 will be excluded EXCEPT for the T20, T21, T22, & T23 tables.

12. After defining which tables will be migrated, select an individual table and choose the Table Settings button. Here you can add transformations for the individual tables, add new columns or remove existing ones, and filter the data that is brought over.

In this screenshot, the T1 table records will only be brought over if the ID is greater than or equal to 50 and the C1 column is LIKE ‘Migrated%’

13. Select the Global Transformations button. Like the table selection screen, you use wildcards to define which tables these transformations will be applied to.
You can:

  • Rename the schema
  • Rename the table
  • Rename columns
  • Add new columns
  • Drop existing columns
  • Change the column data types

In this screenshot, a new column named MigratedDateTime will be created on all tables and populated with the current DateTime value.

14. Finally, save the task and choose Run. This will kick off the migration process and bring up the Monitoring window. From here, you can see the current task’s status, notifications, and errors, as well as get an idea of the remaining time.

PlanetMySQL Voting: Vote UP / Vote DOWN

Introducing mysqlpump

Starting with MySQL 5.7.8, we are shipping a new client utility called mysqlpump that performs logical backups, producing a set of SQL statements that can be run to reproduce the original schema objects and table data. The goal of mysqlpump is to have a modern utility that is extendable and has native support for parallelization. We felt that the best way to achieve this was to write an entirely new tool where we would be free to break compatibility with mysqldump and where we would not be forced to implement some of the legacy functionality that it provides. mysqlpump executes all of the operations related to dumping multiple databases and the objects inside them in parallel in order to drastically reduce the total time needed to perform a logical backup.

Backup Examples

Here is the command you would to do a basic backup of all databases (by default mysqlpump will dump all databases):

mysqlpump --user=root --password > full_backup.sql # OR mysqlpump --all-databases --user=root --password > full_backup.sql

Note: mysqlpump will not dump the following special internal databases by default: PERFORMANCE_SCHEMA, INFORMATION_SCHEMA, SYS_SCHEMA, ndbinfo. mysqlpump will also not dump the following system tables from the mysql schema unless they are not explicitly requested using the --include-tables option: user, db, tables_priv, columns_priv, procs_priv, proxies_priv, event, proc, apply_status, schema, general_log, slow_log.

mysqlpump divides the dump process into several sub-tasks and then adds these sub-tasks to a multi-threaded queue. This queue is then processed by N threads (2 by default) in order to allow the work to be done in parallel. Each thread makes a connection to the MySQL server to retrieve all necessary information and then begins its work. The number of threads used can be configured using the --default-parallelism and --parallel-schemas options.

To backup all databases with 4 threads:

mysqlpump --user=root --password --default-parallelism=4 > full_backup.sql

To create 2 worker queues, 1 queue to process databases db1,db2 and 1 queue to process all others with 3 threads per queue (note that by default 2 threads will be created per queue in order to complete the dump tasks):

mysqlpump --user=root --password --parallel-schemas=db1,db2 --default-parallelism=3 > full_backup.sql

To spawn 5 threads to work on the first queue in which db1,db2 will be processed and 3 threads to work on the default queue for rest of the databases:

mysqlpump --user=root --password --parallel-schemas=5:db1,db2 --default-parallelism=3 > full_backup.sql

To spawn 5 threads to work on the first queue in which db1,db2 will be processed, 2 threads to work on the second queue to process db3,db4 and 3 threads to work on default queue used for all other databases:

mysqlpump --user=root --password --parallel-schemas=5:db1,db2 --parallel-schemas=2:db3,db4 --default-parallelism=3 > full_backup.sql

To backup only the ‘accounts’ and ‘inventory’ databases/schemas:

mysqlpump --databases accounts inventory --user=root --password > partial_backup.sql

If we want to export only the metadata information and skip all the data associated with the tables then we can use the --skip-dump-rows option:

mysqlpump --databases accounts inventory --skip-dump-rows --user=root --password > partial_backup.sql

This example would only dump databases/schemas db1 and db2:

mysqlpump --user=root --password --include-databases=db1,db2 --result-file=db1_db2_backup.sql

This example would dump all databases/schemas with a name starting with ‘db':

mysqlpump --user=root --password --include-databases=db1% --result-file=all_db_backup.sql

This example would dump all databases/schemas except db1 and db2:

mysqlpump --user=root --password --exclude-databases=db1,db2 --result-file=partial_backup.sql

This example would dump all tables from all databases/schemas except a table named ‘t’ that may be present within any database/schema:

mysqlpump --user=root --password --exclude-tables=t --result-file=partial_backup.sql

This example would dump all tables from all databases/schemas except table names matching the ‘__user‘ regexp pattern in any database/schema (dbuser, STuser, 45user, etc.):

mysqlpump --user=root --password --exclude-tables=__user --result-file=partial_backup.sql

This example would dump only events with name ‘ev2′ and routines with name ‘p1′ from all databases/schemas, excluding the mysql system database/schema:

mysqlpump --user=root --password --include-events=ev2 --include-routines=p1 --exclude-databases=mysql --result-file=partial_backup.sql

This example would only dump the users present in mysql.user table:

mysqlpump --user=root --password --exclude-databases=% --users

This example would dump all users present in the mysql.user table except the root user(s):

mysqlpump --user=root --password --users --exclude-users=root

By default mysqlpump will also show the dump progress. For example:

mysqlpump --user=root --password > full_backup.sql Dump progress: 0/1 tables, 6/6 rows Dump progress: 4/6 tables, 53964/1867889 rows Dump progress: 4/6 tables, 109464/1867889 rows Dump progress: 4/6 tables, 173714/1867889 rows Dump progress: 4/6 tables, 252464/1867889 rows Dump progress: 4/6 tables, 316464/1867889 rows Dump progress: 4/6 tables, 332964/1867889 rows .... Dump completed in 40554 milliseconds

Restore Example

Importing and restoring the data is easy. To import the generated backup file into a new instance of MySQL, you can simply use the mysql command-line client to import the data:

mysql -uroot -p < partial_backup.sql

Current Limitations

mysqlpump currently tries to do as much work in parallel as possible and the dump threads lack a synchronization point before they start backing up the data. This makes it currently unsafe as a general purpose backup replacement (i.e. fully replacing mysqldump). We will be adding consistent backup related features and much more in upcoming versions.


For additional details on the development work so far, please see WL#7755.

We look forward to hearing from MySQL users as you begin trying out this new tool for creating logical dumps and backups! If you encounter any problems with this new tool—or you have ideas for new features—please let us know here in the comments, open a bug report / feature request at, or open a support ticket.

As always, THANK YOU for using MYSQL!

PlanetMySQL Voting: Vote UP / Vote DOWN

Percona Server 5.5.45-37.4 is now available

Percona is glad to announce the release of Percona Server 5.5.45-37.4 on September 2, 2015. Based on MySQL 5.5.45, including all the bug fixes in it, Percona Server 5.5.45-37.4 is now the current stable release in the 5.5 series.

Percona Server is open-source and free. Details of the release can be found in the 5.5.45-37.5 milestone on Launchpad. Downloads are available here and from the Percona Software Repositories.

Bugs Fixed:

  • Querying INFORMATION_SCHEMA GLOBAL_TEMPORARY_TABLES table would crash threads working with internal temporary tables used by ALTER TABLE. Bug fixed #1113388.
  • FLUSH INDEX_STATISTICS/FLUSH CHANGED_PAGE_BITMAPS and FLUSH USER_STATISTICS/RESET CHANGE_PAGE_BITMAPS pairs of commands were inadvertently joined, i.e. issuing either command had the effect of both. The first pair, besides flushing both index statistics and changed page bitmaps, had the effect of FLUSH INDEX_STATISTICS requiring SUPER instead of RELOAD privilege. The second pair resulted in FLUSH USER_STATISTICS destroying changed page bitmaps. Bug fixed #1472251.
  • If a new connection thread was created while a SHOW PROCESSLIST command or a INFORMATION_SCHEMA.PROCESSLIST query was in progress, it could have a negative TIME_MS value returned in the PROCESSLIST output. Bug fixed #1379582.

Other bugs fixed: #768038 and #1472256.

Release notes for Percona Server 5.5.45-37.4 are available in our online documentation. Bugs can be reported on the launchpad bug tracker.

The post Percona Server 5.5.45-37.4 is now available appeared first on MySQL Performance Blog.

PlanetMySQL Voting: Vote UP / Vote DOWN

Orchestrator visual cheatsheet, TL;DR the &quot;smart&quot; way

Orchestrator is really growing. And the amount of users (DBAs, sys admins) using it is growing. Which gives me a lot of immediate feedback in the form of "Look, there's just too many options to move slaves around! Which ones should we use?"

TL;DR look at the two visualized commands below

They are enough

The "smart" commands to end all commands

So all relocation commands are important, and give you fine-grained, pin-pointed control of the method of topology refactoring. However, most of the time you just want to move those servers around. Which is why there's a new "smart" mode which support these two commands, which you should be happy using:

  • relocate: move a single slave to another position
  • relocate-slaves: move all/some slaves of some server to another position.

What makes these commands Smart? You can move slaves around from anywhere to anywhere. And orchestrator figures out the bast execution path. If possible, it uses GTID. Not possible? Is Pseudo-GTID available? Great, using Pseudo-GTID. Oh, are there binlog servers involved? Really simple, use them. None of the above? Orchestrator will use "standard" binlog file:pos math (with limitations). Orchestrator will even figure out if multiple steps are necessary and will combine any of the above.

So you don't have to remember all the possible ways and options. The visual cheatsheet now boils down to these two:

Let's take a slightly deeper look


Moves a single slave X from any point to replicate another some server Z

  • As usual, orchestrator first confirms that X can replicate from Z (Z has log-slave-updates or is a binlog server; binlog format compatible, etc.)
  • With GTID/Pseudo-GTID, move from any point to any point
  • With binlog servers, move around the binlog server environment (at this point you are not using binlog servers, so ignore)
  • With normal replication, requires an "atomic" operation: either move the slave one level up, or make it replicate from a sibling.
  • You can relocate the same master, effectively repointing the slave back to its existing position. This serves to re-resolve master hostname; to reset relay logs; to verify slave is aligned with master.
  • Or combination of the above

Moves multiple slaves of server X to replicate from some other server W

  • By default moves all slaves of X, where possible
  • Each slave verified to be able to replicate from W. Those that can't are left behind.
  • Can filter using regular expression via --pattern=some?[reg]ex on slave hostnames
  • Can relocate under same master, effectively repointing all slaves (see above explanation)
  • Can relocate below one of the very slaves of X. If Y is in itself a slave of X and you're executing:
    orchestrator -c relocate-slaves -i X -d Y
    then Y is excluded from the list of relocated slaves. This effectively means "make Y local master of its current siblings". Very cool stuff.
  • When binlog servers involved, simple math-less repointing takes place
  • When GTID involved, let MySQL/MariaDB (both supported) do the math on a per-server basis
  • When Pseudo-GTID involved, greatly optimize by dividing into equivalence classes and only doing the math on a representative of each class.
  • Or combination of the above
What about the other commands?

The above covers such commands as move-up, move-below, repoint, repoint-slaves, match-below, multi-match-slaves, regroup-slaves and more. It does not cover enslave-master and make-co-master which are a bit different.

My guess is you can pass 98% of your operations with relocate and relocate-slaves. Otherwise just run orchestrator with no arguments nor options to get a full-blown breakdown of available commands.

GUI drag-n-drop

relocate-slaves is achieved by dragging the slaves of an instance on top of a new master, as follows:

Find latest orchestrator release at


PlanetMySQL Voting: Vote UP / Vote DOWN

How MySQL-Sandbox is tested, and tests MySQL in the process

MySQL-Sandbox is a great tool for testing a new release, and in fact this is what I do when a new MySQL tarball becomes available. I don't think many people are aware of the full testing capabilities of the sandbox, though.

When you think about testing, you may just think of creating a sandbox with the new tarball, and then hammering it with your pet procedure. That works, of course, as the main purpose of MySQL-Sandbox is to allow you to do just that. There is, however, a full test suite that can tell you in a short while if your tarball is compatible with the past or not.

This procedure is quite strict. It has happened several times that I caught a bug in a new release of MySQL, or Percona Server, or MariaDB, just by running this suite.

How MySQL-Sandbox gets tested

Before describing how to test, I would like to show what I do. When a new version of MySQL-Sandbox is ready (and I happen to have time, because, hey! I also have a day job!) I subject it to the full test suite, which is about 500 different tests (the number may vary depending on the operating system and the MySQL version being tested). Then I repeat the full test for every version that I have stored in my test machines. That is, from version 5.0 to 5.7, passing through Percona Server and MariaDB forks, I test about a dozen versions:

mariadb 10.0.20
mariadb 10.1.6
mariadb 5.5.40
percona server 5.5.43
percona server 5.6.25

The above versions change when new releases are available. I repeat the same test in two different machines, covering OSX and Linux, for a grand total of ≈ 12,000 tests before I upload the tarball to CPAN.

What does the MySQL-Sandbox test suite do

There are 20 files in the ./t directory, each one starting a set of thematic tests:

  • t/01_modules.t tests the MySQL Sandbox module
  • t/02_test_binaries.t tests the completeness of MySQL-Sandbox deployment
  • t/03_test_sandbox.t is the longest and more comprehensive test in the suite. It installs every type of sandbox, and runs basic functinal tests for each. It also includes an expensive test for sandbox installation with pattern recognition. In total, it runs for more than 10 minutes. This is where new versions may fail if they are not fully compatible with previous ones.
  • t/04_test_sbtool.t is a test for the companion tool, which can do many operations, including installing and testing hierarchical replication.
  • t/05_test_smoke.t tests a sandbox with the basic functionalities that were considered important to test manually when I was working at MySQL. Tired of testing it manually, I scripted the procedure. It may not be so vital now, but it does not hurt to run it.
  • t/06_test_user_defined.t is a demo test for the simplest user-defined tests implemented with the sandbox
  • t/07_test_user_defined.t does the same as the above, but testing replication instead of a single server
  • t/08_test_single_port_checking.t tests that we can install a sandbox multiple time, with automatic port choice
  • t/09_test_multiple_port_checking.t tests group sandboxes for the above feature
  • t/10_check_start_restart.t tests that we can restart sandboxed servers with options set on-the-fly
  • t/11_replication_parameters.t tests the installation of replication groups with user parameters that affect one or more nodes.
  • t/12_custom_user_pwd.t tests the ability of defining your own passwords instead of using the defaults
  • t/13_innodb_plugin_install.t tests the installation of the innodb plugin (it runs only only for version 5.1)
  • t/14_semi_synch_plugin_install.t tests the installation and functioning of the semi-synchronous plugin (requires version 5.5+)
  • t/15_user_privileges.t tests that the sandbox users have the privileges that were expected
  • t/16_replication_options.t similar to test #11, but testing a different set of options
  • t/17_replication_flow.t tests that regular and circular replication groups can transfer data as expected.
  • t/18_force_creation.t tests that we can create a sandbox by overwriting an existing one
  • t/19_replication_gtid.t tests replication with GTID
  • t/20_add_option.t tests the ability of restarting a server with a permanent new option.
How to test a new tarball with MySQL-Sandbox

You do not need to go to the full testing of every version. That's done to make sure that MySQL-Sandbox does not have regressions, and works as expected with all the versions. But if your purpose is to make sure that your new tarball is ready to be used, a simple pass of the test suite will do. Here are the steps:

  1. Get the tarball. For this demonstration, we will use the latest Percona Server: Percona-Server-5.6.25-rel73.1-Linux.x86_64.ssl100.tar.gz
  2. We will need to download the MySQL-Sandbox code. It is not enough to have it installed, as we will need visibility and access to the test files. $ git clone
    $ cd mysql-sandbox
  3. We need the tarball to be extracted in the default directory ($HOME/opt/mysql): $ make_sandbox --export_binaries --add_prefix=Perc /path/to/Percona-Server-5.6.25-rel73.1-Linux.x86_64.ssl100.tar.gz
    This step moves the tarball under $HOME/opt/mysql, creates a directory named Perc5.6.25, and installs a sandbox from that new directory.
  4. The previous step is only needed for two things: creating the directory in the right place, and making sure the tarball can be installed. Sometimes this step fails because of some surprising incompatibility. At this point, we can remove the new sandbox: $ sbtool -o delete -s ~/sandboxes/msb_Perc5_6_25
  5. Now we are ready, and we can start the test: $ perl Makefile.PL
    $ make
    $ export TEST_VERSION=Perc5.6.25
    $ make test

This will run for quite a while. Depending on the power of your server, it can take from 20 to 40 minutes.

$ make test
PERL_DL_NONLAZY=1 /usr/bin/perl "-MExtUtils::Command::MM" "-e" "test_harness(0, 'blib/lib', 'blib/arch')" t/*.t
t/01_modules.t ...................... ok
t/02_test_binaries.t ................ ok
t/03_test_sandbox.t ................. Testing <5.6.26>. Please wait. This will take a few minutes
t/03_test_sandbox.t ................. ok
t/04_test_sbtool.t .................. ok
t/05_test_smoke.t ................... ok
t/06_test_user_defined.t ............ ok
t/07_test_user_defined.t ............ ok
t/08_test_single_port_checking.t .... ok
t/09_test_multiple_port_checking.t .. ok
t/10_check_start_restart.t .......... ok
t/11_replication_parameters.t ....... ok
t/12_custom_user_pwd.t .............. ok
t/13_innodb_plugin_install.t ........ # Skipping version 5.6.26 for this test. It is not in the required range (5.1.6 - 5.1.99)
t/13_innodb_plugin_install.t ........ ok
t/14_semi_synch_plugin_install.t .... ok
t/15_user_privileges.t .............. ok
t/16_replication_options.t .......... ok
t/17_replication_flow.t ............. ok
t/18_force_creation.t ............... ok
t/19_replication_gtid.t ............. ok
t/20_add_option.t ................... ok
All tests successful.
Files=20, Tests=495, 1560 wallclock secs ( 0.22 usr 0.03 sys + 480.89 cusr 73.55 csys = 554.69 CPU)
Result: PASS

CAVEATS: When you run this test, you must be aware of three things:

  1. If you have sandboxes running in $HOME/sandboxes, they will be stopped when the test starts. This is a necessity to avoid clashes, as the test needs to start every possible combination.
  2. If your sandboxes were running in a non-standard location, you need to stop them manually, as they may make the test fail. The test will not care if the main server runs (on port 3306.) It will not clash, and it will not attempt to shut it down.
  3. If you stop the test during the execution, you may end up with orphaned servers, which are installed under ./t/test_sb. If that happens, you will need to stop and eventually remove them manually.
Running only a subset of the tests

If you want to test only something specific, you can do that by invoking one of the tests mentioned above. For example, if you want to test only basic replication, you can do this:

$ perl Makefile.PL
$ make
$ export TEST_VERSION=Perc5.6.25
$ prove -b -v t/17_replication_flow.t
t/17_replication_flow.t ..
installing and starting master
installing slave 1
installing slave 2
starting slave 1
. sandbox server started
starting slave 2
. sandbox server started
initializing slave 1
initializing slave 2
replication directory installed in $HOME/sandboxes/repl_deployment
ok - Replication directory created
# Master log: mysql-bin.000001 - Position: 3667 - Rows: 2
# Testing slave #1
ok - Slave #1 acknowledged reception of transactions from master
ok - Slave #1 IO thread is running
ok - Slave #1 SQL thread is running
ok - Table t1 found on slave #1
ok - Table t1 has 2 rows on #1
# Testing slave #2
ok - Slave #2 acknowledged reception of transactions from master
ok - Slave #2 IO thread is running
ok - Slave #2 SQL thread is running
ok - Table t1 found on slave #2
ok - Table t1 has 2 rows on #2
# TESTS : 10
# FAILED: 0 ( 0.0%)
# PASSED: 10 (100.0%)
ok - Replication test was successful
# executing "stop" on $HOME/sandboxes/repl_deployment
executing "stop" on slave 1
executing "stop" on slave 2
executing "stop" on master
# executing "clear" on $HOME/sandboxes/repl_deployment
executing "clear" on slave 1
executing "clear" on slave 2
executing "clear" on master
sandbox at <$HOME/sandboxes/repl_deployment> has been removed
ok - Regular replication directory repl_deployment removed
installing node 1
installing node 2
installing node 3
# server: 1:
# server: 2:
# server: 3:
# server: 1:
# server: 2:
# server: 3:
# server: 1:
# server: 2:
# server: 3:
Circular replication activated
group directory installed in $HOME/sandboxes/repl_deployment
ok - circular replication installed
# Master log: mysql-bin.000001 - Position: 1126 - Rows: 2
# Testing slave #1
ok - Slave #1 IO thread is running
ok - Slave #1 SQL thread is running
ok - Table t1 found on slave #1
ok - Table t1 has 2 rows on #1
# Testing slave #2
ok - Slave #2 IO thread is running
ok - Slave #2 SQL thread is running
ok - Table t1 found on slave #2
ok - Table t1 has 2 rows on #2
# TESTS : 8
# FAILED: 0 ( 0.0%)
# PASSED: 8 (100.0%)
ok - Replication test was successful
# executing "stop" on $HOME/sandboxes/repl_deployment
# server: 1:
# server: 2:
# server: 3:
executing "stop" on node 1
executing "stop" on node 2
executing "stop" on node 3
# executing "clear" on $HOME/sandboxes/repl_deployment
# server: 1:
# server: 2:
# server: 3:
executing "clear" on node 1
executing "clear" on node 2
executing "clear" on node 3
sandbox at <$HOME/sandboxes/repl_deployment> has been removed
ok - Circular replication directory repl_deployment removed
All tests successful.
Files=1, Tests=24, 69 wallclock secs ( 0.05 usr 0.01 sys + 3.24 cusr 2.57 csys = 5.87 CPU)
Result: PASS

You can repeat the procedure for every file ./t/*.t

Writing your own tests

As a parting thought, let me mention again that you can create your own user-defined tests using the sandbox simple hooks.

Here is a sample user defined test that you can run using test_sandbox:

There are two kind of tests: shell and sql The test type is defined by a keyword followed by a colon.

The 'shell' test requires:

  • a 'command', which is passed to a shell.
  • The 'expected' label is a string that you expect to find within the shell output. If you don't expect anything, you can just say "expected = OK", meaning that you will be satisfied with a ZERO exit code reported by the operating system.
  • The 'msg' is the description of the test that is shown to you when the test runs.

command = make_sandbox $TEST_VERSION -- --no_confirm --sandbox_directory=msb_XXXX
expected = sandbox server started
msg = sandbox creation

The 'sql' test requires

  • a 'path', which is the place where the test engine expects to find a 'use' script.
  • The 'query' is passed to the above mentioned script and the output is captured for further processing.
  • The 'expected' parameter is a string that you want to find in the query output.
  • The 'msg' parameter is like the one used with the 'shell' test.

query = select 10 * 10
expected = 100
msg = checking database response

All strings starting with a $ are expanded to their corresponding environment variables. For example, if $SANDBOX_HOME is /home/sb/tests, the line below will expand to
command = /home/sb/tests/msb_5_1_30/stop

It is a good idea to finish every test with a cleanup. Here, we simply stop the server

command = $SANDBOX_HOME/msb_XXXX/stop
expected = OK
msg = stopped

To run this example, you have two options:

  • Run it directly with test_sandbox: $ test_sandbox --versions=Perc5.6.25 --tests=user
  • Or create an harness like the ones in the test suite. See for example t/06_test_user_defined.t, which then invokes the test proper, which is $ export TEST_VERSION=Perc5.6.25
    $ cat t/06_test_user_defined.t
    use lib './t';
    use Test_Helper;
    test_sandbox( 'test_sandbox --user_test=./t/', 3);
    In this code, the '3' after the test name is the number of tests expected to run.

If this paradigm is too simple (and I know that sometimes it is) you can write your own plugins in Perl, using as examples the ones in the suite. e.g.:

$ cat t/08_test_single_port_checking.t
use lib './t';
use Test_Helper;
test_sandbox( 'test_sandbox --tests=user --user_test=./t/', 6);

The perl plugin requires, of course, some knowledge of Perl, but they allow a greater flexibility to create your own checks.

PlanetMySQL Voting: Vote UP / Vote DOWN

Evaluating MySQL Parallel Replication Part 3: Benchmarks in Production

Tue, 2015-09-01 11:17jeanfrancoisgagne

Parallel replication is a much-expected feature of MySQL. It is available in MariaDB 10.0 and in MySQL 5.7. In this 3rd post of the series, we present benchmark results from production environments.

This is a repost of Jean-François Gagné's blog post on

Note: this post has an annex: Under the Hood. Benchmarking is a complex art and reporting results accurately is even harder. If all the details were put in a single article, it would make a very long post. The links to the annex should satisfy readers eager for more details.

Parallel replication is on its way and with it comes the hope that more transactions can be run on a master without introducing slave lag. But it remains to be seen whether this dream will come true - will parallel replication hold its promise or will there be surprises after its deployment? We would like to know whether or not we can count on that feature in the future.

To get answers, nothing is better than experimenting with the technology. Benchmark results have already been published (MariaDB 10.0: 10 times improvement; and MySQL 5.7: 3 to 6 times improvement) but results might be different in our production environments. The best test would be to run parallel replication on our real workloads but this is not trivial. To be able to run transactions in parallel, a slave needs parallelism information from the master (for more details, see Part 1). With a master in an older version (MySQL 5.6 in our case), slaves do not have this information.

Luckily, we can use slave group commit on an intermediate master to identify transactions that can be run in parallel. The trick is to execute transactions sequentially on a slave, but to delay their commit. While the commit is delayed, the next transaction is started. If the two transactions are non-conflicting, the second could complete and the two transactions would commit together. In this scenario, grouping has succeeded (the two transactions committed in a single group) and parallelism is identified (as they commit together, the transactions are non-conflicting, thus can be run in parallel on slaves). For more details on slave group commit, see Part 2.

Our benchmarks are established within four production environments: E1, E2, E3, and E4. Each of these environments is composed of one production master, with some intermediate masters, and a leaf slave. A complete description of those environments can be found in the annex.

Slave group commit identifies less parallelism than a parallel execution on a master would identify (details in the Group Commit: Slave vs. Master section of the annex). Even so, decent group sizes are obtained, as shown in the group commit size graphs in the annex. Usually we have group sizes of at least 5, most of the time they are larger than 10, and sometimes they are as big as 15 or even 20. These will allow us to test parallel replication with real production workload.

Before closing this long introduction, let's talk a little about our expectations. InnoDB is getting better at using many cores (RO and RW benchmark results) but single-threaded replication gets in the way of pushing more writes in a replicating environment. Without parallel replication, a single core can be used on a master to perform writes without incurring slave lag. This is disappointing as servers come with 12 and more cores (only one can be used for writes). Ideally, we would like to use a significant percentage of the cores of a server for writes (25% could be a good start). So speedups of 3 would be good results at this point (12 cores), and speedups of 6 and 10 would be needed in the near future (24 and 40 cores).

The Test: Catching Up with 24 Hours of Transactions

Our benchmark scenario is as follows: after restoring a backup of the database, starting MariaDB, waiting for the buffer pool to be loaded, and running the slave for at least 12 hours, we measure the length of time it has taken to process 24 hours of production transactions.

The tests are run in the following binary log configurations:

  • Intermediary Master (IM): both binary logs and log-slave-updates are enabled.
  • Slave with Binary Logs (SB): binary logs are enabled but log-slave-updates is disabled.
  • Standard Slave (SS): both binary logs and log-slave-updates are disabled.

And in the following durability configurations:

  • High Durability (HD): sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1.
  • No Durability (ND): sync_binlog = 0 and innodb_flush_log_at_trx_commit = 2 (also described/known as relaxed durability).

For each of those configurations (6 in total: IM-HD, IM-ND, SB-HD, SB-ND, SS-HD, and SS-ND), the tests are run with different values of slave_parallel_threads (SPT). The full results are presented in the annex and the most interesting results are presented below (SB-HD and SB-ND). The times presented are in the format hours:minutes.seconds. Below the time taken to process 24-hours of transactions, the speedup achieved from the single-threaded run is presented in bold.

Execution Times and Speedups for Slave with Binary Logs E1 E2 E3 E4 SPT SB-HD SB-ND SB-HD SB-ND SB-HD SB-ND SB-HD SB-ND 0 9:11.11 5:22.29 2:50.58 1:11.30 9:58.58 9:06.20 7:43.06 7:26.24 5 6:47.38
1.35 5:15.03
1.02 1:40.32
1.70 1:11.19
1.00 8:41.47
1.15 8:24.40
1.08 6:36.06
1.17 6:29.52
1.15 10 6:16.49
1.46 5:10.02
1.04 1:28.48
1.93 1:09.40
1.03 8:23.07
1.19 8:16.23
1.10 6:17.59
1.23 6:13.48
1.19 20 6:00.22
1.53 5:07.16
1.05 1:24.52
2.01 1:09.03
1.04 8:06.12
1.23 8:05.13
1.13 6:05.40
1.27 6:05.48
1.22 40 5:53.42
1.56 5:05.36
1.06 1:22.12
2.08 1:08.32
1.04 8:07.26
1.23 8:04.31
1.13 5:59.30
1.29 5:59.57

In the Graph during Tests section of the annex, you can find many details about the different test runs.


There are lots of things to say about those results. Let's start with observations that are not related to parallel replication (SPT=0):

  • Obs1: Standard Slave results (without binary logs) are very close to the results of Slave with Binary Logs (without log-slave-updates)(details in the annex).
  • Obs2: log-slave-updates has visible cost for E1 (time difference between IM-HD and SB-HD), a less obvious but still noticeable cost for E2 and E3, and it is a win for E4 (that last one is disturbing, the numbers are in the annex).
  • Obs3: relaxing durability is a huge win for E1 and E2, a more limited win for E3, and a much smaller one for E4.

With reference to Obs1 above, this shows that binary logs should probably not be disabled on slave: the cost is almost inexistent and the wins are big (tracing errant transactions and being a candidate for master promotion). However, slaves with log-slave-updates are slower than slaves with only binary logs enabled (Obs2 above), so log-slave-updates should be avoided when possible. Binlog Servers can be used to replace log-slave-updates for intermediate masters, see MySQL Slave Scaling for more details (see also Better Parallel Replication for MySQL for an explanation why log-slave-updates is bad on intermediate masters for parallel replication).

With reference to Obs3 above, this can be explained by the different workload of the four environments (more details about the workloads can be found in the annex):

  • E2 is a CPU-bound workload (the dataset fits in RAM).
  • E1 is also mostly CPU-bound but with some cache misses in the InnoDB buffer pool, so it needs a page fetch from disk before doing a write.
  • E3 is a mixed CPU and IO workload (more cache misses in the InnoDB buffer pool but still with enough cache hit to get a good commit throughput).
  • E4 is an IO-bound workload (mostly cache misses).

Relaxing durability on CPU-bound workloads achieves good throughput improvements, but this does not happen on IO-bound workloads.

Now, let's focus on parallel replication. The Standard Slave results (SS-HD and SS-ND) are not worth discussing as they are very close to the Slave with Binary Logs results (Obs1 above). We will also not discuss Intermediate Master results (IM-HD and IM-ND) as they should be replaced by Binlog Servers. So all observations below are made on the results of Slave with Binary Logs (SB-HD and SB-ND):

  • Obs4: the best speedup (~2.10) is in E2 with high durability. E1 follows with a speedup of ~1.56 (always with high durability).
  • Obs5: the speedups for E4 are modest (~1.29) and the results are almost identical for both durability settings.
  • Obs6: for E1 and E2, the speedups with no durability are almost non-existent (less than 1.10).
  • Obs7: for both E1 and E2, relaxing durability with single-threaded replication leads to faster execution than enabling parallel replication.
  • Obs8: the results for E3 are halfway between E1/E2 and E4: both SB-HD and SB-ND get some modest speedups from parallel replication (like E4 and opposite to E1/E2) and relaxing durability makes things run a little faster (like E1/E2 and opposite to E4), but not to the point where single-threaded low durability is faster than multi-threaded high durability.

All those observations point to the importance of the workload in parallel replication speedups:

  • CPU-bound workloads seem to get modest speedups in high-durability configurations.
  • Relaxing durability for CPU-bound workloads looks like a better option than enabling parallel replication on a high-durability configuration.
  • IO-bound workloads get more limited speedups.

Our first reaction is disappointment: the speedups are not as high as expected. Don't get us wrong: faster is always better, especially when the only thing to do is to upgrade the software, which we will do anyway eventually. However, having only 25% more writes on a master (or 110% depending on which environment we look at) will not help us in the long term. Parallel replication is not the solution (at least not the only solution) that will allow us to stop/avoid sharding.

Ideas and Future Work

We have a hypothesis explaining the modest speedups: long-running transactions. In the presence of long-running transactions, the parallel replication pipeline on the slave stalls. Let's take the following six transactions committing on the master in two commit groups (B for begin and C for commit):

--------Time--------> T1: B-----------C T2: B--C T3: B--C T4: B--C T5: B--C T6: B--C

Running those transactions on a single-threaded slave takes 33 units of time (time scale is at the bottom):

----------------Time----------------> T1: B-----------C T2: B--C T3: B--C T4: B--C T5: B--C T6: B--C 1 2 3 123456789012345678901234567890123

Running those transactions on a multi-threaded slave with SPT=4 takes 17 units of time:

---------Time--------> T1: B-----------C T2: B-- . . . . C T3: B-- . . . . C T4: B-- . . . . C T5: B--C T6: B--C 1 12345678901234567

So we barely achieve a speedup of 2 (and the second commit group does not even contain a large transaction). The low speedup is explained by T1 being much bigger than the other transactions in the group. So our intuition is that to get better speedup with parallel replication, all transactions should be of similar size, and bigger transactions should be broken down into smaller ones (when possible).

We have many of those big transactions in our workload at Most of our design choices predate MySQL 5.6, where a commit was expensive. Reducing the number of commits was a good optimization at that time, so doing many changes in a single transaction was a good thing. Now, with binary log group commit, this optimization is less useful but does not harm. However, this optimization is very bad for parallel replication.

There are at least two other things to discuss from those results but this post is already too long, so you will have to go in the annex to read the Additional Discussions.


It is possible to test parallel replication with true production workload, even if the master is running an old version of MySQL. Thanks to slave group commit in MariaDB 10.0, we can identify parallelism on intermediate master and enable parallel replication on a slave. Even if this parallelism identification is not as good as it would be on a master, we get decent group sizes.

Our CPU-bound workloads are getting speedups of ~1.56 to ~2.10 with high-durability constraints. This is a little disappointing: we would like to have more than two cores busy applying writes on slaves. Our guess is that better speedup could be obtained by hunting down large transactions, but that still needs to be verified. At this point, and for this type of workload, our tests show that relaxing durability is a better optimization than enabling parallel replication. Finally, with relaxed durability, parallel replication shows almost no improvement (4% to 6% improvement), and it is still unknown if hunting down large transactions and splitting them would result in better speedups.

Our IO-bound workloads are getting speedups of ~1.23 to ~1.29, which is also disappointing but expected because it is hard to fight against seek time of magnetic disks. In this type of workload, relaxed durability setting benefits from parallel replication. However, at high enough parallelism on the slave and for this type of workload, relaxing durability is not very beneficial. It is hard to tell what types of improvement would come from hunting down large transactions for this type of workload.

The next step on parallel replication evaluation would be to try optimistic parallel replication. This will make a good fourth part in the series.

This is a repost of Jean-François Gagné's blog post on

Tags: MariaDB ReleasesMaxScaleMySQLReplication About the Author

Jean-François Gagné is a System Engineer at supporting the growth of with a focus on growing their MariaDB and MySQL infrastructure.

PlanetMySQL Voting: Vote UP / Vote DOWN

Orchestrator 1.4.340: GTID, binlog servers, Smart Mode, failovers and lots of goodies

Orchestrator 1.4.340 is released. Not quite competing with the MySQL latest changelog, and as I haven't blogged about orchestrator featureset in a while, this is a quick listing of orchestrator features available since my last publication:

  • Supports GTID (Oracle & MariaDB)
    • GTID still not being used in automated recovery -- in progress.
    • enable-gtid, disable-gtid, skip-query for GTID commands
  • Supports binlog servers (MaxScale)
    • Discovery & operations on binlog servers
    • Understanding slave repositioning in a binlog-server architecture
  • Smart mode: relocate & relocate-below commands (or Web/GUI drag-n-drop) let orchestrator figure out the best way of slave repositioning. Orchestrator picks from GTID, Pseudo GTID, binlog servers, binlog file:pos math (and more) options, or combinations of the above. Fine grained commands still there, but mostly you won't need them.
  • Crash recoveries (did you know orchestrator does that?):
    • For intermediate master recovery: improved logic in picking the best recovery plan (prefer in-DC, prefer promoting local slave, supporting binlog server topologies, ...)
    • For master recovery: even better slave promotion; supports candidate slaves (prefer promoting such slaves); supports binlog server shared topologies
    • Better auditing and logging of recovery cases
    • Better analysis of crash scenarios, also in the event of lost VIPs, hanging connections; emergent checks in crash suspected scenarios
    • recover-lite: do all topology-only recovery steps, without invoking external processes
  • Better browser support: used to only work on Firefox and Chrome (and the latter has had issues), the Web UI should now work well on all browsers, at the cost of reduced d3 animation. More work still in progress.
  • Faster, more parallel, less blocking operations on all counts; removed a lots of serialized code; less locks.
  • Web enhancements
    • More verbose drag-n-drop (operation hint; color hints)
    • Drag-n-drop for slaves-of-a-server
    • Replication/crash analysis dashboard
  • Pools: orchestrator can be told about instance-to-pool association (submit-pool-instances command)
    • And can then present pool status (web)
    • Or pool hints within topologies (web)
    • Or queried for all pools (cluster-pool-instances command)
  • Other:
    • Supports MySQL 5.7 (tested with 5.7.8)
    • Configurable graphite path for metrics
    • --noop flag; does all the work except for actually changing master on slaves. Shows intentions.
    • Web (or cli which-cluster-osc-slaves command) provide list of control slaves to use in pt-osc operation
    • hostname-unresolve: force orchestrator to unresolve a fqdn into VIP/CNAME/... when issuing a CHANGE MASTER TO
  • 3rd party contributions (hey, thanks!) include:
    • More & better SSL support
    • Vagrant templates
  • For developers:
    • Orchestrator now go-gettable. Just go get
    • Improved build script; supports more architectures

Also consider these manuals:

Orchestrator is free and open source (Apache 2.0 License).

I'll be speaking about orchestrator in PerconaLive Amsterdam.

PlanetMySQL Voting: Vote UP / Vote DOWN

Booking dot yeah!’s Jean-François Gagné on Percona Live Amsterdam, one of the world’s leading e-commerce companies, helps travels book nearly 1 million rooms per night. Established in 1996, B.V. guarantees the best prices for any type of property, from small, family-run bed and breakfasts to executive apartments and five-star luxury suites.

The travel website is also a dedicated contributor to the MySQL and Perl community. Other open source technologies include CentOS Linux, Nginx, python, puppet, Git and more.

A Diamond sponsor of Percona Live Amsterdam Sept. 21-23, you can meet the people who power at booth 205. Enter promo code “BlogInterview” at registration to save €20!

In the meantime, meet Jean-François Gagné, a system engineer at He’ll be presenting a couple of talks: “Riding the Binlog: an in Deep Dissection of the Replication Stream” and “Binlog Servers at”

Tom: Hi Jean-François, in your session, “Riding the Binlog: an in Deep Dissection of the Replication Stream“, you talk about how we can think of the binary logs as a transport for a “Stream of Transactions”. What will be the top 3 things attendees will come away with following this 50-minute talk?’s Jean-François Gagné

Jean-François: Hi Tom, thanks for this opportunity to give a sneak peak of my talk.  The most important subject that will be discussed is that the binary logs evolves: by the usage of “log-slave-updates”, the stream can grow, shrink or morph.  Said in another way: the binary logs of a slave can be very different from the binary logs of the master, and this should be taken into account when relying on those (including when replicating using intermediate master and when promoting a slave as a new master using GTIDs).  We will also explore how the binary logs can be decomposed in sub-streams, or viewed as the multiplexing of many streams.  We will also look for de-multiplexing functions and the new possibilities that are opened with that.


Tom: Percona Live, starting with this conference, has a new venue and a broader theme – now encompassing, in addition to MySQL, MongoDB, NoSQL and data in the cloud. Your thoughts? And what do think is missing – what would you change (if anything)?

Jean-François: I think you forget the best of all changes: going from a 2 day conference last year in London to a 3 day conference this year.  This will allow better knowledge exchange and I am very happy about that.  I think this event will be a success with a good balance of sessions focus on technologies and presentation about a specific use-case of those technologies.  If I had one wish: I would like to see more sessions about specific use-cases of NoSQL technologies with and in deep discussion about why they are a better choice than more traditional solutions: maybe more of those sessions will be submitted next year.


Tom: Which other session(s) are you most looking forward to besides your own?

Jean-François: I will definitely attend the Facebook session about Semi-Synchronous Replication: it is very close to my interest, especially as is thinking about using loss-less semi-sync replication in the future, and I look forward to hear war stories about this feature.  All sessions dissecting internals of a technology (InnoDB, TokuDB, RocksDB, …) will also have my attention.  Finally, it is always interesting to hear about how large companies are using databases, so I plan to attend the MySQL@Wikimedia session.


Tom: As a resident of Amsterdam, what are some of the must-do activities/sightseeing for those visiting for Percona Live from out of town?

Jean-François: Seeing the city from a high point is impressive, and you will have the opportunity of enjoying that view from the office at the Community Dinner.  Also, I recommend finding a bike and discover the city pedaling (there are many renting shops, just ask Google).  From the conference venue, you can do a 70 minutes ride crossing three nice parks: the Westerpark, the Rembrandtpark and the Vondelpark – – and you can discover the first of third park in a shorter ride (45 minutes).  If you feel a little more adventurous, I recommend a 90 minute ride South following the Amstel: once out of Amsterdam, you will have the water on one side at the level of the road, and the fields (Polder) 3 meters below on the other side (  This will allow you to see for yourself why this place is called the “Low Countries”.

The post Booking dot yeah!’s Jean-François Gagné on Percona Live Amsterdam appeared first on MySQL Performance Blog.

PlanetMySQL Voting: Vote UP / Vote DOWN

Cached linkbench performance for MySQL 5.7.8, 5.6, WebScale and MyRocks

This extends previous results for Linkbench to compare performance for a cached database with concurrent clients. My conclusions are:
  • InnoDB compression in the Facebook patch for MySQL 5.6 is much faster for insert-heavy workloads than the same feature in upstream 5.6 and 5.7. Too bad those changes might not reach upstream
  • InnoDB transparent page compression is faster than non-transparent for write-heavy workloads assuming that feature is OK to use on your servers.
  • QPS for MyRocks suffers over time. We have work-in-progress to fix this. Otherwise it is already competitive with InnoDB. Compression with MyRocks is much better than InnoDB for linkbench data. That has also been true for real workloads.
  • Load rates are higher for compressed InnoDB tables when partitioning is used for 5.6 but not for 5.7. I didn't debug the slowdown in 5.7. It has been a win in the past for IO-bound linkbench because it reduces contention on the per-index mutex in InnoDB. Work has been done in 5.7 to reduce the contention on the per-index mutex.
SetupThe database size was between 10G and 30G after the load. The test was run with maxid=20000001, loaders=10 & requesters=20. Otherwise the default settings for linkbench were used. The InnoDB buffer pool was large enough to cache the database. The server has 144G of RAM, fast PCIe flash storage and 40 HW threads with HT enabled. The binlog was enabled but fsync was not done for the binlog or InnoDB redo log on commit. I tested several configurations for compression and partitioning:
  • p0.c0 - no partitioning, no compression
  • p0.c1 - no partitioning, compression (transparent & non-transparent)
  • p1.c0 - partitioning, no compression
  • p1.c1 - partitioning, compression (transparent & non-transparent)
I tested several binaries:
  • myrocks - the Facebook patch for MySQL, 5.6.X and the RocksDB storage engine
  • fb56 - the Facebook patch for MySQL, 5.6.X and InnoDB
  • - upstream 5.6.26 with the performance schema (PS) enabled
  • - upstream 5.7.8 with non-transparent compression and PS enabled
  • - upstream 5.7.8 with transparent compression and PS enabled
The test was done in two parts. First I measure the load performance, then I run the query test for 12 1-hour intervals. The data below is the insert rate from the load (load ips), the size after load (load gb), the QPS rate during the second and twelfth 1-hour runs (2h qps, 12h qps) and the size after the second and twelfth 1-hour runs (2h gb, 12h gb).

load    load    2h      2h      12h     12hips     gb      qps     gb      qps     gb      config136041  14      43784   18      24298   20      myrocks109724  22      49881   31      48459   51      fb56103030  24      39979   34      39582   54      orig56.ps116343  24      48506   35      48112   58
p0.c1load    load    2h      2h      12h     12hips     gb      qps     gb      qps     gb      config 73115  15      42508   20      35766   32      fb56 45660  16      36474   22      33107   34 46737  16      40890   22      37305   36      orig57.ps101966  17      33716   23      29695   37
p1.c0load    load    2h      2h      12h     12hips     gb      qps     gb      qps     gb      config101783  26      34342   30      21883   36      myrocks105099  24      48686   33      47369   52      fb56 97931  27      39343   36      39000   55      orig56.ps109230  27      46671   37      46155   59
p1.c1load    load    2h      2h      12h     12hips     gb      qps     gb      qps     gb      config 91884  15      46852   21      45223   36      fb56 74080  17      39379   23      38627   38 77037  17      45156   24      44070   40 87708  19      37062   25      32424   40      orig57.tcGraphs!And for people who want graphs this has the average insert rate from the load and the average query rate from the twelfth hour for the p0.c0 test (no partitioning, no compression).

PlanetMySQL Voting: Vote UP / Vote DOWN

High-load clusters and desynchronized nodes on Percona XtraDB Cluster

There can be a lot of confusion and lack of planning in Percona XtraDB Clusters in regards to nodes becoming desynchronized for various reasons.  This can happen a few ways:

When I say “desynchronized” I mean a node that is permitted to build up a potentially large wsrep_local_recv_queue while some operation is happening.  For example a node taking a backup would set wsrep_desync=ON during the backup and potentially fall behind replication some amount.

Some of these operations may completely block Galera from applying transactions, while others may simply increase load on the server enough that it falls behind and applies at a reduced rate.

In all the cases above, flow control is NOT used while the node cannot apply transactions, but it MAY be used while the node is recovering from the operation.  For an example of this, see my last blog about IST.

If a cluster is fairly busy, then the flow control that CAN happen when the above operations catch up MAY be detrimental to performance.

Example setup

Let us take my typical 3 node cluster with workload on node1.  We are taking a blocking backup of some kind on node3 so we are executing the following steps:

  1. node3> set global wsrep_desync=ON;
  2. Node3’s “backup” starts, this starts with FLUSH TABLES WITH READ LOCK;
  3. Galera is paused on node3 and the wsrep_local_recv_queue grows some amount
  4. Node3’s “backup” finishes, finishing with UNLOCK TABLES;
  5. node3> set global wsrep_desync=OFF;
During the backup

This includes up through step 3 above.  My node1 is unaffected by the backup on node3, I can see it averaging 5-6k writesets(transactions) per second which it did before we began:


node2 is also unaffected:

but node3 is not applying and its queue is building up:

Unlock tables, still wsrep_desync=ON

Let’s examine briefly what happens when node3 is permitted to start applying, but wsrep_desync stays enabled:

node1’s performance is pretty much the same, node3 is not using flow control yet. However, there is a problem:

It’s hard to notice, but node3 is NOT catching up, instead it is falling further behind!  We have potentially created a situation where node3 may never catch up.

The PXC nodes were close enough to the red-line of performance that node3 can only apply just about as fast (and somewhat slower until it heats up a bit) as new transactions are coming into node1.

This represents a serious concern in PXC capacity planning:

Nodes do not only need to be fast enough to handle normal workload, but also to catch up after maintenance operations or failures cause them to fall behind.

Experienced MySQL DBA’s will realize this isn’t all that different than Master/Slave replication.

Flow Control as a way to recovery

So here’s the trick:  if we turn off wsrep_desync on node3 now, node3 will use flow control if and only if the incoming replication exceeds node3’s apply rate.  This gives node3 a good chance of catching up, but the tradeoff is reducing write throughput of the cluster.  Let’s see what this looks like in context with all of our steps.  wsrep_desync is turned off at the peak of the replication queue size on node3, around 12:20PM:

So at the moment node3 starts utilizing flow control to prevent falling further behind, our write throughput (in this specific environment and workload) is reduced by approximately 1/3rd (YMMV).   The cluster will remain in this state until node3 catches up and returns to the ‘Synced’ state.  This catchup is still happening as I write this post, almost 4 hours after it started and will likely take another hour or two to complete.

I can see a more realtime representation of this by using myq_status on node1, summarizing every minute:

[root@node1 ~]# myq_status -i 1m wsrep mycluster / node1 (idx: 1) / Galera 3.11(ra0189ab) Cluster Node Outbound Inbound FlowC Conflct Gcache Appl time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef 19:58:47 P 5 3 Sync 0.9ms 3128 2.0M 0 27 213b 0 25.4s 0 0 0 3003k 16k 62% 19:59:47 P 5 3 Sync 1.1ms 3200 2.1M 0 31 248b 0 18.8s 0 0 0 3003k 16k 62% 20:00:47 P 5 3 Sync 0.9ms 3378 2.2M 32 27 217b 0 26.0s 0 0 0 3003k 16k 62% 20:01:47 P 5 3 Sync 0.9ms 3662 2.4M 32 33 266b 0 18.9s 0 0 0 3003k 16k 62% 20:02:47 P 5 3 Sync 0.9ms 3340 2.2M 32 27 215b 0 27.2s 0 0 0 3003k 16k 62% 20:03:47 P 5 3 Sync 0.9ms 3193 2.1M 0 27 215b 0 25.6s 0 0 0 3003k 16k 62% 20:04:47 P 5 3 Sync 0.9ms 3009 1.9M 12 28 224b 0 22.8s 0 0 0 3003k 16k 62% 20:05:47 P 5 3 Sync 0.9ms 3437 2.2M 0 27 218b 0 23.9s 0 0 0 3003k 16k 62% 20:06:47 P 5 3 Sync 0.9ms 3319 2.1M 7 28 220b 0 24.2s 0 0 0 3003k 16k 62% 20:07:47 P 5 3 Sync 1.0ms 3388 2.2M 16 31 251b 0 22.6s 0 0 0 3003k 16k 62% 20:08:47 P 5 3 Sync 1.1ms 3695 2.4M 19 39 312b 0 13.9s 0 0 0 3003k 16k 62% 20:09:47 P 5 3 Sync 0.9ms 3293 2.1M 0 26 211b 0 26.2s 0 0 0 3003k 16k 62%

This reports around 20-25 seconds of flow control every minute, which is consistent with that ~1/3rd of performance reduction we see in the graphs above.

Watching node3 the same way proves it is sending the flow control (FlowC snt):

mycluster / node3 (idx: 2) / Galera 3.11(ra0189ab) Cluster Node Outbound Inbound FlowC Conflct Gcache Appl time P cnf # stat laten msgs data que msgs data que pause snt lcf bfa ist idx %ef 17:38:09 P 5 3 Dono 0.8ms 0 0b 0 4434 2.8M 16m 25.2s 31 0 0 18634 16k 80% 17:39:09 P 5 3 Dono 1.3ms 0 0b 1 5040 3.2M 16m 22.1s 29 0 0 37497 16k 80% 17:40:09 P 5 3 Dono 1.4ms 0 0b 0 4506 2.9M 16m 21.0s 31 0 0 16674 16k 80% 17:41:09 P 5 3 Dono 0.9ms 0 0b 0 5274 3.4M 16m 16.4s 27 0 0 22134 16k 80% 17:42:09 P 5 3 Dono 0.9ms 0 0b 0 4826 3.1M 16m 19.8s 26 0 0 16386 16k 80% 17:43:09 P 5 3 Jned 0.9ms 0 0b 0 4957 3.2M 16m 18.7s 28 0 0 83677 16k 80% 17:44:09 P 5 3 Jned 0.9ms 0 0b 0 3693 2.4M 16m 27.2s 30 0 0 131k 16k 80% 17:45:09 P 5 3 Jned 0.9ms 0 0b 0 4151 2.7M 16m 26.3s 34 0 0 185k 16k 80% 17:46:09 P 5 3 Jned 1.5ms 0 0b 0 4420 2.8M 16m 25.0s 30 0 0 245k 16k 80% 17:47:09 P 5 3 Jned 1.3ms 0 0b 1 4806 3.1M 16m 21.0s 27 0 0 310k 16k 80%

There are a lot of flow control messages (around 30) per minute.  This is a lot of ON/OFF toggles of flow control where writes are briefly delayed rather than a steady “you can’t write” for 20 seconds straight.

It also interestingly spends a long time in the Donor/Desynced state (even though wsrep_desync was turned OFF hours before) and then moves to the Joined state (this has the same meaning as during an IST).

Does it matter?

As always, it depends.

If these are web requests and suddenly the database can only handle ~66% of the traffic, that’s likely a problem, but maybe it just slows down the website somewhat.  I want to emphasize that WRITES are what is affected here.  Reads on any and all nodes should be normal (though you probably don’t want to read from node3 since it is so far behind).

If this were some queue processing that had reduced throughput, I’d expect it to possibly catch up later

This can only be answered for your application, but the takeaways for me are:

  • Don’t underestimate your capacity requirements
  • Being at the redline normally means you are well past the redline for abnormal events.
  • Plan for maintenance and failure recoveries
  • Where possible, build queuing into your workflows so diminished throughput in your architecture doesn’t generate failures.

Happy clustering!

Graphs in this post courtesy of VividCortex.

The post High-load clusters and desynchronized nodes on Percona XtraDB Cluster appeared first on MySQL Performance Blog.

PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL replication in action - Part 5 - parallel appliers

Previous episodes: Parallel replication overview

One of the main grievance of replication users is that, while a well tuned master server can handle thousands of concurrent operations, an equally tuned slave is constrained to work on a single thread. In Figure 1, we see the schematics of this paradigm. Multiple operations on the master are executed simultaneously and saved to the binary log. The slave IO thread copies the binary log events to a local log, and on such log the SQL thread executes the events on the slave database. When the master is very active, chances are that the slave lags behind, causing hatred and nightmares to the DBAs.

Figure 1 - Single applier

Until 2011, there was little remedy to this problem. Then we saw the first implementation of parallel apply, provided by Tungsten Replicator. The parallelism is schema-based, and provides 5x improvement on the slave performance.

Shortly after that, the same algorithm of schema-based parallel applier was implemented —with the same performance gain— in MySQL 5.6 (Figure 2). This implementation is valuable for those organizations that have data split by schemas, which can happen for those companies that have adopted sharding of data and split their logical chunks in different schemas, or for those companies that run multi-tenant services, where splitting data by schema is a necessity. This solution does not meet the needs of all users. Many systems are based on a single very active schema and perhaps a few ancillary ones with minimal traffic. For those users, parallel replication by schema is useless.

Figure 2 - Parallel applier

In more recent times, we have seen three more implementations:

  1. In MariaDB 10, parallel appliers are based on information generated by the master when using group commit;
  2. In MySQL 5.7, a new parallelism mode was introduced, based on the events logical clock: the events that have the same timestamp are safe to execute in parallel.
  3. In MariaDB 10.1, there is an optimistic mode that breaks the commit order that was guaranteed by the group-commit algorithm, and allows higher performance.

We are not going to measure the performance of all these methods. Having benchmarked a few of them, I know by experience that you can get enormous performance gains, but depending on the data traffic you can also get smaller and smaller advantages, and you may even slow down the slave further in some cases. The performance depends on data distribution, on the pattern of the data, the type and amount of indexes, and some unknown factors that sometimes baffle both developers and DBAs. Thus, we will focus our examination to the manageability of parallel appliers. What is important to note here is that all implementations introduce, in addition to a larger number of appliers, much more complexity in the operations, where we need to know, and eventually monitor, much more than what we should do in single threaded replication. We'll see in a moment how the new features rise to this challenge.

Setting up Parallel replication

Compared to what we had to do for multi-source, the setup is not hard. There is a variable that we need to change. The name is different in MySQL 5.6/5.7 and MariaDB 10. As it happens for multi source, we can set-up the functionality using one of the mysql-replication-samples scripts on GitHub.

Setup in MySQL 5.7

When we run the script, what happens is that the slave gets stopped, the parallel workers variable is increased, and then the slave restarts. A look at show processlist in the slave tells us that there are now 10 processes waiting for work.

| Variable_name | Value |
| slave_parallel_workers | 10 |
| Id | User | Host | db | Command | Time | State | Info |
| 9 | system user | | NULL | Connect | 0 | Waiting for master to send event | NULL |
| 10 | system user | | NULL | Connect | 0 | Reading event from the relay log | NULL |
| 11 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL |
| 12 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL |
| 13 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL |
| 14 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL |
| 15 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL |
| 16 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL |
| 17 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL |
| 18 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL |
| 19 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL |
| 20 | system user | | NULL | Connect | 0 | Waiting for an event from Coordinator | NULL |
| 22 | msandbox | localhost | NULL | Query | 0 | starting | show processlist |
setup in MariaDB 10

Same story in MariaDB 10. Stop the slave, change the variable, restart the slave. The process list shows willing workers.

| Variable_name | Value |
| slave_parallel_threads | 10 |
| Id | User | Host | db | Command | Time | State | Info | Progress |
| 10 | system user | | NULL | Connect | 0 | Waiting for master to send event | NULL | 0.000 |
| 11 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 |
| 12 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 |
| 13 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 |
| 14 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 |
| 15 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 |
| 16 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 |
| 17 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 |
| 18 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 |
| 19 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 |
| 20 | system user | | NULL | Connect | 0 | Waiting for work from SQL thread | NULL | 0.000 |
| 21 | system user | | NULL | Connect | 0 | Slave has read all relay log; wait | NULL | 0.000 |
| 23 | msandbox | localhost | NULL | Query | 0 | init | show processlist | 0.000 |

The above output has been shortened for easier showing. The message for thread 21 said "waiting for master to send event"

Monitoring parallel replication

After the setup, the interesting operations can start. We need to know what we can get from the monitoring facilities.

To monitor, however, we need some action to look at. In the same mysql-replicator-samples there are a few scripts that can generate load for multiple databases ( or for many tables in the same database ( We will use the first one in MySQL 5.7 to check the default algorithm (parallelism by schema) and the second script for mariadb, again to test its default algorithm, which can run parallel appliers in the same schema. As I mentioned before, we are not comparing performance here: we just want to see what we can detect when parallel replication is running.

The scripts are nothing fancy. They will generate lots of concurrent inserts in many tables. This is enough to see all the parallel workers busy. You can use your favorite stress test instead of these simple scripts.

Monitoring in MySQL 5.7

Let's start with MySQL 5.7, where we have two tables dedicated to parallel replication monitoring. The first one, in the mysql schema, can show the progress of work on binary logs and relay logs. There is no mention of GTIDs. And there is no mention of database. For a system where parallelism can happen by schema, this looks like insufficient design. Another thing that is important to notice is the field Checkpoint_group_bitmap, of which I ignore the purpose, but is very prominent when you select the table contents, because this column is filled with non-printable characters. For this reason, we will select just a few columns, to see the basic progress.

slave1 [localhost] {msandbox} (mysql) > desc slave_worker_info;
| Field | Type | Null | Key | Default | Extra |
| Id | int(10) unsigned | NO | PRI | NULL | |
| Relay_log_name | text | NO | | NULL | |
| Relay_log_pos | bigint(20) unsigned | NO | | NULL | |
| Master_log_name | text | NO | | NULL | |
| Master_log_pos | bigint(20) unsigned | NO | | NULL | |
| Checkpoint_relay_log_name | text | NO | | NULL | |
| Checkpoint_relay_log_pos | bigint(20) unsigned | NO | | NULL | |
| Checkpoint_master_log_name | text | NO | | NULL | |
| Checkpoint_master_log_pos | bigint(20) unsigned | NO | | NULL | |
| Checkpoint_seqno | int(10) unsigned | NO | | NULL | |
| Checkpoint_group_size | int(10) unsigned | NO | | NULL | |
| Checkpoint_group_bitmap | blob | NO | | NULL | |
| Channel_name | char(64) | NO | PRI | NULL | |
13 rows in set (0.00 sec)
slave1 [localhost] {msandbox} (mysql) > select ID,Relay_log_name,Relay_log_pos,Master_log_name,Master_log_pos from slave_worker_info;
| ID | Relay_log_name | Relay_log_pos | Master_log_name | Master_log_pos |
| 1 | ./mysql-relay.000003 | 33029530 | mysql-bin.000002 | 33029317 |
| 2 | ./mysql-relay.000003 | 33066066 | mysql-bin.000002 | 33065853 |
| 3 | ./mysql-relay.000003 | 33111090 | mysql-bin.000002 | 33110877 |
| 4 | ./mysql-relay.000003 | 33107225 | mysql-bin.000002 | 33107012 |
| 5 | ./mysql-relay.000003 | 33059630 | mysql-bin.000002 | 33059417 |
| 6 | ./mysql-relay.000003 | 33056541 | mysql-bin.000002 | 33056328 |
| 7 | ./mysql-relay.000003 | 33086391 | mysql-bin.000002 | 33086178 |
| 8 | ./mysql-relay.000003 | 33018457 | mysql-bin.000002 | 33018244 |
| 9 | ./mysql-relay.000003 | 33052167 | mysql-bin.000002 | 33051954 |
| 10 | ./mysql-relay.000003 | 33044450 | mysql-bin.000002 | 33044237 |
10 rows in set (0.00 sec)

We can see that workers are using different positions of the relay logs, which correspond to different positions of the binary logs. However, there are no timestamps, so it is hard to know which workers are working more.

Next, we look at the table dedicated to parallel replication in performance_schema.

slave1 [localhost] {msandbox} (mysql) > use performance_schema
Database changed

slave1 [localhost] {msandbox} (performance_schema) > desc replication_applier_status_by_worker;
| Field | Type | Null | Key | Default | Extra |
| CHANNEL_NAME | char(64) | NO | | NULL | |
| WORKER_ID | bigint(20) unsigned | NO | | NULL | |
| THREAD_ID | bigint(20) unsigned | YES | | NULL | |
| SERVICE_STATE | enum('ON','OFF') | NO | | NULL | |
| LAST_SEEN_TRANSACTION | char(57) | NO | | NULL | |
| LAST_ERROR_NUMBER | int(11) | NO | | NULL | |
| LAST_ERROR_MESSAGE | varchar(1024) | NO | | NULL | |

slave1 [localhost] {msandbox} (performance_schema) > select WORKER_ID, THREAD_ID , SERVICE_STATE , LAST_SEEN_TRANSACTION , LAST_ERROR_MESSAGE from replication_applier_status_by_worker;
| 1 | 40 | ON | 00013253-1111-1111-1111-111111111111:133621 | |
| 2 | 41 | ON | 00013253-1111-1111-1111-111111111111:133699 | |
| 3 | 42 | ON | 00013253-1111-1111-1111-111111111111:133826 | |
| 4 | 43 | ON | 00013253-1111-1111-1111-111111111111:133919 | |
| 5 | 44 | ON | 00013253-1111-1111-1111-111111111111:133686 | |
| 6 | 45 | ON | 00013253-1111-1111-1111-111111111111:133688 | |
| 7 | 46 | ON | 00013253-1111-1111-1111-111111111111:133770 | |
| 8 | 47 | ON | 00013253-1111-1111-1111-111111111111:133494 | |
| 9 | 48 | ON | 00013253-1111-1111-1111-111111111111:133731 | |
| 10 | 49 | ON | 00013253-1111-1111-1111-111111111111:133763 | |
10 rows in set (0.00 sec)

There is a timestamp in this table, but only for errors. The regular operations don't get one. Also here we miss info about database, time of extraction and apply, and we don't know what exactly the worker is doing: The column LAST_SEEN_TRANSACTION is about received GTIDs, not executed ones.

Finally, we have a look at the output of SHOW SLAVE STATUS. In the past episodes, here is where we had the most comprehensive set of information. But with parallel replication we are going to have a surprise.

slave1 [localhost] {msandbox} (performance_schema) > SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_User: rsandbox
Master_Port: 13253
Connect_Retry: 60
Master_Log_File: mysql-bin.000002
Read_Master_Log_Pos: 38675840
Relay_Log_File: mysql-relay.000003
Relay_Log_Pos: 35124216
Relay_Master_Log_File: mysql-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Skip_Counter: 0
Exec_Master_Log_Pos: 35124003
Relay_Log_Space: 38676469
Until_Condition: None
Until_Log_Pos: 0
Master_SSL_Allowed: No
Seconds_Behind_Master: 30
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_SQL_Errno: 0
Master_Server_Id: 1
Master_UUID: 00013253-1111-1111-1111-111111111111
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Reading event from the relay log
Master_Retry_Count: 86400
Retrieved_Gtid_Set: 00013253-1111-1111-1111-111111111111:1-150420
Executed_Gtid_Set: 00013253-1111-1111-1111-111111111111:1-136616:136618-136624:136626-136632:

Auto_Position: 1
1 row in set (0.00 sec)

The output of Executed_Gtid_set is actually one line. I split it to make it fit in the page. But even after it is split, I have trouble understanding what it is showing, and how this monster piece of information can be useful. Want to see something more horrible than this? Try parallel replication with muulti-source.

Monitoring with MariaDB 10

In MariaDb there is only one table where we can follow progress. Unlike the one in MySQL 5.6/7, there is no fixed number of rows. This is the same table where operations with GTID are stored with single thread replication. The rows are added or pruned according to the traffic in the various channels.

slave1 [localhost] {msandbox} (mysql) > select * from gtid_slave_pos;
| domain_id | sub_id | server_id | seq_no |
| 0 | 28569 | 1 | 28569 |
| 0 | 28570 | 1 | 28570 |
2 rows in set (0.00 sec)

slave1 [localhost] {msandbox} (mysql) > select * from gtid_slave_pos;
| domain_id | sub_id | server_id | seq_no |
| 0 | 42786 | 1 | 42786 |
| 0 | 42787 | 1 | 42787 |
| 0 | 42788 | 1 | 42788 |
| 0 | 42789 | 1 | 42789 |
4 rows in set (0.00 sec)

slave1 [localhost] {msandbox} (mysql) > select * from gtid_slave_pos;
| domain_id | sub_id | server_id | seq_no |
| 0 | 46807 | 1 | 46807 |
| 0 | 46808 | 1 | 46808 |
2 rows in set (0.00 sec)

The information here is slim. As we have noticed in the previous articles, we get little or no visibility into the internal operations. Here we have ten appliers that work concurrently, but there is little evidence that this is happening. Much like 'show processlist', the gtid table shows only a glimpse of a brief moment. Sometimes we find 10 rows showing activity, sometimes just two. It does not mean that the server is idle. It's just that we don't catch its activity. Same as when we try to monitor using process list: with a busy server, catching a view of many concurrent queries is a matter of luck.

Let's have a look at SHOW SLAVE STATUS. The good news is that we don't have the horrible crowd seen in MySQL 5.7. The bad news is that we have just as much information that we would get with a single thread. And let's not forget that the only info about GTID in the slave status is related to the IO thread. For the SQL threads (in this case, we have 10) we need to look at the above table or the GTID variables.

slave1 [localhost] {msandbox} ((none)) > SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_User: rsandbox
Master_Port: 25030
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 958300
Relay_Log_File: mysql-relay.000002
Relay_Log_Pos: 956223
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Skip_Counter: 0
Exec_Master_Log_Pos: 958300
Relay_Log_Space: 956516
Until_Condition: None
Until_Log_Pos: 0
Master_SSL_Allowed: No
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_SQL_Errno: 0
Master_Server_Id: 1
Using_Gtid: Current_Pos
Gtid_IO_Pos: 0-1-5826
1 row in set (0.00 sec)

Small improvements

About two years ago I wrote a post comparing parallel replication in MySQL 5.6 and Tungsten Replicator. Most of what was said there is still valid, but I want to acknowledge a small improvement. When an error happened in parallel replication, the message shown in my article said:

Last_SQL_Error: Error 'Duplicate entry '1' for key 'PRIMARY'' on query.
Default database: 'test'. Query: 'insert into t1 values (1)'

In this context, when working with 10 threads, finding where the problem happens may be tough. But now the same situation results in this more helpful error message:

Last_SQL_Error: Worker 8 failed executing transaction '00021891-1111-1111-1111-111111111111:90860'
at master log mysql-bin.000002, end_log_pos 23636368; Error 'Duplicate entry '340' for key 'PRIMARY'' on query.
Default database: 'db1'. Query: 'insert into t7 values (340, null)'

Here we get the thread ID (Worker 8), the GTID (yay!) and next to it the binary log name and position that was processed. This proves that the worker thread knows everything that is useful. Now the next step will be adding the same information to the appropriate monitoring table.

Summing up

Of all the new features, parallel replication is probably the most needed, and the one that falls short in matter of usability. MySQL 5.7 offers what looks at first sight a large amount of monitoring data, but in the end is not enough to run operations confidently.

In MariaDB, the lack of monitoring data is so acute that I would be very reluctant to use or recommend it for production.

It's a start. I hope that both teams will pause a bit in their frenzy to add more features, and will strengthen up the existing ones instead.

What's next: Percona Live in Amsterdam!

This is the last episode in the series. There is much more to say about replication features, old and new, but the focus of these articles was on monitoring capabilities, and we've seen enough.

I am speaking at Percona Live Amsterdam 2015 on September 22, and I will cover all these topics with examples.

I would like to remind everyone that there is a public project on GitHub with mysql-replication-samples. I am sure there could be plenty of other scripts that could be added. Participation is welcome!

PlanetMySQL Voting: Vote UP / Vote DOWN

Unexpected Memory Consumption for Bulk Index Creation in InnoDB (MySQL)

In my last Hackathon, I worked on MyISAM vs InnoDB for data loading (LOAD DATA IN FILE) and bulk index creation.  My motivation was the following: knowing that some are still using MyISAM for this particular use-case, I wanted to verify/understand if/why InnoDB is slower than MyISAM.  I do not yet have complete results on this specific subject but I found some interesting things that
PlanetMySQL Voting: Vote UP / Vote DOWN

First day with InnoDB transparent page compression

I ran linkbench overnight for a database that started at 100G using MySQL 5.7.8 and InnoDB transparent page compression. After ~24 hours I have 1 mysqld crash with nothing in the error log. I don't know if that is related to bug 77738. I will attach gdb and hope for another crash. For more about transparent page compression read here, here and here. For concerns about the feature see the post by Domas. I previously wrote about this feature.

Then I ran 'dmesg -e' and get 81 complaints from XFS on the host that uses transparent compression. The warnings are from the time when the benchmark ran. My other test host isn't using hole-punch and doesn't get these warnings.

[Aug27 05:53] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[  +1.999375] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[  +1.999387] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[  +1.983386] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
[  +1.999379] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)

The host has Fedora 19 and the kernel is 3.14.27-100.fc19.x86_64. I don't know if Fedora 19 is officially supported. I know that hole punch is available because this is in the error log:
    [Note] InnoDB: PUNCH HOLE support available

And this was used in a create table statement:

From my host without hole-punch where the files for the table are ~44G.
    $ xfs_bmap /ssd1/scratch/data.mdcallag/data/linkdb/linktable*.ibd | wc -l

And the host with it  where the files for the table are ~104G according to ls but are much smaller because of the holes.
    $ xfs_bmap /ssd1/scratch/data.mdcallag/data/linkdb/linktable.ibd  | wc -l

I don't like the vmstat output either. On the host that uses transparent page compression swap is being used and that started during the linkbench load. It is not being used on the other host. Doesn't look right.

 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
13  6 4095996 8410200   1164 6298324    0    0    65   534    0    0  8  1 89  2

PlanetMySQL Voting: Vote UP / Vote DOWN

Log Buffer #438: A Carnival of the Vanities for DBAs

This Log Buffer Edition covers Oracle, MySQL, and SQL Server blog posts from the last week.


Integrating Telstra Public SMS API into Bluemix

Adaptive Query Optimization in Oracle 12c : Ongoing Updates

First flight into the Oracle Mobile Cloud Service

Oracle 12C Problem with datapatch. Part 2, the “fix”

oracle applications r12 auto start on linux

SQL Server:

Email Formatted HTML Table with T-SQL

SQL Server 2016 – Introduction to Stretch Database

Soundex – Experiments with SQLCLR Part 3

An Introduction to Real-Time Communication with SignalR

Strange Filtered Index Problem


Announcing Galera Cluster 5.5.42 and 5.6.25 with Galera 3.12

doing nothing on modern CPUs

Single-threaded linkbench performance for MySQL 5.7, 5.6, WebScale and MyRocks

Identifying Insecure Connections

MyOraDump, Oracle dump utility, version 1.2

The post Log Buffer #438: A Carnival of the Vanities for DBAs appeared first on Pythian - Data Experts Blog.

PlanetMySQL Voting: Vote UP / Vote DOWN

Log Buffer #437: A Carnival of the Vanities for DBAs

This Log Buffer Edition goes out deep into the vistas of database world and brings out few of the good ones published during the week from Oracle, SQL Server, and MySQL.


Overriding Default Context-Sensitive Action Enablement

This is an alternative to if… then… else… elsif… end if when you want to use conditional statements in PL/SQL.

Achieving SAML interoperability with OAM OAuth Server

Release of BP02 for Oracle Identity Manager

IT Business Edge: Oracle Ties Mobile Security to Identity and Access Management

SQL Server:

How to render PDF documents using SQL CLR. Also a good introduction on creating SQL CLR functions.

What is DNX?

SQL Server Performance dashboard reports

Using Microsoft DiskSpd to Test Your Storage Subsystem

Connect to Salesforce Data as a Linked Server


Optimizing PXC Xtrabackup State Snapshot Transfer

Adding your own collation to MySQL

Monitoring your Amazon Aurora Databases using MONyog

How much could you benefit from MySQL 5.6 parallel replication?

MySQL checksum

The post Log Buffer #437: A Carnival of the Vanities for DBAs appeared first on Pythian - Data Experts Blog.

PlanetMySQL Voting: Vote UP / Vote DOWN