Planet MySQL

MySQL 8.0 Hot Rows with NOWAIT and SKIP LOCKED

In MySQL 8.0 there are two new features designed to support lock handling: NOWAIT and SKIP LOCKED. In this post, we’ll look at how MySQL 8.0 handles hot rows. Up until now, how have you handled locks that are part of an active transaction or are hot rows? It’s likely that you have the application attempt to access the data, and if there is a lock on the requested rows, you incur a timeout and have to retry the transaction. These two new features help you to implement sophisticated lock handling scenarios allowing you to handle timeouts better and improve the application’s performance.

To demonstrate I’ll use this product table.

mysql> select @@version; +-----------+ | @@version | +-----------+ | 8.0.11 | +-----------+ 1 row in set (0.00 sec)

CREATE TABLE `product` ( `p_id` int(11) NOT NULL AUTO_INCREMENT, `p_name` varchar(255) DEFAULT NULL, `p_cost` decimal(19,4) NOT NULL, `p_availability` enum('YES','NO') DEFAULT 'NO', PRIMARY KEY (`p_id`), KEY `p_cost` (`p_cost`), KEY `p_name` (`p_name`) ) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Let’s run through an example. The transaction below will lock the rows 2 and 3 if not already locked. The rows will get released when our transaction does a COMMIT or a ROLLBACK. Autocommit is enabled by default for any transaction and can be disabled either by using the START TRANSACTION clause or by setting the Autocommit to 0.

Session 1:

mysql> START TRANSACTION;SELECT * FROM mydb.product WHERE p_cost >=20 and p_cost <=30 FOR UPDATE; Query OK, 0 rows affected (0.00 sec) +------+--------+---------+----------------+ | p_id | p_name | p_cost | p_availability | +------+--------+---------+----------------+ | 2 | Item2 | 20.0000 | YES | | 3 | Item3 | 30.0000 | YES | +------+--------+---------+----------------+ 2 rows in set (0.00 sec)

InnoDB performs row-level locking in such a way that when it searches or scans a table index, it sets shared or exclusive locks on the index records it encounters. Thus, the row-level locks are actually index-record locks.

We can get the details of a transaction such as the transaction id, row lock count etc using the command innodb engine status or by querying the performance_schema.data_locks table. The result from the innodb engine status command can however be confusing as we can see below. Our query only locked rows 3 and 4 but the output of the query reports 5 rows as locked (Count of Locked PRIMARY+ locked selected column secondary index + supremum pseudo-record). We can see that the row right next to the rows that we selected is also reported as locked. This is an expected and documented behavior. Since the table is small with only 5 rows, a full scan of the table is much faster than an index search. This causes all rows or most rows of the table to end up as locked as a result of our query.

Innodb Engine Status :-

---TRANSACTION 205338, ACTIVE 22 sec 3 lock struct(s), heap size 1136, 5 row lock(s) MySQL thread id 8, OS thread handle 140220824467200, query id 28 localhost root

performance_schema.data_locks (another new feature in 8.0.1):

mysql> SELECT ENGINE_TRANSACTION_ID, CONCAT(OBJECT_SCHEMA, '.', OBJECT_NAME)TBL, INDEX_NAME,count(*) LOCK_DATA FROM performance_schema.data_locks where LOCK_DATA!='supremum pseudo-record' GROUP BY ENGINE_TRANSACTION_ID,INDEX_NAME,OBJECT_NAME,OBJECT_SCHEMA; +-----------------------+--------------+------------+-----------+ | ENGINE_TRANSACTION_ID | TBL | INDEX_NAME | LOCK_DATA | +-----------------------+--------------+------------+-----------+ | 205338 | mydb.product | p_cost | 3 | | 205338 | mydb.product | PRIMARY | 2 | +-----------------------+--------------+------------+-----------+ 2 rows in set (0.04 sec)

mysql> SELECT ENGINE_TRANSACTION_ID as ENG_TRX_ID, object_name, index_name, lock_type, lock_mode, lock_data FROM performance_schema.data_locks WHERE object_name = 'product'; +------------+-------------+------------+-----------+-----------+-------------------------+ | ENG_TRX_ID | object_name | index_name | lock_type | lock_mode | lock_data | +------------+-------------+------------+-----------+-----------+-------------------------+ | 205338 | product | NULL | TABLE | IX | NULL | | 205338 | product | p_cost | RECORD | X | 0x800000000000140000, 2 | | 205338 | product | p_cost | RECORD | X | 0x8000000000001E0000, 3 | | 205338 | product | p_cost | RECORD | X | 0x800000000000320000, 5 | | 205338 | product | PRIMARY | RECORD | X | 2 | | 205338 | product | PRIMARY | RECORD | X | 3 | +------------+-------------+------------+-----------+-----------+-------------------------+ 6 rows in set (0.00 sec)

Session 1:

mysql> COMMIT; Query OK, 0 rows affected (0.00 sec)

SELECT FOR UPDATE with innodb_lock_wait_timeout:

The innodb_lock_wait_timeout feature is one mechanism that is used to handle lock conflicts. The variable has default value set to 50 sec and causes any transaction that is waiting for a lock for more than 50 seconds to terminate and post a timeout message to the user. The parameter is configurable based on the requirements of the application.

Let’s look at how this feature works using an example with a select for update query.

mysql> select @@innodb_lock_wait_timeout; +----------------------------+ | @@innodb_lock_wait_timeout | +----------------------------+ | 50 | +----------------------------+ 1 row in set (0.00 sec)

Session 1:

mysql> START TRANSACTION;SELECT * FROM mydb.product WHERE p_cost >=20 and p_cost <=30 FOR UPDATE; Query OK, 0 rows affected (0.00 sec) +------+--------+---------+----------------+ | p_id | p_name | p_cost | p_availability | +------+--------+---------+----------------+ | 2 | Item2 | 20.0000 | YES | | 3 | Item3 | 30.0000 | YES | +------+--------+---------+----------------+ 2 rows in set (0.00 sec)

Session 2:

mysql> select now();SELECT * FROM mydb.product WHERE p_id=3 FOR UPDATE;select now(); +---------------------+ | now() | +---------------------+ | 2018-06-19 05:29:48 | +---------------------+ 1 row in set (0.00 sec) ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction +---------------------+ | now() | +---------------------+ | 2018-06-19 05:30:39 | +---------------------+ 1 row in set (0.00 sec) mysql>

Autocommit is enabled (by default) and as expected the transaction waited for lock wait timeout and exited.

Session 1:

mysql> COMMIT; Query OK, 0 rows affected (0.00 sec)


The NOWAIT clause causes a query to terminate immediately in the case that candidate rows are already locked. Considering the previous example, if the application’s requirement is to not wait for the locks to be released or for a timeout, using the NOWAIT clause is the perfect solution. (Setting the innodb_lock_wait_timeout=1 in session also has the similar effect). 

Session 1:

mysql> START TRANSACTION;SELECT * FROM mydb.product WHERE p_cost >=20 and p_cost <=30 FOR UPDATE; Query OK, 0 rows affected (0.00 sec) +------+--------+---------+----------------+ | p_id | p_name | p_cost | p_availability | +------+--------+---------+----------------+ | 2 | Item2 | 20.0000 | YES | | 3 | Item3 | 30.0000 | YES | +------+--------+---------+----------------+ 2 rows in set (0.00 sec)

Session 2:

mysql> SELECT * FROM mydb.product WHERE p_id = 3 FOR UPDATE NOWAIT; ERROR 3572 (HY000): Statement aborted because lock(s) could not be acquired immediately and NOWAIT is set. mysql>

Session 1:

mysql> COMMIT; Query OK, 0 rows affected (0.00 sec)


The SKIP LOCKED clause asks MySQL to non-deterministically skip over the locked rows and process the remaining rows based on the where clause. Let’s look at how this works using some examples:

Session 1:

mysql> START TRANSACTION;SELECT * FROM mydb.product WHERE p_cost >=20 and p_cost <=30 FOR UPDATE; Query OK, 0 rows affected (0.00 sec) +------+--------+---------+----------------+ | p_id | p_name | p_cost | p_availability | +------+--------+---------+----------------+ | 2 | Item2 | 20.0000 | YES | | 3 | Item3 | 30.0000 | YES | +------+--------+---------+----------------+ 2 rows in set (0.00 sec)

Session 2:

mysql> SELECT * FROM mydb.product WHERE p_cost = 30 FOR UPDATE SKIP LOCKED; Empty set (0.00 sec) mysql>

mysql> SELECT * from mydb.product where p_id IN (1,2,3,4,5) FOR UPDATE SKIP LOCKED; +------+--------+---------+----------------+ | p_id | p_name | p_cost | p_availability | +------+--------+---------+----------------+ | 1 | Item1 | 10.0000 | YES | | 5 | Item5 | 50.0000 | YES | +------+--------+---------+----------------+ 2 rows in set (0.00 sec)

Session 1:

mysql> COMMIT; Query OK, 0 rows affected (0.00 sec)

The first transaction is selecting rows 2 and 3 for update(ie locked). The second transaction skips these rows and returns the remaining rows when the SKIP LOCKED clause is used.

Important Notes: As the SELECT … FOR UPDATE clause affects concurrency, it should only be used when absolutely necessary. Make sure to index the column part of the where clause as the SELECT … FOR UPDATE is likely to lock the whole table if proper indexes are not setup for the table. When an index is used, only the candidate rows are locked.

The post MySQL 8.0 Hot Rows with NOWAIT and SKIP LOCKED appeared first on Percona Database Performance Blog.

How NOT to Monitor Your Database

Do you have experience putting out backend database fires? What were some things you wished you had done differently? Proactive database monitoring is more cost efficient, manageable, and sanity-saving than reactive monitoring. We reviewed some of the most common mistakes - too many log messages, metric “melting pots,” retroactive changes, incomplete visibility, undefined KPIs - and put together an action plan on how to prevent them. From our experience, we've listed out the top 5 biggest (and preventable!) database monitoring pitfalls.

Log Levels

There never seem to be enough logging levels to capture the desired granularity and relevance of a log message accurately. Is it INFO, TRACE, or DEBUG? What if it’s DEBUG but it’s for a condition we should WARN about? Is there really a linear hierarchy here? If you’re like most people, you’ve seen at least once an extension of those types of standard logging levels on top of a widely available logging system in an attempt to add even more custom levels. There exists a good argument that there should really only be two types of log messages: those useful for writing and debugging the code, and those useful for operating it. Dave Cheney has a good blog post about this differentiation.

Mixed Status and Configuration Variables

Many systems don’t distinguish between status variables, which signal the system’s state, and configuration variables, which are inputs to the system’s operation. For example, in both MySQL and Redis, the commands to get system status will return mixtures of configuration variables and status metrics. Such a metrics “melting pot” is a very common problem that usually requires custom code or exception lists (blacklist/whitelist) to identify which variables are what. 

Breaking Backwards Compatibility

If you change the meaning or dimensions of a metric, ideally you should leave the old behavior unchanged and introduce a replacement alongside it. Failure to do this causes a lot of work for other systems. For example, in MySQL, the SHOW STATUS command was changed to include connection-specific counters by default, with the old system-wide global counters accessible via a different query syntax. This change was just a bad decision, and it caused an enormous amount of grief. Likewise, the meaning of MySQL’s “Questions” status variable was changed at one point, and the old behavior was available in a new variable called “Queries.” Essentially, they renamed a variable and then introduced a new, different variable with the same name as the old one. This change also caused a lot of confusion. Don’t do this.

Incomplete Visibility

Again, the easiest example of this is in MySQL, which has had a SHOW VARIABLES command for many years. Most, but not all, of the server’s command line options had identically named variables visible in the output of this command. But some were missing entirely, and others were present but under names that didn’t match.

Missing KPIs

The list of crucial metrics for finding and diagnosing performance issues isn’t that large. Metrics such as utilization, latency, queue length, and the like can be incredibly valuable, and can be computed from fundamental metrics, if those are available. For an example, see the Linux /proc/diskstats metrics, which include values that you can analyze with queueing theory, as illustrated on Baron’s personal blog. But you’d be surprised how many systems don’t have any way to inspect these key metrics, because people without much knowledge of good monitoring built the systems. For example, PostgreSQL has a standard performance counter for transactions, but not for statements, so if you want to know how many queries (statements) per second your server is handling, you have to resort to much more complex alternatives. This lack of a basic performance metric (throughput) is quite a serious oversight.

These are just some don’ts for developing and monitoring database applications. Interested in learning some of the do’s? Download the full eBook, Best Practices for Architecting Highly Monitorable Applications.

MySQL Swapping With Fsync

One problem that’s a lot less common these days is swapping. Most of the issues that cause swapping with MySQL have been nailed down to several different key configuration points, either in the OS or MySQL, or issues like the swap insanity issue documented by Jeremy Cole back in 2010. As such, it’s usually pretty easy to resolve these issues and keep MySQL out of swap space. Recently, however, we had tried all of the usual tricks but had an issue where MySQL was still swapping.

The server with the issue was a VM running with a single CPU socket (multiple cores), so we knew it wasn’t NUMA. Swappiness and MySQL were both configured correctly and when you checked the output of free -m it showed 4735M of memory available.

[sylvester@host~]$ free -m total used free shared buff/cache available Mem: 16046 10861 242 16 4941 4735 Swap: 15255 821 14434

The point that needs a bit more attention here is the amount of memory being used by the OS cache. As you can see, there is a total of 16046M of physical memory available to the host, with only 10861M in use and the majority of what’s left over being used by the OS cache. This typically isn’t a problem. When requests for more memory come in from threads running in the OS, it should evict pages from the OS cache in order to make memory available to the requesting process. In this case, this did not occur. Instead, we observed that it held onto that cache memory and forced MySQL to turn to swap. But why?

As it turns out, the system in question had recently been converted from MYISAM to InnoDB and hadn’t had any server configuration set to accommodate for this. As such it was still configured for innodb_flush_method at the default value, which in 5.7 is still fsync. Both Ivan Groenwold and I have both written blog posts in regards to flush methods, and it’s been generally accepted that O_DIRECT is a much better way to go in most use cases on Linux, including this one, so we wanted to get the system in question more aligned with best practices before investigating further. As it turns out, we didn’t have to look any further than this, as switching the system over to innodb_flush_method = O_DIRECT resolved the issue. It appears that fsync causes the kernel to want to hang onto its data pages, so when innodb attempted to expand its required amount of memory, it was unable to do so without accessing swap, even with swappiness set to 0 to test.

Ever since we did the change to O_DIRECT, the OS cache usage has dropped and there have been no problems with OS cache page eviction.


MySQL swapping can really ruin your day and it’s something you want to avoid if at all possible. We still run into issues with swapping every now and then and want to continue to provide the community with our findings as they become available. So if you have a server that is swapping, and the OS cache isn’t making room for MySQL, and if you’re still using fsync for InnoDB flush, consider switching to O_DIRECT.

JFG Posted on the Percona Community Blog - A Nice Feature in MariaDB 10.3: no InnoDB Buffer Pool in Core Dumps

I just posted an article on the Percona Community Blog.  You can access it following this link: A Nice Feature in MariaDB 10.3: no InnoDB Buffer Pool in Core Dumps I do not know if I will stop publishing posts on my personal blog or use both, I will see how things go...  In the rest of this post, I will share why I published there and how things went in the process. So there is a Percona

A Nice Feature in MariaDB 10.3: no InnoDB Buffer Pool in Core Dumps

MariaDB 10.3 is now generally available (10.3.7 was released GA on 2018-05-25). The article What’s New in MariaDB Server 10.3 by the MariaDB Corporation lists three key improvements in 10.3: temporal data processing, Oracle compatibility features, and purpose-built storage engines. Even if I am excited about MyRocks and curious on Spider, I am also very interested in less flashy but still very important changes that make running the database in production easier. This post describes such improvement: no InnoDB Buffer Pool in core dumps.

Hidden in the Compression section of the page Changes & Improvements in MariaDB 10.3 from the Knowledge Base, we can read:

On Linux, shrink the core dumps by omitting the InnoDB buffer pool

This is it, no more details, only a link to MDEV-10814 (Feature request: Optionally exclude large buffers from core dumps). This Jira ticket was open in 2016-09-15 by a well-known MariaDB Support Engineer: Hartmut Holzgraefe. I know was asking for this feature for a long time, this is even mentioned by Hartmut in a GitHub comment.

The ways this feature eases operations with MariaDB are well documented by Hartmut in the description of the Jira ticket:

  • it needs less available disk space to store core dumps,
  • it reduces the time required to write core dumps (and hence restart MySQL after a crash),
  • it improves security by omitting substantial amount of user data from core dumps.

In addition to that, I would add that smaller core dumps are easier to share in tickets. I am often asked by support engineers to provide a core dump in relation to a crash, and my reply is “How do you want me to give you with a 192 GB file ?” (or even bigger files as I saw MySQL/MariaDB being used on servers with 384 GB of RAM). This often leads to a “Let me think about this and I will come back to you” answer. Avoiding the InnoDB Buffer Pool in core dumps makes this less of an issue for both DBAs and support providers.

Before continuing the discussion on this improvement, I need to give more details about what a core dump is.

What is a Core Dump and Why is it Useful ?

By looking at the Linux manual page for core (and core dump file), we can read:

[A core dump is] a disk file containing an image of the process’s memory at the time of termination. This image can be used in a debugger to inspect the state of the program at the time that it terminated.

The Wikipedia article for core dump also tells us that:

  • the core dump includes key pieces of program state as processor registers, memory management details, and other processor and operating system flags and information,
  • the name comes from magnetic core memory, the principal form of random access memory from the 1950s to the 1970s, and the name has remained even if magnetic core technology is obsolete.

So a core dump is a file that can be very useful to understand the context of a crash. The exact details of how to use a core dump have been already discussed in many places and is beyond the subject of this post. The interested reader can learn more by following those links:

Now that we know more about core dumps, we can get back to the discussion of the new feature.

The no InnoDB Buffer Pool in Core Dump Feature from MariaDB 10.3

As already pointed out above, there are very few details in the release notes about how this feature works. By digging in MDEV-10814, following pointers to pull requests (#333, #364, 365, …), and reading the commit message, I was able to gather this:

  • An initial patch was written by Hartmut in 2015.
  • It uses theMADV_DONTDUMPflag to the madvise system call (available in Linux kernel 3.4 and higher).
  • Hartmut’s patch was rebased by Daniel Black, a well-known MariaDB Community Contributor (pull request #333).
  • The first work by Daniel had a configuration parameter to allow including/excluding the InnoDB Buffer Pool in/from core dumps, but after a discussion in pull request #333, it was decided that the RELEASE builds would not put the InnoDB Buffer Pool in core dumps and that DEBUG builds would include it (more about this below).
  • The function buf_madvise_do_dump is added but never invoked by the server; it is there to be called from a debugger to re-enable full core dumping if needed (from this commit message).
  • The InnoDB Redo Log buffer is also excluded from core dumps (from this comment).

I have doubts about the absence of a configuration parameter for controlling the feature. Even if the InnoDB Buffer Pool (as written above, the feature also concerns the InnoDB Redo Log buffer, but I will only mention InnoDB Buffer Pool in the rest of this post for brevity) is not often required in core dumps, Marko Mäkelä, InnoDB Engineer at, mentioned sometimes needing it to investigate deadlocks, corruption or race conditions. Moreover, I was recently asked, in a support ticket, to provide a core dump to understand a crash in MariaDB 10.2 (public bug report in MDEV-15608): it looks to me that the InnoDB Buffer Pool be useful here. Bottom line: having the InnoDB Buffer Pool (and Redo log buffer) in core dumps might not be regularly useful, but it is sometimes needed.

To include the InnoDB Buffer Pool in core dumps, DBAs can install DEBUG binaries or they can use a debugger to call the

buf_madvise_do_dumpfunction (well thought Daniel for compensating the absence of a configuration parameter, but there are caveats described below). Both solutions are suboptimal in my humble opinion. For #2, there are risks and drawbacks of using a debugger on a live production database (when it works … see below for a war story). For #1 and unless I am mistaken, DEBUG binaries are not available from the MariaDB download site. This means that they will have to be built by engineers of your favorite support provider, or that DBAs will have to manually compile them: this is a lot of work to expect from either party. I also think that the usage of DEBUG binaries in production should be minimized, not encouraged (DEBUG binaries are for developers, not DBAs); so I feel we are heading in the wrong direction. Bottom line: I would not be surprised (and I am not alone) that a parameter might be added in a next release to ease investigations of InnoDB bugs.

Out of curiosity, I checked the core dump sizes for some versions of MySQL and MariaDB with dbdeployer (if you have not tried it yet, you should probably spend time learning how to use dbdeployer: it is very useful). Here are my naive first results with default configurations and freshly started

  • 487 MB and 666 MB core dumps with MySQL 5.7.22 and 8.0.11 respectively,
  • 673 MB and 671 MB core dumps with MariaDB 10.2.15 and MariaDB 10.3.7 respectively.

I tried understanding where the inflation is coming from in MySQL 8.0.11 but I tripped on Bug#90561 which prevents my investigations. We will have to wait for 8.0.12 to know more…

Back to the feature, I was surprised to see no shrinking between MariaDB 10.2 and 10.3. To make sure something was not wrong, I tried to have the InnoDB Buffer Pool in the core dump by calling the

buf_madvise_do_dump function. I used the slides from the gdb tips and tricks for MySQL DBAs talk by Valerii Kravchuk presented at FOSDEM 2015 (I hope a similar talk will be given soon at Percona Live as my gdb skills need a lot of improvements), but I got the following result:$ gdb -p $(pidof mysqld) -ex "call buf_madvise_do_dump()" -batch [...] No symbol "buf_madvise_do_dump" in current context.

After investigations, I understood that the generic MariaDB Linux packages that I used with dbdeployer are compiled without the feature. A reason could be that there is no way to know that those packages will be used on a Linux 3.4+ kernel (without a recent enough kernel, the

MADV_DONTDUMPargument does not exist for themadvisesystem call). To be able to test the feature, I would either have to build my own binaries or try packages for a specific distribution. I chose to avoid compilation but this was more tedious than I thought…

By the way, maybe the

buf_madvise_do_dumpfunction should always be present in binaries and return a non-zero value when failing with a detailed message in the error logs. This would have spared me spending time understanding why it did not work in my case. I opened MDEV-16605: Always include buf_madvise_do_dumpin binaries for that.

Back to my tests and to see the feature in action, I started a Ubuntu 16.04.4 LTS in AWS (it comes with a 4.4 kernel). But again, I could not call

buf_madvise_do_dump. After more investigation, I understood that the Ubuntu and Debian packages are not compiled with symbols, so callingbuf_madvise_do_dumpcannot be easily done on those (I later learned that there are mariadb-server-10.3-dbgsym packages, but I did not test them). I ended-up falling back to Centos 7.5, which comes with a 3.10 kernel, and it worked ! Below are the core dump sizes with and without callingbuf_madvise_do_dump:
  • 527 MB core dump on MariaDB 10.3.7 (without callingbuf_madvise_do_dump),
  • 674 MB core dump on MariaDB 10.3.7 (with callingbuf_madvise_do_dump).

I was surprised by bigger core dumps in MariaDB 10.3 than in MySQL 5.7, so I spent some time looking into that. It would have been much easier with the Memory Instrumentation from Performance Schema, but this is not yet available in MariaDB. There is a Jira ticket opened for that (MDEV-16431); if you are also interested in this feature, I suggest you vote for it.

I guessed that the additional RAM used by MariaDB 10.3 (compared to MySQL 5.7) comes from the caches for the MyISAM and Aria storage engines. Those caches, whose sizes are controlled by the key_buffer_size and aria_pagecache_buffer_size parameters, are 128 MB by default in MariaDB 10.3 (more discussion about these sizes below). I tried shrinking both caches to 8 MB (the default value in MySQL since at least 5.5), but I got another surprise:

> SET GLOBAL key_buffer_size = 8388608; Query OK, 0 rows affected (0.001 sec) > SET GLOBAL aria_pagecache_buffer_size = 8388608; ERROR 1238 (HY000): Variable 'aria_pagecache_buffer_size' is a read only variable

The aria_pagecache_buffer_size parameter is not dynamic ! This is annoying as I like tuning parameters to be dynamic, so I opened MDEV-16606: Makearia_pagecache_buffer_sizedynamic for that. I tested with only shrinking the MyISAM cache and by modifying the startup configuration for Aria. The results for the core dump sizes are the following:

  • 527 MB core dump for the default behavior,
  • 400 MB core dump by shrinking the MyISAM cache from 128 MB to 8 MB,
  • 268 MB core dump by also shrinking the Aria cache from 128 MB to 8 MB.

We are now at a core dump size smaller than MySQL 5.7.22: this is the result I was expecting.

I did some more tests with a larger InnoDB Buffer Pool and with a larger InnoDB Redo Log buffer while keeping MyISAM and Aria cache sizes to 8 MB. Here are the results of the sizes of the compact core dump (default behavior) vs the full core dump (using gdb):

  • 340 MB vs 1.4 GB core dumps when growing the InnoDB Buffer Pool from 128 MB to 1 GB,
  • 357 MB vs 1.7 GB core dumps when also growing the InnoDB Redo Log buffer from 16 MB to 128 MB.

I think the results above show the usefulness of the no InnoDB Buffer Pool in core dump feature.

Potential Improvements of the Shrinking Core Dump Feature

The end goal of excluding the InnoDB Buffer Pool from core dumps is to make generating and working with those files easier. As already mentioned above, the space and time taken to save core dumps are the main obstacles, and sharing them is also an issue (including leaking a lot of user data).

Ideally, I would like to always run MySQL/MariaDB with core dump enabled on crashes (I see one exception when using database-level encryption for not leaking data). I even think this should be the default behavior, but this is another discussion that I will not start here. My main motivation is that if/when MySQL crashes, I want all information needed to understand the crash (and eventually report a bug) without having to change parameters, restart the database, and generate the same crash again. Obviously, this configuration is unsuitable for servers with a lot of RAM and with a large InnoDB Buffer Pool. MariaDB 10.3 makes a big step forward by excluding the InnoDB Buffer Pool (and Redo Log buffer) from core dumps, but what else could be done to achieve the goal of always running MySQL with core dump enabled ?

There is a pull request to exclude the query cache from core dumps (also by Daniel Black, thanks for this work). When MariaDB is run with a large query cache (and I know this is unusual, but if you know of a valid real world use case, please add a comment below), excluding it from core dumps is good. But I am not sure this is a generally needed improvement:

It looks like there is a consensus that the query cache is a very niche feature and otherwise should be disabled, so this work might not be the one that will profit most people. Still good to be done though.

I would like similar work to be done on MyISAM, Aria, TokuDB and MyRocks. As we saw above, there is an opportunity, for default deployments, to remove 256 MB from core dumps by excluding MyISAM and Aria caches. I think this work is particularly important for those two storage engines as they are loaded by default in MariaDB. By the way, and considering the relatively low usage of the MyISAM and Aria storage engine, maybe the default value for their caches should be lower: I opened MDEV-16607: Consider smaller defaults for MyISAM and Aria cache sizes for that.

I cannot think of any other large memory buffers that I would like to exclude from core dumps. If you think about one, please add a comment below.

Finally, I would like the shrinking core dump feature to also appear in Oracle MySQL and Percona Server, so I opened Bug#91455: Implement core dump size reduction for that. For the anecdote, I was recently working on a Percona Server crash in production, and we were reluctant to enable core dumps because of the additional minutes of downtime needed to write the file to disk. In this case, the no InnoDB Buffer Pool in core dump would have been very useful !

The post A Nice Feature in MariaDB 10.3: no InnoDB Buffer Pool in Core Dumps appeared first on Percona Community Blog.

What To Do When MySQL Runs Out of Memory: Troubleshooting Guide

Troubleshooting crashes is never a fun task, especially if MySQL does not report the cause of the crash. For example, when MySQL runs out of memory. Peter Zaitsev wrote a blog post in 2012: Troubleshooting MySQL Memory Usage with a lots of useful tips. With the new versions of MySQL (5.7+) and performance_schema we have the ability to troubleshoot MySQL memory allocation much more easily.

In this blog post I will show you how to use it.

First of all, there are 3 major cases when MySQL will crash due to running out of memory:

  1. MySQL tries to allocate more memory than available because we specifically told it to do so. For example: you did not set innodb_buffer_pool_size correctly. This is very easy to fix
  2. There is some other process(es) on the server that allocates RAM. It can be the application (java, python, php), web server or even the backup (i.e. mysqldump). When the source of the problem is identified, it is straightforward to fix.
  3. Memory leaks in MySQL. This is a worst case scenario, and we need to troubleshoot.
Where to start troubleshooting MySQL memory leaks

Here is what we can start with (assuming it is a Linux server):

Part 1: Linux OS and config check
  1. Identify the crash by checking mysql error log and Linux log file (i.e. /var/log/messages or /var/log/syslog). You may see an entry saying that OOM Killer killed MySQL. Whenever MySQL has been killed by OOM “dmesg” also shows details about the circumstances surrounding it.
  2. Check the available RAM:
    • free -g
    • cat /proc/meminfo
  3. Check what applications are using RAM: “top” or “htop” (see the resident vs virtual memory)
  4. Check mysql configuration: check /etc/my.cnf or in general /etc/my* (including /etc/mysql/* and other files). MySQL may be running with the different my.cnf (run ps  ax| grep mysql )
  5. Run vmstat 5 5 to see if the system is reading/writing via virtual memory and if it is swapping
  6. For non-production environments we can use other tools (like Valgrind, gdb, etc) to examine MySQL usage
Part 2:  Checks inside MySQL

Now we can check things inside MySQL to look for potential MySQL memory leaks.

MySQL allocates memory in tons of places. Especially:

  • Table cache
  • Performance_schema (run: show engine performance_schema status  and look at the last line). That may be the cause for the systems with small amount of RAM, i.e. 1G or less
  • InnoDB (run show engine innodb status  and check the buffer pool section, memory allocated for buffer_pool and related caches)
  • Temporary tables in RAM (find all in-memory tables by running: select * from information_schema.tables where engine='MEMORY' )
  • Prepared statements, when it is not deallocated (check the number of prepared commands via deallocate command by running show global status like ‘Com_prepare_sql';show global status like 'Com_dealloc_sql'  )

The good news is: starting with MySQL 5.7 we have memory allocation in performance_schema. Here is how we can use it

  1. First, we need to enable collecting memory metrics. Run:
    UPDATE setup_instruments SET ENABLED = 'YES' WHERE NAME LIKE 'memory/%';
  2. Run the report from sys schema:
    select event_name, current_alloc, high_alloc from sys.memory_global_by_current_bytes where current_count > 0;
  3. Usually this will give you the place in code when memory is allocated. It is usually self-explanatory. In some cases we can search for bugs or we might need to check the MySQL source code.

For example, for the bug where memory was over-allocated in triggers ( the select shows:

mysql> select event_name, current_alloc, high_alloc from memory_global_by_current_bytes where current_count > 0; +--------------------------------------------------------------------------------+---------------+-------------+ | event_name | current_alloc | high_alloc | +--------------------------------------------------------------------------------+---------------+-------------+ | memory/innodb/buf_buf_pool | 7.29 GiB | 7.29 GiB | | memory/sql/sp_head::main_mem_root | 3.21 GiB | 3.62 GiB | ...

The largest chunk of RAM is usually the buffer pool but ~3G in stored procedures seems to be too high.

According to the MySQL source code documentation, sp_head represents one instance of a stored program which might be of any type (stored procedure, function, trigger, event). In the above case we have a potential memory leak.

In addition we can get a total report for each higher level event if we want to see from the birds eye what is eating memory:

mysql> select substring_index( -> substring_index(event_name, '/', 2), -> '/', -> -1 -> ) as event_type, -> round(sum(CURRENT_NUMBER_OF_BYTES_USED)/1024/1024, 2) as MB_CURRENTLY_USED -> from performance_schema.memory_summary_global_by_event_name -> group by event_type -> having MB_CURRENTLY_USED>0; +--------------------+-------------------+ | event_type | MB_CURRENTLY_USED | +--------------------+-------------------+ | innodb | 0.61 | | memory | 0.21 | | performance_schema | 106.26 | | sql | 0.79 | +--------------------+-------------------+ 4 rows in set (0.00 sec)

I hope those simple steps can help troubleshoot MySQL crashes due to running out of memory.

Links to more resources that might be of interest

The post What To Do When MySQL Runs Out of Memory: Troubleshooting Guide appeared first on Percona Database Performance Blog.

How to Improve Performance of Galera Cluster for MySQL or MariaDB

Galera Cluster comes with many notable features that are not available in standard MySQL replication (or Group Replication); automatic node provisioning, true multi-master with conflict resolutions and automatic failover. There are also a number of limitations that could potentially impact cluster performance. Luckily, if you are not aware of these, there are workarounds. And if you do it right, you can minimize the impact of these limitations and improve overall performance.

We have previously covered many tips and tricks related to Galera Cluster, including running Galera on AWS Cloud. This blog post distinctly dives into the performance aspects, with examples on how to get the most out of Galera.

Replication Payload

A bit of introduction - Galera replicates writesets during the commit stage, transferring writesets from the originator node to the receiver nodes synchronously through the wsrep replication plugin. This plugin will also certify writesets on the receiver nodes. If the certification process passes, it returns OK to the client on the originator node and will be applied on the receiver nodes at a later time asynchronously. Else, the transaction will be rolled back on the originator node (returning error to the client) and the writesets that have been transferred to the receiver nodes will be discarded.

A writeset consists of write operations inside a transaction that changes the database state. In Galera Cluster, autocommit is default to 1 (enabled). Literally, any SQL statement executed in Galera Cluster will be enclosed as a transaction, unless you explicitly start with BEGIN, START TRANSACTION or SET autocommit=0. The following diagram illustrates the encapsulation of a single DML statement into a writeset:

For DML (INSERT, UPDATE, DELETE..), the writeset payload consists of the binary log events for a particular transaction while for DDLs (ALTER, GRANT, CREATE..), the writeset payload is the DDL statement itself. For DMLs, the writeset will have to be certified against conflicts on the receiver node while for DDLs (depending on wsrep_osu_method, default to TOI), the cluster cluster runs the DDL statement on all nodes in the same total order sequence, blocking other transactions from committing while the DDL is in progress (see also RSU). In simple words, Galera Cluster handles DDL and DML replication differently.

Round Trip Time

Generally, the following factors determine how fast Galera can replicate a writeset from an originator node to all receiver nodes:

  • Round trip time (RTT) to the farthest node in the cluster from the originator node.
  • The size of a writeset to be transferred and certified for conflict on the receiver node.

For example, if we have a three-node Galera Cluster and one of the nodes is located 10 milliseconds away (0.01 second), it's very unlikely you might be able to write more than 100 times per second to the same row without conflicting. There is a popular quote from Mark Callaghan which describes this behaviour pretty well:

"[In a Galera cluster] a given row can’t be modified more than once per RTT"

To measure RTT value, simply perform ping on the originator node to the farthest node in the cluster:

$ ping # the farthest node

Wait for a couple of seconds (or minutes) and terminate the command. The last line of the ping statistic section is what we are looking for:

--- ping statistics --- 65 packets transmitted, 65 received, 0% packet loss, time 64019ms rtt min/avg/max/mdev = 0.111/0.431/1.340/0.240 ms

The max value is 1.340 ms (0.00134s) and we should take this value when estimating the minimum transactions per second (tps) for this cluster. The average value is 0.431ms (0.000431s) and we can use to estimate the average tps while min value is 0.111ms (0.000111s) which we can use to estimate the maximum tps. The mdev means how the RTT samples were distributed from the average. Lower value means more stable RTT.

Hence, transactions per second can be estimated by dividing RTT (in second) into 1 second:


  • Minimum tps: 1 / 0.00134 (max RTT) = 746.26 ~ 746 tps
  • Average tps: 1 / 0.000431 (avg RTT) = 2320.19 ~ 2320 tps
  • Maximum tps: 1 / 0.000111 (min RTT) = 9009.01 ~ 9009 tps

Note that this is just an estimation to anticipate replication performance. There is not much we can do to improve this on the database side, once we have everything deployed and running. Except, if you move or migrate the database servers closer to each other to improve the RTT between nodes, or upgrade the network peripherals or infrastructure. This would require maintenance window and proper planning.

Chunk Up Big Transactions

Another factor is the transaction size. After the writeset is transferred, there will be a certification process. Certification is a process to determine whether or not the node can apply the writeset. Galera generates MD5 checksum pseudo keys from every full row. The cost of certification depends on the size of the writeset, which translates into a number of unique key lookups into the certification index (a hash table). If you update 500,000 rows in a single transaction, for example:

# a 500,000 rows table mysql> UPDATE mydb.settings SET success = 1;

The above will generate a single writeset with 500,000 binary log events in it. This huge writeset does not exceed wsrep_max_ws_size (default to 2GB) so it will be transferred over by Galera replication plugin to all nodes in the cluster, certifying these 500,000 rows on the receiver nodes for any conflicting transactions that are still in the slave queue. Finally, the certification status is returned to the group replication plugin. The bigger the transaction size, the higher risk it will be conflicting with other transactions that come from another master. Conflicting transactions waste server resources, plus cause a huge rollback to the originator node. Note that a rollback operation in MySQL is way slower and less optimized than commit operation.

The above SQL statement can be re-written into a more Galera-friendly statement with the help of simple loop, like the example below:

(bash)$ for i in {1..500}; do \ mysql -uuser -ppassword -e "UPDATE mydb.settings SET success = 1 WHERE success != 1 LIMIT 1000"; \ sleep 2; \ done

The above shell command would update 1000 rows per transaction for 500 times and wait for 2 seconds between executions. You could also use a stored procedure or other means to achieve a similar result. If rewriting the SQL query is not an option, simply instruct the application to execute the big transaction during a maintenance window to reduce the risk of conflicts.

For huge deletes, consider using pt-archiver from the Percona Toolkit - a low-impact, forward-only job to nibble old data out of the table without impacting OLTP queries much.

Parallel Slave Threads

In Galera, the applier is a multithreaded process. Applier is a thread running within Galera to apply the incoming write-sets from another node. Which means, it is possible for all receivers to execute multiple DML operations that come right from the originator (master) node simultaneously. Galera parallel replication is only applied to transactions when it is safe to do so. It improves the probability of the node to sync up with the originator node. However, the replication speed is still limited to RTT and writeset size.

To get the best out of this, we need to know two things:

  • The number of cores the server has.
  • The value of wsrep_cert_deps_distance status.

The status wsrep_cert_deps_distance tells us the potential degree of parallelization. It is the value of the average distance between highest and lowest seqno values that can be possibly applied in parallel. You can use the wsrep_cert_deps_distance status variable to determine the maximum number of slave threads possible. Take note that this is an average value across time. Hence, in order get a good value, you have to hit the cluster with writes operations through test workload or benchmark until you see a stable value coming out.

To get the number of cores, you can simply use the following command:

$ grep -c processor /proc/cpuinfo 4

Ideally, 2, 3 or 4 threads of slave applier per CPU core is a good start. Thus, the minimum value for the slave threads should be 4 x number of CPU cores, and must not exceed the wsrep_cert_deps_distance value:

MariaDB [(none)]> SHOW STATUS LIKE 'wsrep_cert_deps_distance'; +--------------------------+----------+ | Variable_name | Value | +--------------------------+----------+ | wsrep_cert_deps_distance | 48.16667 | +--------------------------+----------+

You can control the number of slave applier threads using wsrep_slave_thread variable. Even though this is a dynamic variable, only increasing the number would have an immediate effect. If you reduce the value dynamically, it would take some time, until the applier thread exits after it finishes applying. A recommended value is anywhere between 16 to 48:

mysql> SET GLOBAL wsrep_slave_threads = 48;

Take note that in order for parallel slave threads to work, the following must be set (which is usually pre-configured for Galera Cluster):

innodb_autoinc_lock_mode=2 Galera Cache (gcache)

Galera uses a preallocated file with a specific size called gcache, where a Galera node keeps a copy of writesets in circular buffer style. By default, its size is 128MB, which is rather small. Incremental State Transfer (IST) is a method to prepare a joiner by sending only the missing writesets available in the donor’s gcache. IST is faster than state snapshot transfer (SST), it is non-blocking and has no significant performance impact on the donor. It should be the preferred option whenever possible.

IST can only be achieved if all changes missed by the joiner are still in the gcache file of the donor. The recommended setting for this is to be as big as the whole MySQL dataset. If disk space is limited or costly, determining the right size of the gcache size is crucial, as it can influence the data synchronization performance between Galera nodes.

The below statement will give us an idea of the amount of data replicated by Galera. Run the following statement on one of the Galera nodes during peak hours (tested on MariaDB >10.0 and PXC >5.6, galera >3.x):

mysql> SET @start := (SELECT SUM(VARIABLE_VALUE/1024/1024) FROM information_schema.global_status WHERE VARIABLE_NAME LIKE 'WSREP%bytes'); do sleep(60); SET @end := (SELECT SUM(VARIABLE_VALUE/1024/1024) FROM information_schema.global_status WHERE VARIABLE_NAME LIKE 'WSREP%bytes'); SET @gcache := (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(@@GLOBAL.wsrep_provider_options,'gcache.size = ',-1), 'M', 1)); SELECT ROUND((@end - @start),2) AS `MB/min`, ROUND((@end - @start),2) * 60 as `MB/hour`, @gcache as `gcache Size(MB)`, ROUND(@gcache/round((@end - @start),2),2) as `Time to full(minutes)`; +--------+---------+-----------------+-----------------------+ | MB/min | MB/hour | gcache Size(MB) | Time to full(minutes) | +--------+---------+-----------------+-----------------------+ | 7.95 | 477.00 | 128 | 16.10 | +--------+---------+-----------------+-----------------------+

We can estimate that the Galera node can have approximately 16 minutes of downtime, without requiring SST to join (unless Galera cannot determine the joiner state). If this is too short time and you have enough disk space on your nodes, you can change the wsrep_provider_options="gcache.size=<value>" to a more appropriate value. In this example workload, setting gcache.size=1G allows us to have 2 hours of node downtime with high probability of IST when the node rejoins.

It's also recommended to use gcache.recover=yes in wsrep_provider_options (Galera >3.19), where Galera will attempt to recover the gcache file to a usable state on startup rather than delete it, thus preserving the ability to have IST and avoiding SST as much as possible. Codership and Percona have covered this in details in their blogs. IST is always the best method to sync up after a node rejoins the cluster. It is 50% faster than xtrabackup or mariabackup and 5x faster than mysqldump.

Asynchronous Slave

Galera nodes are tightly-coupled, where the replication performance is as fast as the slowest node. Galera use a flow control mechanism, to control replication flow among members and eliminate any slave lag. The replication can be all fast or all slow on every node and is adjusted automatically by Galera. If you want to know about flow control, read this blog post by Jay Janssen from Percona.

In most cases, heavy operations like long running analytics (read-intensive) and backups (read-intensive, locking) are often inevitable, which could potentially degrade the cluster performance. The best way to execute this type of queries is by sending them to a loosely-coupled replica server, for instance, an asynchronous slave.

An asynchronous slave replicates from a Galera node using the standard MySQL asynchronous replication protocol. There is no limit on the number of slaves that can be connected to one Galera node, and chaining it out with an intermediate master is also possible. MySQL operations that execute on this server won't impact the cluster performance, apart from the initial syncing phase where a full backup must be taken on the Galera node to stage the slave before establishing the replication link (although ClusterControl allows you to build the async slave from an existing backup first, before connecting it to the cluster).

GTID (Global Transaction Identifier) provides a better transactions mapping across nodes, and is supported in MySQL 5.6 and MariaDB 10.0. With GTID, the failover operation on a slave to another master (another Galera node) is simplified, without the need to figure out the exact log file and position. Galera also comes with its own GTID implementation but these two are independent to each other.

Scaling out an asynchronous slave is one-click away if you are using ClusterControl -> Add Replication Slave feature:

Take note that binary logs must be enabled on the master (the chosen Galera node) before we can proceed with this setup. We have also covered the manual way in this previous post.

The following screenshot from ClusterControl shows the cluster topology, it illustrates our Galera Cluster architecture with an asynchronous slave:

ClusterControl automatically discovers the topology and generates the super cool diagram like above. You can also perform administration tasks directly from this page by clicking on the top-right gear icon of each box.

SQL-aware Reverse Proxy Related resources  How to Benchmark Performance of MySQL & MariaDB using SysBench  Galera Cluster Comparison - Codership vs Percona vs MariaDB  Monitoring Galera Cluster for MySQL or MariaDB - Understanding metrics and their meaning

ProxySQL and MariaDB MaxScale are intelligent reverse-proxies which understand MySQL protocol and is capable of acting as a gateway, router, load balancer and firewall in front of your Galera nodes. With the help of Virtual IP Address provider like LVS or Keepalived, and combining this with Galera multi-master replication technology, we can have a highly available database service, eliminating all possible single-point-of-failures (SPOF) from the application point-of-view. This will surely improve the availability and reliability the architecture as whole.

Another advantage with this approach is you will have the ability to monitor, rewrite or re-route the incoming SQL queries based on a set of rules before they hit the actual database server, minimizing the changes on the application or client side and routing queries to a more suitable node for optimal performance. Risky queries for Galera like LOCK TABLES and FLUSH TABLES WITH READ LOCK can be prevented way ahead before they would cause havoc to the system, while impacting queries like "hotspot" queries (a row that different queries want to access at the same time) can be rewritten or being redirected to a single Galera node to reduce the risk of transaction conflicts. For heavy read-only queries like OLAP or backup, you can route them over to an asynchronous slave if you have any.

Reverse proxy also monitors the database state, queries and variables to understand the topology changes and produce an accurate routing decision to the backend servers. Indirectly, it centralizes the nodes monitoring and cluster overview without the need to check on each and every single Galera node regularly. The following screenshot shows the ProxySQL monitoring dashboard in ClusterControl:

There are also many other benefits that a load balancer can bring to improve Galera Cluster significantly, as covered in details in this blog post, Become a ClusterControl DBA: Making your DB components HA via Load Balancers.

Final Thoughts

With good understanding on how Galera Cluster internally works, we can work around some of the limitations and improve the database service. Happy clustering!

Tags:  MySQL MariaDB galera pxc performance

Scale-with-Maxscale-part5 (Multi-Master)

This is the 5th blog in series of Maxscale blog, Below is the list of our previous blogs, Which provides deep insight for Maxscale and its use cases for different architectures.

Here we are going to discuss, using Maxscale for Multi-Master environment (M-M), in both Active-Passive and Active-Active mode

Test Environment:

Below is the detail of the environment used for testing

OS                            : Debian 8 (Jessie)
MySQL Version     : 5.7.21-20-log Percona Server (GPL)
Maxscale version : maxscale-1.4.5-1.debian.jessie.x86_64 ( GPL )
Master 1                 :
Master 2                 :


Master-Master Replication


Setting up of master-master is beyond the scope of our exercise, I will directly jump on to the configuration of maxscale with Multi-master setup

Monitor Module for M-M

Maxscale comes with a special monitor module named “MMMON”, This monitors the health of servers and set the status flag based on which the router module (Read-Write splitter) sends connections


Below is the configuration basic configuration for Multi-Master using maxscale.

[Splitter Service] type=service router=readwritesplit servers=master1,master2 user=maxscale passwd=B87D86D1025669B26FA53505F848EC9B [Splitter Listener] type=listener service=Splitter Service protocol=MySQLClient port=3306 socket=/tmp/ClusterMaster [Replication Monitor] type=monitor module=mmmon servers=master1,master2 user=maxscale passwd=B87D86D1025669B26FA53505F848EC9B monitor_interval=200 detect_stale_master=1 detect_replication_lag=true [master1] type=server address= port=3306 protocol=MySQLBackend [master2] type=server address= port=3306 protocol=MySQLBackend

Active-Active Setup:

Active-Active setup is where there is completed the balance of read & write between the servers, With this setup, I would strongly recommend having the Auto_increment_increment & Auto_increment_offset to avoid conflicting writes.

Active-Active Setup with Maxscale

Below is how it looks from Maxscale

MaxScale> list servers Servers. -------------------+-----------------+-------+-------------+-------------------- Server | Address | Port | Connections | Status -------------------+-----------------+-------+-------------+-------------------- master1 | | 3306 | 0 | Master, Running master2 | | 3306 | 0 | Master, Running -------------------+-----------------+-------+-------------+--------------------

Active Passive Setup:

Active-Passive setup is where Writes happen on one of the node and reads is distributed among the servers. To have this just enable the “read_only=1” on any of the node. Maxscale identifies this flag and starts routing only the read connections.

Next question which arises immediately is what happens when there is writer (Active) node failure?

The answer is pretty simple just disable the read-only on the passive node, You do it manually by logging the node or automate it with Maxscale by integrating it along with the fail-over script, which will be called during the time of unplanned or planned maintenance.


Active-Passive setup With Maxscale


I have just enabled the read_only on Master2, you can see the status got changed to ‘Slave’, as below.

MaxScale> list servers Servers. -------------------+-----------------+-------+-------------+-------------------- Server | Address | Port | Connections | Status -------------------+-----------------+-------+-------------+-------------------- master1 | | 3306 | 0 | Master, Running master2 | | 3306 | 0 | Slave, Running -------------------+-----------------+-------+-------------+--------------------

By setting the above Multi-master setup, we have ensured that we have a one more DB node to have the fail-over, This leaves maxscale as a single point of failure for the application, we can have an HA setup for maxscale using keepalived by having an IP switch between the nodes are if you using AWS you can go with ELB(Network) on TCP ports and balancing connection between Maxscale nodes.

Image Courtesy : Photo by Vincent van Zalinge on Unsplash


Percona Monitoring and Management 1.12.0 Is Now Available

PMM (Percona Monitoring and Management) is a free and open-source platform for managing and monitoring MySQL and MongoDB performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

In release 1.12, we invested our efforts in the following areas:

  • Visual Explain in Query Analytics – Gain insight into MySQL’s query optimizer for your queries
  • New Dashboard – InnoDB Compression Metrics – Evaluate effectiveness of InnoDB Compression
  • New Dashboard – MySQL Command/Handler Compare – Contrast MySQL instances side by side
  • Updated Grafana to 5.1 – Fixed scrolling issues

We addressed 10 new features and improvements, and fixed 13 bugs.

Visual Explain in Query Analytics

We’re working on substantial changes to Query Analytics and the first part to roll out is something that users of Percona Toolkit may recognize – we’ve introduced a new element called Visual Explain based on pt-visual-explain.  This functionality transforms MySQL EXPLAIN output into a left-deep tree representation of a query plan, in order to mimic how the plan is represented inside MySQL.  This is of primary benefit when investigating tables that are joined in some logical way so that you can understand in what order the loops are executed by the MySQL query optimizer. In this example we are demonstrating the output of a single table lookup vs two table join:

Single Table Lookup Two Tables via INNER JOIN SELECT DISTINCT c
FROM sbtest13
AND 49907
SELECT sbtest3.c
FROM sbtest1
INNER JOIN sbtest3
ON =
WHERE sbtest3.c ='long-string'; InnoDB Compression Metrics Dashboard

A great feature of MySQL’s InnoDB storage engine includes compression of data that is transparently handled by the database, saving you space on disk, while reducing the amount of I/O to disk as fewer disk blocks are required to store the same amount of data, thus allowing you to reduce your storage costs.  We’ve deployed a new dashboard that helps you understand the most important characteristics of InnoDB’s Compression.  Here’s a sample of visualizing Compression and Decompression attempts, alongside the overall Compression Success Ratio graph:


MySQL Command/Handler Compare Dashboard

We have introduced a new dashboard that lets you do side-by-side comparison of Command (Com_*) and Handler statistics.  A common use case would be to compare servers that share a similar workload, for example across MySQL instances in a pool of replicated slaves.  In this example I am comparing two servers under identical sysbench load, but exhibiting slightly different performance characteristics:

The number of servers you can select for comparison is unbounded, but depending on the screen resolution you might want to limit to 3 at a time for a 1080 screen size.

New Features & Improvements
  • PMM-2519: Display Visual Explain in Query Analytics
  • PMM-2019: Add new Dashboard InnoDB Compression metrics
  • PMM-2154: Add new Dashboard Compare Commands and Handler statistics
  • PMM-2530: Add timeout flags to mongodb_exporter (thank you unguiculus for your contribution!)
  • PMM-2569: Update the MySQL Golang driver for MySQL 8 compatibility
  • PMM-2561: Update to Grafana 5.1.3
  • PMM-2465: Improve pmm-admin debug output
  • PMM-2520: Explain Missing Charts from MySQL Dashboards
  • PMM-2119: Improve Query Analytics messaging when Host = All is passed
  • PMM-1956: Implement connection checking in mongodb_exporter
Bug Fixes
  • PMM-1704: Unable to connect to AtlasDB MongoDB
  • PMM-1950: pmm-admin (mongodb:metrics) doesn’t work well with SSL secured mongodb server
  • PMM-2134: rds_exporter exports memory in Kb with node_exporter labels which are in bytes
  • PMM-2157: Cannot connect to MongoDB using URI style
  • PMM-2175: Grafana singlestat doesn’t use consistent colour when unit is of type Time
  • PMM-2474: Data resolution on Dashboards became 15sec interval instead of 1sec
  • PMM-2581: Improve Travis CI tests by addressing pmm-admin check-network Time Drift
  • PMM-2582: Unable to scroll on “_PMM Add Instance” page when many RDS instances exist in an AWS account
  • PMM-2596: Set fixed height for panel content in PMM Add Instances
  • PMM-2600: InnoDB Checkpoint Age does not show data for MySQL
  • PMM-2620: Fix balancerIsEnabled & balancerChunksBalanced values
  • PMM-2634: pmm-admin cannot create user for MySQL 8
  • PMM-2635: Improve error message while adding metrics beyond “exit status 1”
Known Issues
  • PMM-2639: mysql:metrics does not work on Ubuntu 18.04 – We will address this in a subsequent release
How to get PMM Server

PMM is available for installation using three methods:

The post Percona Monitoring and Management 1.12.0 Is Now Available appeared first on Percona Database Performance Blog.

Presentation : Customer Experience on InnoDB Cluster


As Mydbops we have consulted  many large scale MySQL deployments. This presentation is about one of our customer who is one of the largest retailer in North America. This is about their data migration to InnoDB Cluster ( MySQL ) from an enterprise database.

Shinguz: FromDual Backup and Recovery Manager for MariaDB and MySQL 2.0.0 has been released

FromDual has the pleasure to announce the release of the new version 2.0.0 of its popular Backup and Recovery Manager for MariaDB and MySQL (brman).

The new FromDual Backup and Recovery Manager can be downloaded from here. How to install and use the Backup and Recovery Manager is describe in FromDual Backup and Recovery Manager (brman) installation guide.

In the inconceivable case that you find a bug in the FromDual Backup and Recovery Manager please report it to the FromDual Bugtracker or just send us an email.

Any feedback, statements and testimonials are welcome as well! Please send them to

Upgrade from 1.2.x to 2.0.0

brman 2.0.0 requires a new PHP package for ssh connections.

shell> sudo apt-get install php-ssh2 shell> cd ${HOME}/product shell> tar xf /download/brman-2.0.0.tar.gz shell> rm -f brman shell> ln -s brman-2.0.0 brman
Changes in FromDual Backup and Recovery Manager 2.0.0

This release is a new major release series. It contains a lot of new features. We have tried to maintain backward-compatibility with the 1.2 release series. But you should test the new release seriously!

You can verify your current FromDual Backup Manager version with the following command:

shell> fromdual_bman --version shell> bman --version
FromDual Backup Manager
  • brman was made ready for MariaDB 10.3
  • brman was made ready for MySQL 8.0
  • DEB and RPM packages are prepared.
  • fromdual_brman was renamed to brman (because of DEB).
  • mariabackup support was added.
  • Timestamp for physical backup and compression added to log file.
  • Option --no-purge added to not purge binary logs during binlog backup.
  • A failed archive command in physical full backup does not abort backup loop any more.
  • Return code detection for different physical backup tools improved.
  • Bug in fetching binlog file and position from xtrabackup_binlog_info fixed.
  • Version made MyEnv compliant.
  • Errors and warnings are written to STDERR.
  • General Tablespace check implemented. Bug in mysqldump. Only affects MySQL 5.7 and 8.0. MariaDB up to 10.3 has not implemented this feature yet.
  • Warning messages improved.
  • Option --quick added for logical (mysqldump) backup to speed up backup.
  • On schema backups FLUSH BINARY LOGS is executed only once when --per-schema backup is used.
  • The database user root should not be used for backups any more. User brman is suggested.
  • Option --pass-through is implemented to pass options like --ignore-table through to the backend backup tool (mysqldump, mariabackup, xtrabackup, mysqlbackup).
  • bman can report to fpmmm/Zabbix now.
  • Check for binary logging made less intrusive.
  • All return codes (rc) are matching to new schema now. That means errors do not necessarily have same error codes with new brman version.
  • If RELOAD privilege is missing --master-data and/or --flush-logs options are omitted. This makes bman backups possible for some shared hosting and cloud environments.
  • Schema backup does not require SHOW DATABASES privilege any more. This makes it possible to use bman for shared hosting and cloud environments.
  • Info messages made nicer with empty lines.
  • Option --archivedir is replaced by --archivedestination.
  • Remote copy of backup via rsync, scp and sftp is possible.
  • Connect string was shown wrong in the log file.
  • Connect string of target and catalog made URI conform.
  • bman supports now mariabackup, xtrabackup and mysqlbackup properly (recent releases).
FromDual Backup Manager Catalog
  • Catalog write is done if physical backup hits an error in archiving.
  • Renamed catalog to brman_catalog.

For subscriptions of commercial use of brman please get in contact with us.

Taxonomy upgrade extras:  Backup Restore Recovery pitr brman release fromdual_brman

Webinar 6/28: Securing Database Servers From External Attacks

Please join Percona’s Chief Evangelist Colin Charles on Thursday, June 28th, 2018, as he presents Securing Database Servers From External attacks at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4).

Register Now


A critical piece of your infrastructure is the database tier, yet people don’t pay enough attention to it judging by how many are bitten via poorly chosen defaults, or just a lack understanding of running a secure database tier. In this talk, I’ll focus on MySQL/MariaDB, PostgreSQL, and MongoDB, and cover external authentication, auditing, encryption, SSL, firewalls, replication, and more gems from over a decade of consulting in this space from Percona’s 4,000+ customers.

Register Now


Colin Charles Chief Evangelist

Colin Charles is the Chief Evangelist at Percona. He was previously on the founding team of MariaDB Server in 2009, and had worked at MySQL since 2005, and been a MySQL user since 2000. Before joining MySQL, he worked actively on the Fedora and projects. He’s well known within open source communities in APAC, and has spoken at many conferences. Experienced technologist, well known in the open source world for work that spans nearly two decades within the community. Pays attention to emerging technologies from an integration standpoint. Prolific speaker at many industry-wide conferences delivering talks and tutorials with ease. Interests: application development, systems administration, database development, migration, Web-based technologies. Considered expert in Linux and Mac OS X usage/administration/roll-out’s. Specialties: MariaDB, MySQL, Linux, Open Source, Community, speaking & writing to technical audiences as well as business stakeholders.

The post Webinar 6/28: Securing Database Servers From External Attacks appeared first on Percona Database Performance Blog.

MySQL InnoDB Cluster : MySQL Shell and the AdminAPI

As promised, here is a post more detailed on how to create a MySQL InnoDB Cluster using MySQL Shell and the AdminAPI.

First of all, as a good practice is never enough repeated, whatever the version of MySQL you are using, please use the latest MySQL Shell ! So if you are using 5.7, please use MySQL Shell 8.0.11. See this compatibility matrix or this official one.

dba class

The AdminAPI can be accessed by the MySQL Shell via the dba object. The reference manual for this class is here. The Shell provides you a very useful method for mostly all objects: help() (this method is not yet documented).

Let’s call the help() in the Shell and see the output:

MySQL JS> The global variable 'dba' is used to access the AdminAPI functionality and perform DBA operations. It is used for managing MySQL InnoDB clusters. The following properties are currently supported. - verbose Enables verbose mode on the Dba operations. The following functions are currently supported. - checkInstanceConfiguration Validates an instance for MySQL InnoDB Cluster usage. - configureInstance Validates and configures an instance for MySQL InnoDB Cluster usage. - configureLocalInstance Validates and configures a local instance for MySQL InnoDB Cluster usage. - createCluster Creates a MySQL InnoDB cluster. - deleteSandboxInstance Deletes an existing MySQL Server instance on localhost. - deploySandboxInstance Creates a new MySQL Server instance on localhost. - dropMetadataSchema Drops the Metadata Schema. - getCluster Retrieves a cluster from the Metadata Store. - help Provides help about this class and it's members - killSandboxInstance Kills a running MySQL Server instance on localhost. - rebootClusterFromCompleteOutage Brings a cluster back ONLINE when all members are OFFLINE. - startSandboxInstance Starts an existing MySQL Server instance on localhost. - stopSandboxInstance Stops a running MySQL Server instance on localhost. For more help on a specific function use:'') e.g.'deploySandboxInstance')

It’s also possible to get more info as explained above, let’s try:

MySQL JS>'configureInstance') Validates and configures an instance for MySQL InnoDB Cluster usage. SYNTAX .configureInstance([instance][, options]) WHERE instance: An instance definition. options: Additional options for the operation. RETURNS A descriptive text of the operation result. DESCRIPTION This function auto-configures the instance for InnoDB Cluster usage.If the target instance already belongs to an InnoDB Cluster it errors out. The instance definition is the connection data for the instance. For additional information on connection data use \? connection. Only TCP/IP connections are allowed for this function. The options dictionary may contain the following options: - mycnfPath: The path to the MySQL configuration file of the instance. - outputMycnfPath: Alternative output path to write the MySQL configuration file of the instance. - password: The password to be used on the connection. - clusterAdmin: The name of the InnoDB cluster administrator user to be created. The supported format is the standard MySQL account name format. - clusterAdminPassword: The password for the InnoDB cluster administrator account. - clearReadOnly: boolean value used to confirm that super_read_only must be disabled. - interactive: boolean value used to disable the wizards in the command execution, i.e. prompts are not provided to the user and confirmation prompts are not shown. - restart: boolean value used to indicate that a remote restart of the target instance should be performed to finalize the operation. The connection password may be contained on the instance definition, however, it can be overwritten if it is specified on the options. This function reviews the instance configuration to identify if it is valid for usage in group replication and cluster. An exception is thrown if not. If the instance was not valid for InnoDB Cluster and interaction is enabled, before configuring the instance a prompt to confirm the changes is presented and a table with the following information: - Variable: the invalid configuration variable. - Current Value: the current value for the invalid configuration variable. - Required Value: the required value for the configuration variable. - Required Value: the required value for the configuration variable. EXCEPTIONS ArgumentError in the following scenarios: - If 'interactive' is disabled and the instance parameter is empty. - If the instance definition is invalid. - If the instance definition is a connection dictionary but empty. - If the instance definition is a connection dictionary but any option is invalid. - If 'interactive' mode is disabled and the instance definition is missing the password. - If 'interactive' mode is enabled and the provided password is empty. RuntimeError in the following scenarios: - If the configuration file path is required but not provided or wrong. - If the instance accounts are invalid. - If the instance is offline. - If the instance is already part of a Replication Group. - If the instance is already part of an InnoDB Cluster. - If the given instance cannot be used for Group Replication.

For this setup, we will use MySQL 8.0.11 servers having one user created on each one (clusteradmin) like this:

create user 'clusteradmin'@'%' identified by 'fred';
grant all privileges on *.* to 'clusteradmin'@'%' with grant option;
reset master;

This user can be created from your GUI interface if you are using nodes in the cloud. But we can also create such user in the Shell as explained in the next chapter.

If you want to grant only the required privileges the list is availble here :

GRANT SELECT ON mysql_innodb_cluster_metadata.* TO your_user@'%';
GRANT SELECT ON performance_schema.global_status TO your_user@'%';
GRANT SELECT ON performance_schema.replication_applier_configuration TO your_user@'%';
GRANT SELECT ON performance_schema.replication_applier_status TO your_user@'%';
GRANT SELECT ON performance_schema.replication_applier_status_by_coordinator TO your_user@'%';
GRANT SELECT ON performance_schema.replication_applier_status_by_worker TO your_user@'%';
GRANT SELECT ON performance_schema.replication_connection_configuration TO your_user@'%';
GRANT SELECT ON performance_schema.replication_connection_status TO your_user@'%';
GRANT SELECT ON performance_schema.replication_group_member_stats TO your_user@'%';
GRANT SELECT ON performance_schema.replication_group_members TO your_user@'%';
GRANT SELECT ON performance_schema.threads TO your_user@'%' WITH GRANT OPTION

Configure the servers

The first step will be to configure the MySQL servers (mysql1, mysql2 and myslq3) that will take part in the cluster. By default, a new instance of MySQL that has been just installed, needs some configuration changes to be able to create or join an InnoDB Cluster.

Even if this is not required, it’s recommended to verify the configuration of the instances. This needs to be done against running mysqld. From the Shell we will verify the configuration of all the nodes using dba.checkInstanceConfiguration():

MySQL JS> dba.checkInstanceConfiguration('clusteradmin@mysql1') Please provide the password for 'clusteradmin@mysql1': **** Validating local MySQL instance listening at port 3306 for use in an InnoDB cluster... This instance reports its own address as mysql1 Clients and other cluster members will communicate with it through this address by default. If this is not correct, the report_host MySQL system variable should be changed. Checking whether existing tables comply with Group Replication requirements... No incompatible tables detected Checking instance configuration... Some configuration options need to be fixed: +--------------------------+---------------+----------------+--------------------------------------------------+ | Variable | Current Value | Required Value | Note | +--------------------------+---------------+----------------+--------------------------------------------------+ | binlog_checksum | CRC32 | NONE | Update the server variable | | enforce_gtid_consistency | OFF | ON | Update read-only variable and restart the server | | gtid_mode | OFF | ON | Update read-only variable and restart the server | +--------------------------+---------------+----------------+--------------------------------------------------+ Please use the dba.configureInstance() command to repair these issues. { "config_errors": [ { "action": "server_update", "current": "CRC32", "option": "binlog_checksum", "required": "NONE" }, { "action": "restart", "current": "OFF", "option": "enforce_gtid_consistency", "required": "ON" }, { "action": "restart", "current": "OFF", "option": "gtid_mode", "required": "ON" } ], "errors": [], "status": "error" }

As we can see, 3 settings must be changed. GTID must be enabled which is expected as Group Replication is based on GTID. We also need to disable the checksum of binlogs as GCS/XCOM already use its internal checksum of events.

We can also notice that some changes require a restart of mysqld.

You can also verify the two other nodes:

MySQL JS> dba.checkInstanceConfiguration('clusteradmin@mysql2') ... MySQL JS> dba.checkInstanceConfiguration('clusteradmin@mysql3') ...

Now we need to make those configuration changes and restart the mysqld instances. As we use MySQL 8.0, we can configure and restart mysqld remotely !

Let’s configure and restart mysql1 using dba.configureInstance() but in case you don’t have yet create a user to manage your cluster, it’s also possible to do it now !

We have then 2 options, configure the instance creating the user:

MySQL JS> dba.configureInstance('mysql1',{clusterAdmin: 'newclusteradmin@%',clusterAdminPassword: 'mypassword'})

Or using a user we already created earlier:

MySQL JS> dba.configureInstance('clusteradmin@mysql1') Please provide the password for 'clusteradmin@mysql1': **** Configuring local MySQL instance listening at port 3306 for use in an InnoDB cluster... This instance reports its own address as mysql1 Clients and other cluster members will communicate with it through this address by default. If this is not correct, the report_host MySQL system variable should be changed. Some configuration options need to be fixed: +--------------------------+---------------+----------------+--------------------------------------------------+ | Variable | Current Value | Required Value | Note | +--------------------------+---------------+----------------+--------------------------------------------------+ | binlog_checksum | CRC32 | NONE | Update the server variable | | enforce_gtid_consistency | OFF | ON | Update read-only variable and restart the server | | gtid_mode | OFF | ON | Update read-only variable and restart the server | +--------------------------+---------------+----------------+--------------------------------------------------+ Do you want to perform the required configuration changes? [y/n]: y Do you want to restart the instance after configuring it? [y/n]: y Configuring instance... The instance 'mysql1:3306' was configured for cluster usage. Restarting MySQL... MySQL server at mysql1:3306 was restarted.

We can verify in the datadir the modification made by the server on its configuration:

# cat mysqld-auto.cnf | jq { "Version": 1, "mysql_server": { "mysql_server_static_options": { "binlog_checksum": { "Value": "NONE", "Metadata": { "Timestamp": 1530087517651637, "User": "clusteradmin", "Host": "mysql1" } }, "enforce_gtid_consistency": { "Value": "ON", "Metadata": { "Timestamp": 1530087516620027, "User": "clusteradmin", "Host": "mysql1" } }, "gtid_mode": { "Value": "ON", "Metadata": { "Timestamp": 1530087517697838, "User": "clusteradmin", "Host": "mysql1" } } } } }

We can also configure mysql2 and mysql3 using the same MySQL Shell session and we can also avoid the prompt:

MySQL JS> dba.configureInstance('clusteradmin:fred@mysql2',{'restart': true, 'interactive': false}) Configuring MySQL instance at mysql2:3306 for use in an InnoDB cluster... This instance reports its own address as mysql2 Clients and other cluster members will communicate with it through this address by default. If this is not correct, the report_host MySQL system variable should be changed. Some configuration options need to be fixed: +--------------------------+---------------+----------------+--------------------------------------------------+ | Variable | Current Value | Required Value | Note | +--------------------------+---------------+----------------+--------------------------------------------------+ | binlog_checksum | CRC32 | NONE | Update the server variable | | enforce_gtid_consistency | OFF | ON | Update read-only variable and restart the server | | gtid_mode | OFF | ON | Update read-only variable and restart the server | +--------------------------+---------------+----------------+--------------------------------------------------+ Configuring instance... The instance 'mysql2:3306' was configured for cluster usage. Restarting MySQL... MySQL server at mysql2:3306 was restarted. MySQL JS> dba.configureInstance('clusteradmin:fred@mysql2',{'restart': true, 'interactive': false}) ...

After the restart of the instances, we can of course verify the configuration again:

MySQL JS> dba.checkInstanceConfiguration('clusteradmin@mysql1') Please provide the password for 'clusteradmin@mysql1': **** Validating local MySQL instance listening at port 3306 for use in an InnoDB cluster... This instance reports its own address as mysql1 Clients and other cluster members will communicate with it through this address by default. If this is not correct, the report_host MySQL system variable should be changed. Checking whether existing tables comply with Group Replication requirements... No incompatible tables detected Checking instance configuration... Instance configuration is compatible with InnoDB cluster The instance 'mysql1:3306' is valid for InnoDB cluster usage. { "status": "ok" }

This is perfect, we can now create our MySQL InnoDB Cluster !

Cluster creation

To create the cluster, we need to connect to one of the 3 nodes in the Shell and use this connection to create the cluster using dba.createCluster() method that returns a cluster object. It’s also at this time that we can set if we want to use a Single-Primary, default or a Multi-Primary cluster.

Let’s create our Single-Primary Cluster using mysql1:

MySQL JS> \c clusteradmin@mysql1 Creating a session to 'clusteradmin@mysql1' Enter password: **** Fetching schema names for autocompletion... Press ^C to stop. Your MySQL connection id is 10 (X protocol) Server version: 8.0.11 MySQL Community Server - GPL No default schema selected; type \use to set one. MySQL [mysql1+ ssl] JS> cluster = dba.createCluster('lefredCluster') A new InnoDB cluster will be created on instance 'clusteradmin@mysql1:3306'. Validating instance at mysql1:3306... This instance reports its own address as mysql1 Instance configuration is suitable. Creating InnoDB cluster 'lefredCluster' on 'clusteradmin@mysql1:3306'... Adding Seed Instance... Cluster successfully created. Use Cluster.addInstance() to add MySQL instances. At least 3 instances are needed for the cluster to be able to withstand up to one server failure.

As you can see, the returned object is now in the variable ‘cluster’. In case you forgot to set a variable like this, you can also do it later using the dba.getCluster() method:

MySQL [mysql1+ ssl] JS> cluster = dba.getCluster()

As for dba object, the cluster object has also a help() method available:

MySQL [mysql1+ ssl] JS> The cluster object is the entry point to manage and monitor a MySQL InnoDB cluster. A cluster is a set of MySQLd Instances which holds the user's data. It provides high-availability and scalability for the user's data. The following properties are currently supported. - name Cluster name. The following functions are currently supported. - addInstance Adds an Instance to the cluster. - checkInstanceState Verifies the instance gtid state in relation with the cluster. - describe Describe the structure of the cluster. - disconnect Disconnects all internal sessions used by the cluster object. - dissolve Dissolves the cluster. - forceQuorumUsingPartitionOf Restores the cluster from quorum loss. - getName Retrieves the name of the cluster. - help Provides help about this class and it's members - rejoinInstance Rejoins an Instance to the cluster. - removeInstance Removes an Instance from the cluster. - rescan Rescans the cluster. - status Describe the status of the cluster. For more help on a specific function use:'') e.g.'addInstance')

We can verify our cluster using the status() method:

{ "clusterName": "lefredCluster", "defaultReplicaSet": { "name": "default", "primary": "mysql1:3306", "ssl": "REQUIRED", "status": "OK_NO_TOLERANCE", "statusText": "Cluster is NOT tolerant to any failures.", "topology": { "mysql1:3306": { "address": "mysql1:3306", "mode": "R/W", "readReplicas": {}, "role": "HA", "status": "ONLINE" } } }, "groupInformationSourceMember": "mysql://clusteradmin@mysql1:3306" }

We can now add a new instance to cluster. But first, we can check the compatibility of the dataset (check if the new node doesn’t have any transactions, gtid, that are unknown in the cluster). We will use the cluster.checkInstanceState() method to achieve this task:

MySQL [mysql1+ ssl] JS> cluster.checkInstanceState('clusteradmin@mysql2') Please provide the password for 'clusteradmin@mysql2': **** Analyzing the instance replication state... The instance 'clusteradmin@mysql2' is valid for the cluster. The instance is new to Group Replication. { "reason": "new", "state": "ok" }

The state reported can be one of the following (we will see another example in the next chapter):

  • new: if the instance doesn’t have any transactions
  • recoverable: if the instance executed GTIDs are not conflicting with the executed GTIDs of the cluster instances
  • diverged: if the instance executed GTIDs diverged with the executed GTIDs of the cluster instances
  • lost_transactions: if the instance has more executed GTIDs than the executed GTIDs of the cluster instances

So for now, we can see that there are no problem and that the instance can be added, let’s do so using cluster.addInstance():

MySQL [mysql1+ ssl] JS> cluster.addInstance('clusteradmin@mysql2:3306') A new instance will be added to the InnoDB cluster. Depending on the amount of data on the cluster this might take from a few seconds to several hours. Please provide the password for 'clusteradmin@mysql2:3306': **** Adding instance to the cluster ... Validating instance at mysql2:3306... This instance reports its own address as mysql2 Instance configuration is suitable. The instance 'clusteradmin@mysql2:3306' was successfully added to the cluster

You could notice that the port is required ! We can verify the cluster using status() again:

MySQL [mysql1+ ssl] JS> cluster.status() { "clusterName": "lefredCluster", "defaultReplicaSet": { "name": "default", "primary": "mysql1:3306", "ssl": "REQUIRED", "status": "OK_NO_TOLERANCE", "statusText": "Cluster is NOT tolerant to any failures.", "topology": { "mysql1:3306": { "address": "mysql1:3306", "mode": "R/W", "readReplicas": {}, "role": "HA", "status": "ONLINE" }, "mysql2:3306": { "address": "mysql2:3306", "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "ONLINE" } } }, "groupInformationSourceMember": "mysql://clusteradmin@mysql1:3306" } Wrong Initial State

I’ve added a transaction in mysql3 to show what happens when there is an error:

MySQL [mysql1+ ssl] JS> cluster.checkInstanceState('clusteradmin@mysql3') Please provide the password for 'clusteradmin@mysql3': **** Analyzing the instance replication state... The instance 'clusteradmin@mysql3' is invalid for the cluster. The instance contains additional transactions in relation to the cluster. { "reason": "diverged", "state": "error" } Router

The router configuration is very easy. The bootstrap option can be used and it will configure itself:

[root@mysql1 mysqlrouter]# mysqlrouter --user mysqlrouter --bootstrap clusteradmin@mysql1 Please enter MySQL password for clusteradmin: Bootstrapping system MySQL Router instance... MySQL Router has now been configured for the InnoDB cluster 'lefredCluster'. The following connection information can be used to connect to the cluster. Classic MySQL protocol connections to cluster 'lefredCluster': - Read/Write Connections: localhost:6446 - Read/Only Connections: localhost:6447 X protocol connections to cluster 'lefredCluster': - Read/Write Connections: localhost:64460 - Read/Only Connections: localhost:64470

Of course, don’t forget to start MySQL Router:

# systemctl start mysqlrouter # systemctl status mysqlrouter ● mysqlrouter.service - MySQL Router Loaded: loaded (/usr/lib/systemd/system/mysqlrouter.service; disabled; vendor preset: disabled) Active: active (running) since Wed 2018-06-27 09:57:42 UTC; 1s ago Main PID: 14955 (main) CGroup: /system.slice/mysqlrouter.service └─14955 /usr/bin/mysqlrouter -c /etc/mysqlrouter/mysqlrouter.conf Jun 27 09:57:42 mysql1 systemd[1]: Started MySQL Router. Jun 27 09:57:42 mysql1 systemd[1]: Starting MySQL Router...

Don’t forget to use latest mysql-router 8.x even with 5.7 !

Links to the MySQL Shell APIs documentation:


Extended Connectivity in dbForge Tools for MySQL

We are excited to inform our users that we have released new versions of dbForge Schema Compare for MySQL, dbForge Data Compare for MySQL, dbForge Query Builder for MySQL, dbForge Data Generator for MySQL, and dbForge Documenter for MySQL. We have introduced brand new connectivity features, including support for the latest MySQL Server, v8.0, and […]

Encrypting an existing MySQL/MariaDB RDS Instance

Often it is necessary to convert an unencrypted RDS instance into an encrypted one. And it is usually expected that this process is done with minimum or no downtime. Unfortunately, one can only enable encryption when the instance is created. However, there is still hope, as there are a couple of workarounds to encrypt your existing data.

In this article, I will discuss two different solutions to achieve this result.

Solution 1: Create a snapshot and copy the snapshot to a new encrypted snapshot:

  • Create a manual snapshot of the unencrypted RDS instance
  • Go to Snapshots from the left panel and choose the snapshot just created
  • From the Actions, choose Copy snapshot option and enable encryption
  • Select the new encrypted snapshot
  • Go to Actions and select Restore snapshot

Of course, we need to stop writes on the instance while the above steps are executed. However, since we are using RDS snapshots to create the new instance, this can still be a feasible option for a production instance.

To help you decide on the best solution for you, I am sharing the statistics of a test case I performed.

Case: On a t2.medium RDS instance of 100 GB allocated storage, the used data size is of 96GB, running in single AZ deployment.

Here is the time consumed at different stages of the complete process:

Creating a snapshot: 47 mins 16 secs

Copying snapshot: 14 mins 15 secs

Restoring snapshot: 2 mins 38 secs

The total time consumed, which can be considered as the downtime in this process, was approximately 64 mins in my test case.

Please note that this time will always vary and it will depend on many factors like the storage size, workload of your instance, the number of transactions going on your instance at that time, instance class, etc.

Also, note that new volumes created from existing EBS snapshots load lazily in the background. This means that after a volume is created from a snapshot, there is no need to wait for all the data to transfer from Amazon S3 to your EBS volume before your attached instance can start accessing the volume and all its data. This is the reason why restore operation takes such less amount of time.

If your instance accesses data that hasn’t yet been loaded, the volume immediately downloads the requested data from Amazon S3 and continues loading the rest of the data in the background. Storage blocks on volumes that were restored from snapshots must be initialized, i.e pulled down from Amazon S3 and written to the volume, before you can access the block. This preliminary action takes time and can cause a significant increase in the latency of an I/O operation the first time each block is accessed. Performance is restored after the data is accessed once.

If looking for a reduced downtime option, keep reading.

Solution 2:  Set up external replication from unencrypted to encrypted RDS instance:

The below-mentioned steps are performed on an unencrypted RDS MySQL 5.7.17 to convert it into an encrypted one. It’s recommended that before implementing this procedure in a production environment, you test it in your development instance and create your action plan accordingly.

Here is the overview of the steps:

  1. Extend the Binary Log Retention Period on the unencrypted RDS instance
  2. Create a Read Replica from the unencrypted RDS instance
  3. Stop Replication on the Read Replica and note down the Relay_Master_Log_File & Exec_Master_Log_Pos from SHOW SLAVE STATUS
  4. Create a manual snapshot from that Read Replica (This snapshot will be unencrypted)
  5. Copy that snapshot and enable encryption
  6. Restore that snapshot (this will create an encrypted RDS instance)
  7. Start an external replication from the binary log file and position mentioned in Step-3

Below are the details of each step:

Step-1: Extend the binary log retention period on the unencrypted RDS instance

mysql> CALL mysql.rds_set_configuration(‘binlog retention hours’, 144);

RDS normally purges a binary log as soon as possible, but the binary log might still be required for the external replication. Please specify the number of hours to retain binary log files as per your requirement.

Created replication user on the unencrypted RDS instance.


NOTE: REPLICATION SLAVE privilege is restricted in the RDS MySQL 5.5

Step-2: Create a Read Replica of the unencrypted RDS instance

Step-3: Once the Read Replica RDS instance becomes available, stop the replication using below command:

mysql> CALL mysql.rds_stop_replication;

After stopping the replication, note the Relay_Master_Log_File & Exec_Master_Log_Pos from the output of SHOW SLAVE STATUS\G

Step-4: Create a manual snapshot from that Read Replica

Select the Read Replica RDS instance and from the Instance Actions take a snapshot (this snapshot will be the unencrypted snapshot).

Step-5: Copy that snapshot and enable the encryption

Select the created snapshot and from the snapshot actions menu, copy the snapshot. Enable encryption and provide a name for the snapshot.

Step-6: Restore that snapshot to the RDS instance

Choose this encrypted snapshot and from snapshot actions select Restore Snapshot. This will create the new encrypted RDS instance from the snapshot.

Step-7: Set up external replication between unencrypted RDS instance and encrypted RDS instance

Once the encrypted RDS instance becomes available, set up its external replication with the unencrypted RDS instance. Start the external replication from the binary log file and position noted in Step 3.

mysql> CALL mysql.rds_set_external_master (‘RDS_ENDPOINT’, 3306, ‘repl’, ‘repl’, ‘BINARY_LOG_FILE’, ‘BINARY_LOG_POS’, 0);

Please replace the RDS_ENDPOINT with the endpoint of unencrypted RDS instance. Also replace the BINARY_LOG_FILE and BINARY_LOG_POS with the details noted down in Step 3.

Start the replication using the below command:

mysql> CALL mysql.rds_start_replication;

A few things to make sure when setting up the external replication between RDS instances:

  • The binary log retention period is extended on the master RDS instance, as in RDS, binary logs will be flushed as soon as the replica executes events.
  • On the master RDS instance, a replication user is created with the required privileges.
  • The access of the replica RDS instance is allowed into the security group of the master RDS instance.

Please monitor your slave status and once the slave gets in sync with the master, your instance is ready to be used. You can switch the traffic to the encrypted RDS instance. The cutover time will be the downtime in this case.


While encrypting an existing RDS instance, if you are using the first solution, a read-replica creation is not required, but on the flipside, the downtime required will be considerably more. So if downtime is a considerable factor for you, then evaluate the feasibility of creating a read-replica at your end and go for the second solution, as the cutover time of moving the pointer from unencrypted to encrypted instance would be the only downtime in this solution.

I hope this helps!

Schema Management Tips for MySQL & MariaDB

Database schema is not something that is written in stone. It is designed for a given application, but then the requirements may and usually do change. New modules and functionalities are added to the application, more data is collected, code and data model refactoring is performed. Thereby the need to modify the database schema to adapt to these changes; adding or modifying columns, creating new tables or partitioning large ones. Queries change too as developers add new ways for users to interact with the data - new queries could use new, more efficient indexes so we rush to create them in order to provide the application with the best database performance.

So, how do we best approach a schema change? What tools are useful? How to minimize the impact on a production database? What are the most common issues with schema design? What tools can help you to stay on top of your schema? In this blog post we will give you a short overview of how to do schema changes in MySQL and MariaDB. Please note that we will not discuss schema changes in the context of Galera Cluster. We already discussed Total Order Isolation, Rolling Schema Upgrades and tips to minimize impact from RSU in previous blog posts. We will also discuss tips and tricks related to schema design and how ClusterControl can help you to stay on top of all schema changes.

Types of Schema Changes Related resources  Online schema change for MySQL & MariaDB - comparing GitHub’s gh-ost vs pt-online-schema-change  How to perform online schema changes on MySQL using gh-ost  Online schema change with gh-ost - throttling and changing configuration at runtime  Become a MySQL DBA blog series - Common operations - Schema Changes

First things first. Before we dig into the topic, we have to understand how MySQL and MariaDB perform schema changes. You see, one schema change is not equal to another schema change.

You may have heard about online alters, instant alters or in-place alters. All of this is a result of work which is ongoing to minimize the impact of the schema changes on the production database. Historically, almost all schema changes were blocking. If you executed a schema change, all of the queries will start to pile up, waiting for the ALTER to complete. Obviously, this posed serious issues for production deployments. Sure, people immediately start to look for workarounds, and we will discuss them later in this blog, as even today those are still relevant. But also, work started to improve capability of MySQL to run DDL’s (Data Definition Language) without much impact to other queries.

Instant Changes

Sometimes it is not needed to touch any data in the tablespace, because all that has to be changed is the metadata. An example here will be dropping an index or renaming a column. Such operations are quick and efficient. Typically, their impact is limited. It is not without any impact, though. Sometimes it takes couple of seconds to perform the change in the metadata and such change requires a metadata lock to be acquired. This lock is on a per-table basis, and it may block other operations which are to be executed on this table. You’ll see this as “Waiting for table metadata lock” entries in the processlist.

An example of such change may be instant ADD COLUMN, introduced in MariaDB 10.3 and MySQL 8.0. It gives the possibility to execute this quite popular schema change without any delay. Both MariaDB and Oracle decided to include code from Tencent Game which allows to instantly add a new column to the table. This is under some specific conditions; column has to be added as the last one, full text indexes cannot exist in the table, row format cannot be compressed - you can find more information on how instant add column works in MariaDB documentation. For MySQL, the only official reference can be found on blog, although a bug exists to update the official documentation.

In Place Changes

Some of the changes require modification of the data in the tablespace. Such modifications can be performed on the data itself, and there’s no need to create a temporary table with a new data structure. Such changes, typically (although not always) allow other queries touching the table to be executed while the schema change is running. An example of such operation is to add a new secondary index to the table. This operation will take some time to perform but will allow DML’s to be executed.

Table Rebuild

If it is not possible to make a change in place, InnoDB will create a temporary table with the new, desired structure. It will then copy existing data to the new table. This operation is the most expensive one and it is likely (although it doesn’t always happen) to lock the DML’s. As a result, such schema change is very tricky to execute on a large table on a standalone server, without help of external tools - typically you cannot afford to have your database locked for long minutes or even hours. An example of such operation would be to change the column data type, for example from INT to VARCHAR.

Schema Changes and Replication

Ok, so we know that InnoDB allow online schema changes and if we consult MySQL documentation, we will see that the majority of the schema changes (at least among the most common ones) can be performed online. What is the reason behind dedicating hours of development to create online schema change tools like gh-ost? We can accept that pt-online-schema-change is a remnant of the old, bad times but gh-ost is a new software.

The answer is complex. There are two main issues.

For starters, once you start a schema change, you do not have control over it. You can abort it but you cannot pause it. You cannot throttle it. As you can imagine, rebuilding the table is an expensive operation and even if InnoDB allows DML’s to be executed, additional I/O workload from the DDL affects all other queries and there’s no way to limit this impact to a level that is acceptable to the application.

Second, even more serious issue, is replication. If you execute a non-blocking operation, which requires a table rebuild, it will indeed not lock DML’s but this is true only on the master. Let’s assume such DDL took 30 minutes to complete - ALTER speed depends on the hardware but it is fairly common to see such execution times on tables of 20GB size range. It is then replicated to all slaves and, from the moment DDL starts on those slaves, replication will wait for it to complete. It does not matter if you use MySQL or MariaDB, or if you have multi-threaded replication. Slaves will lag - they will wait those 30 minutes for the DDL to complete before the commence applying the remaining binlog events. As you can imagine, 30 minutes of lag (sometimes even 30 seconds will be not acceptable - it all depends on the application) is something which makes impossible to use those slaves for scale-out. Of course, there are workarounds - you can perform schema changes from the bottom to the top of the replication chain but this seriously limits your options. Especially if you use row-based replication, you can only execute compatible schema changes this way. Couple of examples of limitations of row-based replication; you cannot drop any column which is not the last one, you cannot add a column into a position other than the last one. You cannot also change column type (for example, INT -> VARCHAR).

As you can see, replication adds complexity into how you can perform schema changes. Operations which are non-blocking on the standalone host become blocking while executed on slaves. Let’s take a look at couple of methods you can use to minimize the impact of schema changes.

Online Schema Change Tools

As we mentioned earlier, there are tools, which are intended to perform schema changes. The most popular ones are pt-online-schema-change created by Percona and gh-ost, created by GitHub. In a series of blog posts we compared them and discussed how gh-ost can be used to perform schema changes and how you can throttle and reconfigure an undergoing migration. Here we will not go into details, but we would still like to mention some of the most important aspects of using those tools. For starters, a schema change executed through pt-osc or gh-ost will happen on all database nodes at once. There is no delay whatsoever in terms of when the change will be applied. This makes it possible to use those tools even for schema changes that are incompatible with row-based replication. The exact mechanisms about how those tools track changes on the table is different (triggers in pt-osc vs. binlog parsing in gh-ost) but the main idea is the same - a new table is created with the desired schema and existing data is copied from the old table. In the meantime, DML’s are tracked (one way or the other) and applied to the new table. Once all the data is migrated, tables are renamed and the new table replaces the old one. This is atomic operation so it is not visible to the application. Both tools have an option to throttle the load and pause the operations. Gh-ost can stop all of the activity, pt-osc only can stop the process of copying data between old and new table - triggers will stay active and they will continue duplicating data, which adds some overhead. Due to the rename table, both tools have some limitations regarding foreign keys - not supported by gh-ost, partially supported by pt-osc either through regular ALTER, which may cause replication lag (not feasible if the child table is large) or by dropping the old table before renaming the new one - it’s dangerous as there’s no way to rollback if, for some reason, data wasn’t copied to the new table correctly. Triggers are also tricky to support.

They are not supported in gh-ost, pt-osc in MySQL 5.7 and newer has limited support for tables with existing triggers. Other important limitations for online schema change tools is that unique or primary key has to exist in the table. It is used to identify rows to copy between old and new tables. Those tools are also much slower than direct ALTER - a change which takes hours while running ALTER may take days when performed using pt-osc or gh-ost.

On the other hand, as we mentioned, as long as the requirements are satisfied and limitations won’t come into play, you can run all schema changes utilizing one of the tools. All will happen at the same time on all hosts thus you don’t have to worry about compatibility. You have also some level of control over how the process is executed (less in pt-osc, much more in gh-ost).

You can reduce the impact of the schema change, you can pause them and let them run only under supervision, you can test the change before actually performing it. You can have them track replication lag and pause should an impact be detected. This makes those tools a really great addition to the DBA’s arsenal while working with MySQL replication.

ClusterControl Single Console for Your Entire Database Infrastructure Find out what else is new in ClusterControl Install ClusterControl for FREE Rolling Schema Changes

Typically, a DBA will use one of the online schema change tools. But as we discussed earlier, under some circumstances, they cannot be used and a direct alter is the only viable option. If we are talking about standalone MySQL, you have no choice - if the change is non-blocking, that’s good. If it is not, well, there’s nothing you can do about it. But then, not that many people run MySQL as single instances, right? How about replication? As we discussed earlier, direct alter on the master is not feasible - most of the cases it will cause lag on the slave and this may not be acceptable. What can be done, though, is to execute the change in a rolling fashion. You can start with slaves and, once the change is applied on all of them, promote one of the slaves as a new master, demote the old master to a slave and execute the change on it. Sure, the change has to be compatible but, to tell the truth, the most common cases where you cannot use online schema changes is because of a lack of primary or unique key. For all other cases, there is some sort of workaround, especially in pt-online-schema-change as gh-ost has more hard limitations. It is a workaround you would call “so so” or “far from ideal”, but it will do the job if you have no other option to pick from. What is also important, most of the limitations can be avoided if you monitor your schema and catch the issues before the table grows. Even if someone creates a table without a primary key, it is not a problem to run a direct alter which takes half a second or less, as the table is almost empty.

If it will grow, this will become a serious problem but it is up to the DBA to catch this kind of issues before they actually start to create problems. We will cover some tips and tricks on how to make sure you will catch such issues on time. We will also share generic tips on how to design your schemas.

Tips and Tricks Schema Design

As we showed in this post, online schema change tools are quite important when working with a replication setup therefore it is quite important to make sure your schema is designed in such a way that it will not limit your options for performing schema changes. There are three important aspects. First, primary or unique key has to exist - you need to make sure there are no tables without a primary key in your database. You should monitor this on a regular basis, otherwise it may become a serious problem in the future. Second, you should seriously consider if using foreign keys is a good idea. Sure, they have their uses but they also add overhead to your database and they can make it problematic to use online schema change tools. Relations can be enforced by the application. Even if it means more work, it still may be a better idea than to start using foreign keys and be severely limited to which types of schema changes can be performed. Third, triggers. Same story as with foreign keys. They are a nice feature to have, but they can become a burden. You need to seriously consider if the gains from using them outweight the limitations they pose.

Tracking Schema Changes

Schema change management is not only about running schema changes. You also have to stay on top of your schema structure, especially if you are not the only one doing the changes.

ClusterControl provides users with tools to track some of the most common schema design issues. It can help you to track tables which do not have primary keys:

As we discussed earlier, catching such tables early is very important as primary keys have to be added using direct alter.

ClusterControl can also help you track duplicate indexes. Typically, you don’t want to have multiple indexes which are redundant. In the example above, you can see that there is an index on (k, c) and there’s also an index on (k). Any query which can use index created on column ‘k’ can also use a composite index created on columns (k, c). There are cases where it is beneficial to keep redundant indexes but you have to approach it on case by case basis. Starting from MySQL 8.0, it is possible to quickly test if an index is really needed or not. You can make a redundant index ‘invisible’ by running:


This will make MySQL ignore that index and, through monitoring, you can check if there was any negative impact on the performance of the database. If everything works as planned for some time (couple of days or even weeks), you can plan on removing the redundant index. In case you detected something is not right, you can always re-enable this index by running:


Those operations are instant and the index is there all the time, and is still maintained - it’s only that it will not be taken into consideration by the optimizer. Thanks to this option, removing indexes in MySQL 8.0 will be much safer operation. In the previous versions, re-adding a wrongly removed index could take hours if not days on large tables.

ClusterControl can also let you know about MyISAM tables.

While MyISAM still may have its uses, you have to keep in mind that it is not a transactional storage engine. As such, it can easily introduce data inconsistency between nodes in a replication setup.

Another very useful feature of ClusterControl is one of the operational reports - a Schema Change Report.

In an ideal world, a DBA reviews, approves and implements all of the schema changes. Unfortunately, this is not always the case. Such review process just does not go well with agile development. In addition to that, Developer-to-DBA ratio typically is quite high which can also become a problem as DBA’s would struggle not to become a bottleneck. That’s why it is not uncommon to see schema changes performed outside of the DBA’s knowledge. Yet, the DBA is usually the one responsible for the database’s performance and stability. Thanks to the Schema Change Report, they can now keep track of the schema changes.

At first some configuration is needed. In a configuration file for a given cluster (/etc/cmon.d/cmon_X.cnf), you have to define on which host ClusterControl should track the changes and which schemas should be checked.

schema_change_detection_address= schema_change_detection_databases=sbtest

Once that’s done, you can schedule a report to be executed on a regular basis. An example output may be like below:

As you can see, two tables have changed since the previous run of the report. In the first one, a new composite index has been created on columns (k, c). In the second table, a column was added.

In the subsequent run we got information about new table, which was created without any index or primary key. Using this kind of info, we can easily act when it is needed and solve the issues before they actually start to become blockers.

Tags:  MySQL MariaDB schema changes

Webinar 6/27: MySQL Troubleshooting Best Practices: Monitoring the Production Database Without Killing Performance

Please join Percona’s Principal Support Escalation Specialist Sveta Smirnova as she presents Troubleshooting Best Practices: Monitoring the Production Database Without Killing Performance on Wednesday, June 27th at 11:00 AM PDT (UTC-7) / 2:00 PM EDT (UTC-4).

Register Now


During the MySQL Troubleshooting webinar series, I covered many monitoring and logging tools such as:

  • General, slow, audit, binary, error log files
  • Performance Schema
  • Information Schema
  • System variables
  • Linux utilities
  • InnoDB monitors
  • PMM

However, I did not spend much time on the impact these instruments have on overall MySQL performance. And they do have an impact.

And this is the conflict many people face. MySQL Server users try exploring these monitoring instruments, see that they slow down their installations, and turn them off. This is unfortunate. If the instrument that can help you resolve a problem is OFF, you won’t have good and necessary information to help understand when, how and why the issue occurred. In the best case, you’ll re-enable instrumentation and wait for the next disaster occurrence. In the worst case, you try various fix options without any real knowledge if they solve the problem or not.

This is why it is important to understand the impact monitoring tools have on your database, and therefore how to minimize it.

Understanding and controlling the impact of MySQL monitoring tools

In this webinar, I cover why certain monitoring tools affect performance, and how to minimize the impact without turning the instrument off. You will learn how to monitor safely and effectively.

Register Now


Sveta Smirnova Principal Support Escalation Specialist

Sveta joined Percona in 2015. Her main professional interests are problem-solving, working with tricky issues, bugs, finding patterns that can quickly solve typical issues and teaching others how to deal with MySQL issues, bugs and gotchas effectively. Before joining Percona, Sveta worked as Support Engineer in MySQL Bugs Analysis Support Group in MySQL AB-Sun-Oracle. She is the author of book “MySQL Troubleshooting” and JSON UDF functions for MySQL.

The post Webinar 6/27: MySQL Troubleshooting Best Practices: Monitoring the Production Database Without Killing Performance appeared first on Percona Database Performance Blog.

Migrating Messenger storage to optimize performance

More than a billion people now use Facebook Messenger to instantly share text, photos, video, and more. As we have evolved the product and added new functionality, the underlying technologies that power Messenger have changed substantially.

When Messenger was originally designed, it was primarily intended to be a direct messaging product similar to email, with messages waiting in your inbox the next time you visited the site. Today, Messenger is a mobile-first, real-time communications system used by businesses as well as individuals. To enable that shift, we have made many changes through the years to update the backend system. The original monolithic service was separated into a read-through caching service for queries; Iris to queue writes to subscribers (such as the storage service and devices); and a storage service to retain message history. This design optimized Messenger for a mobile-first world and helped fuel its success.

To help improve Messenger even more, we now have overhauled and modernized the storage service to make it faster, more efficient, more reliable, and easier to upgrade with new features. This evolution involved several major changes:

  • We redesigned and simplified the data schema, created a new source-of-truth index from existing data, and made consistent invariants to ensure that all data is formatted correctly.
  • We moved from HBase, an open source distributed key-value store based on HDFS, to MyRocks, Facebook’s open source database project that integrates RocksDB as a MySQL storage engine.
  • We moved from storing the database on spinning disks to flash on our new Lightning Server SKU.

The result has been a better experience for Messenger users themselves, who can now enjoy a more responsive product with enhanced functionality, such as mobile content search. We also have improved system resiliency, reduced latency, and decreased storage consumption by 90 percent. This was achieved in a seamless way that did not cause any disruption or downtime, but it required us to plan for and execute two different migration flows to account for every single Messenger user.

Handling the challenge of migrating at scale

HBase had been our database since 2010 and it served us well, but the change to MyRocks helped us realize several important benefits. We can now leverage our new Lightning Server SKU, from the Open Compute Project, to serve data housed in flash storage instead of relying on spinning disks. Also, the replication topology for MySQL is more compatible with the way Facebook data centers operate in production, enabling us to reduce the number of physical replicas of data while producing better availability and disaster recovery.

Once we decided to update the service and move to MyRocks, migrating data between storage systems while keeping Messenger up and running for more than 1 billion accounts proved to be an interesting challenge.

We wanted to make sure people using Messenger had an uninterrupted experience during the migration process. The migration required extensive reading across all historical data on HBase clusters, which are I/O operations bound. If we proceeded too aggressively, we would downgrade the performance of HBase, leading to errors that negatively affect the user experience. Also, business users often have many Messenger chat windows active simultaneously around the clock for tasks such as helping with customer service, taking online orders from customers, or providing updates based on customers’ requests. This also meant that as we worked on the migration, we had to make code changes to support new product features on both the old and new systems so that people using Messenger would not experience disruptions when their accounts were moved to the new database.

It was also crucial to ensure that we migrated all the information for every single Messenger account, which meant petabytes of information. Since we were changing the data schema, we had to carefully parse the existing data, handle messy legacy data, cover corner cases, and resolve conflicts so that people saw the exact same messages, videos, and photos as before.

As we were also migrating to a brand-new database, we were also developing the service and designing and manufacturing the Lightning flash servers at the same time. Rolling out the new service required fixing software, kernel, firmware, and even physical power path bugs.

To address these challenges, we designed two migration flows for all Messenger accounts. The normal flow covered 99.9 percent of accounts, while the buffered migration flow covered the remaining, hard-to-migrate accounts. We did a thorough data validation, prepared a revert plan, and performed an accounting job to verify we didn’t miss anyone. When we were confident we had migrated every account to the new system, we finally took the old system offline.

Chart 1: Migration workflow.Normal migration

It is critical to have strong data consistency before and after migration to ensure the correct product experience. The basic single-user migrator does this through a crucial assumption: that no data is written to the account during migration. To ensure this, we clearly defined the state machine and developed abundant monitoring tools. For each account at any given time, the state machine puts it in one of three “static states” (not-migrated, double-writing, or done), or it is in a “dynamic state” and is actively undergoing migration. When migration starts, we log the last position of data in the old system (both the storage service and Iris have this information) and then start migrating data to new system (shown in step B in Chart 2 below). Once finished, we check whether the data position moved. If not, the account now allows writing traffic to the new system (step C in Chart 2) and enters the double-writing state. Otherwise, we classify this account migration as failed, clean up the data in MyRocks, and let a future migration job try to migrate that account again.

During the double-writing step, the migrator performs data validation and API validation to ensure accuracy. For data validation, we confirm that the data we have in HBase matches the data in MyRocks. If both systems process the data identically, the data will conform no matter how many new messages are received. For API validation, we issue read requests to both systems, old and new, and compare responses to make sure they are equivalent, in order to make sure the client can read seamlessly from either system.

Before reaching the done status in the workflow, we verify that the account has been successfully migrated. If issues are detected, there is a complete revert plan that can roll back an account to the not-migrated state. Whether migrating an individual or a large group of Messenger accounts, we developed tools to switch where their reads are served from as well as reverting back to the old system and wiping the data from the new system.

Chart 2: Normal migration flow.Buffered migration flow

Some accounts cannot be migrated in the normal way. For example, a large business may run a Messenger bot to serve its customers, and there is no window of time to perform the migration when new messages aren’t coming in. These accounts might also be much larger than is typical. So we developed a different flow to migrate these users. We set a cutoff time, called a migration start time, and then take a snapshot of the account’s data at that moment. We then copy the snapshot to a buffer tier (step B in Chart 3), which usually takes about 10 percent of the migration time. Then we migrate the data from the buffer tier to MyRocks (step C in Chart 3). Meanwhile, the write traffic to the new (MyRocks) system is queueing up in Iris (step D in Chart 3). Iris was designed to be able to queue messages for weeks, so we were able to concurrently migrate hundreds of high-volume accounts, with each one receiving up to thousands of messages per second. Once all data from the buffer HBase tier is migrated, the account enters the double-writing state as a normal flow: We resume write traffic to MyRocks. The new system then quickly drains the queue and catches up with the old system.

In this workflow, we also have two options to choose a buffer tier. We’ve built a dedicated HBase tier as the buffer tier, which keeps the exact same schema as old system; this approach can work for most accounts in the buffered migration flow. But for accounts with an extremely large amount of data, we reserve dedicated servers with SSDs, using embedded RocksDB as a buffer. This made their migration even faster.

Chart 3: Buffered migration flow.Migrate at scale

To manage migration jobs at a high level, we’ve leveraged the Bistro framework, another Facebook open source project, to parallelize the work. We are able to flexibly schedule jobs, track progress, log and analyze the progress, and throttle jobs if issues arise with the service.

When we start to mark migrated accounts as “done,” all new Messenger accounts are then created on the new system only. We gradually stop writing traffic to HBase, one cluster at a time. We then run an accounting job to make sure every single account found in HBase has actually been migrated and no account has been left behind. Through this approach, we were able to finish this mass migration with 100 percent of Messenger accounts moved to the new system.

Benefits of the new system

In the end, we migrated 99.9 percent of accounts via the normal migration flow in two weeks, and finished the remaining via the buffered migration flow two weeks after.

The simplified data schema directly reduced the size of data on disk. We saved additional space in MyRocks by applying Zstandard, a state-of-the-art lossless data compression algorithm developed by Facebook. We were able to reduce the replication factor from six to three, thanks to differences between the HBase and MyRocks architectures. In total, we reduced storage consumption by 90 percent without data loss, thereby making it practical to use flash storage.

Compared with HBase, MyRocks at Facebook has more mature and more automated features to handle disaster recovery. In cases where we need to completely switch to another data center, we no longer need to involve human supervisors and perform manual operations. Automation makes switching data centers orders of magnitude faster than with the old system.

MyRocks is optimized in both reading and writing, and by leveraging Lightning with flash storage, we realized latency wins, in part because we are no longer bound on I/O in HBase. Read latency is now 50 times lower than in the previous system. People using Messenger can see that it is now much faster to scroll back though old messages. MyRocks is also more resilient.

These performance and architecture improvements have also enabled Messenger to add new features much more easily. The migration to MyRocks has helped us to launch message content search on mobile, a frequently requested feature. It was hard to implement mobile message search using HBase. Because it is I/O bound, it would require an equivalent read-heavy job to build an index for search. By switching to MyRocks, Messenger has now directly adopted the established Facebook search infrastructure built on top of MySQL, and we now have more than sufficient headroom to allow users to search their conversations on mobile as well as on desktop.

With the database migration complete, we now have a system in place to support further improvements and new features in Messenger. We are looking forward to sharing more updates in the future.