Planet MySQL

The MySQL 8.0.12 Maintenance Release is Generally Available

The MySQL Development team is very happy to announce that MySQL 8.0.12, the first 8.0 Maintenance Release, is now available for download at In addition to bug fixes there are a few new features added in this release, most notably the InnoDB instant add column feature. …

MySQL Workbench 8.0.12 has been released

Dear MySQL users,

The MySQL developer tools team announces 8.0.12 as our first general
availability (GA) for MySQL Workbench 8.0.

For the full list of changes in this revision, visit

For discussion, join the MySQL Workbench Forums:

The release is now available in source and binary form for a number of
platforms from our download pages at:


MySQL Connector/C++ 8.0.12 has been released

Dear MySQL users,

MySQL Connector/C++ 8.0.12 is the second GA release version of the MySQL Connector/C++ 8.0 series.

Connector/C++ 8.0 can be used to access MySQL implementing Document Store or in a traditional way, using SQL queries. It allows writing both C++ and plain C applications using X DevAPI and X DevAPI for C. It also supports the legacy API of Connector/C++ 1.1 based on JDBC4.

To learn more about how to write applications using X DevAPI, see “X DevAPI User Guide”

See also “X DevAPI Reference” at

and “X DevAPI for C Reference” at

For generic information on using Connector/C++ 8.0, see

Note: For X DevAPI applications, Connector/C++ 8.0 requires MySQL Server 5.7.12 or higher with X Plugin enabled. For applications that use the legacy JDBC API, Connector/C++ 8.0 can use MySQL Server 5.5 or higher, and X Plugin is not required.

To download MySQL Connector/C++ 8.0.12, see the “Generally Available (GA)
Releases” tab at

Changes in MySQL Connector/C++ 8.0.12 (2018-07-27, General Availability)
  • Installation Notes
  • Packaging Notes
  • Security Notes
  • X DevAPI Notes
  • Bugs Fixed
Installation Notes * Because the Microsoft Visual C++ 2017 Redistributable installer deletes the Microsoft Visual C++ 2015 Redistributable registry keys that identify its installation, standalone MySQL MSIs may fail to detect the Microsoft Visual C++ 2015 Redistributable if both it and the Microsoft Visual C++ 2017 Redistributable are installed. The solution is to repair the Microsoft Visual C++ 2017 Redistributable via the Windows Control Panel to recreate the registry keys needed for the runtime detection. Unlike the standalone MSIs, MySQL Installer for Windows contains a workaround for the detection problem. Packaging Notes * An RPM package for installing ARM 64-bit (aarch64) binaries of Connector/C++ on Oracle Linux 7 is now available in the MySQL Yum Repository and for direct download. Known Limitation for this ARM release: You must enable the Oracle Linux 7 Software Collections Repository (ol7_software_collections) to install this package, and must also adjust the libstdc++7 path. See Yum's Platform Specific Notes ( installation-yum-repo.html#yum-install-platform-specifics) for additional details. * Installers for Connector/C++ are now available in these formats: MSI packages (Windows); RPM packages (Linux); DMG packages (macOS). Security Notes * yaSSL is no longer included in Connector/C++ source distributions. wolfSSL may be used as a functionally equivalent alternative that has a GPLv2-compatible license. In addition, wolfSSL (like OpenSSL) supports the TLSv1.2 protocol, which yaSSL does not. To build Connector/C++ using wolfSSL, use the -DWITH_SSL=path_name CMake option, where path_name indicates the location of the wolfSSL sources. For more information, see Source Installation System Prerequisites ( -cpp-installation-source-prerequisites.html), and Connector/C++ Source-Configuration Options ( -source-configuration-options.html). X DevAPI Notes * Connector/C++ now supports NOWAIT and SKIP LOCKED lock contention modes to be used with lockExclusive() and lockShared() clauses of CRUD find/select operations (see Locking Read Concurrency with NOWAIT and SKIP LOCKED ( -locking-reads.html#innodb-locking-reads-nowait-skip-locked), and a default lock contention mode. The following list names the permitted constants. For each item, the first and second constants apply to X DevAPI and X DevAPI for C, respectively. + LockContention::DEFAULT, LOCK_CONTENTION_DEFAULT: Block the query until existing row locks are released. + LockContention::NOWAIT, LOCK_CONTENTION_NOWAIT: Return an error if the lock cannot be obtained immediately. + LockContention::SKIP_LOCKED, LOCK_CONTENTION_SKIP_LOCKED: Execute the query immediately, excluding from the query items that are locked. For X DevAPI and X DevAPI for C applications, lock mode methods accept these lock contention constants as a parameter. For X DevAPI applications, lock mode methods can be called without this parameter, as before; this is equivalent to passing a lock mode of DEFAULT. * Connector/C++ now supports the SHA256_MEMORY authentication mechanism for connections using the X Protocol. For X DevAPI applications, SessionOption::AUTH supports the new value AuthMethod::SHA256_MEMORY. For X DevAPI for C applications, the session option MYSQLX_OPT_AUTH supports the new value MYSQLX_AUTH_SHA256_MEMORY. These new values request using the sha256_memory authentication mechanism when creating a session. * For compliance with the Core DevAPI, these Connector/C++ changes were made: + getAffectedItemsCount() was moved from Result to Result_common. + Collection.modify(condition).arrayDelete() was removed. + getAffectedRowsCount() was removed. Use getAffectedItemsCount() instead. + getWarningCount() was renamed to getWarningsCount(). Bugs Fixed * utf8mb4 character data was handled incorrectly. (Bug#28240202) * Session creation had a memory leak. (Bug #27917942) * When configuring to build Connector/C++ with the legacy connector, CMake did not account for the MYSQL_CONFIG_EXECUTABLE option. (Bug #27874173, Bug#90389) * Improper error handling for unknown hosts when creating a session could result in unexpected application exit. (Bug#27868302) * The mysqlx_row_fetch_one() X DevAPI for C function could fail to return for large result set exceeding the maximum packet size. Now such result sets produce an error. (Bug#27732224)

Enjoy and thanks for the support!

On Behalf of Oracle/MySQL Release Engineering Team,
Balasubramanian Kandasamy

MySQL Connector/ODBC 8.0.12 has been released

Dear MySQL users,

MySQL Connector/ODBC 8.0.12 is the second version of the MySQL Connector/ODBC 8.0 series, the ODBC driver for the MySQL Server.

The available downloads include both a Unicode driver and an ANSI driver based on the same modern codebase. Please select the driver type you need based on the type of your application – Unicode or ANSI. Server-side prepared statements are enabled by default. It is suitable for use with any MySQL server version from 5.5.

This release of the MySQL ODBC driver is conforming to the ODBC 3.8 specification. It contains implementations of key 3.8 features, including self-identification as a ODBC 3.8 driver, streaming of output parameters (supported for binary types only), and support of the SQL_ATTR_RESET_CONNECTION connection attribute (for the Unicode driver only).

The release is now available in source and binary form for a number of platforms from our download pages at

For information on installing, please see the documentation at

Functionality Added or Changed

  • Several code issues identified by Fortify were corrected.
  • Refactored codebase to remove legacy code and implement general performance improvements. For example, unused ANSI data conversion code and legacy functions were removed. Example improvements affect bookmark handling for bulk operations, handling of memory buffers for prepared statements, and handling of session variables.
  • On Windows, 32-bit support was added and 32-bit binaries are now available.
  • An RPM package for installing ARM 64-bit (aarch64) binaries of Connector/ODBC on Oracle Linux 7 is now available in the MySQL Yum Repository and for direct download. Known Limitation for this ARM release: You must enable the Oracle Linux 7 Software Collections Repository (ol7_software_collections) to install this package, and must also adjust the libstdc++7 path. See Yum’s Platform Specific Notes ( for additional details.

Bugs Fixed

  • Added checks for unsupported functionality that return SQL_ERROR instead of SQL_SUCCESS, where the error message refers to the unsupported functionality. (Bug #28217387)
  • The data source dependent type’s name was not always returned. For example, the ODBC driver reported TEXT as the database type for TINYTEXT, MEDIUMTEXT, and LONGTEXT, and reported BLOB for TINYBLOB, MEDIUMBLOB, and LONGBLOB. (Bug #11761407, Bug #53900)

On Behalf of Oracle/MySQL Release Engineering Team,
Piotr Obrzut

Fast shutdown, fast startup

Managing a web-scale database service is easier when shutdown and startup are fast, or at least fast enough. Fast enough to me means a few seconds, and too slow means tens of seconds.

When these operations are too slow then:
  • scripts might time out - one of the MySQL scripts used to do that, see bug 25341
  • uncertainty increases - storage engines rarely provide progress indicators for shutdown. Most provide 2 to a few lines in the error log, 1 for shutdown starting, 1 for shutdown ending and maybe a few more. Alas, you have to ssh to the host to tail the error log to see them. When startup for InnoDB does crash recovery there is a useful progress indicator in the error log, but again, you need to ssh to the host to see that. Note that "ssh to host to tail error log" is not a best practice for web-scale.
  • downtime increases - restart (shutdown/startup), shutdown and startup can be sources of downtime. They happen for many reasons -- changing a configuration option that isn't dynamic, upgrading to a new binary, upgrading the kernel, etc. When they take 60 seconds your service might incur 60 seconds of downtime.
The work done by shutdown/startup also depends on the index structure (LSM vs b-tree) and on implementation details.

For a b-tree either shutdown or startup will be slow. The choice is either to flush dirty pages on shutdown (one random write per dirty page) or to do crash recovery at startup (one random read per page that was dirty on shutdown, eventually those dirty pages must be written back). The innodb_fast_shutdown option lets you control which one will be slow.

When dirty page writeback is done on shutdown then the time for that is a function of storage performance and the number of dirty pages. Back in the day (InnoDB running with disks) shutdown was slower. Storage is much faster today, but buffer pools are also larger because servers have more RAM. Shutdown can be made faster by reducing the value of innodb_max_dirty_pages_pct a few minutes before shutdown will be done. Alas, using a large value for innodb_max_dirty_pages_pct can be very good for performance -- less log IO, less page IO, less write-amplification.

Amazon Aurora is a hybrid, or mullet, with a b-tree up front and log-structured in the back. Shutdown for it is fast. It also doesn't need to warmup after startup because the buffer pool survives instance restart. Many years ago there was an option in Percona XtraDB to make the buffer pool survive restart, I wonder if that option will return. InnoDB also has an option to warmup the buffer pool at startup, but that still does IO which isn't as nice as preserving the buffer pool.

Back in the day InnoDB startup was often too slow. My memory about this has faded. One part of the problem is that per-index statistics are computed the first time a table is opened and that did ~6 random reads per index. That was the first part of the delay. My memory about the second part of the delay has faded more but I think at times this was single-threaded. A post from Percona explained some of this. Today InnoDB stats can be persistent, so they don't have to be sampled at startup. But InnoDB was also enhanced to avoid some of this problem long before persistent stats were added. I hope a reader provides a less vague description of this.
Shutdown for an LSM is fast -- flush the write buffer, no random writes. One thing that made shutdown slower for RocksDB was calling free for every page in the LRU. Note that RocksDB does malloc per page in the LRU rather than one huge malloc like InnoDB. With MyRocks the LRU isn't free'd on shutdown so the there are no stalls from that.

Startup for MyRocks should be fast but there is still at least one problem to solve. If you configure it with max_open_files=-1 then file descriptors are opened for all SSTs at startup. This helps performance by avoiding the need to search a hash table. The cost of this is 1) more open file descriptors and 2) more work at startup. See the description of the RocksDB option and more details in the tuning guide and FAQ. Note that the work done to open all of the SSTs can be done by multiple threads and the number of threads is controlled by the max_file_opening_threads RocksDB option. From looking at MyRocks code I don't think there is a way to change the value of max_file_opening_threads and the default is 16. The not-yet-solved problem is that RocksDB tries to precache some data from the end of every SST, by reading this data into the OS page cache, and that can be a lot of IO at startup, which also can make startup slower. With MyRocks when rocksdb_max_open_files is set to -2 then the open files limit is auto-configured, when set to -1 then there is no limit, and when set to > 0 then that is the limit.

This Week in Data with Colin Charles 46: OSCON Recap, Google Site Reliability Workbook

Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

OSCON happened last week and was incredible. It is true there was less of a database focus, and a lot more topics covered. In fact, you’d have found it hard to find database content. There was plenty of interesting content around AI/ML, cloud, SRE, blockchain and more. As a speaker, the 40-minute sessions that included a Q and A session was quite compact (I felt it was a little too short, and many speakers sped up towards the end). I guess it will make for more blog content.

The conference’s open source ethos is still extremely strong, and the keynotes exemplified that. It was not just the 20th anniversary of OSCON, but also the 20th anniversary of the Open Source Initiative (OSI). Percona is a sponsor (and I am an individual member). From a sponsor standpoint, Home Depot had a huge booth, but so did the NSA – who, along with giving out stickers, were actively recruiting. Readers might recall mention of NSA’s involvement with LemonGraph and the other data open source data projects from column 43.

Google just released The Site Reliability Workbook, which looks like the full PDF (“launch day edition”) of their new book. It includes practical ways to implement SRE. You can pre-order the book, and the official release date is August 4, 2018. This should be the best companion to Site Reliability Engineering: How Google Runs Production Systems, which I highly recommend reading first before getting to the workbook. After a quick perusal of the new release, I can say I like it — the case studies from Evernote and Home Depot, are all very interesting from a database standpoint (MySQL, Cloud SQL). Plenty of information is relevant if you’re a Prometheus user as well. I say skim the PDF, and devour the book!

Releases Link List Industry Updates
  • Elastic is on a spree of grabbing folk – Massimo Brignoli (ex-MongoDB, SkySQL, Oracle, and MySQL) joins as a Principal Solutions Architect, and Gerardo Narvaja joins as Sr. Solutions Architect for Pacific Northwest. He departs MariaDB Corporation, and has previously been at Tokutek, Pythian and MySQL.
  • Morgan Tocker (LinkedIn) has joined PingCAP as Senior Product & Community Manager. Previously he was both product manager and community manager at Oracle for MySQL, had a stint at Percona and also was at the original MySQL AB.
  • Baron Schwartz is now Chief Technology Officer of VividCortex, and Amena Ali has become the new CEO.
Upcoming Appearances Feedback

I look forward to feedback/tips via e-mail at or on Twitter @bytebot.

The post This Week in Data with Colin Charles 46: OSCON Recap, Google Site Reliability Workbook appeared first on Percona Database Performance Blog.

Mastering Continuent Clustering Series: Configuring Startup on Boot

In this blog post, we talk about how to configure automatic start at boot time for the Tungsten Clustering components.

By default, Tungsten Clustering does not start automatically on boot. To enable Tungsten Clustering to start at boot time, use the deployall script provided to create the necessary boot scripts:

shell> sudo /opt/continuent/tungsten/cluster-home/bin/deployall

To disable automatic startup at boot time, use the undeployall command:

shell> sudo /opt/continuent/tungsten/cluster-home/bin/undeployall

For Multisite/Multimaster deployments in specific, there are separate cross-site replicators running. In this case, a custom startup script must be created, otherwise the replicator will be unable to start as it has been configured in a different directory.

  1. Create a link from the Tungsten Replicator service startup script in the operating system startup directory (/etc/init.d):shell> sudo ln -s /opt/replicator/tungsten/tungsten-replicator/bin/replicator /etc/init.d/mmreplicator
  2. Stop the Tungsten Replicator process. Failure to do this will cause issues because the service will no longer recognize the existing PID file and report it is not> /etc/init.d/mmreplicator stop
  3. Modify the APP_NAME variable within the startup script (/etc/init.d/mmreplicator) to mmreplicator:APP_NAME="mmreplicator"
  4. Start the Tungsten Replicator> /etc/init.d/mmreplicator start
  5. Update the operating system startup configuration to use the updated script.On Debian/Ubuntu:

    shell> sudo update-rc.d mmreplicator defaults

    On RedHat/CentOS:

    shell> sudo chkconfig --add mmreplicator

Click here for the full documentation

In future articles, we will continue to cover more advanced subjects of interest!

Questions? Contact Continuent

MySQL Shell 8.0.12 – storing MySQL passwords securely

MySQL Shell 8.0.12 introduces a new feature which allows users to store and automatically retrieve their MySQL account credentials. This enables them to seamlessly work with various servers without explicitly providing the password for each new connection and create secure unattended scripts which do not need to include plain text passwords.…

Introduction to Failover for MySQL Replication - the 101 Blog

You may have heard about the term “failover” in the context of MySQL replication.

Maybe you wondered what it is as you are starting your adventure with databases.

Maybe you know what it is but you are not sure about potential problems related to it and how they can be solved?

In this blog post we will try to give you an introduction to failover handling in MySQL & MariaDB.

We will discuss what the failover is, why it is unavoidable, what the difference is between failover and switchover. We will discuss the failover process in the most generic form. We will also touch a bit on different issues that you will have to deal with in relation to the failover process.

What does “failover” mean?

MySQL replication is a collective of nodes, each of them may serve one role at a time. It can become a master or a replica. There is only one master node at a given time. This node receives write traffic and it replicates writes to its replicas.

As you can imagine, being a single point of entry for data into the replication cluster, the master node is quite important. What would happen had it failed and become unavailable?

This is quite a serious condition for a replication cluster. It cannot accept any writes at a given moment. As you may expect, one of the replicas will have to take over the master’s tasks and start accepting writes. The rest of the replication topology may also have to change - remaining replicas should change their master from the old, failed node to the newly chosen one. This process of “promoting” a replica to become a master after the old master has failed is called “failover”.

On the other hand, “switchover” happens when the user triggers the promotion of the replica. A new master is promoted from a replica pointed by the user and the old master, typically, becomes a replica to the new master.

The most important difference between “failover” and “switchover” is the state of the old master. When a failover is performed the old master is, in some way, not reachable. It may have crashed, it may have suffered a network partitioning. It cannot be used at a given moment and its state is, typically, unknown.

On the other hand, when a switchover is performed, the old master is alive and well. This has serious consequences. If a master is unreachable, it may mean that some of the data has not yet been sent to the slaves (unless semi-synchronous replication was used). Some of the data may have been corrupted or sent partially.

There are mechanisms in place to avoid propagating such corruptions on slaves but the point is that some of the data may be lost in process. On the other hand, while performing a switchover, the old master is available and data consistency is maintained.

Failover process

Let’s spend some time discussing how exactly the failover process looks like.

Master crash is detected

For starters, a master has to crash before the failover will be performed. Once it is not available, a failover is triggered. So far, it seems simple but the truth is, we are already on slippery ground.

First of all, how is master health tested? Is it tested from one location or are tests distributed? Does the failover management software just attempt to connect to the master or does it implement more advanced verifications before master failure is declared?

Let’s imagine the following topology:

We have a master and two replicas. We also have a failover management software located on some external host. What would happen if a network connection between the host with failover software and the master failed?

According to the failover management software, the master has crashed - there’s no connectivity to it. Still, the replication itself is working just fine. What should happen here is that the failover management software would try to connect to replicas and see what their point of view is.

Do they complain about a broken replication or are they happily replicating?

Things may become even more complex. What if we’d add a proxy (or a set of proxies)? It will be used to route traffic - writes to master and reads to replicas. What if a proxy cannot access the master? What if none of the proxies can access the master?

This means that the application cannot function under those conditions. Should the failover (actually, it would be more of a switchover as the master is technically alive) be triggered?

Technically, the master is alive but it cannot be used by the application. Here, the business logic has to come in and a decision has to be made.

Preventing old master from running

No matter how and why, if there is a decision to promote one of the replicas to become a new master, the old master has to be stopped and, ideally, it shouldn’t be able to start again.

How this can be achieved depends on the details of the particular environment; therefore this part of the failover process is commonly reinforced by external scripts integrated into the failover process through different hooks.

Those scripts can be designed to use tools available in the particular environment to stop the old master. It can be a CLI or API call that will stop a VM; it can be shell code that runs commands through some sort of “lights out management” device; it can be a script which sends SNMP traps to the Power Distribution Unit that disable the power outlets the old master is using (without electric power we can be sure it will not start again).

If a failover management software is a part of more complex product, which also handles recovery of nodes (like it is the case for ClusterControl), the old master may be marked as excluded from the recovery routines.

You may wonder why it is so important to prevent the old master from becoming available once more?

The main issue is that in replication setups, only one node can be used for writes. Typically you ensure that by enabling a read_only (and super_read_only, if applicable) variable on all replicas and keeping it disabled only on the master.

Once a new master is promoted, it will have read_only disabled. The problem is that, if the old master is unavailable, we cannot switch it back to read_only=1. If MySQL or a host crashed, this is not much of an issue as good practices are to have my.cnf configured with that setting so, once MySQL starts, it always starts in read only mode.

The problem shows when it’s not a crash but a network issue. The old master is still running with read_only disabled, it’s just not available. When networks converge, you will end up with two writeable nodes. This may or may not be a problem. Some of the proxies use the read_only setting as an indicator whether a node is a master or a replica. Two masters showing up at the given moment may result in a huge issue as data is written to both hosts, but replicas get only half of the write traffic (the part that hit the new master).

Sometimes it’s about hardcoded settings in some of the scripts which are configured to connect to a given host only. Normally they’d fail and someone would notice that the master has changed.

With the old master being available, they will happily connect to it and data discrepancy will arise. As you can see, making sure that the old master will not start is quite a high priority item.

Decide on a master candidate

The old master is down and it will not return from its grave, now it’s time to decide which host we should use as a new master. Usually there is more than one replica to pick from, so a decision has to be made. There are many reasons why one replica may be picked over another, therefore checks have to be performed.

Whitelists and blacklists

For starters, a team managing databases may have its reasons to pick one replica over another when deciding about a master candidate. Maybe it’s using weaker hardware or has some particular job assigned to it (that replica runs backup, analytic queries, developers have access to it and run custom, hand-made queries). Maybe it’s a test replica where a new version is undergoing acceptance tests before proceeding with the upgrade. Most failover management software supports white and blacklists, which can be utilized to precisely define which replicas should or cannot be used as master candidates.

Semi-synchronous replication

A replication setup may be a mix of asynchronous and semi-synchronous replicas. There’s a huge difference between them - semi-synchronous replica is guaranteed to contain all of the events from the master. An asynchronous replica may not have received all the data thus failing over to it may result in data loss. We would rather see semi-synchronous replicas to be promoted.

Replication lag

Even though a semi-synchronous replica will contain all the events, those events may still reside in relay logs only. With heavy traffic, all replicas, no matter if semi-sync or async, may lag.

The problem with replication lag is that, when you promote a replica, you should reset the replication settings so it will not attempt to connect to the old master. This will also remove all relay logs, even if they are not yet applied - which leads to data loss.

Even if you will not reset the replication settings, you still cannot open a new master to connections if it hasn’t applied all events from its relay log. Otherwise you will risk that the new queries will affect transactions from the relay log, triggering all sort of problems (for example, an application may remove some rows which are accessed by transactions from relay log).

Taking all of this under consideration, the only safe option is to wait for the relay log to be applied. Still, it may take a while if the replica was lagging heavily. Decisions have to be made as to which replica would make a better master - asynchronous, but with small lag or semi-synchronous, but with lag that would require a significant amount of time to apply.

Errant transactions

Even though replicas should not be written to, it still could happen that someone (or something) has written to it.

It may have been just a single transaction way in the past, but it still may have a serious effect on the ability to perform a failover. The issue is strictly related to Global Transaction ID (GTID), a feature which assigns a distinct ID to every transaction executed on a given MySQL node.

Nowadays it’s quite a popular setup as it brings great levels of flexibility and it allows for better performance (with multi-threaded replicas).

The issue is that, while re-slaving to a new master, GTID replication requires all events from that master (which have not been executed on replica) to be replicated to the replica.

Let’s consider the following scenario: at some point in the past, a write happened on a replica. It was a long time ago and this event has been purged from the replica’s binary logs. At some point a master has failed and the replica was appointed as a new master. All remaining replicas will be slaved off the new master. They will ask about transactions executed on the new master. It will respond with a list of GTIDs which came from the old master and the single GTID related to that old write. GTIDs from the old master are not a problem as all remaining replicas contain at least the majority of them (if not all) and all missing events should be recent enough to be available in the new master’s binary logs.

Worst case scenario, some missing events will be read from the binary logs and transferred to replicas. The issue is with that old write - it happened on a new master only, while it was still a replica, thus it does not exist on remaining hosts. It is an old event therefore there is no way to retrieve it from binary logs. As a result, none of the replicas will be able to slave off the new master. The only solution here is to take a manual action and inject an empty event with that problematic GTID on all replicas. It will also means that, depending on what happened, the replicas may not be in sync with the new master.

As you can see, it is quite important to track errant transactions and determine if it is safe to promote a given replica to become a new master. If it contains errant transactions, it may not be the best option.

Failover handling for the application

It is crucial to keep in mind that master switch, forced or not, does have an effect on the whole topology. Writes have to be redirected to a new node. This can be done in multiple ways and it is critical to ensure that this change is as transparent to the application as possible. In this section we will take a look at some of the examples of how the failover can be made transparent to the application.


One of the ways in which an application can be pointed to a master is by utilizing DNS entries. With low TTL it is possible to change the IP address to which a DNS entry such as ‘’ points. Such a change can be done through external scripts executed during the failover process.

Service discovery

Tools like Consul or etc.d can also be used for directing traffic to a correct location. Such tools may contain information that the current master’s IP is set to some value. Some of them also give the ability to use hostname lookups to point to a correct IP. Again, entries in service discovery tools have to be maintained and one of the ways to do that is to make those changes during the failover process, using hooks executed on different stages of the failover.


Proxies may also be used as a source of truth about topology. Generally speaking, no matter how they discover the topology (it can be either an automatic process or the proxy has to be reconfigured when the topology changes), they should contain the current state of the replication chain as otherwise they wouldn’t be able to route queries correctly.

The approach to use a proxy as a source of truth can be quite common in conjunction with the approach to collocate proxies on application hosts. There are numerous advantages to collocating proxy and web servers: fast and secure communication using Unix socket, keeping a caching layer (as some of the proxies, like ProxySQL can also do the caching) close to the application. In such a case, it makes sense for the application to just connect to the proxy and assume it will route queries correctly.

Failover in ClusterControl

ClusterControl applies industry best practices to make sure that the failover process is performed correctly. It also ensures that the process will be safe - default settings are intended to abort the failover if possible issues are detected. Those settings can be overridden by the user should they want to prioritize failover over data safety.

Once a master failure has been detected by ClusterControl, a failover process is initiated and a first failover hook is immediately executed:

Next, master availability is tested.

ClusterControl does extensive tests to make sure the master is indeed unavailable. This behavior is enabled by default and it is managed by the following variable:

replication_check_external_bf_failover Before attempting a failover, perform extended checks by checking the slave status to detect if the master is truly down, and also check if ProxySQL (if installed) can still see the master. If the master is detected to be functioning, then no failover will be performed. Default is 1 meaning the checks are enabled.

As a following step, ClusterControl ensures that the old master is down and if not, that ClusterControl will not attempt to recover it:

Next step is to determine which host can be used as a master candidate. ClusterControl does check if a whitelist or a blacklist is defined.

You can do that by using the following variables in the cmon configuration file:

replication_failover_blacklist Comma separated list of hostname:port pairs. Blacklisted servers will not be considered as a candidate during failover. replication_failover_blacklist is ignored if replication_failover_whitelist is set. replication_failover_whitelist Comma separated list of hostname:port pairs. Only whitelisted servers will be considered as a candidate during failover. If no server on the whitelist is available (up/connected) the failover will fail. replication_failover_blacklist is ignored if replication_failover_whitelist is set.

It is also possible to configure ClusterControl to look for differences in binary log filters across all replicas. It can be done using replication_check_binlog_filtration_bf_failover variable. By default, those checks are disabled. ClusterControl also verifies there are no errant transactions in place, which could cause issues.

You can also ask ClusterControl to auto-rebuild replicas which cannot replicate from the new master using following setting in cmon configuration file:

* replication_auto_rebuild_slave: If the SQL THREAD is stopped and error code is non-zero then the slave will be automatically rebuilt. 1 means enable, 0 means disable (default).

Afterwards a second script is executed: it is defined in replication_pre_failover_script setting. Next, a candidate undergoes preparation process.

ClusterControl waits for redo logs to be applied (ensuring that data loss is minimal). It also checks if there are other transactions available on remaining replicas, which have not been applied to master candidate. Both behaviors can be controlled by the user, using the following settings in cmon configuration file:

replication_skip_apply_missing_txs Force failover/switchover by skipping applying transactions from other slaves. Default disabled. 1 means enabled. replication_failover_wait_to_apply_timeout Candidate waits up to this many seconds to apply outstanding relay log (retrieved_gtids) before failing over. Default -1 seconds (wait forever). 0 means failover immediately.

As you can see, you can force a failover even though not all of the redo log events have been applied - it allows the user to decide what has the higher priority - data consistency or failover velocity.

Finally, the master is elected and the last script is executed (a script which can be defined as replication_post_failover_script.

If you haven’t tried ClusterControl yet, I encourage you to download it (it’s free) and give it a go.

ClusterControl Single Console for Your Entire Database Infrastructure Find out what else is new in ClusterControl Install ClusterControl for FREE Master detection in ClusterControl

ClusterControl gives you ability to deploy full High Availability stack including database and proxy layers. Master discovery is always one of the issues to deal with.

How does it work in ClusterControl?

A high availability stack, deployed through ClusterControl, consists of three pieces:

  • database layer
  • proxy layer which can be HAProxy or ProxySQL
  • keepalived layer, which, with use of Virtual IP, ensures high availability of the proxy layer
Related resources  ClusterControl for MySQL Replication  MySQL Replication for High Availability  How to Control Replication Failover for MySQL and MariaDB  Controlling Replication Failover for MySQL and MariaDB with Pre- or Post-Failover Scripts  Failover for MySQL Replication (and others) - Should it be Automated?

Proxies rely on read_only variables on the nodes.

As you can see in the screenshot above, only one node in the topology is marked as “writable”. This is the master and this is the only node which will receive writes.

A proxy (in this example, ProxySQL) will monitor this variable and it will reconfigure itself automatically.

On the other side of that equation, ClusterControl takes care of topology changes: failovers and switchovers. It will make necessary changes in read_only value to reflect the state of the topology after the change. If a new master is promoted, it will become the only writable node. If a master is elected after the failover, it will have read_only disabled.

On top of the proxy layer, keepalived is deployed. It deploys a VIP and it monitors the state of underlying proxy nodes. VIP points to one proxy node at a given time. If this node goes down, virtual IP is redirected to another node, ensuring that the traffic directed to VIP will reach a healthy proxy node.

To sum it up, an application connects to the database using virtual IP address. This IP points to one of the proxies. Proxies redirect traffic accordingly to the topology structure. Information about topology is derived from read_only state. This variable is managed by ClusterControl and it is set based on the topology changes user requested or ClusterControl performed automatically.

Tags:  mysql replication failover MySQL replication

InnoDB physical files on MySQL 8.0

Introduction –

              MySQL 8.0.10 GPL came out with more changes and advanced features. We have changes on InnoDB physical file layout ( MySQL data directory ) too. This blog will provide the information about the MySQL 8 InnoDB physical files.

MySQL system tables are completely InnoDB now ?

              Yes, Previously, we don’t have too many InnoDB tables on MySQL system tables. We have the innodb_index_stats, innodb_table_stats, slave_master_info, slave_relay_log_info and slave_worker_info in MySQL 5.7 as InnoDB tables. But, now all the MySQL System  tables were converted to InnoDB from MySQL 8.0 . It helps in the transactional DDL’s .

Below are the list of InnoDB physical files on MySQL 8.0 .

  • ibdata1
  • .ibd
  • .SDI
  • undo_001 & undo_002
  • iblogfile0
  • iblogfile1
  • ib_buffer_pool
  • ibtmp1

MySQL 8.0 InnoDB Disk Layout Architecture –

 ibdata1 (Shared Tablespace)  

ibdata1 is the shared tablespace ( tablespace for all the available tables). It contains the change buffer, system tables (MySQL) , double write buffer. The undo was also a part of ibdata1 prior to MySQL 8.0, but they are moved out now (undo_001 & undo_002).Without innodb_file_per_table all the tables data and index pages will be stored in the ibdata1 . It is hard to shrink the ibdata1 with individual table space ( .ibd ), we need to perform the complete rebuilt with mysqldump / mydumper /mysqlpump (logical rebuild).

ibd ( Individual tablespace)

Each table has its own table space file under the name table_name.ibd . It was introduced in MySQL 5.1 . We need to manually enable the innodb_file_per_table variable for individual tablespace till MySQL 5.6. From MySQL 5.6, it was enabled as default. The file contains the tables data and index pages. 

The individual table space adds more benefit to the database while comparing to the disadvantages. In case of partition each partition is distinguished by a ibd file and SDI files take care of the partition details.

SDI  (Serialized Dictionary Information )

The SDI file provides the additional information about table metadata and table partitions details . The file is in JSON format. It replaces the older FRM files and PAR files and TRG, TRN files .

.par  (file for partition structure)
-frm -(file for table structure)
.trg & .trn (files for triggers)

Below are the some important details we can get from SDI files .

  • Table schema
  • Table name
  • Partitions
  • Collation
  • Foreign key
SDI file format for the table general_log – Table Structure – mysql> show create table general_log\G *************************** 1. row ***************************       Table: general_log Create Table: CREATE TABLE `general_log` (  `event_time` timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6),  `user_host` mediumtext NOT NULL,  `thread_id` bigint(21) unsigned NOT NULL,  `server_id` int(10) unsigned NOT NULL,  `command_type` varchar(64) NOT NULL,  `argument` mediumblob NOT NULL ) ENGINE=CSV DEFAULT CHARSET=utf8 COMMENT='General log' 1 row in set (0.00 sec) SDI Structure – cat general_log_195.sdi {"mysqld_version_id":80011,"dd_version":80011,"sdi_version":1,"dd_object_type":"Table","dd_object":{"name":"general_log","mysql_version_id":80011,"created":20180723063147,"last_altered":20180723063147,"hidden":1,"options":"avg_row_length=0;key_block_size=0;keys_disabled=0;pack_record=1;stats_auto_recalc=0;stats_sample_pages=0;","columns":[{"name":"event_time","type":18,"is_nullable":false,"is_zerofill":false,"is_unsigned":false,"is_auto_increment":false,"is_virtual":false,"hidden":1,"ordinal_position":1,"char_length":26,"numeric_precision":0,"numeric_scale":0,"numeric_scale_null":true,"datetime_precision":6,"datetime_precision_null":0,"has_no_default":false,"default_value_null":false,"srs_id_null":true,"srs_id":0,"default_value":"AAAAAAAAAA==","default_value_utf8_null":false,"default_value_utf8":"CURRENT_TIMESTAMP(6)","default_option":"CURRENT_TIMESTAMP(6)","update_option":"CURRENT_TIMESTAMP(6)","comment":"","generation_expression":"","generation_expression_utf8":"","options":"interval_count=0;","se_private_data":"","column_key":1,"column_type_utf8":"timestamp(6)","elements":[],"collation_id":8,"is_explicit_collation":false},{"name":"user_host","type":25,"is_nullable":false,"is_zerofill":false,"is_unsigned":false,"is_auto_increment":false,"is_virtual":false,"hidden":1,"ordinal_position":2,"char_length":8,"numeric_precision":0,"numeric_scale":0,"numeric_scale_null":true,"datetime_precision":0,"datetime_precision_null":1,"has_no_default":true,"default_value_null":false,"srs_id_null":true,"srs_id":0,"default_value":"AAAAAAAAAAAAAAA=","default_value_utf8_null":true,"default_value_utf8":"","default_option":"","update_option":"","comment":"","generation_expression":"","generation_expression_utf8":"","options":"interval_count=0;","se_private_data":"","column_key":1,"column_type_utf8":"mediumtext","elements":[],"collation_id":33,"is_explicit_collation":false},{"name":"thread_id","type":9,"is_nullable":false,"is_zerofill":false,"is_unsigned":true,"is_auto_increment":false,"is_virtual":false,"hidden":1,"ordinal_position":3,"char_length":21,"numeric_precision":20,"numeric_scale":0,"numeric_scale_null":false,"datetime_precision":0,"datetime_precision_null":1,"has_no_default":true,"default_value_null":false,"srs_id_null":true,"srs_id":0,"default_value":"AAAAAAAAAAA=","default_value_utf8_null":true,"default_value_utf8":"","default_option":"","update_option":"","comment":"","generation_expression":"","generation_expression_utf8":"","options":"interval_count=0;","se_private_data":"","column_key":1,"column_type_utf8":"bigint(21) unsigned","elements":[],"collation_id":33,"is_explicit_collation":false},{"name":"server_id","type":4,"is_nullable":false,"is_zerofill":false,"is_unsigned":true,"is_auto_increment":false,"is_virtual":false,"hidden":1,"ordinal_position":4,"char_length":10,"numeric_precision":10,"numeric_scale":0,"numeric_scale_null":false,"datetime_precision":0,"datetime_precision_null":1,"has_no_default":true,"default_value_null":false,"srs_id_null":true,"srs_id":0,"default_value":"AAAAAA==","default_value_utf8_null":true,"default_value_utf8":"","default_option":"","update_option":"","comment":"","generation_expression":"","generation_expression_utf8":"","options":"interval_count=0;","se_private_data":"","column_key":1,"column_type_utf8":"int(10) unsigned","elements":[],"collation_id":33,"is_explicit_collation":false},{"name":"command_type","type":16,"is_nullable":false,"is_zerofill":false,"is_unsigned":false,"is_auto_increment":false,"is_virtual":false,"hidden":1,"ordinal_position":5,"char_length":192,"numeric_precision":0,"numeric_scale":0,"numeric_scale_null":true,"datetime_precision":0,"datetime_precision_null":1,"has_no_default":true,"default_value_null":false,"srs_id_null":true,"srs_id":0,"default_value":"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\nAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==","default_value_utf8_null":true,"default_value_utf8":"","default_option":"","update_option":"","comment":"","generation_expression":"","generation_expression_utf8":"","options":"interval_count=0;","se_private_data":"","column_key":1,"column_type_utf8":"varchar(64)","elements":[],"collation_id":33,"is_explicit_collation":false},{"name":"argument","type":25,"is_nullable":false,"is_zerofill":false,"is_unsigned":false,"is_auto_increment":false,"is_virtual":false,"hidden":1,"ordinal_position":6,"char_length":8,"numeric_precision":0,"numeric_scale":0,"numeric_scale_null":true,"datetime_precision":0,"datetime_precision_null":1,"has_no_default":true,"default_value_null":false,"srs_id_null":true,"srs_id":0,"default_value":"AAAAAAAAAAAAAAA=","default_value_utf8_null":true,"default_value_utf8":"","default_option":"","update_option":"","comment":"","generation_expression":"","generation_expression_utf8":"","options":"interval_count=0;","se_private_data":"","column_key":1,"column_type_utf8":"mediumblob","elements":[],"collation_id":63,"is_explicit_collation":true}],"schema_ref":"mysql","se_private_id":18446744073709551615,"engine":"CSV","comment":"General log","se_private_data":"","row_format":2,"partition_type":0,"partition_expression":"","partition_expression_utf8":"","default_partitioning":0,"subpartition_type":0,"subpartition_expression":"","subpartition_expression_utf8":"","default_subpartitioning":0,"indexes":[],"foreign_keys":[],"partitions":[],"collation_id":33}} undo_ ( UNDO log files )

                   From MySQL 8.0, By default undo log files are separated from system tablespace (ibdata1) and located on data directory . Before MySQL 8.0 we need to manually enable the variable innodb_undo_tablespaces (introduced in MySQL 5.7) to separate the undo  tablespace .

undo_001, undo_002 – Undo files are used to undo the transactions, if the transaction needs a ROLL BACK.

iblogfile (Redo log files )

                MySQL maintains the redo log files for crash recovery . We can resize the redo log file by using innodb_log_file_size. Also, we can define the number of files using the variable innodb_log_files_in_group. By default we will be have 2 redo log files in the group . The log files will be sequentially writes the data . Log buffer will feed the data to redo log files.

iblogfile0, iblogfile1 – redo log buffer will catch up the data from innodb buffer pool ( flushed data ) and written into the redo log files (disk). Those data will be helpful incase of MySQL crash happened .

With innodb_flush_log_at_trx_commit we can define , how the redo log files needs to be written. Setting innodb_flush_log_at_trx_commit = 1 is transaction safe and be more ACID complaint .

ib_buffer_pool (buffer pool dump )

                  MySQL will maintain the frequently access data and index pages on buffer pool. While shutting down the MySQL, all the data available on memory will be lost as it is volatile . So, there is a performance degradation post restart until the buffer pool is warmed up.

From MySQL 5.6, we have the variables  innodb_buffer_pool_dump_at_shutdown and innodb_buffer_pool_load_at_startup to dump and load the memory pages while restart 

Also, innodb_buffer_pool_dump_pct will define the percentage of memory pages needs to be dump and load . We can also dump the memory pages at any time by using the variable innodb_buffer_pool_dump_now.

the ib_buffer_pool contains the tablespace id and the page id ( tablespace ID, page ID ) .

ib_buffer_pool Structure – cat ib_buffer_pool | head -n5 432,3 432,1 430,3 430,1 428,3

These pages in the dump file are loaded back from disk to buffer pool while starting the MySQL again.

ibtmp1 (Innodb temporary table space )

               MySQL community introduced the new file ibtmp1 ( from MySQL 5.7 ) for storing the innodb  internal temporary tables. This is a shared table space for temporary tables created on disk.

We can define the file path and size by using the variable innodb_temp_data_file_path and innodb_tmpdir .

ibtmp1 – Innodb temporary table file for SELECT’s

Conclusion –

By this blog I am just giving the information about the each InnoDB files located in MySQL 8.0 physical data system. 

Image Courtesy : Photo by chuttersnap on Unsplash

How Bluefin ensures 24/7/365 operation and application availability for PayConex and Decryptx with Continuent Clustering

Join MC Brown, VP Products at Continuent, on August 8th for our new webinar on high availability and disaster recovery for MySQL, MariaDB and Percona Server with Continuent Clustering. Learn how Bluefin Payment Systems provides 24/7/365 operation and application availability for their PayConex payment gateway and Decryptx decryption-as-a-service, essential for Point-Of-Sale solutions in retail, mobile, call centers and kiosks.

We’ll discuss why Bluefin uses Continuent Clustering, and how Bluefin runs two co-located data centers with multimaster replication between each cluster in each data center, with full fail-over within the cluster and between clusters, handling 350 million records each month.

Tune in Wednesday, August 8th @ 10 am PT/ pm ET. Save your spot at .

Tiered or leveled compaction, why not both via adaptive compaction?

First there was leveled compaction and it was good, but it took a while for implementations to become popular. Then there was (size) tiered compaction and it was good too, but more confusing given the diversity in strategies for picking files to compact. RocksDB didn't help with the confusion by calling it universal compaction. Eventually compaction algorithms optimized for time series were added (see DTCS for Cassandra). Finally, Kudu and InfluxDB have specialized compaction algorithms that are also worth understanding.

This post is about adaptive compaction which is yet another compaction algorithm. The summary for adaptive compaction is:
  • LSM file structure is orthogonal to the use of tiered or leveled compaction. A given LSM instance can be viewed as using tiered and leveled. It just depends on how you look at it.
  • Some levels in the LSM tree can be read optimized while others can be write optimized.
  • Each compaction step makes cost-based decisions to change a level between read and write optimized states, or remain as is. Compaction steps can also change the per-level fanout (thanks to Siying and Andrew for that suggestion).
  • Adaptive compaction can be configured to do strictly tiered or strictly leveled. So it can even adapt to become non-adaptive.

This is just an idea today. It has not been implemented. I think it will be good for RocksDB, but validation of ideas takes time.

Drawing the LSM file structure
The LSM file structure with leveled compaction is usually depicted vertically. The example below is from leveled compaction with 3 levels, per-level fanout set to 4, 4 SSTs in L1, 16 SSTs in L2 and 64 SSTs in L3. I ignore level 0 to simplify the picture.

With tiered compaction the LSM files are usually depicted horizontally as a sequence of sorted runs. The compaction algorithm merges 2 or more adjacent sorted runs at a time but it is hard to reason about the write amplification that will occur with such merging. Below each sorted run is range partitioned with 4, 16 and 64 partitions. Each partition is an SST.

File Structure vs Compaction

The key point here is that the LSM file structure is orthogonal to the use of tiered or leveled compaction. For the examples above the same LSM files are depicted vertically for leveled compaction and horizontally for tiered compaction.

The same files from an LSM instance can be used with tiered or leveled compaction. An instance using leveled compaction can be stopped and switched to use tiered compaction. An instance using tiered compaction can be stopped and switched to use leveled compaction. Metadata might have to be changed during the switch but otherwise the switch is trivial (allow me to wave hands).

To make this claim I ignore the work done in leveled with RocksDB to limit overlap between SSTs on adjacent levels. That is useful when incremental compaction is done to compact 1 SST from level N with the ~fanout SSTs from level N+1 as done by LevelDB and RocksDB. Limiting overlap isn't needed with the classic LSM approach because compaction between levels is all-to-all rather than incremental.

To transform from leveled to tiered assume that with tiered compaction the LSM structure is a sequence of sorted runs and each sorted run is 1+ SSTs. Then the N SSTs in the L0 become the first N sorted runs in the tiered sequence. They are followed by the sorted runs from L1 to Lmax. In this case the large sorted runs from L1 to Lmax are range partitioned (split into many SSTs) in the tiered sequence.

To transform from tiered to leveled the sorted runs from the prefix of the tiered sequence that are each a single SST become the SSTs in L0. Each of the remaining sorted runs becomes a level from L1 to Lmax. This requires that large sorted runs are range partitioned into many SSTs with tiered compaction.

Next is a specific example for transforming from leveled to tiered with fanout=8. The LSM with leveled has:
  • 4 SSTs in the L0 named L0.1, L0.2, L0.3 and L0.4
  • L1 is partitioned into 4 SSTs: L1.1 to L1.4
  • L2 is partitioned into 32 SSTs: L2.1 to L2.32
  • L3 is partitioned into 256 SSTs: L3.1 to L3.256
That uses 7 sorted runs with tiered compaction. With tiered the sequence of sorted runs is:
  • sorted runs 1 to 4 are L0.1, L0.2, L0.3, L0.4
  • sorted run 5 is L1.1 to L1.4 (range partitioned)
  • sorted run 6 is L2.1 to L2.32 (range partitioned)
  • sorted run 7 is L3.1 to L3.256 (range partitioned)
A similar transformation can be done to go from tiered back to leveled. Assume the LSM with tiered compaction uses the file structure from above with 7 sorted runs, the first 4 are each one SST and then runs 5, 6 and 7 are each range partitioned into many SSTs. This can be viewed as an LSM with leveled compaction where runs 1 to 4 are in the L0 and runs 5, 6 and 7 become levels 1, 2 and 3. As noted elsewhere this ignores the work done to limit overlap between adjacent levels with leveled and RocksDB.

Vertical Depiction of Tiered Compaction

More recently I have seen the vertical depiction used to explain write amplification for tiered compaction (thanks DASLab for doing this). But that has unstated assumptions about how sorted runs are selected for compaction. With such assumptions the number of levels in the LSM tree is the same whether leveled or tiered is used, but the write-amplification is different. The number of levels is logfanout(R) where R is size(Lmax) / size(L1). With leveled the worst-case per-level write-amp is equal to fanout and with tiered it is 1. The unstated assumptions are:
  1. Sorted runs to be merged all come from the same level.
  2. The fanout determines the number of sorted runs per level. When fanout=8 then each level is full with 8 sorted runs and a merge is started.
  3. When each level is full then each level has fanout times more data than the previous level.
  4. The output of a merge is written to the next larger level. Perhaps this rule can be broken when the size of the output is not large enough. For example assume the fanout is 8, the target size for sorted runs on L1 is 100 and the target size for sorted runs on the L2 is 800. When 8 sorted runs from L1 are merged, on which level should the output be written if the output has size 100 or size 200?
  5. The max level has fanout sorted runs, which implies that space amplification is fanout which is too large for me. I prefer to limit the max level to 1 sorted run which also increases the total write-amp. The space-amp can be further reduced by increasing the fanout between the next-to-max and max levels. I am curious whether existing LSMs can be configured to limit the max level to 1 sorted run to limit the worst-case space-amplification to 2 (ignoring transient space-amp during compaction) and a recent paper, Dostoevsky by Dayan and Idreos, claims they cannot.
The example below is a vertical depiction of tiered compaction with 4 sorted runs per level. If fanout=4 then each level is full. The sorted runs in levels 1, 2 and 3 are L1.r1 to L1.r4, L2.r1 to L2.r4 and L3.r1 to L3.r4. Each sorted run can be an SST or a sequence of range partitioned SSTs. A sorted run in level N+1 is approximately fanout times larger than a sorted run in level N. The size of the boxes below don't imply the size of the sorted runs (my drawing skills are limited).

Adaptive Compaction

A quick summary for adaptive compaction is that it uses the vertical depiction for tiered but each level can have a different target for the number of sorted runs -- from 1 to fanout.

First, let me show the LSM tree when the max level is constrained to one sorted run so that the worst-case space-amplification is <= 2, ignoring temp space during compaction. Each level has <= fanout sorted runs. A sorted run is either an SST or range partitioned into many SSTs. A level with 2+ sorted runs is write optimized. A level with 0 or 1 sorted runs is read optimized. The size of the boxes below don't imply the size of the sorted runs (my drawing skills are limited).

Adaptive compaction:
  1. Uses the vertical depiction of the LSM tree to constrain the compaction steps that can occur.
  2. Makes cost-based decisions during compaction to make the LSM tree more efficient for both the short-term and long-term workload. This is done one compaction step at a time. This is called adaptive compaction because it adapts the shape of the LSM tree to the workload.
  3. The decisions are 1) whether a level should be tiered or leveled and 2) the per-level fanout for the level. When a level is tiered then the per-level fanout determines the number of sorted runs on that level. When a level is leveled then the per-level fanout determines the size ratio between it and adjacent levels.
  4. Decisions respect constraints including the maximum space-amplification (both temporary — during compaction, and permanent — after compaction), write-amplification, number of levels and number of sorted runs. Constraints allow limits for the worst-case read and write efficiency. A possible constraint is the max number of sorted runs for the max level. Constraints can also include hints that optimization should favor point reads, range reads or writes.
Compaction is a sequence of compaction steps and at each compaction step adaptive compaction makes the LSM tree more or less read optimized to be more efficient for both the current and long-term workload. Note that workloads can have daily cycles so that optimizing for the current workload during daily backup or bulk load might not be the best decision when things return to normal in a few hours.

There are four types of compaction steps that can be scheduled. Some make the LSM tree more read-optimized, some make the LSM more write-optimized.
  1. tiered - this merges 2+ sorted runs from level N and writes the output to level N or level N+1 depending on the size of the output. Regardless, level N becomes read optimized. Level N+1 remains or becomes write optimized.
  2. tiered+leveled - this merges 2+ sorted runs from level N with 1 sorted run from level N+1 and writes the output to level N+1. For now I assume this cannot be used when level N+1 has more than 1 sorted run. When the compaction step finishes level N+1 remains read optimized and level N becomes read optimized.
  3. leveled.part - this merges one SST from a sorted run on level N with overlapping SSTs from a sorted run on level N+1. This cannot be used when levels N or N+1 have more than one sorted run. This leaves levels N and N+1 as read optimized. For now I ignore the details of minimizing overlaps between a sorted run in level N and sorted runs in level N+1.
  4. leveled.all - this merges the sorted run in level N with the sorted run in level N+1 and writes the output to level N+1. This cannot be used when levels N or N+1 have more than one sorted run. This leaves levels N and N+1 as read optimized. Unlike leveled.part this doesn't require minimizing overlap.
Examples of adaptive compaction
The smaller levels of the LSM tree have the most churn. Therefore they adapt faster as the current workload changes. They can also be fixed faster when the workload shifts from read to write heavy. Optimizations of persistent data structures have risk — which is the time to undo those optimizations when things change. There is much more risk for optimizations done for Lmax than for L1.

A simple example of adaptive compaction is changing L1 and then L2 from having one sorted run to many sorted runs when the current workload shifts from read to write heavy. Assume that the LSM tree starts out with all levels read-optimized (one sorted run per level) and has been using either leveled.part or leveled.all compaction steps. Compaction to level 1 is special, I have yet to mention it but assume there is a write buffer and leveled.part compaction steps have been used to merge it with level 1.

Then there is a burst of writes and L1 switches from read to write optimized with 4 sorted runs. Write buffer flushes stop using leveled.all and just write new sorted runs into L1. Compaction steps into L2, L3 and L4 continue using leveled.all or leveled.part.

Then the burst of writes continues and L2 switches from read to write optimized with 4 sorted runs. Write buffer flushes continue to create new sorted runs in L1. Compaction into L2 begins to use tiered compaction steps.

Eventually the write burst ends and L2 changes from write to read optimized by using a tiered+leveled compaction step from L0 to L1. The result is 1 sorted run in L2 and 0 sorted runs in L0. Then a write buffer flush creates a sorted run in L1.

Open Problems
I am sure there are many, this section lists two.
One interesting problem is to optimize when there is skew in the workload. The workload is usually not uniform across the key space. Some ranges of the key space might need optimization for writes and point queries. Another range might benefit from optimization for range queries. The work proposed above doesn't handle this any better than leveled or tiered compaction today.

In some cases we can isolate different workloads to different column families, and with MyRocks that means a column family per index. But that still doesn't handle the case when the workload has variance within one index. But this is a first step.

Tuning InnoDB Primary Keys

The choice of good InnoDB primary keys is a critical performance tuning decision. This post will guide you through the steps of choosing the best primary key depending on your workload.

As a principal architect at Percona, one of my main duties is to tune customer databases. There are many aspects related to performance tuning which make the job complex and very interesting. In this post, I want to discuss one of the most important one: the choice of good InnoDB primary keys. You would be surprised how many times I had to explain the importance of primary keys and how many debates I had around the topic as often people have preconceived ideas that translate into doing things a certain way without further thinking.

The choice of a good primary key for an InnoDB table is extremely important and can have huge performance impacts. When you start working with a customer using an overloaded x1.16xlarge RDS instance, with close to 1TB of RAM, and after putting a new primary in place they end up doing very well with a r4.4xlarge instance — it’s a huge impact. Of course, it is not a silver bullet –, you need to have a workload like the ones I’ll highlight in the following sections. Keep in mind that tuning comes with trade-offs, especially with the primary key. What you gain somewhere, you have to pay for, performance-wise, elsewhere. You need to calculate what is best for your workload.

What is special about InnoDB primary keys?

InnoDB is called an index-organized storage engine. An index-organized storage engine uses the B-Tree of the primary key to stores the data, the table rows. That means a primary key is mandatory with InnoDB. If there is no primary key for a table, InnoDB adds a hidden auto-incremented 6 bytes counter to the table and use that hidden counter as the primary key. There are some issues with the InnoDB hidden primary key. You should always define explicit primary keys on your tables. In summary, you access all InnoDB rows by the primary key values.

An InnoDB secondary index is also a B-Tree. The search key is made of the index columns and the values stored are the primary keys of matching rows. A search by a secondary index very often results in an implicit search by primary key. You can find more information about InnoDB file format in the documentation. Jeremy Cole’s InnoDB Ruby tools are also a great way to learn about InnoDB internals.

What is a B-Tree?

A B-Tree is a data structure optimized for operations on block devices. Block devices, or disks, have a rather important data access latency, especially spinning disks. Retrieving a single byte at a random position doesn’t take much less time than retrieving a bigger piece of data like a 8KB or 16KB object. That’s the fundamental argument for B-Trees. InnoDB uses pieces of data — pages — of 16KB.

A simple three level B-Tree

Let’s attempt a simplified description of a B-Tree. A B-Tree is a data structure organized around a key. The key is used to search the data inside the B-Tree. A B-Tree normally has multiple levels. The data is stored only in the bottom-most level, the leaves. The pages of the other levels, the nodes, only contains keys and pointers to pages in the next lower level.

When you want to access a piece of data for a given value of the key, you start from the top node, the root node, compare the keys it contains with the search value and finds the page to access at the next level. The process is repeated until you reach the last level, the leaves.  In theory, you need one disk read operation per level of the B-Tree. In practice there is always a memory cache and the nodes, since they are less numerous and accessed often, are easy to cache.

An ordered insert example

Let’s consider the following sysbench table:

mysql> show create table sbtest1\G *************************** 1. row *************************** Table: sbtest1 Create Table: CREATE TABLE `sbtest1` ( `id` int(11) NOT NULL AUTO_INCREMENT, `k` int(11) NOT NULL DEFAULT '0', `c` char(120) NOT NULL DEFAULT '', `pad` char(60) NOT NULL DEFAULT '', PRIMARY KEY (`id`), KEY `k_1` (`k`) ) ENGINE=InnoDB AUTO_INCREMENT=3000001 DEFAULT CHARSET=latin1 1 row in set (0.00 sec) mysql> show table status like 'sbtest1'\G *************************** 1. row *************************** Name: sbtest1 Engine: InnoDB Version: 10 Row_format: Dynamic Rows: 2882954 Avg_row_length: 234 Data_length: 675282944 Max_data_length: 0 Index_length: 47775744 Data_free: 3145728 Auto_increment: 3000001 Create_time: 2018-07-13 18:27:09 Update_time: NULL Check_time: NULL Collation: latin1_swedish_ci Checksum: NULL Create_options: Comment: 1 row in set (0.00 sec)

The primary key B-Tree size is Data_length. There is one secondary key B-Tree, the k_1 index, and its size is given by Index_length. The sysbench table was inserted in order of the primary key since the id column is auto-incremented. When you insert in order of the primary key, InnoDB fills its pages with up to 15KB of data (out of 16KB), even when innodb_fill_factor is set to 100. That allows for some row expansion by updates after the initial insert before a page needs to be split. There are also some headers and footers in the pages. If a page is too full and cannot accommodate an update adding more data, the page is split into two. Similarly, if two neighbor pages are less than 50% full, InnoDB will merge them. Here is, for example, a sysbench table inserted in id order:

mysql> select count(*), TABLE_NAME,INDEX_NAME, avg(NUMBER_RECORDS), avg(DATA_SIZE) from information_schema.INNODB_BUFFER_PAGE -> WHERE TABLE_NAME='`sbtest`.`sbtest1`' group by TABLE_NAME,INDEX_NAME order by count(*) desc; +----------+--------------------+------------+---------------------+----------------+ | count(*) | TABLE_NAME | INDEX_NAME | avg(NUMBER_RECORDS) | avg(DATA_SIZE) | +----------+--------------------+------------+---------------------+----------------+ | 13643 | `sbtest`.`sbtest1` | PRIMARY | 75.0709 | 15035.8929 | | 44 | `sbtest`.`sbtest1` | k_1 | 1150.3864 | 15182.0227 | +----------+--------------------+------------+---------------------+----------------+ 2 rows in set (0.09 sec) mysql> select PAGE_NUMBER,NUMBER_RECORDS,DATA_SIZE,INDEX_NAME,TABLE_NAME from information_schema.INNODB_BUFFER_PAGE -> WHERE TABLE_NAME='`sbtest`.`sbtest1`' order by PAGE_NUMBER limit 1; +-------------+----------------+-----------+------------+--------------------+ | PAGE_NUMBER | NUMBER_RECORDS | DATA_SIZE | INDEX_NAME | TABLE_NAME | +-------------+----------------+-----------+------------+--------------------+ | 3 | 35 | 455 | PRIMARY | `sbtest`.`sbtest1` | +-------------+----------------+-----------+------------+--------------------+ 1 row in set (0.04 sec) mysql> select PAGE_NUMBER,NUMBER_RECORDS,DATA_SIZE,INDEX_NAME,TABLE_NAME from information_schema.INNODB_BUFFER_PAGE -> WHERE TABLE_NAME='`sbtest`.`sbtest1`' order by NUMBER_RECORDS desc limit 3; +-------------+----------------+-----------+------------+--------------------+ | PAGE_NUMBER | NUMBER_RECORDS | DATA_SIZE | INDEX_NAME | TABLE_NAME | +-------------+----------------+-----------+------------+--------------------+ | 39 | 1203 | 15639 | PRIMARY | `sbtest`.`sbtest1` | | 61 | 1203 | 15639 | PRIMARY | `sbtest`.`sbtest1` | | 37 | 1203 | 15639 | PRIMARY | `sbtest`.`sbtest1` | +-------------+----------------+-----------+------------+--------------------+ 3 rows in set (0.03 sec)

The table doesn’t fit in the buffer pool, but the queries give us good insights. The pages of the primary key B-Tree have on average 75 records and store a bit less than 15KB of data. The index k_1 is inserted in random order by sysbench. Why is the filling factor so good? It’s simply because sysbench creates the index after the rows have been inserted and InnoDB uses a sort file to create it.

You can easily estimate the number of levels in an InnoDB B-Tree. The above table needs about 40k leaf pages (3M/75). Each node page holds about 1200 pointers when the primary key is a four bytes integer.  The level above the leaves thus has approximately 35 pages and then, on top of the B-Tree is the root node (PAGE_NUMBER = 3). We have a total of three levels.

A randomly inserted example

If you are a keen observer, you realized a direct consequence of inserting in random order of the primary key. The pages are often split, and on average the filling factor is only around 65-75%. You can easily see the filling factor from the information schema. I modified sysbench to insert in random order of id and created a table, also with 3M rows. The resulting table is much larger:

mysql> show table status like 'sbtest1'\G *************************** 1. row *************************** Name: sbtest1 Engine: InnoDB Version: 10 Row_format: Dynamic Rows: 3137367 Avg_row_length: 346 Data_length: 1088405504 Max_data_length: 0 Index_length: 47775744 Data_free: 15728640 Auto_increment: NULL Create_time: 2018-07-19 19:10:36 Update_time: 2018-07-19 19:09:01 Check_time: NULL Collation: latin1_swedish_ci Checksum: NULL Create_options: Comment: 1 row in set (0.00 sec)

While the size of the primary key b-tree inserted in order of id is 644MB, the size, inserted in random order, is about 1GB, 60% larger. Obviously, we have a lower page filling factor:

mysql> select count(*), TABLE_NAME,INDEX_NAME, avg(NUMBER_RECORDS), avg(DATA_SIZE) from information_schema.INNODB_BUFFER_PAGE -> WHERE TABLE_NAME='`sbtestrandom`.`sbtest1`'group by TABLE_NAME,INDEX_NAME order by count(*) desc; +----------+--------------------------+------------+---------------------+----------------+ | count(*) | TABLE_NAME | INDEX_NAME | avg(NUMBER_RECORDS) | avg(DATA_SIZE) | +----------+--------------------------+------------+---------------------+----------------+ | 4022 | `sbtestrandom`.`sbtest1` | PRIMARY | 66.4441 | 10901.5962 | | 2499 | `sbtestrandom`.`sbtest1` | k_1 | 1201.5702 | 15624.4146 | +----------+--------------------------+------------+---------------------+----------------+ 2 rows in set (0.06 sec)

The primary key pages are now filled with only about 10KB of data (~66%). It is a normal and expected consequence of inserting rows in random order. We’ll see that for some workloads, it is bad. For some others, it is a small price to pay.

A practical analogy

It is always good to have a concrete model or analogy in your mind to better understand what is going on. Let’s assume you have been tasked to write the names and arrival time, on paper, of all the attendees arriving at a large event like Percona Live. So, you sit at a table close to the entry with a good pen and a pile of sheets of paper. As people arrive, you write their names and arrival time, one after the other. When a sheet is full, after about 40 names, you move it aside and start writing to a new one. That’s fast and effective. You handle a sheet only once, and when it is full, you don’t touch it anymore. The analogy is easy, a sheet of paper represents an InnoDB page.

The above use case represents an ordered insert. It is very efficient for the writes. Your only issue is with the organizer of the event: she keeps coming to you asking if “Mr. X” or “Mrs. Y” has arrived. You have to scan through your sheets to find the name. That’s the drawback of ordered inserts, reads can be more expensive. Not all reads are expensive, some can be very cheap. For example: “Who were the first ten people to get in?” is super easy. You’ll want an ordered insert strategy when the critical aspects of the application are the rate and the latency of the inserts. That usually means the reads are not user-facing. They are coming from report batch jobs, and as long as these jobs complete in a reasonable time, you don’t really care.

Now, let’s consider a random insertion analogy. For the next day of the event, tired of the organizer questions, you decide on a new strategy: you’ll write the names grouped by the first letter of the last name. Your goal is to ease the searches by name. So you take 26 sheets, and on top of each one, you write a different letter. As the first visitors arrive, you quickly realize you are now spending a lot more time looking for the right sheet in the stack and putting it back at the right place once you added a name to it.

At the end of the morning, you have worked much more. You also have more sheets than the previous day since for some letters there are few names while for others you needed more than a sheet. Finding names is much easier though. The main drawback of random insertion order is the overhead to manage the database pages when adding entries. The database will read and write from/to disk much more and the dataset size is larger.

Determine your workload type

The first step is to determine what kind of workload you have. When you have an insert-intensive workload, very likely, the top queries are inserts on some large tables and the database heavily writes to disk. If you repeatedly execute “show processlist;” in the MySQL client, you see these inserts very often. That’s typical of applications logging a lot of data. There are many data collectors and they all wait to insert data. If they wait for too long, some data may be lost. If you have strict SLA on the insert time and relaxed ones on the read time, you clearly have an insert oriented workload and you should insert rows in order of the primary key.

You may also have a decent insert rate on large tables but these inserts are queued and executed by batch processes. Nobody is really waiting for these inserts to complete and the server can easily keep up with the number of inserts. What matters for your application is the large number of read queries going to the large tables, not the inserts. You already went through query tuning and even though you have good indexes, the database is reading from disk at a very high rate.

When you look at the MySQL processlist, you see many times the same select query forms on the large tables. The only options seem to be adding more memory to lower the disk reads, but the tables are growing fast and you can’t add memory forever. We’ll discuss the read-intensive workload in details in the next section.

If you couldn’t figure if you have an insert-heavy or read-heavy workload, maybe you just don’t have a big workload. In such a case, the default would be to use ordered inserts, and the best way to achieve this with MySQL is through an auto-increment integer primary key. That’s the default behavior of many ORMs.

A read-intensive workload

I have seen quite a few read-intensive workloads over my consulting years, mostly with online games and social networking applications. On top of that, some games have social networking features like watching the scores of your friends as they progress through the game. Before we go further, we first need to confirm the reads are inefficient. When reads are inefficient, the top select query forms will the accessing a number of distinct InnoDB pages close to the number of rows examined. The Percona Server for MySQL slow log, when the verbosity level includes “InnoDB”, exposes both quantities, and the pt-query-digest tool includes stats on them. Here’s an example output (I’ve removed some lines):

# Query 1: 2.62 QPS, 0.00x concurrency, ID 0x019AC6AF303E539E758259537C5258A2 at byte 19976 # This item is included in the report because it matches --limit. # Scores: V/M = 0.00 # Time range: 2018-07-19T20:28:02 to 2018-07-19T20:28:23 # Attribute pct total min max avg 95% stddev median # ============ === ======= ======= ======= ======= ======= ======= ======= # Count 48 55 # Exec time 76 93ms 637us 3ms 2ms 2ms 458us 2ms # Lock time 100 10ms 72us 297us 182us 247us 47us 176us # Rows sent 100 1.34k 16 36 25.04 31.70 4.22 24.84 # Rows examine 100 1.34k 16 36 25.04 31.70 4.22 24.84 # Rows affecte 0 0 0 0 0 0 0 0 # InnoDB: # IO r bytes 0 0 0 0 0 0 0 0 # IO r ops 0 0 0 0 0 0 0 0 # IO r wait 0 0 0 0 0 0 0 0 # pages distin 100 1.36k 18 35 25.31 31.70 3.70 24.84 # EXPLAIN /*!50100 PARTITIONS*/ select * from friends where user_id = 1234\G

The friends table definition is:

CREATE TABLE `friends` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `friend_user_id` int(10) unsigned NOT NULL, `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, `active` tinyint(4) NOT NULL DEFAULT '1', PRIMARY KEY (`id`), UNIQUE KEY `uk_user_id_friend` (`user_id`,`friend_user_id`), KEY `idx_friend` (`friend_user_id`) ) ENGINE=InnoDB AUTO_INCREMENT=144002 DEFAULT CHARSET=latin1

I built this simple example on my test server. The table easily fits in memory, so there are no disk reads. What matters here is the relation between “page distin” and “Rows examine”. As you can see, the ratio is close to 1. It means that InnoDB rarely gets more than one row per page it accesses. For a given user_id value, the matching rows are scattered all over the primary key b-tree. We can confirm this by looking at the output of the sample query:

mysql> select * from friends where user_id = 1234 order by id limit 10; +-------+---------+----------------+---------------------+--------+ | id | user_id | friend_user_id | created | active | +-------+---------+----------------+---------------------+--------+ | 257 | 1234 | 43 | 2018-07-19 20:14:47 | 1 | | 7400 | 1234 | 1503 | 2018-07-19 20:14:49 | 1 | | 13361 | 1234 | 814 | 2018-07-19 20:15:46 | 1 | | 13793 | 1234 | 668 | 2018-07-19 20:15:47 | 1 | | 14486 | 1234 | 1588 | 2018-07-19 20:15:47 | 1 | | 30752 | 1234 | 1938 | 2018-07-19 20:16:27 | 1 | | 31502 | 1234 | 733 | 2018-07-19 20:16:28 | 1 | | 32987 | 1234 | 1907 | 2018-07-19 20:16:29 | 1 | | 35867 | 1234 | 1068 | 2018-07-19 20:16:30 | 1 | | 41471 | 1234 | 751 | 2018-07-19 20:16:32 | 1 | +-------+---------+----------------+---------------------+--------+ 10 rows in set (0.00 sec)

The rows are often apart by thousands of id values. Although the rows are small, about 30 bytes, an InnoDB page doesn’t contain more than 500 rows. As the application becomes popular, there are more and more users and the table size grows like the square of the number of users. As soon as the table outgrows the InnoDB the buffer pool, MySQL starts to read from disk. Worse case, with nothing cached, we need one read IOP per friend. If the rate of these selects is 300/s and on average, every user has 100 friends, MySQL needs to access up to 30000 pages per second. Clearly, this doesn’t scale for long.

We need to determine all the ways the table is accessed. For that, I use pt-query-digest and I raise the limit on the number of query forms returned. Let’s assume I found:

  • 93% of the times by user_id
  • 5% of the times by friend_id
  • 2% of the times by id

The above proportions are quite common. When there is a dominant access pattern, we can do something. The friends table is a typical example of a many-to-many table. With InnoDB, we should define such tables as:

CREATE TABLE `friends` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `user_id` int(10) unsigned NOT NULL, `friend_user_id` int(10) unsigned NOT NULL, `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, `active` tinyint(4) NOT NULL DEFAULT '1', PRIMARY KEY (`user_id`,`friend_user_id`), KEY `idx_friend` (`friend_user_id`), KEY `idx_id` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=144002 DEFAULT CHARSET=latin1

Now, the rows are ordered, grouped, by user_id inside the primary key B-Tree but the inserts are in random order. Said otherwise, we slowed down the inserts to the benefit of the select statements on the table. To insert a row, InnoDB potentially needs one disk read to get the page where the new row is going and one disk write to save it back to the disk. Remember in the previous analogy, we needed to take one sheet from the stack, add a name and put it back in place. We also made the table bigger, the InnoDB pages are not as full and the secondary indexes are bigger since the primary key is larger. We also added a secondary index. Now we have less data in the InnoDB buffer pool.

Shall we panic because there is less data in the buffer pool? No, because now when InnoDB reads a page from disk, instead of getting only a single matching row, it gets up to hundreds of matching rows. The amount of read IOPS is no longer correlated to the number of friends times the rate of select statements. It is now only a factor of the incoming rate of select statements. The impacts of not having enough memory to cache all the table are much reduced. As long as the storage can perform more read IOPS than the rate of select statements, all is fine. With the modified table, the relevant lines of the pt-query-digest output are now:

# Attribute pct total min max avg 95% stddev median # ============ === ======= ======= ======= ======= ======= ======= ======= # Rows examine 100 1.23k 16 34 23.72 30.19 4.19 22.53 # pages distin 100 111 2 5 2.09 1.96 0.44 1.96

With the new primary key, instead of 30k read IOPS, MySQL needs to perform only about 588 read IOPS (~300*1.96). It is a workload much easier to handle. The inserts are more expensive but if their rate is 100/s, it just means 100 read IOPS and 100 write IOPS in the worse case.

The above strategy works well when there is a clear access pattern. On top of my mind, here are a few other examples where there are usually dominant access patterns:

  • Game leaderboards (by user)
  • User preferences (by user)
  • Messaging application (by from or to)
  • User object store (by user)
  • Likes on items (by item)
  • Comments on items (by item)

What can you do when you don’t have a dominant access pattern? One option is the use of a covering index. The covering index needs to cover all the required columns. The order of the columns is also important, as the first must be the grouping value. Another option is to use partitions to create an easy to cache hot spot in the dataset. I’ll discuss these strategies in future posts, as this one is long enough!

We have seen in this post a common strategy used to solve read-intensive workload. This strategy doesn’t work all the time — you must access the data through a common pattern. But when it works, and you choose good InnoDB primary keys, you are the hero of the day!

The post Tuning InnoDB Primary Keys appeared first on Percona Database Performance Blog.

Best and fastest way to learn SQL

Imagine that spider-man didn’t know that he can shoot webs from his wrist… Not a good starting point, right?

Sometimes one can learn writing SQL queries but miss critical functionalities and hidden parameters which can change their day-to-day life and save tons of hours while writing or reviewing SQL queries.

In the following article, we collected some of the best tutorials and hands-on exercises which can quickly help you get on board.

Basic SQL training
  • Khan Acadamy – Khan Academy is a non-profit educational organization created in 2006 with a goal of creating a set of online tools that help educate students. The organization produces short lessons in the form of YouTube videos. Here is a short training with videos and self learning exercise
  • Coursera – Every course on Coursera is taught by top instructors from the world’s best universities and educational institutions. Courses include recorded video lectures, auto-graded and peer-reviewed assignments, and community discussion forums. When you complete a course, you’ll receive a shareable electronic Course Certificate. That’s cool!
    Here you can find different SQL courses for data scientist, excel experts, developers and business analysts:
  • Udemy – Udemy is an online learning platform. It is aimed at professional adults. Unlike academic MOOC programs which are driven by traditional collegiate coursework, Udemy uses content from online content creators to sell for profit, which makes its content pretty good. The SQL course is not free and it costs about $10 but it’s definitely worth it:
  • Code academy – Codecademy is an education company. But not one in the way you might think. They are committed to building the best learning experience inside and out.
    Their SQL course includes good hands-on exercises:
  • SQL Bolt – cool website with good interactive exercise –
  • Lynda – is an American online education company offering video courses taught by industry experts in software, creative, and business skills. This course has good reviews as well:
  • Vertabelo is an online database design tool and they created some good courses, not all of them are free, but worth checking out:
  • Solo learn – Sololearn offers a lot on short quizzes and can be a good way for you to test your skills –
  • SQL Teaching – a nice online training designed for kids, with good hands-on exercises
SQL Advanced training Performance oriented training Highly Recommended Performance Related Books

Good learning, and we hope you’ll discover the SQL ninja inside you!

Mastering Continuent Clustering Series: Automatic Reconnect in the Tungsten Connector

In this blog post, we talk about the Automatic Reconnect feature in the Tungsten Connector.

Automatic reconnect enables the Connector to re-establish a connection in the event of a transient failure. Under specific circumstances, the Connector will also retry the query.

Connector automatic reconnect is enabled by default in Proxy and Smartscale modes.

Use the following tpm command option on the command line (remove the leading hyphens inside INI files):


to disable automatic reconnect.

This feature is not available while running in Bridge Mode. Use the tpm command option --connector-bridge-mode=false to disable Bridge mode.

Automatic reconnect enables retries of statements under the following circumstances:

  • not in bridge mode
  • not inside a transaction
  • no temp table has been created
  • no lock acquired and not released
  • the request is a read

To disable:

shell> tpm update alpha --connector-autoreconnect=false

The autoreconnect status can be monitored within the autoReconnect parameter output by the tungsten show variables while connected to the Connector. For example:

shell> tpm connector mysql> tungsten show variables like "autoReconnect"; +----------------------+---------------+-------+ | Variable_Type | Variable_name | Value | +----------------------+---------------+-------+ | | autoReconnect | false | +----------------------+---------------+-------+ 1 row in set (0.00 sec)

The above output indicates that the autoreconnect feature is disabled. The tungsten show command is not available in Bridge mode.

Click here for the documentation page: Connector Automatic Reconnect

In future articles, we will continue to cover more advanced subjects of interest!

Questions? Contact Continuent

Default options in MyRocks

We need to make MyRocks easier to configure -- this isn't a new idea. If you are using MyRocks with default options in mid-2018 then you are probably not using bloom filters, compression or my favorite compaction policy.

You can fix all of that by setting rocksdb_default_cf_options. I wish this were the default.
rocksdb_default_cf_options=block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatioThe above will enable the default compression type for all levels of the LSM tree which is Snappy in a recent MyRocks build with FB MySQL. But one of the proper distros only provides zlib and doing that for the small levels in the LSM tree (L0, L1, L2) might slow down compaction too much.

To set rocksdb_default_cf_options but disable compression use:
rocksdb_default_cf_options=block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio;compression=kNoCompressionTo set rocksdb_default_cf_options and use fast compression for all levels use this after changing $fast to one of kLZ4Compression or kSnappyCompression:
rocksdb_default_cf_options=block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio;compression=$fastAnd to set rocksdb_default_cf_options with a fast compression configuration (no compression for smallest levels, fast compression for mid levels, slower compression for max level) try this after changing $fast and $slow to the appropriate values:
rocksdb_default_cf_options=block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio;compression_per_level=kNoCompression:kNoCompression:kNoCompression:kCompression:$fast:$fast;bottommost_compression=$slowYou can determine which compression libraries are supported by you MyRocks build by looking in $datadir/.rocksdb/LOG. This is from an FB MySQL build that I use:

Compression algorithms supported:          kZSTDNotFinalCompression supported: 1          kZSTD supported: 1          kXpressCompression supported: 0          kLZ4HCCompression supported: 1          kLZ4Compression supported: 1          kBZip2Compression supported: 1          kZlibCompression supported: 1          kSnappyCompression supported: 1
Some information on my.cnf for MyRocks is here and much more info on RocksDB is on the wiki. If you want to tune, then start with the tuning guide.

Another potential problem with the default configuration is that rocksdb_max_background_jobs=2 so there are 2 threads for compaction, you usually want more

Finally, there were some changes to MyRocks to make it better at avoiding too many tombstones and it looks like that is disabled by default:
rocksdb_compaction_sequential_deletes   0rocksdb_compaction_sequential_deletes_count_sd  OFFrocksdb_compaction_sequential_deletes_file_size 0rocksdb_compaction_sequential_deletes_window    0 Setting rocksdb_max_open_files to -1 can be great for performance but there are side-effects (more open file descriptors, more untracked memory consumption).

If you are only using MyRocks then you might want to set transaction-isolation=READ-COMMITTED because repeatable-read in MyRocks doesn't use gap locks (yet).

Finally, you probably want to set rocksdb_block_cache_size.

Install ELK as Centralized Logfile Management Server on CentOS 7

This tutorial explains how to setup a centralized logfile management server using ELK stack on CentOS 7. As anyone who not already know, ELK is the combination of 3 services: ElasticSearch, Logstash, and Kibana. The tutorial will use MySQL log files as source to be monitored.

How to setup ELK as Centralized Management Server on CentOS 7

This tutorial explains how to setup a centralized logfile management server using ELK stack on CentOS 7. As anyone who not already know, ELK is the combination of 3 services: ElasticSearch, Logstash, and Kibana. The tutorial will use MySQL log files as source to be monitored.

MySQL Enterprise Monitor 8.0.1 has been released

We are pleased to announce that MySQL Enterprise Monitor 8.0.1 is now available for download on the My Oracle Support (MOS) web site. It will also be available for download via the Oracle Software Delivery Cloud in a few days. MySQL Enterprise Monitor is the best-in-class tool for monitoring and management of your MySQL assets and is included with your MySQL Enterprise Edition and MySQL Enterprise Carrier Grade subscriptions.

You can find more information on the contents of this release in the change log.

Highlights of MySQL Enterprise Monitor 8.0 include:

  • MySQL Enterprise Monitor 8.0 supports monitoring MySQL Server 8.0.11 and later. You can also use MySQL Server 8.0 as an external data repository for MySQL Enterprise Monitor 8.0.
  • Significant improvements have been made to the performance and user interface of the Query Analyzer. When you click on a SQL statement, the new Statement Details page enables you to browse all example statements in the specified time range, rather than just the one with the worst run time. Drill into any of these examples to see their individual statistics, literals used, and EXPLAIN plan. Graphs display the execution pattern of the selected query during the specified time period. For more information, see Section 30.3.1, "Detailed Query Information".
  • The new Global Memory Usage report builds upon the Memory instrumentation exposed within Performance Schema to provide greater insight into where memory is being allocated within the MySQL Instance - not just at the InnoDB Buffer Pool level, but for all memory allocations, across all connections and threads within MySQL. The report breaks down memory by allocation type, such as memory allocated for sorting, the buffer pool, or various other internal buffers used for statement processing. For more information, see Section 28.6, "Memory Usage Report".

You will find binaries for the new release on My Oracle Support. Choose the "Patches & Updates" tab, and then choose the "Product or Family (Advanced Search)" side tab in the "Patch Search" portlet.

You will also find the binaries on the Oracle Software Delivery Cloud soon.  Type "MySQL Enterprise Monitor 8.0.1" in the search box, or enter a license name to find Enterprise Monitor along with other MySQL products: "MySQL Enterprise Edition" or "MySQL Cluster Carrier Edition".  Then select your platform.

Please open a bug or a ticket on My Oracle Support to report problems, request features, or give us general feedback about how this release meets your needs.

If you are not a MySQL Enterprise customer and want to try the Monitor and Query Analyzer using our 30-day free customer trial, go to, or contact Sales at

Thanks and Happy Monitoring!

- The MySQL Enterprise Tools Development Team

Useful URLs

MySQL Enterprise Monitor 4.0.5 has been released

We are pleased to announce that MySQL Enterprise Monitor 4.0.5 is now available for download on the My Oracle Support (MOS) web site. This is a maintenance release that includes a few new features and fixes a number of bugs. You can find more information on the contents of this release in the change log.

You will find binaries for the new release on My Oracle Support. Choose the "Patches & Updates" tab, and then choose the "Product or Family (Advanced Search)" side tab in the "Patch Search" portlet.

Important: MySQL Enterprise Monitor (MEM) 8.0 offers many significant improvements over MEM 3.4 and 4.0 and we highly recommend that you consider upgrading. More information on MEM 8.0 is available here:

Please open a bug or a ticket on My Oracle Support to report problems, request features, or give us general feedback about how this release meets your needs.

If you are not a MySQL Enterprise customer and want to try the Monitor and Query Analyzer using our 30-day free customer trial, go to, or contact Sales at

Thanks and Happy Monitoring!

- The MySQL Enterprise Tools Development Team

Useful URLs