Planet MySQL

Multiple Data Center Setups Using Galera Cluster for MySQL or MariaDB

Building high availability, one step at a time

When it comes to database infrastructure, we all want it. We all strive to build a highly available setup. Redundancy is the key. We start to implement redundancy at the lowest level and continue up the stack. It starts with hardware - redundant power supplies, redundant cooling, hot-swap disks. Network layer - multiple NIC’s bonded together and connected to different switches which are using redundant routers. For storage we use disks set in RAID, which gives better performance but also redundancy. Then, on the software level, we use clustering technologies: multiple database nodes working together to implement redundancy: MySQL Cluster, Galera Cluster.

Related resources  Galera Cluster for MySQL Tutorial  ClusterControl for Galera Cluster  Galera Cluster Comparison - Codership vs Percona vs MariaDB

All of this is no good  if you have everything in a single datacenter: when a datacenter goes down, or part of the services (but important ones) go offline, or even if you lose connectivity to the datacenter, your service will go down -  no matter the amount of redundancy in the lower levels. And yes, those things happens.

  • S3 service disruption wreaked havoc in US-East-1 region in February, 2017
  • EC2 and RDS Service Disruption in US-East region in April, 2011
  • EC2, EBS and RDS were disrupted in EU-West region in August, 2011
  • Power outage brought down Rackspace Texas DC in June, 2009
  • UPS failure caused hundreds of servers to go offline in Rackspace London DC in January, 2010

This is by no means a complete list of failures, it’s just the result of a quick Google search. These serve as examples that things may and will go wrong if you put all your eggs into the same basket. One more example would be Hurricane Sandy, which caused enormous exodus of data from US-East to US-West DC’s - at that time you could hardly spin up instances in US-West as everyone rushed to move their infrastructure to the other coast in expectation that North Virginia DC will be seriously affected by the weather.

So, multi-datacenter setups are a must if you want to build a high availability environment. In this blog post, we will discuss how to build such infrastructure using Galera Cluster for MySQL/MariaDB.

Galera concepts

Before we look into particular solutions, let us spend some time explaining two concepts which are very important in highly available, multi-DC Galera setups.

Quorum

High availability requires resources - namely, you need a number of nodes in the cluster to make it highly available. A cluster can tolerate the loss of some of its members, but only to a certain extent. Beyond a certain failure rate, you might be looking at a split-brain scenario.

Let’s take an example with a 2 node setup.  If one of the nodes goes down, how can the other one know that its peer crashed and it’s not a network failure? In that case, the other node might as well be up and running, serving traffic. There is no good way to handle such case… This is why fault tolerance usually starts from three nodes. Galera uses a quorum calculation to determine if it is safe for the cluster to handle traffic, or if it should cease operations. After a failure, all remaining nodes attempt to connect to each other and determine how many of them are up. It’s then compared to the previous state of the cluster, and as long as more than 50% of the nodes are up, the cluster can continue to operate.

This results in following:
2 node cluster - no fault tolerance
3 node cluster - up to 1 crash
4 node cluster - up to 1 crash (if two nodes would crash, only 50% of the cluster would be available, you need more than 50% nodes to survive)
5 node cluster - up to 2 crashes
6 node cluster - up to 2 crashes

You probably see the pattern - you want your cluster to have an odd number of nodes - in terms of high availability there’s no point in moving from 5 to 6 nodes in the cluster. If you want better fault tolerance, you should go for 7 nodes.

Segments

Typically, in a Galera cluster, all communication follows the all to all pattern. Each node talks to all the other nodes in the cluster.

As you may know, each writeset in Galera has to be certified by all of the nodes in the cluster - therefore every write that happened on a node has to be transferred to all of the nodes in the cluster. This works ok in a low-latency environment. But if we are talking about multi-DC setups, we need to consider much higher latency than in a local network. To make it more bearable in clusters spanning over Wide Area Networks, Galera introduced segments.

They work by containing the Galera traffic within a group of nodes (segment). All nodes within a single segment act as if they were in a local network - they assume one to all communication. For cross-segment traffic, things are different - in each of the segments, one “relay” node is chosen, all of the cross-segment traffic goes through those nodes. When a relay node goes down, another node is elected. This does not reduce latency by much - after all, WAN latency will stay the same no matter if you make a connection to one remote host or to multiple remote hosts, but given that WAN links tend to be limited in bandwidth and there might be a charge for the amount of data transferred, such approach allows you to limit the amount of data exchanged between segments. Another time and cost-saving option is the fact that nodes in the same segment are prioritized when a donor is needed - again, this limits the amount of data transferred over the WAN and, most likely, speeds up SST as a local network almost always will be faster than a WAN link.

ClusterControl Single Console for Your Entire Database Infrastructure Find out what else is new in ClusterControl Install ClusterControl for FREE Galera in multi-DC setups

Now that we’ve got some of these concepts out of the way, let’s look at some other important aspects of multi-DC setups for Galera cluster.

Issues you are about to face

When working in environments spanning across WAN, there are a couple of issues you need to take under consideration when designing your environment.

Quorum calculation

In the previous section, we described how a quorum calculation looks like in Galera cluster - in short, you want to have an odd number of nodes to maximize survivability. All of that is still true in multi-DC setups, but some more elements are added into the mix. First of all, you need to decide if you want Galera to automatically handle a datacenter failure. This will determine how many datacenters you are going to use. Let’s imagine two DC’s - if you’ll split your nodes 50% - 50%, if one datacenter goes down, the second one doesn’t have 50%+1 nodes to maintain its “primary” state. If you split your nodes in an uneven way, using the majority of them in the “main” datacenter, when that datacenter goes down, the “backup” DC won’t have 50% + 1 nodes to form a quorum. You can assign different weights to nodes but the result will be exactly the same - there’s no way to automatically failover between two DC’s without manual intervention. To implement automated failover, you need more than two DC’s. Again, ideally an odd number - three datacenters is a perfectly fine setup. Next, the question is - how many nodes you need to have? You want to have them evenly distributed across the datacenters. The rest is just a matter of how many failed nodes your setup has to handle.

Minimal setup will use one node per datacenter - it has serious drawbacks, though. Every state transfer will require moving data across the WAN and this results in either longer time needed to complete SST or higher costs.

Quite typical setup is to have six nodes, two per datacenter. This setup seems unexpected as it has an even number of nodes. But, when you think of it, it might not be that big of an issue: it’s quite unlikely that three nodes will go down at once, and such a setup will survive a crash of up to two nodes. A whole datacenter may go offline and two remaining DC’s will continue operations. It also has a huge advantage over the minimal setup - when a node goes offline, there’s always a second node in the datacenter which can serve as a donor. Most of the time, the WAN won’t be used for SST.

Of course, you can increase the number of nodes to three per cluster, nine in total. This gives you even better survivability: up to four nodes may crash and the cluster will still survive. On the other hand, you have to keep in mind that, even with the use of segments, more nodes means higher overhead of operations and you can scale out Galera cluster only to a certain extent.

It may happen that there’s no need for a third datacenter because, let’s say, your application is located in only two of them. Of course, the requirement of three datacenters is still valid so you won’t go around it, but it is perfectly fine to use a Galera Arbitrator (garbd) instead of fully loaded database servers.

Garbd can be installed on smaller nodes, even virtual servers. It does not require powerful hardware, it does not store any data nor apply any of the writesets. But it does see all the replication traffic, and takes part in the quorum calculation. Thanks to it, you can deploy setups like four nodes, two per DC + garbd in the third one - you have five nodes in total, and such cluster can accept up to two failures. So it means it can accept a full shutdown of one of the datacenters.

Which option is better for you? There is no best solution for all cases, it all depends on your infrastructure requirements. Luckily, there are different options to pick from: more or less nodes, full 3 DC or 2 DC and garbd in the third one - it’s quite likely you’ll find something suitable for you.

Network latency

When working with multi-DC setups, you have to keep in mind that network latency will be significantly higher than what you’d expect from a local network environment. This may seriously reduce performance of the Galera cluster when you compare it with standalone MySQL instance or a MySQL replication setup. The requirement that all of the nodes have to certify a writeset means that all of the nodes have to receive it, no matter how far away they are. With asynchronous replication, there’s no need to wait before a commit. Of course, replication has other issues and drawbacks, but latency is not the major one. The problem is especially visible when your database has hot spots - rows, which are frequently updated (counters, queues, etc). Those rows cannot be updated more often than once per network round trip. For clusters spanning across the globe, this can easily mean that you won’t be able to update a single row more often than 2 - 3 times per second. If this becomes a limitation for you, it may mean that Galera cluster is not a good fit for your particular workload.

Proxy layer in multi-DC Galera cluster

It’s not enough to have Galera cluster spanning across multiple datacenters, you still need your application to access them. One of the popular methods to hide complexity of the database layer from an application is to utilize a proxy. Proxies are used as an entry point to the databases, they track the state of the database nodes and should always direct traffic to only the nodes that are available. In this section, we’ll try to propose a proxy layer design which could be used for a multi-DC Galera cluster. We’ll use ProxySQL, which gives you quite a bit of flexibility in handling database nodes, but you can use another proxy, as long as it can track the state of Galera nodes.

Where to locate the proxies?

In short, there are two common patterns here: you can either deploy ProxySQL on a separate nodes or you can deploy them on the application hosts. Let’s take a look at pros and cons of each of these setups.

Proxy layer as a separate set of hosts

The first pattern is to build a proxy layer using separate, dedicated hosts. You can deploy ProxySQL on a couple of hosts, and use Virtual IP and keepalived to maintain high availability. An application will use the VIP to connect to the database, and the VIP will ensure that requests will always be routed to an available ProxySQL. The main issue with this setup is that you use at most one of the ProxySQL instances - all standby nodes are not used for routing the traffic. This may force you to use more powerful hardware than you’d typically use. On the other hand, it is easier to maintain the setup - you will have to apply configuration changes on all of the ProxySQL nodes, but there will be just a handful of them. You can also utilize ClusterControl’s option to sync the nodes. Such setup will have to be duplicated on every datacenter that you use.

Proxy installed on application instances

Instead of having a separate set of hosts, ProxySQL can also be installed on the application hosts. Application will connect directly to the ProxySQL on localhost, it could even use unix socket to minimize the overhead of the TCP connection. The main advantage of such a setup is that you have a large number of ProxySQL instances, and the load is evenly distributed across them. If one goes down, only that application host will be affected. The remaining nodes will continue to work. The most serious issue to face is configuration management. With a large number of ProxySQL nodes, it is crucial to come up with an automated method of keeping their configurations in sync. You could use ClusterControl, or a configuration management tool like Puppet.

Tuning of Galera in a WAN environment

Galera defaults are designed for local network and if you want to use it in a WAN environment, some tuning is required. Let’s discuss some of the basic tweaks you can make. Please keep in mind that the precise tuning requires production data and traffic - you can’t just make some changes and assume they are good, you should do proper benchmarking.

Operating system configuration

Let’s start with the operating system configuration. Not all of the modifications proposed here are WAN-related, but it’s always good to remind ourselves what is a good starting point for any MySQL installation.

vm.swappiness = 1

Swappiness controls how aggressive the operating system will use swap. It should not be set to zero because in more recent kernels, it prevents the OS from using swap at all and it may cause serious performance issues.

/sys/block/*/queue/scheduler = deadline/noop

The scheduler for the block device, which MySQL uses, should be set to either deadline or noop. The exact choice depends on the benchmarks but both settings should deliver similar performance, better than default scheduler, CFQ.

For MySQL, you should consider using EXT4 or XFS, depending on the kernel (performance of those filesystems changes from one kernel version to another). Perform some benchmarks to find the better option for you.

In addition to this, you may want to look into sysctl network settings. We will not discuss them in detail (you can find documentation here) but the general idea is to increase buffers, backlogs and timeouts, to make it easier to accommodate for stalls and unstable WAN link.

net.core.optmem_max = 40960 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 87380 16777216 net.core.netdev_max_backlog = 50000 net.ipv4.tcp_max_syn_backlog = 30000 net.ipv4.tcp_congestion_control = htcp net.ipv4.tcp_mtu_probing = 1 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_slow_start_after_idle = 0

In addition to OS tuning you should consider tweaking Galera network - related settings.

evs.suspect_timeout evs.inactive_timeout

You may want to consider changing the default values of these variables. Both timeouts govern how the cluster evicts failed nodes. Suspect timeout takes place when all of the nodes cannot reach the inactive member. Inactive timeout defines a hard limit of how long a node can stay in the cluster if it’s not responding. Usually you’ll find that the default values work well. But in some cases, especially if you run your Galera cluster over WAN (for example, between AWS regions), increasing those variables may result in more stable performance. We’d suggest to set both of them to PT1M, to make it less likely that WAN link instability will throw a node out of the cluster.

evs.send_window evs.user_send_window

These variables, evs.send_window and evs.user_send_window, define how many packets can be sent via replication at the same time (evs.send_window) and how many of them may contain data (evs.user_send_window). For high latency connections, it may be worth increasing those values significantly (512 or 1024 for example).

evs.inactive_check_period

The above variable may also be changed. evs.inactive_check_period, by default, is set to one second, which may be too often for a WAN setup. We’d suggest to set it to PT30S.

gcs.fc_factor gcs.fc_limit

Here we want to minimize chances that flow control will kick in, therefore we’d suggest to set gcs.fc_factor to 1 and increase gcs.fc_limit to, for example, 260.

gcs.max_packet_size

As we are working with the WAN link, where latency is significantly higher, we want to increase size of the packets. A good starting point would be 2097152.

As we mentioned earlier, it is virtually impossible to give a simple recipe on how to set these parameters as it depends on too many factors - you will have to do your own benchmarks, using data as close to your production data as possible, before you can say your system is tuned. Having said that, those settings should give you a starting point for the more precise tuning.

That’s it for now. Galera works pretty well in WAN environments, so do give it a try and let us know how you get on.

Tags:  MySQL galera high availability galera cluster xtradb cluster mariadb cluster

Cost/Benefit Analysis of a MySQL Index

We all know that if we add a MySQL index to speed up a read, we end up making writes slower. How often do we do the analysis to look at how much more work is done?

Recently, a developer came to me and wanted to add an index to a very large table (hundreds of gigabytes) to speed up a query. We did some testing on a moderately used server:

Set long_query_time to 0 and turn slow query logging on
Turn slow query logging off after 30 minutes.

Add the index (was on a single field)

Repeat the slow query logging for 30 minutes at a similar time frame (in our case, we did middle of the day usage on a Tuesday and Wednesday, when the database is heavily used).

Then I looked at the write analysis – there were no DELETEs, no UPDATEs that updated the indexed field, and no UPDATEs that used the indexed field in the filtering. There were only INSERTs, and with the help of pt-query-digest, here’s what I found:

INSERT analysis:
Query hash 0xFD7…..
Count: 2627 before, 2093 after
Exec time:
– avg – 299us before, 369us after (70us slower)
– 95% – 445 us before, 596us after
– median – 273us before, 301us after

I extrapolated the average per query to 2400 queries, and got:
**Total, based on 2400 queries – 71.76ms before, 88.56ms after, 16.8ms longer**

There was only one read query that used the indexed field for ORDER BY (or anywhere at all!), so the read analysis was also simple:

Read analysis:
Query hash 0xF94……
Count:187 before, 131 after
Exec time:
– avg – 9ms before, 8ms after. 1 ms saved
– 95% – 20ms before, 16 ms after
– median – 9ms before, 8 ms after

Again, extrapolating to average for 150 queries:
**Total, based on 150 queries: 150ms saved**

So we can see in this case, the index created a delay of 16.8 ms in a half-hour timeframe, but saved 150 ms in reads.

It is also impressive that the write index added very little time – 70 microseconds – but saved so much time – 1 millisecond – that there were 16 times the number of writes than reads, but we still had huge improvement, especially given the cost.

I cannot make a blanket statement, that this kind of index will always have this kind of profile – very tiny write cost for a very large read savings – but I am glad I did this analysis and would love to do it more in the future, to see what the real costs and savings are.

IO-bound sysbench on a smaller server

This is part 3 of my performance report on IO-bound sysbench. In part 1 and part 2 I used a large server with 48 HW threads. Here I use a core i5 NUC with 4 HW threads.

tl;dr
  • results are similar to the first post
  • MyRocks is competitive on the write-heavy tests
  • MyRocks is slower on the read-heavy tests
  • I don't attempt to show that MyRocks wins on space and write efficiency and that is why I like it

Configuration

I use my sysbench fork and helper scripts, release specific my.cnf files. The server is a core i5 NUC with 4 HW threads, 16gb of RAM and a fast SSD. The binlog was enabled and sync-on-commit was disabled for the binlog and database log. I remembered to disable SSL.

I tested MyRocks and InnoDB, with buffered IO and a 4g block cache for MyRocks and O_DIRECT and a 12gb buffer pool for InnoDB. The server is shared by the sysbench client and mysqld. For MyRocks I used a build from August 15 with git hash 0d76ae. For InnoDB I used upstream 5.6.35, 5.7.17 and 8.0.2. For InnoDB 8.0.2 I used latin1 charset and latin1_swedish_ci collation. Compression was not used for InnoDB or MyRocks. Note that this description is correct where it disagrees with the my.cnf files that I shared.

The test used 4 tables with 80M rows/table. My use of sysbench is explained here. Tests are run in an interesting pattern -- load, write-heavy, read-only, insert-only. On the core i5 NUC each test is run for 1 and then 2 concurrent connections for either 5 or 10 minutes per concurrency level. So each test runs for either 10 or 20 minutes total and I hope that is long enough to get the database into a steady state. An example command line to run the test with my helper scripts is:
bash all.sh 4 80000000 600 600 300 innodb 1 0 /orig5717/bin/mysql none /sysbench.new
Results without charts

All of the data is here. The results here are mostly similar to the results from the large server however MyRocks does worse here on some of the read-heavy tests. The QPS ratio for point-query is 0.651 for MyRocks here versus 0.850 on the large server. Note that the hot-points workload is not IO-bound, regardless we need to make MyRocks more effective on it. MySQL did something to make InnoDB range scans more efficient starting in 5.7. I don't know whether the problem for MyRocks is CPU or IO overhead in the range-scan heavy tests (read-write.*, read-only.*).

QPS ratio:
* rocks = myrocks.none / inno5635
* inno = inno5717 / inno5635
* value less than 1.0 means that InnoDB 5.6 is faster

1 connection
rocks   inno
2.909   1.074   update-index
1.068   1.095   update-nonindex
1.006   0.935   update-nonindex-special
1.682   0.988   delete-only
1.053   0.961   read-write.range100
0.881   1.554   read-write.range10000
0.776   1.348   read-only.range100
0.898   1.584   read-only.range10000
0.651   1.197   point-query
1.000   1.285   random-points
0.267   0.943   hot-points
0.989   0.941   insert-only

Results with charts

Sorry, no charts this time. Charts from the previous post are close enough.

Protocol reverse engineering with tcpdump

Sometimes network protocols don’t entirely behave as documented. Other times there is no documentation at all beyond code. Either way you can sometimes find a need to sniff the traffic of a connection to find out what is really going on.

Whilst I have been working on MariaDB ColumnStore for a year now there are still some parts of the codebase I know little about. I recently had to write some code that worked with the network protocol of ColumnStore, but there were a few parts that were difficult to understand exactly what was happening just by looking at the code. This is where tcpdump came in.

tcpdump is a powerful tool to help you sniff the raw packet data for network connections. It can be very verbose giving parts of the TCP/IP handshake, headers, etc… This is way more than I often need for reverse engineering network protocols so I use tcpflow to filter the results. The final command looks a little like this:

sudo tcpdump -i lo -l -w - port <PORT> | tcpflow -D -C -r -

Breaking this down we are listening on localhost interface with a line buffered output to pipe using raw packets. We then use tcpflow to just show the hex data when reading from the pipe.

If we look at port 8616 (DBRM controller) for ColumnStore the end result can look a little like this during a small insert query:

0000: 37c1 fb14 0500 0000 3100 0000 00 7.......1.... 0000: 37c1 fb14 0600 0000 0000 0000 0000 7............. 0000: 37c1 fb14 0100 0000 2d 7.......- 0000: 37c1 fb14 0d00 0000 00bd 1d00 0000 0000 0000 0000 00 7.................... 0000: 37c1 fb14 0100 0000 34 7.......4 0000: 37c1 fb14 0500 0000 0029 0000 00 7........)... 0000: 37c1 fb14 9100 0000 1a05 0000 0000 102d 0000 0000 0000 0000 0000 0000 80ff ffff 7..............-................ 0020: ffff ffff 7ffe ffff ff00 202d 0000 0000 0000 0000 0000 0000 80ff ffff ffff ffff .......... -.................... 0040: 7ffe ffff ff00 302d 0000 0000 0000 0000 0000 0000 80ff ffff ffff ffff 7ffe ffff ......0-........................ 0060: ff00 502d 0000 0000 0000 0000 0000 0000 80ff ffff ffff ffff 7ffe ffff ff00 702d ..P-..........................p- 0080: 0000 0000 0000 0000 0000 0000 80ff ffff ffff ffff 7ffe ffff ff .........................

From observing the ColumnStore messaging code I know that “37c1 fb14” is an uncompressed packet header and the next 4 bytes are the packet length. The next byte is usually packet type (or response) which we can lookup some ENUMs to discover. From there we can figure out the rest packet contents. I won’t go into details here but on some occasions it required printing off this data and using highlighters to figure out the parts of the packet.

This method has been extremely useful for other things in the past as well such as debugging MySQL’s replication protocol. It is definitely part of my toolset for working on network daemons. If there are any similar tools you use please put them in the comments below. I’m always interested in improving my workflow and toolset.

Image credit: Terry Robinson, used under a Creative Commons license


Sysbench, a large server and fast SSD vs MyRocks, InnoDB and TokuDB: part 2

This provide more data using the setup in my previous post. The difference is that the previous post used 8 tables with 100M rows/table and this post uses 1 table with 800M rows/table, so this test has more opportunity for contention.

tl;dr
  • Results here are mostly similar to results from the previous test but ...
  • MyRocks does better here relative to other engines, so it suffers less from contention.
  • InnoDB and TokuDB don't do as well on this test as on the previous test so they suffer more from contention.

Results without pictures
All of the data is here. The data below is the QPS ratio comparing the QPS for MyRocks, InnoDB 5.7.17 and TokuDB 5.7.17 with InnoDB 5.6.35. A value less than one means the QPS for the engine is less than for InnoDB 5.6.35.
QPS ratio: * rocks = myrocks.none / inno5635 * inno = inno5717 / inno5635 * toku = toku5717.none / inno5635 1 connection
rocks inno toku engine/test
1.765 1.306 0.995 update-index
1.049 1.122 0.421 update-nonindex
0.842 1.020 0.351 update-nonindex-special
4.415 1.821 1.111 delete-only
0.988 1.166 0.329 read-write.range100
1.189 1.300 1.339 read-write.range10000
0.910 1.092 0.358 read-only.range100
0.980 0.966 1.246 read-only.range10000
0.862 0.950 0.257 point-query
0.949 0.949 0.271 random-points
0.515 0.885 0.606 hot-points
0.963 0.991 0.324 insert-only

8 connections
rocks inno toku engine/test
5.958 2.644 2.782 update-index
1.513 1.434 0.473 update-nonindex
1.289 1.422 0.365 update-nonindex-special
1.542 0.995 0.275 delete-only
1.206 1.306 0.253 read-write.range100
1.002 1.288 1.159 read-write.range10000
1.017 1.083 0.329 read-only.range100
0.871 1.069 1.191 read-only.range10000
0.862 0.855 0.204 point-query
0.915 0.859 0.208 random-points
0.890 1.194 0.758 hot-points
0.951 1.149 0.353 insert-only

48 connections
rocks inno toku engine/test
3.836 2.340 1.492 update-index
2.425 2.520 0.312 update-nonindex
2.112 2.981 0.557 update-nonindex-special
3.028 1.950 0.306 delete-only
1.790 1.643 0.122 read-write.range100
1.488 1.883 0.768 read-write.range10000
0.820 1.005 0.093 read-only.range100
0.903 1.219 0.579 read-only.range10000
0.854 0.971 0.079 point-query
0.871 0.953 0.096 random-points
0.739 1.286 0.698 hot-points
1.768 1.579 0.491 insert-only
Results with pictures
Sorry, no charts this time. Charts from the previous post are close enough.

Looking at Disk Utilization and Saturation

In this blog post, I will look at disk utilization and saturation.

In my previous blog post, I wrote about CPU utilization and saturation, the practical difference between them and how different CPU utilization and saturation impact response times. Now we will look at another critical component of database performance: the storage subsystem. In this post, I will refer to the storage subsystem as “disk” (as a casual catch-all). 

The most common tool for command line IO performance monitoring is iostat, which shows information like this:

root@ts140i:~# iostat -x nvme0n1 5 Linux 4.4.0-89-generic (ts140i)         08/05/2017      _x86_64_        (4 CPU) avg-cpu:  %user   %nice %system %iowait  %steal   %idle           0.51    0.00    2.00    9.45    0.00   88.04 Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util nvme0n1           0.00     0.00 3555.57 5887.81 52804.15 87440.73    29.70     0.53    0.06    0.13    0.01   0.05  50.71 avg-cpu:  %user   %nice %system %iowait  %steal   %idle           0.60    0.00    1.06   20.77    0.00   77.57 Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util nvme0n1           0.00     0.00 7612.80    0.00 113507.20     0.00    29.82     0.97    0.13    0.13    0.00   0.12  93.68 avg-cpu:  %user   %nice %system %iowait  %steal   %idle           0.50    0.00    1.26    6.08    0.00   92.16 Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util nvme0n1           0.00     0.00 7653.20    0.00 113497.60     0.00    29.66     0.99    0.13    0.13    0.00   0.12  93.52

The first line shows the average performance since system start. In some cases, it is useful to compare the current load to the long term average. In this case, as it is a test system, it can be safely ignored. The next line shows the current performance metrics over five seconds intervals (as specified in the command line).

The iostat command reports utilization information in the %util column, and you can look at saturation by either looking at the average request queue size (the avgqu-sz column) or looking at the r_await and w_await columns (which show the average wait for read and write operations). If it goes well above “normal” then the device is over-saturated.

As in my previous blog post, we’ll perform some system Sysbench runs and observe how the iostat command line tool and Percona Monitoring and Management graphs behave.

To focus specifically on the disk, we’re using the Sysbench fileio test. I’m using just one 100GB file, as I’m using DirectIO so all requests hit the disk directly. I’m also using “sync” request submission mode so I can get better control of request concurrency.

I’m using an Intel 750 NVME SSD in this test (though it does not really matter).

Sysbench FileIO 1 Thread

root@ts140i:/mnt/data# sysbench  --threads=1 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run File operations:    reads/s:                      7113.16    writes/s:                     0.00    fsyncs/s:                     0.00 Throughput:    read, MiB/s:                  111.14    written, MiB/s:               0.00 General statistics:    total time:                          600.0001s    total number of events:              4267910 Latency (ms):         min:                                  0.07         avg:                                  0.14         max:                                  6.18         95th percentile:                      0.17

A single thread run is always great as a baseline, as with only one request in flight we should expect the best response time possible (though typically not the best throughput possible).

Iostat Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util nvme0n1           0.00     0.00 7612.80    0.00 113507.20     0.00    29.82     0.97    0.13    0.13    0.00   0.12  93.68

Disk Latency

The Disk Latency graph confirms the disk IO latency we saw in the iostat command, and it will be highly device-specific. We use it as a baseline to compare changes to with higher concurrency.

Disk IO Utilization

Disk IO utilization is close to 100% even though we have just one outstanding IO request (queue depth). This is the problem with Linux disk utilization reporting: unlike CPUs, Linux does not have direct visibility on how the IO device is designed. How many “execution units” does it really have? How are they utilized?  Single spinning disks can be seen as a single execution unit while RAID, SSDs and cloud storage (such as EBS) are more than one.

Disk Load

This graph shows the disk load (or request queue size), which roughly matches the number of threads that are hitting disk as hard as possible.

Saturation (IO Load)

The IO load on the Saturation Metrics graph shows pretty much the same numbers. The only difference is that unlike Disk IO statistics, it shows the summary for the whole system.

Sysbench FileIO 4 Threads

Now let’s increase IO to four concurrent threads and see how disk responds:

sysbench  --threads=4 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run File operations:    reads/s:                      26248.44    writes/s:                     0.00    fsyncs/s:                     0.00 Throughput:    read, MiB/s:                  410.13    written, MiB/s:               0.00 General statistics:    total time:                          600.0002s    total number of events:              15749205 Latency (ms):         min:                                  0.06         avg:                                  0.15         max:                                  8.73         95th percentile:                      0.21

We can see the number of requests scales almost linearly, while request latency changes very little: 0.14ms vs. 0.15ms. This shows the device has enough execution units internally to handle the load in parallel, and there are no other bottlenecks (such as the connection interface).

Iostat Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util nvme0n1           0.00     0.00 28808.60    0.00 427668.00     0.00    29.69     4.05    0.14    0.14    0.00   0.03  99.92

Disk Latency

Disk Utilization

Disk Load

Saturation Metrics (IO Load)

These stats and graphs show interesting picture: we barely see a response time increase for IO requests, while utilization inches closer to 100% (with four threads submitting requests all the time, it is hard to catch the time when the disk does not have any requests in flight). The load is near four (showing the disk has to handle four requests at the time on average).

Sysbench FileIO 16 Threads

root@ts140i:/mnt/data# sysbench  --threads=16 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run File operations:    reads/s:                      76845.96    writes/s:                     0.00    fsyncs/s:                     0.00 Throughput:    read, MiB/s:                  1200.72    written, MiB/s:               0.00 General statistics:    total time:                          600.0003s    total number of events:              46107727 Latency (ms):         min:                                  0.07         avg:                                  0.21         max:                                  9.72         95th percentile:                      0.36

Going from four to 16 threads, we again see a good throughput increase with a mild response time increase. If you look at the results closely, you will notice one more interesting thing: the average response time has increased from 0.15ms to 0.21ms (which is a 40% increase), while the 95% response time has increased from 0.21ms to 0.36ms (which is 71%). I also ran a separate test measuring 99% response time, and the difference is even larger: 0.26ms vs. 0.48ms (or 84%).

This is an important observation to make: once saturation starts to happen, the variance is likely to increase and some of the requests will be disproportionately affected (beyond what the average response time shows).

Iostat Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util nvme0n1           0.00     0.00 82862.20    0.00 1230567.20     0.00    29.70    16.33    0.20    0.20    0.00   0.01 100.00

Disk IO Latency

Disk IO Utilization

Disk Load

Saturation Metrics IO Load

The graphs show an expected figure: the disk load and IO load from saturation are up to about 16, and utilization remains at 100%.

One thing to notice is increased jitter in the graphs. IO utilization jumps to over 100% and disk IO load spikes to 18, when there should not be as many requests in flight. This comes from how this information is gathered. An attempt is made to sample this data every second, but with the loaded system it takes time for this process to work: sometimes when we try to get the data for a one-second interval but really get data for 1.05- or 0.95-second intervals. When the math is applied to the data, it creates the spikes and dips in the graph when there should be none. You can just ignore them if you’re looking at the big picture.

Sysbench FileIO 64 Threads

Finally, let’s run sysbench with 64 concurrent threads hitting the disk:

root@ts140i:/mnt/data# sysbench  --threads=64 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run File operations:    reads/s:                      127840.59    writes/s:                     0.00    fsyncs/s:                     0.00 Throughput:    read, MiB/s:                  1997.51    written, MiB/s:               0.00 General statistics:    total time:                          600.0014s    total number of events:              76704744 Latency (ms):         min:                                  0.08         avg:                                  0.50         max:                                  9.34         95th percentile:                      1.25

We can see the average has risen from 0.21ms to 0.50 (more than two times), and 95% almost tripped from 0.36ms to 1.25ms. From a practical standpoint, we can see some saturation starting to happen, but we’re still not seeing a linear response time increase with increasing numbers of parallel operations as we have seen with CPU saturation. I guess this points to the fact that this IO device has a lot of parallel capacity inside and can process requests more effectively (even going from 16 to 64 concurrent threads).

Over the series of tests, as we increased concurrency from one to 64, we saw response times increase from 0.14ms to 0.5ms (or approximately three times). The 95% response time at this time grew from 0.17ms to 1.25ms (or about seven times). For practical purposes, this is where we see the IO device saturation start to show.

Iostat Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util nvme0n1           0.00     0.00 138090.20    0.00 2049791.20     0.00    29.69    65.99    0.48    0.48    0.00   0.01 100.24

We’ll skip the rest of the graphs as they basically look the same, just with higher latency and 64 requests in flight.

Sysbench FileIO 256 Threads

root@ts140i:/mnt/data# sysbench  --threads=256 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run File operations:    reads/s:                      131558.79    writes/s:                     0.00    fsyncs/s:                     0.00 Throughput:    read, MiB/s:                  2055.61    written, MiB/s:               0.00 General statistics:    total time:                          600.0026s    total number of events:              78935828 Latency (ms):         min:                                  0.10         avg:                                  1.95         max:                                 17.08         95th percentile:                      3.89

Iostat Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util nvme0n1           0.00     0.00 142227.60    0.00 2112719.20     0.00    29.71   268.30    1.89    1.89    0.00   0.01 100.00

With 256 threads, finally we’re seeing the linear growth of the average response time that indicates overload and queueing to process requests. There is no easy way to tell if it is due to the IO bus saturation (we’re reading 2GB/sec here) or if it is the internal device processing ability.  

As we’ve seen a less than linear increase in response time going from 16 to 64 connections, and a linear increase going from 64 to 256, we can see the “optimal” concurrency for this device: somewhere between 16 and 64 connections. This allows for peak throughput without a lot of queuing.

Before we get to the summary, I want to make an important note about this particular test. The test is a random reads test, which is a very important pattern for many database workloads, but it might not be the dominant load for your environment. You might be write-bound as well, or have mainly sequential IO access patterns (which could behave differently). For those other workloads, I hope this gives you some ideas on how to also analyze them.

Another Way to Think About Saturation

When I asked the Percona staff for feedback on this blog post by, my colleague Yves Trudeau provided another way to think about saturation: measure saturation as percent increase in the average response time compared to the single user. Like this:

Threads Avg Response Time Saturation 1 0.14 – 4 0.15 1.07x  or 7% 16 0.21 1.5x  or 50% 64 0.50 3.6x or 260% 256 1.95 13.9x or 1290%

 

Summary

We can see how understanding disk utilization and saturation is much more complicated than for the CPU:

  • The Utilization metric (as reported by iostat and by PMM) is not very helpful for showing true storage utilization, as it only measures the time when there is at least one request in flight. If you had the same metric for the CPU, it would correspond to something running on at least one of the cores (not very useful for highly parallel systems).
  • Unlike a CPU, Linux tools do not provide us with information about the structure of the underlying storage and how much parallel load it should be able to handle without saturation. Even more so, storage might well have different low-level resources that cause saturation. For example, it could be the network connection, SATA BUS or even the kernel IO stack for older kernels and very fast storage.
  • Saturation as measured by the number of requests in flight is helpful for guessing if there might be saturation, but since we do not know how many requests the device can efficiently process concurrently, just looking the raw metric doesn’t let us determine that the device is overloaded.
  • Avg Response Time is a great metric for looking at saturation, but as with the response time you can’t say what response time is good or bad for this device. You need to look at it in context and compare it to the baseline. When you’re looking at the Avg Response Time, make sure you’re looking at read request response time vs. write request response time separately, and keep the average request size in mind to ensure we are comparing apples to apples.

MySQL 5.7 InnoDB Tablespace

By default for MySQL server, InnoDB Engine is getting used widely due it’s ACID support, optimized read-write performance and for many other reasons which are great significance for the database server.

In this blog post, we are going to cover the InnoDB tablespace and its features like,

  • InnoDB engine tablespaces
  • Tablespace Data Encryption
  • Tablespace related Configuration
InnoDB engine tablespaces System tablespace:  

Common tablespace for MySQL server operations. Apart from the table data storage, InnoDB’s functionality requires looking for table metadata, storing and retrieving MVCC info to support ACID compliance and Transaction Isolation. It contains several types of information for InnoDB objects.

  • Contains:
    Table Data Pages
    Table Index Pages
    Data Dictionary
    MVCC Control Data
    Undo Space
    Rollback Segments
    Double Write Buffer (Pages Written in the Background to avoid OS caching)
    Insert Buffer (Changes to Secondary Indexes)
  • Variables:
    innodb_data_file_path = /ibdata/ibdata1:10M:autoextend

    By enabling innodb_file_per_table (the default) option, we can store each newly created table (data and index) in a separate tablespace. Advantage for this storage method is less fragmentation within disk data file.

InnoDB data dictionary:
Storage area in system tablespace made up of internal system tables with metadata information for objets[tables, index, columns etc.]

Double write buffer:
Storage area in system tablespace where innodb writes pages from innodb buffer pool, before writing to their proper location in the data files.
In case mysqld process crash in the middle of a page writes, at the time of crash recovery InnoDB can find a good copy of the page from doublewrite buffer.

Variable: inndb_doublewrite (default enable)

REDO logs:
Use for crash recovery. At the time of mysqld startup, InnoDB performs auto recovery to correct data written by incomplete transactions. Transactions that not finish updating data files before an unexpected mysqld shutdown are replayed automatically at the time of mysqld startup even before taking any connection. It uses LSN(Log Sequence Number) value.
Plenty of data changes can not get written to disk quickly, so it will go under redo and then to the disk.

Why we need a redo for recovery?
Let’s take an example, User changing data in innodb buffer and commit, somewhere it needs to go before writing into a disk. Because in the case of crash buffer data will lost, that’s why we need redo logs.

– In redo, all changes will go with info like row_id, old column value, new column value, session_id and time.
– One commit complete data will under disk in a data file.
– Variables:
Innodb_log_file_in_group= [# of redo file groups]

innodb_log_buffer_size= [ Buffer size ] (Set greater value to hold large transactions in memory. Start from 10-20% of total log files size)

Innodb_log_file_size= [Size for each redo file]  (Should be set to a greater value greater for BLOB data types in database )

UNDO tablespace and logs:

UNDO tablespace contains one or more undo logs files.
UNDO manages consistent reads by keeping modified uncommitted data for active transaction [MVCC]. Unmodified data is retrieved from this storage area.Undo logs also called as rollback segments
By default, UNDO logs are part of system tablespace, MySQL allows to store undo logs in separate UNDO tablespace/s [Introduce in MySQL 5.6]. Need to configure before initializing mysqld server.

– When we configure separate undo tablespace, the undo logs in the system tablespace become inactive.
– Need to configure before initializing mysqld server and can not change after that.
– We truncate undo logs, but can not drop.
– The default initial size of an undo tablespace file is 10MB.
– Variables :
innodb_undo_tablespace : Number of undo tablespaces, default 0 , max 95
innodb_undo_directory : Location for undo tablespace,default is data_dir with 10MB initial size.
innodb_undo_logs : Number of undo logs in a single undo tablespace, default and max value is ‘128’ [ Deprecated in 5.7.19 , innodb_rollback_segments variable will control this]
innodb_undo_log_truncate: truncate undo tablespace, Default OFF  [When enabled, undo tablespaces that exceed the threshold value defined by innodb_max_undo_log_size are marked for truncation. ]

Key Points:

  • Truncating undo logs need separate undo logs. This means undo in system tablespace can not be truncated.
  • innodb_undo_tablespaces must be set to a value equal to or greater than 2.
  • innodb_rollback_segments must set to a value equal to or greater than 35.

Benefits: Reduce the size of the single tablespace (system tablespace), since we are storing long-running transactions into a sperate single/multiple UNDO tablespaces.

Temporary tablespace:

Storage to keep and retrieve modified uncommitted data for temporary tables and related objects.Introduce in MySQL 5.7.2 and use for rollback temp table changes while a server is running.

– Undo logs for temporary tables reside in the temp tablespace.
– Default tablespace file ibtmp1 getting recreated on server startup.
–  Not getting used for crash recovery.
– Advantage: Performance gain by avoiding redo logging IO for temp tables and related objects.
– Variable:
innodb_temp_data_file_path = ibtmp1:12M:autoextend (default)

General tablespace:

Shared tablespace to store multiple table data. Introduce in MySQL 5.7.6. A user has to create this using CREATE TABLESPACE syntax. TABLESPACE option can be used with CREATE TABLE to create a table and ALTER TABLE to move a table in general table.

– Memory advantage over innodb_file_per_table storage method.
– Support both Antelope and Barracuda file formats.
–  Supports all row formats and associated features.
–  Possible to create outside data directory.

Tablespace Data Encryption:

InnoDB supports data encryption for InnoDB tables stored in file-per-table tablespaces using mysql keyring. MySQL 5.7.11 and higher includes a keyring plugin.

keyring_file: Stores keyring data in a file local to the server host. keyring_file must be loaded at each server startup using the –early-plugin-load option

keyring_okv: Back end keyring storage products such as Oracle Key Vault, This plugin is available in MySQL Enterprise Edition distributions.

Variables:
early-plugin-load : Settings ensure that plugin is available prior to initialization of the InnoDB storage engine.
keyring_file_data : keyring file path.

Tablespace related Configuration and operations:

MySQL InnoDB Configuration:

## DATA STORAGE ## datadir=/var/lib/mysql ## InnoDB Configuration ## innodb_file_per_table=1 # InnoDB Memory innodb_buffer_pool_size = 2000M # System Tablespace configuration innodb_data_file_path= ibdata1:512M;ibdata2:512M:autoextend # Redo log and buffer configuration innodb-log-files-in-group=3 innodb_log_file_size=100M innodb_log_buffer_size=30M #InnoDB file formate innodb_file_format = Barracuda # UNDO Tablespace Configuration innodb_undo_directory =  /var/lib/mysql/ innodb_undo_tablespaces = 3 innodb_undo_logs = 128 innodb_undo_log_truncate = ON innodb_rollback_segments = 128 # Temp Tablespace Configuration tmpdir = /var/lib/mysql/ innodb_temp_data_file_path = ibtmp1:20M:autoextend # Keyring configuration early-plugin-load=keyring_file.so keyring_file_data=/var/lib/mysql-keyring/keyring  

MySQL Server Initialization Logs:

[Note] InnoDB: Using Linux native AIO [Note] InnoDB: Number of pools: 1 [Note] InnoDB: Using CPU crc32 instructions [Note] InnoDB: Initializing buffer pool, total size = 2G, instances = 8, chunk size = 128M [Note] InnoDB: Completed initialization of buffer pool [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority(). [Note] InnoDB: Opened 4 undo tablespaces [Note] InnoDB: 4 undo tablespaces made active [Note] InnoDB: Highest supported file format is Barracuda. [Note] InnoDB: Creating shared tablespace for temporary tables [Note] InnoDB: Setting file './ibtmp1' size to 20 MB. Physically writing the file full; Please wait ... [Note] InnoDB: File './ibtmp1' size is now 20 MB. [Note] InnoDB: 96 redo rollback segment(s) found. 96 redo rollback segment(s) are active. [Note] InnoDB: 32 non-redo rollback segment(s) are active. [Note] InnoDB: Waiting for purge to start [Note] InnoDB: 5.7.19 started; log sequence number 2454162 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool [Note] Plugin 'FEDERATED' is disabled. [Note] InnoDB: Buffer pool(s) load completed at 170828 12:03:52

UNDO and Temporary tablespaces:

General tablespace Example:

General Tablespace can be created inside mysql datadir [ Default ] or outside of the MySQL data directory.

Example:

# Create General tablespace mysql> CREATE TABLESPACE gen_tblsp ADD DATAFILE 'gen_tlbsp.ibd' ENGINE = INNODB; Query OK, 0 rows affected (0.01 sec) mysql> select * from INFORMATION_SCHEMA.FILES where TABLESPACE_NAME ='gen_tblsp'\G *************************** 1. row *************************** FILE_ID: 27 FILE_NAME: ./gen_tlbsp.ibd FILE_TYPE: TABLESPACE TABLESPACE_NAME: gen_tblsp .... .... # Create table inside general tablespace. mysql> CREATE TABLE gen_ts_tbl (id int(11), c_desc varchar(100), c_comments text ) TABLESPACE gen_tblsp; Query OK, 0 rows affected (0.01 sec) mysql> INSERT INTO gen_ts_tbl values (1, 'test' , 'General tablespace testing'); Query OK, 1 row affected (0.01 sec) mysql> select * from gen_ts_tbl; +------+--------+----------------------------+ | id | c_desc | c_comments | +------+--------+----------------------------+ | 1 | test | General tablespace testing | +------+--------+----------------------------+ 1 row in set (0.00 sec) # Move Existing table into general tablespace. mysql> create table innodb_table (id int (11), uname varchar(78)); Query OK, 0 rows affected (0.01 sec) mysql> insert into innodb_table values(1,'moving to gen_tblsp'); Query OK, 1 row affected (0.01 sec) mysql> ALTER TABLE innodb_table TABLESPACE gen_tblsp; Query OK, 0 rows affected (0.02 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> select * from innodb_table; +------+---------------------+ | id | uname | +------+---------------------+ | 1 | moving to gen_tblsp | +------+---------------------+ 1 row in set (0.00 sec) # DROP General Tablespace [ We need to drop all table in general tablespace before dropping it] mysql> show tables; +----------------+ | Tables_in_test | +----------------+ | gen_ts_tbl | | innodb_table | +----------------+ 2 rows in set (0.00 sec) mysql> drop table gen_ts_tbl; Query OK, 0 rows affected (0.01 sec) mysql> drop table innodb_table; Query OK, 0 rows affected (0.00 sec) mysql> show tables; Empty set (0.01 sec) mysql> drop tablespace gen_tblsp; Query OK, 0 rows affected (0.00 sec) mysql> select * from INFORMATION_SCHEMA.FILES where TABLESPACE_NAME ='gen_tblsp'\G Empty set (0.00 sec)

InnoDB TDE using a keyring_file plugin:

mysql> SELECT PLUGIN_NAME, PLUGIN_STATUS FROM INFORMATION_SCHEMA.PLUGINS WHERE PLUGIN_NAME LIKE 'keyring%'; +--------------+---------------+ | PLUGIN_NAME | PLUGIN_STATUS | +--------------+---------------+ | keyring_file | ACTIVE | +--------------+---------------+ 1 row in set (0.00 sec) mysql> show variables like '%keyring%'; +-------------------+--------------------------------+ | Variable_name | Value | +-------------------+--------------------------------+ | keyring_file_data | /var/lib/mysql-keyring/keyring | +-------------------+--------------------------------+ 1 row in set (0.00 sec) mysql> CREATE TABLE innodb_tde (id int(11), c_desc varchar(100), c_comments text ) ENCRYPTION='Y'; Query OK, 0 rows affected (0.01 sec) mysql> SELECT TABLE_SCHEMA, TABLE_NAME, CREATE_OPTIONS FROM INFORMATION_SCHEMA.TABLES WHERE CREATE_OPTIONS LIKE '%ENCRYPTION="Y"%'; +--------------+------------+----------------+ | TABLE_SCHEMA | TABLE_NAME | CREATE_OPTIONS | +--------------+------------+----------------+ | test | innodb_tde | ENCRYPTION="Y" | +--------------+------------+----------------+ 1 row in set (0.01 sec) mysql> INSERT INTO innodb_tde values (1, 'test tde' , 'innodb tde testing'); Query OK, 1 row affected (0.00 sec) mysql> select * from innodb_tde; +------+----------+--------------------+ | id | c_desc | c_comments | +------+----------+--------------------+ | 1 | test tde | innodb tde testing | +------+----------+--------------------+ 1 row in set (0.01 sec)</pre> # Disable - Enable ENCRYPTION from table mysql> ALTER TABLE innodb_tde ENCRYPTION='N'; Query OK, 1 row affected (0.02 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE innodb_tde ENCRYPTION='Y'; Query OK, 1 row affected (0.02 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> select * from innodb_tde; +------+----------+--------------------+ | id | c_desc | c_comments | +------+----------+--------------------+ | 1 | test tde | innodb tde testing | +------+----------+--------------------+ 1 row in set (0.00 sec) #ENCRYPTION MASTER KEY Rotation mysql> ALTER INSTANCE ROTATE INNODB MASTER KEY; Query OK, 0 rows affected (0.01 sec) mysql> select * from innodb_tde; +------+----------+--------------------+ | id | c_desc | c_comments | +------+----------+--------------------+ | 1 | test tde | innodb tde testing | +------+----------+--------------------+ 1 row in set (0.00 sec)

Refer MySQL doc for more:
https://dev.mysql.com/doc/refman/5.7/en/innodb-tablespace.html


Using Active Record migrations beyond SQLite

SQLite is really a good tool to set up quick proof of concepts and small applications; however it’s not the most robust solution on the market for working with relational databases. In the open source community two databases take the top of the list: PostgreSQL and MySQL.

I did a small project for my studies. I was using SQLite as I didn’t need much out of it. Curious, I decided to see how the application would behave on other databases and decided to try PostgreSQL and MySQL. I had two problems to solve, and this post is about the first one: how to deal with the migrations. They were as follows:

Active Record automatically put the field id in all of its tables, that’s why it is omitted on the migrations.

In PostgreSQL it went smoothly, all the migrations ran without any hiccup, except on MySQL, it gave me an error!

StandardError: An error has occurred, all later migrations canceled: Column `artist_id` on table `songs` has a type of `int(11)`. This does not match column `id` on `artists`, which has type `bigint(20)`. To resolve this issue, change the type of the `artist_id` column on `songs` to be :integer. (For example `t.integer artist_id`). Original message: Mysql2::Error: Cannot add foreign key constraint: ALTER TABLE `songs` ADD CONSTRAINT `fk_rails_5ce8fd4cc7` FOREIGN KEY (`artist_id`) REFERENCES `artists` (`id`)

The problem, beyond generating an ineligible name for an index: fk_rails_5ce8fd4cc7, is that artist_id on my table was as INT. The first thing I checked was to see if the artist.id was UNSIGNED and if my foreign key was also unsigned. They weren’t, but since were both signed, it wouldn’t throw an error. Looking more closely to the error message I noticed that the type in my foreign key column did not match the type on the primary key on the other table. Little did I know that Active Record generates the id field not as an INT, but as BIGINT.

I decided to go back and look at PostgreSQL, and to my surprise, and up to now I still am not sure of why, PostgreSQL did allow the column type mismatch where MySQL threw an error.

To fix it, I had to change the migration as follows:

Digging online, I found out how to create a bigint field with AR. According to the post, this would only work on MySQL, which they did, but I found it also worked with PostgreSQL (I tested MySQL 5.7 and Postgres 9.6): t.integer :artist_id, limit: 8.

The limit is used to set a maximum length for string types or number of bytes for numbers.

Why type matching is important

As an INT let’s say you can fit your number inside an espresso cup. Sure you can use the Starbucks Venti size cup to fit your coffee, but the full content of a Venti would never fit an espresso cup.

In the specific domain I am working on if I had a big list of Artists, and happen to have an artist which ID was higher than 2,147,483,647 (signed, and for both PostgreSQL and MySQL), I would get an error when trying to insert it into the Songs table since an Artist id can be up to 8 bytes (9,223,372,036,854,775,807).

Example:

Queen has its Artist id as: 21474836481 (which is a BIGINT)

Trying to insert “We Will Rock you” in the artist_id column for songs:

We get:

********** Error ********** ERROR: integer out of range SQL state: 22003

This is the kind of problem we don’t usually notice in the beginning, and more often than not while the application is in production for even years, but this can happen and will happen if we don’t pay attention to foreign key types.

After that change, all the migrations ran smoothly. And I could actually move forward to the next problem (and post): Filtering a song title or artist name.


Filed under: Article Tagged: activerecord, mysql, postgres, ruby, sqlite

Part of my history inside InfiniDB/ColumnStore

Several years ago there was a fork of the unreleased MySQL 6.0 called Drizzle. It was designed to be a lightweight, cloud/web/UTF8 first database server with a microkernel style core. I worked for a while as one of the core developers of Drizzle until the corporate sponsor I worked for ceased funding its development.

Fast-forward to 2016 and I start working on MariaDB ColumnStore and one of the biggest surprises to me is that it incorporated part of Drizzle! Specifically the BSD licensed MySQL/MariaDB compatible client library called libdrizzle.

ColumnStore’s MariaDB plugin gets the entire query plan tree for a query and passes it on to its internal processes to break it up into parts that can be worked on in parallel. Since this doesn’t happen inside MariaDB server it needs a way to get at data that is not part of ColumnStore (such as InnoDB) for joins and subqueries.

This is where libdrizzle comes in, ColumnStore builds a query which can get the data required from the non-ColumnStore tables, makes a new client connection, executes the query and joins the result internally to the ColumnStore tables.

I suspect libdrizzle was originally used due to its BSD license which would make it easier for Calpont (the InfiniDB company) to have a commercial and Open Source version of InfiniDB. As well as its libmysqlclient-like API and MySQL compatibility.

So, where does my history come in? One of the things I worked on the most when I was working on the Drizzle project is libdrizzle. I had no clue that it was being used in this way which was an interesting surprise!

Unfortunately libdrizzle 2.0 (the version used in ColumnStore) is no longer developed. The only version that is being maintained is a fork of a more recent version Brian Aker and I created whilst we worked for HP called libdrizzle-redux. This was an attempt at simplifying the API and adding support for features that libdrizzle never supported. Development continues on GitHub by sociomantic labs GmbH. I have helped them get off the ground with the project and am happy a part of my work lives on.

Libdrizzle-redux has a very different API so we couldn’t just drop-in replace it. As much as I liked libdrizzle I didn’t want to continue maintaining the 2.0 version (I have fixed some things in the version ColumnStore 1.0 uses). In addition we didn’t really have a need for a BSD licensed connector in ColumnStore so for ColumnStore 1.1 we are switching to use the internal client connector that comes with MariaDB.

It was a little bittersweet removing a large chunk of my history from ColumnStore’s codebase but I’m glad that socimantic have continued to build a community around my legacy.


This Week in Data with Colin Charles #3: More Percona Live Europe!

Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

We are five weeks out to the conference! The tutorials and the sessions have been released, and there’s an added bonus – you can now look at all this in a grid view: tutorials, day one and day two. Now that you can visualize what’s being offered, don’t forget to register.

If you want a discount code, feel free to email me at colin.charles@percona.com.

We have some exciting keynotes as well. Some highlights:

  1. MySQL as a Layered Service: How to Use ProxySQL to Control Traffic and Scale Out, given by René Cannaò, the creator of ProxySQL
  2. Why Open Sourcing Our Database Tooling was the Smart Decision, given by Shlomi Noach, creator of Orchestrator, many other tools, and developer at GitHub (so expect some talk about gh-ost)
  3. MyRocks at Facebook and a Roadmap, given by Yoshinori Matsunobu, shepherd of the MyRocks project at Facebook
  4. Real Time DNS Analytics at CloudFlare with ClickHouse, given by Tom Arnfeld
  5. Prometheus for Monitoring Metrics, given by Brian Brazil, core developer of Prometheus
  6. A Q&A session with Charity Majors and Laine Campbell on Database Reliability Engineering, their new upcoming book!

Let’s not forget the usual State of the Dolphin, an update from Oracle’s MySQL team (representative: Geir Høydalsvik), as well as a keynote by Peter Zaitsev (CEO, Percona) and Continuent. There will also be a couple of Percona customers keynoting, so expect information-packed fun mornings! You can see more details about the keynotes here: day one and day two.

Releases
  • Tarantool 1.7.5 stable. The first in the 1.7 series that comes as stable, and it also comes with its own Log Structured Merge Tree (LSM) engine called Vinyl. They wrote this when they found RocksDB insufficient for them. Slides: Vinyl: why we wrote our own write-optimized storage engine rather than chose RocksDB (and check out the video).
  • MariaDB Server 10.2.8. A– as per my previous column, this build merges TokuDB from Percona Server 5.6.36-82.1 (fixing some bugs). There is also a new InnoDB from MySQL 5.7.19 (current GA release). Have you tried MariaDB Backup yet? There are some GIS compatibility fixes (i.e., to make it behave like MySQL 5.7). One thing that piqued my interest is the CONNECT storage engine (typically used for ETL operations) now has beta support for the MONGO table type. No surprises, it’s meant to read MongoDB tables via the MongoDB C Driver API. Definitely something to try!
Link List Upcoming Appearances

Percona’s website keeps track of community events, so check out where to listen to a Perconian speak. My upcoming appearances are:

  1. db tech show case Tokyo 2017 – 5-7 September 2017, Tokyo, Japan
  2. Open Source Summit North America – 11-14 September 2017, Los Angeles, CA, USA
  3. Percona Live Europe Dublin – 25-27 September 2017, Dublin, Ireland
  4. Velocity Europe – 17-20 October 2017, London, UK
  5. Open Source Summit Europe – 23-26 October 2017, Prague, Czech Republic
Feedback

Bill Bogasky (MariaDB Corporation) says that if you’re looking for commercial support for Riak now that Basho has gone under, you could get it from Erlang Solutions or TI Tokyo. See their announcement: Riak commercial support now available post-Basho. Thanks, Bill!

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

Percona Server for MySQL 5.6.37-82.2 Is Now Available

Percona announces the release of Percona Server for MySQL 5.6.37-82.2 on August 25, 2017. Download the latest version from the Percona web site or the Percona Software Repositories. You can also run Docker containers from the images in the Docker Hub repository.

Based on MySQL 5.6.37, and including all the bug fixes in it, Percona Server for MySQL 5.6.37-82.2 is now the current GA release in the Percona Server for MySQL 5.6 series. Percona Server for MySQL is open-source and free – this is the latest release of our enhanced, drop-in replacement for MySQL. Complete details of this release are available in the 5.6.37-82.2 milestone on Launchpad.

NOTE: Red Hat Enterprise Linux 5 (including CentOS 5 and other derivatives), Ubuntu 12.04 and older versions are no longer supported by Percona software. The reason for this is that these platforms reached end of life, will not receive updates and are not recommended for use in production.

Bugs Fixed

  • #1703105: Fixed overwriting of error log on server startup.
  • #1705729: Fixed the postinst script to correctly locate the datadir.
  • #1709834: Fixed the mysqld_safe script to correctly locate the basedir.
  • Other fixes: #1706262

TokuDB Changes

  • TDB-72: Fixed issue when renaming a table with non-alphanumeric characters in its name.

Platform Support

  • Stopped providing packages for RHEL 5 (CentOS 5) and Ubuntu 12.04.

Release notes for Percona Server for MySQL 5.6.37-82.2 are available in the online documentation. Please report any bugs on the launchpad bug tracker.

A How-To Guide for Galera Cluster - Updated Tutorial

Since it was originally published more than 63,000 people (to date) have leveraged the MySQL for Galera Cluster Tutorial to both learn about and get started using MySQL Galera Cluster.

Galera Cluster for MySQL is a true Multi-master Cluster which is based on synchronous replication. Galera Cluster is an easy-to-use, high-availability solution, which provides high system uptime, no data loss and scalability to allow for future growth.

Severalnines was a very early adopter of the Galera Cluster technology; which was created by Codership and has since expanded to include versions from Percona and MariaDB.  

Related resources  Read the Updated Tutorial  ClusterControl for Galera Cluster  Top Galera Resources

Included in this newly updated tutorial are topics like…

  • An introduction to Galera Cluster
  • An explanation of the differences between MySQL Replication and Galera Replication
  • Deployment of Galera Cluster
  • Accessing the Galera Cluster
  • Failure Handling
  • Management and Operations
  • FAQs and Common Questions

Check out the updated tutorial MySQL for Galera Cluster here.

ClusterControl Single Console for Your Entire Database Infrastructure Find out what else is new in ClusterControl Install ClusterControl for FREE ClusterControl for Galera

ClusterControl makes it easy for those new to Galera to use the technology and deploy their first clusters. It centralizes the database management into a single interface. ClusterControl automation ensures DBAs and SysAdmins make critical changes to the cluster efficiently with minimal risks.

ClusterControl delivers on an array of features to help manage and monitor your open source database environments:

  • Deploy Database Clusters
  • Add Node, Load Balancer (HAProxy, ProxySQL) or Replication Slave
  • Backup Management
  • Configuration Management
  • Full stack monitoring (DB/LB/Host)
  • Query Monitoring
  • Enable SSL Encryption Galera Replication
  • Node Management
  • Developer Studio with Advisors

Learn more about how ClusterControl can help you drive high availability with Galera Cluster here.

Tags:  galera galera cluster xtradb cluster mariadb cluster mysql cluster MySQL database cluster clustering tutorial

Using MySQL Connector/Python X (mysqlx module) MyRocks episode

This post is about, how I have tried to make simple Python script using mysqlx module work with MyRocks.
This is also related to pytest, as I have implemented simple pytest tests to call them from bash file.

So let’s discuss problem description:
The base problem is, by default when you create collection using Python X Plugin, the collection will have, 1 json type column called `doc` and 1 generated column from this `doc` column called `_id`.
So basically, you can not alter table engine to MyRocks because it will give an error something like:

ERROR 3106 (HY000): 'Specified storage engine' is not supported for generated columns.

The result:

Well, it can be solved by dropping generated `_id` column. Here we are encountering another issue that, if you have table with json data, please do NOT alter it to MyRocks, otherwise, you will get some weird results as described here:

https://jira.percona.com/browse/MYR-152

So basically, when I select from this collection it will raise an OperationalError:

mysqlx.errors.OperationalError("The JSON binary value contains invalid data")

Now the sample code portions:

The main worker class code:

import mysqlx class MyXPlugin: def __init__(self, schema_name, collection_name): # Connect to a dedicated MySQL server self.session = mysqlx.get_session({ 'host': 'localhost', 'port': 33060, 'user': 'bakux', 'password': 'Baku12345', 'ssl-mode': mysqlx.SSLMode.DISABLED }) self.schema_name = schema_name self.collection_name = collection_name # Getting schema object self.schema = self.session.get_schema(self.schema_name) # Creating collection self.schema.create_collection(self.collection_name) # Getting collection object self.collection_obj = self.schema.get_collection(self.collection_name) def insert_into_collection(self): # You can also add multiple documents at once print "Inserting 3 rows into collection" self.collection_obj.add({'_id': '2', 'name': 'Sakila', 'age': 15}, {'_id': '3', 'name': 'Jack', 'age': 15}, {'_id': '4', 'name': 'Clare', 'age': 37}).execute() def remove_from_collection(self): # Removing non-existing _id self.collection_obj.remove('_id = 1').execute() def alter_table_engine(self): # Altering table engine to rocksdb; Should raise an error try: command = "alter table {}.{} engine=rocksdb".format(self.schema_name, self.collection_name) sql = self.session.sql(command) sql.execute() except Exception as e: raise mysqlx.errors.OperationalError("Could not alter engine of table here!") else: return 0 def alter_table_drop_column(self): # Dropping generated column print "Altering default collection to drop generated column" try: command = "alter table {}.{} drop column `_id`".format(self.schema_name, self.collection_name) sql = self.session.sql(command) sql.execute() except Exception as e: raise else: return 0 def return_table_obj(self): # Returning Table object table = mysqlx.Table(self.schema, self.collection_name) return table def create_view_from_collection(self, view_name): # Creating view from collection print "Trying to create view based on MyRocks collection" try: command = "create view {}.{} as select * from {}.{}".format(self.schema_name, view_name, self.schema_name, self.collection_name) sql = self.session.sql(command) sql.execute() except Exception as e: raise else: return 0 def select_from_view(self, view_name): # Running select; Should raise an error print "Trying to select from view [Should raise an OperationalError]" try: command = "select * from {}.{}".format(self.schema_name, view_name) sql = self.session.sql(command) sql.execute() except Exception as e: raise mysqlx.errors.OperationalError("The JSON binary value contains invalid data") else: return 0 def select_from_table(self): # Running select; Should raise an error print "Trying to select from view [Should raise an OperationalError]" try: command = "select * from {}.{}".format(self.schema_name, self.collection_name) sql = self.session.sql(command) sql.execute() except Exception as e: raise mysqlx.errors.OperationalError("The JSON binary value contains invalid data") else: return 0

The PyTest module:

import pytest from mysqlx.errors import OperationalError @pytest.mark.usefixtures("return_plugin_obj") class TestXPlugin: """ Tests for XPlugin + MyRocks """ def test_check_if_collection_exists(self, return_plugin_obj): assert return_plugin_obj.collection_obj.exists_in_database() == True def test_check_collection_count(self, return_plugin_obj): # Inserting data return_plugin_obj.insert_into_collection() # Removing data return_plugin_obj.remove_from_collection() # Checking count assert return_plugin_obj.collection_obj.count() == 3 def test_alter_table_engine_raises(self, return_plugin_obj): # Should raise error here print "Altering default collection engine from InnoDB to MyRocks [Should raise an OperationalError]" with pytest.raises(OperationalError) as er: return_plugin_obj.alter_table_engine() print er def test_alter_table_drop_column(self, return_plugin_obj): return_value = return_plugin_obj.alter_table_drop_column() assert return_value == 0 def test_alter_table_engine(self, return_plugin_obj): print "Altering default collection engine from InnoDB to MyRocks [Should NOT raise an OperationalError]" return_value = return_plugin_obj.alter_table_engine() assert return_value == 0 def helper_function(self, return_plugin_obj): table_obj = return_plugin_obj.return_table_obj() return table_obj def test_check_if_table_exists(self, return_plugin_obj): assert self.helper_function(return_plugin_obj).exists_in_database() == True def test_check_table_count(self, return_plugin_obj): assert self.helper_function(return_plugin_obj).count() == 3 def test_check_table_name(self, return_plugin_obj): assert self.helper_function(return_plugin_obj).get_name() == "my_collection" def test_check_schema_name(self, return_plugin_obj): assert self.helper_function(return_plugin_obj).get_schema().get_name() == "generated_columns_test" def test_check_if_table_is_view(self, return_plugin_obj): assert self.helper_function(return_plugin_obj).is_view() == False def test_create_view_from_collection(self, return_plugin_obj): return_value = return_plugin_obj.create_view_from_collection("my_collection_view") def test_select_from_view(self, return_plugin_obj): with pytest.raises(OperationalError) as er: return_plugin_obj.select_from_view("my_collection_view") print er def test_select_from_table(self, return_plugin_obj): with pytest.raises(OperationalError) as er: return_plugin_obj.select_from_table() print er

The PyTest fixture code:

from myrocks_mysqlx_plugin.myrocks_mysqlx_plugin import MyXPlugin import pytest # schema_name = "generated_columns_test" # collection_name = "my_collection" plugin_obj = MyXPlugin("generated_columns_test", "my_collection") @pytest.fixture() def return_plugin_obj(): return plugin_obj

The final bash runner test output:

#Running X Plugin tests# ========================================================================== test session starts ========================================================================== platform linux2 -- Python 2.7.5, pytest-3.2.1, py-1.4.34, pluggy-0.4.0 -- /usr/bin/python cachedir: ../../.cache rootdir: /home/shahriyar.rzaev, inifile: collected 13 items ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_check_if_collection_exists PASSED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_check_collection_count PASSED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_alter_table_engine_raises PASSED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_alter_table_drop_column PASSED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_alter_table_engine PASSED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_check_if_table_exists PASSED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_check_table_count PASSED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_check_table_name PASSED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_check_schema_name PASSED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_check_if_table_is_view PASSED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_create_view_from_collection PASSED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_select_from_view FAILED ../../percona-qa/myrocks-tests/myrocks_mysqlx_plugin_test/test_module01.py::TestXPlugin::test_select_from_table PASSED

Also reported interesting thing here:
https://bugs.mysql.com/bug.php?id=87531


Sysbench, a large server and fast SSD vs MyRocks, InnoDB and TokuDB

I have new performance results to share starting with modern sysbench, a large server, fast SSD and a database that is larger than RAM. Competitive read performance with much better space and write efficiency is why I like MyRocks and this is an example of that. However, we have work to do on read efficiency that will be apparent in future perf reports.

tl;dr
  • For write-heavy workloads MyRocks competes with InnoDB 5.7 and 8.0 and does better than 5.6 at all concurrency levels. 
  • For read-heavy workloads MyRocks is usually within 10% of InnoDB 5.6.
  • For write-heavy workloads InnoDB in 5.7 and 8.0 do better than 5.6.
  • For some read-heavy workloads InnoDB has a perf regression from 5.6 to 5.7/8.0.
  • TokuDB usually has the worst QPS

Configuration

I use my sysbench fork and helper scripts, release specific my.cnf files and a server with 48 HW threads, fast SSD and 256gb of RAM although only 50gb was available to the database and OS page cache. The binlog was enabled and sync-on-commit was disabled for the binlog and database log. I remembered to disable SSL.

I tested MyRocks, TokuDB and InnoDB, with buffered IO and a 10g database cache for MyRocks/TokuDB and O_DIRECT and a 35gb buffer pool for InnoDB. The server is shared by the sysbench client and mysqld. For MyRocks I used a build from August 15 with git hash 0d76ae and repeated the test first without compression and then no compression for L0/L1/L2, LZ4 for interior levels and zstd for the max level of the leveled LSM. For TokuDB I used Percona Server 5.7.17-12 and repeated the test first without compression and then with zlib compression. For InnoDB I used upstream 5.6.35, 5.7.17 and 8.0.2. For InnoDB 8.0.2 I used latin1 charset and latin1_swedish_ci collation. Compression was not used for InnoDB. More details are in the release specific my.cnf files and I used the same my.cnf for InnoDB with 8.0.1 and 8.0.2.

The test used 8 tables with 100M rows/table. My use of sysbench is explained here. Tests are run in an interesting pattern -- load, write-heavy, read-only, insert-only. On the large server each test is run for 1, 2, 4, 8, 16, 24, 32, 40, 48 and 64 concurrent connections for either 3 or 5 minutes per concurrency level. So each test runs for either 30 or 50 minutes total and I hope that is long enough to get the database into a steady state. An example command line to run the test with my helper scripts is:
bash all.sh 8 100000000 180 300 180 innodb 1 0 /orig5717/bin/mysql none /sysbench.new
Results without pictures

I have a lot of numbers to share and struggled with the presentation. All of the data is here. I will start with relative QPS - the QPS for an engine versus the QPS for InnoDB from upstream 5.6.35. There is one column for MyRocks without compression, one for TokuDB without compression and one for InnoDB from upstream 5.7.17. I did not include columns for compressed MyRocks, compressed TokuDB and InnoDB from upstream 8.0.2 to improve readability and because MyRocks/TokuDB QPS with compression is similar to it without compression and InnoDB QPS from 8.0.2 is similar to InnoDB from 5.7.17.

When the QPS ratio is greater than one then the engine is faster than InnoDB 5.6.35. I show ratios for 1, 8 and 48 connections to compare low, medium and high concurrency workloads. Things that I notice in the results include:
  • For write-heavy workloads MyRocks competes with InnoDB 5.7 and 8.0 and does better than 5.6 at all concurrency levels.
  • For read-heavy workloads MyRocks is usually within 10% of InnoDB 5.6. 
  • For write-heavy workloads InnoDB in 5.7 and 8.0 do better than 5.6.
  • For some read-heavy workloads InnoDB has a perf regression from 5.6 to 5.7/8.0.
  • TokuDB usually has the worst QPS

QPS ratio:
* rocks = myrocks.none / inno5635
* inno = inno5717 / inno5635
* toku = toku5717.none / inno5635

1 connection
rocks   inno    toku    engine/test
2.516   1.338   1.395   update-index
1.103   1.210   0.482   update-nonindex
0.907   1.067   0.376   update-nonindex-special
1.325   1.099   0.337   delete-only
0.942   1.149   0.317   read-write.range100
1.064   1.157   1.210   read-write.range10000
0.798   0.968   0.323   read-only.range100
0.980   0.946   1.280   read-only.range10000
0.850   0.917   0.226   point-query
0.933   0.900   0.300   random-points
0.457   0.904   0.608   hot-points
0.890   1.009   0.317   insert-only

8 connections
rocks   inno    toku    engine/test
4.892   2.862   2.304   update-index
1.769   1.681   0.169   update-nonindex
1.382   1.506   0.476   update-nonindex-special
1.409   1.178   0.250   delete-only
0.989   1.082   0.202   read-write.range100
0.893   1.141   1.055   read-write.range10000
0.838   0.895   0.279   read-only.range100
0.870   1.046   1.258   read-only.range10000
0.857   0.840   0.215   point-query
0.911   0.824   0.245   random-points
0.621   1.194   0.819   hot-points
0.828   1.056   0.313   insert-only

48 connections
rocks   inno    toku    engine/test
2.294   2.223   0.817   update-index
2.685   2.726   0.393   update-nonindex
2.172   3.031   0.734   update-nonindex-special
2.830   3.054   0.283   delete-only
2.758   2.706   0.192   read-write.range100
1.113   1.394   0.577   read-write.range10000
0.810   0.996   0.101   read-only.range100
0.893   1.203   0.589   read-only.range10000
0.845   0.954   0.089   point-query
0.864   0.941   0.125   random-points
1.412   3.166   2.055   hot-points
1.111   1.855   0.579   insert-only

Results with pictures

Some people prefer graphs, so there is the relative QPS as a chart. I truncated the x-axis at 2 to make it easier see differences. The first chart is from the test with 1 connection.
At 8 connections
At 48 connections

Percona Live Europe Featured Talks: A Quick Tour of MySQL 8.0 Roles with Giuseppe Maxia

Welcome to our series of interview blogs for the upcoming Percona Live Europe 2017 in Dublin. This series highlights a number of talks that will be at the conference and gives a short preview of what attendees can expect to learn from the presenter.

This first blog post is with Giuseppe Maxia of VMware. His talk is titled A Quick Tour of MySQL 8.0 Roles. MySQL 8.0 introduced roles, which are a new security and administrative feature that allows DBAs to simplify user management and increases the security of multi-user environments. In our conversation, we discussed MySQL roles and how they can help MySQL DBAs:

Percona: Hello Giuseppe, nice to interview you again (in our last blog we discussed “MySQL Document Store: SQL and NoSQL United”)! What have you been up to since the last Percona Live?

Giuseppe: Hi Dave, glad to be sharing ideas with you again. Since the last Percona conference, I’ve been going through two separate technical paths. For my day job, I work as software explorer on a large high availability in the cloud project. Exploratory software testing is a branch of QA that can easily qualify as a “dream job” for most senior QA engineers. For me, it’s a dynamic process that allows me to combine experience, creativity and development skills. Wildly interesting as it is, this job doesn’t require any MySQL skills. Thus my second technical path goes on in private, and I keep myself up to date with the MySQL world. This year I presented at several conferences and user groups in Europe, Asia, and America. I plan to keep working on MySQL in my own time, just because I like the topic.

Percona: You’re presenting a session called “A Quick Tour of MySQL 8.0 Roles”. What are MySQL roles, and why are they important?

Guiseppe: Roles are a way of simplifying user management. In database administration, users are granted privileges to access schemas, tables or columns (depending on the business needs). When many different users require authorization for different sets of privileges, administrators have to repeat the process of granting privileges several times. This is both tedious and error-prone. Using roles, administrators can define sets of privileges for a user category, and then the user authorization becomes a single statement operation.

In a well-regulated and security-minded organization, administrators should only use roles for privilege management. This policy not only simplifies user management, but also provides meaningful data on privilege usage.

Percona: When getting into role assignment, what are some of the things that DBAs need to watch out for?

Giuseppe: Although roles make everyday tasks easier, they also present some additional challenges. In the MySQL implementation, roles are users with some small differences. While this similarity allows admins to get lazy and assign pre-existing users as roles to other users, this practice would make administration more difficult in the long run. To truly benefit from this new feature, DBAs must get organized, and spend some time planning how they want to orchestrate their roles for maximum efficiency before plunging in.

Percona: Could you share your personal experience with this feature?

Giuseppe: Roles have been on the MySQL community’s wish list for a long time. I remember several third party solutions that tried to implement roles as a hack on top of the existing privileges granting system. I created my own solution many years ago when I had to administer a large set of users with different levels of access.

Anytime a new project promised to ease the roles problem, I gave it a try. None of them truly delivered a secure solution.

When I saw the roles feature in MySQL 8, I tested it extensively, provided feedback to the MySQL team and asked for better usability. I was pleased to see that in the latest release (8.0.2) the team addressed some of my concerns, making the roles both easier to use and more powerful – although at the same time they introduced a new extension (mandatory roles) that could create more problems. All in all, I am pleased with the attitude of the MySQL team: they were willing to listen to my feedback and my proposals for improvement.

Percona: What do you want attendees to take away from your session? Why should they attend?

Giuseppe: When I first proposed this session at Percona Live in Santa Clara, my goal was to explain the various aspects of the new feature. Many users, when hearing about roles, think that it’s a straightforward extension of the existing privileges system. In practice, roles usage is a minefield. Many commands perform seemingly the same operation but often lead to unexpected results.

My session should make the basic operations clear, and teach attendees how to avoid the most common pitfalls.

Percona: What are you most looking forward to at Percona Live Europe 2017?

Giuseppe: First of all, Dublin! I have been an advocate of moving the conference to Ireland, and as soon as I saw that the venue confirmed I booked flight and hotel without waiting for the CfP. At the very minimum, I will be a happy tourist there! As for the conference, there are several topics that I want to follow. The continued improvement of MySQL 8.0 is one, which now seems poised for a GA release in the near future. The explosion of technical solutions that are conquering the community is another: ProxySQL, Orchestrator, gh-ost, MyRocks. I am also interested in the evolution of InnoDB Cluster, which one year ago was presented as the solution to every DBA need (but so far has been less than overwhelming).

There are always lots of sessions with intriguing subjects, and I know already that I won’t be able to attend them all. But I am sure I will learn some new technique or methodology that comes in handy, as happens to me at every MySQL conference.

Want to find out more about Giuseppe and MySQL roles? Register for Percona Live Europe 2017, and see his talk A Quick Tour of MySQL 8.0 Roles. Register now to get the best price! Use discount code SeeMeSpeakPLE17 to get 10% off your registration.

Percona Live Open Source Database Conference Europe 2017 in Dublin is the premier European open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, MariaDB, MongoDB, time series database, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Open Source Database Conference Europe will be September 25-27, 2017 at the Radisson Blu Royal Hotel, Dublin.

Log Buffer #519: A Carnival of the Vanities for DBAs

This Log Buffer Edition covers Oracle, SQL Server and MySQL.

Oracle:

Fix index corruption found using analyze table validate structure

Customizing a CoreOS image/iso: Ignition

12.2 New Feature: the FLEX ASM disk group part 4

An Important Change Is Coming for Application Express 5.2 (if you use the Mobile User Interface)

Creating an Oracle Database Docker image

SQL Server:

Installing SQL Server Agent on SUSE Linux

Load Extended Events via Powershell

The SQL Server Features that Time Forgot

Using the Copy Wizard for the Azure Data Factory

Azure Resource Manager (ARM) Templates

MySQL:

Migrating Data from an Encrypted Amazon MySQL RDS Instance to an Encrypted Amazon Aurora Instance

Automatically Dropping Old Partitions in MySQL and MariaDB

Connecting PHP to MySQL on Bluemix

No more mysql.proc in MySQL 8.0

How ProxySQL deal with schema (and schemaname)

Plan to improve the out of the Box Experience in MySQL 8.0

In MySQL 8.0, we will be introducing a new configuration parameter called innodb_dedicated_server=bool. When ON, this option will look at the system memory, and then automatically set the these configuration parameters using the following rules:

innodb_buffer_pool_size server_memory < 1G ?…

Taming a ‘wild’ NDB 7.3 with Cluster Manager 1.4.3 & direct upgrade to 7.5.

Well, since working with outdated clusters and upgrade paths that quickly become obsolete, as in my last post, Migrating/importing NDB to Cluster Manager w/ version upgrade. , I wanted to share that we can also use Cluster Manager, mcm, to upgrade NDB Cluster from 7.3 directly to 7.5. So we can start using the mcm new features like autotune that help guide us towards some Cluster tuning, or 7.5 new features like READ_BACKUP or FULLY_REPLICATED tables. Sometimes table comments can be soo important…

So, as with the last scenario, a 7.3.8 ‘wild’ cluster. A reminder of what we had:

sn1: ndb_mgmd --configdir=/usr/local/mysql-cluster-gpl-7.3.8-linux-glibc2.5-x86_64/conf -f /usr/local/mysql-cluster-gpl-7.3.8-linux-glibc2.5-x86_64/conf/config.ini --config-cache=false --ndb-nodeid=1 dn1 & dn2: ndbmtd --ndb-nodeid=3 -c 10.0.0.10 ndbmtd --ndb-nodeid=4 -c 10.0.0.10 sn1 & sn2: mysqld --defaults-file=/usr/local/mysql/conf/my.cnf --user=mysql --ndb-nodeid=10 & mysqld --defaults-file=/usr/local/mysql/conf/my.cnf --user=mysql --ndb-nodeid=11 &

Double checking everythig is up and running (the the ‘mysql’ os user… not root):

ndb_mgm -e show Connected to Management Server at: localhost:1186 Cluster Configuration --------------------- [ndbd(NDB)] 2 node(s) id=3 @10.0.0.12 (mysql-5.6.22 ndb-7.3.8, Nodegroup: 0, *) id=4 @10.0.0.13 (mysql-5.6.22 ndb-7.3.8, Nodegroup: 0) [ndb_mgmd(MGM)] 1 node(s) id=1 @10.0.0.10 (mysql-5.6.22 ndb-7.3.8) [mysqld(API)] 5 node(s) id=10 @10.0.0.10 (mysql-5.6.22 ndb-7.3.8) id=11 @10.0.0.11 (mysql-5.6.22 ndb-7.3.8) id=12 (not connected, accepting connect from any host) id=13 (not connected, accepting connect from any host) id=14 (not connected, accepting connect from any host)

As before, the ‘mcmd’@’localhost’ user exists on both sqlnodes.

PID files change time:

sqlnodes:

cd /opt/mysql/738/data more sn1.pid more sn2.pid ps -ef | grep mysqld | grep -v grep | awk '{ print $2 }' cp sn1.pid ndb_10.pid cp sn2.pid ndb_11.pid

dnodes:

cd /opt/mysql/738/ndbd_data/ more *.pid kill -9 6523 kill -9 2608 sed -i 's/6523/6524/' ndb_3.pid sed -i 's/2608/2609/' ndb_4.pid more *.pid

With our wild cluster semi-tamed, it’s time for the upgrade.

 

Now time to create the 7.5.7 MCM cluster to import into.

On all hosts:

mkdir -p /opt/mysql/757/mcm_data cd /opt/mysql/757/mcm_data chown -R mysql:mysql .

Add paths to MCM binaries to the os user ‘mysql’ .bash_profile:

vi ~/.bash_profile export PATH=$PATH:/usr/local/mcm1.4.3/bin

Now to uncompress the latest version of cluster:

cd /usr/local tar zxvf mysql-cluster-advanced-7.5.7-linux-glibc2.12-x86_64.tar.gz cd mysql-cluster-advanced-7.5.7-linux-glibc2.12-x86_64 chown -R mysql:mysql .

Now to add a directory where we have read-write permissions, and that is also meaningful to us:

cd ../mcm1.4.3/ vi etc/mcmd.ini ... log-file = /opt/mysql/757/mcm_data/mcmd.log ... manager-directory = /opt/mysql/757/mcm_data

Now to get mcm running. Remember to do this on all nodes:

su - mysql mcmd --defaults-file=/usr/local/mcm1.4.3/etc/mcmd.ini --daemon

Creating the cluster in mcm to import our wild cluster into, as in the previous scenario:

mcm> create site --hosts=10.0.0.10,10.0.0.11,10.0.0.12,10.0.0.13 mysite;

Add cluster binaries & create cluster ready for IMPORT:

mcm> add package --basedir=/usr/local/mysql-cluster-gpl-7.3.8-linux-glibc2.5-x86_64 cluster738; mcm> create cluster --import --package=cluster738 --processhosts=ndb_mgmd:1@10.0.0.10,ndbmtd:3@10.0.0.12,ndbmtd:4@10.0.0.13 mycluster; mcm> show status -r mycluster; mcm> add process --processhosts=mysqld:10@10.0.0.10,mysqld:11@10.0.0.11 mycluster; mcm> add process --processhosts=ndbapi:12@*,ndbapi:13@*,ndbapi:14@* mycluster;

Ok, looking good. “What about the previous config?” I hear you say:

mcm> import config --dryrun mycluster;

Ok, MCM complains about the mysqld’s processes running as root at the os level, so we can make this change via mcm on both mysqld’s:

mcm> set user:mysqld=mysql mycluster;

And re-run the dryrun config check:

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Command result | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Import checks passed. Please check /opt/mysql/757/mcm_data/clusters/mycluster/tmp/import_config.e40f3e52_97_3.mcm on host 10.0.0.13 for settings that will be applied. | +------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (5.91 sec)

As you can see, this message is slightly different to the one received in MCM 1.4.0. Makes it easier to check.

So, if we’re happy, an importing we go:

mcm> import config mycluster; +--------------------------------------------------------------------------------------------+ | Command result | +--------------------------------------------------------------------------------------------+ | Configuration imported successfully. Please manually verify the settings before proceeding | +--------------------------------------------------------------------------------------------+ 1 row in set (5.90 sec)

That went well. Now for the real McCoy:

mcm> import cluster mycluster; +-------------------------------+ | Command result | +-------------------------------+ | Cluster imported successfully | +-------------------------------+ 1 row in set (3.32 sec)

Making sure all’s up and running:

mcm> show status -r mycluster; +--------+----------+-----------+---------+-----------+------------+ | NodeId | Process | Host | Status | Nodegroup | Package | +--------+----------+-----------+---------+-----------+------------+ | 1 | ndb_mgmd | 10.0.0.10 | running | | cluster738 | | 3 | ndbmtd | 10.0.0.12 | running | 0 | cluster738 | | 4 | ndbmtd | 10.0.0.13 | running | 0 | cluster738 | | 10 | mysqld | 10.0.0.10 | running | | cluster738 | | 11 | mysqld | 10.0.0.11 | running | | cluster738 | | 12 | ndbapi | * | added | | | | 13 | ndbapi | * | added | | | | 14 | ndbapi | * | added | | | +--------+----------+-----------+---------+-----------+------------+ 8 rows in set (0.06 sec)

I just wanted to set expectations here. My restart / upgrade / import times are because my cluster data set is almost non-existent. The more DataMemory & IndexMemory we use, in addition to the cores we have available as well as disk speed will all determine how fast each node restart is, hence, how long the import & upgrade process is.
Upgrade time again: Direct from 7.3.8 to 7.5.7.

This is how we check we’re still ok, i.e., only have 1 binary package known to mcm:

mcm> list packages mysite;

And now add the new 7.5 binaries:

mcm> add package --basedir=/usr/local/mysql-cluster-advanced-7.5.7-linux-glibc2.12-x86_64 cluster757; ERROR 7018 (00MGR): Sync spawning '/usr/local/mysql-cluster-advanced-7.5.7-linux-glibc2.12-x86_64/bin/mysqld' exited with 127. Started in , with stdout='' and stderr='/usr/local/mysql-cluster-advanced-7.5.7-linux-glibc2.12-x86_64/bin/mysqld: error while loading shared libraries: libnuma.so.1: cannot open shared object file: No such file or directory'

Make sure you’ve got your version of libnuma installed (hint: https://www.rpmfind.net/linux/rpm2html/search.php?query=libnuma.so.1()(64bit)). This is needed for MySQL Server 5.7, so it’s not really a MySQL Cluster nor Cluster Manager error in itself.

I’m on CentOS 6.5 so:

# sudo yum install numactl

And retry:

mcm> add package --basedir=/usr/local/mysql-cluster-advanced-7.5.7-linux-glibc2.12-x86_64 cluster757; +----------------------------+ | Command result | +----------------------------+ | Package added successfully | +----------------------------+ 1 row in set (0.66 sec) mcm> list packages mysite; +------------+----------------------------------------------------------------+-----------------------------------------+ | Package | Path | Hosts | +------------+----------------------------------------------------------------+-----------------------------------------+ | cluster738 | /usr/local/mysql-cluster-gpl-7.3.8-linux-glibc2.5-x86_64 | 10.0.0.10,10.0.0.11,10.0.0.12,10.0.0.13 | | cluster757 | /usr/local/mysql-cluster-advanced-7.5.7-linux-glibc2.12-x86_64 | 10.0.0.10,10.0.0.11,10.0.0.12,10.0.0.13 | +------------+----------------------------------------------------------------+-----------------------------------------+ 2 rows in set (0.05 sec)

Ok, so now we can upgrade:

mcm> upgrade cluster --package=cluster757 mycluster; +-------------------------------+ | Command result | +-------------------------------+ | Cluster upgraded successfully | +-------------------------------+ 1 row in set (2 min 3.69 sec)

Double check mcm is now using the latest binaries. You can also check at o.s. level via ‘ps -ef | grep mysqld’ / ‘ps -ef | grep ndbmtd’.

mcm> show status -r mycluster;

So, now to use one of mcm’s new features:

mcm> autotune --dryrun --writeload=low realtime mycluster; +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Command result | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Autotuning calculation complete. Please check /opt/mysql/757/mcm_data/clusters/mycluster/tmp/autotune.e40f3e52_314_3.mcm on host 10.0.0.13 for settings that will be applied. | +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (1.28 sec)

And we cat that file to see if contains:

# The following will be applied to the current cluster config: set HeartbeatIntervalDbDb:ndbmtd=1500 mycluster; set HeartbeatIntervalDbApi:ndbmtd=1500 mycluster; set RedoBuffer:ndbmtd=32M mycluster; set SendBufferMemory:ndbmtd+ndbmtd=2M mycluster; set ReceiveBufferMemory:ndbmtd+ndbmtd=2M mycluster; set SendBufferMemory:ndb_mgmd+ndbmtd=2M mycluster; set ReceiveBufferMemory:ndb_mgmd+ndbmtd=2M mycluster; set SendBufferMemory:mysqld+ndbmtd=2M mycluster; set ReceiveBufferMemory:mysqld+ndbmtd=2M mycluster; set SendBufferMemory:ndbapi+ndbmtd=2M mycluster; set ReceiveBufferMemory:ndbapi+ndbmtd=2M mycluster; set SharedGlobalMemory:ndbmtd=20M mycluster; set FragmentLogFileSize:ndbmtd=256M mycluster; set NoOfFragmentLogFiles:ndbmtd=3 mycluster;

Obviously now we can also look at using READ_BACKUP and other NDB 7.5 new features. An upgrade process far from painless.

Hope this helps someone.


Galera Cluster Comparison - Codership vs Percona vs MariaDB

Galera Cluster is a synchronous multi-master replication plugin for InnoDB or XtraDB storage engine. It offers a number of outstanding features that standard MySQL replication doesn’t - read-write to any cluster node, automatic membership control, automatic node joining, parallel replication on row-level, and still keeping the native look and feel of a MySQL server. This plug-in is open-source and developed by Codership as a patch for standard MySQL. Percona and MariaDB leverage the Galera library in Percona XtraDB Cluster (PXC) and MariaDB Server (MariaDB Galera Cluster for pre 10.1) respectively.

We often get the question - which version of Galera should I use? Percona? MariaDB? Codership? This is not an easy one, since they all use the same Galera plugin that is developed by Codership. Nevertheless, let’s give it a try.

In this blog post, we’ll compare the three vendors and their Galera Cluster releases. We will be using the latest stable version of each vendor available at the time of writing - Galera Cluster for MySQL 5.7.18, Percona XtraDB Cluster 5.7.18 and MariaDB 10.2.7 where all are shipped with InnoDB storage engine 5.7.18.

Database Release

A database vendor who wish to leverage Galera Cluster technology would need to incorporate the WriteSet Replication (wsrep) API patch into its server codebase. This will allow the Galera plugin to work as a wsrep provider, to communicate and replicate transactions (writesets in Galera terms) via a group communication protocol.

The following diagram illustrates the difference between the standalone MySQL server, MySQL Replication and Galera Cluster:

Codership releases the wsrep-patched version of Oracle’s MySQL. MySQL has already released MySQL 5.7 as General Availability (GA) since October 2015. However the first beta wsrep-patched for MySQL was released a year later around October 2016, then became GA in January 2017. It took more than a year to incorporate Galera Cluster into Oracle’s MySQL 5.7 release line.

Percona releases the wsrep-patched version of its Percona Server for MySQL called Percona XtraDB Cluster (PXC). Percona Server for MySQL comes with XtraDB storage engine (a drop-in replacement of InnoDB) and follows the upstream Oracle MySQL releases very closely (including all the bug fixes in it) with some additional features like MyRocks storage engine, TokuDB as well as Percona’s own bug fixes. In a way, you can think of it as an improved version of Oracle’s MySQL, embedded with Galera technology.

MariaDB releases the wsrep-patched version of its MariaDB Server, and it’s already embedded since MariaDB 10.1, where you don’t have to install separate packages for Galera. In the previous versions (5.5 and 10.0 particularly), the Galera variant’s of MariaDB is called MariaDB Galera Cluster (MGC) with separate builds. MariaDB has its own path of releases and versioning and does not follow any upstream like Percona does. The MariaDB server functionality has started diverging from MySQL, so it might not be as straightforward a replacement for MySQL. It still comes with a bunch of great features and performance improvements though.

System Status

Monitoring Galera nodes and the cluster requires the wsrep API to report several statuses, which is exposed through SHOW STATUS statement:

mysql> SHOW STATUS LIKE 'wsrep%';

PXC does have a number of extra statuses, if compared to other variants. The following list shows wsrep related status that can only be found in PXC:

  • wsrep_flow_control_interval
  • wsrep_flow_control_interval_low
  • wsrep_flow_control_interval_high
  • wsrep_flow_control_status
  • wsrep_cert_bucket_count
  • wsrep_gcache_pool_size
  • wsrep_ist_receive_status
  • wsrep_ist_receive_seqno_start
  • wsrep_ist_receive_seqno_current
  • wsrep_ist_receive_seqno_end

While MariaDB only has one extra wsrep status, if compared to the Galera version provided by Codership:

  • wsrep_thread_count

The above does not necessarily tell us that PXC is superior to the others. It means that you can get better insights with more statuses.

Configuration Options

Since Galera is part of MariaDB 10.1 and later, you have to explicitly enable the following option in the configuration file:

wsrep_ready=ON

Note that if you do not enable this option, the server will act as a standard MariaDB installation. For Codership and Percona, this option is enabled by default.

Some Galera-related variables are NOT available across all Galera variants:

Database Server Variable name Codership’s MySQL Galera Cluster 5.7.18, wsrep 25.12
  • wsrep_mysql_replication_bundle
  • wsrep_preordered
  • wsrep_reject_queries
Percona XtraDB Cluster 5.7.18, wsrep 29.20
  • wsrep_preordered
  • wsrep_reject_queries
  • pxc_encrypt_cluster_traffic
  • pxc_maint_mode
  • pxc_maint_transition_period
  • pxc_strict_mode
MariaDB 10.2.7, wsrep 25.19
  • wsrep_gtid_domain_id
  • wsrep_gtid_mode
  • wsrep_mysql_replication_bundle
  • wsrep_patch_version

The above list might change once the vendor releases a new version. The only point that we would like to highlight here is, do not expect that Galera nodes hold the same set of configuration parameters across all variants. Some configuration variables were introduced by a vendor to specifically complement and improve the database server.

Contributions and Improvements

Database performance is not easily comparable, as it can vary a lot depending on the workloads. For general workloads, the replication performance are fairly similar across all variants. Under some specific workloads, it could be different.

Looking at the latest claims, Percona did an amazing job improving IST performance up to 4x as well as the commit operation. MariaDB also contributes a number of useful features for example WSREP_INFO plugin. On the other hand, Codership is focusing more on core Galera issues issues, including bug fixing and new features. Galera 4.0 has features like intelligent donor selection, huge transaction support, and non-blocking DDL.

The introduction of Percona Xtrabackup (a.k.a xtrabackup) as part of Galera’s SST has improved the SST performance significantly. The syncing process becomes faster and non-blocking to the donor. MariaDB then came up with its own xtrabackup fork called MariaDB Backup (mariabackup) which supported by Galera’s SST method through variable wsrep_sst_method=mariabackup. It also supports installation on Microsoft Windows.

Support

All Galera Cluster variants software are open-source and available for free. This includes the syncing software supported by Galera like mysqldump, rsync, Percona Xtrabackup and MariaDB Backup. For community users, you can seek for support, ask for questions, file a bug report, feature request or even make a pull request to the vendor’s respective support channels:

  Codership Percona MariaDB Database server public issue tracker MySQL wsrep on Github Percona XtraDB Cluster on Launchpad MariaDB Server on JIRA Galera issue tracker Galera on Github Documentation Galera Cluster Documentation Percona XtraDB Cluster Documentation MariaDB Documentation Support forum Codership Team Groups Percona Forum MariaDB Open Questions Related resources  Video: ClusterControl for Galera Cluster  ClusterControl for Galera Cluster for MySQL  ClusterControl Product Video for Galera Clusters

Each vendor provides commercial support services.

Summary

We hope that this comparison gives you a clearer picture and helps you determine which vendor that better suits your need. They all use pretty much the same wsrep libraries, the differences would be mainly on the server side - for instance, if you want to leverage some specific features in MariaDB or Percona Server. You might want to check out this blog that compares the different servers (Oracle MySQL, MariaDB and Percona Server). ClusterControl supports all of the three vendors, so you can easily deploy different clusters and compare them yourself with your own workload, on your own hardware. Do give it a try.

Tags:  MySQL galera MariaDB pxc comparison

Pages