Planet MySQL

About ZFS Performance

If you are a regular reader of this blog, you likely know I like the ZFS filesystem a lot. ZFS has many very interesting features, but I am a bit tired of hearing negative statements on ZFS performance. It feels a bit like people are telling me “Why do you use InnoDB? I have read that MyISAM is faster.” I found the comparison of InnoDB vs. MyISAM quite interesting, and I’ll use it in this post.

To have some data to support my post, I started an AWS i3.large instance with a 1000GB gp2 EBS volume. A gp2 volume of this size is interesting because it is above the burst IOPS level, so it offers a constant 3000 IOPS performance level.

I used sysbench to create a table of 10M rows and then, using export/import tablespace, I copied it 329 times. I ended up with 330 tables for a total size of about 850GB. The dataset generated by sysbench is not very compressible, so I used lz4 compression in ZFS. For the other ZFS settings, I used what can be found in my earlier ZFS posts but with the ARC size limited to 1GB. I then used that plain configuration for the first benchmarks. Here are the results with the sysbench point-select benchmark, a uniform distribution and eight threads. The InnoDB buffer pool was set to 2.5GB.

In both cases, the load is IO bound. The disk is doing exactly the allowed 3000 IOPS. The above graph appears to be a clear demonstration that XFS is much faster than ZFS, right? But is that really the case? The way the dataset has been created is extremely favorable to XFS since there is absolutely no file fragmentation. Once you have all the files opened, a read IOP is just a single fseek call to an offset and ZFS doesn’t need to access any intermediate inode. The above result is about as fair as saying MyISAM is faster than InnoDB based only on table scan performance results of unfragmented tables and default configuration. ZFS is much less affected by the file level fragmentation, especially for point access type.

More on ZFS metadata

ZFS stores the files in B-trees in a very similar fashion as InnoDB stores data. To access a piece of data in a B-tree, you need to access the top level page (often called root node) and then one block per level down to a leaf-node containing the data. With no cache, to read something from a three levels B-tree thus requires 3 IOPS.

Simple three levels B-tree

The extra IOPS performed by ZFS are needed to access those internal blocks in the B-trees of the files. These internal blocks are labeled as metadata. Essentially, in the above benchmark, the ARC is too small to contain all the internal blocks of the table files’ B-trees. If we continue the comparison with InnoDB, it would be like running with a buffer pool too small to contain the non-leaf pages. The test dataset I used has about 600MB of non-leaf pages, about 0.1% of the total size, which was well cached by the 3GB buffer pool. So only one InnoDB page, a leaf page, needed to be read per point-select statement.

To correctly set the ARC size to cache the metadata, you have two choices. First, you can guess values for the ARC size and experiment. Second, you can try to evaluate it by looking at the ZFS internal data. Let’s review these two approaches.

You’ll read/hear often the ratio 1GB of ARC for 1TB of data, which is about the same 0.1% ratio as for InnoDB. I wrote about that ratio a few times, having nothing better to propose. Actually, I found it depends a lot on the recordsize used. The 0.1% ratio implies a ZFS recordsize of 128KB. A ZFS filesystem with a recordsize of 128KB will use much less metadata than another one using a recordsize of 16KB because it has 8x fewer leaf pages. Fewer leaf pages require less B-tree internal nodes, hence less metadata. A filesystem with a recordsize of 128KB is excellent for sequential access as it maximizes compression and reduces the IOPS but it is poor for small random access operations like the ones MySQL/InnoDB does.

To determine the correct ARC size, you can slowly increase the ARC size and monitor the number of metadata cache-misses with the arcstat tool. Here’s an example:

# echo 1073741824 > /sys/module/zfs/parameters/zfs_arc_max # arcstat -f time,arcsz,mm%,mhit,mread,dread,pread 10 time arcsz mm% mhit mread dread pread 10:22:49 105M 0 0 0 0 0 10:22:59 113M 100 0 22 73 0 10:23:09 120M 100 0 20 68 0 10:23:19 127M 100 0 20 65 0 10:23:29 135M 100 0 22 74 0

You’ll want the ‘mm%’, the metadata missed percent, to reach 0. So when the ‘arcsz’ column is no longer growing and you still have high values for ‘mm%’, that means the ARC is too small. Increase the value of ‘zfs_arc_max’ and continue to monitor.

If the 1GB of ARC for 1TB of data ratio is good for large ZFS recordsize, it is likely too small for a recordsize of 16KB. Does 8x more leaf pages automatically require 8x more ARC space for the non-leaf pages? Although likely, let’s verify.

The second option we have is the zdb utility that comes with ZFS, which allows us to view many internal structures including the B-tree list of pages for a given file. The tool needs the inode of a file and the ZFS filesystem as inputs. Here’s an invocation for one of the tables of my dataset:

# cd /var/lib/mysql/data/sbtest # ls -li | grep sbtest1.ibd 36493 -rw-r----- 1 mysql mysql 2441084928 avr 15 15:28 sbtest1.ibd # zdb -ddddd mysqldata/data 36493 > zdb5d.out # more zdb5d.out Dataset mysqldata/data [ZPL], ID 90, cr_txg 168747, 4.45G, 26487 objects, rootbp DVA[0]=<0:1a50452800:200> DVA[1]=<0:5b289c1600:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=3004977L/3004977P fill=26487 cksum=13723d4400:5d1f47fb738:fbfb87e6e278:1f30c12b7fa1d1 Object lvl iblk dblk dsize lsize %full type 36493 4 16K 16K 1.75G 2.27G 97.62 ZFS plain file 168 bonus System attributes dnode flags: USED_BYTES USERUSED_ACCOUNTED dnode maxblkid: 148991 path /var/lib/mysql/data/sbtest/sbtest1.ibd uid 103 gid 106 atime Sun Apr 15 15:04:13 2018 mtime Sun Apr 15 15:28:45 2018 ctime Sun Apr 15 15:28:45 2018 crtime Sun Apr 15 15:04:13 2018 gen 3004484 mode 100640 size 2441084928 parent 36480 links 1 pflags 40800000004 Indirect blocks: 0 L3 0:1a4ea58800:400 4000L/400P F=145446 B=3004774/3004774 0 L2 0:1c83454c00:1800 4000L/1800P F=16384 B=3004773/3004773 0 L1 0:1eaa626400:1600 4000L/1600P F=128 B=3004773/3004773 0 L0 0:1c6926ec00:c00 4000L/c00P F=1 B=3004773/3004773 4000 L0 EMBEDDED et=0 4000L/6bP B=3004484 8000 L0 0:1c69270c00:400 4000L/400P F=1 B=3004773/3004773 c000 L0 0:1c7fbae400:800 4000L/800P F=1 B=3004736/3004736 10000 L0 0:1ce3f53600:3200 4000L/3200P F=1 B=3004484/3004484 14000 L0 0:1ce3f56800:3200 4000L/3200P F=1 B=3004484/3004484 18000 L0 0:18176fa600:3200 4000L/3200P F=1 B=3004485/3004485 1c000 L0 0:18176fd800:3200 4000L/3200P F=1 B=3004485/3004485 ... [more than 140k lines truncated]

The last section of the above output is very interesting as it shows the B-tree pages. The ZFSB-tree of the file sbtest1.ibd has four levels. L3 is the root page, L2 is the first level (from the top) pages, L1 are the second level pages, and L0 are the leaf pages. The metadata is essentially L3 + L2 + L1. When you change the recordsize property of a ZFS filesystem, you affect only the size of the leaf pages.

The non-leaf page size is always 16KB (4000L) and they are always compressed on disk with lzop (If I read correctly). In the ARC, these pages are stored uncompressed so they use 16KB of memory each. The fanout of a ZFS B-tree, the largest possible ratio of a number of pages between levels, is 128. With the above output, we can easily calculate the required size of metadata we would need to cache all the non-leaf pages in the ARC.

# grep -c L3 zdb5d.out 1 # grep -c L2 zdb5d.out 9 # grep -c L1 zdb5d.out 1150 # grep -c L0 zdb5d.out 145447

So, each of the 330 tables of the dataset has 1160 non-leaf pages and 145447 leaf pages; a ratio very close to the prediction of 0.8%. For the complete dataset of 749GB, we would need the ARC to be, at a minimum, 6GB to fully cache all the metadata pages. Of course, there is some overhead to add. In my experiments, I found I needed to add about 15% for ARC overhead in order to have no metadata reads at all. The real minimum for the ARC size I should have used is almost 7GB.

Of course, an ARC of 7GB on a server with 15GB of Ram is not small. Is there a way to do otherwise? The first option we have is to use a larger InnoDB page size, as allowed by MySQL 5.7. Instead of the regular Innodb page size of 16KB, if you use a page size of 32KB with a matching ZFS recordsize, you will cut the ARC size requirement by half, to 0.4% of the uncompressed size.

Similarly, an Innodb page size of 64KB with similar ZFS recordsize would further reduce the ARC size requirement to 0.2%. That approach works best when the dataset is highly compressible. I’ll blog more about the use of larger InnoDB pages with ZFS in a near future. If the use of larger InnoDB page sizes is not a viable option for you, you still have the option of using the ZFS L2ARC feature to save on the required memory.

So, let’s proposed a new rule of thumb for the required ARC/L2ARC size for a a given dataset:

  • Recordsize of 128KB => 0.1% of the uncompressed dataset size
  • Recordsize of 64KB => 0.2% of the uncompressed dataset size
  • Recordsize of 32KB => 0.4% of the uncompressed dataset size
  • Recordsize of 16KB => 0.8% of the uncompressed dataset size
The ZFS revenge

In order to improve ZFS performance, I had 3 options:

  1. Increase the ARC size to 7GB
  2. Use a larger Innodb page size like 64KB
  3. Add a L2ARC

I was reluctant to grow the ARC to 7GB, which was nearly half the overall system memory. At best, the ZFS performance would only match XFS. A larger InnoDB page size would increase the CPU load for decompression on an instance with only two vCPUs; not great either. The last option, the L2ARC, was the most promising.

The choice of an i3.large instance type is not accidental. The instance has a 475GB ephemeral NVMe storage device. Let’s try to use this storage for the ZFS L2ARC. The warming of an L2ARC device is not exactly trivial. In my case, with a 1GB ARC, I used:

echo 1073741824 > /sys/module/zfs/parameters/zfs_arc_max echo 838860800 > /sys/module/zfs/parameters/zfs_arc_meta_limit echo 67108864 > /sys/module/zfs/parameters/l2arc_write_max echo 134217728 > /sys/module/zfs/parameters/l2arc_write_boost echo 4 > /sys/module/zfs/parameters/l2arc_headroom echo 16 > /sys/module/zfs/parameters/l2arc_headroom_boost echo 0 > /sys/module/zfs/parameters/l2arc_norw echo 1 > /sys/module/zfs/parameters/l2arc_feed_again echo 5 > /sys/module/zfs/parameters/l2arc_feed_min_ms echo 0 > /sys/module/zfs/parameters/l2arc_noprefetch

I then ran ‘cat /var/lib/mysql/data/sbtest/* > /dev/null’ to force filesystem reads and caches on all of the tables. A key setting here to allow the L2ARC to cache data is the zfs_arc_meta_limit. It needs to be slightly smaller than the zfs_arc_max in order to allow some data to be cache in the ARC. Remember that the L2ARC is fed by the LRU of the ARC. You need to cache data in the ARC in order to have data cached in the L2ARC. Using lz4 in ZFS on the sysbench dataset results in a compression ration of only 1.28x. A more realistic dataset would compress by more than 2x, if not 3x. Nevertheless, since the content of the L2ARC is compressed, the 475GB device caches nearly 600GB of the dataset. The figure below shows the sysbench results with the L2ARC enabled:

Now, the comparison is very different. ZFS completely outperforms XFS, 5000 qps for ZFS versus 3000 for XFS. The ZFS results could have been even higher but the two vCPUs of the instance were clearly the bottleneck. Properly configured, ZFS can be pretty fast. Of course, I could use flashcache or bcache with XFS and improve the XFS results but these technologies are way more exotic than the ZFS L2ARC. Also, only the L2ARC stores data in a compressed form, maximizing the use of the NVMe device. Compression also lowers the size requirement and cost for the gp2 disk.

ZFS is much more complex than XFS and EXT4 but, that also means it has more tunables/options. I used a simplistic setup and an unfair benchmark which initially led to poor ZFS results. With the same benchmark, very favorable to XFS, I added a ZFS L2ARC and that completely reversed the situation, more than tripling the ZFS results, now 66% above XFS.

Conclusion

We have seen in this post why the general perception is that ZFS under-performs compared to XFS or EXT4. The presence of B-trees for the files has a big impact on the amount of metadata ZFS needs to handle, especially when the recordsize is small. The metadata consists mostly of the non-leaf pages (or internal nodes) of the B-trees. When properly cached, the performance of ZFS is excellent. ZFS allows you to optimize the use of EBS volumes, both in term of IOPS and size when the instance has fast ephemeral storage devices. Using the ephemeral device of an i3.large instance for the ZFS L2ARC, ZFS outperformed XFS by 66%.

The post About ZFS Performance appeared first on Percona Database Performance Blog.

MySQL User Camp, Bangalore – 27th April, 2018

MySQL User Camp is a forum where MySQL Engineers and community users come together to connect, collaborate, and share knowledge. This year’s first MySQL User Camp was held on 27th April 2018, at Oracle India Pvt Ltd, Kalyani Magnum Infotech Park, Bangalore with an excellent turnout of 60 attendees. The event began with a welcome […]

Updated: Become a ClusterControl DBA: Safeguarding your Data

In the past four posts of the blog series, we covered deployment of clustering/replication (MySQL/Galera, MySQL Replication, MongoDB & PostgreSQL), management & monitoring of your existing databases and clusters, performance monitoring and health and in the last post, how to make your setup highly available through HAProxy and ProxySQL.

So now that you have your databases up and running and highly available, how do you ensure that you have backups of your data?

You can use backups for multiple things: disaster recovery, to provide production data to test against development or even to provision a slave node. This last case is already covered by ClusterControl. When you add a new (replica) node to your replication setup, ClusterControl will make a backup/snapshot of the master node and use it to build the replica. It can also use an existing backup to stage the replica, in case you want to avoid that extra load on the master. After the backup has been extracted, prepared and the database is up and running, ClusterControl will automatically set up replication.

Creating an Instant Backup

In essence, creating a backup is the same for Galera, MySQL replication, PostgreSQL and MongoDB. You can find the backup section under ClusterControl > Backup and by default you would see a list of created backup of the cluster (if any). Otherwise, you would see a placeholder to create a backup:

From here you can click on the "Create Backup" button to make an instant backup or schedule a new backup:

All created backups can also be uploaded to cloud by toggling "Upload Backup to the Cloud", provided you supply working cloud credentials. By default, all backups older than 31 days will be deleted (configurable via Backup Retention settings) or you can choose to keep it forever or define a custom period.

"Create Backup" and "Schedule Backup" share similar options except the scheduling part and incremental backup options for the latter. Therefore, we are going to look into Create Backup feature (a.k.a instant backup) in more depth.

As all these various databases have different backup tools, there is obviously some difference in the options you can choose. For instance with MySQL you get to choose between mysqldump and xtrabackup (full and incremental). For MongoDB, ClusterControl supports mongodump and mongodb-consistent-backup (beta) while PostgreSQL, pg_dump and pg_basebackup are supported. If in doubt which one to choose for MySQL, check out this blog about the differences and use cases for mysqldump and xtrabackup.

Backing up MySQL and Galera

As mentioned in the previous paragraph, you can make MySQL backups using either mysqldump or xtrabackup (full or incremental). In the "Create Backup" wizard, you can choose which host you want to run the backup on, the location where you want to store the backup files, and its directory and specific schemas (xtrabackup) or schemas and tables (mysqldump).

If the node you are backing up is receiving (production) traffic, and you are afraid the extra disk writes will become intrusive, it is advised to send the backups to the ClusterControl host by choosing "Store on Controller" option. This will cause the backup to stream the files over the network to the ClusterControl host and you have to make sure there is enough space available on this node and the streaming port is opened on the ClusterControl host.

There are also several other options whether you would want to use compression and the compression level. The higher the compression level is, the smaller the backup size will be. However, it requires higher CPU usage for the compression and decompression process.

If you would choose xtrabackup as the method for the backup, it would open up extra options: desync, backup locks, compression and xtrabackup parallel threads/gzip. The desync option is only applicable to desync a node from a Galera cluster. Backup locks uses a new MDL lock type to block updates to non-transactional tables and DDL statements for all tables which is more efficient for InnoDB-specific workload. If you are running on Galera Cluster, enabling this option is recommended.

After scheduling an instant backup you can keep track of the progress of the backup job in the Activity > Jobs:

After it has finished, you should be able to see the a new entry under the backup list.

Backing up PostgreSQL

Similar to the instant backups of MySQL, you can run a backup on your Postgres database. With Postgres backups there are two backup methods supported - pg_dumpall or pg_basebackup. Take note that ClusterControl will always perform a full backup regardless of the chosen backup method.

We have covered this aspect in this details in Become a PostgreSQL DBA - Logical & Physical PostgreSQL Backups.

Backing up MongoDB

For MongoDB, ClusterControl supports the standard mongodump and mongodb-consistent-backup developed by Percona. The latter is still in beta version which provides cluster-consistent point-in-time backups of MongoDB suitable for sharded cluster setups. As the sharded MongoDB cluster consists of multiple replica sets, a config replica set and shard servers, it is very difficult to make a consistent backup using only mongodump.

Note that in the wizard, you don't have to pick a database node to be backed up. ClusterControl will automatically pick the healthiest secondary replica as the backup node. Otherwise, the primary will be selected. When the backup is running, the selected backup node will be locked until the backup process completes.

Scheduling Backups

Now that we have played around with creating instant backups, we now can extend that by scheduling the backups.

The scheduling is very easy to do: you can select on which days the backup has to be made and at what time it needs to run.

For xtrabackup there is an additional feature: incremental backups. An incremental backup will only backup the data that changed since the last backup. Of course, the incremental backups are useless if there would not be full backup as a starting point. Between two full backups, you can have as many incremental backups as you like. But restoring them will take longer.

Once scheduled the job(s) should become visible under the "Scheduled Backup" tab and you can edit them by clicking on the "Edit" button. Like with the instant backups, these jobs will schedule the creation of a backup and you can keep track of the progress via the Activity tab.

Backup List

You can find the Backup List under ClusterControl > Backup and this will give you a cluster level overview of all backups made. Clicking on each entry will expand the row and expose more information about the backup:

Each backup is accompanied with a backup log when ClusterControl executed the job, which is available under "More Actions" button.

Offsite Backup in Cloud

Since we have now a lot of backups stored on either the database hosts or the ClusterControl host, we also want to ensure they don’t get lost in case we face a total infrastructure outage. (e.g. DC on fire or flooded) Therefore ClusterControl allows you to store or copy your backups offsite on cloud. The supported cloud platforms are Amazon S3, Google Cloud Storage and Azure Cloud Storage.

The upload process happens right after the backup is successfully created (if you toggle "Upload Backup to the Cloud") or you can manually click on the cloud icon button of the backup list:

Choose the cloud credential and specify the backup location accordingly:

Restore and/or Verify Backup

From the Backup List interface, you can directly restore a backup to a host in the cluster by clicking on the "Restore" button for the particular backup or click on the "Restore Backup" button:

One nice feature is that it is able to restore a node or cluster using the full and incremental backups as it will keep track of the last full backup made and start the incremental backup from there. Then it will group a full backup together with all incremental backups till the next full backup. This allows you to restore starting from the full backup and applying the incremental backups on top of it.

ClusterControl supports restore on an existing database node or restore and verify on a new standalone host:

These two options are pretty similar, except the verify one has extra options for the new host information. If you follow the restoration wizard, you will need to specify a new host. If "Install Database Software" is enabled, ClusterControl will remove any existing MySQL installation on the target host and reinstall the database software with the same version as the existing MySQL server.

Once the backup is restored and verified, you will receive a notification on the restoration status and the node will be shut down automatically.

Point-in-Time Recovery

For MySQL, both xtrabackup and mysqldump can be used to perform point-in-time recovery and also to provision a new replication slave for master-slave replication or Galera Cluster. A mysqldump PITR-compatible backup contains one single dump file, with GTID info, binlog file and position. Thus, only the database node that produces binary log will have the "PITR compatible" option available:

When PITR compatible option is toggled, the database and table fields are greyed out since ClusterControl will always perform a full backup against all databases, events, triggers and routines of the target MySQL server.

Now restoring the backup. If the backup is compatible with PITR, an option will be presented to perform a Point-In-Time Recovery. You will have two options for that - “Time Based” and “Position Based”. For “Time Based”, you can just pass the day and time. For “Position Based”, you can pass the exact position to where you want to restore. It is a more precise way to restore, although you might need to get the binlog position using the mysqlbinlog utility. More details about point in time recovery can be found in this blog.

Backup Encryption

Universally, ClusterControl supports backup encryption for MySQL, MongoDB and PostgreSQL. Backups are encrypted at rest using AES-256 CBC algorithm. An auto generated key will be stored in the cluster's configuration file under /etc/cmon.d/cmon_X.cnf (where X is the cluster ID):

$ sudo grep backup_encryption_key /etc/cmon.d/cmon_1.cnf backup_encryption_key='JevKc23MUIsiWLf2gJWq/IQ1BssGSM9wdVLb+gRGUv0='

If the backup destination is not local, the backup files are transferred in encrypted format. This feature complements the offsite backup on cloud, where we do not have full access to the underlying storage system.

Final Thoughts

We showed you how to get your data backed up and how to store them safely off site. Recovery is always a different thing. ClusterControl can recover automatically your databases from the backups made in the past that are stored on premises or copied back from the cloud.

Obviously there is more to securing your data, especially on the side of securing your connections. We will cover this in the next blog post!

Tags:  backup clustercontrol MariaDB MongoDB MySQL postgres PostgreSQL xtrabackup

MySQL Performence : 8.0 GA and TPCC Workloads


Generally TPC-C benchmark workload is considered as one of the #1 references for Database OLTP Performance. On the same time, for MySQL users it's often not something which is seen as "the most compelling" for performance evaluations.. -- well, when you're still fighting to scale with your own very simple queries, any good result on something more complex may only look as "fake" ;-)) So, since a long time Sysbench workloads remained (and will remain) as the main #1 "entry ticket" for MySQL evaluation -- the most simple to install, to use, and to point on some sensible issues (if any). Specially that since new Sysbench version 1.0 a lot of improvements were made in Sysbench code itself, it really scales now, has the lowest ever overhead, and also allowing you to add your own test scenario via extended LUA scripts (and again, with lowest ever overhead) -- so, anyone can easily add whatever kind of different test scenarios and share with others ! (while I'd say "the most compelling test workload" for any given user should be the workload which is the most closely reproducing his production load -- and you can probably do it now with new Sysbench, just try it !).

However, from MySQL Dev side, every given benchmark workload is mostly seen asa yet one (or several ones) problem(s) to resolve. Some of problems are common for many workloads, some are completely different ones, but generally it's never about "something cool" -- and we're just progressing in this long road by fixing one problem after another (to hit yet another one again). So, TPC-C workload for MySQL is just yet another problem to resolve ;-))

Historically the most popular TPC-C implementations for MySQL were :
  • DBT-2 : an open source version of TPC-C
  • TPCC-mysql : another open source version of TPC-C developed by Percona

Both versions were implemented completely differently, but at least both were very "compelling" to MySQL users, as they can run TPC-C workload via SQL queries (and not via stored procedures, which are more popular for other DB vendors).. So, it was up to anyone preferences which of 2 test cases to use (while personally I'd say it was always more simple to install and use TPCC-mysql). However, since new Sysbench is here and Percona now ported their TPCC-mysql to Sysbench -- for me it's no doubt everyone should move to Sysbench-TPCC if interested on TPC-C testing ! (and kudos Percona to make this happen !! ;-))

So, what is good with new Sysbench-TPCC :
  • first of all it's fully integrated with Sysbench, so if you already got Sysbench installed on your server, TPCC workload will just work, as well all your old sysbench scripts around to collect the test results and so on ;-))
  • it also goes more far than original TPC-C test -- it's allowing you to run several TPC-C data sets in parallel (I was doing the same in the past by running several TPCC-mysql or DBT-2 processes on the same time -- which is allowing you to see "what is the next problem") -- but now you have the same out-of-the-box!

From the past testings I've already observed that the most "representative" data set size for TPCC workload is around 1000W (1K "warehouses") -- it's not too small, nor too big to take a long time to generate the data + allocate too much disk space (generally it's using around 100GB in InnoDB once loaded).. -- probably over a time I will also test a x10 times bigger volume (or more), but for the moment the 1000W volume is already big enough to investigate MySQL scalability limits on this workload..

So far, my test scenario will be the following :
  • data set :
    • 1000W (single dataset as in original TPC-C workload)
    • 10x100W (10 datasets of 100W executed in parallel)
  • concurrent users : 1, 2, 4, .. 1024
  • InnoDB Buffer Pool (BP) :
    • 128GB : data set is fully cached in BP (no I/O reads)
    • 32GB : not more than 1/3 of data set can be cached in BP (I/O reads and yet more writes to expect)
  • HW and my.conf are used the same as in the previous article about 8.0 OLTP_RW performance.
    • and as well I'm curious how well MySQL 8.0 is scaling on this workload when 1CPU socket (1S) is used comparing to 2CPU sockets (2S)

Sysbench-TPCC 1000W
here is the result with MySQL 8.0 :
Comments :
  • the above graph is representing the test executed on 1S (left side) and then on 2S (right side)
  • the load is starting with 1 user session, then progressively increased to 2 users, 4, 8, .. 1024
  • as you can see, there is no much difference between 1S and 2S results..

We're scaling on this workload only up to 32 concurrent users, so having more CPU cores could not bring any help here.. And what is the bottleneck ? -- we're hardly hitting index RW-lock contention here :

with such a hot contention the difference between MySQL 8.0 and older versions could not be big ;-))

Sysbench-TPCC 1000W, BP=128GB

Comments :
  • interesting that MySQL 8.0 is still winning here anyway !
  • and even on low load 8.0 is mostly matching TPS of 5.6, which is also very positive
  • (and we may expect even better TPS once the index lock contention is lowered, but this is probably not a simple fix)..
  • no idea why MariaDB is even not matching TPS level of 5.7 (while using InnoDB 5.7)


Sysbench-TPCC 1000W, BP=32GB
Comments :
  • the same 1000W workload, but with BP=32GB, so more IO-bound activity is expected..
  • however, TPS results between BP=128GB and BP=32GB configs are not that much different, right ?
  • this is just because TPCC workload itself is not that much IO-intensive as many could imagine ;-))
  • (well, yes, it'll still involve a lot of I/O writes and reads, but they will be often grouped around the same data, so already cached pages could be re-used)
  • (this is completely opposite to, for ex., Sysbench OLTP_RW workload which with a similar amount of data and configured with BP=32GB will become extremely aggressive on I/O and get TPS decreased by several times)
  • again, no idea about MariaDB..

From the other side, I don't recall to hit the same index lock contention while testing 1000W dataset with DBT-2, so I was very curious to compare it with Sysbench-TPCC 1000W on 2S and with BP=128GB :

DBT-2 1000W -vs- Sysbench-TPCC 1000W (2S, BP=128GB)
Comments :
  • so far, DBT2 workload is on the left side, and Sysbench-TPCC is on the right
  • as you can see, peak TPS level reached on DBT2 is nearly twice higher than on Sysbench-TPCC
  • why ? -- this is a good question ;-))
  • initially I supposed this is due more indexes used in Sysbench-TPCC schema, but removing them did not help..
  • in fact, everything is looking similar, but there is still something which is making Sysbench-TPCC to have a different processing "signature" which is resulting in this intensive index lock contention..
  • would be great if Percona eng. could find what is making this difference, and then making from this an additional test scenario ! -- after what we could definitively forget about DBT2 and use Sysbench-TPCC exclusively ;-))

So far, let's get a look how the things are changing when 10x100W dataset is used :

Sysbench-TPCC 10x100W
here is the result with MySQL 8.0 :
Comments :
  • as in the previous case, the result on 1S is on the left, and 2S on the right side of the graph
  • and you can see here that peak TPS on 2S is more than 50% higher !
  • (but not yet near x2 ;-))
  • as well peak TPS on 1S higher than 1000W result on 1S

This is because by using 10 datasets in parallel we multiplied the number of all tables by 10, which is dividing the initially observed "index lock" contention by 10 too (!) for the same number of concurrent users -- and moving the internal bottleneck to another place, and now it's hitting lock management part (lock_sys mutex contention) :

Comments :
  • as you can see, the index lock contention is still present, but it's divided by 10 now
  • and the presence of lock_sys contention is blocking us from going more far..
  • work in progress, and I'm impatient to see this bottleneck gone ;-))

Ok, and now -- how MySQL 8.0 compares to other versions ?

Sysbench-TPCC 10x100W, BP=128GB
Comments :
  • MySQL 8.0 is still showing the best TPS result on this workload as well !
  • TPS is lower -vs- 5.7 on 512 and 1024 users load due higher lock_sys contention in 8.0
  • (by fixing REDO locking in 8.0 we also made other locks more hot, and this is as expected)
  • NOTE : I could "hide" this TPS decrease by limiting thread concurrency, but I did not do it here intentionally to see all the other locks impact..
  • and yes, index lock itself makes a huge trouble when present -- as we see here x2 times better TPS -vs- 1000W
  • note as well, that MySQL 8.0 is matching 5.6 TPS on low load (which, sadly, was no more the case for 5.7)
  • no idea about MariaDB..

and now the same with 32GB BP :

Sysbench-TPCC 10x100W, BP=32GB
Comments :
  • MySQL 8.0 is still do better than others here too !
  • I'm a bit surprised to see 5.6 to do slightly better on low load (but hope it'll be improved in 8.0 soon)
  • again, TPS is not that much different comparing to BP=128GB config, so the workload is not that much IO-bound as anyone could expect.. -- definitively not something to use as a test case if your target is Storage Performance evaluation..


And I was ready to finish my post here, while Percona published their benchmark results comparing InnoDB -vs- MyRocks on Sysbench-TPCC 10x100W workload ! -- I was happy to see that MyRocks is progressing and doing well, but my main attention was attracted by InnoDB results.. -- as you can see from all the results I've presented above, there is not too much difference when we're going from 128GB BP config to 32GB BP config.. While from Percona results we're seeing exactly opposite.. -- not far from x3 times lower TPS between 128GB and 32GB BP configs ! -- how ever such is possible then ?..

Unfortunately the article is not trying to explain what is going behind, but just showing you the numbers.. -- so, let's try to investigate this a little bit ;-))

First of all, in Percona tests was used 2S 28cores-HT server, so I'll limit my HW setup to 1S and use 24cores-HT only (for sure, it's not the same CPU chips, but at least the number of really concurrent tasks executed in parallel will be similar)..

Then, comparing the configuration settings, the most notable differences are :
  • checksums = ON
  • doublewrite = ON
  • binlog = ON
  • adaptive hash index = ON (AHI)
  • and lower values for : io capacity / io max / lru depth / BP instances / cleaner threads / etc..

From the follow-up Percona results you can see that this TPS drop between 128GB and 32GB BP is NOT related to binlog, so I have at least one option less to investigate ;-))

So, first of all I wanted to re-check the "base line" results with BP=128GB.

The following graph is representing MySQL 8.0 under Sysbench-TPCC 10x100W workload with different config settings -- I'll try to be short and present all the test cases together rather one by one, so you can see all the 4 tests here :
Comments :
  • the test #1 is executed with my config as presented in all the results above
  • the test #2 is the same as #1, but doublewrite=ON and AHI=ON, and you can see a significant TPS drop..
  • however, this TPS drop is exactly because of AHI ! -- and as I've mentioned to PeterZ during his "InnoDB Tutorial" @PerconaLIVE -- as soon as you have data changes in your workload, in the current implementation AHI becomes the bottleneck by itself.. -- so, the only one AHI option you should retain in your mind in this case -- is to switch AHI=OFF ! ;-))
  • so, the test #3 is the same as #2, but with AHI=OFF -- and as you can see, we got our lost TPS back ! ;-))
  • and another observation you may make here is that "doublewrite=ON" is not impacting TPS result at all in the current workload.. -- even it's still not fixed yet in MySQL 8.0
  • (Sunny, please, push the new doublewrite code asap to show people a real power of MySQL 8.0 !)
  • and the test #4 is with : doublewrite=ON, AHI=OFF, checksums=ON (crc32), io capacity=2K, io capacity max=4K, etc. -- mostly the same as Percona config, and you can see TPS on the same level again ;-))
  • NOTE : while using so low IO capacities settings is not resulting here in TPS drops, it's lowering the resistance of MySQL Server instance to activity bursts -- Checkpoint Age is hitting its max limit, and sync flushing waits are already present during the test (aka "furious flushing").. -- so, I'd not suggest it as the right tuning.
  • I don't test the impact of checksums here as it'll be simply not present in this workload (all data are in BP, checksums will be involved only on page writes which is going in background, so zero impact on overall processing)..

Now, let's see the same workload, but with BP=32GB :
Comments :
  • the first test is, again, with my initial config settings, and TPS is not too much lower than with BP=128GB..
  • the test #2 is as test #1, but with doublewrite=ON and AHI=ON, and indeed, not far from x2 TPS drop..
  • let's switch AHI=OFF now as in the previous case..
  • the test #3 is as test #2, but with AHI=OFF, and as expected, we can see increased TPS here ;-))
  • now, what is the impact of checksums ?
  • the test #4 is the same as #3, but with checksums=ON (crc32) -- mostly zero impact on TPS
  • and the test #5 is mostly reusing Percona config, except with AHI=off -- which is slightly lowering TPS..

So far :
  • the biggest impact here is coming from doublewrite=ON
  • and the impact is NOT because we're writing the data twice.. -- but because there is a lock contention in doblewrite code ! -- historically doblewrite was implemented as a small write zone, and as soon as you have many writes going in parallel -- you have a locking fight..
  • the new doublewrite code was implemented by Sunny without any of these limitation, and as soon as your storage is able to follow (e.g. to write twice the same data) -- your TPS will remain the same ! ;-))
  • e.g. in my case I should obtain the same over 10K TPS as you can see in the test #1
  • but Percona is claiming to have it fixed, so that's why this x3 TPS drop in their results between 128GB and 32GB BP configs is surprising me.. -- is it the AHI so much impacting in their tests ?.. -- no idea
  • then, why doublewrite is more impacting in 32GB BP config comparing to 128GB BP ?
    • with 32GB BP we are doing much more I/O :
    • first of all, only 1/3 of data may remain cached in BP, so we'll often Read from storage
    • but before to be able to Read, we need to find first a free page in BP to re-use
    • and if most of pages in BP are "dirty" with changes, we need to Write these changes first before to declare a given page as "free" and ready to re-use
    • which is resulting in much more Writes -vs- 128GB BP config (where you don't have any Reads at all)
  • other point : you should also keep in mind to look on TPS results as on "one whole"
  • e.g. if you'll look on 32 users load, you'll see 7.5K TPS, but if you'll look on 128 users only -- you'll see 5K TPS (or even less, depending on config ;-))
  • and if you're looking for reaching a max possible TPS, your main load level is then around a peak TPS
  • once the peak TPS is reached, your main worry then is only about how to not loose it with higher load..
  • there are many solutions available around (and the most optimal IMHO is with ProxySQL pool) -- and you have as well the old good one -- "thread concurrency" tuning ;-))

So, let's add the test #6 which is the same as test #4 (doublewrite=ON, AHI=OFF, checksums=ON) but with innodb_thread_concurrency=32 :
Comments :
  • as you can see, even on higher load TPS is "improved" now as well ;-))
  • (I'd rather say it's "solved" from contention, as we're not improving here anything, just limiting the concurrency)
  • one day we will have no more TPS drops on high load at all (even with thread concurrency=0), but this day is not yet today (nor tomorrow ;-))

Ok, we're able to "hide" the doublewrite contention, fine ! -- but could we reduce the overall Writes impact here ? (with reduced Writes we'll much less stress doublewrite buffer, means its lock contention will be lower, and probably overall TPS will be higher then ?.. -- And YES, in this workload it's possible ! ;-))

How ?.. -- remind that this TPCC, e.g. pure OLTP workload, and, as I've mentioned before -- the data access is "grouped" (so, some data are re-used from BP cache before to Read another ones). And these workload conditions are perfectly matching the story I've explained in 1M IO-bound QPS article with MySQL 8.0 -- let's try the same test #6, but with InnoDB configured with page size = 4KB, which will be the test #7 on the next graph :

Comments :
  • as you can see, with 4KB page size the TPS level is even higher than in test #1 !! ;-))
  • (note : we're still having doublewrite=ON and checksums=ON)
  • and with the new doublewrite code it should be just the same in all test results here (just mind to switch AHI=OFF ;-))
  • also, as you can see, even with x4 times less RAM for BP (32GB -vs- 128GB) and doublewrite=ON and checksums=ON we're still NOT x3 times worse on TPS, but rather near the same result as with 128GB BP !!

INSTEAD OF SUMMARY
  • Sysbench-TPCC itself still has some surprises (comparing 1000W case with DBT2)
  • MySQL 8.0 is doing better than any other/older version here !
  • (but we're yet far from a good scalability -- work in progress, stay tuned ;-))
  • believe me, you're not yet finished to be surprised by InnoDB ;-))
  • Sunny, please, push to 8.0 the new doublewrite code ASAP !! ;-))

Thank you for using MySQL ! -- stay tuned ;-))

Rgds,
-Dimitri

Installing MySQL 8.0 on Ubuntu 16.04 LTS in Five Minutes

Do you want to install MySQL 8.0 on Ubuntu 16.04 LTS? In this quick tutorial, I show you exactly how to do it in five minutes or less.

This tutorial assumes you don’t have MySQL or MariaDB installed. If you do, it’s necessary to uninstall them or follow a slightly more complicated upgrade process (not covered here).

Step 1: Install MySQL APT Repository

Ubuntu 16.04 LTS, also known as Xenial, comes with a choice of MySQL 5.7 and MariaDB 10.0.

If you want to use MySQL 8.0, you need to install the MySQL/Oracle Apt repository first:

wget https://dev.mysql.com/get/mysql-apt-config_0.8.10-1_all.deb dpkg -i mysql-apt-config_0.8.10-1_all.deb

The MySQL APT repository installation package allows you to pick what MySQL version you want to install, as well as if you want access to Preview Versions. Let’s leave them all as default:

Step 2: Update repository configuration and install MySQL Server

apt-get update apt-get install mysql-server

Note: Do not forget to run “apt-get update”, otherwise you can get an old version of MySQL from Ubuntu repository installed.

The installation process asks you to set a password for the root user:

I recommend you set a root password for increased security. If you do not set a password for the root account, “auth_socket” authentication is enabled. This ensures only the operating system’s “root” user can connect to MySQL Server without a password.

Next, the installation script asks you whether to use Strong Password Encryption or Legacy Authentication:

While using strong passwords is recommend for security purposes, not all applications and drivers support this new authentication method. Going with Legacy Authentication is a safer choice

All Done

You should have MySQL 8.0 Server running. You can test it by connecting to it with a command line client:

As you can see, it takes just a few simple steps to install MySQL 8.0 on Ubuntu 16.04 LTS.

Installing MySQL 8.0 on Ubuntu 16.04 LTS is easy. Go ahead give it a try!

The post Installing MySQL 8.0 on Ubuntu 16.04 LTS in Five Minutes appeared first on Percona Database Performance Blog.

MySQL Performance : 1M *IO-bound* QPS with 8.0 GA on Intel Optane SSD !


Historically, Random I/O Reads were always a major PITA for any OLTP workload.. If Random I/O Writes you could yet "delay" via controller's caches (or any kind of other battery-protected caches -- specially if Writes are coming in bursts), there is no way to "predict" I/O Reads if they are fully Random (so you cannot "cache" or "prefetch" them ahead and have to deliver the data directly from storage, read by read.. -- which is hitting a huge "rotation penalty" on HDD).

Indeed, things changed dramatically since arriving of Flash Storage. You don't need to spend any particular attention if your I/O Reads are Random or Sequential. However, you still need to keep in mind to not hit the overall throughout limit of your Flash Device. As the result, reading by smaller I/O blocks allowing you to do more I/O operations/sec than with bigger blocks.

And what about InnoDB ? -- InnoDB is using by default 16KB page size (so by default all Random I/O Reads are of 16KB) :

And with 16KB Random Reads you definitively will reach your throughput limit sooner than with 8KB or 4KB Reads. Many users are seeing "compression" as the best matching solution here. And, indeed, if you're very lucky and can compress your data, say, by x4 times => you'll Read then only 4KB from storage to deliver 16KB of data. And, yes, you'll be able to deliver x4 times more Reads/sec :

The only problem is that you'll still deliver 16KB of data which you need first to copy to Buffer Pool (BP) to be able then to process the query, and your BP is not unlimited ! ;-)) -- so, in reality, you're able to read data x4 times faster from your storage, but still not able to process them faster due BP size limit :


However, if your rows are well grouped within 16KB page -- you may still win on doing less I/O Reads, but not get any overall QPS performance gain comparing to "non compressed" default case)..

On the same time you have another solution to radically improve your QPS and get a bigger profit of your Flash Device -- you may just use 4KB page size for your whole InnoDB instance, which will allow you within the same I/O throughput level to Read x4 times more and process x4 times more having the same BP size :


Unfortunately, such an approach was possible to use with InnoDB only "partially" until now, because historically every I/O operation in InnoDB was going via a global lock (fil_system mutex) quickly limiting your overall I/O rate. The good news is that since MySQL 8.0 this limitation is finally gone and our I/O scalability moved to the next level ! ;-))

But this all may still remain just a good "theory" if there was no any "practical" result to confirm the claim, right ? -- so, to confirm the gain I'll use IO-bound Sysbench Point-Select workload :
  • test case : Sysbench 1.x uniform point-selects
  • volume : 50M x 8 tables (400M rows in total, ~100GB)
  • BP size : 32GB

NOTE : while there is still a possibility that some data will be re-used from BP, this test scenario is generally saturating any flash device much faster then you can imagine.. -- so, the whole processing here is really IO-bound and IO-driven.

And I'd say the timing was very favorable here to allow us to test it on the latest Intel Optane SSD drive ! -- this device by itself seems to bring the next level of standard to the whole Flash Storage Story ;-)) Here are few "pure IO" test results I've obtained from a single Optane device on my validation tests :

Intel Optane SSD drive :
  • singe thread pure Random I/O Reads with 16KB : 65.000 reads/sec (!!), 1000 MB/sec
  • avg latency : 0.01ms (!!)
  • max throughput : 2000-2500 MB/sec

So, on a pure I/O test on this drive a single thread is already delivering 65K Random Reads/sec of 16KB ! -- but also reaching 1000 MB/sec throughput, means with 2 or 3 threads we're already reaching the max IO rate on this device. However, there is a big difference between "pure IO" test and database "IO-bound workload" -- there is still a lot of processing around every SQL request (not only I/O)..

But "in theory" by using x2 Optane drives together there should be possible to reach :
  • with 16KB page size : 260K Random Reads/sec
  • with 8KB page : over 500K/sec
  • with 4KB page : probably 1M Reads/sec

and as usual, the real answer you obtain only by a real test ;-))

so far :

Sysbench 50Mx8tab IO-bound point-selects Comments :
  • yes, MySQL 8.0 is showing a real difference here by reaching over 1M IO-bound QPS !!!
  • NOTE : the max QPS on MySQL 5.6 and 5.7 is reached with 8KB page size here (become worse on 4KB page due present in these versions fil_system lock contention)..

The results are obtained on the same Skylake server with the same config settings as described before. The only difference is that 50M rows were loaded per table and BP size was reduced to 32GB.

Thank you for using MySQL ! -- stay tuned ;-))

Rgds,
-Dimitri

Backing up users and privileges in MySQL

There are two simple ways to backup only the users and privileges in MySQL:

1- Using mysqlpump utility (as create user and grant statements):

[shell ~]$ mysqlpump -uUSER -p --exclude-databases=% --add-drop-user --users > /tmp/pump-all-users_privileges-timestamp.sql
Dump completed in 1364 milliseconds

Sample output:

[shell ~]$ head /tmp/pump-all-users_privileges-timestamp.sql

-- Dump created by MySQL pump utility, version: 5.7.21-20, Linux (x86_64)
-- Dump start time: Sun May 13 23:30:49 2018
-- Server version: 5.7.21
SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0;
SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0;
SET @OLD_SQL_MODE=@@SQL_MODE;
SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO";
SET @@SESSION.SQL_LOG_BIN= 0;
SET @OLD_TIME_ZONE=@@TIME_ZONE;
SET TIME_ZONE='+00:00';
SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT;
SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS;
SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION;
SET NAMES utf8mb4;
DROP USER 'check'@'%';
CREATE USER 'check'@'%' IDENTIFIED WITH 'mysql_native_password' AS '*B865CAE8F340F6CE1485A06F4492BB49718DF' REQUIRE NONE PASSWORD EXPIRE DEFAULT ACCOUNT UNLOCK;
GRANT SELECT, PROCESS, SUPER, REPLICATION SLAVE ON *.* TO 'check'@'%';

You can also use --exclude-users=[comma separated usernames] to exclude some users from the backup or --include-users=[comman separated usernames] to include only some users in the backup.

2- Using Percona toolkit (pt-show-grants):

[shell ~]$ pt-show-grants -uUSER --ask-pass --drop > /tmp/ptshowgrants-all-users_privileges-timestamp.sql

Sample output:

[shell ~]$ head /tmp/ptshowgrants-all-users_privileges-timestamp.sql

-- Grants dumped by pt-show-grants
-- Dumped from server Localhost via UNIX socket, MySQL 5.7.21-20-log at 2018-05-13 23:39:55
DROP USER 'check'@'%';
DELETE FROM `mysql`.`user` WHERE `User`='check' AND `Host`='%';
-- Grants for 'check'@'%'
CREATE USER IF NOT EXISTS 'check'@'%';
ALTER USER 'check'@'%' IDENTIFIED WITH 'mysql_native_password' AS '*B865CAE8F340F6CE1485A06F4492BB49718DF' REQUIRE NONE PASSWORD EXPIRE DEFAULT ACCOUNT UNLOCK;
GRANT PROCESS, REPLICATION SLAVE, SELECT, SUPER ON *.* TO 'check'@'%';

Similarly, you can also use --only or --ignore options to include/exclude specific users in the backup.

Note:
pt-show-grants used to print only the grants statements (no create user info) which caused issues in the restore, especially if the sql-mode NO_AUTO_CREATE_USERS is enabled.
Although – as you can see from the output – percona team has fixed this issue but I still see unnecessary statements, e.g. delete the user record from mysql.user table (isn’t DROP USER sufficient enough to drop the user?!).

Restore the grants

You can simply restore the privileges from either methods above by the following command:

mysql -uUSER -p < user_privileges.sql

Conclusion
  • Backing up the user and privileges – as Create User and grant statements – using mysqlpump or pt-show-grants is much better than backing up the grant tables in mysql system database.
  • It’s recommended to run either of the above commands in a scheduled job beside your normal backups.

MySQL master discovery methods, part 5: Service discovery & Proxy

This is the fifth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via Service discovery and Proxy

Part 4 presented with an anti-pattern setup, where a proxy would infer the identify of the master by drawing conclusions from backend server checks. This led to split brains and undesired scenarios. The problem was the loss of context.

We re-introduce a service discovery component (illustrated in part 3), such that:

  • The app does not own the discovery, and
  • The proxy behaves in an expected and consistent way.

In a failover/service discovery/proxy setup, there is clear ownership of duties:

  • The failover tool own the failover itself and the master identity change notification.
  • The service discovery component is the source of truth as for the identity of the master of a cluster.
  • The proxy routes traffic but does not make routing decisions.
  • The app only ever connects to a single target, but should allow for a brief outage while failover takes place.

Depending on the technologies used, we can further achieve:

  • Hard cut for connections to old, demoted master M.
  • Black/hold off for incoming queries for the duration of failover.

We explain the setup using the following assumptions and scenarios:

  • All clients connect to master via cluster1-writer.example.net, which resolves to a proxy box.
  • We fail over from master M to promoted replica R.

A non planned failover illustration #1

Master M has died, the box had a power failure. R gets promoted in its place. Our recovery tool:

  • Updates service discovery component that R is the new master for cluster1.

The proxy:

  • Either actively or passively learns that R is the new master, rewires all writes to go to R.
  • If possible, kills existing connections to M.

The app:

  • Needs to know nothing. Its connections to M fail, it reconnects and gets through to R.
A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted.

Everything is as before.

If the proxy kills existing connections to M, then the fact M is back alive turns meaningless. No one gets through to M. Clients were never aware of its identity anyhow, just as they are unaware of R's identity.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • In the process of promotion, M turned read-only.
  • Immediately following promotion, our failover tool updates service discovery.
  • Proxy reloads having seen the changes in service discovery.
  • Our app connects to R.
Discussion

This is a setup we use at GitHub in production. Our components are:

  • orchestrator for failover tool.
  • Consul for service discovery.
  • GLB (HAProxy) for proxy
  • Consul template running on proxy hosts:
    • listening on changes to Consul's KV data
    • Regenerate haproxy.cfg configuration file
    • reload haproxy

As mentioned earlier, the apps need not change anything. They connect to a name that is always resolved to proxy boxes. There is never a DNS change.

At the time of failover, the service discovery component must be up and available, to catch the change. Otherwise we do not strictly require it to be up at all times.

For high availability we will have multiple proxies. Each of whom must listen on changes to K/V. Ideally the name (cluster1-writer.example.net in our example) resolves to any available proxy box.

  • This, in itself, is a high availability issue. Thankfully, managing the HA of a proxy layer is simpler than that of a MySQL layer. Proxy servers tend to be stateless and equal to each other.
  • See GLB as one example for a highly available proxy layer. Cloud providers, Kubernetes, two level layered proxies, Linux Heartbeat, are all methods to similarly achieve HA.

See also:

Sample orchestrator configuration

An orchestrator configuration would look like this:

"ApplyMySQLPromotionAfterMasterFailover": true, "KVClusterMasterPrefix": "mysql/master", "ConsulAddress": "127.0.0.1:8500", "ZkAddress": "srv-a,srv-b:12181,srv-c", "PostMasterFailoverProcesses": [ “/just/let/me/know about failover on {failureCluster}“, ],

In the above:

  • If ConsulAddress is specified, orchestrator will update given Consul setup with K/V changes.
  • At 3.0.10, ZooKeeper, via ZkAddress, is still not supported by orchestrator.
  • PostMasterFailoverProcesses is here just to point out hooks are not strictly required for the operation to run.

See orchestrator configuration documentation.

The witching hour: How illegal datetimes get stored into MySQL in US timezones

Most companies make a mistake at the very beginning.  They set their system time zone to Eastern, Central, Mountain, or Pacific time.  Then, they install their database on it.  They use all manner of timestamp and date types and insert terabytes of data.  After that, it becomes almost impossible (in terms of being worth the effort) to change it some months or years later when the company has a popular product or service.  One of the unfortunate issues that may arise is illegal datetimes.

On Sunday, March 11, 2018, when US time zones switched To daylight savings time, there was no 2AM hour.  When we switch back to normal time on Sunday November 4,  there will be two 2AM hours.

Inevitably, out of many database servers and many different types of application servers, some data manages to get inserted into MySQL with an illegal datetime during the 2AM hour on the “spring forward” date.  Each year, developers ask the question about application errors arising from it.

Here is an illustration of how it happens.

Given the following table definition:

CREATE TABLE `ts_test` (   `js_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,   `js_dt` datetime NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Let’s now try to insert some data.

mysql> insert into ts_test (js_ts,js_dt) values ('2018-03-11 02:12:00','2018-03-11 02:12:00'); ERROR 1292 (22007): Incorrect datetime value: '2018-03-11 02:12:00' for column 'js_ts' at row 1

We got an error for the datetime column.  This is good right?  MySQL is protecting us from inserting a bad value.  So – if I’m running… let’s see here, what am I running.  MySQL 5.7.21 — there should be no issue, Right?  Maybe this was yesterday’s issue on a different version of MySQL?….

Not so fast.

What about your ORM?  Is your ORM giving any specific instructions to the database for inserts and updates to datetimes?  If so, maybe… something like THIS is happening.

mysql> set session time_zone='-5:00'; Query OK, 0 rows affected (0.00 sec) mysql> insert into ts_test (js_ts,js_dt) values ('2018-03-11 02:12:00','2018-03-11 02:12:00'); Query OK, 1 row affected (0.00 sec) mysql> select * from ts_test; +---------------------+---------------------+ | js_ts               | js_dt               | +---------------------+---------------------+ | 2018-03-11 02:12:00 | 2018-03-11 02:12:00 | +---------------------+---------------------+ 1 row in set (0.00 sec)

Hm.  So, if we set the session time zone to UTC -5, the insert succeeds.  That must mean that MySQL has made this “right”.  If I query again after disconnecting from this session and starting a fresh one in daylight savings time – what will happen?  Both timestamps will advance and show 3AM…. right?

mysql> select * from ts_test; +---------------------+---------------------+ | js_ts               | js_dt               | +---------------------+---------------------+ | 2018-03-11 03:12:00 | 2018-03-11 02:12:00 | +---------------------+---------------------+ 1 row in set (0.00 sec)

Wrong.

The timestamp column adjusted.  The datetime one did not.  I now have a situation that my application is not going to like.  My applications above this database will error on the js_dt column value because it’s an illegal timestamp.

What if I open up a virtual machine, create the same table, set the system time back to just before the time change, establish a MySQL session to the database server, leave that connection open, and continue to insert rows as the time changes.

I ran:

~# date -s "11 Mar 2018 01:59:30"; > insert into ts_test values (now(),now()); # repeatedly

… and as we went over the time change, MySQL did the right thing with the now() function and the session as far as I could tell.

| 2018-03-11 01:59:55 | 2018-03-11 01:59:55 | | 2018-03-11 01:59:57 | 2018-03-11 01:59:57 | | 2018-03-11 01:59:58 | 2018-03-11 01:59:58 | | 2018-03-11 01:59:59 | 2018-03-11 01:59:59 | | 2018-03-11 03:00:00 | 2018-03-11 03:00:00 | | 2018-03-11 03:00:01 | 2018-03-11 03:00:01 | | 2018-03-11 03:00:02 | 2018-03-11 03:00:02 | | 2018-03-11 03:00:03 | 2018-03-11 03:00:03 | | 2018-03-11 03:00:04 | 2018-03-11 03:00:04 |

I did not have applications at hand to try a PHP session or a java session inserting data second by second as we passed over the time change.

Most of the time we see this in the real world, the datetime datapoints are only few seconds past 2AM.  It is my belief that ORMs and software in general can set a specific offset from UTC, set it in the database session, and fail to change it because of an operation taking a few seconds and completing after the time change.

Let’s recap.

Q: How do bad datetime values get into the database?

  • Because application ORMs and other session originators can explicitly set the session offset from UTC.
  • Because a session connected prior to the time change may still have the previous time zone setting.

Q: Why is this a problem for datetimes and *not* timestamps?

  • The underlying storage for timestamps is UTC, and times must be greater than the start of the epoch (1970-01-01 00:00:00) or 0 unixtime.  The DB converts timestamps to UTC, and that conversion works relative to the connected session time zone.
  • Datetimes are not converted back and forth.  What you put in is what you get back out.  MySQL does try to validate datetimes based on the current timezone (see previous error), but an override of session.time_zone to an explicit UTC offset will negate the comparison to the system time zone; thus allowing “illegal” values to be inserted.

Q: What do I do if I have illegal datetimes in my database?

  • Illegal datetimes should only occur during 1 hour per year, therefore it should be easy to query to look for and report on them.  They should be updated + 1 hour.

Streaming Data From MariaDB Server Into MariaDB ColumnStore via MariaDB MaxScale

Streaming Data From MariaDB Server Into MariaDB ColumnStore via MariaDB MaxScale markusmakela Fri, 05/11/2018 - 23:04

In this blog post, we look at how to configure Change Data Capture (CDC) from the MariaDB Server to
MariaDB ColumnStore via MariaDB MaxScale. Our goal in this blog post is to have our analytical
ColumnStore instance reflect the changes that happen on our operational MariaDB Server.

MariaDB MaxScale Configuration

We start by creating a MaxScale configuration with a binlogrouter and avrorouter instances. The
former acts as a replication slave and fetches binary logs and the latter processes the binary logs
into CDC records.

[replication-router] type=service router=binlogrouter user=maxuser passwd=maxpwd server_id=2 master_id=1 binlogdir=/var/lib/maxscale mariadb10-compatibility=1 filestem=mariadb-bin [replication-listener] type=listener service=replication-router protocol=MySQLClient port=3306 [avro-router] type=service router=avrorouter source=replication-router avrodir=/var/lib/maxscale [avro-listener] type=listener service=avro-router protocol=cdc port=4001

Copy the contents of this file into the `maxscale.cnf` file.

The docker-compose.yml File

The next step is to clone the MaxScale repository and to create the docker-compose file.

To clone the MaxScale repository, execute the following command.

git clone https://github.com/mariadb-corporation/MaxScale.git --branch=2.2 --depth=1

After the command completes, create the `docker-compose.yml` file with the following contents in the
same directory where you cloned MaxScale.
 

version: '2' services:     master:         image: mariadb:10.2         container_name: master         environment:             MYSQL_ALLOW_EMPTY_PASSWORD: Y         command: mysqld --log-bin=mariadb-bin --binlog-format=ROW --server-id=1         ports:             - "3306:3306"     maxscale:         build: ./MaxScale/docker/         container_name: maxscale         volumes:             - ./maxscale.cnf:/etc/maxscale.cnf.d/maxscale.cnf         ports:             - "3307:3306"             - "4001:4001"     mcs:         image: mariadb/columnstore_singlenode:latest         container_name: mcs         ports:             - "3308:3306"     adapter:         image: centos:7         container_name: adapter         command: /bin/sleep 0xffffffff

This file contains a MariaDB Server that acts as the master server, a MaxScale instance in a CDC
configuration and a single-node ColumnStore container. We also use a plain CentOS 7 container where
we install the adapter.

To start the cluster, run the following commands.

docker-compose build docker-compose up -d Configuring

The next step is to copy the ColumnStore configuration file from the `mcs` container and modify it
to use the container hostname instead of the loopback address. To do this, execute the following
commands.

docker cp mcs:/usr/local/mariadb/columnstore/etc/Columnstore.xml . sed -i 's/127.0.0.1/mcs/' Columnstore.xml docker cp Columnstore.xml adapter:/etc/Columnstore.xml

After we have copied the configuration file into the `adapter` container, we are ready to install the adapter.

Installing Adapter

To access the container, execute `docker-compose exec adapter bash`. This will launch a new shell
where the following commands will be executed.

yum -y install epel-release yum -y install https://downloads.mariadb.com/Data-Adapters/mariadb-columnstore-api/1.1.3/centos/x86_64/7/mariadb-columnstore-api-1.1.3-1-x86_64-centos7.rpm curl -sS https://downloads.mariadb.com/MariaDB/mariadb_repo_setup | bash yum -y install https://downloads.mariadb.com/Data-Adapters/mariadb-streaming-data-adapters/cdc-data-adapter/1.1.3/centos-7/mariadb-columnstore-maxscale-cdc-adapters-1.1.3-1-x86_64-centos7.rpm

After the adapter is installed, exit the shell.

Next we can start preparing the data on the master server and configure the replication between it
and MaxScale.

Preparing Data and Configuring Replication

We connect to the MariaDB Server running on the `master` container with the following command.

mysql -uroot -h 127.0.0.1 -P 3306

Once connected, executing the following SQL. This will prepare the server, create a table and insert
some dummy data into the table. It also modified the data to emulate changes in the database.

RESET MASTER; CREATE USER 'maxuser'@'%' IDENTIFIED BY 'maxpwd'; GRANT ALL ON *.* TO 'maxuser'@'%'; CREATE DATABASE test; USE test; CREATE TABLE t1(id INT); INSERT INTO t1 VALUES (1), (2), (3); UPDATE t1 SET id = 4 WHERE id = 2; DELETE FROM t1 WHERE id = 3;

Once we have created some data, we configure the replication between MaxScale and the master
server. To do this, execute the following command.

mysql -umaxuser -pmaxpwd -h 127.0.0.1 -P 3307 -e "CHANGE MASTER TO MASTER_HOST='master', MASTER_PORT=3306, MASTER_USER='maxuser', MASTER_PASSWORD='maxpwd', MASTER_LOG_FILE='mariadb-bin.000001', MASTER_LOG_POS=4; START SLAVE"

MaxScale will start to replicate events from the master server and process them into CDC records.

Create CDC User

To use the CDC system in MaxScale, we have to create a user for it. Execute the following command to create a user.

docker-compose exec maxscale maxctrl call command cdc add_user avro-router cdcuser cdcpassword Starting the Adapter

We again execute the commands inside the adapter container. To access the container, execute
`docker-compose exec adapter bash`.

Once inside the container, we can try to start the adapter. Given that the table `test.t1` does not
exist on ColumnStore, the adapter will give us an error when we try to start it:
 

[root@d444d5c5b820 /]# mxs_adapter -c /etc/Columnstore.xml -u cdcuser -p cdcpassword -h maxscale -P 4001 test t1 Table not found, create with:     CREATE TABLE test.t1 (domain int, event_number int, event_type varchar(50), id int, sequence int, server_id int, timestamp int) ENGINE=ColumnStore;

To create the table on ColumnStore, we have to exit the container. Once out of the container, we
connect to the ColumnStore container and create the table described in the error message with the
following command.

mysql -uroot -h 127.0.0.1 -P 3308 -e "CREATE TABLE test.t1 (domain int, event_number int, event_type varchar(50), id int, sequence int, server_id int, timestamp int) ENGINE=ColumnStore;"

Once the table is created, we go back into the adapter container with `docker-compose exec adapter
bash` and try to start it again.

[root@d444d5c5b820 /]# mxs_adapter -c /etc/Columnstore.xml -u cdcuser -p cdcpassword -h maxscale -P 4001 test t1 4 rows and 1 transactions inserted in 0.210798 seconds. GTID = 0-1-6 2 rows and 1 transactions inserted in 0.164197 seconds. GTID = 0-1-7

This time we see that it processed a total of six rows of data. We can now connect to the
ColumnStore on another terminal and see what the table contains.

[markusjm@localhost blog]$ mysql -uroot -h 127.0.0.1 -P 3308 -e "SELECT * FROM test.t1" +--------+--------------+---------------+------+----------+-----------+------------+ | domain | event_number | event_type    | id   | sequence | server_id | timestamp  | +--------+--------------+---------------+------+----------+-----------+------------+ |      0 |            1 | insert        |    1 |        5 |         1 | 1523948280 | |      0 |            2 | insert        |    2 |        5 |         1 | 1523948280 | |      0 |            3 | insert        |    3 |        5 |         1 | 1523948280 | |      0 |            1 | update_before |    2 |        6 |         1 | 1523948280 | |      0 |            2 | update_after  |    4 |        6 |         1 | 1523948280 | |      0 |            1 | delete        |    3 |        7 |         1 | 1523948281 | +--------+--------------+---------------+------+----------+-----------+------------+

The changes we did on the master MariaDB Server have been propagated to ColumnStore. To understand
what the values are, we can map the SQL statements to the rows in the table.

The first SQL statement is `INSERT INTO t1 VALUES (1), (2), (3);` which inserts three values into
the table. We see that the first three rows in the resultset are of type `insert` and the values
match what we inserted.

The next SQL statement is `UPDATE t1 SET id = 4 WHERE id = 2;` which only touches one row. Although
it modifies only one row in the database, it generated two rows in ColumnStore. This happened
because the MaxScale CDC system stores both the before and after images of the modified row. This
allows easy comparisons between new and old values.

The final SQL statement was `DELETE FROM t1 WHERE id = 3;` which deleted one row. This statement was
converted to a delete entry with the data that was deleted (row with `id` of 3). This allows deleted
data to be retained for analytical and auditing purposes without actually storing it on the master
database.

In this blog post, we look at how to configure Change Data Capture from the MariaDB Server to
MariaDB ColumnStore via MariaDB MaxScale. Our goal in this blog post is to have our analytical
ColumnStore instance reflect the changes that happen on our operational MariaDB Server.

Login or Register to post comments

How SendGrid Ships Better Code Faster with VividCortex

VividCortex CEO Baron Schwartz and SendGrid DBA Silvia Botros teamed up to discuss how performance monitoring leads to better, faster code deployment. This 30-minute webinar covers: 

  • How developers can deploy code faster and more safely.
  • A close-up view of a health- and monitoring-focused work environment.
  • How database monitoring fits into a culture of DevOps and lean, agile development. 
"Now, with VividCortex, whenever we have an issue that's impacting the mail processing throughput, we can very quickly go and answer, "What was running at that time? What was the most expensive query? What was taking up all the load?" Within an hour, we can typically figure out exactly which JIRA WAR to go to."  —Silvia Botros, Lead MySQL DBA

Take a look:

 

 

Compiling ProxySQL 2.0.0

ProxySQL 2.0 is nearing (GA) and a lot of folks in the MySQL world are eager to check out many of the new features such as GTID consistent reads, frontend SSL and Galera native support. Packages will of course be released for all the supported Linux operating systems on the release date, however until then the only way to get up and running is to compile the source code.

There are two ways of going about this:

  • Compile within Docker to create an installable package for your Linux distribution (already documented in the GitHub wiki)
  • Compile directly on your operating system to generate the proxysql executable file

In this article we'll go through the steps for compiling on your operating system (only Linux is supported, however MacOS and FreeBSD are compatible). Firstly we'll need to install some pre-requisites which differ slightly depending on the family your Linux distribution belongs to:

# RHEL / CentOS Based: yum install automake bzip2 make g++ gcc git openssl openssl-devel patch # Debian / Ubuntu Based: apt-get install automake bzip2 make g++ gcc git openssl libssl-dev patch

The cmake package was ommitted intentionally as the version packaged with current operating systems is a bit older than we'd like, make sure to remove the default package by running apt-get remove cmake / yum remove cmake. Now we need to compile and install version 3.9.2 (verified version, although any version newer than 3.5 should probably work):

cd /tmp wget https://cmake.org/files/v3.9/cmake-3.9.2.tar.gz tar -zxf cmake-3.9.2.tar.gz cd cmake-3.9.2 ./configure gmake gmake install

Now that all pre-requisites are installed we can proceed with the usual compilation process. The source code can be downloaded from here or alternatively by cloning the repository via Github:

git clone -b v2.0.0 https://github.com/sysown/proxysql.git cd proxysql make

Once the process completes the executable will be written to src/proxysql. You can also install it by running make install although this is not recommended since we recommend installed versions to be part of a package.

Enjoy your new installation of ProxySQL 2.0 and its many features! Looking forward to feedback and suggestions via our GitHub Issue Tracker!

Authored by: Nick Vyzas

This Week in Data with Colin Charles 39: a valuable time spent at rootconf.in

Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

rootconf.in 2018 just ended, and it was very enjoyable to be in Bangalore for the conference. The audience was large, the conversations were great, and overall I think this is a rather important conference if you’re into the “DevOps” movement (or are a sysadmin!). From the data store world, Oracle MySQL was a sponsor, as was MyDBOPS (blog), and Elastic. There were plenty more, including Digital Ocean/GoJek/Walmart Labs — many MySQL users.

I took a handful of pictures with people, and here are some of the MyDBOPS team and myself.  They have over 20 employees, and serve the Indian market at rates that would be more palatable than straight up USD rates. Traveling through Asia, many businesses always do find local partners and offer local pricing; this really becomes more complex in the SaaS space (everyone pays the same rate generally) and also the services space.

Some of the Oracle MySQL team who were exhibiting were very happy they got a good amount of traffic to the booth based on stuff discussed at the talk and BOF.

From a talk standpoint, I did a keynote for an hour and also a BoF session for another hour (great discussion, lots of blog post ideas from there), and we had a Q&A session for about 15 minutes. There were plenty of good conversations in the hallway track.

A quick observation that I notice happens everywhere: many people don’t realize features that have existed in MySQL since 5.6/5.7.  So they are truly surprised with stuff in 8.0 as well. It is clear there is a huge market that would thrive around education. Not just around feature checklists, but also around how to use features. Sometimes, this feels like the MySQL of the mid-2000’s — getting apps to also use new features, would be a great thing.

Releases

This seems to have been a quiet week on the releases front.

Are you a user of Amazon Aurora MySQL? There is now the Amazon Aurora Backtrack feature, which allows you to go back in time. It is described to work as:

Aurora uses a distributed, log-structured storage system (read Design Considerations for High Throughput Cloud-Native Relational Databases to learn a lot more); each change to your database generates a new log record, identified by a Log Sequence Number (LSN). Enabling the backtrack feature provisions a FIFO buffer in the cluster for storage of LSNs. This allows for quick access and recovery times measured in seconds.

Link List Upcoming appearances Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

 

The post This Week in Data with Colin Charles 39: a valuable time spent at rootconf.in appeared first on Percona Database Performance Blog.

My MySQL Database is Corrupted... What Do I Do Now?

How do MySQL tables get corrupted? There are many ways to spoil data files. Often, corruption is due to defects in the underlying platform, which MySQL relies on to store and retrieve data - disk subsystem, controllers, communication channels, drivers, firmware or other hardware faults. Data corruption can also occur if the MySQL server daemon restarts suddenly, or your server reboots due to a crash of other OS components. If the database instance was in the middle of writing data to disk, it could write the data partially which may end up with a page checksum that is different than expected. There have also been bugs in MySQL so even if the server hardware is ok, MySQL itself can cause corruption.

Usually when MySQL data gets corrupted the recommendation is to restore it from the last backup, switch to DR server or take down the affected node if you have Galera cluster to serve data immediately from other nodes. In some cases you can't - if the backup is not there, the cluster was never set up, your replication is down for a very long time, or the DR procedure was never tested. Even if you have a backup, you may still want to take some actions to attempt recovery as it may take less time get back online.

MyISAM, the bad and ugly

InnoDB is more fault-tolerant than MyISAM. InnoDB has auto_recovery features and is much safer as compared to the older MyISAM engine.

MyISAM tables can easily get corrupted when lots of writes happen and a lot of locks happen on that table. The storage engine "writes" data to the filesystem cache, which may take some time before it is flushed to disk. Therefore if your server restarts suddenly, some unknown amount of data in the cache is lost. That's a usual way for MyISAM data to be corrupted. The recommendation is to migrate from MyISAM to InnoDB, but there may be cases where this is not possible.

Primum non nocere, the backup

Before you attempt to repair corrupted tables, you should back your database files first. Yes, it’s already broken but this is to minimize the risk of possible further damage which may be caused by a recovery operation. There is no guarantee that any action you take will not harm untouched data blocks. Forcing InnoDB recovery with values greater than 4 can corrupt data files, so make sure you will do it with prior backup and ideally on a separate physical copy of the database.

To back up all of the files from all of your databases, follow these steps:

Stop the MySQL server

service mysqld stop

Type the following command for your datadir.

cp -r /var/lib/mysql /var/lib/mysql_bkp

After we have a backup copy of the data directory, we are ready to start troubleshooting.

Data corruption identification

The error log is your best friend. Usually, when data corruption happens, you will find relevant information (including links to documentation) in the error log. If you don't know where it's located, check my.cnf and variable log_error, for more details check this article https://dev.mysql.com/doc/refman/8.0/en/error-log-destination-configuration.html. What you should also know is your storage engine type. You can find this information in the error log or in information_schema.

mysql> select table_name,engine from information_schema.tables where table_name = '<TABLE>' and table_schema = '<DATABASE>';

The main tools/commands to diagnose issues with data corruption are CHECK TABLE, REPAIR TABLE, and myisamchk. The mysqlcheck client performs table maintenance: It checks, repairs (MyISAM), optimizes or analyzes tables while MySQL is running.

mysqlcheck -uroot -p <DATABASE>

Replace DATABASE with the name of the database, and replace TABLE with the name of the table that you want to check:

mysqlcheck -uroot -p <DATABASE> <TABLE>

Mysqlcheck checks the specified database and tables. If a table passes the check, mysqlcheck displays OK for the table.

employees.departments OK employees.dept_emp OK employees.dept_manager OK employees.employees OK Employees.salaries Warning : Tablespace is missing for table 'employees/salaries' Error : Table 'employees.salaries' doesn't exist in engine status : Operation failed employees.titles OK

Data corruption issues may be also related to permission issues. In some cases, OS can switch mount point to read-only mode due to R/W issues or this can be caused by a user who accidentally changed ownership of the data files. In such cases, you will find relevant information in the error log.

[root@node1 employees]# ls -rtla ... -rw-rw----. 1 mysql mysql 28311552 05-10 06:24 titles.ibd -rw-r-----. 1 root root 109051904 05-10 07:09 salaries.ibd drwxr-xr-x. 7 mysql mysql 4096 05-10 07:12 .. drwx------. 2 mysql mysql 4096 05-10 07:17 .

MySQL Client

MariaDB [employees]> select count(*) from salaries; ERROR 1932 (42S02): Table 'employees.salaries' doesn't exist in engine

Error log entry

2018-05-10 9:15:38 140703666226944 [ERROR] InnoDB: Failed to find tablespace for table `employees`.`salaries` in the cache. Attempting to load the tablespace with space id 9 2018-05-10 9:15:38 140703666226944 [ERROR] InnoDB: Operating system error number 13 in a file operation. 2018-05-10 9:15:38 140703666226944 [ERROR] InnoDB: The error means mysqld does not have the access rights to the directory. 2018-05-10 9:15:38 140703666226944 [ERROR] InnoDB: Cannot open datafile for read-only: './employees/salaries.ibd' OS error: 81 2018-05-10 9:15:38 140703666226944 [ERROR] InnoDB: Operating system error number 13 in a file operation. 2018-05-10 9:15:38 140703666226944 [ERROR] InnoDB: The error means mysqld does not have the access rights to the directory. 2018-05-10 9:15:38 140703666226944 [ERROR] InnoDB: Could not find a valid tablespace file for `employees/salaries`. Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb-troubleshooting-datadict.html for how to resolve the issue. ClusterControl Single Console for Your Entire Database Infrastructure Find out what else is new in ClusterControl Install ClusterControl for FREE Recovering InnoDB table

If you are using the InnoDB storage engine for a database table, you can run the InnoDB recovery process.
To enable auto recovery MySQL needs innodb_force_recovery option to be enabled. Innodb_force_recovery forces InnoDB to start up while preventing background operations from running, so that you can dump your tables.

To do this open my.cnf and add the following line to the [mysqld] section:

[mysqld] innodb_force_recovery=1 service mysql restart

You should start from innodb_force_recovery=1 save the changes to my.cnf file, and then restart the MySQL server using the appropriate command for your operating system. If you are able to dump your tables with an innodb_force_recovery value of 3 or less, then you are relatively safe. In many cases you will have to go up to 4 and as you already know that can corrupt data.

[mysqld] innodb_force_recovery=1 service mysql restart

If needed change to the higher value, six is the maximum and most dangerous.

Once you are able to start your database, type the following command to export all of the databases to the databases.sql file:

mysqldump --all-databases --add-drop-database --add-drop-table > dump.sql

Start mysql, and then try to drop the affected database or databases using the DROP DATABASE command. If MySQL is unable to drop a database, you can delete it manually using the steps below after you stop the MySQL server.

service mysqld stop

If you were unable to drop a database, type the following commands to delete it manually.

cd /var/lib/mysql rm -rf <DATABASE>

Make sure you do not delete the internal database directories.
After you are done, comment out the following line in the [mysqld] to disable InnoDB recovery mode.

#innodb_force_recovery=...

Save the changes to the my.cnf file, and then start the MySQL server

service mysqld start

Type the following command to restore the databases from the backup file you created in step 5:

mysql> tee import_database.log mysql> source dump.sql Repairing MyISAM

If mysqlcheck reports an error for a table, type the mysqlcheck command with -repair flag to fix it. The mysqlcheck repair option works while the server is up and running.

mysqlcheck -uroot -p -r <DATABASE> <TABLE>

If the server is down and for any reason mysqlcheck cannot repair your table, you still have an option to perform recovery directly on files using myisamchk. With myisamchk, you need to make sure that the server doesn't have the tables open.

Stop the MySQL

service mysqld stop cd /var/lib/mysql

Change to the directory where the database is located.

cd /var/lib/mysql/employees myisamchk <TABLE>

To check all of the tables in a database, type the following command:

myisamchk *.MYI

If the previous command does not work, you can try deleting temporary files that may be preventing myisamchk from running correctly. To do this, change back to the data dir directory, and then run the following command:

ls */*.TMD

If there are any .TMD files listed, delete them:

rm */*.TMD

Then re-run myisamchk.

To attempt repair a table, execute the following command, replacing TABLE with the name of the table that you want to repair:

myisamchk --recover <TABLE>

Restart the MySQL server

service mysqld start How to avoid data loss

There are several things you can do to minimize the risk of unrecoverable data. First of all backups. The problem with backups is that sometimes they can be overlooked. For cron scheduled backups, usually we write wrapper scripts that detect problems in the backup log, but that does not include cases when the backup didn’t start at all. Cron can sometimes hang and often there is no monitoring set on it. Another potential issue could be the case when the backup was never set up. The good practice is to run reports from a separate tool that will analyze the backup status and inform you about missing backups schedules. You can use ClusterControl for that or write your own programs.

ClusterControl operational backup report

To reduce the impact of the possible data corruption you should always consider clustered systems. It’s just a matter of time when the database will crash or get corrupted, so it’s good to have a copy which you can switch to. It could be Master / Slave replication. The important aspect here is to have safe automatic recovery to minimize the complexity of the switchover and minimize the recovery time (RTO).

ClusterControl auto recovery features Tags:  MySQL maiadb data corruption corrupted data

Fulfilled Tablespace Encryption (TDE) in Percona Cluster

Encryption is a very important form of security and It’s becoming a standard from a compliance perspective to ensure PCI, PII and HIPPA compliances. Encryption needs to be performed for Data at Rest, Data over Wire.

Data At Rest:

  • Until 5.6, To encrypt the MySQL data files, Only disk level encryption possible (Gazzang, Luks)
  • From 5.7.11 MySQL added a feature for tablespace encryption. Which encrypts only the InnoDB tablespaces, Not the logs files.
  • Percona Server 5.7.20-19 added a feature to encrypt binary logs and relay logs.
  • DBaaS like RDS supports TDE (Transparent Data Encryption), It’s a disk level encryption through Amazon KMS.
  • On EC2, Google Cloud servers it’s possible to create the disk as encrypted during the time of creation and MySQL data can be stored on that disk.

Data Over Wire:

  • Encryption over the wire (network) can be setup configuring MySQL with SSL.

Now coming to the topic,

We are about to explain how to make your PXC installation encrypted properly to ensure compliance is met.

  • Encrypting Client Traffic / Cluster Communication Using SSL
  • Encrypting Table space With TableSpace Encryption
  • Encrypting SST
  • Encrypting Binary/Relay logs With Percona Server Binlog Encryption

Below is the test environment used for this test case,

Operating System: Centos7.4

PXC Version : 5.7.21-20-57-log Percona XtraDB Cluster (GPL),

In this blog, we will not explain the basic installation and configuration of PXC,

Step1: Generating SSL certificates

By Default during MySQL installation, we will be having self-signed certificates created under “/var/lib/mysql”, it’s always recommended to have the below certificates in a separate directory common in all cluster nodes.

  • ca-key.pem (CA key file)
  • Ca.pem (CA certification file)
  • Server-key.pem (server key file)
  • Server-cert.pem (server certification file)

You can copy these file generated during installation to a separate location or use “mysql_ssl_rsa_setup” to generate these file to the specified directory as below, in this demo I have used “/etc/certs”

# mysql_ssl_rsa_setup -d /etc/certs Generating a 2048 bit RSA private key .................+++.....+++ writing new private key to 'ca-key.pem' Generating a 2048 bit RSA private key....................................................................................................................................+++ ................+++ writing new private key to 'server-key.pem' Generating a 2048 bit RSA private key .............................+++ .........................................................+++ writing new private key to 'client-key.pem' Step 2: Enabling variables for encryption, Encrypting Tablespace:

Tablespace encryption is done using keyring plugin, make sure to have individual table space for each table(“innodb_files_table=1“), you can refer here for TDE with details explanation

[mysqld] early-plugin-load = keyring_file.so keyring-file-data = /var/lib/mysql-keyring/keyring Encrypting Replication Traffic :

Replication traffic in PXC majorly involves the following

  • Write set replication, which is the major one
  • IST (Incremental state transfer) copies only the missing transaction from DONOR to JOINER
  • Service messages ensure all the nodes are synchronised

We can have this variable “pxc-encrypt-cluster-traffic = ON” to enable the cluster traffic to be encrypted under mysqld section of cnf

Encrypting SST traffic:

As you were aware SST refers to State Snapshot transfer, It’s always recommended to use xtrabackup-v2 as the SST method for safer and faster way for transfer, While doing the SST, keyfile has to be sent along with file for decryption, hence provide keyfile location under [xtrabackup] and also provide the same SSL files under [sst].

[xtrabackup] keyring-file-data=/var/lib/mysql-keyring/keyring [sst] streamfmt = xbstream encrypt = 4 tmpdir = /tmp ssl-key=/etc/certs/server-key.pem ssl-ca=/etc/certs/ca.pem ssl-cert=/etc/certs/server-cert.pem

 

Encrypting Binlog/Relay log:

As you are aware, binlog contains the change data(writes), Replication uses binlogs to copy data between master and slave, Anyone having access to binlog basically means, he has access to your DATA, this is a potential risk, hence it makes trivial to enable binlog Encryption. Binlog encryption is supported from Percona Server for MySQL version 5.7.20-19, Below are the trivial variables to enable binlog/Relaylog encryption along with the basic variable for binlog, Binlog encryption also needs the keyring plugin to be enabled,

[mysqld] encrypt-binlog=1 master_verify_checksum binlog_checksum

Its mandatory to have the same setting on the slave server, which makes the relay log to be encrypted.

Consolidating and considering all the configs, A working config of PXC with look as below

[mysqld] port = 3306 socket = /var/lib/mysql/mysql.sock datadir = /var/lib/mysql user = mysql log-error = /var/log/mysqld.err wsrep_provider = /usr/lib64/galera3/libgalera_smm.so wsrep_cluster_address = gcomm://192.168.33.11,192.168.33.12,192.168.33.13 wsrep_node_address = 192.168.33.11 wsrep_node_name = v11 wsrep_sst_method = xtrabackup-v2 wsrep_cluster_name = xtradbcluster wsrep_auto_increment_control = ON wsrep_sst_method = xtrabackup-v2 wsrep_sst_auth = sstuser:s3cretPass innodb_autoinc_lock_mode = 2 server_id = 311 #binlog encrption log_bin = mysql-bin encrypt-binlog=1 master_verify_checksum binlog_checksum=crc32 log_slave_updates #Cluster encryption pxc-encrypt-cluster-traffic = ON early-plugin-load = keyring_file.so keyring-file-data = /var/lib/mysql-keyring/keyring ssl-key=/etc/certs/server-key.pem ssl-ca=/etc/certs/ca.pem ssl-cert=/etc/certs/server-cert.pem [sst] streamfmt = xbstream encrypt = 4 tmpdir = /tmp ssl-key=/etc/certs/server-key.pem ssl-ca=/etc/certs/ca.pem ssl-cert=/etc/certs/server-cert.pem [xtrabackup] keyring-file-data = /var/lib/mysql-keyring/keyring

Now lets proceed to start the nodes, First node will be bootstrapped, make sure to have the cluster address as “wsrep_cluster_address=gcom://”

Below is the log from first node, which is limited to the SSL setting

2018-04-24T17:59:48.485704Z 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.33.11; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 10; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 4; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.freeze_purge_at_seqno = -1; gcache.keep_pages_count = 0; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 8G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 800; gcs.fc_master_slave= YES; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recovery = 1; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.weight = 1; protonet.backend = asio; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 8; socket.checksum = 2; socket.recv_buf_size = 212992; socket.ssl_ca = /etc/certs/ca.pem; socket.ssl_cert = /etc/certs/server-cert.pem; socket.ssl_cipher = AES128-SHA; socket.ssl_compression = YES; socket.ssl_key = /etc/certs/server-key.pem; 2018-04-24T17:59:48.619657Z 0 [Note] WSREP: Assign initial position for certification: 3, protocol version: -1 2018-04-24T17:59:48.619875Z 0 [Note] WSREP: Preparing to initiate SST/IST 2018-04-24T17:59:48.619910Z 0 [Note] WSREP: Starting replication 2018-04-24T17:59:48.619946Z 0 [Note] WSREP: Setting initial position to 300efff9-431e-11e8-98e0-e27f04800ef7:3 2018-04-24T17:59:48.620379Z 0 [Note] WSREP: Using CRC-32C for message checksums. 2018-04-24T17:59:48.620489Z 0 [Note] WSREP: initializing ssl context 2018-04-24T17:59:48.621385Z 0 [Note] WSREP: gcomm thread scheduling priority set to other:0 2018-04-24T17:59:48.621800Z 0 [Warning] WSREP: Fail to access the file (/var/lib/mysql//gvwstate.dat) error (No such file or directory). It is possible if node is booting forfirst time or re-booting after a graceful shutdown 2018-04-24T17:59:48.621810Z 0 [Note] WSREP: Restoring primary-component from disk failed. Either node is booting for first time or re-booting after a graceful shutdown 2018-04-24T17:59:48.622330Z 0 [Note] WSREP: GMCast version 0 2018-04-24T17:59:48.622647Z 0 [Note] WSREP: (476f28e5, 'ssl://0.0.0.0:4567') listening at ssl://0.0.0.0:4567 2018-04-24T17:59:48.622657Z 0 [Note] WSREP: (476f28e5, 'ssl://0.0.0.0:4567') multicast: , ttl: 1 2018-04-24T17:59:48.623454Z 0 [Note] WSREP: EVS version 0 2018-04-24T17:59:48.623800Z 0 [Note] WSREP: gcomm: connecting to group 'xtradbcluster', peer '' 2018-04-24T17:59:48.623858Z 0 [Note] WSREP: start_prim is enabled, turn off pc_recovery 2018-04-24T17:59:48.626063Z 0 [Note] WSREP: Node 476f28e5 state primary 2018-04-24T17:59:48.626176Z 0 [Note] WSREP: Current view of cluster as seen by this node view (view_id(PRIM,476f28e5,1) memb { 476f28e5,0 }

Now let’s proceed to join the next node, Please find the logs as below DONOR→ JOINER

2018-04-24T18:09:50.293384Z 0 [Note] WSREP: (ae0ecc92, 'ssl://0.0.0.0:4567') listening at ssl://0.0.0.0:4567 2018-04-24T18:09:50.293395Z 0 [Note] WSREP: (ae0ecc92, 'ssl://0.0.0.0:4567') multicast: , ttl: 1 2018-04-24T18:09:50.294414Z 0 [Note] WSREP: EVS version 0 2018-04-24T18:09:50.294860Z 0 [Note] WSREP: gcomm: connecting to group 'xtradbcluster', peer '192.168.33.11:,192.168.33.12:' 2018-04-24T18:09:50.312625Z 0 [Note] WSREP: SSL handshake successful, remote endpoint ssl://192.168.33.12:41398 local endpoint ssl://192.168.33.12:4567 cipher: AES128-SHA compression: none 2018-04-24T18:09:50.313114Z 0 [Note] WSREP: SSL handshake successful,remote endpoint ssl://192.168.33.12:4567 local endpoint ssl://192.168.33.12:41398 cipher: AES128-SHA compression: none 2018-04-24T18:09:50.313615Z 0 [Note] WSREP: (ae0ecc92, 'ssl://0.0.0.0:4567') connection established to ae0ecc92 ssl://192.168.33.12:4567 2018-04-24T18:09:50.313636Z 0 [Warning] WSREP: (ae0ecc92, 'ssl://0.0.0.0:4567') address 'ssl://192.168.33.12:4567' points to own listenin address, blacklisting 2018-04-24T18:09:50.314606Z 0 [Note] WSREP: SSL handshake successful, remote endpoint ssl://192.168.33.11:4567 local endpoint ssl://192.168.33.12:35336 cipher: AES128-SHA compression: none 2018-04-24T18:09:51.305207Z 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 3) 2018-04-24T18:09:51.306246Z 2 [Note] WSREP: State transfer required: Group state: 300efff9-431e-11e8-98e0-e27f04800ef7:3 Local state: 00000000-0000-0000-0000-000000000000:-1 2018-04-24T18:09:51.306312Z 2 [Note] WSREP: New cluster view: global state: 300efff9-431e-11e8-98e0-e27f04800ef7:3, view# 2: Primary, number of nodes: 2, my index: 1, protocol version 3 2018-04-24T18:09:51.306344Z 2 [Note] WSREP: Setting wsrep_ready to true 2018-04-24T18:09:51.306429Z 2 [Warning] WSREP: Gap in state sequence.Need state transfer. 2018-04-24T18:09:51.306433Z 2 [Note] WSREP: Setting wsrep_ready to false 2018-04-24T18:09:51.307298Z 0 [Note] WSREP: Initiating SST/IST transfer on JOINER side (wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.33.12' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '5788' '' ) 2018-04-24T18:09:53.810961Z 0 [Note] WSREP: (ae0ecc92, 'ssl://0.0.0.0:4567') turning message relay requesting off 2018-04-24T18:09:57.217241Z 2 [Note] WSREP: Prepared SST/IST request: xtrabackup-v2|192.168.33.12:4444/xtrabackup_sst//1 2018-04-24T18:09:57.217414Z 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2018-04-24T18:09:57.217611Z 2 [Note] WSREP: REPL Protocols: 8 (3, 2) 2018-04-24T18:09:57.217739Z 2 [Note] WSREP: Assign initial position for certification: 3, protocol version: 3 2018-04-24T18:09:57.217901Z 0 [Note] WSREP: Service thread queue flushed. 2018-04-24T18:09:57.217974Z 2 [Note] WSREP: Check if state gap can be serviced using IST 2018-04-24T18:09:57.218020Z 2 [Note] WSREP: Local UUID: 00000000-0000-0000-0000-000000000000 != Group UUID: 300efff9-431e-11e8-98e0-e27f04800ef7 2018-04-24T18:09:57.218477Z 2 [Note] WSREP: State gap can't be serviced using IST. Switching to SST 2018-04-24T18:09:57.218549Z 2 [Note] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (300efff9-431e-11e8-98e0-e27f04800ef7): 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():538. IST will be unavailable. 2018-04-24T18:09:57.220866Z 0 [Note] WSREP: Member 1.0 (v12) requested state transfer from '*any*'. Selected 0.0 (v11)(SYNCED) as donor. 2018-04-24T18:09:57.221004Z 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 3) 2018-04-24T18:09:57.221495Z 2 [Note] WSREP: Requesting state transfer: success, donor: 0 2018-04-24T18:09:57.221534Z 2 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 300efff9-431e-11e8-98e0-e27f04800ef7:3 2018-04-24T18:10:06.909709Z WSREP_SST: [INFO] donor keyring received at: '/var/lib/mysql-keyring/donor-keyring' 2018-04-24T18:10:07.061101Z WSREP_SST: [INFO] Proceeding with SST......... 2018-04-24T18:10:07.350117Z WSREP_SST: [INFO] ............Waiting for SST streaming to complete! 2018-04-24T18:10:24.368719Z 0 [Note] WSREP: 0.0 (v11): State transfer to 1.0 (v12) complete. 2018-04-24T18:10:24.370118Z 0 [Note] WSREP: Member 0.0 (v11) synced with group. 2018-04-24T18:10:24.474538Z WSREP_SST: [INFO] Preparing the backup at /tmp/pxc_sst_VRBd/sst_Teu6 2018-04-24T18:10:33.692709Z WSREP_SST: [INFO] Moving the backup to /var/lib/mysql/ 2018-04-24T18:10:33.891894Z WSREP_SST: [INFO] Moving sst keyring into place: moving /var/lib/mysql-keyring/donor-keyring to /var/lib/mysql-keyring/keyring 2018-04-24T18:10:33.939633Z WSREP_SST: [INFO] Galera co-ords from recovery: 300efff9-431e-11e8-98e0-e27f04800ef7:3 2018-04-24T18:10:34.021154Z 0 [Note] WSREP: SST complete, seqno: 3

From the above logs, it has undergone SST successfully with encryption and synced with the cluster.

In the future version of PXC (MySQL 8.0) we may get the undo logs and redo logs encryption . It will make all data files (most) at rest to be encrypted.

Deploying Cloud Databases with ClusterControl 1.6

ClusterControl 1.6 comes with tighter integration with AWS, Azure and Google Cloud, so it is now possible to launch new instances and deploy MySQL, MariaDB, MongoDB and PostgreSQL directly from the ClusterControl user interface. In this blog, we will show you how to deploy a cluster on Amazon Web Services.

Note that this new feature requires two modules called clustercontrol-cloud and clustercontrol-clud. The former is a helper daemon which extends CMON capability of cloud communication, while the latter is a file manager client to upload and download files on cloud instances. Both packages are dependencies of the clustercontrol UI package, which will be installed automatically if they do not exist. See the Components documentation page for details.

Cloud Credentials

ClusterControl allows you to store and manage your cloud credentials under Integrations (side menu) -> Cloud Providers:

The supported cloud platforms in this release are Amazon Web Services, Google Cloud Platform and Microsoft Azure. On this page, you can add new cloud credentials, manage existing ones and also connect to your cloud platform to manage resources.

The credentials that have been set up here can be used to:

  • Manage cloud resources
  • Deploy databases in the cloud
  • Upload backup to cloud storage

The following is what you would see if you clicked on "Manage AWS" button:

You can perform simple management tasks on your cloud instances. You can also check the VPC settings under "AWS VPC" tab, as shown in the following screenshot:

The above features are useful as reference, especially when preparing your cloud instances before you start the database deployments.

Database Deployment on Cloud

In previous versions of ClusterControl, database deployment on cloud would be treated similarly to deployment on standard hosts, where you had to create the cloud instances beforehand and then supply the instance details and credentials in the "Deploy Database Cluster" wizard. The deployment procedure was unaware of any extra functionality and flexibility in the cloud environment, like dynamic IP and hostname allocation, NAT-ed public IP address, storage elasticity, virtual private cloud network configuration and so on.

With version 1.6, you just need to supply the cloud credentials, which can be managed via the "Cloud Providers" interface and follow the "Deploy in the Cloud" deployment wizard. From ClusterControl UI, click Deploy and you will be presented with the following options:

At the moment, the supported cloud providers are the three big players - Amazon Web Service (AWS), Google Cloud and Microsoft Azure. We are going to integrate more providers in the future release.

In the first page, you will be presented with the Cluster Details options:

In this section, you would need to select the supported cluster type, MySQL Galera Cluster, MongoDB Replica Set or PostgreSQL Streaming Replication. The next step is to choose the supported vendor for the selected cluster type. At the moment, the following vendors and versions are supported:

  • MySQL Galera Cluster - Percona XtraDB Cluster 5.7, MariaDB 10.2
  • MongoDB Cluster - MongoDB 3.4 by MongoDB, Inc and Percona Server for MongoDB 3.4 by Percona (replica set only).
  • PostgreSQL Cluster - PostgreSQL 10.0 (streaming replication only).

In the next step, you will be presented with the following dialog:

Here you can configure the selected cluster type accordingly. Pick the number of nodes. The Cluster Name will be used as the instance tag, so you can easily recognize this deployment in your cloud provider dashboard. No space is allowed in the cluster name. My.cnf Template is the template configuration file that ClusterControl will use to deploy the cluster. It must be located under /usr/share/cmon/templates on the ClusterControl host. The rest of the fields are pretty self-explanatory.

The next dialog is to select the cloud credentials:

You can choose the existing cloud credentials or create a new one by clicking on the "Add New Credential" button. The next step is to choose the virtual machine configuration:

Most of the settings in this step are dynamically populated from the cloud provider by the chosen credentials. You can configure the operating system, instance size, VPC setting, storage type and size and also specify the SSH key location on the ClusterControl host. You can also let ClusterControl generate a new key specifically for these instances. When clicking on "Add New" button next to Virtual Private Cloud, you will be presented with a form to create a new VPC:

VPC is a logical network infrastructure you have within your cloud platform. You can configure your VPC by modifying its IP address range, create subnets, configure route tables, network gateways, and security settings. It's recommended to deploy your database infrastructure in this network for isolation, security and routing control.

When creating a new VPC, specify the VPC name and IPv4 address block with subnet. Then, choose whether IPv6 should be part of the network and the tenancy option. You can then use this virtual network for your database infrastructure.

ClusterControl Single Console for Your Entire Database Infrastructure Find out what else is new in ClusterControl Install ClusterControl for FREE

The last step is the deployment summary:

In this stage, you need to choose which subnet under the chosen virtual network that you want the database to be running on. Take note that the chosen subnet MUST have auto-assign public IPv4 address enabled. You can also create a new subnet under this VPC by clicking on "Add New Subnet" button. Verify if everything is correct and hit the "Deploy Cluster" button to start the deployment.

You can then monitor the progress by clicking on the Activity -> Jobs -> Create Cluster -> Full Job Details:

Depending on the connections, it could take 10 to 20 minutes to complete. Once done, you will see a new database cluster listed under the ClusterControl dashboard. For PostgreSQL streaming replication cluster, you might need to know the master and slave IP addresses once the deployment completes. Simply go to Nodes tab and you would see the public and private IP addresses on the node list on the left:

Your database cluster is now deployed and running on AWS.

At the moment, the scaling up works similar to the standard host, where you need to create a cloud instance manually beforehand and specify the host under ClusterControl -> pick the cluster -> Add Node.

Under the hood, the deployment process does the following:

  1. Create cloud instances
  2. Configure security groups and networking
  3. Verify the SSH connectivity from ClusterControl to all created instances
  4. Deploy database on every instance
  5. Configure the clustering or replication links
  6. Register the deployment into ClusterControl

Take note that this feature is still in beta. Nevertheless, you can use this feature to speed up your development and testing environment by controlling and managing the database cluster in different cloud providers from a single user interface.

Database Backup on Cloud

This feature has been around since ClusterControl 1.5.0, and now we added support for Azure Cloud Storage. This means that you can now upload and download the created backup on all three major cloud providers (AWS, GCP and Azure). The upload process happens right after the backup is successfully created (if you toggle "Upload Backup to the Cloud") or you can manually click on the cloud icon button of the backup list:

Related resources  Cloud Database Features Comparison - Amazon RDS vs Google Cloud SQL  Comparing Cloud Database Options for PostgreSQL  Announcing ClusterControl 1.6 - automation and management of open source databases in the cloud

You can then download and restore backups from the cloud, in case you lost your local backup storage, or if you need to reduce local disk space usage for your backups.

Current Limitations

There are some known limitations for the cloud deployment feature, as stated below:

  • There is currently no 'accounting' in place for the cloud instances. You will need to manually remove the cloud instances if you remove a database cluster.
  • You cannot add or remove a node automatically with cloud instances.
  • You cannot deploy a load balancer automatically with a cloud instance.

We have extensively tested the feature in many environments and setups but there are always corner cases that we might have missed out upon. For more information, please take a look at the change log.

Happy clustering in the cloud!

Tags:  cloud AWS gcp MySQL MariaDB galera MongoDB PostgreSQL azure

Moving a MariaDB Database to Encrypted and Unencrypted States

Moving a MariaDB Database to Encrypted and Unencrypted States janlindstrom Thu, 05/10/2018 - 02:28

In this blog, we present a way to move an existing database first to an encrypted state and then, how to move your database to an unencrypted state. 

In order to use encryption, you need to load a plugin to manage the encryption keys. See currently supported encryption plugins. Each key uses a 32-bit integer as a key identifier (key_id) and actual key. Keys can be versioned so that data is re-encrypted from older key to newer version of the key. In this blog, we will use file key management plugin as an example (see encryption key management). We also assume that you are using the most recent version of MariaDB Server (this blog assumes that MDEV-15566 is fixed i.e. MariaDB version should be 10.1.33, 10.2.15 or 10.3.6).

Moving a database to an encrypted state or to an unencrypted state is done using a key_rotation. Key rotation moves the database from an existing encrypted state to another. Note that here tablespace could have no encrypted state (i.e. tablespace is unencrypted) or tablespace could have an encryption state that is moved to an unencrypted state. Key rotation can happen periodically (based on configuration variable innodb-encryption-rotate-key-age i.e. how old key can be before it is rotated), requested by database administrator (e.g. by issuing set global innodb_encrypt_tables=ON;) or by encryption key management system (see e.g. rotate keys).

Database administrators need to make the decision if it is enough to encrypt only individual tables (see encrypting data for InnoDB) or the whole database including system tablespace. Note that table data is also written to redo log and undo log. Thus, if the database contains tables that contain very sensitive data innodb-encrypt-log should also be enabled. In this blog, we show how to encrypt the whole database. 

Moving database to encrypted state

Before the database can be moved to an encrypted state, we need to add encryption plugin configuration to config file (see detailed description on parameters):

# File Key Management plugin-load-add = file_key_management file-key-management-filename = /mnt/flash/keys.txt file-key-management-encryption-algorithm = aes_ctr # InnoDB encryption setup innodb-encrypt-tables=ON innodb-encrypt-log=ON innodb-encryption-rotate-key-age=1024 innodb-encryption-threads=4 innodb-tablespaces-encryption

After restart progress of the encryption operation can be monitored from INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION table. In the following example, we query name of tablespace, current page under key rotation and maximum page in the tablespace for those tables that are not yet encrypted:

MariaDB [(none)]> select name, KEY_ROTATION_PAGE_NUMBER, KEY_ROTATION_MAX_PAGE_NUMBER from information_schema.innodb_tablespaces_encryption where min_key_version = 0 or ROTATING_OR_FLUSHING = 1; +---------------+--------------------------+------------------------------+ | name | KEY_ROTATION_PAGE_NUMBER | KEY_ROTATION_MAX_PAGE_NUMBER | +---------------+--------------------------+------------------------------+ | innodb_system | 17641 | 1397504 | +---------------+--------------------------+------------------------------+ 1 row in set (0.000 sec)

Naturally, you may also query the status of all tables:

MariaDB [tpcc1000]> select * from information_schema.innodb_tablespaces_encryption; +-------+-------------------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+----------------------+ | SPACE | NAME | ENCRYPTION_SCHEME | KEYSERVER_REQUESTS | MIN_KEY_VERSION | CURRENT_KEY_VERSION | KEY_ROTATION_PAGE_NUMBER | KEY_ROTATION_MAX_PAGE_NUMBER | CURRENT_KEY_ID | ROTATING_OR_FLUSHING | +-------+-------------------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+----------------------+ | 0 | innodb_system | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 3 | tpcc1000/customer | 1 | 1 | 0 | 1 | 2401 | 1317888 | 1 | 1 | +-------+-------------------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+----------------------+ 2 rows in set (0.000 sec)

From this we can see that system tablespace is already encrypted but table customer from database tpcc1000 is currently being encrypted. If your system has hardware resources and the encryption process seems slow, you may try the following parameters:

# Set close to number of cores set global innodb_encryption_threads=16; # For SSD increase number of I/O operations used for encryption in second set global innodb_encryption_rotation_iops=40000;

Database encryption is finished when there are no tables in an unencrypted state:

MariaDB [tpcc1000]> select name, KEY_ROTATION_PAGE_NUMBER, KEY_ROTATION_MAX_PAGE_NUMBER from information_schema.innodb_tablespaces_encryption where min_key_version = 0 or ROTATING_OR_FLUSHING = 1; Empty set (0.001 sec)

And to verify, list all tables that are encrypted:

MariaDB [tpcc1000]> select * from information_schema.innodb_tablespaces_encryption where min_key_version != 0; +-------+---------------------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+----------------------+ | SPACE | NAME | ENCRYPTION_SCHEME | KEYSERVER_REQUESTS | MIN_KEY_VERSION | CURRENT_KEY_VERSION | KEY_ROTATION_PAGE_NUMBER | KEY_ROTATION_MAX_PAGE_NUMBER | CURRENT_KEY_ID | ROTATING_OR_FLUSHING | +-------+---------------------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+----------------------+ | 0 | innodb_system | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 3 | tpcc1000/customer | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 2 | tpcc1000/district | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 4 | tpcc1000/history | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 8 | tpcc1000/item | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 5 | tpcc1000/new_orders | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 7 | tpcc1000/order_line | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 6 | tpcc1000/orders | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 9 | tpcc1000/stock | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 1 | tpcc1000/warehouse | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | +-------+---------------------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+----------------------+ 10 rows in set (0.000 sec)

As can be seen, all tablespaces use ENCRYPTION_SCHEME=1 (encrypted) and MIN_KEY_VERSION=1. After this phase the database administrator should consider decreasing the number of used encryption threads and rotation iops. Furthermore, the need for further key rotation should be also considered as the file key management plugin does not support real key rotation. Key rotation can be disabled using innodb-encryption-rotate-key-age=0. Note that even with that setup all new tables created are considered for encryption.

Moving database to unencrypted state

Here we assume that you have a database that is encrypted and there is no longer a need to encrypt data or data protection is done differently. We will use the same database as an example as in moving database to encrypted state. At this point there is no need to restart the server. Instead moving the database to unencrypted state can be done as an online operation. First, the database administrator should check that there is no tables using explicit encryption i.e. there is a table where create table used ENCRYPTED=YES table option. Now moving the database to an unencrypted state can be simple done by issuing:

SET GLOBAL innodb_encrypt_tables=OFF;

This will start unencrypting all tablespaces including system tablespace and progress of this operation can be monitored by:

MariaDB [tpcc1000]> select * from information_schema.innodb_tablespaces_encryption where min_key_version != 0; +-------+---------------------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+----------------------+ | SPACE | NAME | ENCRYPTION_SCHEME | KEYSERVER_REQUESTS | MIN_KEY_VERSION | CURRENT_KEY_VERSION | KEY_ROTATION_PAGE_NUMBER | KEY_ROTATION_MAX_PAGE_NUMBER | CURRENT_KEY_ID | ROTATING_OR_FLUSHING | +-------+---------------------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+----------------------+ | 7 | tpcc1000/order_line | 1 | 1 | 1 | 1 | 76564 | 1947904 | 1 | 1 | | 6 | tpcc1000/orders | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 9 | tpcc1000/stock | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 1 | tpcc1000/warehouse | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | | 10 | tpcc1000/t1 | 1 | 1 | 1 | 1 | NULL | NULL | 1 | 0 | +-------+---------------------+-------------------+--------------------+-----------------+---------------------+--------------------------+------------------------------+----------------+----------------------+ 5 rows in set (0.001 sec)

From this we can see that table order_line from database tpcc1000 is being rotated. Operation is finished when there is no tables using encryption i.e. have min_key_version != 0.

MariaDB [tpcc1000]> select * from information_schema.innodb_tablespaces_encryption where min_key_version != 0 or rotating_or_flushing = 1; Empty set (0.000 sec)

If the encryption setup needs to be removed from the configuration now is the time to shutdown the server. If configuration uses redo log encryption i.e. innodb-encrypt-log=ON take backups from your database including InnoDB log files and after that remove InnoDB log files as they are unusable if they contain encrypted data.

rm -rf ib_logfile*

Remove encryption setup from the configuration and restart the server. Now you have a database instance where no encryption is used.

Conclusion

Moving a database to an encrypted state as seen above requires the server to be restarted and requires a careful encryption plugin configuration. How long this operation takes depends on the number of tables and how big these tables are. We have presented a way to monitor this progress and how to speed it up if the hardware used has enough resources. Moving a database to an unencrypted state requires only setting one global variable. However, if encryption is on longer needed and there is a need to remove all references to it, there is a need for one restart. We have shown how to monitor this transition and how to fully remove encryption setup from both the database and configuration.

In this blog, we present a way to move an existing database first to an encrypted state and then, how to move your database to an unencrypted state. 

Login or Register to post comments

MySQL master discovery methods, part 4: Proxy heuristics

Note: the method described here is an anti pattern

This is the fourth in a series of posts reviewing methods for MySQL master discovery: the means by which an application connects to the master of a replication tree. Moreover, the means by which, upon master failover, it identifies and connects to the newly promoted master.

These posts are not concerned with the manner by which the replication failure detection and recovery take place. I will share orchestrator specific configuration/advice, and point out where cross DC orchestrator/raft setup plays part in discovery itself, but for the most part any recovery tool such as MHA, replication-manager, severalnines or other, is applicable.

We discuss asynchronous (or semi-synchronous) replication, a classic single-master-multiple-replicas setup. A later post will briefly discuss synchronous replication (Galera/XtraDB Cluster/InnoDB Cluster).

Master discovery via Proxy Heuristics

In Proxy Heuristics all clients connect to the master through a proxy. The proxy observes the backend MySQL servers and determines who the master is.

This setup is simple and easy, but is an anti pattern. I recommend against using this method, as explained shortly.

Clients are all configured to connect to, say, cluster1-writer.proxy.example.net:3306. The proxy will intercept incoming requests either based on hostname or by port. It is aware of all/some MySQL backend servers in that cluster, and will route traffic to the master M.

A simple heuristic that I've seen in use is: pick the server that has read_only=0, a very simple check.

Let's take a look at how this works and what can go wrong.

A non planned failover illustration #1

Master M has died, the box had a power failure. R gets promoted in its place. Our recovery tool:

  • Fails over, but doesn't need to run any hooks.

The proxy:

  • Knows both about M and R.
  • Notices M fails health checks (select @@global.read_only returns error since the box is down).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.

Success, we're happy.

Configuration tip

With an automated failover solution, use read_only=1 in my.cnf at all times. Only the failover solution will set a server to read_only=0.

With this configuration, when M restarts, MySQL starts up as read_only=1.

A non planned failover illustration #2

Master M gets network isolated for 10 seconds, during which time we failover. R gets promoted. Our tool:

  • Fails over, but doesn't need to run any hooks.

The proxy:

  • Knows both about M and R.
  • Notices M fails health checks (select @@global.read_only returns error since the box is down).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.
  • 10 seconds later M comes back to life, claiming read_only=0.
  • The proxy now sees two servers reporting as healthy and with read_only=0.
  • The proxy has no context. It does not know why both are reporting the same. It is unaware of failovers. All it sees is what the backend MySQL servers report.

Therein lies the problem: you can not trust multiple servers (MySQL backends) to deterministically pick a leader (the master) without them collaborating on some elaborate consensus communication.

A non planned failover illustration #3

Master M box is overloaded, issuing too many connections for incoming connections.

Our tool decides to failover.

  • And doesn't need to run any hooks.

The proxy:

  • Notices M fails health checks (select @@global.read_only does not respond because of the load).
  • Notices R reports healthy and with read_only=0.
  • Routes all traffic to R.
  • Shortly followed by M recovering (since no more writes are sent its way), claiming read_only=0.
  • The proxy now sees two servers reporting as healthy and with read_only=0.

Again, the proxy has no context, and neither do M and R, for that matter. The context (the fact we failed over from M to R) was known to our failover tool, but was lost along the way.

Planned failover illustration

We wish to replace the master, for maintenance reasons. We successfully and gracefully promote R.

  • M is available and responsive, we set it to read_only=1.
  • We set R to read_only=0.
  • All new connections route to R.
  • We should also instruct our Proxy to kill all previous connections to M.

This works very nicely.

Discussion

There is a substantial risk to this method. Correlation between failover and network partitioning/load (illustrations #2 and #3) is reasonable.

The root of the problem is that we expect individual servers to resolve conflicts without speaking to each other: we expect the MySQL servers to correctly claim "I'm the master" without context.

We then add to that problem by using the proxy to "pick a side" without giving it any context, either.

Sample orchestrator configuration

By way of discouraging use of this method I do not present an orchestrator configuration file.

Mydbops Delighted to be part of Root Conf 2018

Root conf is a community and conference on DevOps and cloud infrastructure. Root conf is one of the India’s well known conference on DevOPS and IT infrastructure.  It is organised by Hasgeek.

This year (2018) it is scheduled at 10-May & 11-May Mydbops is one of the Exhibition Sponsor along with companies like MySQL , Walmart labs , GO JEK, Intuit  and more.

This year the major theme focus on System Security and Infrastructue Architecture.

Talk on MySQL .

  1. Securing Your database server from external attacks – Colin Charles ( Percona )

There are many interesting talks too.

Please visit our booth , if you are at Rootconf ( 2018 )

Pages