Planet MySQL

What Should I Monitor, and How Should I Do It?

Monitoring tools offer two core types of functionality: alerts based on aliveness checks and comparing metrics to thresholds, and displaying time-series charts of status counters. Nagios + Graphite are the prototypical time-series tools that do these things.

But these tools don’t answer the crucial questions about what we should monitor. What kinds of aliveness/health checks should we build into Nagios? Which metrics should we monitor with thresholds to raise alarms, and what should the thresholds be? What graphs should we build of status counters, which graphs should we examine and what do they mean?

We need guiding principles to help answer these questions. This webinar briefly introduces the principles that motivate and inform what we do at VividCortex, then explains which types of health checks and charts are valuable and what conclusions should be drawn from them. The webinar is focused mostly on MySQL database monitoring, but will be relevant beyond MySQL as well. Some of the questions we answer are:

  • What status counters from MySQL are central and core, and which are peripheral?
  • What is the meaning of MySQL status metrics?
  • Which subsystems inside MySQL are the most common causes of problems in production?
  • What is the unit of work-getting-done in MySQL, and how can you measure it?
  • Which open-source tools do a good job at monitoring in the way we recommend at VividCortex?
  • Which new and/or popular open-source tools should you evaluate when choosing a solution?

You will leave this webinar with a solid understanding of the types of monitoring you should be doing, the low-hanging fruit, and tools for doing it. This is not just a sales pitch for VividCortex. Register below, and we will send you a link to the recording and a copy of the slide deck.

Pic Cred

PlanetMySQL Voting: Vote UP / Vote DOWN

Slave Election is welcoming GTID

Slave election is a popular HA architecture,  first MySQL MariaDB toolkit to manage switchover and failover in a correct way was introduce by Yoshinori Matsunobu into MHA.

Failover and switchover in asynchronous clusters require caution:

- The CAP theorem need to be satisfy. Getting strong consistency, require the slave election to reject transactions ending up in the old master when electing the candidate master.

- Slave election need to take care that all events on the old master are applied to the candidate master before switching roles.

- Should be instrumented to found a good candidate master and make sure it's setup to take the master role.

- Need topology detection, a master role can't be pre defined, as the role is moving around nodes .

- Need monitoring to escalate switchover to failover.

MHA as been coded at a time no unique event id was possible in a cluster,  each event was track as independent coordinate on each node, making MHA architecture to have an internal way to rematch coordinate on all nodes.

With introduction of GTID, MHA brings the heritage and looks like unnecessary complex, with an agent base solution and ssh connections requirement to all nodes .

A lighter MHA was needed for MariaDB when the replication is using GTID, and that's what my colleague Guillame Lefranc have been addressing inside a new MariaDB toolkit

In MariaDB GTID usage is as simple as:

#>stop slave;change master to master_use_gtid=current_pos;start slave; 
As a bonus, the code is in golang and do not require any external dependencies
We can enjoy a singe command line procedure in interactive mode.

mariadb-repmgr -hosts=,, -user=admin:xxxxx -rpluser=repl:xxxxxx -pre-failover-script="/root/" -post-failover-script="/root/" -verbose -maxdelay 15    Don't be afraid default is to run in interactive mode and it does not launch anything yet.

In my post configuration script i usually update some haproxy configuration store in a NAS or a SAN and reload or shoot in the head all proxies

Note that the new elected master will be passed as second argument of the script.

I strongly advice not to try to auto failover base on some monitoring, get a good replication monitoring tool and analyze all master status alerts, checking for false positive situation before enjoying pre coded failover.

Loss less semi-synchronous replication in MDEV-162  and multiple performance improvements of semi-synchronous MDEV-7257, have made it to MariaDB 10.1, it can be use to greatly improve zero data lost in case of failure . Combine with parallel replication it's now possible to have an HA architecture that is as robust as asynchronous can be, and under replication delay control is crash safe as well.    
Galera aka MariaDB Cluster as a write speed limit bound to upper network speed, it come at the advantage to always offer crash safe consistency. Slave election HA have the master disk speed limit and do not suffer lower network speed but is losing consistency in failover when slave can't catch.

Interesting time to see how flash storage adoption flavor one or the other architecture.
PlanetMySQL Voting: Vote UP / Vote DOWN

Deploying Galera Cluster for MySQL using Vagrant

Setting up environments, starting processes, and monitoring these processes on multiple machines can be time consuming and error prone - stale settings from previous test runs, wrong configurations, wrong commands, package conflicts, etc.. quite a few things can go wrong. If you are using Galera Cluster, you would probably want application developers to have a proper development environment on their local computers. Proper here means testing your code on a local Galera Cluster, not on a single instance MySQL. Galera Cluster differs from a single instance MySQL, so this allows you to catch these differences early in the project. But how can you quickly roll out a mini test clusters to your application developers, without having them waste time setting these up? This is where Vagrant comes in.

Vagrant is a system that allows you to easily create and move development environments from one machine to another. Simply define what type of VM you want in a file called Vagrantfile and then fire them up with a single command. It integrates well with virtual machine providers like VirtualBox, VMware and AWS. In this blog, we’ll show you how to expedite the deployment of your development environment using some Vagrant boxes we’ve put together.

Our Vagrantfile deploys 4 instances on VirtualBox platform, three for Galera nodes plus one for ClusterControl. It requires the following Vagrant boxes available on our site:

  • s9s-cc (505 MB) - Ubuntu 14.04.x, ClusterControl 1.2.10
  • s9s-galera (407 MB) - Ubuntu 14.04.x, Percona XtraDB Cluster 5.6

Here are the main steps:

  1. Install Vagrant and Virtualbox
  2. Download the related Vagrant boxes and Vagrantfile
  3. Launch the instances
  4. Bootstrap the Galera cluster
  5. Add the cluster to ClusterControl.

The following architecture diagram shows what you will get once everything is deployed:

Ensure that you have Vagrant and VirtualBox installed. We are not going to cover the installation of these in this blog post.

Deploying the Cluster

1.  Download and install the Vagrant boxes:

$ vagrant box add s9s-cc $ vagrant box add s9s-galera

Make sure you keep the box names s9s-cc and s9s-galera, otherwise you’ll need to change the corresponding values in the Vagrantfile.

2. Create a directory and download the Vagrantfile:

$ mkdir s9s-cc $ cd s9s-cc $ wget

3. Launch 4 instances, each requires 768 MB of memory:

$ vagrant up

4. Verify if all instances are up with:

$ vagrant status

4. SSH to vm2 (n2) and run the script located under s9s directory. This will copy the relevant my.cnf file and bootstrap the Galera cluster:

$ vagrant ssh vm2 vagrant@n2:~$ cd s9s vagrant@n2:~$ ./

5. Execute the same on vm3 (n3) and vm4 (n4). This will copy the relevant my.cnf file and start the node to join n2:

$ vagrant ssh vm3 vagrant@n3:~$ cd s9s vagrant@n3:~$ ./ $ vagrant ssh vm4 vagrant@n4:~$ cd s9s vagrant@n4:~$ ./

At this point, our Galera cluster should be up and running. You should be able to access each MySQL server on its respective IP address and port. The default MySQL root password is root123 while the ‘cmon’ password is cmon.

Adding Galera Cluster into ClusterControl

Once Galera Cluster is running, add it to ClusterControl. Open a web browser and point it to http://localhost:8080/clustercontrol. Create a default admin user with a valid email address and password, and click ‘Register & Create User’. 

Once logged in, click on ‘Add Existing Server/Cluster’, and enter the following details:

Click ‘Add Cluster’ and monitor the output of cluster jobs. Once done, you should able to see the Galera Cluster listed:

That’s it! Quick, simple and works every time :-)

Blog category: Tags:
PlanetMySQL Voting: Vote UP / Vote DOWN

Jörg Brühe: The Upcoming Leap Second

The press, be it the general daily newspaper or the computer magazines, is currently informing the public about an upcoming leap second, which will be taken in the night from June 30 to July 1 at 00:00:00 UTC. While we Europeans will enjoy our well-deserved sleep then, this will be at 5 PM (17:00) local time on June 30 for Califormia people, and during the morning of July 1 for people in China, Japan, Korea, or Australia. (Other countries not mentioned for the sake of brevity.) This is different from last time, when the leap second was taken in the night from Saturday to Sunday (2012-July-1 00:00:00 UTC), so it was a weekend everywhere on the globe.

We have got several requests from our customers about this upcoming leap second, whether they need to take any special precautions or whether they "are safe". Well, obviously nobody is "safe" from the leap second in the sense that it would circumvent them, everybody will encounter it on their systems. The concern is whether they have to expect any trouble.

For the people operating MySQL (or any other DBMS), this issue is threefold:

  • How will the operating system behave?
  • How will the database system behave?
  • How will the applications behave?
  • Let us look at the operating system first, dealing with Linux only. (All other operating systems don't show up with significant numbers in our customer base.)
    Linux measures the time in seconds (since Jan 1, 1970, 00:00:00 UTC), and it does not include the leap seconds in this counting. When the leap second is taken, the timestamp value (those seconds) will simply not be advanced at the end of the regular second, but it will re-use its current value in the leap second. As a result, the conversion of timestamps into common time reckoning will still produce values from 0 to 59 for the minute as well as for the second, there will not be a 60. Another consequence is that there is no way to tell the leap second from the preceding regular one.

    MySQL always takes the time information from the Linux kernel, so it will also use the same timestamp value (or "now()" result) for both the regular and the leap second. The MySQL manual describes this on its own leap second page, which has become inaccurate by some recent code changes: We could not reproduce the results given there (and will probably file a documentation bug about this). However, that difference does not affect the principle of the page's first paragraph.

    About the application, it is hard to claim anything - there are too many of them. However, it is obvious that an application might run into trouble if it managed to store a new timestamp value every second and assumed they are distinct: they will not be (unless the application manages to skip the leap second). Let's hope any application programmer creating such a high-resolution application was aware of the problem.

    So does it all look fine? Not completely: Following the last leap second (just three years ago), several administrators noticed that their machines became extraordinarily busy, which even let the power consumption rise significantly. This was caused by a bug in the Linux kernel, it let user processes resume immediately if they were trying to wait on a high-resolution timer. The exact details are beyond the scope of this article, your favourite search engine will provide you with more than enough references if you ask it.
    For our purposes, the important fact is that this also happened to MySQL server processes ("mysqld") because InnoDB was doing such operations.

    Back then, Sheeri K. Cabral (well-known in the MySQL blogosphere) published a cure which her team had discovered:
    This loop can be broken by simply setting the Linux kernel's clock to the current time.
    date -s "`date`"
    While this looks like a no-op at first sight, I assume that it will reset the sub-second time information and so will have a small effect, which obviously is sufficient to terminate that loop. (In practice, you should stop your NTP daemon before changing the date, and restart it afterwards. The exact command will depend on your Linux distribution.)

    Well, this was three years ago. The issue was reported to kernel developers, a fix was developed (which gives credit to Sheeri's report) and applied to newer kernels, and it w as also backported to older kernels. From my search, it seems that all reasonably maintained installations should have received that fix. For example, I have seen notices that Ubuntu 12.04 (since kernel 3.2.0-29.46) and 14.04 do have it, as does RedHat 6.4 (since kernel 2.6.32-298); for other distributions, I did not search.

    In addition to a current kernel, you obviously need current packages including the leap second information. Typically, this would be "ntp", "ntpdate", "tzdata", and related ones. Don't forget to restart your NTP daemon after installing these updates!

    So I wish all our readers that the leap second may not get them into trouble, or that (if worst comes to worst) the cure published by Sheeri may get them out. However, I have to remind you of the old truth which is attributed to Mark Twain (sometimes also to Niels Bohr):
    "It is difficult to make predictions, especially about the future."

    PlanetMySQL Voting: Vote UP / Vote DOWN

    Practical MySQL Performance Optimization (July 2 webinar)

    Applications often become impacted by MySQL performance. Optimization is the obvious solution but where to start? Join me on July 2 at 11 a.m. Pacific and I’ll show you how to optimize MySQL in a practical way – spending time on what gives you the best return. Many of you have heard me talk on this topic before and this updated presentation will get you up to speed on the latest areas to consider when improving MySQL performance.

    I’ll be pointing out the most important things you need to get right for each of the following:

    • Hardware
    • MySQL Configuration
    • Schema and Queries
    • Application Architecture

    I’ll also share some tools and techniques to focus on the queries most important for your application. At the end of this webinar, aptly titled “Practical MySQL Performance Optimization,” you will know how to optimize MySQL performance in a practical way to ensure our system is “good enough” in terms of responsiveness, efficiency and scalability.

    I hope to see you there! Register now to reserve your spot.

    The post Practical MySQL Performance Optimization (July 2 webinar) appeared first on MySQL Performance Blog.

    PlanetMySQL Voting: Vote UP / Vote DOWN

    TokuDB and Binlog Group Commit in MySQL 5.7

    The MySQL 5.7.6 release notes describe a change to the binary log group commit algorithm that can improve the performance of write intensive applications when the binary log is enabled (noted with Bug #19424075).  This change is probably inspired by the experiment reported in MySQL bug #73202.

    There is a long history of tweaking the binary log group commit algorithm and the InnoDB storage engine to scale performance.  Mark Callaghan described the work on this problem at Facebook.  Kristian Nielsen described how this problem was fixed in MariaDB.  Mats Kindahl described how this problem was fixed in MySQL 5.6.  Now, there is yet another tweak to the algorithm.

    Since the binary log group commit algorithm in MySQL 5.7 changes the interaction between the MySQL transaction commit logic and the storage engine, a small change to TokuDB is required to use the new algorithm.  I decided to run the sysbench non-indexed update test with the InnoDB and TokuDB storage engines on MySQL 5.7 to test this new algorithm.
    InnoDB experimentThe sysbench non-indexed update test is run on a small table that fits entirely in the InnoDB in memory buffer pool.   A small table is chosen to focus the test on the group commit algorithm and its disk sync's.  A consumer grade SSD with about 3 millisecond fsync time is used to store the data.  The sysbench update throughput as a function of the number of concurrent threads updating the table is reported.

    The new binary log group commit algorithm increases application throughput by about 20%.
    TokuDB experimentThe sysbench non-indexed update test is run on a small table that fits entirely in the TokuDB in memory cache. The test labelled 'mysql 5.7.7 tokudb flush' required a change to the TokuDB prepare function to work with the new binary log group commit algorithm.

    Sysbench throughput on Percona Server 5.6.24 and MySQL 5.7.7 are about the same as expected.  When TokuDB is changed to use the MySQL 5.7 binary log group commit algorithm,  sysbench non-indexed update throughput increases by about 30%.
    What is the new binary log group commit algorithm?MySQL uses a two-phase commit algorithm to coordinate the state of the MySQL binary log with the state of the storage engine recovery log.

    Here is the MySQL 5.6 commit logic and its interaction with the TokuDB storage engine.  The interaction with InnoDB is similar.

    MySQL commit logic calls TokuDB prepare:
    (1) write a prepare recovery log entry to the in memory TokuDB recovery log.
    (2) write and sync the TokuDB recovery log with a group commit algorithm.

    MySQL commit logic writes the binary log:
    (3) write transactional changes to the binary log.
    (4) sync the binary log with a group commit algorithm.

    MySQL commit logic calls TokuDB commit:
    (5) write a commit recovery log entry to the TokuDB recovery log.  A sync of the TokuDB recovery log is unnecessary here for proper recovery.

    Transaction durability occurs when the storage engine recovery log and the binary log are synced to disk in steps (2) and (4).

    During crash recovery, the MySQL transaction coordinator gets a list of all of the prepared transactions from the storage engine and will commit those prepared transactions that are present in the binary log.  If the prepared transaction is not in the binary log, as is the case if the crash occurs before the binary log is synced to disk, then the transaction is aborted.

    Disk syncs are really expensive compared to memory operations.  A group commit algorithm is used to amortize the sync cost over multiple transactions.   Since the expected sync time is about the same for both steps (2) and (4), we expect about the same number of threads to be group committed together.

    Here is the MySQL 5.7 commit logic and its interaction with the TokuDB storage engine.

    MySQL commit logic calls TokuDB prepare:
    (1) write a prepare recovery log entry to the in memory TokuDB recovery log.

    MySQL commit logic calls TokuDB flush logs:
    (2) write and sync the TokuDB recovery log.

    MySQL commit logic writes the binary log:
    (3) write transactional changes to the binary log
    (4) sync the binary log with a group commit algorithm

    MySQL commit logic calls TokuDB commit:
    (5) write a commit recovery log entry to the TokuDB recovery log.  A sync of the TokuDB recovery log is unnecessary for proper recovery.

    MySQL 5.7 groups threads together when executing steps (2) through (5).  A leader thread is chosen to execute these actions.  Steps (2) through (4) are executed by the leader thread on behalf of many other threads.  Since here is only one point where threads are collected for group commit (step 4), the sync cost is amortized over more threads than on MySQL 5.6 which has two places where group commit occurs.TokuDB changesTokuDB detects that the new two-phase commit algorithm is being used and does not sync its recovery log in the storage engine prepare (step 2).  TokuDB assumes that MySQL will call its flush logs function to write and sync its recovery log.  See this gitub commit for details.

    PlanetMySQL Voting: Vote UP / Vote DOWN

    Plan to deprecate PROCEDURE ANALYSE

    In the MySQL team, we have been refactoring the SQL parser to be more maintainable. Gleb Shchepa lists the goals of this project in more details on the MySQL Server Team blog.

    As part of this, we have identified the feature PROCEDURE ANALYSE as something that we would like to deprecate. For added context, here is a demonstration:

    mysql> SELECT * FROM City procedure analyse()\G *************************** 1. row *************************** Field_name: Min_value: 1 Max_value: 4079 Min_length: 1 Max_length: 4 Empties_or_zeros: 0 Nulls: 0 Avg_value_or_avg_length: 2040.0000 Std: 1177.5058 Optimal_fieldtype: SMALLINT(4) UNSIGNED NOT NULL *************************** 2. row *************************** Field_name: Min_value: A Coruña (La Coruña) Max_value: ´s-Hertogenbosch Min_length: 3 Max_length: 33 Empties_or_zeros: 0 Nulls: 0 Avg_value_or_avg_length: 8.5295 Std: NULL Optimal_fieldtype: VARCHAR(33) NOT NULL *************************** 3. row *************************** Field_name: Min_value: ABW Max_value: ZWE Min_length: 3 Max_length: 3 Empties_or_zeros: 0 Nulls: 0 Avg_value_or_avg_length: 3.0000 Std: NULL Optimal_fieldtype: ENUM('ABW','AFG','AGO','AIA','ALB','AND','ANT','ARE','ARG','ARM','ASM','ATG','AUS','AUT','AZE','BDI','BEL','BEN','BFA','BGD','BGR','BHR','BHS','BIH','BLR','BLZ','BMU','BOL','BRA','BRB','BRN','BTN','BWA','CAF','CAN','CCK','CHE','CHL','CHN','CIV','CMR','COD','COG','COK','COL','COM','CPV','CRI','CUB','CXR','CYM','CYP','CZE','DEU','DJI','DMA','DNK','DOM','DZA','ECU','EGY','ERI','ESH','ESP','EST','ETH','FIN','FJI','FLK','FRA','FRO','FSM','GAB','GBR','GEO','GHA','GIB','GIN','GLP','GMB','GNB','GNQ','GRC','GRD','GRL','GTM','GUF','GUM','GUY','HKG','HND','HRV','HTI','HUN','IDN','IND','IRL','IRN','IRQ','ISL','ISR','ITA','JAM','JOR','JPN','KAZ','KEN','KGZ','KHM','KIR','KNA','KOR','KWT','LAO','LBN','LBR','LBY','LCA','LIE','LKA','LSO','LTU','LUX','LVA','MAC','MAR','MCO','MDA','MDG','MDV','MEX','MHL','MKD','MLI','MLT','MMR','MNG','MNP','MOZ','MRT','MSR','MTQ','MUS','MWI','MYS','MYT','NAM','NCL','NER','NFK','NGA','NIC','NIU','NLD','NOR','NPL','NRU','NZL','OMN','PAK','PAN','PCN','PER','PHL','PLW','PNG','POL','PRI','PRK','PRT','PRY','PSE','PYF','QAT','REU','ROM','RUS','RWA','SAU','SDN','SEN','SGP','SHN','SJM','SLB','SLE','SLV','SMR','SOM','SPM','STP','SUR','SVK','SVN','SWE','SWZ','SYC','SYR','TCA','TCD','TGO','THA','TJK','TKL','TKM','TMP','TON','TTO','TUN','TUR','TUV','TWN','TZA','UGA','UKR','URY','USA','UZB','VAT','VCT','VEN','VGB','VIR','VNM','VUT','WLF','WSM','YEM','YUG','ZAF','ZMB','ZWE') NOT NULL *************************** 4. row *************************** Field_name: Min_value: Abhasia [Aphazeti] Max_value: – Min_length: 1 Max_length: 20 Empties_or_zeros: 4 Nulls: 0 Avg_value_or_avg_length: 9.0194 Std: NULL Optimal_fieldtype: VARCHAR(20) NOT NULL *************************** 5. row *************************** Field_name: Min_value: 42 Max_value: 10500000 Min_length: 2 Max_length: 8 Empties_or_zeros: 0 Nulls: 0 Avg_value_or_avg_length: 350468.2236 Std: 723686.9870 Optimal_fieldtype: MEDIUMINT(8) UNSIGNED NOT NULL 5 rows in set (0.01 sec)

    ANALYSE() examines the result from a query and returns an analysis of the results that suggests optimal data types for each column that may help reduce table sizes.

    Our justification for wanting to deprecate PROCEDURE ANALYZE is as follows:

    • There are no other uses of SELECT * FROM table PROCEDURE. This syntax is used exclusively by ANALYSE, and uses the UK English spelling.
    • The name PROCEDURE predates the addition of stored procedures as a MySQL feature. Ideally this feature would use a different name (CHANNEL?) to avoid confusion in usage. It also exists as an extension to the SQL standard.
    • There are numerous advantages to a feature similar to this being external to the MySQL server. The server must follow a stable release cycle, with core functionality being unchanged once it is declared GA. As an external tool, it is much easier to develop in an agile way, and provide new functionality without having to provide the same level of backward compatibility.

      By "external" I am implying that this could either be a script or as a view or stored procedure in MySQL. Shlomi has a good example of how to show auto_increment column capacity in common_schema!

    Our current plan is to deprecate PROCEDURE ANALYSE in MySQL 5.7, for removal as soon as MySQL 5.8. We are inviting feedback from the MySQL Community and would like to hear from you if you use PROCEDURE ANALYSE. Please leave a comment, or get in touch!

    PlanetMySQL Voting: Vote UP / Vote DOWN

    VALIDATE PASSWORD PLUGIN with mysql_secure_installation in 5.7.7-rc

    While testing installation steps with MySQL 5.7.7-rc, surely you will find much more improvements, changes, deprecated functionality and etc.

    The interesting one is activating VALIDATE PASSWORD PLUGIN via mysql_secure_installation script. Which we use by mean “securing” MySQL installations.

    I will write a separate topic about MySQL 5.7.7-rc installation steps from source, with related BUG reports.

    So after first run:

    [root@centos7_vm mysql]# bin/mysql_secure_installation --socket=/opt/mysql/datadir/mysqld-new.sock Securing the MySQL server deployment. Connecting to MySQL server using password in '/root/.mysql_secret'

    If you notice, now script trying to connect to MySQL using a temporary password which is generated for root@localhost and logged into hidden .mysql_secret file.

    VALIDATE PASSWORD PLUGIN can be used to test passwords and improve security. It checks the strength of password and allows the users to set only those passwords which are secure enough. Would you like to setup VALIDATE PASSWORD plugin? Press y|Y for Yes, any other key for No: Y There are three levels of password validation policy: LOW Length >= 8 MEDIUM Length >= 8, numeric, mixed case, and special characters STRONG Length >= 8, numeric, mixed case, special characters and dictionary file Please enter 0 = LOW, 1 = MEDIUM and 2 = STRONG: 1 Using existing root password. Estimated strength of the password: 100

    From now we have an option to activate our password policy on the fly after fresh installation.
    If you try to give a password such 12345 it will fail at password policy check:

    Change the root password? (Press y|Y for Yes, any other key for No) : Y New password: Re-enter new password: Estimated strength of the password: 25 Do you wish to continue with the password provided?(Press y|Y for Yes, any other key for No) : Y ... Failed! Error: Your password does not satisfy the current policy requirements

    The fail reason is strength of password is equal to 25 , where it should be 100.
    After giving a proper password:

    New password: Re-enter new password: Estimated strength of the password: 100 Do you wish to continue with the password provided?(Press y|Y for Yes, any other key for No) : Y

    The other parts of output should be familiar:

    By default, a MySQL installation has an anonymous user, allowing anyone to log into MySQL without having to have a user account created for them. This is intended only for testing, and to make the installation go a bit smoother. You should remove them before moving into a production environment. Remove anonymous users? (Press y|Y for Yes, any other key for No) : Y Success. Normally, root should only be allowed to connect from 'localhost'. This ensures that someone cannot guess at the root password from the network. Disallow root login remotely? (Press y|Y for Yes, any other key for No) : Y Success. By default, MySQL comes with a database named 'test' that anyone can access. This is also intended only for testing, and should be removed before moving into a production environment. Remove test database and access to it? (Press y|Y for Yes, any other key for No) : Y - Dropping test database... Success. - Removing privileges on test database... Success. Reloading the privilege tables will ensure that all changes made so far will take effect immediately. Reload privilege tables now? (Press y|Y for Yes, any other key for No) : Y Success. All done!

    The post VALIDATE PASSWORD PLUGIN with mysql_secure_installation in 5.7.7-rc appeared first on Azerbaijan MySQL UG.

    PlanetMySQL Voting: Vote UP / Vote DOWN

    Debugging Transaction Conflicts in Galera Cluster

    If you are using Galera Cluster in multi-master mode, you will most likely run into transaction conflicts if two clients attempt to modify the same row at the same time. Such conflicts are reported a deadlock errors to the application.

    Legacy applications are frequently unable to handle transaction conflicts properly and may not provide sufficient information to debug the source of the problem.

    If the wsrep_log_conflicts option is set, Galera can output all the information about transaction conflicts that is available to it to the error log. As it is a dynamic option, you can enable it while the server is running, collect some entries for examination, and disable it to avoid filling up the log.

    Decoding the Output

    The output from wsrep_log_conflicts may look a bit intimidating at first, but in fact contains a lot of information that can be used to pin-point the offending application, module or SQL operation. The relevant pieces of information have been underlined:

    *** Priority TRANSACTION: TRANSACTION 1375, ACTIVE 0 sec starting index read mysql tables in use 1, locked 1 1 lock struct(s), heap size 360, 0 row lock(s) MySQL thread id 2, OS thread handle 0x7fbbbc1f4700, query id 52 System lock *** Victim TRANSACTION: TRANSACTION 1374, ACTIVE 23 sec starting index read mysql tables in use 1, locked 1 4833 lock struct(s), heap size 554536, 1004832 row lock(s), undo log entries 934296 MySQL thread id 5, OS thread handle 0x7fbbb4601700, query id 50 localhost ::1 root updating update t1 set f2 = 'problematic_key_value21' *** WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 8 page no 4 n bits 280 index `PRIMARY` of table `test`.`t1` trx id 1374 lock_mode X Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0 0: len 8; hex 73757072656d756d; asc supremum;; Record lock, heap no 2 PHYSICAL RECORD: n_fields 4; compact format; info bits 0 0: len 4; hex 80000001; asc ;; 1: len 6; hex 00000000055e; asc ^;; 2: len 7; hex 39000021fd0110; asc 9 ! ;; 3: len 30; hex 70726f626c656d617469635f6b65795f76616c7565323120202020202020; asc problematic_key_value21 ; (total 50 bytes); ... 2015-06-29 09:41:02 7900 [Note] WSREP: cluster conflict due to high priority abort for threads: 2015-06-29 09:41:02 7900 [Note] WSREP: Winning thread: THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 24 SQL: (null) 2015-06-29 09:41:02 7900 [Note] WSREP: Victim thread: THD: 5, mode: local, state: executing, conflict: no conflict, seqno: -1 SQL: update t1 set f2 = 'problematic_key_value21'

    Galera provides the following information:

    • ACTIVE 23 sec - how long the victim transaction has been running before it was aborted. Long-running transactions are more prone to being unable to complete due to other transactions committing in the meantime.

      Consider breaking down such transactions into smaller parts or changing them to modify a smaller number of rows. Transactions that perform housekeeping can be modified to do less work, but run more frequently, or be moved to a dedicated maintenance window or a period of lower server activity.

    • MySQL thread id 5 - the ID of the thread that was aborted due to the conflict.

      This ID is the same that is used in SHOW PROCESSLIST, the query log and the slow query log, so can be used to cross-reference with those sources of information

    • localhost ::1 root - the username of the client for the aborted transaction and the host the client connected from.
    • update t1 set ... The SQL string of the query that was aborted.

      For multi-statement transactions, the SQL string may not be available, or show simply COMMIT, if the conflict was detected at commit time.

    • index `PRIMARY` - the name of the index that was used by the aborted query.
    • PHYSICAL RECORD - a dump of the record or records where the conflict occurred.

      This section comes directly from the InnoDB storage engine and follows the format used in SHOW ENGINE INNODB STATUS.

      The record with heap no 1 can be disregarded, while the following entries contain the actual conflicting records from the table.

    • hex 80000001 - in our example, this is the hex dump of the primary key.

      As the key is declared as UNSIGNED INTEGER, the value has a leading sign bit that should be taken into account when converting to decimal.

    • problematic_key_value21 - any string fields will be decoded and visible in the output.
    • seqno: 24 - the binary log ID of the winning transaction.
    Determining The Winning Transaction

    All the information from the output above, except for the seqno, pertains to the victim transaction of a conflict. Sometimes it is useful to determine the transaction that won the conflict and was not aborted, and the seqno can be used to obtain that information from the binary log.

    In order to be able to fetch this information, the server needs to be running with binary logging enabled. If log-slave-updates is enabled, then the binlog on each server will contain all updates, so only one server needs to be searched in order to find the transaction. Otherwise the binlogs of all servers needs to be searched separately.

    MariaDB Cluster provides the actual SQL update statements from the transaction if the --binlog-annotate-row-events option is enabled. Galera Cluster and Percona XtraDB Cluster will only provide a list of the updates made by the transaction.

    Unfortunately one needs to search through the entire binlog for the seqno in question, which is called Xid in the binlog:

    $ mysqlbinlog var/mysqld.2/data/0.000001 | grep 'Xid = 3' #150629 0:46:45 server id 2 end_log_pos 644 CRC32 0x39cbbd68 Xid = 3

    This provides the server id that executed the winning transaction, plus the position in the binlog when the transaction ended. We can use mysqlbinlog again to display the binlog up to and including the winning transaction:

    $ mysqlbinlog var/mysqld.2/data/0.000001 --server-id=2 --stop-position=644 --base64-output=DECODE-ROWS --verbose ... BEGIN /*!*/; # at 513 #150629 0:46:45 server id 2 end_log_pos 559 CRC32 0xae3feaec Table_map: `test`.`t1` mapped to number 70 # at 559 #150629 0:46:45 server id 2 end_log_pos 613 CRC32 0x77b280b0 Update_rows: table id 70 flags: STMT_END_F ### UPDATE `test`.`t1` ### WHERE ### @1=1 ### @2=1 ### SET ### @1=1 ### @2=50 # at 613 #150629 0:46:45 server id 2 end_log_pos 644 CRC32 0x39cbbd68 Xid = 3 COMMIT/*!*/; DELIMITER ;

    From this output, we can see what updates the winning transaction made, as artificially-generated SQL statements. MariaDB Cluster with --binlog-annotate-row-events would provide the original SQL that was issued.

    PlanetMySQL Voting: Vote UP / Vote DOWN

    Success Story: VividCortex Improves Reliability at DC Energy

    DC Energy is a trader in the commodities markets that relies heavily on MySQL. Since trading is time-sensitive, database management and minimal system downtime is a high priority. Prior to deploying VividCortex, DC Energy used query logs and custom tools, scripts and metrics to find and resolve problems. Internal tools did not provide a way to quickly review and assess the enormous amount of data available, and gaining insight into performance was time consuming.

    In March 2013, DC Energy began using VividCortex. Joe Kelly, DC Energy associated, highlighted the benefits since installation:

    VividCortex has an intuitive interface, and it’s clearly built by a team that understands monitoring because it only shows us what we need to know. It does not overwhelm us with a lot of information we don’t want. Everything I want is accessible, and I am not bogged down by extraneous information.

    This has led to faster diagnosis and prevention of potential problems. Because DC Energy analysts depend on quick response times for trading, the database performance improvements have helped them perform with increased efficiency.

    The full case study is available for download here. If you think this level of analysis could benefit your company, learn more or start your free trial today!

    PlanetMySQL Voting: Vote UP / Vote DOWN

    MariaDB Galera Cluster 5.5.44 and 10.0.20 now available

    The MariaDB project is pleased to announce the immediate availability of MariaDB Galera Cluster 10.0.20 and MariaDB Galera Cluster 5.5.44. These are Stable (GA) releases.

    Download MariaDB Galera Cluster 10.0.20

    Release Notes Changelog What is MariaDB Galera Cluster?

    Download MariaDB Galera Cluster 5.5.44

    Release Notes Changelog What is MariaDB Galera Cluster?

    MariaDB APT and YUM Repository Configuration Generator

    See the Release Notes and Changelogs for detailed information on these releases and the What is MariaDB Galera Cluster? page in the MariaDB Knowledge Base for general information about MariaDB Galera Cluster.

    Thanks, and enjoy MariaDB!

    PlanetMySQL Voting: Vote UP / Vote DOWN

    MySQL performance optimization: 50% more work with 60% less latency variance

    When I joined Pinterest, my first three weeks were spent in Base Camp, where the newest engineering hires work on real production issues across the entire software stack. In Base Camp, we learn how Pinterest is built by building it, and it’s not uncommon to be pushing code and making meaningful contributions within just a few days. At Pinterest, newly hired engineers have the flexibility to choose which team they’ll join, and working on different parts of the code as part of the Base Camp experience can help with this decision. Base Campers typically work on a variety of tasks, but my project was a deep dive into a MySQL performance optimization project.

    Pinterest, MySQL and AWS, oh my!

    We work with MySQL running entirely inside Amazon Web Services (AWS). Despite using fairly high-powered instance types with RAID-0 SSDs and a fairly simple workload (many point selects by PK or simple ranges) that peaks around 2,000 QPS, we had been unable to realize anywhere near the expected IO performance levels. Exceeding roughly 800 write IOPS would lead to unacceptable increases in latency and replication lag, and this replication lag or insufficient read performance on the slaves would slow down ETL and batch jobs, which would then have downstream impact on any other team relying on those jobs. The only options available were either to go to an even larger instance size, thus doubling our cost and obliterating our efficiency, or find ways to make the existing systems perform better.

    I took over the project from my colleague, Rob Wultsch, who had already made the significant discovery that Linux kernel version appears to be quite important when running on SSD inside AWS. The default 3.2 kernel that ships with Ubuntu 12.04 just doesn’t cut it, nor does the 3.8 kernel that AWS recommends as a minimum version (although it’s still more than twice as fast as 3.2). Running sysbench on an i2.2xlarge (2 disk RAID-0 of SSDs) instance with kernel 3.2 could barely hit 100MB/sec of 16K random writes. Upgrading the kernel to 3.8 got us to 350MB/sec with the same test, but this was still much lower than expected. Seeing this kind of improvement from such a simple change opened up many new questions and hypotheses about other inefficiencies and poor configuration options: Could we get better performance from an even newer kernel? Should we change other settings at the OS level? Are there optimizations to be found in my.cnf? How can we make MySQL go faster?

    In pursuit of answers, I set up almost 60 different sysbench fileIO test configurations with different kernels, filesystems, mount options and RAID block sizes. Once the best fit configuration was chosen from these experiments, I ran another 20 or so sysbench OLTP runs with other system permutations. The basic test methodology was identical across all trials: run the test for an hour collecting metrics at one second intervals, then drop the first 600 seconds to account for cache warm-up time and process the remainder. After the optimal configuration had been identified, we rebuilt our largest and most critical servers and rolled out the changes into production.

    From 5000 QPS to 26000 QPS: scaling MySQL performance without scaling hardware

    Let’s take a look at the impact of these changes on some basic sysbench OLTP tests via the p99 response times and throughput metrics at 16 and 32 threads for several different configurations.  

    Here is what each of the numbers represent:

    • CURRENT:                    3.2 kernel and standard MySQL configuration
    • STOCK:                          3.18 kernel with standard MySQL configuration
    • KERNEL:                        3.18 kernel with a few IO/memory sysctl tweaks
    • MySQL:                          3.18 kernel with an optimized MySQL configuration
    • KERN + MySQL:         3.18 kernel with tweaks from #3 and #4
    • KERN + JE:                    3.18 kernel with tweaks from #3 and jemalloc
    • MySQL + JE:                3.18 kernel with MySQL config from #4 and jemalloc
    • ALL:                                 3.18 kernel with #3,  #4 and jemalloc

    When we enable all of the optimizations, we find we can achieve roughly 500 percent more read and write throughput at both 16 and 32 threads while simultaneously reducing p99 latency by over 500ms in both directions. On the read side, we go from approximately 4100 – 4600 QPS to just over 22000 – 25000, depending on concurrency. On the write side, we go from approximately 1000 QPS to 5100 – 6000 QPS. These are massive gains in headroom and performance achieved with just a few simple changes.

    Of course, all the synthetic benchmarks in the world don’t mean much if they don’t translate into real-world results. The graph below shows latency on our primary clusters from both the client and server perspective from several days before the upgrades were pushed until several days after all the masters were upgraded. The process took just a week to complete.

    The red line represents client-perceived latency, and the green represents server-measured latency. From the client perspective, p99 latency dropped from a highly-variable 15-35ms with outliers over 100ms down to a pretty flat 15ms with outliers at 80ms or less. Server-measured latency also declined from a wavy 5-15ms to a basically-flat 5ms, with a daily 18ms spike due to system maintenance. Furthermore, since the beginning of the year, our peak throughput on these clusters has increased about 50 percent, so not only are we handling considerably more load (still well under our estimated capacity), we’re doing it with much better and more predictable throughput. And, in what can only be termed good news for everyone who enjoys sleeping through the night, the number of pageable incidents related specifically to system performance or general server overload dropped from over 300 in March down to less than 10 in April and May combined.

    For more information, including the before-and-after details of what our MySQL and OS configurations look like, check out the slides from “All Your IOPS Are Belong To Us,” my talk from the 2015 Percona Live MySQL Conference/Expo, and stay tuned for more insights on how we get the most out of MySQL, Redis and other storage technologies.

    Ernie Souhrada is a database engineer on the SRE team, part of the Cloud Engineering team.

    For Pinterest engineering news and updates, follow our engineering Pinterest, Facebook and Twitter. Interested in joining the team? Check out our Careers site.

    PlanetMySQL Voting: Vote UP / Vote DOWN

    Oracle dump utility

    I have created a small program to extract data from an Oracle database in a format suitable for importing into MariaDB (and MySQL I guess). It is pretty fast as it is written in C and uses the OCI interface. Also, it supports most of the Oracle basic types, including BLOB, CLOB and LONG. UTF8 is also supported and there are many generic features, as well as some features specific for later import into MariaDB.

    Download the 1.0 version from Sourceforge where the programs source, that is using autotools for building, as well as documentation is available. I have not tested to build on any other version of Oracle than 11, but maybe someone could help me there.

    PlanetMySQL Voting: Vote UP / Vote DOWN

    An update on GROUP BY implicit sort

    In the MySQL team, we have been planning for some time to remove the implicit sorting that is provided by GROUP BY. In doing so, we will make a number of existing queries faster (by no longer requiring a sort) as well as unlock opportunities to implement further optimizations.

    This is one of the more complicated behaviours to remove, because it is not possible to tell if an application relies upon implicit ordering. Since a GROUP BY query without an ORDER BY clause is a valid query, it is also not reasonable to issue deprecation warnings.

    However, one piece of the puzzle that was missing when I last wrote about this problem, is that MySQL 5.7 will support server-side query rewrite. What this means, is that Database Administrators will have the ability to inject an ORDER BY into queries that require this legacy behaviour. This is useful in the case where modifying the application directly is not possible.

    The second part to this update, is that we also plan to deprecate the closely related syntax GROUP BY .. [ASC|DESC]. I am sure many users are probably unaware that exists, but you can change the implicit ordering to be in descending order with:

    SELECT MAX(Population), Name FROM Country GROUP BY Name DESC;

    (Note the missing "ORDER BY").

    This represents an extension from the SQL standard, that what I can tell is not present in other databases.

    Our current plan is to make GROUP BY .. [ASC|DESC] deprecated as of MySQL 5.7, with removal in 5.8. As part of this, we also plan to remove the implicit GROUP BY sort as early as MySQL 5.8.

    We are inviting feedback from the MySQL Community regarding this plan. Please leave a comment, or get in touch! I would love to hear from you.

    PlanetMySQL Voting: Vote UP / Vote DOWN

    Log Buffer #429: A Carnival of the Vanities for DBAs

    This Log Buffer Edition gathers a wide sample of blogs and then purifies the best ones from Oracle, SQL Server and MySQL.


    • If you take a look at the “alter user” command in the old 9i documentation, you’ll see this: DEFAULT ROLE Clause.
    • There’s been an interesting recent discussion on the OTN Database forum regarding “Index blank blocks after a large update that was rolled back.”
    • 12c Parallel Execution New Features: 1 SLAVE distribution
    • Index Tree Dumps in Oracle 12c Database (New Age)
    • Is it possible to cause tables to be stale with only tiny amounts of change?

    SQL Server:

    • Making Data Analytics Simpler: SQL Server and R
    • Challenges with integrating MySQL data in an ETL regime and the amazing FMTONLY trick!
    • Azure Stream Analytics aims to extract knowledge structures from continuous ordered streams of data by real-time analysis.
    • Grant User Access to All SQL Server Databases
    • SQL SERVER – How Do We Find Deadlocks?


    • Efficient Use of Indexes in MySQL
    • Business-critical MySQL with DR in vCloud Air
    • Become a MySQL DBA blog series – Common operations – Schema Changes.
    • Building a Better CREATE USER Command

    Learn more about Pythian’s expertise in Oracle , SQL ServerMySQL, as well as the author Fahd Mirza.

    PlanetMySQL Voting: Vote UP / Vote DOWN

    MariaDB with Galera available on the IBM Power8 platform

    It was a very long time since I wrote something in this blob, but I have been very busy this spring with MariaDB on Power mostly. This has been a lot of work, but also a lot of fun. So, what is this MariaDB on Power thing all about, well I wrote an introduction to the Power platform late last year. Since then a lot of things has happened though.

    One thing is that several service providers out there has adopted Power8 as a platform. To be honest, this really isn't sexy, but it is useful and as a user of one of these services, you will just see the same old Linux you are used to, but potentially it is more powerful and reliable. One such provider is OVH, whose service is more known as RunAbove. If you want to try it, you can do so for free for 7 7 days, just go there and off you go.

    Another important thing is that MariaDB is now available on Power8 running, RedHat, SUSE or Ubuntu Linux. To get access to this, pop by MariaDB and if you are not yet signed up, then do this now and then go to "My Portal", further to "Downloads" and then select "MariaDB Enterprise and MariaDB Enterprise Cluster". You are now ready to install using the operating system of your choise, but on Power you are, as I said before,limited to SUSE, RedHat and Ubuntu, and if you want to test MariaDB Enterprise Cluster, i.e. MariaDB with Galera, you have to go with Ubuntu.

    Installing MariaDB Enterprise Cluster on Power8 is no more complex than on Intel. There are a few thing to adjust before you can get started with this, after having installed the software. The first node has, as usual, to be configured with wsrep_cluster_adress set to gcomm:// to ensure that this first node will bootstrap without having to connect to a cluster. Once the cluster is up and running though, this variable is set to the cluster addresses. In my case, this what the Galera setting look like in /etc/mysql/my.cnf which is the location of this file on Ubuntu.
    # Galera
    Note in particular the binlog_format setting. This MUST be set to ROW for Galera to work. But fact is that these setting are not particular to MariaDB on Power, this is the same even on Intel.

    Tf this isn't enough to convice you about the advantages of running MariaDB on IBM Power, then see what Foedus in Italy has to say about this combination in this Video:

    There is more to say about running MariaDB on Power and there is more to come here, I'll look at some performance data, we'll have a look at MaxScale on Power (this is not official yet, but that isn't stopping me) as well as a blog on how to run a Power8 emulation on Intel which I have promissed before.

    So, don't touch that dial!

    PlanetMySQL Voting: Vote UP / Vote DOWN

    Efficient Use of Indexes in MySQL

    The slides of “Efficient Use of Indexes in MySQL” talk we delivered on SFMySQL Meetup.

    This is an introductory talk for developers on MySQL indexes. In my opinion it’s quite important to understand how InnoDB organizes data. If you know how MySQL accesses data it’s easier to write optimal queries.

    When working with queries I imagine secondary indexes as a table with records sorted by secondary key fields. This is a powerful concept that helps to understand MySQL logic. It’s also easy to understand complex optimizations like loose index scan.

    For example, for index (last_name, rank) the secondary index table looks like:

    Enjoy the slides!

    The post Efficient Use of Indexes in MySQL appeared first on Backup and Data Recovery for MySQL.

    PlanetMySQL Voting: Vote UP / Vote DOWN


    We received feedback from a number of users in the MySQL community that the command SHOW ENGINE INNODB MUTEX remains useful in a number of scenarios. We listened, and the command is scheduled to make a return in MySQL 5.7.8.

    To lessen overhead, the command will now feature a mechanism to enable and disable metrics collection. This is documented in the manual here:

    SET GLOBAL innodb_monitor_enable='latch'; SET GLOBAL innodb_monitor_disable='latch';

    Thank you for helping make a better MySQL!

    PlanetMySQL Voting: Vote UP / Vote DOWN

    Become a MySQL DBA blog series - Common operations - Schema Changes

    Database schema changes are not popular among DBAs, not when you are operating production databases and cannot afford to switch off the service during a maintenance window. These are unfortunately frequent and necessary, especially when introducing new features to existing applications. 

    Schema changes can be performed in different ways, with tradeoffs such as complexity versus performance or availability. For instance, some methods would trigger a full table rewrite which could lead to high server load. This in turn would lead to degraded performance and increased replication lag in master-slave replication setups. 

    This is the fourth installment in the ‘Become a MySQL DBA’ series, and discusses the different approaches to schema changes in MySQL. Our previous posts in the DBA series include High Availability, Backup & Restore and Monitoring & Trending.

    Schema changes in MySQL

    A common obstacle when introducing new features to your application is making a MySQL schema change in production, in the form of additional columns or indexes. Traditionally, a schema change in MySQL was a blocking operation - a table had to be locked for the duration of the ALTER. This is unacceptable for many applications - you can’t just stop receiving writes as this causes your application to become unresponsive. In general, “maintenance breaks” are not popular - databases have to be up and running most of the time. The good news is that there are ways to make this an online process. 

    Rolling schema update in MySQL Replication setups

    MySQL replication is an easy way of setting up high availability, but managing schema updates are tricky. Some ALTERs may lock writes on the master and create replication lag - this is obvious for any ALTER statement. The reason is simple - MySQL replication is single-threaded and if the SQL thread is executing an ALTER statement, it won’t execute anything else. It is also important to understand that the slave is able to start replicating the schema change only after it has completed on the master. This results in a significant amount of time needed to complete changes on the slave: time needed for a change on the master + time needed for a change on the slave.

    All of this sounds bad but replication can be used to help a DBA manage some of the schema changes. The plan is simple - take one of the slaves out of rotation, execute ALTERs, bring it back, rinse and repeat until all slaves have been updated. Once that’s done, promote one of the slaves to master, run ALTER on the old master, bring it back as a slave.

    This is a simple yet efficient way of implementing schema changes. Failover requires some downtime but it is much less impacting than running all of the changes through the replication chain, starting from the master. The main limitation of this method is that the new schema has to be compatible with the current schema - remember, master (where all writes happen) has it unchanged until almost the end. This is a significant limitation, especially if you use row-based binary log format. While statement-based replication (SBR) is pretty flexible, row-based replication (RBR) is much more demanding when it comes to the schema consistency. For example, adding a new column in any place other than the end of the table won’t work in RBR. With SBR, it is not an issue. Be sure you checked the documentation and verified that your schema change is compatible. Last but not least, if you use mixed binlog format, keep in mind that while it uses mostly statement-based binlog format, it will use row-based binlog format for those queries which are not deterministic. Thus, it may cause similar problems as RBR.

    MySQL-based functionality for online schema change

    As we mentioned earlier, some of the operations may not be blocking in MySQL and thus can be executed on a live system. It is true especially with MySQL 5.6, which brought a number of improvements in this area. Unfortunately, it doesn’t solve problems with replication lag - ALTERs will still cause this type of problem. Still, this is a great choice for smaller tables where lag created is acceptable. Of course, it is application-dependent but usually it’s not a big deal if the slave lag is a couple of seconds and this may mean that tables even up to a couple of gigabytes (hardware-dependent) may be within range. If your application cannot accept even such small lag, then we’d strongly suggest to rethink about the design. Slaves will lag, it is just a matter of when it will happen.

    Other tools for Online schema change

    There are a couple of tools that perform online schema change in MySQL. The best known is probably pt-online-schema-change, which is part of Percona Toolkit. Another one is “Online Schema Change” developed by Facebook.

    • Those tools work in a similar way
    • create a new table with the desired schema;
    • create triggers on the old table that will mirror all changes and store them in the new table;
    • copy data from old table into a new one in batches;
    • once it’s done, rename tables and drop the old one.

    Those tools give the DBA great flexibility - you don’t have to do a time-consuming rolling upgrade, it’s enough to run pt-online-schema-change and it will take care of your ALTER. It’s even replication-aware and, as such, it can throttle itself down when a lag is detected on one of the slaves. It’s not without limitations, though. 

    You need to be aware that the “copy” operation is basically a number of low priority inserts. They will impact the overall performance - it’s inevitable. The process of moving millions of rows takes time - online schema change is much slower than the direct ALTER executed on the same table. By “much” we mean even an order of magnitude. Of course, it all depends on your hardware (disk throughput is the most important factor) and table schema, but it is not uncommon to see changes which literally take days to finish. Another limitation is the fact that this tool cannot be used on a table where triggers already exist. For now MySQL allows only a single trigger of a given type per table. This will probably change in MySQL 5.7 (the relevant worklog is marked as completed) but it doesn’t help much if you run on MySQL 5.6. 

    Another problem is with foreign keys - they are linked to a given table and if you create a new one and then swap it with the old table, foreign keys will have to be updated to point to the new table. Pt-online-schema-change gives you two options to deal with it but, frankly, none of them is good. 

    The first option, fast but risky, is to drop the old table instead of renaming it. The main problem here is two-fold - first, for a while there’s no table - renaming a table is an atomic operation, dropping it is not. Second, as the old table has been dropped, there’s no rollback if an error occurs after the drop. 

    The second option requires executing ALTERs on the tables linked by foreign keys - those tables are basically altered and new FKs are created. This is fine as long as those tables are small because the change is executed as a normal ALTER with all it’s consequences (replication lag, for example).

    Metadata locking is another problem that you may experience while using pt-online-schema-change. Pt-osc have to create triggers and this operation requires a metadata lock. On a busy server with plenty of long-running transactions, this could be hard to acquire. It is possible to increase timeouts and, in that way, increase chances of acquiring the lock. But we’ve seen servers where it’s virtually impossible to run pt-online-schema-change due to this problem. 

    Given this long list of the problems and limitations, you might think that this tool is not worth your time. Well, on the contrary. The list is so long because almost every MySQL DBA will rely on pt-online-schema-change heavily and, in the process, will learn all of it’s dark sides. This tool is one of the most useful tools in the DBA’s toolkit. Even though it has some limitations, it gives you great degree of flexibility regarding how to approach schema changes in MySQL.

    Schema changes in Galera Cluster

    Galera cluster brings another layer of complexity when it comes to schema changes. As it is a ‘virtually’ synchronous cluster, having a consistent schema is even more important than regular MySQL connected via replication. Galera brings two methods of running schema changes and we’ll discuss them and the repercussions of using them below.

    TOI (Total Order Isolation)

    The default one, TOI - Total Order Isolation, works in a way that the change happens at exactly the same time on all of the nodes in the cluster. This is great for consistency and allows you to run any kind of change, even non-compatible ones. But it comes with a huge cost - all other writes have to wait until the ALTER finishes. This, of course, make long-running ALTERs not feasible to execute because every one of them will cause significant downtime for the whole application. This mode can be used successfully for quick, small changes which do not take more than a second (unless you are ok with some ‘stalls’ in your application or you have a maintenance window defined for such changes).

    What is also important is that MySQL’s online ALTERs do not help here. Even a change, which you could easily run on the master without blocking it (and only be concerned about slaves lagging) will cause all writes to halt.

    RSU (Rolling Schema Upgrade)

    The second option that Galera offers is RSU - Rolling Schema Upgrade. This is somewhat similar to the option we discussed above, (see section on Rolling Schema Update in MySQL Replication setups). At that time we were pulling out our slaves, one by one, and finally we executed a master change. Here we’ll be taking the Galera nodes out of rotation.

    The whole process is partially automated - set the wsrep_OSU_method variable to RSU, and all you need to do is to proceed with the ALTER. The node will switch to the Desync state and flow control will be disabled ensuring that the ALTER will not affect the rest of the cluster. If your proxy layer is setup in a way that Desync state means no traffic will reach this node (and that’s how you should set up your proxy), such operation is transparent to the application. Once the ALTER finishes, the node is brought back to sync with the cluster.

    This has several repercussions that you need to keep in mind. First of all, similar to the rolling schema upgrade on MySQL replication, changes have to be compatible with the old schema. As Galera uses row-based format for replication, it is very strict regarding changes that can be done online. You should verify every change you plan to make (see MySQL documentation) to ensure it is indeed compatible. If you performed an incompatible schema change, Galera won’t be able to apply writesets and it will complain about a node not being consistent with the rest of the cluster. This will result in Galera wiping out the offending node and executing SST.

    You also need be aware of the fact that, for the duration of the change, the altered node does not process writesets. It will ask for them later, once it finishes the ALTER process. If it won’t find the writesets on any of the other synced nodes in the cluster, it will execute SST, removing the change completely. You have to ensure that gcache is large enough to store the data for the duration of the ALTER. It can be tricky and problematic as gcache size is only one of the factors - another one is the workload. You may have increased gcache but if the amount (and size) of the writesets in a given time increases too, you may still run out of space in the cache.

    Generic scenarios of the schema change

    Now, let’s look at some real life scenarios and how you could approach them. We hope this will make more clear the strong and weak points of each method. Please note that we are adding estimated time to each of these scenarios. It is critical that the DBA, before executing a change, has knowledge about the time needed to complete it. We cannot stress it enough - you have to know what you’ll be executing and how long will it take.

    There are a couple of ways in which you can estimate the performance. First, you can (and you should) have a development environment with a copy of your production data. This data should be as close to the real production copy as possible in terms of the size. Sure, sometimes you have to scrub it for security reasons, but still - closer to production means better estimates. If you have such environment, you can execute a change and assess the performance.

    Another way, even more precise, is to run the change on a host that is connected to the production setup via replication. It is more precise because, for example, pt-online-schema-change execute numerous inserts and they can be slowed down because of the regular traffic. Having the regular traffic flown in via replication helps to make a good assessment. 

    Finally, it’s all about the experience of the DBA - knowledge about the system’s performance and workload patterns. From our experience we’d say that when in doubt, add 50% to the estimated time. In the best case, you’ll be happy. In the worst, you should be about right, maybe a bit over the ETA.

    Scenario - Small table, alter takes up to 10s

    MySQL Replication

    In this case it’s a matter of answering the question - does your application allow some lag? If yes, and if the change is non-blocking, you can run direct ALTER. On the other hand, pt-online-schema-change shouldn’t take more than couple of minutes on such a table and it won’t cause any lag-related issues. It’s up to you to decide which approach is better. Of course, if the change is blocking on the MySQL version you have installed, online schema change is the only option.

    Galera Cluster

    In this case, we’d say the only feasible way of executing the change is to use pt-online-schema-change. Obviously we don’t want to use TOI as we’d be locked for couple of seconds. We could use RSU if the change is compatible, but it creates additional overhead of running the change on a node, one by one, keeping an eye on their status, ensuring the proxy layer is taking nodes out of rotation. It’s doable but if we can use online schema change and just let it run, why not do that?

    Scenario - Medium-sized table, from 20 - 30 minutes up to 1h

    Replication and Galera Cluster

    This is where pt-online-schema-change shines. Changes take too long for a direct ALTER to be feasible yet the table is not too big and pt-osc should be able to finish the process within several hours at the most. It may take a while but it will eventually be done. It’s also much less cumbersome than executing a rolling schema upgrade.

    Scenario - Large tables, more than 1h, up to 6 -12h

    MySQL Replication

    Such tables can become tricky ones. On the one hand, pt-online-schema-change will work fine, but problems may start to appear. As pt-osc is expected to take even 36 - 48h to finish such change, you need to consider impact on the performance (because pt-osc has its impact, the inserts need to be executed). You also need to assess if you have enough disk space. This is somewhat true for most of the methods we described (except maybe for online ALTERs) but it’s even more true for pt-osc as inserts will significantly increase the size of the binary logs. Therefore you may want to try to use Rolling Schema Upgrade - downtime will be required but the overall impact may be lower than using pt-osc.

    Galera Cluster

    In Galera, the situation is somewhat similar. You can also use pt-online-schema-change if you are ok with some performance impact. You may also use RSU mode and execute changes node by node. Keep in mind that gcache size for 12hrs worth of writesets, on a busy cluster, may require a significant amount of memory. What you can do is to monitor wsrep_last_committed and wsrep_local_cached_downto counters to estimate how long the gcache is able to store data in your case.

    Scenario - Very large tables, more than 12h

    First of all, why do you need such a large table? :-) Is it really required to have all this data in a single table? Maybe it’s possible to archive some of this data in a set of archive tables (one per year/month/week, depending on their size) and remove it from the “main” table?

    If it’s not possible to decrease the size (or it’s too late as this process will take weeks while ALTER has to be executed now), you need to get creative. For MySQL Replication you’ll probably use rolling schema upgrade as a method of choice with a slight change, though. Instead of running the ALTER over and over again you may want to use xtrabackup or even snapshots, if you have LVM or run on EBS volumes in EC2, to propagate changes through the replication chain. It will be probably faster to run ALTER once and then rebuild slaves from scratch using the new data (rather than executing the ALTER on every host).

    Galera Cluster may suffer from problems with gcache. If you can fit the 24h or even more data into gache, good for you - you can use RSU. If not, though, you will have to improvise. One way would be to take a backup of the cluster and use it to build a new Galera cluster which will be connected to the production one via replication. Once that is done, run the change on the ‘other’ cluster and, finally, failover to it.

    As you can see, schema changes may become a serious problem to deal with. This is a good point to keep in mind that the schema design is very important in relational databases - once you push data into tables, things may become hard to change. Therefore you need to design table schemas as time-proof as possible (including indexing any access pattern that may be used by queries in the future). Also, before you start inserting data in your tables, you need to plan how to archive this data. Partitions maybe? Separate archive tables? As long as you can keep the tables reasonably small, you won’t have problems with adding a new index.

    Of course, your mileage may vary - we used time as a main differentiating factor because an ALTER on a 10GB table may take minutes or hours. You also need to remember that pt-online-schema-change has its limitations - if a table has triggers, you may need to use rolling schema upgrade on it. Same with foreign keys. This is another question to answer while designing the schema - do you need triggers? Can it be done from within app? Are foreign keys required or can you have some consistency checks in the application? It is very likely that developers will push on using all those database features, and that’s perfectly understandable - they are there to be used. But you, as a DBA, will have to assess all of the pros and cons and help them decide whether the pros of using all those database features are larger than the cons of maintaining a database that is full of triggers and foreign keys. Schema changes will happen and eventually you’ll have to perform them. Not having an option to run pt-online-schema-change may significantly limit your possibilities.

    Related Blogs


    Blog category: Tags:
    PlanetMySQL Voting: Vote UP / Vote DOWN