Planet MySQL

Nasty MySQL Replication Bugs that Affect Upgrade to 5.6

There were two nasty MySQL replication bugs in two different 5.6 releases that would make it difficult to upgrade slaves to MySQL 5.6 while still connected to MySQL 5.5 master. The first of those bugs is MySQL bug 72610 which affects 5.6.19. Essentially this bug is triggered when the table structure on the slave is different from the table structure on the master which leads to unnecessarily large amount of RAM usage while replicating events that affect that table. The amount of RAM used would generally be more noticeable when the replicated transaction consists of thousands of RBR events. The...

The post Nasty MySQL Replication Bugs that Affect Upgrade to 5.6 appeared first on ovais.tariq.


PlanetMySQL Voting: Vote UP / Vote DOWN

(More) Secure local passwords in MySQL 5.6 and up

I log into a lot of different servers running MySQL and one of the first things I do is create a file in my home directory called ‘.my.cnf’ with my credentials to that local mysql instance:

[client] user=root password=secret

This means I don’t have to type my password in every time, nor am I tempted to include it on the command line with -p and get the dreaded (but completely accurate):

Warning: Using a password on the command line interface can be insecure.

MySQL 5.6 introduces a utility to make this easier and more secure. First, let’s start with a new mysqld instance with a blank root password and make it more secure:

[vagrant@localhost ~]$ mysqladmin -u root password New password:secret Confirm new password:secret [vagrant@localhost ~]$ mysql -u root ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)

Ok, so I’ve added a password, now I want to create my .my.cnf file:

[vagrant@localhost ~]$ mysql_config_editor set --user=root --password Enter password:secret [vagrant@localhost ~]$ mysql Welcome to the MySQL monitor. Commands end with ; or g. Your MySQL connection id is 10 Server version: 5.6.21-70.0 Percona Server (GPL), Release 70.0, Revision 688 Copyright (c) 2009-2014 Percona LLC and/or its affiliates Copyright (c) 2000, 2014, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or 'h' for help. Type 'c' to clear the current input statement. mysql>

What did ‘mysql_config_editor set’ actually do? It creates a .mylogin.cnf file (which stands in for a .my.cnf) in my home directory that contains my credentials, just in encrypted form:

[vagrant@localhost ~]$ ls -alh .my* -rw-------. 1 vagrant vagrant 100 Nov 19 16:20 .mylogin.cnf -rw-------. 1 vagrant vagrant 29 Nov 19 16:20 .mysql_history [vagrant@localhost ~]$ cat .mylogin.cnf ��>NTv�&�S���/�, >ј$%KZ 9i�V�jK䉦H[��� k. [vagrant@localhost ~]$ mysql_config_editor print [client] user = root password = *****

The mysql client picks this up right away and will use it by default. This file has good default filesystem permissions, is local to my homedir, and is a whole lot better than specifying it on the command line or typing it in every time.

This utility also supports a feature called ‘login-path’ wherein you can add multiple mysql logins (perhaps to different servers) and refer to them with the —login-path option in the mysql client:

[vagrant@localhost ~]$ mysql_config_editor set --login-path=remote --host=remote --user=remote --password Enter password:secure [vagrant@localhost ~]$ mysql_config_editor print --all [client] user = root password = ***** [remote] user = remote password = ***** host = remote [vagrant@localhost ~]$ mysql --login-path=remote ERROR 2005 (HY000): Unknown MySQL server host 'remote' (0)

The ‘remote’ host doesn’t exist here, but you get the idea. You can create as many login-paths as you want with varied hostnames, credentials and other login parameters and quickly access them with any client supporting login-path.

Now, how secure is this really?  This isn’t secure from anyone who roots your DB server.  I would say the benefits are more about reducing careless password storage and tidier management of local credentials.

The post (More) Secure local passwords in MySQL 5.6 and up appeared first on MySQL Performance Blog.


PlanetMySQL Voting: Vote UP / Vote DOWN

Alternatives for chunking bulk deletes in common_schema

I've blogged about common_schema multiple times in the past, and it's a tool I use frequently. Last week I had a project to delete millions of rows from multiple rollup tables in a star schema. Since the tables are not partitioned I needed to use DELETE instead of DROP PARTITION, but I didn't want to delete millions of rows in a single transaction. My first instinct was to use common_schema's split() function to break the deletes into chunks. So I ran a query on INFORMATION_SCHEMA to generate a bunch of statements like this:

call common_schema.run("split(delete from rollup_table1 where the_date > '2013-03-30') pass;"); call common_schema.run("split(delete from rollup_table2 where the_date > '2013-03-30') pass;"); call common_schema.run("split(delete from rollup_table3 where the_date > '2013-03-30') pass;"); ...

That's the simplest way to do deletes with split(), and the tool will automatically determine which index and what chunk size to use. If I were running this on an active database (or a master) I would probably use throttle to control the speed of the deletes, but in this case it was running on passive replicas so I just used pass to run the deletes with no sleep time in between them. I sorted the deletes by table size, from smallest to largest, and had a total of 33 tables to process.

I started running the SQL on a stage database and it deleted data from the first 32 tables with no problem, but it got stuck on the 33rd table. I checked the process list and found this query running (table and column names have been changed for simplicity):

select REQ_DATE, DIMENSION1, DIMENSION2, DIMENSION3, DIMENSION4, DIMENSION5 from `flite`.`rollup_table33` order by REQ_DATE ASC, DIMENSION1 ASC, DIMENSION2 ASC, DIMENSION3 ASC, DIMENSION4 ASC, DIMENSION5 ASC limit 1 into @_split_column_variable_min_1, @_split_column_variable_min_2, @_split_column_variable_min_3, @_split_column_variable_min_4, @_split_column_variable_min_5, @_split_column_variable_min_6

The relevant table has about 100 million rows, but I would expect the above query to be fairly fast since there is a unique index on the columns being selected. I ran an explain and found that the query was doing a full table scan. I'm not sure exactly why it was doing a full table scan, but the table does have some quirks:

  • Two of the dimension columns use the TEXT data type, and thus only a substring from each fo those column is indexed
  • The default charset for the table is latin1, but for the TEXT columns it is utf8
  • The table uses ROW_FORMAT=DYNAMIC

Rather than trying to figure out exactly why that query was doing a full table scan, I checked the common_schema documentation to see if there were any options I could use to avoid running this particular query. The parameters for split are fairly limited, but I did try using start to see if that would get around it:

call common_schema.run("split({start:'2013-03-30,0,,0,0,', size: 10000} : delete from rollup_table3 where the_date > '2013-03-30') pass;");

I thought telling split() where to start and what chunk size to use might help, but it still ran the same query with the full table scan.

At that point I started looking for alternatives to split(). I remembered seeing an example in the documentation using WHILE to do a delete with a WHERE clause and a LIMIT. Here's the example:

while (DELETE FROM world.Country WHERE Continent = 'Asia' LIMIT 10) { throttle 2; }

So I decided to try that with my table. Here's the command I ran:

call common_schema.run(" while (delete from rollup_table33 where req_date > '2013-06-14' LIMIT 10000) { pass; }");

As I expected that delete started fast and slowed down over time, but within 3 hours the deletes were done!

At some point I intend to go back and figure out why my first approach was so slow, and might file a bug report against common_schema. But for now I'm happy that I could get the tool to do what I needed thanks to its flexibility.


PlanetMySQL Voting: Vote UP / Vote DOWN

Announcing iiBench for MySQL in Java

I just pushed the new Java based iiBench for MySQL (and Percona Server and MariaDB), the code and documentation are available now in the iibench-mysql Github repo. Pull request are welcome!

The history of iiBench goes back to the early days of Tokutek. Since "indexed insertion" is a strength of Fractal Tree indexes, the first iiBench was created by Tokutek in C++ back in 2008. Mark Callaghan rewrote iiBench in Python, adding several features along the way. His version of iiBench is available in Launchpad.

So why did I create a new iiBench in Java?

  • Raw Speed
    • By eliminating the insert calls from each version I tested how many inserts per second each version. Any time spent waiting for the next batch of inserts is time that could be put toward inserting rows (and yes I understand that concurrent clients can reduce this concern).
    • All tests were run on my desktop (Intel i7-4790K). As the below graph shows, the 1 thread version of the Java iiBench is almost 4x faster than the 4 threaded Python iiBench, and the 4 thread Java version scales quite nicely.
  • Capability
    • Because of the Python's Global Interpreter Lock, I need to run 4 copies of the Python iiBench to create 4 loader "threads". Each of these benchmark clients creates it's own log files that need to be aggregated to show cumulative insertion performance. Java threading handles it cleanly, allowing a single client application to run regardless of how many client threads are needed.
  • Miscellaneous
    • Not really relevant to the version or language, but I cringe every time I need to get software from Launchpad. I am not a fan of Bazaar.
Again, contributions/comments/complaints are always welcome, here or in github!
PlanetMySQL Voting: Vote UP / Vote DOWN

Percona Server 5.6.21-70.1 is now available

Percona is glad to announce the release of Percona Server 5.6.21-70.1 on November 24, 2014. Download the latest version from the Percona web site or from the Percona Software Repositories.

Based on MySQL 5.6.21, including all the bug fixes in it, Percona Server 5.6.21-70.1 is the current GA release in the Percona Server 5.6 series. Percona Server is open-source and free. Complete details of this release can be found in the 5.6.21-70.1 milestone on Launchpad.

Bugs Fixed:

  • A slave replicating in RBR mode would crash, if a table definition between master and slave differs with an allowed conversion, and the binary log contains a table map event followed by two row log events. This bug is an upstream regression introduced by a fix for bug #72610. Bug fixed #1380010.
  • An incorrect source code function attribute would cause MySQL to crash on an InnoDB row write, if compiled with a recent GCC with certain compilation options. Bug fixed #1390695 (upstream #74842).
  • MTR tests for Response Time Distribution were not packaged in binary packages. Bug fixed #1387170.
  • The RPM packages provided for CentOS 5 were built using a debugging information format which is not supported in the gdb version included with CentOS 5.10. Bug fixed #1388972.
  • A session on a server in mixed mode binlogging would switch to row-based binlogging whenever a temporary table was created and then queried. This switch would last until the session end or until all temporary tables in the session were dropped. This was unnecessarily restrictive and has been fixed so that only the statements involving temporary tables were logged in the row-based format whereas the rest of the statements would continue to use the statement-based logging. Bug fixed #1313901 (upstream #72475).

Other bugs fixed: #1387227, and #1388001.

Release notes for Percona Server 5.6.21-70.1 are available in the online documentation. Please report any bugs on the launchpad bug tracker

The post Percona Server 5.6.21-70.1 is now available appeared first on MySQL Performance Blog.


PlanetMySQL Voting: Vote UP / Vote DOWN

Log Buffer #398, A Carnival of the Vanities for DBAs

This Log Buffer Edition covers some informative and interesting posts from Oracle, SQL Server and the MySQL.

Oracle:

If you’re not using Toad DBA Suite, it’s sometimes hard to find solutions. Somebody wanted to know how to find indexes that aren’t that aren’t indirect. Indirect indexes are those created for a primary key because a primary key column or set of columns are both not null and uniquely constrained. You can’t drop a unique index for a primary key without dropping the primary key constraint that indirectly created it.

At the NoCOUG fall conference at the eBay town hall in San Jose, we got a first-hand look at the workings of the most complex database environments in the world.

Is there a feature you would like added to Inventory or Inventory Items? Tired of logging Service Requests with Support to request product enhancements? Well those days are over. You can now submit Enhancement Requests (ER’s) for Logistics (Inventory) and/or Inventory Items (APC/PLM/PIM) directly in their respective Communities.

Oracle Database 12c : EXPAND_SQL_TEXT, APPROX_COUNT_DISTINCT, Session Sequences and Temporary Undo.

Integrating Cordova Plugin with Oracle MAF – iOS Calendar Plugin by Chris Muir.

SQL Server:

Learn how to invoke SSRS reports from an SSIS package after the data load is completed.

Questions About Using TSQL to Import Excel Data You Were Too Shy to Ask.

This article shows a step-by-step guide to move the distribution database to a new SQL Server instance.

Monitoring Azure SQL Database.

Stairway to SQL Server Agent – Level 2: Job Steps and Subsystems.

Where in the Application Should Data Validation be Done?

MySQL:

Creating JSON documents with MariaDB.

Geographic replication with MySQL and Galera.

Sys Schema for MySQL 5.6 and MySQL 5.7.

Logging with MySQL: Error-Logging to Syslog & EventLog.

Multi-source Replication with Galera Cluster for MySQL.


PlanetMySQL Voting: Vote UP / Vote DOWN

Schema changes in MySQL for OpenStack Trove users

People using OpenStack Trove instances can hit a common issue in the MySQL world: how to perform schema change operations while minimizing the impact on the database server? Let’s explore the options that can allow online schema changes.

Summary

With MySQL 5.5, pt-online-schema-change from Percona Toolkit is your best option for large tables while regular ALTER TABLE statements are only acceptable for small tables. Also beware of metadata locks.

With MySQL 5.6, almost all types of schema changes can be done online. Metadata locks can also be an issue. pt-online-schema-change can still be worth using as it is also online on read replicas.

Regular ALTER TABLE with MySQL 5.5

If you are still using MySQL 5.5, almost all schema changes will require a table rebuild and MySQL will set a write lock. Therefore all writes to the table that gets modified will be blocked. As soon as the table gets large or if you cannot afford to have a maintenance window, using ALTER TABLE becomes tricky.

The only exception is that secondary indexes can be added or removed without rebuilding the whole table. The table is still write locked during the operation but it is much faster.

You can spot this ‘fast index creation’ process by looking at SHOW PROCESSLIST (see manage keys in the State field):

+----+------+-----------+--------+---------+------+-------------+-----------------------------------------+-----------+---------------+-----------+ | Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read | +----+------+-----------+--------+---------+------+-------------+-----------------------------------------+-----------+---------------+-----------+ | 1 | root | localhost | sbtest | Query | 4 | manage keys | alter table sbtest2 add index idx_k (k) | 0 | 0 | 0 | +----+------+-----------+--------+---------+------+-------------+-----------------------------------------+-----------+---------------+-----------+

It is helpful for small or even medium tables, but it is not fast enough for large tables.

A common workaround when you have a master-slave configuration is to perform the schema change on the slave first, promote the slave and then perform the schema change on the old master. However replication is not only available if you are using Icehouse and even with Juno, replicas are read-only. So this workaround is not usable with Trove.

pt-online-schema-change

A good alternative is to use pt-online-schema-change. In a nutshell the tool creates a new table with the desired structure, creates triggers on the original table, and copies rows in chunks from the original table to the new table.

Nice features of the tool include being able to write to the original table while the schema change is being performed (hence the name of the tool), monitoring replication lag on slaves and automatic throttling to keep lag under a reasonable threshold.

Let’s assume you want to add a new field called ‘city’ in the ‘user’ table. Instead of running:

mysql> ALTER TABLE mydb.user ADD city VARCHAR(50) NOT NULL DEFAULT '';

you will run:

# pt-online-schema-change --progress=percentage,10 --alter="ADD city VARCHAR(50) NOT NULL DEFAULT ''" h=localhost,u=root,D=mydb,t=user --execute

All ALTER TABLE statements can be converted to an equivalent pt-online-schema-change command.

Is there any tradeoff? The first one is that pt-online-schema-change will be slower than a plain ALTER TABLE. This is expected as the tool has extra code to throttle itself if it adds too much load.

How much slower is it? As always it depends. On a lightly loaded server the difference will be minimal, for instance I measured 3mn24s for ALTER TABLE and 3mn30s for pt-online-schema-change on a test table. However on a heavily loaded server, it can be like 5x slower.

The second tradeoff is that pt-online-schema-change adds triggers on the original table to capture data change. So if you already have triggers, you can’t use the tool (this will be lifted in MySQL 5.7).

Metadata Locks

Metadata locks were introduced in MySQL 5.5 for better transaction isolation.

But one side effect is: if you start a schema change operation on a table while another transaction is reading or writing on the same table, ALTER TABLE will be stuck in the Waiting for metadata lock state.

Another negative side effect of metadata locks is that in some situations ALTER TABLE will also block all queries to the table that is being altered, even reads. In the worst cases, this may create a query pileup situation that will freeze the whole server. For more on this topic, have a look at this post.

What about pt-online-schema-change and metadata locks? It is of course hit by this issue/feature. However the good thing is that the timeout for metadata locks is set by the tool to 60s instead of the default 1 year. So after 60s, pt-online-schema-change will simply retry the operation that failed because of metadata locks.

MySQL 5.6: Online Schema Changes?

Metadata locks also exist with MySQL 5.6, the same issues as described above can then happen.

However the good news with MySQL 5.6 is that most schema changes can be done online. Now the question is: should you use pt-online-schema-change or a regular online ALTER TABLE statement?

Both have pros and cons:

  • ALTER TABLE is easy to use while being confident with pt-online-schema-change requires some time.
  • There is no way for ALTER TABLE to know if it’s overloading a replica while pt-online-schema-change monitors replication lag and throttles itself if needed.
  • ALTER TABLE is only online for the master, not for replicas while pt-online-schema-change is online for all servers.

The last point can be a bit confusing: why an online ALTER TABLE on the master wouldn’t be online on the slave as well? The truth is that unless you are using multi-threaded replication, writes on slaves are serialized. So while the ALTER TABLE is running, the slave will not process any other write, which is not much different from an offline ALTER TABLE.

So if you are running a single server (all Icehouse users for instance), ALTER TABLE is probably the right choice. But if you have read replicas and specifically if the application cannot tolerate replication lag, pt-online-schema-change is definitely a better choice.

Conclusion

Performing schema changes is becoming easier with newer versions of MySQL. However potential issues can be found with any version, so always be careful when you need to run ALTER TABLE.

The post Schema changes in MySQL for OpenStack Trove users appeared first on MySQL Performance Blog.


PlanetMySQL Voting: Vote UP / Vote DOWN

Creating JSON documents with MariaDB

You probably know about two MariaDB great features: Dynamic Columns and Virtual Colums. I don’t think that we already found all possible use cases for them: more ideas will come in the future, and hopefully someone will write about them. Here, I will describe a use case which involves both Virtual Columns and Dynamic Columns.

Now, suppose that we have relational data in MariaDB, and we want to periodically export them in JSON, so we can export them to a variety of other data sources. For most data format, the best exporting solution is generally the CONNECT storage engine; but CONNECT doesn’t currently support JSON, so this option is not available.

Here I will introduce you to the best solution I know. Then, you will find slightly more complex solutions for specific cases.

Our data

Before we can start having fun with MariaDB features, I will show our data. Dear data, this is my reader… say hello!

CREATE OR REPLACE TABLE category ( id INTEGER AUTO_INCREMENT PRIMARY KEY , name VARCHAR(50) NOT NULL ) ENGINE InnoDB CHARACTER SET utf8 ; INSERT INTO category (name) VALUES ('Chair'), ('Tables'); CREATE OR REPLACE TABLE product ( id INTEGER AUTO_INCREMENT PRIMARY KEY , category INTEGER NOT NULL , name VARCHAR(50) NOT NULL , PRICE DECIMAL(9, 2) NOT NULL , FOREIGN KEY (category) REFERENCES category (id) ) ENGINE InnoDB CHARACTER SET utf8 ; INSERT INTO product (category, name, price) VALUES (1, 'Black Chair', '49.99') , (1, 'Red Chair', '44.00') , (2, 'Glass Table', '250.00') ; COLUMN_JSON()

MariaDB supports Dynamic Columns. The COLUMN_JSON() function is used to trasform a Dynamic Column value into JSON. The apparent problem here is: we don’t have a dynamic column, we have normal atomic columns.

Well, no problem. After all, a Dynamic Column is just a BLOB (binary) value encoded in a certain way, so that Dynamic Column functions can parse and modify it. We can create a dynamic value using COLUMN_CREATE(), and modify it with COLUMN_ADD(), COLUMN_DELETE(), etc.

That said, the following example query should be self-explanatory:

SELECT CONCAT('[\n\t', GROUP_CONCAT( COLUMN_JSON( COLUMN_ADD( COLUMN_CREATE('id', id) , 'name', name , 'price', price ) ) ORDER BY id SEPARATOR ',\n\t' ), '\n]') AS json FROM product \G


Its output is:

*************************** 1. row *************************** json: [ {"id":1,"name":"Black Chair","price":49.99}, {"id":2,"name":"Red Chair","price":44.00}, {"id":3,"name":"Glass Table","price":250.00} ] 1 row in set (0.00 sec)

Neat, isn't it? Of course it has some limits:

  • It's expensive. No temptables are needed, but MariaDB does a full table scan and transforms the rows on the fly.
  • What if we want to build a JSON array from multiple (joined) tables? Unless there is a daytime when MariaDB does nothing, we don't want to do more than one table scan.

Let's see how we can optimize the basic idea.

Pre-transforming each product

MariaDB also supports Virtual Columns (other DBMS's call them Generated Columns or Computed Columns). These fields are deterministic expressions that don't involve subqueries, and they can be computer on the fly or permanently stored on-disk. They are useful in a variety of situations, sometimes in combination with Dynamic columns.

In our example, we can use PERSISTENT Virtual Columns to store JSON representations of individual products, so that they don't need be generated on the fly. Here's how:

ALTER TABLE product ADD COLUMN IF NOT EXISTS json TEXT AS ( COLUMN_JSON( COLUMN_ADD( COLUMN_CREATE('id', id) , 'name', name , 'price', price ) ) ) PERSISTENT;

To get a complete JSON array from these objects:

SELECT CONCAT( '[\n\t' , GROUP_CONCAT(json ORDER BY id SEPARATOR ',\n\t') , '\n]') AS json FROM product \G

Output:

*************************** 1. row *************************** json: [ {"id":1,"name":"Black Chair","price":49.99}, {"id":2,"name":"Red Chair","price":44.00}, {"id":3,"name":"Glass Table","price":250.00} ] 1 row in set (0.00 sec) Pre-trasforming the complete array

Maybe we don't want to query a single table. Since Virtual Columns cannot be computed with a subquery, we cannot use them to put data from other tables into each product's JSON object.

A good alternative is storing JSON object into a summary table, that is updated every time someone modifies the category or product tables.

I don't like to do this kind of things withing the application, because more than one application could modify the tables, and we could also modify the tables via SQL queries (or via a GUI). To reliably guarantee integrity in such cases, I prefer to use foreign keys and triggers.

I will not show the complete example here, because it is too complex: three kind of actions (INSERT/UPDATE/DELETE) should be handled (or forbidden) for both the product and category tables. I will just show the summary table, and the stored function that creates a JSON object for one product. INSERT and UPDATE triggers will call this procedure when necessary. Here's the code:

CREATE OR REPLACE TABLE json ( product INTEGER NOT NULL PRIMARY KEY , json TEXT ) ENGINE InnoDB CHARACTER SET utf8 ; DELIMITER || CREATE PROCEDURE make_product_json(IN p_product_id INTEGER) MODIFIES SQL DATA BEGIN REPLACE INTO json SELECT p_product_id , COLUMN_JSON( COLUMN_ADD( COLUMN_CREATE('id', p.id) , 'name', p.name , 'price', p.price , 'category', c.name ) ) AS json FROM product p LEFT JOIN category c ON p.category = c.id WHERE p.id = p_product_id ; END || DELIMITER ;

PlanetMySQL Voting: Vote UP / Vote DOWN

mylvmbackup 0.16 has been released

I'm happy to announce the release of mylvmbackup version 0.16. The source package is now available for download from http://lenzg.net/mylvmbackup/ and https://launchpad.net/mylvmbackup.

Installation packages for a number of platforms can be obtained from the openSUSE Build Service.

Version 0.16 adds support for sending out SNMP traps in case of backup successes or failures. I'd like to thank Alexandre Anriot for contributing this new feature and his patience with me.

Please see the ChangeLog and bzr history for more details.


PlanetMySQL Voting: Vote UP / Vote DOWN

Geographic replication with MySQL and Galera

Introduction

With the introduction of Galera replication for MySQL, Codership has, in my opinion, provide the most significant support to MySQL that the community had seen in the last tree years.

Codership had filled a significant gap in MySQL replication capability, succeeding in a task where many before had failed.

Not only Codership is constantly committed in doing their solution better and more solid, and they are the most friendly and helpful team I have met in years.

In short chapeau to all of the guys! Great job, I am really proud to have the opportunity to discuss, interact with you.

Said that, no product or code around is unmingled of errors and/or issues at different level. So also if Galera help us a lot solving many replication and architectural issues it still has some limitation that must be taken in consideration.

Galera is easy to setup, especially to perform a basic setup, which will work fine for almost all basic OLTP low level and local traffic.

Unfortunately I have seen many implementations that had start as basic and then become something different, in usage and very often in design.

The most common mistake, from architectural point of view, I have seen is when customers starts with a simple cluster of 3 nodes all based on local network, and then given it "works so great" they decide to add one/two nodes on a geographic distributed site.

This solution is not wrong "per se" but it is how it is implemented that makes the difference.

The easiest way to explain what I mean and what should not be done, it is to use a real case, and describe what was implemented and how we had evolve the solution, while Codership was optimizing their product.

{autotoc enabled=yes}

Customer scenario

The customer approaches us after few months from an activity that was focus on showing them the basics of how MySQL/Galera works.

They reach us as an emergency given their cluster was totally stuck and their production site was down.

In order to understand what happened we need first take a look to what the customer put up.

Customer environment description

The customer approaches us after few months from an activity that was focus on showing them the basics of how MySQL/Galera works.

They reach us as an emergency given their cluster was totally stuck and their production site was down.

In order to understand what happened we need first take a look to what the customer put up.

Customer environment description

The initial environment of one single application node pointing to a cluster of 3 MySQL/Galera nodes, was transformed into a 6 (or more) application nodes getting traffic from a balancer, each application nodes could write in any Data node (round robin) residing in the same geographic area (at least), but cluster was geographically distributed 3 nodes in site A, 3 nodes in site B.

As result the customer navigation was jumping from one application server to another, and from one Data node to another as well.

Transactions were small and relevant only for a single request.

Finally each application node had a timer to clean up "old" session, the concept of OLD was relevant only to the local node and it could happen that an OLD session was in fact still on going on another application node.

In short, the customer was pushing randomly traffic to both zones, using a single table for the sessions to coordinate the activities, but also purging them without any good locking mechanism:

Problem description

When I connect to the machine, I was not aware of the architecture as it was; I was still referring to what I had found in our history, so one Application node and a simple cluster.

Takes me 5 seconds, to see that the situation was quite different. The cluster was suffering from many point of view, Flow Control was high very often, and nodes between the zones were declare dead often.

Takes me some time to let the customer understand why that was not working correctly, and why the architecture was design wrongly. Mostly because they were claiming that it was working fine before (reminds me "on my machine works" doesn't it?). Discussion on what before was, how much traffic/load and so on.

Finally, they redirect all the traffic to one zone only and at least the FC went down enough to have the site to work, and to give me the chance to review the settings and try to optimize it.

To keep it up I had to increase the FC related variable, to modify galera parameters for nodes time out, and obviously optimize MySQL itself, and we survive a bit.

For the scope of the article focus on the geographic replication, I am not to report all the review, but the part related to the subject.

As said I had extended time outs and FC, but I was not happy and as such I had tested in more details what the network was really allowing to transmit.

 

 

[root@xx ~]# netperf -H 10.0.0.12 -t TCP_RR -v 2 -- -b 6 -r 32K -S 256K -S 256K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.12 () port 0 AF_INET : first burst 6 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec   16384 87380 32768 32768 10.01 97.45 249856 249856 Alignment Offset RoundTrip Trans Throughput Local Remote Local Remote Latency Rate 10^6bits/s Send Recv Send Recv usec/Tran per sec Outbound Inbound 8 0 0 0 71830.852 97.451 25.546 25.546 <------------ 71ms  

 

So a simple round trip for a small record of 32K takes 71 ms, what if I simulate a basic load of 1Mb?

 

[root@xx ~]# netperf -H 10.2.3.30 -t TCP_RR -v 2 -- -b 6 -r 1M,32K -S 256K -S 256K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.2.3.30 () port 0 AF_INET : first burst 6 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec   16384 87380 1048576 32768 10.00 2.70 249856 249856 Alignment Offset RoundTrip Trans Throughput Local Remote Local Remote Latency Rate 10^6bits/s Send Recv Send Recv usec/Tran per sec Outbound Inbound 8 0 0 0 2592709.047 2.700 22.648 0.708 <-- we have 2592.709047 ms latency  

 

 

Note, to understand better what was going on and why I use this method to evaluate the network read my other article (Effective way to check the network connection performance, when using replication geographically distributed)

What we had to do

With this kind of connectivity, it is very unluckily that we can have the galera replication working correctly. Not even if we keep only one site as master and the other as stand by. The way the nodes communicate and validate the writesets will be too demanding.

See the below diagram from Codership to understand why

 

The only alternative to this is to use asynchronous connection and set two nodes, one for each site in master-master mode:

Setup a master-master, means that we loose the automatic replication shift from one node to another. That is it, this solution while fix the latency problem because use asynchronous replication, it does imply that in case of crash of one of the two nodes connected, you must manually move the replication to another node.

How to set it up how to fix when it breaks

Setting up asynchronous replication in a galera cluster is quite easy.

Assuming you have two nodes connected as master master (A1 B1) and assuming the only active was A1 and that it crashes, the only thing you need to do is to check on B1 for:

 

+----------------------+-----------+ | Variable_name | Value | +----------------------+-----------+ | wsrep_last_committed | 221463896 | +----------------------+-----------+  

 

 

In the status variables, then look on the node you had choose as master say A2, for the binary log containing the

 

Xid = 221463896

 

 

And issue the command CHANGE MASTER with the coordinates you will find there.

Then on A2 set the node as slave of the B1 using the Master coordinates from that node.

The procedure is simple and obviously worked decently, given the situation.

Customer was obviously not happy because they would like to have the two cluster in perfectly sync, but that was at least good enough to keep the site working efficiently. Not only after we set it up, we had at least two incidents to production, and we had successfully able to fail over to DR and then Failback with less then few seconds of service interruption.

Anyhow our friends in Codership had already release the new version of Galera (3.x), working on MySQL 5.6.

So while the customer was refining and correcting a lot of issues their side, mainly in the code and how the application was using the common table.

I had review, test and crashed MySQL 5.6/Galera3.

Galera 3.0 comes with some interesting new enhancement, see for an overview http://www.slideshare.net/Severalnines/webinar-slides-27249453.

At the moment of writing Seppo is close to deliver the speech about the new features in Galera 4 at Percona Live London 2014. (http://www.percona.com/live/london-2014/sessions/galera-cluster-new-features)

Galera 3.X group replication and how this can change everything.

In Galera 3, the introduction of the segment concept basically isolate/identify a group of data nodes inside a cluster.

The meaning is that any node with the same segment value will communicate extensively to each other, while intra-segments they elect one node for each segment as communication channel:

 

Quorum calculation is also taken in to account the segments allocation.

In the presentation I have linked here there is mention that using segments will allow managing WAN latency of 100-300 ms.

Given I am a skeptical, I decide to perform extensive tests, and see/identify what it is really possible to accomplish.

Investigations

I did several tests, involving many parameters in galera, but at the end I had two main actors I was playing with, the max_packet size and gcache pagesize.

All the others ones changed had prove to be stable and efficient, below the settings I had used.

How to test the problem

I have setup 8 different tests:

1) simple insert on 1 node from 3 apps 2) simple insert on 1 node for segment on DR only one app on production 2 3) simple insert on 1 node from 3 apps GALERA PACK larger 4) simple insert on 1 node for on DR only one app on production 2 GALERA PACK larger 5) simple insert on 1 node from 3 apps BUT with WAN connection 6) simple insert on 1 node for on DR only one app on production 2 BUT with WAN connection 7) simple insert on 1 node from 3 apps BUT with WAN connection AND GALERA PACK larger 8) simple insert on 1 node on DR only one app on production 2 BUT with WAN connection AND GALERA PACK larger

 

Each test was using OLTP approach, with 5 main tables and 6 child tables with the following structure:

 

tbtest1 | CREATE TABLE `tbtest1` ( `autoInc` bigint(11) NOT NULL AUTO_INCREMENT, `a` int(11) NOT NULL, `uuid` char(36) COLLATE utf8_bin NOT NULL, `b` varchar(100) COLLATE utf8_bin NOT NULL, `c` char(200) COLLATE utf8_bin NOT NULL, `counter` bigint(20) DEFAULT NULL, `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `partitionid` int(11) NOT NULL DEFAULT '0', `date` date NOT NULL, `strrecordtype` char(3) COLLATE utf8_bin DEFAULT NULL, PRIMARY KEY (`autoInc`,`partitionid`), KEY `IDX_a` (`a`), KEY `IDX_uuid` (`uuid`) ) ENGINE=InnoDB AUTO_INCREMENT=3474631 DEFAULT CHARSET=utf8 COLLATE=utf8_bin |   CREATE TABLE: CREATE TABLE `tbtest_child1` ( `a` int(11) NOT NULL, `bb` int(11) NOT NULL AUTO_INCREMENT, `date` date NOT NULL, `partitionid` int(11) NOT NULL DEFAULT '0', `stroperation` varchar(254) COLLATE utf8_bin DEFAULT NULL, `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`a`,`bb`), UNIQUE KEY `bb` (`bb`) ) ENGINE=InnoDB AUTO_INCREMENT=36289 DEFAULT CHARSET=utf8 COLLATE=utf8_bin 1 row IN SET (0.00 sec)  

 

 

Using 3 application nodes, each was having 8 children treads for a total of 24 threads writing on the cluster.

The tests performed from 1 to 4 where done as baseline so NO WAN network simulation.

When I mention Galera pack larger I refer specifically to:

gcs.max_packet_size

This value is "All writesets exceeding that size will be fragmented." And it is set by default to 32Kb, which means anything larger that will be fragmented in side galera. My understanding is that writeset package exceeding that dimension will be split in smaller packages. But given the dimension of 32K, or even less is significantly larger then the network frame dimension, the packet fragmentation will take place in any case, but at lower level (network).

My reasoning here was, why I should want to do that, if I already have frame fragmentation at network (TCP/IP) level? I may get the meaning of the parameter wrong, but if not is worth a try.

On this regards I had also optimize the machine network layer:

net.core.optmem_max = 40960 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 50000 net.ipv4.tcp_max_syn_backlog = 30000 net.ipv4.tcp_congestion_control=htcp net.ipv4.tcp_mtu_probing=1 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_slow_start_after_idle = 0

I run the tests on 6 data nodes using 2 segments, in virtualized environment, 8 core 8GB raid 10 6 10k RPM disks, CentOS 6.

On purpose I had skip HAProxy and prefer to pint directly on the node I had defined.

The applications were closing and opening connection at each transaction to emulate the customer behavior.

When simulating the wan network I had set a final total latency of :

 

root@galera1h1n5 galera56N1]# netperf -H 10.0.0.33 -t TCP_RR -v 2 -- -b 6 -r 120K,64K -S 156K -S 156K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.33 () port 0 AF_INET : first burst 6 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec   65536 87380 122880 65536 10.00 1.90 319488 319488 Alignment Offset RoundTrip Trans Throughput Local Remote Local Remote Latency Rate 10^6bits/s Send Recv Send Recv usec/Tran per sec Outbound Inbound 8 0 0 0 3684481.420 1.900 1.868 0.996 <----- 3684.481 ms  

 

With packet reordering 15-20%.

Just for the good of comparison a PING whould had report:

[root@galera1h1n5 galera56N1]# ping -M do -s 1472 -c 3 10.0.0.33 PING 10.0.0.33 (10.0.0.33) 1472(1500) bytes of data. 1480 bytes from 10.0.0.33: icmp_seq=1 ttl=64 time=403 ms 1480 bytes from 10.0.0.33: icmp_seq=2 ttl=64 time=389 ms 1480 bytes from 10.0.0.33: icmp_seq=3 ttl=64 time=410 ms  

 

So ~400 ms for 1427 (1500) bytes.

 

With all this in place, I have started to roll the ball and collected my data.

To notice that I have run each set of test at least 5 times, to validate the data collected, and data was clean, refresh and InnoDB BP reload at any test.

Before reaching the last set of tests (the final 2 sets) I had run a total amount of 64 variants of the tests, sometime crashing galera, sometime getting so bad performance that it was obvious that would not be usable in production, and so on.

Anyhow, I was more focus on the replication performance/behavior than on what MySQL/Innodb was doing given I was not looking for an ultra-fine tuning at that level, actually some imperfections had fit my purpose better.

Related issues

Initially I was not using an optimize network settings, and as soon as I reach the first time the tests 7-8 I had issues, in short the test 7 was taking 4 time longer then test 5 (his twin), and in some cases it was so bad that I had to stop it.

But once the network was set up correctly most of the issue disappears.

Results

Who wins?

There is not a single winner; all the tests shows that there is the need to carefully pick the right solution for the job.

But these two sets of tests, and settings had proven to be stable and better performing then all the others in my simulation:

 

This first image show the set of tests perform with the gcahe page set to a lower value of 32Mb and more instances 4.

 

This second one shows the test with a larger page of 250Mb and only 2 instances.

I did not include the starting test where the page/instance was set by defaults, because it was providing less performance.

Comparing the two, we can see that there is a different behavior between the tests done on a local network and the ones using the WAN.

In the case of a LAN where the cluster is less limited in the performance at network level, the use of more smaller gcache pages with more instances, had allow the cluster and MySQL to insert ~5700-5300 row/sec.

The use of a large gcache page and only 2 instances, on the other side see the drop of the performance to ~5500-4950 row/sec.

The difference is not significant with this traffic, but is interesting t note that it exists and is consistent on all the tests.

My side, this means we have something that add a cost, significant cost when galera need to access a less number of gcache page. Given this is a memory access I am oriented to think there is a mutex that prevent the usage of the single page/resource.

On the other hand the Large gcache page, seems providing better support in the case of a distribute cluster.

I assume that given the transmission is slowed down by the network, galera/MySQL is able to better utilize the less instances of the gcache memory pages.

Never the less, the large gcache with less instances, had be less effective when inserting in multiple nodes then the small gcache page with more instances and larger gcs.max_packet_size.

But that is not all.

When using larger packet size (gcs.max_packet_size), I was consistently able to complete the tests in less time when I was using a geographically distributed cluster, large gcache page and multiple nodes writing.

While when the cluster was performing locally, the small (standard) gcs.max_packet_size was working better.

Let us navigate the results.

Segments distribution

The first test I have done is to be sure that the segments where working as expected.

The cluster was compose by 2 segments, each segment 3 nodes.

Segment 1 nodes IP ending with: 5,21,23

Segment 2 nodes IP ending with 2,23,33

 

 

Local cluster(same network) Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb └───────────────┴────────────────┴────────┴────────────────┴──────────────── 10.0.0.5 <=> 10.0.0.21 1.65GB 76.8Kb 63.9Kb 67.1Kb 10.0.0.5 <=> 10.0.0.31 1.63GB 63.8Kb 51.8Kb 55.2Kb 10.0.0.5 <=> 10.0.0.23 1.58GB 31.8Kb 26.2Kb 28.5Kb <---- 10.0.0.5 <=> 10.0.0.2 30.8MB 19.3Kb 15.4Kb 16.2Kb 10.0.0.5 <=> 10.0.0.33 270KB 848b 1.81Kb 1.74Kb     Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb └────────────────┴────────────────┴────────┴────────────────┴──────────────── 10.0.0.21 <=> 10.0.0.5 1.70GB 70.7Kb 71.6Kb 67.5Kb 10.0.0.21 <=> 10.0.0.31 83.4MB 44.9Kb 43.3Kb 40.9Kb 10.0.0.21 <=> 10.0.0.2 31.8MB 15.6Kb 16.9Kb 16.5Kb <---- ?? 10.0.0.21 <=> 10.0.0.33 32.1MB 15.6Kb 16.8Kb 16.3Kb <---- ?? 10.0.0.21 <=> 10.0.0.23 269KB 848b 1.34Kb 1.36Kb     Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb └────────────────┴────────────────┴────────┴────────────────┴──────────────── 10.0.0.31 <=> 10.0.0.5 1.67GB 56.5Kb 58.3Kb 55.7Kb 10.0.0.31 <=> 10.0.0.21 83.4MB 44.7Kb 43.3Kb 41.5Kb 10.0.0.31 <=> 10.0.0.2 31.7MB 19.6Kb 17.6Kb 17.2Kb <---- ?? 10.0.0.31 <=> 10.0.0.33 273KB 848b 1.98Kb 1.89Kb 10.0.0.31 <=> 10.0.0.23 269KB 848b 1.85Kb 1.72Kb   Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb └───────────────┴───────────────┴─────────┴──────────────┴──────────────── 10.0.0.2 <=> 10.0.0.23 1.70GB 71.2Kb 71.2Kb 67.6Kb 10.0.0.2 <=> 10.0.0.33 111MB 58.2Kb 57.6Kb 54.6Kb 10.0.0.2 <=> 10.0.0.31 31.7MB 20.2Kb 17.7Kb 16.9Kb <---- ?? 10.0.0.2 <=> 10.0.0.5 30.8MB 15.6Kb 17.3Kb 16.5Kb <---- ?? 10.0.0.2 <=> 10.0.0.21 31.8MB 15.6Kb 16.9Kb 16.5Kb <---- ??   Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb ───────────────┴───────────────┴────────────────────────┴──────────────── 10.0.0.23 <=> 10.0.0.2 1.65GB 71.2Kb 71.2Kb 68.3Kb 10.0.0.23 <=> 10.0.0.33 1.60GB 43.7Kb 43.4Kb 41.7Kb 10.0.0.23 <=> 10.0.0.5 1.62GB 28.1Kb 29.6Kb 28.3Kb <---- 10.0.0.23 <=> 10.0.0.31 269KB 848b 1.85Kb 1.76Kb 10.0.0.23 <=> 10.0.0.21 269KB 848b 1.34Kb 1.36Kb   Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb └───────────────┴────────────────┴───────┴────────────────┴──────────────── 10.0.0.33 <=> 10.0.0.2 111MB 58.8Kb 57.6Kb 55.4Kb 10.0.0.33 <=> 10.0.0.23 1.65GB 43.7Kb 43.3Kb 41.9Kb 10.0.0.33 <=> 10.0.0.21 32.1MB 15.6Kb 16.8Kb 16.5Kb <---- ?? 10.0.0.33 <=> 10.0.0.31 273KB 848b 1.98Kb 1.89Kb 10.0.0.33 <=> 10.0.0.5 274KB 848b 1.78Kb 1.73Kb     Wan distributed cluster Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb └───────────────┴───────────┴─────────┴────────────────┴────────────────────── 10.0.0.5 <=> 10.0.0.31 1.71GB 30.2Kb 31.3Kb 31.0Kb 10.0.0.5 <=> 10.0.0.21 1.71GB 28.1Kb 30.4Kb 30.4Kb 10.0.0.5 <=> 10.0.0.2 1.61GB 15.1Kb 16.3Kb 16.6Kb <---- 10.0.0.5 <=> 10.0.0.33 682KB 1.19Kb 1.91Kb 1.77Kb 10.0.0.5 <=> 10.0.0.23 8.83MB 3.91Kb 1.77Kb 1.87Kb <---- ?? 10.0.0.5 <=> 10.0.0.153 5.05KB 0b 0b 572b 10.0.0.5 <=> 10.0.0.151 5.91KB 0b 0b 535b   Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb └───────────────┴────────────┴────────┴─────────────────┴────────────────────── 10.0.0.21 <=> 10.0.0.31 194MB 28.7Kb 31.2Kb 31.3Kb 10.0.0.21 <=> 10.0.0.5 1.76GB 28.5Kb 30.6Kb 30.4Kb 10.0.0.21 <=> 10.0.0.23 73.8MB 15.6Kb 17.1Kb 17.1Kb <---- ?? 10.0.0.21 <=> 10.0.0.33 298KB 1.86Kb 2.00Kb 1.91Kb 10.0.0.21 <=> 10.0.0.2 5.83MB 2.06Kb 1.53Kb 1.42Kb <---- ?? 10.0.0.21 <=> 10.0.0.12 351KB 1.53Kb 1.34Kb 1.41Kb     Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb └───────────────┴────────────┴────────┴─────────────────┴────────────────────── 10.0.0.31 <=> 10.0.0.5 1.76GB 28.7Kb 31.3Kb 31.1Kb 10.0.0.31 <=> 10.0.0.21 192MB 28.7Kb 31.2Kb 31.3Kb 10.0.0.31 <=> 10.0.0.33 53.3MB 17.8Kb 17.9Kb 17.7Kb <---- ?? 10.0.0.31 <=> 10.0.0.23 1.23MB 3.93Kb 2.01Kb 2.31Kb 10.0.0.31 <=> 10.0.0.2 323KB 528b 1.77Kb 1.81Kb 10.0.0.31 <=> 10.0.0.13 354KB 1.28Kb 1.32Kb 1.38Kb     Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb └───────────────┴───────────┴─────────┴────────────────┴────────────────────── 10.0.0.2 <=> 10.0.0.23 1.65GB 28.5Kb 30.9Kb 31.2Kb 10.0.0.2 <=> 10.0.0.33 1.65GB 27.7Kb 30.5Kb 31.0Kb 10.0.0.2 <=> 10.0.0.5 1.64GB 15.1Kb 16.3Kb 16.6Kb <---- 10.0.0.2 <=> 10.0.0.31 323KB 528b 1.77Kb 1.81Kb 10.0.0.2 <=> 10.0.0.21 5.88MB 2.06Kb 1.53Kb 1.46Kb <---- ??   Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb └───────────────┴────────────┴────────┴─────────────────┴────────────────────── 10.0.0.23 <=> 10.0.0.33 118MB 29.8Kb 31.5Kb 31.5Kb 10.0.0.23 <=> 10.0.0.2 1.70GB 29.5Kb 30.8Kb 31.2Kb 10.0.0.23 <=> 10.0.0.21 74.0MB 15.4Kb 16.9Kb 17.0Kb <----?? 10.0.0.23 <=> 10.0.0.31 1.22MB 3.31Kb 1.85Kb 2.27Kb 10.0.0.23 <=> 10.0.0.5 8.96MB 3.40Kb 1.67Kb 1.84Kb <---- ?? 10.0.0.23 <=> 10.0.0.12 359KB 1.47Kb 1.47Kb 1.36Kb   Display paused 19.1Mb 38.1Mb 57.2Mb 76.3Mb 95.4Mb └───────────────┴────────────┴────────┴─────────────────┴────────────────────── 10.0.0.33 <=> 10.0.0.23 118MB 28.1Kb 31.5Kb 31.4Kb 10.0.0.33 <=> 10.0.0.2 1.70GB 27.5Kb 30.2Kb 30.7Kb 10.0.0.33 <=> 10.0.0.31 53.3MB 17.8Kb 17.9Kb 17.7Kb <----?? 10.0.0.33 <=> 10.0.0.21 298KB 1.86Kb 2.00Kb 1.91Kb 10.0.0.33 <=> 10.0.0.5 686KB 1.03Kb 1.96Kb 1.78Kb 10.0.0.33 <=> 10.0.0.13 370KB 1.41Kb 1.44Kb 1.42Kb  

 

 

Apparently all was working fine, and if you notice the nodes 5 -23 and 5-2 seems to work as bridge between the segments.

But if you look carefully (I had point the cases with <---- ??), you will see that there is additional traffic going on, for instance in the first set we have nodes 33-21 exchanging data for 32Mb, and in the second set (wan) even more we have node 23-21 exchanging 73Mb.

It seems nothing, but instead it means there is an exchange of some type. What kind of traffic is going on between SOME nodes? I said SOME on purpose, because other nodes are not generating this that I consider an anomaly.

May be this is by design, if so I would like to know, and would like to know why.

Anyhow a side from this, most of the exchange is actually floating as expected.

 

More inside the replication

As mention before, the whole exercise was to identify if and if yes which is best settings to use for the cluster working in geographically distributed environment. In relation to that the tests that are more relevant are the number 6 and 8:

6) simple insert on 1 node for on DR only one app on production 2 BUT with WAN connection 8) simple insert on 1 node on DR only one app on production 2 BUT with WAN connection AND GALERA PACK larger

Comparing the differences existing between the gcache.keep_pages_size and page_size settings as well.

Test6

Test8

Reviewing the two graphs above, related to the Replication queue, we can see that both tests do not report Flow Control action.

In the case of Test6 the receiving and sending queue had some spikes, that were not present in Test8, meaning the two queues were processed more efficiently when the max_packet_size is larger.

To note is that the wsrep_evs_repl_latency in Test8 is higher, as such the messages seems takes longer to be dispatched, but at the end the performance are better.

 

Test6

Test8

Reviewing the graphs related to Writeset dimension and the Netstat, we can notice that the size of the writesets are larger in the test with default max_packet_size, also network traffic results to be higher. I was surprise about that and I had repeat the tests even more times. At the end magnifying the information related to the transmitted packet, we can notice that when using larger max_packet_size, the distance between received and transmitted is less.

Test6

Test8

Given in this test we have writes going on two nodes, my conclusion is that the network layer was able to optimize the data transmission, reducing the number of vectors (frames) and as such reducing the overhead.

 

Test6

Test8

To complete the review related to the galera replication, let take a look to the parallelization efficiency graphs.

Here the differences are more evident. The apply window (Average distance between highest and lowest concurrently applied seqno) when using default max_packet_size is definitely higher, as well the wsrep_commit_window (Average distance between highest and lowest concurrently committed seqno).

Also the wsrep_cert_deps_distance (Average distance between highest and lowest seqno value that can be possibly applied in parallel, potential degree of parallelization), in the first case is steady while with larger max_packet_size it is going down.

I was oriented to read this as an optimize way to apply and commit the incoming data.

Test6

Test8

 

Reviewing the Writeset_byte graphs we can note that while the bytes replicated remain very close between Test6 and Test8, the received are quite larger in the case of smaller/default max_packet_size.

So the node was by the fact receiving less data in Test8, keeping constant the same amount of data from application.

Test6

Test8

 

Analyzing the process information the Test6 (default max_packet_size) has more major page fault then Test8, and both system time and user time are larger in Test6. The meaning of this is more I/Operations

The differences are not huge, but they are consistent through the different number of tests.

 

Conclusions?

I had collect and still collecting a lot of additional data, but for the scope of this article focus on the network and replication side, I prefer to stop.

There is not A conclusion about the whole MySQL/Galera setup, but let me say that I am strongly suggesting to do not use a monolithic setup for all cases.

Instead adapt the configuration following the architecture you are implementing.

 

So far and from the tests I have done, I can say that:

Cluster implemented in a single site with very good connectivity and using MySQL/Galera for HA or ad write distribute platform:

  • Use multiple gcache.keep_pages_size (I had relate the number to the number of cores but I don't see any benefit to go above 8 instance at the moment)
  • Use gcache.page_size never less then 32Mb no more 256Mb total
  • Use small gcs.max_packet_size 32K

Cluster implemented in a single site with geographic distributed site and using MySQL/Galera on second site for HA only:

  • Use single or two gcache.keep_pages_size
  • Use default gcache.page_size or little bit larger like 256M
  • Use larger gcs.max_packet_size 2097152K

 

In the final tests I have being using the two following configurations, and they had work great also with significant load, not only for the tests but also when we had implement them at the customer site, using the same load and operation that was crashing the system with Galera 2.x or with other Galera 3.x configurations.

Configuration used for local cluster:

 

wsrep_provider_options = "gmcast.segment=1|2;socket.ssl_compression=no;gcache.keep_pages_size=2;gcache.page_size=125M; gcache.size=5024M;evs.send_window=512;evs.user_send_window=512; evs.use_aggregate = true; gcs.fc_factor = 1;  gcs.fc_limit = 32; gcs.max_packet_size = 32768;gcs.fc_debug = 0; evs.inactive_check_period = PT1S;  evs.inactive_timeout = PT15S;evs.suspect_timeout = PT5S; evs.stats_report_period = PT1M;"

 

 

Configuration used for geographically distributed DR:

 

wsrep_provider_options = "gmcast.segment=1|2;socket.ssl_compression=no;gcache.keep_pages_size=4;gcache.page_size=64M; gcache.size=5024M;evs.send_window=1024;evs.user_send_window=1024; evs.use_aggregate = true; gcs.fc_factor = 1;  gcs.fc_limit = 260; gcs.max_packet_size = 2097152;gcs.fc_debug = 0; evs.inactive_check_period = PT30S; evs.inactive_timeout = PT1M;evs.suspect_timeout = PT1M; evs.stats_report_period = PT3M;"

 

 

 

Please let me know if in you will experience different results, and how/why. 

 

Great MySQL .... & Galera to all.


PlanetMySQL Voting: Vote UP / Vote DOWN

Aurora for MySQL is coming

I am excited about Aurora for MySQL. While there aren't many details, I have two conclusions from the information that is available. First, many talented people did great work on this. Second, many customers want the features it provides and some of these features are otherwise unavailable unless you are web-scale and can afford a team of automation experts. This is a big deal and good for the MySQL community. I am not just writing this to boost my priority on the Aurora preview signup list.
SoapboxBefore describing what I learned about it I have a short editorial.
  • Is Amazon the reason we need AGPL? It isn't clear to me that they improve upstream MySQL. They do benefit the community by making it easier to run MySQL in the cloud. But I don't see them in the community. Which means I also don't see their developers in the community and who wants to disappear while working on mostly open-source?
  • Their marketing leads with up to 5X faster than MySQL which is translated by the tech press into 5X faster than MySQL. When I signed up for an Aurora preview the response email from Amazon also used the 5X faster than MySQL claim. I prefer up to 5X faster than MySQL.
  • In the video from AWS re:Invent 2014 James Hamilton makes some interesting claims.
    • I am not sure he heard about InnoDB based on this statement -- Just doing a state of the art storage engine would have been worth doing. Take Jim Gray's black book on transactions, implement it, I would be happy. Someone can tell him to be happy.
    • In big print -- 3X write performance, 5X read performance. In small print -- sysbench. It will take time to determine whether similar performance & availability can be had elsewhere at a similar cost.
    • In big print -- 400X less lag, 2 seconds vs 5 milliseconds. I doubt this. The slide also stated -- synchronous multi-DC replication. I don't see how that is possible except within a datacenter given network latency. Other content from Amazon claims this is async replication. But then Mr. Hamilton stated that everything is there, transactions are not lost were two data centers to instantly disappear. Again, that requires sync replication. This is confusing and I am surprised his claims don't match the other documentation from Amazon.
    • It can fix torn pages. InnoDB does this for writes in progress during a crash. I assume Aurora can also do this for pages written a long time ago and that would be a big deal.
InformationWhat is Aurora? I don't know and we might never find out. I assume it is a completely new storage engine rather than a new IO layer under InnoDB. I encourage people to read the details page, FAQ and pricing guide. The response from Clustrix is also useful.
The big deal is that the database is replicated 6X using 3 availability zones (AZs). I assume this means it uses 2 storage clusters per AZ. Documentation from Amazon states this is done using async replication and (committed) writes are available within a few milliseconds or 10s of milliseconds. I assume this is only for Aurora read replicas in the same AZ as the master. Network latency will make that impossible for some remote AZs.  In the presentation at AWS re:Invent there are claims that the replication is synchronous. That is confusing.
For now I will assume that Aurora does async replication to the two remote AZs and the non-primary storage cluster in the primary AZ -- which means that 5 of the 6 copies are updated via async replication. If this is true, then an outage at the primary storage cluster can mean that recent commits are lost. It would be great if someone were to make this clear. My preferred solution would be sync replication within the primary AZ to maintain copies in 2 storage clusters, and then async replication to the 4 copies in the 2 remote AZs. We will soon have multiple solutions in the MySQL community that can do sync replication within a datacenter and async replication elsewhere -- Galera, upstream and lossless semisync. But Aurora is much easier to deploy than the alternatives. My standard question is what commit rate can be sustained when all updates are to the same row? Sync replication with cross-country network round trips makes that slow.
The presentation also claimed this was mostly drop-in compatible. I am interested in compatibility with InnoDB. 
  • What are the semantics for cursor isolation? PostgreSQL, InnoDB and Oracle all do snapshot isolation with different semantics for writes. PostgreSQL has excellent documentation to describe the behavior.
  • Does Aurora support clustered indexes? 
  • What is the max size of an index key?
  • How are large columns supported? Is data always inline?
  • How is multi-versioning implemented? InnoDB usually does updates in place so there isn't much work for purge to do except for deletes and updates to secondary index columns. 
  • Does this use pessimistic or optimistic concurrency control?
  • Does this support partitioning?
  • What is the block size for reads?
  • Does this use compression?
FeaturesThe brief description of features is very interesting. I will summarize that here. The features sound great and I expect them to get a lot of customers.
  • Unplanned failover to an Aurora read replica takes a few minutes. Unplanned failover to a new instance can take up to 15 minutes. This happens for customers who aren't spending money on Aurora read replicas. This is another feature that will make Aurora very popular. While it will be nice to make failover faster, the big deal is that they provide this. The usual MySQL deployment required some do-it-yourself effort to get something similar.
  • Storage uses SSD. I assume this is based on EBS. They do background scrubbing to detect and correct corrupt pages. Storage grows in 10G increments. You don't have to provision for a fixed amount of GB or TB, they will grow as needed up to 64T (or 64T/table). I am not sure I would want 64T in one instance, but it can make it easy to archive data in place. Also automatic growth to such large database sizes will make it much easier for deployments to avoid sharding especially when an instance has many TB of cold data.
  • There are interesting features for point-in-time recovery, incremental backup and snapshots. This is integrated with S3. Incremental backups make it affordable to archive data in place as you don't do full backups for data that doesn't change. But I don't understand all of their backup options. 
  • Database is replicated 2 times within 3 AZs so there are 6 copies. Up to 2 copies can be lost and writes are still possible. Up to 3 copies can be lost and reads are still possible. I assume that by copies can be lost they mean storage clusters can be lost. Automatic recovery here is another big deal.
  • The buffer pool survives mysqld process restart. I wonder if that is only true for planned restart. Regardless, this is a very useful feature for IO-bound workloads when it is important to have a warm cache. Percona used to have a patch for this with InnoDB.
  • Replication is supported in two ways -- via Aurora (at storage level) and MySQL (binlog). My bet is that Aurora replication will be most popular but some people will use MySQL replication to replicas with locally attached storage to save money. They claim much less lag with Aurora and I agree there will be less but I am not sure there will be 400X less. However, they avoid some overhead on commit by not using the binlog and they avoid a lot of complexity by not requiring MySQL replication on the master or slave. I assume that Aurora replication ships deltas rather than page images to be efficient on the network.
CostCost comparisons will be interesting. I am very uncertain about my estimates here as I don't have much experience with prices on AWS or for normal sized datacenter deployments limited to a few servers. I estimate a 3 year cost of $568,814 for a high-end Aurora deployment: largest servers, 1 master, 1 replica, backup and storage. It will be interesting to compare this to non-AWS hardware because you need to account for features that Aurora provides and for the extra availability of the Aurora storage solution. I used 3TB and 10k IOPs from storage because that can be provided easily by a high-end PCIe flash device, but that also assumes a very busy server.
  • Reserved db.r3.8xlarge (32 vCPUs, 244GiB RAM) for 2 servers is $115,682 over 3 years
  • 50TB of backup at 10 cents/GB/month is $53,100 over 3 years
  • 3TB of storage and 10,000 IOPs per server is $400,032 for 2 servers over 3 years. But I am not sure if the storage cost includes IO done to all AZs to maintain the 6 database copies. In that case, the storage cost might be higher.

PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Workbench 6.2.4 GA has been released

The MySQL developer tools team announces 6.2.4 as our maintenance release for MySQL Workbench 6.2.

MySQL Workbench 6.2 is the new version for the official MySQL graphical development tool.

MySQL Workbench 6.2 focuses on support for innovations released in MySQL 5.6 and MySQL 5.7 DMR (Development Release) as well as MySQL Fabric 1.5, with features such as:
  • A new spatial data viewer, allowing graphical views of result sets containing GEOMETRY data and taking advantage of the new GIS capabilities in MySQL 5.7.
  • Support for new MySQL 5.7.4 SQL syntax and configuration options.
  • Metadata Locks View shows the locks connections are blocked or waiting on.
  • MySQL Fabric cluster connectivity - Browsing, view status, and connect to any MySQL instance in a Fabric Cluster.
  • MS Access migration Wizard - easily move to MySQL Databases.

Other significant usability improvements were made, aiming to raise productivity for advanced and new users:

  • Direct shortcut buttons to commonly used features in the schema tree.
  • Improved results handling. Columns have better auto-sizing and their widths are saved. Fonts can also be customized. Results "pinned" to persist viewing data.
  • A convenient Run SQL Script command to directly execute SQL scripts, without loading them first.
  • Database Modeling has been updated to allow changes to the formatting of note objects and attached SQL scripts can now be included in forward  engineering and synchronization scripts.
  • Integrated Visual Explain within the result set panel.
  • Visual Explain drill down for large to very large explain plans.
  • Shared SQL snippets in the SQL Editor, allowing multiple users to share SQL code by storing it within a MySQL instance.
  • And much more.

The list of provided binaries was updated and MySQL Workbench binaries
now available for:
  • Windows 7 32 and 64bits
  • Mac OS X Lion or newer
  • Ubuntu 12.04 LTS and Ubuntu 14.04 64bits
  • Fedora 20 64bits
  • Oracle Linux 6.5 64bits
  • Oracle Linux 7 64bits
  • Sources for building in other Linux distributions. Pre-compiled binaries are no longer provided for 32bit Linux distributions, although users may still download and build from sources.
Note: Workbench for Windows requires Visual C++ Redistributable for
Visual Studio 2013, which can be downloaded from:

http://www.microsoft.com/en-us/download/details.aspx?id=40784

For the full list of changes in this revision, visit
http://dev.mysql.com/doc/relnotes/workbench/en/changes-6-2.html

For discussion, join the MySQL Workbench Forums:
http://forums.mysql.com/index.php?151

Download MySQL Workbench 6.2.4 now, for Windows, Mac OS X 10.7+, Oracle
Linux 6 and 7, Fedora 20, Ubuntu 12.04 and Ubuntu 14.04 or sources,
from:

http://dev.mysql.com/downloads/tools/workbench/
PlanetMySQL Voting: Vote UP / Vote DOWN

MySQL Workbench 6.2.4 has been released

Dear MySQL users,

The MySQL developer tools team announces 6.2.4 as our maintenance release
for MySQL Workbench 6.2.

MySQL Workbench 6.2 is the new version for the official MySQL
graphical development tool.

MySQL Workbench 6.2 focuses on support for innovations released in MySQL
5.6 and MySQL 5.7 DMR (Development Release) as well as MySQL Fabric 1.5,
with features such as:

* A new spatial data viewer, allowing graphical views of result sets
containing GEOMETRY data and taking advantage of the new GIS
capabilities in MySQL 5.7.
* Support for new MySQL 5.7.4 SQL syntax and configuration options.
* Metadata Locks View shows the locks connections are blocked or waiting on.
* MySQL Fabric cluster connectivity – Browsing, view status, and connect
to any MySQL instance in a Fabric Cluster.
* MS Access migration Wizard – easily move to MySQL Databases.
Other significant usability improvements were made, aiming to raise
productivity for advanced and new users:

* Direct shortcut buttons to commonly used features in the schema tree.
* Improved results handling. Columns have better auto-sizing and their
widths are saved. Fonts can also be customized. Results “pinned” to
persist viewing data.
* A convenient Run SQL Script command to directly execute SQL scripts,
without loading them first.
* Database Modeling has been updated to allow changes to the formatting
of note objects and attached SQL scripts can now be included in forward
engineering and synchronization scripts.
* Integrated Visual Explain within the result set panel.
* Visual Explain drill down for large to very large explain plans.
* Shared SQL snippets in the SQL Editor, allowing multiple users to
share SQL code by storing it within a MySQL instance.
* And much more.

The list of provided binaries was updated and MySQL Workbench binaries
now available for:

* Windows 7 32 and 64bits
* Mac OS X Lion or newer
* Ubuntu 12.04 LTS and Ubuntu 14.04 64bits
* Fedora 20 64bits
* Oracle Linux 6.5 64bits
* Oracle Linux 7 64bits
* Sources for building in other Linux distributions. Pre-compiled binaries
are no longer provided for 32bit Linux distributions, although users may
still download and build from sources.

Note: Workbench for Windows requires Visual C++ Redistributable for
Visual Studio 2013, which can be downloaded from:
http://www.microsoft.com/en-us/download/details.aspx?id=40784

For the full list of bugs fixed in this revision, visit
http://dev.mysql.com/doc/relnotes/workbench/en/changes-6-2.html

For discussion, join the MySQL Workbench Forums:
http://forums.mysql.com/index.php?151

Download MySQL Workbench 6.2.4 now, for Windows, Mac OS X 10.7+, Oracle
Linux 6 and 7, Fedora 20, Ubuntu 12.04 and Ubuntu 14.04 or sources,
from:

http://dev.mysql.com/downloads/tools/workbench/

Enjoy!


PlanetMySQL Voting: Vote UP / Vote DOWN

Sys Schema for MySQL 5.6 and MySQL 5.7

Performance Schema (P_S) has been available since MySQL 5.5, more than 4 years ago. It is still difficult to see production servers with P_S enabled, even with MySQL 5.6 where the feature is enabled by default. There have been several complaints like the overhead, that the fix is a work in progress, and the ease of use. 52 tables to query and 31 configuration variables is enough to scare people.

There is a solution for the second problem, the usability. It’s name is “sys schema“. It is a collection of views, functions and procedures to make P_S more user friendly.

Installation

If you are a MySQL Workbench user the installation is pretty easy because sys schema is already included. You just need to install it. Click on “Performance – Performance Reports” and there you will find the “Install Helper” button that will install sys schema.

If you don’t use MySQL Workbench you need to download sys_56.sql or sys_57.sql (depends if you use 5.6 or 5.7) from the github repository. Then, just import the sql file as usual:

mysql -u root -p < ./sys_56.sql

Usage

After the import, you will have a new “sys” schema with some very descriptive table names. Let’s see an example. Do you want to know what tables are using most of our InnoDB buffer memory? Easy:

mysql> select * from sys.innodb_buffer_stats_by_table; +---------------+--------------------+-----------+-----------+-------+--------------+-----------+-------------+ | object_schema | object_name | allocated | data | pages | pages_hashed | pages_old | rows_cached | +---------------+--------------------+-----------+-----------+-------+--------------+-----------+-------------+ | test | t | 63.61 MiB | 58.06 MiB | 4071 | 4071 | 4071 | 2101222 | | InnoDB System | SYS_FOREIGN | 32.00 KiB | 0 bytes | 2 | 2 | 2 | 0 | | InnoDB System | SYS_TABLES | 32.00 KiB | 1.11 KiB | 2 | 2 | 2 | 10 | | InnoDB System | SYS_COLUMNS | 16.00 KiB | 4.68 KiB | 1 | 1 | 1 | 71 | | InnoDB System | SYS_DATAFILES | 16.00 KiB | 324 bytes | 1 | 1 | 1 | 6 | | InnoDB System | SYS_FIELDS | 16.00 KiB | 722 bytes | 1 | 1 | 1 | 17 | | InnoDB System | SYS_INDEXES | 16.00 KiB | 836 bytes | 1 | 1 | 1 | 12 | | InnoDB System | SYS_TABLESPACES | 16.00 KiB | 318 bytes | 1 | 1 | 1 | 6 | | mysql | innodb_index_stats | 16.00 KiB | 274 bytes | 1 | 1 | 1 | 3 | | mysql | innodb_table_stats | 16.00 KiB | 53 bytes | 1 | 1 | 1 | 1 | +---------------+--------------------+-----------+-----------+-------+--------------+-----------+-------------+

Pretty easy and useful, right? You can also get what is the database using more memory in the buffer pool querying innodb_buffer_stats_by_schema.

For each table there is another similar ones that it’s name starts with x$. For example, you have user_summary_by_file_io and x$user_summary_by_file_io. The difference is that the first table has a human readable output values while the second one has the real values. Let’s see an example:

mysql> select * from sys.user_summary_by_file_io; +------------+-------+------------+ | user | ios | io_latency | +------------+-------+------------+ | root | 19514 | 2.87 s | | background | 5916 | 1.91 s | +------------+-------+------------+ 2 rows in set (0.00 sec) mysql> select * from sys.x$user_summary_by_file_io; +------------+-------+---------------+ | user | ios | io_latency | +------------+-------+---------------+ | root | 19514 | 2871847094292 | | background | 5916 | 1905079715132 | +------------+-------+---------------+

For humans, at least for me, it is easier to read seconds rather than picoseconds

There are multiple tables with very descriptive names.

– io_by_thread_by_latency
– schema_unused_indexes
– statements_with_temp_tables
– statements_with_errors_or_warnings
– user_summary_by_statement_type
– waits_by_host_by_latency

There are lot more, and they are explained with examples in project’s README file.

Configuration

On the MySQL side nothing special is needed. Just enable performance_schema:

performance_schema="on"

sys schema also provides some procedures to enable/disable some features. For example:

– ps_setup_enable_background_threads
– ps_setup_enable_consumers
– ps_setup_enable_instrument
and so on…

We also have the same procedures with “disable”. After you have made the changes you can save them calling ps_setup_save() and reload it later on if you want calling ps_setup_reload_saved(). If you want to reset the configuration to default values just call ps_setup_reset_to_default().

For example, we can check that some consumers are disabled. It is possible to enable them and save the configuration:

mysql> CALL sys.ps_setup_show_disabled_consumers(); +--------------------------------+ | disabled_consumers | +--------------------------------+ | events_stages_current | | events_stages_history | | events_stages_history_long | | events_statements_history | | events_statements_history_long | | events_waits_current | | events_waits_history | | events_waits_history_long | +--------------------------------+ mysql> CALL sys.ps_setup_enable_consumers('events'); +---------------------+ | summary | +---------------------+ | Enabled 8 consumers | +---------------------+ mysql> CALL sys.ps_setup_show_disabled_consumers(); Empty set (0.00 sec) mysql> CALL sys.ps_setup_save(TRUE);

Conclusion

Performance Schema is very useful when we want to know what is happening inside the database. It is getting more features with each new GA and will probably be the single point of information in near future. Now thanks to sys schema it is also easy to use.

The post Sys Schema for MySQL 5.6 and MySQL 5.7 appeared first on MySQL Performance Blog.


PlanetMySQL Voting: Vote UP / Vote DOWN

Logging with MySQL: Error-Logging to Syslog &amp; EventLog

You’ve already read it in What’s new in 5.7 (So Far) — the MySQL server now has new-and-improved supported for syslog (on unix-like systems) and EventLog (on Windows). In the next few paragraphs, we’ll take a look at what they are, what has changed, and how they can make your life easier.

The MySQL server supplies information in two main ways:

  1. The client will receive a reply to every statement. If everything goes right, then we’ll see a simple OK for success, or a result set for SELECT, SHOW, etc.; and even a successful statement may be qualified by a set of warnings or notices. If the statement fails for some reason then we’ll receive an error regarding the failure.
  2. On the server, we’ll see a variety of logs depending on the server configuration. Queries exceeding a certain execution time may get logged to the slow query log, while in some debug scenarios all queries may be logged to the general query log either as soon as they are received (--log-raw) or after parsing and having security sensitive information such as passwords replaced. Both the slow query log and the general query log may be sent to a file or table. The third major log is the error log.
The error log — what and why

Like the clients, the error log may receive messages of varying severities. It collects messages that do not have an obvious query, client, or connection that is to blame, messages that need persistence beyond that of a connection or the server’s runtime, and others that have an intended audience of the DBA or system administrator rather than the application developer. Examples are exhaustion of resources on the host machine, certain errors reported by the host operating system, backtraces of crashes, messages about damaged databases and indexes, errors on start-up, failed or refused connections, etc.

The error log — where and why

Unlike the slow and general query logs, the error log cannot be kept in a table — this sidesteps questions such as how to safely log an error about the table engine malfunctioning while using that same engine for error logging. This also makes the error log persistent beyond the runtime of the server, and makes the log file accessible to specialized log analyser tools.

Enter syslog

Both Unix and Windows do however already have standardized logging facilities — syslog on Unix and EventLog on Windows. Using those facilities gives us access to a plethora of tools written specifically for them, it lets us log to a different machine, etc. While the MySQL server (mysqld) previously had very basic support for the Windows EventLog, syslogging was implemented by simply writing error messages to a stream (STDOUT/STDERR). Through this stream, whatever started the server (e.g. mysqld_safe) may receive or redirect this information at the OS level (note: if you start mysqld directly these streams will be visible in your terminal window; on Windows you have to use the --console option). The mysqld_safe wrapper script would start the MySQL server and optionally direct this error output to a program that would write it to the Unix syslog.

Onwards and upwards

While that syslog support solved the immediate problem, there was room for improvement — all MySQL server errors were filed with a syslog severity of error, even if the message said something different. A line could contain one time-stamp from the MySQL server, and another from the syslog facilities. And those were just the cosmetic issues!

Native syslogging

Having native support for syslog in the MySQL server itself enables us to turn syslogging on or off at runtime, and to query its status.

The syslog facility—information about what category of software the message were from—is also no longer fixed to daemon but can be queried and configured at runtime. This should make it easier to adapt the MySQL syslog messages to your syslog environment and enable better message filtering. See the manual page for syslog.conf (shell> man syslog.conf) for more on filtering.

The following options have also been added and can be set at start-up as well as viewed and set at runtime via system variables (setting them requires the SUPER privilege):

  1. --log_syslog
  2. --log_syslog_include_pid
  3. --log_syslog_facility
  4. --log_syslog-tag
mysqld_safe

While the mysqld_safe wrapper script has been updated to be aware of the server’s new native syslog support, the options to the script remain unchanged. Where before, mysqld_safe used the program logger to pass the server’s error messages to syslog, the script now selects server options consistent with the options given to the script. Therefore mysqld_safe --syslog --syslog-tag=AZOU would result in mysqld being started with mysqld --log-syslog=1 --log-syslog-facility=daemon --log-syslog-tag=AZOU.

It bears noting that the options to mysqld_safe control both the server and the mysqld_safe script itself. In the above example, the MySQL server would log as mysqld-AZOU, whereas the mysqld_safe wrapper script would log as mysqld_safe-AZOU.

systemd-journal

A variety of linux systems have recently adopted systemd. Its logging facilities offer a sink that is compatible to traditional syslog, making the MySQL server’s native support compatible out of the box. When systemd journaling is used, corresponding tools should be used to query and filter the logs, e.g. the journalctl program to query the logs. In the following example, our running MySQL server has the tag production (and therefore logs as mysqld-production).

We could then use journalctl to ask for all log entries using that identifier:
journalctl -f SYSLOG_IDENTIFIER=mysqld-production

Due to the richness of systemd-journal logging, this is somewhat verbose. That’s why in the following example we ask for the output to be JSON-formatted. We’ll then use the jq tool to filter this JSON stream so that the only the fields we are interested in remain—SYSLOG_IDENTIFIER, SYSLOG_FACILITY, and MESSAGE (see the systemd.journal-fields manual page for others):
journalctl -f SYSLOG_IDENTIFIER=mysqld-production -o json | jq '.SYSLOG_IDENTIFIER + " " + .SYSLOG_FACILITY + ": " + .MESSAGE' 2>/dev/null

Windows

On Windows EventLog is used in lieu of syslog. Like on Unix logging can be enabled, disabled, and queried at runtime. The facility and process ID options, however, are not available on Windows. A tag can be set and queried at runtime, but to create a new tag (start using a tag that does not already exist) the MySQL server needs to be started with sufficient Windows system privileges needed to create an entry in the Windows Registry. Once this entry is created then the elevated system privileges are not required to use the newly created tag. As the entry exists in the Windows registry and therefore outside the MySQL server, it will persist through restarts of the MySQL server and indeed the host machine. Care should therefore be taken not to pollute the registry with a great number of unused tags.

All’s well that sends well

Finally, after talking so much about the server (and laying the groundwork for future Log Lady reports), I would be amiss not to note that the MySQL client now also supports syslog. I look forward to your feedback on these new features! If you encounter any issues, please let us know by filing a bug or opening a support ticket.

Thank you for using MySQL!


PlanetMySQL Voting: Vote UP / Vote DOWN

Multi-source Replication with Galera Cluster for MySQL

November 20, 2014 By Severalnines

Multi-source replication means that one server can have multiple masters from which it replicates. Why multi-source? One good reason is to consolidate databases (e.g. merge your shards) for analytical reporting or as a centralized backup server. MariaDB 10 already has this feature, and MySQL 5.7 will also support it. 

It is possible to set up your Galera Cluster as an aggregator of your masters in a multi-source replication setup, we’ll walk you through the steps in this blog. Note that the howto is for Galera Cluster for MySQL (Codership) and Percona XtraDB Cluster. In a separate post, we’ll show you how to configure MariaDB Cluster 10 instead. 

 

Galera Cluster as Aggregator/Slave

 

Galera cluster can operate both as MySQL master and slave. Each Galera node can act as a slave channel accepting replication from a master. The number of slave channels should be equal or less to the number of Galera master nodes in the cluster. So, if you have a three-node Galera cluster, you can have up to three different replication sources connected to it. Note that in MariaDB Galera Cluster 10, you can configure as many sources as you want since each node supports multi-source replication. 

To achieve multi-source replication in MySQL 5.6, you cannot have GTID enabled for Galera Cluster. GTID will cause our Galera cluster to work as a single unit (imagine one single slave server), since it globally preserves the MySQL GTID events on the cluster.  So the cluster will not be able to replicate from more than one master. Hence, we will use the “legacy” way to determine the starting binary log file and position. On a side note, enabling GTID is highly recommended if your Galera Cluster acts as a MySQL master, as described in this blog post.

We will setup multi-source replication as below:

We have 3 standalone MySQL servers (masters), and each master has a separate database: mydb1, mydb2 and mydb3. We would like to consolidate all 3 databases into our Galera cluster.

 

Setting Up Masters

 

1. On each standalone MySQL server, configure it as a master by adding a server ID, enabling binary logging with ROW format:

# mysql1 my.cnf server-id=101 log-bin=binlog binlog-format=ROW

 

# mysql2 my.cnf server-id=102 log-bin=binlog binlog-format=ROW

 

# mysql3 my.cnf server-id=103 log-bin=binlog binlog-format=ROW

 

read more


PlanetMySQL Voting: Vote UP / Vote DOWN

Improving InnoDB index statistics

The MySQL/MariaDB optimiser likes to know things like the cardinality of an index – that is, the number of distinct values the index holds. For a PRIMARY KEY, which only has unique values, the number is the same as the number of rows.  For an indexed column that is boolean (such as yes/no) the cardinality would be 2.

There’s more to it than that, but the point is that the optimiser needs some statistics from indexes in order to try and make somewhat sane decisions about which index to use for a particular query. The statistics also need to be updated when a significant number of rows have been added, deleted, or modified.

In MyISAM, ANALYZE TABLE does a tablescan where everything is tallied, and the index stats are updated. InnoDB, on the other hand, has always done “index dives”, looking at a small sample and deriving from that. That can be ok as a methodology, but unfortunately the history is awkward. The number used to be a constant in the code (4), and that is inadequate for larger tables. Later the number was made a server variable innodb_stats_sample_pages and its default is now 8 – but that’s still really not enough for big(ger) tables.

We recently encountered this issue again with a client, and this time it really needed addressing as no workarounds were effective across the number of servers and of course over time. Open Query engineer Daniel filed https://mariadb.atlassian.net/browse/MDEV-7084 which was picked up by MariaDB developed Jan Lindström.

Why not just set the innodb_stats_sample_pages much higher? Well, every operation takes time, so setting the number appropriate for your biggest table means that the sampling would take unnecessarily long for all the other (smaller, or even much smaller) tables. And that’s just unfortunate.

So why doesn’t InnoDB just scale the sample size along with the table size? Because, historically, it didn’t know the table size: InnoDB does not maintain a row count (this has to do with its multi-versioned architecture and other practicalities – as with everything, it’s a trade-off). However, these days we have persistent stats tables – rather than redoing the stats the first time a table is opened after server restart, they’re stored in a table. Good improvement. As part of that information, InnoDB now also knows how many index pages (and leaf nodes in its B+Tree) it has for each table. And while that’s not the same as a row count (rows have a variable length so there’s no fixed number of rows per index page), at least it grows along with the table. So now we have something to work with! The historical situation is no longer a hindrance.

In order to scale the sample size sanely, that is not have either too large a number for small tables, or a number for big tables that’s over the top, we’ll want some kind of logarithmic scale. For instance, log2(16 thousand) = 14, and log2(1 billion) = 30. That’s small enough to be workable. The new code as I suggested:

n_sample_pages = max(min(srv_stats_sample_pages, index->stat_index_size), log2(index->stat_index_size) * srv_stats_sample_pages);

This is a shorter construct (using min/max instead of ifs) of what was already there, combined with the logarithmic sample basis. For very small tables, either the innodb_stats_sample_pages number if used or the actual number of pages, whichever is smaller – for bigger tables, the log2 of the #indexpages is used, multiplied by the dynamic system variable innodb_stats_sample_pages. So we can still scale and thus influence the system in case we want more samples. Simple, but it seems effective – and it any case we get decent results in many more cases than before, so it’s a worthwhile improvement. Obviously, since it’s a statistical sample, it could still be wrong for an individual case.

Jan reckons that just like MyISAM, InnoDB should do a table scan and work things out properly – I agree, this makes sense now that we have persistent stats. So the above is a good fix for 5.5 and 10.0, and the more significant change to comprehensive stats can be in a near future major release. So then we have done away with the sampling altogether, instead basing the info on the full dataset. Excellent.

Another issue that needed to be dealt with is when InnoDB recalculates the statistics. You don’t want to do it on every change, but regularly if there has been some change is good as it might affect which indexes should be chosen for optimal query execution. The hardcoded rule was 1/16th of the table or 2 billion rows, whichever comes first. Again that’s unfortunate, because for a bigger table 1/16th still amounts to a very significant number. I’d tend towards setting the upper bound to say 100,000. Jan put in a new dynamic system variable for this, stat_modified_counter. It essentially replaces the old static value of 2 billion, providing us with a configurable upper bound. That should do nicely!

Once again horay for open source, but in particular responsive and open development. If someone reports a bug and there is no interaction between developer and the outside world until something is released, the discussion and iterative improvement process to come to a good solution cannot occur. The original code definitely worked as designed, but it was no longer suitable in the context of today’s usage/needs. The user-dev interaction allowed for a much better conclusion.


PlanetMySQL Voting: Vote UP / Vote DOWN

In Case You Missed It - Developing MySQL Applications with Go

In this webinar, Baron Schwartz introduces the integration of Go and MySQL. He discusses idiomatic database/sql code, available drivers for MySQL and time saving tips and tricks.

If you did not have a chance to join the webinar, the slide deck is embedded below. You can also register for a recording here.

Using MySQL with Go - Golang's Database/SQL Driver from VividCortex
PlanetMySQL Voting: Vote UP / Vote DOWN

How a set of queries can be killed in MySQL using Percona Toolkit’s pt-kill

You might have encountered situations where you had to kill some specific select queries that were running for long periods and choking the database. This post will go into more detail with an example of report query offloading.

Report query (select) offloading to a slave server is a common practice to reduce the workload of the master DB server. The long running selects will be executed in the slave for report generation. I have observed in many cases where the slave used to get delayed or the slave DB encounters a slowdown due to some heavy long-running orphaned selects from some wrong reports.

There are two main ways to kill queries in MySQL: 1. use custom scripts that match on a regular expression, or 2. use a tool written and supported by Percona that is designed to kill queries based on matching conditions. Below is one script that will help you to kill those queries. The script will take the process list from MySQL and filter the long-running select query considering the “User”, “Time” and “State” from the list. However I suggest that you use the pt-kill tool from Percona Toolkit which provides a more reliable choice and options based on your requirements.

Process list:

+-----+------+-----------+------+---------+------+------------------------------+-----------------------------+ | Id  | User | Host      | db   | Command | Time | State                        | Info                        | +-----+------+-----------+------+---------+------+------------------------------+-----------------------------+ | 103 | root | localhost | test | Query   |    0 | init                         | show processlist            | | 104 | root | localhost | test | Sleep   |  383 |                              | NULL                        | | 106 | root | localhost | test | Query   |  377 | Waiting for table level lock | SELECT * FROM t FOR UPDATE  | | 107 | root | localhost | test | Query   |  364 | Waiting for table level lock | insert into t value(5)      | | 108 | rpt  | localhost | test | Query   |  345 | Waiting for table level lock | SELECT c1 FROM t FOR UPDATE | +-----+------+-----------+------+---------+------+------------------------------+-----------------------------+

1. Shell script example:

List all queries from rpt user having query time greater than 1 minute: [root@test3 ~]# for id in $(mysql -t -e "SHOW FULL PROCESSLIST" | /bin/awk -F "|" '($6 ~ /Query/) && (60< $7) && ($3 ~ /rpt/) { print $2}'); do echo "KILL QUERY $id;"; done KILL QUERY 108; Kill all queries from rpt user having query time greater than 1 minute: [root@test3 ~]# for id in $(mysql -t -e "SHOW FULL PROCESSLIST" | /bin/awk -F "|" '($6 ~ /Query/) && (60< $7) && ($3 ~ /rpt/) { print $2}'); do mysql -e "KILL QUERY $id;"; done

2. Pt-kill example:

List all queries from rpt user having query time greater than 1 minute (–print):

[root@test3 ~]# pt-kill --busy-time 60 --match-command Query --user root --print --match-user rpt --socket /tmp/mysql.sock # 2014-11-12T03:51:01 KILL 108 (Query 485 sec) SELECT c1 FROM t FOR UPDATE # 2014-11-12T03:51:31 KILL 108 (Query 515 sec) SELECT c1 FROM t FOR UPDATE # 2014-11-12T03:52:01 KILL 108 (Query 545 sec) SELECT c1 FROM t FOR UPDATE

Kill all queries from rpt user having query time greater than 1 minute (–kill-query):

[root@test3 ~]# pt-kill --busy-time 60 --match-command Query --user root --print --kill-query --match-user rpt --socket /tmp/mysql.sock # 2014-11-12T03:53:26 KILL QUERY 108 (Query 630 sec) SELECT c1 FROM t FOR UPDATE

Process list:

+-----+------+-----------+------+---------+------+------------------------------+----------------------------+ | Id | User | Host | db | Command | Time | State | Info | +-----+------+-----------+------+---------+------+------------------------------+----------------------------+ | 103 | root | localhost | test | Query | 0 | init | show processlist | | 104 | root | localhost | test | Sleep | 843 | | NULL | | 106 | root | localhost | test | Query | 837 | Waiting for table level lock | SELECT * FROM t FOR UPDATE | | 107 | root | localhost | test | Query | 824 | Waiting for table level lock | insert into t value(5) | | 108 | rpt | localhost | test | Sleep | 805 | | NULL | | 111 | root | localhost | NULL | Sleep | 25 | | NULL | +-----+------+-----------+------+---------+------+------------------------------+----------------------------+

 Note : –kill-query option makes pt-kill kill matching queries. This requires MySQL 5.0 or newer. Unlike –kill which kills the connection for matching queries, this option only kills the query, not its connection.

pt-kill captures queries from SHOW PROCESSLIST, filters them, and then either kills or prints them. This is also known as a “slow query sniper” in some circles. The idea is to watch for queries that might be consuming too many resources, and kill them. For details please read the tool’s (pt-kill) documentation.

The post How a set of queries can be killed in MySQL using Percona Toolkit’s pt-kill appeared first on MySQL Performance Blog.


PlanetMySQL Voting: Vote UP / Vote DOWN

New ClusterControl User Guide for MySQL based Clusters

November 19, 2014 By Severalnines

Following the release of ClusterControl 1.2.8 with a range of cool new features, we have now also published a new ClusterControl User Guide that provides all the steps, tips & tricks to follow in order to successfully deploy, monitor, manage and scale database clusters.  

This user guide predominantly covers ClusterControl with MySQL-based clusters, namely:

  • Percona XtraDB Cluster
  • MariaDB Cluster
  • Galera Cluster for MySQL (Codership)
  • MySQL Cluster
  • MySQL Replication
  • A pool of MySQL single instances

 

New features in ClusterControl 1.2.8 include: 

  • Deployment and scaling of single-instance MySQL, MariaDB and MongoDB
  • YUM/APT repositories for ClusterControl
  • Alerts and incident tracking with PagerDuty
  • New flexible alarms/email notification system
  • Cluster-wide MySQL User Management
  • New default dashboards for MySQL/MariaDB
  • Puppet Module for ClusterControl

 

read more


PlanetMySQL Voting: Vote UP / Vote DOWN

Pages