Planet MySQL

Fun with Bugs #44 - Community Bugs Fixed in MySQL 5.7.14

MySQL 5.7.14 was officially released yesterday. So, it's time to check what bugs reported by MySQL Community in public were fixed in this release. Some of these bugs are presented below.

As usual, let me start with InnoDB. The following bugs were fixed there:
  • Bug #80296 - "FTS query exceeds result cache limit". It was reported (for 5.6, but I do not see new 5.6 release notes yet) by Monty Solomon and verified by Umesh.
  • Bug #80304 - "generated columns don't work with foreign key actions". It was reported by Guilhem Bichot based on test case by Peter Gulutzan presented here.As most community bug reports during last 2-3 years, it was verified by Umesh.
  • Bug #80298 - "Full-Text queries with additional secondary index gives NULL or Zero rows", was reported by Ray Lambe and verified by Umesh.
  • Bug #76728 - "reduce lock_sys->mutex contention for transaction that only contains SELECT". This old bug report by Zhai Weixiang (who had provided a patch) was verified by Sinisa Milivojevic.
  • Bug #80083 - "Setting innodb_monitor_enable to ALL does not enable all monitors". It was reported by Davi Arnaut and verified by Miguel Solorzano.
  • Bug #79772 - "Foreign key not allowed when a virtual index exists". It was reported and verified by Jesper wisborg Krogh from Oracle.
There are many more bugs fixed in InnoDB, but all of them were reported in internal Oracle's bugs database by Oracle employees. I do not like this trend.

Now, let's check replication bugs that were fixed:
  • Bug #79324 - "Slave is ~10x slower to execute set of statements compared to master RBR", was reported by Serge Grachov and verified by Umesh.
  • Bug #62008 - "read-only option does not allow inserts/updates on temporary tables". This bug was reported long time ago by Ivan Stanojevic and verified by me when I worked in Oracle.It's really good to see it fixed now!
Some bugs were fixed in Performance_Schema (who could imagine it has bugs...), but they were either reported internally or remain private, like Bug #81464. Just take into account that SELECT from some P_S tables could crash server before 5.7.14, based on release notes...

This time I see several build-related bugs fixed, like these:
  • Bug #81274 - "Add support for Solaris Studio 12.5 aka 5.14". It was reported (and probably fixed) by Tor Didriksen.
  • Bug #81593 - "adapt to gcc 5.3 on solaris". It was also reported and fixed by Tor Didriksen. personally I am happy to see that Oracle still cares about Solaris and related software. Historical sentiments...
  • Bug #80996 - "correct make_pair for c++11 (contribution)". This fix was contributed by Daniel Black
  • Bug #80371 - "MySQL fails to build with new default mode in GCC6". It was reported by Terje Røsten.
The last but not the least, I also have to mention this bug in audit (and, thus, query rewrite) plugins, Bug #81298 - "query rewrite plugin suffers scalability issues". It was reported by Vadim Tkachenko and verified by Sinisa Milivojevic. This is a great improvement.

To summarize, I see reasons to upgrade for those who rely a lot on FTS indexes in InnoDB, replication, audit plugins and Performance_schema. I had not even tried to build 5.7.14 from source yet, so I do not have any personal experience to share.
PlanetMySQL Voting: Vote UP / Vote DOWN

WarpSQL now has SQL shim plugins

I made some improvements to the 'proxy' inside of MySQL 5.7 that I've created for WarpSQL (Shard-Query 3).  I've made the MySQL proxy/shim pluggable and I moved the SQLClient to sql/  I've merged these changes into 'master' in my fork.

Now you can create "SQL shim" plugins (SHOW PASSWORD is implemented in plugin/sql_shim) and install them in the server like regular plugins:
-- command doesn't work mysql> show password; ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'password' at line 1 -- install the example sql_shim plugin: mysql> install plugin sql_shim soname ''; Query OK, 0 rows affected (0.00 sec) -- now the command works mysql> show password; +--+ | | +--+ | | +--+ 1 row in set (0.00 sec)

There can be only one!
There may only be ONE "SQL shim" plugin running in the server at one time.  All "SQL shim" plugins must use the plugin name "sql_shim".  This prevents more than one "SQL shim" plugin from being utilized at once.  This is by design, because the MySQL plugin interface doesn't make any provisions for plugin execution order.  If you install a SQL shim plugin by another name, it won't cause harm, but it will not be utilized.
PlanetMySQL Voting: Vote UP / Vote DOWN

MariaDB 10.2 CHECK and DEFAULT clauses

In this blog post, we’ll look at the MariaDB 10.2 CHECK and DEFAULT clauses.

MariaDB 10.2 includes some long-awaited features. In this blog, we are going to discuss the improvements to some table definitions: the DEFAULT clause and the CHECK constraints. These clauses describe columns default values and rules for data validation.

Note that MariaDB 10.2 is still in alpha stage. This article describes the current state of these features, which could change before MariaDB 10.2 becomes GA.

The DEFAULT clause

The DEFAULT clause has always been supported in MariaDB/MySQL, but traditionally it only accepted literal values (like “hello world” or “2”). MariaDB 10.2 removes this limitation, so DEFAULT can now accept most SQL expressions. For example:

  • fiscal_year SMALLINT DEFAULT (YEAR(NOW()))
  • valid_until DATE DEFAULT (NOW() + INTERVAL 1 YEAR)
  • owner VARCHAR(100) DEFAULT (USER())

Additionally, MariaDB 10.2 allows you to set a DEFAULT value for the TEXT and BLOB columns. This was not possible in previous versions. While this might look like a small detail, it can be hard to add a column to an existing table that is used by production applications, if it cannot have a default value.

The DEFAULT clause has some very reasonable limitations. For example, it cannot contain a subquery or a stored function. An apparently strange limitation is that we can mention another column in DEFAULT only if it comes first in the CREATE TABLE command.

Note that DEFAULT can make use of non-deterministic functions even if the binary log uses the STATEMENT format. In this case, default non-deterministic values will be logged in the ROW format.

CHECK constraints

CHECK constraints are SQL expressions that are checked when a row is inserted or updated. If this expression result is false (0, empty string, empty date) or NULL, the statement will fail with an error. The error message states which CHECK failed in a way that is quite easy to parse:

ERROR 4022 (23000): CONSTRAINT `consistent_dates` failed for `test`.`author`

Some example of CHECK constraints:

  • CONSTRAINT non_empty_name CHECK (CHAR_LENGTH(name) > 0)
  • CONSTRAINT consistent_dates CHECK (birth_date IS NULL OR death_date IS NULL OR birth_date < death_date)
  • CONSTRAINT past_date CHECK (birth_date < NOW())

A possible trick is checking that a column is different from its default value. This forces users to assign values explicitly.

CHECK constraints cannot be added or altered. It is only possible to drop them. This is an important limitation for production servers.

Another limitation is that CHECK metadata are not accessible via the INFORMATION_SCHEMA. The only way to find out if a table has CHECK clauses is parsing the output of SHOW CREATE TABLE.

The exact behavior of CHECK constraints in a replication environment depends on the master binary log format. If it is STATEMENT, the slaves will apply CHECK constraints to events received from the master. If it is ROW, only the master will need to apply constraints, because failed statements will not be replicated.

Thus, in all cases, we recommend having identical constraints on master and slaves, and only using deterministic constraints.


While I didn’t run a professional benchmark, I can say that both DEFAULT and CHECK clauses don’t have a noticeable impact on a simple test where we insert one million rows (on my local machine).

However, these clauses evaluate an SQL expression each time a row is inserted or updated. The overhead is at least equal to the SQL expression performance. If high-performing writes are important, you will probably not want to use complex data validation.

To check how fast an expression is, we can use the BENCHMARK() function:

MariaDB [(none)]> SELECT BENCHMARK(10000000, (555 / 100 * 20)); +---------------------------------------+ | BENCHMARK(10000000, (555 / 100 * 20)) | +---------------------------------------+ | 0 | +---------------------------------------+ 1 row in set (1.36 sec) MariaDB [(none)]> SELECT BENCHMARK(100000000, MD5('hello world')); +------------------------------------------+ | BENCHMARK(100000000, MD5('hello world')) | +------------------------------------------+ | 0 | +------------------------------------------+ 1 row in set (14.84 sec)

In this example, we executed the specified expressions ten million times. BENCHMARK() always returns 0, but what we want to check is the execution time. We can see for example that evaluating MD5(‘hello world’) takes less than 0.000002 seconds. In some cases, we may want to retry the same expressions with different parameters (longer strings, higher numbers, etc.) to check if the execution times varies.

Unfortunately, we don’t have a status variable which tells us how many times MariaDB evaluated CHECK clauses. If our workload performs many writes, that variable could help us to find out if CHECK constraints are slowing down inserts. Maybe the MariaDB team can take this as a suggestion for the future.

PlanetMySQL Voting: Vote UP / Vote DOWN

Faceted search, why the DevAPI could matter one day

Faceted search or faceted navigation is a highly praised and widely use search pattern. And, it is a great reply to an off the records sales engineering question. MySQL finally has some document store features built-in. A bit of a yawn in 2016. There is a new X DevAPI available with some Connectors. A bit of a yawn technically. But it is a non-technical change of mind: developer centric counts! Sales, all, technical value could show at non-trivial developer tasks, like faceted search.

Todays X DevAPI does not get you very far

There are great stories to tell about the X DevAPI, see MySQL 5.7.12 – Part 3: More Than “Just” SQL?:

  • Non-blocking API
  • CRUD API for schemaless documents in collections and schemaful rows in tables
  • SQL support
  • Prepared for “we need new APIs” for distributed databases

Most importantly the “look and feel” is similar to comparable offerings from NoSQL vendors. Competitive offerings have been described as easy, natural, developer friendly. We tried to follow these maximes. Albeit a leap step forward the feature set of the first version of the X DevAPI is limited. It works for basic apps like the demo app. But instead of adding a bulk of additional features we broaden our view to achieve the goal of improving developer ease of use on a whole:

Selling point readability

The off the records sales engineers question was: why would one use the X DevAPI, given that it is only a dump SQL wrapper? Let a developer write some code to fetch all products from a collection that have a price higher than 1.2 Euros:

products.find("price > 1.2").exeucte();

Upon execution, the Connectors (drivers) send the query to the X server plugin which translates it to some SQL and executes it. The SQL statement makes use of ANSI/ISO SQL standard JSON features which MySQL 5.7.12+ supports:

SELECT product FROM products WHERE JSON_UNQUOTE(JSON_EXTRACT(product, "$.price")) > 1.2;

There is no technical value in this. Schemaless? Works with SQL. JSON processing? Works with SQL, too. The X DevAPI selling point is readability. But none of the available features today and none of feature on the short term roadmap has the potential to add more value to the X DevAPI.

Faceted search: let daddy sew a coat

Faceted search supports exploring large amounts of data by displaying summaries about various partitions of the data and later allowing to narrow the navigation to a specific partition. Let there be a father that wants to sew a rain coat exactly as the one below for his one year old daughter.

Daddy opens his web browser in the middle of the night after having completed all baby care and feeding duties and starts to search for fabric in an online retailer. But what to search for, how to quicky find the best fabric out of 100.000 offered by the online store? Many stores will display a search box and a faceted navigation at the left side. Possible facets (partitions, dimensions) are:

  • Material
    • Blends (16647)
    • Cotton (16762)
  • Usage
    • Baby (19913)
    • Dress (20005)
  • Price
    • Less than 10 Euros (13815)
    • 10-20 Euros (16207)
A faceted document search using SQL

Assume the products are stored in the database using (JSON) documents. All documents are in JSON column called “product” of a table called “products”. How to get the totals for all the facets using SQL?

{ "title": "Beach motive M182-16" "description": ... "weight": ... "material": "cotton" ... }

The SQL for one facet is pretty much straight-forward:

SELECT JSON_UNQUOTE(JSON_EXTRACT(product, "$.material")) AS facet_value, COUNT(*) AS facet_count FROM products WHERE JSON_EXTRACT(product, "$.material") IS NOT NULL GROUP BY JSON_EXTRACT(product, "$.material")

To get the values for more than one facet using no more than one query call in your application, combine the results for each facet using UNION ALL. Add a column “facet” to mark the orgininating facet in the combined row result.

SELECT "material" AS facet, JSON_UNQUOTE(JSON_EXTRACT(product, "$.material")) AS facet_value, COUNT(*) AS facet_count FROM products WHERE JSON_EXTRACT(product, "$.material") IS NOT NULL GROUP BY JSON_EXTRACT(product, "$.material") UNION ALL SELECT "usage" AS facet, JSON_UNQUOTE(JSON_EXTRACT(product, "$.usage")) AS facet_value, COUNT(*) AS facet_count FROM products WHERE JSON_EXTRACT(product, "$.usage") IS NOT NULL GROUP BY JSON_EXTRACT(product, "$.usage")

Speeding it up and shortening the SQL

There’s quite a bit to come and the length of the SQL statement will grow. Let’s shorten it a bit by adding generated columns (5.7.6) and speed things up using indicies. Long story short, here’s the CREATE TABLE statement I’m using for this blog post:

CREATE TABLE `products` ( `product_id` int(11) NOT NULL AUTO_INCREMENT, `product` json DEFAULT NULL, `facet_size` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.size'))) VIRTUAL, `facet_material` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.material'))) VIRTUAL, `facet_usage` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.usage'))) VIRTUAL, `facet_gender` varchar(255) GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.gender'))) VIRTUAL, `facet_price` double GENERATED ALWAYS AS (json_unquote(json_extract(`product`,'$.price'))) VIRTUAL, PRIMARY KEY (`product_id`), KEY `idx_facet_size` (`facet_size`), KEY `idx_facet_material` (`facet_material`), KEY `idx_facet_usage` (`facet_usage`), KEY `idx_facet_gender` (`facet_gender`), KEY `idx_face_price` (`facet_price`) )

Using the above the query to gather one facets totals is shortened to:

SELECT facet_material AS facet_value, count(*) AS facet_count FROM products WHERE facet_material IS NOT NULL GROUP BY facet_material

A range based face: price

The price facet in the fabric online shop is range based. In Germany, the home of the example daddy, fabrics are sold per meter. The typical price of a fabric ranges from a few Euros up to 100 Euros with the majority topping around 40 Euros. The show owner wants to display totals for prices from 0..10, 10..20, 20..50, 50+ Euros. Note that the ranges are of different size.

Of course, SQL can deliver! For example, use a subquery and CASE expressions. Stackoverflow will guide a developer within seconds.

SELECT "price" AS facet, fpsub.facet_price_range AS facet_value, COUNT(*) AS facet_count FROM (SELECT (CASE WHEN facet_price BETWEEN 0 AND 10 THEN "0-10" WHEN facet_price BETWEEN 10.01 AND 20 THEN "10-20" WHEN facet_price BETWEEN 20.01 AND 50 THEN "20-50" ELSE "50+" END) AS facet_price_range FROM products WHERE facet_price IS NOT NULL) AS fpsub GROUP BY fpsub.facet_price_range

Combine all the individual facet queries using UNION ALL and you have the basic search.

Refinement: user selects a facet value

Example Daddy made up his mind on the material for the inner jacket, selects “cotton” and enters “Beach” into the search box. Likely, the users expectation is that the selection will have no impact on the totals shown for any other material but cotton. In other words the material face query needs to be split in two: one to count the total for “cotton” and “Beach” and another one for the totals of all the other facet values. Note that I ignore the problem of full text search and use LIKE exactly how one should not use it.

SELECT facet_material, count(*) FROM products WHERE facet_material = "cotton" AND JSON_UNQUOTE(JSON_EXTRACT(product, '$.title')) LIKE "%Beach%" GROUP BY facet_material UNION ALL SELECT facet_material, COUNT(*) FROM products WHERE facet_material != "cotton" AND facet_material IS NOT NULL GROUP BY facet_material

The grand total

Proof: you can use SQL for the job.

SELECT "material" AS facet, facet_material AS facet_value, COUNT(*) AS facet_count FROM products WHERE facet_material = "cotton" AND JSON_UNQUOTE(JSON_EXTRACT(product, '$.title')) LIKE "%Beach%" GROUP BY facet_material UNION ALL SELECT "material" AS facet, facet_material AS facet_value, COUNT(*) AS facet_count FROM products WHERE facet_material IS NOT NULL AND facet_material != "cotton" GROUP BY facet_material UNION ALL SELECT "price" AS facet, fpsub.facet_price_range AS facet_value, COUNT(*) AS facet_count FROM (SELECT (CASE WHEN facet_price BETWEEN 0 AND 10 THEN "0-10" WHEN facet_price BETWEEN 10.01 AND 20 THEN "10-20" WHEN facet_price BETWEEN 20.01 AND 50 THEN "20-50" ELSE "50+" END) AS facet_price_range FROM products WHERE facet_price IS NOT NULL) AS fpsub GROUP BY fpsub.facet_price_range UNION ALL SELECT "size" AS facet, facet_size AS facet_value, COUNT(*) AS facet_count FROM products WHERE facet_size IS NOT NULL GROUP BY facet_size UNION ALL SELECT "usage" AS facet, facet_usage AS facet_value, COUNT(*) AS facet_count FROM products WHERE facet_usage IS NOT NULL GROUP BY facet_usage UNION ALL SELECT "gender" AS facet, facet_gender AS facet_value, COUNT(*) AS facet_count FROM products WHERE facet_gender IS NOT NULL GROUP BY facet_gender ORDER BY facet, facet_value +----------+------------------+-------------+ | facet | facet_value | facet_count | +----------+------------------+-------------+ | gender | female | 33387 | | gender | male | 33327 | | material | blends | 16647 | | material | designer fabrics | 16703 | | material | knits | 16739 | | material | silk | 16594 | | material | wool | 16555 | | price | 0-10 | 13815 | | price | 10-20 | 16207 | | price | 20-50 | 55668 | | price | 50+ | 14310 | | size | 114cm | 14464 | | size | 140cm | 14366 | | size | L | 14009 | | size | M | 14303 | | size | S | 14327 | | size | XL | 14211 | | size | XXL | 14320 | | usage | baby | 19913 | | usage | dress | 20005 | | usage | inside | 19929 | | usage | outdoor | 20166 | +----------+------------------+-------------+ 22 rows in set (0,34 sec)

The execution time of 0.34s? Well, this is a source build of MySQL running in a VM on a notebook. No server parameters set. Inacceptable result for interactive search experience.

How the DevAPI could make a difference

A DevAPI counterpart of the above could read very different.

products .find("title LIKE :search") .facets({ "usage", "material", "gender", "size", "price" { "0-10": {"min": 0, "max": 10"} ... } .bind("search", "%Beach%")

As a developer, I could immediately grasp what the higher level DevAPI does. I would have no clue what the SQL is about if the column aliases would not hint me. Note also that the DevAPI and the X plugin are not limited to SQL. They could, if they want, use the lower level MySQL internal data access APIs. The internal APIs can be faster than SQL, way faster. SQL runs on top of them.

Dear Sales, All, the DevAPI is only a tiny bit of the MySQL document store story. You may see it as a dump SQL wrapper of little value. Or, as a chance to offer an alternative query API that solves web developer problems much easier than raw SQL could do.

Happy hacking!


The post Faceted search, why the DevAPI could matter one day appeared first on Ulf Wendel.

PlanetMySQL Voting: Vote UP / Vote DOWN

On Uber’s Choice of Databases

A few days ago Uber published the article “Why Uber Engineering Switched from Postgres to MySQL”. I didn’t read the article right away because my inner nerd told me to do some home improvements instead. While doing so my mailbox was filling up with questions like “Is PostgreSQL really that lousy?”. Knowing that PostgreSQL is not generally lousy, these messages made me wonder what the heck is written in this article. This post is an attempt to make sense out of Uber’s article.

In my opinion Uber’s article basically says that they found MySQL to be a better fit for their environment as PostgreSQL. However, the article does a lousy job to transport this message. Instead of writing “PostgreSQL has some limitations for update-heavy use-cases” the article just says “Inefficient architecture for writes,” for example. In case you don’t have an update-heavy use-case, don’t worry about the problems described in Uber’s article.

In this post I’ll explain why I think Uber’s article must not be taken as general advice about the choice of databases, why MySQL might still be a good fit for Uber, and why success might cause more problems than just scaling the data store.


The first problem Uber’s article describes in great, yet incomplete detail is that PostgreSQL always needs to update all indexes on a table when updating rows in the table. MySQL with InnoDB, on the other hand, needs to update only those indexes that contain updated columns. The PostgreSQL approach causes more disk IOs for updates that change non-indexed columns (“Write Amplification” in the article). If this is such a big problem to Uber, these updates might be a big part of their overall workload.

However, there is a little bit more speculation possible based upon something that is not written in Uber’s article: The article doesn’t mention PostgreSQL Heap-Only-Tuples (HOT). From the PostgreSQL source, HOT is useful for the special case “where a tuple is repeatedly updated in ways that do not change its indexed columns.” In that case, PostgreSQL is able to do the update without touching any index if the new row-version can be stored in the same page as the previous version. The latter condition can be tuned using the fillfactor setting. Assuming Uber’s Engineering is aware of this means that HOT is no solution to their problem because the updates they run at high frequency affect at least one indexed column.

This assumption is also backed by the following sentence in the article: “if we have a table with a dozen indexes defined on it, an update to a field that is only covered by a single index must be propagated into all 12 indexes to reflect the ctid for the new row”. It explicitly says “only covered by a single index” which is the edge case—just one index—otherwise PostgreSQL’s HOT would solve the problem.

[Side note: I’m genuinely curious whether the number of indexes they have could be reduced—index redesign in my challenge. However, it is perfectly possible that those indexes are used sparingly, yet important when they are used.]

It seems that they are running many updates that change at least one indexed column, but still relatively few indexed columns compared to the “dozen” indexes the table has. If this is a predominate use-case, the article’s argument to use MySQL over PostgreSQL makes sense.


There is one more statement about their use-case that caught my attention: the article explains that MySQL/InnoDB uses clustered indexes and also admits that “This design means that InnoDB is at a slight disadvantage to Postgres when doing a secondary key lookup, since two indexes must be searched with InnoDB compared to just one for Postgres.” I’ve previously written about this problem (“the clustered index penalty”) in context of SQL Server.

What caught my attention is that they describe the clustered index penalty as a “slight disadvantage”. In my opinion, it is a pretty big disadvantage if you run many queries that use secondary indexes. If it is only a slight disadvantage to them, it might suggest that those indexes are used rather seldom. That would mean, they are mostly searching by primary key (then there is no clustered index penalty to pay). Note that I wrote “searching” rather than “selecting”. The reason is that the clustered index penalty affects any statement that has a where clause—not just select. That also implies that the high frequency updates are mostly based on the primary key.

Finally there is another omission that tells me something about their queries: they don’t mention PostgreSQL’s limited ability to do index-only scans. Especially in an update-heavy database, the PostgreSQL implementation of index-only scans is pretty much useless. I’d even say this is the single issue that affects most of my clients. I’ve already blogged about this in 2011. In 2012, PostgreSQL 9.2 got limited support of index-only scans (works only for mostly static data). In 2014 I even raised one aspect of my concern at PgCon. However, Uber doesn’t complain about that. Select speed is not their problem. I guess query speed is generally solved by running the selects on the replicas (see below) and possibly limited by mostly doing primary key side.

By now, their use-case seems to be a better fit for a key/value store. And guess what: InnoDB is a pretty solid and popular key/value store. There are even packages that bundle InnoDB with some (very limited) SQL front-ends: MySQL and MariaDB are the most popular ones, I think. Excuse the sarcasm. But seriously: if you basically need a key/value store and occasionally want to run a simple SQL query, MySQL (or MariaDB) is a reasonable choice. I guess it is at least a better choice than any random NoSQL key/value store that just started offering an even more limited SQL-ish query language. Uber, on the other hand just builds their own thing (“Schemaless”) on top of InnoDB and MySQL.

On Index Rebalancing

One last note about how the article describes indexing: it uses the word “rebalancing” in context of B-tree indexes. It even links to a Wikipedia article on “Rebalancing after deletion.” Unfortunately, the Wikipedia article doesn’t generally apply to database indexes because the algorithm described on Wikipedia maintains the requirement that each node has to be at least half-full. To improve concurrency, PostgreSQL uses the Lehman, Yao variation of B-trees, which lifts this requirement and thus allows sparse indexes. As a side note, PostgreSQL still removes empty pages from the index (see slide 15 of “Indexing Internals”). However, this is really just a side issue.

What really worries me is this sentence: “An essential aspect of B-trees are that they must be periodically rebalanced, …” Here I’d like to clarify that this is not a periodic process one that runs every day. The index balance is maintained with every single index change (even worse, hmm?). But the article continues “…and these rebalancing operations can completely change the structure of the tree as sub-trees are moved to new on-disk locations.” If you now think that the “rebalancing” involves a lot of data moving, you misunderstood it.

The important operation in a B-tree is the node split. As you might guess, a node split takes place when a node cannot host a new entry that belongs into this node. To give you a ballpark figure, this might happen once for about 100 inserts. The node split allocates a new node, moves half of the entries to the new node and connects the new node to the previous, next and parent nodes. This is where Lehman, Yao save a lot of locking. In some cases, the new node cannot be added to the parent node straight away because the parent node doesn’t have enough space for the new child entry. In this case, the parent node is split and everything repeats.

In the worst case, the splitting bubbles up to the root node, which will then be split as well and a new root node will be put above it. Only in this case, a B-tree ever becomes deeper. Note that a root node split effectively shifts the whole tree down and therefore keeps the balance. However, this doesn’t involve a lot of data moving. In the worst case, it might touch three nodes on each level and the new root node. To be explicit: most real world indexes have no more than 5 levels. To be even more explicit: the worst case—root node split—might happen about five times for a billion inserts. On the other cases it will not need to go the whole tree up. After all, index maintenance is not “periodic”, not even very frequent, and is never completely changing the structure of the tree. At least not physically on disk.

On Physical Replication

That brings me to the next major concern the article raises about PostgreSQL: physical replication. The reason the article even touches the index “rebalancing” topic is that Uber once hit a PostgreSQL replication bug that caused data corruption on the downstream servers (the bug “only affected certain releases of Postgres 9.2 and has been fixed for a long time now”).

Because PostgreSQL 9.2 only offers physical replication in core, a replication bug “can cause large parts of the tree to become completely invalid.” To elaborate: if a node split is replicated incorrectly so that it doesn’t point to the right child nodes anymore, this sub-tree is invalid. This is absolutely true—like any other “if there is a bug, bad things happen” statement. You don’t need to change a lot of data to break a tree structure: a single bad pointer is enough.

The Uber article mentions other issues with physical replication: huge replication traffic—partly due to the write amplification caused by updates—and the downtime required to update to new PostgreSQL versions. While the first one makes sense to me, I really cannot comment on the second one (but there were some statements on the PostgreSQL-hackers mailing list).

Finally, the article also claims that “Postgres does not have true replica MVCC support.” Luckily the article links to the PostgreSQL documentation where this problem (and remediations) are explained. The problem is basically that the master doesn’t know what the replicas are doing and might thus delete data that is still required on a replica to complete a query.

According to the PostgreSQL documentation, there are two ways to cope with this issue: (1) delaying the application of the replication stream for a configurable timeout so the read transaction gets a chance to complete. If a query doesn’t finish in time, kill the query and continue applying the replication stream. (2) configure the replicas to send feedback to the master about the queries they are running so that the master does not vacuum row versions still needed by any slave. Uber’s article rules the first option out and doesn’t mention the second one at all. Instead the article blames the Uber developers.

On Developers

To quote it in all its glory: “For instance, say a developer has some code that has to email a receipt to a user. Depending on how it’s written, the code may implicitly have a database transaction that’s held open until after the email finishes sending. While it’s always bad form to let your code hold open database transactions while performing unrelated blocking I/O, the reality is that most engineers are not database experts and may not always understand this problem, especially when using an ORM that obscures low-level details like open transactions.”

Unfortunately, I understand and even agree with this argument. Instead of “most engineers are not database experts” I’d even say that most developers have very little understanding of databases because every developer that touches SQL needs know about transactions—not just database experts.

Giving SQL training to developers is my main business. I do it at companies of all sizes. If there is one thing I can say for sure is that the knowledge about SQL is ridiculously low. In context of the “open transaction” problem just mentioned I can conform that hardly any developer even knows that read only transactions are a real thing. Most developers just know that transactions can be used to back out writes. I’ve encountered this misunderstanding often enough that I’ve prepared slides to explain it and I just uploaded these slides for the curious reader.

On Success

This leads me to the last problem I’d like to write about: the more people a company hires, the closer their qualification will be to the average. To exaggerate, if you hire the whole planet, you’ll have the exact average. Hiring more people really just increases the sample size.

The two ways to beat the odds are: (1) Only hire the best. The difficult part with this approach is to wait if no above-average candidates are available; (2) Hire the average and train them on the job. This needs a pretty long warm-up period for the new staff and might also bind existing staff for the training. The problem with both approaches is that they take time. If you don’t have time—because your business is rapidly growing—you have to take the average, which doesn’t know a lot about databases (empirical data from 2014). In other words: for a rapidly growing company, technology is easier to change than people.

The success factor also affects the technology stack as requirements change over time. At an early stage, start-ups need out-of-the-box technology that is immediately available and flexible enough to be used for their business. SQL is a good choice here because it is actually flexible (you can query your data in any way) and it is easy to find people knowing SQL at least a little bit. Great, let’s get started! And for many—probably most—companies, the story ends here. Even if they become moderately successful and their business grows, they might still stay well within the limits of SQL databases forever. Not so for Uber.

A few lucky start-ups eventually outgrow SQL. By the time that happens, they have access to way more (virtually unlimited?) resources and then…something wonderful happens: They realize that they can solve many problems if they replace their general purpose database by a system they develop just for their very own use-case. This is the moment a new NoSQL database is born. At Uber, they call it Schemaless.

On Uber’s Choice of Databases

By now, I believe Uber did not replace PostgreSQL by MySQL as their article suggests. It seems that they actually replaced PostgreSQL by their tailor-made solution, which happens to be backed by MySQL/InnoDB (at the moment).

It seems that the article just explains why MySQL/InnoDB is a better backend for Schemaless than PostgreSQL. For those of you using Schemaless, take their advice! Unfortunately, the article doesn’t make this very clear because it doesn’t mention how their requirements changed with the introduction of Schemaless compared to 2013, when they migrated from MySQL to PostgreSQL.

Sadly, the only thing that sticks in the reader’s mind is that PostgreSQL is lousy.

If you like my way of explaining things, you’ll love my book.


Original title and author: “On Uber’s Choice of Databases” by Markus Winand.

PlanetMySQL Voting: Vote UP / Vote DOWN

Is the new MySQL Document Store and JSON data type cheating?

Is it cheating? Is using MySQL without Structured Query Language (SQL) or putting all your data into one column proper? Impossible a year ago and probably thought as a poor/crazy practice until recently, this is a new type of MySQL usage. NoSQL has had a big impact in the SQL world with several relational products from vendors like MySQL, Microsoft, Postgresql and others offering NoSQL interfaces and JSON data types to their databases.

Several old timers have come to me asking if putting lots of data in a JSON column is cheating? After all data normalization is part of relational databases and the way to efficiency and speed is well organized data. This higgledy–piggledy fashion of putting an entire document in a column without breaking it down to its component sections does violate the first rule of data normalization. And that has worked for decades pretty well.

But things change.

Many new development projects start with no idea what their data is going to look like. Maybe, if they are lucky, they will pick an API for interchanging data. Some will decide to use a database as a persistent message queue. Sometimes the service is product and the products offered by that service are nebulous at beast at the onset so the developers do not want to be locked into a schema.

And schema changes get expensive quickly. Code written before the change needs to be updated and you have to get the DBA to run an ALTER TABLE that can take too long to complete. And rolling back is almost impossible.

Compounding this is the lack of database skills in new developers. Very few have any training in Structured Query Language, relational theory, sets, data normalization, or other skills that have been held in high esteem since the days of Codd. And these new developers argue that they do not need these skills as technology and approaches to coding have evolved. So if the foundation of relational databases is the weak link then remove it.

The JSON data type has been very popular with developers. In the past you could dump JSON documents in a CHAR column and used REGEX or have your application dig out the parts you needed from within the document. Now with MySQL 5.7 there is a native JSON data type with supporting functions so that developers can manipulate the document data easily. So storing data in a JSON format in a column of a table within a database becomes a valuable has becomes a useful part of a developers life.

Ruby on Rails was the first piece of software that allowed many developers to see the power of CRUD (Create Read Update Delete) access to a database without needing to know SQL. The MySQL Document Store features that arrived with 5.7.13 also provide CRUD so developers can use 'collections' for schema-less data storage. The developers do not need to know SQL. relational theory, or any of the other old stuff. The good news for the old timers is that the data may be stored in a schema-less fashion but data is stored in a good ol' MySQL table. Which means the decades spent gathering SQL skills still work on this new data.

But is it cheating? Yes, and no. Yes in that you would get better performance out of having all your data in fourth or fifth normalized form just like a good driver can get better performance out of good driving practices with a manual transmission. No in that the data is still in a relational database and pertinent information can still be pried out of the JSON data. Heck you can even make materialized columns from the JSON data to create indexes.
PlanetMySQL Voting: Vote UP / Vote DOWN

SQL Injection Detection and Alerting are Vital For Secure Data

The recent Mossack Fonesca “Panama Papers” hack is the latest security breach to drive home how much an impact an SQL injection can have on modern-day organizations. Though that hack ultimately revealed massive professional fraud by companies and governments around the world, it involved the exposure of 11.5 million confidential documents. The 2.6 terabytes of data stolen were a powerful reminder that the history of web-based business has been riddled with instances of SQL injections. They remain common and potentially devastating to organizations.

Other high-profile examples of SQL injection include an instance when NASA sites were hacked in 2009, yielding site administrator info; when Heartland Payment systems were rummaged in 2008, resulting in the exposure of 134 million credit cards; and earlier this year, when the high-profile Ashley Madison leak occurred, many experts’ first thought was “SQL injection” (though that was later stated not to be the case). In 2012, Neira Jones, the head of payment security at Barclaycard cited SQL injections as responsible for 97 percent of all data breaches. And the Open Web Application Security Project called SQL injections the number one most prevalent attack in 2013.

The abundance and risk of SQL injection means that it’s vital for businesses to be able to detect when their own systems have been targeted, so they can respond as quickly and tactically as possible.

In the past, just knowing that your databases were under attack was difficult. However, VividCortex includes basic SQL injection detection for MySQL out of the box, ensuring that security teams, database administrators, and even CTOs can be made aware at the first sign of an attack. With our new Alerting & Integration module, those injection warnings can be transmitted directly to email, Slack, Victorops or any number of other messaging systems. We’ve written about our methods for detecting SQL injection in prior posts. As we mentioned there, we've actually built our SQL injection detection as part of our Query Analysis feature, so you can view the results of our detection as part of the Query Analysis dashboard. We class SQL injection events as “critical.”

[In Sample Info, if "Failed Rules" reads "Evidence of SQL injection," you'll know you need to take action.]

VividCortex is the first Database APM platform to reveal potential threats with such precision and speed. This can be used as a powerful, proactive, preventative measure: though many unskilled attackers will be able to trigger injections, they most likely will be unable to exploit them. However, the moment a skilled hacker is the one doing the probing, you're totalled. Knowing that, if you watch your system vigilantly, with the help of VividCortex. hopefully the frequent-but-unskilled probes will bring prompt attention to the faulty components that use unsanitized inputs.

Image cred: xkcd

If you want to see VividCortex in action, you can request a demo now. Once we're monitoring your systems, you'll be able to eliminate your most critical APM visibility gaps, with insights like detection of SQL injections in your MySQL databases. 

PlanetMySQL Voting: Vote UP / Vote DOWN

DDL Failures in MariaDB Galera Cluster

A MariaDB support customer recently asked me what would happen if a Data Definition Language (DDL) statement failed to complete on one or more nodes in MariaDB Galera Cluster. In this blog post, I will demonstrate what would happen. The demonstration below was performed on a 2-node cluster running MariaDB 10.1, but other Galera Cluster distributions should work similarly. Schema ... Read More
PlanetMySQL Voting: Vote UP / Vote DOWN

Percona Monitoring and Management 1.0.2 Beta

Percona announces the release of Percona Monitoring and Management 1.0.2 Beta on 28 July 2016.

Like prior versions, PMM is distributed through Docker Hub and is free to download. Full instructions for download and installation of the server and client are available in the documentation.

Notable changes to the tool include:

  • Upgraded to Grafana 3.1.0.
  • Upgraded to Prometheus 1.0.1.
  • Set default metrics retention to 30 days.
  • Eliminated port 9001. Now the container uses only one configurable port, 80 by default.
  • Eliminated the need to specify ADDRESS variable when creating Docker container.
  • Completely re-wrote pmm-admin with more functions
  • Added ability to stop all services using the new pmm-admin.
  • Added support to name instances using the new pmm-admin.
  • Query Analytics Application updates:
    • Redesigned queries profile table
    • Redesigned metrics table
    • Redesigned instance settings page
    • Added sparkline charts
    • Added ability to show more than ten queries
    • Various updates for MongoDB dashboards.

The full release notes are available in the documentation. The documentation also includes details on installation and architecture.

A demonstration of the tool has been set up at

We have also implemented forums for the discussion of PMM.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Some screen shots of the updates:

Note the new sparkline that shows the current load in context (so you know if the number is higher/normal/lower than normal), and the option to “Load next 10 queries” at the bottom of the listing.

Our admin tool was completely re-written with new functions:

pmm-admin –help output


pmm-admin list command output


pmm-admin check-network output, which provides information on the status of the client’s network connection to the server.
PlanetMySQL Voting: Vote UP / Vote DOWN

Upcoming Webinar August 2 10:00 am PDT: MySQL and Ceph

Join Brent Compton, Kyle Bader and Yves Trudeau on August 2, 2016 at 10 am PDT (UTC-7) for a MySQL and Ceph webinar.

Many operators select OpenStack as their control plane of choice for providing both internal and external IT services. The OpenStack user survey repeatedly shows Ceph as the dominant backend for providing persistent storage volumes through OpenStack Cinder. When building applications and repatriating old workloads, developers are discovering the need to provide OpenStack infrastructure database services. Given MySQL’s ubiquity, and it’s reliance on persistent storage, it is of utmost importance to understand how to achieve the performance demanded by today’s applications. Databases like MySQL can be incredibly IO intensive, and Ceph offers a great opportunity to go beyond the limitations presented by a single scale-up system. Since Ceph provides a mutable object store with atomic operations, could MySQL store InnoDB pages directly in Ceph?

This talk reviews the general architecture of Ceph, and then discusses benchmark results from small to mid-size Ceph clusters. These benchmarks lead to the development of prescriptive guidance around tuning Ceph storage nodes (OSDs), the impact the amount of physical memory, and the presence of SSDs, high-speed networks or RAID controllers.

Click here to register now. Speakers: Brent Compton
Director Storage Solution Architectures, Red Hat
Brent Compton is Director Storage Solution Architectures at Red Hat. He leads the team responsible for building Ceph and Gluster storage reference architectures with Red Hat Storage partners. Before Red Hat, Brent was responsible for emerging non-volatile memory software technologies at Fusion-io. Previous enterprise software leadership roles include VP Product Management at Micromuse (now IBM Tivoli Netcool) and Product Marketing Director within HP’s OpenView software division. Brent also served as Director Middleware Development Platforms at the LDS Church and as CIO at Joint Commission International. Brent has a tight-knit family, and can be found on skis or a mountain bike whenever possible. Kyle Bader
Sr Solution Architect, Red Hat
Kyle Bader, a Red Hat senior architect, provides expertise in the design and operation of petabyte-scale storage systems using Ceph. He joined Red Hat as part of the 2014 Inktank acquisition. As a senior systems engineer at DreamHost, he helped implement, operate, and design Ceph and OpenStack-based systems for DreamCompute and DreamObjects cloud products. Yves Trudeau
Principal Architect
Yves is a Principal Consultant at Percona, specializing in MySQL High-Availability and scaling solutions. Before joining Percona in 2009, he worked as a senior consultant for MySQL AB and Sun Microsystems, assisting customers across North America with NDB Cluster and Heartbeat/DRBD technologies. Yves holds a Ph.D. in Experimental Physics from Université de Sherbrooke. He lives in Québec, Canada with his wife and three daughters.
PlanetMySQL Voting: Vote UP / Vote DOWN

Performance of Inserts on Partitions – MySQL 5.6 v/s MySQL 5.7

Recently, I was discussing with one of my colleagues about how insert statement performs for MySQL partitioned tables. General prediction is that it should be slower than for non-partitioned tables, but how much that we didn’t know. So, I thought let’s test with different types of partitions (i.e range, list and hash) and also with different number of partitions and check how’s performance. As people says, MySQL 5.7 is must faster than old one, so I also tested partitions with it.

So, I took simple table with 3.2M records on Centos 6.7 VM (4 core with 2GB RAM) with default my.cnf settings and then created tables for range, list and hash partitioning with 5,25,50 and 100 partitions. i.e with 5 partition (range and list), the table structures were like

CREATE TABLE emp_range_5( id int, fname varchar (30), lname varchar (30), hired_date date not null, separated_date date not null, job_code int, store_id int ) PARTITION BY RANGE (store_id) ( PARTITION p0 VALUES LESS THAN (5), PARTITION p1 VALUES LESS THAN (25), PARTITION p2 VALUES LESS THAN (50), PARTITION p3 VALUES LESS THAN (75), PARTITION p4 VALUES LESS THAN MAXVALUE ); CREATE TABLE emp_list_5( id int, fname varchar(30), lname varchar(30), hired_date date not null, separated_date date not null, job_code int, store_id int ) PARTITION BY LIST(store_id) ( PARTITION pNorth VALUES IN (3,5,6,9,17,21,22,23,24,25,26,27,28,29,30,31), PARTITION pEast VALUES IN (1,2,10,11,19,20,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50), PARTITION pWest VALUES IN (4,12,13,14,18,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70), PARTITION pSouth VALUES IN (86,87,88,89,90,91,92,93,94,95,96,97,98,99,100), PARTITION pCentral VALUES IN (7,8,15,16,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85) );

One by one, I have inserted 3.2M records to each table with different number of partitions and types like,

mysql> INSERT INTO emp_range_5 SELECT * FROM emp_new; Query OK, 3276800 rows affected (23.15 sec) Records: 3276800 Duplicates: 0 Warnings: 0

and get the below result.

If we check MySQL 5.6 test result, range type is better than list and hash partitions, but when we add more and more partitions it’s taking more time to insert records than into non-partitioned table. So at least this test proves that “Partitions are slow”. Here, partitioned tables are slower than non-partitioned tables for inserts because different types of partitioning and number of partitions add different overhead. With huge tables, opposite trend may appear where partitions are fast because insert into one huge b-tree can be slower than search and insert into 100 times smaller b-tree. Need to do more research on this.

If we compare MySQL 5.6 test result with 5.7, surprisingly it’s more slow but non-partition table is faster than with MySQL 5.6. So we can consider that 5.7 is faster than 5.6 for non-partition tables but not the ones with partitions.

For MySQL 5.6, I was bit curious about why Inserts are slow with partitions so I’ve tried to figure out with different ways. First I’ve checked profile data from Performance_Schema (P_S), i.e. for below query

mysql> INSERT INTO emp_range_5 SELECT * FROM emp_new; Query OK, 3276800 rows affected (23.12 sec) Records: 3276800 Duplicates: 0 Warnings: 0 mysql> SELECT EVENT_ID, TRUNCATE(TIMER_WAIT/1000000000000,6) as Duration, SQL_TEXT FROM performance_schema.events_statements_history_long WHERE SQL_TEXT like '%emp%'; +----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | EVENT_ID | Duration | SQL_TEXT | +----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 256 | 0.000328 | SELECT EVENT_ID, TRUNCATE(TIMER_WAIT/1000000000000,6) as Duration, SQL_TEXT FROM performance_schema.events_statements_history_long WHERE SQL_TEXT like '%emp%' | | 333 | 22.837071 | INSERT INTO emp_range_5 SELECT * FROM emp_new | +----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set (0.02 sec) mysql> SELECT event_name AS Stage, TRUNCATE(TIMER_WAIT/1000000000000,6) AS Duration FROM performance_schema.events_stages_history_long WHERE NESTING_EVENT_ID=333; +--------------------------------+-----------+ | Stage | Duration | +--------------------------------+-----------+ | stage/sql/init | 0.000052 | | stage/sql/checking permissions | 0.000003 | | stage/sql/checking permissions | 0.000001 | | stage/sql/Opening tables | 0.001760 | | stage/sql/init | 0.001701 | | stage/sql/System lock | 0.000017 | | stage/sql/optimizing | 0.000004 | | stage/sql/statistics | 0.000010 | | stage/sql/preparing | 0.000008 | | stage/sql/executing | 0.000000 | | stage/sql/Sending data | 23.064697 | | stage/sql/end | 0.000027 | | stage/sql/query end | 0.057433 | | stage/sql/closing tables | 0.000024 | | stage/sql/freeing items | 0.001319 | | stage/sql/logging slow query | 0.000001 | | stage/sql/cleaning up | 0.000002 | +--------------------------------+-----------+ 17 rows in set (0.01 sec)

So “Sending data” is main task which is taking more time and that is expected as we are doing “insert into …select * from”. But why? Then I have tried to check in P_S  if we really have any waits instrumented in detailed enough level so it can apply to P_S for partitioning and we can check. I ran below query.

mysql> select event_name, count_star, sum_timer_wait -> from performance_schema.events_waits_summary_global_by_event_name -> where count_star > 0 -> order by sum_timer_wait desc limit 10; +--------------------------------------+------------+-------------------+ | event_name | count_star | sum_timer_wait | +--------------------------------------+------------+-------------------+ | idle | 550 | 39826303901000000 | | wait/io/table/sql/handler | 78522954 | 330357979232652 | | wait/io/file/innodb/innodb_data_file | 555776 | 251974879365594 | | wait/io/file/innodb/innodb_log_file | 18619 | 61417239253993 | | wait/io/file/myisam/dfile | 3125 | 407945499647 | | wait/io/file/sql/FRM | 2788 | 238943294597 | | wait/io/file/sql/partition | 643 | 46937210061 | | wait/io/file/myisam/kfile | 336 | 33820372614 | | wait/io/file/sql/pid | 3 | 9358099177 | | wait/io/file/sql/casetest | 10 | 2381744687 | +--------------------------------------+------------+-------------------+ 10 rows in set (1.68 sec)

I got the details but I didn’t truncated that table before test so it might be the old details. So I truncate that table, run the insert test again and check but I didn’t get event “wait/io/file/sql/partition” in the list. When we drop or truncate, it shows otherwise not. After some more research I found that probably it is instrumented only for DDL commands, but not for DML

We clearly see that inserts are getting slow from the above statistics, but I was not able to see where time is spent more from P_S. So finally I moved to perf utility where we can monitor particular PID and get the details. In one session, I ran this command “perf record -p 3306” and started insert test from another session in mysql, and captured the for each statements. I compare few partition related entries along with IO and found what’s presented below.

With 5 partitions: 6.26% mysqld [kernel.kallsyms] [k] finish_task_switch 5.55% mysqld [kernel.kallsyms] [k] ioread32 1.77% mysqld [kernel.kallsyms] [k] iowrite32 0.30% mysqld mysqld [.] ha_partition::write_row(unsigned char*) 0.24% mysqld mysqld [.] get_partition_id_range(partition_info*, unsigned int*, long long*) 0.22% mysqld mysqld [.] select_insert::send_data(List<Item>&) With 25 partitions: 7.14% mysqld [kernel.kallsyms] [k] finish_task_switch 6.56% mysqld [kernel.kallsyms] [k] ioread32 1.78% mysqld [kernel.kallsyms] [k] iowrite32 0.59% mysqld mysqld [.] get_partition_id_range(partition_info*, unsigned int*, long long*) 0.26% mysqld mysqld [.] ha_partition::write_row(unsigned char*) 0.26% mysqld mysqld [.] select_insert::send_data(List<Item>&) With 50 partitions: 6.75% mysqld [kernel.kallsyms] [k] finish_task_switch 6.23% mysqld [kernel.kallsyms] [k] ioread32 1.71% mysqld [kernel.kallsyms] [k] iowrite32 0.62% mysqld mysqld [.] get_partition_id_range(partition_info*, unsigned int*, long long*) 0.30% mysqld mysqld [.] ha_partition::write_row(unsigned char*) 0.29% mysqld mysqld [.] select_insert::send_data(List<Item>&) With 100 partitions: 6.45% mysqld [kernel.kallsyms] [k] ioread32 6.10% mysqld [kernel.kallsyms] [k] finish_task_switch 1.69% mysqld [kernel.kallsyms] [k] iowrite32 0.70% mysqld mysqld [.] get_partition_id_range(partition_info*, unsigned int*, long long*) 0.43% mysqld mysqld [.] ha_partition::write_row(unsigned char*) 0.25% mysqld mysqld [.] select_insert::send_data(List<Item>&)

So it seems as and when number of partitions increases, there are more ranges to compare with for each record and so more CPU is spent on that.  While without partitions this cost will be saved. As in above result get_partition_id_range function was related to partition, I tried to check the code,

int get_partition_id_range(partition_info *part_info, uint *part_id, longlong *func_value) ... /* Search for the partition containing part_func_value */ while (max_part_id > min_part_id) { loc_part_id= (max_part_id + min_part_id) / 2; if (range_array[loc_part_id] <= part_func_value) min_part_id= loc_part_id + 1; else max_part_id= loc_part_id; } loc_part_id= max_part_id; *part_id= (uint)loc_part_id; if (loc_part_id == max_partition && part_func_value >= range_array[loc_part_id] && !part_info->defined_max_value) ... }

There is a loop. the more partitions, the more time might be spent there. It is non-linear search and midpoint so log iterations where n is no. of partitions. Now, if we check same code for the list function, It’s linear, no midpoint and searching from both the side.

int get_partition_id_list(partition_info *part_info, uint32 *part_id, longlong *func_value) ... while (max_list_index >= min_list_index) { list_index= (max_list_index + min_list_index) >> 1; list_value= list_array[list_index].list_value; if (list_value < part_func_value) min_list_index= list_index + 1; else if (list_value > part_func_value) { if (!list_index) goto notfound; max_list_index= list_index - 1; } else { *part_id= (uint32)list_array[list_index].partition_id; DBUG_RETURN(0); } } ...

This is probably the reason why list partitions are slower than range.

I’ve tried to check the same perf report for MySQL 5.7, but it seems it uses different functions. I wasn’t able to find get_partition_id_range and ha_partitions. While checking doc, found that:

“As of MySQL 5.7.6, InnoDB supports native partitioning. Previously, InnoDB relied on the ha_partition handler, which creates a handler object for each partition. With native partitioning, a partitioned InnoDB table uses a single partition-aware handler object. This enhancement reduces the amount of memory required for partitioned InnoDB tables “

Instead I got this function,

1.14% mysqld mysqld [.] row_insert_for_mysql_using_ins_graph(unsigned char const*, row_prebuilt_t*)

This function must have some explanation but I already covered many things on this post. I don’t want to make it more lengthy. Seems, I’ll have to make part II to understand why the partitions are slow in MySQL 5.7 compared to 5.6.

Conclusion: during this research I found that partitioned tables are slower comparing to non-partitioned tables for INSERTs, and list and hash are slower partition types (with the same number of partitions) comparing to range partitioning. But few questions are still open like:

  1. Are partitioned tables really always slow for INSERTs? Might be my test was done within limited environment and result can be different with huge tables and large environment.
  2. What changes has been done in MySQL 5.7 that made it noticeably more slow with partitions comparing to MySQL 5.6?

PlanetMySQL Voting: Vote UP / Vote DOWN

Jörg Brühe: Multiple MySQL Instances on a Single Machine

Typically, on a single machine (be it a physical or a virtual one) only a single MySQL instance (process) is running. This is perfectly ok for all those situations where a single instance is sufficient, like for storing small amounts of data (RedHat using MySQL for postfix, KDE using it for akonadi, ...), as well as those where a dedicated machine per MySQL instance is appropriate (high CPU load, memory fully loaded, availability requirements).

But there are also those users who want to (or would like to) have multiple instances which would still fit into a single machine. Even among them, a single instance per machine is typical. For this, there are good reasons:

  • MySQL comes with defaults for files (config file, error log, ...) and directories (data directory, binlogs, ...) which would cause conflicts between multiple instances (unless they are changed).
  • The scripts coming with MySQL, especially the automated start/stop with machine reboot/shutdown, are written for a single instance only.
  • Last but not least: The instructions. those in the manual as well as the many "How to setup ..." in the Web, cover a single instance only.
Because of this, users often either restrict themselves to a single instance, or they set up several virtual machines (or containers) holding a single instance each.

But that overhead (both in software and in labour) isn't necessary: There is a way out, supporting easy handling of multiple MySQL instances on a single machine directly, without containers or VMs. This is our "MyEnv" package, available for download here, licensed under the GPL.

What Does MyEnv Do?

MyEnv cares about two aspects which in combination provide easy use of multiple instances:

  • It helps to configure multiple MySQL instances without overlap, so they won't collide with each other.
  • It maintains separate environments, each to manage and access one specific instance.

Each environment contains the path to the binaries (so the instances can use different versions), the config file, the socket and port number, data directory, error log etc. The environment is specified by a name (choose a meaningful one!), and it is switched by using its name as a shell command. (MyEnv creates an alias for that.)

Administrative commands like "start" and "stop" will manage the instance of the current environment. MySQL client programs like "mysql" or "mysqldump" will access that instance.

MyEnv supports the autostart of instances at machine boot, configurable per instance - something which is impossible using only the tools of a MySQL distribution.

Of course, an instance started via MyEnv (either manually or via autostart) can be accessed by any other client program on the machine, or from any other machine in the network - all that is needed is the specification of the proper socket or network port.

Handling Multiple Binaries

In the previous section, I wrote the instances can use different versions. This is done by installing those different versions into different locations, controlled by MyEnv, and the directory with the binaries will become a component of the user's PATH variable, switched when the environment is switched. Obviously, this works only if the destination path of the installation can be controlled, which implies the tar.gz format - RPM or DEB packages have fixed destinations, so different versions would overwrite each other on installation.
But that is no severe limitation, as all MySQL versions are available in tar.gz format, and these are sufficiently generic to run on any reasonably current Linux distribution.
(Yes, that is something I forgot to mention: MyEnv is developed and tested on Linux only. You are welcome to try it on any other Unix platform, and we will gladly listen to your experiences and accept your contributions, but we do not actively pursue non-Linux platforms.)

This support for multiple versions makes MyEnv the perfect tool for application development: Using a single machine, you can let your application access the MySQL servers of different versions and can verify it works the way you want it to.

Similar, you can install binaries of MySQL (Oracle), Percona Server, or MariaDB, and verify your application is portable across them.

And the adventurous among us can use different binaries, from the same or different vendor(s), to test whether replication works across versions and/or vendors, all without the effort of installing a separate VM or container setup.

MyEnv and Galera Cluster Till now, I mentioned MySQL (and its variants), and many readers may associate that term with a traditional single instance. So I better state explicitly: Of course, such an instance can take part in replication, in any role: master, slave, or intermediate in multi-level replication.

But besides single instances and replication, there exists a different MySQL setup: Nodes combined to form a Galera Cluster. And again, let me state explicitly: Again of course, an instance controlled by MyEnv can be a node participating in a Galera Cluster.

Those readers who have experiance with Galera Cluster (or who have just read the documentation or blogs about it) know that to start the first node of a cluster a special command is needed, called "bootstrap" - a simple "start" will not do. So this command was also added to MyEnv, it can manage a Galera Cluster completely by its builtin commands.

RPM and DEB packages Above, I wrote that to install different versions you cannot use RPM or DEB packages. I did not write that MyEnv cannot use RPM or DEB - in fact it can, the absolute path names in these formats just limit this to a single version.

So you can install the RPM or DEB of your choice, disable its autostart, and then call MyEnv to create multiple instances. You will give them different names, specify different sockets and ports and use different data directories, but for all of them you will specify the same path "/usr". As a result, MyEnv will simply manage multiple instances of the same version.

You can configure them differently to test the consequences, or you can set them up to replicate among them - master and slave can run on the same machine. Of course, this will not give you the "high availability" or the "scale-out" benefits which are the typical reasons to use replication, but I trust this wasn't your purpose for this test.

Using binaries that include Galera, and configuring them properly, you can even run all nodes of a Galera Cluster as separate instances on a single machine. That may be considered to stretch the concept, because a single machine is a very different setup than separate machines, but it gives an idea of the possibilities opened by MyEnv.

Typical Use of MyEnv

Admitted: The claim to know what MyEnv is used for by others would be arrogant, and I do not uphold it. Nonetheless, we do know some use cases of people who downloaded MyEnv, and they are close to our internal use of the tool.

MyEnv allows to have multiple MySQL instances on the same machine, to manage them separately, and to access them using MySQL client programs or other applications. So it is the perfect setup for all those who need to access different versions: developers and software testers.

When we encounter some unexpected behaviour, we often want to know whether it is specific to some version or series, or is widespread. To check that, MyEnv is the perfect infrastructure: You write a test case to provoke the effect and run it on several versions, then you note the result and can tell whether it exists "since ages" or is new, whether it still occurs in current versions or will change with an upgrade - exactly the information you need to decide about an upgrade or write a bug report.

Database administrators and application developers use it to avoid nasty surprises with new versions, so their production instances will not suffer from unexpected functional changes. Setting up a test environment, especially for multiple versions, becomes cheap, much less ressources are needed. You don't need to copy your test code onto different machines, and you are sure you are running identical tests, so that you won't compare apples and oranges.


If all that made you curious, I invite you to look into the instructions, to download MyEnv and to try it. And of course, your feedback and reports are very welcome.

Take care!

Appendix: Where to Meet Us

All FromDual colleagues will deliver talks at the FrOSCon in St. Augustin near Cologne, Germany, on August 20 and 21, so that is a good opportunity for personal contact. As several talks will be delivered in English, the conference also meets the needs of attendants who cannot follow a German talk - check the programme. Froscon is a famous event, very interesting talks are promised, and I look forward to enjoy the community atmosphere there.

I will deliver a talk at the "Open Source Backup Conference" in Cologne, Germany, on September 26 and 27; this conference is held in English.

I do not have feedback yet about Percona Live in Amsterdam, I may attend that also.

And finally, FromDual will again have a booth and deliver talks at the DOAG conference on November 15 - 18 in Nuremberg, Germany. This is "the" event for Oracle users (at least in Germany, maybe in all Europe), and it has a separate track dealing with MySQL only.

We will be delighted to meet you face to face!

PlanetMySQL Voting: Vote UP / Vote DOWN

Planets9s - Sign up for our webinar trilogy on MySQL Query Tuning

Welcome to this week’s Planets9s, covering all the latest resources and technologies we create around automation and management of open source database infrastructures.

Sign up for our webinar trilogy on MySQL Query Tuning

This is a new webinar trilogy on MySQL Query Tuning, which follows the popular webinar on MySQL database performance tuning. In this trilogy, we will look at query tuning process and tools to help with that. We’ll cover topics such as SQL tuning, indexing, the optimizer and how to leverage EXPLAIN to gain insight into execution plans. This is a proper deep-dive into optimising MySQL queries, which we’re covering in three parts.

Sign up for the webinars

ClusterControl Developer Studio: MongoDB Replication Lag Advisor

This blog post explains, step by step, how we implemented our MongoDB replication lag advisor in our Developer Studio. We have included this advisor in ClusterControl 1.3.2, and enabled it by default on any MongoDB cluster or replica set. ClusterControl Developer Studio allows you to write your own scripts, advisors and alerts. With just a few lines of code, you can already automate your clusters. Happy clustering!

Read the blog

MySQL on Docker: Single Host Networking for MySQL Containers

Having covered the basics of running MySQL in a container and how to build a custom MySQL image in our previous MySQL on Docker posts, we are now going to cover the basics of how Docker handles single-host networking and how MySQL containers can leverage that. We’d love to hear your feedback, so feel free to comment on our blogs as well.

Read the blog

That’s it for this week! Feel free to share these resources with your colleagues and follow us in our social media channels.

Have a good end of the week,

Jean-Jérôme Schmidt
Planets9s Editor
Severalnines AB

PlanetMySQL Voting: Vote UP / Vote DOWN

Percona Live Europe Amsterdam 2016 Tutorial Schedule is Up!

This blog post lists the Percona Live Europe Amsterdam 2016 tutorial schedule.

We are excited to announce that the tutorial schedule for the Percona Live Europe Amsterdam Open Source Database Conference 2016 is up!

The Percona Live Europe Amsterdam Open Source Database Conference is the premier event for the diverse and active open source community, as well as businesses that develop and use open source software. The conferences have a technical focus with an emphasis on the core topics of MySQL, MongoDB, and other open source databases. Tackling subjects such as analytics, architecture and design, security, operations, scalability and performance, Percona Live provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs.

Tackling subjects such as analytics, architecture and design, security, operations, scalability and performance, Percona Live Europe provides in-depth discussions for your high-availability, IoT, cloud, big data and other changing business needs. This conference is an opportunity to network with peers and technology professionals by bringing together accomplished DBA’s, system architects and developers from around the world to share their knowledge and experience – all to help you learn how to tackle your open source database challenges in a whole new way. These tutorials are a must for any data performance professional!

The Percona Live Europe Open Source Database Conference is October 3-5 at the Mövenpick Hotel Amsterdam City Centre.

Click through to the tutorial link right now, look them over, and pick which sessions you want to attend. Discounted passes available below!

Tutorial List: Early Bird Discounts

Just a reminder to everyone out there: our Early Bird discount rate for the Percona Live Europe Amsterdam Open Source Database Conference is only available ‘til August 8, 2016, 11:30 pm PST! This rate gets you all the excellent and amazing opportunities that Percona Live offers, at a very reasonable price!

Sponsor Percona Live

Become a conference sponsor! We have sponsorship opportunities available for this annual MySQL, MongoDB and open source database event. Sponsors become a part of a dynamic and growing ecosystem and interact with hundreds of DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solutions vendors, and entrepreneurs who attend the event.

PlanetMySQL Voting: Vote UP / Vote DOWN

Monitoring MongoDB with Nagios

In this blog, we’ll discuss monitoring MongoDB with Nagios.

There is a significant amount of talk around graphing MongoDB metrics using things like Prometheus, Data Dog, New Relic, and Ops Manager from MongoDB Inc. However, I haven’t noticed a lot of talk around “What MongoDB alerts should I be setting up?”

While building out Percona’s remote DBA service for MongoDB, I looked at Prometheus’s AlertManager. After reviewing it, I’m not sure it’s quite ready to be used exclusively. We needed to decide quickly if there are better Nagios checks on the market, or did I need to write my own?

In the end, we settled on a hybrid approach. There are some good frameworks, but we need to create or tweak some of the things needed for an “SEV 1-” or “SEV 2-” type issue (which are most important to me). One of the most common problems for operations, Ops, DevOps, DBA teams and most engineering is alert spam. As such I wanted to be very careful to only alert on the things pointing to immediate dangers or current outages. As a result, we have now added to the GitHub for Percona Monitoring Plugins. Since we use Grafana and Prometheus for metrics and graphing, there are no accompanying Catci information templates. In the future, we’ll need to decide how this will change PMP overtime. In the meantime, we wanted to make the tool available now and worry about some of the issues later on.

As part of this push, I want to give you some real world examples of how you might use this tool. There are many options available to you, and Nagios is still a bit green in regards to making those options as user-friendly as our tools are.

Usage: [options] Options: -h, --help show this help message and exit -H HOST, --host=HOST The hostname you want to connect to -P PORT, --port=PORT The port mongodb is running on -u USER, --user=USER The username you want to login as -p PASSWD, --password=PASSWD The password you want to use for that user -W WARNING, --warning=WARNING The warning threshold you want to set -C CRITICAL, --critical=CRITICAL The critical threshold you want to set -A ACTION, --action=ACTION The action you want to take. Valid choices are (check_connections, check_election, check_lock_pct, check_repl_lag, check_flushing, check_total_indexes, check_balance, check_queues, check_cannary_test, check_have_primary, check_oplog, check_index_ratio, check_connect) Default: check_connect -s SSL, --ssl=SSL Connect using SSL -r REPLICASET, --replicaset=REPLICASET Connect to replicaset -c COLLECTION, --collection=COLLECTION Specify the collection in check_cannary_test -d DATABASE, --database=DATABASE Specify the database in check_cannary_test -q QUERY, --query=QUERY Specify the query, only used in check_cannary_test --statusfile=STATUS_FILENAME File to current store state data in for delta checks --backup-statusfile=STATUS_FILENAME_BACKUP File to previous store state data in for delta checks --max-stale=MAX_STALE Age of status file to make new checks (seconds)

There seems to be a huge amount going on here, but let’s break it down into a few categories:

  • Connection options
  • Actions
  • Action options
  • Status options

Hopefully, this takes some of the scariness out of the script above.

Connection options
  • Host / Port Number
    • Pretty simple, this is just the host you want to connect to and what TCP port it is listening on.
  • Username and Password
    • Like with Host/Port, this is some of your normal and typical Mongo connection field options. If you do not set both the username and password, the system will assume auth was disabled.
  • SSL
    • This is mostly around the old SSL support in Mongo clients (which was a boolean). This tool needs updating to support the more modern SSL connection options. Use this as a “deprecated” feature that might not work on newer versions.
  • ReplicaSet
    • Very particular option that is only used for a few checks and verifies that the connection uses a replicaset connection. Using this option lets the tool automatically find a primary node for you, and is helpful to some checks specifically around replication and high availability (HA):
      • check_election
      • check_repl_lag
      • check_cannary_test
      • chech_have_primary
      • check_oplog
Actions and what they mean
  • check_connections
    • This parameter refers to memory usage, but beyond that you need to know if your typical connections suddenly double. This indicates something unexpected happened in the application or database and caused everything to reconnect. It often takes up to 10 minutes for those old connections to go away.
  • check_election
    • This uses the status file options we will cover in a minute, but it checks to see if the primary from the last check differs from the current found primary. If so, it alerts. This check should only have a threshold of one before it alarms (as an alert means an HA event occurred).
  • check_lock_pct
    • MMAP only, this engine has a write lock on the whole collection/database depending on the version. This is a crucial metric to determine if MMAP writes are blocking reads, meaning you need to scale the DB layer in some way.
  • check_repl_lag
    • Checks the replication stream to understand how lagged a given node is the primary. To accomplish this, it uses a fake record in the test DB to cause a write. Without this, a read-only system would look lagged artificially as no new oplog entries get created.
  • check_flushing
    • A common issue with MongoDB is very long flush times, causing a system halt. This is a caused by your disk subsystem not keeping up, and then the DB having to wait on flushing to make sure writes get correctly journaled.
  • check_total_indexes
    • The more indexes you have, the more the planner has to work to determine which index is a good fit. This increases the risk that the recovery of a failure will take a long time. This is due to the way a restore builds indexes and how MongoDB can only make one index at a time.
  • check_balance
    • While MongoDB should keep things in balance across a cluster, many things can happen: jumbo chunks, a disabled balancer being, constantly attempting to move the same chunk but failing, and even adding/removing sharding. This alert is for these cases, as an imbalance means some records might get served faster than others. It is purely based on the chunk count that the MongoDB balancer is also based on, which is not necessarily the same as disk usage.
  • check_queues
    • No matter what engine you have selected, a backlog of sustained reads or writes indicates your DB layer is unable to keep up with demand. It is important in these cases to send an alert if the rate is maintained. You might notice this is also in our Prometheus exporter for graphics as both trending and alerting are necessary to watch in a MongoDB system.
  • check_cannary_test
    • This is a typical query for the database and then used to set critical/warning levels based on the latency of the returned query. While not as accurate as full synthetic transactions, queries through the application are good to measure response time expectations and SLAs.
  • check_have_primary
    • If we had an HA event but failed to get back up quickly, it’s important to know if a new primary is causing writes to error on the system. This check simply determines if the replica set has a primary, which means it can handle reads and writes.
  • check_oplog
    • This check is all about how much oplog history you have. This is much like measuring how much history you have in MySQL blogs. The reason this is important is when recovering from a backup and performing a point in time recovery, you can use the current oplog if the oldest timestamp in the oplog is newer than the backup timestamp. As a result, this is normal three times the backup interval you use to guarantee that you have plenty of time to find the newest recovery and then do the recovery.
  • check_index_ratio
    • This is an older metric that modern MongoDB versions will not find useful, but in the past, it was a good way to understand the percentage of queries not handled by an index.
  • check_connect
    • A very basic check to ensure it can connect (and optionally login) to MongoDB and verify the server is working.
Status File options

These options rarely need to be changed but are present in case you want to store the status on an SHM mount point to avoid actual disk writes.

  • statusfile
    • This is where a copy of the current rs.status, serverStatus and other command data is stored
  • backup-statusfile
    • Like status_file, but status_file is moved here when a new check is done. These two objects can then be compared to find the delta between two checkpoints.
  • max-stale
    • This is the amount of age for which an old file is still valid. Deltas older then this aren’t allowed and exist to protect the system from will assumption when a statusfile is hours or days old.

If you have any questions on how to use these parameters, feel free to let us know. In the code, there is also a defaults dictionary for most of these options so that in many cases setting warning and critical level are not needed.

PlanetMySQL Voting: Vote UP / Vote DOWN

New Webinar Trilogy: The MySQL Query Tuning Deep-Dive

Following our popular webinar on MySQL database performance tuning, we’re excited to introduce a new webinar trilogy dedicated to MySQL query tuning.

This is an in-depth look into the ins and outs of optimising MySQL queries conducted by Krzysztof Książek, Senior Support Engineer at Severalnines.

When done right, tuning MySQL queries and indexes can significantly increase the performance of your application as well as decrease response times. This is why we’ll be covering this complex topic over the course of three webinars of 60 minutes each.

Dates Part 1: Query tuning process and tools

Tuesday, August 30th

Part 2: Indexing and EXPLAIN - deep dive

Tuesday, September 27th

Part 3: Working with the optimizer and SQL tuning

Tuesday, October 25th

Agenda Part 1: Query tuning process and tools
  • Query tuning process
    • Build
    • Collect
    • Analyze
    • Tune
    • Test
  • Tools
    • tcpdump
    • pt-query-digest
Part 2: Indexing and EXPLAIN - deep dive
  • How B-Tree indexes are built?
  • Indexes - MyISAM vs. InnoDB
  • Different index types
    • B-Tree
    • Fulltext
    • Hash
  • Indexing gotchas
  • EXPLAIN walkthrough - query execution plan
Part 3: Working with optimizer and SQL tuning
  • Optimizer
    • How execution plans are calculated
    • InnoDB statistics
  • Hinting the optimizer
    • Index hints
    • JOIN order modifications
    • Tweakable optimizations
  • Optimizing SQL

Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience in managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard. He’s the main author of the Severalnines blog and webinar series: Become a MySQL DBA.

PlanetMySQL Voting: Vote UP / Vote DOWN

Testing Samsung storage in tpcc-mysql benchmark of Percona Server

This blog post will detail the results of Samsung storage in tpcc-mysql benchmark using Percona Server.

I had an opportunity to test different Samsung storage devices under tpcc-mysql benchmark powered by Percona Server 5.7. You can find a summary with details here

I have in my possession:

  • Samsung 850 Pro, 2TB: This is a SATA device and is positioned as consumer-oriented, something that you would use in a high-end user desktop. As of this post, I estimate the price of this device as around $430/TB.
  • Samsung SM863, 1.92TB: this device is also a SATA, and positioned for a server usage. The current price is about $600/TB. 
  • Samsung PM1725, 800GB: This is an NVMe device, in a 2.5″ form factor, but it requires a connection to a PCIe slot, which I had to allocate in my server. The device is high-end, oriented for server-side and demanding workloads. The current price is about $1300/TB.

I am going to use 1000 warehouses in the tpcc-mysql benchmarks, which corresponds roughly to a data size of 100GB.

This benchmark varies the innodb_buffer_pool_size from 5GB to 115GB. With 5GB buffer pool size only a very small portion of data fits into memory, which results in intensive foreground IO reads and intensive background IO writes. With 115GB almost all data fits into memory, which results in very small (or almost zero) IO reads and moderate background IO writes.

All buffer pool sizes in the middle of the interval correspond to resulting IO reads and writes. For example, we can see the read to write ratio on the chart below (received for the PM1725 device) with different buffer pool sizes:

We can see that for the 5GB buffer pool size we have 56000 read IOPs operations and 32000 write IOPs. For 115GB, the reads are minimal at about 300 IOPS and the background writes are at the 20000 IOPs level. Reads gradually decline with the increasing buffer pool size.

The charts are generated with the Percona Monitoring and Management tools.


Let’s review the results. The first chart shows measurements taken every one second, allowing us to see the trends and stalls.

If we take averages, the results are:

In table form (the results are in new order transactions per minute (NOTPM)):

bp, GB pm1725 sam850 sam863 pm1725 / sam863 pm1725 / sam850 5 42427.57 1931.54 14709.69 2.88 21.97 15 78991.67 2750.85 31655.18 2.50 28.72 25 108077.56 5156.72 56777.82 1.90 20.96 35 122582.17 8986.15 93828.48 1.31 13.64 45 127828.82 12136.51 123979.99 1.03 10.53 55 130724.59 19547.81 127971.30 1.02 6.69 65 131901.38 27653.94 131020.07 1.01 4.77 75 133184.70 38210.94 131410.40 1.01 3.49 85 133058.50 39669.90 131657.16 1.01 3.35 95 133553.49 39519.18 132882.29 1.01 3.38 105 134021.26 39631.03 132126.29 1.01 3.38 115 134037.09 39469.34 132683.55 1.01 3.40 Conclusion

The Samsung 850 obviously can’t keep with the more advanced SM863 and PM1725. The PM1725 shows a greater benefit with smaller buffer pool sizes. In cases using large amounts of memory, there is practically no difference with SM863. The reason is that with big buffer pool sizes, MySQL does not push IO subsystem much to use all the PM1725 performance.

For the reference, my.cnf file is

[mysqld] datadir=/var/lib/mysql socket=/tmp/mysql.sock ssl=0 symbolic-links=0 sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES # general thread_cache_size=2000 table_open_cache = 200000 table_open_cache_instances=64 back_log=1500 query_cache_type=0 max_connections=4000 # files innodb_file_per_table innodb_log_file_size=15G innodb_log_files_in_group=2 innodb_open_files=4000 innodb_io_capacity=10000 loose-innodb_io_capacity_max=12000 innodb_lru_scan_depth=1024 innodb_page_cleaners=32 # buffers innodb_buffer_pool_size= 200G innodb_buffer_pool_instances=8 innodb_log_buffer_size=64M # tune innodb_doublewrite= 1 innodb_support_xa=0 innodb_thread_concurrency=0 innodb_flush_log_at_trx_commit= 1 innodb_flush_method=O_DIRECT_NO_FSYNC innodb_max_dirty_pages_pct=90 join_buffer_size=32K sort_buffer_size=32K innodb_use_native_aio=0 innodb_stats_persistent = 1 # perf special innodb_adaptive_flushing = 1 innodb_flush_neighbors = 0 innodb_read_io_threads = 16 innodb_write_io_threads = 8 innodb_purge_threads=4 innodb_adaptive_hash_index=0 innodb_change_buffering=none loose-innodb-log_checksum-algorithm=crc32 loose-innodb-checksum-algorithm=strict_crc32 loose-innodb_sched_priority_cleaner=39 loose-metadata_locks_hash_instances=256

PlanetMySQL Voting: Vote UP / Vote DOWN

SQL injection in the MySQL server! (of the proxy kind)

[this is a repost of my blog post, because it did not syndicate to]

As work on WarpSQL (Shard-Query 3) progresses, it has outgrown MySQL proxy.  MySQL proxy is a very useful tool, but it requires LUA scripting, and it is an external daemon that needs to be maintained.  The MySQL proxy module for Shard-Query works well, but to make WarpSQL into a real distributed transaction coordinator, moving the proxy logic inside of the server makes more sense.

The main benefit of MySQL proxy is that it allows a script to "inject" queries between the client and server, intercepting the results and possibly sending back new results to the client.  I would like similar functionality, but inside of the server.

For example, I would like to implement new SHOW commands, and these commands do not need to be implemented as actual MySQL SHOW commands under the covers.

For example, for this blog post I made a new example command called "SHOW PASSWORD"

Example "injection" which adds SHOW PASSWORD functionality to the server
mysql> select user(); +----------------+ | user() | +----------------+ | root@localhost | +----------------+ 1 row in set (0.00 sec) -- THIS COMMAND DOES NOT EXIST mysql> show password; +-------------------------------------------+ | password_hash | +-------------------------------------------+ | *00A51F3F48415C7D4E8908980D443C29C69B60C9 | +-------------------------------------------+ 1 row in set (0.00 sec)

Important - This isn't a MySQL proxy plugin.  There is C++ code in the SERVER to answer that query, but it isn't the normal SHOW command code.  This "plugin" (I put it in quotes because my plan is for a pluggable interface but it isn't added to the server yet) doesn't access the mysql.user table using normal internal access methods. It runs actual SQL inside of the server, on the same THD as the client connection, in the same transaction as the client connection, to get the answer!

Problem #1 - Running SQL in the server

The MySQL C client API doesn't have any methods for connecting to the server from inside of the server, except to connect to the normally available socket interfaces, authenticate, and then issue queries like a normal client.  While it is perfectly possible to connect to the server as a client in this manner, it is sub-optimal for a number of reasons.  First, it requires a second connection to the server, second, it requires that you authenticate again (which requires you have the user's password), and lastly, any work done in the second connection is not party to transactional changes in the first, and vice-versa.

The problem is communication between the client and server, which uses a mechanism called VIO.  There was work done a long time ago for external stored procedures, which never made it into the main server that would have alleviated this problem by implementing a in-server VIO layer, and making the parser re-entrant.  That work was done on MySQL 5.1 though.

It is possible to run queries without using VIO though.  You simply can't get results back, except to know if the query succeeded or not.  This means it is perfectly acceptable for any command that doesn't need a resultset, basically anything other than SELECT.  There is a loophole however, in that any changes made to the THD stay made to that THD.  Thus, if the SQL executed sets any user variables, then those variables are of course visible after query execution.

Solution  - encapsulate arbitrary SQL resultsets through a user variable

Since user variables are visible after query execution, the goal is to get the complete results of a query into a user variable, so that the resultset can be accessed from the server.  To accomplish this, first a method to get the results into the variable must be established, and then some data format for communication that is amenable to that method has to be decided upon so that the resultset can be accessed conveniently..

With a little elbow grease MySQL can convert any SELECT statement into CSV resultset.  To do so, the following are used:

  1. SELECT ... INTO @user_variable

  2. A subquery in the FROM clause (for the original query)

  3. CONCAT, REPLACE, IFNULL, GROUP_CONCAT (to encode the resultset data)

Here is the SQL that the SHOW PASSWORD command uses to get the correct password:
select authentication_string as pw, user from mysql.user where concat(user,'@',host) = USER() or user = USER() LIMIT 1
Here is the "injected" SQL that the database generates to encapsulate the SQL resultset as CSV:
select group_concat( concat('"', IFNULL(REPLACE(REPLACE(`pw`,'"','\\"'),"\n","\\n"),"\N"), '"|"', IFNULL(REPLACE(REPLACE(`user`,'"','\\"'),"\n","\\n"),"\N"), '"' ) separator "\n" ) from ( select authentication_string as pw, user from mysql.user where concat(user,'@',host) = USER() OR user = USER() LIMIT 1 ) the_query into @sql_resultset ; Query OK, 1 row affected (0.00 sec)
Here is the actual encapsulated resultset.  If there were more than one row, they would be newline separated.
mysql> select @sql_resultset; +----------------+ | @sql_resultset | +----------------+ | ""|"root" | +----------------+ 1 row in set (0.00 sec)
Injecting SQL in the server

With the ability to encapsulate resultsets into CSV in user variables, it is possible to create a cursor over the resultset data and access it in the server.  The MySQL 5.7 pre-parse rewrite plugins, however,  still run inside the parser.  The THD is not "clean" with respect to being able to run a second query.  The parser is not re-entrant.  Because I desire to run (perhaps many) queries between the time a user enters a query and the server actually answers the query (perhaps with a different query than the user entered!) the MySQL 5.7 pre-parse rewrite plugin infrastructure doesn't work for me.

I modified the server, instead, so that there is a hook in do_command() for query injections.  I called it conveniently query_injection_point() and the goal is to make it a new plugin type, but I haven't written that code yet.  Here is the current signature for query_injection_point():

bool query_injection_point( THD* thd, COM_DATA *com_data, enum enum_server_command command, COM_DATA* new_com_data, enum enum_server_command* new_command );

It has essentially the same signature as dispatch_command(), but it provides the ability to replace the command, or keep it as is.  It returns true when the command has been replaced.

Because it is not yet pluggable, here is the code that I placed in the injection point:

/* TODO: make this pluggable */ bool query_injection_point(THD* thd, COM_DATA *com_data, enum enum_server_command command, COM_DATA* new_com_data, enum enum_server_command* new_command) { /* example rewrite rule for SHOW PASSWORD*/ if(command != COM_QUERY) { return false; } /* convert query to upper case */ std::locale loc; std::string old_query(com_data->com_query.query,com_data->com_query.length); for(unsigned int i=0;i<com_data->com_query.length;++i) { old_query[i] = std::toupper(old_query[i], loc); } if(old_query == "SHOW PASSWORD") { std::string new_query; SQLClient conn(thd); SQLCursor* stmt; SQLRow* row; if(conn.query("pw,user", "select authentication_string as pw,user from mysql.user " \ "where concat(user,'@',host) = USER() or user = USER() LIMIT 1", &stmt)) { if(stmt != NULL) { if((row = stmt->next())) { new_query = "SELECT '" + row->at(0) + "' as password_hash"; } } else { return false; } } else { return false; } /* replace the command sent to the server */ if(new_query != "") { Protocol_classic *protocol= thd->get_protocol_classic(); protocol->create_command( new_com_data, COM_QUERY, (uchar *) strdup(new_query.c_str()), new_query.length() ); *new_command = COM_QUERY; } else { if(stmt) delete stmt; return false; } if(stmt) delete stmt; return true; } } /* don't replace command */ return false; }

You will notice that the code access the mysql.user table using SQL, using the SQLClient, SQLCursor, and SQLRow objects.  These are the objects that wrap around encapsulating the SQL into a CSV resultset, and actually accessing the result set.  The interface is very simple, as you can see from the example.  You create a SQLClient for a THD (one that is NOT running a query already!) and then you simply run queries and access the results.

The SQLClient uses a stored procedure to methodically encapsulate the SQL into CSV and then provides objects to access and iterate over the data that is buffered in the user variable.  Because MySQL 5.7 comes with the sys schema, I placed the stored procedure into it, as there is no other available default database that allows the creation of stored procedures.  I called it sys.sql_client().

Because the resultset is stored as text data, the SQLRow object returns all column values as std::string.

What's next?

I need to add a proper plugin type for "SQL injection plugins".  Then I need to work on a plugin for parallel queries.  Most of the work for that is already done, actually, at least to get it into an alpha quality state.  There is still quite a bit of work to be done though.

You can find the code in the internal_client branch of my fork of MySQL 5.7:

PlanetMySQL Voting: Vote UP / Vote DOWN

Percona XtraBackup 2.4.4 is now available

Percona announces the GA release of Percona XtraBackup 2.4.4 on July 25th, 2016. You can download it from our download site and from apt and yum repositories.

Percona XtraBackup enables MySQL backups without blocking user queries, making it ideal for companies with large data sets and mission-critical applications that cannot tolerate long periods of downtime. Offered free as an open source solution, Percona XtraBackup drives down backup costs while providing unique features for MySQL backups.

New Features:

  • Percona XtraBackup has been rebased on MySQL 5.7.13.

Bugs Fixed:

  • Percona XtraBackup reported the difference in the actual size of the system tablespace and the size which was stored in the tablespace header. This check is now skipped for tablespaces with autoextend support. Bug fixed #1550322.
  • Because Percona Server 5.5 and MySQL 5.6 store the LSN offset for large log files at different places inside the redo log header, Percona XtraBackup was trying to guess which offset is better to use by trying to read from each one and compare the log block numbers and assert lsn_chosen == 1 when both LSNs looked correct, but they were different. Fixed by improving the server detection. Bug fixed #1568009.
  • Percona XtraBackup didn’t correctly detect when tables were both compressed and encrypted. Bug fixed #1582130.
  • Percona XtraBackup would crash if the keyring file was empty. Bug fixed #1590351.
  • Backup couldn’t be prepared when the size in cache didn’t match the physical size. Bug fixed #1604299.
  • Free Software Foundation address in copyright notices was outdated. Bug fixed #1222777.
  • Backup process would fail if the datadir specified on the command-line was not the same as one that is reported by the server. Percona XtraBackup now allows the datadir from my.cnf override the one from SHOW VARIABLES. xtrabackup prints a warning that they don’t match, but continues. Bug fixed #1526467.
  • With upstream change of maximum page size from 16K to 64K, the size of incremental buffer became 1G. Which increased the requirement to 1G of RAM in order to prepare the backup. While in fact there is no need to allocate such a large buffer for smaller pages. Bug fixed #1582456.
  • Backup process would fail on MariaDB Galera cluster operating in GTID mode if binary logs were in non-standard directory. Bug fixed #1517629.

Other bugs fixed: #1583717, #1583954, and #1599397.

Release notes with all the bugfixes for Percona XtraBackup 2.4.4 are available in our online documentation. Please report any bugs to the launchpad bug tracker.

PlanetMySQL Voting: Vote UP / Vote DOWN