Planet MySQL

Asynchronous Replication from MySQL Cluster

MySQL Cluster is a highly available, distributed, shared-nothing database with very interesting performance characteristics for some workloads. Among other features, it supports automatic sharding and allows us to bypass the SQL layer if we don’t need it, via the NDB API (which in my eyes, makes it one of the few transactional nosql databases out there).

In this post, I’ll describe how we can set up replication from MySQL Cluster into a standalone MySQL server using Innodb as the storage engine.

Introduction

There are a few reasons to set up replication between MySQL Cluster and a non-NDB based MySQL server. These reasons include (but are not limited to): the need to use other storage engines for various parts of the workload, improved availability courtesy of geographical distribution, or (based on my recent experience), migrating away from MySQL Cluster to MySQL (Galera Cluster in my case, but the procedure is the same).

While the manual provides a lot of information on NDB asynchronous replication, it assumes the desired setup involves running NDB on both ends of the link.

In the next few paragraphs, I’ll describe the lab environment I used for my tests, go through the required configuration changes both on MySQL/NDB and MySQL/Innodb’s side, and then outline the steps required to get replication working.

Test environment

My test environment consists of a MySQL Cluster with 2 data nodes, 2 sql nodes and one management node, and a standalone MySQL server, as described by this diagram:

If you’d like to set up MySQL Cluster to try the commands shown in this post, but don’t have much experience with it, I recommend using the auto installer, or this vagrant environment. The latter offers something very close to what’s on the NDB side of that diagram, but with two management processes; each living in an sql node (so, 4 VMs in total).

MySQL Cluster required configuration

MySQL Cluster uses its own synchronous replication system, independent of MySQL’s asynchronous one. This is implemented at the storage engine level, so MySQL is unaware of it.

Among other things, this means that the usual configuration options used by MySQL replication are not required for MySQL Cluster to work. On a MySQL Cluster, the sql nodes are normal mysqld processes that use the ndb storage engine to connect to ndb processes living (hopefully if it’s production) on other hosts. Before we can set up a standalone mysqld as a replica of MySQL Cluster, we need to choose one sql node that will act as master, and set it up for replication, by adding the following to its global my.cnf file’s [mysqld] section:

binlog_format=ROW log-bin=binlog server_id=168150

As with any normal replication setup, server_id can be any valid value for that range, provided it is unique across the topology. In my situation, I chose sql1 for this task. The server id reflects part of its IP address, since it is typically unique across a replication topology (often because on a multi-DC setup you may have the same private network on different DCs. I try not to do that).

This is a change that requires a mysqld service restart on the affected sql node. Since this is MySQL Cluster, we can make the restart with no impact by simply removing this node from rotation.

Additionally, we will need to create a replication user, which we can do by executing the following statements:

CREATE USER 'replica'@'10.0.2.15' IDENTIFIED BY 'examplepassword';
GRANT REPLICATION SLAVE ON *.* TO 'replica'@'10.0.2.15';

MySQL required configuration

The standalone mysqld instance also needs to be configured for replication, by adding the following to its global my.cnf’s file [mysqld] section:

server_id=100215 log-bin=binlog binlog_format=ROW

MySQL also needs to be restarted here for this change to take effect.

Getting data out of MySQL Cluster

We will create a logical backup of the data with mysqldump:

mysqldump --master-data=1 example > example.sql

The –master-data option set to 1 will include a CHANGE MASTER TO command on the resulting dump, so that we can START SLAVE when the load completes, and start replicating off the server where we generated the dump (sql1 in this case). I am including the database name explicitly, instead of using –all-databases, to not get the mysql schema in the backup. You can backup multiple databases by using –databases if needed. Additionally, I am not using –single-transaction because NDB only supports the READ COMMITTED isolation level, and only on a per-row level, so this mysqldump option is not really supported for MySQL Cluster.

Getting data into MySQL

We will now load the logical backup into MySQL:


mysql -e 'change master to master_host="192.168.1.50";create database example'
mysql < example.sql example

Besides loading the data, this will also run a CHANGE MASTER to command and set the right replication coordinates, thanks to the –master-data file. For example, here is the command generated in my test backup:


CHANGE MASTER TO MASTER_LOG_FILE='binlog.000001', MASTER_LOG_POS=154;

Since it does not include a host specification, the previously set master_host will be used, which is the one we set with the ‘change master to’ command we ran right before creating the database and loading the data.

A Short Detour: How Does This Work?

A few readers will likely have noticed that, on MySQL Cluster’s side, we enabled binary logging on sql1. What then happens to writes made on sql2?

To understand what happens, we must remember that NDB handles its own replication, on a separate layer (and actually separate system processes) from MySQL’s. In order to get NDB writes into the binary log, regardless of the originating node, MySQL Cluster uses the following auxiliary tables on the mysql schema: ndb_apply_status and ndb_binlog_index. On my installation using MySQL Community 5.7.17 as target version, only the ndb_binlog_index was present, so I had to manually create the ndb_apply_status table. The manual recommends verifying that the tables exist as a prerequisite to setting up asynchronous replication in MySQL Cluster.

To create it, I took the structure from sql1, and then ran the following on the target host:

mysql> use mysql
Database changed
mysql> CREATE TABLE `ndb_apply_status` (
-> `server_id` int(10) unsigned NOT NULL,
-> `epoch` bigint(20) unsigned NOT NULL,
-> `log_name` varchar(255) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
-> `start_pos` bigint(20) unsigned NOT NULL,
-> `end_pos` bigint(20) unsigned NOT NULL,
-> PRIMARY KEY (`server_id`) USING HASH
-> );
Query OK, 0 rows affected (0.02 sec)

The ‘USING HASH’ will be ignored by Innodb, but it won’t hurt (a usual btree index will be created instead).

Now we’re finally ready to start replication:


start slave user='replica' password='examplepassword';

We can now check and see that it is running:

mysql> pager egrep -i running PAGER set to 'egrep -i running' mysql> show slave status\G Slave_IO_Running: Yes Slave_SQL_Running: Yes Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates 1 row in set (0.00 sec)

We can also verify that everything thing works as expected, we can insert into MySQL Cluster using sql2, which does not have binary logging enabled, and we’ll still have the write propagated to the MySQL slave which is replicating off sql1:

[vagrant@sql2 ~]$ mysql -e 'select max(id) from example.t' +---------+ | max(id) | +---------+ | 9 | +---------+ [vagrant@sql2 ~]$ mysql -e 'select max(id) from example.t' +---------+ | max(id) | +---------+ | 9 | +---------+ [vagrant@sql2 ~]$ mysql -e 'insert into example.t values (null)' [vagrant@sql2 ~]$ mysql -e 'select max(id) from example.t' +---------+ | max(id) | +---------+ | 10 | +---------+ [root@mysql57 ~]# mysql -e 'select max(id) from example.t' +---------+ | max(id) | +---------+ | 10 | +---------+

And we can see we have a working asynchronous replication channel from MySQL Cluster into MySQL.

Summary

Even though manuals always discuss asynchronous replication in MySQL Cluster in the context of inter-cluster replication, we have seen that there are valid use cases for replicating to standalone MySQL (or a Galera based cluster), and we can achieve this setup with minimal effort.

I’d like to reiterate that in the specific case that inspired this post, the use case was a migration away from MySQL Cluster, so the replication channel was not expected to be live for a long time; just long enough to allow the migration to take place with minimal application impact. It is valid to use this in other scenarios, but bear in mind that these are mostly uncharted waters: be prepared for some sharks!

Log Buffer #507: A Carnival of the Vanities for DBAs

This Log Buffer edition covers Oracle, SQL Server and MySQL blog posts with a touch of Cloud.

Oracle:

Provisioning EBS on Oracle Cloud: Latest Enhancements

ORA-54002 when trying to create Virtual Column using REGEXP_REPLACE on Oracle 12cR2

Business rules, common sense and query performance

Problem with V$RECOVERY_AREA_USAGE view and FRA space not being reclaimed

Updated SQLcl: 2 New Commands, Bug Fixes, and 12cR2 Oh My!

SQL Server:

Windowing Functions: Tell me when that changes

SQL Server Bulk Insert Row Terminator Issues

Introducing DLM Techniques for a Multi-Database Multi-Server System

Ding – The World’s Largest Mobile Top-up Network Streamlines QA with SQL Server Containers

Enable Trace Flags in SQL Server

MySQL:

Migrating MySQL database from Amazon RDS to DigitalOcean

Monitoring Databases: A Product Comparison

New collations in MySQL 8.0.0

How to store date, time, and timestamps in UTC time zone with JDBC and Hibernate

MySQL Connector/NET 7.0.7 m6 development has been released

High Availability in ProxySQL: new webinar with René Cannaò

Following the interest we saw in this topic during our recent introduction webinar to ProxySQL, we’re pleased to invite you to join this new webinar on high availability in ProxySQL.

As you will know, the proxy layer is crucial when building a highly available MySQL infrastructure. It is therefore imperative to not let it become a single point of failure on its own. And building a highly available proxy layer creates additional challenges, such as how to manage multiple proxy instances, how to ensure that their configuration is in sync, Virtual IP and fail-over.

In this new webinar with ProxySQL’s creator, René Cannaò, we’ll discuss building a solid, scalable and manageable proxy layer using ProxySQL. And we will demonstrate how you can make your ProxySQL highly available when deploying it from ClusterControl.

Date, Time & Registration Europe/MEA/APAC

Tuesday, April 4th at 09:00 BST (UK) / 10:00 CEST (Germany, France, Sweden)

Register Now

North America/LatAm

Tuesday, April 4th at 9:00 Pacific Time (US) / 12:00 Eastern Time (US)

Register Now

Agenda Related resources  MySQL & MariaDB load balancing with ProxySQL & ClusterControl  Video Interview with ProxySQL Creator René Cannaò  Using ClusterControl to Deploy and Configure ProxySQL on top of MySQL Replication
  • Introduction
  • High Availability in ProxySQL
    • Layered approach
    • Virtual IP
    • Keepalived
  • Configuration management in distributed ProxySQL clusters
  • Demo: ProxySQL + keepalived in ClusterControl
    • Deployment
    • Failover
  • Q&A
Speakers

René Cannaò, Creator & Founder, ProxySQL. René has 10 years of working experience as a System, Network and Database Administrator mainly on Linux/Unix platform. In the last 4-5 years his experience was focused mainly on MySQL, working as Senior MySQL Support Engineer at Sun/Oracle and then as Senior Operational DBA at Blackbird, (formerly PalominoDB). In this period he built an analytic and problem solving mindset and he is always eager to take on new challenges, especially if they are related to high performance. And then he created ProxySQL …

Krzysztof Książek, Senior Support Engineer at Severalnines, is a MySQL DBA with experience managing complex database environments for companies like Zendesk, Chegg, Pinterest and Flipboard.

We look forward to “seeing” you there and to insightful discussions!

If you have any questions or would like a personalised live demo, please do contact us.

Tags: MySQLclustercontrolhahigh availabilityproxysql

Defining Moments in Database History

I’ve been privileged to be intimately involved with one defining set of technology shifts in the mid-2000s, which ultimately created both a new industry-defining set of technologies and the impetus for contenders to emerge. I’m referring to the rise of the LAMP stack. I’ve been reflecting recently on what perhaps were key factors in that phenomenon and what’s happened since then, indeed, what’s happening now.

First, what was it about the LAMP stack, anyway? All of the ingredients in that stack were interesting and signaled tectonic shifts (or were the result of them), but I’m particularly interested in what MySQL can teach us about emerging databases today.

MySQL is the database that came to power much of the Internet as we know it today. MySQL was remarkable for many reasons, although it’s easy to forget them in hindsight. It wasn’t the first or perhaps even the best open source database, but it was just enough better that it became the best for the situation at hand. And ultimately it became a commercial success that even in hindsight seems improbable.

I’ve thought of many necessary conditions for MySQL to flourish in 2000-2010. The big question is which combination of those conditions were sufficient. I am not certain of the answer, but I’m certain the answer is plural.

And yet, partially because of its enormous popularity, MySQL helped spur the rise of NoSQL in 2008-2009. These databases sought to define a new moment in database history: one in which legacy relational technology was finally replaced by an utterly new generation. If you were around at the time, you might remember how vehemently people decried relational, joins, SQL, ACID, etc. It was not sufficient to lambaste a technology or implementation: you needed to have some highly-fermented bile against the concepts and foundations themselves or you weren’t really a NoSQL believer. The disruptor was being disrupted.

Where do we find ourselves today? Relational implementations rapidly improved (enter NewSQL), and NoSQL was backronymed to mean “not only SQL” instead of being a rejection of SQL. Many NoSQL databases today sport SQL-like languages. The obvious question has to be addressed: was it just a flare-up? Is there any need for next-generation data storage and processing? Or is good old relational going to improve and obviate every next-gen data technology anyway?

I believe strongly that the answer is no. I see a few current trends and I’m sure that at least some of them will eventually become something enduring in some form. I think we are seeing historic shifts in database technology emerge right now.

Next-Generation General-Purpose Databases

Relational, and SQL, are painful. SQL is a Yoda language that causes a lot of problems. It obscures intent, introduces illogical logic such as tri-valued truth, prompts books from the likes of Celko and Date about the small subset of how to do it right, and creates endless opportunities for the server to do things you didn’t intend and cause incredibly subtle bugs and performance disasters.

Not least, SQL is practically an open sore when it’s written in a program. Think about it: you’ve got this nice strictly-typed language with all sorts of compiler guarantees, and in the middle of it is a meaningless string blob that isn’t compiled, syntax-checked, or type-checked. It is bound to a foreign source of data through an API that isn’t knowable to the program or compiler, and may change without warning. It’s “I give up, random potentially correct garbage of dubious meaning goes here.” It’s the equivalent of an ugly CDATA in an XML document.

This should present significant opportunities for improvement. One can imagine a number of sensible first steps to take: find a way for the program and the database to use the same language and toolset; design a database query language that works similarly to a programming language; memory-map the database into the program; and so on. Problems begin immediately, and indeed the relational model was created to solve many of those issues—issues that have been happily and naively reinvented ever since. Those who are ignorant of history are doomed to repeat it.

But into this fray waded a brave new generation in 2009, with map-reduce databases, key-value databases, Javascript databases, and so forth. There were even some databases that were overall platforms, such as Firebase. All with some good ideas, all going in some productive direction, all with at least some aspects that could be legitimately criticized.

I’ve been an avid student of emerging databases and have even been seen as a champion of some of them. A while ago I predicted that MongoDB, Redis, and Riak would survive in a meaningful way. Of these, Riak seems to have been sidelined, but MongoDB and Redis are going strong.

Which other NoSQL databases have had impact on par with those two? Perhaps Cassandra, and arguably Neo4J, but both of those are less mainstream. MongoDB and Redis are ubiquitous.

Why? It’s instructive to look at the problems they solve. Redis starts with a simple conceptual foundation: label a piece of data, then you can use the label to fetch and manipulate the data. The data can be richly structured in ways that are familiar to programmers, and the operations you can perform on these structures are a Swiss Army knife of building blocks for applications. The types of things that otherwise force you to write boilerplate code or build frameworks. Redis focuses on doing these things well, and doesn’t try to solve a lot of other problems.

MongoDB also starts with a simple concept, essentially that databases should store nested, structured “documents” that can map directly to the structs or objects you use in your programming language. And on top of this, MongoDB adds another power tool: the programming language you use to query the database is the ubiquitous JavaScript, arguably the most popular and flexible programming language today. There’s much more, too, such as built-in scalability so you don’t have to build “sharding” into your app (anyone who’s done that knows that you’re actually building a new custom database in your app code).

Many of the NoSQL databases that sprang up like weeds in 2009 didn’t solve these types of problems in these kinds of ways. For example, Cassandra solved the scalability problem, but gave the programmer only limited expressive power. Ultimately, a highly scalable but not very “powerful” database can be less attractive than one that acts as a force multiplier for programmer productivity. To a first approximation, high scalability is a tech ops force multiplier, and devs outnumber ops tenfold.

Perhaps this is what makes Redis and MongoDB endure. I don’t know, but I am sure it’s part of what makes them a joy to use. And for better or for worse, from where I sit they seem to be the most viable answer to the proposition “a more modern database is a practical and useful thing to create.”

Time Series Databases

Another distinct emerging category is time series databases. These databases store facts with timestamps, and treat the time as a native and essential part of the data model. They allow you to do time-based analysis. Not only that, they really view temporal queries as central. Many of them even make time a mandatory dimension of any query.

I wrote extensively about time series databases previously. For example, I argued that the world is time series and I shared my requirements for a time series database a bit later. (That latter article is not something I agree with fully today).

InfluxDB is on a very steep growth trajectory as it seeks to define what it means for a database to be natively time oriented, and answer the question of whether that is enough for a database, or if there’ll be a “last mile problem” that will make people want some of the stuff they can get from other types of databases too. Defining the boundaries of a database’s functionality is hard. But InfluxDB seems to be doing an admirable job of it.

An alternative is ElasticSearch, which offers time series functionality in some ways, but not as the sole and central concept. It’s really a distributed search engine that knows about time. This quite naturally and properly raises the question: if you’re going to use a non-time-series database that knows about time, why use a search engine? Why not a relational database that has time series functionality?

There are many, many others. Time will tell what survives and what set of problems is worth solving and leaves nothing essential unsatisfied. I’d bet on InfluxDB at this point, personally. But one thing is certain: time series is important enough that first-class time series databases are necessary and worthwhile. It’s not enough to foist this use case onto another “yeah we do that too” database.

Stream-Oriented Databases

The final category of data technologies I think is going to end up defining a standalone category is stream-oriented, pub-sub, queueing, or messaging—choose your terminology; they’re different but related. These databases are essentially logs or buses (and some of them have names that indicate this). Instead of permanently storing the data and letting you retrieve and mutate it, the concept is insertion, immutable storage in order, and later reading it out again (potentially multiple times, potentially deleting on retrieval).

Why would you want this? It’s not obvious at first glance, but this “river of data, from which everything in the enterprise can drink” architecture is at once enormously powerful and enormously virtuous. It enables things that otherwise would require all kinds of gyrations and effort, while simultaneously simplifying things greatly.

The typical enterprise data architecture quickly becomes a nightmare spaghetti tangle. Data flows through the architecture in weird ways that are difficult to understand and manage. And problems like performance, reliability, and guarantees about hard things such as processing order, are prime motivators for a lot of complexity that you can solve or avoid with a queue or streaming database.

There are a lot of concepts related to these databases and their interplay with other types of database; too many to list here. I’ll just say that it’s a fundamental mindset shift, similar to the type of epiphany you get the first time you really understand purely functional programming. For example, you suddenly want to abolish replication forevermore, and you never want anything to poll or batch process again, ever.

Lots of technologies such as Spark are emerging around these areas. But in my view, Apache Kafka is the undisputed game-changer. It’s truly a watershed technology. Rather than try to explain why, I’ll just point you to the commercial company behind Kafka, Confluent. Read their materials. I know many of the people working there; they are genuine, smart, and it’s not marketing fluff. You can drink from their well. Deeply.

Conclusions

If anyone thought that NoSQL was just a flare-up and it’s died down now, they were wrong. NoSQL did flare up, and we did see a lot of bad technology emerge for a time. But the pains and many of the solutions are real. A key determinant of what’ll survive and what’ll be lost to history is going to be product-market fit. In my opinion, three important areas where markets aren’t being satisfied by relational technologies are relational and SQL backwardness, time series, and streaming data. Time will tell if I’m right.

Pic Credit

Multiple ProxySQL instances on same ports, and seamless upgrade

Since its genesis ProxySQL was designed to be reconfigurable at runtime without restarting it. There are only 2 exceptions: changing the listening ports (work in progress to fix that) and number of threads. Everything else was always reconfigurable at runtime. Adding and removing servers, adding and removing users, configure routing, configure caching, and tune all the internal variables.

This design was driven by the wish of trying to avoid to restart ProxySQL, no matter the reason.
ProxySQL creates high availability, therefore it is extremely important that it has a very high uptime.
To ensure high availability, it also implements a strategy borrowed from ndbd in MySQL Cluster. When proxysql starts, it immediately forks:

  • the parent process only checks the exit status of the child process and restart it if it didn't exit gracefully
  • the child process is the real proxysql service
    In other words, if proxysql dies, it parents automatically restarts it.
    You could achieve the same also with systemd, supervisord or others, but ProxySQL tries to handle this autonomously.

Yet there is another reason when ProxySQL breaks high availability: upgrades.
If you want to upgrade from version X to version Y, the usual procedure is:

  • download the new version
  • make a copy of current database file proxysql.db in case you want to downgrade: that step is important, as an upgraded proxysql.db file is not always backward compatible
  • install the new version
  • restart the service

The restart of proxysql to upgrade from version X to version Y generally takes just a fraction of second, so this doesn't sound a big availability hit.
Although there are certain users of ProxySQL that have very complex configurations where a restart is long process. An example of this is a hosting provider where a single proxysql instance has hundreds of thousands of users in mysql_users table (yes, that many!).

Yet there are also cases in which you don't want to upgrade immediately from version X to version Y without having first tested Y. Assume you have tried version Y in staging, all works great, and it is now time to deploy Y in production. Wouldn't be great to have both versions running at the same time?

Do you want to run multiple proxysql instances in parallel? All listening on the same port(s)? You can!
With ProxySQL this is possible to run multiple processes on the same port(s)!
ProxySQL allows to have multiple processes, even different versions (or different configurations) running at the same time ... on the same ports!

Starting from version 1.3.0, ProxySQL introduces a new feature that allows to share the same TCP sockets.
To explicitly enable this feature, proxysql needs to be started with -r or --reuseport: this enables the use of the SO_REUSEPORT socket option. This option is currently disabled by default because requires Linux kernel 3.9 or newer: although kernel 3.9 was released in April 2013, there are still a lot of production environments that use an older kernel. Future version of ProxySQL will automatically detect if the option is available, and enable it if possible.

To avoid interference between two (or more) proxysql instances running on the same host it is recommended to use different config file (using -c flag) and different datadir (using -D flag).

Testing MyRocks vs InnoDB Performance Using sysbench 1.x oltp_point_select.lua

It seems MyRocks is going to become a hot topic in April 2017. Previously (here and there) I tried to compare its performance and scalability vs InnoDB from MySQL 5.7.17 using test case from famous bug #68079. It's an interesting case that took a lot of efforts from Oracle to make InnoDB scale properly, and InnoDB (on my QuadCore box at least, others reported different results on other hardware in comments) still outperformed MyRocks. But maybe it's corner case that is not a big deal in general?

Earlier this month I decided to give MyRocks another chance and try it with "industry-standard" benchmarks, like those provided by sysbench tool. At the same time, I studied the impact of adaptive hash indexing (AHI) on InnoDB (for the reason i am not yet ready to share), along the lines of this great post by Peter Zaitsev. The study is not yet complete, and I am not yet sure that it makes sense to continue doing it on my ages old QuadCore box with Fedora 25, but in the process I've got one interesting and repeatable result that I'd like to share in any case.

For that study I decided to use recent sysbench 1.1.x, so I had to build it from source to begin with. I did the following:
[openxs@fc23 git]$ git clone https://github.com/akopytov/sysbench.gitbut then during ./configure run I've got a small problem:
...
checking for pkg-config... yes
checking for xxd... no
configure: error: "xxd is required to build sysbench (usually comes with the vim package)"So, I had to install vim package:
[openxs@fc23 sysbench]$ sudo yum install vim
...
Installed:
  gpm-libs.x86_64 1.20.7-9.fc24         vim-common.x86_64 2:8.0.386-1.fc25
  vim-enhanced.x86_64 2:8.0.386-1.fc25  vim-filesystem.x86_64 2:8.0.386-1.fc25

Complete!and then build and installation process (with all defaults and MariaDB software provided by Fedora present) completed without any problem, and I've ended up with nice new sysbench version:
[openxs@fc23 sysbench]$ /usr/local/bin/sysbench --version
sysbench 1.1.0-2343e4b

[openxs@fc23 sysbench]$ ls /usr/local/share/sysbench/
bulk_insert.lua  oltp_point_select.lua  oltp_update_non_index.lua  tests
oltp_common.lua  oltp_read_only.lua     oltp_write_only.lua
oltp_delete.lua  oltp_read_write.lua    select_random_points.lua
oltp_insert.lua  oltp_update_index.lua  select_random_ranges.luaAs I use all default settings for both MyRocks and InnoDB, I decided to start testing with the oltp_point_select.lua simplest test and table size that does NOT fit into the default 128M of buffer pool in InnoDB case:
[openxs@fc23 sysbench]$ sysbench /usr/local/share/sysbench/oltp_point_select.lua --report-interval=1 --oltp-table-size=1000000 --max-time=0 --oltp-read-only=off --max-requests=0 --num-threads=1 --rand-type=uniform --db-driver=mysql --mysql-db=test --mysql-socket=/tmp/mysql.sock --mysql-user=root --mysql-password= prepare
sysbench 1.1.0-2343e4b (using bundled LuaJIT 2.1.0-beta2)

invalid option: --oltp-table-size=1000000Note that good old command lines copied from older sysbench versions verbatim may NOT work any more in 1.1.x. Some options changed, now the names are shorter:
[openxs@fc23 sysbench]$ sysbench /usr/local/share/sysbench/oltp_point_select.lua help
sysbench 1.1.0-2343e4b (using bundled LuaJIT 2.1.0-beta2)

oltp_point_select.lua options:
  --distinct_ranges=N           Number of SELECT DISTINCT queries per transaction [1]
  --sum_ranges=N                Number of SELECT SUM() queries per transaction [1]
  --skip_trx[=on|off]           Don't start explicit transactions and execute all queries in the AUTOCOMMIT mode [off]
  --secondary[=on|off]          Use a secondary index in place of the PRIMARY KEY [off]
  --create_secondary[=on|off]   Create a secondary index in addition to the PRIMARY KEY [on]
  --index_updates=N             Number of UPDATE index queries per transaction [1]
  --range_size=N                Range size for range SELECT queries [100]
  --auto_inc[=on|off]           Use AUTO_INCREMENT column as Primary Key (for MySQL), or its alternatives in other DBMS. When disabled, use client-generated IDs [on]
  --delete_inserts=N            Number of DELETE/INSERT combination per transaction [1]
  --tables=N                    Number of tables [1]
  --mysql_storage_engine=STRING Storage engine, if MySQL is used [innodb]
  --non_index_updates=N         Number of UPDATE non-index queries per transaction [1]
  --table_size=N                Number of rows per table [10000]
  --pgsql_variant=STRING        Use this PostgreSQL variant when running with the PostgreSQL driver. The only currently supported variant is 'redshift'. When enabled, create_secondary is automatically disabled, and delete_inserts is set to 0
  --simple_ranges=N             Number of simple range SELECT queries per transaction [1]
  --order_ranges=N              Number of SELECT ORDER BY queries per transaction [1]
  --range_selects[=on|off]      Enable/disable all range SELECT queries [on]
  --point_selects=N             Number of point SELECT queries per transaction [10]I've ended up creating the table like this for InnoDB case:
[openxs@fc23 sysbench]$ sysbench /usr/local/share/sysbench/oltp_point_select.lua --report-interval=1 --table-size=1000000 --num-threads=1 --rand-type=uniform --db-driver=mysql --mysql-db=test --mysql-socket=/tmp/mysql.sock --mysql-user=root --mysql-password= prepare
sysbench 1.1.0-2343e4b (using bundled LuaJIT 2.1.0-beta2)

Creating table 'sbtest1'...
Inserting 1000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'to end up with the following table:
mysql> show table status like 'sbtest1'\G
*************************** 1. row ***************************
           Name: sbtest1
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 986400
 Avg_row_length: 228
    Data_length: 225132544
Max_data_length: 0
   Index_length: 0
      Data_free: 4194304
 Auto_increment: 1000001
    Create_time: 2017-03-02 16:18:57
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)For MyRocks I also had to specify storage engine explicitly:
[openxs@fc23 fb56]$ sysbench /usr/local/share/sysbench/oltp_point_select.lua --table-size=1000000 --threads=1 --rand-type=uniform --db-driver=mysql --mysql-db=test --mysql-socket=/tmp/mysql.sock --mysql-user=root --mysql-password= --mysql_storage_engine=rocksdb prepare
sysbench 1.1.0-2343e4b (using bundled LuaJIT 2.1.0-beta2)

Creating table 'sbtest1'...
Inserting 1000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...to end up with the following table:
mysql> show table status like 'sbtest1'\G
*************************** 1. row ***************************
           Name: sbtest1
         Engine: ROCKSDB
        Version: 10
     Row_format: Fixed
           Rows: 1000000
 Avg_row_length: 198
    Data_length: 198545349
Max_data_length: 0
   Index_length: 16009534
      Data_free: 0
 Auto_increment: 1000001
    Create_time: NULL
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)Note that in case of InnoDB I've used MySQL 5.7.17 from Oracle, and MyRocks was built from this commit using my usual cmake options:
[openxs@fc23 mysql-5.6]$ git log -1
commit 01c386be8b02e6469b934c063aefdf8403844d99
Author: Herman Lee <herman@fb.com>
Date:   Wed Mar 1 18:14:25 2017 -0800

[openxs@fc23 mysql-5.6]$ cmake . -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_SSL=system -DWITH_ZLIB=bundled -DMYSQL_MAINTAINER_MODE=0 -DWITH_EMBEDDED_SERVER=OFF -DENABLED_LOCAL_INFILE=1 -DENABLE_DTRACE=0 -DCMAKE_INSTALL_PREFIX=/home/openxs/dbs/fb56I've run the tests for InnoDB with adaptive hash indexing set to ON (by default) and OFF (changed at run time), and then for MyRocks, using 1, 2, 4, 8, 16, 32 and 64 (all cases but InnoDB with AHI ON) concurrent threads, with sysbench command line like this to run the test for 60 seconds (note new options syntax of sysbench 1.x: --time, --threads etc):
[openxs@fc23 fb56]$ sysbench /usr/local/share/sysbench/oltp_point_select.lua --table-size=1000000 --time=60 --threads=1 --rand-type=uniform --db-driver=mysql --mysql-db=test --mysql-socket=/tmp/mysql.sock --mysql-user=root run
sysbench 1.1.0-2343e4b (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Initializing worker threads...

Threads started!

SQL statistics:
    queries performed:
        read:                            821511
        write:                           0
        other:                           0
        total:                           821511
    transactions:                        821511 (13691.77 per sec.)
    queries:                             821511 (13691.77 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          60.0003s
    total number of events:              821511

Latency (ms):
         min:                                  0.06
         avg:                                  0.07
         max:                                  1.11
         95th percentile:                      0.08
         sum:                              59537.46

Threads fairness:
    events (avg/stddev):           821511.0000/0.00
    execution time (avg/stddev):   59.5375/0.00and then summarized the results into the following chart:


One day I'll share raw results, as a gist or somehow else, but for now let me summarize my findings as of March 3, 2017:
  1. MyRocks really rocks with this oltp_point_select.lua --table-size=1000000 test of sysbench 1.1.0! With default settings of server variables it outperformed InnoDB from MySQL 5.7.17 at all number of threads tested, from 1 to 64, and proved good scalability on up to 64 threads on my QuadCore box. I've got more than 45K QPS starting from 4 threads.
  2. InnoDB with disabled AHI is somewhat faster for this test than with enabled AHI, highest result was almost 44K QPS with AHI OFF on 4 threads.
  3. It seems my QuadCore is not relevant any more for serious benchmarks, as for quite a some time people use 8 to 18 cores per socket etc and start with 200K QPS with 8 threads already.

Percona XtraDB Cluster 5.7.17-27.20 is now available

Percona announces the release of Percona XtraDB Cluster 5.7.17-27.20 on March 16, 2017. Binaries are available from the downloads section or our software repositories.

NOTE: You can also run Docker containers from the images in the Docker Hub repository.

Percona XtraDB Cluster 5.7.17-27.20 is now the current release, based on the following:

All Percona software is open-source and free. Details of this release can be found in the 5.7.17-27.20 milestone on Launchpad.

There are no new features or bug fixes to the main components, besides upstream changes and the following fixes related to packaging:

  • BLD-512: Fixed startup of garbd on Ubuntu 16.04.2 LTS (Xenial Xerus).
  • BLD-519: Added the garbd debug package to the repository.
  • BLD-569: Fixed grabd script to return non-zero if it fails to start.
  • BLD-570: Fixed service script for garbd on Ubuntu 16.04.2 LTS (Xenial Xerus) and Ubuntu 16.10 (Yakkety Yak).
  • BLD-593: Limited the use of rm and chown by mysqld_safe to avoid exploits of the CVE-2016-5617 vulnerability. For more information, see 1660265.
    Credit to Dawid Golunski (https://legalhackers.com).
  • BLD-610: Added version number to the dependency requirements of the full RPM package.
  • BLD-643: Fixed systemctl to mark mysql process as inactive after it fails to start and not attempt to start it again. For more information, see 1662292.
  • BLD-644: Added the which package to PXC dependencies on CentOS 7. For more information, see 1661398.
  • BLD-645: Fixed mysqld_safe to support options with a forward slash (/). For more information, see 1652838.
  • BLD-647: Fixed systemctl to show correct status for mysql on CentOS 7. For more information, see 1644382.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Column Store Database Benchmarks: MariaDB ColumnStore vs. Clickhouse vs. Apache Spark

This blog shares some column store database benchmark results, and compares the query performance of MariaDB ColumnStore v. 1.0.7 (based on InfiniDB), Clickhouse and Apache Spark.

I’ve already written about ClickHouse (Column Store database).

The purpose of the benchmark is to see how these three solutions work on a single big server, with many CPU cores and large amounts of RAM. Both systems are massively parallel (MPP) database systems, so they should use many cores for SELECT queries.

For the benchmarks, I chose three datasets:

  1. Wikipedia page Counts, loaded full with the year 2008, ~26 billion rows
  2. Query analytics data from Percona Monitoring and Management
  3. Online shop orders

This blog post shares the results for the Wikipedia page counts (same queries as for the Clickhouse benchmark). In the following posts I will use other datasets to compare the performance.

Databases, Versions and Storage Engines Tested
  • MariaDB ColumnStore v. 1.0.7, ColumnStore storage engine
  • Yandex ClickHouse v. 1.1.54164, MergeTree storage engine
  • Apache Spark v. 2.1.0, Parquet files and ORC files

Although all of the above solutions can run in a “cluster” mode (with multiple nodes), I’ve only used one server.

Hardware

This time I’m using newer and faster hardware:

  • CPU: physical = 2, cores = 32, virtual = 64, hyperthreading = yes
  • RAM: 256Gb
  • Disk: Samsung SSD 960 PRO 1TB, NVMe card
Data Sizes

I’ve loaded the above data into Clickhouse, ColumnStore and MySQL (for MySQL the data included a primary key; Wikistat was not loaded to MySQL due to the size). MySQL tables are InnoDB with a primary key.

Dataset Size (GB) Column Store Clickhouse MySQL Spark / Parquet Spark / ORC file Wikistat 374.24 Gb 211.3 Gb n/a (> 2 Tb) 395 Gb 273 Gb Query metrics 61.23 Gb 28.35 Gb 520 Gb Store Orders 9.3 Gb 4.01 Gb 46.55 Gb

 

Query Performance

Wikipedia page counts queries

Test type (warm) Spark Clickhouse ColumnStore Query 1: count(*) 5.37 2.14 30.77 Query 2: group by month 205.75 16.36 259.09 Query 3: top 100 wiki pages by hits (group by path) 750.35 171.22 1640.7

Test type (cold) Spark Clickhouse ColumnStore Query 1: count(*) 21.93 8.01 139.01 Query 2: group by month 217.88 16.65 420.77 Query 3: top 100 wiki pages by hits (group by path) 887.434 182.56 1703.19


Partitioning and Primary Keys

All of the solutions have the ability to take advantage of data “partitioning,” and only scan needed rows.

Clickhouse has “primary keys” (for the MergeTree storage engine) and scans only the needed chunks of data (similar to partition “pruning” in MySQL). No changes to SQL or table definitions is needed when working with Clickhouse.

Clickhouse example:

:) select count(*), toMonth(date) as mon :-] from wikistat where toYear(date)=2008 :-] and toMonth(date) = 1 :-] group by mon :-] order by mon; SELECT count(*), toMonth(date) AS mon FROM wikistat WHERE (toYear(date) = 2008) AND (toMonth(date) = 1) GROUP BY mon ORDER BY mon ASC ┌────count()─┬─mon─┐ │ 2077594099 │ 1 │ └────────────┴─────┘ 1 rows in set. Elapsed: 0.787 sec. Processed 2.08 billion rows, 4.16 GB (2.64 billion rows/s., 5.28 GB/s.) :) select count(*), toMonth(date) as mon from wikistat where toYear(date)=2008 and toMonth(date) between 1 and 10 group by mon order by mon; SELECT count(*), toMonth(date) AS mon FROM wikistat WHERE (toYear(date) = 2008) AND ((toMonth(date) >= 1) AND (toMonth(date) <= 10)) GROUP BY mon ORDER BY mon ASC ┌────count()─┬─mon─┐ │ 2077594099 │ 1 │ │ 1969757069 │ 2 │ │ 2081371530 │ 3 │ │ 2156878512 │ 4 │ │ 2476890621 │ 5 │ │ 2526662896 │ 6 │ │ 2460873213 │ 7 │ │ 2480356358 │ 8 │ │ 2522746544 │ 9 │ │ 2614372352 │ 10 │ └────────────┴─────┘ 10 rows in set. Elapsed: 13.426 sec. Processed 23.37 billion rows, 46.74 GB (1.74 billion rows/s., 3.48 GB/s.)

As we can see here, ClickHouse has processed ~two billion rows for one month of data, and ~23 billion rows for ten months of data. Queries that only select one month of data are much faster.

For ColumnStore we need to re-write the SQL query and use “between ‘2008-01-01’ and 2008-01-10′” so it can take advantage of partition elimination (as long as the data is loaded in approximate time order). When using functions (i.e., year(dt) or month(dt)), the current implementation does not use this optimization. (This is similar to MySQL, in that if the WHERE clause has month(dt) or any other functions, MySQL can’t use an index on the dt field.)

ColumnStore example:

MariaDB [wikistat]> select count(*), month(date) as mon -> from wikistat where year(date)=2008 -> and month(date) = 1 -> group by mon -> order by mon; +------------+------+ | count(*) | mon | +------------+------+ | 2077594099 | 1 | +------------+------+ 1 row in set (2 min 12.34 sec) MariaDB [wikistat]> select count(*), month(date) as mon from wikistat where date between '2008-01-01' and '2008-01-31' group by mon order by mon; +------------+------+ | count(*) | mon | +------------+------+ | 2077594099 | 1 | +------------+------+ 1 row in set (12.46 sec)

Apache Spark does have partitioning however. It requires the use of partitioning with parquet format in the table definition. Without declaring partitions, even the modified query (“select count(*), month(date) as mon from wikistat where date between ‘2008-01-01’ and ‘2008-01-31’ group by mon order by mon”) will have to scan all the data.

The following table and graph shows the performance of the updated query:

Test type / updated query Spark Clickhouse ColumnStore group by month, one month, updated syntax 205.75 0.93 12.46 group by month, ten months, updated syntax 205.75 8.84 170.81

 

Working with Large Datasets

With 1Tb uncompressed data, doing a “GROUP BY” requires lots of memory to store the intermediate results (unlike MySQL, ColumnStore, Clickhouse and Apache Spark use hash tables to store groups by “buckets”). For example, this query requires a very large hash table:

SELECT path, count(*), sum(hits) AS sum_hits, round(sum(hits) / count(*), 2) AS hit_ratio FROM wikistat WHERE project = 'en' GROUP BY path ORDER BY sum_hits DESC LIMIT 100

As “path” is actually a URL (without the hostname), it takes a lot of memory to store the intermediate results (hash table) for GROUP BY.

MariaDB ColumnStore does not allow us to “spill” data on disk for now (only disk-based joins are implemented). If you need to GROUP BY on a large text field, you can decrease the disk block cache setting in Columnstore.xml (i.e., set disk cache to 10% of RAM) to make room for an intermediate GROUP BY:

<DBBC> <!-- The percentage of RAM to use for the disk block cache. Defaults to 86% --> <NumBlocksPct>10</NumBlocksPct>

In addition, as the query has an ORDER BY, we need to

increase max_length_for_sort_data in MySQL:ERROR 1815 (HY000): Internal error: IDB-2015: Sorting length exceeded. Session variable max_length_for_sort_data needs to be set higher. mysql> set global max_length_for_sort_data=8*1024*1024;

SQL Support SQL Spark* Clickhouse ColumnStore INSERT … VALUES ✅ yes ✅ yes ✅ yes INSERT SELECT / BULK INSERT ✅ yes ✅ yes ✅ yes UPDATE ❌ no ❌ no ✅ yes DELETE ❌ no ❌ no ✅ yes ALTER … ADD/DROP/MODIFY COLUMN ❌ no ✅ yes ✅ yes ALTER … change paritions ✅ yes ✅ yes ✅ yes SELECT with WINDOW functions ✅ yes ❌ no ✅ yes

 

*Spark does not support UPDATE/DELETE. However, Hive supports ACID transactions with UPDATE and DELETE statements. BEGIN, COMMIT, and ROLLBACK are not yet supported (only the ORC file format is supported).

ColumnStore is the only database out of the three that supports a full set of DML and DDL (almost all of the MySQL’s implementation of SQL is supported).

Comparing ColumnStore to Clickhouse and Apache Spark  Solution  Advantages  Disadvantages MariaDB ColumnStore
  • MySQL frontend (make it easy to migrate from MySQL)
  • UPDATE and DELETE are supported
  • Window functions support
  • Select queries are slower
  • No replication from normal MySQL server (planned for the future versions)
  • No support for GROUP BY on disk
Yandex ClickHouse
  • Fastest performance
  • Better compression
  • Primary keys
  • Disk-based GROUP BY, etc.
  • No MySQL protocol support
Apache Spark
  • Flexible storage options
  • Machine learning integration (i.e., pyspark ML libraries run inside spark nodes)
  • No MySQL protocol support
  • Slower select queries (compared to ClickHouse)

Conclusion

Yandex ClickHouse is an absolute winner in this benchmark: it shows both better performance (>10x) and better compression than
MariaDB ColumnStore and Apache Spark. If you are looking for the best performance and compression, ClickHouse looks very good.

At the same time, ColumnStore provides a MySQL endpoint (MySQL protocol and syntax), so it is a good option if you are migrating from MySQL. Right now, it can’t replicate directly from MySQL but if this option is available in the future we can attach a ColumnStore replication slave to any MySQL master and use the slave for reporting queries (i.e., BI or data science teams can use a ColumnStore database, which is updated very close to realtime).

Table Structure and List of Queries

Table structure (MySQL / Columnstore version):

CREATE TABLE `wikistat` ( `date` date DEFAULT NULL, `time` datetime DEFAULT NULL, `project` varchar(20) DEFAULT NULL, `subproject` varchar(2) DEFAULT NULL, `path` varchar(1024) DEFAULT NULL, `hits` bigint(20) DEFAULT NULL, `size` bigint(20) DEFAULT NULL ) ENGINE=Columnstore DEFAULT CHARSET=utf8

Query 1:

select count(*) from wikistat

Query 2a (full scan):

select count(*), month(dt) as mon from wikistat where year(dt)=2008 and month(dt) between 1 and 10 group by month(dt) order by month(dt)

Query 2b (for partitioning test)

select count(*), month(date) as mon from wikistat where date between '2008-01-01' and '2008-10-31' group by mon order by mon;

Query 3:

SELECT path, count(*), sum(hits) AS sum_hits, round(sum(hits) / count(*), 2) AS hit_ratio FROM wikistat WHERE project = 'en' GROUP BY path ORDER BY sum_hits DESC LIMIT 100;

 

How to install Apache, PHP 7.1 and MySQL on CentOS 7.3 (LAMP)

This tutorial shows how you can install an Apache webserver on a CentOS 7 server with PHP (mod_php with PHP 5.4, PHP 7.0, or PHP 7.1) and MySQL support. This setup is often referred to as LAMP which stands for Linux - Apache - MySQL - PHP.

MySQL 5.7: Improved JOIN Order by Taking Condition Filter Effect into Account

One of the major challenges of query optimizers is to correctly estimate how many rows qualify from each table of a join. If the estimates are wrong, the optimizer may choose a non-optimal join order.

Before MySQL 5.7, the estimated number of rows from a table only took into account the conditions from the WHERE clause that were used to set up the access method (e.g., the size of an index range scan). This often led to row estimates that were far too high, resulting in very wrong cost estimates for join plans. To improve this issue, MySQL 5.7 introduced a cost model that considered the entire WHERE condition when estimating the number of qualifying rows from each table. This model estimates the filtering effect of the table’s conditions.

As shown in the above figure, the condition filter effect will reduce the estimated number of rows from tx that will lead to an index look-up on tx+1. For more details on how condition filtering works, see two earlier blog posts on this topic: part1, part2.

Taking condition filtering into account, the join optimizer will in many cases be able to find a more optimal join order. However, there are cases where the optimizer will overestimate the filtering effect and choose a non-optimal query plan. In this blog post, I will show an example of a query that benefits from condition filtering, and in a follow-up blog post I will discuss what could be done if condition filtering does not have the desired effect.

Example: DBT-3 Query 8

To show the benefits of condition filtering, we will look at Query 8 in the DBT-3 benchmark:

SELECT o_year,
SUM(CASE WHEN nation = 'FRANCE' THEN volume ELSE 0 END) / SUM(volume) AS mkt_share
FROM (
SELECT EXTRACT(YEAR FROM o_orderdate) AS o_year,
l_extendedprice * (1 - l_discount) AS volume, n2.n_name AS nation
FROM part
JOIN lineitem ON p_partkey = l_partkey
JOIN supplier ON s_suppkey = l_suppkey
JOIN orders ON l_orderkey = o_orderkey
JOIN customer ON o_custkey = c_custkey
JOIN nation n1 ON c_nationkey = n1.n_nationkey
JOIN region ON n1.n_regionkey = r_regionkey
JOIN nation n2 ON s_nationkey = n2.n_nationkey
WHERE r_name = 'EUROPE' AND o_orderdate BETWEEN '1995-01-01' AND '1996-12-31'
AND p_type = 'PROMO BRUSHED STEEL'
) AS all_nations GROUP BY o_year ORDER BY o_year;

Query 8 is called National Market Share Query, and it finds the market share in Europe for French suppliers of a given part type. You do not need to understand this query in detail. The main point is that 8 tables are joined, and that it is important to find an efficient join order for the query to perform well.

In MySQL 5.6, Visual EXPLAIN shows the following query plan for Query 8:

Tables are joined from left-to-right, starting with a full table scan of the region table, doing index look-ups into all other tables, with the part table as the last table. The execution time for this query is around 6 seconds on a scale factor 1 DBT-3 database.

If we look at the WHERE clause of the query, we see that there is one condition with high selectivity: We are interested in just one particular of many possible part types. That the region should be Europe, and that the time period is restricted to two years, will still leave many candidate rows, so these conditions have low selectivity. In other words, a query plan that put the part table at the end is not optimal. If part was processed early, we would be able to filter out many rows, and the number of index look-ups into other tables could be significantly reduced.

In MySQL 5.7, the use of condition filtering estimates changes the query plan for Query 8, and the new query plan looks as follows:

As you see, part is now processed early. (We see that region is still processed first, but this is a small table with only 5 rows of which only one row matches its condition. Hence, it will not impact the fan-out of the join.) This new query plan takes 0.5 seconds to execute. In other words, execution time is reduced by 92% by changing the join order! The main saving is that, instead of going through all orders for customers in Europe, one will only look at orders of parts of a given type.

(The observant reader will have noticed another difference between the Visual EXPLAIN diagrams: The box around the tables is no longer present in the 5.7 diagram. If you look a Query 8, you notice that the many-table join is in a sub-query in the FROM clause; aka derived table. In MySQL 5.6 and earlier, the result of derived tables where materialized in a temporary table, before the main query was executed. The extra box in the 5.6 diagram, represents such a temporary table. In MySQL 5.7, derived tables will be merged into the outer query if possible, and the materialization of the derived table is avoided.)

To see the optimizer's estimates for condition filter effects, we can look at the filtered column of tabular EXPLAIN (some of the columns have been removed to save space):

idselect_typetabletypekeyrowsfilteredExtra1SIMPLEregionALLNULL520.00Using where; Using temporary; Using filesort1SIMPLEpartALLNULL20000010.00Using where; Using join buffer (Block Nested Loop)1SIMPLElineitemrefi_l_partkey_suppkey30100.00Using index condition1SIMPLEsuppliereq_refPRIMARY1100.00Using where1SIMPLEn2eq_refPRIMARY1100.00NULL1SIMPLEorderseq_refPRIMARY150.00Using where1SIMPLEcustomereq_refPRIMARY1100.00Using where1SIMPLEn1eq_refPRIMARY120.00Using where

We can see that there are 4 tables for which the optimizer assumes some filtering; region, part, orders, and n1 (nation). The optimizer has three sources for its estimates:

  1. Range estimates:

    For indexed columns, the range optimizer will ask the storage engine for an estimate as to how many rows are within a given range. This estimate will be pretty accurate. For our query, this is the case for the region and orders table. Europe is 1 out of 5 regions and the requested two year period contains 50% of the orders in the database.

  2. Index statistics:

    For indexed columns, MySQL also keep statistics on the average number of rows per value (records_per_key). This can be used to estimate the filtering effect of conditions that refers to columns of preceeding tables in the join order. For our query this is the case for n1. The query has two conditions involving n1, but the condition on n_nationkey is used for the index look-up and will not contribute to extra filtering. On the other hand, the condition on n_regionkey will provide some extra filtering, and since records_per_key tells that there are 5 distinct values for this column, the estimated filtering effect will be 20%.

    The assumption is that values are evenly distributed. If the distribution is skewed, the filtering estimate will be less accurate. Another cause of estimation errors is that index statistics are based on sampling, so it may not precisely reflect the actual distribution of values.

  3. Guesstimates:

    For non-indexed columns, MySQL does not have any statistics. The optimizer will then resort to heuristics. For equality conditions, the filtering effect is assumed to be 10%. The DBT-3 database does not have an index on part_type. Hence, the filtering effect for the part table will be set to 10%. This is actually a bit off since there are 150 different parts. However, in this case, just assuming that there is some filtering, made the optimizer choose a better plan.

Caveat

As shown in this blog post, taking condition filtering into account may give better query plans. However, as discussed, there is a risk that the filtering estimate are inaccurate, especially for conditions on non-indexed columns. Another cause of estimation errors is that it is assumed that columns are not correlated. Hence, when here are conditions on correlated columns, the optimizer will overestimate the filtering effect of those conditions.

We have got a few bug reports on performance regressions when upgrading from 5.6 to 5.7 that are caused by the optimizer overestimating the filtering effect. In most cases, this is because the query contains equality conditions on non-indexed columns with low cardinality, so the guesstimate of 10% is too optimistic. In my next blog post, I will discuss what can be done when this happens. We are also working on adding histograms to MySQL. Histograms will give a much better basis for estimating the filtering effects of conditions on non-indexed columns.

Jeudis du Libre – Mons

Yesterday I was invited to speak at the “Jeudis du Libre” in Mons.

The location was very special as it was in one auditorium of Polytech, the oldest university in the city of Mons.

I presented in French two very hot topics in the MySQL ecosystem:

  • MySQL InnoDB Cluster
  • MySQL as Document Store with JSON datatype & X plugin

Those are very new technologies illustrating MySQL’s innovation. And of course there is much more to come with MySQL 8 !

Here are the slides if you are interested:

Jeudis du Libre – MySQL InnoDB Cluster from Frederic Descamps

Jeudis du Libre – MySQL comme Document Store from Frederic Descamps

I also recommend you to attend future sessions at Jeudis du Libre, some might be very interesting. I plan to participate again once MySQL 8.0 is released.

Monitoring Databases: A Product Comparison

In this blog post, I will discuss the solutions for monitoring databases (which includes alerting) I have worked with and recommended in the past to my clients. This survey will mostly focus on MySQL solutions. 

One of the most common issues I come across when working with clients is monitoring and alerting. Many times, companies will fall into one of these categories:

  • No monitoring or alerting. This means they have no idea what’s going on in their environment whatsoever.
  • Inadequate monitoring. Maybe people in this camp are using a platform that just tells them the database is up or connections are happening, but there is no insight into what the database is doing.
  • Too much monitoring and alerting. Companies in this camp have tons of dashboards filled with graphs, and their inbox is full of alerts that get promptly ignored. This type of monitoring is just as useful as the first option. Alert fatigue is a real thing!

With my clients, I like to talk about what monitoring they need and what will work for them.

Before we get started, I do want to point out that I have borrowed some text and/or graphics from the websites and promotional material of some of the products I’m discussing.

Simple Alerting

Percona provides a Nagios plugin for database alerts: https://www.percona.com/downloads/percona-monitoring-plugins/.

I also like to point out to clients what metrics are important to monitor long term to make sure there are no performance issues. I prefer the following approach:

  • On the hardware level:
    • Monitor CPU, IO, network usage and how it trends monthly. If some resource consumption comes to a critical level, this might be a signal that you need more capacity.
  • On the MySQL server level:
    • Monitor connections, active threads, table locks, row locks, InnoDB IO and buffer pool usage
    • For replication, monitor seconds behind master (SBM), binlog size and replication errors. In Percona XtraDB Cluster, you might want to watch wsrep_local_recv_queue.
  • On the query level:
    • Regularly check query execution and response time, and make sure it stays within acceptable levels. When execution time approaches or exceeds established levels, evaluate ways to optimize your queries.
  • On the application side:
    • Monitor that response time is within established SLAs.
High-Level Monitoring Solution Comparison PMM MonYOG Severalnines VividCortex SelectStar Databases Supported MySQL, MongoDB and others with custom addons MySQL MySQL, MongoDB, PostgreSQL MySQL, MongoDB, PostgreSQL, Redis MySQL, MongoDB, PostgreSQL, Hadoop, Cassandra, Amazon Dynamo, IBM DB2, SQL Server, Oracle Open Source x Cost Free Subscription per node Subscription per node and free Community Edition Subscription per instance Subscription per instance Cloud or
On Premises
On premises On premises On premises Cloud with on premises collector Cloud with on premises collector Has Agents x x Monitoring x x x x x Alerting Yes, but requires custom setup x x x x Replication Topology Management x x Query Analytics x x x x Configuration Management x x Backup Management x OS Metrics x x  x x x Configuration Advisors x  x x Failover Management x x ProxySQL and
HA Proxy Support
Monitors ProxySQL x

 

PMM

http://pmmdemo.percona.com

https://www.percona.com/blog/2016/04/18/percona-monitoring-and-management/

https://www.percona.com/doc/percona-monitoring-and-management/index.html

Percona Monitoring and Management (PMM) is a fully open source solution for managing MySQL platform performance and tuning query performance. It allows DBAs and application developers to optimize the performance of the database layer. PMM is an on-premises solution that keeps all of your performance and query data inside the confines of your environment, with no requirement for data to cross the Internet.

Assembled from a supported package of “best-of-breed” open source tools such as Prometheus, Grafana and Percona’s Query Analytics, PMM delivers results right out of the box.

With PMM, anyone with database maintenance responsibilities can get more visibility for actionable enhancements, realize faster issue resolution times, increase performance through focused optimization and better manage resources. More information allows you to concentrate efforts on the areas that yield the highest value, rather than hunting and pecking for speed.

PMM monitors and provides performance data for Oracle’s MySQL Community and Enterprise Servers, as well as Percona Server for MySQL and MariaDB.

Alerting

In the current version of PMM, custom alerting can be set up. Percona has a guide here: https://www.percona.com/blog/2017/01/23/mysql-and-mongodb-alerting-with-pmm-and-grafana/.

Architecture

The PMM platform is based on a simple client-server model that enables efficient scalability. It includes the following modules:

  • PMM Client is installed on every MySQL host that you want to monitor. It collects MySQL server metrics, general system metrics, and query analytics data for a complete performance overview. Collected data is sent to the PMM Server.
  • PMM Server aggregates collected data and presents it in the form of tables, dashboards and graphs in a web interface.

MySQL Configuration

Percona recommends certain settings to get the most out of PMM. You can get more information and a guide here: https://www.percona.com/doc/percona-monitoring-and-management/conf-mysql.html.

Advantages
  • Fast setup
  • Fully supported and backed by Percona
  • Impressive roadmap ahead
  • Monitors your database in depth
  • Query analytics
  • Quick setup docker container
  • Free and open source
Disadvantages
  • New, could still have some growing pains
  • Requires agents on database machines
Severalnines

http://severalnines.com/

Severalnines ClusterControl provides access to 100+ key database and host metrics that matter to your operational performance. You can visualize historical performance in custom dashboards to establish operational baselines and capacity planning. It lets you proactively monitor and receive advice to address immediate and potential database and server issues, and ships with over 100 built-in advisors or easily-writeable custom advisors for your specific needs. It is very scriptable and customizable with some effort.

Severalnines has a free community version as well as a commercial offering. The free version includes deployment, monitoring and advisors with a Developer Studio (with which users can create their own advisors).

Severalnines is definitely more sysadmin focused. The best part about it is its ability to deploy and manage deployments of your database with almost no command line work.

The community edition of ClusterControl is “free forever”.

Architecture

ClusterControl is an agentless management and automation software for database clusters. It helps deploy, monitor, manage and scale your database server/cluster directly from ClusterControl user interface.

ClusterControl consists of four components:

Component Package Naming Role ClusterControl controller (cmon) clustercontrol- controller The brain of ClusterControl. A backend service performing automation, management, monitoring and scheduling tasks. All the collected data will be stored directly inside CMON database ClusterControl REST API clustercontrol-cmonapi Interprets request and response data between ClusterControl UI and CMON database ClusterControl UI clustercontrol A modern web user interface to visualize and manage the cluster. It interacts with CMON controller via remote procedure call (RPC) or REST API interface ClusterControl NodeJS clustercontrol-nodejs This optional package is introduced in ClusterControl version 1.2.12 to provide an interface for notification services and integration with 3rd party tools

 

Advantages
  • Agentless
  • Monitors, deploys and manages:
    • Database
    • Configuration
    • Backups
    • Users
  • Simple web GUI to manage your databases, alerts, users, settings
  • Can create custom monitors or jobs
  • Can off-load and compress backups
  • Great support team
  • Rich feature set and multiple databases supported
Disadvantages
  • Cost per node
  • UI can occasionally be clunky
  • Query tools lack as compared to other solutions here
  • Metrics and Advisors may not be as powerful or easy to use as other products
MONyog

https://www.webyog.com/product/monyog

MONyog MySQL Monitor and Advisor is a “MySQL DBA in a box” that helps MySQL DBAs manage more MySQL servers, tune their current MySQL servers and find and fix problems with their MySQL database applications before they can become serious problems or costly outages.

MONyog proactively monitors enterprise database environments and provides expert advice on how even those new to MySQL can tighten security, optimize performance and reduce downtime of their MySQL powered systems.

MONyog is more DBA focused and focuses on the MySQL configuration and queries.

Architecture

MONyog web server runs on Linux, monitoring MySQL on all platforms and also monitoring OS-data on Linux servers. To retrieve OS metrics, MONyog uses SSH. However, with this scenario (MONyog installed on a Linux machine) MONyog web-server/agent cannot collect Windows OS metrics.

Of course, the client where the MONyog output is viewed can be any browser supporting AJAX on any platform. MONyog can be installed on a remote PC as well as the server. It does not require processing, and with agentless monitoring it can collect and retrieve data from the server.

Advantages
  • Setup and startup within two minutes
  • Agentless
  • Good query tools
  • Manages configuration
  • Great advisors for database tuning built-in
  • Most comprehensive and detailed alerting
Disadvantages
  • Cost per node
  • Only supports MySQL
VividCortex

VividCortex is a good cloud-based tool to see what your production databases are doing. It is a modern SaaS database performance monitoring platform that significantly eases the pain of database performance at scale, on distributed and polyglot systems, for the entire engineering team. It’s hosted for you with industry-leading security, and is continuously improved and maintained. VividCortex measures and analyzes the system’s work and resource consumption. The result is an immediate insight into query performance, better performance and quality, faster time-to-market and reduced cost and effort.

Architecture

VividCortex is the combination of agent programs, APIs and a web application. You install the agents on your servers, they send data to their APIs, and you access the results through the web application at https://app.vividcortex.com. VividCortex has a diagram on their site showing how it works:

The agents are self-supervising, managed by an agent called vc-agent-007. You can read more about the agents in the agent-specific documentation. They send primarily time-series metrics to the APIs, at one-second granularity. It sometimes sends additional metadata as well. For example, query digests are required to show what query is responsible for specific query-related metrics.
On the backend, a distributed, fully multi-tenant service stores your data separately from all other customers. VividCortex servers are currently hosted in the Amazon AWS public cloud.

Advantages
  • Great visibility into query-level performance to pinpoint optimization efforts
  • Granularity, with the ability to identify performance fluctuations down to a one-second resolution
  • Smart anomaly detection using advanced statistics and machine learning to reduce false-positives and make alerts meaningful and actionable
  • Unique collaboration tools, enabling developers to answer many of their own questions and freeing DBAs to be more responsive and proactive.
Disadvantages
  • Cloud-based tools may not be desirable in a secure environment
  • Cost
  • Not useful if you lose outside network access during an incident
  • Dependent on AWS availability
SelectStar

https://selectstar.io

SelectStar monitors key metrics for many different database types, and has a comprehensive alerts and recommendations system. SelectStar supports monitoring and alerts on:

  • MySQL, Percona Server for MySQL, MariaDB
  • PostgreSQL
  • Oracle
  • MongoDB
  • Microsoft SQL
  • DB2
  • Amazon RDS and Aurora
  • Hadoop
  • Cassandra

The alerts and recommendations are designed to ensure you have an immediate understanding of key issues — and where they are coming from. You can pinpoint the exact database instance that may be causing the issue, or go further up the chain and see if it’s an issue impacting several database instances at the host level.

Recommendations are often tied to alerts — if you have a red alert, there’s going to be a recommendation tied to it on how you can improve. However, the recommendations pop up even if your database is completely healthy — ensuring that you have visibility into how you can improve your configuration before you actually have an issue impacting performance.

Architecture

Using agentless collectors, SelectStar gathers data from both your on-premises and AWS platforms so that you can have insight into all of your database instances.

The collector is an independent machine within your infrastructure that pulls data from your database. It is low impact in order to not impact performance. This is a different approach from all of the other monitoring tools I have looked at.

Advantages
  • Multiple database technologies (the most out of the tools presented here)
  • Great visibility into query-level performance to pinpoint optimization efforts
  • Agentless
  • Good query tools
  • Great advisors for database tuning built in
  • Good alerting
  • Fast setup
  • Monitors your database in depth
  • Query analytics
Disadvantages
  • Cloud-based tools may not be desirable in a secure environment
  • Cost
  • New, could still have some growing pains
  • Still requires an on-premises collector
So What Do I Recommend?

It depends.” – Peter Z., CEO Percona

As always, I recommend whatever works best for your workload, in your environment, and within the standards of your company’s practices!

New collations in MySQL 8.0.0

Since MySQL 5.5, MySQL has supported the utf8mb4 character set.  With the character-set defining the repertoire of characters that can be stored (utf8mb4 can present Unicode characters from U+0000 to U+10FFFF), a collation defines how sorting order and comparisons should behave.…

Test a Flask App with Selenium WebDriver – Part 1

Ever wondered how to write tests for the front-end of your web application? You may already have functional back-end tests, for example to ensure that your models and views are working. However, you may be unsure how to simulate a user of your app for testing. How can you test front-end functions like registration and logging in, which are done in a browser?

In this two-part tutorial, I will show you how to write front-end tests for an existing Python/Flask web application. You should therefore already have a functional application, along with a virtual environment with the necessary software dependencies installed. We will use Project Dream Team, a CRUD web app I built in a three-part tutorial (here is Part One, Part Two and Part Three). I recommend that you go through the tutorial and build out the app, especially if you are new to Flask or to testing in Flask. If you have built and tested a Flask app before, or don't want to go through the tutorial, you can clone the complete app from my GitHub repository, here and follow the set-up instructions in the README file.

Testing Overview

To recap, the application we will be testing is an employee management app. The app allows users to register as employees and login. It also allows admin users to view all employees, as well as to create departments and roles, and then assign them to each employee.

We will therefore write tests to ensure all these functions are achieved in the app. If you followed the tutorial, you should already have a test database set up. If not, you can create one and grant your database user privileges on it, as follows:

$ mysql -u root mysql> CREATE DATABASE dreamteam_test; Query OK, 1 row affected (0.00 sec) mysql> GRANT ALL PRIVILEGES ON dreamteam_test . * TO 'dt_admin'@'localhost'; Query OK, 0 rows affected (0.00 sec)

We use the test database to run our tests to prevent our test data from being saved into the main database we are using for development.

The back-end tests, which were already written in the app tutorial, test that the app's models and views are working as expected. The front-end tests, which we will write in this tutorial, will simulate a user using the app in the browser. For these tests, we will create test data, and then test the registration, login, and CRUD functionality of the app.

Test Directory Structure

We previously had only one test file, tests.py, in the root directory. Rather than have all the tests in one file, we shall create a tests directory. In it, create two new files: __init__.py and test_front_end.py. Move the tests.py file to the tests directory and rename it to test_back_end.py. Your tests directory should now look like this:

└── tests ├── __init__.py ├── test_back_end.py └── test_front_end.py

The __init__.py file initializes the tests directory as a Python package. test-back_end.py will contain all our back-end tests, while test_front_end.py will contain all our front-end tests. Update the test_back_end.py file as follows:

# tests/test_back_end.py import unittest from flask import abort, url_for from flask_testing import TestCase from app import create_app, db from app.models import Department, Employee, Role class TestBase(TestCase): def create_app(self): # pass in test configurations config_name = 'testing' app = create_app(config_name) app.config.update( SQLALCHEMY_DATABASE_URI='mysql://dt_admin:dt2016@localhost/dreamteam_test' ) return app def setUp(self): """ Will be called before every test """ db.session.commit() db.drop_all() db.create_all() # create test admin user admin = Employee(username="admin", password="admin2016", is_admin=True) # create test non-admin user employee = Employee(username="test_user", password="test2016") # save users to database db.session.add(admin) db.session.add(employee) db.session.commit() def tearDown(self): """ Will be called after every test """ db.session.remove() db.drop_all() class TestModels(TestBase): def test_employee_model(self): """ Test number of records in Employee table """ self.assertEqual(Employee.query.count(), 2) def test_department_model(self): """ Test number of records in Department table """ # create test department department = Department(name="IT", description="The IT Department") # save department to database db.session.add(department) db.session.commit() self.assertEqual(Department.query.count(), 1) def test_role_model(self): """ Test number of records in Role table """ # create test role role = Role(name="CEO", description="Run the whole company") # save role to database db.session.add(role) db.session.commit() self.assertEqual(Role.query.count(), 1) class TestViews(TestBase): def test_homepage_view(self): """ Test that homepage is accessible without login """ response = self.client.get(url_for('home.homepage')) self.assertEqual(response.status_code, 200) def test_login_view(self): """ Test that login page is accessible without login """ response = self.client.get(url_for('auth.login')) self.assertEqual(response.status_code, 200) def test_logout_view(self): """ Test that logout link is inaccessible without login and redirects to login page then to logout """ target_url = url_for('auth.logout') redirect_url = url_for('auth.login', next=target_url) response = self.client.get(target_url) self.assertEqual(response.status_code, 302) self.assertRedirects(response, redirect_url) def test_dashboard_view(self): """ Test that dashboard is inaccessible without login and redirects to login page then to dashboard """ target_url = url_for('home.dashboard') redirect_url = url_for('auth.login', next=target_url) response = self.client.get(target_url) self.assertEqual(response.status_code, 302) self.assertRedirects(response, redirect_url) def test_admin_dashboard_view(self): """ Test that dashboard is inaccessible without login and redirects to login page then to dashboard """ target_url = url_for('home.admin_dashboard') redirect_url = url_for('auth.login', next=target_url) response = self.client.get(target_url) self.assertEqual(response.status_code, 302) self.assertRedirects(response, redirect_url) def test_departments_view(self): """ Test that departments page is inaccessible without login and redirects to login page then to departments page """ target_url = url_for('admin.list_departments') redirect_url = url_for('auth.login', next=target_url) response = self.client.get(target_url) self.assertEqual(response.status_code, 302) self.assertRedirects(response, redirect_url) def test_roles_view(self): """ Test that roles page is inaccessible without login and redirects to login page then to roles page """ target_url = url_for('admin.list_roles') redirect_url = url_for('auth.login', next=target_url) response = self.client.get(target_url) self.assertEqual(response.status_code, 302) self.assertRedirects(response, redirect_url) def test_employees_view(self): """ Test that employees page is inaccessible without login and redirects to login page then to employees page """ target_url = url_for('admin.list_employees') redirect_url = url_for('auth.login', next=target_url) response = self.client.get(target_url) self.assertEqual(response.status_code, 302) self.assertRedirects(response, redirect_url) class TestErrorPages(TestBase): def test_403_forbidden(self): # create route to abort the request with the 403 Error @self.app.route('/403') def forbidden_error(): abort(403) response = self.client.get('/403') self.assertEqual(response.status_code, 403) self.assertTrue("403 Error" in response.data) def test_404_not_found(self): response = self.client.get('/nothinghere') self.assertEqual(response.status_code, 404) self.assertTrue("404 Error" in response.data) def test_500_internal_server_error(self): # create route to abort the request with the 500 Error @self.app.route('/500') def internal_server_error(): abort(500) response = self.client.get('/500') self.assertEqual(response.status_code, 500) self.assertTrue("500 Error" in response.data) if __name__ == '__main__': unittest.main()

Because we now have multiple test files, we need a way to run all the tests at once. It would be inconvenient to have to run each test file individually. For this we will use nose2, a package that extends Python's unit testing framework, unittest, and makes testing easier.

$ pip install nose2

Now, we will run the tests using the nose2 command. This command looks for all files whose names begin with "test", and runs the methods inside these files whose names begin with "test". This means that if you name your test files and methods incorrectly, they will not be run by nose2. First, remember to ensure that your MySQL server is running. You can do this by running the following command:

$ mysqld

Then, in another terminal window, run the tests:

$ nose2 .............. ---------------------------------------------------------------------- Ran 14 tests in 2.582s OK Introduction to Selenium WebDriver

Selenium is a suite of tools to automate web browsers for a variety of purposes. Selenium WebDriver in particular is useful when writing browser-based tests.

Begin by installing Selenium:

$ pip install selenium Selenium Set-Up

Now, let's set up Selenium. We are using Flask Testing, an extension that provides unit testing utilities for Flask. To use Selenium with Flask Testing, we need to inherit from the LiveServerTestCase class, which will allow us to run a live server during our tests.

We will use ChromeDriver, a WebDriver for Chrome. This will allow us to run the front-end tests on the Chrome browser. If you are on MacOS, you can install it simply using the brew install chromedriver command. For other platforms, follow this Getting Started guide to download and properly set up ChromeDriver. Note that you will need to have Chrome browser installed.

If you prefer to use Firefox broswer for your tests, you can install and set up FirefoxDriver instead.

For the front-end tests, we will begin by creating test data that will be added to the database before each test runs. This includes two test users, two test departments, and two test roles. Add the following code to tests/test_front_end.py:

# tests/front_end_tests.py import unittest import urllib2 from flask_testing import LiveServerTestCase from selenium import webdriver from app import create_app, db from app.models import Employee, Role, Department # Set test variables for test admin user test_admin_username = "admin" test_admin_email = "admin@email.com" test_admin_password = "admin2016" # Set test variables for test employee 1 test_employee1_first_name = "Test" test_employee1_last_name = "Employee" test_employee1_username = "employee1" test_employee1_email = "employee1@email.com" test_employee1_password = "1test2016" # Set test variables for test employee 2 test_employee2_first_name = "Test" test_employee2_last_name = "Employee" test_employee2_username = "employee2" test_employee2_email = "employee2@email.com" test_employee2_password = "2test2016" # Set variables for test department 1 test_department1_name = "Human Resources" test_department1_description = "Find and keep the best talent" # Set variables for test department 2 test_department2_name = "Information Technology" test_department2_description = "Manage all tech systems and processes" # Set variables for test role 1 test_role1_name = "Head of Department" test_role1_description = "Lead the entire department" # Set variables for test role 2 test_role2_name = "Intern" test_role2_description = "3-month learning position" class TestBase(LiveServerTestCase): def create_app(self): config_name = 'testing' app = create_app(config_name) app.config.update( # Specify the test database SQLALCHEMY_DATABASE_URI='mysql://dt_admin:dt2016@localhost/dreamteam_test', # Change the port that the liveserver listens on LIVESERVER_PORT=8943 ) return app def setUp(self): """Setup the test driver and create test users""" self.driver = webdriver.Chrome() self.driver.get(self.get_server_url()) db.session.commit() db.drop_all() db.create_all() # create test admin user self.admin = Employee(username=test_admin_username, email=test_admin_email, password=test_admin_password, is_admin=True) # create test employee user self.employee = Employee(username=test_employee1_username, first_name=test_employee1_first_name, last_name=test_employee1_last_name, email=test_employee1_email, password=test_employee1_password) # create test department self.department = Department(name=test_department1_name, description=test_department1_description) # create test role self.role = Role(name=test_role1_name, description=test_role1_description) # save users to database db.session.add(self.admin) db.session.add(self.employee) db.session.add(self.department) db.session.add(self.role) db.session.commit() def tearDown(self): self.driver.quit() def test_server_is_up_and_running(self): response = urllib2.urlopen(self.get_server_url()) self.assertEqual(response.code, 200) if __name__ == '__main__': unittest.main()

At the beginning of the file, right after the imports, we begin by creating test variables which we will use in the tests. These include details for three test users, one admin and the other two non-admin, as well as two test departments and two test roles.

Next, we create a TestBase class, which subsequent test classes will inherit from. In it, we have the create_app method, which Flask Testing requires to return an app instance. In this method, we specify the Flask configuration and test database, just like we did in the back-end tests. We also specify the port that the live server will run on. Flask uses port 5000 by default. Changing this for the live server will ensure that the tests run on another port (in this case, port 8943), thus preventing any conflict with another Flask app that could be running.

In the setUp method, we set up the test driver as the ChromeDriver. (If using another webdriver, such as Firefox for example, use self.driver = webdriver.Firefox().) Take note of the self.get_server_url method; we use this to get the home URL. We also create a test employee, test admin user, test department and test role using the variables we created initially. In the tearDown method, we quit the webdriver, and clear the test database. The setUp method is called before every test, while the tearDown method is called after each test. It is important to have the test data before the tests run because most of our tests will require existing users, departments and roles. It would be repetitive to create new users, departments and roles for each test, so it is simpler and DRY-er to do it in the setUp method.

We have one test method in the base class, test_server_is_up_and_running. This method simply ensures that the test server is working as expected and returns a 200 OK status code.

Run the tests using the nose2 command:

$ nose2 ............... ---------------------------------------------------------------------- Ran 15 tests in 4.313s OK

It shows that it ran 15 tests, which includes the 14 back-end tests that we had initially, as well as 1 front-end test.

Finding Elements

So far, we have only one test in the test_front_end.py file. We need to write additional tests that simulate a user actually using the app, such as clicking links and buttons, entering data into forms, and so on. To tell Selenium to click a particular button or enter data in a particular field in a form, we need to use unique identifiers for these web elements.

Selenium provides methods that we can call to find web elements in a variety of ways, such as by id, by name, by XPath, and by CSS selector. I find that locating an element by id is one of the simplest ways; it only requires you to give an id attribute to the element that needs to be located.

We'll start by assigning id attributes to the menu links in the app so that we can direct the test user to click them as required. Open the templates/base.html file and update the menu items with id attributes as follows:

<!-- app/templates/base.html --> <!-- Modify nav bar menu --> <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1"> <ul class="nav navbar-nav navbar-right"> {% if current_user.is_authenticated %} {% if current_user.is_admin %} <li id="dashboard_link_admin"><a href="{{ url_for('home.admin_dashboard') }}">Dashboard</a></li> <li id="departments_link"><a href="{{ url_for('admin.list_departments') }}">Departments</a></li> <li id="roles_link"><a href="{{ url_for('admin.list_roles') }}">Roles</a></li> <li id="employees_link"><a href="{{ url_for('admin.list_employees') }}">Employees</a></li> {% else %} <li id="dashboard_link_employee"><a href="{{ url_for('home.dashboard') }}">Dashboard</a></li> {% endif %} <li id="logout_link"><a href="{{ url_for('auth.logout') }}">Logout</a></li> <li id="username_greeting"><a><i class="fa fa-user"></i> Hi, {{ current_user.username }}!</a></li> {% else %} <li id="home_link"><a href="{{ url_for('home.homepage') }}">Home</a></li> <li id="register_link"><a href="{{ url_for('auth.register') }}">Register</a></li> <li id="login_link"><a href="{{ url_for('auth.login') }}">Login</a></li> {% endif %} </ul> </div>

Most other web elements that we will need to locate already have id attributes. For those that don't, we will find them by their CSS class names and selectors.

Testing the Auth Blueprint

We will begin by writing tests for the Auth blueprint, which handles registration, login, and logout. These tests need to handle edge cases, such as users entering invalid data, so that we can ensure the app responds appropriately.

Registration Tests

We'll start by writing the registration tests. Add the following code to the test_front_end.py file, after the TestBase class:

# tests/front_end_tests.py # update imports import time from flask import url_for class TestRegistration(TestBase): def test_registration(self): """ Test that a user can create an account using the registration form if all fields are filled out correctly, and that they will be redirected to the login page """ # Click register menu link self.driver.find_element_by_id("register_link").click() time.sleep(1) # Fill in registration form self.driver.find_element_by_id("email").send_keys(test_employee2_email) self.driver.find_element_by_id("username").send_keys( test_employee2_username) self.driver.find_element_by_id("first_name").send_keys( test_employee2_first_name) self.driver.find_element_by_id("last_name").send_keys( test_employee2_last_name) self.driver.find_element_by_id("password").send_keys( test_employee2_password) self.driver.find_element_by_id("confirm_password").send_keys( test_employee2_password) self.driver.find_element_by_id("submit").click() time.sleep(1) # Assert that browser redirects to login page assert url_for('auth.login') in self.driver.current_url # Assert success message is shown success_message = self.driver.find_element_by_class_name("alert").text assert "You have successfully registered" in success_message # Assert that there are now 3 employees in the database self.assertEqual(Employee.query.count(), 3) def test_registration_invalid_email(self): """ Test that a user cannot register using an invalid email format and that an appropriate error message will be displayed """ # Click register menu link self.driver.find_element_by_id("register_link").click() time.sleep(1) # Fill in registration form self.driver.find_element_by_id("email").send_keys("invalid_email") self.driver.find_element_by_id("username").send_keys( test_employee2_username) self.driver.find_element_by_id("first_name").send_keys( test_employee2_first_name) self.driver.find_element_by_id("last_name").send_keys( test_employee2_last_name) self.driver.find_element_by_id("password").send_keys( test_employee2_password) self.driver.find_element_by_id("confirm_password").send_keys( test_employee2_password) self.driver.find_element_by_id("submit").click() time.sleep(5) # Assert error message is shown error_message = self.driver.find_element_by_class_name( "help-block").text assert "Invalid email address" in error_message def test_registration_confirm_password(self): """ Test that an appropriate error message is displayed when the password and confirm_password fields do not match """ # Click register menu link self.driver.find_element_by_id("register_link").click() time.sleep(1) # Fill in registration form self.driver.find_element_by_id("email").send_keys(test_employee2_email) self.driver.find_element_by_id("username").send_keys( test_employee2_username) self.driver.find_element_by_id("first_name").send_keys( test_employee2_first_name) self.driver.find_element_by_id("last_name").send_keys( test_employee2_last_name) self.driver.find_element_by_id("password").send_keys( test_employee2_password) self.driver.find_element_by_id("confirm_password").send_keys( "password-won't-match") self.driver.find_element_by_id("submit").click() time.sleep(5) # Assert error message is shown error_message = self.driver.find_element_by_class_name( "help-block").text assert "Field must be equal to confirm_password" in error_message

We've created a class, TestRegistration, that inherits from the TestBase class, and has three test methods. Note that the if __name__ == '__main__': section in the file should always be at the bottom of the file, so any new code additions will go before it.

The test_registration method tests that a user can successfully create an account if they fill out all fields in the registration form correctly. It also tests that the user is redirected to the login page after registration and that a message is displayed inviting them to login. The test_registration_invalid_email method tests that entering an invalid email format in the email field will prevent the registration form from being submitted, and display an appropriate error message. The test_registration_confirm_password method tests that the data in the "Password" and "Confirm password" fields must match, and that an appropriate error message is displayed if they do not.

Take note of the send_keys method, which we use to enter data into form fields, and the click method, which we use to click on web elements such as buttons and links. Additionally, take note of the time.sleep method. We use this to pause the tests to give the browser time to load all web elements completely before proceeding to the next step.

Login Tests

The next tests are for logging in. Add a TestLogin class as follows:

# tests/front_end_tests.py class TestLogin(TestBase): def test_login(self): """ Test that a user can login and that they will be redirected to the homepage """ # Click login menu link self.driver.find_element_by_id("login_link").click() time.sleep(1) # Fill in login form self.driver.find_element_by_id("email").send_keys(test_employee1_email) self.driver.find_element_by_id("password").send_keys( test_employee1_password) self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert that browser redirects to dashboard page assert url_for('home.dashboard') in self.driver.current_url # Assert that welcome greeting is shown username_greeting = self.driver.find_element_by_id( "username_greeting").text assert "Hi, employee1!" in username_greeting def test_admin_login(self): """ Test that an admin user can login and that they will be redirected to the admin homepage """ # Click login menu link self.driver.find_element_by_id("login_link").click() time.sleep(1) # Fill in login form self.driver.find_element_by_id("email").send_keys(test_admin_email) self.driver.find_element_by_id("password").send_keys( test_admin_password) self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert that browser redirects to dashboard page assert url_for('home.admin_dashboard') in self.driver.current_url # Assert that welcome greeting is shown username_greeting = self.driver.find_element_by_id( "username_greeting").text assert "Hi, admin!" in username_greeting def test_login_invalid_email_format(self): """ Test that a user cannot login using an invalid email format and that an appropriate error message will be displayed """ # Click login menu link self.driver.find_element_by_id("login_link").click() time.sleep(1) # Fill in login form self.driver.find_element_by_id("email").send_keys("invalid") self.driver.find_element_by_id("password").send_keys( test_employee1_password) self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert error message is shown error_message = self.driver.find_element_by_class_name( "help-block").text assert "Invalid email address" in error_message def test_login_wrong_email(self): """ Test that a user cannot login using the wrong email and that an appropriate error message will be displayed """ # Click login menu link self.driver.find_element_by_id("login_link").click() time.sleep(1) # Fill in login form self.driver.find_element_by_id("email").send_keys(test_employee2_email) self.driver.find_element_by_id("password").send_keys( test_employee1_password) self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert that error message is shown error_message = self.driver.find_element_by_class_name("alert").text assert "Invalid email or password" in error_message def test_login_wrong_password(self): """ Test that a user cannot login using the wrong password and that an appropriate error message will be displayed """ # Click login menu link self.driver.find_element_by_id("login_link").click() time.sleep(1) # Fill in login form self.driver.find_element_by_id("email").send_keys(test_employee1_email) self.driver.find_element_by_id("password").send_keys( "invalid") self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert that error message is shown error_message = self.driver.find_element_by_class_name("alert").text assert "Invalid email or password" in error_message

The test_login method tests that when the correct email and password combination is entered for a registered user, the user is redirected to the dashboard, where their username is displayed. It is the same as the test_admin_login method, except the latter tests login for admin users who are redirected to the admin dashboard. The test_invalid_email_format tests that users cannot submit the login form if they enter a non-email in the email field. The test_login_wrong_email and test_login_wrong_password methods test that a user cannot login using invalid credentials, and that an appropriate error message is displayed.

Run the tests again. You'll notice that for the front-end tests, a browser window will open. It will simulate a user clicking links and filling out forms as specified in the tests we wrote. The output of running the tests should be similar to this:

$ nose2 ........................ ---------------------------------------------------------------------- Ran 25 tests in 96.581s OK Testing the Admin Blueprint

Next, we will write tests for the Admin blueprint, which is where the CRUD functionality of the application is. In these tests, we will simulate an admin user creating, viewing, updating, and deleting departments and roles. We will also simulate an admin user assigning departments and roles to employees.

Because we will need to log in multiple times before being able to do any of this, we will create a method that we will call every time we need to log in as a user.

In your front_end_tests.py file, add a CreateObjects class, just after the test variables and before the TestRegistration class:

# tests/front_end_tests.py # after initialization of test variables class CreateObjects(object): def login_admin_user(self): """Log in as the test employee""" login_link = self.get_server_url() + url_for('auth.login') self.driver.get(login_link) self.driver.find_element_by_id("email").send_keys(test_admin_email) self.driver.find_element_by_id("password").send_keys( test_admin_password) self.driver.find_element_by_id("submit").click() def login_test_user(self): """Log in as the test employee""" login_link = self.get_server_url() + url_for('auth.login') self.driver.get(login_link) self.driver.find_element_by_id("email").send_keys(test_employee1_email) self.driver.find_element_by_id("password").send_keys( test_employee1_password) self.driver.find_element_by_id("submit").click()

Now, if we need to be logged in as a particular type of user in any of our tests, we can simply call the relevant method.

Department Tests

Next, we'll write our tests. Add a TestDepartments class after the TestLogin class, as follows:

# tests/front_end_tests.py class TestDepartments(CreateObjects, TestBase): def test_add_department(self): """ Test that an admin user can add a department """ # Login as admin user self.login_admin_user() # Click departments menu link self.driver.find_element_by_id("departments_link").click() time.sleep(1) # Click on add department button self.driver.find_element_by_class_name("btn").click() time.sleep(1) # Fill in add department form self.driver.find_element_by_id("name").send_keys(test_department2_name) self.driver.find_element_by_id("description").send_keys( test_department2_description) self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert success message is shown success_message = self.driver.find_element_by_class_name("alert").text assert "You have successfully added a new department" in success_message # Assert that there are now 2 departments in the database self.assertEqual(Department.query.count(), 2) def test_add_existing_department(self): """ Test that an admin user cannot add a department with a name that already exists """ # Login as admin user self.login_admin_user() # Click departments menu link self.driver.find_element_by_id("departments_link").click() time.sleep(1) # Click on add department button self.driver.find_element_by_class_name("btn").click() time.sleep(1) # Fill in add department form self.driver.find_element_by_id("name").send_keys(test_department1_name) self.driver.find_element_by_id("description").send_keys( test_department1_description) self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert error message is shown error_message = self.driver.find_element_by_class_name("alert").text assert "Error: department name already exists" in error_message # Assert that there is still only 1 department in the database self.assertEqual(Department.query.count(), 1) def test_edit_department(self): """ Test that an admin user can edit a department """ # Login as admin user self.login_admin_user() # Click departments menu link self.driver.find_element_by_id("departments_link").click() time.sleep(1) # Click on edit department link self.driver.find_element_by_class_name("fa-pencil").click() time.sleep(1) # Fill in add department form self.driver.find_element_by_id("name").clear() self.driver.find_element_by_id("name").send_keys("Edited name") self.driver.find_element_by_id("description").clear() self.driver.find_element_by_id("description").send_keys( "Edited description") self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert success message is shown success_message = self.driver.find_element_by_class_name("alert").text assert "You have successfully edited the department" in success_message # Assert that department name and description has changed department = Department.query.get(1) self.assertEqual(department.name, "Edited name") self.assertEqual(department.description, "Edited description") def test_delete_department(self): """ Test that an admin user can delete a department """ # Login as admin user self.login_admin_user() # Click departments menu link self.driver.find_element_by_id("departments_link").click() time.sleep(1) # Click on edit department link self.driver.find_element_by_class_name("fa-trash").click() time.sleep(1) # Assert success message is shown success_message = self.driver.find_element_by_class_name("alert").text assert "You have successfully deleted the department" in success_message # Assert that there are no departments in the database self.assertEqual(Department.query.count(), 0)

The test_add_department method tests that an admin user can add a department using the add department form, while the test_add_existing_department method tests that they cannot add a department name that already exists. The test_edit_department method tests that a department can be edited using the edit form. Take note of the clear method, which allows us to clear a textbox of any exisiting text before entering some other text. Lastly, the test_delete_department tests than a department can be deleted using the delete link in the departments page.

Role Tests

We will write similar tests for roles:

# tests/front_end_tests.py class TestRoles(CreateObjects, TestBase): def test_add_role(self): """ Test that an admin user can add a role """ # Login as admin user self.login_admin_user() # Click roles menu link self.driver.find_element_by_id("roles_link").click() time.sleep(1) # Click on add role button self.driver.find_element_by_class_name("btn").click() time.sleep(1) # Fill in add role form self.driver.find_element_by_id("name").send_keys(test_role2_name) self.driver.find_element_by_id("description").send_keys( test_role2_description) self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert success message is shown success_message = self.driver.find_element_by_class_name("alert").text assert "You have successfully added a new role" in success_message # Assert that there are now 2 roles in the database self.assertEqual(Role.query.count(), 2) def test_add_existing_role(self): """ Test that an admin user cannot add a role with a name that already exists """ # Login as admin user self.login_admin_user() # Click roles menu link self.driver.find_element_by_id("roles_link").click() time.sleep(1) # Click on add role button self.driver.find_element_by_class_name("btn").click() time.sleep(1) # Fill in add role form self.driver.find_element_by_id("name").send_keys(test_role1_name) self.driver.find_element_by_id("description").send_keys( test_role1_description) self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert error message is shown error_message = self.driver.find_element_by_class_name("alert").text assert "Error: role name already exists" in error_message # Assert that there is still only 1 role in the database self.assertEqual(Role.query.count(), 1) def test_edit_role(self): """ Test that an admin user can edit a role """ # Login as admin user self.login_admin_user() # Click roles menu link self.driver.find_element_by_id("roles_link").click() time.sleep(1) # Click on edit role link self.driver.find_element_by_class_name("fa-pencil").click() time.sleep(1) # Fill in add role form self.driver.find_element_by_id("name").clear() self.driver.find_element_by_id("name").send_keys("Edited name") self.driver.find_element_by_id("description").clear() self.driver.find_element_by_id("description").send_keys( "Edited description") self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert success message is shown success_message = self.driver.find_element_by_class_name("alert").text assert "You have successfully edited the role" in success_message # Assert that role name and description has changed role = Role.query.get(1) self.assertEqual(role.name, "Edited name") self.assertEqual(role.description, "Edited description") def test_delete_role(self): """ Test that an admin user can delete a role """ # Login as admin user self.login_admin_user() # Click roles menu link self.driver.find_element_by_id("roles_link").click() time.sleep(1) # Click on edit role link self.driver.find_element_by_class_name("fa-trash").click() time.sleep(1) # Assert success message is shown success_message = self.driver.find_element_by_class_name("alert").text assert "You have successfully deleted the role" in success_message # Assert that there are no roles in the database self.assertEqual(Role.query.count(), 0) Exception Handling

You may also need to edit the add_department and add_role views to handle an SQLAlchemy exception that may occur during transaction rollback when attempting to add a department or role that already exists. To handle this exception, edit the try-except block in the add_department and add_role views, as follows:

# app/admin/views.py def add_department(self): # existing code remains if form.validate_on_submit(): department = Department(name=form.name.data, description=form.description.data) try: # add department to the database db.session.add(department) db.session.commit() flash('You have successfully added a new department.') except: # in case department name already exists db.session.rollback() flash('Error: department name already exists.') def add_role(self): # existing code remains if form.validate_on_submit(): role = Role(name=form.name.data, description=form.description.data) try: # add role to the database db.session.add(role) db.session.commit() flash('You have successfully added a new role.') except: # in case role name already exists db.session.rollback() flash('Error: role name already exists.')

Run your tests now:

$ nose2 ................................... ---------------------------------------------------------------------- Ran 35 tests in 234.457s OK Employee Tests

Now we will write tests where the admin user assigns departments and roles to employees.

# tests/front_end_tests.py # update imports from selenium.webdriver.support.ui import Select class TestEmployees(CreateObjects, TestBase): def test_assign(self): """ Test that an admin user can assign a role and a department to an employee """ # Login as admin user self.login_admin_user() # Click employees menu link self.driver.find_element_by_id("employees_link").click() time.sleep(1) # Click on assign link self.driver.find_element_by_class_name("fa-user-plus").click() time.sleep(1) # Department and role already loaded in form self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert success message is shown success_message = self.driver.find_element_by_class_name("alert").text assert "You have successfully assigned a department and role" in success_message # Assert that department and role has been assigned to employee employee = Employee.query.get(2) self.assertEqual(employee.role.name, test_role1_name) self.assertEqual(employee.department.name, test_department1_name) def test_reassign(self): """ Test that an admin user can assign a new role and a new department to an employee """ # Create new department department = Department(name=test_department2_name, description=test_department2_description) # Create new role role = Role(name=test_role2_name, description=test_role2_description) # Add to database db.session.add(department) db.session.add(role) db.session.commit() # Login as admin user self.login_admin_user() # Click employees menu link self.driver.find_element_by_id("employees_link").click() time.sleep(1) # Click on assign link self.driver.find_element_by_class_name("fa-user-plus").click() time.sleep(1) # Select new department and role select_dept = Select(self.driver.find_element_by_id("department")) select_dept.select_by_visible_text(test_department2_name) select_role = Select(self.driver.find_element_by_id("role")) select_role.select_by_visible_text(test_role2_name) self.driver.find_element_by_id("submit").click() time.sleep(2) # Assert success message is shown success_message = self.driver.find_element_by_class_name("alert").text assert "You have successfully assigned a department and role" in success_message # Assert that employee's department and role has changed employee = Employee.query.get(2) self.assertEqual(employee.role.name, test_role2_name) self.assertEqual(employee.department.name, test_department2_name)

The test_assign method assigns the exisiting department and role to the existing user using the assign link in the Employees page. The test_reassign method adds a new department and role to the database, and then assigns them to the existing employee. Take note of the Select class that we have imported from Selenium. From it, we use the select_by_visible_text method to select the department and role from a dropdown menu.

Let's run the tests one more time:

$ nose2 ...................................... ---------------------------------------------------------------------- Ran 38 tests in 181.709s OK Conclusion

That's it for Part One! To recap, in this tutorial you've learnt about Selenium WebDriver and how it can be used to run front-end tests by simulating a user of your app. You've also learnt about web elements and how to find them using some of Selenium's in-built methods. We have written tests for registration, login, and performing CRUD operations on departments and roles.

In Part Two, we will write tests for permissions, to ensure that only authorised users can access certain resources. Part Two will also cover continuous integration and linking our app with CircleCI, a continuous integration and delivery platform.

Quiz: Drop non-global users

Somebody asked on Freenode. I don't know why they wanted it. How would you drop all MySQL users who do not have "GRANT ALL ON *.* ... WITH GRANT OPTION"? That is, drop any users who have 'N' in any of the privilege columns in `mysql`.`user`.

My solution shown below. Did you think of a different approach?

My solution ▼

I used SQL to build SQL which built more SQL. Get the list of 'priv' columns from information_schema.columns. Build a query that looks for 'N' in any of those columns (it feels funny to search for a constant in a list of fields instead of a field in a list of constants, but it's perfectly legal). Use the results to build the DROP USER statement. mysql -BNe "SELECT CONCAT('SELECT CONCAT(''DROP USER '', QUOTE(user), ''@'', QUOTE(Host), '';'') FROM mysql.user WHERE ''N'' IN (', GROUP_CONCAT(column_name), ')') FROM information_schema.columns WHERE table_schema = 'mysql' AND table_name = 'user' AND column_name LIKE '%priv';" | mysql -BN | mysql

MySQL Connector/NET 7.0.7 m6 development has been released

MySQL Connector/Net 7.0.7 is the fourth development release that expands cross-platform support to Linux and OS X when using Microsoft’s .NET Core framework. Now,
.NET developers can use the X DevAPI with .NET Core and Entity Framework Core
(EF Core) 1.0 to create server applications that run on Windows, Linux and OS X.
We are very excited about this change and really look forward to your feedback on it!

MySQL Connector/Net 7.0.7 is also the sixth development release of MySQL Connector/Net to add support for the new X DevAPI. The X DevAPI enables
application developers to write code that combines the strengths of the
relational and document models using a modern, NoSQL-like syntax that
does not assume previous experience writing traditional SQL.

To learn more about how to write applications using the X DevAPI, see
http://dev.mysql.com/doc/x-devapi-userguide/en/index.html. For more
information about how the X DevAPI is implemented in Connector/Net, see
http://dev.mysql.com/doc/dev/connector-net.

Note

The X DevAPI requires at least MySQL Server version 5.7.12 or higher with
the X Plugin enabled. For general documentation about how to get started
using MySQL as a document store, see
http://dev.mysql.com/doc/refman/5.7/en/document-store.html.

To download MySQL Connector/Net 7.0.7 M6, see the “Development
Releases” tab at http://dev.mysql.com/downloads/connector/net/

Functionality Added or Changed

  • X DevAPI: Added support for Internet Protocol version 6 (IPv6) addresses. Host names can now resolve as IPv4 or IPv6 addresses.
  • X DevAPI: Connection string syntax is now identical to the URI scheme, which provides a cross-product syntax for defining the connection data to establish a session.
  • X DevAPI: Added new methods to the Schema class to create, alter, and drop views. Usage notes include:
    • Views created with the Schema.CreateView().DefinedAs() method chain are
      supported for use with the table Select method, but are not supported with the collection Find method.
    • A collection view created in the database by some mechanism other than the CreateView method is not defined as a view by Table.IsView.
    • Query objects assigned to a view with the CreateView method are static, even when the underlying query changes.

Nuget

Packages are available at:

https://www.nuget.org/packages/MySql.Data/7.0.7-m6
https://www.nuget.org/packages/MySql.Web/7.0.7-m6
https://www.nuget.org/packages/MySql.Data.Entity/7.0.7-m6
https://www.nuget.org/packages/MySql.Data.EntityFrameworkCore/7.0.7-m6

Enjoy and thanks for the support!

MySQL/Oracle Release Engineering Team

Percona Live Featured Session with Evan Elias: Automatic MySQL Schema Management with Skeema

Welcome to another post in the series of Percona Live featured session blogs! In these blogs, we’ll highlight some of the session speakers that will be at this year’s Percona Live conference. We’ll also discuss how these sessions can help you improve your database environment. Make sure to read to the end to get a special Percona Live 2017 registration bonus!

In this Percona Live featured session, we’ll meet Evan Elias, Director of Engineering, Tumblr. His session is Automatic MySQL Schema Management with SkeemaSkeema is a new open source CLI tool for managing MySQL schemas and migrations. It allows you to easily track your schemas in a repository, supporting a pull-request-based workflow for schema change submission, review, and execution.

I had a chance to speak with Evan about Skeema:

Percona: How did you get into database technology? What do you love about it?

Evan: I first started using MySQL at a college IT job in 2003, and over the years I eventually began tackling much larger-scale deployments at Tumblr and Facebook. I’ve spent most of the past decade working on social networks, where massive high-volume database technology is fundamental to the product. I love the technical challenges present in that type of environment, as well as the huge potential impact of database automation and tooling. In companies with giant databases and many engineers, a well-designed automation system can provide a truly enormous increase in productivity.

Percona: Your talk is called Automatic MySQL Schema Management with Skeema. What is Skeema, and how is it helpful for engineers and DBAs?

Evan: Skeema is an open source tool for managing MySQL schemas and migrations. It allows users to diff, push or pull schema definitions between the local filesystem and one or more databases. It can be configured to support multiple environments (e.g. development/staging/production), external online schema change tools, sharding, and service discovery. Once configured, an engineer or DBA can use Skeema to execute an online schema change on many shards concurrently simply by editing a CREATE TABLE statement in a file and then running “skeema push”.

Percona: What are the benefits of storing schemas in a repository?

Evan: The whole industry is moving towards infrastructure-as-code solutions, providing automated configuration which is reproducible across multiple environments. In extending this concept to database schemas, a file repository stores the desired state of each table, and a schema change is tied to simply changing these files. A few large companies like Facebook have internal closed-source tools to tie MySQL schemas to a git repo, allowing schema changes to be powered by pull requests (without any manual DBA effort). There hasn’t previously been an open source, general-purpose tool for managing schemas and migrations in this way, however. I developed Skeema to fill this gap.

Percona: What do you want attendees to take away from your session? Why should they attend?

Evan: In this session, MySQL DBAs will learn how to automate their schema change workflow to reduce manual operational work, while software engineers will discover how Skeema permits easy online migrations even in frameworks like Rails or Django. Skeema is a brand new tool, and this is the first conference session to introduce it. At this relatively early stage, feedback and feature requests from attendees will greatly influence the direction and prioritization of future development.

Percona: What are you most looking forward to at Percona Live 2017?

Evan: Percona Live is my favorite technical conference. It’s the best place to learn about all of the recent developments in the database world, and meet the top experts in the field. This is my fifth year attending in Santa Clara. I’m looking forward to reconnecting with old friends and making some new ones as well!

Register for Percona Live Data Performance Conference 2017, and see Evan present his session on Automatic MySQL Schema Management with Skeema. Use the code FeaturedTalk and receive $100 off the current registration price!

Percona Live Data Performance Conference 2017 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 24-27, 2017 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

Generating a MySQL Password

So, I’ve started a new job as a Senior Database Engineer at Salesforce, and one of the services I help provide is adding users to MySQL. We have some nice chef recipes, so all I have to do is update a few files, including adding in the MySQL password hash.

Now, when I added myself, I just logged into MySQL and generated a password hash. But when my SRE (systems reliability engineer) colleague needed to generate a password, he did not have a MySQL system he could login to.

The good news is it’s easy to generate a MySQL password hash. The MySQL password hash is simply a SHA1 hash of a SHA1 hash, with * put in the beginning. Which means you do not need a MySQL database to create a MySQL password hash – all you need is a programming language that has a SHA1 function (well, and a concatenate function).

And I found it, of course, on this post at StackExchange. So you don’t have to click through, here is what it says – and I have tested all these methods and I get the same password hash. I have changed their example of “right” to “PASSWORD HERE” so it’s more readable and obvious where the password goes, in case you copy and paste from here.

Some one-liners:

MySQL (may require you add -u(user) -p):

mysql -NBe "select password('PASSWORD HERE')"

Python:

python -c 'from hashlib import sha1; print "*" + sha1(sha1("PASSWORD HERE").digest()).hexdigest().upper()'

Perl:

perl -MDigest::SHA1=sha1_hex -MDigest::SHA1=sha1 -le 'print "*". uc sha1_hex(sha1("PASSWORD HERE"))'

PHP:

php -r 'echo "*" . strtoupper(sha1(sha1("PASSWORD HERE", TRUE))). "\n";'

Hopefully these help you – they have enabled my colleagues to easily generate what’s needed without having to find (or create) a MySQL instance that they can already login to.

 

 

Migrating MySQL database from Amazon RDS to DigitalOcean

In previous blogs (part 1 and part 2), we discussed how to migrate your RDS data into an EC2 instance. In the process, we managed to move our data out of RDS, but we are still running on AWS. If you would like to move your data completely out of Amazon Web Services, there’s a bit more work to do. In today’s blog post, we will show you how it can be done.

Environment introduction

The environment we’ll be working with is pretty similar to what we ended up with on our last post in the series. The only difference is that no cutover happened, as we will use the EC2 instance as an intermediate step in the process of moving out of AWS.

Initial infrastructure setup The action plan Related resources  MySQL in the Cloud - Online Migration from Amazon RDS to EC2 instance (part 1)  MySQL in the Cloud - Online Migration from Amazon RDS to your own server (part 2)  MySQL in the Cloud - Pros and Cons of Amazon RDS

In the previous blog, we first migrated our data from RDS to an EC2 instance that we have full access to. As we already have MySQL running on our EC2 instance, we have more options to choose from regarding how to copy our data to another cloud. DigitalOcean is only used for demo purposes here, the process we describe below can be used to migrate to any other hosting or cloud provider. You would need direct access to the VPS instances. In this process, we will use xtrabackup to copy the data (although it is perfectly fine to use any other method of binary transfer). We would need to prepare a safe connection between AWS and DigitalOcean. Once we do that, we will setup replication from the EC2 instance into a DigitalOcean droplet. The next step would be to perform a cutover and move applications, but we won’t cover it here.

Deciding on connectivity method

Amazon Web Services allows you to pick from many different ways to create a connection to external networks. If you have a hardware appliance which supports VPN connections, you can use it to form a VPN connection between your VPC in AWS and your local infrastructure. If your network provider offers you a peering connection with the AWS network and you have a BGP router, you can get a direct VLAN connection between your network and AWS via AWS Direct Connect. If you have multiple, isolated networks you can link them together with Amazon by using AWS VPN CloudHub. Finally, as EC2 instances are yours to manage, you can as well set up a VPN between that EC2 instance and your local network using software solutions like OpenVPN.

As we are talking databases, you can also decide to setup SSL replication between MySQL on EC2 (the master) and the slave running on DigitalOcean. - We still have to figure out how to do an initial data transfer to the slave - one solution could be to tar the output of xtrabackup, encrypt that file and either send it via WAN (rsync) or upload to S3 bucket and then download it from there. You could also rely on SSH encryption and just scp (or even rsync, using SSH) the data to the new location.

There are many options to choose from. We will use another solution though - we are going to establish an SSH tunnel between the EC2 instance and our DigitalOcean droplet to form a secure channel that we will use to replicate data. Initial transfer will be made using rsync over the SSH connection.

Configuring a DigitalOcean droplet

As we decided to use DigitalOcean, we can leverage NinesControl to deploy it. We will deploy a single PXC 5.7 node (to match MySQL 5.7 version that we use on EC2 - please keep in mind that replication from newer to older version of MySQL is not supported and it will most likely fail). We will also have to configure an SSH tunnel between EC2 and DigitalOcean instances.

We won’t cover here the setting up of NinesControl and deployment, but you can check following blog posts:

Registering an account and deploying on DigitalOcean should not take more than 10-15 minutes. After deployment completed, you will see your database in the NinesControl UI:

NinesControl screen with cluster deployed

Note that we have deployed a single node of Galera Cluster here. You can find an option to download the SSH key - this is what we need to get the access to the host.

NinesControl screen with details of a cluster Infrastructure that we want to build Copying data to DigitalOcean

Once we have MySQL 5.7 up and running on the DigitalOcean instance, we need to perform a backup of the EC2 instance and then transfer it to DO. Technically, it should be possible to perform a direct streaming of xtrabackup data between the nodes but we cannot really recommend it. WAN links can be unreliable, and it would be better to take a backup locally and then use rsync with its ability to retry the transfer whenever something is not right.

First, let’s take a backup on our EC2 instance:

root@ip-172-30-4-238:~# innobackupex --user=tpcc --password=tpccpass /tmp/backup

Once it’s ready we need to transfer it to the DigitalOcean network. To do it in a safe way, we will create a new user on the DO droplet, generate an SSH key and use this user to copy the data. Of course, you can as well use any of existing users, it’s not a required to create a new one. So, let’s add a new user. There are many ways to do this, we’ll use ‘adduser’ command.

root@galera1-node-1:~# adduser rdscopy Adding user `rdscopy' ... Adding new group `rdscopy' (1001) ... Adding new user `rdscopy' (1001) with group `rdscopy' ... Creating home directory `/home/rdscopy' ... Copying files from `/etc/skel' ... Enter new UNIX password: Retype new UNIX password: passwd: password updated successfully Changing the user information for rdscopy Enter the new value, or press ENTER for the default Full Name []: Room Number []: Work Phone []: Home Phone []: Other []: Is the information correct? [Y/n] y

Now, it’s time to generate a pair of ssh keys to use for connectivity:

root@galera1-node-1:~# ssh-keygen -C 'rdscopy' -f id_rsa_rdscopy -t rsa -b 4096 Generating public/private rsa key pair. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in id_rsa_rdscopy. Your public key has been saved in id_rsa_rdscopy.pub. The key fingerprint is: 3a:b0:d2:80:5b:b8:60:1b:17:58:bd:8e:74:c9:56:b3 rdscopy The key's randomart image is: +--[ RSA 4096]----+ | .. | | o . o | | . .. + o | | o ..* E | |+o+.* S | |o+++ + . | |o.. o o | | . . | | | +-----------------+

Having the SSH key, we need to set it up on our Digital Ocean droplet. We need to create .ssh directory and create authorized_keys file with proper access permissions.

root@galera1-node-1:~# mkdir /home/rdscopy/.ssh root@galera1-node-1:~# cat id_rsa_rdscopy.pub > /home/rdscopy/.ssh/authorized_keys root@galera1-node-1:~# chown rdscopy.rdscopy /home/rdscopy/.ssh/authorized_keys root@galera1-node-1:~# chmod 600 /home/rdscopy/.ssh/authorized_keys

Then, we need to copy our private key to the EC2 instance. When we are ready with it, we can copy our data. As we mentioned earlier, we will use rsync for that - it will let us to restart the transfer if, for whatever reason, the process is interrupted. Combined with SSH, we have create a safe and robust method of copying the data over WAN. Let’s start rsync on the EC2 host:

root@ip-172-30-4-238:~# rsync -avz -e "ssh -i id_rsa_rdscopy -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null" --progress /tmp/backup/2017-02-20_16-34-18/ rdscopy@198.211.97.97:/home/rdscopy

After a while, which will depend on the amount of data and transfer speed, our backup data should become available on the DigitalOcean droplet. This means that it is time to prepare it by applying InnoDB redo logs, and then copying it back into MySQL data directory. For that we need to stop MySQL, remove the current data directory, copy the files back using either innobackupex or manually, and  finally, verify that owner and group for new files is set to mysql:

root@galera1-node-1:~# innobackupex --apply-log /home/rdscopy/ root@galera1-node-1:~# service mysql stop root@galera1-node-1:~# rm -rf /var/lib/mysql/* root@galera1-node-1:~# innobackupex --copy-back /home/rdscopy/ root@galera1-node-1:~# chown -R mysql.mysql /var/lib/mysql

Before we start MySQL, we also need to ensure that both server_id and UUID’s are different. The former can be edited in my.cnf, the latter can be assured by:

root@galera1-node-1:~# rm /var/lib/mysql/auto.cnf

Now, we can start MySQL:

root@galera1-node-1:~# service mysql startSetting up replication

We are ready to set up replication between EC2 and DO, but first we need to setup an ssh tunnel - we’ll create an additional ssh key for ubuntu user on EC2 instance and copy it to the DO instance. Then we will use the ubuntu user to create a tunnel that we will use for the replication.

Let’s start by creating the new ssh key:

root@ip-172-30-4-238:~# ssh-keygen -C 'tunnel' -f id_rsa_tunnel -t rsa -b 4096 Generating public/private rsa key pair. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in id_rsa_tunnel. Your public key has been saved in id_rsa_tunnel.pub. The key fingerprint is: c4:44:79:39:9c:c6:ce:45:bb:13:e5:6f:c5:d9:8c:14 tunnel The key's randomart image is: +--[ RSA 4096]----+ | .o+ +. E. | | o. O .= +o| | o= oo o.=| | . o o ..| | S o o| | . . | | | | | | | +-----------------+

Next step - we need to add our public key to the authorized_keys file on the EC2 instance, to which we will connect from DigitalOcean to create a tunnel.

root@ip-172-30-4-238:~# cat id_rsa_tunnel.pub >> /home/ubuntu/.ssh/authorized_keys

We also need a private key to be transferred to the DO droplet. It can be done in many ways, but we’ll use secure scp using rdscopy user and key that we created earlier:

root@ip-172-30-4-238:~# scp -i id_rsa_rdscopy id_rsa_tunnel rdscopy@198.211.97.97:/home/rdscopy id_rsa_tunnel 100% 3247 3.2KB/s 00:00

That’s all we need - now we can create the SSH tunnel. We want it to be available all the time so we will use screen session for it.

root@galera1-node-1:~# screen -S tunnel root@galera1-node-1:~# ssh -L 3307:localhost:3306 ubuntu@54.224.107.6 -i /home/rdscopy/id_rsa_tunnel

What we did here was to open an SSH tunnel between localhost, port 3307 and remote host, 54.224.107.6, port 3306 using “ubuntu” user and a key located in /home/rdscopy/id_rsa_tunnel. Detach the screen session and remote host should be available via 127.0.0.1:3307.

To setup replication, we still need to add n user that we will use to connect to MySQL on EC2. We will create it on the EC2 host and we’ll use ‘127.0.0.1’ as host - connections via SSH tunnel will look like they come from localhost:

mysql> CREATE USER rds_rpl@127.0.0.1 IDENTIFIED BY 'rds_rpl_pass'; Query OK, 0 rows affected (0.00 sec) mysql> GRANT REPLICATION SLAVE ON *.* TO rds_rpl@127.0.0.1; Query OK, 0 rows affected (0.00 sec)

All is ready to setup replication, it’s time now to follow a traditional process of creating a slave based on xtrabackup data. We need to use data from xtrabackup_binlog_info to identify the master position at the time of the backup. This position is what we want to use in our CHANGE MASTER TO … command. Let’s take a look at the contents of xtrabackup_binlog_info file:

root@galera1-node-1:~# cat /home/rdscopy/xtrabackup_binlog_info binlog.000052 896957365

This is the binary log file and position we’ll use in our CHANGE MASTER TO:

root@galera1-node-1:~# mysql -u root -ppass mysql> CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=3307, MASTER_USER='rds_rpl', MASTER_PASSWORD='rds_rpl_pass', MASTER_LOG_FILE='binlog.000052', MASTER_LOG_POS=896957365; START SLAVE;

This is it - replication should be now up and running and our DigitalOcean slave should be catching up on the replication. Once it has caught up, our database tier is ready for switchover. Of course, usually it’s more than just a single node - you will most likely have to setup multiple slaves on DO before the infrastructure is ready to handle production traffic.

Switchover itself is a different topic - you will have to devise a plan to minimize downtime. In general, traffic should be moved from old to new location but how it should be done depends mostly on your environment. It can be anything from a simple change in DNS entry, to complex scripts which will pull all triggers in a correct order to redirect the traffic. No matter what, your database is now already in the new location, ready to serve requests.

Tags: cloudAmazon RDSmigrationMySQLdigitalocean

Network attacks on MySQL, Part 2: SSL stripping with MySQL

Intro

In my previous blog post I told you to use SSL/TLS to secure your MySQL network connections. So I followed my advice and did enable SSL. Great!

So first let's quickly verify that everything is working.

So you enabled SSL with mysql_ssl_rsa_setup, used a OpenSSL based build or put ssl-cert, ssl-key and ssl-ca in the mysqld section of your /etc/my.cnf and now show global variables like 'have_SSL'; returns 'YES'.

And you have configured the client with --ssl-mode=PREFERRED. Now show global status like 'Ssl_cipher'; indicates the session is indeed secured.

You could also dump traffic and it looks 'encrypted' (i.e. not readable)...

With SSL enabled everything should be safe isn't it?

The handshake which MySQL uses always starts unsecured and is upgraded to secured if both the client and server have the SSL flag set. This is very similar to STARTTLS as used in the SMTP protocol.

To attach this we need an active attack; we need to actually sit in between the client and the server and modify packets.

Then we modify the flags sent from the server to the client to have the SSL flag disabled. This is called SSL stripping.

Because the client thinks the server doesn't support SSL the connection is not upgraded and continues in clear text.

An example can be found in the dolfijn_stripssl.py script.

Once the SSL layer is stripped from the connection an attacker can see your queries and resultsets again as described before.

To protect against this attack:

  1. Set REQUIRE SSL on accounts which should never use unencrypted connections.
  2. On the client use --ssl-mode=REQUIRED to force the use of SSL. This is available since 5.6.30 / 5.7 11.
  3. For older clients: Check the Ssl_cipher status variable and exit if it is empty.

Pages