Planet MySQL

Plenty of new MySQL books

In the old days, when we wanted to strengthen our skills the only option was to buy a good book. Nowadays one can find a lot of resources on the Internet, however quality is often poor. Fortunately there are still some great people who are brave enough to write new books that will help a new generation of women and men to build modern applications with MySQL the world's most popular open source database. Let me introduce you 3 MySQL books : Introducing the MySQL 8 Document Store / MySQL and JSON: A Practical Programming Guide / Pro MySQL NDB Cluster

Missed our Continuent Clustering 6.0 webinar and training? Don’t sweat it. Watch them on-demand.

Missed our Continuent Clustering 6.0 webinar and technical deep dive? Don’t sweat it. Watch these recordings of our presentations and find out what’s new and latest for multisite, multimaster clustering, what’s changed, improvements that we’ve made under the hood, and how this will improve the quality and support for your multimaster MySQL/Percona/MariaDB deployments.

In our technical deep dive, we take a detailed look at the new deployment model, how the different components work together, and how you can now manage and support your new multimaster environment using Continuent Clustering 6.0. We cover the new integration of the replication, how that affects your deployment, and how that alters the management and monitoring of your database cluster.

CPU overheads for RocksDB queries

An LSM like RocksDB has much better write and space efficiency than a B-Tree. That means with RocksDB you will use less SSD than with a B-Tree and the SSD will either last longer or you can use lower endurance SSD. But this efficiency comes at a cost. In the worst-case RocksDB might use 2X more CPU/query than a B-Tree, which means that in the worst case QPS with RocksDB might be half of what it is with a B-Tree. In the examples below I show that RocksDB can use ~2X more comparisons per query compared to a B-Tree and it is nice when results in practice can be explained by theory.

But first I want to explain the context where this matters and where it doesn't matter. It matters when the CPU overhead from RocksDB is a significant fraction of the query response time -- so the workload needs to be CPU bound (cached working set). It isn't a problem for workloads that are IO-bound. Many performance results have been published on my blog and more are coming later this year. With fast storage devices available I recommend that database workloads strive to be IO-bound to avoid using too much memory+power.

Basic Operations

Where does the CPU overhead come from? I started to explain this in my review of the original LSM paper. My focus in this post is on SELECT statements in SQL, but note that writes in SQL also do reads from the storage engine (updates have a where clause, searches are done to find matching rows and appropriate leaf pages, queries might be done to determine whether constraints will be violated, etc).

With RocksDB data can be found in the memtable and sorted runs from the L0 to Lmax (Lmax is the max/largest level in the LSM tree. There are 3 basic operations that are used to search for data - get, range seek and range next. Get searches for an exact match. Range seek positions iterators at the start of a range scan. Range next gets the next row from a range scan.

First a few comments before I explain the CPU overhead:
  1. I will try to be clear when I refer to the keys for a SQL table (SQL key) and the key as used by RocksDB (RocksDB key). Many SQL indexes can be stored in the same LSM tree (column family) and they are distinguished by prepending the index ID as the first bytes in the RocksDB key. The prepended bytes aren't visible to the user.
  2. Get is used for exact match on a PK index but not used for exact match on a secondary index because with MyRocks the RocksDB key for a secondary index entry is made unique by appending the PK to it. A SQL SELECT that does an exact match on the secondary key searches for a prefix of the RocksDB key and that requires a range scan.
  3. With RocksDB a bloom filter can be on a prefix of the key or the entire key (prefix vs whole key). A prefix bloom filter can be used for range and point queries. A whole key bloom filter will only be used for point queries on a PK. A whole key bloom filter won't be used for range queries. Nor will it be used for point queries on a secondary index (because secondary index exact match uses a range scan internally).
  4. Today bloom filters are configured per column family. A good use for column families is to put secondary indexes into separate ones that are configured with prefix bloom filters. Eventually we expect to make this easier in RocksDB.
Get
A point query is evaluated for RocksDB by searching sorted runs in order: first the memtable, then runs in the L0, then the L1 and so on until the Lmax is reached. The search stops as soon as the key is found, whether from a tombstone or a key-value pair. The work done to search each sorted run varies. I use comparisons and bloom filter probes as the proxy for work. I ignore memory system behavior (cache misses) but assume that the skiplist isn't great in that regard:
  • memtable - this is a skiplist and I assume the search cost is log2(N) when it has N entries. 
  • L0 - for each sorted run first check the bloom filter and if the key might exist then do binary search on the block index and then do binary search within the data block. The cost is a bloom filter probe and possibly log2(M) comparisons when an L0 sorted run has M entries.
  • L1 to Ln - each of these levels is range partitioned into many SST files so the first step is to do a binary search to find the SST that might contain the key, then check the bloom filter for that SST and if the key might exist then do binary search on the block index for that SST and then do binary search within the data block. The binary search cost to find the SST gets larger for the larger levels and that cost doesn't go away when bloom filters are used. For RocksDB with a 1T database, per-level fanout=8, SST size=32M then the number of SSTs per level is 8 in L1, 64 in L2, 512 in L3, 4096 in L4 and 32768 in L5. The number of comparisons before the bloom filter check are 3 for L1, 6 for L2, 9 for L3 and 12 for L4. I assume there is no bloom filter on L5 because it is the max level.
A bloom filter check isn't free. Fortunately, RocksDB makes the cost less than I expected by limiting the bits that are set for a key, and must be checked on a probe, to one cache line. See the code in AddHash. For now I assume that the cost of a bloom filter probe is equivalent to a few (< 5) comparisons given the cache line optimization.
A Get operation on a B-Tree with 8B rows needs ~33 comparisons. With RocksDB it might need 20 comparisons for the memtable, bloom filter probes for 4 SSTs in L0, 3+6+9+12 SST search comparisons for L1 to L4, bloom filter probes for L1 to L4, and then ~33 comparisons to search the max level (L5). So it is easy to see that the search cost might be double the cost for a B-Tree.

The search cost for Get with RocksDB can be reduced by configuring the LSM tree to use fewer sorted runs although we are still figuring out how much that can be reduced. With this example about 1/3 of the comparisons are from the memtable, another 1/3 are from the max level and the remaining 1/3 are from L0 to L4.

Range Seek and Next
The cost for a range scan has two components: range seek to initialize the scan and range next to produce each row. For range seek an iterator is positioned within each sorted run in the LSM tree. For range next the merging iterator combines the per-run iterators. For RocksDB the cost of range seek depends on the number of sorted runs while the cost of range next does not (for leveled compaction, assuming uniform distribution).

Range seek does binary search per sorted run. This is similar to a point query without a bloom filter. The cost across all sorted runs depends on the number of sorted runs. In the example above where RocksDB has a memtable, 4 SSTs in the L0, L1 to L5 and 8B rows that requires 20 comparisons for the memtable, 4 x 20 comparisons for the L0 SSTs, 21 + 24 + 27 + 30 + 33 comparison for L1 to L5. The total is 235 comparisons. There isn't a way to avoid this and for short range queries the cost of range seek dominates the cost of range next. While this overhead is significant for short range queries with embedded RocksDB and an in-memory workload it is harder to notice with MyRocks because there is a lot of CPU overhead above MyRocks from optimize, parse and client RPC for a short range query. It is easier to notice the difference in CPU overhead between MyRocks and a B-Tree with longer range scans.
The cost for range next is interesting. LevelDB has a comment that suggests using a heap but the merging iterator code uses N-1 comparisons per row produced which means the overhead is dependent on the number of sorted runs. The cost of range next in RocksDB is much less dependent on the number of sorted runs because it uses an optimized binary heap and the number of comparisons to produce a row depends on the node that produces the winner. Only one comparison is done if the root produces the winner and the root produced the previous winner. Two comparisons are done if the root produces the winner but did not produce the previous winner. More comparisons are needed in other cases. The code is optimized for long runs of winners from one iterator and that is likely with leveled compaction because Lmax is ~10X larger than the next largest level and usually produces most winners. Using a simulation with uniform distribution the expected number of comparisons per row is <= 1.55 regardless of the number of sorted runs for leveled compaction.

The optimization in the binary heap used by the merging iterator is limited to remembering the comparison result between the two children of the root node. Were that extended to remembering comparison results between any two sibling nodes in the heap then the expected number of comparisons would be reduced from ~1.55 to ~1.38.

For a B-Tree:
  • The overhead for range seek is similar to a point query -- ~33 comparisons when there are 8B rows.
  • There are no merging iterators. The overhead for range next is low -- usually move to the next row in the current leaf page.


Galera Cluster Recovery 101 - A Deep Dive into Network Partitioning

One of the cool features in Galera is automatic node provisioning and membership control. If a node fails or loses communication, it will be automatically evicted from the cluster and remain unoperational. As long as the majority of nodes are still communicating (Galera calls this PC - primary component), there is a very high chance the failed node would be able to automatically rejoin, resync and resume the replication once the connectivity is back.

Generally, all Galera nodes are equal. They hold the same data set and same role as masters, capable of handling read and write simultaneously, thanks to Galera group communication and certification-based replication plugin. Therefore, there is actually no failover from the database point-of-view due to this equilibrium. Only from the application side that would require failover, to skip the unoperational nodes while the cluster is partitioned.

In this blog post, we are going to look into understanding how Galera Cluster performs node and cluster recovery in case network partition happens. Just as a side note, we have covered a similar topic in this blog post some time back. Codership has explained Galera's recovery concept in great details in the documentation page, Node Failure and Recovery.

Node Failure and Eviction

In order to understand the recovery, we have to understand how Galera detects the node failure and eviction process first. Let's put this into a controlled test scenario so we can understand the eviction process better. Suppose we have a three-node Galera Cluster as illustrated below:

The following command can be used to retrieve our Galera provider options:

mysql> SHOW VARIABLES LIKE 'wsrep_provider_options'\G

It's a long list, but we just need to focus on some of the parameters to explain the process:

evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.keepalive_period = PT1S; evs.suspect_timeout = PT5S; evs.view_forget_timeout = P1D; gmcast.peer_timeout = PT3S;

First of all, Galera follows ISO 8601 formatting to represent duration. P1D means the duration is one day, while PT15S means the duration is 15 seconds (note the time designator, T, that precedes the time value). For example if one wanted to increase evs.view_forget_timeout to 1 day and a half, one would set P1DT12H, or PT36H.

Considering all hosts haven't been configured with any firewall rules, we use the following script called block_galera.sh on galera2 to simulate a network failure to/from this node:

#!/bin/bash # block_galera.sh # galera2, 192.168.55.172 iptables -I INPUT -m tcp -p tcp --dport 4567 -j REJECT iptables -I INPUT -m tcp -p tcp --dport 3306 -j REJECT iptables -I OUTPUT -m tcp -p tcp --dport 4567 -j REJECT iptables -I OUTPUT -m tcp -p tcp --dport 3306 -j REJECT # print timestamp date

By executing the script, we get the following output:

$ ./block_galera.sh Wed Jul 4 16:46:02 UTC 2018

The reported timestamp can be considered as the start of the cluster partitioning, where we lose galera2, while galera1 and galera3 are still online and accessible. At this point, our Galera Cluster architecture is looking something like this:

From Partitioned Node Perspective

On galera2, you will see some printouts inside the MySQL error log. Let's break them out into several parts. The downtime was started around 16:46:02 UTC time and after gmcast.peer_timeout=PT3S, the following appears:

2018-07-04 16:46:05 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') connection to peer 8b2041d6 with addr tcp://192.168.55.173:4567 timed out, no messages seen in PT3S 2018-07-04 16:46:05 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.55.173:4567 2018-07-04 16:46:06 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') connection to peer 737422d6 with addr tcp://192.168.55.171:4567 timed out, no messages seen in PT3S 2018-07-04 16:46:06 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 0

As it passed evs.suspect_timeout = PT5S, both nodes galera1 and galera3 are suspected as dead by galera2:

2018-07-04 16:46:07 140454904243968 [Note] WSREP: evs::proto(62116b35, OPERATIONAL, view_id(REG,62116b35,54)) suspecting node: 8b2041d6 2018-07-04 16:46:07 140454904243968 [Note] WSREP: evs::proto(62116b35, OPERATIONAL, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive 2018-07-04 16:46:07 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 0 2018-07-04 16:46:08 140454904243968 [Note] WSREP: evs::proto(62116b35, GATHER, view_id(REG,62116b35,54)) suspecting node: 737422d6 2018-07-04 16:46:08 140454904243968 [Note] WSREP: evs::proto(62116b35, GATHER, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive

Then, Galera will revise the current cluster view and the position of this node:

2018-07-04 16:46:09 140454904243968 [Note] WSREP: view(view_id(NON_PRIM,62116b35,54) memb { 62116b35,0 } joined { } left { } partitioned { 737422d6,0 8b2041d6,0 }) 2018-07-04 16:46:09 140454904243968 [Note] WSREP: view(view_id(NON_PRIM,62116b35,55) memb { 62116b35,0 } joined { } left { } partitioned { 737422d6,0 8b2041d6,0 })

With the new cluster view, Galera will perform quorum calculation to decide whether this node is part of the primary component. If the new component sees "primary = no", Galera will demote the local node state from SYNCED to OPEN:

2018-07-04 16:46:09 140454288942848 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2018-07-04 16:46:09 140454288942848 [Note] WSREP: Flow-control interval: [16, 16] 2018-07-04 16:46:09 140454288942848 [Note] WSREP: Trying to continue unpaused monitor 2018-07-04 16:46:09 140454288942848 [Note] WSREP: Received NON-PRIMARY. 2018-07-04 16:46:09 140454288942848 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 2753699)

With the latest change on the cluster view and node state, Galera returns the post-eviction cluster view and global state as below:

2018-07-04 16:46:09 140454222194432 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2753699, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3 2018-07-04 16:46:09 140454222194432 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

You can see the following global status of galera2 have changed during this period:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY'); +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ | VARIABLE_NAME | VARIABLE_VALUE | +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+ | WSREP_CLUSTER_SIZE | 1 | | WSREP_CLUSTER_STATUS | non-Primary | | WSREP_EVS_DELAYED | 737422d6-7db3-11e8-a2a2-bbe98913baf0:tcp://192.168.55.171:4567:1,8b2041d6-7f62-11e8-87d5-12a76678131f:tcp://192.168.55.173:4567:2 | | WSREP_LOCAL_STATE_COMMENT | Initialized | | WSREP_READY | OFF | +---------------------------+-----------------------------------------------------------------------------------------------------------------------------------+

At this point, MySQL/MariaDB server on galera2 is still accessible (database is listening on 3306 and Galera on 4567) and you can query the mysql system tables and list out the databases and tables. However when you jump into the non-system tables and make a simple query like this:

mysql> SELECT * FROM sbtest1; ERROR 1047 (08S01): WSREP has not yet prepared node for application use

You will immediately get an error indicating WSREP is loaded but not ready to use by this node, as reported by wsrep_ready status. This is due to the node losing its connection to the Primary Component and it enters the non-operational state (the local node status was changed from SYNCED to OPEN). Data reads from nodes in a non-operational state are considered stale, unless you set wsrep_dirty_reads=ON to permit reads, although Galera still rejects any command that modifies or updates the database.

Finally, Galera will keep on listening and reconnecting to other members in the background infinitely:

2018-07-04 16:47:12 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 30 2018-07-04 16:47:13 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 30 2018-07-04 16:48:20 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 8b2041d6 (tcp://192.168.55.173:4567), attempt 60 2018-07-04 16:48:22 140454904243968 [Note] WSREP: (62116b35, 'tcp://0.0.0.0:4567') reconnecting to 737422d6 (tcp://192.168.55.171:4567), attempt 60

The eviction process flow by Galera group communication for the partitioned node during network issue can be summarized as below:

  1. Disconnects from the cluster after gmcast.peer_timeout.
  2. Suspects other nodes after evs.suspect_timeout.
  3. Retrieves the new cluster view.
  4. Performs quorum calculation to determine the node's state.
  5. Demotes the node from SYNCED to OPEN.
  6. Attempts to reconnect to the primary component (other Galera nodes) in the background.
From Primary Component Perspective

On galera1 and galera3 respectively, after gmcast.peer_timeout=PT3S, the following appears in the MySQL error log:

2018-07-04 16:46:05 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.55.172:4567 2018-07-04 16:46:06 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') reconnecting to 62116b35 (tcp://192.168.55.172:4567), attempt 0

After it passed evs.suspect_timeout = PT5S, galera2 is suspected as dead by galera3 (and galera1):

2018-07-04 16:46:10 139955510687488 [Note] WSREP: evs::proto(8b2041d6, OPERATIONAL, view_id(REG,62116b35,54)) suspecting node: 62116b35 2018-07-04 16:46:10 139955510687488 [Note] WSREP: evs::proto(8b2041d6, OPERATIONAL, view_id(REG,62116b35,54)) suspected node without join message, declaring inactive

Galera checks out if the other nodes respond to the group communication on galera3, it finds galera1 is in primary and stable state:

2018-07-04 16:46:11 139955510687488 [Note] WSREP: declaring 737422d6 at tcp://192.168.55.171:4567 stable 2018-07-04 16:46:11 139955510687488 [Note] WSREP: Node 737422d6 state prim

Galera revises the cluster view of this node (galera3):

2018-07-04 16:46:11 139955510687488 [Note] WSREP: view(view_id(PRIM,737422d6,55) memb { 737422d6,0 8b2041d6,0 } joined { } left { } partitioned { 62116b35,0 }) 2018-07-04 16:46:11 139955510687488 [Note] WSREP: save pc into disk

Galera then removes the partitioned node from the Primary Component:

2018-07-04 16:46:11 139955510687488 [Note] WSREP: forgetting 62116b35 (tcp://192.168.55.172:4567)

The new Primary Component is now consisted of two nodes, galera1 and galera3:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2

The Primary Component will exchange the state between each other to agree on the new cluster view and global state:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID. 2018-07-04 16:46:11 139955510687488 [Note] WSREP: (8b2041d6, 'tcp://0.0.0.0:4567') turning message relay requesting off 2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: sent state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993 2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: got state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993 from 0 (192.168.55.171) 2018-07-04 16:46:11 139955502294784 [Note] WSREP: STATE EXCHANGE: got state msg: b3d38100-7f66-11e8-8e70-8e3bf680c993 from 1 (192.168.55.173)

Galera calculates and verifies the quorum of the state exchange between online members:

2018-07-04 16:46:11 139955502294784 [Note] WSREP: Quorum results: version = 4, component = PRIMARY, conf_id = 27, members = 2/2 (joined/total), act_id = 2753703, last_appl. = 2753606, protocols = 0/8/3 (gcs/repl/appl), group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc 2018-07-04 16:46:11 139955502294784 [Note] WSREP: Flow-control interval: [23, 23] 2018-07-04 16:46:11 139955502294784 [Note] WSREP: Trying to continue unpaused monitor

Galera updates the new cluster view and global state after galera2 eviction:

2018-07-04 16:46:11 139955214169856 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2753703, view# 28: Primary, number of nodes: 2, my index: 1, protocol version 3 2018-07-04 16:46:11 139955214169856 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2018-07-04 16:46:11 139955214169856 [Note] WSREP: REPL Protocols: 8 (3, 2) 2018-07-04 16:46:11 139955214169856 [Note] WSREP: Assign initial position for certification: 2753703, protocol version: 3 2018-07-04 16:46:11 139956691814144 [Note] WSREP: Service thread queue flushed. Clean up the partitioned node (galera2) from the active list: 2018-07-04 16:46:14 139955510687488 [Note] WSREP: cleaning up 62116b35 (tcp://192.168.55.172:4567)

At this point, both galera1 and galera3 will be reporting similar global status:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY'); +---------------------------+------------------------------------------------------------------+ | VARIABLE_NAME | VARIABLE_VALUE | +---------------------------+------------------------------------------------------------------+ | WSREP_CLUSTER_SIZE | 2 | | WSREP_CLUSTER_STATUS | Primary | | WSREP_EVS_DELAYED | 1491abd9-7f6d-11e8-8930-e269b03673d8:tcp://192.168.55.172:4567:1 | | WSREP_LOCAL_STATE_COMMENT | Synced | | WSREP_READY | ON | +---------------------------+------------------------------------------------------------------+

They list out the problematic member in the wsrep_evs_delayed status. Since the local state is "Synced", these nodes are operational and you can redirect the client connections from galera2 to any of them. If this step is inconvenient, consider using a load balancer sitting in front of the database to simplify the connection endpoint from the clients.

Node Recovery and Joining

A partitioned Galera node will keep on attempting to establish connection with the Primary Component infinitely. Let's flush the iptables rules on galera2 to let it connect with the remaining nodes:

# on galera2 $ iptables -F

Once the node is capable of connecting to one of the nodes, Galera will start re-establishing the group communication automatically:

2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 8b2041d6 tcp://192.168.55.173:4567 2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 737422d6 tcp://192.168.55.171:4567 2018-07-09 10:46:34 140075962705664 [Note] WSREP: declaring 737422d6 at tcp://192.168.55.171:4567 stable 2018-07-09 10:46:34 140075962705664 [Note] WSREP: declaring 8b2041d6 at tcp://192.168.55.173:4567 stable

Node galera2 will then connect to one of the Primary Component (in this case is galera1, node ID 737422d6) to get the current cluster view and nodes state:

2018-07-09 10:46:34 140075962705664 [Note] WSREP: Node 737422d6 state prim 2018-07-09 10:46:34 140075962705664 [Note] WSREP: view(view_id(PRIM,1491abd9,142) memb { 1491abd9,0 737422d6,0 8b2041d6,0 } joined { } left { } partitioned { }) 2018-07-09 10:46:34 140075962705664 [Note] WSREP: save pc into disk

Galera will then perform state exchange with the rest of the members that can form the Primary Component:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3 2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f 2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: sent state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f 2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 0 (192.168.55.172) 2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 1 (192.168.55.171) 2018-07-09 10:46:34 140075954312960 [Note] WSREP: STATE EXCHANGE: got state msg: 4b23eaa0-8322-11e8-a87e-fe4e0fce2a5f from 2 (192.168.55.173)

The state exchange allows galera2 to calculate the quorum and produce the following result:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Quorum results: version = 4, component = PRIMARY, conf_id = 71, members = 2/3 (joined/total), act_id = 2836958, last_appl. = 0, protocols = 0/8/3 (gcs/repl/appl), group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc

Galera will then promote the local node state from OPEN to PRIMARY, to start and establish the node connection to the Primary Component:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Flow-control interval: [28, 28] 2018-07-09 10:46:34 140075954312960 [Note] WSREP: Trying to continue unpaused monitor 2018-07-09 10:46:34 140075954312960 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 2836958)

As reported by the above line, Galera calculates the gap on how far the node is behind from the cluster. This node requires state transfer to catch up to writeset number 2836958 from 2761994:

2018-07-09 10:46:34 140075929970432 [Note] WSREP: State transfer required: Group state: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958 Local state: 55238f52-41ee-11e8-852f-3316bdb654bc:2761994 2018-07-09 10:46:34 140075929970432 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958, view# 72: Primary, number of nodes: 3, my index: 0, protocol version 3 2018-07-09 10:46:34 140075929970432 [Warning] WSREP: Gap in state sequence. Need state transfer. 2018-07-09 10:46:34 140075929970432 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2018-07-09 10:46:34 140075929970432 [Note] WSREP: REPL Protocols: 8 (3, 2) 2018-07-09 10:46:34 140075929970432 [Note] WSREP: Assign initial position for certification: 2836958, protocol version: 3

Galera prepares the IST listener on port 4568 on this node and asks any Synced node in the cluster to become a donor. In this case, Galera automatically picks galera3 (192.168.55.173), or it could also pick a donor from the list under wsrep_sst_donor (if defined) for the syncing operation:

2018-07-09 10:46:34 140075996276480 [Note] WSREP: Service thread queue flushed. 2018-07-09 10:46:34 140075929970432 [Note] WSREP: IST receiver addr using tcp://192.168.55.172:4568 2018-07-09 10:46:34 140075929970432 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.55.172:4568 2018-07-09 10:46:34 140075954312960 [Note] WSREP: Member 0.0 (192.168.55.172) requested state transfer from '*any*'. Selected 2.0 (192.168.55.173)(SYNCED) as donor.

It will then change the local node state from PRIMARY to JOINER. At this stage, galera2 is granted with state transfer request and starts to cache write-sets:

2018-07-09 10:46:34 140075954312960 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 2836958) 2018-07-09 10:46:34 140075929970432 [Note] WSREP: Requesting state transfer: success, donor: 2 2018-07-09 10:46:34 140075929970432 [Note] WSREP: GCache history reset: 55238f52-41ee-11e8-852f-3316bdb654bc:2761994 -> 55238f52-41ee-11e8-852f-3316bdb654bc:2836958 2018-07-09 10:46:34 140075929970432 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset

Node galera2 starts receiving the missing writesets from the selected donor's gcache (galera3):

2018-07-09 10:46:34 140075954312960 [Note] WSREP: 2.0 (192.168.55.173): State transfer to 0.0 (192.168.55.172) complete. 2018-07-09 10:46:34 140075929970432 [Note] WSREP: Receiving IST: 74964 writesets, seqnos 2761994-2836958 2018-07-09 10:46:34 140075593627392 [Note] WSREP: Receiving IST... 0.0% ( 0/74964 events) complete. 2018-07-09 10:46:34 140075954312960 [Note] WSREP: Member 2.0 (192.168.55.173) synced with group. 2018-07-09 10:46:34 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') connection established to 737422d6 tcp://192.168.55.171:4567 2018-07-09 10:46:41 140075962705664 [Note] WSREP: (1491abd9, 'tcp://0.0.0.0:4567') turning message relay requesting off 2018-07-09 10:46:44 140075593627392 [Note] WSREP: Receiving IST... 36.0% (27008/74964 events) complete. 2018-07-09 10:46:54 140075593627392 [Note] WSREP: Receiving IST... 71.6% (53696/74964 events) complete. 2018-07-09 10:47:02 140075593627392 [Note] WSREP: Receiving IST...100.0% (74964/74964 events) complete. 2018-07-09 10:47:02 140075929970432 [Note] WSREP: IST received: 55238f52-41ee-11e8-852f-3316bdb654bc:2836958 2018-07-09 10:47:02 140075954312960 [Note] WSREP: 0.0 (192.168.55.172): State transfer from 2.0 (192.168.55.173) complete.

Once all the missing writesets are received and applied, Galera will promote galera2 as JOINED until seqno 2837012:

2018-07-09 10:47:02 140075954312960 [Note] WSREP: Shifting JOINER -> JOINED (TO: 2837012) 2018-07-09 10:47:02 140075954312960 [Note] WSREP: Member 0.0 (192.168.55.172) synced with group.

The node applies any cached writesets in its slave queue and finishes catching up with the cluster. Its slave queue is now empty. Galera will promote galera2 to SYNCED, indicating the node is now operational and ready to serve clients:

2018-07-09 10:47:02 140075954312960 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 2837012) 2018-07-09 10:47:02 140076605892352 [Note] WSREP: Synchronized with group, ready for connections

At this point, all nodes are back operational. You can verify by using the following statements on galera2:

mysql> SELECT * FROM information_schema.global_status WHERE variable_name IN ('WSREP_CLUSTER_STATUS','WSREP_LOCAL_STATE_COMMENT','WSREP_CLUSTER_SIZE','WSREP_EVS_DELAYED','WSREP_READY'); +---------------------------+----------------+ | VARIABLE_NAME | VARIABLE_VALUE | +---------------------------+----------------+ | WSREP_CLUSTER_SIZE | 3 | | WSREP_CLUSTER_STATUS | Primary | | WSREP_EVS_DELAYED | | | WSREP_LOCAL_STATE_COMMENT | Synced | | WSREP_READY | ON | +---------------------------+----------------+

The wsrep_cluster_size reported as 3 and the cluster status is Primary, indicating galera2 is part of the Primary Component. The wsrep_evs_delayed has also been cleared and the local state is now Synced.

The recovery process flow for the partitioned node during network issue can be summarized as below:

  1. Re-establishes group communication to other nodes.
  2. Retrieves the cluster view from one of the Primary Component.
  3. Performs state exchange with the Primary Component and calculates the quorum.
  4. Changes the local node state from OPEN to PRIMARY.
  5. Calculates the gap between local node and the cluster.
  6. Changes the local node state from PRIMARY to JOINER.
  7. Prepares IST listener/receiver on port 4568.
  8. Requests state transfer via IST and picks a donor.
  9. Starts receiving and applying the missing writeset from chosen donor's gcache.
  10. Changes the local node state from JOINER to JOINED.
  11. Catches up with the cluster by applying the cached writesets in the slave queue.
  12. Changes the local node state from JOINED to SYNCED.
ClusterControl Single Console for Your Entire Database Infrastructure Find out what else is new in ClusterControl Install ClusterControl for FREE Cluster Failure

A Galera Cluster is considered failed if no primary component (PC) is available. Consider a similar three-node Galera Cluster as depicted in the diagram below:

A cluster is considered operational if all nodes or majority of the nodes are online. Online means they are able to see each other through Galera's replication traffic or group communication. If no traffic is coming in and out from the node, the cluster will send a heartbeat beacon for the node to response in a timely manner. Otherwise, it will be put into the delay or suspected list according to how the node responses.

If a node goes down, let's say node C, the cluster will remain operational because node A and B are still in quorum with 2 votes out of 3 to form a primary component. You should get the following cluster state on A and B:

mysql> SHOW STATUS LIKE 'wsrep_cluster_status'; +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | wsrep_cluster_status | Primary | +----------------------+---------+

If let's say a primary switch went kaput, as illustrated in the following diagram:

At this point, every single node loses communication to each other, and the cluster state will be reported as non-Primary on all nodes (as what happened to galera2 in the previous case). Every node would calculate the quorum and find out that it is the minority (1 vote out of 3) thus losing the quorum, which means no Primary Component is formed and consequently all nodes refuse to serve any data. This is deemed as cluster failure.

Once the network issue is resolved, Galera will automatically re-establish the communication between members, exchange node's states and determine the possibility of reforming the primary component by comparing node state, UUIDs and seqnos. If the probability is there, Galera will merge the primary components as shown in the following lines:

2018-06-27 0:16:57 140203784476416 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3 2018-06-27 0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID. 2018-06-27 0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: sent state msg: 5885911b-795c-11e8-8683-931c85442c7e 2018-06-27 0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 0 (192.168.55.171) 2018-06-27 0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 1 (192.168.55.172) 2018-06-27 0:16:57 140203784476416 [Note] WSREP: STATE EXCHANGE: got state msg: 5885911b-795c-11e8-8683-931c85442c7e from 2 (192.168.55.173) 2018-06-27 0:16:57 140203784476416 [Warning] WSREP: Quorum: No node with complete state: Version : 4 Flags : 0x3 Protocols : 0 / 8 / 3 State : NON-PRIMARY Desync count : 0 Prim state : SYNCED Prim UUID : 5224a024-791b-11e8-a0ac-8bc6118b0f96 Prim seqno : 5 First seqno : 112714 Last seqno : 112725 Prim JOINED : 3 State UUID : 5885911b-795c-11e8-8683-931c85442c7e Group UUID : 55238f52-41ee-11e8-852f-3316bdb654bc Name : '192.168.55.171' Incoming addr: '192.168.55.171:3306' Version : 4 Flags : 0x2 Protocols : 0 / 8 / 3 State : NON-PRIMARY Desync count : 0 Prim state : SYNCED Prim UUID : 5224a024-791b-11e8-a0ac-8bc6118b0f96 Prim seqno : 5 First seqno : 112714 Last seqno : 112725 Prim JOINED : 3 State UUID : 5885911b-795c-11e8-8683-931c85442c7e Group UUID : 55238f52-41ee-11e8-852f-3316bdb654bc Name : '192.168.55.172' Incoming addr: '192.168.55.172:3306' Version : 4 Flags : 0x2 Protocols : 0 / 8 / 3 State : NON-PRIMARY Desync count : 0 Prim state : SYNCED Prim UUID : 5224a024-791b-11e8-a0ac-8bc6118b0f96 Prim seqno : 5 First seqno : 112714 Last seqno : 112725 Prim JOINED : 3 State UUID : 5885911b-795c-11e8-8683-931c85442c7e Group UUID : 55238f52-41ee-11e8-852f-3316bdb654bc Name : '192.168.55.173' Incoming addr: '192.168.55.173:3306' 2018-06-27 0:16:57 140203784476416 [Note] WSREP: Full re-merge of primary 5224a024-791b-11e8-a0ac-8bc6118b0f96 found: 3 of 3. 2018-06-27 0:16:57 140203784476416 [Note] WSREP: Quorum results: version = 4, component = PRIMARY, conf_id = 5, members = 3/3 (joined/total), act_id = 112725, last_appl. = 112722, protocols = 0/8/3 (gcs/repl/appl), group UUID = 55238f52-41ee-11e8-852f-3316bdb654bc 2018-06-27 0:16:57 140203784476416 [Note] WSREP: Flow-control interval: [28, 28] 2018-06-27 0:16:57 140203784476416 [Note] WSREP: Trying to continue unpaused monitor 2018-06-27 0:16:57 140203784476416 [Note] WSREP: Restored state OPEN -> SYNCED (112725) 2018-06-27 0:16:57 140202564110080 [Note] WSREP: New cluster view: global state: 55238f52-41ee-11e8-852f-3316bdb654bc:112725, view# 6: Primary, number of nodes: 3, my index: 2, protocol version 3

A good indicator to know if the re-bootstrapping process is OK is by looking at the following line in the error log:

[Note] WSREP: Synchronized with group, ready for connections ClusterControl Auto Recovery

ClusterControl comes with node and cluster automatic recovery features, because it oversees and understands the state of all nodes in the cluster. Automatic recovery is by default enabled if the cluster is deployed using ClusterControl. To enable or disable the cluster, simply clicking on the power icon in the summary bar as shown below:

Green icon means automatic recovery is turned on, while red is the opposite. You can monitor the recovery progress from the Activity -> Jobs dialog, like in this case, galera2 was totally inaccessible due to firewall blocking, thus forcing ClusterControl to report the following:

Related resources  Become a MySQL DBA blog series - Galera Cluster diagnostic logs  Disaster Recovery Planning for MySQL & MariaDB  How to Recover Galera Cluster or MySQL Replication from Split Brain Syndrome

The recovery process will only be commencing after a graceful timeout (30 seconds) to give Galera node a chance to recover itself beforehand. If ClusterControl fails to recover a node or cluster, it will first pull all MySQL error logs from all accessible nodes and will raise the necessary alarms to notify the user via email or by pushing critical events to the third-party integration modules like PagerDuty, VictorOps or Slack. Manual intervention is then required. For Galera Cluster, ClusterControl will keep on trying to recover the failure until you mark the node as under maintenance, or disable the automatic recovery feature.

ClusterControl's automatic recovery is one of most favorite features as voted by our users. It helps you to take the necessary actions quickly, with a complete report on what has been attempted and recommendation steps to troubleshoot further on the issue. For users with support subscriptions, you can look for extra hands by escalating this issue to our technical support team for assistance.

Conclusion

Galera automatic node recovery and membership control are neat features to simplify the cluster management, improve the database reliability and reduce the risk of human error, as commonly haunting other open-source database replication technology like MySQL Replication, Group Replication and PostgreSQL Streaming/Logical Replication.

Tags:  galera MySQL MariaDB recovery network partitioning galera cluster

Data Integrity and Performance Considerations in MySQL Semisynchronous Replication

MySQL semisynchronous replication provides improved data integrity because when a commit returns successfully, it’s known that the data exists in at least two places – the master and its slave. In this blog post, we review some of the MySQL configurations that influence the data integrity and performance aspects of semisynchronous replication. We’ll be using InnoDB storage engine and GTID-based replication in a 3-node replica set (master and 2 slaves), which will ensure there is redundancy in the slaves. This means that if there are issues with one slave, we can fall back on the other.

Configurations Applicable to Both Master and Slave Nodes

These configurations guarantee high durability and consistency settings for data. That is, each committed transaction is guaranteed to be present in binary logs and also the logs are flushed to the disk. Hence, in the case of a power failure or operating system crash, the data consistency of MySQL is always preserved.

Configurations on the Master Node.

This option is used to configure the number of slaves that must send an acknowledgment before a semisynchronous master can commit the transaction. In the 3-node replica set, we recommend setting this to 1, so that we always have an assurance that the data is available in at least one slave while avoiding any performance impact involved in waiting for acknowledgment from both the slaves.

This option is used to configure the amount of time that a semisynchronous master waits for slave acknowledgment before switching back to asynchronous mode. We recommend setting this to a large number so that there is no fallback to asynchronous mode which then defeats our data integrity objective. Since we’re operating with 2 slaves and rpl_semi_sync_master_wait_for_slave_count is set to 1, we can assume that at least one of the slaves does acknowledge within a reasonable amount of time, thereby minimizing the performance impact of this setting.

#MySQL Tutorial: Data Integrity and Performance Considerations in Semisynchronous ReplicationClick To Tweet Configurations on the Slave Nodes

In the slaves, it’s always important to track two positions very accurately: the current executed position of SQL thread in relay log, and the current position of the IO thread which indicates how far the mater binary file is read and copied to slave. The consequences of not maintaining these positions are quite obvious. If there’s a slave crash and restart, SQL thread can start processing transactions from a wrong offset or the IO thread can start pulling data from a wrong position in the master binary logs. Both these cases will lead to data corruption.

it is important to ensure crash-safety of slaves through the following configurations:

Setting relay_log_info_repository to TABLE will ensure the position of the SQL thread is updated together with each transaction commit on the slave. However, it’s difficult to maintain the exact position of IO thread and flush to the disk. This is because reading master binary log and writing to slave relay log is not based on transactions. The impact on performance is very high if IO thread position has to be updated and flushed to disk after each write to slave relay logs. A more elegant solution would be to set relay_log_recovery = ON, in which case, if there’s a MySQL restart, current relay logs will be assumed to be corrupted and will be freshly pulled from the master based on the SQL thread position.

Last but not least, it’s important to note that semisynchronous replication ensures that the data has just ‘reached’ one of the slaves before the master committing the transaction, and does not mean that the transactions are committed on the slave. Hence, it will be good to ensure that the SQL thread works with good performance. In the ideal case, the SQL thread moves hand in hand with the IO thread so we can have the benefit of slave not only receiving the transactions, but also committing them. It’s recommended to go with a multi-threaded slave configuration so that we can get increased slave SQL thread performance. The important configurations for multi-threaded slaves are:

The above configurations are going to promise parallelism in the slave, while at the same time, preserving the order of transactions as seen on the master.

In summary, by using the above configurations on our MySQL replica set, we’re able to maintain high data integrity along with an optimal performance.

As always, if you have any questions, leave us a comment, reach out to us at @scalegridio on Twitter, or send us an email at support@scalegrid.io.

AMD EPYC Performance Testing… or Don’t get on the wrong side of SystemD

Ever since AMD released their EPYC CPU for servers I wanted to test it, but I did not have the opportunity until recently, when Packet.net started offering bare metal servers for a reasonable price. So I started a couple of instances to test Percona Server for MySQL under this CPU. In this benchmark, I discovered some interesting discrepancies in performance between  AMD and Intel CPUs when running under systemd .

The set up

To test CPU performance, I used a read-only in-memory sysbench OLTP benchmark, as it burns CPU cycles and no IO is performed by Percona Server.

For this benchmark I used Packet.net c2.medium.x86 instances powered by AMD EPYC 7401P processors. The OS is exposed to 48 CPU threads.

For the OS I tried

  • Ubuntu 16.04 with default kernel 4.4 and upgraded to 4.15
  • Ubuntu 18.04 with kernel 4.15
  • Percona Server started from SystemD and without SystemD (for reasons which will become apparent later)

To have some points for comparison, I also ran a similar workload on my 2 socket Intel CPU server, with CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. I recognize this is not most recent Intel CPU, but this was the best I had at the time, and it also gave 48 CPU Threads.

Ubuntu 16

First, let’s review the results for Ubuntu 16

Or in tabular format:

Threads Ubuntu 16, kernel 4.4; systemd Ubuntu 16, kernel 4.4;

NO systemd

Ubuntu 16, kernel 4.15 1 943.44 948.7 899.82 2 1858.58 1792.36 1774.45 4 3533.2 3424.05 3555.94 8 6762.35 6731.57 7010.51 12 10012.18 9950.3 10062.82 16 13063.39 13043.55 12893.17 20 15347.68 15347.56 14756.27 24 16886.24 16864.81 16176.76 30 18150.2 18160.07 17860.5 36 18923.06 18811.64 19528.27 42 19374.86 19463.08 21537.79 48 20110.81 19983.05 23388.18 56 20548.51 20362.31 23768.49 64 20860.51 20729.52 23797.14 72 21123.71 21001.06 23645.95 80 21370 21191.24 23546.03 90 21622.54 21441.73 23486.29 100 21806.67 21670.38 23392.72 128 22161.42 22031.53 23315.33 192 22388.51 22207.26 22906.42 256 22091.17 21943.37 22305.06 512 19524.41 19381.69 19181.71

 

There are few conclusions we can see from this data

  1. AMD EPYC CPU scales quite well to the number of CPU Threads
  2. The recent kernel helps to boost the throughput.
Ubuntu 18.04

Now, let’s review the results for Ubuntu 18.04

Threads Ubuntu 18, systemd Ubuntu 18, NO systemd 1 833.14 843.68 2 1684.21 1693.93 4 3346.42 3359.82 8 6592.88 6597.48 12 9477.92 9487.93 16 12117.12 12149.17 20 13934.27 13933 24 15265.1 15152.74 30 16846.02 16061.16 36 18488.88 16726.14 42 20493.57 17360.56 48 22217.47 17906.4 56 22564.4 17931.83 64 22590.29 17902.95 72 22472.75 17857.73 80 22421.99 17766.76 90 22300.09 17773.57 100 22275.24 17646.7 128 22131.86 17411.55 192 21750.8 17134.63 256 21177.25 16826.53 512 18296.61 17418.72

 

This is where the result surprised me: on Ubuntu 18.04 with SystemD running Percona Server for MySQL as a service the throughput was up to 24% better than if Percona Server for MySQL is started from a bash shell. I do not know exactly what causes this dramatic difference—systemd uses different slices for services and user commands, and somehow it affects the performance.

Baseline benchmark

To establish a baseline, I ran the same benchmark on my Intel box, running Ubuntu 16, and I tried two kernels: 4.13 and 4.15

Threads Ubuntu 16, kernel 4.13, systemd Ubuntu 16, kernel 4.15, systemd Ubuntu 16, kernel 4.15, NO systemd 1 820.07 798.42 864.21 2 1563.31 1609.96 1681.91 4 2929.63 3186.01 3338.47 8 6075.73 6279.49 6624.49 12 8743.38 9256.18 9622.6 16 10580.14 11351.31 11984.64 20 12790.96 12599.78 14147.1 24 14213.68 14659.49 15716.61 30 15983.78 16096.03 17530.06 36 17574.46 18098.36 20085.9 42 18671.14 19808.92 21875.84 48 19431.05 22036.06 23986.08 56 19737.92 22115.34 24275.72 64 19946.57 21457.32 24054.09 72 20129.7 21729.78 24167.03 80 20214.93 21594.51 24092.86 90 20194.78 21195.61 23945.93 100 20753.44 21597.26 23802.16 128 20235.24 20684.34 23476.82 192 20280.52 20431.18 23108.36 256 20410.55 20952.64 22775.63 512 20953.73 22079.18 23489.3

 

Here we see the opposite result with SystemD: Percona Server running from a bash shell shows the better throughput compared with the SystemD service. So for some reason, systemd works differently for AMD and Intel CPUs. Please let me know if you have any ideas on how to deal with the impact that systemd has on performance.

Conclusions

So there are some conclusions from these results:

  1. AMD EPYC shows a decent performance scalability; the new kernel helps to improve it
  2. systemd shows different effects on throughput for AMD and Intel CPUs
  3. With AMD the throughput declines for a high concurrent workload with 512 threads, while Intel does not show a decline.

The post AMD EPYC Performance Testing… or Don’t get on the wrong side of SystemD appeared first on Percona Database Performance Blog.

The best way to use SQL functions in JPQL or Criteria API queries with JPA and Hibernate

Introduction When executing an entity query (e.g. JPQL, HQL or Criteria API), you can use any SQL function without having to register it as long as the function is passed directly to the WHERE clause of the underlying SQL statement. However, if the SQL function is used in the SELECT clause, and Hibernate has not … Continue reading The best way to use SQL functions in JPQL or Criteria API queries with JPA and Hibernate →

The post The best way to use SQL functions in JPQL or Criteria API queries with JPA and Hibernate appeared first on Vlad Mihalcea.

What is MySQL partitioning ?

MySQL partitioning makes data distribution of individual tables (typically we recommend partition for large & complex I/O table for performance, scalability and manageability) across multiple files based on partition strategy / rules. In very simple terms, different portions of table are stored as separate tables in different location to distribute I/O optimally. The user defined division of data by some rule is known as partition function, In MySQL we partition data by RANGE of values / LIST of values / internal hashing function / linear hashing function. By restricting the query examination on the selected partitions by matching rows increases the query performance by multiple times compared to the same query on a non partitioned table, This methodology is also called partition pruning (trimming of unwanted partitions), Please find below example of partition pruning:

CREATE TABLE tab1 ( col1 VARCHAR(30) NOT NULL, col2 VARCHAR(30) NOT NULL, col3 TINYINT UNSIGNED NOT NULL, col4 DATE NOT NULL ) PARTITION BY RANGE( col3 ) ( PARTITION p0 VALUES LESS THAN (100), PARTITION p1 VALUES LESS THAN (200), PARTITION p2 VALUES LESS THAN (300), PARTITION p3 VALUES LESS THAN MAXVALUE );

Write a SELECT query benefitting partition pruning:

SELECT col1, col2, col3, col4 FROM tab1 WHERE col3 > 200 AND col3 < 250;

What is explicit partitioning in MySQL and how is it different from partition pruning ? 

In MySQL we can explicitly select partition and sub-partitions when executing a statement matching a given WHERE condition, This sounds very much similar to partition pruning, but there is a difference:

  • Partition to be checked are explicitly mentioned in the query statement, In partition pruning it is automatic.
  • In explicit partition, the explicit selection of partitions is supported for both queries and DML statements, partition pruning applies only to queries.
  • SQL statements supported in explicit partitioning – SELECT, INSERT, UPDATE, DELETE, LOAD DATA, LOAD XML and REPLACE

Explicit partition example:

CREATE TABLE customer ( cust_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, cust_fname VARCHAR(25) NOT NULL, cust_lname VARCHAR(25) NOT NULL, cust_phone INT NOT NULL, cust_fax INT NOT NULL ) PARTITION BY RANGE(cust_id) ( PARTITION p0 VALUES LESS THAN (100), PARTITION p1 VALUES LESS THAN (200), PARTITION p2 VALUES LESS THAN (300), PARTITION p3 VALUES LESS THAN MAXVALUE );

Query explicitly mentioning partition:

mysql> SELECT * FROM customer PARTITION (p1);

RANGE partitioning
In RANGE partitioning you can partition values within a given range, Ranges should be contiguous but not overlapping, usually defined by VALUES LESS THAN operator, The following examples explain how to create and use RANGE partitioning for MySQL performance:

CREATE TABLE customer_contract( cust_id INT NOT NULL, cust_fname VARCHAR(30), cust_lname VARCHAR(30), st_dt DATE NOT NULL DEFAULT '1970-01-01', end_dt DATE NOT NULL DEFAULT '9999-12-31', contract_code INT NOT NULL, contract_id INT NOT NULL ) PARTITION BY RANGE (contract_id) ( PARTITION p0 VALUES LESS THAN (50), PARTITION p1 VALUES LESS THAN (100), PARTITION p2 VALUES LESS THAN (150), PARTITION p3 VALUES LESS THAN (200) );

For example, let us suppose that you wish to partition based on the year contract ended:

CREATE TABLE customer_contract( cust_id INT NOT NULL, cust_fname VARCHAR(30), cust_lname VARCHAR(30), st_dt DATE NOT NULL DEFAULT '1970-01-01', end_dt DATE NOT NULL DEFAULT '9999-12-31', contract_code INT NOT NULL, contract_id INT NOT NULL ) PARTITION BY RANGE (year(end_dt)) ( PARTITION p0 VALUES LESS THAN (2001), PARTITION p1 VALUES LESS THAN (2002), PARTITION p2 VALUES LESS THAN (2003), PARTITION p3 VALUES LESS THAN (2004) );

It is also possible to partition a table by RANGE, based on the value of a TIMESTAMP column, using the UNIX_TIMESTAMP() function, as shown in this example:

CREATE TABLE sales_forecast ( sales_forecast_id INT NOT NULL, sales_forecast_status VARCHAR(20) NOT NULL, sales_forecast_updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ) PARTITION BY RANGE ( UNIX_TIMESTAMP(sales_forecast_updated) ) ( PARTITION p0 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-01-01 00:00:00') ), PARTITION p1 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-04-01 00:00:00') ), PARTITION p2 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-07-01 00:00:00') ), PARTITION p3 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-10-01 00:00:00') ), PARTITION p4 VALUES LESS THAN ( UNIX_TIMESTAMP('2009-01-01 00:00:00') ), PARTITION p5 VALUES LESS THAN ( UNIX_TIMESTAMP('2009-04-01 00:00:00') ), PARTITION p6 VALUES LESS THAN ( UNIX_TIMESTAMP('2009-07-01 00:00:00') ), PARTITION p7 VALUES LESS THAN ( UNIX_TIMESTAMP('2009-10-01 00:00:00') ), PARTITION p8 VALUES LESS THAN ( UNIX_TIMESTAMP('2010-01-01 00:00:00') ), PARTITION p9 VALUES LESS THAN (MAXVALUE) );

LIST partitioning
The difference between RANGE and LIST partitioning is: In LIST partitioning, each partition is grouped on the selected list of values of a specific column. You can do it by PARTITION BY LIST (EXPR) where EXPR is the selected column for list partition, We have explained LIST partitioning with example below:

CREATE TABLE students ( student_id INT NOT NULL, student_fname VARCHAR(30), student_lname VARCHAR(30), student_joined DATE NOT NULL DEFAULT '1970-01-01', student_separated DATE NOT NULL DEFAULT '9999-12-31', student_house INT, student_grade_id INT ) PARTITION BY LIST(student_grade_id) ( PARTITION P1 VALUES IN (1,2,3,4), PARTITION P2 VALUES IN (5,6,7), PARTITION P3 VALUES IN (8,9,10), PARTITION P4 VALUES IN (11,12) );

HASH partitioning
HASH partitioning makes an even distribution of data among predetermined number of partitions, In RANGE and LIST partitioning you must explicitly define the partitioning logic and which partition given column value or set of column values are stored. In HASH partitioning MySQL take care of this, The following example explains HASH partitioning better:

CREATE TABLE store ( store_id INT NOT NULL, store_name VARCHAR(30), store_location VARCHAR(30), store_started DATE NOT NULL DEFAULT '1997-01-01', store_code INT ) PARTITION BY HASH(store_id) PARTITIONS 4;

P.S. : If you do not include a PARTITIONS clause, the number of partitions defaults to 1.

LINEAR HASH partitioning
The LINEAR HASH partitioning utilizes a linear powers-of-two algorithm, Where HASH partitioning employs the modulus of the hashing function’s value. Please find below LINEAR HASH partitioning example:

CREATE TABLE store ( store_id INT NOT NULL, store_name VARCHAR(30), store_location VARCHAR(30), store_started DATE NOT NULL DEFAULT '1997-01-01', store_code INT ) PARTITION BY LINEAR HASH( YEAR(store_started) ) PARTITIONS 4;

KEY partitioning
KEY partitioning is very much similar to HASH, the only difference is, the hashing function for the KEY partitioning is supplied by MySQL, In case of MySQL NDB Cluster, MD5() is used, For tables using other storage engines, the MySQL server uses the storage engine specific hashing function which is based on the same algorithm as PASSWORD().

CREATE TABLE contact( id INT NOT NULL, name VARCHAR(20), contact_number INT, email VARCHAR(50), UNIQUE KEY (id) ) PARTITION BY KEY() PARTITIONS 5;

P.S. – if the unique key column were not defined as NOT NULL, then the previous statement would fail.

Subpartitioning
SUBPARTITIONING  is also known as composite partitioning, You can partition table combining RANGE and HASH for better results, The example below explains SUBPARTITIONING better:

CREATE TABLE purchase (id INT, item VARCHAR(30), purchase_date DATE) PARTITION BY RANGE( YEAR(purchase_date) ) SUBPARTITION BY HASH( TO_DAYS(purchase_date) ) SUBPARTITIONS 2 ( PARTITION p0 VALUES LESS THAN (2000), PARTITION p1 VALUES LESS THAN (2010), PARTITION p2 VALUES LESS THAN MAXVALUE );

It is also possible to define subpartitions explicitly using SUBPARTITION clauses to specify options for individual subpartitions:

CREATE TABLE purchase (id INT, item VARCHAR(30), purchase_date DATE) PARTITION BY RANGE( YEAR(purchase_date) ) SUBPARTITION BY HASH( TO_DAYS(purchase_date) ) ( PARTITION p0 VALUES LESS THAN (2000) ( SUBPARTITION s0, SUBPARTITION s1 ), PARTITION p1 VALUES LESS THAN (2010) ( SUBPARTITION s2, SUBPARTITION s3 ), PARTITION p2 VALUES LESS THAN MAXVALUE ( SUBPARTITION s4, SUBPARTITION s5 ) );

Things to remember:

  • Each partition must have the same number of subpartitions.
  • Each SUBPARTITION clause must include (at a minimum) a name for the subpartition. Otherwise, you may set any desired option for the subpartition or allow it to assume its default setting for that option.
  • Subpartition names must be unique across the entire table. For example, the following CREATE TABLE statement is valid in MySQL 5.7:

CREATE TABLE purchase (id INT, item VARCHAR(30), purchase_date DATE) PARTITION BY RANGE( YEAR(purchase_date) ) SUBPARTITION BY HASH( TO_DAYS(purchase_date) ) ( PARTITION p0 VALUES LESS THAN (1990) ( SUBPARTITION s0, SUBPARTITION s1 ), PARTITION p1 VALUES LESS THAN (2000) ( SUBPARTITION s2, SUBPARTITION s3 ), PARTITION p2 VALUES LESS THAN MAXVALUE ( SUBPARTITION s4, SUBPARTITION s5 ) );

MySQL partitioning limitations
MySQL partitioning also has limitations, We are listing down below the limitations of MySQL partitioning:

A PRIMARY KEY must include all columns in the table’s partitioning function:

CREATE TABLE tab3 ( column1 INT NOT NULL, column2 DATE NOT NULL, column3 INT NOT NULL, column4 INT NOT NULL, UNIQUE KEY (column1, column2), UNIQUE KEY (column3) ) PARTITION BY HASH(column1 + column3) PARTITIONS 4;

Expect this error after running above script – ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table’s partitioning function

The right way of doing it:

CREATE TABLE table12 ( column1 INT NOT NULL, column2 DATE NOT NULL, column3 INT NOT NULL, column4 INT NOT NULL, UNIQUE KEY (column1, column2, column3) ) PARTITION BY HASH(column3) PARTITIONS 5;

CREATE TABLE table25 ( column11 INT NOT NULL, column12 DATE NOT NULL, column13 INT NOT NULL, column14 INT NOT NULL, UNIQUE KEY (column11, column13) ) PARTITION BY HASH(column11 + column13) PARTITIONS 5;

Most popular limitation of MySQLPrimary key is by definition a unique key, this restriction also includes the table’s primary key, if it has one. The example below explains this limitation better:

CREATE TABLE table55 ( column11 INT NOT NULL, column12 DATE NOT NULL, column13 INT NOT NULL, column14 INT NOT NULL, PRIMARY KEY(column11, column12) ) PARTITION BY HASH(column13) PARTITIONS 4;

CREATE TABLE table65 ( column20 INT NOT NULL, column25 DATE NOT NULL, column30 INT NOT NULL, column35 INT NOT NULL, PRIMARY KEY(column20, column30), UNIQUE KEY(column25) ) PARTITION BY HASH( YEAR(column25) ) PARTITIONS 5;

Both of the above scripts will return this error – ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table’s partitioning function

The right way of doing it:

CREATE TABLE t45 ( column50 INT NOT NULL, column55 DATE NOT NULL, column60 INT NOT NULL, column65 INT NOT NULL, PRIMARY KEY(column50, column55) ) PARTITION BY HASH(column50 + YEAR(column55)) PARTITIONS 5;

CREATE TABLE table88 ( column80 INT NOT NULL, column81 DATE NOT NULL, column82 INT NOT NULL, column83 INT NOT NULL, PRIMARY KEY(column80, column81, column82), UNIQUE KEY(column81, column82) );

In above example, the primary key does not include all columns referenced in the partitioning expression. However, both of the statements are valid !

You can still successfully partition a MySQL table without unique keys – this also includes having no primary key and you may use any column or columns in the partitioning expression as long as the column type is compatible with the partitioning type, The example below shows partitioning a table with no unique / primary keys:

CREATE TABLE table_has_no_pk (column10 INT, column11 INT, column12 varchar(20)) PARTITION BY RANGE(column10) ( PARTITION p0 VALUES LESS THAN (500), PARTITION p1 VALUES LESS THAN (600), PARTITION p2 VALUES LESS THAN (700), PARTITION p3 VALUES LESS THAN (800) );

You cannot later add a unique key to a partitioned table unless the key includes all columns used by the table’s partitioning expression, The example below explains this much better:

ALTER TABLE table_has_no_pk ADD PRIMARY KEY(column10);

ALTER TABLE table_has_no_pk drop primary key;

ALTER TABLE table_has_no_pk ADD PRIMARY KEY(column10,column11);

ALTER TABLE table_has_no_pk drop primary key;

However, the next statement fails, because column10 is part of the partitioning key, but is not part of the proposed primary key:

mysql> ALTER TABLE table_has_no_pk ADD PRIMARY KEY(column11); ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function mysql>

MySQL partitioning limitations (at storage engine level)

InnoDB

  • InnoDB foreign keys and MySQL partitioning are not compatible, Partitioned InnoDB tables cannot have foreign key references, nor can they have columns referenced by foreign keys, So you cannot partition InnoDB tables which have or referenced by foreign keys.
  • InnoDB does not support use of multiple disks for subpartition (MyISAM supports this feature)
  • Use ALTER TABLE … REBUILD PARTITION and ALTER TABLE … ANALYZE PARTITION than using ALTER TABLE … OPTIMIZE PARTITION

NDB storage engine 

  • We can only partition by KEY (including LINEAR KEY) in NDB storage engine.

FEDERATED storage engine 

  • Partitioning not supported in FEDERATED storage engine.

CSV storage engine

  • Partitioning not supported in CSV storage engine.

MERGE storage engine 

  • Tables using the MERGE storage engine cannot be partitioned. Partitioned tables cannot be merged.

MySQL functions shown in the following list are allowed in partitioning expressions:

  • ABS()
  • CEILING()
  • DATEDIFF()
  • DAY()
  • DAYOFMONTH()
  • DAYOFWEEK()
  • DAYOFYEAR()
  • EXTRACT()
  • FLOOR()
  • HOUR()
  • MICROSECOND()
  • MINUTE()
  • MOD()
  • MONTH()
  • QUARTER()
  • SECOND()
  • TIME_TO_SEC()
  • TO_DAYS()
  • TO_SECONDS()
  • UNIX_TIMESTAMP()
  • WEEKDAY()
  • YEAR()
  • YEARWEEK()

MySQL partitioning and locks 

Effect on DML statements

  • In MySQL 5.7, updating a partitioned MyISAM table cause only the affected partitioned to be locked.
  • SELECT statements (including those containing unions or joins) lock only those partitions that actually need to be read. This also applies to SELECT …PARTITION.
  • An UPDATE prunes locks only for tables on which no partitioning columns are updated.
  • REPLACE and INSERT lock only those partitions having rows to be inserted or replaced. However, if an AUTO_INCREMENT value is generated for any partitioning column then all partitions are locked.
  • INSERT … ON DUPLICATE KEY UPDATE is pruned as long as no partitioning column is updated.
  • INSERT … SELECT locks only those partitions in the source table that need to be read, although all partitions in the target table are locked.
  • Locks imposed by LOAD DATA statements on partitioned tables cannot be pruned.

Effect on DML statements

  • CREATE VIEW does not cause any locks.
  • ALTER TABLE … EXCHANGE PARTITION prunes locks; only the exchanged table and the exchanged partition are locked.
  • ALTER TABLE … TRUNCATE PARTITION prunes locks; only the partitions to be emptied are locked.
  • In addition, ALTER TABLE statements take metadata locks on the table level.

Effect on other statements

  • LOCK TABLES cannot prune partition locks.
  • CALL stored_procedure(expr) supports lock pruning, but evaluating expr does not.
  • DO and SET statements do not support partitioning lock pruning.

 

The post What is MySQL partitioning ? appeared first on MySQL Consulting, Support and Remote DBA Services.

When Database Warm Up is Not Really UP

The common wisdom with database performance management is that a “cold” database server has poor performance. Then, as it “warms up”, performance improves until finally you reach a completely warmed up state with peak database performance. In other words, that to get peak performance from MySQL you need to wait for database warm up.

This thinking comes from the point of view of database cache warmup. Indeed from the cache standpoint, you start with an empty cache and over time the cache is filled with data. Moreover the longer the database runs, the more statistics about data access patterns it has, and the better it can manage database cache contents.

Over recent years with the rise of SSDs, cache warmup has become less of an issue. High Performance NVMe Storage can do more than 1GB/sec read, meaning you can warm up a 100GB database cache in less than 2 minutes. Also, SSD IO latency tends to be quite good so you’re not paying as high a penalty for a higher miss rate during the warm up stage.

It is not all so rosy with database performance over time. Databases tend to delay work when possible, but there is only so much delaying you can do. When the database can’t delay work any longer performance tends to be negatively impacted. Here are some examples of delaying work:

  • Checkpointing: depending on the database technology and configuration, checkpointing may be delayed for 30 minutes or more after database start
  • Change Buffer (Innodb) can delay index maintenance work
  • Pushing Messages from Buffers to Leaves (TokuDB) can be delayed until space in the buffers is exhausted
  • Compaction for RocksDB and other LSM-Tree based system can take quite a while to reach steady state

In all these cases database performance can be a lot better almost immediately after start compared to when it is completely “warmed up”.

An experiment with database warm up

Let’s illustrate this with a little experiment running Sysbench with MySQL and Innodb storage engine for 15 minutes:

sysbench --db-driver=mysql --threads=200 --rand-type=uniform --report-interval=10 --percentile=99 --time=900 --mysql-user=root --mysql-password= /usr/share/sysbench/oltp_update_index.lua --table_size=100000000 run

Let’s look in detail at what happens during the run using graphs from Percona Monitoring and Management

As you can see the number of updates/sec we’re doing actually gets worse (and more uneven) after the first 3 minutes, while a jump to peak performance is almost immediate

The log space usage explains some of this—in the first few minutes, we did not need to do as aggressive flushing as we had to do later.

On the InnoDB I/O graph we can see a couple of interesting things. First, you can see how quickly warm up happens—in 2 minutes the IO is already at half of its peak. You can also see the explanation for the little performance dip after its initial high performance (around 19:13)—this is where we got close to using all log space, so active flushing was required while, at the same time, a lot of IO was still needed for cache warmup.

Reaching Steady State is another term commonly used to describe the stage after warm up completes. Note though that such steady state is not guaranteed to be steady at all. In fact, the most typical steady state is unsteady. For example, you can see in this blog post both InnoDB and MyRocks have quite a variance.

Summary

While the term database warm up may imply performance after warm up will be better, it is often not the case. “Reaching Steady State” is a better term as long as you understand that “steady” does not mean uniform performance.

 

The post When Database Warm Up is Not Really UP appeared first on Percona Database Performance Blog.

How to Automate Minor Version Upgrades for MySQL on RDS

Amazon RDS for MySQL offers the option to automate minor version upgrades using the minor version upgrade policy, a property that lets you decide if Amazon is allowed to perform the upgrades on your behalf. Usually the goal is not to upgrade automatically every RDS instance but to keep up to date automatically non-production deployments. This helps you address engine issues as soon as possible and improve the automation of the deployment process.

If your are using the AWS Command Line Interface (CLI) and you have an instance called test-rds01 it is as simple as changing

[--auto-minor-version-upgrade | --no-auto-minor-version-upgrade]

For example:

aws rds modify-db-instance --db-instance-identifier test-rds01 --apply-immediately --auto-minor-version-upgrade true

And if you use the AWS Management Console, it is just a check box.  All sorted? Unfortunately not. The main problem is that Amazon performs those upgrade only in rare circumstances.

As for Amazon’s documentation:

Minor version upgrades only occur automatically if a minor upgrade replaces an unsafe version, such as a minor upgrade that contains bug fixes for a previous version. In all other cases, you must modify the DB instance manually to perform a minor version upgrade.

If the new version fixes any vulnerabilities that were present in the previous version, then the auto minor version upgrade will automatically take place during the next weekly maintenance window on your DB instance. In all other cases, you should manually perform the minor version upgrade. So in most scenarios, the automatic upgrade is unlikely to happen and using the

auto-minor-version-upgrade  attribute is not the way to keep your MySQL running on RDS updated to the latest available minor version.How to improve automation of minor version upgrades Amazon RDS for MySQL

Let’s say you want to reduce the time a newer minor version reaches your development environments or even your production ones. How can you achieve that on RDS? First of all you have to consider the delay it takes for a minor version to reach RDS that can be anything between a few weeks and a few months.  And you might even not notice that a new minor is available as it is not obvious how to be notified when it is.

What is the best way to be notified of new minor versions available on RDS MySQL?

In the past you could (even automatically) monitor the release notes page but the page is not anymore used for RDS. Now you have to monitor the database announcement page, something that you can hardly automate.

Any way to speed up the minor version upgrades?

You can use the AWS CLI invoking the describe-db-engine-versions API or write a simple Lambda function to retrieve the latest available minor version and act accordingly: you can, for example, notify your team of DBAs using Amazon Simple Notification Service (SNS) or you can automatically upgrade the instance. Let’s first see how to achieve that using the command line:

aws --profile sandbox rds describe-db-engine-versions --engine 'mysql' --engine-version '5.7' --query "DBEngineVersions[-1].EngineVersion"

where the -1 in the array let you filter out the very latest version of the engine available on RDS. Today the result is “5.7.21” and a simple cron job will monitor and can trigger notification for changes. Note that the same approach can be used to retrieve the latest available minor version for engines running MySQL 5.5 and MySQL 5.6. And PostgreSQL engines too.

If you want to automatically and immediately upgrade your instance, the logic can be easily done in a few lines in bash with a cron on a EC2. For example, the following function requires only the database instance identifier:

rds_minor_upgrade() { rds_endpoint=$1 engine_version="5.7" rds_current_minor=$(aws rds describe-db-instances --db-instance-identifier="$rds_endpoint" --query "DBInstances[].EngineVersion") rds_latest_minor=$(aws rds describe-db-engine-versions -- engine 'mysql' --engine-version $eng_version --query "DBEngineVersions[-1].EngineVersion") if [ "$rds_latest_minor" != "$rds_current_minor" ]; then aws rds modify-db-instance --apply-immediately --engine-version $rds_latest_minor --db-instance-identifier $rds_endpoint fi }

Alternatively you can write the code as a scheduled Lambda function in your favourite language. For example, using the AWS node.js SDK you can manage RDS and implement the logic above using the rds.describeDBEngineVersions andrds.modifyDBInstance to achieve the same.

(...) rds.describeDBEngineVersions(params, function(err, data) { (...) }); (...) var params = { DBInstanceIdentifier: 'test-rds01', ApplyImmediately: true, EngineVersion: '<new minor version>', (...) }; rds.modifyDBInstance(params, function(err, data) { (...) });

Speed up your minor upgrade!

To summarize, Amazon Web Services does not offer a real way to automatically upgrade a RDS instance to the latest available minor in the most common scenarios, but it is very easy to achieve that by taking advantage of the AWS CLI or the many SDKs.

The goal is not to upgrade automatically every deployment. You would not normally use this for production deployments. However, being able to monitor the latest available minor version on RDS and apply the changes automatically for development and staging deployment can significantly reduce the time it takes to have MySQL up to date on RDS and make your upgrade process more automated.

 

The post How to Automate Minor Version Upgrades for MySQL on RDS appeared first on Percona Community Blog.

Percona Live Europe 2018 Call for Papers is Now Open

Announcing the opening of the Percona Live Europe Open Source Database Conference 2018 in Frankfurt, Germany call for papers. It will be open from now until August 10, 2018. The conference takes place November 5–7.

Our theme this year is
Connect. Accelerate. Innovate.

As a speaker at Percona Live Europe, you’ll have the opportunity to CONNECT with your peers—open source database experts and enthusiasts who share your commitment to improving knowledge and exchanging ideas. ACCELERATE your projects and career by presenting at the premier open source database event, a great way to build your personal and company brands. And influence the evolution of the open source software movement by demonstrating how you INNOVATE!

Community initiatives remain core to the open source ethos, and we are proud of the contribution we make with Percona Live Europe in showcasing thought leading practices in the open source database world.

With a nod to innovation, for the first time, this year we are introducing a business track to benefit those business leaders who are exploring the use of open source and are interested in learning more about its costs and benefits.

Speaking Opportunities

The Percona Live Europe Open Source Database Conference 2018 Call for Papers is open until August 10, 2018. We invite you to submit your speaking proposal for breakout, tutorial or lightning talk sessions. Classes and talks are invited for Foundation (either entry level or of general interest to all), Core (intermediate) and Masterclass (advanced) levels.

If selected, you will receive a complimentary full conference pass.

  • Breakout Session. Broadly cover a technology area using specific examples. Sessions should be either 25 minutes or 50 minutes in length (including Q&A).
  • Tutorial Session. Present a technical session that aims for a level between a training class and a conference breakout session. We encourage attendees to bring and use laptops for working on detailed and hands-on presentations. Tutorials will be three or six hours in length (including Q&A).
  • Lightning Talk. Give a five-minute presentation focusing on one key point that interests the open source community: technical, lighthearted or entertaining talks on new ideas, a successful project, a cautionary story, a quick tip or demonstration.
Topics and Tracks

We want proposals that cover the many aspects of application development using all open source databases, as well as new and interesting ways to monitor and manage database environments. Did you just embrace open source databases this year? What are the technical and business values of moving to or using open source databases? How did you convince your company to make the move? Was there tangible ROI?

Best practices and current trends, including design, application development, performance optimization, HA and clustering, cloud, containers and new technologies, as well as new and interesting ways to monitor and manage database environments—what’s holding your focus? Share your case studies, experiences and technical knowledge with an engaged audience of open source peers.

In the submission entry you will be asked to indicate which of these tracks your proposal best fits: tutorial, business needs; case studies/use cases; operations; or developer.

A few ideas

The conference committee is looking for proposals that cover the many aspects of using, deploying and managing open source databases, including:

  • Open source – Describe the technical and business values of moving to or using open source databases. How did you convince your company to make the move? Was there tangible ROI?
  • Security – All of us have experienced security challenges. Whether they are initiated by legislature (GDPR), bugs (Meltdown/Spectre), experience (external attacks) or due diligence (planning for the worst), when do you have ‘enough’ security? Are you finding that security requirements are preventing your ability to be agile?
  • Serverless, Cloud or On-Premise – The technology landscape is no longer a simple one, and mixing infrastructures has almost become the norm. Are you designing data architectures for the new landscape, and eager to share your experience? Have microservices become an important part of your plans?
  • MySQL – Do you have an opinion on what is new and exciting in MySQL? With the release of MySQL 8.0, are you using the latest features? How and why? Are they helping you solve any business issues, or making deployment of applications and websites easier, faster or more efficient? Did the new release get you to change to MySQL? What do you see as the biggest impact of the MySQL 8.0 release? Do you use MySQL in conjunction with other databases in your environment?
  • MongoDB – How has the 3.6 release improved your experience in application development or time-to-market? How are the new features making your database environment better? What is it about MongoDB 4.0 that excites you? What are your experiences with Atlas? Have you moved to it, and has it lived up to its promises? Do you use MongoDB in conjunction with other databases in your environment?
  • PostgreSQL – Why do you use PostgreSQL as opposed to other SQL options? Have you done a comparison or benchmark of PostgreSQL vs. other types of databases related to your tasks? Why and what were the results? How does PostgreSQL help you with application performance or deployment? How do you use PostgreSQL in conjunction with other databases in your environment?
  • SQL, NewSQL, NoSQL – It’s become a perennial question without an easy answer. How do databases compare, how do you choose the right technology for the job, how do you trade off between features and their benefits in comparing databases? If you have ever tried a hybrid database approach in a single application, how did that work out? How nicely does MongoDB play with MySQL in the real world? Do you have anything to say about using SQL with NoSQL databases?
  • High Availability – What choices are you making to ensure high availability? How do you find the balance between redundancy and cost? Are you using hot backups, and if so, what happened when you needed to rollback on them?
  • Scalability – When did you recognize you needed to address data scale? Did your data growth take you by surprise or were you always in control? Did it take a degradation in performance to get your management to sit up and take notice? How do you plan for scale if you can’t predict demand?
  • What the Future Holds – What do you see as the “next big thing”? What new and exciting features are going to be released? What’s in your next release? What new technologies will affect the database landscape? AI? Machine learning? Blockchain databases? Let us know about innovations you see on the way.
How to respond to the call for papers

For information on how to submit your proposal visit our call for papers page. The conference web pages will be updated throughout the next few weeks and bios, synopsis and slides will be published on those pages after the event.

Sponsorship

If you would like to obtain a sponsor pack for Percona Live Europe Open Source Database Conference 2018, you will find more information including a prospectus on our sponsorship page. You are welcome to contact me, Bronwyn Campbell, directly.

The post Percona Live Europe 2018 Call for Papers is Now Open appeared first on Percona Database Performance Blog.

MariaDB Galera cluster and GTID

In MariaDB 10.2.12, these two don’t yet work together. GTID = Global Transaction ID.  In the master-slave asynchronous replication realm, this means that you can reconnect a slave to another server (change its master) and it’ll happily continue replicating from the correct point.  No more fussing with filenames and offsets (which of course will both differ on different machines).

So in concept the GTIID is “globally” unique – that means it’s consistent across an entire infra: a binlogged write transaction will have the same GTID no matter on which machine you look at it.

  • OK: if you are transitioning from async replication to Galera cluster, and have a cluster as slave of the old infra, then GTID will work fine.
  • PROBLEM: if you want to run an async slave in a Galera cluster, GTID will currently not work. At least not reliably.

The overview issue is MDEV-10715, the specific problem is documented in MDEV-14153 with some comments from me from late last week. MDEV-14153 documents cases where the GTID is not in fact consistent – and the way in which it isn’t is most disturbing.

The issue appears as “drift”. A GTID is made up of R-S-# where R is replication domain (0 unless set by an app), S for server-id where the write was originally done, and # which is just a number. The only required rule for the # is that that each next event has to have a higher number than the previous.  In principle there could be #s missing, that’s ok.

In certain scenarios, the # part of the GTID falls behind on the “other nodes” in the Galera cluster. There was the node where the statement was first issued, and then there are the other nodes which pick up the change through the Galera (wsrep) cluster mechanism. Those other nodes.  So at that point, different nodes in the cluster have different GTIDs for the same query. Not so great.

To me, this looked like a giant red flag though: if a GTID is assigned on a commit, and then replicated through the cluster as part of that commit, it can’t change. Not drift, or any other change. So the only possible conclusion must be that it is in fact not passed through the cluster, but “reinvented” by a receiving cluster node, which simply assumes that the current event from a particular server-id is previous-event id + 1.  That assumption is false, because as I mentioned above it’s ok for gaps to exist.  As long as the number keeps going up, it’s fine.

Here is one of the simplest examples of breakage (extract from a binlog, with obfuscated table names):

# at 12533795 #180704 5:00:22 server id 1717 end_log_pos 12533837 CRC32 0x878fe96e GTID 0-1717-1672559880 ddl /*!100001 SET @@session.gtid_seq_no=1672559880*//*!*/; # at 12533837 #180704 5:00:22 server id 1717 end_log_pos 12534024 CRC32 0xc6f21314 Query thread_id=4468 exec_time=0 error_code=0 SET TIMESTAMP=1530644422/*!*/; SET @@session.time_zone='SYSTEM'/*!*/; DROP TEMPORARY TABLE IF EXISTS `qqq`.`tmp_foobar` /* generated by server */ /*!*/;

Fact: temporary tables are not replicated (imagine restarting a slave, it wouldn’t have whatever temporary tables were supposed to exist). So, while this event is stored in the binary log (which it is to ensure that if you replay the binlog on a machine, it correctly drops the temporary table after creating and using it), it won’t go through a cluster.  Remember that Galera cluster is essentially a ROW-based replication scheme; if there are changes in non-temporary tables, of course they get replicated just fine.  So if an app creates a temporary table, does some calculations, and then inserts the result of that into a regular table, the data of that last bit will get replicated. As it should. In a nutshell, as far as data consistency goes, we’re all fine.

But the fact that we have an event that doesn’t really get replicated creates the “fun” in the “let’s assume the next event is just the previous + 1” logic. This is where the drift comes in. Sigh.

In any case, this issue needs to be fixed by let’s say “being re-implemented”: the MariaDB GTID needs to be propagated through the Galera cluster, so it’s the same on every server, as it should be. Doing anything else is always going to go wrong somewhere, so trying to catch more cases like the above example is not really the correct way to go.

If you are affected by this or related problems, please do vote on the relevant MDEV issues. That is important!  If you need help tracking down problems, feel free to ask.  If you have more information on the matter, please comment too!  I’m sure this and related bugs will be fixed, there are very capable developers at MariaDB Corp and Codership Oy (the Galera company). But the more information we can provide, the better. It often helps with tracking down problems and creating reproducible test cases.

MySQL Guide to Ports

There could be from one to over ten ports used within your MySQL ecosystem.  It really depends on what you have enabled, which components you are using, how your applications connect, and other characteristics of your environment.

From a security point, these ports need to be opened just wide enough to allow the various components  that should communicate to talk while blocking out anything else trying to hack its way in – the goal being to enforce the security principle of “least privilege”. …

InnoDB Cluster in a nutshell – Part 1

Since MySQL 5.7 we have a new player in the field, MySQL InnoDB Cluster. This is an Oracle High Availability solution that can be easily installed over MySQL to get High Availability with multi-master capabilities and automatic failover.

This solution consists in 3 components: InnoDB Group Replication, MySQL Router and MySQL Shell, you can see how these components interact in this graphic:

In this three blog post series, we will cover each of this components to get a sense of what this tool provides and how it can help with architecture decisions.

Group Replication

This is the actual High Availability solution, and a while ago I wrote a short review of it when it still was in its labs stage. It has improved a lot since then.

This solution is based on a plugin that has to be installed (not installed by default) and works on the top of built-in replication. So it relies on binary logs and relay logs to apply writes to members of the cluster.

The main concept about this new type of replication is that all members of a cluster (i.e. each node) are considered equals. This means there is no master-slave (where slaves follow master) but members that apply transactions based on a consensus algorithm. This algorithm forces all members of a cluster to commit or reject a given transaction following decisions made by each individual member.

In practical terms, this means each member of the cluster has to decide if a transaction can be committed (i.e. no conflicts) or should be rolled back but all other members follow this decision. In other words, the transaction is either committed or rolled back according to the majority of members in a consistent state.

To achieve this, there is a service that exposes a view of cluster status indicating what members form the cluster and the current status of each of them. Additionally Group Replication requires GTID and Row Based Replication (or

binlog_format=ROW ) to distribute each writeset with row changes between members. This is done via binary logs and relay logs but before each transaction is pushed to binary/relay logs it has to be acknowledged by a majority of members of the clusters, in other words through consensus. This process is synchronous, unlike legacy replication. After a transaction is replicated we have a certification process to commit the transaction, and thus making it durable.

Here appears a new concept, the certification process, which is the process that confirms if a writeset can be applied/committed (i.e. a row change can be done without conflicts) after replication of the transaction is complete.

Basically this process consists of inspecting writesets to check if there are conflicts (i.e. same row updated by concurrent transactions). Based on an order set in the writeset, the conflict is resolved by ‘first-commiter wins’ while the second is rolled back in the originator. Finally, the transaction is pushed to binary/relay logs and committed.

Solution features

Some other features provided by this solution are:

  • Single-primary or multi-primary modes meaning that the cluster can operate with a single writer and multiple readers (recommended and default setup); or with multiple writers where all nodes are capable to accept write transactions. The latter is at the cost of a performance penalty due to conflict resolution.
  • Automatic failure detection, where an internal mechanism is able to detect a failed node (i.e. a crash, network problems, etc) and decide to exclude it from the cluster automatically. Also if a member can’t communicate with the cluster and gets isolated, it can’t accept transactions. This ensures that cluster data is not impacted by this situation.
  • Fault tolerance. This is the strategy that the cluster uses to support failing members. As described above, this is based on a majority. A cluster needs at least three members to support one node failure because the other two members will keep the majority. The bigger the number of nodes, the bigger the number of failing nodes the cluster supports. The maximum number of members (nodes) in a cluster is currently limited to 7. If it has seven members, then the majority is kept by four or more active members. In other words, a cluster of seven would support up to three failing nodes.

We will not cover installation and configuration aspects now. This will probably come with a new series of blogs where we can cover not only deployment but also use cases and so on.

In the next post we will talk about the next cluster component: MySQL Router, so stay tuned.

The post InnoDB Cluster in a nutshell – Part 1 appeared first on Percona Database Performance Blog.

Career as a Python Programmer Freelancing Jobs in Upwork Linkedin

In this episode, we are going to discuss about some statistics and data about freelancing and jobs based on Python programmer and python developer worldwide. Python is one of the most popular programming language in the market. We will first discuss about Python related jobs in Upwork.com then we will discuss about the jobs in Linkedin.com.



Summary of my discussion:

To start career in Python based work one need to know additional technology besides Python programming language like:

  1. How to Web Scrap
  2. Bash Shell scripting
  3. Elastic Search
  4. Hadoop
  5. Django [ For Web Application ]
  6. JavaScript [ For Web Applicatin]
  7. Data Structure
  8. Git
  9. MySQL or PostgreSQL
  10. Redis [ Server Side Cache ]
  11. Kafka
Recommended Beginner Python Books:

Python Crash Course : https://amzn.to/2pOCmtm
Automated The Boring Stuff: https://amzn.to/2pM4XjI
Python for Everybody: https://amzn.to/2usGcxV

Recommended Advanced Python Books:

Fluent Python: https://amzn.to/2E4TCzA
Python Cookbook: https://amzn.to/2Go7yXy
Web Scraping: https://amzn.to/2uujJQR

Reference:

1. Upwork – https://www.upwork.com
2. Linkedin – https://www.linkedin.com
3. Python Flask – http://flask.pocoo.org
4. Python Django – https://www.djangoproject.com
5. Python Django-CMS – https://www.django-cms.org/en/
6. Python SciPy – https://www.scipy.org
7. Python TensorFlow – https://www.tensorflow.org
8. Programming Language Comparison – https://bit.ly/2wMoHpQ

The post Career as a Python Programmer Freelancing Jobs in Upwork Linkedin appeared first on Thinkdiff.net.

Problems of Oracle's MySQL as an Open Source Product

In my previous summary blog post I listed 5 problems I see with the way Oracle handles MySQL server development. The first of them was that "Oracle does not develop MySQL server in a true open source way" and this is actually what I started my draft of that entire blog post with. Now it's time to get into details, as so far there was mostly fun around this and statements that MariaDB also could do better in the related Twitter discussion I had.

So, let me explain what forces me to think that Oracle is treating MySQL somewhat wrong for the open source product.

Nice pathway on this photo, but it's not straight and it's not clear where it goes. Same with MySQL development... We get MySQL source code updated at GitHub only when (or, as it often happened in the past, some time after) the official release of new version happens. You can see, for example, that MySQL 8.0 source code at GitHub was actually last time updated on April 3, 2018, while MySQL 8.0.11 GA was released officially on April 19, 2018 (and that's when new code became really available in public repository). We do not see any code changes later than April 3, while it's clear that there are bug fixes already implemented for MySQL 8.0.12 (see Bug #90523 - "[MySQL 8.0 GA Release Build] InnoDB Assertion: (capacity & (capacity - 1)) == 0", for example. There is an easy way to crash official MySQL 8.0.11 binaries upon startup, fixed back before April 30, with some description of the fix even, but no source code of the fix is published) and 8.0.13 even (see Bug #90999 - "Bad usage of ppoll in libmysql"). With Oracle's approach to sharing the source code, we can not see the fixes that are already made long time ago, apply them, test them or comment on them. This is fundamentally wrong, IMHO, for any open source software.

In other projects we usually can see the code as soon as it is pushed to the branch (check MariaDB if you care, last change few hours ago at the moment). Main branches may have more strict rules for updating, but in general we see fixes as they happen, not only when new official release happens.
Side note: if you see that Bug #90523 became private after I mentioned it here, that's another wrong thing they often do. More on the in the next post, on community bug reports handling by Oracle... Interesting enough, when the fix comes from community we can usually see the patch. This happened to the Bug #90999 mentioned above - we have a fix provided by Facebook and one can see the patch in Bug #91067 - "Contribution by Facebook: Do not use sigmask in ppoll for client libraries". When somebody makes pull request, patch source is visible. But one can never be sure if it's the final patch and had it passed all the usual QA tests and reviews, or what happens to pull requests closed because developer had not signed the agreement...

If the fix is developed by Oracle you'll see the code changed only with/after the official release. Moreover, it would be on you to identify the exact commit(s) that introduced the fix. For a long time Laurynas Biveinis from Percona cared to add comments about the exact commit that fixed the bug to public bug reports (see Bug #77689 - "mysql_execute_command SQLCOM_UNLOCK_TABLES redundant trans_check_state check?" as one of examples). Community members have to work hard to "reverse engineer" Oracle's fixes and link them back to details of real problems (community bug reports) they were intended to resolve!

Compare this to a typical changelog of MariaDB that leads you directly to commits and code changes.

What's even worse, Oracle started a practice to publish only part of their changes made for the release. Some tests, those for "security" bugs, are NOT published even if we assume they exist or even can be 100% sure they exist.

My recent enough favorite example is the "The CREATE TABLE of death" bug reported by Jean-François Gagné. If you follow his blog post and links in it you can find out all the details, including the test case that is public in MariaDB. With this public information you can go and crash any affected older MySQL versions. Bug reporter did everything to inform affected vendors properly, and responsible vendors disclosed the test (after they fixed the problem)!

Now, try to find similar test in public GitHub tree of Oracle MySQL. I tried to find it literally, try to find references to somewhat related public bug numbers etc, but failed. If you know better and can identify the related public test at GitHub, please, add a comment and correct me!

To summarize, this is what I am mostly concerned about:
  1. Public source code is updated only with the releases. There are no feature-specific code branches, development branches, just nothing public until the official release.
  2. Oracle does not provide any details about commits and their relations to bugs fixed in the release notes or anywhere else outside GitHub. One has to go study the source code to make his own conclusions.
  3. Oracle does not share some of test cases in their commits. So, some test cases remain non-public and we can only guess (based on code analysis) what was the real intention of the fix. This applies to security bugs and who knows to what else.
I would not go into other potential problems (I've heard about some others from developers, for example, related to code refactoring Oracle does) or more details. The above is enough for me to state that Oracle do wrong things with the way they publish source code and threat MySQL as open source product.

All the problems mentioned above were introduced by Oracle, these never happened in MySQL AB or Sun. MariaDB and Percona servers may have their own problems, but the above do NOT apply to them, so I state that other vendors develop MySQL forks and related projects differently, and still are in business and doing well!

The Night of the Living Files

Scary movies for teenagers and database administration tasks share some common traits. Usually, the movie starts with a group of teens making unwise choices. Maybe you, as a DBA, are no longer a teenager, but you shouldn’t underestimate your ability to do the same.

$ sudo su - Darrera entrada: ds jun 2 14:43:53 UTC 2018 des de a a pts/0 [root@mysql ~]# cd /var/lib/mysql [root@mysql mysql]# df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 38G 35G 3G 92% / [root@mysql mysql]#

Back to the movie…today is just another regular day. Checking Jira (or whatever your ticketing system is), answering emails and phone calls. Maybe a couple of meetings. And in the meantime, you will perform some critical tasks on the most important systems for your company. Just the usual stuff!

[root@mysql mysql]# ls -l *.log ls: cannot access *.log: No such file or directory [root@mysql mysql]# ls -l *log* -rw-r-----. 1 mysql mysql 50331648 Jun 2 15:07 ib_logfile0 -rw-r-----. 1 mysql mysql 50331648 Jun 2 14:53 ib_logfile1 [root@mysql mysql]# rm -f ib* [root@mysql mysql]# There’s a shadow behind you

Right now you are not able to remember why… and it doesn’t matter anymore… but you had to perform a few tasks in a staging environment, release some storage by removing files, but you did it in production. You (I mean YOU) removed critical files from the production database.

Suddenly, you perceive the damage you have caused. You become aware of the full scale of what you’ve done. You feel like you’re going to die.

[root@mysql mysql]# ls -l ib* ls: cannot access ib*: No such file or directory [root@mysql mysql]# Don’t go into the light, Carol Anne

While you are starting to feel really sick, you realize that you should be getting thousands of pages, and a few of the screens near you should display a bunch of red blinking alerts. And a few people should be asking you about what’s happening.

But–now, everything looks fine. Again, in teen horror movies, there is a moment when our heroes have a sense of the real danger ahead, but this danger is ignored by the other people.

[root@mysql mysql]# ps -ef | grep mysql | grep -v grep mysql 2983 1 0 10:16 ? 00:00:00 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid [root@mysql mysql]# [root@mysql mysql]# mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 5 Server version: 5.7.22-22 Percona Server (GPL), Release 22, Revision f62d93c Copyright (c) 2009-2018 Percona LLC and/or its affiliates Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> show processlist; +----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+ | Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | +----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+ REDACTED TO PROTECT THE INNOCENT +----+------+-----------+------+---------+------+----------+------------------+-----------+---------------+ XXX rows in set (0.00 sec) mysql>

You feel like… maybe I failed to remove the files. But then you check and the files are missing. You did it all right, you removed the files. The database should be shaking and trembling before dying in pain.

But it is not.

BYOO (Bring your own Ouija)

You breathe deeply and remember that there’s no such a thing as removing a file. What you did is remove metadata information from the filesystem, but as long as the file is open by a process, this process is able to access it. Read and write. Nobody else can… but the process still keeps a link to that file, a link that keeps this file between us and the hell of removed files.

You must keep the database processes running, and by doing so you ensure that the removed files are available. But you know that this is impossible. You can’t keep your database up and running forever. So now what?

Think, think, think!

It’s easy, blah, blah, blah and you are dead

Another horror picture cliché occurs when one of the characters says that he knows how to fix the problem, both quickly and easily. In our movie, the easy and quick fix (and wrong fix) is to perform a logical dump of the database. Unfortunately, a dump of several TB databases will take some time… and then you have to restore the dump back. A long downtime is not the best option.

And to make things worse, after restoring the backup, you need to apply the binary logs to bring the database up to date.

Run, just run!

While you look at your mates being devoured by the mysqldump zombies, you understand that a better option is required if you want to return home safely. The lsof command (aka List Open Files) can help you understand the problem.

[root@mysql mysql]# lsof -c mysql | grep /var/lib/mysql/ib mysqld 2983 mysql 3uW REG 253,0 50331648 34323185 /var/lib/mysql/ib_logfile0 (deleted) mysqld 2983 mysql 9uW REG 253,0 50331648 34323186 /var/lib/mysql/ib_logfile1 (deleted) mysqld 2983 mysql 10uW REG 253,0 12582912 34323184 /var/lib/mysql/ibdata1 (deleted) mysqld 2983 mysql 12uW REG 253,0 12582912 34323188 /var/lib/mysql/ibtmp1 (deleted)

Here you can see the missing files with The Mark of Cain: Deleted

In /proc we trust

The proc filesystem is an illusionary filesystem that presents information about processes (and other system information). Does the proc filesystem give information about open files for a process? Yes, it does. But not only this. In the proc filesystem, you can access any of the files that a process has currently open. Remember that 2983 is the pid of mysqld.

[root@mysql mysql]# ls -l /proc/2983/fd | grep "lib/mysql/ib" lrwx------. 1 mysql mysql 64 Jun 28 10:27 10 -> /var/lib/mysql/ibdata1 (deleted) lrwx------. 1 mysql mysql 64 Jun 28 10:27 12 -> /var/lib/mysql/ibtmp1 (deleted) lrwx------. 1 mysql mysql 64 Jun 28 10:27 3 -> /var/lib/mysql/ib_logfile0 (deleted) lrwx------. 1 mysql mysql 64 Jun 28 10:27 9 -> /var/lib/mysql/ib_logfile1 (deleted)

Yes, you can access any file that a process has open, even if that file is no longer there. In the directory /proc/<pid>/fd, you have a bunch of soft links to the original files, even if these files were removed. The links and the files are available as long as the process doesn’t close them.

"MySQL 5.7.22-2[root@mysql mysql]# head -c 31 /proc/2983/fd/3 "MySQL 5.7.22-22[root@mysql mysql]# [root@mysql mysql]# Between a rock and a hard link

Wait a minute… we have soft links to removed files, and we can access those files. It would be great if we could create a hard link based on a soft link. You know that the ln command is able to create both soft links and hard links. You check the main page for ln and discover the option -L that dereferences soft links.

You execute a quick test to make sure that ln -L does what you think it should do… and it does!

[root@mysql mysql]# cd /tmp [root@mysql tmp]# touch test_file [root@mysql tmp]# ln -s test_file test_file_link [root@mysql tmp]# ls -l test_file* -rw-r--r--. 1 root root 0 Jun 2 15:32 test_file lrwxrwxrwx. 1 root root 9 Jun 2 15:33 test_file_link -> test_file [root@mysql tmp]# echo AreYouThere > test_file [root@mysql tmp]# ln -L test_file_link test_file_hard [root@mysql tmp]# ls -l test* -rw-r--r--. 2 root root 12 Jun 2 15:33 test_file -rw-r--r--. 2 root root 12 Jun 2 15:33 test_file_hard lrwxrwxrwx. 1 root root 9 Jun 2 15:33 test_file_link -> test_file [root@mysql tmp]# cat test_file_hard AreYouThere [root@mysql tmp]# rm test_file rm: remove regular file ‘test_file'? y [root@mysql tmp]# cat test_file_hard AreYouThere [root@mysql tmp]# cat test_file_link cat: test_file_link: No such file or directory [root@mysql tmp]# Easy Peasy Lemon Squeezy

Now it is time to apply our fix. We will create hard links from the soft links.

[root@mysql tmp]# ln -L /proc/2983/fd/3 /var/lib/mysql/ib_logfile0 ln: failed to create hard link ‘/var/lib/mysql/ib_logfile0' => ‘/proc/2983/fd/3': No such file or directory [root@mysql tmp]# ls -l /proc/2983/fd/3 lrwx------. 1 mysql mysql 64 Jun 2 15:22 /proc/2983/fd/3 -> /var/lib/mysql/ib_logfile0 (deleted)

The file is there, but we can’t create the hard link! Why? There are two reasons that explain why you can’t create the hard link. The first one is that ln -L tries to retrieve the filesystem information from the source of the soft link in /proc/2983/fd/3. This is /var/lib/mysql/ib_logfile0, we all know that it has been deleted.

But, if the destination is deleted, why we can see the contents of that file in /proc? Well, because, as we’ve seen before, /proc is an illusionary filesystem. The part of illusionary explains why we can see the contents of that file even if has been deleted. The part of the filesystem that gives us another reason not to be able to create a hard link from a file “stored” in /proc… you can’t create hard links between different filesystems.

It was not so easy… but there is still hope.

No downtime no pain

We know that we can view, open and copy the deleted files contents using the magic of /proc, but databases require that all the files stay in a consistent state. It is not only that we need to copy the files, we need to make sure that the files are not modified during the copy, after the copy and before the database re-opens the files.

Even if you find a way to keep the consistent state between the /proc files and the “new” ones, usually the only way to get the database to re-open a file is to restart the database (I know there are certain types of files that are re-opened… but let’s keep it this way for simplicity).

Both keeping some database files untouched and restarting the database will require downtime. Keeping the files untouched means that we have to lock them for writes, to make sure we keep the consistency. And we need to restart the database to reopen these files.

Some of us have to die

Now that we know that downtime is unavoidable, we have to discuss with the organization and plan for the downtime… but we don’t have a long time to do this. A database crash could make the deleted files disappear.

But while we discuss the downtime… there is still one thing missing: how are we going to keep the consistency of the files during the copy? We have two options: find a way to keep the files untouched during the copy and the database shutdown, or just shut down the database.

Ok, ok, ok… You told us that the files will disappear if we stop the database. And now you are telling us to stop the database? Are you crazy?

No, I am not crazy. As we already know, if a file is open then it is not removed by the operating system. We also know that there is a place called /proc where we can re-open any file already opened by any other process. The only thing we need to do is keep an instance of the files in /proc, then we stop the database, perform the copy of files and start the database again.

And how do we can keep an instance of the files open? Well, it’s up to you… choose your favorite application or command that can open a file and keep it open. Just make sure that the file you are interested in is really open. For example, vim is a bad choice as it opens the file at the beginning, but doesn’t keep it open. The command I use is tail with the -f flag to keep the file open. And make sure you don’t close/kill it… or the file will disappear. (lsof should help you make sure the files are still open)

[root@mysql ~]# cd /proc/2983/fd [root@mysql fd]# nohup tail -f 10 > /dev/null & [2] 3112 [root@mysql fd]# nohup: ignoring input and redirecting stderr to stdout [root@mysql fd]# nohup tail -f 12 > /dev/null & [3] 3113 [root@mysql fd]# nohup: ignoring input and redirecting stderr to stdout [root@mysql fd]# nohup tail -f 3 > /dev/null & [4] 3114 [root@mysql fd]# nohup: ignoring input and redirecting stderr to stdout [root@mysql fd]# nohup tail -f 9 > /dev/null & [5] 3115 [root@mysql fd]# nohup: ignoring input and redirecting stderr to stdout Oops! Where are my files?

Once we’re sure that all the files we want to recover are open by another process, we may proceed to stop the database.

[root@mysql fd]# service mysql stop Redirecting to /bin/systemctl stop mysql.service [root@mysql fd]# ls -l total 0

“Oh come on… you told us that we were going to keep our files available, but they are gone! You are a big liar!”

Yes, if you check the /proc directory, as we stopped mysql, all the symbolic links to files opened by mysqld are gone. But remember… a file disappears when ALL the processes are over. We have four processes that are keeping the files open, but these processes have different pids, so we need to find the identifiers and then we will be able to copy the files. In our case, the identifiers for the processes are 3112, 3113, 3114 and 3115.

[root@mysql fd]# ls -l /proc/3112/fd total 0 lrwx------. 1 root root 64 Jun 28 10:48 0 -> /dev/pts/1 l-wx------. 1 root root 64 Jun 28 10:48 1 -> /dev/null lrwx------. 1 root root 64 Jun 28 10:48 2 -> /dev/pts/1 lr-x------. 1 root root 64 Jun 28 10:48 3 -> /var/lib/mysql/ibdata1 (deleted) [root@mysql fd]# ls -l /proc/3113/fd total 0 lrwx------. 1 root root 64 Jun 28 10:48 0 -> /dev/pts/1 l-wx------. 1 root root 64 Jun 28 10:48 1 -> /dev/null lrwx------. 1 root root 64 Jun 28 10:48 2 -> /dev/pts/1 lr-x------. 1 root root 64 Jun 28 10:48 3 -> /var/lib/mysql/ibtmp1 (deleted) [root@mysql fd]# ls -l /proc/3114/fd total 0 lrwx------. 1 root root 64 Jun 28 10:48 0 -> /dev/pts/1 l-wx------. 1 root root 64 Jun 28 10:48 1 -> /dev/null lrwx------. 1 root root 64 Jun 28 10:48 2 -> /dev/pts/1 lr-x------. 1 root root 64 Jun 28 10:48 3 -> /var/lib/mysql/ib_logfile0 (deleted) [root@mysql fd]# ls -l /proc/3115/fd total 0 lrwx------. 1 root root 64 Jun 28 10:48 0 -> /dev/pts/1 l-wx------. 1 root root 64 Jun 28 10:48 1 -> /dev/null lrwx------. 1 root root 64 Jun 28 10:48 2 -> /dev/pts/1 lr-x------. 1 root root 64 Jun 28 10:48 3 -> /var/lib/mysql/ib_logfile1 (deleted) It’s alive!

Now it is time to restore the files to the original locations.

[root@mysql fd]# cp /proc/3112/fd/3 /var/lib/mysql/ibdata1 [root@mysql fd]# cp /proc/3113/fd/3 /var/lib/mysql/ibtmp1 [root@mysql fd]# cp /proc/3114/fd/3 /var/lib/mysql/ib_logfile0 [root@mysql fd]# cp /proc/3115/fd/3 /var/lib/mysql/ib_logfile1

Restore ownership.

[root@mysql fd]# chown mysql:mysql /var/lib/mysql/ib* [root@mysql fd]#

And restart the database.

[root@mysql fd]# service mysql start Redirecting to /bin/systemctl start mysql.service [root@mysql ~]# mysql -p Enter password: Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 3 Server version: 5.7.22-22 Percona Server (GPL), Release 22, Revision f62d93c Copyright (c) 2009-2018 Percona LLC and/or its affiliates Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> select "Look! It's moving. It's alive. It's alive... It'alive, it's moving, it's alive, it's alive, it's alive, it's alive, IT'S ALIVE!"; +---------------------------------------------------------------------------------------------------------------------------------+ | Look! It's moving. It's alive. It's alive... It'alive, it's moving, it's alive, it's alive, it's alive, it's alive, IT'S ALIVE! | +---------------------------------------------------------------------------------------------------------------------------------+ | Look! It's moving. It's alive. It's alive... It'alive, it's moving, it's alive, it's alive, it's alive, it's alive, IT'S ALIVE! | +---------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> exit

The database is up and running!

The sun is shining brightly again… or not!

The approach we’ve seen today to recover files removed by mistake has some caveats. Actually, we’re not recovering removed files, we are copying removed files and this means that our filesystem must be able to store both groups of files (removed and copies) at the same time. And even if we have enough storage, when the removed files are large, this copy can take a long time to complete.

Also, depending on the file type, it is possible that MySQL closes the removed files due to internal exhausted file caches. So, you need to move really fast to recover the files. The first thing you must do is execute processes that keep the removed files open (as many tail -f as removed files you have). Also, keep monitoring the MySQL error log all the time during the recovery.

There are additional strategies to recover the situation. For example, locking the files to avoid changes and flushing dirty pages, then restore the files and restart, as seen in this post.

Look also at the comment from Ivgeni Segal in the previous link that presents an interesting method: create a slave from a server with removed files by using symbolic links and xtrabackup and then promote the slave to master. This can be a great option when you have large files and a reduced downtime is necessary. In that case, this is probably the best approach. Other strategies (DISCLAIMER: this is just an idea, I never tested it) could involve performing an xtrabackup while copying the missing files and then apply the logs to have a consistent image.

I’ve also seen people monitor actively for missing files. I don’t think this is a good option, because you are not protecting your database; rather, you are detecting the disaster once it has happened. If you don’t trust yourself and your fellow DBAs, a better option is to run a process that checks all the files and creates hard links in a different location. Then you only need to verify if one of these files is missing, and if so, use a new hard link to re-link it back to the original location. But be careful with this approach, any online change could increase the storage requirements or you could be hard-linking an old version of the file.

But there are so many things that can go wrong, so many different ways to remove critical data, that I would recommend you just three things:

  1. Keep your critical databases as small as possible. You don’t want to have your important database down just because you needed to spend a day recovering several TBs of old email logs and historical data.
  2. Do not plan your backups. Plan your recovery. Use several backup strategies and have at least three backups: physical, logical and binary logs. Physical is usually the fastest if you need to perform a full recovery. Logical is great if you need to recover one table or even some rows from one table. And binary logs are the only way you can perform a point in time recovery. Last, but not least: slaves are good for availability, not for recovery.
  3. Use FIT-ACER. There are so many things that can go wrong, that it is almost impossible to plan for each and every one. But you can do something better, work to avoid things going wrong as we do every day in Pythian.

Finding Values with JSON_CONTAINS

There was an interesting but hard to read post on StackOverflow about how 'insert select delete' data from a MySQL JSON data type column.  The first line of the writer's problem is a little confusing '
In order to record user mac_address and count mac_address to restrict user login's pc or notebook to control user available max for example (it's work)' but the examples reveled more about what was desired. The idea was to track MAC address used by various users and the author of the question was wondering how to up data a JSON Array of values with JSON_INSERT.  INSERT is for inserting and the better choice would be JSON_ARRAY_APPEND or JSON_ARRAY_INSERT.    But what caught my eye was the second question: Select sql command for json column ? could be example? If I want to check whether mac value exists 'c84wr8492eda'  Well, here comes a shameless plug for my Book MySQL and JSON - A Practical Programming Guide  as it details how to do this sort of thing.  What is desired is a certain value (c84wr8492eda) and we can find that easily enough.  We know the key that needs to be searched (mac) and the desired MAC address.  MySQL> select
json_contains(auth_list,json_quote('c84wr8492eda'),'$.mac
')
from users;

+-------------------------------------------------------------+
| json_contains(auth_list,json_quote('c84wr8492eda'),'$.mac') |
+-------------------------------------------------------------+
| 1 |
+-------------------------------------------------------------+
A better code snipped would be SELECT id FROM users WHERE JSON_CONTAINS(auth_list,JSON_QUOTE('c84wr8492eda'),'$,mac') = 1; as you will probably be acting on the  'id' field with the matching MAC address.   


You can find answers to problems like this in my hands book available from Amazon.com and other book sellers. The third question 'Delete sql command for json column ? Could be example? if I want to delete a item where mac value is 'c84wr8492eda'' was also unclear.  Delete the entire record or delete the MAC address from the JSON column?   Ah, the joys of StackOverflow.
DELETE FROM users WHERE JSON_CONTAINS(auth_list,JSON_QUOTE('c84wr8492eda'),'$,mac') = 1;  would remove the entire row.  But what about pruning the one item out of the array?  Well there is not a JSON_ARRAY_REMOVE_ONE_ITEM function.    I would want to get the values for auth_list into a string, removing the desired MAC address, and then using JSON_SET to rewrite the column.  But if you have other options, please let me know!


Percona Toolkit 3.0.11 Is Now Available

Percona announces the release of Percona Toolkit 3.0.11 on July 6, 2018.

Percona Toolkit is a collection of advanced open source command-line tools, developed and used by the Percona technical staff, that are engineered to perform a variety of MySQL®, MongoDB® and system tasks that are too difficult or complex to perform manually. With over 1,000,000 downloads, Percona Toolkit supports Percona Server for MySQL, MySQL®, MariaDB®, Percona Server for MongoDB and MongoDB.

Percona Toolkit, like all Percona software, is free and open source. You can download packages from the website or install from official repositories.

This release includes the following changes:

New Features:
  • PT-1571: Improved hostname recognition in pt-secure-collect
  • PT-1569: Disabled --alter-foreign-keys-method=drop_swap in pt-online-schema-change
  • PT-242: (pt-stalk) Include SHOW SLAVE STATUS on MySQL 5.7 (Thanks Marcelo Altmann)
Fixed bugs:
  • PT-1570: pt-archiver fails to detect columns with the word *GENERATED* as part of the comment
  • PT-1563: pt-show-grantsfails for MySQL 5.6 producing an error which reports that an unknown column account_locked has been detected.
  • PT-1551: pt-table-checksum fails on MySQL 8.0.11
  • PT-241: (pt-stalk) Slave queries don’t run on MySQL 5.7  because the FQDN was missing (Thanks Marcelo Altmann)
Breaking changes:

Starting with this version, the queries checksum in pt-query-digest will use the full MD5 field as a CHAR(32) field instead of storing just the least significant bytes of the checksum as a BIGINT field. The reason for this change is that storing only the least significant bytes as a BIGINT was producing inconsistent results in MySQL 8 compared to MySQL 5.6+.

pt-online-schema-change in MySQL 8:

Due to a bug in MySQL 8.0+, it is not possible to use the drop_swapmethod to rebuild constraints because renaming a table will result in losing the foreign keys. You must specify a different method explicitly.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system.

The post Percona Toolkit 3.0.11 Is Now Available appeared first on Percona Database Performance Blog.

Another Day, Another Data Leak

In the last few days, there has been information released about yet another alleged data leak, placing in jeopardy “…[the] personal information on hundreds of millions of American adults, as well as millions of businesses.” In this case, the “victim” was Exactis, for whom data collection and data security are core business functions.

Some takeaways from Exactis

Please excuse the pun! In security, we have few chances to chuckle. In fact, as a Security Architect, I sigh deeply when I read about this kind of issue. Firstly, it’s preventable. Secondly, I worry that if an organization like Exactis is not getting it right, what chance the rest of the world?

As the Wired article notes the tool https://shodan.io/ can be revealing and well worth a look. For example, you can see there are still MANY elasticSearch systems exposed to the public internet here. Why not use shodan to check what everyone else in the world can see openly on your systems ?

Securing databases

Databases in themselves do not need to be at risk, as long as you take the necessary precautions. We discussed this in this blog post that I co-authored last year.

In this latest alleged gaffe, as far as I can discern, had the setup made use of iptables or a similar feature then the breach could not have occurred.

With immaculate timing, my colleague Marco Tusa wrote a post last month on how to set up iptables for Percona XtraDB Cluster, and if you are not sure if or how that applies to your setup, it is definitely worth a read. In fact, you can access all of our security blog posts if you would like some more pointers.

Of course, security does not stop with iptables. Application developers should already be familiar with the need to avoid SQL injection, and there is a decent SQL injection prevention cheat sheet here, offered by The Open Web Application Security Project (OWASP). Even if you don’t fully understand the technical details, a cheat sheet like this might help you to ask the right questions for your application.

MySQL resources

For a more in-depth look at MySQL security, I have two talks up on YouTube. The first of these is a twenty-minute presentation on hardening MySQL and the second on web application security and why you really should review yours. You could also check out our recorded webinar Security and Encryption in the MySQL world presented by Dimitri Vanoverbeke.

MongoDB resources

Of course, security challenges are not unique to SQL databases. If you are a MongoDB user, this webinar MongoDB Security: Making things secure by default might be of interest to you. Or perhaps this one on using LDAP Authentication with MongoDB? Adamo Tonete presents both of these webinars.

For a more widely applicable view, you could try Colin Charles’ recent webinar too.

There are always consequences

As Exactis are no doubt discovering, managing the fallout from such a breach is a challenge. If you are not sure where you stand on security, or what you can do to improve your situation, then audit services such as those we offer could prove to be a valuable investment.

Finally, some of you will be lucky enough to have someone dedicated to IT security in your organizations. Next time you see them, instead of avoiding their steely stare, why not invite them for a coffee* and a chat? It could be enlightening!

*Beer or scotch is also almost always accepted too…

The post Another Day, Another Data Leak appeared first on Percona Database Performance Blog.

Pages