This article describes how I tuned MariaDB to give the best write throughput with SSD based storage.

When you have a write-heavy application writing into InnoDB, you will probably experience the InnoDB Checkpoint Blues. The effect manifests as stalls – short periods of time where the troughput falls to zero and I/O activity goes crazy. The phenomenon is well known and described i.e. here. More background about checkpointing can be found here.

The XtraDB fork of the InnoDB engine (and heart of Percona Server) contains some patches with the goal to overcome this odd behavior. MariaDB uses XtraDB as default InnoDB implementation, so we can configure some extra variables and hopefully avoid the checkpoint blues.

The first and most important setting is innodb_io_capacity. This is the approximate number of write operations that your hardware can do. If you don’t know that number, then you can easily find it out. Wait until you experience a stall and run iostat -x. You should see fairly high numbers for wrqm/s and w/s on the device holding your InnoDB table spaces. W/s is the number of write requests that hit the device, wrqm/s is the number of requests that could be merged. The sum of both is what InnoDB was effectively using.

In my case the disk is a RAID-0 of 3 SAS SSDs and I have seen up to 25.000 writes per second. So I set innodb_io_capacity=20000 and leave everything else at the default:

Here we can already see that the stall is gone. However after ~370 seconds we see a rather heavy I/O spike. This is, when XtraDB starts eager flushing when the checkpoint age reaches 75% of the log group capacity.

Now lets see if we can do better. The Percona Server documentation claims: innodb_adaptive_flushing_method=keep_average: … is designed for use with SSD cards. So lets try that next:

FAIL! Lets quickly revert that to the default.

Next thing to look at, is neighbor page flushing. This works like so: InnoDB pages are organized in blocks of 64 pages. When the checkpointing algorithm has picked a dirty page to be written to disk, it checks if there are more dirty pages in the block and if yes, writes all those pages at once. The rationale is, that with rotating disks the most expensive part of a write operation is head movement. Once the head is over the right track, it does not make much difference if we write 10 or 100 sectors.

The default setting is innodb_flush_neighbor_pages=area. For SSD there is however no head movement penalty. So it makes sense to experiment with the other settings. Lets try innodb_flush_neighbor_pages = cont next:

We still see a write spike around 400s run time, but it’s less pronounced than last time. Now lets finally try innodb_flush_neighbor_pages=none

And here we are! For SSD based storage the best parameter combination seems to be: innodb_flush_neighbor_pages = none, innodb_io_capacity = what_your_hardware_can_do and all others at the defaults.

If the storage uses rotating disks, we are probably better off with innodb_flush_neighbor_pages = cont or even innodb_flush_neighbor_pages = area, but at the moment I don’t have the hardware to test that.

Disclaimer: the above graphs use the sysbench OLTP multi table read/write benchmark. Data set size was 10G, InnoDB buffer pool 16G, InnoDB total log capacity 4G.

A few days ago MariaDB, MySQL and Percona all three released new versions of the 5.5 server. So I decided it’s time to run sysbench once more and compare the OLTP performance. The test candidates are:

  • MariaDB-5.5.24, using either XtraDB (default) or InnoDB
  • MySQL-5.5.25
  • Percona Server 5.5.24-26.0

For the benchmarks I used our trusty old pitbull machine which has 24 cpu cores, 24G of RAM and a nice RAID-0 composed of 3 SAS SSD.

The benchmark was sysbench-0.5 multi-table OLTP, using 8 tables with total 10G of data. InnoDB buffer pool was 16G, InnoDB log group capacity 4G (the maximum for MySQL). The mysqld process was constrained to 16 cores, sysbench to the other 8.

The only nonstandard tuning was for I/O scalability: innodb_io_capacity = 20000 and innodb_flush_neighbor_pages = none (for Percona Server and MariaDB/XtraDB). See my recent article on I/O tuning for an explanation.

The result for read-only OLTP is nicely balanced. This plot shows the median throughput:

also the response time is nicely flat (the plot shows the 99% quantile, means at most 1% of the requests was slower than that):

In fact the behavior shows nearly perfect scaling: up to 16 threads (= number of cpu cores for mysqld) the response time is constant. For higher concurrency we see a straight line in the double-logarithmic plot.

For read/write OLTP the picture looks quite similar, if we look at the transactions per second graph first:

The picture changes when we look at the response time (99% quantile again). The InnoDB based servers exhibit severe stalls caused by checkpointing as soon as we approach 16 client threads:

The XtraDB based servers handle the write load much better due to the better checkpointing algorithm.

Conclusion: if your workload is read-mostly OLTP, then it makes not much difference performancewise if you chose MariaDB, MySQL or Percona Server. So you can base your decision on other facts: costs, additional features (in MariaDB and Percona Server) or your internal know how.

The servers based on InnoDB (MySQL, MariaDB optionally) show slightly higher throughput, but suffer checkpointing stalls at high write load. MariaDB gives you the highest flexibility here, as you can chose between XtraDB and InnoDB.

Disclaimer: the raw benchmark results as well als scripts and config files used can be found in the public mariadb-benchmarks repository at Launchpad. There are also additional plots.

When I published the MariaDB-5.3.4 sysbench results I said “if your workload includes complex (sub)queries, then you will probably benefit more from MariaDBs new optimizer features”. Today I will present some benchmark results for complex workload.

The benchmark is DBT3, an implementation of the TPC-H specification. DBT3 is written in C and hosted at Sourceforge.

The DBT3 benchmark can run at different scale factors – defining the size of the database. I used a scale factor of 30 which yields ~30GB of raw data and ~48GB of disk footprint. The machine running the benchmark had 16G of memory.

InnoDB has the problem of fluctuating table statistics, leading to rather unpredictable query plans. For this reason the benchmark tables were created with the MyISAM engine. I will definitely run this benchmark vs. InnoDB in the future, but this requires some changes in our DBT3 automation scripts.

Both MySQL and MariaDB had to be tuned to use the new optimizer features. Specifically I turned on multi-range-read and batched-key-access. The MRR cost estimation was turned off because it’s not production ready in neither MySQL-5.6 nor MariaDB. For details see the tarball with the configuration files.

The following concentrates on the DBT3 Power Test which consists of 22 rather heavy JOIN queries. Each query was run 5 times on cold caches (server restart + file system cache cleared). Query execution time was limited by a 2 hour timeout. Execution time varies between 10 seconds and >2000 seconds, so for the following picture the execution time was normalized (MariaDB = 100%). The colored bar shows the median and the whiskers show min and max:

DBT3 Power Test, large scale

Here we can already see that MySQL-5.5 cannot cope well with this type of load. MySQL-5.6 has improved on many queries, but then again for query 3 or 8 it is even worse than 5.5. Rather a mystery is query 22 where MySQL 5.5 comes in first.

For more details, here is a table with the numbers. I have colored the rows to mark the winner (same colors as the diagram). If the difference is below 5% then it’s a draw. Those lines are not colored.

MariaDB-5.3.5 MySQL-5.6.4 MySQL-5.5.21
Query Seconds Seconds relative Seconds relative
1 289 286 -1.04% 319 +10.38%
2 46 48 +4.35% 490 +965.2%
3 243 1322 +444% 534 +119.7%
4 138 141 +2.17% 185 +34.06%
5 187 1232 +558.8% >7200
6 199 191 -4.02% 284 +42.71%
7 861 777 -9.76% 803 -6.74%
8 288 1628 +465.3% 742 +157.6%
9 268 307 +14.55% >7200
10 818 1504 +83.86% 1083 +32.40%
11 13 14 +7.69% 342 +2531%
12 213 206 -3.29% 452 +112.2%
13 250 230 -8.00% 1576 +530.4%
14 90 92 +2.22% >7200
15 194 190 -2.06% 401 +106.7%
16 129 149 +15.50% 148 +14.73%
17 800 797 -0.38% 867 +8.38%
18 284 >7200 >7200
19 62 61 -1.61% 2013 +3147%
20 >7200 >7200 >7200
21 185 182 -1.62% >7200
22 13 14 +7.69% 10 -23.08%

Query 11 is a draw because the difference is only 1 second (the granularity of time measurement). Query 11 showed a strange phenomenon for MariaDB: the first run took 43 seconds, subsequent runs only 13 seconds. This is probably a caching effect of the disk controller.

Neither MySQL version finishes query 18 within the time limit. None of the 3 candidates finishes query 20 within the limit.

Query execution plans differed between the candidates. Sometimes I could tweak the results by forcing the known good join order (using straight_join). I.e. on query 7 MariaDB used a different join order. When I forced the MySQL plan, then MariaDB needed only ~70% of the MySQL time. Or for query 12 MySQL 5.5 used a different plan as the others. After forcing the better join order, all 3 came it at the same time.

Finally I have another diagram, this time showing the details for the region around 100%.

DBT3 power test, small scale

If you want to run the DBT3 benchmark on your own, then you can use our DBT3 scripts. I have tarred up the configuration files for those scripts and the extracted raw data for download here.