Back when the first version of the MariaDB Java Client was released, someone asked in the comments about the performance characteristics of the driver compared to ConnectorJ. I answered with hand-waving, saying that nobody does anything stupid, the performance of the drivers would be roughly the same, but I promised to measure it and tell the world one day. And now that day has come. The day where three MySQL JDBC drivers (ConnectorJ, MariaDB JDBC, and Drizzle JDBC) are compared against each other. Unlike the server, which gets benchmarking attention all the time, there is no standard benchmark for connectors, so I needed to improvise, while trying to keep the overhead of the server minimal. So I did something very primitive to start. I used my two favorite queries:

  • DO 1 — this one does not retrieve a result set, and thus can be seen as a small “update”.
  • SELECT 1 — the minimal SELECT query.

The test program runs a query N times, and if the query was a select, it retrieves all values from the result set, using ResultSet.getObject(i), and calculates the queries-per-second value. (The best thing is that the test program is single-threaded, and how often does one get to run single-threaded tests? :)  the test was run on my own workstation, which runs Windows Server 2008 R2, and I have useConfigs=maxPerformance in the URL for ConnectorJ.

Results (Queries per second,  unprepared)

ConnectorJ-5.1.24 MariaDB-JDBC-1.1.2 Drizzle-JDBC-1.3-SNAPSHOT
DO 1 19543 22104 15288
SELECT 1 17004 19305 13410

jdbc_fast_queries

 

MariaDB JDBC appears to be a little faster (~10%) than  ConnectorJ, and much faster (~30%) than Drizzle JDBC.

Can ConnectorJ do better? I bet it can. Looking into profiler output – CPU profiling, instrumentation mode in NetBeans – for  a test that executes “SELECT 1″ in a loop,  shows com.mysql.jdbc.StatementImpl.findStartOfStatement() taking 7.5% of runtime. Ok, instrumentation results should be taken with a grain of salt, however the single reason string search is used, is because - if an update (DML) statement is executed inside ResultSet.executeQuery(), it is rejected with an exception. This can be done differenty, I believe. If absolutely necessary, throwing an exception can be delayed, until the client finds out that the server sent an OK packet instead of a result set.

Even more interesting is the case with Drizzle JDBC. In theory, since the MariaDB driver has a Drizzle JDBC heritage, the performance characteristics should be similar, but they are not, so there must be a bug somewhere. It appears very easy to find, as according to profiler, 50.2% CPU time (take that number with a big grain of salt) is spent in a function that constructs a hexdump from a byte buffer. Looking at the source code, we find following line that is unconditionally executed:

log.finest("Sending : " + MySQLProtocol.hexdump(byteHeader, 0));

While the result of the hexdump is never used (unless logging level is FINEST), the dump string is still created, using relatively expensive Formatter routines, concatenated with the String “Sending:”, and then thrown away… In Markus’ defense, hexdump() is not his fault, it was contributed 3 years ago. But it remained undetected for 3 years. This bug is now filed  https://github.com/krummas/DrizzleJDBC/issues/21 [UPDATE: this bug was resolved within hours  after reporting]

So, let’s check how much we can gain by putting the offending code into an if (log.getLevel() == java.util.logging.Level.FINEST) condition.
The QPS from “DO 1″ raises from 15288 to 19968 (30%), and for “SELECT 1″ we have increase from 13410 to respectable 16824 (25%). Not bad for a single line fix.
jdbc_fast_queries_drizzle_fix

While the one-liner makes the Drizzle JDBC faster, with slightly better numbers than ConnectorJ, it is still not as fast as MariaDB.

In the MariaDB JDBC connector, there were a couple of improvements to performance which were made since forking. One of the early improvements was to avoid copying data unnecessarily when sending, and to decrease the number of byte buffers.  Another improvement came recently, after profiling and finding that parsing Field packets is expensive (mostly due to the construction of Strings for column name, aliases, and etc…). The improvement was lazy parsing,  delaying string construction, and avoiding it entirely in most cases. For example, if column names are not used, and rows are accessed using integer indexes in ResultSet.getXXX(int i), the metadata won’t be fully parsed. Also, perhaps there were some other fixes that I do not remember anymore. :)

Can we further increase the QPS?

We can try. First, statements can be prepared. MariaDB and Drizzle so far only provide client-side prepared statements (ConnectorJ can do both client and server-side prepared statements) but using them saves having to convert the query to bytes, and JDBC escapes preprocessing. From now on I’ll stay just with “DO 1″ which proved to be the fastest query. Trying it on MariaDB driver shows some minimal QPS increase 22104 (not prepared) vs 22183 (prepared), or 0.3%. Slightly more on ConnectorJ (19543 vs 20096, or 2.9%). Nothing revolutionary so far.

But, We still have not used all of the options in this (admittedly silly) quest for maximizing the performance of “DO 1″. Recall that ConnectorJ can support named pipes on Windows, which are allegedly much faster than TCP connections. Restart server with named pipe, set JDBC URL to “jdbc:mysql:///?socketFactory=com.mysql.jdbc.NamedPipeSocketFactory&namedPipePath=\\\\.\\Pipe\\MySQL&user=root&useConfigs=maxPerformance”, and rerun the test with 1000000 prepared queries. Now the QPS grew to 29542! That is strong, and is a 33% improvement compared to the best result seen so far. Yet, unfortunately, still no cigar, since JVM dumps a stack trace when the named pipe connection is closed. This is a “Won’t fix” (chalked off as a JVM problem) MySQL bug Bug#62518, which renders named pipe support almost useless – though maybe there is a trick to shut up th JVM somehow in this case, but I do not know of such a trick.

How fast is C client library in comparison?

Out of curiosity, I also tested how the native client compares to JDBC. With the TCP protocol, it does slightly better than the fastest JDBC (MariaDB, prepared), but it is not a huge margin – 24063 QPS vs 22183 (8.5% difference), and I believe Java drivers could improve further.
With named pipe, QPS is 33122, which is ~12% better than what ConnectorJ could do, if pipes worked properly there.

 

Accessing benchmark program

I put the benchmark program on Launchpad, together with the drivers. If you’re on Windows, and if you have a server running on port 3306, and the ‘root’ user doesn’t have a password, you can just branch the repository and run bench_all.bat. Those of you who are using other operating systems, I trust you to be able to quickly rewrite the batch files as shell scripts.

We haven’t posted any Windows benchmarks for a while, and MariaDB for Windows contains some specific improvements which might not be widely know since we haven’t talked much about them yet. This post is an attempt to fix that. We’ll also share current MySQL 5.5 numbers.

My setup is an 8 core 2 socket server (yes, a little bit dated for today, but it is the best machine I have at my disposal), 10K SAS disks with RAID1. I ran sysbench 0.4 single table / 1,000,000 records. I ran the benchmark over a network, with the number of concurrent clients ranging from 4 to 4096.

Here is what OLTP-readonly throughput looks like:

  • For most of the tests, MariaDB’s throughput is approx 10% higher than MySQL’s
  • For 4096 concurrent clients, MariaDB’s throughput is better than MySQL 5.5 by 476% (2382 vs 413 TPS).

Many people rightfully remark that throughput does not necessarily represent performance as a whole, and that in OLTP benchmarks, fairness also matters. It actually often matters more than throughput. I agree, so below are the results for 95% of response time, meaning that 95% transactions were finished under the given time.

OLTP Readonly response time, 95th percentile

Concurrent clients 4 8 16 32 64 128 256 512 1024 2048 4096
MariaDB 4.87 6.81 8.83 12.35 22.12 43.56 90.35 180.57 619.05 1003.88 1965.77
MySQL 4.86 7.14 9.96 16.21 37.39 101.33 238.89 499.63 971.07 2241.83 25215.29

As the above table shows, MariaDB 5.5 outperforms MySQL 5.5 in both row throughput, and in fairness. Fairness (response time) results are in fact more impressive than the row throughput.

But, why does MariaDB perform better than MySQL 5.5 on Windows in readonly tests? Why it is at all better than 5.5? It is, after all, based on the same codebase, so performance should be about equal. The answer to this is not our optimizer enhancements, nor XtraDB serving as InnoDB in MariaDB. The answer is the MariaDB threadpool. This is my feature in 5.5 and I’m glad it performs so well, and I’m glad that it is switched “on” by default on Windows (so, Windows users, you do not have to do anything to switch it on).

Still, why does using threadpool result in such better performance? The answer is that the implementation delegates all responsibility for sizing the pool and running callbacks to the Windows native threadpool, so it is a sort of black box inside the OS that just delivers good results. The heart of the Windows native threadpool, if you want to know, is the good old IO completion port, something that Windows has had since the NT 3.5 days. It’s a unique Windows feature, and something that any server application running on this platform is using or should be using. And the tricks that make it run well are:

  • It does not let too many threads run “on CPU” at the same time (this reduces context switches). Reducing context switches is, in my experience, the single most important factor to increasing throughput.
  • It activates threads waiting on completion LIFO order — hot threads remain hot, which reduces cache misses
  • IO completions are processed in FIFO order — this is why fairness performance is so good
  • And yes, it naturally reduces contention on hot locks.

Enough about readonly performance already, and about threadpool.

The next interesting question is whether MariaDB can do as well on write workloads. So let’s run the sysbench in the ultimate “write-only” mode, also known as update_non_index (every query only increments a non-indexed int column). To maximize the write throughput, I set the innodb_flush_log_at_trx_commit parameter to 0, so the log is flushed once a second rather than each time at transaction commit.

Here are the results:

OLTP write-only (update_non_index/flush_log=0) throughput:

This looks pretty good. The difference could be a combination of many factors. XtraDB write performance, group commit, and threadpool could all have a positive (for MariaDB) influence on the results. However I think the Windows version of MariaDB’s asynchronous IO optimization (it, again, uses completion ports) is the main reason for the big difference here. I, actually, first saw the difference when I was implementing the feature.

All IO related parameters and Innodb parameters in the test above are default. The result looks almost too good to be true, so I really wanted to bring MySQL to the point it can achieve better IO throughput by varying innodb_io_capacity and/or innodb_write_io_threads, to no avail. Does someone know the right trick here?

OLTP writeonly (update_non_index/flush_log=0) response time, 95 percentile

Concurrent clients 4 8 16 32 64 128 256 512 1024 2048
MariaDB 0.33 0.63 0.75 1.06 1.94 3.85 8.25 21.38 129.79 274.40
MySQL 0.32 0.61 0.73 1.61 7.62 26.82 96.45 219.29 661.19 2723.36

For completeness, here are database parameters I’m using.

[mysqld]
sql-mode="NO_ENGINE_SUBSTITUTION"
back_log=500
user=root
port=3306
max_connections=4096
max_connect_errors=5000
max_prepared_stmt_count=50000
table_cache=2048
transaction_isolation=REPEATABLE-READ
loose-skip-external-locking
innodb_status_file=0
innodb_data_file_path=ibdata1:200M:autoextend
innodb_buffer_pool_size=1G
innodb_additional_mem_pool_size=20M
innodb_log_file_size=650M
innodb_log_buffer_size=100M
innodb_support_xa=0
innodb_doublewrite=0
innodb_flush_log_at_trx_commit=0
query-cache-size=0
query-cache-type=0
symbolic-links=0
skip-grant-tables