Introduction

When you e.g. delete rows, these rows are just marked as deleted not really physically deleted from indexes and free space introduced is not returned to operating system for later reuse. Purge thread will physically delete index keys and rows, but still free space introduced is not returned to operating system and this operation can lead holes on page. If you have variable length rows, this could lead to situation where this free space can’t be used for new rows (if these rows are larger than old ones). User may use OPTIMIZE TABLE or ALTER TABLE <table> ENGINE=InnoDB to reconstruct the table.

Unfortunately, running OPTIMIZE TABLE against an InnoDB table stored in the shared table-space file ibdata1 does two things:

  • Makes the table’s data and indexes contiguous inside ibdata1
  • Makes ibdata1 grow because the contiguous data and index pages are appended to ibdata1

New defragmentation

In MariaDB 10.1 we have merged Facebooks defragmentation code prepared for MariaDB by Matt, Seong Uck Lee from Kakao. Only major difference to Facebooks code and Matt’s patch is the fact that in MariaDB we decided not to introduce new literals to SQL and no changes to server code. Instead we use already existing OPTIMIZE TABLE and all code changes are inside InnoDB/XtraDB storage engines. To enable this new feature you need to add following to my.cnf (this requirement is to keep the original behavior of OPTIMIZE TABLE for those users that need it).

This new defragmentation feature works inplace, thus no new tables are created and there is no need to copy data from old table to new table. Instead this feature loads n pages and tries to move records so that pages would be full of records and frees pages that are fully empty after the operation.

New configuration variables

  • innodb_defragment: Enable/disable InnoDB defragmentation. When set to FALSE, all existing defragmentation will be paused. And new defragmentation command will fail. Paused defragmentation commands will resume when this variable is set to TRUE. Default value FALSE.
  • innodb_defragment_n_pages: Number of pages considered at once when merging multiple pages to defragment. Range of 2–32 and default is 7.
  • innodb_defragment_stats_accuracy: How many defragment stats changes there are before the stats are written to persistent storage. Set to 0 meaning disable defragment stats tracking. Default 0.
  • innodb_defragment_fill_factor_n_recs:  How many records of space defragmentation should leave on the page. This variable, together with innodb_defragment_fill_factor, is introduced so defragmentation won’t pack the page too full and cause page split on the next insert on every page. The variable indicating more defragmentation gain is the one effective. Range of 1–100 and default 20.
  • innodb_defragment_fill_factor: A number between [0.7, 1] that tells defragmentation how full it should fill a page. Default is 0.9. Number below 0.7 won’t make much sense. This variable, together with innodb_defragment_fill_factor_n_recs, is introduced so defragmentation won’t pack the page too full and cause page split on the next insert on every page. The variable indicating more defragmentation gain is the one effective.
  • innodb_defragment_frequency: Do not defragment a single index more than this number of time per second.This controls the number of time defragmentation thread can request X_LOCK on an index. Defragmentation thread will check whether 1/defragment_frequency (s) has passed since it worked on this index last time, and put the index back to the queue if not enough time has passed. The actual frequency can only be lower than this given number.

New status variables

  • Innodb_defragment_compression_failures: Number of defragment re-compression failures
  • Innodb_defragment_failures: Number of defragment failures.
  • Innodb_defragment_count: Number of defragment operations.

Example

After CREATE TABLE and INSERT operations we can see following from INFORMATION_SCHEMA:

Now if we delete 3/4 of the records that will leave holes in pages and then we optimize table to execute defragmentation:

After this we can see that some pages are freed and some pages merged:

Links

WebScaleSQL Git repository https://github.com/webscalesql/webscalesql-5.6

Facebook Percona Live presentation: https://www.google.fi/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCQQFjAB&url=https%3A%2F%2Fwww.percona.com%2Flive%2Fmysql-conference-2014%2Fsites%2Fdefault%2Ffiles%2Fslides%2Fdefragmentation.pdf&ei=UgNKVNnZMcHhywP7qwI&usg=AFQjCNGREUpen21jCcy0bchUa6Ro83ol_A&sig2=MDZU2Ue9sX1kB9OusvdiFA

Introduction

Online DDL is a new feature in MariaDB 10.0. Online DDL is processed through below 4 tasks in sequence.

  1. InnoDB::ha_prepare_inplace_alter_table(..)
  2. InnoDB::ha_inplace_alter_table(..)
  3. InnoDB::ha_commit_inplace_alter_table(..)
  4. mysql_rename_table(..)

InnoDB storage engine allocates temporal memory buffer for transaction logging in phase 1 where row changes during this phase are logged. Size of this buffer is at start sort_buffer_size and it can be grown up to innodb_online_alter_log_max size. During phase 2 thread processing the ALTER statement will copy old table’s rows to a new altered table. After this MariaDB will take exclusive lock for target table and applies row log buffer to the new altered table.

This introduces a new unpredictable failure case row log buffer overflow. MariaDB server will rollback ALTER statement if row log buffer overflows. Thus, there is following problems:

  • If row log buffer size is too small the ALTER statement is rolled back and you have wasted precious time and resources.
  • If row log buffer is too big, you have wasted precious main-memory that could be used e.g. for buffer pool.
  • Currently, there is no way to see how much row log buffer is used and how much there is free space.
  • Currently, there is not even estimate how much work has been done and how much there is till to be done.
  • Currently, merge sort phase could also take a long time and there is no progress information.

Improvements

There is two improvements in MariaDB 10.1: new status variables and progress information for online DDL.

New status variables and progress info

MariaDB Corporation would like to thank Matt, Seong Uck Lee from Kakao for contributing a patch that has now merged to MariaDB 10.1.

First of all there is three new global status variables.

  • Innodb_onlineddl_rowlog_rows: Shows how many rows is stored in the row log buffer.
  • Innodb_onlineddl_rowlog_pct_used: Shows row log buffer usage in 5-digit integer  (10000 means 100.00% ).
  • Innodb_onlineddl_pct_progress: Shows the progress of in-place alter table. It might be not so accurate because in-place alter is highly dependent on disk and buffer pool status.

Lets consider as an example where we have InnoDB table containing 150000 rows and we try to add a new column.

Concurrently, if we add new row, update some rows and delete few rows

This means that at the time of status statement there were 2003 rows on row log, 23\% of memory allocated for row log is used, and online alter table has completed 56.77\% of it’s work.

There is also additional output at error log, as example:

Merge sort progress

Additionally, show processlist statement will output estimate of index merge sort progress, e.g.

Links

http://seonguck.blogspot.kr/2014/09/what-is-problem-of-mysql-online-ddl.html
http://kakao-dbe.blogspot.kr/2014/09/mysql-status-variables-for-innodb.html

Introduction

Evaluating the performance of database systems is a very demanding task. There are a lot of hard choices to be made, e.g.:

  • What operating system and operating system version is to be used
  • What configuration setup is to be used
  • What benchmarks are to be used and how long are the warm-up and measure times
  • What test setups are to be used
  • What version of the database management system is used
  • What storage engine is used

While performance evaluation is mostly machine time, there is still a lot of hard work for the human monitoring the tests. In this blog post we have made following choices:

  • We’re using an Intel Xeon E5-2690 @ 2.9GHz CPU containing 32-cores and Linux 3.4.12 with 132G main memory. The database is stored on a Fusion-IO ioDrive2 Duo 2.41TB Firmware v7.2.5, rev 110646, using Driver 3.3.4 build 5833069. The database file system is using NVMFS and all test logs and outputs are stored on EXT4.
  • We have selected following benchmarks:
    • LinkBench [1] which is based on traces from production databases that store social graph data at Facebook, a major social network. LinkBench provides a realistic and challenging test for persistent storage of social and web service data.
    • Percona provides an on-line transaction processing (OLTP) benchmark which is an implementation of TPC Benchmark C [2] like workload, which simulates a warehouse application. Approved in July of 1992, TPC Benchmark C is an on-line transaction processing (OLTP) benchmark. TPC-C is more complex than previous OLTP benchmarks such as TPC-A because of its multiple transaction types, more complex database and overall execution structure. TPC-C involves a mix of five concurrent transactions of different types and complexity either executed on-line or queued for deferred execution. The database is comprised of nine types of tables with a wide range of record and population sizes. TPC-C is measured in transactions per minute (tpmC).
  • We are using LinkBench with 10x database size i.e. about 100G and TPC-C with 1000 warehouses. InnoDB buffer pool is set to 50G, thus on both benchmarks the database does not fit in main-memory.
  • We selected alpha versions of MariaDB 10.1.0 and MySQL 5.7.4-labs-tplc, because we are interested how their development releases perform. We selected exactly these versions because they both include support for FusionIO hardware, atomic writes [3,4] and page compression [5].
  • We decided to use InnoDB storage engine on both servers tested.
  • We use FusionIO hardware atomic writes [3,4] on all tests.

Our reasons for these choices are the following:

  • We want to evaluate performance on modern hardware not some old soon-to-be-deprecated hardware. The selected hardware is something that could be used when new hardware is acquired today for a database server.
  • We do not want to evaluate the performance of any corner cases, i.e. single threaded client, all data in main-memory, ridiculous high or low number of client threads, only read-only or write-only workloads, and etc…. These cases are interesting for some applications but not all.
  • We want to use benchmarks that are well known and approved by the database community.
  • XtraDB is not available for MySQL and we have evidence that XtraDB does not perform that well on FusionIO hardware.
  • We use atomic writes on all tests because it provides better performance and thus we have disabled the doublewrite buffer, because when using atomic writes it is unnecessary.

MySQL/MariaDB configuration

We can’t use exactly the same configuration file (my.cnf) for both MariaDB and MySQL because there are some differences on the naming of the parameters. So, below are the configuration variable values so that first common variables and their values are shown followed with tag oracle containing MySQL parameters and with tag mariadb for MariaDB parameters. All parameters are selected so that they are fair and representative, but we do not claim that they optimal values for this hardware or these benchmarks.

Benchmark setup

As mentioned before, for LinkBench we use 10x database size, this will create a database size of about 100G. The load command is as follows

The LinkBench measure phase uses 64 client threads and we use maxtime 86300 seconds, i.e. 24 hours, and warm-up is default 180 seconds. The measure command is as follows:

For TPC-C we use 1000 warehouses, thus the load command is as follows:

We use the default number of threads, i.e. 64 client threads (if not something else mentioned). We have run TPC-C with time limit 10800 seconds, i.e. 3 hours, and warmup is 30 seconds, i.e. the measure command is as follows:

 

Performance results: LinkBench

Both servers contain support for atomic writes [3,4] and page compression [5] for FusionIO hardware. Below we compare the operations = transactions/second reported using this feature:

LinkBench_measure

Both servers provide stable performance after the buffer pool is warm and full. In both servers the performance drops after the buffer pool is full and we really need to fetch pages from FusionIO hardware. However, MariaDB provides clearly superior performance after the buffer pool is full. In this test we used the LZ4 compression method that is provided by both servers. MariaDB can provide about 30% better performance and the difference at the end of the test is 5654 operations/second.

Both servers provide the following compression methods if available on the system:

 

MariaDB also provides support for the following compession methods if available on the system:

Below we present the same results using uncompressed tables on both servers.

LinkBench_measure2

Both servers start with very good performance, however performance slowly drops when the buffer pool is full and write amplification starts to slow down the I/O performance. This is due to Write amplification (WA), which  can happen on Flash memory and Solid-state drives (SSDs) where the actual amount of physical information written is a multiple of the logical amount intended to be written. Because flash memory must be erased before it can be rewritten, the process to perform these operations results in moving (or rewriting) user data and metadata more than once. This multiplying effect increases the number of writes required over the life of the SSD which shortens the time it can reliably operate. The increased writes also consume bandwidth (computing) to the flash memory which mainly reduces random write performance to the SSD [6]. MariaDB provides about 8% better performance and at the end of the tests the difference is 2225 operations/second.

The next research point is row compressed tables. Because both MySQL and MariaDB contain the same, or almost the same, implementation, we do not expect any significant differences on performance. Results below:

LinkBench_measure3

Results are almost identical as expected and the small differences seen are clearly inside a statistical variation. Finally, let’s combine all previous results to see the differences between uncompressed, row-compressed, and page-compressed results.

LinkBench_measure4

As expected, the best performance is achieved by using uncompressed tables. Note that this result does not contradict the results presented on [5] because here we use atomic writes and have disabled the doublewrite buffer on all tests. At the end of the measure time, page compressed performance is about 22% less than uncompressed performance and the difference is 5500 operations/second. Similarly, at end of the measure time, row compressed performance is about 68% less than uncompressed performance and row compressed performance is about 37% less than page compressed performance. Finally, the difference is 11800 operations/second between uncompressed and row compressed and 6568 operations/second between page compressed and row compressed.

For our final test, we measured the load time used to create the 10x LinkBench database using MariaDB 10.1.

load_time

 

Clearly, creating a page-compressed database has the shortest time followed by an uncompressed database. Creating a row compressed database is significantly slower than either of those.

Performance results: TPC-C

We again start from presenting the results using TpmC from page compression feature [5] using a different number of client treads:

 

tpcc_page

Again MariaDB provides clearly better performance on all tests. The server used has 32 cores and that is the reason why performance numbers drop when the number of client treads are increased to higher than 32 cores. Let’s see the performance results from New Order transactions and using the default number of client threads:

 

tpcc_pagecomp

While MySQL occasionally can provide a higher number of completed New Order transactions the variation is significantly higher and the median is significantly lower than the MariaDB results. MariaDB can provide quite stable performance over time after the buffer pool is warm. Clearly, the MySQL flush implementation can’t keep up with very fast FusionIO hardware. The bottleneck on all tests on this post is not the FusionIO hardware, instead the bottleneck is in the MySQL/MariaDB implementation.

In the following graph we have compared the results using uncompressed tables and page compressed tables while varying the number of TPC-C test suite client threads from 4 to 512:

tpcc_full

When client threads are between 4 and 32 MariaDB can provide clearly better performance for both uncompressed and page compressed tables. Actually, MariaDB performance for page compressed tables is the same or even better than MySQL performance for uncompressed tables. When the number of client threads is increased, MariaDB performance drops earlier than MySQL performance but not as significantly as MySQL performance seen on the final results using 512 client threads. With high numbers of client threads, MariaDB page compressed can offer significantly better performance compared with uncompressed tables.

Conclusions

First of all, both alpha versions of these database servers were very stable. We did not encounter even a single crash or other problem while the performance evaluations were being done. This is a very good sign for the quality of these development releases.

Secondly, there is clearly a need to identify and fix new bottlenecks introduced by the significantly faster FusionIO storage device. In all tests we could not even nearly use the full throughput provided by the FusionIO hardware. FusionIO hardware can be seen as a new level on memory hierarchy [7] that requires new kinds of optimizations and research in a similar way as main-memory has required [8]. Some early research results can be seen on [9].

Finally, MariaDB can clearly offer better performance to applications having a similar workload as presented by LinkBench and TPC-C.

References

[1] Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, and Mark Callaghan. 2013. LinkBench: a database benchmark based on the Facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13). ACM, New York, NY, USA, 1185-1196. http://doi.acm.org/10.1145/2463676.2465296.

[2] http://www.tpc.org/tpcc/

[3] MariaDB Introduces Atomic Writes, https://blog.mariadb.org/mariadb-introduces-atomic-writes/

[4] Xiangyong Ouyang and David W. Nellans and Robert Wipfel and David Flynn and Dhabaleswar K. Panda: Beyond block I/O: Rethinking traditional storage primitives, in Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pages 301-311, http://dx.doi.org/10.1109/HPCA.2011.5749738

[5] Significant performance boost with new MariaDB page compression on FusionIO, https://blog.mariadb.org/significant-performance-boost-with-new-mariadb-page-compression-on-fusionio/

[6] Write amplification. http://en.wikipedia.org/wiki/Write_amplification

[7] Memory hierarchy. http://en.wikipedia.org/wiki/Memory_hierarchy

[8] Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. 2000. Optimizing database architecture for the new bottleneck: memory access. The VLDB Journal 9, 3 (December 2000), 231-246. DOI=10.1007/s007780000031 http://dx.doi.org/10.1007/s007780000031

[9] Seok-Hoon Kang, Dong-Hyun Koo, Woon-Hak Kang and Sang-Won Lee: A Case for Flash Memory SSD in Hadoop Applications,
International Journal of Control and Automation, Vol. 6, No. 1, February, 2013. http://www.sersc.org/journals/IJCA/vol6_no1/17.pdf