Introduction

Causal consistency [1] is one of the consistency criteria that can be used on distributed databases as consistency criteria.

Distributed database provides causal consistency if read and write operations that are causally related are seen by every node of the distributed system in the same order. Concurrent writes may be seen in different order in diffrent nodes.  Causal consistency is waker than sequential consistency [2] but stronger than eventual consistency [3]. See earlier blog for more detailed description on eventual consistency https://blog.mariadb.org/eventually-consistent-databases-state-of-the-art/.

When a transaction performs a read operation followed later by a write operation, even on different object, the first read is said to be causally ordered before the write. This is because the value created by the write may have been dependent upon the result of the read operation. Similarly, a read operation is causally ordered after the earlier write on the same object that stored the data retrieved by the read. Also, even two write operations performed by the same node are defined to be causally ordered, in the order they were performed. Intuitively, after writing value v into object x, a node knows that a read of x would give v, so a later write could be said to be (potentially) causally related to the earlier one. Finally, causal order is transitive: that is, if operation A is (causally) ordered before B, and B is ordered before C, A is ordered before C.

Operations that are not causally related, even through other operations, are said to be concurrent. Consider following example:

This execution is causally consistent because read operations r3[x] is causally dependent on write w2[x] and read operation r4[x] is causally dependent on write w1[x]. Note that concurrent write operations may be seen on different order on different transactions or nodes.

Implementing Causal Consistency

Causal consistency can be reached by using Lamport clocks [4] or version vectors [5]. The causal consistency model is implemented by using multi-part timestamps, which are assigned to each object. These timestamps are stored on a vector that contains the version number of the object at each replica. This vector must be included (in the form of dependencies) in all update and query requests so that operations respect the causal ordering: an operation A can only be processed at a given node if all operations, on which the operation A causally depends, have already been applied at that node.

Each element in the vector clock corresponds to one host and the value of element indicates the number of messages sent from that host. The vector clock information is used to order or delay the delivery of messages if necessary, thus maintaining the required consistency. However, for maintaining the consistency of data items, we need information about writes on each data item, and maintaining a clock per data item can help. Therefore, instead of a vector clock of size N (number of hosts), we maintain a vector of size M (number of objects). The value of v[i] in the vector v contains the number of writes on data item i.

Each host maintains a vector of size M. Whenever it updates the value of an item, it increments the corresponding element and sends the vector along with the message of data item and new value to every site, which has a copy of the replica. When a host receives an update message, it delays the delivery of a message till each element in its vector is greater than or equal to the one that is piggybacked. After that, the updates to the data items are applied. In this case, the message overhead is O(M) and thus is independent of the number of hosts in the system.

If each message is an update message, it carries the new value of the data item rather than instructions. Then the delivery of an update on a data-item does not need not wait for the previous updates on the same item. This would not have been possible if vector clocks had been used. In that case, the delivery of a massage would have been delayed even for previous messages that are causally overwritten.

Applications and Databases using Causal Consistency

COPS and Eiger

COPS system [6] (Clusters of Order-Preserving Servers) introduces the causal+ consistency and is designed to support complex online applications that are hosted in a small number of large-scale data-centers, each of which is composed of front-end servers (clients of COPS) and back-end key-value data stores. Eiger [7] has a similar design but a different implementation.

COPS and Eiger support causality through a client library. Both systems replicate writes to geographically distributed data centers and enforce observed ordering. The observed ordering is enforced by delaying the write operations until all causally previous operations have been already applied at that data center.

COPS executes all read and write operations in the local data center in a linearizable [8] fashion, and then replicates data across data centers in a causal+ consistent order in the background. Figure 1 describes the high level architecture of COPS.

 

cops

Figure 1: Architecture of COPS and Eiger [6].

Similarly, the Eiger system provides the linearizability inside each data center and the causally-consistent data store based on a column-family data model to achieve better performance in a geo-distributed setting. All operations from clients are served from the local data center using a client library. The library mediates access to nodes in the local data center, executes the read and write transaction algorithms, and tracks causality and attaches dependencies to write operations. Each replica stores full replica of the database, and operations are handled locally. After an operation is executed locally, the operation is asynchronously pushed to remote data centers, but committed only after all causally dependent operations have been previously committed.

Bolts-on

Bailis et. al. in [9] propose a client-side middle-ware software called Bolt-on. This middle-ware guarantees only application-defined dependencies as an alternative to the causal consistency. Figure 2 describes the architecture of the Bolt-on middle-ware.

 

bolt

Figure 2: Architecture of Bolts-on [9].

The Bolt-on architecture assumes that the underlying data store handles most aspects of data management, including replication, availability, and convergence. In the architecture, the underlying data store locally uses the eventual consistency and allows a large space of read and write histories; the middle-ware handles causal dependencies, and consists of a set of rules, which limit the possible histories to the histories that obey the desired consistency model.

Application: MMORPG

In Massively Multi-player Online Role-Playing Games (MMORPG), players can cooperate with others in a virtual game world, and both players and different game words are naturally distributed. These systems manage large amounts of data, and the biggest problem is how to support data consistency. According to the CAP theorem, we have to sacrifice one of two properties: consistency or availability [10]. If an online game does not guarantee the availability, players’ requests may fail. If data is inconsistent, players may get data not conforming to the game logic, and this data can affect their operations. Therefore, it is important for the MMORPG environment to find a balance between the data consistency and the system availability. For this reason, we must analyze the data consistency requirements of MMORPG so as to find the balance [10].

Diao [10] has studied different consistency models for MMORPG and found that there indeed are part of data, where the causal consistency is an appealing choice: Game data. The game data contains e.g. the world appearance, the meta-data of non-player characters (the characters are created by game developers and controlled only by the game logic), the system configuration and game rules. This data is used by players and the game engine in the entire game, but can be only modified by the game developers. Consistency requirements for the game data are not so strict compared e.g. to the account data. Because e.g. a change of non-player character name or of the duration of bird animation may not be noticed by players.

Furthermore, some change of the game data needs to be delivered to all online players synchronously, e.g. a change of the word appearance, the weapon power, non-player characters, game rules and scripts. If there is inconsistency on these areas, it will cause errors on game display and logic errors on players. Therefore, some data needs to be stored on the server side and some on the client side. The game data on the client side could only synchronize with servers when a player logs in to or starts a game. For this reason, the causal consistency is required [10].

This could mean that when a player A uses the browser to connect with the game server, the game server will check the current local data and update the game data of the client side in the form of data packets. After updating, all future local accesses will return the updated value. Player B, who has not communicated with the game server, will still retain the outdated game data.

Game servers maintain the primary version of game data, and transfer it to client sides. Additionally, players on different game words cannot discuss to each other. Thus, the only need is to make sure that the game data is consistent in one game word in a time so that all players on that game word are handled equally. This requires using the strong consistency locally in the game word and the causal consistency among different game words. When the game data is modified by developers, the update value should be delivered synchronously to all replicates on that game word, and asynchronously to other game words.

While the possibility of using the causal consistency on MMORPG has been identified on research [10] to the authors’ knowledge there is no actual publications or other information that the causal consistency is actually used on MMORPG games (e.g. Lord of the Rings Online).

Application: Facebook

Facebook is an online Social networking service. After registering to use the site, users can create a user profile, add other users as friends, exchange messages, post status updates and photos, share videos and receive notifications when others update their profiles.

When you log into your account on Facebook, the server will show your own status messages and your friends’ status messages at that point in time. Status messages on Facebook may contain pictures, shared links and stories or your own messages. Naturally, your account data requires a strong consistency, but for status data the weaker consistency models are acceptable. During the time the user is online, the status updates of a user’s friends and of the user do not need to be strictly ordered, and the causal ordering is enough.

Thus when a user A sends a status update and a user B replies to that update, there is a causal order on the two updates. However, when users C and D do a totally unrelated update, the order these updates appear to users A and B is not relevant. This is because users A and B do not know in which order updates are performed.

The reason why the eventual consistency is not enough for Facebook status updates is that the eventual consistency does not require any ordering between writes. Consider a case, where the user A first sends a status update, and after few seconds A updates the first status update. With the eventual consistency, all friends of A could see only the first update, because the eventual consistency does not guarantee that first update is performed before the second one. In the causal consistency, as there is a read (by user A) of first update and then write (updated status from user A), these are causally related and all user A’s friends will naturally see second update.

Although the causal consistency is the possible consistency model for Facebook status updates and several similar distributed services containing status updates like LinkedIn, Twitter and Yahoo, to author’s knowledge there is not scientific or other literature that would show the causal consistency being really used.

Conclusions

The causal consistency model can be enforced with Lamport clocks. Transactions using the causal consistency are executed in an order that reflects their causally-related read/write operations’ order. Concurrent operations may be committed in different orders and their results can be read also in different orders.

Actually, the causal consistency can solve many problems, which cannot be solved in the eventual consistency, such as ordering operations. The causal consistency ensures that every sees operations in the same causal order, and this makes the causal consistency stronger than the eventual consistency. However, the causal consistency cannot support e.g. distributed integrity constraints.

Although there are a few promising systems developed on research to the authors’ knowledge there is no commercial or mature systems using the causal consistency model. For more thorough discussion see [11].

References

[1] Mustaque Ahamad , Gil Neiger , James E. Burns , Prince Kohli , P.W. Hutto: Causal Memory: Definitions, Implementation and Programming, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.3356

[2] Leslie Lamport: How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs, IEEE Trans. Comput. C-28,9 (Sept. 1979), 690-691.

[3] Werner Vogels: Eventually consistent. Communications of the ACM 52: 40. doi:10.1145/1435417.1435432

[4] Leslie Lamport: Time, clocks, and the ordering of events in a distributed system, Communications of the ACM, 21: 7, pp. 558–565, 1978.

[5] C. J. Fidge: Timestamps in message passing systems that preserve the partial ordering, in Theoretical Computer Science, 1988.

[6] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen: Don’t settle for eventual: scalable causal consistency for wide-area storage with COPS, in Proceedings of the 23rd ACM Symposium on Operating Systems Principles, pp. 401–416, Portugal, October 23-26, 2011.

[7] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen: Stronger semantics for low-latency geo-replicated storage,” in Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, pp. 313–328, Lombard, IL, USA, April 2-5, 2013.

[8] M. Herlihy and J. M. Wing, Linearizability: A correctness condition for concurrent objects, ACM
Trans. Program. Lang. Syst., 12:3, pp. 463–492, 1990.

[9] . Bailis, A. Ghodsi, J. M. Hellerstein, and I. Stoica,: Bolt-on causal consistency,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 761–772, New York, NY, USA, June 22-27, 2013

[10] Z. Diao, “Consistency models for cloud-based on-line games: the storage system’s perspective,” in
25th Workshop on Grundlagen von Datenbanken, Computing, Portland, Oregon, USA, July 16-19,
pp. 16–21, Ilmenau, Germany, May 28 – 31, 2013.

[11] Mawahib Musa Elbushra, Jan Lindström: Causal Consistent Databases, Open Journal of Databases (OJDB), 2:1, http://www.ronpub.com/publications/OJDB_2015v2i1n02_Elbushra.pdf

Introduction

When you e.g. delete rows, these rows are just marked as deleted not really physically deleted from indexes and free space introduced is not returned to operating system for later reuse. Purge thread will physically delete index keys and rows, but still free space introduced is not returned to operating system and this operation can lead holes on page. If you have variable length rows, this could lead to situation where this free space can’t be used for new rows (if these rows are larger than old ones). User may use OPTIMIZE TABLE or ALTER TABLE <table> ENGINE=InnoDB to reconstruct the table.

Unfortunately, running OPTIMIZE TABLE against an InnoDB table stored in the shared table-space file ibdata1 does two things:

  • Makes the table’s data and indexes contiguous inside ibdata1
  • Makes ibdata1 grow because the contiguous data and index pages are appended to ibdata1

New defragmentation

In MariaDB 10.1 we have merged Facebooks defragmentation code prepared for MariaDB by Matt, Seong Uck Lee from Kakao. Only major difference to Facebooks code and Matt’s patch is the fact that in MariaDB we decided not to introduce new literals to SQL and no changes to server code. Instead we use already existing OPTIMIZE TABLE and all code changes are inside InnoDB/XtraDB storage engines. To enable this new feature you need to add following to my.cnf (this requirement is to keep the original behavior of OPTIMIZE TABLE for those users that need it).

This new defragmentation feature works inplace, thus no new tables are created and there is no need to copy data from old table to new table. Instead this feature loads n pages and tries to move records so that pages would be full of records and frees pages that are fully empty after the operation.

New configuration variables

  • innodb_defragment: Enable/disable InnoDB defragmentation. When set to FALSE, all existing defragmentation will be paused. And new defragmentation command will fail. Paused defragmentation commands will resume when this variable is set to TRUE. Default value FALSE.
  • innodb_defragment_n_pages: Number of pages considered at once when merging multiple pages to defragment. Range of 2–32 and default is 7.
  • innodb_defragment_stats_accuracy: How many defragment stats changes there are before the stats are written to persistent storage. Set to 0 meaning disable defragment stats tracking. Default 0.
  • innodb_defragment_fill_factor_n_recs:  How many records of space defragmentation should leave on the page. This variable, together with innodb_defragment_fill_factor, is introduced so defragmentation won’t pack the page too full and cause page split on the next insert on every page. The variable indicating more defragmentation gain is the one effective. Range of 1–100 and default 20.
  • innodb_defragment_fill_factor: A number between [0.7, 1] that tells defragmentation how full it should fill a page. Default is 0.9. Number below 0.7 won’t make much sense. This variable, together with innodb_defragment_fill_factor_n_recs, is introduced so defragmentation won’t pack the page too full and cause page split on the next insert on every page. The variable indicating more defragmentation gain is the one effective.
  • innodb_defragment_frequency: Do not defragment a single index more than this number of time per second.This controls the number of time defragmentation thread can request X_LOCK on an index. Defragmentation thread will check whether 1/defragment_frequency (s) has passed since it worked on this index last time, and put the index back to the queue if not enough time has passed. The actual frequency can only be lower than this given number.

New status variables

  • Innodb_defragment_compression_failures: Number of defragment re-compression failures
  • Innodb_defragment_failures: Number of defragment failures.
  • Innodb_defragment_count: Number of defragment operations.

Example

After CREATE TABLE and INSERT operations we can see following from INFORMATION_SCHEMA:

Now if we delete 3/4 of the records that will leave holes in pages and then we optimize table to execute defragmentation:

After this we can see that some pages are freed and some pages merged:

Links

WebScaleSQL Git repository https://github.com/webscalesql/webscalesql-5.6

Facebook Percona Live presentation: https://www.google.fi/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CCQQFjAB&url=https%3A%2F%2Fwww.percona.com%2Flive%2Fmysql-conference-2014%2Fsites%2Fdefault%2Ffiles%2Fslides%2Fdefragmentation.pdf&ei=UgNKVNnZMcHhywP7qwI&usg=AFQjCNGREUpen21jCcy0bchUa6Ro83ol_A&sig2=MDZU2Ue9sX1kB9OusvdiFA

Introduction

Online DDL is a new feature in MariaDB 10.0. Online DDL is processed through below 4 tasks in sequence.

  1. InnoDB::ha_prepare_inplace_alter_table(..)
  2. InnoDB::ha_inplace_alter_table(..)
  3. InnoDB::ha_commit_inplace_alter_table(..)
  4. mysql_rename_table(..)

InnoDB storage engine allocates temporal memory buffer for transaction logging in phase 1 where row changes during this phase are logged. Size of this buffer is at start sort_buffer_size and it can be grown up to innodb_online_alter_log_max size. During phase 2 thread processing the ALTER statement will copy old table’s rows to a new altered table. After this MariaDB will take exclusive lock for target table and applies row log buffer to the new altered table.

This introduces a new unpredictable failure case row log buffer overflow. MariaDB server will rollback ALTER statement if row log buffer overflows. Thus, there is following problems:

  • If row log buffer size is too small the ALTER statement is rolled back and you have wasted precious time and resources.
  • If row log buffer is too big, you have wasted precious main-memory that could be used e.g. for buffer pool.
  • Currently, there is no way to see how much row log buffer is used and how much there is free space.
  • Currently, there is not even estimate how much work has been done and how much there is till to be done.
  • Currently, merge sort phase could also take a long time and there is no progress information.

Improvements

There is two improvements in MariaDB 10.1: new status variables and progress information for online DDL.

New status variables and progress info

MariaDB Corporation would like to thank Matt, Seong Uck Lee from Kakao for contributing a patch that has now merged to MariaDB 10.1.

First of all there is three new global status variables.

  • Innodb_onlineddl_rowlog_rows: Shows how many rows is stored in the row log buffer.
  • Innodb_onlineddl_rowlog_pct_used: Shows row log buffer usage in 5-digit integer  (10000 means 100.00% ).
  • Innodb_onlineddl_pct_progress: Shows the progress of in-place alter table. It might be not so accurate because in-place alter is highly dependent on disk and buffer pool status.

Lets consider as an example where we have InnoDB table containing 150000 rows and we try to add a new column.

Concurrently, if we add new row, update some rows and delete few rows

This means that at the time of status statement there were 2003 rows on row log, 23\% of memory allocated for row log is used, and online alter table has completed 56.77\% of it’s work.

There is also additional output at error log, as example:

Merge sort progress

Additionally, show processlist statement will output estimate of index merge sort progress, e.g.

Links

http://seonguck.blogspot.kr/2014/09/what-is-problem-of-mysql-online-ddl.html
http://kakao-dbe.blogspot.kr/2014/09/mysql-status-variables-for-innodb.html