Background

Eventual consistency is a consistency model used in many large distributed databases which requires that all changes to a replicated piece of data eventually reach all affected replicas; conflict resolution is not handled and responsibility is pushed up to the application author in the event of conflicting updates [13].

Eventual consistency is a specific form of weak consistency; the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value [14]. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme [3].

A few examples of eventually consistent systems:

  • DNS
  • Asynchronous master/slave replication on an RDBMS e.g. MariaDB
  • memcached in front of MariaDB, caching reads

Eventual consistency states that in an updatable replicated database, eventually all copies of each data item converge to the same value. The origin of eventual consistency can be traced back to Thomas’ majority consensus algorithm [12]. The term was coined by Terry et al. [11] and later popularized by Amazon in their Dynamo system, which supported only eventual consistency [7].

The CAP theorem, also called Brewer’s theorem by its author Dr. Erik A. Brewer, was introduced at PODC 2000 [4, 5]. The theorem was formally shown by Gilbert and Lynch [8]. Brewer introduced consistency, availability and partition tolerance as three desired properties of any shared-data system and made the conjuncture that maximally two of them can be guaranteed at one time [6].

In general, this theorem perfectly matches the needs of today’s Internet systems. Ideally we expect a service to be available during the whole time period of network connection by which the service is connected to the network/Internet [1]. Thus if a network connection is available the service should be available as well [9,10]. To achieve good performance, requests need to be processed by a distributed system. If the number of servers are increased the probability of server failure or network failure is also increased. A system therefore needs to take this into account and be designed in such a way that these failures are transparent for the client and the impact of such failure is minimized [2]. The abbreviation of the theorem comes from these three properties:

  • Consistency: This property requires that each operation executed within a distributed system where data is spread among many servers ends with the same result as if executed on one server with all data.
  • Availability: This property of a distributed system requires that sending a request to any functional node should be enough for a requester to get the response. By complying with this property a system is tolerant to failure of any nodes caused, for instance, by network throughput issues.
  • Partition Tolerance: A distributed system consists of many servers interconnected by a network. A frequent requirement is distributing the system across more data centers to eliminate the failure of one of them. During network communication, failures are frequent. Hence, a system needs to be fail-proof against an arbitrary number of failed messages among servers. Temporary communication interruption among a server set must not cause the whole system to respond incorrectly [9].

Eventual consistency is defined as follows:

Definition 1: Eventual consistency.

  • Eventual delivery: An update executed at one node eventually executes at all nodes.
  • Termination: All update executions terminate.
  • Convergence: Nodes that have executed the same updates eventually reach an equivalent state (and stay).

Example 1: Consider a case where data item R=0  is stored on all three nodes. Assume that we have the following sequence of writes and commits: W(R=3) C W(R=5) C W(R=7) C  on node0. Now reads from node1 could return R=3  and reads from node2 could return R=5 . This is eventually consistent as long as reads from all nodes eventually return the same value. Note that this final value could be R=3. Eventual consistency does not restrict the order in which the writes must be executed.

MariaDB Demonstration

As already stated, normal master slave setup on MariaDB is eventually consistent. In this article we are interested in a situation where we have a multiple masters setup. We will use MariaDB 10.0. There are several possible topologies that could be considered here but we have selected a ring topology (see Figure 1).

Diagram1

Figure 1: MariaDB ring topology.

In this topology Node0 is master and Node1 is slave for Node0. Similarly, Node2 is slave for Node1. Let’s start configuration of the nodes with Node0:

Similarly Node1:

And finally Node3:

After this is done we can install the MariaDB databases and start the servers.

Now that the servers are up and running lets set up the first master node on Node0:

Fine, now we need to set Node1 as a slave for this with:

Similarly, set Node2 as slave to Node1:

And finally, set Node0 as slave to Node2:

Now let’s create one table and add some data to it from the different nodes:

After this all nodes are eventually consistent and return the same result set, for example:

From this we can conclude that MariaDB is eventually consistent also with a multiple masters setup when there are no conflicting operations done.

But what happens if there is a conflict? We can test this scenario by trying to insert a duplicate key to the table a. We try to insert a value 5 to both node0 and node2 so that the final commit commands are issued at about the same time.

Because we used the InnoDB storage engine and  autocommit  was off, there is no error message shown on both client connections at commit time. This is because MariaDB does not support deferred constraint checks and no error is possible in the following case:

  • You insert 5 on server at node0, it succeeds.
  • Before the insert is replicated to server at node2, you, on server node2, insert 5, that also is OK because this is asynchronous replication.
  • Then the second insert is replicated from node2 to node0, this causes a conflict due to the duplicate key value 5, the replication thread gets the error and rolls back.

Thus the result set is the following on all three nodes:

This is also eventually consistent because all servers return exactly the same value and they have executed exactly the same transactions. From the server log we can find out that:

And from, for example, node0 you can see it with:

As seen from the logs, the problem is that replication between nodes has been stopped. However, there is a way to ignore replication errors caused by application errors by configuring with --slave-skip-error=XXX , and --slave_exec_mode=IDEMPOTENT. The  --slave_exec_mode  option controls whether  IDEMPOTENT  or STRICT  mode is used in replication conflict resolution and error checking. The  IDEMPOTENT  mode causes suppression of duplicate-key and no-key-found errors. This mode is needed for multi-master replication and circular replication. Other valid errors caused by the application can be skipped using --slave-skip-error .

To demonstrate, let’s set  --slave-skip-error=all  and --slave-exec-mode=IDEMPOTENT  on all servers and restart them. We can now try to get the servers into different states (i.e. alternative futures). Execute the following on node0:

And the following on node1:

From slave status  we do not see any problems:

But in the server log there is a warning:

This situation is not eventually consistent and MariaDB can’t resolve the situation automatically. If application needs eventual consistency, it needs to resolve this conflict so that all databases again are in the same state that is correct by application rules.

Conclusions

Eventual consistency means that given enough time, over which no changes are performed, all successful updates will propagate through the system and all replicas will be synchronized. At any given time, there is no guarantee that the data accessed is consistent, therefore the conflicts have to be resolved. Using this definition MariaDB is eventually consistent if replication errors are not ignored even in cases when replication is stopped on replication errors and as long as replication is at some point of time (bounded time) continued and all servers return the same state. If replication errors are ignored, applications must correct the case where two or more servers are in different states.

Our original question was: Is MariaDB eventually consistent?

Answer: For most master slave(s) setups where all data is replicated to slaves MariaDB is eventually consistent. For multiple masters setups where only application handled error cases are ignored and where the application makes sure that servers can’t diverge to alternate futures,  MariaDB is eventually consistent. However, there are replication configurations where MariaDB is not eventually consistent.

References

[1] Bailis, P., and Ghodsi, A: Eventual consistency today: limitations, extensions, and beyond, In communications of the ACM vol. 56, no. 5, PP. 55-63, May 2013.

[2] Philip A. Bernstein, Sudipto Das: Rethinking Eventual Consistency, SIGMOD’13, June 22–27, 2013.

[3] Bermbach, D. and Tai S: Eventual Consistency: How soon is eventual? In Proceedings of ACM MW4SOC ’11 and 6 other workshop on Service Oriented Computing, New York, December, 2011, no.1.

[4] Brewer, E: PODC keynote. http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf, 2000.

[5] Brewer, E.: Towards Robust Distributed Systems, (invited Talk) Principles of Distributed Computing, Portland, Oregon, SIGOPS, And SIGACT News, July 2000.

[6] Brewer, E.: CAP twelve years later: How the “rules” have changed. IEEE Computer, vol. 45, no. 2, pp. 23-29, February 2012.

[7] Decantia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W: Dynamo: Amazon’s highly available key-value store. In Proceeding 21st ACM Symposium on Operating Systems Principles (SOSP), pp. 205-220, 2007.

[8] Lynch, S. Gilbert, N: Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News. 2002, 33, 2, p. 5159.

[9] Hale, C.: You can’t sacrifice partition tolerance; Available from http://codahale.com/you-cant-sacrificepartition-tolerance.

[10] Marc Shapiro, Bettina Kemme: Eventual Consistency. Encyclopedia of Database Systems 2009:1071-1072.

[11] Terry, D. B., Demers, A. J., Petersen, K., Spreitzer, M.J., Theimer, M.M., Welch, B. B.: Session guarantees for Weakly Consistent Replicated Data. In PDIS, pp. 140-149, 1994.

[12] Thomas, R. H.: A majority consensus approach to concurrency control for multiple copy databases. ACM Trans. on Database Systems, vol. 4, no. 2, pp. 180–209, June 1979.

[13] Vogels, W.: Scalable Web services: Eventually Consistent, ACM Queue, vol. 6, no. 6, pp. 14-16, October 2009.

[14] Vogels, W.: Eventually consistent, Communications of the ACM, vol. 52, no. 1, pp. 40–44, January 2009.

A MariaDB Howto authored by: Erkan Yanar.

This is a Howto about installing MariaDB Galera Cluster on Debian/Ubuntu. Because a lot of people were having problems installing MariaDB Galera Cluster, elenst from #maria on freenode forced me to write this Howto :)

Installing MariaDB Galera Cluster is in fact quite easy and actually kind of boring in the end. This Howto is written for (and tested on) on Debian 7.1 (Wheezy) and Ubuntu 12.04 (Precise).

What we need

In our setup we assume 3 nodes (node01, node02, node03) with one interface each. We assume following IP addresses: 172.16.8.5, 172.16.8.6, and 172.16.8.4. We need three packages installed on all nodes:

  • rsync
  • galera
  • mariadb-galera-server

As Galera does not ship with the distribution repositories, go for the repo configurator and follow the instructions to include the repository fitting your system. Keep in mind to Choose “5.5” in Step 3 (Choose a Version). Doing this you can jump directly to Install Packages

Adding the Repository

Alternatively you can just take following steps.

Debian Wheezy

Ubuntu Precise

Yes, they are nearly identical. :)

Install Packages

(Just another shortcut for the impatient)

After installing the packages you will have a running MariaDB server on each node. But none of them will be configured to run as a node in a MariaDB Galera Cluster.

Configuring Galera

So we have to do some configuration next. There is a MariaDB configuration part and one part to configure Galera (starting with wsrep_). As we do the most basic and simple installation in this Howto, it is sufficient you just change the IP’s (Remember: 172.16.8.5, 172.16.8.6, 172.16.8.4) with your IP’s.
This will be needed to define the wsrep_cluster_address Variable (the list of nodes a starting mysqld contacts to join the cluster).

The following configuration file has to be distributed on all nodes. We use a separate configuration file /etc/mysql/conf.d/galera.cnf with the following settings:

FYI: The shared library for wsrep_provider is provided by the installed galera package.

We could also change the cluster name by changing the value of wserp_cluster_name to fit our style. This setting also works as a shared secret to control the access to the cluster. With wsrep_cluster_address you see the IP addresses of our setup. The wsrep_sst_method tells what method to use to synchronise the nodes. While there are also mysqldump and xtrabackup available, I prefer rsync because it is easy to configure (i.e. it does not need any credentials set on the nodes). If you are considering using the xtrabackup method, don’t forget to install xtrabackup.

Starting the Galera Cluster

First we stop mysqld on all nodes.

The configuration file (galera.cnf) is already distributed to all nodes, so we next start the first mysqld. This node initializes/starts the cluster (creates a GTID).

To have a look and see if everything really worked we’ll check the cluster size status variable.

If you see the above, great! That’s what we would expect. Now that the Cluster already exists, we let the next nodes just start and join the cluster.

We can ignore the above error for now. This node is still starting fine.

Let’s pause here and do a quick check. As we are running a cluster it is not important if we execute the following on node01 or node02.

If you see the above, very nice! Now let’s start the third node:

Ok we are finished. We have a running MariaDB Galera Cluster \o/

Having fun with Debian/Ubuntu init scripts

But we’ve got to fix some things because of some Debian/Ubuntu oddities.

Remember the error we saw when starting node02 and node03? What happened? Well, Debian/Ubuntu uses a special user ('debian-sys-maint'@'localhost') in their init script and the credentials for that user are stored in /etc/mysql/debian.cnf. This user is used to make some checks starting MySQL. Checks I don’t think belong into a service script anyway.

We could simply ignore it, but the user user is also used to shutdown mysqld. This is also not required, as a SIGTERM is sufficient to shutdown the mysqld :/

As we copied the data from node01 to all other nodes, the credentials in /etc/mysql/debian.cnf don’t match on node02 and node03. So we will not be able to shutdown mysqld on either of these nodes.

So we’ve got to fix it, by copying /etc/mysql/debian.cnf from the first node (node01) to all other nodes. So the data and configuration files have the same data again.

After that we are able to shutdown the daemon:

Great.

So if we would have a proper init script the Howto would have been even shorter ;)

Follow the Bug :)

Enjoy your MariaDB Galera Cluster and have fun!

– Erkan Yanar

Thx to teuto.net for providing me an OpenStack tenant, so I could run the tests for this Howto.

Lets start by considering a scenario where records are being inserted in a single auto-increment table via different nodes of a multi-master cluster. One issue that might arise is ‘collision’ of generated auto-increment values on different nodes, which is precisely the subject of this article.

As the cluster is multi-master, it allows writes on all master nodes. As a result of which a table might get same auto-incremented values on different nodes on INSERTs. This issue is discovered only after the writeset is replicated and that’s a problem!

Galera cluster suffers with the similar problem.

Lets try to emulate this on a 2-node Galera cluster :

As expected, the second commit could not succeed because of the collision.

So, how do we handle this issue? Enter @@auto_increment_increment and @@auto_increment_offset! Using these two system variables one can control the sequence of auto-generated values on a MySQL/MariaDB server. The trick is to set them in such a way that every node in the cluster generates a sequence of non-colliding numbers.

For instance, lets discuss this for a 3-node cluster (n=3):

As you can see, by setting each node’s auto_increment_increment to the total number of nodes (n) in the cluster and auto_increment_offset to a number between [1,n], we can assure that auto-increment values, thus generated, would be unique across the cluster, thus, would avoid any conflict or collision.

In Galera cluster this is already taken care of by default. As and when a node joins the cluster, the two auto-increment variables are adjusted automatically to avoid collision. However, this capability can be controlled by using wsrep_auto_increment_control variable.

With this setting the last COMMIT in the above example would succeed.