On Thursday MySQL technology saw a huge boost. It’s hard for anyone now to argue that MySQL isn’t in the game of extreme scalability and performance, which some NoSQL vendors have been using as a tagline for the last years. To see four of the largest MySQL and MariaDB users come together to bootstrap a branch of MySQL for extreme scaling needs is simply fantastic. The improvements done inside these companies will now be available to the rest of the community. In all fairness Facebook and Twitter, in particular, have been making their improvements publicly available also before. Google has also made some improvements available publicly over the years and have lately been active in the MariaDB project with code reviews, bug fixes and other patches. But broadening the public contributions further and combining it all, is new.

Engineering of MySQL technology happens in many places. Aside from Oracle and the companies behind WebScaleSQL, there are two other entities that have made significant contributions to improving MySQL technology in recent years – Percona and MariaDB. Percona has made many performance-related functionality and tooling improvements. The MariaDB project has made significant engineering efforts by bringing out many new features to MySQL technology and has also become the project for community contributions. Take a look at the list of bigger contributions that have made it into MariaDB 10.0.

MariaDB 5.5 is constantly being merged with MySQL 5.5 community edition. Every time there is a new minor release of MySQL 5.5 a new release of MariaDB 5.5 with exactly the same number comes out shortly afterwards. In MariaDB 10, this dependency is lighter, which the numbering also indicates. MariaDB 10 includes a lot of merged code from MySQL 5.6, but it also includes a big amount of MariaDB specific code and code merged from the wider community. WebScaleSQL will be another important source for merges in the future. Without knowing all the details of WebScaleSQL, it should be safe to say that there are two sorts of patches: the ones that improve MySQL technology in general and the ones that would be specific for the purposes of running MySQL at extreme scale with integration into technologies not commonly used in more normal setups of MySQL or MariaDB.

MariaDB is all about improving and keeping the world’s most installed database, MySQL, available to the masses independent of whether they are private persons with the need for a database for their blog or if the target is a mission critical enterprise application. MariaDB therefore needs to provide all the components needed from database drivers (connectors) to integrated high-availability solutions like MariaDB Galera Cluster.

In addition, the majority of the users and organizations using MariaDB or MySQL don’t have the in-house skills to build and make changes to MariaDB or MySQL. This is why MariaDB has to be supported on a wide variety of platforms and binaries provided for all of them. WebScaleSQL is currently only compatible with GNU/Linux x86_64 platforms and no binaries are produced.

WebScaleSQL confirms the power of community driven development in open source and is a very nice addition to the branches and forks of MySQL!

Background

Eventual consistency is a consistency model used in many large distributed databases which requires that all changes to a replicated piece of data eventually reach all affected replicas; conflict resolution is not handled and responsibility is pushed up to the application author in the event of conflicting updates [13].

Eventual consistency is a specific form of weak consistency; the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value [14]. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme [3].

A few examples of eventually consistent systems:

  • DNS
  • Asynchronous master/slave replication on an RDBMS e.g. MariaDB
  • memcached in front of MariaDB, caching reads

Eventual consistency states that in an updatable replicated database, eventually all copies of each data item converge to the same value. The origin of eventual consistency can be traced back to Thomas’ majority consensus algorithm [12]. The term was coined by Terry et al. [11] and later popularized by Amazon in their Dynamo system, which supported only eventual consistency [7].

The CAP theorem, also called Brewer’s theorem by its author Dr. Erik A. Brewer, was introduced at PODC 2000 [4, 5]. The theorem was formally shown by Gilbert and Lynch [8]. Brewer introduced consistency, availability and partition tolerance as three desired properties of any shared-data system and made the conjuncture that maximally two of them can be guaranteed at one time [6].

In general, this theorem perfectly matches the needs of today’s Internet systems. Ideally we expect a service to be available during the whole time period of network connection by which the service is connected to the network/Internet [1]. Thus if a network connection is available the service should be available as well [9,10]. To achieve good performance, requests need to be processed by a distributed system. If the number of servers are increased the probability of server failure or network failure is also increased. A system therefore needs to take this into account and be designed in such a way that these failures are transparent for the client and the impact of such failure is minimized [2]. The abbreviation of the theorem comes from these three properties:

  • Consistency: This property requires that each operation executed within a distributed system where data is spread among many servers ends with the same result as if executed on one server with all data.
  • Availability: This property of a distributed system requires that sending a request to any functional node should be enough for a requester to get the response. By complying with this property a system is tolerant to failure of any nodes caused, for instance, by network throughput issues.
  • Partition Tolerance: A distributed system consists of many servers interconnected by a network. A frequent requirement is distributing the system across more data centers to eliminate the failure of one of them. During network communication, failures are frequent. Hence, a system needs to be fail-proof against an arbitrary number of failed messages among servers. Temporary communication interruption among a server set must not cause the whole system to respond incorrectly [9].

Eventual consistency is defined as follows:

Definition 1: Eventual consistency.

  • Eventual delivery: An update executed at one node eventually executes at all nodes.
  • Termination: All update executions terminate.
  • Convergence: Nodes that have executed the same updates eventually reach an equivalent state (and stay).

Example 1: Consider a case where data item R=0  is stored on all three nodes. Assume that we have the following sequence of writes and commits: W(R=3) C W(R=5) C W(R=7) C  on node0. Now reads from node1 could return R=3  and reads from node2 could return R=5 . This is eventually consistent as long as reads from all nodes eventually return the same value. Note that this final value could be R=3. Eventual consistency does not restrict the order in which the writes must be executed.

MariaDB Demonstration

As already stated, normal master slave setup on MariaDB is eventually consistent. In this article we are interested in a situation where we have a multiple masters setup. We will use MariaDB 10.0. There are several possible topologies that could be considered here but we have selected a ring topology (see Figure 1).

Diagram1

Figure 1: MariaDB ring topology.

In this topology Node0 is master and Node1 is slave for Node0. Similarly, Node2 is slave for Node1. Let’s start configuration of the nodes with Node0:

Similarly Node1:

And finally Node3:

After this is done we can install the MariaDB databases and start the servers.

Now that the servers are up and running lets set up the first master node on Node0:

Fine, now we need to set Node1 as a slave for this with:

Similarly, set Node2 as slave to Node1:

And finally, set Node0 as slave to Node2:

Now let’s create one table and add some data to it from the different nodes:

After this all nodes are eventually consistent and return the same result set, for example:

From this we can conclude that MariaDB is eventually consistent also with a multiple masters setup when there are no conflicting operations done.

But what happens if there is a conflict? We can test this scenario by trying to insert a duplicate key to the table a. We try to insert a value 5 to both node0 and node2 so that the final commit commands are issued at about the same time.

Because we used the InnoDB storage engine and  autocommit  was off, there is no error message shown on both client connections at commit time. This is because MariaDB does not support deferred constraint checks and no error is possible in the following case:

  • You insert 5 on server at node0, it succeeds.
  • Before the insert is replicated to server at node2, you, on server node2, insert 5, that also is OK because this is asynchronous replication.
  • Then the second insert is replicated from node2 to node0, this causes a conflict due to the duplicate key value 5, the replication thread gets the error and rolls back.

Thus the result set is the following on all three nodes:

This is also eventually consistent because all servers return exactly the same value and they have executed exactly the same transactions. From the server log we can find out that:

And from, for example, node0 you can see it with:

As seen from the logs, the problem is that replication between nodes has been stopped. However, there is a way to ignore replication errors caused by application errors by configuring with --slave-skip-error=XXX , and --slave_exec_mode=IDEMPOTENT. The  --slave_exec_mode  option controls whether  IDEMPOTENT  or STRICT  mode is used in replication conflict resolution and error checking. The  IDEMPOTENT  mode causes suppression of duplicate-key and no-key-found errors. This mode is needed for multi-master replication and circular replication. Other valid errors caused by the application can be skipped using --slave-skip-error .

To demonstrate, let’s set  --slave-skip-error=all  and --slave-exec-mode=IDEMPOTENT  on all servers and restart them. We can now try to get the servers into different states (i.e. alternative futures). Execute the following on node0:

And the following on node1:

From slave status  we do not see any problems:

But in the server log there is a warning:

This situation is not eventually consistent and MariaDB can’t resolve the situation automatically. If application needs eventual consistency, it needs to resolve this conflict so that all databases again are in the same state that is correct by application rules.

Conclusions

Eventual consistency means that given enough time, over which no changes are performed, all successful updates will propagate through the system and all replicas will be synchronized. At any given time, there is no guarantee that the data accessed is consistent, therefore the conflicts have to be resolved. Using this definition MariaDB is eventually consistent if replication errors are not ignored even in cases when replication is stopped on replication errors and as long as replication is at some point of time (bounded time) continued and all servers return the same state. If replication errors are ignored, applications must correct the case where two or more servers are in different states.

Our original question was: Is MariaDB eventually consistent?

Answer: For most master slave(s) setups where all data is replicated to slaves MariaDB is eventually consistent. For multiple masters setups where only application handled error cases are ignored and where the application makes sure that servers can’t diverge to alternate futures,  MariaDB is eventually consistent. However, there are replication configurations where MariaDB is not eventually consistent.

References

[1] Bailis, P., and Ghodsi, A: Eventual consistency today: limitations, extensions, and beyond, In communications of the ACM vol. 56, no. 5, PP. 55-63, May 2013.

[2] Philip A. Bernstein, Sudipto Das: Rethinking Eventual Consistency, SIGMOD’13, June 22–27, 2013.

[3] Bermbach, D. and Tai S: Eventual Consistency: How soon is eventual? In Proceedings of ACM MW4SOC ’11 and 6 other workshop on Service Oriented Computing, New York, December, 2011, no.1.

[4] Brewer, E: PODC keynote. http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf, 2000.

[5] Brewer, E.: Towards Robust Distributed Systems, (invited Talk) Principles of Distributed Computing, Portland, Oregon, SIGOPS, And SIGACT News, July 2000.

[6] Brewer, E.: CAP twelve years later: How the “rules” have changed. IEEE Computer, vol. 45, no. 2, pp. 23-29, February 2012.

[7] Decantia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W: Dynamo: Amazon’s highly available key-value store. In Proceeding 21st ACM Symposium on Operating Systems Principles (SOSP), pp. 205-220, 2007.

[8] Lynch, S. Gilbert, N: Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News. 2002, 33, 2, p. 5159.

[9] Hale, C.: You can’t sacrifice partition tolerance; Available from http://codahale.com/you-cant-sacrificepartition-tolerance.

[10] Marc Shapiro, Bettina Kemme: Eventual Consistency. Encyclopedia of Database Systems 2009:1071-1072.

[11] Terry, D. B., Demers, A. J., Petersen, K., Spreitzer, M.J., Theimer, M.M., Welch, B. B.: Session guarantees for Weakly Consistent Replicated Data. In PDIS, pp. 140-149, 1994.

[12] Thomas, R. H.: A majority consensus approach to concurrency control for multiple copy databases. ACM Trans. on Database Systems, vol. 4, no. 2, pp. 180–209, June 1979.

[13] Vogels, W.: Scalable Web services: Eventually Consistent, ACM Queue, vol. 6, no. 6, pp. 14-16, October 2009.

[14] Vogels, W.: Eventually consistent, Communications of the ACM, vol. 52, no. 1, pp. 40–44, January 2009.

A MariaDB Howto authored by: Erkan Yanar.

This is a Howto about installing MariaDB Galera Cluster on Debian/Ubuntu. Because a lot of people were having problems installing MariaDB Galera Cluster, elenst from #maria on freenode forced me to write this Howto :)

Installing MariaDB Galera Cluster is in fact quite easy and actually kind of boring in the end. This Howto is written for (and tested on) on Debian 7.1 (Wheezy) and Ubuntu 12.04 (Precise).

What we need

In our setup we assume 3 nodes (node01, node02, node03) with one interface each. We assume following IP addresses: 172.16.8.5, 172.16.8.6, and 172.16.8.4. We need three packages installed on all nodes:

  • rsync
  • galera
  • mariadb-galera-server

As Galera does not ship with the distribution repositories, go for the repo configurator and follow the instructions to include the repository fitting your system. Keep in mind to Choose “5.5″ in Step 3 (Choose a Version). Doing this you can jump directly to Install Packages

Adding the Repository

Alternatively you can just take following steps.

Debian Wheezy

Ubuntu Precise

Yes, they are nearly identical. :)

Install Packages

(Just another shortcut for the impatient)

After installing the packages you will have a running MariaDB server on each node. But none of them will be configured to run as a node in a MariaDB Galera Cluster.

Configuring Galera

So we have to do some configuration next. There is a MariaDB configuration part and one part to configure Galera (starting with wsrep_). As we do the most basic and simple installation in this Howto, it is sufficient you just change the IP’s (Remember: 172.16.8.5, 172.16.8.6, 172.16.8.4) with your IP’s.
This will be needed to define the wsrep_cluster_address Variable (the list of nodes a starting mysqld contacts to join the cluster).

The following configuration file has to be distributed on all nodes. We use a separate configuration file /etc/mysql/conf.d/galera.cnf with the following settings:

FYI: The shared library for wsrep_provider is provided by the installed galera package.

We could also change the cluster name by changing the value of wserp_cluster_name to fit our style. This setting also works as a shared secret to control the access to the cluster. With wsrep_cluster_address you see the IP addresses of our setup. The wsrep_sst_method tells what method to use to synchronise the nodes. While there are also mysqldump and xtrabackup available, I prefer rsync because it is easy to configure (i.e. it does not need any credentials set on the nodes). If you are considering using the xtrabackup method, don’t forget to install xtrabackup.

Starting the Galera Cluster

First we stop mysqld on all nodes.

The configuration file (galera.cnf) is already distributed to all nodes, so we next start the first mysqld. This node initializes/starts the cluster (creates a GTID).

To have a look and see if everything really worked we’ll check the cluster size status variable.

If you see the above, great! That’s what we would expect. Now that the Cluster already exists, we let the next nodes just start and join the cluster.

We can ignore the above error for now. This node is still starting fine.

Let’s pause here and do a quick check. As we are running a cluster it is not important if we execute the following on node01 or node02.

If you see the above, very nice! Now let’s start the third node:

Ok we are finished. We have a running MariaDB Galera Cluster \o/

Having fun with Debian/Ubuntu init scripts

But we’ve got to fix some things because of some Debian/Ubuntu oddities.

Remember the error we saw when starting node02 and node03? What happened? Well, Debian/Ubuntu uses a special user ('debian-sys-maint'@'localhost') in their init script and the credentials for that user are stored in /etc/mysql/debian.cnf. This user is used to make some checks starting MySQL. Checks I don’t think belong into a service script anyway.

We could simply ignore it, but the user user is also used to shutdown mysqld. This is also not required, as a SIGTERM is sufficient to shutdown the mysqld :/

As we copied the data from node01 to all other nodes, the credentials in /etc/mysql/debian.cnf don’t match on node02 and node03. So we will not be able to shutdown mysqld on either of these nodes.

So we’ve got to fix it, by copying /etc/mysql/debian.cnf from the first node (node01) to all other nodes. So the data and configuration files have the same data again.

After that we are able to shutdown the daemon:

Great.

So if we would have a proper init script the Howto would have been even shorter ;)

Follow the Bug :)

Enjoy your MariaDB Galera Cluster and have fun!

– Erkan Yanar

Thx to teuto.net for providing me an OpenStack tenant, so I could run the tests for this Howto.