MariaDB connect engine XML

In January I wrote my first post about node.js and MariaDB . In February I continued with a second post about using jQuery and some GIS calculations. Now it is time for the third and this time the main focus is not so much on GIS functionality, but instead on the capabilities MariaDB has for handling piles of unstructured data. In this case I’ll be focusing on crunching a pile of XML files without importing the XML data itself.

MariaDB 10.0, which had its GA launch in early April, includes the Connect engine. It has been developed to dynamically access all kinds of data sources, from other databases over ODBC connections to web log files. The Connect engine establishes a connection to the data source and exposes the data source to MariaDB as a table. Let’s get started!

To begin with, check that you do not have the Connect engine installed, which you shouldn’t if you haven’t played with it before. It’s a separate install. To verify if it’s installed (or not) run the SHOW ENGINES command. The Connect engine is simply called CONNECT in the output. I’m running this demo on Ubuntu 13.10 so the following instructions focus on that platform, but I’ll indicate how it’s done on other platforms as well.

Make sure you’ve configured APT to use the MariaDB repository. Instructions for that are found in the MariaDB repository configuration tool part of downloads.mariadb.org. Once done with adding the repository the Connect engine can be installed with a normal apt-get command. For example:

Notice that on RPM platforms you would do something similar with yum, i.e. yum install MariaDB-connect-engine .

Next, tell MariaDB to enable the Connect engine:

Now we’re ready with all prerequisites to actually start implementing the solution. Since I’m continuing on the development demo I built earlier, which is about showing running tracks on Google Maps I’m going to continue with that. Also in this case I will use GPX files. GPX files are stored by many GPS devices including running watches and smartphones. GPX itself is a data representation format with a specific XML schema:

For the demo I’ll use a directory /home/rasmus/nodegpxmariadb/public/uploads/ which I’ve preloaded with a set of GPX files. Now starts the interesting part. I want to make the data in these XML formatted files available to MariaDB. It’s pretty straight forward by creating a new table with the Connect engine:

In the CREATE TABLE statement you can see that Connect is specified as the engine, the table_type option is set to XML, multiple is set to 1 to allow for multiple files and file_name includes the path and the pattern of the XML files that I want to make accessible. In the tabname option the XML node under which the data resides is given. Data in this node will be represented as table rows. In this case the node is called trkseg. This node includes all the trkpt -nodes that we’re interested in. A single trkpt node represents a track point stored at certain intervals while running. One track point includes longitude, latitude, elevation and time. By combining these trackpoints you’ll have a representation of the running track. That is what I’m going to do.

In the CREATE TABLE statement there is one more thing to notice. The option tabname and the table field ele both include namespace directives to give to the XML parser so that it knows how to get hold of the right node. This was a bit tricky, but it has to do with the use of the namespaces of GPX and the XML parser that is used. The Connect engine uses libxml2 on Linux.

With the table GPXSource now in place let’s try to do a SELECT over it:

Voila! We can see that it works and I get out the number of trackpoints per GPX file.

Let’s try another SELECT that I will make use of in the actual application. We want to get the duration of each of the runs by calculating the difference between first and last trackpoint for each GPX file:

Now I have all the data I need to be able to create my sample application.  In this blog post I won’t dive into the Node.js application code part, but I have made all source code available in the same way as before on Github, https://github.com/rasmushoj/nodejs-gpx-mariadb . Hopefully you find it useful. The end result however accessing the XML files through MariaDB without importing the data looks like this:

MariaDB connect engine XML

Dotdeb is a repository currently targeting Debian and Ubuntu, providing a nice set of packages for LAMP servers.

Recently, MySQL 5.6 was added to the dotdeb repository. On the surface, this is a very harmless addition. MariaDB is a replacement for MySQL and it should be possible for applications designed for MySQL to easily switch to MariaDB. Therefore MariaDB also includes the libraries that applications using MySQL depend upon, such as libmysqlclient18 and mysql-common. The dpkg package manager looks at the MySQL 5.6 packages in dotdeb and assumes that 5.6 is a higher version than 5.5, which results in it removing or replacing libraries during normal apt-get installation and upgrade procedures.

The problems appear in the following scenarios:

  1. apt-get install mariadb-server-5.5 (fresh or over a previous version)
  2. apt-get upgrade (if MariaDB 5.5 is installed)
  3. apt-get dist-upgrade (if MariaDB 5.5 is installed)

Scenario 3, dist-upgrade, is especially nasty because in its attempt to upgrade the libmysqlclient18 and mysql-common packages, dpkg will REMOVE MariaDB.

MySQL 5.5 users are also affected in a similar way although the consequences look to be a bit lighter. Our early tests show for example that dist-upgrade will upgrade MySQL 5.5 to MySQL 5.6, which might come as a surprise. MySQL 5.5 users should therefore double-check how they are affected.

See below for examples of what these problems look like in practice. It is also worth noting that the problems do NOT exist if you are using MariaDB 10.0. With MariaDB 10.0 all the listed scenarios work just fine. MariaDB 10.0 is GA now and therefore a good alternative.

Solution

Fortunately there is a workaround that can be used and has been proven to work. By pinning the MariaDB repository you can override the normal behavior (where the highest version number wins). The process is simple. You only need to add a preferences file (with a .pref extension) for the MariaDB repository under the /etc/apt/preferences.d/ directory. The contents should look similar to the following:

The value of Pin-Priority must be at least 1000. Also note that the Pin: row needs to have the domain name of the repository that you are using for your MariaDB apt sources file entry. In this example the domain is ftp.osuosl.org.

With the preference file in place the packages in the MariaDB repository will be preferred over the similarly named packages in the dotdeb repository and the problematic scenarios listed above will work just fine.

Many thanks to our users who promptly reported the problem on IRC, helped to find the root cause and experimented with the workaround!

Examples of the problematic scenarios

Scenario 1: Fresh installation or installation over a previous version of MariaDB 5.5
Result: Fails

Scenario 2: apt-get upgrade
Result: MariaDB server and client are not upgraded

Scenario 3: apt-get dist-upgrade
Result: mariadb-server and mariadb-client are removed

Best MariaDB ever hits the streets – MariaDB 10 innovations developed with Google, SkySQL & Fusion-IO

London, United Kingdom – 31 March 2014 – The MariaDB Foundation, an independent body which promotes the popular open source database MariaDB, today announced the much-anticipated general availability of MariaDB 10, providing today’s generation of application developers with enhanced performance and functionality.

Since its launch in 2009, MariaDB has built an active and vibrant open source community and has led the way in database innovation. In 2013, Wikimedia Foundation, the people behind Wikipedia, announced the migration of many of its production MySQL systems over to MariaDB and Google confirmed it aims to move its thousand-plus MySQL instances over to MariaDB technology.

Engineered by some of the world’s leading database experts, including MySQL founder, Michael ‘Monty’ Widenius, MariaDB has been enthusiastically adopted by a fast-growing eco-system including influential Linux distributions Red Hat, Fedora, Suse and Debian. MariaDB 10 includes numerous innovations developed with and for web-scale players like Google, Fusion-IO and Taobao such as:

Replication

  • MariaDB 10 sets a new standard in performance. It is many times faster than previous generations of MariaDB and especially legacy database MySQL thanks to new features including parallel replication and a further advanced group commit. Also, the replication slaves are now crash-safe.
  • It’s also now possible to replicate data from multiple master servers giving a complete view of the distributed data across big datasets for real-time analytical purposes via the multi-source replication feature.

NoSQL Capabilities

  • The CONNECT engine enables dynamic access to diverse data sources dynamically, including unstructured files such as log files in a folder, or any ODBC database, from within MariaDB 10. Great for ETL (Extraction, Transformation and Load) and Real-Time analysis.
  • Dynamic Columns store disparate labelled data objects in each row of a table in much the same way as NoSQL technologies.
  • Access data from Cassandra data directly inside MariaDB 10, and interoperate directly with a widely adopted Big Data technology.

Sharding

  • MariaDB 10 includes built-in sharding in the form of the SPIDER engine, allowing big database tables to be split across multiple servers, for performance and scale. MariaDB combines sharding with the new replication features to provide true high availability.

Michael ‘Monty’ Widenius, CTO MariaDB Foundation said: “MariaDB has led the way as a cutting-edge, open source database innovation. MariaDB 10 continues MariaDB’s reputation for delivering new capabilities ahead of the curve, building advanced features to address the needs of the most demanding users.

“We’d like to thank the tireless support of our principal contributors SkySQL and community members like Google and Fusion-IO whose continuous feedback is essential in driving innovation in MariaDB. We are also very thankful for all community developers that have contributed to the success of MariaDB.” concludes Widenius.

“We are delighted to see our collaboration with the MariaDB community come to fruition in the MariaDB 10.0 solution,” said Madhu Matta, Fusion-io Vice President of Hyperscale. “Databases continue to showcase how flash technologies like Fusion ioMemory drive application acceleration and solve critical customer problems. The unique combination of MariaDB 10 innovations and our flash aware interfaces like NVM Compression and Atomic Writes will help drive the next chapter of database technologies.”

Patrik Sallner, CEO SkySQL said: “MariaDB 10 is a landmark open source innovation and an evolution of the fundamental SQL technology which runs the world’s largest databases. Developers can now achieve state-of-the-art application performance at scale with a SQL database they know and trust. If you want fast and secure applications, MariaDB is the right choice.”

For application developers looking to build serious, revenue-generating applications using MariaDB, SkySQL will be soon releasing its next generation of MariaDB Enterprise that provides all the tools and professional expertise needed to run MariaDB in large, complex production environments.

About MariaDB Foundation
The MariaDB Foundation promotes, protects, and advances the MariaDB codebase, community, and ecosystem. It’s members are companies like SkySQL (who also employ many core MariaDB code contributors), Booking.com, Parallels and Automattic. In addition it receives sponsor support from a range of vendor and user companies. Visit www.mariadb.org to learn more.

For more information visit www.mariadb.org or contact Paul Maher, Positive Marketing on +44 7900 600013.

Download MariaDB 10.0.10

Release Notes Changelog What is MariaDB 10.0?

MariaDB APT and YUM Repository Configuration Generator