ThoughtBlog: More on XMPP and Clustering – 2/20
Crazy goals and dreams
- Horizontally scalable at the application server. The system works with 1 mongrel as well as 1000. Since shared-nothing lets us do this very easily, maintaining or improving this area basically boils to down can the XMPP server handle the same or a greater number of active connections than a mysql server. I think GSFN will be quite gigantic when we run into a bottleneck at this level with either choice.
- Horizontally scalable reads from and queries of, the data store. I should be able to add boxes into the mix and have the number of users that can access the store simultaneously grow accordingly.
- Horizontally scalable writes into the data store. As above.
- Fault tolerance. build the system such the failures are seen by the end user locally: The whole site only goes down if every node in the system goes down. Pieces of the system can wink in and out of existence as failures happen, but the system should be able to recover from such faults.
- Redundant (which is, of course, different from fault tolerant). Given a node whose functionality/data is redundantly maintained across several nodes in the system, any single node should be able to fail and it should at most affect a single user request. If we make idempotent as many of the system’s functions as possible, we can get this easily: just re-run the request.
- Easily-commandable. I should be able to shrink/grow the machines in the system, assess system health (and act accordingly), and quickly query data from one place.
Now, achieving these goals can happen through any number of paths, but for the purposes of this exercise I’ll think through how I use XMPP as a tool to achieve these goals.
Crazy assumptions and assertions of how XMPP will relate into the system
So, lets list out things to which I am assuming I want this system to be. To be added to, changed, revised, and scrapped in the future.
- Data is sharded amongst N data nodes and User requests are served by M application nodes. Each of the nodes are represented in the system as a Jabber client.
- Nodes join/leave the system dynamically at runtime, and use presence notifications to broadcast health and availability to the general system. Other nodes that desire to know the health/availability of a given node will subscribe to the presence of the given node (add to contacts list)
- All Modifications to the data store (insert, update, delete) happen by sending a XMPP message. This can happen in one of several different ways, but each method starts with an application node forming an XMPP message and sending it off into the jabber server.
A Choice of Jabber Server
ejabberd. Given that I’m a fan of, and can hack on Erlang it seems a sane choice. Efforts for 2.0 (in release candidate ATM) has apparently have been centered around re-architecture with an eye towards scalability. DJabberd was another possibility, but apparently it hasn’t seen much development over the past 6 months.
A Choice of Development Languages and Jabber Libraries
JRuby and Smack. Rather, JRuby because of Smack. The graphical debugger bundled with Smack is fucking sexy.
I can see that being a major boon when bumbling my way through this implementation. Not to mention it seems more mature than XMPP4R, as well as having a better license (Apache for Smack, GPL for XMPP4R).
Also, JRuby offers a number of benefits that are very appealing: Having access to the entire body of libraries available in Java, and having a community process that I feel I can participate in opens the possibilities for me to contribute to jruby if I run into any problems. Just to name a few.
Ideas for performing writes
We need to partition our objects amongst the various data nodes. Certain classes within the Get Satisfaction class create objects that are the roots of a tree of objects. We can partition upon these objects, distributing these objects across the entire data store. When one root object refers to another root object we store it as a weak reference through which we can load the referenced object from its own node.
There are various ways to actually partition these roots, and from what I’ve seen you really want to choose this algorithm based on experimental results by testing your application: each choice has tradeoffs. For now, I’m going to assume the system will employ some continuum-based consistent hashing algorithm (like libketama or the hashing strategies described in The Amazon Dynamo paper) from which a location to write data is chosen.
So, writes into the system need to be scalable, fault-tolerant, and redundant. Let’s think about how we can leverage XMPP to achieve this.
directly messaging nodes
When an object is to be persisted:
- The ‘owning’ node is calculated using the hashing algorithm.
- A message is sent directly to that node
- There is no step 3.
With this message we achieve a measure of scalability easily: A given data node could exists anywhere in the federated XMPP network. We can handle as many data nodes as our Jabber network can handle connected clients, and we can handle as many writes as our Jabber network can handle messages.
It’s worth noting that the scalability I’m talking about above is when dealing with reads/writes for multiple objects. Partitioning doesn’t give us scalability when one object is getting hammered: If a object only exists on one node you can only handle as many writes to that one object as the box it lives on can handle. To crack that nut we would need some (more) complicated system like what Amazon’s Dynamo employs to correct conflicts when writes across two nodes clash.
Fault-tolerance is a bit tougher (but costs the same across any of the possible methods I’ve though of so far). Since a message sent in the system is not guaranteed to be responded to, the application node requesting the write would need to check back to ensure that a write was successful. The simplest, but not the best, method is to query the data store to see if the object has been stored successfully. If not, retry. We could also leverage the event notifications extension to XMPP to request a message back from the data node when it has successfully received a message. We post the message, start a wait for the delivery notification with a timeout (if the timeout expires, we retry the write).
Since messaging a data node directly doesn’t have any redundancy built into it, we need to build some in on the backside. We could, for example, have a data node replicate writes around to its peers. Such a system would require that each application node could figure out where those replicants exist in the cloud.
Okay, that’s enough for today. Tomorrow we continue with this line of thinking, exploring more possibilities for writing into the store.