This series of posts, consisting of parts 1, 2, and 3, explores approaches to improving the distribution of repository cache invalidation events in a large ATG cluster. As mentioned the default ATG distributed and distributedJMS cache modes have some serious problems but, thanks to the configurability of PatchBay, distributedJMS offers us some options.
Using distributedJMS cache mode with a third party JMS provider offers a pretty good solution, although as I pointed out in part 2 this can be challenging using JBoss JMS.
While I was thinking about how I would prefer to distribute invalidation events it occurred to me that IP Multicast offered an almost perfect solution. With multicast each event would be sent only once and distributed to all multicast group members. Multicast across subnets can be problematic in most enterprises as generally routers are not configured to allow the packets to pass. This didn’t concern me as all instances in each of my clusters run on the same subnet. Really the only down side I could think of using multicast was that it is unreliable.
Now generally when you say a communication protocol is unreliable people immediately think, “That is totally unacceptable!“. But wait, is it really? Consider the following assertions:
- Some applications simply don’t require reliability. Take for example an application distributing stock quotes every 10 seconds. If you miss a quote how important is it that it be retransmitted to you? Not very I’d say.
- Multicast, or I should say UDP/IP in general, is pretty reliable if firewalls and routers are not involved. On the same subnet a very high percentage of packets will be delivered. In fact the largest problem with lost UDP packets on the same subnet is related to application latency, i.e. the application is unable to consume packets fast enough. This is good news as the problem can be solved at the application layer by adding caches or in-memory queues and processing in background threads.
Given the above I felt that most of the time I would be fine distributing invalidation events via multicast but, in my case, a lost event might result in a stale item in a cache. In eCommerce this could mean telling a customer that a product which might be back ordered was in stock. A FTC violation and a good way to disappoint customers. Neither a very desired outcome.
As a result, I began looking for a package to add reliability onto a multicast solution and what I found was JGroups. This product, which is used to cluster communications in JBoss and JBossCache is a perfect fit. It offers a highly configurable protocol stack that allows as much (or as little) reliability as needed. In fact, if I have any complaint about JGroups it’s that the stack configuration has too many options. Fortunately the product is distributed with example configurations for various purposes and, IMHO, it would be wise to stick with one of them.
Ok, JGroups sounds like a good solution but how do we configure it in our ATG cluster? If you are thinking of using the distributedJMS cache mode you’re on my wave length. As the diagram on the left depicts we can easily modify the how invalidation events are distributed via this cache mode by extending ATG’s message source class atg.adapter.gsa.invalidator.GSAInvalidatorService. The image on the right was taken from the Eclipse Override/Implement Methods tool. By examining this information is is pretty clear that there are two methods that GSA may call to send an invalidation event. This pretty much jives with what we learned in part 2 about the message classes used to communicate invalidation information: atg.adapter.gsa.invalidator.GSAInvalidationMessage and atg.adapter.gsa.invalidator.MultiTypeInvalidationMessage. In fact, since one of these invalidate() methods takes a MultiTypeInvalidationMessage as a parameter it is reasonable to assume that the other invalidate() method is expected to construct and send a GSAInvalidationMessage.
So, how does this all work. It’s pretty straightforward really. When the GSA repository sends an invalidation event to the configured message source our class extension is called via one of the invalidate() methods. Rather than placing the event on a JMS destination as PatchBay expects we use JGroups to send the event to all groups members. We use this same component (the extended message source) to register a JGroups Receiver which will be notified of incoming events. Our receiver code simply places the event on a LocalJMS topic which has been configured to have the GSA Invalidation Receiver as a subscriber. The rest, as they say, is out of the box ATG.
One problem I encountered in early testing of this environment is related to two key facts:
- By default a JGroups subscriber receives all group messages including the ones they send. This is how multicast works by default as well.
- The ATG repository sends invalidation events before the transaction is committed. This is a real head scratcher but it is documented to work this way and obviously does. ATG should reconsider this strategy as it only makes sense to distribute an event after a transaction is successfully committed.
Given this information here is what I found. The invalidation event was often being received by the instance that originated it before the repository transaction was committed. This created a race condition that, sometimes, resulted in a deadlock as the Invalidation Receiver attempted to remove the repository item(s) from cache while another thread was committing them.
This issue was easily avoided by setting an option on the JGroups Channel that disabled receipt of a members own messages.
One final issue I’d like to point out relates to multi-homed hosts. All of my test machines are multi-homed and I discovered that, sometimes the JGroups connection would be bound to interface A while other connections were bound to interface B. This resulted in the group being partitioned into two separate groups with the same name. Since the two groups were on different subnets (and our routers do not pass multicast packets) they didn’t know about each other.
There are a couple of solutions to this problem but the simplest one I found was to configure the JGroups UDP layer with the option receive_on_all_interfaces=”true”. This allows a connection to receive packets on all interfaces configured on the machine.
I currently have this configuration running in a test environment of 10 ATG instances and it is working very well. I can’t duplicate the volume or size of our production clusters in test so it remains to be seen how this will work under load. I’ll report back when I have more information but, for now, I remain optimistic.