Boardin’ a ship today fer Hawaii. Do ye reckon ’tis a coincidence that ’tis Talk Like a Pirate Day? I reckon nah, matey.
Ok, so maybe the best breakfast in Vancouver is at Scoozis. At least it is the best I’ve had. As Judy Jetson might say, “It’s the most ut”. Don’t know who Judy Jetson is? Joan Jet’s sister, I think 🙂
Vacation time is finally here and we are currently in Vancouver, BC after a grueling 14 hour trip from Dallas. We encountered a plane with a defective tire and Alaska Air was kind enough to allow us to sit onboard while it was changed. Our connection, which we’d have missed anyway, was canceled and we had to wait 4 hours for the next flight. Still we arrived and it was absolutely worth it.
Spent yesterday wandering around and have big plans for today, but first – time for breakfast!
I’m not sure what is in the water in tiny Ruswil, Switzerland but it seems that the major passtime is watching cows wander about in a marked field and making bets about which square they will poop in. This from the land of the Swiss-Miss? Cripes Ruswil – get some horses! MmmmmKay?
This series of posts, consisting of parts 1, 2, and 3, explores approaches to improving the distribution of repository cache invalidation events in a large ATG cluster. As mentioned the default ATG distributed and distributedJMS cache modes have some serious problems but, thanks to the configurability of PatchBay, distributedJMS offers us some options.
Using distributedJMS cache mode with a third party JMS provider offers a pretty good solution, although as I pointed out in part 2 this can be challenging using JBoss JMS.
While I was thinking about how I would prefer to distribute invalidation events it occurred to me that IP Multicast offered an almost perfect solution. With multicast each event would be sent only once and distributed to all multicast group members. Multicast across subnets can be problematic in most enterprises as generally routers are not configured to allow the packets to pass. This didn’t concern me as all instances in each of my clusters run on the same subnet. Really the only down side I could think of using multicast was that it is unreliable.
Now generally when you say a communication protocol is unreliable people immediately think, “That is totally unacceptable!“. But wait, is it really? Consider the following assertions:
- Some applications simply don’t require reliability. Take for example an application distributing stock quotes every 10 seconds. If you miss a quote how important is it that it be retransmitted to you? Not very I’d say.
- Multicast, or I should say UDP/IP in general, is pretty reliable if firewalls and routers are not involved. On the same subnet a very high percentage of packets will be delivered. In fact the largest problem with lost UDP packets on the same subnet is related to application latency, i.e. the application is unable to consume packets fast enough. This is good news as the problem can be solved at the application layer by adding caches or in-memory queues and processing in background threads.
Given the above I felt that most of the time I would be fine distributing invalidation events via multicast but, in my case, a lost event might result in a stale item in a cache. In eCommerce this could mean telling a customer that a product which might be back ordered was in stock. A FTC violation and a good way to disappoint customers. Neither a very desired outcome.
As a result, I began looking for a package to add reliability onto a multicast solution and what I found was JGroups. This product, which is used to cluster communications in JBoss and JBossCache is a perfect fit. It offers a highly configurable protocol stack that allows as much (or as little) reliability as needed. In fact, if I have any complaint about JGroups it’s that the stack configuration has too many options. Fortunately the product is distributed with example configurations for various purposes and, IMHO, it would be wise to stick with one of them.
Ok, JGroups sounds like a good solution but how do we configure it in our ATG cluster? If you are thinking of using the distributedJMS cache mode you’re on my wave length. As the diagram on the left depicts we can easily modify the how invalidation events are distributed via this cache mode by extending ATG’s message source class atg.adapter.gsa.invalidator.GSAInvalidatorService. The image on the right was taken from the Eclipse Override/Implement Methods tool. By examining this information is is pretty clear that there are two methods that GSA may call to send an invalidation event. This pretty much jives with what we learned in part 2 about the message classes used to communicate invalidation information: atg.adapter.gsa.invalidator.GSAInvalidationMessage and atg.adapter.gsa.invalidator.MultiTypeInvalidationMessage. In fact, since one of these invalidate() methods takes a MultiTypeInvalidationMessage as a parameter it is reasonable to assume that the other invalidate() method is expected to construct and send a GSAInvalidationMessage.
So, how does this all work. It’s pretty straightforward really. When the GSA repository sends an invalidation event to the configured message source our class extension is called via one of the invalidate() methods. Rather than placing the event on a JMS destination as PatchBay expects we use JGroups to send the event to all groups members. We use this same component (the extended message source) to register a JGroups Receiver which will be notified of incoming events. Our receiver code simply places the event on a LocalJMS topic which has been configured to have the GSA Invalidation Receiver as a subscriber. The rest, as they say, is out of the box ATG.
One problem I encountered in early testing of this environment is related to two key facts:
- By default a JGroups subscriber receives all group messages including the ones they send. This is how multicast works by default as well.
- The ATG repository sends invalidation events before the transaction is committed. This is a real head scratcher but it is documented to work this way and obviously does. ATG should reconsider this strategy as it only makes sense to distribute an event after a transaction is successfully committed.
Given this information here is what I found. The invalidation event was often being received by the instance that originated it before the repository transaction was committed. This created a race condition that, sometimes, resulted in a deadlock as the Invalidation Receiver attempted to remove the repository item(s) from cache while another thread was committing them.
This issue was easily avoided by setting an option on the JGroups Channel that disabled receipt of a members own messages.
One final issue I’d like to point out relates to multi-homed hosts. All of my test machines are multi-homed and I discovered that, sometimes the JGroups connection would be bound to interface A while other connections were bound to interface B. This resulted in the group being partitioned into two separate groups with the same name. Since the two groups were on different subnets (and our routers do not pass multicast packets) they didn’t know about each other.
There are a couple of solutions to this problem but the simplest one I found was to configure the JGroups UDP layer with the option receive_on_all_interfaces=”true”. This allows a connection to receive packets on all interfaces configured on the machine.
I currently have this configuration running in a test environment of 10 ATG instances and it is working very well. I can’t duplicate the volume or size of our production clusters in test so it remains to be seen how this will work under load. I’ll report back when I have more information but, for now, I remain optimistic.
In part 1 and part 2 of this series I discuss problems with and potential solutions to ATG distributed repository cache options. As a second alternative I considered completely replacing the cache used by the ATG Repository. I know, at first this sounds crazy but if you think about it why shouldn’t you be able to plug-in a new cache to ATG’s Repository?
If you spend even a few minutes researching this you will find, as I did, that ATG has not designed the repository cache to be replaced. At the time the Repository was designed this was probably a reasonable approach but now that we have standards like JSR-107 JCACHE it seems rather limiting. My plan, the grand scheme as it were, was to convince my upper management to convince ATG’s upper management that supporting a plug-in repository cache would be in both ATG’s and their customer’s best interest.
Before initiating this plan I wanted to find a top notch enterprise cache that could be used as a replacement and get the people behind it involved as a partner. This was, for me at least, an easy selection. I have long been a fan of Tangosol (now Oracle) Coherence and this is the product I wanted to use as a distributed repository cache.
I had several virtual conversations with Cameron Purdy, Founder and CEO of Tangosol, and now something-or-the-other with Oracle. I’d met Cameron several years ago and I had briefly asked if any ATG customers were using Coherence. Apparently not. My hope was to get some of Cameron’s folks onboard and try to convince ATG to partner with them to offer Coherence as a plug in replacement for the ATG Repository cache.
I think Cameron was interested in this idea but, I suspect, he has so many things on his radar these days that this just hasn’t bubbled up to a visible position.
I still think this is a good idea but I need a solution now and things on this front are moving way too slowly so I set out to look at other alternatives.