I’m going to post about a serious problem my employer has encountered with the ATG Repository and use of distributed caches. I expect this may be a lengthy subject so I’m going to post in multiple parts. In this post I’ll give an overview of repository caches and describe the problem we encountered. In future posts I describe the solutions I evaluated and the one selected.
First some background on how the ATG Repository uses caches.
An ATG repository maintains two types of cache:
- Item Cache – there is an item cache for each item descriptor in the repository which holds repository items that have been loaded from the data store. Items are keyed by their repository ID.
- Query Cache – the query cache, which is disabled by default, stores the set of repository IDs that were returned as a result of a RQL invocation. In theory this could save a lot of time if the same RQL, with the same variables, is executed repeatedly. In real life this is seldom the case. In addition, correctly maintaining the query cache can be devilishly complex. For example, suppose an RQL statement matches 4 repository items and these IDs are cached. If another instance then inserts a new item that would have matched the RQL will the query cache be updated or flushed? I don’t know the answer and, frankly, I’m too lazy to code up an example to find out.
Update: Ok, I did run some tests on how the query cache responds to repository adds/updates and it appears that any modification of a repository item will cause all the query caches for that item type to be discarded. So while it isn’t all that devilishly complex and will always be correct this will have a very negative impact on the query cache hit rate. My suggestion would be to leave it disabled unless you have a case of a read only repository in which a lot of the same queries are executed.
Each Item Cache is assigned a Cache Mode which describes how the cache should be maintained. The available cache modes are:
- Disabled – Caching, in a transaction specific cache, takes place during a transaction and the cache is flushed at transaction termination. The idea is that you don’t want any caching of these items but ATG will perform a transaction local caching for performance reasons. From the perspective of other repository users the items are not cached. In my experience this cache mode does not work correctly and rather acts like simple caching. I’d be happier if ATG dispensed with the “transaction local cache” and just did no caching whatsoever!
- Simple – In simple caching each ATG instance maintains an item cache for use by repository users on that instance. There is no synchronization between instances and changes made by other instances may not be seen. This mode is useful for read only repository items as it gives the benefits of caching without the overhead of cluster synchronization.
- Locked – locked caching uses a lock manager instance to identify which instances have an item cached. The general idea is that when an item is loaded into a cache that instance obtains a read lock on the item’s repository ID. As a result the Lock Manager has a list of all instances caching a particular item. When an instance updates/deletes an item it requests a write lock on the repository ID which causes the Lock Manager to notify all read lock holders that they must relinquish their lock. The read lock owners take this as an indication that they should flush the item from their cache. ATG believes this method may be efficient because the invalidation events are sent to only those instances having the item cached rather than being broadcast to all instances. I think this is a truly bizarre outlook and that ATG should instead look to making broadcast events more efficient. But hey, that’s just me.
- Distributed – in distributed caching all the caches in a cluster are kept synchronized by sending invalidation events when a repository item is updated or deleted. These events are sent to each member of the cluster, serially, via a TCP connection. Yes, every instance in the cluster has a TCP connection to every other instance in the cluster. This amounts to a total of N * (N – 1) socket connections (where N is the number of cluster members) to support cache invalidation. In one of the clusters I support this number was approaching 11,000. In order to identify cluster members a database table, DAS_GSA_SUBSCRIBER, is maintained by each instance identifying the item types it shows as distributed and the host/port it is listening on for invalidation event connections. To ATG’s credit they use the same TCP connection to distribute all distributed invalidation events.
- DistributedJMS – this cache mode works like distributed but uses a PatchBay message source/sink pair to deliver cache invalidation events using a JMS Topic. DistributedJMS is the new kid on the cache mode block joining the fray with ATG 7.0. This, to me, is a very promising delivery mechanism for invalidation events but it falls short in that it is based on SQLJMS which only supports persistent JMS destinations which operate in a polled manner. The whole purpose of a cache is to avoid disk/database I/O so a distribution scheme which uses database I/O doesn’t make much sense. In addition, the polled nature of SQLJMS can easily introduce latency into event distribution. However, plugging in a third party JMS provider which supports in-memory topics could be just the ticket.
So there you have a quickie tour of repository cache invalidation. Now let me describe the problem my group encountered.
We run a number of ATG clusters which range from about 20 members up to about 120 members. From time to time we encounter a condition we call idle threads. Idle threads occur when all the threads in an instance are blocked for some reason. The reason varies, sometimes it is a problem with a scenario, a lock manager problem, the JVM running out of memory, and several other reasons. This type of thing just happens from time to time and it’s one of the reasons we run multiple instances, right? High availability and fail over. The problem is that when an instance encounters idle threads the GSA Event Server, the thread reading cache invalidation events from a TCP socket, also gets blocked and stops reading from the socket. As a result all the other instances trying to send invalidation events to the hung instance fill up their socket write buffers and their invalidation threads hang. Unfortunately these hung invalidation threads are holding Java monitors that are required by most of the other threads in the instance. This condition spreads through our clusters like a virus and within about 10 minutes or the original problem every instance is hung and the entire site is down.
So there you have the problem. I was charged by my management with finding a solution and, quite honestly, I’m very pleased with what we put in place to fix this.
In future posts I’ll talk about the options I considered and what I finally settled on. If you have thoughts or ideas about this situation I’d love to hear them.