ATG Repository Distributed Caches, Part 2

August 26, 2007

Alternative 1

In part 1 of this series I described briefly how ATG’s distributed cache invalidation works and the options it provides. I then described a serious production problem my company was encountering using the distributed cache-mode for large clusters.

As I had mentioned in part 1 ATG’s newest cache-mode option, distributedJMS, appeared to offer a good alternative to the use of TCP connections for the distribution of cache invalidation events. The main problem with this approach is that, by default, it is based on ATG SQLJMS which offers only polled, persistent destinations. If you are using ATG’s DPS module the configuration for distributedJMS cache mode is already in place. Otherwise you can follow the configuration examples in the ATG Repository users Guide.

A very important property, gsaInvalidatorEnabled, of component /atg/dynamo/Configuration must be set to true for distributedJMS cache invalidation to work.

Distributedjms

Distribution of events via JMS is supported by using a PatchBay message source/sink combination which offers us the opportunity to override the definitions and use a 3rd party JMS provider. The advantage of using a real JMS provider is that message distribution is event driven rather than polled and in-memory destinations may be used avoiding disk/database I/O. The figure to the left depicts how JMS is used in event distribution. For each item descriptor defined with a cache mode of distributedJMS ATG’s repository will route all invalidation events to a PatchBay message source defined by class atg.adapter.gsa.invalidator.GSAInvalidatorService. The Nucleus component used by the repository may be set via the invalidatorService property on GSARepository. By default, this component is located at /atg/dynamo/service/GSAInvalidatorService but you can place it anywhere you like.

The actual cache invalidation takes place in the message sink which is defined by the ATG class atg.adapter.gsa.invalidator.GSAInvalidationReceiver. This component receives events of the following types, resolves the name of the supplied repository component and issues an invalidation request for the appropriate item descriptor/repository item(s).

  • atg.adapter.gsa.invalidator.GSAInvalidationMessage – defines a single repository item that should be flushed from the cache.
  • atg.adapter.gsa.invalidator.MultiTypeInvalidationMessage – defines one or more repository items that should be flushed from the cache. All the items defined in this message must belong to the same repository.

The default PatchBay configuration for these components looks like the following:

<dynamo-message-system>
  <patchbay>
 <!-- DAS Messages -->
    <message-source>
       <nucleus-name>/atg/dynamo/service/GSAInvalidatorService</nucleus-name>
       <output-port>
         <port-name> GSAInvalidate </port-name>
         <output-destination>
            <provider-name>sqldms</provider-name>
            <destination-name>sqldms:/sqldms/DASTopic/GSAInvalidator</destination-name>
            <destination-type>Topic</destination-type>
         </output-destination>
       </output-port>
    </message-source>
    <message-sink>
       <nucleus-name>/atg/dynamo/service/GSAInvalidationReceiver</nucleus-name>
       <input-port>
         <port-name>GSAInvalidate</port-name>
         <input-destination>
           <provider-name>sqldms</provider-name>
           <destination-name>sqldms:/sqldms/DASTopic/GSAInvalidator</destination-name>
           <destination-type>Topic</destination-type>
         </input-destination>
       </input-port>
    </message-sink>
  </patchbay>
</dynamo-message-system>

Ok, so my first thought was to modify this configuration to use JBoss as the JMS provider. I considered one of the fine stand-alone JMS providers like Fiorano or Sonic and I think these would have worked just fine. We are currently still running on DAS but expect to to move to JBoss over the next year so using JBoss seemed like a natural. I promptly over-rode the above configuration like this:

<dynamo-message-system>
  <patchbay>
    <provider>
      <provider-name>JBoss</provider-name>
      <topic-connection-factory-name>ConnectionFactory</topic-connection-factory-name>
      <queue-connection-factory-name>ConnectionFactory</queue-connection-factory-name>
      <xa-topic-connection-factory-name>XAConnectionFactory</xa-topic-connection-factory-name>
      <xa-queue-connection-factory-name>XAConnectionFactory</xa-queue-connection-factory-name>
      <supports-transactions>true</supports-transactions>
      <supports-xa-transactions>true</supports-xa-transactions>
      <username></username>
      <password></password>
      <client-id></client-id>
      <initial-context-factory>/my/utils/jms/J2EEInitialContextFactory</initial-context-factory>
    </provider>

    <message-source xml-combine="replace">
      <nucleus-name>/atg/dynamo/service/GSAInvalidatorService</nucleus-name>
        <output-port>
          <port-name>GSAInvalidate</port-name>
          <output-destination>
            <provider-name>JBoss</provider-name>
              <destination-name>/topic/GSAInvalidator</destination-name>
              <destination-type>Topic</destination-type>
            </output-destination>
          </output-port>
    </message-source>    	

    <message-sink xml-combine="replace">
      <nucleus-name>/atg/dynamo/service/GSAInvalidationReceiver</nucleus-name>
        <input-port>
          <port-name>GSAInvalidate</port-name>
          <input-destination>
            <provider-name>JBoss</provider-name>
            <destination-name>/topic/GSAInvalidator</destination-name>
            <destination-type>Topic</destination-type>
          </input-destination>
        </input-port>
    </message-sink>
  </patchbay>
</dynamo-message-system>

Notice that you have to define a component that is used to obtain an initial context from the 3rd party JMS provider. ATG’s documentation covers this in detail so I won’t go into it here.

After setting up this configuration with a properly configured JBoss server I was distributing invalidation events and things were looking great. That’s always the time a nasty problem arises and this situation was no different.

I had tested this configuration but, of course, for our production environment we wanted to run a cluster of JBoss instances to provide high availability. The problem I encountered is that JBoss supports two different JMS providers:

  1. JBossMQ – offers a highly available singleton JMS service and is the out of the box configuration for all JBoss 4.2 and earlier versions. This implementation supports Java 1.4.
  2. JBoss Messaging – offers a highly available distributed message service but requires Java 1.5+. May be configured in JBoss 4.2 and will be the out of the box configuration in the next JBoss release.

We currently run ATG 7.2 under Java 1.4 and I wanted to keep our JBoss servers at the same level if possible so I decided to use JBoss 4.0.5 and JBossMQ. The first problem I encountered was that even though JBossMQ supports high availability it does so with the assistance of its clients. JBossMQ expects all its clients to register a JMS ExceptionListener to handle connection failures by reopening the connection and re-creating all JMS objects when a failure occurs. Clearly this wasn’t going to work for ATG PatchBay – I needed transparent fail over.

My next approach was to use JBoss 4.2 with JBoss Messaging. This required the JBoss servers to run on Java 1.5 but I figured I could live with that until we moved to ATG 2007.1. Of course this didn’t work as the JBoss 4.2 client jars were compiled for Java 1.5 and all my ATG instances were running under 1.4. This was starting to look like more trouble than it was worth but first I ran all the JBoss client jars through Retroweaver and deployed them under Java 1.4. This looked promising until I connected to the JBoss instance and pulled back an InitialContext. The stub that was returned required Java 1.5. I may have been able to work around this but I gave up on JBoss 4.2.

Now a reasonable person would have given up on JBoss at this point and perhaps purchased SonicMQ. Instead I set about writing a JMS mapping layer that would sit between PatchBay and JBossMQ and perform transparent fail over. What I did was use a decorator pattern to wrap every JBoss JMS class with my own that knew how to recreate itself in the event of a fail over. This wasn’t difficult but it involved a fair amount of coding.

I actually got this approach working and it appeared to work very well but then I had another idea and I set this option aside for the time being.

By the way, the JBossMQ transparent fail over layer is not specific to PatchBay, if anyone has a need of this I can probably arrange to give you the code.

That wraps up part 2 of this series. Stay tuned for my second alternative presented in part 3.

Advertisements

ATG Repository Distributed Caches, Part 1

August 25, 2007

I’m going to post about a serious problem my employer has encountered with the ATG Repository and use of distributed caches. I expect this may be a lengthy subject so I’m going to post in multiple parts. In this post I’ll give an overview of repository caches and describe the problem we encountered. In future posts I describe the solutions I evaluated and the one selected.

First some background on how the ATG Repository uses caches.

An ATG repository maintains two types of cache:

  1. Item Cache – there is an item cache for each item descriptor in the repository which holds repository items that have been loaded from the data store. Items are keyed by their repository ID.
  2. Query Cache – the query cache, which is disabled by default, stores the set of repository IDs that were returned as a result of a RQL invocation. In theory this could save a lot of time if the same RQL, with the same variables, is executed repeatedly. In real life this is seldom the case. In addition, correctly maintaining the query cache can be devilishly complex. For example, suppose an RQL statement matches 4 repository items and these IDs are cached. If another instance then inserts a new item that would have matched the RQL will the query cache be updated or flushed? I don’t know the answer and, frankly, I’m too lazy to code up an example to find out.


Update: Ok, I did run some tests on how the query cache responds to repository adds/updates and it appears that any modification of a repository item will cause all the query caches for that item type to be discarded. So while it isn’t all that devilishly complex and will always be correct this will have a very negative impact on the query cache hit rate. My suggestion would be to leave it disabled unless you have a case of a read only repository in which a lot of the same queries are executed.

Each Item Cache is assigned a Cache Mode which describes how the cache should be maintained. The available cache modes are:

  • Disabled – Caching, in a transaction specific cache, takes place during a transaction and the cache is flushed at transaction termination. The idea is that you don’t want any caching of these items but ATG will perform a transaction local caching for performance reasons. From the perspective of other repository users the items are not cached. In my experience this cache mode does not work correctly and rather acts like simple caching. I’d be happier if ATG dispensed with the “transaction local cache” and just did no caching whatsoever!
  • Simple – In simple caching each ATG instance maintains an item cache for use by repository users on that instance. There is no synchronization between instances and changes made by other instances may not be seen. This mode is useful for read only repository items as it gives the benefits of caching without the overhead of cluster synchronization.
  • Locked – locked caching uses a lock manager instance to identify which instances have an item cached. The general idea is that when an item is loaded into a cache that instance obtains a read lock on the item’s repository ID. As a result the Lock Manager has a list of all instances caching a particular item. When an instance updates/deletes an item it requests a write lock on the repository ID which causes the Lock Manager to notify all read lock holders that they must relinquish their lock. The read lock owners take this as an indication that they should flush the item from their cache. ATG believes this method may be efficient because the invalidation events are sent to only those instances having the item cached rather than being broadcast to all instances. I think this is a truly bizarre outlook and that ATG should instead look to making broadcast events more efficient. But hey, that’s just me.
  • Distributed – in distributed caching all the caches in a cluster are kept synchronized by sending invalidation events when a repository item is updated or deleted. These events are sent to each member of the cluster, serially, via a TCP connection. Yes, every instance in the cluster has a TCP connection to every other instance in the cluster. This amounts to a total of N * (N – 1) socket connections (where N is the number of cluster members) to support cache invalidation. In one of the clusters I support this number was approaching 11,000. In order to identify cluster members a database table, DAS_GSA_SUBSCRIBER, is maintained by each instance identifying the item types it shows as distributed and the host/port it is listening on for invalidation event connections. To ATG’s credit they use the same TCP connection to distribute all distributed invalidation events.
  • DistributedJMS – this cache mode works like distributed but uses a PatchBay message source/sink pair to deliver cache invalidation events using a JMS Topic. DistributedJMS is the new kid on the cache mode block joining the fray with ATG 7.0. This, to me, is a very promising delivery mechanism for invalidation events but it falls short in that it is based on SQLJMS which only supports persistent JMS destinations which operate in a polled manner. The whole purpose of a cache is to avoid disk/database I/O so a distribution scheme which uses database I/O doesn’t make much sense. In addition, the polled nature of SQLJMS can easily introduce latency into event distribution. However, plugging in a third party JMS provider which supports in-memory topics could be just the ticket.

So there you have a quickie tour of repository cache invalidation. Now let me describe the problem my group encountered.

We run a number of ATG clusters which range from about 20 members up to about 120 members. From time to time we encounter a condition we call idle threads. Idle threads occur when all the threads in an instance are blocked for some reason. The reason varies, sometimes it is a problem with a scenario, a lock manager problem, the JVM running out of memory, and several other reasons. This type of thing just happens from time to time and it’s one of the reasons we run multiple instances, right? High availability and fail over. The problem is that when an instance encounters idle threads the GSA Event Server, the thread reading cache invalidation events from a TCP socket, also gets blocked and stops reading from the socket. As a result all the other instances trying to send invalidation events to the hung instance fill up their socket write buffers and their invalidation threads hang. Unfortunately these hung invalidation threads are holding Java monitors that are required by most of the other threads in the instance. This condition spreads through our clusters like a virus and within about 10 minutes or the original problem every instance is hung and the entire site is down.

So there you have the problem. I was charged by my management with finding a solution and, quite honestly, I’m very pleased with what we put in place to fix this.

In future posts I’ll talk about the options I considered and what I finally settled on. If you have thoughts or ideas about this situation I’d love to hear them.


The Pot at the End of the Rainbow

August 14, 2007

It doesn’t seem to be the pot everyone was expecting!
Rainbowpot


Timbuk2 Has a Sense of Humor!

August 11, 2007

I love doing business with people who don’t take themselves too seriously. Today I ordered a couple of laptop sleeves from Timbuk2 and really got a charge out of their order summary email:


Thanks for picking us. Your new bag is gonna ROCK!

Here is your order summary email.

We suggest that you actually read this, confirm your order and make sure you ordered what you wanted. Once your order makes it to our warehouse, we can’t change it for you and if it’s custom, we can’t take it back. Not because we don’t love you; but because we already have really, really nice custom made Messenger bags from Timbuk2. It’s part of the uniform.

Your order number is xxxxxxxxx

ORDER SUMMARY:

BILLING INFORMATION or WHO PAID
xxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx

SHIPPING ADDRESS or THE LUCKY ONE
xxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx

ORDER DETAILS or WHATCHUGOT
1 Medium Curly Q Sleeve $35.00 ea.
Ballistic Nylon rose
1 Medium Classic Sleeve $35.00 ea.
Ballistic Nylon navy-grey-navy

PAYMENT DETAILS or HOW MUCH YOU SHELLED OUT
Visa – number: XXXXXXXXXXXXXXXX – $82.50

Subtotal: $70.00
Shipping: $12.50
Tax: $0.00
Total: $82.50

Now here’s the part you really want to read.

When will my order ship?
Your order will be sent to the warehouse and shipped in 2 to 5 business days.
If you selected expedited shipping like Second Day or Next Day Air, your order will ship within 1 to 2 business days. Business means Monday through Friday and excludes most US holidays just in case you didn’t know. Now you do. FYI: custom bags do not require any additional processing time. We are just that awesome. Of course there is one exception, isn’t there always? Think “i” before “e” EXCEPT after “c”. If you ordered the Lex Pack the above rules do not apply to you. You see, the Lex Pack takes longer to assemble; there’s a lot to this bag, so it will take 5-7 business days for the Lex Pack just to leave our warehouse. If you opted for Second Day or Next Day Air, please note the shipping method you selected at checkout will determine how many days in transit your bag will take to get to you but does not include production time.

How long will it take my order to get to my loving, waiting arms?
Once your order ships via UPS it will take 3 to 7 business days to arrive depending on where you call home. You can track your order on our website using your order number and clicking http://www.timbuk2.com/tb2/cms/orderTracking.htm or you can be patient and wait for UPS to send you a notification of shipment indicating that your order has left the building. Please note that your item must ship before anyone can track it and it can take up to 48 hours for your tracking information to trickle down from UPS to our systems.

What if I live across the pond?
International orders are shipped via UPS International Express.
Transit times average 3-5 business days for delivery.

**International Express price does not include Duty, Customs Adjustment or VAT.
Up to $75 additional charges in VAT and duties may apply upon delivery of the product.

What if I entered my email address in wrong?
Our deepest condolence goes out to you because you won’t be able to read any of this.

Much Love,

Timbuk2 Designs


Some Fights Can’t Be Avoided

August 5, 2007

Nutnbitch