Reliability Problems

Another manifestation of the sharing problem arises when there is a failure. Reliability is a key for most enterprise applications. An agile infrastructure has the potential to achieve this through redundancy: if one node fails then the remaining nodes can carry on the work. However, the degree of data sharing affects how complex it is to achieve this. If there is no sharing then a failed computation can simply be re-run.

Many existing Grid solutions offer this level of fault tolerance. However, if data is shared, then it is imperative that the effects of failure are contained so that inconsistent, partially updated state is not allowed to permeate through applications. Having infrastructure that knows when data is in an inconsistent state can be of great benefit when a failure occurs. If a computation has not completed its work at the point when a failure occurs, it may leave data in an inconsistent state. However, a system with the appropriate support could either 'rollback' the state to a previously consistent state (backward recovery), or 'compensate' to create a new consistent state (forward recovery). Without this, there is the real danger of corrupting the data on which the enterprise relies.

Even determining which applications and data sets have been affected by a failure can be a problem in an agile infrastructure. Dynamically deploying software, data replicas and caches in order to meet varying demands means that if there is a hardware failure then it may not be clear which applications are affected nor how they can be recovered. Recovery can be even more complicated if applications are widely distributed across resources that are managed under different regimes.

The danger is that there are may be no consistent, infrastructure-wide mechanisms for detecting failure, reporting failure, deciding what action to take and performing recovery in a co-ordinated manner.

Next: Conclusions