Best practices for eliminating SPoF in cluster architecture

Nov 262012

Much as a chain is only as strong as its weakest link, the effectiveness of a high availability cluster is limited by any single point of failures (SPOF) which exist within its deployment. To ensure the absolute highest levels of availability, SPOFs must be removed. There is a straightforward method for ridding the cluster of these weak links.

First, you must identify any SPOFs which exist with particular attention paid to servers, network connections and storage devices. Modern servers come with redundant and error correcting memory, data striping across hard disks and multiple CPUs which eliminates most hardware components as a SPOF. Software and human error, however, can result in server or application downtime. Deploying a high availability cluster solution which monitors the health of servers and critical applications and takes automatic recovery actions in the event of failure eliminates this SPOF. All clustering solutions provide basic ping tests to validate server functionality, but only more advanced offerings also track application health and have the ability to automatically recover from detected failures. This deeper level of detection and recovery minimizes downtime.

Architecting all components of the cluster for redundancy is paramount to maximizing uptime. Connections to storage often represent a SPOF and it is critical that multi-pathing is architected into any shared storage configuration. Linux DM Multipath (DM-MPIO) provides the rerouting of block I/O to an alternate path in the event of a path failure. This eliminates all components in the path from server to storage as a potential SPOF and provides automatic recovery should a failure occur.

But even configured with multi-pathing, shared storage/SANs still represent single points of failure as does the physical data center where it is located. To provide further protection, off-site replication of critical data combined with cross-site clustering must be deployed. Combined with network redundancy between sites, this optimal solution removes all SPOFs. Real-time replication ensures that an up-to-date copy of business critical data is always available; doing this off-site to a backup data center or into a cloud service also protects against primary data center outages that can result from fire, power outages, etc.

The use of application-level monitoring and auto-recovery, multi-pathing for shared storage, and data replication for off-site protection each eliminate potential Single Points of Failure within your cluster architecture. Paying attention to these components during cluster architecture and deployment will ensure the greatest possible levels of uptime.

LinuxClustering.net

Best practices for eliminating SPoF in cluster architecture

Leave a Reply Cancel reply