Monday, March 30, 2015

The High-Availability Design Paradigm of Application

We have often encounter the Design dilemma between Service Reliability and Application Development Cost. For High-Availability, we need put a lot effort aside from Business Requirement. However, too simple HA solution would have melt your business down when you facing the accident from all kind of Hardware Failure. Here we talk about three level's HA Paradigm from the perspective of implementation complexity.
1. Manual Switch: Service has basic Monitor Infrastructure to help you identify the Hardware failure and you can bring up the Application on the other spare capacity and keep the service continuity. Manual Switch is easily to adopt and take minimal cost. However, this is not even to be an HA Design. Because from the service interrupted to manual started, it might take over 30 minutes (Monitoring Interval usually take 5~10 minute to catch event and alert. Human check False Alarm. Confirmed and follow the SOP to start Application. Service resume.) This kind of low level HA is suitable for some none-timing critical mission such as file transfer, report generator etc. 30 minutes is tolerable for these kind of service flow.
2. Semi-Auto Switch: Some service might have strictly data consistency and require really rigid transaction result without race condition. At meanwhile, the interruption should not take over couple minutes. We usually design Active and Passive nodes and let those nodes coordinate with each other. So there will be only one Active node working at same time. Once the Active node shut down, the Passive one will start to take over the control and occupy a lock (usually in Database). Once the malfunctioned machine recovered, it will not proceed any transaction due to the lock has been acquired by the other partner. There are so many design like Database Cluster. The multiple nodes will take a vote under some quorum assignment and bring up another candidate as Active one for continuing the task. In a service flow, we will have an application behind the queue to maintain the data consistency and no service interruption before the queue. All the switch issue would be taken care after the queue. System interface seems ok. But the internal service flow would have couple minutes downtime and the queue would be a cushion to prevent the damage propagate to  other dependency system.
3. Active-Active Mode: This is most idealism for a service that every node has the same responsibility and no one's failure could make the service interrupted. However, sometimes this design might take a lot of over burden for all the application to communicate with each other for maintaining the data consistency or prevent from race condition issue during the transaction. Often, this kind of burden will drag down all the performance among whole cluster in poor design. Hence, only few scenario could adopt this feature without too much effort like Web Farm with only query capability (no transaction). Service flow that would only have one concurrent user connected to server at a time like ATM (you would only have on debit card, right?) For the application that focus on data availability, this design is pretty fascinated. But if your application is required to maintain the data consistency with multiple connection from  many concurrent users, the Active-Active Mode need a lot of time to enhance the performance, better Data structure for reducing the lock activities, crystal clear service flow and business purpose in case you need to expand the features in future business change.

Usually, we could take compromise for level 2 or 1. But if we could, why not Active-Active for completing your solution.

No comments:

Post a Comment