7.4. Remote Services RedundancyWe'll be talking about general application scalability and redundancy in Chapter 9, but there are some redundancy issues that are specifically applicable to remote services. Assuming you have some degree of control over the various remote services you wish to use, you'll need to carefully calculate how redundant you need each component to be, assuming the component can be set up redundantly. General BCP might dictate that you need one or more hot spares depending on the number of online nodes that comprise a component. Judging how many spare nodes you'll need should take into account several metrics:
Unlike an application as a whole, a key remote service could be disabled while your primary service is still running, so users are still able to initiate actions that rely on making remote service requests. This is in stark contrast to web application server failuresif Apache crashes you can no longer serve pages. You don't need to deal with requests that come in and have partially succeeded; perhaps they have already sent back a response, or written data to storage. It's important to remember that any component in the chain can failthe local machine's network interface could go down, DNS could go down, the network switch could collapse, one of the routing points between the hosts could fail, or some part of the remote server itself could fail. Failures in all of these points in the calling chain manifest themselves in different ways, and some are difficult to monitor (although we'll talk about that more in Chapter 10). When components in the system fail, we want the system as a whole to carry on, if it possibly can. Ideally, we want as many components as we can to provide hot failover behavior. So what do we mean by hot failover in this case? When we have multiple available instances of a remote service, hot failover is the ability to automatically migrate traffic from a failed node to a functioning node. For some services this means using a dedicated hardware or software load-balancing appliance that monitors the things it's balancing. In this case, you then need multiple load balancers to ensure hot failover in the case where one of the balancers fails. For nonbalanced services, this can mean trying a list of hosts until a functional one is found. It's not just fully failed components we need to skip over either. In the case where a service is reachable but returns a certain class of error response (one that specifies that the remote service failed), then we might want our application to retry the request on a sibling server. In this case, a load balancer doesn't really help usit can only detect that the service is available, not that it can successfully execute different requests. User facing components such as web servers tend to need specialist software or hardware load balancer to handle hot failover; we've already discussed load balancing in Chapter 2. Behind the scenes, components can be given hot failover abilities right in your application code. This can reduce the complexity and cost of your architecture by eliminating extra balancing and routing nodes. The most basic example is for software database load balancing. For the cluster of database servers we want to connect to, we have a list of hostnames. First we shuffle the list so that we pick a random server to connect to each time. Next we iterate over the list, trying to connect to each in turn. When we find a host we can connect to, we stop looking and return the connection handle. If we try all hosts in the list and don't manage to connect to any, we return zero and let the application logic worry about what needs to be done: function db_connect($hosts, $user, $pass){ shuffle($hosts); foreach($hosts as $host){ debug("Trying to connect to $host..."); $dbh = @mysql_connect($host, $user, $pass, 1); if ($dbh){ debug("Connected to $host!"); return $dbh; } debug("Failed to connect to $host!"); } debug("Failed to connect to all hosts in list - giving up!"); return 0; } A slightly more complex example might be for a service where more than the connection mattersa service where even if we manage to connect, we might not be able to converse, or the service might not be able to fulfill our request: function store_file($storage_hosts, $filename){ shuffle($storage_hosts); foreach($storage_hosts as $host){ $result = store_file_2($host, $filename); if ($result){ return $result; } } return 0; } function store_file_2($host, $filename){ ... if ($connection_failed){ return 0; } ... if ($operation_failed){ return 0; } return $result; } Here we shuffle and loop over each possible hostname, retrying the operation until it succeeds. Success is defined as connecting to the remote service, issuing our command, and getting the correct response. In the case where we contact the service but the service fails to respond or gives us a failure response, we move on and try the next one. In some cases we won't have hot failover capacity, or all servers in the pool will be unavailable. The action in these circumstances depends on the kind of action being performed. If we were querying a remote service for search results, then we might need to display an error message to the user when all else fails. If we were updating a remote system, then we might want to queue the update request locally so we can resend it later when the remote services becomes available. In the case of a read request to a remote service, we can sometimes fall back on a local cache for frequently requested values. All of these, of course, depend on the nature of the request. We don't want to cache time-sensitive data or queue search queries. |