Application failure happens every day across applications due to variety of reasons. The challenge is how do you architect one that has reduced chances of failure, minimal impact and can recover fast for business critical applications.

There are various options available to mitigate hardware failure (like RAID,Master/Slave, Load Balancers etc)  and there are loads of content to read & ways to avoid failure.

Apart from just hardware failures, there are a bunch of other reasons for the service failures:

  • erroneous deployment procedures like wrong configuration files being deployed
  • buggy code
  • encryption key being corrupted or lost.

I have encountered some of these issues in my previous life in Zoho where we learnt, failed & learnt some more to live & tell the story. The above mentioned failures can be mitigated by following some of these practices.

Split service into Test Site / LIVE Site and expose Test to customers.

For a business critical application it is a good practice to have a sandbox (test) environment exposed to customers to integrate with their development or staging. For critical apps, Test is a Sandbox environment  with which customer is integrating & testing things on a daily basis.

  • Always rollout code changes first to the “test” account. The service should architected in a way to accommodate this, preferably at a domain level.
  • Let it run for sometime with customers testing it with their sandbox environment.
  • Then push to LIVE accounts after validation.

With API & code being tested, most issues can be captured here.

This is very much applicable for services where customers interact via API. An example is our own service that provides a Test and Live account for customers.

Architect to run as mini self-contained service running in separate regions.

For business applications data can be segmented at customer level. And hence can be architected to run as mini-services. Each mini-service should serve a set of users and be deployed in different regions. If the services need to interact among themselves, they should do so as if interacting with a third party service.

As an example checkout Mailchimp: When you login your account itself is segmented into a region as captured in the domain.

https://us4.admin.mailchimp.com/

Ideally the service could be deployed independently, with this architecture.

Before the days of Amazon & Rackspace, it was difficult to have such distributed setup for small players. Now it is all the more easier, where you can have smaller services deployed across regions.

The cost should not increase radically as instead of a few high end servers you would be having servers with lesser capacity. Additional advantage is that latency due to  table schema updates for databases that lock+copy (like mysql) would be lesser as you are operating on a smaller segments of data at a time.

Deployment and monitoring tools need to be in place though.

This allows you to do: Staggered update across regions.

By staggering, the chances of noticing an issue & fixing it is much higher. And your entire customer base will not be affected.

Everything should be backed up/versioned.

If you use managed database service like Amazon RDS there is an option to recover data up to 5 minutes. If you manage your own db, you need to go for incremental backups.

Delayed replica (as provided by mongo db  et al) is another good option.

Similarly encrypted data & keys should handled with proper versioned back ups.  There is no point in having  backed up encrypted data if you don’t have similar backup process for keys.

Update in sets.

Even within the same mini-service, rolling out a key update should preferably be staggered.

Example: if you have a process to change keys every 30 days, you do not have to re-encrypt the data at one-go. You should instead update the data in sets over an hour or more. You would have to store the key version along with the encrypted data to support this.

Finally all these can only help you to mitigate the issues. Monitoring & managing of the servers needs to be robust to capture any failures to act upon it immediately. Investing in tools like newrelic, pingdom/site24x7.com and other home grown app specific monitoring tools could make a difference between a lynch mob vs. a few customers you could reach out  personally.

As a SAAS startup that is gaining traction, not all of these need to be in place while starting out, but there has to be a plan to make a natural progression towards this.