We are in the midst of designing the API for our service. Sharing our notes, thoughts and design process based on our experiences in this.

API design is a tricky business. The problem being that once users start using them, it is tough to  modify the API. You would either end up with a lot of versions to support or  have a lot of angry API users (if you keep asking them to migrate). The SaaS model of  put things out and iterate quickly works counter intuitively here.

Security is one major aspect that we need to get right in the first version of API itself.  This is particularly true for services that deal with money and credit cards (like our service).

An excellent paper by Rui Wang et al  from Indiana University and Microsoft Research explains the security pitfalls in the various API designs of “Cashier-as-a-Service “ services such as PayPal, Amazon Payments and Google Checkout. It can be found here.

How to Shop for Free Online – Security Analysis of Cashier-as-a-Service Based Web Stores

(It is a little academic and dry. But  I would strongly recommend API developers (and API users) to go throught it).

The reason we are emphasizing “security” as an important aspect that needs to be right, starting from first version is this: once a vulnerability is identified / exposed, it still takes lot of time to fix it completely.

In Amazon, an issue was identified in the API provider side and the vulnerable SDK was promptly fixed. But it took 40 days after the advisory, for Amazon to force all the stores to migrate to new version. Till such time, they had to support the older buggier version, as commented in the paper.

In majority of the cases, the mistake found was on the merchant side (API user). But that does not absolve us (the service providers)  completely. Merchants come to us because they don’t have the developer bandwidth or have the expertise to build their own service.

The holy grail is  to try design API  which prevents the merchants from making mistakes.  But the very least  we should document the possible pitfalls and the design to follow on the merchant side to avoid the errors.

I will try to explain  in a series of posts about some of the possible API designs and their pitfalls when trying to integrate multiple web services (not necessarily only payment) .  Some of the examples would be based from the  paper mentioned above but would provide a more holistic picture in addition to the security angle.

The post is specifically targeted at API users (in our case the merchants). Feedback & comments based on your experiences are most welcome.

Scenario

Lets take the example of online shop which uses a 3rd party service like a payment gateway for collecting money.

Design #1

Your customer (using Browser) -> Your service -> 3rd party service

This is one of the simpler designs.

  • You receive the request from the customer along with all the required parameters.
  • You do the pre-processing. For example input  validation, authorization et al.
  • Then call the 3rd party service and receive the response. (It is a synchronous API call)
  • Send your response back to the client.

(Note: All communication is assumed to be  via secured connections like https).

Looks simple. Similar to a db operation.  It is simple as long as the complete operation (including auditing) is  done  on the 3rd party service and your service acts as more of a proxy.

But if part of the work has to be done on your side, it gets complex. For example for a online store , the billing would be taken care by the payment gateway but the actual shipping is done by store.

Pitfalls

  • You must be PCI DSS complaint if the request parameters contain credit card information.
  • The 3rd party service might be down or unresponsive.
  • Even worse, your 3rd party service might go down during processing of the request leaving you unsure whether the request was completed or not.
  • Your service might go down after you sent the request to the service but before the response.

Services go down?

The services going down might look a bit far fetched. But in the internet scale, it does happens (and I have a lot of grey hairs to prove it!).

Also,

  • Your db transactions don’t span to the third party services.
  • Your 3rd party service might promise a SLA of 99.99% availability. But it is not at each “request” level.

The 3rd party service being down or unresponsive.

You should keep shorter  “connection timeouts” and “read timeouts” on the connections to the 3rd party service.  This is especially important for thread based(/process based) servers like apache, tomcat.   Even if a minor percentage of your users try use the feature,  threads would all clog up waiting for the response from the third party service.  Your other users requests would also be waiting for the threads to be freed up. Your service would become unresponsive.

Depending on your requirement, going for a connection pool, with status checks on the 3rd party service availability could be useful.

3rd party service down (or unresponsive) during the processing of request

There is a possibility of service becoming unresponsive after you had sent the request.  Set your read timeouts optimally based on the request. Only “process-heavy” requests such as report generation would need large timeouts.  But even that  is questionable to do in “user request” instead of scheduling.

In any case, if  you failed to receive the response, you need to know whether the request got completed or not. In the case of a online store, you should be able to refund the customer (or provide him with the goods/service).

Your service going down in between request

You would generally log the request to your persistent store(db) for further processing.

To do that you could

  • Log the request in a persistent store (like db).
  • Send the request to the 3rd party service and receive the response.
  • Send your response back to browser

Or you could

  1. First send the request to the 3rd party service and receive the response.
  2. Then log it in a persistent store.
  3. Send your response back to browser

In both cases, there would be discrepancy if your server went down just after the first step.

The proper flow for this design should be

  1. Generate a id for the request. Log the request in a persistent store (like db). But mark its status as “in-progress” along with the timestamp.
  2. Call the 3rd party service with the necessary parameters (including the id). Remember to set short connection timeout and optimal read timeout.
  3. On response, update the status with either “failed”  or “succeeded”  based on the response.
  4. Send your response back to browser.

The 3rd party service should

  • Take in the “id” generated by you to identify the request along with other parameters.
  • Provide a way of querying the status of a particular request based on the id.

Handling failures

  • You could have scheduler which walks through “in progress” requests/transactions that are older than a certain time and try to reconcile it.
  • Or you could handle it manually by having a  internal administrative UI in your service which allows you to see the old but “in progress”  requests/transactions.

The advantage of this design is that there is less security problem  from a malicious user of your service (assuming you do the proper input validation).

ou must be PCI DSS complaint to use this design while handling credit card information. There are other designs which can reduce the burden of PCI DSS compliance.

More designs in the subsequent posts.