Overview
Camayak customers experienced three service outages though the evening hours of February 3rd (US Time Zones)/Early Morning February 4th UTC.
The outages spanned the following times:
Outage 1 (34 minutes):
– UTC From 02/04/2015 03:39 to 02/04/2015 04:13
– CST From 02/03/2015 21:39 to 02/03/2015 22:13
Outage 2 (12 minutes):
– UTC From 02/04/2015 04:37 to 02/04/2015 04:49
– CST From 02/03/2015 22:37 to 02/03/2015 22:49
Outage 3 (40 minutes):
– UTC From 02/04/2015 04:54 to 02/04/2015 05:34
– CST From 02/03/2015 22:54 to 02/03/2015 23:34
In the times between the outages, service may have been slow as customers tried to reconnect.
Part 1: The service outage
A: Cause
We have determined that the service outages were caused by the following circumstances:
1) A customer inadvertently created a large number of publishing destinations pointing to non-existent or misconfigured WordPress instances.
2) When the customer went to the manage publishing destinations page, Camayak attempted to validate the publishing destinations. Due to the specific circumstances of those WordPress instances, the validation code ended up waiting a long time for validation before timing out.
3) The validation attempts tied up application server threads, preventing them from being used to service other customer’s requests
4) Our load balancer, seeing that the applications servers were not responding to the validation requests in a timely fashion re-attempted those requests to other application servers, causing application server threads on those servers to be tied up waiting for validation.
5) This process continued until all application server threads were tied up waiting for validation.
6) Eventually, the validation requests timed out fully, and the application server threads were available to process other requests.
Steps 2-6 of this process happened three times, causing the three outages.
B: Remediation
1) We have, as of February 4th 16:45 UTC (10:45 CST) released a fix to our application server software to ensure that reasonable timeouts on the validation attempts. The validation timeout is well below the load balancer timeout, to ensure the cascade effect in (4) does not happen.
2) We have, as an additional measure, increased our application server capacity.
Part 2: The support outage
A: Cause
The Camayak support team was unable to respond to support queries from customers, which began at UTC 04:01, until UTC 07:29. The reasons for this are detailed here.
B: Remediation
We have, as of February 4th 14:45 UTC (08:45 CST) set up a notifications system with extra features to alert and inform off duty support personnel of service outages. This is an addition to our existing notifications framework for other critical and non-critical alerts.