Probe widgets not polling data
Postmortem

We had a intermittent outage of probe data for all customers and custom metrics data for a subset of customers between 2/21/2015 at 8 AM and 2/23/2015 at 10 AM CT. We are very sorry that we interrupted our service to you during this time period; we know that you rely on our monitoring for your systems' health. We'd like to share some of the details about what happened and how we intend to prevent this and similar problems from recurring.

What Happened

On Saturday 2/21 at 8:18 AM, the primary of a redundant pair of data servers for one of our customer data clusters locked up hard in Amazon. An operations engineer responded to a pager alert and ensured failover had worked as designed; there was a brief period of probe delay on that cluster from the initial failover but service was only briefly interrupted and then the system was working fine.

The failed server had to be hard rebooted and when it was, its data was corrupted and the server had to be rebuilt, and was then set up to resync its data with the live server. A manual error was made during the rebuild and replication was set up in an infinite loop. These syncs can take a long time, so it was left to run overnight. On the evening of 2/22 the operations engineer noticed it was taking an unusually long time and logged a ticket for the development engineers to look at the issue when they came in on Monday. Unfortunately they did not report this to our Support organization (skipped a step in the incident communication process).

As the data sync continued to tax the system, probe data began to overwhelm the remaining data server on the evening of 2/22 and probes began to not update again intermittently for all customers at 5:40 CT. Support saw user reports of probes not updating and notified Engineering, who fixed the cluster configuration and the probe data backlog cleared by ~3 AM on 2/23. The affected cluster still had some intermittent loss of custom metrics data until 9:50 AM the morning of 2/23.

What We Plan To Improve

The problem was detected immediately and our redundancy and failover worked to maintain service. The real outage was self-inflicted by an operator error, but once the right engineers were looking at the issue it was fixed promptly. We believe the major gaps this exposed were:

  1. Our incident communication process needs to ensure that Support is notified immediately so that user problem reports can be evaluated in relation to ongoing issues and that updates to our public status page can be made promptly. In this case we did not post an update to the status page even though an engineer knew about a potential issue because the communication process wasn't followed. We will also ensure better weekend coverage (we don't have people on shift over the weekend, but do have ways to notify people of off hour issues).

  2. Our system's probe data feeds need reengineering to not back up for all customers when one cluster's data server is slow. In this case the impact was widened to all customers when it could have been only affecting a smaller subset.

  3. We will further automate the process of rebuilding one of our data servers so that a manual misconfiguration is not possible. Many of our servers are auto-configuring and auto-scaling, but there are still some manual steps in the process of rebuilding our core data servers. A mistake during this rebuild was the proximate cause of the outage.

Thanks for reading our postmortem; please report any issues or ask any questions via CopperEgg Support (https://copperegg.zendesk.com).

Posted Mar 02, 2015 - 13:59 CST

Resolved
The issue has been resolved. It was due to intermittent loading of back-end servers that led to creation of long queue of data to be processed. The same issue has been rectified now. In case any issues are noticed in the same context then please contact support@copperegg.com
Posted Feb 23, 2015 - 04:36 CST
Monitoring
Probe monitoring data and graphs are working now. We are monitoring the situation.
Posted Feb 23, 2015 - 04:18 CST
Update
Custom Metrics may also be partially effected due to this.
Posted Feb 23, 2015 - 03:54 CST
Investigating
Probe widgets are not polling data. Our Engineering team is investigating into the issue. It is likely due to overload at the back-end. We shall update you soon on this.
Posted Feb 23, 2015 - 03:52 CST