API Issues
Postmortem

We had a widespread intermittent outage of many of CopperEgg's services between 10/19/2014 and 12/01/2014. We are very sorry that we interrupted our service to you during this time period; we know that you rely on our monitoring for your systems' health. We'd like to share some of the details about what happened and how we intend to prevent this and similar problems from recurring. This is the largest service disruption we've had at CopperEgg and it was a result of a "perfect storm" of multiple issues coming together; we are working on fixing all of them and want to be transparent with you how this happened.

What Happened

On November 19 at 2:30 PM CT, our custom metrics API (api.copperegg.com), which is used by custom metrics agents, our mobile apps, and the collector installer, began to have performance issues and timeouts. We detected the issue immediately and had a notice posted on the status page by 2:39 PM. We updated the status page with the situation around twice a day through the service disruption period as we had additional developments.

This same API slowdown issue had happened recently for brief periods on Nov 6 and Nov 17 and was already under investigation in engineering. In those previous cases, manipulation of the servers by taking them out of the load balancer and letting them "cool down" in turn restored performance. In this case, it did not. Our operations personnel tried to mitigate the issue and it would clear for ~20 minutes at a time but then would come back. The load on the API servers began to get very high, because most consumers of the API retry failed responses - as a result, when the API has a performance-oriented failure it gets load on it doubled and tripled in short order as the number of requests with a retry interval less than the API's current response time continue to grow. Our operations engineers started adding API capacity by spinning up additional servers in Amazon to handle the additional load.

At 5:30 PM the API had degraded enough to cause false alarms to be sent to users. The API is where the Server:Not Seen and Server:Health metrics are sourced from and alerts depending on those two metrics fired. Overnight, our overseas engineers tried building an entire new API server farm (new ELB, new larger servers) on the hypothesis that either something was wrong with our ELB or existing servers or that our total memory capacity on the existing servers was just not enough to process through the current backlog. That did not help.

At 8:43 AM on 10/20 probe stations began to go offline as well. Our probers use the API to push in data but also use it to get updated information on what they should be probing; after they can't get fresh info after half a day they hibernate until they can.

On 11/20 we escalated the service disruption because of its duration and severity and brought in some additional specialist consultants to help with the issue over and above our usual platform engineering team. All these engineers worked on the issue on 10/21 and tried various things to correct the problem, but server monitoring and RUM also became unavailable for, at the time, unknown reasons. At this time we shut off alert SMS notifications because we decided we'd be sending more false alerts than accurate ones.

On 11/21 the engineering team had narrowed the problem down to bad calls to the old v1 custom metrics API. We decided to temporarily block the v1 API traffic entirely to restore service to the system. This worked, but the problems with server monitoring persisted. We discovered that as the FastAPI servers had been autoscaling (data ingestion from collectors goes in through a different path than our usual API, the "FastAPI") the new servers were not getting DNS set up correctly in the FastAPI round robin DNS entry. We manually updated them and began searching for the root cause of that issue. Then the third issue hit - we had some AWS instances come off reserve, causing us to hit our on demand instance cap in AWS East. This prevented our autoscaling from working and prevented us from being able to launch instances manually. The API farm was unbalanced across Availability Zones at this time and Elastic Load Balancers send equal load to each AZ regardless of how many servers are running there - this caused additional performance issues because some servers are overloaded while others are not. It took some time to figure out what was happening, because the AWS Trusted Advisor information that tells you if you've hit your various limits is not updated in realtime, in fact it takes 4+ hours to update. The combination of the initial API issue, the DNS issue, and the autoscaling issue made an environment where it was difficult for engineers to figure out why a given piece of functionality wasn't working. Regardless, over the course of the day we got the v2 API and server monitoring working again. Probes were working again but had large amounts of backed up data to process through.

On the weekend of 11/22-11/23 we got the AWS scaling cap issue fixed by our helpful AWS sales rep, processed through the probe backlogs, and continued to manually update the FastAPI DNS to keep it on line for server monitoring. Most services were functioning normally over this weekend except for the v1 API (and alert notifications were still turned off).

On Monday 11/24 the situation degraded again. Our redis servers' replication was significantly backlogged by all the pent up prober and server information now flowing into the system and probes and server monitoring went out again around 8 AM. A change to the redis servers' client-output-buffer-limit allowed us to clear the backlog and restore services other than the v1 API and notifications around 11 AM. Various services failed again that evening and are worked on overnight.

On Tuesday 11/25 the root cause of the DNS issue was finally identified. A symbolic link had disappeared from a system, causing a script that dynamically updated the fastAPI DNS entries to behave incorrectly. Services are all eventually back on line and stable over the course of the day, except for the v1 API and alerting (probes are up but backlogged) and several specific probe stations.

On Wednesday 11/26 one of our NestDB server pairs had a completely unrelated issue that takes probes down for a couple hours; engineering mitigates this by swapping over to the working server (automated swapover exists but failed) and works on fixing it behind the scenes for the next couple days. Services are all brought back up except for the v1 API; we also still have alerting turned off until we can be sure we're going to be sending good alerts.

On Thursday 11/27 (Thanksgiving), we are stable. In the background engineers clean up (the general chaos from the outage left a lot of cruft in logs, DNS, our monitoring system, etc.).

On Friday 11/28 we turn the v1 API back on and remain stable. We declare all services working, but the decision is made to wait until Monday (many people are out due to the Thanksgiving holiday) to tell customers all-clear and reenable alert notifications.

On Monday 12/1 we tell customers all-clear and reenable sending SMS notifications.

The overall period where "something" was wrong was 12 days, but only on a subset of these days were services other than the v1 custom metrics API affected. Data was not captured from servers and probes during the worst parts of the disruption - exact times with missing data will vary by customer and server/probe but that's primarily parts of the 19th-21st, 24th, and 26th for servers and parts of the 20th-22nd and 26th for probes. AWS monitoring was not disrupted at all and I don't show any missing v2 API custom data from the period on my custom metrics agent. (CopperEgg collectors and agents are pretty good about queueing up data for a couple hours and then sending it when they get a chance, so in periods of intermittent service like this you may not see data in realtime but it will populate historically fairly robustly). Direct use of the v1 custom metrics API would not have logged data for the majority of the 19th-28th.

What We Plan To Improve

Though the old v1 version of the custom metrics API was what provoked these issues, we had a "perfect storm" of 3-4 unrelated other issues that occurred during this time period that caused the service disruption to become more severe and prolonged. Therefore we have fixes needed in various areas.

  1. The custom metrics API itself. We are working on code for better API validation and throttling, on obsoleting the v1 API entirely, and on better inspection around the API (most notably log analysis) so that we can identify problematic requests quickly and block them or have the code be resilient in the face of them. There are also infrastructure changes and upgrades we can make to limit request queues to prevent request overloads from turning into outages. In the short term we are also reaching out to specific customers that we see using the v1 API heavily and generating long-running queries on it to see if there's a different way of accomplishing their goals. If you are using the v1 API please consider switching over to the new v2 API, as we will be deprecating the v1 API shortly.

  2. The DNS updating mechanism. As many of you know, CopperEgg went through an ownership transition this year and as part of that switched out a lot of its engineering team members. The DNS updating was a part of the system that was undocumented and was running in a place that was hard to discover. In addition, we were not monitoring that every box's DNS entries were set correctly. As a result it took a while to locate; once it was located determining the issue and fixing it was quick. We have been focused on documenting the system over the last couple months because we know there are still some remaining "blind spots" like this; unfortunately this problem occurred before we got through all the DevOps pieces. Besides better documentation and monitoring of these operational processes, we are embarking on a wider scale reengineering of our systems - they are partially under automated configuration management and AWS autoscaling but not completely, and some of the support processes (like the DNS script) were written quickly by a startup in support of a growing system and need to be rewritten to be robust and effectively support a more mature, larger system with larger customers. One of the big takeaways from this is the truism that "infrastructure is code," and if your operational code isn't as well documented and as robust as your application code, you need to invest more into it. And monitoring of scripts and cronjobs is as important as monitoring your user-facing services.

  3. The AWS scaling limitation. We were monitoring this metric but we didn't know it took so long to update. We have therefore set our thresholds to review while we are farther away from any limits and will work with Amazon to keep those limits well above, not 20% above, our current usage.

  4. An outage of this magnitude doesn't happen without many other small flaws lining up to help cause, prolong, or disguise it. We discovered a host of other small issues that surfaced as impediments when large parts of the system were having issues that masked a problem, didn't read right on problem 1 because of problem 2, et cetera. For example, we have a separate CopperEgg installation we use to monitor CopperEgg servers we affectionately call "watchdog," but the DNS updating issues meant that our automatically added server probes did not work there, so it was hard to determine when a fix worked or did not work on the API servers and it took longer to validate than it could have. Similarly, as so many errors were being encountered by our servers, log files grew in size much larger than usual and some of our logrotate policies couldn't keep up with the new rate of logging, causing server outages from full disks. All these issues have been added to our work backlog and prioritized by severity. The ones around issuing false alarms during service issues are the highest priority and are in work now.

Again, we apologize for the disruption to our services. It definitely ruined our Thanksgiving weekend and we know it did the same to some of you. We have switched our product development roadmap over to pure stability issues for the next quarter and have invested heavily in bringing in expert consultants to significantly reengineer CopperEgg's infrastructure and automation over that period.

Posted Dec 22, 2014 - 12:40 CST

Resolved
Alerting has been re-enabled. All CopperEgg functionality has been restored. We apologize for the lengthy service interruption and now that we're stable we'll move on to further addressing customer concerns.
Posted Dec 01, 2014 - 08:04 CST
Update
At this time the Server monitoring, Probe monitoring, RUM, Custom Metrics, Copperegg API components are all operational. Alerting may be restricted to email at this time. We will monitor the service and if it is determined that we will not be sending spurious messages we will also re-enable the SMS alerting today.
Posted Dec 01, 2014 - 04:39 CST
Update
We now have a partial restoration of the alerting mechanism. We are continuing to monitor the situation and will update again once we have more information to pass along.
Posted Nov 26, 2014 - 15:44 CST
Monitoring
We have restored probe functionality at this time. We are still reviewing the alerting configuration to ensure as much as possible that we will not be sending duplicate or false alerts when the service is re-enabled. We will update again when we have turned the alerting mechanism back on.
Posted Nov 26, 2014 - 11:27 CST
Update
We have been alerted that the probes are currently not reporting any data. Our engineers are currently investigating the cause of the probe outage.
Posted Nov 26, 2014 - 07:39 CST
Identified
An additional detailed update on CopperEgg service status. Server monitoring is fully working. v2 API is working but we still have v1 API partially shut off. RUM is having some issues. All probes but Dallas, Tokyo, and Sao Paulo are functioning correctly. Alerting is still shut off until we have full stability. Sorry about the length of the outage, we have many engineers working furiously on the issue.
Posted Nov 25, 2014 - 10:54 CST
Monitoring
The Server monitoring, Probe monitoring, RUM, Custom Metrics, Copperegg API components are operational now. We are Monitoring the situation.
Posted Nov 25, 2014 - 06:41 CST
Update
CopperEgg probes, Web Apps (RUM) and server statistics are down again. The v1 API and alerting are also down. The customer UI, AWS stats, and the v2 API are working. Our engineering team has brought in some specialists and continues to work on the stability issues.
Posted Nov 24, 2014 - 22:21 CST
Update
The Server and Probe monitoring components are down again.
Posted Nov 24, 2014 - 06:38 CST
Update
All CopperEgg probe stations have been restored to service except for Fremont and Sao Paolo.
Posted Nov 23, 2014 - 10:19 CST
Update
Today we've been working on making server stats more stable. Next we will work on restoring probers. RUM, AWS, and v2 API should be operable.
Posted Nov 22, 2014 - 23:58 CST
Update
Server data and RUM data are now operational. The v2 API endpoint is also working. We are still working on restoring probes and the rest of the API. Alerting is still disabled.
Posted Nov 21, 2014 - 23:28 CST
Update
We have had some success in restoring server monitoring and the v2 version of the custom API (has /v2 in the URL). We have disabled the v1 custom API (has /2011-04-26 in the URL) and alerting until we are more stable.
Posted Nov 21, 2014 - 16:56 CST
Update
CopperEgg services are still down. We have brought in additional resources to help work the issue and we have a group of engineers working continuously on the issue. We apologize for the long term outage and will give more information as soon as we have it. Thank you for your understanding.
Posted Nov 21, 2014 - 10:43 CST
Update
The CopperEgg API is down. This affects the ability to put, get, and alert off custom metrics, display and alerting using server healthchecks and Not Seen times, probes, and alerting. Our engineering team has rebuilt some of our infrastructure but it has not resolved the problem; more help is being brought in to help resolve the issue.
Posted Nov 20, 2014 - 18:41 CST
Update
The CopperEgg custom metrics API is still unstable and many Server: Not Seen alerts are misfiring (the Not Seen metric specifically comes from that API). Engineering is continuing to work on the problem and is trying everything they can to restore service to you. We certainly apologize for this unusually long outage. If you are experiencing false alarms, consider using alerts other than Not Seen to check your servers' uptime in the interim.
Posted Nov 20, 2014 - 08:28 CST
Update
The custom metrics API is still down and some users may be receiving spurious alerts for Server: Not Seen. Debugging is still underway; routine attempts at restoring the service have not worked. You may want to mute your Not Seen alerts for the time being.
Posted Nov 19, 2014 - 18:13 CST
Investigating
The CopperEgg Custom Metrics API is having issues again as of 2:33 PM CT. Our technical team is responding.
Posted Nov 19, 2014 - 14:39 CST