Impact: Data collection was stopped and UI Dashboard was not responsive
Resolved
Root cause:

Redis-master crashed due to an underlying hardware issue with our hosting platform (AWS). The failure recovery script was triggered when Redis-master's failure was detected. But, the recovery script works in such a way that it is checking for the status of Redis master and whenever master gets down or becomes inaccessible then it makes a connection to Jenkins server and executes the REDIS_FAILOVER_TASK, which in turn ensures that a new Redis master is created and it becomes accessible again. The issue was in making a connection to the Jenkins server because the credentials used to connect to Jenkins had expired.

Once the team was aware of the issue, we manually ran the Failover task and we had to manually create extra processing servers to make sure that the pending queues were processed.

Preventive Measures:

Early alerting: We have added more alerts in our internal monitoring system to notify us immediately as soon as possible for similar issues.
Ensuring Auto-Recovery: We have also upgraded our Jenkins credentials in all the recovery scripts and we are looking and removing the expiry time limits for Jenkins credentials.
Posted Apr 25, 2020 - 20:00 CDT