Redirector connectivity issues

Incident Report for urllo

Postmortem

We’d like to shed more light on the responsiveness issues experienced by our redirection services on February 6th. Firstly, we’d like to sincerely apologize that this incident occurred. The performance of our services over this period does not reflect our goal of 100% uptime and we recognize this affected our customers negatively. I hope the following information shows that we have understood what has happened, and that we have laid plans to reduce the likelihood of this occurring in the future.

Customer impacted incident start time:

February 6, 2020 @ 20:01 MST (2020-02-07 @ 03:01 UTC)

Customer impacted incident end time:

February 6, 2020 @ 20:33 MST (2020-02-07 @ 03:33 UTC)

Impact:

The redirection services on 54.68.182.72 and 34.213.106.51 responded to requests very slowly, and in some instances requests were dropped.

Root cause:

A distributed-denial-of-service (DDOS) attack on a customer website.

Solution:

Provision enough capacity to handle the full load of the attack while ensuring all traffic was processed within our typical response times.

Background

The EasyRedir redirection services are hosted on AWS across multiple availability zones (AZ) within the US-West-2 region. There is an AWS Network Load Balancer (NLB) that has an interface in each AZ, and a fleet of EC2 instances in each AZ that actually processes the redirection requests. This architecture has proven to be highly reliable and easily scales to very high traffic levels.

Incident

On February 6, 2020 we received alerts from our monitoring tools of high loads on our redirection servers. We immediately began an investigation and determined the servers were receiving vastly higher traffic levels than we typically process at any given time. At peak, our servers were processing 44x our typical traffic levels. It’s important to note that although our systems were loaded much higher than is typical, we were still responding to this traffic within our typical response times.

Our systems have a variety of tools at their disposal to mitigate attacks from bad actors. Our AWS NLB has a variety of DDOS mitigation functions built into it (which typically operate at the IP or TCP layers of the network stack). Our redirection servers also have a variety of tools to handle this level of traffic (highly tuned Linux kernel parameters, iptables based IP blocking, request and connection limits built into the web server configuration, and crucially, a carefully constructed series of RAM-based caches that cache redirect configuration information).

The nexus of the customer visible impact originated from our action to make a web server configuration change to block this traffic at an earlier point in our processing pipeline. This change required a reload of our server configurations. What was not fully understood at that time was the degree to which our caches were contributing to our low (and typical) response times. When each server configuration was reloaded, the cache was cleared. This had a knock-on effect throughout our processing pipeline - connections to backing cache servers had to be reestablished, and RAM caches rebuilt. It was this action that caused the start of the customer visible incident as our systems struggled to respond to client requests in a timely manner.

We immediately began to provision additional EC2 instances and added them to the NLB. Once this capacity started to come online, response times started to drop back down towards normal levels. Fully normal response times and traffic processing capabilities were returned 32 minutes into the customer visible incident. It’s important to note that during this time, many requests were serviced successfully, albeit at times much longer than we typically take to process a request.

Resolution

The redirection services were fully restored within 32 minutes of the start of the customer visible event.

Corrective Action

This failure was regrettable on both a corporate and personal level. The decision to initiate the actions that led to this incident was taken by our staff - this was not a failure of our architecture or technology. This has been felt personally by us, and we are sincerely sorry.

We have already taken a number of actions as a result of this incident, and plan to take many more in the days to come.

Actions already taken include:

Provisioning additional “standby” capacity that is ready to be activated at a moments notice
Added additional system monitors to detect anomalies in our traffic levels before they would trigger a “high load” monitor
Changed some system/server configuration parameters to be more aggressive towards limiting bad actors

Actions which will be taken over the coming days:

Investigate using proxy technology on each redirection instance to prevent process restarts from crushing caching datastores
Investigate how to detect and ignore spoofed IP addresses
Tune logging related to request and connection limits
Investigate whether using instances with greater network capacity would be helpful for our caching datastores

Posted Feb 07, 2020 - 16:50 MST

Resolved

We're satisfied the issues impacting the connectivity of the redirection clusters have been resolved. We'll continue to closely monitor this situation and will perform a root-cause analysis to determine steps to help prevent a disruption like this in the future.

Posted Feb 06, 2020 - 21:17 MST

Monitoring

The additional capacity we've provisioned has resolved the issues with response times. We're currently monitoring the situation and to ensure systems are performing as expected.

Posted Feb 06, 2020 - 20:53 MST

Update

We are provisioning more server capacity to reduce response times. We will post further updates shortly.

Posted Feb 06, 2020 - 20:27 MST

Update

We are continuing to investigate this issue.

Posted Feb 06, 2020 - 20:18 MST

Investigating

We are currently investigating connectivity and responsiveness issues related to our redirection cluster. We will report further information as we know it.

Posted Feb 06, 2020 - 20:17 MST

This incident affected: Redirection Services (US West Edge (HTTP traffic), US West Edge (HTTPS traffic)).