Internet services outages happen. Try as we all might to prevent them, outages still happen because the internet is an interdependent ecosystem of very sophisticated services that sometimes have an impact on each other in unexpected or unwanted ways.
On June 24, a number of important internet services such as Cloudflare and Amazon Web Services experienced outages, mostly on Verizon. This outage was created by incorrect BGP advertisements from the Verizon network and black holed Internet traffic, preventing users from being able to access the services.
To be clear, the outages experienced by users of Cloudflare, Amazon Web Services, and other services were not the fault of the services themselves. Those platforms were fully functional by the users that could reach them successfully. But that highlights a challenge with the internet: Sometimes the services are working but are not actually accessible to users.
Citrix ITM Radar is a worldwide community of websites and applications that collect information about the health and performance of internet services in real time. Radar collects about 10 billion health measurements from a billion users, which provides an unmatched level of visibility into users’ ability to access the most important services hosting the internet’s services.
What Did Radar See?
First let’s look at some of the services that were impacted. Citrix ITM Radar measures most of the popular platforms that web applications use to host their services such as cloud regions, cloud storage, global CDNs, and private data centers.
We’ll start with Amazon Web Services. The AWS cloud regions were impacted to differing degrees based on the BGP configurations.
AWS US-East 1 Virginia had a large outage early in the incident. It recovered relatively quickly compared to the other AWS regions, but it continued to have reduced availability for the duration of the outage, from about 10:34 a.m. to 1:03 p.m. UTC.
AWS US-West 2 Oregon exhibited the same early behavior but also experienced an extended, large outage from 11:30 a.m. until 12:28 p.m. UTC.
AWS US-East 2 Ohio had an availability profile very similar to Oregon, with two large outages and a lower availability from Verizon throughout the incident.
AWS Cloudfront was not fully unavailable but still saw large availability drops on Verizon. It was likely partially available because it was also serving content out of AWS cloud regions that were not impacted by BGP configuration errors.
In addition to Amazon Web Services, Cloudflare also experienced a large availability impact during the incident. Cloudflare sustained a larger availability drop throughout the incident than the other services.
Citrix ITM Radar measures the availability of these services from users on tens of thousands of networks around the world. That provides Radar the ability to see how platforms perform when being accessed from different access networks.
Citrix ITM can be used to report on availability specific to the different access networks. The Cloudflare availability drop occurs mostly on Verizon networks, as would be expected based on the outage. The chart below shows that users on Comcast and AT&T are unaffected (the blue and green lines) while users accessing from Verizon networks (the pink and purple lines) were impacted.
Obviously, this situation is bad for everyone in the internet ecosystem — for the operators running these important internet platforms, for enterprises trying to keep their services available, and for users trying to access those applications.
What Can Be Done?
Now for the good news: Citrix ITM enables you to avoid this problem. Citrix ITM customers can automatically adjust their traffic management rules based on the Radar data. Users will be sent to the data center, region, or CDN that is still accessible, so they are wholly unaware anything happened.
In this case, it means that users accessing a service from Verizon will be sent to a different endpoint that is available to users from the network.
To illustrate, here is the experience of an actual customer. Like many companies, this customer hosts their applications in the cloud and private data centers, but we will focus on a service that is hosted in multiple AWS regions: US East-1 in Virginia and US East-2 in Ohio.
The service normally operates in an active/passive state where traffic is sent to US East-2 unless there is an outage in that service endpoint.
As described above, Radar sees connectivity issues in AWS Virginia and Ohio for users on Verizon.
Region Available Chart (AWS US-East 1 Virginia is blue, AWS US-East 2 Ohio is green)
Citrix ITM Decision Chart
Here’s how that plays out:
The Virginia and Ohio regions both have an outage, recover relatively quickly to a more stable state, but continue to have reachability issues. (1, in the Region Available Chart)
At that point, user traffic starts being sent to Virginia in states where Ohio is less available at the state level. (2, in the Citrix ITM Decision Chart)
About an hour later, AWS Ohio suffers a sustained larger outage. (3)
Citrix ITM sends traffic to AWS Virginia in most locations, which is more available for most users in the United States. (4)
The outage in AWS Ohio ends and recovers quickly. Users to AWS Virginia experience some lingering impacts from the outage as the BGP route advertisements are fixed. (5)
All traffic returns to AWS Ohio as normal. (6)
All of this detection and recovery happened automatically. The customer didn’t have to react to any pager or escalate for any failover settings changes. Radar measured the issue from end users, and Citrix ITM automatically made the best choices to keep the service as available possible, making traffic management decisions as granular as a state-level on Verizon.