My sister sent me a grumpy email this morning. It appears that any web site outage is now my fault in the family much like she takes the heat for bad behavior by attorneys.
Let’s get the cynics out of the way first: Black Friday and Cyber Monday are soon upon us, so it seems worthy to note that as of 2008, Ecommerce is clocking in at a healthy $141B/yr out of a total of $2.6T/yr or about 5.4% (not including food and cars). At that rate, it’s easy to see that the dollars are real and the impact that they have on business is significant. It shouldn’t be a surprise really – even traditional brick and mortar outlets like Barnes and Nobles recently announced that around 33% of its revenue came from ecommerce and related transactions.
Having had most of the large ecommerce sites as customers over the years, we’ve heard first hand what the web means to business scalability and of course, the flip side of the equation, what outages mean in terms of opportunity cost. At a brick and mortar location – people are willing to stand in line and wait to get in, turning the whole experience into an almost social event in itself. (Hey, it’s better than second guessing how long the line is somewhere else.) But when a few clicks let you bounce between your web site and your competition, site availability and experience matter.
What it comes down to are a few key metrics:
1. The First Response
The first page of a web site sets the tone of a session for a user. When it takes a long time, everything feels like it takes a long time. To help manage that, properties that have invested appropriately in their infrastructure often go out of their way to make sure that there is some dedicated infrastructure for common page elements that serve up content quickly. The underlying technologies vary based on need but includes web caches, CDNs, and dedicated servers for static content. The bottom line is that each second saved is revenue made. You’ll notice this as a common characteristic of all top revenue generating web sites — the first page always loads fast. Many of our NetScaler customers leverage content switching policies to make sure that content that should be delivered quickly is routed to special groups of servers that take care of that task and overall overhead for processing these requests are minimized.
In case you were wondering – Google does consider your site page load times as part of its page rank criteria. A faster site with less content can very easily beat out a slower site with more content.
2. Site Availability
Site availability has two key dimensions: coping with failure events and managing unnatural spikes in load. To understand the first (failure events), you have to accept that failure can and will happen. Software updates go wrong, people trip over power cables, and sometimes people hit the big red fire emergency button in the datacenter by accident which kills power to everything in one go. You can have as many procedural and technological pieces in play to derisk a given threat to site availability, but removing the possibility of failure is guaranteed to, well… fail.
Outages caused by unnatural spikes in load has a special place in Hell where everyone in the company gets to one day vent their frustration at a missed opportunity. Customers that you may have converted into life-long buyers disappear. The revenue from the moment goes away. The list goes on.
Solving for the first problem (failures) means creating a system that transparently routes traffic around problems. A lot of tier-1 ecommerce customers use sophisticated server health checks where they validate that not only are the servers working, but that all of their dependencies are also working (e.g., databases, etc.) and outages are taken out of the load balancing pool. Datacenter level failures (“the big red button”) are addressed with DNS based global server load balancing (GSLB) where another datacenter takes over live traffic by responding to DNS requests with another site’s IP addresses. So when you hear that “so and so has n datacenters” where n is greater than 1, know that this probably means they are using GSLB to manage site availability.
Handling load spikes is much more subtle and tricky. Our customers use a technology called “SurgeQueue” which defines an upper-bound of traffic that a server can accept before it fails. If traffic levels exceed that level, traffic is temporarily queued on the NetScaler while waiting for a server to free up. This keeps the servers from themselves crashing because of a server load spike (which is typically the cause of a site outage).
To provide some perspective, prior to 2004, Weather.com would suffer site outages when too many high profile weather events would happen at once. This was sufficiently rare for them that they grudgingly accepted the lost revenue. 2004 changed that – the number of hurricanes that was happening on the US East Coast was significant and coming back to back and the lost revenue had added up. They had to fix this issue quickly as more hurricanes were predicted. Prior to the four back to back hurricanes of August, 2004 they did an overnight move to NetScaler with SurgeQueue. When the hurricanes came, they saw site load burst, but because the servers did not crash end users only experienced momentary slow down (usually 1-2 seconds at most). Compared to before, this was a night and day difference since no clicks were lost. Before the adoption of SurgeQueue, they estimated that the site would have crashed many times, each destroying a significant amount of revenue for the month.
3. Ongoing Application Performance
After that first page loads quickly, getting subsequent pages to load usually becomes harder. The premise is that the subsequent pages are customized to reflect the user’s interests based on historical data, etc. In other words, a user will get a page that no other user will ever get.
At casual glance, this can be very tricky to address. With less and less page reuse applying, the server needs to do more and more work. So what do we do? Well, we help the server in two ways: offload grunt work and maximize the server’s ability to hold context.
Offloading grunt work is a long standing tradition to devices like NetScaler. They are able to reduce the network, security, and performance optimization burdens traditionally carried by the server with highly optimized appliances that are able to perform those tasks with great efficiency. (e.g., a NetScaler can perform up to 220k SSL transactions/sec – most commodity servers peak at around 1000-2000 SSL transactions/sec.) At first glance, this is a very tactical move, but with some quick back of the envelope math, the benefits of these optimizations can be truly outstanding for some web sites.
Some perspective… CNET was able to remove enough servers from their farms based on optimizing this overhead that they shaved a cool $250k/yr in power and cooling costs in their datacenter. Countless other stories with similar endings are abound in NetScaler lore.
The problem of server context is more subtle.
The first step to a server responding to a user’s request is to know who the user is. Once they have that information, they can start making decisions about what to do. Finally, armed with information it can generate the necessary web page for the user to see. Typically, applications store context for a set of decisions and users in some local memory on the premise that if a user was back once, they will be back again.
Things break in load balancing. When a user is moved from server to server, all of that context disappears and the server slows down again because it has to go back and recompute that information. Ideally, we want a given user to go back to the same server over and over again, so long as the server is up and working.
To do this, the NetScaler inserts a special browser cookie into the traffic stream transparently. This cookie is a marker that the NetScaler can use for looking up its own state information about that user and the server he was connected to. When a click returns, the cookie is looked up and rather than sending the traffic to a server at random, traffic is sent back to the same server as it was last served.
Because of this, the server gains significant efficiencies as its own caches are hit with much higher frequency. The impact can be significant – in fact so significant that performance benchmarking often requires that servers do a clean reboot prior to testing performance just to make sure that performance is not artificially boosted through caching.
So there you have it. If you’re doing business on the web, three considerations to availability and performance that are mandatory to handle the imposing load and scale that comes with running a business there. It can feel intimidating at times, but done correctly it can be the difference between business success and business failure. After all, all that technology dictates your ability to collect revenue.
Get good enough at it and you can become the next Amazon.com.