Scrapers are a multi-billion dollar nuisance to the Internet, with increasing threats to legitimate eCommerce sites in the areas of revenue loss due to traffic interception, data manipulation, copyright compliance and site reputation to mention a few. It is estimated that scraper traffic from China alone is surpassing all of the page views from Google, Yahoo & MSN. It should come as a no surprise that ADCs at the forefront of web content delivery are the pioneers of website protection against bots. NetScaler, the de-facto ADC choice of the world’s largest eCommerce sites does offer some slick solutions for scraper control.
Traditional scraper and bot control mechanisms focus on these methods :
Use of specific cookies
Check source IP ranges
Challenge response methods like captcha
Check whether images are being downloaded
Check if the user agent is valid
Check the rate of requests
Some of the methods mentioned have some drawbacks, for example captcha can affect user experience reducing site popularity rendering the method ineffective. The best method is a combination of cookies and rate limiting based on request rates. Typical user generated HTTP request has more cookies than a bot generated request, if bots were to present that many cookies for each of its requests it would be very resource intensive for the bot. For most eCommerce sites it is unlikely that a human can be visting numerous listing pages within a second. If the users want to look at more than one result, they go back to the list in their browser, and then click on another item for details. This process takes several seconds at best. If the user wants to modify the query, they go back to the query page and select new attributes like price, color, manufacturer, etc. This also takes several seconds at a minimum. The scrapers don’t have this time delay between requests, so one can set the trigger value for rate limiting to a very low threshold like 2 seconds. Once the NetScaler detects such clients and tags them as bots with a cookie/http header for that particular session, it enables intelligent decision making on subsequent requests like redirection to another site or sending resets. A NetScaler ADC that can handle millions of connections is needed to offload this functionality from the webservers. For more involved scraper control solutions customers should also consider http callout.
Here is a customer deployed configuration developed by John Meikrantz, one of our stellar systems engineer. Here the scrapers were repeatedly/rapidly increasing the listingId value in the URL and walking through thousands of product listings in a span of seconds. Imagine the frustation of the vendor with a million views on the listing but no buyers !. The current solution inserts a tag for the second tier ADCs to content switch/reset the scraper traffic thus providing meaningful statistics for vendor’s product listings.
# rate limiting selectors looking at jsessionid and client source address
add ns limitSelector site_selector “HTTP.REQ.COOKIE.VALUE(\”JSESSIONID\”)” CLIENT.IP.SRC
# rate limit parameters will evaluate as true if more than one request is seen in two seconds from an individual client ip address and jsessionid
add ns limitIdentifier site_selector_limit -timeSlice 2000 -selectorName site_selector
# rewrite action that inserts a header SITE-SCRAPER with a string value of LIMIT_HIT
add rewrite action scraper_header insert_http_header SITE-SCRAPER “\”LIMIT_HIT\””
# rewrite policies that triggers the action if true
add rewrite policy scraper_detail “HTTP.REQ.URL.PATH.EQ(\”/go/search/detail.jsp\”)&& SYS.CHECK_LIMIT(\”site_selector_limit\”)” scraper_header
add rewrite policy scraper_searchresults “HTTP.REQ.URL.PATH.EQ(\”/forsale/searchresults.action\”)&& SYS.CHECK_LIMIT(\”site_selector_limit\”)” scraper_header
# bound globally if in transparent mode
bind rewrite global scraper_searchresults 100 END -type REQ_OVERRIDE
bind rewrite global scraper_detail 110 END -type REQ_OVERRIDE