What Is Web Scraping?
Web scraping is a process where bots crawl websites to continuously capture pricing data and product descriptions at scale. Sometimes this information is used productively - on a site like Google that aggregates it for users to find relevant content easily. Meanwhile, malicious bots - possibly commissioned by competitors - are also crawling your site and scraping your content, but with far more dangerous intent.
Your competitors may be using automated price scraping bots to match or beat your pricing and take business away from you. Some ruthless competitors use content scraping bots that can steal your exclusive, copyrighted content and images, which can damage your SEO rankings when search engines detect pages with duplicate content.
Web Scraping Is a Growing Business Problem
Web scraping is a rapidly growing threat for many industries with travel and hospitality, e-commerce and media being top targets. Also, the more successful your business, the more likely you will be scraped by competitors, which fuels more targeted attacks.
Scraping bots are increasingly more sophisticated and increasingly difficult to detect because they can imitate normal human interactions. Every part of user behavior— mouse movement, keyboard clicking and typing—is mimicked by bots, but there is no purchase. Scraping bot attacks have also become more widely distributed, mounting low and slow attacks that use thousands of different IP addresses, each only requesting a few pages of content, and rotating browser user-agents which makes them harder to detect. It is common practice to develop and continuously improve scraping bots to maintain an edge on the competition.
How Are Companies Fighting Web Scraping Bots?
The most common method used to protect a website from scraping relies on tracking the activity of old attacks coming from suspicious IP addresses and domains. But bad bots find new ways in, so basic detection tools that are based on signatures or volumetric sensors are unable to keep up with changes, leaving site owners with thousands of obsolete threat profiles. Web application firewalls (WAFs) are ineffective in stopping bot attacks because modern bots are capable of evading detection by mimicking human behavior. Hyper-distributed bot attacks that use many different user-agents, IPs and ASNs easily bypass WAFs and homegrown bot solutions. Homegrown bot management and CAPTCHA challenges are typically no match for advanced scraping bots and only succeed in frustrating site visitors.