What is Web Scraping?
Web scraping, or content scraping, is the practice of using automated bots and web crawlers to extract content or data from third-party websites. The scraper can then replicate this data on another website or application.
Web scraping can be a confusing issue from a security perspective as it is a wide-spread practice in many digital businesses and has legitimate uses. Online businesses might scrape websites for such things as search engines, delivering price comparisons to consumers, and aggregating news or weather content.
In other cases however, unscrupulous businesses will use bots to create competitive disadvantages by scraping pricing data from competitor’s sites to undercut them or harvesting and duplicating marketing content from competitive sites for SEO purposes.
In the most malicious scenarios, cybercriminals deploy bots to scrape user data and resell it or use it for a broader attack. In April and July of 2021, LinkedIn fell victim when data from over one billion user accounts was scraped and offered for sale on the dark web. In May of 2021, Business Insider reported that Facebook had been similarly targeted: scrapers gained information on over 500 million users. In addition to reselling for a quick profit, attackers scrape a site to identify employee names and deduce username and email formats to launch targeted phishing and account takeover (ATO) attacks.
How Do Attackers Execute a Web Scraping Attack?
Web scraping attacks can be very broad, copying an entire site to see if there is any data which can be exploited, or very targeted, seeking specific data on specific pages. Regardless, every attack starts with a plan.
The attacker may begin by deploying web crawlers or spiders to map a targeted site; identifying URLs, page metadata and identifying access gates such as account log-ins and CAPTCHAs on specific pages. With this, the attacker develops a script for the scraper bots to follow. It tells the scrapers which URLs to go to and what to do on the page. The attacker may also create fake user accounts to register bots as legitimate users on the website and enable them to access paid content.
A web scraper bot will typically send a series of HTTP GET requests to the targeted website to access the HTML source code. Data locators within the scraper are programmed to look for specific data types and save all of the relevant information from the targeted web server’s response to a CSV or JSON file.
A more advanced type of scraping is database scraping in which the scraper interacts with an application on the site to retrieve content from its database. For example, sophisticated bots can be programmed to make thousands of requests to internal application programming interfaces (APIs) for some associated data – like product prices or contact details – that are stored in a database and delivered to a browser via HTTP requests.
Collecting and copying these large data repositories requires an enormous amount of processing power. While businesses engaged in legitimate scraping activity invest in vast server arrays to process the data, criminals are more likely to employ a network of hundreds or thousands of hijacked computers, known as a botnet, which spreads the processing load and helps to mask the malicious activity by distributing the data requests.
Web Scraping is a Growing Business Problem
Web scraping is a rapidly growing threat for many industries with travel and hospitality, e-commerce and media being top targets. Also, across all industries, the more successful your business, the more likely you will be scraped by competitors, which fuels more targeted attacks.
Scraping bots are increasingly more sophisticated and increasingly difficult to detect because they can imitate normal human interactions. Every part of user behavior— mouse movement, keyboard clicking and typing—is mimicked by bots, but there is no intent to these actions other than collecting data.
Scraping bot attacks have also become more widely distributed, mounting low and slow attacks that use thousands of geographically distributed IP addresses, each only requesting a few pages of content, and rotating browser user-agents to avoid being detected by web security tools.
The report from Aberdeen Research, The Business Impact of Website Scraping, finds that, “The median annual business impact of website scraping is as much as 80% of overall e-commerce website profitability.” For the Media sector, the research estimates “the annual business impact of website scraping is between 3.0% and 14.8% of annual website revenue, with a median of 7.9%.”
Common Defenses Against Web Scraping Bots
The most common method used to protect a website from scraping relies on tracking the activity of old attacks coming from suspicious IP addresses and domains. But bad bots find new ways in, so basic detection tools that are based on signatures or volumetric sensors are unable to keep up with changes, leaving site owners with thousands of obsolete threat profiles and an ongoing problem.
Web application firewalls (WAFs) are also commonly used, but are largely ineffective in stopping bot attacks because modern bots are capable of evading detection by mimicking human behavior. Hyper-distributed bot attacks that use many different user-agents, IPs and ASNs easily bypass WAFs and homegrown bot solutions. Homegrown bot management and CAPTCHA challenges are typically no match for advanced scraping bots and only succeed in frustrating site visitors.