What is data scraping?

Data scraping: A (mostly) legal way to harvest your information

When you post information or images of yourself publicly online, there is always the risk that someone will record that information and use it in some way. But the practice of data scraping, in which large amounts of public information is collected in an automated fashion, has made this possibility almost a guarantee.

What is data scraping?

With data scraping, machines are used to record information that was meant for human eyes. This happens most commonly in the form of web scraping, where an algorithm copies data from a web page while posing as a human.

Web scrapers are commonly used by companies to keep tabs on their competitors’ websites, scanning for new updates, inventory changes, and price fluctuations. Travel sites scrape data from different airline and hotel websites to show users price comparisons. Some retailers also scrape Twitter and review sites like Yelp for sales leads.

But more recently, data scraping has been used to copy en masse publicly available information of individuals on social media. While this information was never a secret to begin with, attackers using data scraping have been able to create large, organized collections of the data for sale.

Data scraping vs. web crawling vs. hacking

Search engines like Google use web crawlers to discover and record pages on the internet so people can search for them. It’s a symbiotic relationship between web crawlers and websites: Google wants to know what content websites have to offer its users, and website owners (usually) want those users to be able to find them easily.

Data scrapers, meanwhile, can be thought of as parasites. They are not customers and provide no value back to the website. Deployed on a massive scale, they can overload web servers and slow down websites for legitimate users. Ever had to solve a CAPTCHA to “prove you’re not a robot”? It’s partly to prevent data scraping.

It’s not that websites don’t want any other machines touching their data. Many websites provide APIs, or application programming interfaces, software that lets legitimate apps and their algorithms access databases without clogging up the pipes for customers. But when a program doesn’t use an API and instead attempts to parse data off a public-facing web page, that’s data scraping.

Left unchecked, data scraping can be a huge problem for companies and their customers, on a scale that’s beginning to rival that of more traditional hacks and data breaches.

There are also nuances when it comes to the difference between hacking and data scraping. Hacking is analogous to theft: An attacker gains access to data that was protected somehow, usually by a password.

Data scraping is morally fuzzier. The data in question was technically out in the open already. For example, airlines already make their airfares public to help potential customers, so if a competitor’s bot wants the same info, is it really “stealing”?

Is data scraping legal?

Web scraping is legal, in theory. Let’s say you are copying and pasting text from a free resource like Wikipedia and decide to write an automated script to make your job easier. This is perfectly legal and doesn’t hurt anyone.

Many websites, however, have terms of service that explicitly prohibit data scraping, but the consequences of violating them can vary dramatically. If the scraping was small in scale, you may simply lose access to their service. But you may also face legal action, especially if the scraping was large-scale enough to impact their bottom line.

This is what happened when eBay sued Bidder’s Edge, a service that aggregated auction data scraped from eBay, resulting in approximately 100,000 extra server requests per day. EBay argued that Bidder’s Edge had committed “trespass to chattels” by interfering with their business, resulting in an undisclosed settlement in eBay’s favor.

Other companies have followed suit, notably Craigslist (v. Padmapper), QVC (v. Resultly), and LinkedIn (v. hiQ), setting more and more precedents for legal action against data scrapers.

In the long-running case of LinkedIn suing hiQ Labs for scraping its data, an appeals court reaffirmed in April 2022 its original finding that scraping data that is publicly accessible on the internet is not a violation of the Computer Fraud and Abuse Act, which governs what constitutes computer hacking under U.S. law. LinkedIn vows to fight on.

Everything About Data Scraping

One Response

Leave a Reply to A WordPress Commenter Cancel reply

Your email address will not be published. Required fields are marked *