Scrapers Be Scrapin'

FrgMstr

Just Plain Mean
Staff member
Joined
May 18, 1997
Messages
55,598
Scraping bots are nothing new. These bots move all over the internet, and are used by tons of companies to collect data on just about everything you can think of. If you are not in the know about scraping bots, Wired has write-up that will tell you all you need to know, and likely why you don't really care, unless of course you use price comparison sites that are generally built on scraping.


Companies like Amazon and Walmart have internal teams dedicated to scraping, says Alexandr Galkin, CEO of the retail price optimization company Competera. Others turn to companies like his. Competera scrapes pricing data from across the web, for companies ranging from footwear retailer Nine West to industrial outfitter Deelat, and uses machine-learning algorithms to help its customers decide how much to charge for different products.
 
IMHO, use of data very much should be contingent on the purpose it is put up. Just because someone puts information on the internet, doesn't mean that it should be OK for anyone to scrape it and use it as they see fit.

I like price comparison sites, but some sanity needs to be applied here. It's not supposed to be a free for all buffet of data.
 
IMHO, use of data very much should be contingent on the purpose it is put up. Just because someone puts information on the internet, doesn't mean that it should be OK for anyone to scrape it and use it as they see fit.

I like price comparison sites, but some sanity needs to be applied here. It's not supposed to be a free for all buffet of data.

A lot of it is protected like copyright. Essentially, you're ok to view/read it, but you can't use it legally on a new product. Policing that is very hard.

Google/Yahoo had to restructure their finance API's because they didn't have proper redistribution licenses and their sites/API's made it too easy to get it in bulk.

I do a lot of scraping for research, so I'm familiar with the tricks sites pull to prevent it. Many sites do simple checks for the header ID, other sites use JavaScript to post-load the data. For every prevention, there's a counter. The most advanced sites you have to use tools like Python Selenium which is a headless browser that simulates the user and loads the entire DOM (including JavaScript).

But to be honest, it's not as hard as it used to be. Modern sites use a JSON feed for AJAX loading of the data, if you sniff the network during the page load you can usually find this exposed API call and simulate that to dump the data in a JSON array. Waaaay easier.
 
Last edited:
How many of these companies that try and prevent scraping are also in the business of collecting and selling the information of their users?
 
I like how Google was penalizing websites for scraping and Google was scraping hardcore at the same time.
 
Back
Top