Is there a library than can trudge through AJAX/javascript?

Coldblackice

[H]ard|Gawd
Joined
Aug 14, 2010
Messages
1,152
I'm using PHP to scrape some information off webpages, however, I've discovered that the info I'm trying to scrape from the pages is loading through some manner of AJAX/javascript. I thought I remembered that Curl could iterate through the javascript, but I've found that that's not the case.

I seem to remember some sort of backend "web browser" library/function that could trace through javascript and AJAX, to get at a final page result of what a full-functioned browser would arrive at.

Is there a library or function that can do this? Any ideas on how to go about this, other than having to manually trace through the scripts/redirects myself? It doesn't have to be pretty -- I'm just looking to scrape the resulting text.
 
take a look at the url that is being pulled in the AJAX requests, see if you could potentially take that url, insert parameters, and get what you need
 
I've heard from some others at work node.js can be pretty powerful for scraping web sites as you have access to the DOM, and jQuery. I haven't used it personally though.
 
take a look at the url that is being pulled in the AJAX requests, see if you could potentially take that url, insert parameters, and get what you need
this is almost trivially easy to do with the firebug plugin for firefox. just open firebug to the console tab (you might have to enable the console) and it will show when an ajax request is made with the request and response bodies, headers, etc.
 
An aside question: do you have permission/is it okay for you to be scraping this data?
 
I can't think of any reliable way in which you can do this. If the asynchronous calls are something really trivial, maybe, if you get around XSS issues, you could simulate them yourself.

But if the site has any kind of complexity, you'd hit a wall real fast. There are some major services out there that try to do this like Mint (finances) and what have, you and even they are only partially successful.

I'm no expert though so don't take my word as final. This seems promising http://www.phparch.com/books/phparchitects-guide-to-web-scraping-with-php/
 
Back
Top